A crawler that extracts data from a dynamic webpage. Written in node js.

Overview

🕸️ Gumo

"Gumo" (蜘蛛) is Japanese for "spider".

npm version MIT license

Overview 👓

A web-crawler (get it?) and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications. Written in NodeJS.

Table of Contents 📖

Features 🌟

  • Crawl hyperlinks present on the pages of any domain and its subdomains.
  • Scrape meta-tags and body text from every page.
  • Store entire sitemap in a GraphDB (currently supports Neo4J).
  • Store page content in ElasticSearch for easy full-text lookup.

Installation 🏗️

NPM

Usage 👨‍💻

From code:

// 1: import the module
const gumo = require('gumo')

// 2: instantiate the crawler
let cron = new gumo()

// 3: call the configure method and pass the configuration options
cron.configure({
    'neo4j': { // replace with your details or remove if not required
        'url' : 'neo4j://localhost',
        'user' : 'neo4j',
        'password' : 'gumo123'
    },
    'elastic': { // replace with your details or remove if not required
        'url' : 'http://localhost:9200',
        'index' : 'myIndex'
    },
    'crawler': {
        'url': 'https://www.example.com',
    }
});

// 4: start crawling
cron.insert()

Note: The config params passed to cron.configure above are the default values. Please refer to the Configuration section below to learn more about the customization options that are available.

Configuration ⚙️

The behavior of the crawler can be customized by passing a custom configuration object to the config() method. The following are the attributes which can be configured:

Attribute ( * - Mandatory ) Type Accepted Values Description Default Value Default Behavior
* crawler.url string Base URL to start scanning from "" (empty string) Module is disabled
crawler.Cookie string Cookie string to be sent with each request (useful for pages that require auth) "" (empty string) Cookies will not be attached to the requests
crawler.saveOutputAsHtml string "Yes"/"No" Whether or not to store scraped content as HTML files in the output/html/ directory "No" Saving output as HTML files is disabled
crawler.saveOutputAsJson string "Yes"/"No" Whether or not to store scraped content as JSON files in the output/json/ directory "No" Saving output as JSON files is disabled
crawler.maxRequestsPerSecond int range: 1 to 5000 The maximum number of requests to be sent to the target in one second 5000
crawler.maxConcurrentRequests int range: 1 to 5000 The maximum number of concurrent connections to be created with the host at any given time 5000
crawler.whiteList Array(string) If populated, only these URLs will be traversed [] (empty array) All URLs with the same hostname as the "url" attribute will be traversed
crawler.blackList Array(string) If populated, these URLs will ignored [] (empty array)
crawler.depth int range: 1 to 999 Depth up to which nested hyperlinks will be followed 3
* elastic.url string URI of the ElasticSearch instance to connect to "http://localhost:9200"
* elastic.index string The name of the ElasticSearch index to store results in "myIndex"
* neo4j.url string The URI of a running Neo4J instance (uses the Bolt driver to connect) "neo4j://localhost"
* neo4j.user string Neo4J server username "neo4j"
* neo4j.password string Neo4J server password "gumo123"

ElasticSearch

The content of the web page will be stored along with the url, and a hash. The index for the elastic search can be selected through config.json index attribute. If the index already exists in the elastic search it will be used, else it will create one.

id: hash, index: config.index, type: 'pages', body: JSON.stringify(page content)

GraphDB ☋

The sitemap of all the traversed pages is stored in a convenient graph. The following structure of nodes and relationships is followed:

Nodes

  • Label: Page
  • Properties:
Property Name Type Description
pid String UID generated by the crawler which can be used to uniquely identify a page across ElasticSearch and GraphDB
link String URL of the current page
parent String URL of the page from which the current page was accessed (typically only used while creating relationships)
title String Page title as it appears in the page header

Relationships

Name Direction Condition
links_to (a)-[r1:links_to]->(b) b.link = a.parent
links_from (b)-[r2:links_from]->(a) b.link = a.parent

TODO ☑️

  • Make it executable from CLI
  • Enable to send config parameters while invoking the gumo
  • Write more tests
You might also like...

Webpage for a leaderboard list app that uses the Leaderboard api to store the highscores for a game

Webpage for a leaderboard list app that uses the Leaderboard api to store the highscores for a game

Leaderboard This Webpage is for a leaderboard list app that uses the Leaderboard api to store the highscores for a game. This is one of my first exper

Mar 12, 2022

This React-Based WebPage allows the client/user system to create their own blog, where users can publish their own opinions.

Getting Started with Create React App This project was bootstrapped with Create React App. Available Scripts In the project directory, you can run: np

Jul 28, 2022

A chrome / firefox extension to draw on any webpage with tldraw

A chrome / firefox extension to draw on any webpage with tldraw

tldrawe A chrome / firefox extension to draw on any webpage with tldraw. Development From the root folder: Run yarn to install dependencies. Run yarn

Jan 6, 2023

In this project we made a Tv shows webpage where you can like or comment the different shows.

In this project we made a Tv shows webpage where you can like or comment the different shows.

JS Capstone Project In this project we made a Tv shows webpage where you can like or comment the differents shows. Built With HTML CSS JavaScript Lint

Mar 16, 2022

Source code for the #30DayChartChallenge webpage

Source code for the #30DayChartChallenge webpage

Call me Sam: a theme for Hugo Sam is a Simple and Minimalist theme for Hugo. It lets you categorize and showcase your content the way you want to. Foc

Apr 18, 2022

Convert any webpage into bionified text!

Convert any webpage into bionified text!

Bionify - Read Faster! LEGAL NOTICE: To the wonderful folks at Bionic Reading®, this is not a pirated version of your Bionic Reading® API, but rather

Dec 8, 2022

Personal Webpage / Homepage application

Personal Webpage / Homepage application

Solace Prototype Design for Webpage Description Solace is a personal project by me, for me. That aims to replace Tabliss as my new page / home page pr

Sep 18, 2022

Detect webpage updates and notify user to reload. support vite and umijs

English | 简体中文 plugin-web-update-notification Detect webpage updates and notify user to reload. support vite and umijs. Take the git commit hash as th

Dec 26, 2022

A webpage where the user can search for different TV shows, comment them and like them.

A webpage where the user can search for different TV shows, comment them and like them.

TV Shows In this project we built a webpage where the user can search for different TV shows, comment them and like them. Video Built With Major langu

Jul 9, 2022
Comments
  • [Bug] Running as binary from CLI results in an error

    [Bug] Running as binary from CLI results in an error

    Stack trace:

    PS D:\Downloads\Gumo\test> node gumo
    node:internal/modules/cjs/loader:928
      throw err;
      ^
    
    Error: Cannot find module 'D:\Downloads\Gumo\test\gumo'
        at Function.Module._resolveFilename (node:internal/modules/cjs/loader:925:15)
        at Function.Module._load (node:internal/modules/cjs/loader:769:27)
        at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
        at node:internal/main/run_main_module:17:47 {
      code: 'MODULE_NOT_FOUND',
      requireStack: []
    }
    
    bug 
    opened by JediRhymeTrix 0
Releases(v1.0.7)
Owner
Nuthalapai Venkata Krishna Chaitanya
Nuthalapai Venkata Krishna Chaitanya
Google-reviews-crawler - A simple Playwright crawler that stores Google Maps Place/Business reviews to a JSON file.

google-reviews-crawler A simple Playwright crawler that stores Google Maps Place/Business reviews to a JSON file. Usage Clone the repo, install the de

￸A￸l￸e￸x D￸o￸m￸a￸k￸i￸d￸i￸s 6 Oct 26, 2022
a web crawler that crawls website links & fetches SEO Data

Overview ?? It is a module that crawls sites and extracts basic information on any web page of interest to site owners in general, and SEO specialists

Syrian Open Source 7 Dec 15, 2022
Chrome Extension that extracts metroretro.io JSON to your clipboard as an HTML list

retro-clippy Chrome Extension that extracts metroretro.io JSON to your clipboard as an HTML list Installation Clone repo and run yarn to install depen

Sherman Hui 2 Apr 11, 2022
Extracts favicon of the current page and calculates their murmurhash. Firefox extension source code.

Favicon to Murmurhash Extracts favicon of the current page and calculates their murmurhash. Shows links to shodan search based on favicon murmurhashes

null 16 Dec 17, 2022
Crawler Crypto using NodeJS for performance with Elasticsearch DB for high efficiency.

Coin crawler - Coingecko version Crawler using NodeJS for performance with Elasticsearch DB for high efficiency. Requirements For development, you wil

Minh.N.Pham 1 Jan 20, 2022
A crawler that crawls the site's internal links, fetching information of interest to any SEO specialist to perform appropriate analysis on the site.

Overview ?? It is a module that crawls sites and extracts basic information on any web page of interest to site owners in general, and SEO specialists

Yazan Zoghbi 2 Apr 22, 2022
A crawler that crawls the site's internal links, fetching information of interest to any SEO specialist to perform appropriate analysis on the site.

Overview ?? It is a module that crawls sites and extracts basic information on any web page of interest to site owners in general, and SEO specialists

Yazan Zoghbi 2 Apr 22, 2022
Dynamic-web-development - Dynamic web development used CSS and HTML

Dynamic-web-development ASSISNMENT I just used CSS and HTML to make a mobile int

null 1 Feb 8, 2022
dynamic-component-app is an angular application for dynamic component template creation

MyApp This project was generated with Angular CLI version 14.1.0. Development server Run ng serve for a dev server. Navigate to http://localhost:4200/

Aniket Muruskar 7 Aug 26, 2022
A simple to do list webpage where you can log the daily tasks you have to do, mark them as checked, modify them, reorder them and remove them. Made using HTML, CSS and JavaScript.

To-Do-List This Webpage is for an app called To-Do-List which helps you add, remove or check tasks you have to do. It is a simple web page which conta

Zeeshan Haider 9 Mar 12, 2022