-
The url to start scraping from.
-
The maximum depth to crawl down to from the start url.
-
The max number of pages for the entire scrape job.
(Stop crawling a job when it reaches maxDepth or maxPages, whichever comes first.)
- title - The document.title of the page.
- depth - Current depth being scraped.
- url - The URL that was scraped.
- links - All hrefs in the anchor tags in the page.
- run
git clone https://github.com/PerachBD/WebCrawler.git
- run
npm i && npm start
- NodeJS
- React
- Express
- Web Storage
- Socket.IO - enables real-time bidirectional event-based communication.
- Lowdb - Small JSON database for Node, Electron and the browser. Powered by Lodash.
- node-html-parser - Fast HTML Parser is a very fast HTML parser. Which will generate a simplified DOM tree, with basic element query support.
- Save running time for overlapping scrape jobs.
- Calculating the number of "workers", dynamically depending on the loads and the number of scrape jobs to be performed.
- Add option to delete, pause and continue scrape job