Ask HN: Advanced web crawling resources?

pesfandiar · on April 28, 2017

Scrapinghub writes some useful blog posts at https://blog.scrapinghub.com/. It obviously has to do with using their frameworks and services, so it may not be very useful in your case.

bootcat · on May 3, 2017

A sample crawler i wrote to harvest Yelp results, Feel free to gain insight on how it was written. Might not work as yelp might have had cosmetic changes. But the theme would help you write one on your own ! https://github.com/deepanprabhu/yelp-crawler .

I also have an advanced scraper, than can harvest AJAX heavy site like http://venture-capital-firms.findthecompany.com/. I completely scraped their site, using a chrome plugin, exporting results through a web server. Kind of a complex procedure as we have to be inside a live browser to hijack their results. The VC site, even avoid headless browsers so it was tricky.

I can share the code, in case you are interested. And scaling scraping, is an interesting process.

rguillaume · on April 29, 2017

Hi there,

You can start reading this article about the BFS algorithm : https://fr.khanacademy.org/computing/computer-science/algori...

I did a personnal webcrawler using PHP, Redis, Gearman on a single (personnal) computer with many VMs to emulate AWS instances and it works great ! You can surely improve this by using other technologies than PHP (python, C, nodejs) and Gearman (Kafka, rabbitmq).

Hope this helps

cond289123 · on April 28, 2017

I did this for sites with paging. https://github.com/indatawetrust/reporter It saves it in a json file by pulling the data according to the desired properties. It is not very good but it can be brought to a better condition if you wish.

z3t4 · on April 28, 2017

Scraping is only one part. How are you going to categorize, store and search the data !?

bruno2223 · on April 29, 2017

Yes, indeed, scraping is the easiest part.

Saving everything in a way for use it later is much harder (and expensive), IMHO.

dewey · on April 29, 2017

I'd argue that this is highly dependent on the type of data you scrape and the what you want to do with the data.

If you have a good data model the categorizing, storing and searching of the final result the isn't a big problem and the scraping is the complicated part. If you don't have a specific kind of resource you are scraping and just dump everything into some storage solution with no structure that's going to be the hard part while scraping is the easy part.

z3t4 · on April 29, 2017

In theory, say you want to index one billion (10^9) web sites. Using modern hardware, you should be able to crawl, 10,000 web pages per second, which would take ca 30 hours, and if you save 1kb of text from each web site, that would be ca 1 TB of data. Doing a text search of 1TB of text would take some time though, maybe minutes. You could partition the data between servers though.

spaniard89277 · on April 30, 2017

I use couchdb with replication and postgreSQL as data warehouse.

Anyway Im a noob, but reading here and there is what I decided to use.

For scraping Im using scrapy + selenium and a modified js script that uses chrome (webscraper.io).

z3t4 · on May 1, 2017

i would just make a naive implementation, instead of searching for the optimal tools and solutions.