Collecting public data from Internet

Friday, 01 January 2021

Earlier we wrote that recently we've gotten few tasks related to Machine Learning. The prerequisites to such task is to collect and prepare the input data. Usually the required data is scattered across public sites, some of them are in plain text format (or close to it), but others are accessible as output of public applications. To obtain the required data for such sites you have to navigate thourgh pages, which often requires keeping state between navigations.

In order to implement this task you need some kind of crawler/scraper of the websites. Fortunately, there are a lot of frameworks, libraries and tools in C# (and in other languages too) that allow to do this (visit this or this site to see most popular of them), for example:

ScrapySharp
ABot
HtmlAgilityPack
DotnetSpider

There are pros and cons of using these libraries. Most crucial cons is a lack of support of rich UI based on heavy client-side scripts and client-side state support. Since not all such libraries implement fully browser emulation and even more, some of them do not support Javascript execution. So, they suit for gathering information from simple web pages, but no library allows to easy navigate to some page of a web application that keeps rich client-side state. Even best of them, like ScrapySharp, require heavy programming to achieve the result.

Then, suddenly, we've recalled that already for several years we're using Selenium and web drivers to automate web tests for AngularJS/Angular projects. After short discussion we came to conclusion that there is no big difference between testing web application and collecting data, since one of testing stages is collecting of actual results (data) from the tested page, and usually our tests consist of chains of actions performed on consequently visited pages.

This way we came to idea to use WebDriver API implemented by Selenium project. There are implementations of this API in different languages, and in C# too.

Using WebDriver we easily implement cumbersome navigation of a complex web application and can collect required data. Moreover, it allows to run WebDriver in screenless mode. Some of its features allow to create a snapshots of virtual screen and store HTML sources that would resulted of Javascript execution. These features are very useful during run-time troubleshooting. To create a complex web application navigation we need only a bit more knowledge than usual web application's user - we need to identify somehow pages' elements for example by CSS selectors or by id of HTML elements (as we do this for tests). All the rest, like coockies, view state (if any), value of hidden fields, some Javascript events will be transparent in this case.

Although one may say that approach with Selenium is rather fat, it's ought to mention that it is rather scalable. You may either to run several threads with different WebDriver instances in each thread or run several processes simultaneously.

However, beside pros there are cons in the solution with Selenium. They will appear when you'll decide to publish it, e.g. to Azure environment. Take a note that approach with Selenium requires a browser on the server, there is also a problem with Azure itself, as it's Microsoft's platform and Selenium is a product of their main competitor Google... So, some issues aren't techincals. The only possible solution is to use PaaS approach instead of SaaS, but in this case you have to support everything by yourself...

The other problem is that if your application will implement rather aggressive crawling, so either servers where you gather data or your own host might ban it. So, be gentle, play nice, and implement delays between requests.

Also, take into account that when you're implementing any crawler some problems may appear on law level, since not all web sites allow pull anything you want. Many sites use terms & conditions that defines rules for the site users (that you cralwer should follow to), otherwise legal actions may be used against them (or their owners in case of crawler). There is very interesting article that describes many pitfalls when you implement your own crawler.

To summarize everything we told early, the Selenium project could be used in many scenarios, and one of them is to create a powerful crawler.

Friday, 01 January 2021 14:34:37 UTC

Comments [0] -
ML.NET | Thinking aloud | Tips and tricks

Comments are closed.