<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:pingback="http://madskills.com/public/xml/rss/module/pingback/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:georss="http://www.georss.org/georss" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">
  <channel>
    <title>Nesterovsky bros - ML.NET</title>
    <link>http://www.nesterovsky-bros.com/weblog/</link>
    <description />
    <language>en-us</language>
    <copyright>Nesterovsky bros</copyright>
    <lastBuildDate>Fri, 01 Jan 2021 14:34:37 GMT</lastBuildDate>
    <generator>newtelligence dasBlog 2.3.12105.0</generator>
    <managingEditor>contact@nesterovsky-bros.com</managingEditor>
    <webMaster>contact@nesterovsky-bros.com</webMaster>
    <item>
      <trackback:ping>http://www.nesterovsky-bros.com/weblog/Trackback.aspx?guid=b69227a9-de75-4062-aa5b-82c8dff548e1</trackback:ping>
      <pingback:server>http://www.nesterovsky-bros.com/weblog/pingback.aspx</pingback:server>
      <pingback:target>http://www.nesterovsky-bros.com/weblog/PermaLink,guid,b69227a9-de75-4062-aa5b-82c8dff548e1.aspx</pingback:target>
      <dc:creator>Arthur Nesterovsky</dc:creator>
      <georss:point>0 0</georss:point>
      <wfw:commentRss>http://www.nesterovsky-bros.com/weblog/SyndicationService.asmx/GetEntryCommentsRss?guid=b69227a9-de75-4062-aa5b-82c8dff548e1</wfw:commentRss>
      <title>Collecting public data from Internet</title>
      <guid isPermaLink="false">http://www.nesterovsky-bros.com/weblog/PermaLink,guid,b69227a9-de75-4062-aa5b-82c8dff548e1.aspx</guid>
      <link>http://www.nesterovsky-bros.com/weblog/2021/01/01/CollectingPublicDataFromInternet.aspx</link>
      <pubDate>Fri, 01 Jan 2021 14:34:37 GMT</pubDate>
      <description>&lt;p&gt;
Earlier we wrote that recently we've gotten few tasks related to Machine Learning.
The prerequisites to such task is to collect and prepare the input data. Usually the
required data is scattered across public sites, some of them are in plain text format
(or close to it), but others are accessible as output of public applications. To obtain
the required data for such sites you have to navigate thourgh pages, which often requires
keeping state between navigations. 
&lt;/p&gt;
&lt;p&gt;
In order to implement this task you need some kind of crawler/scraper of the websites.
Fortunately, there are a lot of frameworks, libraries and tools in C# (and in other
languages too) that allow to do this (visit &lt;a href="https://nugetmusthaves.com/Tag/crawler" target="_blank"&gt;this&lt;/a&gt; or &lt;a href="https://prowebscraper.com/blog/50-best-open-source-web-crawlers/" target="_blank"&gt;this&lt;/a&gt; site
to see most popular of them), for example: 
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
ScrapySharp&lt;/li&gt;
&lt;li&gt;
ABot&lt;/li&gt;
&lt;li&gt;
HtmlAgilityPack&lt;/li&gt;
&lt;li&gt;
DotnetSpider&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;
There are pros and cons of using these libraries. Most crucial cons is a lack of support
of rich UI based on heavy client-side scripts and client-side state support. Since
not all such libraries implement fully browser emulation and even more, some of them
do not support Javascript execution. So, they suit for gathering information from
simple web pages, but no library allows to easy navigate to some page of a web application
that keeps rich client-side state. Even best of them, like ScrapySharp, require heavy
programming to achieve the result.
&lt;/p&gt;
&lt;p&gt;
Then, suddenly, we've recalled that already for several years we're using Selenium
and web drivers to automate web tests for AngularJS/Angular projects. After short
discussion we came to conclusion that there is no big difference between testing web
application and collecting data, since one of testing stages is collecting of actual
results (data) from the tested page, and usually our tests consist of chains of actions
performed on consequently visited pages. 
&lt;/p&gt;
&lt;p&gt;
This way we came to idea to use WebDriver API implemented by &lt;a href="https://www.selenium.dev/documentation/en/" target="_blank"&gt;Selenium
project&lt;/a&gt;. There are implementations of this API in different languages, and in
C# too.
&lt;/p&gt;
&lt;p&gt;
Using WebDriver we easily implement cumbersome navigation of a complex web application
and can collect required data. Moreover, it allows to run WebDriver in screenless
mode. Some of its features allow to create a snapshots of virtual screen and store
HTML sources that would resulted of Javascript execution. These features are very
useful during run-time troubleshooting. To create a complex web application navigation
we need only a bit more knowledge than usual web application's user - we need to identify
somehow pages' elements for example by CSS selectors or by id of HTML elements (as
we do this for tests). All the rest, like coockies, view state (if any), value of
hidden fields, some Javascript events will be transparent in this case. 
&lt;/p&gt;
&lt;p&gt;
Although one may say that approach with Selenium is rather fat, it's ought to mention
that it is rather scalable. You may either to run several threads with different WebDriver
instances in each thread or run several processes simultaneously. 
&lt;/p&gt;
&lt;p&gt;
However, beside pros there are cons in the solution with Selenium. They will appear
when you'll decide to publish it, e.g. to Azure environment. Take a note that approach
with Selenium requires a browser on the server, there is also a problem with Azure
itself, as it's Microsoft's platform and Selenium is a product of their main competitor
Google... So, some issues aren't techincals. The only possible solution is to use
PaaS approach instead of SaaS, but in this case you have to support everything by
yourself... 
&lt;/p&gt;
&lt;p&gt;
The other problem is that if your application will implement rather aggressive crawling,
so either servers where you gather data or your own host might ban it. So, be gentle,
play nice, and implement delays between requests. 
&lt;/p&gt;
&lt;p&gt;
Also, take into account that when you're implementing any crawler some problems may
appear on law level, since not all web sites allow pull anything you want. Many sites
use terms &amp; conditions that defines rules for the site users (that you cralwer should
follow to), otherwise legal actions may be used against them (or their owners in case
of crawler). There is &lt;a href="https://www.c-sharpcorner.com/article/web-crawling-with-c-sharp/" targte="_blank"&gt;very
interesting article&lt;/a&gt; that describes many pitfalls when you implement your own crawler. 
&lt;/p&gt;
&lt;p&gt;
To summarize everything we told early, the Selenium project could be used in many
scenarios, and one of them is to create a powerful crawler. 
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.nesterovsky-bros.com/weblog/aggbug.ashx?id=b69227a9-de75-4062-aa5b-82c8dff548e1" /&gt;</description>
      <comments>http://www.nesterovsky-bros.com/weblog/CommentView,guid,b69227a9-de75-4062-aa5b-82c8dff548e1.aspx</comments>
      <category>ML.NET</category>
      <category>Thinking aloud</category>
      <category>Tips and tricks</category>
    </item>
    <item>
      <trackback:ping>http://www.nesterovsky-bros.com/weblog/Trackback.aspx?guid=638aefac-fd7a-431a-9d7f-0aeef885d9a5</trackback:ping>
      <pingback:server>http://www.nesterovsky-bros.com/weblog/pingback.aspx</pingback:server>
      <pingback:target>http://www.nesterovsky-bros.com/weblog/PermaLink,guid,638aefac-fd7a-431a-9d7f-0aeef885d9a5.aspx</pingback:target>
      <dc:creator>Arthur Nesterovsky</dc:creator>
      <georss:point>0 0</georss:point>
      <wfw:commentRss>http://www.nesterovsky-bros.com/weblog/SyndicationService.asmx/GetEntryCommentsRss?guid=638aefac-fd7a-431a-9d7f-0aeef885d9a5</wfw:commentRss>
      <body xmlns="http://www.w3.org/1999/xhtml">
        <p>
          <img src="https://i.ytimg.com/vi/VJTpbZrBAeU/hqdefault.jpg" width="200" style="float: left; margin-right: 5px;" />Eventually
we've started to deal with tasks that required machine learning. Thus, the good tutorial
for ML.NET was required and we had <a href="https://youtu.be/VJTpbZrBAeU" target="_blank">found
this one</a> that goes along with <a href="https://github.com/jeffprosise/ML.NET" target="_blank">good
simple codesamples</a>. Thanks to Jeff Prosise. Hope this may be helpfull to you too.
</p>
        <img width="0" height="0" src="http://www.nesterovsky-bros.com/weblog/aggbug.ashx?id=638aefac-fd7a-431a-9d7f-0aeef885d9a5" />
      </body>
      <title>ML.NET tutorial</title>
      <guid isPermaLink="false">http://www.nesterovsky-bros.com/weblog/PermaLink,guid,638aefac-fd7a-431a-9d7f-0aeef885d9a5.aspx</guid>
      <link>http://www.nesterovsky-bros.com/weblog/2020/12/16/MLNETTutorial.aspx</link>
      <pubDate>Wed, 16 Dec 2020 11:38:57 GMT</pubDate>
      <description>&lt;p&gt;
&lt;img src="https://i.ytimg.com/vi/VJTpbZrBAeU/hqdefault.jpg" width="200" style="float: left; margin-right: 5px;" /&gt;Eventually
we've started to deal with tasks that required machine learning. Thus, the good tutorial
for ML.NET was required and we had &lt;a href="https://youtu.be/VJTpbZrBAeU" target="_blank"&gt;found
this one&lt;/a&gt; that goes along with &lt;a href="https://github.com/jeffprosise/ML.NET" target="_blank"&gt;good
simple codesamples&lt;/a&gt;. Thanks to Jeff Prosise. Hope this may be helpfull to you too.
&lt;/p&gt;
&lt;img width="0" height="0" src="http://www.nesterovsky-bros.com/weblog/aggbug.ashx?id=638aefac-fd7a-431a-9d7f-0aeef885d9a5" /&gt;</description>
      <comments>http://www.nesterovsky-bros.com/weblog/CommentView,guid,638aefac-fd7a-431a-9d7f-0aeef885d9a5.aspx</comments>
      <category>.NET</category>
      <category>ML.NET</category>
      <category>Tips and tricks</category>
    </item>
  </channel>
</rss>