Scrape a website once it's not a challenge. Doing it at scale and continuously can be very hard. In this ppt for my speech at Web Extract Summit 2020 i've tried to highlight some challenges and some best practices to use when you run a business based on web scraped data.
2. Who I am
Pierluigi Vinciguerra, from Milan, Italy, co founder of Re
Analytics.
I’ve spend all my working life in managing data, from
business intelligence projects in consultancy and in
industries.
5 Years ago I’ve started my adventure in Re Analytics
where I control all the data acquisition process from the
websites we scrape
3. What we do at Re Analytics
We call ourself a data boutique because sell
meaningful data and insights for selected
industries (luxury goods and travel at the moment).
We start from web scraping at large scale, in a
scale of billions of data points per month, we
process them, integrate with other sources and our
knowledge of the industry and on top we apply
algorithms and AI to extracts insights to sell.
Data available on
Data partner of
4. Why on this virtual stage
Reaching this scale of scraping, with our actual level of
efficiency, has not been easy.
We’ve more than 5000+ scrapers written and we can
count a huge number of hours spent on Scrapy and web
scraping in general.
But that’s only the beginning: without the processes
we’ve built during these years and the automation, we
could not simply scale at these levels, because of all the
changes and anti bot mechanisms our sources
implement each days.
Today I want to share with you some of the most
important lessons we’ve learnt during our journey to
becoming what we are today.
5. Ourjourney
Me and my friend Andrea in 2010 were intrigued by the
amount of data passing by the web without no one
collecting them for analyzing it.
We started then with dozens of scrapers in C++ ( !! ) for
the real estate market.
They were an hell to maintain but we’ve built an MVP and
started learning what it means to try to do some web
scraping.
After founding Re Analytics in 2015, the scalability
problem was part of the main one: we could not grow to
have enough interesting data for our customer if we
didn’t create the right processes and automation steps,
use the proper tools and change radically we way we
made web scraping.
6. So, data is the new gold
We’ve heard many times that data is new oil or
gold….
…. but also web scraping is much more similar to
gold mining process than we might think.
It starts finding the right spot to start (a problem to
solve for an industry), then we begin digging
(scraping), breaking the rocks we collect and
separate them from gold nuggets (store and clean
data). With all the single nuggets we create then
some ingot to sell (cleaned datasets) or, in case we
want to do more, even jewels (insights from data).
7. Find the right spot to start
As every new startup or business, your company must find a
problem to solve, before starting web scraping in a massive
way.
It could be a niche problem, a vast one, but your solution
should be something original (please not another Amazon
Price tracker) that, leveraging on web scraped data, gives to
your customer a solution much better than the actual one.
Try to involve industry experts, to better understand what’s
really needed to be extracted at which frequency
8. It’s a long road to the gold
It will take time to structure a fully scalable and
modern pipeline of web data integration
The sooner you understand what data is needed
from the industry, the first you can be sure to scrape
correctly your sources
You don’t want to waste weeks or months scraping
partial or worthless data
9. So, let’s start digging
We need our shovels and excavators to start digging
We can choose between third party tools like
import.io or packages like Scrapy.
In our experience, for a massive scale web scraping,
code is preferred for its flexibility.
It allows us to face all the possible challenges we
encounter. Given this opportunity, we also write our
scheduling program so every for each websites knows
where to be executed and its set of options.
The cons of this approach is that there’s an high risk of
having a code base of spaghetti code and nowadays
there are many tools like Crawlera or Airflow that
simplify some of these tasks.
10. Sometimes is not easy to reach gold
We might find some blocks on hard rock our path.
That’s why we always should have some TNT in our
pockets.
Some useful tools we have to avoid blocks in scraping
are:
• Multi cloud environment
• IP rotation via proxy services
• Captcha brakers
• Headless Browsers with javascript (Splash)
Using these tools, but also during the whole process of
web scraping, you should be always aware of the term of
use of the tools themselves, of the website you’re going
to scrape and if the industry you’re trying to sell your
data to has some specific bans on some techniques.
11. Our approach to ethical web scraping
In finance, where most of our customers come from, the rules for data sourcing are stricter than in other sectors, so we decided
to adhere to the Investment Data Standards Organization Best Practices on Web scraping.
• Browse-wrap terms of use are OK, click-wrap NO
• Follow what robots.txt file says
• Web scraping should not interfere with the website’s operations
• Do not make copies of the website
• Scraped information should be public
• Use APIs if possible
• Don’t scrape a website to gain a competitive edge on it
In any case, all of the techniques and tools we’ll see in the next slides, should be used in compliance with the Terms of Use of
the target website.
The full document can be found at this link: https://www.investmentdata.org/publications
12. Multi cloud environment
Web scraping at large scale can’t rely on one or
few machines in a local environment
A cloud environment with at least one single
vendor a must for every project, it allows to
multiply he IPs used by scrapers .
A multi cloud environment with several vendors is
highly suggested because it’s not uncommon to
find websites that block requests coming from one
provider or another.
13. IP rotation via proxies
Strictly tied to the multi cloud environment, ip
rotation is the other way we can rotate IPs during our
crawling.
Tools like Crawlera, Luminati and many others do an
excellent job on it.
On top, some websites, without worrying about SEO,
block all requests coming from datacenter IPs and
this is why we’ll need also proxy providers that allow
us to use Residential proxies.
There are also websites that geolocate your IP to
show different versions of itself, so also
Geographical Proxies are needed to see the version
we desire.
14. Captcha Breakers and Javascript
Some websites (typically with some configuration of Cloudfront)
ask to solve some javascript challenge before being scraped.
Scrapinghub developed Splash for this case (and we all are grateful
for it)
Captcha instead are really bad, because there’s no simple way to
handle them.
They can be broken in 2 ways:
• Third party services, that use humans to solve captchas and
send back the result. They have high latency time and personally
I have some concerns about workers conditions in these places.
• OCR software, i don’t personally know an out-of-the-box solution
that works for every kind of challenge the captcha raise
The obvious solution is: do your best to not trigger captchas.
15. We have rocks, now?
At the end of the scraping process, our data
should be stored in a database (except in some
rare cases).
Of course, depending from the data type you’re
scraping, there will be the most suitable database
for your need.
Be aware that data doesn’t need only to be stored
but, before coming valuable, must continue its
road on the value pipeline.
16. Turning rocks into gold
A single data snapshot from a website has no value
by itself but it need to pass some other steps before
becoming valuable.
• We must check its completeness: did we crawl the
whole website / all the infos we need?
• We must check its correctness: is the data we
downloaded correct?
• Given the two steps before for granted, data needs
to enter inside a process of standardization, to
blend with other data in our database or in the
database of the final customer.
• Standardize is not enough. We also need to have an
enrichment process so we can talk the same
language of our customers.
17. Checking the data
In Re Analytics our approach is a mix of automation
and human checks.
Each data ingestion is programmatically controlled to
check its completeness according to our scope.
If something is missing, we raise a flag signaling it and
assigning to the person in charge of the website.
We integrate our data quality algorithms with manual
check, so we can catch up also partial data ingestion
in most of the case.
It's a never ending learning curve, where automation,
industry knowledge and human touch go hand in
hand.
18. Enriching data
Now we’re sure data is correct and clean, we need
to enrich it with features external from the original
data source (in our case, as an example, Stock
Tickers, Currency standard ISO codes, Country
Iso codes, business related segmentation)
Without the proper segmentation coming from the
so-called “small data”, the data we’ re amassing in
the database is almost unusable to the final user.
Users should be familiar with the language will see
in the dataset and should be easy for them to
integrate with other datasets, that’s why it’s crucial
to standardize our data.
19. Trends
Point in time data snapshots can be enough for
some businesses but for most of them what it really
matters are trends over time.
You don’t need to be able to scrape a website only
once. You need to be able to scrape it every time
you need it (may It be every day, hour or week).
The higher the frequency is, the more important is
having a strong and timely process like the one
described up to now.
20. Publishing
Not all the extracted data can be published. Even
with the most advanced acquisition pipeline, you’ll
find out some gaps or incomplete data.
Having an automatic fill forwardbackward rules,
accepted by customers, will improve greatly the
overall data quality of the output dataset.
After the publishing phase, the value pipeline for the
data ingested is over and can be sold.
21. Proxies
21
Thevaluepipeline
Multi cloud
environment
Database
Staging area Cleaning Enrich Publish
ALGORITHMS
Launcher
Websites Scrapers
The value of the data increases at each step in the
database. Single snapshots could have some values in
some cases but cleaning, enriching them and adding to
the historical trend add definitely more value
Hypothetical Architecture
22. Selling gold bars or jewels?
The value pipeline has produced a sellable dataset
but we can do more.
We can add a presentation layer, where we can
highlight insights that emerge from data, ML
algorithms to make forecasts or anything useful for
the industry we’re operating in.
This will mean higher margin and a stronger asset,
more difficult to be copied…
…. but also it’s a completely different job and
should be done by other department of our
company.
23. Key Takeaways
• Start with a validated MVP and then scale up massively
• Interact with business expert to understand their need
• Web scraping at large scale is complex but it’s only the first step of the value pipeline
• It’s a British bulldog game. Be sure to use the latest techniques to not be cut off the websites you’re scraping
• Log everything, to be sure you’re not loosing something in the pipeline
• Be sure also to use all the cautions to web scrape legally your target websites, without disturbing the website
operations and business.
• Scaling requires processes, templates and automation, in every step of the value pipeline. We don’t dig for gold with
pickaxe anymore, we use machines. Use machines also for the tasks in your company and don’t make people tasks
that should be automated.
• Don't let dirty data make it to your dataset, it will mine your reputation. Use a publishing process to sell a stable and
reliable flow of data.
24. Thankyou!
If you want to share your experiences with me, ask
me something out of this webinar, please feel free
to add me on Linkedin or write me at
pierluigi.vinciguerra@re-analytics.com