SlideShare a Scribd company logo
1 of 24
Download to read offline
Pierluigi Vinciguerra
RunningaBusinessOnWebScrapedData
Ifdataisthenewgoldthenwebscrapingatscaleislikegoldmining
Who I am
Pierluigi Vinciguerra, from Milan, Italy, co founder of Re
Analytics.
I’ve spend all my working life in managing data, from
business intelligence projects in consultancy and in
industries.
5 Years ago I’ve started my adventure in Re Analytics
where I control all the data acquisition process from the
websites we scrape
What we do at Re Analytics
We call ourself a data boutique because sell
meaningful data and insights for selected
industries (luxury goods and travel at the moment).
We start from web scraping at large scale, in a
scale of billions of data points per month, we
process them, integrate with other sources and our
knowledge of the industry and on top we apply
algorithms and AI to extracts insights to sell.
Data available on
Data partner of
Why on this virtual stage
Reaching this scale of scraping, with our actual level of
efficiency, has not been easy.
We’ve more than 5000+ scrapers written and we can
count a huge number of hours spent on Scrapy and web
scraping in general.
But that’s only the beginning: without the processes
we’ve built during these years and the automation, we
could not simply scale at these levels, because of all the
changes and anti bot mechanisms our sources
implement each days.
Today I want to share with you some of the most
important lessons we’ve learnt during our journey to
becoming what we are today.
Ourjourney
Me and my friend Andrea in 2010 were intrigued by the
amount of data passing by the web without no one
collecting them for analyzing it.
We started then with dozens of scrapers in C++ ( !! ) for
the real estate market.
They were an hell to maintain but we’ve built an MVP and
started learning what it means to try to do some web
scraping.
After founding Re Analytics in 2015, the scalability
problem was part of the main one: we could not grow to
have enough interesting data for our customer if we
didn’t create the right processes and automation steps,
use the proper tools and change radically we way we
made web scraping.
So, data is the new gold
We’ve heard many times that data is new oil or
gold….
…. but also web scraping is much more similar to
gold mining process than we might think.
It starts finding the right spot to start (a problem to
solve for an industry), then we begin digging
(scraping), breaking the rocks we collect and
separate them from gold nuggets (store and clean
data). With all the single nuggets we create then
some ingot to sell (cleaned datasets) or, in case we
want to do more, even jewels (insights from data).
Find the right spot to start
As every new startup or business, your company must find a
problem to solve, before starting web scraping in a massive
way.
It could be a niche problem, a vast one, but your solution
should be something original (please not another Amazon
Price tracker) that, leveraging on web scraped data, gives to
your customer a solution much better than the actual one.
Try to involve industry experts, to better understand what’s
really needed to be extracted at which frequency
It’s a long road to the gold
It will take time to structure a fully scalable and
modern pipeline of web data integration
The sooner you understand what data is needed
from the industry, the first you can be sure to scrape
correctly your sources
You don’t want to waste weeks or months scraping
partial or worthless data
So, let’s start digging
We need our shovels and excavators to start digging
We can choose between third party tools like
import.io or packages like Scrapy.
In our experience, for a massive scale web scraping,
code is preferred for its flexibility.
It allows us to face all the possible challenges we
encounter. Given this opportunity, we also write our
scheduling program so every for each websites knows
where to be executed and its set of options.
The cons of this approach is that there’s an high risk of
having a code base of spaghetti code and nowadays
there are many tools like Crawlera or Airflow that
simplify some of these tasks.
Sometimes is not easy to reach gold
We might find some blocks on hard rock our path.
That’s why we always should have some TNT in our
pockets.
Some useful tools we have to avoid blocks in scraping
are:
• Multi cloud environment
• IP rotation via proxy services
• Captcha brakers
• Headless Browsers with javascript (Splash)
Using these tools, but also during the whole process of
web scraping, you should be always aware of the term of
use of the tools themselves, of the website you’re going
to scrape and if the industry you’re trying to sell your
data to has some specific bans on some techniques.
Our approach to ethical web scraping
In finance, where most of our customers come from, the rules for data sourcing are stricter than in other sectors, so we decided
to adhere to the Investment Data Standards Organization Best Practices on Web scraping.
• Browse-wrap terms of use are OK, click-wrap NO
• Follow what robots.txt file says
• Web scraping should not interfere with the website’s operations
• Do not make copies of the website
• Scraped information should be public
• Use APIs if possible
• Don’t scrape a website to gain a competitive edge on it
In any case, all of the techniques and tools we’ll see in the next slides, should be used in compliance with the Terms of Use of
the target website.
The full document can be found at this link: https://www.investmentdata.org/publications
Multi cloud environment
Web scraping at large scale can’t rely on one or
few machines in a local environment
A cloud environment with at least one single
vendor a must for every project, it allows to
multiply he IPs used by scrapers .
A multi cloud environment with several vendors is
highly suggested because it’s not uncommon to
find websites that block requests coming from one
provider or another.
IP rotation via proxies
Strictly tied to the multi cloud environment, ip
rotation is the other way we can rotate IPs during our
crawling.
Tools like Crawlera, Luminati and many others do an
excellent job on it.
On top, some websites, without worrying about SEO,
block all requests coming from datacenter IPs and
this is why we’ll need also proxy providers that allow
us to use Residential proxies.
There are also websites that geolocate your IP to
show different versions of itself, so also
Geographical Proxies are needed to see the version
we desire.
Captcha Breakers and Javascript
Some websites (typically with some configuration of Cloudfront)
ask to solve some javascript challenge before being scraped.
Scrapinghub developed Splash for this case (and we all are grateful
for it)
Captcha instead are really bad, because there’s no simple way to
handle them.
They can be broken in 2 ways:
• Third party services, that use humans to solve captchas and
send back the result. They have high latency time and personally
I have some concerns about workers conditions in these places.
• OCR software, i don’t personally know an out-of-the-box solution
that works for every kind of challenge the captcha raise
The obvious solution is: do your best to not trigger captchas.
We have rocks, now?
At the end of the scraping process, our data
should be stored in a database (except in some
rare cases).
Of course, depending from the data type you’re
scraping, there will be the most suitable database
for your need.
Be aware that data doesn’t need only to be stored
but, before coming valuable, must continue its
road on the value pipeline.
Turning rocks into gold
A single data snapshot from a website has no value
by itself but it need to pass some other steps before
becoming valuable.
• We must check its completeness: did we crawl the
whole website / all the infos we need?
• We must check its correctness: is the data we
downloaded correct?
• Given the two steps before for granted, data needs
to enter inside a process of standardization, to
blend with other data in our database or in the
database of the final customer.
• Standardize is not enough. We also need to have an
enrichment process so we can talk the same
language of our customers.
Checking the data
In Re Analytics our approach is a mix of automation
and human checks.
Each data ingestion is programmatically controlled to
check its completeness according to our scope.
If something is missing, we raise a flag signaling it and
assigning to the person in charge of the website.
We integrate our data quality algorithms with manual
check, so we can catch up also partial data ingestion
in most of the case.
It's a never ending learning curve, where automation,
industry knowledge and human touch go hand in
hand.
Enriching data
Now we’re sure data is correct and clean, we need
to enrich it with features external from the original
data source (in our case, as an example, Stock
Tickers, Currency standard ISO codes, Country
Iso codes, business related segmentation)
Without the proper segmentation coming from the
so-called “small data”, the data we’ re amassing in
the database is almost unusable to the final user.
Users should be familiar with the language will see
in the dataset and should be easy for them to
integrate with other datasets, that’s why it’s crucial
to standardize our data.
Trends
Point in time data snapshots can be enough for
some businesses but for most of them what it really
matters are trends over time.
You don’t need to be able to scrape a website only
once. You need to be able to scrape it every time
you need it (may It be every day, hour or week).
The higher the frequency is, the more important is
having a strong and timely process like the one
described up to now.
Publishing
Not all the extracted data can be published. Even
with the most advanced acquisition pipeline, you’ll
find out some gaps or incomplete data.
Having an automatic fill forwardbackward rules,
accepted by customers, will improve greatly the
overall data quality of the output dataset.
After the publishing phase, the value pipeline for the
data ingested is over and can be sold.
Proxies
21
Thevaluepipeline
Multi cloud
environment
Database
Staging area Cleaning Enrich Publish
ALGORITHMS
Launcher
Websites Scrapers
The value of the data increases at each step in the
database. Single snapshots could have some values in
some cases but cleaning, enriching them and adding to
the historical trend add definitely more value
Hypothetical Architecture
Selling gold bars or jewels?
The value pipeline has produced a sellable dataset
but we can do more.
We can add a presentation layer, where we can
highlight insights that emerge from data, ML
algorithms to make forecasts or anything useful for
the industry we’re operating in.
This will mean higher margin and a stronger asset,
more difficult to be copied…
…. but also it’s a completely different job and
should be done by other department of our
company.
Key Takeaways
• Start with a validated MVP and then scale up massively
• Interact with business expert to understand their need
• Web scraping at large scale is complex but it’s only the first step of the value pipeline
• It’s a British bulldog game. Be sure to use the latest techniques to not be cut off the websites you’re scraping
• Log everything, to be sure you’re not loosing something in the pipeline
• Be sure also to use all the cautions to web scrape legally your target websites, without disturbing the website
operations and business.
• Scaling requires processes, templates and automation, in every step of the value pipeline. We don’t dig for gold with
pickaxe anymore, we use machines. Use machines also for the tasks in your company and don’t make people tasks
that should be automated.
• Don't let dirty data make it to your dataset, it will mine your reputation. Use a publishing process to sell a stable and
reliable flow of data.
Thankyou!
If you want to share your experiences with me, ask
me something out of this webinar, please feel free
to add me on Linkedin or write me at
pierluigi.vinciguerra@re-analytics.com

More Related Content

What's hot

Towards the Next Generation Financial Crimes Platform - How Data, Analytics, ...
Towards the Next Generation Financial Crimes Platform - How Data, Analytics, ...Towards the Next Generation Financial Crimes Platform - How Data, Analytics, ...
Towards the Next Generation Financial Crimes Platform - How Data, Analytics, ...Molly Alexander
 
Marketing campaign to sell long term deposits
Marketing campaign to sell long term depositsMarketing campaign to sell long term deposits
Marketing campaign to sell long term depositsAditya Bahl
 
Effective Credit Policy Consumer Credit
Effective Credit Policy Consumer CreditEffective Credit Policy Consumer Credit
Effective Credit Policy Consumer CreditSriram Natarajan
 
Transactional File System In Java - Commons Transaction
Transactional File System In Java - Commons TransactionTransactional File System In Java - Commons Transaction
Transactional File System In Java - Commons TransactionGuo Albert
 
Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analyticsPrasad Narasimhan
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money LaunderingJim Dowling
 
Nelito System Company Credentials
Nelito System Company CredentialsNelito System Company Credentials
Nelito System Company CredentialsSumit Mishra
 
Credit Scoring
Credit ScoringCredit Scoring
Credit ScoringMABSIV
 

What's hot (11)

Towards the Next Generation Financial Crimes Platform - How Data, Analytics, ...
Towards the Next Generation Financial Crimes Platform - How Data, Analytics, ...Towards the Next Generation Financial Crimes Platform - How Data, Analytics, ...
Towards the Next Generation Financial Crimes Platform - How Data, Analytics, ...
 
Marketing campaign to sell long term deposits
Marketing campaign to sell long term depositsMarketing campaign to sell long term deposits
Marketing campaign to sell long term deposits
 
Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
 
Effective Credit Policy Consumer Credit
Effective Credit Policy Consumer CreditEffective Credit Policy Consumer Credit
Effective Credit Policy Consumer Credit
 
Transactional File System In Java - Commons Transaction
Transactional File System In Java - Commons TransactionTransactional File System In Java - Commons Transaction
Transactional File System In Java - Commons Transaction
 
Fraud detection analysis
Fraud detection analysis Fraud detection analysis
Fraud detection analysis
 
Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analytics
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money Laundering
 
Credit scoring
Credit scoringCredit scoring
Credit scoring
 
Nelito System Company Credentials
Nelito System Company CredentialsNelito System Company Credentials
Nelito System Company Credentials
 
Credit Scoring
Credit ScoringCredit Scoring
Credit Scoring
 

Similar to Running a business on Web Scraped Data

Rethink Web Harvesting and Scraping
Rethink Web Harvesting and ScrapingRethink Web Harvesting and Scraping
Rethink Web Harvesting and Scrapingscrapeit
 
Stapling and patching the web of now - ForwardJS3, San Francisco
Stapling and patching the web of now - ForwardJS3, San FranciscoStapling and patching the web of now - ForwardJS3, San Francisco
Stapling and patching the web of now - ForwardJS3, San FranciscoChristian Heilmann
 
Spotlight on Jet2.com - Interview with Stuart Bass, B.I. Applications Manager
Spotlight on Jet2.com - Interview with Stuart Bass, B.I. Applications ManagerSpotlight on Jet2.com - Interview with Stuart Bass, B.I. Applications Manager
Spotlight on Jet2.com - Interview with Stuart Bass, B.I. Applications ManagerInterQuest Group
 
Digital Transformation - Why you need to embrace it now
Digital Transformation - Why you need to embrace it nowDigital Transformation - Why you need to embrace it now
Digital Transformation - Why you need to embrace it nowMuliadi Jeo
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Aparna Sharma
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013Nick Galbreath
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathDevopsdays
 
An Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyAn Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyRoger Giuffre
 
How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to ...
How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to ...How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to ...
How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to ...Dana Gardner
 
An Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyAn Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyRoger Giuffre
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...GetInData
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideDominic Woodman
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale
 
AI와 같이 살기 - 남서울대학교 인터브이알
AI와 같이 살기 - 남서울대학교 인터브이알AI와 같이 살기 - 남서울대학교 인터브이알
AI와 같이 살기 - 남서울대학교 인터브이알HashScraper Inc.
 
CloudChannel – Behind the Front
CloudChannel – Behind the FrontCloudChannel – Behind the Front
CloudChannel – Behind the FrontDominic Hawes
 
The Open Commerce Conference - Premature Optimisation: The Root of All Evil
The Open Commerce Conference - Premature Optimisation: The Root of All EvilThe Open Commerce Conference - Premature Optimisation: The Root of All Evil
The Open Commerce Conference - Premature Optimisation: The Root of All EvilFabio Akita
 

Similar to Running a business on Web Scraped Data (20)

Rethink Web Harvesting and Scraping
Rethink Web Harvesting and ScrapingRethink Web Harvesting and Scraping
Rethink Web Harvesting and Scraping
 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Stapling and patching the web of now - ForwardJS3, San Francisco
Stapling and patching the web of now - ForwardJS3, San FranciscoStapling and patching the web of now - ForwardJS3, San Francisco
Stapling and patching the web of now - ForwardJS3, San Francisco
 
Spotlight on Jet2.com - Interview with Stuart Bass, B.I. Applications Manager
Spotlight on Jet2.com - Interview with Stuart Bass, B.I. Applications ManagerSpotlight on Jet2.com - Interview with Stuart Bass, B.I. Applications Manager
Spotlight on Jet2.com - Interview with Stuart Bass, B.I. Applications Manager
 
Tweak Geeks #FOS15
Tweak Geeks #FOS15Tweak Geeks #FOS15
Tweak Geeks #FOS15
 
Digital Transformation - Why you need to embrace it now
Digital Transformation - Why you need to embrace it nowDigital Transformation - Why you need to embrace it now
Digital Transformation - Why you need to embrace it now
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick Gallbreath
 
An Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyAn Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech Company
 
How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to ...
How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to ...How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to ...
How FinTech Innovator Razorpay Uses Open-Source Tracing And Observability to ...
 
An Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech CompanyAn Innovative Big-Data Web Scraping Tech Company
An Innovative Big-Data Web Scraping Tech Company
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 
SEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech SideSEO for Large/Enterprise Websites - Data & Tech Side
SEO for Large/Enterprise Websites - Data & Tech Side
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to Cloud
 
AI와 같이 살기 - 남서울대학교 인터브이알
AI와 같이 살기 - 남서울대학교 인터브이알AI와 같이 살기 - 남서울대학교 인터브이알
AI와 같이 살기 - 남서울대학교 인터브이알
 
CloudChannel – Behind the Front
CloudChannel – Behind the FrontCloudChannel – Behind the Front
CloudChannel – Behind the Front
 
The Open Commerce Conference - Premature Optimisation: The Root of All Evil
The Open Commerce Conference - Premature Optimisation: The Root of All EvilThe Open Commerce Conference - Premature Optimisation: The Root of All Evil
The Open Commerce Conference - Premature Optimisation: The Root of All Evil
 
Web Scraping Services.pptx
Web Scraping Services.pptxWeb Scraping Services.pptx
Web Scraping Services.pptx
 

Recently uploaded

WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 

Recently uploaded (20)

WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 

Running a business on Web Scraped Data

  • 2. Who I am Pierluigi Vinciguerra, from Milan, Italy, co founder of Re Analytics. I’ve spend all my working life in managing data, from business intelligence projects in consultancy and in industries. 5 Years ago I’ve started my adventure in Re Analytics where I control all the data acquisition process from the websites we scrape
  • 3. What we do at Re Analytics We call ourself a data boutique because sell meaningful data and insights for selected industries (luxury goods and travel at the moment). We start from web scraping at large scale, in a scale of billions of data points per month, we process them, integrate with other sources and our knowledge of the industry and on top we apply algorithms and AI to extracts insights to sell. Data available on Data partner of
  • 4. Why on this virtual stage Reaching this scale of scraping, with our actual level of efficiency, has not been easy. We’ve more than 5000+ scrapers written and we can count a huge number of hours spent on Scrapy and web scraping in general. But that’s only the beginning: without the processes we’ve built during these years and the automation, we could not simply scale at these levels, because of all the changes and anti bot mechanisms our sources implement each days. Today I want to share with you some of the most important lessons we’ve learnt during our journey to becoming what we are today.
  • 5. Ourjourney Me and my friend Andrea in 2010 were intrigued by the amount of data passing by the web without no one collecting them for analyzing it. We started then with dozens of scrapers in C++ ( !! ) for the real estate market. They were an hell to maintain but we’ve built an MVP and started learning what it means to try to do some web scraping. After founding Re Analytics in 2015, the scalability problem was part of the main one: we could not grow to have enough interesting data for our customer if we didn’t create the right processes and automation steps, use the proper tools and change radically we way we made web scraping.
  • 6. So, data is the new gold We’ve heard many times that data is new oil or gold…. …. but also web scraping is much more similar to gold mining process than we might think. It starts finding the right spot to start (a problem to solve for an industry), then we begin digging (scraping), breaking the rocks we collect and separate them from gold nuggets (store and clean data). With all the single nuggets we create then some ingot to sell (cleaned datasets) or, in case we want to do more, even jewels (insights from data).
  • 7. Find the right spot to start As every new startup or business, your company must find a problem to solve, before starting web scraping in a massive way. It could be a niche problem, a vast one, but your solution should be something original (please not another Amazon Price tracker) that, leveraging on web scraped data, gives to your customer a solution much better than the actual one. Try to involve industry experts, to better understand what’s really needed to be extracted at which frequency
  • 8. It’s a long road to the gold It will take time to structure a fully scalable and modern pipeline of web data integration The sooner you understand what data is needed from the industry, the first you can be sure to scrape correctly your sources You don’t want to waste weeks or months scraping partial or worthless data
  • 9. So, let’s start digging We need our shovels and excavators to start digging We can choose between third party tools like import.io or packages like Scrapy. In our experience, for a massive scale web scraping, code is preferred for its flexibility. It allows us to face all the possible challenges we encounter. Given this opportunity, we also write our scheduling program so every for each websites knows where to be executed and its set of options. The cons of this approach is that there’s an high risk of having a code base of spaghetti code and nowadays there are many tools like Crawlera or Airflow that simplify some of these tasks.
  • 10. Sometimes is not easy to reach gold We might find some blocks on hard rock our path. That’s why we always should have some TNT in our pockets. Some useful tools we have to avoid blocks in scraping are: • Multi cloud environment • IP rotation via proxy services • Captcha brakers • Headless Browsers with javascript (Splash) Using these tools, but also during the whole process of web scraping, you should be always aware of the term of use of the tools themselves, of the website you’re going to scrape and if the industry you’re trying to sell your data to has some specific bans on some techniques.
  • 11. Our approach to ethical web scraping In finance, where most of our customers come from, the rules for data sourcing are stricter than in other sectors, so we decided to adhere to the Investment Data Standards Organization Best Practices on Web scraping. • Browse-wrap terms of use are OK, click-wrap NO • Follow what robots.txt file says • Web scraping should not interfere with the website’s operations • Do not make copies of the website • Scraped information should be public • Use APIs if possible • Don’t scrape a website to gain a competitive edge on it In any case, all of the techniques and tools we’ll see in the next slides, should be used in compliance with the Terms of Use of the target website. The full document can be found at this link: https://www.investmentdata.org/publications
  • 12. Multi cloud environment Web scraping at large scale can’t rely on one or few machines in a local environment A cloud environment with at least one single vendor a must for every project, it allows to multiply he IPs used by scrapers . A multi cloud environment with several vendors is highly suggested because it’s not uncommon to find websites that block requests coming from one provider or another.
  • 13. IP rotation via proxies Strictly tied to the multi cloud environment, ip rotation is the other way we can rotate IPs during our crawling. Tools like Crawlera, Luminati and many others do an excellent job on it. On top, some websites, without worrying about SEO, block all requests coming from datacenter IPs and this is why we’ll need also proxy providers that allow us to use Residential proxies. There are also websites that geolocate your IP to show different versions of itself, so also Geographical Proxies are needed to see the version we desire.
  • 14. Captcha Breakers and Javascript Some websites (typically with some configuration of Cloudfront) ask to solve some javascript challenge before being scraped. Scrapinghub developed Splash for this case (and we all are grateful for it) Captcha instead are really bad, because there’s no simple way to handle them. They can be broken in 2 ways: • Third party services, that use humans to solve captchas and send back the result. They have high latency time and personally I have some concerns about workers conditions in these places. • OCR software, i don’t personally know an out-of-the-box solution that works for every kind of challenge the captcha raise The obvious solution is: do your best to not trigger captchas.
  • 15. We have rocks, now? At the end of the scraping process, our data should be stored in a database (except in some rare cases). Of course, depending from the data type you’re scraping, there will be the most suitable database for your need. Be aware that data doesn’t need only to be stored but, before coming valuable, must continue its road on the value pipeline.
  • 16. Turning rocks into gold A single data snapshot from a website has no value by itself but it need to pass some other steps before becoming valuable. • We must check its completeness: did we crawl the whole website / all the infos we need? • We must check its correctness: is the data we downloaded correct? • Given the two steps before for granted, data needs to enter inside a process of standardization, to blend with other data in our database or in the database of the final customer. • Standardize is not enough. We also need to have an enrichment process so we can talk the same language of our customers.
  • 17. Checking the data In Re Analytics our approach is a mix of automation and human checks. Each data ingestion is programmatically controlled to check its completeness according to our scope. If something is missing, we raise a flag signaling it and assigning to the person in charge of the website. We integrate our data quality algorithms with manual check, so we can catch up also partial data ingestion in most of the case. It's a never ending learning curve, where automation, industry knowledge and human touch go hand in hand.
  • 18. Enriching data Now we’re sure data is correct and clean, we need to enrich it with features external from the original data source (in our case, as an example, Stock Tickers, Currency standard ISO codes, Country Iso codes, business related segmentation) Without the proper segmentation coming from the so-called “small data”, the data we’ re amassing in the database is almost unusable to the final user. Users should be familiar with the language will see in the dataset and should be easy for them to integrate with other datasets, that’s why it’s crucial to standardize our data.
  • 19. Trends Point in time data snapshots can be enough for some businesses but for most of them what it really matters are trends over time. You don’t need to be able to scrape a website only once. You need to be able to scrape it every time you need it (may It be every day, hour or week). The higher the frequency is, the more important is having a strong and timely process like the one described up to now.
  • 20. Publishing Not all the extracted data can be published. Even with the most advanced acquisition pipeline, you’ll find out some gaps or incomplete data. Having an automatic fill forwardbackward rules, accepted by customers, will improve greatly the overall data quality of the output dataset. After the publishing phase, the value pipeline for the data ingested is over and can be sold.
  • 21. Proxies 21 Thevaluepipeline Multi cloud environment Database Staging area Cleaning Enrich Publish ALGORITHMS Launcher Websites Scrapers The value of the data increases at each step in the database. Single snapshots could have some values in some cases but cleaning, enriching them and adding to the historical trend add definitely more value Hypothetical Architecture
  • 22. Selling gold bars or jewels? The value pipeline has produced a sellable dataset but we can do more. We can add a presentation layer, where we can highlight insights that emerge from data, ML algorithms to make forecasts or anything useful for the industry we’re operating in. This will mean higher margin and a stronger asset, more difficult to be copied… …. but also it’s a completely different job and should be done by other department of our company.
  • 23. Key Takeaways • Start with a validated MVP and then scale up massively • Interact with business expert to understand their need • Web scraping at large scale is complex but it’s only the first step of the value pipeline • It’s a British bulldog game. Be sure to use the latest techniques to not be cut off the websites you’re scraping • Log everything, to be sure you’re not loosing something in the pipeline • Be sure also to use all the cautions to web scrape legally your target websites, without disturbing the website operations and business. • Scaling requires processes, templates and automation, in every step of the value pipeline. We don’t dig for gold with pickaxe anymore, we use machines. Use machines also for the tasks in your company and don’t make people tasks that should be automated. • Don't let dirty data make it to your dataset, it will mine your reputation. Use a publishing process to sell a stable and reliable flow of data.
  • 24. Thankyou! If you want to share your experiences with me, ask me something out of this webinar, please feel free to add me on Linkedin or write me at pierluigi.vinciguerra@re-analytics.com