By Nikolai Avteniev (Sr Software Engineer, LinkedIn)
LinkedIn is the professional profile of record for our 370M+ members globally, but many people don't realize the full potential of their LinkedIn profile – especially on mobile. Adding blogs, photos and other rich content to your profile on a small screen device can get tedious. That's why LinkedIn created Satori, a Hadoop tool that crawls the web and extracts data to discover members' professional content online. Satori uses machine learning techniques and leverages other open source tools like Nutch and Gobblin in order to help match members with relevant content in order to maximize their professional profile. In this talk, Nikolai will share his experience in building the product and discuss the challenges and opportunities encountered along the way.
6. What we thought we needed
6
The BIG Idea
Inspired by Hsieh, Jonathan M., Steven D. Gribble, and Henry M. Levy.
"The Architecture and Implementation of an Extensible Web Crawler." NSDI. 2010.
7. Questions we wanted to answer
7
Focused our Vision
Who would use this tool?
Do we need to crawl the entire web?
Do we need to process the pages near line?
Where would we store this data?
How would we correct mistakes in the flow?
9. Virtually All Member Value Relies On Identity Data
Susan Kaplan
Sr. Marketing Manager at Weblo
SEARCH
Research & Contact
AD TARGETING
Market Products
& Services
PMYK
Build Your Network
RECRUITER
Recruit & Hire
FEED
Get Daily News
NETWORK
Keep in Touch
RECOMMENDATIONS
Get a Job/Gig
WVMP
Establish Yourself
as Expert
10. Identity Use Case
A smarter way to build your profile
• Suggest 1-click profile updates to members
• Using this, we can help members easily fill in profile gaps
& get credit for certificates, patents, publications…
12. • Avg. HTML Document is 6K
37% < 10K
• Samza can handle 1.2M
messages per node [2]
• There is a limit of how much
data is retained between 7
and 30 days.
• Most of the data is filtered out
• Need to bootstrap Samza
stores
12
Not a perfect fit
1. HTML Document Transfer size http://httparchive.org/interesting.php?a=All&l=Oct%2015%202015#bytesHtmlDoc
2. Feng, Tao “Benchmarking Apache Samza: 1.2 million messages per second on a single node”
https://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-
node
13. Help 400M members fully realize
their professional identity on
LinkedIn.
Find sources of professional
content on the public internet.
Fetch the content, extract
structured data and match it to
member profiles
13
The Project: Satori
15. • Enterprise VS Social Web
use cases
• Web Sources
• Wrappers
15
Web Data Extraction System
3. Ferrara, Emilio, et al. "Web data extraction, applications and techniques: A survey." Knowledge-Based Systems 70
(2014): 301-323.
17. Induce wrappers based on data [4]
Build wrappers that are robust. [5]
Cluster similar pages by URL [6]
The web is huge and there are
interesting things in the long tale[7]
17
Industrial Web Data Extraction
4. Dalvi, Nilesh, Ravi Kumar, and Mohamed Soliman. "Automatic wrappers for large scale web extraction." Proceedings of the VLDB
Endowment 4.4 (2011): 219-230.
5. Dalvi, Nilesh, Philip Bohannon, and Fei Sha. "Robust web extraction: an approach based on a probabilistic tree-edit model." Proceedings of
the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.
6. Blanco, Lorenzo, Nilesh Dalvi, and Ashwin Machanavajjhala. "Highly efficient algorithms for structural clustering of large websites."
Proceedings of the 20th international conference on World wide web. ACM, 2011.
7. Dalvi, Nilesh, Ashwin Machanavajjhala, and Bo Pang. "An analysis of structured data on the web." Proceedings of the VLDB Endowment
5.7 (2012): 680-691.
19. HERITRIX powers archive.org
NUTCH powers common crawl
BUbinNG part of LAW
Scrapy used with in LinkedIn
19
The Contestants
8. Web crawling, C Olston, M Najork - Foundations and Trends in Information Retrieval, 2010
9. An Introduction to Heritrix: An Open Source Archival Quality Web Crawler, A Dan, K Michele – 2004
10.BUbiNG: massive crawling for the masses, P Boldi, A Marino, M Santini, S Vigna -, 2014
11.Nutch: A Flexible and Scalable Open-Source Web Search Engine. CommerceNet Labs, R Khare, D Cutting, K
Sitaker, A Rifkin - 2004 - CN-TR-04-04, November
22. • Built on Nutch 1.9
• Runs on Hadoop 2.3
• Scheduled to run every 5
hours
• Respects robots.txt
• Default crawl delay of 5
seconds
22
Crawl Flow
23. • Output into target schema
• Apply XPATH wrappers
• Wrappers are hierarchical
mapping of Schema field to
XPath expression
• Indexed by data domain and
data source
23
Extract Flow
24. Crawl rate is bound by the
number of sites and the site
crawl delay
25. Common Crawl Great Source
https://commoncrawl.org/
Gobblin Great Ingestion
Framework
https://github.com/linkedin/gobblinn
25
Bootstrap From Bulk Sources
31. Match using global identifiers,
email or full name.
The data might not be clean
after extraction
Start with a small set of data and
get it to the users quickly
31
Start Simple
32. Narrow the candidates with
LSH[1]
Use the simple model to
generate the ground truth
Train using a simple algorithm
and a few hundred features
32
Keep It Simple
1. https://en.wikipedia.org/wiki/Locality-sensitive_hashing