SlideShare a Scribd company logo
1 of 20
Context from Big Data
Startup Showcase
IEEE Big Data Conference
November 1, 2015
Santa Clara, CA
Delroy Cameron, Data Scientist
@urxtech | urx.com | research@urx.com
People
URX has 40 people: 75%
product/eng, 25% business
Customers
URX partners with the world’s top
publisher & advertisers.
Funding
URX raised $15M from Accel,
Google Ventures, and others
Who is URX?
URX is a mobile technology platform that focuses on publisher monetization,
content distribution, and user engagement.
What problem does URX solve?
URX serves contextually relevant native ads.
URX interprets page
context to dynamically
determine the best
message & action.
How does URX affect the mobile ecosystem?
Volume (Apps) Volume (web pages) Variety (entities)
Why is this a Big Data problem?
Rhapsody
(Music)
Fansided
(Sports)
Apple
(Music, TV, Books)
Source: The Statistics Portal - http://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/
1.6M Apps (Android)
1.5M Apps (Apple Store)
How do we collect, store, and process the data needed
to build our machine learning models?
1.Data Collection and Parsing
2.Data Storage
• Persistent Storage
• Search Index
3.Data Processing
• Dictionary Building
• Vectorization (Feature Vector Creation)
Important tasks
11GB XML dump (gzip file)
15M pages (but only 4M articles)
Wikitext Grammar
Wikipedia Corpus (English)
1. Data collection & parsing
https://dumps.wikimedia.org/enwiki/latest/
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility"/>
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space=“preserve">
#REDIRECT [[Computer accessibility]] {{Redr|move|from CamelCase|up}}
</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
1. Data collection & parsing
https://dumps.wikimedia.org/enwiki/latest/
1. Data collection & parsing
sax library, generator
20 secs/doc, 10 years
FullWikiParser (mediawikiparser)
sax library, generator
200 docs/sec, ~ 21 hours
FastWikiParser (mwparserfromhell)
hbase, lxml parser
6 docs/sec, ~ one month
HTMLWikiParser (URX Index)
multithreading, generator
~ 3 hours
GensimWikiCorpusParser
1. pyspark (64 cores, 8GB RAM)
2. wikihadoop (StreamWikiDumpInputFormat)
• split input file
3. mwparserfromhell
• parse to raw text
4. ~20 minutes
wikipedia-parser
wikipedia-indexer
datanode 1
Namenode
datanode 2
datanode n
.
.
.
HDFS Elasticsearch Index
ClusterNode1
ClusterNode 2
ClusterNode m
.
.
.
2. Data storage
wikipedia-parser
(0 taylor) . . . (1999995 zion)
(1 alison) . . . (1999996 dozer)
(2 swift) . . . (1999997 tank)
(3 born) . . . (1999998 trinity)
(4 december) . . . (1999999 neo)
3. Data Processor (Dictionary building)
wikihadoop, StreamWikiDumpInputFormat
dictionary, tfidfmodel
~ 1 hour
Pyspark (Gensim)
multithreading, generator
corpus, dictionary, tfidfmodel
~ 6 hours
GensimWikiCorpusParser
Alias Candidate Entity f1 f2 … fn
Taylor Swift wikipedia:Taylor_Swift 0.91 0.81 … 0.34
wikipedia:Taylor_Swift_(album) 0.42 0.10 … 0.42
wikipedia:1989_(Taylor_Swift_album) 0.71 0.23 … 0.31
wikipedia:Fearless_(Taylor_Swift_song) 0.13 0.22 … 0.23
wikipedia:John_Swift 0.00 0.19 … 0.56
4. Data Processor (Vectorization)
~ 350ms predict entity per alias
Gensim
~ 100ms predict entity per alias
Cython
Wikipedia
Corpus
corpus-parser
corpus-indexer
HDFS
(Wikilinks)
Wikilinks
Corpus
X
Corpus
Data
Processor
Dictionary TF-IDF Model
Machine Learning Module
HDFS
(Wikipedia)
HDFS
(X Corpus)
Elasticsearch1
Elasticsearch2
Elasticsearchn
1
2
3
4
5
6
7
Demo
Linked Entities
1. http://en.wikipedia.org/wiki/Macgyver
2. http://en.wikipedia.org/wiki/Neil_deGrasse_Tyson
3. http://en.wikipedia.org/wiki/Richard_Dean_Anderson
4. http://en.wikipedia.org/wiki/Josh_Holloway
5. http://en.wikipedia.org/wiki/NBC
6. http://en.wikipedia.org/wiki/CBS
7. http://en.wikipedia.org/wiki/James_Wan
8. http://en.wikipedia.org/wiki/Netflix
9. http://en.wikipedia.org/wiki/America_America
http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/
● Tuning pyspark jobs (64 cores, 8GB Driver RAM)
● Bringing down the elasticsearch cluster
● Rejoining the union after secession (elasticsearch nodes)
● Text Cleaning (lowercasing, character encoding)
● Merging in Hadoop for dictionary creation
Things to watch out for
Getting started is easy.
Sign Up Download SDK Start Building
Visit http://urx.com/sign-up for more information.
Thank you.
delroy@urx.com

More Related Content

Recently uploaded

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Recently uploaded (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Context from Big Data

  • 1. Context from Big Data Startup Showcase IEEE Big Data Conference November 1, 2015 Santa Clara, CA Delroy Cameron, Data Scientist @urxtech | urx.com | research@urx.com
  • 2. People URX has 40 people: 75% product/eng, 25% business Customers URX partners with the world’s top publisher & advertisers. Funding URX raised $15M from Accel, Google Ventures, and others Who is URX? URX is a mobile technology platform that focuses on publisher monetization, content distribution, and user engagement.
  • 3. What problem does URX solve?
  • 4. URX serves contextually relevant native ads. URX interprets page context to dynamically determine the best message & action.
  • 5. How does URX affect the mobile ecosystem?
  • 6. Volume (Apps) Volume (web pages) Variety (entities) Why is this a Big Data problem? Rhapsody (Music) Fansided (Sports) Apple (Music, TV, Books) Source: The Statistics Portal - http://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/ 1.6M Apps (Android) 1.5M Apps (Apple Store)
  • 7. How do we collect, store, and process the data needed to build our machine learning models?
  • 8. 1.Data Collection and Parsing 2.Data Storage • Persistent Storage • Search Index 3.Data Processing • Dictionary Building • Vectorization (Feature Vector Creation) Important tasks
  • 9. 11GB XML dump (gzip file) 15M pages (but only 4M articles) Wikitext Grammar Wikipedia Corpus (English) 1. Data collection & parsing https://dumps.wikimedia.org/enwiki/latest/ <page> <title>AccessibleComputing</title> <ns>0</ns> <id>10</id> <redirect title="Computer accessibility"/> <revision> <id>631144794</id> <parentid>381202555</parentid> <timestamp>2014-10-26T04:50:23Z</timestamp> <contributor> <username>Paine Ellsworth</username> <id>9092818</id> </contributor> <comment>add [[WP:RCAT|rcat]]s</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space=“preserve"> #REDIRECT [[Computer accessibility]] {{Redr|move|from CamelCase|up}} </text> <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
  • 10. 1. Data collection & parsing https://dumps.wikimedia.org/enwiki/latest/
  • 11. 1. Data collection & parsing sax library, generator 20 secs/doc, 10 years FullWikiParser (mediawikiparser) sax library, generator 200 docs/sec, ~ 21 hours FastWikiParser (mwparserfromhell) hbase, lxml parser 6 docs/sec, ~ one month HTMLWikiParser (URX Index) multithreading, generator ~ 3 hours GensimWikiCorpusParser 1. pyspark (64 cores, 8GB RAM) 2. wikihadoop (StreamWikiDumpInputFormat) • split input file 3. mwparserfromhell • parse to raw text 4. ~20 minutes wikipedia-parser
  • 12. wikipedia-indexer datanode 1 Namenode datanode 2 datanode n . . . HDFS Elasticsearch Index ClusterNode1 ClusterNode 2 ClusterNode m . . . 2. Data storage wikipedia-parser
  • 13. (0 taylor) . . . (1999995 zion) (1 alison) . . . (1999996 dozer) (2 swift) . . . (1999997 tank) (3 born) . . . (1999998 trinity) (4 december) . . . (1999999 neo) 3. Data Processor (Dictionary building) wikihadoop, StreamWikiDumpInputFormat dictionary, tfidfmodel ~ 1 hour Pyspark (Gensim) multithreading, generator corpus, dictionary, tfidfmodel ~ 6 hours GensimWikiCorpusParser
  • 14. Alias Candidate Entity f1 f2 … fn Taylor Swift wikipedia:Taylor_Swift 0.91 0.81 … 0.34 wikipedia:Taylor_Swift_(album) 0.42 0.10 … 0.42 wikipedia:1989_(Taylor_Swift_album) 0.71 0.23 … 0.31 wikipedia:Fearless_(Taylor_Swift_song) 0.13 0.22 … 0.23 wikipedia:John_Swift 0.00 0.19 … 0.56 4. Data Processor (Vectorization) ~ 350ms predict entity per alias Gensim ~ 100ms predict entity per alias Cython
  • 15. Wikipedia Corpus corpus-parser corpus-indexer HDFS (Wikilinks) Wikilinks Corpus X Corpus Data Processor Dictionary TF-IDF Model Machine Learning Module HDFS (Wikipedia) HDFS (X Corpus) Elasticsearch1 Elasticsearch2 Elasticsearchn 1 2 3 4 5 6 7
  • 16. Demo
  • 17. Linked Entities 1. http://en.wikipedia.org/wiki/Macgyver 2. http://en.wikipedia.org/wiki/Neil_deGrasse_Tyson 3. http://en.wikipedia.org/wiki/Richard_Dean_Anderson 4. http://en.wikipedia.org/wiki/Josh_Holloway 5. http://en.wikipedia.org/wiki/NBC 6. http://en.wikipedia.org/wiki/CBS 7. http://en.wikipedia.org/wiki/James_Wan 8. http://en.wikipedia.org/wiki/Netflix 9. http://en.wikipedia.org/wiki/America_America http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/
  • 18. ● Tuning pyspark jobs (64 cores, 8GB Driver RAM) ● Bringing down the elasticsearch cluster ● Rejoining the union after secession (elasticsearch nodes) ● Text Cleaning (lowercasing, character encoding) ● Merging in Hadoop for dictionary creation Things to watch out for
  • 19. Getting started is easy. Sign Up Download SDK Start Building Visit http://urx.com/sign-up for more information.

Editor's Notes

  1. I am Delroy, Data Scientist @URX Challenges and Experiences
  2. URX is a mobile advertising company in SF spotify, bandsintown, lyft, seekgeek, stubhub, airbnb
  3. Mobile App or Mobile Web - ad from left field
  4. create a more cohesive and relevant mobile experience Deeplinking refers to the use of a hyperlink that links to a specific, generally searchable or indexed, piece of web content on a website Deeplinks are important because they connect the content within one app directly to the content within another app fansided.com to seetgeek.com
  5. Enable developers to better monetize by linking the content within apps Create more engagement by allowing users to convert intent into actions
  6. The Statistics Portal (July 2015) - 1.6M Apps (Android) - 1.5M Apps (Apple Store) 1M pages - Rhapsody, Pandora (URX Index)
  7. Entity Linking Problem What happens is we just search wikipedia?
  8. Wikipedia Comprehensive Accurate (due to crowdsourcing) wikification D2W
  9. Full - 3/min, 180/hr, 4300/day, 1.5M/year
  10. Wikipedia Pages + Mentions Index Batch updates to reduce IO ~ 15 or more hours Persistent Cluster Node rejoins the union if it dies
  11. What happens at test time? www.dancingastronaut.com
  12. http compute:65432/?scores=max url=http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/ http compute:65432/mentions url=http://zap2it.com/2015/10/finding-carter-season-2b-ben-escape-cash-explained/ http compute:65432/?scores=max url=http://zap2it.com/2015/10/finding-carter-season-2b-ben-escape-cash-explained/
  13. http compute:65432/?scores=max url=http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/