SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Helping travelers make better hotel choices
500 million times a month*
Steffen Wenz, CTO TrustYou
For every hotel on the
planet, provide a summary
of traveler reviews.
What does TrustYou do?
✓ Excellent hotel!
✓ Excellent hotel!
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Excellent hotel!*
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Great for partying
“Nice weekend getaway or for partying”
✗ Solo travelers complain about TVs
ℹ You should check out Reichstag,
KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)
DBCrawling
Semantic
Analysis
TrustYou
Analytics
API
Kayak...
TrustYou Architecture
200 million
reqs/month
Crawling
/find?
q=Berlin
/find?
q=Munich
/meetup/
BerlinPyData
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=2
/meetup/
BerlinPolitics
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=3
Seed URLs
Frontier
Basic crawling setup
/find?
q=Berlin
/find?
q=Munich
/meetup/
BerlinPyData
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=2
/meetup/
BerlinPolitics
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=3
/find?
q=Munich&pa
ge=99999999
...
… if only it were so easy
facebok.
com/meetup
Seed URLs
Frontier
Scrapy
● Build your own web crawlers
○ Extract data via CSS selectors, XPath, regexes …
○ Handles queuing, request parallelism, cookies,
throttling …
● Comprehensive and well-designed
● Commercial support by http://scrapinghub.com/
Frontier
Seed URLs
Intro to Scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = "my_spider"
# start with this URL
start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"]
# follow these URLs, and call self.parse_meetup to extract data from them
rules = [
Rule(LinkExtractor(allow=[
"^http://www.meetup.com/[^/]+/$",
]), callback="parse_meetup"),
]
def parse_meetup(self, response):
# Extract data about meetup from HTML
m = MeetupItem()
yield m
Try it out!
$ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null
{"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy -
Berlin", "members": "774"}
{"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"}
{"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"}
{"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"}
{"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin",
"members": "1"}
{"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight
Catch Up", "members": "1"}
...
Full code on GitHub, dump of all Berlin meetups
(Note: Meetup also has an API …)
Number of registered meetups
Crawling at TrustYou scale
● 2 - 3 million new reviews/week
● Customers want alerts 8 - 24h
after review publication!
● Smart crawl frequency & depth,
but still high overhead
● Pools of constantly refreshed
EC2 proxy IPs
● Direct API connections with
many sites
Crawling at TrustYou scale
● Custom framework very similar to scrapy
● Runs on Hadoop cluster (100 nodes)
● … Though problem not 100% suitable for MapReduce
○ Nodes mostly waiting
○ Coordination/messaging between nodes required:
■ Distributed queue
■ Rate limiting
Textual Data
Treating textual data
raw text sentence
splitting
stopword
filtering
stemming
tokenization
Tokenization
>>> import nltk
>>> raw = "We are always looking for interesting talks, locations to
host meetups and enthusiastic volunteers. Please get in touch using
info@pydata.berlin."
>>> nltk.sent_tokenize(raw)
['We are always looking for interesting talks, locations to host meetups
and enthusiastic volunteers.', 'Please get in touch using info@pydata.
berlin.']
>>> nltk.word_tokenize(raw)
['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',',
'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic',
'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@',
'pydata.berlin', '.']
“great rooms”
“great hotel”
“rooms are terrible”
“hotel is terrible”
JJ NN
JJ NN
NN VB JJ
NN VB JJ
Grammars and Parsing
>>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible"))
[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NN COP JJ
... OPINION -> JJ NN
... NN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... JJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print tree
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing
WordNet
>>> from nltk.corpus import wordnet as wn
>>> wn.morphy('coded', wn.VERB)
'code'
>>> wn.synsets("python")
[Synset('python.n.01'), Synset('python.n.02'), Synset('python.n.
03')]
>>> wn.synset('python.n.01').hypernyms()
[Synset('boa.n.02')]
>>> # meh :/
● “Nice room”
● “Room wasn‘t so great”
● “The air-conditioning
was so powerful that we
were cold in the room
even when it was off.”
● “อาหารรสชาติดี”
● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ”
● 20 languages
● Linguistic system
(morphology, taggers,
grammars, parsers …)
● Hadoop: Scale out CPU
○ ~1B opinions in DB
● Python for ML & NLP
libraries
Semantic Analysis at TrustYou
Word2Vec
● Map words to vectors
● “Step up” from bag-of-
words model
● ‘Cats’ and ‘dogs’ should
be similar - because they
occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,
-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,
# ...
-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,
-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,
-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],
dtype=float32)
Fun with Word2Vec
>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',
0.8189617991447449)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["berlin"])[:3]
[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',
0.7970746755599976)]
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
ML @ TrustYou
● gensim doc2vec model
to create hotel
embedding
● Used - together with
other features - for
various classifiers
Workflow Management
& Scaling Up
● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
● Some support for Hadoop
● Pythonic replacement for Oozie
● Can be combined with Pig, Hive
Luigi
class MyTask(luigi.Task):
def requires(self):
return DependentTask()
def output(self):
return luigi.LocalTarget("data/my_task_output"))
def run(self):
with self.output().open("w") as out:
out.write("foo")
Luigi tasks vs. Makefiles
data/my_task_output: DependentTask
run
run
run ...
class CrawlTask(luigi.Task):
city = luigi.Parameter()
def output(self):
output_path = os.path.join("data", "{}.jsonl".format(self.city))
return luigi.LocalTarget(output_path)
def run(self):
tmp_output_path = self.output().path + "_tmp"
subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}".
format(self.city), "-o", tmp_output_path, "-t", "jsonlines"])
os.rename(tmp_output_path, self.output().path)
Example: Wrap crawl in Luigi task
Luigi dependency graphs
Hadoop!
● MapReduce: Programming model for distributed
computation problems
● Express your algorithm as sequences of operations:
a. Map: Do a linear pass over your data, emit (k, v)
b. (Distributed sort)
c. Reduce: Linear pass over all (k, v) for the same k
● Python on Hadoop: Hadoop streaming, MRJob, Luigi
(Just go learn PySpark instead)
Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
1. Your input data is sitting in
distributed file system (HDFS)
2. Luigi creates a .tar.gz, Hadoop
moves your code on machines
3. mapper() gets run (distributed)
4. Data gets re-sorted by key
5. reducer() gets run (distributed)
6. Output gets saved in HDFS
● Batch, never real time
● Slow even for batch
(lots of disk IO)
● Limited expressiveness
(remedies/crutches:
MRJob, Pig, Hive)
● Spark: More complete
Python support
Beyond MapReduce
Workflows at TrustYou
Workflows at TrustYou
We’re hiring! steffen@trustyou.com

Contenu connexe

Tendances

Full-Stack JavaScript with Node.js
Full-Stack JavaScript with Node.jsFull-Stack JavaScript with Node.js
Full-Stack JavaScript with Node.js
Michael Lehmann
 
Assignment no39
Assignment no39Assignment no39
Assignment no39
Jay Patel
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
aleks-f
 

Tendances (20)

When RegEx is not enough
When RegEx is not enoughWhen RegEx is not enough
When RegEx is not enough
 
Naughty And Nice Bash Features
Naughty And Nice Bash FeaturesNaughty And Nice Bash Features
Naughty And Nice Bash Features
 
Concurrent applications with free monads and stm
Concurrent applications with free monads and stmConcurrent applications with free monads and stm
Concurrent applications with free monads and stm
 
Unleash your inner console cowboy
Unleash your inner console cowboyUnleash your inner console cowboy
Unleash your inner console cowboy
 
ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015ClojureScript loves React, DomCode May 26 2015
ClojureScript loves React, DomCode May 26 2015
 
ClojureScript for the web
ClojureScript for the webClojureScript for the web
ClojureScript for the web
 
Python Objects
Python ObjectsPython Objects
Python Objects
 
Python GC
Python GCPython GC
Python GC
 
Full Stack Clojure
Full Stack ClojureFull Stack Clojure
Full Stack Clojure
 
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak   CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
CONFidence 2015: DTrace + OSX = Fun - Andrzej Dyjak
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
 
Why my Go program is slow?
Why my Go program is slow?Why my Go program is slow?
Why my Go program is slow?
 
Full-Stack JavaScript with Node.js
Full-Stack JavaScript with Node.jsFull-Stack JavaScript with Node.js
Full-Stack JavaScript with Node.js
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 
Продвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev ToolsПродвинутая отладка JavaScript с помощью Chrome Dev Tools
Продвинутая отладка JavaScript с помощью Chrome Dev Tools
 
Assignment no39
Assignment no39Assignment no39
Assignment no39
 
Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! Look Ma, “update DB to HTML5 using C++”, no hands! 
Look Ma, “update DB to HTML5 using C++”, no hands! 
 
Advanced JavaScript
Advanced JavaScriptAdvanced JavaScript
Advanced JavaScript
 
node ffi
node ffinode ffi
node ffi
 
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
Living with Garbage by Gregg Donovan at LuceneSolr Revolution 2013
 

En vedette

En vedette (18)

Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012Managing Online Reputation: ATM Dubai 2012
Managing Online Reputation: ATM Dubai 2012
 
The TrustYou Culture Book
The TrustYou Culture BookThe TrustYou Culture Book
The TrustYou Culture Book
 
DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020DevTalks Cluj - Predictions for Machine Learning in 2020
DevTalks Cluj - Predictions for Machine Learning in 2020
 
Unmanaged Tags - Data Protection in the Age of Mindless Proliferation
Unmanaged Tags - Data Protection in the Age of Mindless Proliferation Unmanaged Tags - Data Protection in the Age of Mindless Proliferation
Unmanaged Tags - Data Protection in the Age of Mindless Proliferation
 
Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...
Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...
Entwicklung von Social-Web-Applikationen auf Facebook und anderen Plattformen...
 
Get ready for SharePoint 2016
Get ready for SharePoint 2016Get ready for SharePoint 2016
Get ready for SharePoint 2016
 
Session 1 | «Von der Strategie zum Image.» | Hello Apple
Session 1 | «Von der Strategie zum Image.» | Hello AppleSession 1 | «Von der Strategie zum Image.» | Hello Apple
Session 1 | «Von der Strategie zum Image.» | Hello Apple
 
Designing For Smarties
Designing For SmartiesDesigning For Smarties
Designing For Smarties
 
Social digital tech trends 2016
Social digital tech trends 2016 Social digital tech trends 2016
Social digital tech trends 2016
 
Securing the Cloud
Securing the CloudSecuring the Cloud
Securing the Cloud
 
NYU Talk
NYU TalkNYU Talk
NYU Talk
 
10 Event Technology Trends to Watch in 2016
10 Event Technology Trends to Watch in 201610 Event Technology Trends to Watch in 2016
10 Event Technology Trends to Watch in 2016
 
DESIGN THE PRIORITY, PERFORMANCE 
AND UX
DESIGN THE PRIORITY, PERFORMANCE 
AND UXDESIGN THE PRIORITY, PERFORMANCE 
AND UX
DESIGN THE PRIORITY, PERFORMANCE 
AND UX
 
Forgotten women in tech history.
Forgotten women in tech history.Forgotten women in tech history.
Forgotten women in tech history.
 
Publishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the CloudPublishing Production: From the Desktop to the Cloud
Publishing Production: From the Desktop to the Cloud
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
 
Pollen VC Building A Digital Lending Business
Pollen VC Building A Digital Lending BusinessPollen VC Building A Digital Lending Business
Pollen VC Building A Digital Lending Business
 
Developing an Intranet Strategy
Developing an Intranet StrategyDeveloping an Intranet Strategy
Developing an Intranet Strategy
 

Similaire à PyData Berlin Meetup

Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
Edward Capriolo
 
Web Performance Workshop - Velocity London 2013
Web Performance Workshop - Velocity London 2013Web Performance Workshop - Velocity London 2013
Web Performance Workshop - Velocity London 2013
Andy Davies
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 

Similaire à PyData Berlin Meetup (20)

Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
MongoDB at ZPUGDC
MongoDB at ZPUGDCMongoDB at ZPUGDC
MongoDB at ZPUGDC
 
.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#.NET Foundation, Future of .NET and C#
.NET Foundation, Future of .NET and C#
 
Why Rust? - Matthias Endler - Codemotion Amsterdam 2016
Why Rust? - Matthias Endler - Codemotion Amsterdam 2016Why Rust? - Matthias Endler - Codemotion Amsterdam 2016
Why Rust? - Matthias Endler - Codemotion Amsterdam 2016
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
 
Web Performance Workshop - Velocity London 2013
Web Performance Workshop - Velocity London 2013Web Performance Workshop - Velocity London 2013
Web Performance Workshop - Velocity London 2013
 
Immutable Deployments with AWS CloudFormation and AWS Lambda
Immutable Deployments with AWS CloudFormation and AWS LambdaImmutable Deployments with AWS CloudFormation and AWS Lambda
Immutable Deployments with AWS CloudFormation and AWS Lambda
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
 
Rust: Systems Programming for Everyone
Rust: Systems Programming for EveryoneRust: Systems Programming for Everyone
Rust: Systems Programming for Everyone
 
Node.js - async for the rest of us.
Node.js - async for the rest of us.Node.js - async for the rest of us.
Node.js - async for the rest of us.
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
 
Node azure
Node azureNode azure
Node azure
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Modern C++
Modern C++Modern C++
Modern C++
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction
 
2019 hashiconf consul-templaterb
2019 hashiconf consul-templaterb2019 hashiconf consul-templaterb
2019 hashiconf consul-templaterb
 

Dernier

一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Monica Sydney
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 

Dernier (20)

一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 

PyData Berlin Meetup

  • 1. Helping travelers make better hotel choices 500 million times a month* Steffen Wenz, CTO TrustYou
  • 2. For every hotel on the planet, provide a summary of traveler reviews. What does TrustYou do?
  • 4. ✓ Excellent hotel! ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe »
  • 5. ✓ Excellent hotel!* ✓ Nice building “Clean, hip & modern, excellent facilities” ✓ Great view « Vue superbe » ✓ Great for partying “Nice weekend getaway or for partying” ✗ Solo travelers complain about TVs ℹ You should check out Reichstag, KaDeWe & Gendarmenmarkt. *) nhow Berlin (Full summary)
  • 6.
  • 7.
  • 8.
  • 13. Scrapy ● Build your own web crawlers ○ Extract data via CSS selectors, XPath, regexes … ○ Handles queuing, request parallelism, cookies, throttling … ● Comprehensive and well-designed ● Commercial support by http://scrapinghub.com/
  • 14. Frontier Seed URLs Intro to Scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class MySpider(CrawlSpider): name = "my_spider" # start with this URL start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"] # follow these URLs, and call self.parse_meetup to extract data from them rules = [ Rule(LinkExtractor(allow=[ "^http://www.meetup.com/[^/]+/$", ]), callback="parse_meetup"), ] def parse_meetup(self, response): # Extract data about meetup from HTML m = MeetupItem() yield m
  • 15. Try it out! $ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null {"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy - Berlin", "members": "774"} {"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"} {"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"} {"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"} {"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin", "members": "1"} {"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight Catch Up", "members": "1"} ... Full code on GitHub, dump of all Berlin meetups (Note: Meetup also has an API …)
  • 17. Crawling at TrustYou scale ● 2 - 3 million new reviews/week ● Customers want alerts 8 - 24h after review publication! ● Smart crawl frequency & depth, but still high overhead ● Pools of constantly refreshed EC2 proxy IPs ● Direct API connections with many sites
  • 18. Crawling at TrustYou scale ● Custom framework very similar to scrapy ● Runs on Hadoop cluster (100 nodes) ● … Though problem not 100% suitable for MapReduce ○ Nodes mostly waiting ○ Coordination/messaging between nodes required: ■ Distributed queue ■ Rate limiting
  • 20. Treating textual data raw text sentence splitting stopword filtering stemming tokenization
  • 21. Tokenization >>> import nltk >>> raw = "We are always looking for interesting talks, locations to host meetups and enthusiastic volunteers. Please get in touch using info@pydata.berlin." >>> nltk.sent_tokenize(raw) ['We are always looking for interesting talks, locations to host meetups and enthusiastic volunteers.', 'Please get in touch using info@pydata. berlin.'] >>> nltk.word_tokenize(raw) ['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',', 'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic', 'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@', 'pydata.berlin', '.']
  • 22. “great rooms” “great hotel” “rooms are terrible” “hotel is terrible” JJ NN JJ NN NN VB JJ NN VB JJ Grammars and Parsing >>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible")) [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
  • 23. >>> grammar = nltk.CFG.fromstring(""" ... OPINION -> NN COP JJ ... OPINION -> JJ NN ... NN -> 'hotel' | 'rooms' ... COP -> 'is' | 'are' ... JJ -> 'great' | 'terrible' ... """) >>> parser = nltk.ChartParser(grammar) >>> sent = nltk.word_tokenize("great rooms") >>> for tree in parser.parse(sent): >>> print tree (OPINION (ADJ great) (NOUN rooms)) Grammars and Parsing
  • 24. WordNet >>> from nltk.corpus import wordnet as wn >>> wn.morphy('coded', wn.VERB) 'code' >>> wn.synsets("python") [Synset('python.n.01'), Synset('python.n.02'), Synset('python.n. 03')] >>> wn.synset('python.n.01').hypernyms() [Synset('boa.n.02')] >>> # meh :/
  • 25. ● “Nice room” ● “Room wasn‘t so great” ● “The air-conditioning was so powerful that we were cold in the room even when it was off.” ● “อาหารรสชาติดี” ● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ” ● 20 languages ● Linguistic system (morphology, taggers, grammars, parsers …) ● Hadoop: Scale out CPU ○ ~1B opinions in DB ● Python for ML & NLP libraries Semantic Analysis at TrustYou
  • 26. Word2Vec ● Map words to vectors ● “Step up” from bag-of- words model ● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts >>> m["python"] array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709, -0.0200, -0.0325, 0.0166, 0.3312, -0.0928, -0.0967, -0.0199, -0.2498, -0.4445, -0.0445, # ... -1.0090, -0.2553, 0.2686, -0.4121, 0.3116, -0.0639, -0.3688, -0.0273, -0.1266, -0.2606, -0.1549, 0.0023, 0.0084, 0.2169, 0.0060], dtype=float32)
  • 27. Fun with Word2Vec >>> # trained from 100k meetup descriptions! >>> m = gensim.models.Word2Vec.load("data/word2vec") >>> m.most_similar(positive=["python"])[:3] [(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django', 0.8189617991447449)] >>> m.doesnt_match(["python", "c++", "javascript"]) 'c++' >>> m.most_similar(positive=["berlin"])[:3] [(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland', 0.7970746755599976)] >>> m.most_similar(positive=["ladies"])[:3] [(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]
  • 28. ML @ TrustYou ● gensim doc2vec model to create hotel embedding ● Used - together with other features - for various classifiers
  • 30. ● Build complex pipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs ● Some support for Hadoop ● Pythonic replacement for Oozie ● Can be combined with Pig, Hive Luigi
  • 31. class MyTask(luigi.Task): def requires(self): return DependentTask() def output(self): return luigi.LocalTarget("data/my_task_output")) def run(self): with self.output().open("w") as out: out.write("foo") Luigi tasks vs. Makefiles data/my_task_output: DependentTask run run run ...
  • 32. class CrawlTask(luigi.Task): city = luigi.Parameter() def output(self): output_path = os.path.join("data", "{}.jsonl".format(self.city)) return luigi.LocalTarget(output_path) def run(self): tmp_output_path = self.output().path + "_tmp" subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}". format(self.city), "-o", tmp_output_path, "-t", "jsonlines"]) os.rename(tmp_output_path, self.output().path) Example: Wrap crawl in Luigi task
  • 34. Hadoop! ● MapReduce: Programming model for distributed computation problems ● Express your algorithm as sequences of operations: a. Map: Do a linear pass over your data, emit (k, v) b. (Distributed sort) c. Reduce: Linear pass over all (k, v) for the same k ● Python on Hadoop: Hadoop streaming, MRJob, Luigi (Just go learn PySpark instead)
  • 35. Luigi Hadoop integration class HadoopTask(luigi.hadoop.JobTask): def output(self): return luigi.HdfsTarget("output_in_hdfs") def requires(self): return { "some_task": SomeTask(), "some_other_task": SomeOtherTask() } def mapper(self, line): key, value = line.rstrip().split("t") yield key, value def reducer(self, key, values): yield key, ", ".join(values)
  • 36. Luigi Hadoop integration class HadoopTask(luigi.hadoop.JobTask): def output(self): return luigi.HdfsTarget("output_in_hdfs") def requires(self): return { "some_task": SomeTask(), "some_other_task": SomeOtherTask() } def mapper(self, line): key, value = line.rstrip().split("t") yield key, value def reducer(self, key, values): yield key, ", ".join(values) 1. Your input data is sitting in distributed file system (HDFS) 2. Luigi creates a .tar.gz, Hadoop moves your code on machines 3. mapper() gets run (distributed) 4. Data gets re-sorted by key 5. reducer() gets run (distributed) 6. Output gets saved in HDFS
  • 37. ● Batch, never real time ● Slow even for batch (lots of disk IO) ● Limited expressiveness (remedies/crutches: MRJob, Pig, Hive) ● Spark: More complete Python support Beyond MapReduce