PyData Berlin Meetup

Helping travelers make better hotel choices
500 million times a month*
Steffen Wenz, CTO TrustYou

For every hotel on the
planet, provide a summary
of traveler reviews.
What does TrustYou do?

✓ Excellent hotel!
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »

✓ Excellent hotel!*
✓ Nice building
“Clean, hip & modern, excellent facilities”
✓ Great view
« Vue superbe »
✓ Great for partying
“Nice weekend getaway or for partying”
✗ Solo travelers complain about TVs
ℹ You should check out Reichstag,
KaDeWe & Gendarmenmarkt.
*) nhow Berlin (Full summary)

DBCrawling
Semantic
Analysis
TrustYou
Analytics
API
Kayak...
TrustYou Architecture
200 million
reqs/month

/find?
q=Berlin
/find?
q=Munich
/meetup/
BerlinPyData
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=2
/meetup/
BerlinPolitics
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=3
Seed URLs
Frontier
Basic crawling setup

/find?
q=Berlin
/find?
q=Munich
/meetup/
BerlinPyData
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=2
/meetup/
BerlinPolitics
/meetup/
BerlinCyclists
/find?
q=Munich&pa
ge=3
/find?
q=Munich&pa
ge=99999999
...
… if only it were so easy
facebok.
com/meetup
Seed URLs
Frontier

Scrapy
● Build your own web crawlers
○ Extract data via CSS selectors, XPath, regexes …
○ Handles queuing, request parallelism, cookies,
throttling …
● Comprehensive and well-designed
● Commercial support by http://scrapinghub.com/

Frontier
Seed URLs
Intro to Scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = "my_spider"
# start with this URL
start_urls = ["http://www.meetup.com/find/?allMeetups=true&radius=50&userFreeform=Berlin"]
# follow these URLs, and call self.parse_meetup to extract data from them
rules = [
Rule(LinkExtractor(allow=[
"^http://www.meetup.com/[^/]+/$",
]), callback="parse_meetup"),
]
def parse_meetup(self, response):
# Extract data about meetup from HTML
m = MeetupItem()
yield m

Try it out!
$ scrapy crawl city -a city=Berlin -t jsonlines -o - 2>/dev/null
{"url": "http://www.meetup.com/Making-Customers-Happy-Berlin/", "name": "eCommerce - Making Customers Happy -
Berlin", "members": "774"}
{"url": "http://www.meetup.com/Berlin-Scrum-Meetup/", "name": "Berlin Scrum Meetup", "members": "368"}
{"url": "http://www.meetup.com/Clojure-Berlin/", "name": "The Clojure Conspiracy (Berlin)", "members": "545"}
{"url": "http://www.meetup.com/appliedJavascript/", "name": "Applied Javascript", "members": "494"}
{"url": "http://www.meetup.com/englishconversationclubberlin/", "name": "English Conversation Club Berlin",
"members": "1"}
{"url": "http://www.meetup.com/Berlin-Nights-Out-and-Daylight-Catch-Up/", "name": "Berlin Nights Out and Daylight
Catch Up", "members": "1"}
...
Full code on GitHub, dump of all Berlin meetups
(Note: Meetup also has an API …)

Crawling at TrustYou scale
● 2 - 3 million new reviews/week
● Customers want alerts 8 - 24h
after review publication!
● Smart crawl frequency & depth,
but still high overhead
● Pools of constantly refreshed
EC2 proxy IPs
● Direct API connections with
many sites

Crawling at TrustYou scale
● Custom framework very similar to scrapy
● Runs on Hadoop cluster (100 nodes)
● … Though problem not 100% suitable for MapReduce
○ Nodes mostly waiting
○ Coordination/messaging between nodes required:
■ Distributed queue
■ Rate limiting

Treating textual data
raw text sentence
splitting
stopword
filtering
stemming
tokenization

Tokenization
>>> import nltk
>>> raw = "We are always looking for interesting talks, locations to
host meetups and enthusiastic volunteers. Please get in touch using
info@pydata.berlin."
>>> nltk.sent_tokenize(raw)
['We are always looking for interesting talks, locations to host meetups
and enthusiastic volunteers.', 'Please get in touch using info@pydata.
berlin.']
>>> nltk.word_tokenize(raw)
['We', 'are', 'always', 'looking', 'for', 'interesting', 'talks', ',',
'locations', 'to', 'host', 'meetups', 'and', 'enthusiastic',
'volunteers.', 'Please', 'get', 'in', 'touch', 'using', 'info', '@',
'pydata.berlin', '.']

“great rooms”
“great hotel”
“rooms are terrible”
“hotel is terrible”
JJ NN
JJ NN
NN VB JJ
NN VB JJ
Grammars and Parsing
>>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible"))
[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]

>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NN COP JJ
... OPINION -> JJ NN
... NN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... JJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print tree
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing

WordNet
>>> from nltk.corpus import wordnet as wn
>>> wn.morphy('coded', wn.VERB)
'code'
>>> wn.synsets("python")
[Synset('python.n.01'), Synset('python.n.02'), Synset('python.n.
03')]
>>> wn.synset('python.n.01').hypernyms()
[Synset('boa.n.02')]
>>> # meh :/

● “Nice room”
● “Room wasn‘t so great”
● “The air-conditioning
was so powerful that we
were cold in the room
even when it was off.”
● “อาหารรสชาติดี”
● “‫ﺟﯾدة‬ ‫ﺧدﻣﺔ‬ ”
● 20 languages
● Linguistic system
(morphology, taggers,
grammars, parsers …)
● Hadoop: Scale out CPU
○ ~1B opinions in DB
● Python for ML & NLP
libraries
Semantic Analysis at TrustYou

Word2Vec
● Map words to vectors
● “Step up” from bag-of-
words model
● ‘Cats’ and ‘dogs’ should
be similar - because they
occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166, 0.3312, -0.0928,
-0.0967, -0.0199, -0.2498, -0.4445, -0.0445,
# ...
-1.0090, -0.2553, 0.2686, -0.4121, 0.3116,
-0.0639, -0.3688, -0.0273, -0.1266, -0.2606,
-0.1549, 0.0023, 0.0084, 0.2169, 0.0060],
dtype=float32)

Fun with Word2Vec
>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.8382717370986938), (u'php', 0.8266388773918152), (u'django',
0.8189617991447449)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["berlin"])[:3]
[(u'paris', 0.8339072465896606), (u'lisbon', 0.7986686825752258), (u'holland',
0.7970746755599976)]
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.8175351619720459), (u'mamas', 0.745951771736145), (u'gals', 0.7336771488189697)]

ML @ TrustYou
● gensim doc2vec model
to create hotel
embedding
● Used - together with
other features - for
various classifiers

Workflow Management
& Scaling Up

● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
● Some support for Hadoop
● Pythonic replacement for Oozie
● Can be combined with Pig, Hive
Luigi

class MyTask(luigi.Task):
def requires(self):
return DependentTask()
def output(self):
return luigi.LocalTarget("data/my_task_output"))
def run(self):
with self.output().open("w") as out:
out.write("foo")
Luigi tasks vs. Makefiles
data/my_task_output: DependentTask
run
run
run ...

class CrawlTask(luigi.Task):
city = luigi.Parameter()
def output(self):
output_path = os.path.join("data", "{}.jsonl".format(self.city))
return luigi.LocalTarget(output_path)
def run(self):
tmp_output_path = self.output().path + "_tmp"
subprocess.check_output(["scrapy", "crawl", "city", "-a", "city={}".
format(self.city), "-o", tmp_output_path, "-t", "jsonlines"])
os.rename(tmp_output_path, self.output().path)
Example: Wrap crawl in Luigi task

Hadoop!
● MapReduce: Programming model for distributed
computation problems
● Express your algorithm as sequences of operations:
a. Map: Do a linear pass over your data, emit (k, v)
b. (Distributed sort)
c. Reduce: Linear pass over all (k, v) for the same k
● Python on Hadoop: Hadoop streaming, MRJob, Luigi
(Just go learn PySpark instead)

Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)

Luigi Hadoop integration
class HadoopTask(luigi.hadoop.JobTask):
def output(self):
return luigi.HdfsTarget("output_in_hdfs")
def requires(self):
return {
"some_task": SomeTask(),
"some_other_task": SomeOtherTask()
}
def mapper(self, line):
key, value = line.rstrip().split("t")
yield key, value
def reducer(self, key, values):
yield key, ", ".join(values)
1. Your input data is sitting in
distributed file system (HDFS)
2. Luigi creates a .tar.gz, Hadoop
moves your code on machines
3. mapper() gets run (distributed)
4. Data gets re-sorted by key
5. reducer() gets run (distributed)
6. Output gets saved in HDFS

● Batch, never real time
● Slow even for batch
(lots of disk IO)
● Limited expressiveness
(remedies/crutches:
MRJob, Pig, Hive)
● Spark: More complete
Python support
Beyond MapReduce

We’re hiring! steffen@trustyou.com

PyData Berlin Meetup

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (18)

Similaire à PyData Berlin Meetup

Similaire à PyData Berlin Meetup (20)

Dernier

Dernier (20)

PyData Berlin Meetup