Eat whatever you can with PyBabe

PyBabe
Eat whatever data you wanna eat

Dataiku™

Projet Goal was …

Integrate Game Logs on a Large actor, social Gaming

IsCool Entertainment (Euronext: ALWEK), 70 people,
10M€ revenues.

Around 30 GB raw logs per day for 7 games(web, mobile)

That’s about 10TB per year.

At the end some Hadoop’ing + Analytics SQ L, but
in the middle lots of data integration

Anykind of logs and Data

Partial database extracts

Apache/NGinx logs

Tracking Logs (Web Analytics stuff. etc..)

Application Logs

REST APIs (Currency Exchange, Geo Data,
Facebook APIs. )..)

Dataiku™

As a reminder
What most data scientists do ?

LinkedIn&Twitter

“Data Science” Real Life
“Recommendation” 80% of its time is spent
“Clustering algorithms” getting the data right
“Big Data”
“Machine Learning” 19% Analytics
“Hidden Markov Model”
“Predictive Analytics” 1% Twitter & LinkedIn
“Logistic Regression”

Dataiku™

Goal

An project based on a ETL solution had
previously failed

Need for

Agility

To manage any data

To be quick

The answer is ….

PYTHON !!!

Dataiku™

Step 1: Open your favorite
editor, write a .py ﬁle
Script for data parsing, ﬁlling up the
database, enrichment, cleanup,
etc..

Around 2000 line of code

5 man days work

!Good, but hard to maintain
on the long run

!Not fun

I switched from emacs to
SublimeText2 in the meantime, that
was cool.

Dataiku™

Step 2: Abstract and
Generalize. PyBabe
Micro-ETL in Python
Can read and write: FTP, HTTP, SQL, Filesystem, Amazon S3, E-Mail, ZIP,
GZIP, MongoDB, Excel, etc..
Basic file filters and transformations (filters, regular expressions, date parsing,
geoip, transpose, sort, group, …)
Use yield and named tuples
Open-source
https://github.com/fdouetteau/PyBabe
And the old project ?
The old project became 200 linesof specific code

Dataiku™

Sample pybabe
(1) Fetch a log file in s3 and put integrate in

babe = Babe()

## Fetch multiple CSV file from S3, har, cache en local
babe = babe.pull(url=“s3://myapp/mydir 2012-07-07_*.csv.gz”,cache=True)

## Recupère l’IP dans le champs IP, trouve pas geoip le pays
babe = babe.geoip_country_code(field=“ip”, country_code=“country”, ignore_error_True)

## Récupère le user agent, et stocke le nom du navigateur
babe = babe.user_agent(field=“user_agent”, browser=“browser”)

## Ne garde que les champs pertinents
babe = babe.filterFields(fields=[“user_id”, “date”, “country”, “user_agent”])

## Stocke le résultat dans une base de donnée
babe.push_sql(database=“mydb”, table=“mytable”, username=“…”);

Dataiku™

Sample PyBabe script
(2) Large file sort, join
babe = Babe()
## Fetch a large CSV file
babe = babe.pull(filename=“mybigfile.csv”)

## Perform a disk-based sort, batch 100k lines in memory
babe = babe.sortDiskBased(field=“uid”, nsize=100000)

## Group By uid and sum revenu per user.
babe = babe.groupBy(field=“uid”, reducer=lambda x, y: (x.uid, x.amount + y.amount))

## Join this stream on “uid” with the result of a CSV file
abe = babe.join(Babe().pull_sql(database=“mydb”, table=“user_info”, “uid”, “uid”)

## Store the result in an Excel file
babe.push (filename=“reports.xlxs”);

Dataiku™

Sample PyBabe script
(3) Mail a report

babe = Babe()
## Pull the result of a SQL query
babe = babe.pull(database=“mydb”, name=”First Query”, query=“SELECT …. “)

## Pull the result of a second SQL query
babe = babe.pull(database=“mydb”, name=”Second Query”, query=“SELECT ….”)

## Send the overall stream (concatenated) as an email, with content attached in Excel, and
some sample data in the body
babe = babe.sendmail(subject=“Your Report”, recipients=fd@me.com, data_in_body=True,
data_in_body_row_limit=10, attach_formats=“xlsx”)

Dataiku™

Some Design Choices
Use collections.namedtuple
Use generators
Nice and easy programming style
def ﬁlter(stream, f):
for data in stream:
if isinstance(data, StreamMeta):
yield data
elif f(data):
yield data
IO Streaming whenever possible
An HTTP downloaded ﬁle begins to be processed as it starts downloading
Use bulk-loaders (SQL) or external program when faster than the python
implementation (e.g gzip)

Dataiku™

PyBabe data model
def sample_pull():
header =
StreamHeader(name=”visits”,
partition={‘day’:‘2012-09-14’},
A Babe works on a ﬁelds=[“name”, “day”])
generator that contains
yield header
a sequence of partition
yield header.makeRow(‘Florian’,‘2012-09-14’)
A Partition is yield header.makeRow(‘John’, ‘2012-09-14’)

composed of a header yield StreamFooter()
(StreamHeader), rows,
yield header.replace(partition={‘day’:‘2012-09-15’})
and a Footer
yield header.makeRow(‘Phil’, ‘2012-09-15’)

yield StreamFooter()
Dataiku™

Some thoughts and
associated projects
strptime and performance
Parse a date with time.strptime or datetime.strptime
30 microseconds vs. 3 microseconds for regexp matching !!!
“Tarpys” a date parsing library, with date guessing
Charset management (pyencoding_cleaner)
Sniff ISO or UTF-8 charset over a fragment
Optionally try to ﬁx bad encoding ( ÃƒÂ®, ÃƒÂ, ÃƒÂ¼)
python2.X csv module is ok but …
No Unicode support
Separator snifﬁng buggy on edge cases

Dataiku™

Future

Need to separate the github project into core and plugins
Rewrite in C a CSV module ? …
Conﬁgurable Error system. Should a error row fail the
whole stream, fail the whole babe, send a warning, be or
be skipped
Pandas/NumPy integration
An Homepage, Docs, etc...

Dataiku™

Ask questions ? ?
babe = Babe().pull(“questions.csv”)

babe = babe.ﬁlter(smart=True)

babe = babe.mapTo(oracle)
Florian
Douetteau babe.push(“answers.csv”);
@fdouetteau
CEO
Dataiku
Dataiku : Our Goal
- Leverage and Provide the best of open
souce technologies to help people build
their own data science platform Dataiku™

Eat whatever you can with PyBabe

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Eat whatever you can with PyBabe

Similar to Eat whatever you can with PyBabe (20)

More from Dataiku

More from Dataiku (12)

Recently uploaded

Recently uploaded (20)

Eat whatever you can with PyBabe

Editor's Notes