2. Projet Goal was …
Integrate Game Logs on a Large actor, social Gaming
IsCool Entertainment (Euronext: ALWEK), 70 people,
10M€ revenues.
Around 30 GB raw logs per day for 7 games(web, mobile)
That’s about 10TB per year.
At the end some Hadoop’ing + Analytics SQ L, but
in the middle lots of data integration
Anykind of logs and Data
Partial database extracts
Apache/NGinx logs
Tracking Logs (Web Analytics stuff. etc..)
Application Logs
REST APIs (Currency Exchange, Geo Data,
Facebook APIs. )..)
Dataiku™
3. As a reminder
What most data scientists do ?
LinkedIn&Twitter
“Data Science” Real Life
“Recommendation” 80% of its time is spent
“Clustering algorithms” getting the data right
“Big Data”
“Machine Learning” 19% Analytics
“Hidden Markov Model”
“Predictive Analytics” 1% Twitter & LinkedIn
“Logistic Regression”
Dataiku™
4. Goal
An project based on a ETL solution had
previously failed
Need for
Agility
To manage any data
To be quick
The answer is ….
PYTHON !!!
Dataiku™
5. Step 1: Open your favorite
editor, write a .py file
Script for data parsing, filling up the
database, enrichment, cleanup,
etc..
Around 2000 line of code
5 man days work
!Good, but hard to maintain
on the long run
!Not fun
I switched from emacs to
SublimeText2 in the meantime, that
was cool.
Dataiku™
6. Step 2: Abstract and
Generalize. PyBabe
Micro-ETL in Python
Can read and write: FTP, HTTP, SQL, Filesystem, Amazon S3, E-Mail, ZIP,
GZIP, MongoDB, Excel, etc..
Basic file filters and transformations (filters, regular expressions, date parsing,
geoip, transpose, sort, group, …)
Use yield and named tuples
Open-source
https://github.com/fdouetteau/PyBabe
And the old project ?
The old project became 200 linesof specific code
Dataiku™
7. Sample pybabe
(1) Fetch a log file in s3 and put integrate in
babe = Babe()
## Fetch multiple CSV file from S3, har, cache en local
babe = babe.pull(url=“s3://myapp/mydir 2012-07-07_*.csv.gz”,cache=True)
## Recupère l’IP dans le champs IP, trouve pas geoip le pays
babe = babe.geoip_country_code(field=“ip”, country_code=“country”, ignore_error_True)
## Récupère le user agent, et stocke le nom du navigateur
babe = babe.user_agent(field=“user_agent”, browser=“browser”)
## Ne garde que les champs pertinents
babe = babe.filterFields(fields=[“user_id”, “date”, “country”, “user_agent”])
## Stocke le résultat dans une base de donnée
babe.push_sql(database=“mydb”, table=“mytable”, username=“…”);
Dataiku™
8. Sample PyBabe script
(2) Large file sort, join
babe = Babe()
## Fetch a large CSV file
babe = babe.pull(filename=“mybigfile.csv”)
## Perform a disk-based sort, batch 100k lines in memory
babe = babe.sortDiskBased(field=“uid”, nsize=100000)
## Group By uid and sum revenu per user.
babe = babe.groupBy(field=“uid”, reducer=lambda x, y: (x.uid, x.amount + y.amount))
## Join this stream on “uid” with the result of a CSV file
abe = babe.join(Babe().pull_sql(database=“mydb”, table=“user_info”, “uid”, “uid”)
## Store the result in an Excel file
babe.push (filename=“reports.xlxs”);
Dataiku™
9. Sample PyBabe script
(3) Mail a report
babe = Babe()
## Pull the result of a SQL query
babe = babe.pull(database=“mydb”, name=”First Query”, query=“SELECT …. “)
## Pull the result of a second SQL query
babe = babe.pull(database=“mydb”, name=”Second Query”, query=“SELECT ….”)
## Send the overall stream (concatenated) as an email, with content attached in Excel, and
some sample data in the body
babe = babe.sendmail(subject=“Your Report”, recipients=fd@me.com, data_in_body=True,
data_in_body_row_limit=10, attach_formats=“xlsx”)
Dataiku™
10. Some Design Choices
Use collections.namedtuple
Use generators
Nice and easy programming style
def filter(stream, f):
for data in stream:
if isinstance(data, StreamMeta):
yield data
elif f(data):
yield data
IO Streaming whenever possible
An HTTP downloaded file begins to be processed as it starts downloading
Use bulk-loaders (SQL) or external program when faster than the python
implementation (e.g gzip)
Dataiku™
11. PyBabe data model
def sample_pull():
header =
StreamHeader(name=”visits”,
partition={‘day’:‘2012-09-14’},
A Babe works on a fields=[“name”, “day”])
generator that contains
yield header
a sequence of partition
yield header.makeRow(‘Florian’,‘2012-09-14’)
A Partition is yield header.makeRow(‘John’, ‘2012-09-14’)
composed of a header yield StreamFooter()
(StreamHeader), rows,
yield header.replace(partition={‘day’:‘2012-09-15’})
and a Footer
yield header.makeRow(‘Phil’, ‘2012-09-15’)
yield StreamFooter()
Dataiku™
12. Some thoughts and
associated projects
strptime and performance
Parse a date with time.strptime or datetime.strptime
30 microseconds vs. 3 microseconds for regexp matching !!!
“Tarpys” a date parsing library, with date guessing
Charset management (pyencoding_cleaner)
Sniff ISO or UTF-8 charset over a fragment
Optionally try to fix bad encoding ( î, ÃÂ, ü)
python2.X csv module is ok but …
No Unicode support
Separator sniffing buggy on edge cases
Dataiku™
13. Future
Need to separate the github project into core and plugins
Rewrite in C a CSV module ? …
Configurable Error system. Should a error row fail the
whole stream, fail the whole babe, send a warning, be or
be skipped
Pandas/NumPy integration
An Homepage, Docs, etc...
Dataiku™
14. Ask questions ? ?
babe = Babe().pull(“questions.csv”)
babe = babe.filter(smart=True)
babe = babe.mapTo(oracle)
Florian
Douetteau babe.push(“answers.csv”);
@fdouetteau
CEO
Dataiku
Dataiku : Our Goal
- Leverage and Provide the best of open
souce technologies to help people build
their own data science platform Dataiku™