PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

PyconJP
2016-09-22
Fabian Dubois
Building a data
preparation
pipeline with
Pandas and
AWS Lambda

Building a data preparation pipeline with Pandas and AWS Lambda
What Will You Learn?
▸ What is data preparation and why it is required.
▸ How to prepare data with pandas.
▸ How to set up a pipeline with AWS Lambda

About Me
▸ Based in Tokyo
▸ Using python with data for 6 years
▸ Freelance Data Products Developper and Consultant 
(data visualization, machine learning)
▸ Former Orange Labs and Locarise 
(connected sensors data processing and visualization)
▸ Current side project denryoku.io an API for electric grid
power demand and capacity prediction.

So you have got data, now what?
▸ Showing it to an audience:
▸ a report from a survey?
▸ a news article with charts?
▸ a sales dashboard?

But a lot of available data is messy
▸ incomplete or missing data
▸ mis-formatted, mis-typed data
▸ wrong / corrupted values

It has all the reasons to be messy
▸ non availability
▸ no appropriate mean of collection
▸ lack of validation
▸ human errors

And this can have very bad consequences
▸ Crash in your report generator
▸ incomplete reports
▸ report reaches wrong conclusions
▸ Ultimately, if your data is really bad, you cannot trust any
conclusion from it

It is not just about quality (ETL)
▸ Enriching the data
▸ Aggregating
!" "
clean
" !clean
!
aggregate, 
classify, …input 1
input 2
output
▸ Classiﬁcation (ML)
▸ Predictions (ML)
Visualize
|

Example: data journalism &
interactive visualization
▸ Often manually gathered
data in spreadsheets
▸ Data cleaning required
▸ Data aggregation/
preprocessing required
▸ Data may be updated on a
weekly basis

If it is a product, it needs to deal with data updates
Current Data
!
preparation script visualisation ready data Visualisation
" " |
▸ Who is going to run the script?
"
New data
Needs to be automated (the pipeline)

What does it apply to?
data
quality
data update
frequency once monthly real-timedaily
low high
dashboards,
data products
data journalism
interactive reports,
email reports
ad hoc data analysisapplication
solution jupiter notebook
automated preparation 
pipeline (batch) 
prototype
micro-batch or real-time 
processing pipeline
our focus

common operations
▸ Date parsing
▸ Deciding on a strategy for null or non parseable values
▸ Enforce value ranges
▸ Sanitise strings

Existing tools
▸ Trifacta Wrangler, Talend Dataprep, Google Open Reﬁne
▸ great tools to check data quality and deﬁne transformations

So why custom solutions with Python and Pandas?
▸ With python, you can do anything!
▸ It is not that diﬃcult
▸ Pandas is a versatile tool that manipulate Dataframes
▸ Easy to specify transformations
▸ Not limited by Pandas, the whole python ecosystem is
available, like scikit-learn

Example from a Jupiter notebook
▸ load a simple ﬁle with a list of name and ages of diﬀerent
persons

Example: statistics on groups (names)
▸ Is there a
relationship
between name
length and
median age?
▸ Chain
operations
▸ plot the length
of name vs age
for each name
Warning
Outlier

something
is wrong
null values
label issues

Let’s ﬁx this
▸ deal with
missing values
with `dropna`
or `ﬁllna`
▸ clean names
▸ reject outliers

Close the loop to improve the data entry/acquisition
▸ Many errors can be avoided during data collection:
▸ form / column validation
▸ drop down selections for categories
▸ Report rejected rows to improve collection process
$
Data
! preparation 
script"
list of issues
%Improve 
forms…

Testing your preparation
▸ Unit tests
▸ Test for anticipated edge cases (defensive programming)
▸ Property based testing (http://hypothesis.works/)

More references for data cleaning
▸ Data cleaning with Pandas https://www.youtube.com/
watch?v=_eQ_8U5kruQ
▸ Data cleanup with Python: http://kjamistan.com/
automating-your-data-cleanup-with-python/
▸ Modern Pandas: Tidy Data https://
tomaugspurger.github.io/modern-5-tidy.html

Setting up a
pipeline with AWS
Lambda.

Some challenges
▸ Don’t let users run scripts
▸ Automating is part of a quality process
▸ Keeping things simple…
▸ and cheap

What is AWS Lambda: server less solution
▸ Serverless oﬀer by AWS
▸ No lifecycle to manage or shared state => resilient
▸ Auto-scaling
▸ Pay for actual running time: low cost
▸ No server, infra management: reduced dev / devops cost
…events
lambda function
output
…

Creating a function
just a python function

Creating a function: options

Creating an “architecture” with triggers

Batch processing at regular interval
▸ cron scheduling
▸ let your function get some data and process it at regular interval

An API / webhook
▸ on API call
▸ Can be triggered from a google spreadsheet

Setting up AWS Lambda for Pandas
Pandas and dependencies need to be compiled for Amazon
Linux x86_64 # install compilation environment
sudo yum -y update
sudo yum -y upgrade
sudo yum groupinstall "Development Tools"
sudo yum install blas blas-devel lapack
lapack-devel Cython --enablerepo=epel
# create and activate virtual env
virtualenv pdenv
source pdenv/bin/activate
# install pandas
pip install pandas
# zip the environment content
cd ~/pdenv/lib/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
cd ~/pdenv/lib64/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
# add the supporting libraries
cd ~/
mkdir -p libs
cp /usr/lib64/liblapack.so.3
/usr/lib64/libblas.so.3
/usr/lib64/libgfortran.so.3
/usr/lib64/libquadmath.so.0
libs/
zip -r ~/pdenv.zip libs
1. Launch an
EC2 instance
and connect
to it
2. Install
pandas in a
virtualenv
3. Zip the
installed
libraries
shell

Using pandas from a lambda function
▸ The lambda process
need to access those
binaries
▸ Set up env variables
▸ Call a subprocess
▸ And pickle the function
input
▸ AWS will call
`lambda_function.lambda
_handler`
import os, sys, subprocess, json
import cPickle as pickle
LIBS = os.path.join(os.getcwd(), 'local', 'lib')
def handler(filename):
def handle(event, context):
pickle.dump( event, open( “/tmp/event.p”, “wb” ))
env = os.environ.copy()
env.update(LD_LIBRARY_PATH=LIBS)
proc = subprocess.Popen(
('python', filename),
env=env,
stdout=subprocess.PIPE)
proc.wait()
return proc.stdout.read()
return handle
lambda_handler = handler('my_function.py')
python: lambda_function.py

The actual function
▸ Get the input data from
a google spreadsheet,
a css ﬁle on s3, an FTP
▸ Clean it
▸ Copy it somewhere
import pandas as pd
import pickle
import requests
from StringIO import StringIO
def run():
# get the lambda call arguments
event = pickle.load( open( “/tmp/event.p”, “rb” ))
# load some data from a google spreadsheet
r = requests.get(‘https://docs.google.com/spreadsheets'
+ ‘/d/{sheet_id}/export?format=csv&gid={page_id}')
data = r.content.decode('utf-8')
df = pd.read_csv(StringIO(data))
# Do something
# save as file
file_ = StringIO()
df.to_csv(file_, encoding='utf-8')
# copy the result somewhere
if __name__ == '__main__':
run()
python: my_function.py

upload and test
▸ add your lambda function code to the environment zip.
▸ upload your function

caveat 1: python 2.7
▸ oﬃcially, only python 2.7 is supported
▸ But python 3 is available and can be called as a
subprocess
▸ details here: http://www.cloudtrek.com.au/blog/
running-python-3-on-aws-lambda/

caveat 2: max process memory (1.5GB) and execution time
▸ need to split the dataset if tool large
▸ loop over in your lambda call:
▸ may excess timeout
▸ map to multiple lambda calls
▸ need to merge the dataset at the end
▸ Lambda functions should be simple, chain if required

Takeaways
▸ Know your data and your target
▸ Pandas can solve many issues
▸ Defensive programming and closing the loop
▸ AWS Lambda is a powerful and ﬂexible tool for time and
resource constrained teams

Thanks
Questions?
@fabian_dubois
fabian@datamaplab.com
check denryoku.io

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

Similaire à PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda (20)

Dernier

Dernier (20)

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda