Building a data preparation pipeline with Pandas and AWS Lambda
What is data preparation and why it is required.
How to prepare data with pandas.
How to set up a pipeline with AWS Lambda
https://youtu.be/pc0Xn0uAm34?t=9m15s
2. Building a data preparation pipeline with Pandas and AWS Lambda
What Will You Learn?
▸ What is data preparation and why it is required.
▸ How to prepare data with pandas.
▸ How to set up a pipeline with AWS Lambda
3. Building a data preparation pipeline with Pandas and AWS Lambda
About Me
▸ Based in Tokyo
▸ Using python with data for 6 years
▸ Freelance Data Products Developper and Consultant
(data visualization, machine learning)
▸ Former Orange Labs and Locarise
(connected sensors data processing and visualization)
▸ Current side project denryoku.io an API for electric grid
power demand and capacity prediction.
5. Building a data preparation pipeline with Pandas and AWS Lambda
So you have got data, now what?
▸ Showing it to an audience:
▸ a report from a survey?
▸ a news article with charts?
▸ a sales dashboard?
6. Building a data preparation pipeline with Pandas and AWS Lambda
But a lot of available data is messy
▸ incomplete or missing data
▸ mis-formatted, mis-typed data
▸ wrong / corrupted values
7. Building a data preparation pipeline with Pandas and AWS Lambda
It has all the reasons to be messy
▸ non availability
▸ no appropriate mean of collection
▸ lack of validation
▸ human errors
8. Building a data preparation pipeline with Pandas and AWS Lambda
And this can have very bad consequences
▸ Crash in your report generator
▸ incomplete reports
▸ report reaches wrong conclusions
▸ Ultimately, if your data is really bad, you cannot trust any
conclusion from it
9. Building a data preparation pipeline with Pandas and AWS Lambda
It is not just about quality (ETL)
▸ Enriching the data
▸ Aggregating
!" "
clean
" !clean
!
aggregate,
classify, …input 1
input 2
output
▸ Classification (ML)
▸ Predictions (ML)
Visualize
|
10. Building a data preparation pipeline with Pandas and AWS Lambda
Example: data journalism &
interactive visualization
▸ Often manually gathered
data in spreadsheets
▸ Data cleaning required
▸ Data aggregation/
preprocessing required
▸ Data may be updated on a
weekly basis
11. Building a data preparation pipeline with Pandas and AWS Lambda
If it is a product, it needs to deal with data updates
Current Data
!
preparation script visualisation ready data Visualisation
" " |
▸ Who is going to run the script?
"
New data
Needs to be automated (the pipeline)
12. Building a data preparation pipeline with Pandas and AWS Lambda
What does it apply to?
data
quality
data update
frequency once monthly real-timedaily
low high
dashboards,
data products
data journalism
interactive reports,
email reports
ad hoc data analysisapplication
solution jupiter notebook
automated preparation
pipeline (batch)
prototype
micro-batch or real-time
processing pipeline
our focus
14. Building a data preparation pipeline with Pandas and AWS Lambda
common operations
▸ Date parsing
▸ Deciding on a strategy for null or non parseable values
▸ Enforce value ranges
▸ Sanitise strings
15. Building a data preparation pipeline with Pandas and AWS Lambda
Existing tools
▸ Trifacta Wrangler, Talend Dataprep, Google Open Refine
▸ great tools to check data quality and define transformations
16. Building a data preparation pipeline with Pandas and AWS Lambda
So why custom solutions with Python and Pandas?
▸ With python, you can do anything!
▸ It is not that difficult
▸ Pandas is a versatile tool that manipulate Dataframes
▸ Easy to specify transformations
▸ Not limited by Pandas, the whole python ecosystem is
available, like scikit-learn
17. Building a data preparation pipeline with Pandas and AWS Lambda
Example from a Jupiter notebook
▸ load a simple file with a list of name and ages of different
persons
18. Building a data preparation pipeline with Pandas and AWS Lambda
Example: statistics on groups (names)
▸ Is there a
relationship
between name
length and
median age?
▸ Chain
operations
▸ plot the length
of name vs age
for each name
Warning
Outlier
19. Building a data preparation pipeline with Pandas and AWS Lambda
something
is wrong
null values
label issues
20. Building a data preparation pipeline with Pandas and AWS Lambda
Let’s fix this
▸ deal with
missing values
with `dropna`
or `fillna`
▸ clean names
▸ reject outliers
21. Building a data preparation pipeline with Pandas and AWS Lambda
Close the loop to improve the data entry/acquisition
▸ Many errors can be avoided during data collection:
▸ form / column validation
▸ drop down selections for categories
▸ Report rejected rows to improve collection process
$
Data
! preparation
script"
list of issues
%Improve
forms…
22. Building a data preparation pipeline with Pandas and AWS Lambda
Testing your preparation
▸ Unit tests
▸ Test for anticipated edge cases (defensive programming)
▸ Property based testing (http://hypothesis.works/)
23. Building a data preparation pipeline with Pandas and AWS Lambda
More references for data cleaning
▸ Data cleaning with Pandas https://www.youtube.com/
watch?v=_eQ_8U5kruQ
▸ Data cleanup with Python: http://kjamistan.com/
automating-your-data-cleanup-with-python/
▸ Modern Pandas: Tidy Data https://
tomaugspurger.github.io/modern-5-tidy.html
25. Building a data preparation pipeline with Pandas and AWS Lambda
Some challenges
▸ Don’t let users run scripts
▸ Automating is part of a quality process
▸ Keeping things simple…
▸ and cheap
26. Building a data preparation pipeline with Pandas and AWS Lambda
What is AWS Lambda: server less solution
▸ Serverless offer by AWS
▸ No lifecycle to manage or shared state => resilient
▸ Auto-scaling
▸ Pay for actual running time: low cost
▸ No server, infra management: reduced dev / devops cost
…events
lambda function
output
…
27. Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function
just a python function
28. Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function: options
29. Building a data preparation pipeline with Pandas and AWS Lambda
Creating an “architecture” with triggers
30. Building a data preparation pipeline with Pandas and AWS Lambda
Batch processing at regular interval
▸ cron scheduling
▸ let your function get some data and process it at regular interval
31. Building a data preparation pipeline with Pandas and AWS Lambda
An API / webhook
▸ on API call
▸ Can be triggered from a google spreadsheet
32. Building a data preparation pipeline with Pandas and AWS Lambda
Setting up AWS Lambda for Pandas
Pandas and dependencies need to be compiled for Amazon
Linux x86_64 # install compilation environment
sudo yum -y update
sudo yum -y upgrade
sudo yum groupinstall "Development Tools"
sudo yum install blas blas-devel lapack
lapack-devel Cython --enablerepo=epel
# create and activate virtual env
virtualenv pdenv
source pdenv/bin/activate
# install pandas
pip install pandas
# zip the environment content
cd ~/pdenv/lib/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
cd ~/pdenv/lib64/python2.7/site-packages/
zip -r ~/pdenv.zip . --exclude *.pyc
# add the supporting libraries
cd ~/
mkdir -p libs
cp /usr/lib64/liblapack.so.3
/usr/lib64/libblas.so.3
/usr/lib64/libgfortran.so.3
/usr/lib64/libquadmath.so.0
libs/
zip -r ~/pdenv.zip libs
1. Launch an
EC2 instance
and connect
to it
2. Install
pandas in a
virtualenv
3. Zip the
installed
libraries
shell
33. Building a data preparation pipeline with Pandas and AWS Lambda
Using pandas from a lambda function
▸ The lambda process
need to access those
binaries
▸ Set up env variables
▸ Call a subprocess
▸ And pickle the function
input
▸ AWS will call
`lambda_function.lambda
_handler`
import os, sys, subprocess, json
import cPickle as pickle
LIBS = os.path.join(os.getcwd(), 'local', 'lib')
def handler(filename):
def handle(event, context):
pickle.dump( event, open( “/tmp/event.p”, “wb” ))
env = os.environ.copy()
env.update(LD_LIBRARY_PATH=LIBS)
proc = subprocess.Popen(
('python', filename),
env=env,
stdout=subprocess.PIPE)
proc.wait()
return proc.stdout.read()
return handle
lambda_handler = handler('my_function.py')
python: lambda_function.py
34. Building a data preparation pipeline with Pandas and AWS Lambda
The actual function
▸ Get the input data from
a google spreadsheet,
a css file on s3, an FTP
▸ Clean it
▸ Copy it somewhere
import pandas as pd
import pickle
import requests
from StringIO import StringIO
def run():
# get the lambda call arguments
event = pickle.load( open( “/tmp/event.p”, “rb” ))
# load some data from a google spreadsheet
r = requests.get(‘https://docs.google.com/spreadsheets'
+ ‘/d/{sheet_id}/export?format=csv&gid={page_id}')
data = r.content.decode('utf-8')
df = pd.read_csv(StringIO(data))
# Do something
# save as file
file_ = StringIO()
df.to_csv(file_, encoding='utf-8')
# copy the result somewhere
if __name__ == '__main__':
run()
python: my_function.py
35. Building a data preparation pipeline with Pandas and AWS Lambda
upload and test
▸ add your lambda function code to the environment zip.
▸ upload your function
36. Building a data preparation pipeline with Pandas and AWS Lambda
caveat 1: python 2.7
▸ officially, only python 2.7 is supported
▸ But python 3 is available and can be called as a
subprocess
▸ details here: http://www.cloudtrek.com.au/blog/
running-python-3-on-aws-lambda/
37. Building a data preparation pipeline with Pandas and AWS Lambda
caveat 2: max process memory (1.5GB) and execution time
▸ need to split the dataset if tool large
▸ loop over in your lambda call:
▸ may excess timeout
▸ map to multiple lambda calls
▸ need to merge the dataset at the end
▸ Lambda functions should be simple, chain if required
39. Building a data preparation pipeline with Pandas and AWS Lambda
Takeaways
▸ Know your data and your target
▸ Pandas can solve many issues
▸ Defensive programming and closing the loop
▸ AWS Lambda is a powerful and flexible tool for time and
resource constrained teams