PyconJP
2016-09-22
Fabian Dubois
Building a data
preparation
pipeline with
Pandas and
AWS Lambda
Building a data preparation pipeline with Pandas and AWS Lambda
What Will You Learn?
▸ What is data preparation and why it...
Building a data preparation pipeline with Pandas and AWS Lambda
About Me
▸ Based in Tokyo
▸ Using python with data for 6 y...
Why Data
Preparation?
Building a data preparation pipeline with Pandas and AWS Lambda
So you have got data, now what?
▸ Showing it to an audienc...
Building a data preparation pipeline with Pandas and AWS Lambda
But a lot of available data is messy
▸ incomplete or missi...
Building a data preparation pipeline with Pandas and AWS Lambda
It has all the reasons to be messy
▸ non availability
▸ no...
Building a data preparation pipeline with Pandas and AWS Lambda
And this can have very bad consequences
▸ Crash in your re...
Building a data preparation pipeline with Pandas and AWS Lambda
It is not just about quality (ETL)
▸ Enriching the data
▸ ...
Building a data preparation pipeline with Pandas and AWS Lambda
Example: data journalism &
interactive visualization
▸ Oft...
Building a data preparation pipeline with Pandas and AWS Lambda
If it is a product, it needs to deal with data updates
Cur...
Building a data preparation pipeline with Pandas and AWS Lambda
What does it apply to?
data
quality
data update
frequency ...
How to
prepare data?
Building a data preparation pipeline with Pandas and AWS Lambda
common operations
▸ Date parsing
▸ Deciding on a strategy ...
Building a data preparation pipeline with Pandas and AWS Lambda
Existing tools
▸ Trifacta Wrangler, Talend Dataprep, Googl...
Building a data preparation pipeline with Pandas and AWS Lambda
So why custom solutions with Python and Pandas?
▸ With pyt...
Building a data preparation pipeline with Pandas and AWS Lambda
Example from a Jupiter notebook
▸ load a simple file with a...
Building a data preparation pipeline with Pandas and AWS Lambda
Example: statistics on groups (names)
▸ Is there a
relatio...
Building a data preparation pipeline with Pandas and AWS Lambda
something
is wrong
null values
label issues
Building a data preparation pipeline with Pandas and AWS Lambda
Let’s fix this
▸ deal with
missing values
with `dropna`
or ...
Building a data preparation pipeline with Pandas and AWS Lambda
Close the loop to improve the data entry/acquisition
▸ Man...
Building a data preparation pipeline with Pandas and AWS Lambda
Testing your preparation
▸ Unit tests
▸ Test for anticipat...
Building a data preparation pipeline with Pandas and AWS Lambda
More references for data cleaning
▸ Data cleaning with Pan...
Setting up a
pipeline with AWS
Lambda.
Building a data preparation pipeline with Pandas and AWS Lambda
Some challenges
▸ Don’t let users run scripts
▸ Automating...
Building a data preparation pipeline with Pandas and AWS Lambda
What is AWS Lambda: server less solution
▸ Serverless offer...
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function
just a python function
Building a data preparation pipeline with Pandas and AWS Lambda
Creating a function: options
Building a data preparation pipeline with Pandas and AWS Lambda
Creating an “architecture” with triggers
Building a data preparation pipeline with Pandas and AWS Lambda
Batch processing at regular interval
▸ cron scheduling
▸ l...
Building a data preparation pipeline with Pandas and AWS Lambda
An API / webhook
▸ on API call
▸ Can be triggered from a g...
Building a data preparation pipeline with Pandas and AWS Lambda
Setting up AWS Lambda for Pandas
Pandas and dependencies n...
Building a data preparation pipeline with Pandas and AWS Lambda
Using pandas from a lambda function
▸ The lambda process
n...
Building a data preparation pipeline with Pandas and AWS Lambda
The actual function
▸ Get the input data from
a google spr...
Building a data preparation pipeline with Pandas and AWS Lambda
upload and test
▸ add your lambda function code to the env...
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 1: python 2.7
▸ officially, only python 2.7 is suppor...
Building a data preparation pipeline with Pandas and AWS Lambda
caveat 2: max process memory (1.5GB) and execution time
▸ ...
Takeaways
Building a data preparation pipeline with Pandas and AWS Lambda
Takeaways
▸ Know your data and your target
▸ Pandas can so...
Thanks
Questions?
@fabian_dubois
fabian@datamaplab.com
check denryoku.io
Prochain SlideShare
Chargement dans…5
×

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

1 701 vues

Publié le

Building a data preparation pipeline with Pandas and AWS Lambda

What is data preparation and why it is required.
How to prepare data with pandas.
How to set up a pipeline with AWS Lambda

https://youtu.be/pc0Xn0uAm34?t=9m15s

Publié dans : Logiciels
0 commentaire
5 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

Aucun téléchargement
Vues
Nombre de vues
1 701
Sur SlideShare
0
Issues des intégrations
0
Intégrations
538
Actions
Partages
0
Téléchargements
13
Commentaires
0
J’aime
5
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

PyconJP: Building a data preparation pipeline with Pandas and AWS Lambda

  1. 1. PyconJP 2016-09-22 Fabian Dubois Building a data preparation pipeline with Pandas and AWS Lambda
  2. 2. Building a data preparation pipeline with Pandas and AWS Lambda What Will You Learn? ▸ What is data preparation and why it is required. ▸ How to prepare data with pandas. ▸ How to set up a pipeline with AWS Lambda
  3. 3. Building a data preparation pipeline with Pandas and AWS Lambda About Me ▸ Based in Tokyo ▸ Using python with data for 6 years ▸ Freelance Data Products Developper and Consultant
 (data visualization, machine learning) ▸ Former Orange Labs and Locarise
 (connected sensors data processing and visualization) ▸ Current side project denryoku.io an API for electric grid power demand and capacity prediction.
  4. 4. Why Data Preparation?
  5. 5. Building a data preparation pipeline with Pandas and AWS Lambda So you have got data, now what? ▸ Showing it to an audience: ▸ a report from a survey? ▸ a news article with charts? ▸ a sales dashboard?
  6. 6. Building a data preparation pipeline with Pandas and AWS Lambda But a lot of available data is messy ▸ incomplete or missing data ▸ mis-formatted, mis-typed data ▸ wrong / corrupted values
  7. 7. Building a data preparation pipeline with Pandas and AWS Lambda It has all the reasons to be messy ▸ non availability ▸ no appropriate mean of collection ▸ lack of validation ▸ human errors
  8. 8. Building a data preparation pipeline with Pandas and AWS Lambda And this can have very bad consequences ▸ Crash in your report generator ▸ incomplete reports ▸ report reaches wrong conclusions ▸ Ultimately, if your data is really bad, you cannot trust any conclusion from it
  9. 9. Building a data preparation pipeline with Pandas and AWS Lambda It is not just about quality (ETL) ▸ Enriching the data ▸ Aggregating !" " clean " !clean ! aggregate,
 classify, …input 1 input 2 output ▸ Classification (ML) ▸ Predictions (ML) Visualize |
  10. 10. Building a data preparation pipeline with Pandas and AWS Lambda Example: data journalism & interactive visualization ▸ Often manually gathered data in spreadsheets ▸ Data cleaning required ▸ Data aggregation/ preprocessing required ▸ Data may be updated on a weekly basis
  11. 11. Building a data preparation pipeline with Pandas and AWS Lambda If it is a product, it needs to deal with data updates Current Data ! preparation script visualisation ready data Visualisation " " | ▸ Who is going to run the script? " New data Needs to be automated (the pipeline)
  12. 12. Building a data preparation pipeline with Pandas and AWS Lambda What does it apply to? data quality data update frequency once monthly real-timedaily low high dashboards, data products data journalism interactive reports, email reports ad hoc data analysisapplication solution jupiter notebook automated preparation
 pipeline (batch)
 prototype micro-batch or real-time
 processing pipeline our focus
  13. 13. How to prepare data?
  14. 14. Building a data preparation pipeline with Pandas and AWS Lambda common operations ▸ Date parsing ▸ Deciding on a strategy for null or non parseable values ▸ Enforce value ranges ▸ Sanitise strings
  15. 15. Building a data preparation pipeline with Pandas and AWS Lambda Existing tools ▸ Trifacta Wrangler, Talend Dataprep, Google Open Refine ▸ great tools to check data quality and define transformations
  16. 16. Building a data preparation pipeline with Pandas and AWS Lambda So why custom solutions with Python and Pandas? ▸ With python, you can do anything! ▸ It is not that difficult ▸ Pandas is a versatile tool that manipulate Dataframes ▸ Easy to specify transformations ▸ Not limited by Pandas, the whole python ecosystem is available, like scikit-learn
  17. 17. Building a data preparation pipeline with Pandas and AWS Lambda Example from a Jupiter notebook ▸ load a simple file with a list of name and ages of different persons
  18. 18. Building a data preparation pipeline with Pandas and AWS Lambda Example: statistics on groups (names) ▸ Is there a relationship between name length and median age? ▸ Chain operations ▸ plot the length of name vs age for each name Warning Outlier
  19. 19. Building a data preparation pipeline with Pandas and AWS Lambda something is wrong null values label issues
  20. 20. Building a data preparation pipeline with Pandas and AWS Lambda Let’s fix this ▸ deal with missing values with `dropna` or `fillna` ▸ clean names ▸ reject outliers
  21. 21. Building a data preparation pipeline with Pandas and AWS Lambda Close the loop to improve the data entry/acquisition ▸ Many errors can be avoided during data collection: ▸ form / column validation ▸ drop down selections for categories ▸ Report rejected rows to improve collection process $ Data ! preparation
 script" list of issues %Improve
 forms…
  22. 22. Building a data preparation pipeline with Pandas and AWS Lambda Testing your preparation ▸ Unit tests ▸ Test for anticipated edge cases (defensive programming) ▸ Property based testing (http://hypothesis.works/)
  23. 23. Building a data preparation pipeline with Pandas and AWS Lambda More references for data cleaning ▸ Data cleaning with Pandas https://www.youtube.com/ watch?v=_eQ_8U5kruQ ▸ Data cleanup with Python: http://kjamistan.com/ automating-your-data-cleanup-with-python/ ▸ Modern Pandas: Tidy Data https:// tomaugspurger.github.io/modern-5-tidy.html
  24. 24. Setting up a pipeline with AWS Lambda.
  25. 25. Building a data preparation pipeline with Pandas and AWS Lambda Some challenges ▸ Don’t let users run scripts ▸ Automating is part of a quality process ▸ Keeping things simple… ▸ and cheap
  26. 26. Building a data preparation pipeline with Pandas and AWS Lambda What is AWS Lambda: server less solution ▸ Serverless offer by AWS ▸ No lifecycle to manage or shared state => resilient ▸ Auto-scaling ▸ Pay for actual running time: low cost ▸ No server, infra management: reduced dev / devops cost …events lambda function output …
  27. 27. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function just a python function
  28. 28. Building a data preparation pipeline with Pandas and AWS Lambda Creating a function: options
  29. 29. Building a data preparation pipeline with Pandas and AWS Lambda Creating an “architecture” with triggers
  30. 30. Building a data preparation pipeline with Pandas and AWS Lambda Batch processing at regular interval ▸ cron scheduling ▸ let your function get some data and process it at regular interval
  31. 31. Building a data preparation pipeline with Pandas and AWS Lambda An API / webhook ▸ on API call ▸ Can be triggered from a google spreadsheet
  32. 32. Building a data preparation pipeline with Pandas and AWS Lambda Setting up AWS Lambda for Pandas Pandas and dependencies need to be compiled for Amazon Linux x86_64 # install compilation environment sudo yum -y update sudo yum -y upgrade sudo yum groupinstall "Development Tools" sudo yum install blas blas-devel lapack lapack-devel Cython --enablerepo=epel # create and activate virtual env virtualenv pdenv source pdenv/bin/activate # install pandas pip install pandas # zip the environment content cd ~/pdenv/lib/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc cd ~/pdenv/lib64/python2.7/site-packages/ zip -r ~/pdenv.zip . --exclude *.pyc # add the supporting libraries cd ~/ mkdir -p libs cp /usr/lib64/liblapack.so.3 /usr/lib64/libblas.so.3 /usr/lib64/libgfortran.so.3 /usr/lib64/libquadmath.so.0 libs/ zip -r ~/pdenv.zip libs 1. Launch an EC2 instance and connect to it 2. Install pandas in a virtualenv 3. Zip the installed libraries shell
  33. 33. Building a data preparation pipeline with Pandas and AWS Lambda Using pandas from a lambda function ▸ The lambda process need to access those binaries ▸ Set up env variables ▸ Call a subprocess ▸ And pickle the function input ▸ AWS will call `lambda_function.lambda _handler` import os, sys, subprocess, json import cPickle as pickle LIBS = os.path.join(os.getcwd(), 'local', 'lib') def handler(filename): def handle(event, context): pickle.dump( event, open( “/tmp/event.p”, “wb” )) env = os.environ.copy() env.update(LD_LIBRARY_PATH=LIBS) proc = subprocess.Popen( ('python', filename), env=env, stdout=subprocess.PIPE) proc.wait() return proc.stdout.read() return handle lambda_handler = handler('my_function.py') python: lambda_function.py
  34. 34. Building a data preparation pipeline with Pandas and AWS Lambda The actual function ▸ Get the input data from a google spreadsheet, a css file on s3, an FTP ▸ Clean it ▸ Copy it somewhere import pandas as pd import pickle import requests from StringIO import StringIO def run(): # get the lambda call arguments event = pickle.load( open( “/tmp/event.p”, “rb” )) # load some data from a google spreadsheet r = requests.get(‘https://docs.google.com/spreadsheets' + ‘/d/{sheet_id}/export?format=csv&gid={page_id}') data = r.content.decode('utf-8') df = pd.read_csv(StringIO(data)) # Do something # save as file file_ = StringIO() df.to_csv(file_, encoding='utf-8') # copy the result somewhere if __name__ == '__main__': run() python: my_function.py
  35. 35. Building a data preparation pipeline with Pandas and AWS Lambda upload and test ▸ add your lambda function code to the environment zip. ▸ upload your function
  36. 36. Building a data preparation pipeline with Pandas and AWS Lambda caveat 1: python 2.7 ▸ officially, only python 2.7 is supported ▸ But python 3 is available and can be called as a subprocess ▸ details here: http://www.cloudtrek.com.au/blog/ running-python-3-on-aws-lambda/
  37. 37. Building a data preparation pipeline with Pandas and AWS Lambda caveat 2: max process memory (1.5GB) and execution time ▸ need to split the dataset if tool large ▸ loop over in your lambda call: ▸ may excess timeout ▸ map to multiple lambda calls ▸ need to merge the dataset at the end ▸ Lambda functions should be simple, chain if required
  38. 38. Takeaways
  39. 39. Building a data preparation pipeline with Pandas and AWS Lambda Takeaways ▸ Know your data and your target ▸ Pandas can solve many issues ▸ Defensive programming and closing the loop ▸ AWS Lambda is a powerful and flexible tool for time and resource constrained teams
  40. 40. Thanks Questions? @fabian_dubois fabian@datamaplab.com check denryoku.io

×