Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Scaling out big-data computation & machine
learning using Pig, Python and Luigi
Ron Reiter
VP R&D, Crosswise
AGENDA
§  The goal
§  Data processing at Crosswise
§  The basics of prediction using machine learning
§  The “big data...
THE GOAL
1.  Process huge amounts of data points
2.  Allow data scientists to focus on their
research
3.  Adjust productio...
DATA PROCESSING AT
CROSSWISE
§  We are building a graph of devices that belong to the
same user, based on browsing data o...
DATA PROCESSING AT
CROSSWISE
§  Interesting facts about our data processing
pipeline:
§  We process 1.5 trillion data po...
DATA PROCESSING AT
CROSSWISE
§  Our constraints
§  We are dealing with massive amounts of data, and we have to
go for a ...
PREDICTING AT SCALE
MODEL	
  BUILDING	
  PHASE	
  
(SMALL	
  /	
  LARGE	
  SCALE)	
  
PREDICTION	
  PHASE	
  
(MASSIVE	
  ...
PREDICTING AT SCALE
§  Steps
§  Training & evaluating the model (Iterations on training and
evaluation are done until th...
THE “BIG DATA” STACK
YARN	
   Mesos	
  
MapReduce	
   Tez	
  
Resource	
  
Manager	
  
ComputaJon	
  
Framework	
  
High	
...
PIG
§  Pig is a high level, SQL-like language, which
runs on Hadoop
§  Pig also supports User Defined Functions written
...
HOW DOES PIG WORK?
§  Pig converts SQL-like queries to MapReduce iterations
§  Pig builds a work plan based on a DAG it ...
PIG DIRECTIVES
The most common Pig directives are:
§  LOAD/STORE – Load and save data sets
§  FOREACH – map function whi...
PIG CODE EXAMPLE
customers	
  =	
  LOAD	
  'customers.tsv'	
  USING	
  PigStorage('t')	
  AS	
  	
  
	
  	
  (customer_id,...
COMBINING PIG &
PYTHON
COMBINING PIG AND
PYTHON
§  Pig gives you the power to scale and
process data conveniently with an SQL-
like syntax
§  P...
MACHINE LEARNING IN PYTHON
USING SCIKIT-LEARN
PYTHON UDF
§  Pig provides two Python UDF (User-defined function)
engines: Jython (JVM) and CPython
§  Mortar (mortardat...
USING THE PYTHON UDF
§  Register the Python UDF:
§  If you prefer speed over package compatibility, use Jython:
§  Then...
CONNECT PIG AND PYTHON
JOBS
§  In many common scenarios, especially in machine
learning, a classifier can usually be trai...
WORKFLOW MANAGEMENT
S3	
   HDFS	
  SFTP	
   FILE	
  DB	
  
Task	
  A	
   Task	
  B	
   Task	
  C	
  
REQUIRES	
  REQUIRES	...
WORKFLOW MANAGEMENT
WITH LUIGI
§  Unlike Oozie and Azkaban which are heavy workflow
managers, Luigi is more of a Python p...
EXAMPLE - TRAIN MODEL
LUIGI TASK
§  Let’s see how it’s done:
import	
  luigi,	
  numpy,	
  pandas,	
  pickle,	
  sklearn	...
PREDICT RESULTS LUIGI
TASK
§  We predict using a Pig task which has access to the pickled
model:
import	
  luigi	
  
	
  ...
PREDICTION PIG USER-DEFINED
FUNCTION (PYTHON)
§  We can then generate a custom UDF while replacing the
$MODEL with an act...
PITFALLS
§  For the classifier to work on your Hadoop
cluster, you have to install the required
packages on all of your H...
CLUSTER PROVISIONING
WITH LUIGI
§  To conserve resources, we use clusters only when needed. So
we created the StartCluste...
USING LUIGI WITH OTHER
COMPUTATION ENGINES
§  Luigi acts like the “glue” of data pipelines, and we use it to
interconnect...
GRAPHLAB AT CROSSWISE
§  We use GraphLab to run graph processing at scale – for
example, to run connected components and ...
PYTHON API
§  Pig is a “data flow” language, and not a real language. Its
abilities are limited - there are no conditiona...
CROSSWISE HADOOP
SSH JOB RUNNER
STANDARD LUIGI
WORKFLOW
§  Standard Luigi Hadoop tasks need a
correctly configured Hadoop client to
launch jobs.
§  This...
LUIGI HADOOP SSH
RUNNER
§  At Crosswise, we implemented a Luigi task for running
Hadoop JARs (e.g. Pig) remotely, just li...
WHY RUN HADOOP JOBS
EXTERNALLY?
Working with the EMR API is convenient, but Luigi expects to run
jobs from the master node...
NEXT STEPS AT CROSSWISE
§  We are planning on moving to Apache Tez since
MapReduce has a high overhead for
complicated pr...
QUESTIONS?
THANK YOU!
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi
Prochain SlideShare
Chargement dans…5
×

BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi

1 368 vues

Publié le

Scaling out big-data computation & machine learning using Pig, Python and Luigi

Publié dans : Données & analyses
  • Soyez le premier à commenter

BDX 2015 - Scaling out big-data computation & machine learning using Pig, Python and Luigi

  1. 1. Scaling out big-data computation & machine learning using Pig, Python and Luigi Ron Reiter VP R&D, Crosswise
  2. 2. AGENDA §  The goal §  Data processing at Crosswise §  The basics of prediction using machine learning §  The “big data” stack §  An introduction to Pig §  Combining Pig and Python §  Workflow management using Luigi and Amazon EMR
  3. 3. THE GOAL 1.  Process huge amounts of data points 2.  Allow data scientists to focus on their research 3.  Adjust production systems according to research conclusions quickly, without duplicating logic between research and production systems
  4. 4. DATA PROCESSING AT CROSSWISE §  We are building a graph of devices that belong to the same user, based on browsing data of users
  5. 5. DATA PROCESSING AT CROSSWISE §  Interesting facts about our data processing pipeline: §  We process 1.5 trillion data points from 1 billion devices §  30TB of compressed data §  Cluster with 1600 cores running for 24 hours
  6. 6. DATA PROCESSING AT CROSSWISE §  Our constraints §  We are dealing with massive amounts of data, and we have to go for a solid, proven and truly scalable solution §  Our machine learning research team uses Python and sklearn §  We are in a race against time (to market) §  We do not want the overhead of maintaining two separate processing pipelines, one for research and one for large-scale prediction
  7. 7. PREDICTING AT SCALE MODEL  BUILDING  PHASE   (SMALL  /  LARGE  SCALE)   PREDICTION  PHASE   (MASSIVE  SCALE)   Labeled  Data   Train  Model   Evaluate   Model   Model   Unlabeled   Data   Predict   Output  
  8. 8. PREDICTING AT SCALE §  Steps §  Training & evaluating the model (Iterations on training and evaluation are done until the model’s performance is acceptable) §  Predicting using the model at massive scale §  Assumptions §  Distributed learning is not required §  Distributed prediction is required §  Distributed learning can be achieved but not all machine learning models support it, and not all infrastructures know how to do it
  9. 9. THE “BIG DATA” STACK YARN   Mesos   MapReduce   Tez   Resource   Manager   ComputaJon   Framework   High  Level   Language   Spark   Graphlab   Spark   Program   GraphLab   Script   Pig   Scalding   Oozie  Luigi   Azkaban   Workflow   Management   Hive  
  10. 10. PIG §  Pig is a high level, SQL-like language, which runs on Hadoop §  Pig also supports User Defined Functions written in Java and Python
  11. 11. HOW DOES PIG WORK? §  Pig converts SQL-like queries to MapReduce iterations §  Pig builds a work plan based on a DAG it calculates §  Newer versions of Pig know how to run on different computation engines, such as Apache Tez and Spark which offer a higher level of abstraction than MapReduce Pig  Runner   Map   Reduce   Map   Reduce   Map   Reduce   Map   Reduce   Map   Reduce  
  12. 12. PIG DIRECTIVES The most common Pig directives are: §  LOAD/STORE – Load and save data sets §  FOREACH – map function which constructs a new row for each row in a data set §  FILTER – filters in/out rows that obey to a certain criteria §  GROUP – groups rows by a specific column / set of columns §  JOIN – join two data sets based on a specific column And many more functions: http://pig.apache.org/docs/r0.14.0/func.html
  13. 13. PIG CODE EXAMPLE customers  =  LOAD  'customers.tsv'  USING  PigStorage('t')  AS        (customer_id,  first_name,  last_name);   orders  =  LOAD  'orders.tsv'  USING  PigStorage('t')  AS        (customer_id,  price);   aggregated  =  FOREACH  (GROUP  orders  BY  customer_id)  GENERATE        group  AS  customer_id,        SUM(orders.price)  AS  price_sum;   joined  =  JOIN  customers  ON  customer_id,  aggregated  ON   customer_id;   STORE  joined  INTO  'customers_total.tsv'  USING  PigStorage('t');  
  14. 14. COMBINING PIG & PYTHON
  15. 15. COMBINING PIG AND PYTHON §  Pig gives you the power to scale and process data conveniently with an SQL- like syntax §  Python is easy and productive, and has many useful scientific packages available (sklearn, nltk, numpy, scipy, pandas) +  
  16. 16. MACHINE LEARNING IN PYTHON USING SCIKIT-LEARN
  17. 17. PYTHON UDF §  Pig provides two Python UDF (User-defined function) engines: Jython (JVM) and CPython §  Mortar (mortardata.com) added support for C Python UDFs, which support scientific packages (numpy, scipy, sklearn, nltk, pandas, etc.) §  A Python UDF is a function with a decorator that specifies the output schema. (since Python is dynamic the input schema is not required) from  pig_util  import  outputSchema     @outputSchema('value:int')   def  multiply_by_two(num):          return  num  *  2  
  18. 18. USING THE PYTHON UDF §  Register the Python UDF: §  If you prefer speed over package compatibility, use Jython: §  Then, use the UDF within a Pig expression: REGISTER  'udfs.py'  USING  streaming_python  AS  udfs;   processed  =  FOREACH  data  GENERATE  udfs.multiply_by_two(num);   REGISTER  'udfs.py'  USING  jython  AS  udfs;  
  19. 19. CONNECT PIG AND PYTHON JOBS §  In many common scenarios, especially in machine learning, a classifier can usually be trained using a simple Python script §  Using the classifier we trained, we can now predict on a massive scale using a Python UDF §  Requires a higher-level workflow manager, such as Luigi PYTHON  JOB   PIG  JOB         PYTHON  UDF   PICKLED  MODEL   S3://model.pkl  
  20. 20. WORKFLOW MANAGEMENT S3   HDFS  SFTP   FILE  DB   Task  A   Task  B   Task  C   REQUIRES  REQUIRES   OUTPUTS   OUTPUTS   OUTPUTS  OUTPUTS  OUTPUTS   USES   USES   D  A  T  A      F  L  O  W    
  21. 21. WORKFLOW MANAGEMENT WITH LUIGI §  Unlike Oozie and Azkaban which are heavy workflow managers, Luigi is more of a Python package. §  Luigi works based on dependency resolving, similar to a Makefile (or Scons) §  Luigi defines an interface of “Tasks” and “Targets”, which we use to connect the two tasks using dependencies. UNLABELED LOGS 2014-01-01 TRAINED MODEL 2014-01-01 OUTPUT 2014-01-01 LABELED LOGS 2014-01-01 UNLABELED LOGS 2014-01-02 TRAINED MODEL 2014-01-02 OUTPUT 2014-01-02 LABELED LOGS 2014-01-02
  22. 22. EXAMPLE - TRAIN MODEL LUIGI TASK §  Let’s see how it’s done: import  luigi,  numpy,  pandas,  pickle,  sklearn     class  TrainModel(luigi.Task):          target_date  =  luigi.DateParameter()            def  requires(self):                  return  LabelledLogs(self.target_date)            def  output(self):                  return  S3Target('s3://mybucket/model_%s.pkl'  %  self.target_date)            def  run(self):                  clf  =  sklearn.linear_model.SGDClassifier()                                  with  self.output().open('w')  as  fd:                          df  =  pandas.load_csv(self.input())                          clf.fit(df[["a","b","c"]].values,  df["class"].values)                          fd.write(pickle.dumps(clf))    
  23. 23. PREDICT RESULTS LUIGI TASK §  We predict using a Pig task which has access to the pickled model: import  luigi     class  PredictResults(PigTask):          PIG_SCRIPT  =  """                  REGISTER  'predict.py'  USING  streaming_python  AS  udfs;                  data  =  LOAD  '$INPUT'  USING  PigStorage('t');                  predicted  =  FOREACH  data  GENERATE  user_id,  predict.predict_results(*);                  STORE  predicted  INTO  '$OUTPUT'  USING  PigStorage('t');          """          PYTHON_UDF  =  'predict.py'            target_date  =  luigi.DateParameter()            def  requires(self):                  return  {'logs':  UnlabelledLogs(self.target_date),                                  'model':  TrainModel(self.target_date)}            def  output(self):                  return  S3Target('s3://mybucket/results_%s.tsv'  %  self.target_date)  
  24. 24. PREDICTION PIG USER-DEFINED FUNCTION (PYTHON) §  We can then generate a custom UDF while replacing the $MODEL with an actual model file. §  The model will be loaded when the UDF is initialized (this will happen on every map/reduce task using the UDF)      from  pig_util  import  outputSchema        import  numpy,  pickle          clf  =  pickle.load(download_s3('$MODEL'))          @outputSchema('value:int')        def  predict_results(feature_vector):                return  clf.predict(numpy.array(feature_vector))[0]  
  25. 25. PITFALLS §  For the classifier to work on your Hadoop cluster, you have to install the required packages on all of your Hadoop nodes (numpy, sklearn, etc.) §  Sending arguments to a UDF is tricky; there is no way to initialize a UDF with arguments. To load a classifier to a UDF, you should generate the UDF using a template with the model you wish to use
  26. 26. CLUSTER PROVISIONING WITH LUIGI §  To conserve resources, we use clusters only when needed. So we created the StartCluster task: §  With this mechanism in place, we also have a cron that kills idle clusters and save even more money. §  We use both EMR clusters and clusters provisioned by Xplenty which provide us with their Hadoop provisioning infrastructure. PigTask   StartCluster   REQUIRES   ClusterTarget  OUTPUTS   USES  
  27. 27. USING LUIGI WITH OTHER COMPUTATION ENGINES §  Luigi acts like the “glue” of data pipelines, and we use it to interconnect Pig and GraphLab jobs §  Pig is very convenient for large scale data processing, but it is very weak when it comes to graph analysis and iterative computation §  One of the main disadvantages of Pig is that it has no conditional statements, so we need to use other tools to complete our arsenal Pig  task   Pig  task  GraphLab  task  
  28. 28. GRAPHLAB AT CROSSWISE §  We use GraphLab to run graph processing at scale – for example, to run connected components and create “users” from a graph of devices that belong to the same user
  29. 29. PYTHON API §  Pig is a “data flow” language, and not a real language. Its abilities are limited - there are no conditional blocks or loops. Loops are required when trying to reach “convergence”, such as when finding connected components in a graph. To overcome this limitation, a Python API has been created. from  org.apache.pig.scripting  import  Pig     P  =  Pig.compile(          "A  =  LOAD  '$input'  AS  (name,  age,  gpa);"  +          "STORE  A  INTO  '$output';")     Q  =  P.bind({          'input':  'input.csv',          'output':  'output.csv'})     result  =  Q.runSingle()  
  30. 30. CROSSWISE HADOOP SSH JOB RUNNER
  31. 31. STANDARD LUIGI WORKFLOW §  Standard Luigi Hadoop tasks need a correctly configured Hadoop client to launch jobs. §  This can be a pain when running an automatically provisioned Hadoop cluster (e.g. an EMR cluster). HADOOP  MASTER  NODE   HADOOP  SLAVE   NODE   HADOOP  SLAVE   NODE   LUIGI   NAMENODE   HADOOP   CLIENT   JOB  TRACKER  
  32. 32. LUIGI HADOOP SSH RUNNER §  At Crosswise, we implemented a Luigi task for running Hadoop JARs (e.g. Pig) remotely, just like the Amazon EMR API enables. §  Instead of launching steps using EMR API, we implemented our own, to enable running steps concurrently. LUIGI   CLUSTER   MASTER  NODE         EMR  SLAVE   NODE   EMR  SLAVE   NODE   API  /  SSH   API  /  SSH   HADOOP  CLIENT  INSTANCE   HADOOP  CLIENT  INSTANCE  
  33. 33. WHY RUN HADOOP JOBS EXTERNALLY? Working with the EMR API is convenient, but Luigi expects to run jobs from the master node and not using the EMR job submission API Advantages: §  Doesn’t require to run on a local configured Hadoop client §  Allows to provision the clusters as a task (using Amazon EMR’s API for example) §  The same Luigi process can utilize several Hadoop clusters at once
  34. 34. NEXT STEPS AT CROSSWISE §  We are planning on moving to Apache Tez since MapReduce has a high overhead for complicated processes, and it is hard to tweak and utilize the framework properly §  We are also investigating Dato’s distributed data processing, training and prediction capabilities at scale (using GraphLab Create)
  35. 35. QUESTIONS?
  36. 36. THANK YOU!

×