I work in a Data Innovation Lab with a horde of Data Scientists. Data Scientists gather data, clean data, apply Machine Learning algorithms and produce results, all of that with specialized tools (Dataiku, Scikit-Learn, R...). These processes run on a single machine, on data that is fixed in time, and they have no constraint on execution speed.
With my fellow Developers, our goal is to bring these processes to production. Our constraints are very different: we want the code to be versioned, to be tested, to be deployed automatically and to produce logs. We also need it to run in production on distributed architectures (Spark, Hadoop), with fixed versions of languages and frameworks (Scala...), and with data that changes every day.
In this talk, I will explain how we, Developers, work hand-in-hand with Data Scientists to shorten the path to running data workflows in production.
2. Who I am
• Software engineer for 15 years
• Consultant at IpponTech in Paris, France
• Favorite subjects: Spark, Cassandra,Ansible, Docker
• @aseigneurin
3. • 200 software engineers in France and the US
• In the US: offices in DC, NYC and Richmond,Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster,Tatami, etc.
• @ipponusa
4. The project
• Data Innovation Lab of a large insurance company
• Data → Business value
• Team of 30 Data Scientists + Software Developers
6. Skill set of a Data Scientist
• Strong in:
• Science (maths / statistics)
• Machine Learning
• Analyzing data
• Good / average in:
• Programming
• Not good in:
• Software engineering
12. Skill set of a Developer
• Strong in:
• Software engineering
• Programming
• Good / average in:
• Science (maths / statistics)
• Analyzing data
• Not good in:
• Machine Learning
13. How Developers work
• Programming languages
• Java
• Scala
• Development environment
• Eclipse
• IntelliJ IDEA
• Toolbox
• Maven
• …
15. Workflow
1. Data Cleansing
2. Feature Engineering
3. Train a Machine Learning model
1. Split the dataset: training/validation/test datasets
2. Train the model
4. Apply the model on new data
16. Data Cleansing
• Convert strings to numbers/booleans/…
• Parse dates
• Handle missing values
• Handle data in an incorrect format
• …
17. Feature Engineering
• Transform data into numerical features
• E.g.:
• A birth date → age
• Dates of phone calls → Number of calls
• Text →Vector of words
• 2 names → Levensthein distance
18. Machine Learning
• Train a model
• Test an algorithm with different
params
• Cross validation (Grid Search)
• Compare different algorithms, e.g.:
• Logistic regression
• Gradient boosting trees
• Random forest
19. Machine Learning
• Evaluate the accuracy of the
model
• Root Mean Square Error (RMSE)
• ROC curve
• …
• Examine predictions
• False positives, false negatives…
23. Distribute the processing
• Data Scientists work with data samples
• No constraint on processing time
• Processing on the Data Scientist’s workstation
(IPython Notebook) or on a single server
(Dataiku)
24. Distribute the processing
• In production:
• H/W resources are constrained
• Large data sets to process
• Spark:
• Included in CDH
• DataFrames (Spark 1.3+) ≃ Pandas DataFrames
• Fast!
26. Use a centralized data store
• Data Scientists store data on their workstations
• Limited storage
• Data not shared within the team
• Data privacy not enforced
• Subject to data losses
27. Use a centralized data store
• Store data on HDFS:
• Hive tables (SQL)
• Parquet files
• Security: Kerberos + permissions
• Redundant + potentially unlimited storage
• Easy access from Spark and Dataiku
29. Programming languages
• Data Scientists write code on their workstations
• This code may not run in the datacenter
• Language variety → Hard to share knowledge
30. Programming languages
• Use widely spread languages
• Spark in Python/Scala
• Support for R is too young
• Provide assistance to ease the adoption!
35. Source Control
• Data Scientists work on their workstations
• Code is not shared
• Code may be lost
• Intermediate versions are not preserved
• Lack of code review
36. Source Control
• Git + GitHub / GitLab
• Versioning
• Easy to go back to a version running in production
• Easy sharing (+permissions)
• Code review
41. Secure the build process
• Data Scientists may commit code… without
running tests first!
• Quality may decrease over time
• Packages built by hand on a workstation are not
reproducible
42. Secure the build process
• Jenkins
• Unit test report
• Code coverage report
• Packaging: Jar / Egg
• Dashboard
• Notifications (Slack + email)
47. Adapt to living data
• Data Scientists work with:
• Frozen data
• Samples
• Risks with data received on a regular basis:
• Incorrect format (dates, numbers…)
• Corrupt data (incl. encoding changes)
• Missing values
48. Adapt to living data
• Data Checking & Cleansing
• Preliminary steps before processing the data
• Decide what to do with invalid data
• Thetis
• Internal tool
• Performs most checking & cleansing operations
50. Library of transformations
• Dataiku « shakers »:
• Parse dates
• Split a URL (protocol, host, path, …)
• Transform a post code into a city / department name
• …
• Cannot be used outside Dataiku
51. Library of transformations
• All transformations should be code
• Reuse transformations between projects
• Provide a library
• Transformation = DataFrame → DataFrame
• Unit tests
53. Unit test the data pipeline
• Independent data processing steps
• Data pipeline not often tested from beginning to
end
• Data pipeline easily broken
54. Unit test the data pipeline
• Unit test each data transformation stage
• Scala: Scalatest
• Python: Unittest
• Use mock data
• Compare DataFrames:
• No library (yet?)
• Compare lists of lists
56. Assemble the Workflow
• Separate transformation processes:
• Transformations applied to some data
• Results are frozen and used in other processes
• Jobs are launched manually
• No built-in scheduler in Spark