Data Science meets Software Development

DATA SCIENCE MEETS
SOFTWARE DEVELOPMENT
Alexis Seigneurin - IpponTechnologies

Who I am
• Software engineer for 15 years
• Consultant at IpponTech in Paris, France
• Favorite subjects: Spark, Cassandra,Ansible, Docker
• @aseigneurin

• 200 software engineers in France and the US
• In the US: ofﬁces in DC, NYC and Richmond,Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster,Tatami, etc.
• @ipponusa

The project
• Data Innovation Lab of a large insurance company
• Data → Business value
• Team of 30 Data Scientists + Software Developers

Data Scientists
Who they are
&
How they work

Skill set of a Data Scientist
• Strong in:
• Science (maths / statistics)
• Machine Learning
• Analyzing data
• Good / average in:
• Programming
• Not good in:
• Software engineering

Programming languages
• Mostly Python, incl. frameworks:
• NumPy
• Pandas
• SciKit Learn
• SQL
• R

Development environments
• IPython Notebook

Development environments
• Dataiku

Machine Learning
• Algorithms:
• Logistic Regression
• Decision trees
• Random forests
• Implementations:
• Dataiku
• Scikit-Learn
• Vowpal Wabbit

Programmers
Who they are
&
How they work
http://xkcd.com/378/

Skill set of a Developer
• Strong in:
• Software engineering
• Programming
• Good / average in:
• Science (maths / statistics)
• Analyzing data
• Not good in:
• Machine Learning

How Developers work
• Programming languages
• Java
• Scala
• Development environment
• Eclipse
• IntelliJ IDEA
• Toolbox
• Maven
• …

A typical Data
Science project
In the Lab

Workﬂow
1. Data Cleansing
2. Feature Engineering
3. Train a Machine Learning model
1. Split the dataset: training/validation/test datasets
2. Train the model
4. Apply the model on new data

Data Cleansing
• Convert strings to numbers/booleans/…
• Parse dates
• Handle missing values
• Handle data in an incorrect format
• …

Feature Engineering
• Transform data into numerical features
• E.g.:
• A birth date → age
• Dates of phone calls → Number of calls
• Text →Vector of words
• 2 names → Levensthein distance

Machine Learning
• Train a model
• Test an algorithm with different
params
• Cross validation (Grid Search)
• Compare different algorithms, e.g.:
• Logistic regression
• Gradient boosting trees
• Random forest

Machine Learning
• Evaluate the accuracy of the
model
• Root Mean Square Error (RMSE)
• ROC curve
• …
• Examine predictions
• False positives, false negatives…

Disclaimer
• Context of this project:
• Not So Big Data (but Smart Data)
• No real-time workﬂows (yet?)

Distribute the
processing
R E C I P E # 1

Distribute the processing
• Data Scientists work with data samples
• No constraint on processing time
• Processing on the Data Scientist’s workstation
(IPython Notebook) or on a single server
(Dataiku)

Distribute the processing
• In production:
• H/W resources are constrained
• Large data sets to process
• Spark:
• Included in CDH
• DataFrames (Spark 1.3+) ≃ Pandas DataFrames
• Fast!

Use a centralized
data store
R E C I P E # 2

Use a centralized data store
• Data Scientists store data on their workstations
• Limited storage
• Data not shared within the team
• Data privacy not enforced
• Subject to data losses

Use a centralized data store
• Store data on HDFS:
• Hive tables (SQL)
• Parquet ﬁles
• Security: Kerberos + permissions
• Redundant + potentially unlimited storage
• Easy access from Spark and Dataiku

Rationalize the use of
programming
languages
R E C I P E # 3

• Data Scientists write code on their workstations
• This code may not run in the datacenter
• Language variety → Hard to share knowledge

• Use widely spread languages
• Spark in Python/Scala
• Support for R is too young
• Provide assistance to ease the adoption!

Use an IDE
• Notebooks:
• Powerful for exploratory work
• Weak for code edition and code
structuring
• Inadequate for code versioning

Use an IDE
• IntelliJ IDEA / PyCharm
• Code compilation
• Refactoring
• Execution of unit tests
• Support for Git

Source Control
R E C I P E # 5

Source Control
• Data Scientists work on their workstations
• Code is not shared
• Code may be lost
• Intermediate versions are not preserved
• Lack of code review

Source Control
• Git + GitHub / GitLab
• Versioning
• Easy to go back to a version running in production
• Easy sharing (+permissions)
• Code review

Packaging the code
R E C I P E # 6

Packaging the code
• Source code has dependencies
• Dependencies in production ≠ at dev time
• Assemble the code + its dependencies

Packaging the code
• Freeze the dependencies:
• Scala → Maven
• Python → Setuptools
• Packaging:
• Scala → Jar (Maven Shade plugin)
• Python → Egg (Setuptools)
• Compliant with spark-submit.sh

R E C I P E # 7
Secure the build
process

Secure the build process
• Data Scientists may commit code… without
running tests ﬁrst!
• Quality may decrease over time
• Packages built by hand on a workstation are not
reproducible

Secure the build process
• Jenkins
• Unit test report
• Code coverage report
• Packaging: Jar / Egg
• Dashboard
• Notiﬁcations (Slack + email)

Automate the process
R E C I P E # 8

• Data is loaded manually in HDFS:
• CSV ﬁles, sometimes compressed
• Often received by email
• Often samples

• No human intervention should be required
• All steps should be code / tools
• E.g. automate ﬁle transfers, unzipping…

Adapt to living data
R E C I P E # 9

• Data Scientists work with:
• Frozen data
• Samples
• Risks with data received on a regular basis:
• Incorrect format (dates, numbers…)
• Corrupt data (incl. encoding changes)
• Missing values

• Data Checking & Cleansing
• Preliminary steps before processing the data
• Decide what to do with invalid data
• Thetis
• Internal tool
• Performs most checking & cleansing operations

Provide a library of
transformations
R E C I P E # 1 0

Library of transformations
• Dataiku « shakers »:
• Parse dates
• Split a URL (protocol, host, path, …)
• Transform a post code into a city / department name
• …
• Cannot be used outside Dataiku

Library of transformations
• All transformations should be code
• Reuse transformations between projects
• Provide a library
• Transformation = DataFrame → DataFrame
• Unit tests

Unit test the data
pipeline
R E C I P E # 1 1

Unit test the data pipeline
• Independent data processing steps
• Data pipeline not often tested from beginning to
end
• Data pipeline easily broken

Unit test the data pipeline
• Unit test each data transformation stage
• Scala: Scalatest
• Python: Unittest
• Use mock data
• Compare DataFrames:
• No library (yet?)
• Compare lists of lists

Assemble the
Workﬂow
R E C I P E # 1 2

Assemble the Workﬂow
• Separate transformation processes:
• Transformations applied to some data
• Results are frozen and used in other processes
• Jobs are launched manually
• No built-in scheduler in Spark

Assemble the workﬂow
• Oozie:
• Spark
• Map-Reduce
• Shell
• …
• Scheduling
• Alerts
• Logs

Summary
• Keys:
• Use industrialization-ready tools
• Pair Programming: Data Scientist + Developer
• Success criteria:
• Lower time to market
• Higher processing speed
• More robust processes

Thank you!
@aseigneurin - @ipponusa

Data Science meets Software Development

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Data Science meets Software Development

Similaire à Data Science meets Software Development (20)

Plus de Alexis Seigneurin

Plus de Alexis Seigneurin (7)

Dernier

Dernier (20)

Data Science meets Software Development