DevOps Practices for Data Scientists and Engineers
1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
1. Data Science at Scale - The DevOps Approach
DevOps Practices for Data Scientists and Engineers
Mihai Criveti
22nd September 2019
https://github.com/crivetimihai
http://galaxy.ansible.com/crivetimihai
1
2. 1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
2
3. Speaker Bio
Mihai Criveti, IBM
Designs and builds multi-cloud
customer solutions for Cloud Native
applications, Big Data analytics and
Machine Learning.
Pursuing a MSc in Data Science at
UCD.
Leads the Cloud Native competency
for IBM Cloud Solutioning.
Passionate about Open Source
software, DevOps and Data Science.
3
5. What is Data Science
Data Science
Multi-disciplinary field that brings together Computer Science, Statistics/Machine Learning, and
Data Analysis to understand and extract insights from ever-increasing amounts of data.
Machine Learning
The science of getting computers to act without being explicitly programmed.
Deep Learning
Family of machine learning methods based on learning data representations, as opposed to
task-specific algorithms. Can be supervised, semi-supervised or unsupervised.
AI
Intelligent machines that work and react like humans.
4
6. Moving Towards Big Data
Figure 1: Powerful models and big data support Machine Learning 5
7. Data Scientist Domain
Data Engineering:
• Linux, Cloud, Big Data Platforms.
• Streaming, big data pipelines.
Software Development
• Coding skills, such as Python, R, SQL.
• Development practices: Agile, DevOps,
CI/CD and using GitOps effectively.
Figure 2: Data Scientist Venn Diagram
“While data scientists are recognized for their brilliant algorithms, up to 80% of their time
could be spent collecting, cleaning and organizing data.” 1
1
Forbes: Radical Change Is Coming To Data Science Jobs
6
8. Data Science Roles
Data Scientist / Analyst:
• Turn raw data into valuable insights that an organization needs in order to grow or compete.
• Analytical data experts with the skill to solve complex problems and the curiosity to explore
what problems need solving.
• They use Data, visualization, machine learning, deep learning, pattern recognition, natural
language processing, analytics.
• Always curious: ”What can we learn from this data? What actions can we take after?
Data Engineer / Data Architect:
• Prepare the “big data” infrastructure to be analysed by Data Scientists.
• Software engineers who design, build, integrate data from various resources, and manage big
data.
7
10. Often, it’s a manual process
Figure 3: Pen and paper - planning 8
11. Data Science is awesome
Data Science is OSEMN (pronounced AWESOME!)2
- an interactive process that consists, largely, of
the following steps:
1. Inquire: ask a meaningful question.
2. Obtain: get the required data.
3. Scrub: clean the data.
4. Explore: learn about your data, try stuff.
5. Model: create a couple of models and test them.
6. iNterpret: gain insight from data, present it in a usable form (reports, dashboards,
applications, etc).
2
O’Reilly - “Data Science at the Command Line, Facing the Future with Time-Tested Tools, Jeroen Janssens”
9
12. CRISP-DM - Cross-industry standard process for data mining
CRISP-DM is a widely used, general purpose, too independent form of data-mining model.
Stage Description
1. Business Understanding Ask relevant questions, define objectives
2. Data Mining Gather the necessary data
3. Data Cleaning Scrub and fix data inconsistencies
4. Data Exploration Form hypothesis about the data
5. Feature Engineering Select / construct important features
6. Predictive Modeling Train Machine Learning Models
7. Data Visualization Communicate findings to key stakeholders
8. Data Automation Automate and deploy ML models
Figure 4: CRISP-DM
10
17. Types of Data
Data can be:
• Structured: tables, spreadsheets, relational
databases.
• Unstructured: text, images, audio, video.
• Numerical / Quantitative: ex: pulse, temperature.
• Categorical / qualitative: ex: hair colour.
• Big Data: massive data sets that cannot fit in memory
on a single machine.
Figure 8: Different types of data
Data becomes information when viewed in context or post-analysis.
14
18. Obtaining Data
Private / Enterprise Data
• Private data can often be found in: Data warehouse, SQL Database.
• NoSQL store, Data Lakes or HDFS, document repositories.
• Private wiks, ERP and CRM platforms, object storage and more often then not, spreadsheets.
Public Data
• Weather, social media, location / geographical data, stock data, public internet data
(scraping), wikis, Eurostat, kaggle, government data portal - are all sources of external data.
Data compliance, governance and security are key to a successful data strategy.
15
20. Common Data Formats
• XML, JSON, YAML
• CSV, TSV, Parquet, XLSX
• Markdown, HTML, DOCX
• TXT, PDF
• Audio, Video
• Data APIs that return JSON or XML
• Streaming data
• SQL and other database formats
• HDFS and other big data stores or encapsulated data on object storage
17
21. The Big Data Challenge
Figure 10: Making sense of ever growing data sets through automation and machine learning 18
23. Tools Data Scientists use
• Mathematics - Linear Algebra, Statistics, Combinatorics
• Some of them use R - focusing on statistics
• A lot of them use Python - usually with Jupyter notebook as a front-end
• Libraries such as Pandas and Numpy are very handy!
• Natural Language Processing with NLTK
• or Machine Learning libraries - Scikit-Learn, Tensorflow or PyTorch
• SQL and databases tend to be quite popular. After all, where does data live?
• NoSQL databases such as MongoDB are quite useful too…
• And a whole bunch of Big Data tools: Hadoop, Spark, Kafka, etc.
• They write papers too, so Markdown and LaTeX come in handy!
• Lots of code, so typical software development tools (git, IDEs, CI/CD, etc.)
• Processes (SCRUM, Agile, Lean, CRISP-DM, Design Thinking)
19
24. Tools to IOSEMN process
+-----------------+ Project Management / Lifecycle
| INQUIRE | Git, Github, Gitlab (Project documentation)
+-----------------+ Documentation systems
v
+------------------+ Requests, APIs, sensors, surveys
| OBTAIN | SQL, CSV, JSON, XLS, NoSQL, Hadoop, Spark
+------------------+ Store / Cache data locally (SQLite, PostgreSQL)
v (Gather internal and external data)
+-----------------+ Jupyter Notebook
| SCRUB | Regular Expression (re), BeautifulSoup
+-----------------+ SQLite, ETL, Glue
20
25. Tools (continued)
+-----------------+ Jupyter Notebook
| EXPLORE | Pandas, Orange
+-----------------+ Matplotlib
^ v (Explore and understand the data)
+-----------------+ SciKit-Learn, Tensorflow
| MODEL | PyTorch, NumPy
+-----------------+ Machine Learning
RE-INQUIRE | (Model: predict, check accuracy, evaluate model)
^ +-----------------+ Jupyter Notebook, MatplotLib
+--------- | INTERPRET | Bokeh, D3.JS, XLSXWriter
+-----------------+ Dashboards, Reports, etc.
(Choose a good representation, interpret the results)
21
36. Cloud Computing
A model for enabling convenient, on-demand network access to a shared pool of config-
urable computing resources that can be rapidly provisioned and released with minimal
management effort or service provider interaction.3
Characteristics
On-demand Self-service
Broad network access
Resource pooling
Rapid elasticity
Measured service
Service Models
IaaS Infrastructure as a Service
PaaS Platform as a Service
FaaS Functions as a Service
SaaS Software as a Service
BPaaS Business Process as a Service
Deployment Models
Private Cloud
Community Cloud
Public Cloud
Hybrid Cloud
Multi Cloud
3
The NIST Definition of Cloud Computing, Special Publication 800-145
31
37. Why take advantage of Cloud in Data Science projects
1. Big Data: analytics for ever increasing, varied data sets.
2. IoT: Easily gather and process data from smart devices (Internet of Things).
3. Collaborative: Set up data environments with ease, and collaborate with your team.
4. Scalable: Access to virtually unlimited resources, GPU computing, etc.
5. Automated: orchestrate and provision systems and tear them down when no longer needed.
Chances are, you’re likely already using it. Github? Kaggle notebooks? Google Docs? Dropbox? AWS
Free Tier? JupyterHub?
32
38. Machine Learning as a Service
Figure 18: IBM Watson Machine Learning Services 33
41. Kubernetes: container platform that enables portability across infrastructures providers
Kubernetes facilitates both declarative configuration and automation.
Container Orchestration Benefits:
• Cluster management: Federate hosts and manage them.
• Self-healing: Detect and replace unhealthy container Pods and hosts. Attempt to put the
cluster back to the wanted state.
• Replication: Ensure that the wanted number of Pod replicas is running.
• Service discovery: Locate and distribute client requests across running containers.
• Scheduling: Distribute containers across the worker nodes.
• Scaling: Adding or removing containers to match workload.
• Persistent Storage: Manage persistent storage.
36
44. Cultural Transformation
• Culture: Build trust and align your team with better communication and transparency.
• Discover: Understand the problem domain and align on common goals.
• Think: Know your audience and meet its needs faster than the competition.
• Develop: Collaborate to build, continuously integrate and deliver high-quality code.
• Reason: Apply AI techniques so that you can make better decisions.
• Operate: Harness the power of the cloud to quickly get your minimum viable product (MVP)
into production, and monitor and manage your applications to a high degree of quality and
meet your service level agreements. Grow or shrink your resources based on demand.
• Learn: Gain insights from your users as they interact with your application.
38
45. End-to-End CD4ML Process by Martin Fowler
Figure 22: Continuous Delivery for Machine Learning end-to-end process 39
46. Build reproducible images with Packer
Figure 23: Packer building a VirtualBox image for RHEL 8 using Kickstart Automated Install 40
47. Secure your environment with OpenSCAP:
Figure 24: Automate Continuous Compliance and Remediation (ex: HIPAA, PCI) 41
49. What can I do with Ansible?
Figure 26: Automate your entire infrastructure and data pipeline using Ansible
43
50. Molecule: Test your Ansible Playbooks on Docker, Vagrant or Cloud
Molecule provides support for testing with multiple instances, operating systems and distributions,
virtualization providers, test frameworks and testing scenarios.
Instantly create a Vagrant or Docker machine and test your playbook:
molecule create -s vagrant-centos-7
molecule converge -s vagrant-centos-7
molecule login
Integrate your tests as part of your CI/CD
molecule test
44
53. The Open Practice Library
Figure 28: openpracticelibrary.com: A community-driven
repository of practices and tools
An Outcome Delivery framework:
• Discovery - generate the Outcomes
• Options - identify how to get there
• Delivery - implement and put ideas to the test.
Learn what works and what doesn’t.
46
54. The Open Practice Library - Discovery
Figure 29: What problems are you trying to solve, for whom and why? How will you measure Outcomes? 47
55. The Open Practice Library - Options Pivot
Figure 30: What are the different options? What do you need to make this happen? 48
56. The Open Practice Library - Delivery
Figure 31: What was measured impact? What did you learn? 49
57. The Open Practice Library - Foundation
Figure 32: Creating a team culture, an environment of collaboration and technical engineering practices 50
60. Example skills for Data Science scenarios:
1. Data Engineering: Infrastructure as Code (CloudFormation, Terraform) Shell Scripting, Cloud,
APIs (boto3), etc.
2. DevOps: putting together CI/CD pipelines, Jenkins, CodeStar*, CodePipeline, CodeBuild, etc.
3. Event Driven Architectures: putting an object on S3 triggers a lambda function that performs
ETL and loads the result into RedShift.
4. GPU computing: accelerate workloads for general-purpose scientific and engineering
computing. 4
5. Container platforms: develop microservices and set up environments with ease.
4
Forbes: From Deep Learning To Data Science: Everything You Need To Know
52
63. Soft Skills and Career
Figure 36: Career Development, Soft Skills, Stakeholder Management 55
64. Meetups and Events
Networking:
• Absorb ideas and filter them through your own experience at meetups and events.
• Build a strong network.
Giveback:
• Give back to the community by speaking and events, supporting, sponsoring or co-organising.
• Contribute to Open Source projects. Meetups and hackathons are a great place to start!
Social Eminence
• Be an active member of the Data Science community and build your social eminence!
56
69. Example Courses
cognitiveclass.ai
Learning Paths and Badges on Containers, Kubernetes, SQL, Big Data, Python, R, Deep Learning,
Analytics, Hadoop and Spark.
katakoda.com
Interactive, hands-on courses on Containers, Machine Learning, DevOps, Software Engineering
Practices and more straight in your browser.
Data Science Courses
• edx.org - you can ‘audit’ courses for free.
• coursera.org - great machine learning courses (ex: Andrew Ng.)
• fast.ai - free courses on Deep Learning, Computational Linear Algebra.
61
70. What they don’t teach you
Sometimes, it’s just you.. and there won’t be a:
• Data Engineer to build and automate your pipelines
• A Cloud Architect to design your infrastructure
• A Network Engineer
• A DevOps Engineer to create your Code deployment pipelines and help with CI/CD
• A Systems Administrator to support your Linux environment.
Go SaaS
• Could I go SaaS?
Build it yourself!
• How do I build all this myself?
62