Data Science at Scale - The DevOps Approach

Data Science at Scale - The DevOps Approach
DevOps Practices for Data Scientists and Engineers
Mihai Criveti
22nd September 2019
https://github.com/crivetimihai
http://galaxy.ansible.com/crivetimihai
1

1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
2

Speaker Bio
Mihai Criveti, IBM
Designs and builds multi-cloud
customer solutions for Cloud Native
applications, Big Data analytics and
Machine Learning.
Pursuing a MSc in Data Science at
UCD.
Leads the Cloud Native competency
for IBM Cloud Solutioning.
Passionate about Open Source
software, DevOps and Data Science.
3

What is Data Science
Data Science
Multi-disciplinary ﬁeld that brings together Computer Science, Statistics/Machine Learning, and
Data Analysis to understand and extract insights from ever-increasing amounts of data.
Machine Learning
The science of getting computers to act without being explicitly programmed.
Deep Learning
Family of machine learning methods based on learning data representations, as opposed to
task-speciﬁc algorithms. Can be supervised, semi-supervised or unsupervised.
AI
Intelligent machines that work and react like humans.
4

Moving Towards Big Data
Figure 1: Powerful models and big data support Machine Learning 5

Data Scientist Domain
Data Engineering:
• Linux, Cloud, Big Data Platforms.
• Streaming, big data pipelines.
Software Development
• Coding skills, such as Python, R, SQL.
• Development practices: Agile, DevOps,
CI/CD and using GitOps effectively.
Figure 2: Data Scientist Venn Diagram
“While data scientists are recognized for their brilliant algorithms, up to 80% of their time
could be spent collecting, cleaning and organizing data.” 1
1
Forbes: Radical Change Is Coming To Data Science Jobs
6

Data Science Roles
Data Scientist / Analyst:
• Turn raw data into valuable insights that an organization needs in order to grow or compete.
• Analytical data experts with the skill to solve complex problems and the curiosity to explore
what problems need solving.
• They use Data, visualization, machine learning, deep learning, pattern recognition, natural
language processing, analytics.
• Always curious: ”What can we learn from this data? What actions can we take after?
Data Engineer / Data Architect:
• Prepare the “big data” infrastructure to be analysed by Data Scientists.
• Software engineers who design, build, integrate data from various resources, and manage big
data.
7

Often, it’s a manual process
Figure 3: Pen and paper - planning 8

Data Science is awesome
Data Science is OSEMN (pronounced AWESOME!)2
- an interactive process that consists, largely, of
the following steps:
1. Inquire: ask a meaningful question.
2. Obtain: get the required data.
3. Scrub: clean the data.
4. Explore: learn about your data, try stuff.
5. Model: create a couple of models and test them.
6. iNterpret: gain insight from data, present it in a usable form (reports, dashboards,
applications, etc).
2
O’Reilly - “Data Science at the Command Line, Facing the Future with Time-Tested Tools, Jeroen Janssens”
9

CRISP-DM - Cross-industry standard process for data mining
CRISP-DM is a widely used, general purpose, too independent form of data-mining model.
Stage Description
1. Business Understanding Ask relevant questions, define objectives
2. Data Mining Gather the necessary data
3. Data Cleaning Scrub and fix data inconsistencies
4. Data Exploration Form hypothesis about the data
5. Feature Engineering Select / construct important features
6. Predictive Modeling Train Machine Learning Models
7. Data Visualization Communicate findings to key stakeholders
8. Data Automation Automate and deploy ML models
Figure 4: CRISP-DM
10

Design Thinking
Figure 5: Design Thinking 11

Data Development Lifecycle
Figure 6: Development, Data and Analytics Lifecycle 12

Data is fundamental
Figure 7: Data and the AI Ladder 13

Types of Data
Data can be:
• Structured: tables, spreadsheets, relational
databases.
• Unstructured: text, images, audio, video.
• Numerical / Quantitative: ex: pulse, temperature.
• Categorical / qualitative: ex: hair colour.
• Big Data: massive data sets that cannot ﬁt in memory
on a single machine.
Figure 8: Different types of data
Data becomes information when viewed in context or post-analysis.
14

Obtaining Data
Private / Enterprise Data
• Private data can often be found in: Data warehouse, SQL Database.
• NoSQL store, Data Lakes or HDFS, document repositories.
• Private wiks, ERP and CRM platforms, object storage and more often then not, spreadsheets.
Public Data
• Weather, social media, location / geographical data, stock data, public internet data
(scraping), wikis, Eurostat, kaggle, government data portal - are all sources of external data.
Data compliance, governance and security are key to a successful data strategy.
15

Data Portal
Figure 9: Open Data portals such as data.gov.ie 16

Common Data Formats
• XML, JSON, YAML
• CSV, TSV, Parquet, XLSX
• Markdown, HTML, DOCX
• TXT, PDF
• Audio, Video
• Data APIs that return JSON or XML
• Streaming data
• SQL and other database formats
• HDFS and other big data stores or encapsulated data on object storage
17

The Big Data Challenge
Figure 10: Making sense of ever growing data sets through automation and machine learning 18

Tools Data Scientists use
• Mathematics - Linear Algebra, Statistics, Combinatorics
• Some of them use R - focusing on statistics
• A lot of them use Python - usually with Jupyter notebook as a front-end
• Libraries such as Pandas and Numpy are very handy!
• Natural Language Processing with NLTK
• or Machine Learning libraries - Scikit-Learn, Tensorﬂow or PyTorch
• SQL and databases tend to be quite popular. After all, where does data live?
• NoSQL databases such as MongoDB are quite useful too…
• And a whole bunch of Big Data tools: Hadoop, Spark, Kafka, etc.
• They write papers too, so Markdown and LaTeX come in handy!
• Lots of code, so typical software development tools (git, IDEs, CI/CD, etc.)
• Processes (SCRUM, Agile, Lean, CRISP-DM, Design Thinking)
19

Tools to IOSEMN process
+-----------------+ Project Management / Lifecycle
| INQUIRE | Git, Github, Gitlab (Project documentation)
+-----------------+ Documentation systems
v
+------------------+ Requests, APIs, sensors, surveys
| OBTAIN | SQL, CSV, JSON, XLS, NoSQL, Hadoop, Spark
+------------------+ Store / Cache data locally (SQLite, PostgreSQL)
v (Gather internal and external data)
+-----------------+ Jupyter Notebook
| SCRUB | Regular Expression (re), BeautifulSoup
+-----------------+ SQLite, ETL, Glue
20

Tools (continued)
+-----------------+ Jupyter Notebook
| EXPLORE | Pandas, Orange
+-----------------+ Matplotlib
^ v (Explore and understand the data)
+-----------------+ SciKit-Learn, Tensorflow
| MODEL | PyTorch, NumPy
+-----------------+ Machine Learning
RE-INQUIRE | (Model: predict, check accuracy, evaluate model)
^ +-----------------+ Jupyter Notebook, MatplotLib
+--------- | INTERPRET | Bokeh, D3.JS, XLSXWriter
+-----------------+ Dashboards, Reports, etc.
(Choose a good representation, interpret the results)
21

Jupyter Lab / Notebook
Figure 11: Jupyter Notebook 22

Graphing and Dashboards
Figure 12: Grafana: dashboard for time series analytics 23

Apache Superset Visualization
Figure 13: Apache Supserset 24

Geospacial Data Visualization
Figure 14: Visualize geospatial data with deck.gl 25

Local Cloud - Docker Compose
version: '3'
services:
jupyter:
image: cmihai/jupyter:v1
container_name: jupyter
volumes:
- ./notebooks:/notebooks
ports:
- '9000:9000'
links:
- postgres
- redis
postgres:
image: postgres
container_name: postgres
ports:
- '5432:5432'
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
volumes:
- pgdata:/var/lib/postgresql/data
redis:
image: redis:alpine
26

Composable Environments
+-----------------+
| Jupyter |
PYTHON | ports:9000 +---------------------------------+
| vol: /notebooks | |
| (Anaconda 3) +-----------------+ |
+---------|-------+ | |
| | |
+---------v-------+ +-----v------+ +-----v-----+
| PostgreSQL | NOSQL: | REDIS | | MONGODB |
SQL | | | | | |
| | | | | |
| | | | | |
+-----------------+ +------------+ +-----------+
27

Machine Learning Frameworks
Figure 15: Architecture: Jupyter Notebook using Keras with Tensorﬂow
28

Open Data, Open Tools
Figure 16: Open tools analysing open medical data
29

Gartner Hype Cycle
Figure 17: Gartner Hype Cycle 2018 30

Cloud Computing
A model for enabling convenient, on-demand network access to a shared pool of conﬁg-
urable computing resources that can be rapidly provisioned and released with minimal
management effort or service provider interaction.3
Characteristics
On-demand Self-service
Broad network access
Resource pooling
Rapid elasticity
Measured service
Service Models
IaaS Infrastructure as a Service
PaaS Platform as a Service
FaaS Functions as a Service
SaaS Software as a Service
BPaaS Business Process as a Service
Deployment Models
Private Cloud
Community Cloud
Public Cloud
Hybrid Cloud
Multi Cloud
3
The NIST Deﬁnition of Cloud Computing, Special Publication 800-145
31

Why take advantage of Cloud in Data Science projects
1. Big Data: analytics for ever increasing, varied data sets.
2. IoT: Easily gather and process data from smart devices (Internet of Things).
3. Collaborative: Set up data environments with ease, and collaborate with your team.
4. Scalable: Access to virtually unlimited resources, GPU computing, etc.
5. Automated: orchestrate and provision systems and tear them down when no longer needed.
Chances are, you’re likely already using it. Github? Kaggle notebooks? Google Docs? Dropbox? AWS
Free Tier? JupyterHub?
32

Machine Learning as a Service
Figure 18: IBM Watson Machine Learning Services 33

Cloud, Multi-Cloud, Data Lakes
Figure 19: Data Lake Architecture on AWS 34

Kubernetes: Container Orchestration at Scale
Figure 20: Kubernetes is Desired State Management
35

Kubernetes: container platform that enables portability across infrastructures providers
Kubernetes facilitates both declarative conﬁguration and automation.
Container Orchestration Beneﬁts:
• Cluster management: Federate hosts and manage them.
• Self-healing: Detect and replace unhealthy container Pods and hosts. Attempt to put the
cluster back to the wanted state.
• Replication: Ensure that the wanted number of Pod replicas is running.
• Service discovery: Locate and distribute client requests across running containers.
• Scheduling: Distribute containers across the worker nodes.
• Scaling: Adding or removing containers to match workload.
• Persistent Storage: Manage persistent storage.
36

Collaborate to continuously deliver
Figure 21: Practices to implement DevOps 37

Cultural Transformation
• Culture: Build trust and align your team with better communication and transparency.
• Discover: Understand the problem domain and align on common goals.
• Think: Know your audience and meet its needs faster than the competition.
• Develop: Collaborate to build, continuously integrate and deliver high-quality code.
• Reason: Apply AI techniques so that you can make better decisions.
• Operate: Harness the power of the cloud to quickly get your minimum viable product (MVP)
into production, and monitor and manage your applications to a high degree of quality and
meet your service level agreements. Grow or shrink your resources based on demand.
• Learn: Gain insights from your users as they interact with your application.
38

End-to-End CD4ML Process by Martin Fowler
Figure 22: Continuous Delivery for Machine Learning end-to-end process 39

Build reproducible images with Packer
Figure 23: Packer building a VirtualBox image for RHEL 8 using Kickstart Automated Install 40

Secure your environment with OpenSCAP:
Figure 24: Automate Continuous Compliance and Remediation (ex: HIPAA, PCI) 41

Ansible: Provisioning and Conﬁguration Management
Figure 25: Ansible: Application Deployment + Conﬁguration Management + Continuous Delivery 42

What can I do with Ansible?
Figure 26: Automate your entire infrastructure and data pipeline using Ansible
43

Molecule: Test your Ansible Playbooks on Docker, Vagrant or Cloud
Molecule provides support for testing with multiple instances, operating systems and distributions,
virtualization providers, test frameworks and testing scenarios.
Instantly create a Vagrant or Docker machine and test your playbook:
molecule create -s vagrant-centos-7
molecule converge -s vagrant-centos-7
molecule login
Integrate your tests as part of your CI/CD
molecule test
44

7 Reusable Assets and Practices

Reproducible Research and Giveback
Figure 27: https://paperswithcode.com - Reproducible Research 45

The Open Practice Library
Figure 28: openpracticelibrary.com: A community-driven
repository of practices and tools
An Outcome Delivery framework:
• Discovery - generate the Outcomes
• Options - identify how to get there
• Delivery - implement and put ideas to the test.
Learn what works and what doesn’t.
46

The Open Practice Library - Discovery
Figure 29: What problems are you trying to solve, for whom and why? How will you measure Outcomes? 47

The Open Practice Library - Options Pivot
Figure 30: What are the different options? What do you need to make this happen? 48

The Open Practice Library - Delivery
Figure 31: What was measured impact? What did you learn? 49

The Open Practice Library - Foundation
Figure 32: Creating a team culture, an environment of collaboration and technical engineering practices 50

Skills Map
Figure 33: Cross-functional Skills Map 51

Example skills for Data Science scenarios:
1. Data Engineering: Infrastructure as Code (CloudFormation, Terraform) Shell Scripting, Cloud,
APIs (boto3), etc.
2. DevOps: putting together CI/CD pipelines, Jenkins, CodeStar*, CodePipeline, CodeBuild, etc.
3. Event Driven Architectures: putting an object on S3 triggers a lambda function that performs
ETL and loads the result into RedShift.
4. GPU computing: accelerate workloads for general-purpose scientiﬁc and engineering
computing. 4
5. Container platforms: develop microservices and set up environments with ease.
4
Forbes: From Deep Learning To Data Science: Everything You Need To Know
52

Reasoning Skills
Figure 34: Avoid Congitive Biases 53

Technical Books
Figure 35: Journey from Linux and Python to Big Data and Machine Learning 54

Soft Skills and Career
Figure 36: Career Development, Soft Skills, Stakeholder Management 55

Meetups and Events
Networking:
• Absorb ideas and ﬁlter them through your own experience at meetups and events.
• Build a strong network.
Giveback:
• Give back to the community by speaking and events, supporting, sponsoring or co-organising.
• Contribute to Open Source projects. Meetups and hackathons are a great place to start!
Social Eminence
• Be an active member of the Data Science community and build your social eminence!
56

Communities and Competitions
Figure 37: Compete on Kaggle or check out notebooks and datasets 57

Developing a Personal Brand and Building your Portfolio
Figure 38: Develop your own journey and personal brand
58

Great places to learn
Figure 39: MooC: CognitiveClass.AI - Free Courses and Badges 59

Gamiﬁcation with Badges
Figure 40: Collect free badges and accreditations 60

Example Courses
cognitiveclass.ai
Learning Paths and Badges on Containers, Kubernetes, SQL, Big Data, Python, R, Deep Learning,
Analytics, Hadoop and Spark.
katakoda.com
Interactive, hands-on courses on Containers, Machine Learning, DevOps, Software Engineering
Practices and more straight in your browser.
Data Science Courses
• edx.org - you can ‘audit’ courses for free.
• coursera.org - great machine learning courses (ex: Andrew Ng.)
• fast.ai - free courses on Deep Learning, Computational Linear Algebra.
61

What they don’t teach you
Sometimes, it’s just you.. and there won’t be a:
• Data Engineer to build and automate your pipelines
• A Cloud Architect to design your infrastructure
• A Network Engineer
• A DevOps Engineer to create your Code deployment pipelines and help with CI/CD
• A Systems Administrator to support your Linux environment.
Go SaaS
• Could I go SaaS?
Build it yourself!
• How do I build all this myself?
62

Questions and Contact
Twitter: @CrivetiMihai
LinkedIn: https://www.linkedin.com/in/crivetimihai/
GitHub: crivetimihai
Blog: blog.boreas.ro
Ansible Galaxy: https://galaxy.ansible.com/crivetimihai
63

Data Science at Scale - The DevOps Approach

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Science at Scale - The DevOps Approach

Similaire à Data Science at Scale - The DevOps Approach (20)

Plus de Mihai Criveti

Plus de Mihai Criveti (12)

Dernier

Dernier (20)

Data Science at Scale - The DevOps Approach