SlideShare a Scribd company logo
1 of 64
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE
CRASH COURSE
BERLIN 2018
Robert Hryniewicz
Data Evangelist
@RobHryniewicz
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AI
MACHINE LEARNING
DEEP LEARNING
WHY NOW?
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key drivers behind AI Explosion
 Exponential data growth
– And the ability to Process All Data – both Structured & Unstructured
 Faster & open distributed systems
– Such as Hadoop, Spark, TensorFlow, …
 Smarter algorithms
– Esp. in the Machine Learning and Deep Learning domains
– More Accurate Models  Better ROI for Customers
Source: Deloitte Tech Trends 2017 report
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Healthcare
Predict diagnosis
Prioritize screenings
Reduce re-admittance rates
Financial services
Fraud Detection/prevention
Predict underwriting risk
New account risk screens
Public Sector
Analyze public sentiment
Optimize resource allocation
Law enforcement & security
Retail
Product recommendation
Inventory management
Price optimization
Telco/mobile
Predict customer churn
Predict equipment failure
Customer behavior analysis
Oil & Gas
Predictive maintenance
Seismic data management
Predict well production levels
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Google does not have better algorithms, only more data.
-- Peter Norvig, Dir of Research, Google
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 50ZB+ in 2021
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data: The New Oil
Training Data: The New New Oil
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Effectiveness of AI technologies will be only as good as the data they have
access to, and the most valuable data may exist beyond the borders of
one’s own organization.”
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE PREREQUISITES
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
Source: hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE & MACHINE LEARNING
WHAT IS A MODEL?
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is a ML Model?
 Mathematical formula with a number of parameters that need to be learned from the
data. Fitting a model to the data is a process known as model training.
 E.g. linear regression
– Goal: fit a line y = mx + c to data points
– After model training: y = 2x + 5
Input OutputModel
1, 0, 7, 2, … 7, 5, 19, 9, …
y = 2x + 5
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ALGORITHMS
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
START
Regression
Classification Collaborative Filtering
Clustering
Dimensionality Reduction
• Logistic Regression
• Support Vector Machines (SVM)
• Random Forest (RF)
• Naïve Bayes
• Linear Regression
• Alternating Least Squares (ALS)
• K-Means, LDA
• Principal Component Analysis (PCA)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CLASSIFICATION
Identifying to which category an object belongs to
Examples: spam detection, diabetes diagnosis, text labeling
Algorithms:
 Logistic Regression
– Fast training, linear model
– Classes expressed in probabilities
 Support Vector Machines (SVM)
– “Best” supervised learning algorithm, effective
– More robust to outliers than Log Regression
– Handles non-linearity
 Random Forest
– Fast training
– Handles categorical features
– Does not require feature scaling
– Captures non-linearity and
feature interaction
 Naïve Bayes
– Good for text classification
– Assumes independent variables
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visual Intro to Decision Trees
 http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
REGRESSION
Predicting a continuous-valued output
Example: Predicting house prices based on number of bedrooms and square footage
Algorithms: Linear Regression
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CLUSTERING
Automatic grouping of similar objects into sets (clusters)
Example: market segmentation – auto group customers into different market segments
Algorithms: K-means, LDA
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix
Applications: Product/movie recommendation
Algorithms: Alternating Least Squares (ALS)
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DIMENSIONALITY REDUCTION
Reducing the number of redundant features/variables
Applications:
 Removing noise in images by selecting only
“important” features
 Removing redundant features, e.g. MPH & KPH are
linearly dependent
Algorithms: Principal Component Analysis (PCA)
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
START
Regression
Classification Deep Learning
Clustering
Dimensionality Reduction
• XGBoost (Extreme Gradient Boosting) • Recurrent Neural Network (RNN)
• Convolutional Neural Network (CNN)
• Yinyang K-Means
• t-Distributed Stochastic Neighbor Embedding (t-SNE)
• Local Regression (LOESS)
Collaborative Filtering
• Weighted Alternating Least
Squares (WALS)
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DEEP LEARNING
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
TensorFlow Playground
playground.tensorflow.org
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Identify the right Deep Learning problems
 DL is terrific at language tasks, image classification, speech translation, machine
translation, and game playing (i.e. Chess, Go, Starcraft).
 It is less performant at traditional Machine Learning tasks such as credit card fraud
detection, asset pricing, and credit scoring.
Source: towardsdatascience.com/deep-misconceptions-about-deep-learning-f26c41faceec
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Limits to Deep Learning (DL)
 We don’t have infinite datasets
– DL is not great at generalizing
– ImageNet: 9 layers and 60 mil parameters with 650,000 nodes from 1 mil examples with 1000
categories
 Top 10 challenges for Deep Learning
1.Data hungry
2.Shallow and limited capacity for transfer
3.No natural way to deal w/ hierarchical structure
4.Struggles w/ open-ended inference
5.Not sufficiently transparent
6.Now well integrated w/ prior knowledge
7.Cannot distinguish causation from correlation
8.Presumes stable world
9.Works well as an approximation, but answers cannot be fully trusted
10. Difficult to engineer with
Source: arxiv.org/pdf/1801.00631.pdf
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AI HACKING
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE JOURNEY
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Start by Asking Relevant Questions
 Specific (can you think of a clear answer?)
 Measurable (quantifiable? data driven?)
 Actionable (if you had an answer, could you do something with it?)
 Realistic (can you get an answer with data you have?)
 Timely (answer in reasonable timeframe?)
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Preparation
1. Data analysis (audit for anomalies/errors)
2. Creating an intuitive workflow (formulate seq. of prep operations)
3. Validation (correctness evaluated against sample representative dataset)
4. Transformation (actual prep process takes place)
5. Backflow of cleaned data (replace original dirty data)
Approx. 80% of Data Analyst’s job is Data Preparation!
Example of multiple values used for U.S. States  California, CA, Cal., Cal
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Feature Selection
 Also known as variable or attribute selection
 Why important?
– simplification of models  easier to interpret by researchers/users
– shorter training times
– enhanced generalization by reducing overfitting
 Dimensionality reduction vs feature selection
– Dimensionality reduction: create new combinations of attributes
– Feature selection: include/exclude attributes in data without changing them
Q: Which features should you use to create a predictive model?
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hyperparameters
 Define higher-level model properties, e.g. complexity or learning rate
 Cannot be learned during training  need to be predefined
 Can be decided by
– setting different values
– training different models
– choosing the values that test better
 Hyperparameter examples
– Number of leaves or depth of a tree
– Number of latent factors in a matrix factorization
– Learning rate (in many models)
– Number of hidden layers in a deep neural network
– Number of clusters in a k-means clustering
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Residuals
• residual of an observed value is the difference between
the observed value and the estimated value
 R2 (R Squared) – Coefficient of Determination
• indicates a goodness of fit
• R2 of 1 means regression line perfectly fits data
 RMSE (Root Mean Square Error)
• measure of differences between values predicted by a model and values actually
observed
• good measure of accuracy, but only to compare forecasting errors of different
models (individual variables are scale-dependent)
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
With that in mind…
 No simple formula for “good questions” only general guidelines
 The right data is better than lots of data
 Understanding relationships matters
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Training Set
Learning Algorithm
h
hypothesis/model
input output
Ingest / Enrich Data
Clean / Transform / Filter
Select / Create New Features
Evaluate Accuracy / Score
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MODEL TRAINING
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scatter Data
|label|features|
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Coefficients: 2.81 Intercept: 3.05
y = 2.81x + 3.05
Training
Result
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SPARK ML PIPELINES
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark ML Pipeline
 fit() is for training
 transform() is for prediction
Input
DataFrame
Input
DataFrame
Output
Dataframe
Pipeline
Pipeline Model
fit
transform
Train
Predict
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
SAMPLE CODE
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSX + HDP
57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: Google NIPS
ML Code
58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
D ATA S C I E N C E P L AT F O R M
Community Open Source Scale & Enterprise Security
• Find tutorials and datasets
• Connect with Data Scientists
• Ask questions
• Read articles and papers
• Fork and share projects
• Code in Scala/Python/R/SQL
• Zeppelin & Jupyter Notebooks
• RStudio IDE and Shiny
• Apache Spark
• Your favorite libraries
• Data Science at Scale
• Run Spark Jobs on HDP Cluster
• Secure Hadoop Support
• Ranger Atlas Support for Data
• Support for ABAC
Model Management
• Data Shaping Pipeline UI
• Auto-data preparation & modeling
• Advanced Visualizations
• Model management & deployment
• Documented Model APIs
Data Science Experience
SPSS Modeler
for DSX
DO for DSX
60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSX
HDP
Data Science Experience (DSX) Local
Enterprise Data Science platform for teams
Livy REST interface
Hortonworks Data Platform (HDP)
Enterprise compute (Spark/Hive) & storage
(HDFS/Ozone)
61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FINAL NOTES
62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“The business value of AI consists of its ability to lower the cost of
prediction, just as computers lowered the cost of arithmetic.”
63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Building a Business Around AI - HBR
1. Find and own valuable data no one else has
2. Take a systemic view of your business, and find data adjacencies
3. Package AI for the customer experience
"No single tool, even one as powerful as AI, determines the fate of a business. As
much as the world changes, deep truths — around unearthing customer
knowledge, capturing scarce goods, and finding profitable adjacencies — will
matter greatly. As ever, the technology works to the extent that its owners
know what it can do, and know their market."
64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Robert Hryniewicz
@RobHryniewicz
Thanks!

More Related Content

What's hot

SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsDataWorks Summit
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseDataWorks Summit
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationAbdelkrim Hadjidj
 
Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...DataWorks Summit
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationAbdelkrim Hadjidj
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)Abdelkrim Hadjidj
 
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsDataWorks Summit/Hadoop Summit
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...DataWorks Summit/Hadoop Summit
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...DataWorks Summit
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onDataWorks Summit
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveDataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...DataWorks Summit
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationDataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash CourseDataWorks Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
 

What's hot (20)

SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use case
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
 
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the Details
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus on
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Shaping a Digital Vision
Shaping a Digital VisionShaping a Digital Vision
Shaping a Digital Vision
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 

Similar to Data Science Crash Course

Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXKirk Haslbeck
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scaleCarolyn Duby
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceThiago Santiago
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingDataWorks Summit
 
Enterprise Data Science at Scale
Enterprise Data Science at ScaleEnterprise Data Science at Scale
Enterprise Data Science at ScaleArtem Ervits
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution Hortonworks
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...DataWorks Summit
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Powering the Future of Data  
Powering the Future of Data	   Powering the Future of Data	   
Powering the Future of Data  Bilot
 
Credit fraud prevention on hwx stack
Credit fraud prevention on hwx stackCredit fraud prevention on hwx stack
Credit fraud prevention on hwx stackKirk Haslbeck
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...DataWorks Summit
 
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big DataBig Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big DataMatt Stubbs
 
Hortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud EventHortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud EventThiago Santiago
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...Jürgen Ambrosi
 
Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Daniel Madrigal
 

Similar to Data Science Crash Course (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturing
 
Enterprise Data Science at Scale
Enterprise Data Science at ScaleEnterprise Data Science at Scale
Enterprise Data Science at Scale
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Powering the Future of Data  
Powering the Future of Data	   Powering the Future of Data	   
Powering the Future of Data  
 
Credit fraud prevention on hwx stack
Credit fraud prevention on hwx stackCredit fraud prevention on hwx stack
Credit fraud prevention on hwx stack
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
 
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big DataBig Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
 
Hortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud EventHortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud Event
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
 
Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 

More from DataWorks Summit (20)

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Data Science Crash Course

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE CRASH COURSE BERLIN 2018 Robert Hryniewicz Data Evangelist @RobHryniewicz
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AI MACHINE LEARNING DEEP LEARNING WHY NOW?
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key drivers behind AI Explosion  Exponential data growth – And the ability to Process All Data – both Structured & Unstructured  Faster & open distributed systems – Such as Hadoop, Spark, TensorFlow, …  Smarter algorithms – Esp. in the Machine Learning and Deep Learning domains – More Accurate Models  Better ROI for Customers Source: Deloitte Tech Trends 2017 report
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Healthcare Predict diagnosis Prioritize screenings Reduce re-admittance rates Financial services Fraud Detection/prevention Predict underwriting risk New account risk screens Public Sector Analyze public sentiment Optimize resource allocation Law enforcement & security Retail Product recommendation Inventory management Price optimization Telco/mobile Predict customer churn Predict equipment failure Customer behavior analysis Oil & Gas Predictive maintenance Seismic data management Predict well production levels
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Google does not have better algorithms, only more data. -- Peter Norvig, Dir of Research, Google
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 50ZB+ in 2021
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data: The New Oil Training Data: The New New Oil
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “Effectiveness of AI technologies will be only as good as the data they have access to, and the most valuable data may exist beyond the borders of one’s own organization.”
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE PREREQUISITES
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 Source: hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE & MACHINE LEARNING WHAT IS A MODEL?
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is a ML Model?  Mathematical formula with a number of parameters that need to be learned from the data. Fitting a model to the data is a process known as model training.  E.g. linear regression – Goal: fit a line y = mx + c to data points – After model training: y = 2x + 5 Input OutputModel 1, 0, 7, 2, … 7, 5, 19, 9, … y = 2x + 5
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ALGORITHMS
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved START Regression Classification Collaborative Filtering Clustering Dimensionality Reduction • Logistic Regression • Support Vector Machines (SVM) • Random Forest (RF) • Naïve Bayes • Linear Regression • Alternating Least Squares (ALS) • K-Means, LDA • Principal Component Analysis (PCA)
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CLASSIFICATION Identifying to which category an object belongs to Examples: spam detection, diabetes diagnosis, text labeling Algorithms:  Logistic Regression – Fast training, linear model – Classes expressed in probabilities  Support Vector Machines (SVM) – “Best” supervised learning algorithm, effective – More robust to outliers than Log Regression – Handles non-linearity  Random Forest – Fast training – Handles categorical features – Does not require feature scaling – Captures non-linearity and feature interaction  Naïve Bayes – Good for text classification – Assumes independent variables
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visual Intro to Decision Trees  http://www.r2d3.us/visual-intro-to-machine-learning-part-1 CLASSIFICATION
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved REGRESSION Predicting a continuous-valued output Example: Predicting house prices based on number of bedrooms and square footage Algorithms: Linear Regression
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CLUSTERING Automatic grouping of similar objects into sets (clusters) Example: market segmentation – auto group customers into different market segments Algorithms: K-means, LDA
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved COLLABORATIVE FILTERING Fill in the missing entries of a user-item association matrix Applications: Product/movie recommendation Algorithms: Alternating Least Squares (ALS)
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DIMENSIONALITY REDUCTION Reducing the number of redundant features/variables Applications:  Removing noise in images by selecting only “important” features  Removing redundant features, e.g. MPH & KPH are linearly dependent Algorithms: Principal Component Analysis (PCA)
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved START Regression Classification Deep Learning Clustering Dimensionality Reduction • XGBoost (Extreme Gradient Boosting) • Recurrent Neural Network (RNN) • Convolutional Neural Network (CNN) • Yinyang K-Means • t-Distributed Stochastic Neighbor Embedding (t-SNE) • Local Regression (LOESS) Collaborative Filtering • Weighted Alternating Least Squares (WALS)
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DEEP LEARNING
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved TensorFlow Playground playground.tensorflow.org
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Identify the right Deep Learning problems  DL is terrific at language tasks, image classification, speech translation, machine translation, and game playing (i.e. Chess, Go, Starcraft).  It is less performant at traditional Machine Learning tasks such as credit card fraud detection, asset pricing, and credit scoring. Source: towardsdatascience.com/deep-misconceptions-about-deep-learning-f26c41faceec
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Limits to Deep Learning (DL)  We don’t have infinite datasets – DL is not great at generalizing – ImageNet: 9 layers and 60 mil parameters with 650,000 nodes from 1 mil examples with 1000 categories  Top 10 challenges for Deep Learning 1.Data hungry 2.Shallow and limited capacity for transfer 3.No natural way to deal w/ hierarchical structure 4.Struggles w/ open-ended inference 5.Not sufficiently transparent 6.Now well integrated w/ prior knowledge 7.Cannot distinguish causation from correlation 8.Presumes stable world 9.Works well as an approximation, but answers cannot be fully trusted 10. Difficult to engineer with Source: arxiv.org/pdf/1801.00631.pdf
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AI HACKING
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE JOURNEY
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Start by Asking Relevant Questions  Specific (can you think of a clear answer?)  Measurable (quantifiable? data driven?)  Actionable (if you had an answer, could you do something with it?)  Realistic (can you get an answer with data you have?)  Timely (answer in reasonable timeframe?)
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Preparation 1. Data analysis (audit for anomalies/errors) 2. Creating an intuitive workflow (formulate seq. of prep operations) 3. Validation (correctness evaluated against sample representative dataset) 4. Transformation (actual prep process takes place) 5. Backflow of cleaned data (replace original dirty data) Approx. 80% of Data Analyst’s job is Data Preparation! Example of multiple values used for U.S. States  California, CA, Cal., Cal
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Feature Selection  Also known as variable or attribute selection  Why important? – simplification of models  easier to interpret by researchers/users – shorter training times – enhanced generalization by reducing overfitting  Dimensionality reduction vs feature selection – Dimensionality reduction: create new combinations of attributes – Feature selection: include/exclude attributes in data without changing them Q: Which features should you use to create a predictive model?
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hyperparameters  Define higher-level model properties, e.g. complexity or learning rate  Cannot be learned during training  need to be predefined  Can be decided by – setting different values – training different models – choosing the values that test better  Hyperparameter examples – Number of leaves or depth of a tree – Number of latent factors in a matrix factorization – Learning rate (in many models) – Number of hidden layers in a deep neural network – Number of clusters in a k-means clustering
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Residuals • residual of an observed value is the difference between the observed value and the estimated value  R2 (R Squared) – Coefficient of Determination • indicates a goodness of fit • R2 of 1 means regression line perfectly fits data  RMSE (Root Mean Square Error) • measure of differences between values predicted by a model and values actually observed • good measure of accuracy, but only to compare forecasting errors of different models (individual variables are scale-dependent)
  • 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved With that in mind…  No simple formula for “good questions” only general guidelines  The right data is better than lots of data  Understanding relationships matters
  • 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Training Set Learning Algorithm h hypothesis/model input output Ingest / Enrich Data Clean / Transform / Filter Select / Create New Features Evaluate Accuracy / Score
  • 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved MODEL TRAINING
  • 49. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scatter Data |label|features| |-12.0| [-4.9]| | -6.0| [-4.5]| | -7.2| [-4.1]| | -5.0| [-3.2]| | -2.0| [-3.0]| | -3.1| [-2.1]| | -4.0| [-1.5]| | -2.2| [-1.2]| | -2.0| [-0.7]| | 1.0| [-0.5]| | -0.7| [-0.2]| ... ... ...
  • 50. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Coefficients: 2.81 Intercept: 3.05 y = 2.81x + 3.05 Training Result
  • 51. 51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Linear Regression (two features) Coefficients: [0.464, 0.464] Intercept: 0.0563
  • 52. 52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SPARK ML PIPELINES
  • 53. 53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Feature transform 1 Feature transform 2 Combine features Linear Regression Input DataFrame Input DataFrame Output DataFrame Pipeline Pipeline Model Train Predict Export Model
  • 54. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark ML Pipeline  fit() is for training  transform() is for prediction Input DataFrame Input DataFrame Output Dataframe Pipeline Pipeline Model fit transform Train Predict
  • 55. 55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved rf = RandomForestClassifier(numTrees=100) pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf]) model = pipe.fit(trainData) # Train model results = model.transform(testData) # Test model SAMPLE CODE
  • 56. 56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DSX + HDP
  • 57. 57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: Google NIPS ML Code
  • 58. 58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 59. 59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved D ATA S C I E N C E P L AT F O R M Community Open Source Scale & Enterprise Security • Find tutorials and datasets • Connect with Data Scientists • Ask questions • Read articles and papers • Fork and share projects • Code in Scala/Python/R/SQL • Zeppelin & Jupyter Notebooks • RStudio IDE and Shiny • Apache Spark • Your favorite libraries • Data Science at Scale • Run Spark Jobs on HDP Cluster • Secure Hadoop Support • Ranger Atlas Support for Data • Support for ABAC Model Management • Data Shaping Pipeline UI • Auto-data preparation & modeling • Advanced Visualizations • Model management & deployment • Documented Model APIs Data Science Experience SPSS Modeler for DSX DO for DSX
  • 60. 60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DSX HDP Data Science Experience (DSX) Local Enterprise Data Science platform for teams Livy REST interface Hortonworks Data Platform (HDP) Enterprise compute (Spark/Hive) & storage (HDFS/Ozone)
  • 61. 61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FINAL NOTES
  • 62. 62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “The business value of AI consists of its ability to lower the cost of prediction, just as computers lowered the cost of arithmetic.”
  • 63. 63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Building a Business Around AI - HBR 1. Find and own valuable data no one else has 2. Take a systemic view of your business, and find data adjacencies 3. Package AI for the customer experience "No single tool, even one as powerful as AI, determines the fate of a business. As much as the world changes, deep truths — around unearthing customer knowledge, capturing scarce goods, and finding profitable adjacencies — will matter greatly. As ever, the technology works to the extent that its owners know what it can do, and know their market."
  • 64. 64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Robert Hryniewicz @RobHryniewicz Thanks!