SlideShare une entreprise Scribd logo
1  sur  25
DONALD MINER
End-to-end Big Data
Projects with Python
DONALD MINER
StampedeCon Big Data
July 25th, 2017
dminer@minerkasch.com
Donald Miner
2009
BIG DATA =
Real “Big Data”
isn’t just about
platforms anymore
Real “Big Data”
isn’t just about
platforms anymore
Streaming
Infrastructure
Cloud
Applications
Mobile
{
Real “Big Data”
isn’t just about
data processing
anymore
Real “Big Data”
isn’t just about
data processing
anymore
Machine Learning
Data Science
NLP
Deep Learning
Visualization
{
2017
BIG DATA =
& integrated and user facing
& advanced analytics
Big Data in 2009 was so Java oriented.
It was easier to use Java for everything or
use a collection of random languages.
Python seemed to have everything we
wanted, except for Big Data
Some brave souls tried:
Hadoopy, mrjob, Pig+Python
PySpark
PySpark was the missing piece of the Big Data Python picture
The first major Big Data platform with first-class Python support
Thanks to PySpark, Python is now a viable and competitive option
for end-to-end systems that utilize Big Data
What’s the big deal?
Python has best in class functionality for all the other things we want
to do with Big Data:
Data manipulation, Machine Learning, Text, Applications, Visualization
In 2017,
we can build end-to-end Big Data systems entirely in Python:
from ingest to user experience and everything between
The case for Python
Succinct code that’s easy to read
The case for Python
A language people know
The case for Python
Interpreted, not compiled
The Python Big Data Architecture
Distributed Computing
# Read data as lines from a source
lines = spark.read.text(inpath).rdd.map(lambda r: r[0])
# Count the data
counts = lines.flatMap(lambda x: x.split(' ')) 
.map(lambda x: (x, 1)) 
.reduceByKey(add)
# Bring it locally
output = counts.collect()
Machine Learning
# Initialize Random Forest classifier
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250)
# Train Random Forest classifier
clf = clf.fit(feature_vectors, labels)
Why sklearn over MLLib?
Deep Learning
# Load the ImageNet pretrained network
model = VGG16(weights="imagenet")
# Run the model on an image
preds = model.predict(preprocess_input(image))
# Hotdog or not hotdog?
print(decode_predictions(preds))
& Keras
Also excited about pytorch!
Visualization
Lots of visualization options in Python
• Seaborn
• Matplotlib
• Bokeh
• ggplot
seaborn.swarmplot(x="measurement", y="value", hue="species", data=iris)
Integration
# http://56.120.177.55/hello?name=Don
@app.route('/hello')
def say_hello():
name = request.args.get(‘name’)
return json.dumps({ ‘query’ : name, 
‘message’: ‘HELLO ‘ + name })
# returns { ‘query’ : ‘Don’, ‘message’ : ‘HELLO Don’ }
Workflows
# Run this every day at 3:45AM
mdag = DAG(’DRSpark', description=’DailyRun', schedule_interval=’45 3 * * *')
sp1 = PythonOperator(task_id=‘sp1’, python_callable=runspark1, dag=mdag)
sp2 = PythonOperator(task_id=’sp2’, python_callable=runspark2, dag=mdag)
ou = PythonOperator(task_id=‘clean’, python_callable=cleanupresults, dag=mdag)
sp1 >> ou # sp1 happens before ou
sp2 >> ou # sp2 happens before ou, but doesn’t depend on sp1
# Spark job to build feature vectors
rows = myrdd.map(lambda r: r[0].split(‘,’))
out = rows.map(lambda row: (row[0], row)) 
.groupByKey().map(build_feature_vector) # outputs [(FV, label)]
# Bring data down locally and prepare it
localout = counts.collect()
X = [ row[0] for row in localout ] # feature is set of 40 aggregate properties
t = [ row[1] for row in localout ] # potential labels are types of devices
# Train a RF classifier on it
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250)
clf = clf.fit(X, y)
# save the model (maybe to s3 instead?)
pickle.dump(clf, open(‘/models/behavior.sklearn’, ‘w’))
--- training data sample of netflow ---
SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT
123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB
123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB
123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB
123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB
# http://56.120.177.55/predictip?ip=159.31.120.44
# http://56.120.177.55/predictfv -- for POST of feature vector
_MODEL = pickle.load(open(’/models/maintenance.sklearn’))
@app.route('/predictrepair')
def predicttypefrombehavior():
netflowlog = request.form[‘logcsv’]
fv = build_feature_vetor(netflowlog)
pr = _MODEL.predict(fv)
return json.dumps({ ‘query’ : fv, 
‘prediction’ : pr })
# returns { ‘query’ : [9, 4, 123.1, …], ‘prediction’ : ‘HTTP PROXY’ }
--- training data sample of netflow ---
SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT
123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB
123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB
123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB
123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB
PYTHON!
Viable option for Big Data Analytics with PySpark
Tie it all together and integrate into the enterprise with the
same language
Leverage the benefits of Python for data analysis
Get projects done faster

Contenu connexe

Similaire à End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...TigerGraph
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherDatabricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017Amazon Web Services
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Map, flatmap and reduce are your new best friends (javaone, svcc)
Map, flatmap and reduce are your new best friends (javaone, svcc)Map, flatmap and reduce are your new best friends (javaone, svcc)
Map, flatmap and reduce are your new best friends (javaone, svcc)Chris Richardson
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkDatabricks
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data ExpoBigDataExpo
 
Google Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine LearningGoogle Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine LearningDataWorks Summit/Hadoop Summit
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
 
Towards a Serverless Platform for Edge AI
Towards a Serverless Platform for Edge AITowards a Serverless Platform for Edge AI
Towards a Serverless Platform for Edge AIThomas Rausch
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)Amazon Web Services Korea
 

Similaire à End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017 (20)

Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
Scalable Deep Learning on AWS Using Apache MXNet - AWS Summit Tel Aviv 2017
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Map, flatmap and reduce are your new best friends (javaone, svcc)
Map, flatmap and reduce are your new best friends (javaone, svcc)Map, flatmap and reduce are your new best friends (javaone, svcc)
Map, flatmap and reduce are your new best friends (javaone, svcc)
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data Expo
 
Google Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine LearningGoogle Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine Learning
 
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
MXNet Workshop
MXNet WorkshopMXNet Workshop
MXNet Workshop
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
Towards a Serverless Platform for Edge AI
Towards a Serverless Platform for Edge AITowards a Serverless Platform for Edge AI
Towards a Serverless Platform for Edge AI
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
 

Plus de StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 

Plus de StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 

Dernier

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Dernier (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017

  • 1. DONALD MINER End-to-end Big Data Projects with Python DONALD MINER StampedeCon Big Data July 25th, 2017
  • 4. Real “Big Data” isn’t just about platforms anymore
  • 5. Real “Big Data” isn’t just about platforms anymore Streaming Infrastructure Cloud Applications Mobile {
  • 6. Real “Big Data” isn’t just about data processing anymore
  • 7. Real “Big Data” isn’t just about data processing anymore Machine Learning Data Science NLP Deep Learning Visualization {
  • 8. 2017 BIG DATA = & integrated and user facing & advanced analytics
  • 9. Big Data in 2009 was so Java oriented. It was easier to use Java for everything or use a collection of random languages.
  • 10. Python seemed to have everything we wanted, except for Big Data Some brave souls tried: Hadoopy, mrjob, Pig+Python
  • 11. PySpark PySpark was the missing piece of the Big Data Python picture The first major Big Data platform with first-class Python support Thanks to PySpark, Python is now a viable and competitive option for end-to-end systems that utilize Big Data
  • 12. What’s the big deal? Python has best in class functionality for all the other things we want to do with Big Data: Data manipulation, Machine Learning, Text, Applications, Visualization In 2017, we can build end-to-end Big Data systems entirely in Python: from ingest to user experience and everything between
  • 13. The case for Python Succinct code that’s easy to read
  • 14. The case for Python A language people know
  • 15. The case for Python Interpreted, not compiled
  • 16. The Python Big Data Architecture
  • 17. Distributed Computing # Read data as lines from a source lines = spark.read.text(inpath).rdd.map(lambda r: r[0]) # Count the data counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) # Bring it locally output = counts.collect()
  • 18. Machine Learning # Initialize Random Forest classifier clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250) # Train Random Forest classifier clf = clf.fit(feature_vectors, labels) Why sklearn over MLLib?
  • 19. Deep Learning # Load the ImageNet pretrained network model = VGG16(weights="imagenet") # Run the model on an image preds = model.predict(preprocess_input(image)) # Hotdog or not hotdog? print(decode_predictions(preds)) & Keras Also excited about pytorch!
  • 20. Visualization Lots of visualization options in Python • Seaborn • Matplotlib • Bokeh • ggplot seaborn.swarmplot(x="measurement", y="value", hue="species", data=iris)
  • 21. Integration # http://56.120.177.55/hello?name=Don @app.route('/hello') def say_hello(): name = request.args.get(‘name’) return json.dumps({ ‘query’ : name, ‘message’: ‘HELLO ‘ + name }) # returns { ‘query’ : ‘Don’, ‘message’ : ‘HELLO Don’ }
  • 22. Workflows # Run this every day at 3:45AM mdag = DAG(’DRSpark', description=’DailyRun', schedule_interval=’45 3 * * *') sp1 = PythonOperator(task_id=‘sp1’, python_callable=runspark1, dag=mdag) sp2 = PythonOperator(task_id=’sp2’, python_callable=runspark2, dag=mdag) ou = PythonOperator(task_id=‘clean’, python_callable=cleanupresults, dag=mdag) sp1 >> ou # sp1 happens before ou sp2 >> ou # sp2 happens before ou, but doesn’t depend on sp1
  • 23. # Spark job to build feature vectors rows = myrdd.map(lambda r: r[0].split(‘,’)) out = rows.map(lambda row: (row[0], row)) .groupByKey().map(build_feature_vector) # outputs [(FV, label)] # Bring data down locally and prepare it localout = counts.collect() X = [ row[0] for row in localout ] # feature is set of 40 aggregate properties t = [ row[1] for row in localout ] # potential labels are types of devices # Train a RF classifier on it clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250) clf = clf.fit(X, y) # save the model (maybe to s3 instead?) pickle.dump(clf, open(‘/models/behavior.sklearn’, ‘w’)) --- training data sample of netflow --- SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT 123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB 123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB 123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB 123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB
  • 24. # http://56.120.177.55/predictip?ip=159.31.120.44 # http://56.120.177.55/predictfv -- for POST of feature vector _MODEL = pickle.load(open(’/models/maintenance.sklearn’)) @app.route('/predictrepair') def predicttypefrombehavior(): netflowlog = request.form[‘logcsv’] fv = build_feature_vetor(netflowlog) pr = _MODEL.predict(fv) return json.dumps({ ‘query’ : fv, ‘prediction’ : pr }) # returns { ‘query’ : [9, 4, 123.1, …], ‘prediction’ : ‘HTTP PROXY’ } --- training data sample of netflow --- SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT 123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB 123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB 123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB 123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB
  • 25. PYTHON! Viable option for Big Data Analytics with PySpark Tie it all together and integrate into the enterprise with the same language Leverage the benefits of Python for data analysis Get projects done faster