This talk will go over how to build an end-to-end data processing system in Python, from data ingest, to data analytics, to machine learning, to user presentation. Developments in old and new tools have made this particularly possible today. The talk in particular will talk about Airflow for process workflows, PySpark for data processing, Python data science libraries for machine learning and advanced analytics, and building agile microservices in Python.
System architects, software engineers, data scientists, and business leaders can all benefit from attending the talk. They should learn how to build more agile data processing systems and take away some ideas on how their data systems could be simpler and more powerful.
9. Big Data in 2009 was so Java oriented.
It was easier to use Java for everything or
use a collection of random languages.
10. Python seemed to have everything we
wanted, except for Big Data
Some brave souls tried:
Hadoopy, mrjob, Pig+Python
11. PySpark
PySpark was the missing piece of the Big Data Python picture
The first major Big Data platform with first-class Python support
Thanks to PySpark, Python is now a viable and competitive option
for end-to-end systems that utilize Big Data
12. What’s the big deal?
Python has best in class functionality for all the other things we want
to do with Big Data:
Data manipulation, Machine Learning, Text, Applications, Visualization
In 2017,
we can build end-to-end Big Data systems entirely in Python:
from ingest to user experience and everything between
13. The case for Python
Succinct code that’s easy to read
17. Distributed Computing
# Read data as lines from a source
lines = spark.read.text(inpath).rdd.map(lambda r: r[0])
# Count the data
counts = lines.flatMap(lambda x: x.split(' '))
.map(lambda x: (x, 1))
.reduceByKey(add)
# Bring it locally
output = counts.collect()
18. Machine Learning
# Initialize Random Forest classifier
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250)
# Train Random Forest classifier
clf = clf.fit(feature_vectors, labels)
Why sklearn over MLLib?
19. Deep Learning
# Load the ImageNet pretrained network
model = VGG16(weights="imagenet")
# Run the model on an image
preds = model.predict(preprocess_input(image))
# Hotdog or not hotdog?
print(decode_predictions(preds))
& Keras
Also excited about pytorch!
20. Visualization
Lots of visualization options in Python
• Seaborn
• Matplotlib
• Bokeh
• ggplot
seaborn.swarmplot(x="measurement", y="value", hue="species", data=iris)
22. Workflows
# Run this every day at 3:45AM
mdag = DAG(’DRSpark', description=’DailyRun', schedule_interval=’45 3 * * *')
sp1 = PythonOperator(task_id=‘sp1’, python_callable=runspark1, dag=mdag)
sp2 = PythonOperator(task_id=’sp2’, python_callable=runspark2, dag=mdag)
ou = PythonOperator(task_id=‘clean’, python_callable=cleanupresults, dag=mdag)
sp1 >> ou # sp1 happens before ou
sp2 >> ou # sp2 happens before ou, but doesn’t depend on sp1
23. # Spark job to build feature vectors
rows = myrdd.map(lambda r: r[0].split(‘,’))
out = rows.map(lambda row: (row[0], row))
.groupByKey().map(build_feature_vector) # outputs [(FV, label)]
# Bring data down locally and prepare it
localout = counts.collect()
X = [ row[0] for row in localout ] # feature is set of 40 aggregate properties
t = [ row[1] for row in localout ] # potential labels are types of devices
# Train a RF classifier on it
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=250)
clf = clf.fit(X, y)
# save the model (maybe to s3 instead?)
pickle.dump(clf, open(‘/models/behavior.sklearn’, ‘w’))
--- training data sample of netflow ---
SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT
123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB
123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB
123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB
123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB
24. # http://56.120.177.55/predictip?ip=159.31.120.44
# http://56.120.177.55/predictfv -- for POST of feature vector
_MODEL = pickle.load(open(’/models/maintenance.sklearn’))
@app.route('/predictrepair')
def predicttypefrombehavior():
netflowlog = request.form[‘logcsv’]
fv = build_feature_vetor(netflowlog)
pr = _MODEL.predict(fv)
return json.dumps({ ‘query’ : fv,
‘prediction’ : pr })
# returns { ‘query’ : [9, 4, 123.1, …], ‘prediction’ : ‘HTTP PROXY’ }
--- training data sample of netflow ---
SOURCE IP DEST IP DATE STIME ETIME DATAIN DATAOUT
123.41.12.31, 123.41.155.32, 2017-02-01, 09:00, 09:59, 103KB, 959KB
123.41.59.99, 123.41.155.32, 2017-02-01, 09:00, 09:59, 44KB, 884KB
123.41.12.31, 123.41.155.32, 2017-02-01, 10:00, 10:59, 3KB, 9KB
123.41.59.99, 123.41.155.32, 2017-02-01, 10:00, 10:59, 4KB, 15KB
25. PYTHON!
Viable option for Big Data Analytics with PySpark
Tie it all together and integrate into the enterprise with the
same language
Leverage the benefits of Python for data analysis
Get projects done faster