The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
The Developer Data Scientist – Creating New Analytics Driven Applications using Apache Spark with Azure Databricks
1.
2.
3. Model & ServePrep & Train
Databricks
HDInsight
Data Lake Analytics
Custom
apps
Sensors
and devices
Store
Blobs
Data Lake
Ingest
Data Factory
(Data movement, pipelines & orchestration)
Machine
Learning
Cosmos DB
SQL Data
Warehouse
Analysis Services
Event Hub
IoT Hub
SQL Database
Analytical dashboards
Predictive apps
Operational reports
Intelligence
B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E
Business
apps
10
01
SQLKafka
4. A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
9. Infrastructure management
Data exploration and visualization at scale
Time to value - From model iterations to intelligence
Integrating with various ML tools to stitch a solution together
Operationalize ML models to integrate them into applications
10. Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
A Z U R E D A T A B R I C K S
11. Easy to create and manage compute clusters that auto-scale
Rapid development using the integrated workspace that
facilitates cross-team collaboration
Interactive exploration with notebooks and dashboards
Seamless integration with ML eco-system libraries and tools
Deep Learning support with GPUs (coming soon in next release)
33. Data Science Software Engineering
Prototype (Python/R)
Create model
Re-implement model for
production (Java)
Deploy model
3
34. Data Science Software Engineering
Prototype (Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implement Pipeline
for production (Java)
Deploy Pipeline
3
35. Data Science Software Engineering
Prototype (Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“path://...”)
Load Pipeline (Scala/Java)
Model.load(“path://…”)
Deploy in production
48. Use Azure Databricks for scaling out ML task
Leverage well-known model architectures
MLLib Pipeline API simplifies ML workflows
Leverage pre-trained models for common tasks
58. JFK
IAD
LAX
SFO
SEA
DFW src dest delay tripid
SFO SEA 45 105892
3
LAX JFK 52 410022
4
id city state
SEA Seattle WA
SFO San Francisco CA
JFK New York NY
vertices DataFrame
edges
DataFrame
vertex
63. Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parquet(...)
g = GraphFrame(vertices, edges)
g.vertices.write.parquet(...)
g.edges.write.parquet(...)
Notes de l'éditeur
Contributions estimated from github commit logs, with some effort to de-duplicate entities.
No time to mention:
User-defined functions (UDFs)
Optimizations: code gen, predicate pushdown
Model training / tuning
Regularization: parameter that controls how the linear model does on unseen data
There is no single good value for the regularization parameter.
One common method to find on is to try out different values.
This technique is called CV: you split your training data into 2 sets: one set used to learn some parameters with a given regularization parameter, and another set to evaluate how well we are doing with the given parameter.