This document discusses the challenges of big data analytics and how Apache Spark and Databricks can help address them. It summarizes that:
1) There is a gap between the growth of data and ability to perform real-time analytics on that data due to challenges in managing infrastructure, empowering teams, and establishing production-ready applications.
2) Databricks provides a cloud-hosted platform that uses Apache Spark to allow for just-in-time processing of data across storage silos, with an integrated workspace for interactive exploration, machine learning, and production-ready workflows.
3) Databricks Enterprise Security provides an end-to-end security solution for Apache Spark to address challenges in securing file
6. Why is there a gap?
6
Real-time Data-Driven Analytics Applications
Manage Data
infrastructure
• Create, tune, monitor compute clusters.
• Securely access silos of disparate data sources.
• Enforce proper data governance.
•1
Empower teams to be
productive
• Securely share big data clusters among analysts.
• Interactively explore data and prototype ideas.
• Debug, troubleshoot, version-control big data applications.•
•
•
2
Establish Production-
Ready Applications
• Setup robust data pipelines for ETL/ELT.
• Productionize real-time applications with HA, FT.
• Build, serve, maintain advanced machine learning models.
•
3
Siloed, Fast-Growing Size, Cost
7. Databricks Cloud-Hosted Platform
7
• Separate compute & storage
• Integrate existing data stores
• Efficient cache on first access
Just-in-Time Data
Platform
1
Agile
• Workflow scheduler for ML,
streaming, SQL, ETL
• High availability, fault-
tolerant, performance-
optimized
Automated Apache
Spark Management
3
Production-Ready
• Interactive notebooks,
dashboards, reports
• Real-time exploration,
machine learning, graph use
cases
Integrated
Workspace
2
Democratize Big Data
8. HADOOP /
DATA LAKES
DATA
WAREHOUSE
S
YOUR
STORAGE
CLOUD
STORAGE
8
Databricks Just-in-Time Data Platform
INTEGRATED
WORKSPACEDASHBOARD
S
Reports
NOTEBOOKS
github, viz,
collaboration
BI TOOLS
JUST-IN-TIME
PROCESSING
POWERED
BY
APACHE
CLUSTERS: Auto-scaled, resilient, multi-tenant
DATA INTEGRATION: secure and fast data source
integrations
INTERFACES: REST APIs & BI tools
DATABRICKS SERVICES
+
YOUR CUSTOM SPARK
APPS
PRODUCTION JOBS
DATA LAKE
DATA HUB
9. The Challenge of Securing Analytics
9
End-to-end security a challenge for enterprises
Securing file
management
Secure table
management
Secure
cluster
management
Secure
job
workflows
Secure
dashboards,
report, notebook
management
Today there are piecemeal solutions, but no comprehensive
solution
10. Databricks Enterprise Security (DBES)
10
Holistic end-to-end security for Data Analytics
Tables Clusters Workflow
s
Notebooks,
Dashboards,
Reports
Files
• Role-based access control
• Auditing and governance
• Integrated identity-management
• Encryption on-disk and on-the-
wire
DBES provides
The First End-to-End Security Solution for Apache Spark
11. Enterprise use-cases
11
Preventing credit card fraud
Predict energy demand based on massive weather data
Predict player churn, predicting network outages
Natural language processing to extract author graph
Generating tailored programs based on big data
13. Try Apache Spark with Databricks
13
http://databricks.com/try
Try latest version of Apache Spark and preview of Spark 2.0
Notes de l'éditeur
I HAVE AN ANNOUNCEMENT TODAY
BUT FIRST, WANT TO TALK ABOUT DATABRICKS
THIS IS NOT JUST SCI-FI
Establish production-quality infrastructure
1: Global publisher Elsevier – team in US and EU perform natural language processing on all their content to experiment with new product ideas.
2: Energy analytics company DNV GL – Databricks sped up analytics of IoT data from weather and grid sensor by 100x.
3: Financial service provider LendUp– Databricks enabled them to update their machine models daily instead of weekly.