Why to build your own analytics application on top on Delta lake : – Every enterprise is building a data lake. However, these data lakes are plagued by low user adoption, poor data quality, and result in lower ROI. – BI tools may not be enough for your use case, especially, when you want to build a data driven analytical web application such as paysa. – Delta’s ACID guarantees allows you to build a real-time reporting app that displays consistent and reliable data
In this talk we will learn :
how to build your own analytics app on top of delta lake.
how Delta Lake helps you build pristine data lake with several ways to expose data to end-users
how analytics web application can be backed by custom Query layer that executes Spark SQL in remote Databricks cluster.
We’ll explore various options to build an analytics application using various backend technologies.
Various Architecture pattern/components/frameworks can be used to build custom analytics platform in no time.
How to leverage machine learning to build advanced analytics applications Demo: Analytics application built on Play Framework(for back-end), React(for front-end), Structured Streaming for ingesting data from Delta table. Live query analytics on real time data ML predictions based on analytics data
3. Agenda
• Delta Lake - What and Why?
• Common Delta Lake use cases
• Data as a Service (DaaS)
• Our Approach
• Use Cases
• Demo
• Q&A
3
4. What’s a Data Lake?
4
A data lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale.
“If you think of a datamart as a store of bottled water – cleansed and
packaged and structured for easy consumption – the data lake is a
large body of water in a more natural state. The contents of the data
lake stream in from a source to fill the lake, and various users of the
lake can come to examine, dive in, or take samples.” - James Dixon
5. Why Data Lake ?
5
LAKES STREAMS
WAREHOUSES NOSQL
CSV,
JSON,
TXT…
Challenges with Data Warehouse
• Big Data problem
• Expensive (build, store and process)
• Proprietary technology (processing and
storage)
• Vendor lock-in
• Lack of ML capabilities
6. Data Lake: Aspiration
6
Real-time Streaming,
Data Science and ML
• Recommendation Engines
• Risk, Fraud, & Intrusion Detection
• Customer Analytics
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
Use AI and Machine Learning to outperform your competition,
retain your customers, boost your productivity with lower TCO
using variety of data sources
7. Data Lake: Reality
7
Real-time Streaming,
Data Science and ML
• Recommendation Engines
• Risk, Fraud, & Intrusion Detection
• Customer Analytics
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
The majority of these
projects are failing!
Unreliable, low quality data
slow performance
8. Why ?
8
Data WarehouseStrengths of Data Warehouse
• Full ACID Transaction
• Insert, Delete, Update w/ SCD-II
• Indexing for faster query response
• Schema-On-Write
Strengths of Data Lake
• Open Source, Open Standards
• Powered By Apache Spark
• Scale
• Unified platform for data & AI
● Unification of Batch & Streaming workloads
● Incrementally improve the quality of your data until it is ready for
consumption (Multi-hop pipelines)
● Dramatically reduces legacy Spark/Hive operational burdens
● Scalable Metadata Handling
And
9. What’s a Delta Lake
9
A Data Lake Powered By Delta
LAKES STREAMS
WAREHOUSES NOSQL
CSV,
JSON,
TXT…
Raw
Ingestion
Bronze
Filtered, Cleaned
Augmented
Silver
Business-level
Aggregates
Gold
Delta Lake
10. Common Delta Lake Use Cases
• Interactive Queries
• BI reporting and dashboards
• Train and Build Machine Learning Models
• Create Data Warehouse
• Create / Monetize Data Products
• Sell or Share curated data to partners, vendors and internal
customers
• Feed data back to source systems, web applications, Mobile
Apps
10
11. Common Delta Lake Use Cases
• Interactive Queries
• BI reporting and dashboards
• Train and Build Machine Learning Models
• Create Data Warehouse
11
• Create / Monetize Data Products
• Sell or Share curated data to partners, vendors and internal
customers
• Feed data back to source systems, web applications, Mobile
Apps
12. Serving Data From Delta Lake
12
Web app
Mobile app
ERP
Storage
Data product
Data enrichment
Data Integration
Data export
14. Serving Data From Delta Lake
14
Storage
S3 ADLS HDFS
Catalog
ConsumersCompute Serving
API
Access
Management
Data Service
Metadata Service
15. Serving Data From Delta Lake
Data-as-a-Service (DaaS )
• Rest APIs
• Ready-Only
• Data Format
• Delivery mechanism
15
Challenges
• Security
• Latency
• Throughput
• SLA
• Data licensing, ownership
and monetization model
• Managing evolving
requirements
• Minimizing Information Silos
16. Use Cases for Demo App
• MVP features for the demo app
• End-to-end etl pipeline writing into delta lake
• DaaS REST endpoint to export data
• Front-end app to consume data and build a dashboard
16
• UI to interact with delta lake
• Export classified and aggregated data out of delta lake to be
consumed by a client app
18. DaaS APIs
18
GET delta-meta-service/getDbDetails
GET delta-meta-service/previewTable?table=db.tablename
POST delta-sql-service/exportSqlData -d
{
"inputSql": "select * from db.table where condition",
"outputPath": "/path/",
"format": "json"
}
GET delta-sql-service/getRunStatus?run_id=id