SQL Analytics Powering Telemetry Analysis at Comcast

SQL Analytics
powering Telemetry
Analysis at Comcast
Suraj Nesamani,
Principal Engineer, RDK Analytics @
Comcast
and
Molly Nagamuthu,
Sr Resident Solutions Architect @
Databricks

Agenda
▪ Introduction to the Lakehouse
Platform
▪ SQL Analytics - under the hood
by Molly Nagamuthu
▪ RDK challenges at Comcast
▪ SQL Analytics Test and Learn
by Suraj Nesamani

20+ years in Product Development,
Engineering & Professional Services
Telecom
Healthcare & Media
Finance
Molly Nagamuthu
Sr Resident Solutions Architect @ Databricks

Databricks’ vision is to
enable data-driven
innovation to all
enterprises

Today, most enterprises struggle with data
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering Streaming Data Science & Machine Learning
Extract Load
Transform
Streaming data sources
Streaming Data Engine
Real-time Database
Analytics and BI
Data marts
Data warehouse
Structured data
Structured, semi-structured
and unstructured data
Data Lake
Data prep
Data Lake
Machine
Learning
Data
Science

Extract Load
Transform
Real-time Database
Analytics and BI
Data marts
Data warehouse
Structured data
Data Lake
Data prep
Data Lake
Machine
Learning
Data
Science
Amazon Redshift Teradata
Azure Synapse Google BigQuery
Snowflake IBM Db2
SAP Oracle Autonomous
Data Warehouse
Hadoop Apache Airflow
Amazon EMR Apache Spark
Google Dataproc Cloudera
Jupyter Amazon SageMaker
Azure ML Studio MatLAB
Domino Data Labs SAS
TensorFlow PyTorch
Apache Kafka Apache Spark
Apache Flink Amazon Kinesis
Azure Stream Analytics Google Dataflow
Tibco Spotfire Confluent
Disconnected systems and proprietary data formats make integration di cult

Extract Load
Transform
Real-time Database
Analytics and BI
Data marts
Data warehouse
Structured data
Data Lake
Data prep
Data Lake
Machine
Learning
Data
Science
Amazon Redshift Teradata
Azure Synapse Google BigQuery
Snowflake IBM Db2
SAP Oracle Autonomous
Data Warehouse
Hadoop Apache Airflow
Amazon EMR Apache Spark
Google Dataproc Cloudera
Jupyter Amazon SageMaker
Azure ML Studio MatLAB
Domino Data Labs SAS
TensorFlow PyTorch
Apache Kafka Apache Spark
Apache Flink Amazon Kinesis
Azure Stream Analytics Google Dataflow
Tibco Spotfire Confluent
Disconnected systems and proprietary data formats make integration di cult
Data Scientists
Data Engineers
Data Analysts Data Engineers
Siloed data teams decrease productivity

Structured Semi-structured Unstructured Streaming
Lakehouse Platform
Data Engineering
BI & SQL
Analytics
Real-time Data
Applications
Data Science
& Machine Learning
Data Management & Governance
Open Data Lake
SIMPLE OPEN COLLABORATIVE

Databricks SQL Analytics
Delivering analytics on the freshest data
with data warehouse performance and
data lake economics
■ Query your lakehouse with better price /
performance
■ Simplify discovery and sharing of new insights
■ Connect to familiar BI tools, like Tableau or Power BI
■ Simplify administration and governance

Broad integration with BI tools
Connect your preferred BI tools with
optimized connectors that provide
fast performance, low latency, and
high user conconcurrency to your data
lake for your existing BI tools.
Coming soon:

Use Cases
Collaborative exploratory
data analysis on your
data lake
Data-enhanced
applications
Connect existing BI tools
and use one source of
truth for all your data

Connect your existing BI tools to your data
lake with optimized connectors and
ODBC/JDBC Drivers
Databricks SQL Analytics
Under the hood: SQL Analytics
SQL Endpoints
Vectorized Query Engine
Build a curated cloud data lake on
an open format with Delta Lake
Filtered, Cleaned,
Augmented
Silver
Raw Ingestion
and History
Bronze
Business-level
Aggregates
Gold
Curated Data
Structured, Semi-Structured, and Unstructured Data
BI/SQL Connectors
SQL Interface
Next generation query engine providers
real-life performance for all queries
Based on Redash, query your entire
data lake with SQL and visualize results
Quickly setup SQL / BI optimized compute with
best price/performance and track usage

Cluster sizecheck your cloud provider quotas!
Check documentation for latest cluster size:
https://docs.sql.azuredatabricks.net/sql/admin/sql-endpoints.html
https://docs.sql.databricks.com/sql/admin/sql-endpoints.html

Additional Resources
➢ https://docs.databricks.com/sql/index.html
➢ https://databricks.com/product/data-lakehouse
➢ https://databricks.com/discover/demos/sql-analytics
➢ Customer Success Offering SQL Analytics MVP available in Q2

Related Talks
WEDNESDAY
• 03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks
• 04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn & Alex Behm,
Databricks
• 04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya, Plume
THURSDAY
• 11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics
• 03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano, Databricks
FRIDAY
• 10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast
& Molly Nagamuthu, Databricks

Suraj Nesamani
Principal Engineer, RDK@Comcast
15 plus years of experience in Engineering,
Mostly specializing in RDK Telemetry and
Big Data Analysis.
Experienced in handling Peta-Byte scale
IoT data

The RDK is a Whole Home Open Source Software Platform
powering Video, Broadband and IoT Devices.
It enables operators to manage devices and easily customize their
UI’s and Apps, and provides analytics to improve the customer
experience and drive business results.
➢ https://rdkcentral.com/
RDK - Reference Design Kit

Raw
Telemetry
Data
Raw
Telemetry
Data
Formatted
Data
Copy data
to Redshift
RDK Telemetry Pipeline

Node : dc2.8xlarge
Cluster Size : 12 nodes
Queries executed : 1000 per day
CPU Usage :80% most of the time
Disk Usage : 60%

Storage Capacity
Scalable
Complex Queries
Pricing
Data retention
Sort and distribution keys
Concurrent Execution
AWS Redshift

RDK Analytics Challenge
● Queries that take more CPU and time in cluster
● Workload Management (WLM ) Aborts
● Expensive to add more nodes
● Query output still needed to run the business

RDK-Databricks Partnership
● Migrated a complex redshift pipeline using Spark 3.0 on databricks
● Migrated some EMR workloads
● Optimizations and Databricks training
● Delta Lake Test and Learn
● Upgraded to E2 version of the Databricks Platform - Which is more secure,
scalable and simpler to manage.

SQL Analytics-(Test and Learn)-Scope
Analyze Redshift workloads
• 10 slowest performing queries on redshift
• Average runtime for each of these queries is 30 mins
• Currently use 12 dc2.8xl nodes
Disclaimer: For the Test and Learn, we could not reproduce the redshift production environment. So we only
considered the query runtimes for comparison.

Test and Learn Design
Timeline 2-3 weeks
Migrate workloads to Databricks SQL Analytics
• Metastore decisions
• Table Access Control List (ACLs)
• Convert redshift queries to spark sql
• Test workloads, as is , in parquet format
• Convert to Delta Lake and test against Photon

Observations
The Native SQL interface was very intuitive and easy to use
Creating endpoints is extremely simpliﬁed. It helps SQL
analysts to a great extent.
Does not have Support for UDFs in SQL Analytics
We did not test ACLs too much but it seemed simple enough.
a centralized catalog would be really nice

Databricks Unity Catalog - Coming soon!
S3 Data Source
GCS Data Source
Cluster or
SQL Endpoint
Database Data Source
Data source
credentials
Users
Unity Catalog
SQL GRANT
permissions
Audit
log

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

SQL Analytics Powering Telemetry Analysis at Comcast

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à SQL Analytics Powering Telemetry Analysis at Comcast

Similaire à SQL Analytics Powering Telemetry Analysis at Comcast (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

SQL Analytics Powering Telemetry Analysis at Comcast