Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience.
In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses.
We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
SQL Analytics Powering Telemetry Analysis at Comcast
1. SQL Analytics
powering Telemetry
Analysis at Comcast
Suraj Nesamani,
Principal Engineer, RDK Analytics @
Comcast
and
Molly Nagamuthu,
Sr Resident Solutions Architect @
Databricks
2. Agenda
▪ Introduction to the Lakehouse
Platform
▪ SQL Analytics - under the hood
by Molly Nagamuthu
▪ RDK challenges at Comcast
▪ SQL Analytics Test and Learn
by Suraj Nesamani
3. 20+ years in Product Development,
Engineering & Professional Services
Telecom
Healthcare & Media
Finance
Molly Nagamuthu
Sr Resident Solutions Architect @ Databricks
6. Today, most enterprises struggle with data
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering Streaming Data Science & Machine Learning
Extract Load
Transform
Streaming data sources
Streaming Data Engine
Real-time Database
Analytics and BI
Data marts
Data warehouse
Structured data
Structured, semi-structured
and unstructured data
Structured, semi-structured
and unstructured data
Data Lake
Data prep
Data Lake
Machine
Learning
Data
Science
7. Today, most enterprises struggle with data
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering Streaming Data Science & Machine Learning
Extract Load
Transform
Streaming data sources
Streaming Data Engine
Real-time Database
Analytics and BI
Data marts
Data warehouse
Structured data
Structured, semi-structured
and unstructured data
Structured, semi-structured
and unstructured data
Data Lake
Data prep
Data Lake
Machine
Learning
Data
Science
Amazon Redshift Teradata
Azure Synapse Google BigQuery
Snowflake IBM Db2
SAP Oracle Autonomous
Data Warehouse
Hadoop Apache Airflow
Amazon EMR Apache Spark
Google Dataproc Cloudera
Jupyter Amazon SageMaker
Azure ML Studio MatLAB
Domino Data Labs SAS
TensorFlow PyTorch
Apache Kafka Apache Spark
Apache Flink Amazon Kinesis
Azure Stream Analytics Google Dataflow
Tibco Spotfire Confluent
Disconnected systems and proprietary data formats make integration di cult
8. Today, most enterprises struggle with data
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering Streaming Data Science & Machine Learning
Extract Load
Transform
Streaming data sources
Streaming Data Engine
Real-time Database
Analytics and BI
Data marts
Data warehouse
Structured data
Structured, semi-structured
and unstructured data
Structured, semi-structured
and unstructured data
Data Lake
Data prep
Data Lake
Machine
Learning
Data
Science
Amazon Redshift Teradata
Azure Synapse Google BigQuery
Snowflake IBM Db2
SAP Oracle Autonomous
Data Warehouse
Hadoop Apache Airflow
Amazon EMR Apache Spark
Google Dataproc Cloudera
Jupyter Amazon SageMaker
Azure ML Studio MatLAB
Domino Data Labs SAS
TensorFlow PyTorch
Apache Kafka Apache Spark
Apache Flink Amazon Kinesis
Azure Stream Analytics Google Dataflow
Tibco Spotfire Confluent
Disconnected systems and proprietary data formats make integration di cult
Data Scientists
Data Engineers
Data Analysts Data Engineers
Siloed data teams decrease productivity
9. Structured Semi-structured Unstructured Streaming
Lakehouse Platform
Data Engineering
BI & SQL
Analytics
Real-time Data
Applications
Data Science
& Machine Learning
Data Management & Governance
Open Data Lake
SIMPLE OPEN COLLABORATIVE
11. Databricks SQL Analytics
Delivering analytics on the freshest data
with data warehouse performance and
data lake economics
■ Query your lakehouse with better price /
performance
■ Simplify discovery and sharing of new insights
■ Connect to familiar BI tools, like Tableau or Power BI
■ Simplify administration and governance
12. Broad integration with BI tools
Connect your preferred BI tools with
optimized connectors that provide
fast performance, low latency, and
high user conconcurrency to your data
lake for your existing BI tools.
Coming soon:
13. Use Cases
Collaborative exploratory
data analysis on your
data lake
Data-enhanced
applications
Connect existing BI tools
and use one source of
truth for all your data
14. Connect your existing BI tools to your data
lake with optimized connectors and
ODBC/JDBC Drivers
Databricks SQL Analytics
Under the hood: SQL Analytics
SQL Endpoints
Vectorized Query Engine
Build a curated cloud data lake on
an open format with Delta Lake
Filtered, Cleaned,
Augmented
Silver
Raw Ingestion
and History
Bronze
Business-level
Aggregates
Gold
Curated Data
Structured, Semi-Structured, and Unstructured Data
BI/SQL Connectors
SQL Interface
Next generation query engine providers
real-life performance for all queries
Based on Redash, query your entire
data lake with SQL and visualize results
Quickly setup SQL / BI optimized compute with
best price/performance and track usage
18. Related Talks
WEDNESDAY
• 03:50 PM (PT): Databricks SQL Analytics Deep Dive for the Data Analyst - Doug Bateman, Databricks
• 04:25 PM (PT): Radical Speed for SQL Queries on Databricks: Photon Under the Hood - Greg Rahn & Alex Behm,
Databricks
• 04:25 PM (PT): Delivering Insights from 20M+ Smart Homes with 500M+ devices - Sameer Vaidya, Plume
THURSDAY
• 11:00 AM (PT): Getting Started with Databricks SQL Analytics - Simon Whiteley, Advancing Analytics
• 03:15 PM (PT): Building Lakehouses on Delta Lake and SQL Analytics - A Primer - Franco Patano, Databricks
FRIDAY
• 10:30 AM (PT): SQL Analytics Powering Telemetry Analysis at Comcast - Suraj Nesamani, Comcast
& Molly Nagamuthu, Databricks
20. Suraj Nesamani
Principal Engineer, RDK@Comcast
15 plus years of experience in Engineering,
Mostly specializing in RDK Telemetry and
Big Data Analysis.
Experienced in handling Peta-Byte scale
IoT data
21. The RDK is a Whole Home Open Source Software Platform
powering Video, Broadband and IoT Devices.
It enables operators to manage devices and easily customize their
UI’s and Apps, and provides analytics to improve the customer
experience and drive business results.
➢ https://rdkcentral.com/
RDK - Reference Design Kit
25. RDK Analytics Challenge
● Queries that take more CPU and time in cluster
● Workload Management (WLM ) Aborts
● Expensive to add more nodes
● Query output still needed to run the business
26. RDK-Databricks Partnership
● Migrated a complex redshift pipeline using Spark 3.0 on databricks
● Migrated some EMR workloads
● Optimizations and Databricks training
● Delta Lake Test and Learn
● Upgraded to E2 version of the Databricks Platform - Which is more secure,
scalable and simpler to manage.
28. SQL Analytics-(Test and Learn)-Scope
Analyze Redshift workloads
• 10 slowest performing queries on redshift
• Average runtime for each of these queries is 30 mins
• Currently use 12 dc2.8xl nodes
Disclaimer: For the Test and Learn, we could not reproduce the redshift production environment. So we only
considered the query runtimes for comparison.
29. Test and Learn Design
Timeline 2-3 weeks
Migrate workloads to Databricks SQL Analytics
• Metastore decisions
• Table Access Control List (ACLs)
• Convert redshift queries to spark sql
• Test workloads, as is , in parquet format
• Convert to Delta Lake and test against Photon
31. Observations
The Native SQL interface was very intuitive and easy to use
Creating endpoints is extremely simplified. It helps SQL
analysts to a great extent.
Does not have Support for UDFs in SQL Analytics
We did not test ACLs too much but it seemed simple enough.
a centralized catalog would be really nice