SlideShare une entreprise Scribd logo
1  sur  75
Télécharger pour lire hors ligne
Designing and
Implementing Real-time
Data Lake with
Dynamically changing
Schemas
Agenda
Mate Gulyas
Practice Lead and Principal Instructor
Databricks
Shasidhar Eranti
Resident Solutions Engineer
Databricks
Introduction
▪ SEGA is a worldwide leader in interactive entertainment
▪ SEGA is a worldwide leader in interactive entertainment
▪ Huge franchises including Sonic, Total War and Football
Manager
▪ SEGA is a worldwide leader in interactive entertainment
▪ Huge franchises including Sonic, Total War and Football
Manager
▪ SEGA is currently celebrating its long awaited 60th anniversary.
▪ SEGA is a worldwide leader in interactive entertainment
▪ Huge franchises including Sonic, Total War and Football
Manager
▪ SEGA is currently celebrating its long awaited 60th anniversary.
▪ SEGA also produces arcade machines, holiday resorts, films
and merchandise
▪ Real time data from SEGA titles is crucial for business users.
▪ Real time data from SEGA titles is crucial for business users.
▪ SEGA’s 6 studios send data to one centralised data platform.
▪ Real time data from SEGA titles is crucial for business users.
▪ SEGA’s 6 studios send data to one centralised data platform.
▪ New events are frequently added and event schemas evolve
overtime.
▪ Real time data from SEGA titles is crucial for business users.
▪ SEGA’s 6 studios send data to one centralised data platform.
▪ New events are frequently added and event schemas evolve
overtime.
▪ Over 300 event types from over 40 SEGA titles (constantly growing)
▪ Real time data from SEGA titles is crucial for business users.
▪ SEGA’s 6 studios send data to one centralised data platform.
▪ New events are frequently added and event schemas evolve
overtime.
▪ Over 300 event types from over 40 SEGA titles (constantly growing)
▪ Events arrive at a rate of 8,000 every second
What is the GOAL and the CHALLENGE we try to
achieve?
Real time
data lake
No upfront
information about
the schemas or the
upcoming schema
changes
No downtime
Architecture
Key Requirements
Ingest different
types of JSON at
scale
Handle schema
evolution
dynamically
Serve
un-structured
data in a
structured form
for Business
users
Architecture
● Delta Architecture (Bronze - Silver
layers)
Architecture
● Delta Architecture (Bronze - Silver
layers)
Architecture
● Ingestion Stream (Bronze)
Using forEachBatch()
○ Dump JSON into delta table
○ Track Schema changes
● Delta Architecture (Bronze - Silver
layers)
Architecture
● Ingestion Stream (Bronze)
Using forEachBatch()
○ Dump JSON into delta table
○ Track Schema changes
● Stream multiplexing using Delta
● Event Streams(Silver)
○ Read from Bronze table
○ Fetch event schema
○ Apply schema using from_json()
○ Write to Silver table
● Delta Architecture (Bronze - Silver
layers)
Architecture
● Ingestion Stream (Bronze)
Using forEachBatch()
○ Dump JSON into delta table
○ Track Schema changes
● Stream multiplexing using Delta
● Event Streams(Silver)
○ Read from Bronze table
○ Fetch event schema
○ Apply schema using from_json()
○ Write to Silver table
Sample Data
Bronze Table
Sample Data
Silver Tables
Bronze Table
Event Type 1.1 Event Type 2.1
Schema Inference
{
“event_type”: “1.1”,
“user_agent”: “chrome”,
}
Schema Changes
{
“event_type”: “1.1”,
“user_agent”: “firefox”,
“has_plugins”: “true”,
}
{
“event_type”: “1.1”,
“user_agent”: “chrome”,
}
Schema Variation Hash
1. Raw message
{
“event_type”: “1.1”,
“user_agent”: “chrome”
}
Schema Variation Hash
[“event_type”, “user_agent”]
1. Raw message 2. Sorted list of ALL columns (including nested)
{
“event_type”: “1.1”,
“user_agent”: “chrome”
}
Schema Variation Hash
7862AF20813560D9AAEAF38D7E
[“event_type”, “user_agent”]
3. Calculate SHA1 Hash
1. Raw message 2. Sorted list of ALL columns (including nested)
Schema Repository
Schema Repository
{
“event_type”: “1.1”,
“user_agent”: “chrome”,
“has_plugins”: “true”,
}
Schema Variation Hash
1. Raw message
{
“event_type”: “1.1”,
“user_agent”: “chrome”,
“has_plugins”: “true”,
}
Schema Variation Hash
[“event_type”, “user_agent”, ”has_plugins”]
1. Raw message 2. Sorted list of ALL columns (including nested)
{
“event_type”: “1.1”,
“user_agent”: “chrome”,
“has_plugins”: “true”,
}
Schema Variation Hash
BEA2ACAF2081350D9AAEAF38D7E
[“event_type”, “user_agent”, ”has_plugins”]
3. Calculate SHA1 Hash
1. Raw message 2. Sorted list of ALL columns (including nested)
{
“event_type”: “1.1”,
“user_agent”: “chrome”,
“has_plugins”: “true”,
}
Schema Variation Hash
BEA2ACAF2081350D9AAEAF38D7E
[“event_type”, “user_agent”, ”has_plugins”]
3. Calculate SHA1 Hash
1. Raw message 2. Sorted list of ALL columns (including nested)
Not in Schema Repository
{
“event_type”: “1.1”,
“user_agent”: “chrome”,
“has_plugins”: “true”,
}
Schema Variation Hash
BEA2ACAF2081350D9AAEAF38D7E
[“event_type”, “user_agent”, ”has_plugins”]
3. Calculate SHA1 Hash
1. Raw message 2. Sorted list of ALL columns (including nested)
Not in Schema Repository
We need to update the schema
for 1.1
Foreach Batch
Update the Schema
Update the Schema
The new, so far UNSEEN message
Update the Schema
All of the the old prototypes from the Schema Repository
(We have only 1 now, but could be more)
Update the Schema
from typing import List
from pyspark.sql import Row
def inferSchema(protoPayloads: List[str]) -> "DataType":
schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads))
return spark
.read
.option("inferSchema", True)
.json(schemaProtoDF.rdd.map(lambda r: r.json))
.schema)
Update the Schema
from typing import List
from pyspark.sql import Row
def inferSchema(protoPayloads: List[str]) -> "DataType":
schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads))
return spark
.read
.option("inferSchema", True)
.json(schemaProtoDF.rdd.map(lambda r: r.json))
.schema)
Update the Schema
from typing import List
from pyspark.sql import Row
def inferSchema(protoPayloads: List[str]) -> "DataType":
schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads))
return spark
.read
.option("inferSchema", True)
.json(schemaProtoDF.rdd.map(lambda r: r.json))
.schema)
Update the Schema
from typing import List
from pyspark.sql import Row
def inferSchema(protoPayloads: List[str]) -> "DataType":
schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads))
return spark
.read
.option("inferSchema", True)
.json(schemaProtoDF.rdd.map(lambda r: r.json))
.schema)
Update the Schema
from typing import List
from pyspark.sql import Row
def inferSchema(protoPayloads: List[str]) -> "DataType":
schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads))
return spark
.read
.option("inferSchema", True)
.json(schemaProtoDF.rdd.map(lambda r: r.json))
.schema)
Update the Schema
from typing import List
from pyspark.sql import Row
def inferSchema(protoPayloads: List[str]) -> "DataType":
schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads))
return spark
.read
.option("inferSchema", True)
.json(schemaProtoDF.rdd.map(lambda r: r.json))
.schema)
Update the Schema
Schema Repository
Schema Repository
We now have a new schema that incorporates all
the previous prototypes from all known schema
variations
Silver tables
Foreach Batch
def assert_and_process(event_type: String, target: String) (df:DataFrame,
batchId: Long): Unit = {
val (schema, schemaVersion) = get_schema(schema_repository, event_type_id)
df
.transform(process_raw_events(schema, schemaVersion))
.write.format("delta").mode("append")
.option("mergeSchema", true)
.save(target)
}
Retrieve the schema
def assert_and_process(event_type: String, target: String)(df:DataFrame,
batchId: Long): Unit = {
val (schema, schemaVersion) = get_schema(schema_repository, event_type_id)
df
.transform(process_raw_events(schema, schemaVersion))
.write.format("delta").mode("append")
.option("mergeSchema", true)
.save(target)
}
Retrieve the schema
def assert_and_process(event_type: String, target: String)(df:DataFrame,
batchId: Long): Unit = {
val (schema, schemaVersion) = get_schema(schema_repository, event_type_id)
df
.transform(process_raw_events(schema, schemaVersion))
.write.format("delta").mode("append")
.option("mergeSchema", true)
.save(target)
}
Retrieve the schema
def assert_and_process(event_type: String, target: String)(df:DataFrame,
batchId: Long): Unit = {
val (schema, schemaVersion) = get_schema(schema_repository, event_type_id)
df
.transform(process_raw_events(schema, schemaVersion))
.write.format("delta").mode("append").partitionBy(partitionColumns: _*)
.option("mergeSchema", true)
.save(target)
}
Retrieve the schema
Productionization
(Deployment and Monitoring)
Deploying Event Streams
Deploying Event Streams
● Events are grouped logically
Deploying Event Streams
● Events are grouped logically
● Stream groups are deployed on job
clusters
Deploying Event Streams
● Events are grouped logically
● Stream groups are deployed on job
clusters
● Two main aspects
○ Schema change
○ New Schema detected
Deploying Event Streams
● Events are grouped logically
● Stream groups are deployed on job
clusters
● Two main aspects
○ Schema change
○ New Schema detected
Schema change
● Incompatible schema changes causes
stream failures
Deploying Event Streams
● Events are grouped logically
● Stream groups are deployed on job
clusters
● Two main aspects
○ Schema change
○ New Schema detected
Schema change
● Incompatible schema changes causes
stream failures
● Stream monitoring in job clusters
Deploying Event Streams
● Events are grouped logically
● Stream groups are deployed on job
clusters
● Two main aspects
○ Schema change
○ New Schema detected
Schema change
● Incompatible schema changes causes
stream failures
● Stream monitoring in job clusters
New Schema detected
Management Stream EventGroup table
Management Stream EventGroup table
● Tracks schema changes from
schemaRegistry table
Management Stream EventGroup table
● Tracks schema changes from
schemaRegistry table
● Two type of source changes
○ Change in schema
○ New schema detected
Management Stream EventGroup table
● Tracks schema changes from
schemaRegistry table
● Two type of source changes
○ Change in schema
○ New schema detected
● Change in schema (No action)
Management Stream EventGroup table
● Tracks schema changes from
schemaRegistry table
● Two type of source changes
○ Change in schema
○ New schema detected
● Change in schema (No action)
● New schema detected
○ Add new entry in event group table
○ New stream is launched
automatically
Monitoring
Monitoring
● Use Structured Streaming listener
APIs to track metrics
Monitoring
● Use Structured Streaming listener
APIs to track metrics
● Dump Streaming metrics to central
dashboarding tool
Monitoring
● Use Structured Streaming listener
APIs to track metrics
● Dump Streaming metrics to central
dashboarding tool
● Key metrics tracked in monitoring
dashboard
○ Stream Status
○ Streaming latency
Monitoring
● Use Structured Streaming listener
APIs to track metrics
● Dump Streaming metrics to central
dashboarding tool
● Key metrics tracked in monitoring
dashboard
○ Stream Status
○ Streaming latency
● Enable Stream metrics capture for
ganglia using
spark.sql.streaming.metricsEnabled=true
Key takeaways
Delta helps with
Schema Evolution
and Stream
Multiplexing
capabilities
Schema Variation
hash to detect
schema changes
ImplementationArchitecture
Job clusters to run
streams in
production
Productionizing
Felix Baker, SEGA
”
“This has revolutionised the flow of analytics from our games
and has enabled business users to analyse and react to data
far more quickly than we have been able to do previously.”
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Contenu connexe

Tendances

Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentialsqureshihamid
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDatabricks
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 

Tendances (20)

Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Snowflake essentials
Snowflake essentialsSnowflake essentials
Snowflake essentials
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 

Similaire à Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema

WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0WSO2
 
Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleMariaDB plc
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaOCoderFest
 
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital EnterpriseWSO2
 
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
WSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsWSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsSriskandarajah Suhothayan
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"DataStax Academy
 
Grid Objects in InduSoft Web Studio
Grid Objects in InduSoft Web StudioGrid Objects in InduSoft Web Studio
Grid Objects in InduSoft Web StudioAVEVA
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 

Similaire à Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema (20)

WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
WSO2 Product Release Webinar: WSO2 Complex Event Processor 4.0
 
Streaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScaleStreaming Operational Data with MariaDB MaxScale
Streaming Operational Data with MariaDB MaxScale
 
Clojure@Nuday
Clojure@NudayClojure@Nuday
Clojure@Nuday
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in Grafana
 
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
[WSO2Con EU 2017] Streaming Analytics Patterns for Your Digital Enterprise
 
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
WSO2 Product Release Webinar - Introducing the WSO2 Complex Event Processor
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
WSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needsWSO2 Analytics Platform: The one stop shop for all your data needs
WSO2 Analytics Platform: The one stop shop for all your data needs
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
 
Building Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQLBuilding Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQL
 
Grid Objects in InduSoft Web Studio
Grid Objects in InduSoft Web StudioGrid Objects in InduSoft Web Studio
Grid Objects in InduSoft Web Studio
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Dernier

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 

Dernier (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 

Designing and Implementing a Real-time Data Lake with Dynamically Changing Schema

  • 1. Designing and Implementing Real-time Data Lake with Dynamically changing Schemas
  • 2. Agenda Mate Gulyas Practice Lead and Principal Instructor Databricks Shasidhar Eranti Resident Solutions Engineer Databricks
  • 4. ▪ SEGA is a worldwide leader in interactive entertainment
  • 5. ▪ SEGA is a worldwide leader in interactive entertainment ▪ Huge franchises including Sonic, Total War and Football Manager
  • 6. ▪ SEGA is a worldwide leader in interactive entertainment ▪ Huge franchises including Sonic, Total War and Football Manager ▪ SEGA is currently celebrating its long awaited 60th anniversary.
  • 7. ▪ SEGA is a worldwide leader in interactive entertainment ▪ Huge franchises including Sonic, Total War and Football Manager ▪ SEGA is currently celebrating its long awaited 60th anniversary. ▪ SEGA also produces arcade machines, holiday resorts, films and merchandise
  • 8. ▪ Real time data from SEGA titles is crucial for business users.
  • 9. ▪ Real time data from SEGA titles is crucial for business users. ▪ SEGA’s 6 studios send data to one centralised data platform.
  • 10. ▪ Real time data from SEGA titles is crucial for business users. ▪ SEGA’s 6 studios send data to one centralised data platform. ▪ New events are frequently added and event schemas evolve overtime.
  • 11. ▪ Real time data from SEGA titles is crucial for business users. ▪ SEGA’s 6 studios send data to one centralised data platform. ▪ New events are frequently added and event schemas evolve overtime. ▪ Over 300 event types from over 40 SEGA titles (constantly growing)
  • 12. ▪ Real time data from SEGA titles is crucial for business users. ▪ SEGA’s 6 studios send data to one centralised data platform. ▪ New events are frequently added and event schemas evolve overtime. ▪ Over 300 event types from over 40 SEGA titles (constantly growing) ▪ Events arrive at a rate of 8,000 every second
  • 13. What is the GOAL and the CHALLENGE we try to achieve? Real time data lake No upfront information about the schemas or the upcoming schema changes No downtime
  • 15. Key Requirements Ingest different types of JSON at scale Handle schema evolution dynamically Serve un-structured data in a structured form for Business users
  • 17. ● Delta Architecture (Bronze - Silver layers) Architecture
  • 18. ● Delta Architecture (Bronze - Silver layers) Architecture ● Ingestion Stream (Bronze) Using forEachBatch() ○ Dump JSON into delta table ○ Track Schema changes
  • 19. ● Delta Architecture (Bronze - Silver layers) Architecture ● Ingestion Stream (Bronze) Using forEachBatch() ○ Dump JSON into delta table ○ Track Schema changes ● Stream multiplexing using Delta ● Event Streams(Silver) ○ Read from Bronze table ○ Fetch event schema ○ Apply schema using from_json() ○ Write to Silver table
  • 20. ● Delta Architecture (Bronze - Silver layers) Architecture ● Ingestion Stream (Bronze) Using forEachBatch() ○ Dump JSON into delta table ○ Track Schema changes ● Stream multiplexing using Delta ● Event Streams(Silver) ○ Read from Bronze table ○ Fetch event schema ○ Apply schema using from_json() ○ Write to Silver table
  • 22. Sample Data Silver Tables Bronze Table Event Type 1.1 Event Type 2.1
  • 24. { “event_type”: “1.1”, “user_agent”: “chrome”, } Schema Changes { “event_type”: “1.1”, “user_agent”: “firefox”, “has_plugins”: “true”, }
  • 26. { “event_type”: “1.1”, “user_agent”: “chrome” } Schema Variation Hash [“event_type”, “user_agent”] 1. Raw message 2. Sorted list of ALL columns (including nested)
  • 27. { “event_type”: “1.1”, “user_agent”: “chrome” } Schema Variation Hash 7862AF20813560D9AAEAF38D7E [“event_type”, “user_agent”] 3. Calculate SHA1 Hash 1. Raw message 2. Sorted list of ALL columns (including nested)
  • 30. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash 1. Raw message
  • 31. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash [“event_type”, “user_agent”, ”has_plugins”] 1. Raw message 2. Sorted list of ALL columns (including nested)
  • 32. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash BEA2ACAF2081350D9AAEAF38D7E [“event_type”, “user_agent”, ”has_plugins”] 3. Calculate SHA1 Hash 1. Raw message 2. Sorted list of ALL columns (including nested)
  • 33. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash BEA2ACAF2081350D9AAEAF38D7E [“event_type”, “user_agent”, ”has_plugins”] 3. Calculate SHA1 Hash 1. Raw message 2. Sorted list of ALL columns (including nested) Not in Schema Repository
  • 34. { “event_type”: “1.1”, “user_agent”: “chrome”, “has_plugins”: “true”, } Schema Variation Hash BEA2ACAF2081350D9AAEAF38D7E [“event_type”, “user_agent”, ”has_plugins”] 3. Calculate SHA1 Hash 1. Raw message 2. Sorted list of ALL columns (including nested) Not in Schema Repository We need to update the schema for 1.1
  • 37. Update the Schema The new, so far UNSEEN message
  • 38. Update the Schema All of the the old prototypes from the Schema Repository (We have only 1 now, but could be more)
  • 40. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  • 41. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  • 42. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  • 43. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  • 44. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  • 45. from typing import List from pyspark.sql import Row def inferSchema(protoPayloads: List[str]) -> "DataType": schemaProtoDF = spark.createDataFrame(map(lambda x: Row(json=x), protoPayloads)) return spark .read .option("inferSchema", True) .json(schemaProtoDF.rdd.map(lambda r: r.json)) .schema) Update the Schema
  • 48. We now have a new schema that incorporates all the previous prototypes from all known schema variations
  • 51. def assert_and_process(event_type: String, target: String) (df:DataFrame, batchId: Long): Unit = { val (schema, schemaVersion) = get_schema(schema_repository, event_type_id) df .transform(process_raw_events(schema, schemaVersion)) .write.format("delta").mode("append") .option("mergeSchema", true) .save(target) } Retrieve the schema
  • 52. def assert_and_process(event_type: String, target: String)(df:DataFrame, batchId: Long): Unit = { val (schema, schemaVersion) = get_schema(schema_repository, event_type_id) df .transform(process_raw_events(schema, schemaVersion)) .write.format("delta").mode("append") .option("mergeSchema", true) .save(target) } Retrieve the schema
  • 53. def assert_and_process(event_type: String, target: String)(df:DataFrame, batchId: Long): Unit = { val (schema, schemaVersion) = get_schema(schema_repository, event_type_id) df .transform(process_raw_events(schema, schemaVersion)) .write.format("delta").mode("append") .option("mergeSchema", true) .save(target) } Retrieve the schema
  • 54. def assert_and_process(event_type: String, target: String)(df:DataFrame, batchId: Long): Unit = { val (schema, schemaVersion) = get_schema(schema_repository, event_type_id) df .transform(process_raw_events(schema, schemaVersion)) .write.format("delta").mode("append").partitionBy(partitionColumns: _*) .option("mergeSchema", true) .save(target) } Retrieve the schema
  • 57. Deploying Event Streams ● Events are grouped logically
  • 58. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters
  • 59. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters ● Two main aspects ○ Schema change ○ New Schema detected
  • 60. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters ● Two main aspects ○ Schema change ○ New Schema detected Schema change ● Incompatible schema changes causes stream failures
  • 61. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters ● Two main aspects ○ Schema change ○ New Schema detected Schema change ● Incompatible schema changes causes stream failures ● Stream monitoring in job clusters
  • 62. Deploying Event Streams ● Events are grouped logically ● Stream groups are deployed on job clusters ● Two main aspects ○ Schema change ○ New Schema detected Schema change ● Incompatible schema changes causes stream failures ● Stream monitoring in job clusters New Schema detected
  • 64. Management Stream EventGroup table ● Tracks schema changes from schemaRegistry table
  • 65. Management Stream EventGroup table ● Tracks schema changes from schemaRegistry table ● Two type of source changes ○ Change in schema ○ New schema detected
  • 66. Management Stream EventGroup table ● Tracks schema changes from schemaRegistry table ● Two type of source changes ○ Change in schema ○ New schema detected ● Change in schema (No action)
  • 67. Management Stream EventGroup table ● Tracks schema changes from schemaRegistry table ● Two type of source changes ○ Change in schema ○ New schema detected ● Change in schema (No action) ● New schema detected ○ Add new entry in event group table ○ New stream is launched automatically
  • 69. Monitoring ● Use Structured Streaming listener APIs to track metrics
  • 70. Monitoring ● Use Structured Streaming listener APIs to track metrics ● Dump Streaming metrics to central dashboarding tool
  • 71. Monitoring ● Use Structured Streaming listener APIs to track metrics ● Dump Streaming metrics to central dashboarding tool ● Key metrics tracked in monitoring dashboard ○ Stream Status ○ Streaming latency
  • 72. Monitoring ● Use Structured Streaming listener APIs to track metrics ● Dump Streaming metrics to central dashboarding tool ● Key metrics tracked in monitoring dashboard ○ Stream Status ○ Streaming latency ● Enable Stream metrics capture for ganglia using spark.sql.streaming.metricsEnabled=true
  • 73. Key takeaways Delta helps with Schema Evolution and Stream Multiplexing capabilities Schema Variation hash to detect schema changes ImplementationArchitecture Job clusters to run streams in production Productionizing
  • 74. Felix Baker, SEGA ” “This has revolutionised the flow of analytics from our games and has enabled business users to analyse and react to data far more quickly than we have been able to do previously.”
  • 75. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.