Building a curated data lake on real time data is an emerging data warehouse pattern with delta. However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.
4. ▪ SEGA is a worldwide leader in interactive entertainment
5. ▪ SEGA is a worldwide leader in interactive entertainment
▪ Huge franchises including Sonic, Total War and Football
Manager
6. ▪ SEGA is a worldwide leader in interactive entertainment
▪ Huge franchises including Sonic, Total War and Football
Manager
▪ SEGA is currently celebrating its long awaited 60th anniversary.
7. ▪ SEGA is a worldwide leader in interactive entertainment
▪ Huge franchises including Sonic, Total War and Football
Manager
▪ SEGA is currently celebrating its long awaited 60th anniversary.
▪ SEGA also produces arcade machines, holiday resorts, films
and merchandise
8. ▪ Real time data from SEGA titles is crucial for business users.
9. ▪ Real time data from SEGA titles is crucial for business users.
▪ SEGA’s 6 studios send data to one centralised data platform.
10. ▪ Real time data from SEGA titles is crucial for business users.
▪ SEGA’s 6 studios send data to one centralised data platform.
▪ New events are frequently added and event schemas evolve
overtime.
11. ▪ Real time data from SEGA titles is crucial for business users.
▪ SEGA’s 6 studios send data to one centralised data platform.
▪ New events are frequently added and event schemas evolve
overtime.
▪ Over 300 event types from over 40 SEGA titles (constantly growing)
12. ▪ Real time data from SEGA titles is crucial for business users.
▪ SEGA’s 6 studios send data to one centralised data platform.
▪ New events are frequently added and event schemas evolve
overtime.
▪ Over 300 event types from over 40 SEGA titles (constantly growing)
▪ Events arrive at a rate of 8,000 every second
13. What is the GOAL and the CHALLENGE we try to
achieve?
Real time
data lake
No upfront
information about
the schemas or the
upcoming schema
changes
No downtime
15. Key Requirements
Ingest different
types of JSON at
scale
Handle schema
evolution
dynamically
Serve
un-structured
data in a
structured form
for Business
users
32. {
“event_type”: “1.1”,
“user_agent”: “chrome”,
“has_plugins”: “true”,
}
Schema Variation Hash
BEA2ACAF2081350D9AAEAF38D7E
[“event_type”, “user_agent”, ”has_plugins”]
3. Calculate SHA1 Hash
1. Raw message 2. Sorted list of ALL columns (including nested)
33. {
“event_type”: “1.1”,
“user_agent”: “chrome”,
“has_plugins”: “true”,
}
Schema Variation Hash
BEA2ACAF2081350D9AAEAF38D7E
[“event_type”, “user_agent”, ”has_plugins”]
3. Calculate SHA1 Hash
1. Raw message 2. Sorted list of ALL columns (including nested)
Not in Schema Repository
34. {
“event_type”: “1.1”,
“user_agent”: “chrome”,
“has_plugins”: “true”,
}
Schema Variation Hash
BEA2ACAF2081350D9AAEAF38D7E
[“event_type”, “user_agent”, ”has_plugins”]
3. Calculate SHA1 Hash
1. Raw message 2. Sorted list of ALL columns (including nested)
Not in Schema Repository
We need to update the schema
for 1.1
59. Deploying Event Streams
● Events are grouped logically
● Stream groups are deployed on job
clusters
● Two main aspects
○ Schema change
○ New Schema detected
60. Deploying Event Streams
● Events are grouped logically
● Stream groups are deployed on job
clusters
● Two main aspects
○ Schema change
○ New Schema detected
Schema change
● Incompatible schema changes causes
stream failures
61. Deploying Event Streams
● Events are grouped logically
● Stream groups are deployed on job
clusters
● Two main aspects
○ Schema change
○ New Schema detected
Schema change
● Incompatible schema changes causes
stream failures
● Stream monitoring in job clusters
62. Deploying Event Streams
● Events are grouped logically
● Stream groups are deployed on job
clusters
● Two main aspects
○ Schema change
○ New Schema detected
Schema change
● Incompatible schema changes causes
stream failures
● Stream monitoring in job clusters
New Schema detected
65. Management Stream EventGroup table
● Tracks schema changes from
schemaRegistry table
● Two type of source changes
○ Change in schema
○ New schema detected
66. Management Stream EventGroup table
● Tracks schema changes from
schemaRegistry table
● Two type of source changes
○ Change in schema
○ New schema detected
● Change in schema (No action)
67. Management Stream EventGroup table
● Tracks schema changes from
schemaRegistry table
● Two type of source changes
○ Change in schema
○ New schema detected
● Change in schema (No action)
● New schema detected
○ Add new entry in event group table
○ New stream is launched
automatically
70. Monitoring
● Use Structured Streaming listener
APIs to track metrics
● Dump Streaming metrics to central
dashboarding tool
71. Monitoring
● Use Structured Streaming listener
APIs to track metrics
● Dump Streaming metrics to central
dashboarding tool
● Key metrics tracked in monitoring
dashboard
○ Stream Status
○ Streaming latency
72. Monitoring
● Use Structured Streaming listener
APIs to track metrics
● Dump Streaming metrics to central
dashboarding tool
● Key metrics tracked in monitoring
dashboard
○ Stream Status
○ Streaming latency
● Enable Stream metrics capture for
ganglia using
spark.sql.streaming.metricsEnabled=true
73. Key takeaways
Delta helps with
Schema Evolution
and Stream
Multiplexing
capabilities
Schema Variation
hash to detect
schema changes
ImplementationArchitecture
Job clusters to run
streams in
production
Productionizing
74. Felix Baker, SEGA
”
“This has revolutionised the flow of analytics from our games
and has enabled business users to analyse and react to data
far more quickly than we have been able to do previously.”