SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Building a Flexible, Real-time
Big Data Applications Platform
on Cassandra with Kiji
Cassandra Day Silicon Valley
07 April 2014
Clint Kelly
Member of Technical Staff
WibiData
1
Overview
• The Kiji Project
• The Kiji data model and KijiSchema
• Mapping Kiji to Cassandra
• Status and future work
• Try it now!
2
Should there be any intro
page that talks about
WibiData anywhere?
The Kiji Project
3
4
!
Want to build this...
Have this...
5
!
Want to build this...
!
Have this...
6
Want to build this...
Open source components
• Batch processing
– Extract, transform, load
– Train machine learning models
• Scalable storage
– Time-series data
• Serialization
– Complex data types
7
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
KijiSchema
• Schemas and data serialization
• Complex, atomic data types
8
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
record UserLog {
long timestamp;
int user_id;
string url;
long session_id;
}
• Schema evolution
• Table metadata
Kiji batch components
• Scala DSL ➔ describe
MapReduce computations
• Machine learning library
• Hive adapter
9
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
Kiji real-time components
• REST server
• Scoring server
10
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
Kiji Summary
• Bridge between open-source technologies
and real-time, big data applications
• Users are building real systems with Kiji now!
– Personalized recommendation systems for retail
– Energy usage and analytics reporting
11
The Kiji data model and
KijiSchema
12
row
13
Table are composed of rows.
entity ID data
14
We call row keys “entity IDs.”
data0xfa “bob”
15
We support composite entity IDs (with
hashed and unhashed components).
info0xfa “bob” songs
16
Data in rows is organized into “column
families.”
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment
17
Column families contain columns,
named as “family:qualifier.”
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
18
Individual columns can have many
different timestamped versions.
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
19
Data values can be complex records
record SongPlay {
long song_id;
int user_rating;
long session_id;
device_type device;
}
20
Locality groups
Separate logical organization of data
(column families) from physical
attributes (caching, compression, etc.)
info songs_todayentity ID songs_prev_year
21
Locality groups
Separate logical organization of data
(column families) from physical
attributes (caching, compression, etc.)
Need this data ASAP
for real-time scoring.
Use this data only for
batch jobs.
info songs_todayentity ID songs_prev_year
info songs_todayentity ID songs_prev_year
“real_time” (in-memory,
uncompressed, TTL = 1 day)
“batch” (compressed,
TTL = 12mo)
22
Locality groups
Always refer to columns by logical name
(“family:qualifier”).
Need this data ASAP
for real-time scoring.
Use this data only for
batch jobs.
KijiSchema summary
• Data model similar to Cassandra, HBase,
BigTable
• Contains time dimension (not present in C*)
• Logical and physical organization separate
• Complex schemas with Avro
23
Mapping Kiji to Cassandra
24
Implementation notes
25
• Built for Cassandra 2.0.6+
• Native protocol / Java driver (no Thrift)
• Asynchronous API
• Assume users have Hadoop, ZooKeeper
Mapping a Kiji table ➔ Cassandra
• Locality group ➔ Table
• Entity ID ➔ Primary key
– Hashed components ➔ partition key
– Unhashed components ➔ clustering columns
• Family, qualifier, timestamp ➔ clustering columns
• Cell values ➔ blobs
26
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
CQL for Kiji locality group
CREATE TABLE users_locality_group_fast (
userid bigint,
user text,
family text,
qualifier text,
timestamp bigint,
value blob,
PRIMARY KEY (userid, username, family, qualifier, timestamp)
) WITH CLUSTERING ORDER BY (
username ASC, family ASC, qualifier ASC, timestamp DESC);
27
TODO: Show row diagram,
arrows pointing to components?
28
cqlsh:kiji_music>SELECT * FROM kiji_table_users;
userid | username | family | qualifier | timestamp | value
--------+----------+--------+----------------+-----------+---------------
0xfa | bob | info | email | 139653249 | 1243970104327
0xfa | bob | songs | abbey road | 139656012 | 0981274331032
0xfa | bob | songs | help | 139625013 | 9074132704129
0xfa | bob | songs | help | 139621359 | 1923079210370
0xfa | bob | songs | help | 139625013 | 4745018223497
0xfa | bob | songs | helter skelter | 139621324 | 7710423974234
Physical organization of data on disk
29
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob” info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
0xfa:bob:info:email:t0:bob@gmail.com
0xfa:bob:info:payment:t1:AMEX1234...
0xfa:bob:songs:let it be:t5:...
0xfa:bob:songs:let it be:t4:…
0xfa:bob:songs:let it be:t2:…
0xfa:bob:songs:help:t2:…
0xfa:bob:songs:helter skelter:t1:…
Efficient queries =
continuous scans!
Kiji queries ➔ CQL queries
All data in “info” column family for “bob” ➔
SELECT qualifier, value FROM music
WHERE userid=0xfa
AND user=‘bob’
AND family=‘info’;
30
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
Kiji queries ➔ CQL queries
Data in “info:email” and last play of “help” for “bob” ➔
SELECT value FROM music WHERE userid=0xfa AND
user=‘bob’ AND family=‘info’ AND qualifier=‘email’;
SELECT value FROM music WHERE userid=0xfa AND
user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1;
31
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
Kiji queries ➔ CQL queries
All songs played by “bob” on April 2nd ➔
SELECT qualifier, value FROM music WHERE
userid=0xfa AND user=‘bob’ AND family=‘songs’
AND timestamp >= 1396396800
AND timestamp <= 1396483200
ALLOW FILTERING; 😱😱
32
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
Kiji queries ➔ CQL queries
33
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
!
Bad Request: PRIMARY KEY
part timestamp cannot be
restricted (preceding part
qualifier is either not
restricted or by a non-EQ
relation)
Queries that do not map well to CQL
• Break up into multiple CQL queries
– Hooray for Session#executeAsync!
• Filter on the client
– Potentially very expensive, but functional
– Provide warning to user
• Educate users about table layout
– Layout in previous example is terrible for that query
• Most issues related to “time” dimension
34
MapReduce
• Wrote new InputFormat, OutputFormat
• Hadoop 2.x
• Multiple C* queries per RecordReader
• Does not use Thrift
35
Project status and next steps
36
Initial release in ~ 2 weeks
37
• Cassandra as part of the Bento Box
• Cassandra working in KijiSchema, KijiMR
Support in the coming months
• Cassandra integration with KijiREST,
KijiScoring, KijiExpress, etc.
• Expose Cassandra-specific features to users
– Variable consistency levels
– Load-balancing policies
– Diagnostics (e.g., route tracing)
• Kiji support in CQLSH
– Decode Avro values
38
Thanks to Cassandra community
• Great help on mailing lists for users, dev, java
driver
• Webinars, meetups, C* Summit all available
online
• Free training from DataStax
• Very easy to get up-to-speed
39
Try it now -- Kiji Bento Box
• Latest compatible versions of all components
• Hadoop, ZooKeeper, HBase
• Cassandra in ~2 weeks
40
www.kiji.org/getstarted
Mention hiring?
KijiSchema
• Schemas and data serialization
• Complex data types (e.g.,
nested maps)
• Schema evolution
• Metadata
• Composite row keys
• Transparent paging
• Data-definition language, REPL
41
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress
42
Schema support
Support for complex schemas with Avro
record UserLog {
long timestamp;
int user_id;
string url;
}
KijiSchema allows schema versioning
43
Column name translation
•“family:qualifier” -> “A:B”
•Saves disk space
•Improves performance
•User-facing tools translate names
•Possible to turn this off
Kiji queries ➔ CQL queries
All data in family “songs” for user “bob” ➔
SELECT qualifier, value FROM music
WHERE userid=0xfa AND user=‘bob’
AND family=‘songs’;
44
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123

Contenu connexe

Plus de DataStax Academy

Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkDataStax Academy
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and CassandraDataStax Academy
 

Plus de DataStax Academy (20)

Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

  • 1. Building a Flexible, Real-time Big Data Applications Platform on Cassandra with Kiji Cassandra Day Silicon Valley 07 April 2014 Clint Kelly Member of Technical Staff WibiData 1
  • 2. Overview • The Kiji Project • The Kiji data model and KijiSchema • Mapping Kiji to Cassandra • Status and future work • Try it now! 2 Should there be any intro page that talks about WibiData anywhere?
  • 5. Have this... 5 ! Want to build this...
  • 6. ! Have this... 6 Want to build this...
  • 7. Open source components • Batch processing – Extract, transform, load – Train machine learning models • Scalable storage – Time-series data • Serialization – Complex data types 7 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  • 8. KijiSchema • Schemas and data serialization • Complex, atomic data types 8 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress record UserLog { long timestamp; int user_id; string url; long session_id; } • Schema evolution • Table metadata
  • 9. Kiji batch components • Scala DSL ➔ describe MapReduce computations • Machine learning library • Hive adapter 9 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  • 10. Kiji real-time components • REST server • Scoring server 10 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  • 11. Kiji Summary • Bridge between open-source technologies and real-time, big data applications • Users are building real systems with Kiji now! – Personalized recommendation systems for retail – Energy usage and analytics reporting 11
  • 12. The Kiji data model and KijiSchema 12
  • 14. entity ID data 14 We call row keys “entity IDs.”
  • 15. data0xfa “bob” 15 We support composite entity IDs (with hashed and unhashed components).
  • 16. info0xfa “bob” songs 16 Data in rows is organized into “column families.”
  • 17. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment 17 Column families contain columns, named as “family:qualifier.”
  • 18. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 18 Individual columns can have many different timestamped versions.
  • 19. songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 19 Data values can be complex records record SongPlay { long song_id; int user_rating; long session_id; device_type device; }
  • 20. 20 Locality groups Separate logical organization of data (column families) from physical attributes (caching, compression, etc.) info songs_todayentity ID songs_prev_year
  • 21. 21 Locality groups Separate logical organization of data (column families) from physical attributes (caching, compression, etc.) Need this data ASAP for real-time scoring. Use this data only for batch jobs. info songs_todayentity ID songs_prev_year
  • 22. info songs_todayentity ID songs_prev_year “real_time” (in-memory, uncompressed, TTL = 1 day) “batch” (compressed, TTL = 12mo) 22 Locality groups Always refer to columns by logical name (“family:qualifier”). Need this data ASAP for real-time scoring. Use this data only for batch jobs.
  • 23. KijiSchema summary • Data model similar to Cassandra, HBase, BigTable • Contains time dimension (not present in C*) • Logical and physical organization separate • Complex schemas with Avro 23
  • 24. Mapping Kiji to Cassandra 24
  • 25. Implementation notes 25 • Built for Cassandra 2.0.6+ • Native protocol / Java driver (no Thrift) • Asynchronous API • Assume users have Hadoop, ZooKeeper
  • 26. Mapping a Kiji table ➔ Cassandra • Locality group ➔ Table • Entity ID ➔ Primary key – Hashed components ➔ partition key – Unhashed components ➔ clustering columns • Family, qualifier, timestamp ➔ clustering columns • Cell values ➔ blobs 26 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  • 27. CQL for Kiji locality group CREATE TABLE users_locality_group_fast ( userid bigint, user text, family text, qualifier text, timestamp bigint, value blob, PRIMARY KEY (userid, username, family, qualifier, timestamp) ) WITH CLUSTERING ORDER BY ( username ASC, family ASC, qualifier ASC, timestamp DESC); 27 TODO: Show row diagram, arrows pointing to components?
  • 28. 28 cqlsh:kiji_music>SELECT * FROM kiji_table_users; userid | username | family | qualifier | timestamp | value --------+----------+--------+----------------+-----------+--------------- 0xfa | bob | info | email | 139653249 | 1243970104327 0xfa | bob | songs | abbey road | 139656012 | 0981274331032 0xfa | bob | songs | help | 139625013 | 9074132704129 0xfa | bob | songs | help | 139621359 | 1923079210370 0xfa | bob | songs | help | 139625013 | 4745018223497 0xfa | bob | songs | helter skelter | 139621324 | 7710423974234
  • 29. Physical organization of data on disk 29 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 0xfa:bob:info:email:t0:bob@gmail.com 0xfa:bob:info:payment:t1:AMEX1234... 0xfa:bob:songs:let it be:t5:... 0xfa:bob:songs:let it be:t4:… 0xfa:bob:songs:let it be:t2:… 0xfa:bob:songs:help:t2:… 0xfa:bob:songs:helter skelter:t1:… Efficient queries = continuous scans!
  • 30. Kiji queries ➔ CQL queries All data in “info” column family for “bob” ➔ SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’; 30 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  • 31. Kiji queries ➔ CQL queries Data in “info:email” and last play of “help” for “bob” ➔ SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ AND qualifier=‘email’; SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1; 31 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  • 32. Kiji queries ➔ CQL queries All songs played by “bob” on April 2nd ➔ SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND timestamp >= 1396396800 AND timestamp <= 1396483200 ALLOW FILTERING; 😱😱 32 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123
  • 33. Kiji queries ➔ CQL queries 33 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123 ! Bad Request: PRIMARY KEY part timestamp cannot be restricted (preceding part qualifier is either not restricted or by a non-EQ relation)
  • 34. Queries that do not map well to CQL • Break up into multiple CQL queries – Hooray for Session#executeAsync! • Filter on the client – Potentially very expensive, but functional – Provide warning to user • Educate users about table layout – Layout in previous example is terrible for that query • Most issues related to “time” dimension 34
  • 35. MapReduce • Wrote new InputFormat, OutputFormat • Hadoop 2.x • Multiple C* queries per RecordReader • Does not use Thrift 35
  • 36. Project status and next steps 36
  • 37. Initial release in ~ 2 weeks 37 • Cassandra as part of the Bento Box • Cassandra working in KijiSchema, KijiMR
  • 38. Support in the coming months • Cassandra integration with KijiREST, KijiScoring, KijiExpress, etc. • Expose Cassandra-specific features to users – Variable consistency levels – Load-balancing policies – Diagnostics (e.g., route tracing) • Kiji support in CQLSH – Decode Avro values 38
  • 39. Thanks to Cassandra community • Great help on mailing lists for users, dev, java driver • Webinars, meetups, C* Summit all available online • Free training from DataStax • Very easy to get up-to-speed 39
  • 40. Try it now -- Kiji Bento Box • Latest compatible versions of all components • Hadoop, ZooKeeper, HBase • Cassandra in ~2 weeks 40 www.kiji.org/getstarted Mention hiring?
  • 41. KijiSchema • Schemas and data serialization • Complex data types (e.g., nested maps) • Schema evolution • Metadata • Composite row keys • Transparent paging • Data-definition language, REPL 41 Hadoop, C*, HBase, Avro KijiSchema KijiMR KijiREST KijiHive KijiScoring KijiExpress
  • 42. 42 Schema support Support for complex schemas with Avro record UserLog { long timestamp; int user_id; string url; } KijiSchema allows schema versioning
  • 43. 43 Column name translation •“family:qualifier” -> “A:B” •Saves disk space •Improves performance •User-facing tools translate names •Possible to turn this off
  • 44. Kiji queries ➔ CQL queries All data in family “songs” for user “bob” ➔ SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’; 44 songs: let it be songs: help songs: helter skelter 0xfa “bob” info: email info: payment songs: let it besongs: let it besongs: let it be songs: let it be 1396560123