Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Building a Flexible, Real-time
Big Data Applications Platform
on Cassandra with Kiji
Cassandra Day Silicon Valley
07 April 2014
Clint Kelly
Member of Technical Staff
WibiData
1

Overview
• The Kiji Project
• The Kiji data model and KijiSchema
• Mapping Kiji to Cassandra
• Status and future work
• Try it now!
2
Should there be any intro
page that talks about
WibiData anywhere?

Have this...
5
!
Want to build this...

!
Have this...
6
Want to build this...

Open source components
• Batch processing
– Extract, transform, load
– Train machine learning models
• Scalable storage
– Time-series data
• Serialization
– Complex data types
7
Hadoop, C*, HBase, Avro
KijiSchema
KijiMR KijiREST
KijiHive KijiScoring
KijiExpress

KijiSchema
• Schemas and data serialization
• Complex, atomic data types
8
KijiSchema
KijiMR KijiREST
KijiExpress
record UserLog {
long timestamp;
int user_id;
string url;
long session_id;
}
• Schema evolution
• Table metadata

Kiji batch components
• Scala DSL ➔ describe
MapReduce computations
• Machine learning library
• Hive adapter
9
KijiSchema
KijiMR KijiREST
KijiExpress

Kiji real-time components
• REST server
• Scoring server
10
KijiSchema
KijiMR KijiREST
KijiExpress

Kiji Summary
• Bridge between open-source technologies
and real-time, big data applications
• Users are building real systems with Kiji now!
– Personalized recommendation systems for retail
– Energy usage and analytics reporting
11

The Kiji data model and
KijiSchema
12

row
13
Table are composed of rows.

entity ID data
14
We call row keys “entity IDs.”

data0xfa “bob”
15
We support composite entity IDs (with
hashed and unhashed components).

info0xfa “bob” songs
16
Data in rows is organized into “column
families.”

songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment
17
Column families contain columns,
named as “family:qualifier.”

songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
18
Individual columns can have many
different timestamped versions.

songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
19
Data values can be complex records
record SongPlay {
long song_id;
int user_rating;
long session_id;
device_type device;
}

20
Locality groups
Separate logical organization of data
(column families) from physical
attributes (caching, compression, etc.)
info songs_todayentity ID songs_prev_year

21
Locality groups
Separate logical organization of data
(column families) from physical
attributes (caching, compression, etc.)
Need this data ASAP
for real-time scoring.
Use this data only for
batch jobs.

“real_time” (in-memory,
uncompressed, TTL = 1 day)
“batch” (compressed,
TTL = 12mo)
22
Locality groups
Always refer to columns by logical name
(“family:qualifier”).
Need this data ASAP
for real-time scoring.
Use this data only for
batch jobs.

KijiSchema summary
• Data model similar to Cassandra, HBase,
BigTable
• Contains time dimension (not present in C*)
• Logical and physical organization separate
• Complex schemas with Avro
23

Implementation notes
25
• Built for Cassandra 2.0.6+
• Native protocol / Java driver (no Thrift)
• Asynchronous API
• Assume users have Hadoop, ZooKeeper

Mapping a Kiji table ➔ Cassandra
• Locality group ➔ Table
• Entity ID ➔ Primary key
– Hashed components ➔ partition key
– Unhashed components ➔ clustering columns
• Family, qualifier, timestamp ➔ clustering columns
• Cell values ➔ blobs
26
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123

CQL for Kiji locality group
CREATE TABLE users_locality_group_fast (
userid bigint,
user text,
family text,
qualifier text,
timestamp bigint,
value blob,
PRIMARY KEY (userid, username, family, qualifier, timestamp)
) WITH CLUSTERING ORDER BY (
username ASC, family ASC, qualifier ASC, timestamp DESC);
27
TODO: Show row diagram,
arrows pointing to components?

28
cqlsh:kiji_music>SELECT * FROM kiji_table_users;
userid | username | family | qualifier | timestamp | value
--------+----------+--------+----------------+-----------+---------------
0xfa | bob | info | email | 139653249 | 1243970104327
0xfa | bob | songs | abbey road | 139656012 | 0981274331032
0xfa | bob | songs | help | 139625013 | 9074132704129
0xfa | bob | songs | helter skelter | 139621324 | 7710423974234

Physical organization of data on disk
29
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob” info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
0xfa:bob:info:email:t0:bob@gmail.com
0xfa:bob:info:payment:t1:AMEX1234...
0xfa:bob:songs:let it be:t5:...
0xfa:bob:songs:let it be:t4:…
0xfa:bob:songs:let it be:t2:…
0xfa:bob:songs:help:t2:…
0xfa:bob:songs:helter skelter:t1:…
Efficient queries =
continuous scans!

Kiji queries ➔ CQL queries
All data in “info” column family for “bob” ➔
SELECT qualifier, value FROM music
WHERE userid=0xfa
AND user=‘bob’
AND family=‘info’;
30
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123

Data in “info:email” and last play of “help” for “bob” ➔
SELECT value FROM music WHERE userid=0xfa AND
user=‘bob’ AND family=‘info’ AND qualifier=‘email’;
SELECT value FROM music WHERE userid=0xfa AND
user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1;
31
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123

All songs played by “bob” on April 2nd ➔
SELECT qualifier, value FROM music WHERE
userid=0xfa AND user=‘bob’ AND family=‘songs’
AND timestamp >= 1396396800
AND timestamp <= 1396483200
ALLOW FILTERING; 😱😱
32
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123

33
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123
!
Bad Request: PRIMARY KEY
part timestamp cannot be
restricted (preceding part
qualifier is either not
restricted or by a non-EQ
relation)

Queries that do not map well to CQL
• Break up into multiple CQL queries
– Hooray for Session#executeAsync!
• Filter on the client
– Potentially very expensive, but functional
– Provide warning to user
• Educate users about table layout
– Layout in previous example is terrible for that query
• Most issues related to “time” dimension
34

MapReduce
• Wrote new InputFormat, OutputFormat
• Hadoop 2.x
• Multiple C* queries per RecordReader
• Does not use Thrift
35

Project status and next steps
36

Initial release in ~ 2 weeks
37
• Cassandra as part of the Bento Box
• Cassandra working in KijiSchema, KijiMR

Support in the coming months
• Cassandra integration with KijiREST,
KijiScoring, KijiExpress, etc.
• Expose Cassandra-specific features to users
– Variable consistency levels
– Load-balancing policies
– Diagnostics (e.g., route tracing)
• Kiji support in CQLSH
– Decode Avro values
38

Thanks to Cassandra community
• Great help on mailing lists for users, dev, java
driver
• Webinars, meetups, C* Summit all available
online
• Free training from DataStax
• Very easy to get up-to-speed
39

Try it now -- Kiji Bento Box
• Latest compatible versions of all components
• Hadoop, ZooKeeper, HBase
• Cassandra in ~2 weeks
40
www.kiji.org/getstarted
Mention hiring?

KijiSchema
• Schemas and data serialization
• Complex data types (e.g.,
nested maps)
• Schema evolution
• Metadata
• Composite row keys
• Transparent paging
• Data-definition language, REPL
41
KijiSchema
KijiMR KijiREST
KijiExpress

42
Schema support
Support for complex schemas with Avro
record UserLog {
long timestamp;
int user_id;
string url;
}
KijiSchema allows schema versioning

43
Column name translation
•“family:qualifier” -> “A:B”
•Saves disk space
•Improves performance
•User-facing tools translate names
•Possible to turn this off

All data in family “songs” for user “bob” ➔
SELECT qualifier, value FROM music
WHERE userid=0xfa AND user=‘bob’
AND family=‘songs’;
44
songs:
let it be
songs:
help
songs:
helter
skelter
0xfa “bob”
info:
email
info:
payment songs:
let it besongs:
let it besongs:
let it be
songs:
let it be
1396560123

Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Recommandé

Recommandé

Contenu connexe

Plus de DataStax Academy

Plus de DataStax Academy (20)

Dernier

Dernier (20)

Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji