18. Denormalization
Query based data model
Employee
EmployeeID
OrganizationID
Name
OrganizationID Name
Employee name
EmployeeID
1. Select all employees for a given organizationID
EmployeeID Name
2. Select employee for a given employeeID
OrganizationID
Organization
OrganizationID
Name
Relational model
20. Denormalization
Secondary indexes
1. Select all employees for a given organization
2. Select employee for a given employeeID
EmployeeID Name OrganizationID
Performance impact
...
CREATE SECONDARY INDEX
Organization
OrganizationID
Name
Employee
EmployeeID
OrganizationID
Firstname
Lastname
Email
...
Relational model
21. Denormalization
PROS
Fast reads
One query per request (usually)
Scalable (probably)
CONS
Complex data management
Can be extremely hard and complex on
insert/update/delete
Need to know all queries upfront
22. UDTs
CREATE TABLE keyspace.organization (
organizationid bigint PRIMARY KEY,
name text,
employees list<frozen<employee>>
);
CREATE TYPE test.employee (
employeeid bigint,
firstname text,
lastname text,
email text
);
OrganizationID Name
Employees
Employee Employee ...
23. UDTs
PROS
Fast(er) reads
One query per request
Scalable (should be!!)
Indexing?
CONS
Complex data management
No partial updates
Need to know all queries upfront
Indexing?
24. Blob data
CREATE TABLE keyspace.organization (
organizationid bigint PRIMARY KEY,
name text,
employees text / blob
);
OrganizationID Name
Employees
Employees list as a JSON text
or a serialized objects blob
JSON text or serialized objects
25. Blob data
PROS
Fast reads
One query per request
No need to serialize into JSON
CONS
Complex data management
No partial updates
Need to know all queries upfront
No indexing option
31. How to insert data
Insert AS JSON (2.2+)
Inserted as string, stored as a column type
Easy to manage and debug
Keep track of the data size!!!
32. Spark dataframe UDT mapping
dataframe.as("parent").join(
child.groupBy(seq.map(col): _*)
.agg(collect_list(struct(columns.map(col): _*))
.alias(alias)), seq, joinType
)
dataframe.join(child
.withColumn(alias, struct(child.columns.map(col): _*))
.select(joinColumn, alias), Seq(joinColumn), joinType)
One to many
One to one
33. Inserting from Spark
// Save to cassandra
dataframe.write
.format("org.apache.spark.sql.cassandra")
.options(Map(
"keyspace" -> s"$keyspace",
"table" -> s"$table"
))
.mode(SaveMode.Append)
.save
34. Indexing UDTs
Not possible with just Cassandra
Lucene/Solr based secondary index
Indexing of fields on nested UDTs
Field analyzers
37. Closing notes
Cassandra data model supports a lot of use cases
Data modeling skills are required
Relational model is hard but not impossible
Additional tools in the ecosystem
Don’t be stubborn