Polyglot persistence for Java developers: time to move out of the relational comfort zone? (gids 2015)

Polyglot persistence for Java
developers:
time to move out of the relational
comfort zone?
Chris Richardson
Author of POJOs in Action
Founder of the original CloudFoundry.com
@crichardson
chris@chrisrichardson.net
http://plainoldobjects.com

@crichardson
Presentation Goal
The beneﬁts and drawbacks
of polyglot persistence
and
How to design applications
that use this approach

@crichardson
About Chris
Founder of a startup that’s creating
a platform for developing
event-driven microservices

@crichardson
Agenda
• Why polyglot persistence?
• Persisting entities with MongoDB and Cassandra
• Querying data with MongoDB and Cassandra
• Scaling MongoDB and Cassandra

@crichardson
Relational Databases

@crichardson
Example: Food to Go
• Take-out food delivery
service
• “Launched” in 2006

@crichardson
FoodTo Go Architecture
Order
taking
Restaurant
Management
MySQL
Database
CONSUMER
RESTAURANT
OWNER

@crichardson
Example: Device management
server ~ 2003
• Everything was stored in a Oracle database
• Device metadata
• Firmware patches!
• ….

@crichardson
RDBMS are great
• SQL = Rich, declarative query language
• Database enforces referential integrity
• ACID semantics
• Well understood by developers
• Well supported by frameworks and tools, e.g. Spring JDBC,
Hibernate, JPA
• Well understood by operations

@crichardson
Impact of SSD/Flash storage
• HDD = 200 IOPS vs. SSD = 100K IOPS
• Massive performance improvement
• Expands the range of use cases that a single RDBMS server
can cost-effectively support

@crichardson
• Hosted relational database
• Compatible with MySQL 5.6 but with 5x performance
• Vertically scales to 32 vCPUs and 244 GiB of RAM
• SSD-backed virtualized storage layer, replicated 6 ways across 3 AZs
• Up to 15 replicas that share storage with master - minimal replication lag
• Fast restart after crash
• No redo log replay
• SSD-backed virtualized storage layer purpose-built for database workloads
• Fast fail-over to replica after master instance failure without data loss
AWS Aurora
http://aws.amazon.com/rds/aurora/details/

NEW SQL
• Next generation SQL databases,
e.g.VoltDB, MemSQL, ...
• Leverage modern, multi-core,
commodity hardware
• In-memory
• Horizontally scalable
• Transparently shardable
• ACID
“Current databases are designed for 1970s
hardware and for both OLTP and data
warehouses”
http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf

@crichardson
An RDBMS is great for many
applications but ….

@crichardson
Limitations of relational
databases
• Scalability
• Multi data center, distributed database
• Schema updates
• O/R impedance mismatch
• Handling semi-structured data

@crichardson
Solution: Spend $$$ on Oracle’s
high-end databases and servers

@crichardson
Not so bad…
http://www.powerandmotoryacht.com/megayachts/megayacht-musashi

@crichardson
… or is it?
http://www.iwtg.net/

@crichardson
Solution: Spend $$$ - open-
source stack + DevOps people
http://www.trekbikes.com/us/en/bikes/road/race_performance/madone_5_series/madone_5_2/#

@crichardson
Apply the scale cube
X axis
- horizontal duplication
Z
axis
-data
partitioning
Y axis -
functional
decomposition
Scale
by
splitting
sim
ilar
things
Scale by
splitting
different things

@crichardson
Applying the scale cube
• Y-axis splits/functional decomposition
• Application = Set[Microservice] - each with its own database
• Monolithic database is functionally decomposed
• Different types of entities in different databases
• Z-axis splits/sharding
• Entities of the same type partitioned across multiple databases

@crichardson
How does each service access
data?
?
Velocity and
Volume
Variety of
Data
Fixed or ad
hoc queries
Access
patterns
DistributionLatency

@crichardson
Velocity andVolume?
• Velocity - speed at which data moves
• Volume - the amount of data
• Does it ﬁt on a single machine?

@crichardson
Variety of Data?
• Relational
• Aggregate oriented
• Graph
• Complex nested structures
• Semi structured
• Text
• Binary blogs, e.g. images

@crichardson
Fixed or ad hoc queries?
• Fixed set of queries
• Known in advance
• Slowly changing
• Ad hoc queries
• Users can submit ad hoc queries

@crichardson
Access patterns
• PK-oriented access, e.g. load-modify-update a business entity
• Bulk queries and/or updates
• Non-relational queries:
• text search
• graph-oriented
• geo search
• …

@crichardson
Reads vs.Writes
• Mix of reads and writes
• Write intensive, e.g. logging application
• Read intensive
• Data analytics/warehouse
• Slowly changing data
• …

@crichardson
Distribution
• Single database
• Multiple active databases
• on a LAN (low latency)
• on a WAN (high latency)

@crichardson
Transactions
• Mandatory ACID
• Eventual consistency OK?

@crichardson
Latency
• When should new data show up in results?
• Low latency - seconds, milliseconds?
• High latency - next day?

@crichardson
And then pick your database…

@crichardson
Use a NoSQL database
Beneﬁts
• Higher performance
• Higher scalability
• Richer data-model
• Schema-less
Drawbacks
• Limited transactions
• Limited querying
• Relaxed consistency
• Unconstrained data

@crichardson
Example NoSQL Databases
Database Key features
Cassandra
Extensible column store,
very scalable, distributed
MongoDB Document-oriented, fast,
scalable
Redis Key-value store, very fast
DynamoDB AWS hosted key-value
and document store
Neo4j Graph Database
http://nosql-database.org/ lists 150 NoSQL
databases

@crichardson
Relative popularity
http://www.indeed.com/jobtrends/mongodb%2Ccassandra%2Credis%2Cneo4j%2Cdynamodb.html

@crichardson
But there are many other options
• Blob store, e.g.AWS S3
• Text search engine, e.g. ElasticSearch,AWS CloudSearch, …
• Big data technology:Apache Hadoop,Apache Spark, …
• Real time streaming: Storm, Spark Streaming, …

@crichardson
Polyglot persistence
IEEE Software Sept/October 2010 - Debasish Ghosh / Twitter @debasishg
Event sourcing and CQRS are a great approach

@crichardson
Food to Go – Domain model (partial)
class Restaurant {
long id;
String name;
Set<String> serviceArea;
Set<TimeRange> openingHours;
List<MenuItem> menuItems;
}
class MenuItem {
String name;
double price;
}
class TimeRange {
long id;
int dayOfWeek;
int openTime;
int closeTime;
}

@crichardson
Database schema
ID Name …
1 Ajanta
2 Montclair Eggshop
Restaurant_id zipcode
1 94707
1 94619
2 94611
2 94619
Restaurant_id dayOfWeek openTime closeTime
1 Monday 1130 1430
1 Monday 1730 2130
2 Tuesday 1130 …
RESTAURANT table
RESTAURANT_ZIPCODE table
RESTAURANT_TIME_RANGE table

@crichardson
RestaurantRepository
public interface RestaurantRepository {
void addRestaurant(Restaurant restaurant);
Restaurant ﬁndById(long id);
...
}
Food To Go will have scaling
eventually issues

@crichardson
MongoDB
• Document-oriented database
• JSON-style documents: Lists, Maps, primitives
• Schema-less
• Transaction = update of a single document
• Rich query language for dynamic/ad hoc queries + geo queries
• Tunable writes: speed vs. reliability
• Highly scalable and available

@crichardson
MongoDB use cases
• High volume writes
• Complex data
• Semi-structured data

@crichardson
MongoDB data model
Server
Database: Food To Go
Collection: Restaurants
{
"_id" : ObjectId("4bddc2f49d1505567c6220a0")
"name": "Ajanta",
"serviceArea": ["94619", "99999"],
"openingHours": [
{
"dayOfWeek": 1,
"open": 1130,
"close": 1430 },
{
"dayOfWeek": 2,
"open": 1130,
"close": 1430
}, …
]
}
BSON = binary JSON
Sequence of bytes
on disk è fast i/o
16MByte limit
PK

@crichardson
Many NoSQL Databases
=
Aggregate-oriented

@crichardson
Basic MongoDB collection
operations...
• insert(document(s), options)
• Application assigned ids
• Mongo generated UUID
• update(query, update, options)
• query - selects document(s)
• update - replace or modify document (e.g. increment a ﬁeld)
• options - upset , multi, … (optional)
• remove(query, options)

@crichardson
....Basic MongoDB collection
operations
• find/findOne(criteria, projection)
• criteria - query
• projection - fields to return (optional)

@crichardson
Using Spring Data for Mongo
@Repository
class RestaurantRepositoryMongoDbImpl implements RestaurantRepository {
@Override
public void add(Restaurant restaurant) {
mongoTemplate.insert(restaurant, "restaurants");
}
@Override
public Restaurant findDetailsById(int id) {
return mongoTemplate.findById(id, Restaurant.class, "restaurants");
}
Spring Data’s Generic Repositories = even less code

@crichardson
Apache Cassandra
• Distributed/Extensible row store: row ~= java.util.SortedMap
• Transaction = update of a row
• Fast writes = append to a log
• Tunable reads/writes: consistency latency/availability
• Extremely scalable
• Transparent and dynamic clustering
• Rack and datacenter aware data replication

@crichardson
Apache Cassandra use cases
• Big data
• Multiple Data Center distributed database
• (Write intensive) Logging
• High-availability (writes)

@crichardson
Cassandra data model
Keyspace
Table
K1 N1 V1 TS1 N2 V2 TS2 N3 V3 TS3
N1 V1 TS1 N2 V2 TS2 N3 V3 TS3K2
Column
Name
Column
Value
Timestamp
Row
Key
Column name/value: number, string, Boolean, timestamp, counter, and
composite

@crichardson
Inserting/updating data
table.insert(key=K1, (N4, V4, TS4), …)Idempotent= transaction
Table
K1 N1 V1 TS1
…
N2 V2 TS2 N3 V3 TS3
Table
K1 N1 V1 TS1
…
N2 V2 TS2 N3 V3 TS3 N4 V4 TS4
optional column TTL
Application assigned keys - natural or UUID

@crichardson
Reading data
table.slice(key=K1, startColumn=N2, endColumn=N4)
Tables
K1 N1 V1 TS1
…
N2 V2 TS2 N3 V3 TS3 N4 V4 TS4
K1 N2 V2 TS2 N3 V3 TS3 N4 V4 TS4
Cassandra has secondary indexes but they
aren’t always helpful

@crichardson
Cassandra Query Language
• SQL-like
• DDL: Create table, ...
• DML: Insert, Update, Select, ...
• Restricted WHERE clauses, e.g. PK equality only (if you want efﬁciency)
• Primary key:
• Simple - 1 storage table row 1 CQL row
• Compound - 1 storage table row multiple CQL rows! (clustered rows)

@crichardson
Representing restaurants
create table restaurant (
restaurant_id int PRIMARY KEY,
name text,
service_area set<text>,
day_of_weeks list<int>,
opening_times list<int>,
closing_times list<int>
);

@crichardson
Inserting and retrieving
restaurants
insert into restaurants.restaurant(
restaurant_id, name, service_area,
day_of_weeks, opening_times,
closing_times)
Values(?, ?, ?, ?, ?, ?)
select *
from restaurants.restaurant
where restaurant_id = ?

@crichardson
Storing restaurants in Cassandra
name Ajanta1 serviceArea:94619 -
serviceArea:94618 -
Set member
daysOfWeeks:0 Monday
daysOfWeeks:1 Monday
Element
index
Element
value

@crichardson
Cassandra Java APIs
• Java Driver
• https://github.com/datastax/java-driver
• Netﬂix Astanyx
• http://techblog.netﬂix.com/2013/12/astyanax-update.html
• Spring Data for Cassandra
• http://projects.spring.io/spring-data-cassandra/

@crichardson
Java Driver: Inserting a restaurant
public class AvailableRestaurantRepositoryCassandraImpl ...
public AvailableRestaurantRepositoryCassandraImpl(Session session) {
insertStatement = session.prepare(
"insert into restaurants.restaurant(restaurant_id, name, service_area, day_of_weeks,
opening_times, closing_times) Values(?, ?, ?, ?, ?, ?);"
);
...
}
@Override
public void add(Restaurant restaurant) {
List<Integer> dayOfWeeks = new ArrayList<Integer>();
List<Integer> openingTimes = new ArrayList<Integer>();
List<Integer> closingTimes = new ArrayList<Integer>();
for (TimeRange tr : restaurant.getOpeningHours()) {
dayOfWeeks.add(tr.getDayOfWeek());
openingTimes.add(tr.getOpenHour());
closingTimes.add(tr.getClosingTime());
}
session.execute(insertStatement.bind(restaurant.getId(),
restaurant.getName(),
restaurant.getServiceArea(),
dayOfWeeks,
openingTimes,
closingTimes
));
}

@crichardson
Java Driver: Finding a restaurant
public class AvailableRestaurantRepositoryCassandraImpl
implements AvailableRestaurantRepository {
this.findByIdStatement = session.prepare(
"select * from restaurants.restaurant where restaurant_id = ?;");
...
}
@Override
public Restaurant findDetailsById(int id) {
Row row = session.execute(findByIdStatement.bind(id)).all().get(0);
List<Integer> dayOfWeeks = row.getList("day_of_weeks", Integer.class);
List<Integer> openingTimes= row.getList("opening_times", Integer.class);
List<Integer> closingTimes = row.getList("closing_times", Integer.class);
Set<TimeRange> openingHours = new HashSet<TimeRange>();
for (int i = 0 ; i < dayOfWeeks.size(); i++) {
openingHours.add(
new TimeRange(dayOfWeeks.get(i), openingTimes.get(i), closingTimes.get(i)));
}
Restaurant r = new Restaurant(row.getString("name"), ...,
row.getSet("service_area", String.class), openingHours, null);
r.setId(id);
return r;
}

@crichardson
Finding available restaurants
Available restaurants =
Serve the zip code of the delivery address
AND
Are open at the delivery time
public interface AvailableRestaurantRepository {
List<AvailableRestaurant>
ﬁndAvailableRestaurants(Address deliveryAddress, Date deliveryTime);
...
}

@crichardson
Finding available restaurants on Monday, 6.15pm for
94619 zipcode
Straightforward three-way join
select r.*
from restaurant r
inner join restaurant_time_range tr
on r.id =tr.restaurant_id
inner join restaurant_zipcode sa
on r.id = sa.restaurant_id
where ’94619’ = sa.zip_code
and tr.day_of_week=’monday’
and tr.openingtime <= 1815
and 1815 <= tr.closingtime

@crichardson
MongoDB = easy to query
{
serviceArea:"94619",
openingHours: {
$elemMatch : {
"dayOfWeek" : "Monday",
"open": {$lte: 1815},
"close": {$gte: 1815}
}
}
}
DBCursor cursor = collection.find(qbeObject);
while (cursor.hasNext()) {
DBObject o = cursor.next();
…
}
db.availableRestaurants.ensureIndex({serviceArea: 1})

@crichardson
Using Spring Data for Mongo
@Repository
class RestaurantRepositoryMongoDbImpl implements RestaurantRepository {
@Override
public List<AvailableRestaurant> findAvailableRestaurants(
Address deliveryAddress, Date deliveryTime) {
int timeOfDay = DateTimeUtil.timeOfDay(deliveryTime);
int dayOfWeek = DateTimeUtil.dayOfWeek(deliveryTime);
Query query =
new Query(
where("serviceArea").is(deliveryAddress.getZip())
.and("openingHours")
.elemMatch(
where("dayOfWeek").is(dayOfWeek)
.and("openingTime").lte(timeOfDay)
.and("closingTime").gte(timeOfDay)));
return mongoTemplate.find(
query, AvailableRestaurant.class,
AVAILABLE_RESTAURANTS_COLLECTION);
}

@crichardson
BUT how to do this with
Cassandra??!
• How can Cassandra support a query that has
• A 3-way join
• Multiple =
• > and <
?
è We need to denormalize the data!!

@crichardson
Simpliﬁcation #1:
Denormalization
Restaurant_id Day_of_week Open_time Close_time Zip_code
1 Monday 1130 1430 94707
1 Monday 1130 1430 94619
1 Monday 1730 2130 94707
1 Monday 1730 2130 94619
2 Monday 0700 1430 94619
…
SELECT restaurant_id
FROM time_range_zip_code
WHERE day_of_week = ‘Monday’
AND zip_code = 94619
AND 1815 < close_time
AND open_time < 1815
Simpler query:
§ No joins
§ Two = and two <

@crichardson
Simpliﬁcation #2:Application
ﬁltering
SELECT restaurant_id, open_time
FROM time_range_zip_code
WHERE day_of_week = ‘Monday’
AND zip_code = 94619
AND 1815 < close_time
AND open_time < 1815
Even simpler query
• No joins
• Two = and one <
This is a CQL query!

@crichardson
Available restaurants table
create table available_restaurants (
id int,
name text,
zip_code text,
day_of_week int,
open_time int,
close_time int,
primary key ((zip_code, day_of_week), close_time, id)
) ;
Compound
primary key
Clustering columns
preﬁx column names
Composite
partition key
= row key

@crichardson
Cassandra available_restaurants
table
1430:1:name Ajanta94619:Monday
1430:1:open_time 1130
close_time:id:≪column name≫zipcode:day of week
1730:1:name Ajanta
1730:1:open_time 2130
1430:2:name Egg shop
1430:2:open_time 0800
primary key ((zip_code, day_of_week), close_time, id)

@crichardson
Finding available restaurants
select *
from available_restaurants
where
zip_code = '94619'
and day_of_week = 1
and close_time > 1815;

@crichardson
Cassandra query
@Repository
class AvailableRestaurantRepositoryCassandraImpl
implements RestaurantRepository {
this.findAvailable = session.prepare(
"Select open_time, restaurant_name " +
"From restaurants.available_restaurants " +
"Where zip_code = ? " +
"And day_of_week = ? " +
" And close_time >= ?;"
);
…
}

@crichardson
Cassandra query
@Repository
class AvailableRestaurantRepositoryCassandraImpl implements
RestaurantRepository {
@Override
public List<AvailableRestaurant> findAvailableRestaurants(
Address deliveryAddress, Date deliveryTime) {
List<AvailableRestaurant> result =
new ArrayList<AvailableRestaurant>();
int timeOfDay = DateTimeUtil.timeOfDay(deliveryTime);
BoundStatement bound = findAvailable.bind(deliveryAddress.getZip(),
DateTimeUtil.dayOfWeek(deliveryTime), timeOfDay);
for (Row row : session.execute(bound).all()) {
if (row.getInt("open_time") <= timeOfDay) {
result.add(
new AvailableRestaurant(row.getString("restaurant_name"))
);
}
}
return result;
}

@crichardson
NoSQL Denormalized
representation for each query

@crichardson
SorryTed!
http://en.wikipedia.org/wiki/Edgar_F._Codd

@crichardson
About Cassandra and MongoDB
• Cassandra:
• Efﬁcient storage of
complex aggregates
• Limited queries requiring
denormalized
representation 
• MongoDB
• Efﬁcient storage of
complex aggregates
• Rich ad hoc queries
But where they get really interesting is
when it comes to scaling

Scaling MongoDB: Replica Sets
Replica Set
Mongod
(secondary)
Mongod
(primary)
Mongod
(secondary)
Client
http://docs.mongodb.org/manual/replication/
Writes
Consistent reads Inconsistent reads
replication
Automatic
master
election
Connects to seed servers

Mongos
Scaling MongoDB: Sharding
Replica Set 2 (aka. Shard 2)
Mongod
(secondary)
Mongod
(primary)
Mongod
(secondary)
Replica Set 1 (aka. Shard 1)
Mongod
(secondary)
Mongod
(primary)
Mongod
(secondary)
Mongos
Client
Config Server
mongod
mongod
mongod
http://docs.mongodb.org/manual/core/sharding-introduction/
Key-based routing
or
Scatter/gather

@crichardson
MongoDB Sharding
• Collection is partitioned into chunks
• Each shard is responsible for one or more chunks
• Range-based sharding
• Each chunk is responsible for a range of keys
• Efﬁcient execution of range queries BUT risk of uneven distribution
• Hash-based sharding
• Key is hashed and mapped into chunk
• Good distribution BUT range queries processed by all shards

@crichardson
MongoDB reads and writes
• Writes
• Trade-off: request latency vs. safety
• No acknowledgement!
• Acknowledgement by primary or by primary & N - 1 replicas
• Acknowledgement after committing to journal
• Tag-based, e.g. write to servers in different data centers
• Reads
• Read uncommitted isolation - reads can return data that has not been committed yet
• Master - the default
• Secondary - if stale data is ok
• Use tags
{ w: N,
j: true/false,
wtimeout: timeout
}

@crichardson
Cassandra cluster
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
Key
Partitioner
64/128-bit hash
(a.ka. token)
VNode
owns a
range of
hash
values
ReplicasMurmurHash
MD5
Node
owns
collection
of vnodes

@crichardson
Multiple data centers
DC 1 DC 2

@crichardson
Cassandra reads and writes
• Any node can handle any request
• Plays the role of coordinator
• Communicates with replica nodes
• Write request
• Update is written to commit log of one or more replicas
• Other replicas are updated asynchronously
• Read request
• Read data from one or more replicas
• Choose the most recent data based on timestamp
• Read repair: sends updates to stale replicas
No
Master!

@crichardson
Cassandra read and write
consistency
• For each read and write request you specify:
• How many nodes to read/write before responding
• Local (single DC) vs. Multi-DCs
• All replicas in all DCs will eventually be updated
• Trade-off:
• More nodes: greater consistency but less availability and higher latency
• Fewer nodes: less consistency but higher availability and lower latency
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_conﬁg_consistency_c.html

@crichardson
Consistency examples
• High-performance, high-availability writes, e.g. logging
• Write consistency of ANY - even replicas can be down
• Read consistency of ONE - any replica
• Consistent reads
• (nodes_written + nodes_read) > replication_factor
• Read/Write consistency of LOCAL_QUORUM
• Globally consistent reads
• Read/write consistency of QUORUM

@crichardson
Comparing Cassandra and
MongoDB
• Cassandra
• Replica model
• Write to any replica (or
Node)
• Sync locally/async globally 
• MongoDB
• Master/slave model
• Write to master
• Sync to possibly remote
master

@crichardson
Summary
• Each SQL/NoSQL database = set of tradeoffs
• NoSQL databases:
• Diverse
• Aggregate-oriented (typically)
• Use query-oriented data modeling (typically)
• Polyglot persistence: leverage the strengths of SQL and NoSQL
databases

@crichardson
Questions?
@crichardson chris@chrisrichardson.net
http://plainoldobjects.com

Polyglot persistence for Java developers: time to move out of the relational comfort zone? (gids 2015)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Polyglot persistence for Java developers: time to move out of the relational comfort zone? (gids 2015)

Similaire à Polyglot persistence for Java developers: time to move out of the relational comfort zone? (gids 2015) (20)

Plus de Chris Richardson

Plus de Chris Richardson (13)

Dernier

Dernier (20)

Polyglot persistence for Java developers: time to move out of the relational comfort zone? (gids 2015)