Strata NYC 2016. Jeff Carpenter describes how data modeling can be a key enabler of microservice architectures for transactional and analytics systems, including service identification, schema design, and event streaming.
%in Soweto+277-882-255-28 abortion pills for sale in soweto
Data Modeling for Microservices with Cassandra and Spark
1. Strata + Hadoop World NYC Sept 26-29, 2016Strata + Hadoop World NYC Sept 26-29, 2016Page 1Page 1
Jeff Carpenter, Choice Hotels International
Data modeling for
microservices with
Cassandra and Spark
2. Strata + Hadoop World NYC Sept 26-29, 2016
1 IT Transformation – Distribution and Analytics
2 Creating a Data Architecture
3 Data Modeling for Microservices
4 Using Metadata for Diagnostics and Analytics
5 Challenges
Agenda
Page 2
3. Strata + Hadoop World NYC Sept 26-29, 2016
IT Capabilities
Corporate IT
Guest
Franchise
Relations
Hotel
Manage-
ment
Business
Intelligence
Distribution
Page 3
This
talk
4. Strata + Hadoop World NYC Sept 26-29, 2016
CRSWeb and
Mobile
External
Channels
Customer
& Loyalty
Billing
Property
Systems
Reporting
& Analytics
Distribution - Central Reservation System
Page 4
Distribution
Domain
Guest
Domain
Franchisee
Domain
Hotel
Management
Domain
Business
Intelligence
Domain
5. Strata + Hadoop World NYC Sept 26-29, 2016
Current Reservation System – By The Numbers
Page 5
25 years
6,000 hotels
50
transactions / second4,000
distribution channels
1 instance
6. Strata + Hadoop World NYC Sept 26-29, 2016
New Systems: Distribution and Data Platforms
Page 6
Distribution
Platform
Data
Platform
History
Realtime
data
See: Choice Hotels's journey to
better understand its customers
through self-service analytics
This Talk: how we model data
and use the self-service
platform
7. Strata + Hadoop World NYC Sept 26-29, 2016
Distribution Platform - Architecture Tenets
Cloud-native
Microservices
Open Source Infrastructure
Extensibility
Stable, Scalable, Secure
Page 7
8. Strata + Hadoop World NYC Sept 26-29, 2016
Data Ownership
What is a Microservice? (one definition)
Page 8
Message
Driven Service
Entity
Service
Client
REST
API
AMQ
Events
DB
Composing
Service
Persistence
9. Strata + Hadoop World NYC Sept 26-29, 2016Strata + Hadoop World NYC Sept 26-29, 2016Page 9
How can we design our data
architecture & models to be…
• Scalable?
• Extensible?
• Maintainable?
• Analytics-ready?
10. Strata + Hadoop World NYC Sept 26-29, 2016
Non-
relational
storage
Long Term
Storage
Logging
Reporting
& Analytics
Metrics
Our Data Stack
Page 10
11. Strata + Hadoop World NYC Sept 26-29, 2016
Data Modeling – Then and Now
Isolated
Systems
Data
Dictionary
SOA and
Canonical
Data
Model
Services
own data
Page 11
• Identifying domains and relationships
Conceptual Data Model
• Identifying data types and relationships
Logical Data Model
• Java APIs
• RESTful APIs (JSON)
• Events (JSON)
• Cassandra Schemas
Physical Models
12. Strata + Hadoop World NYC Sept 26-29, 2016
Conceptual Data Model - Domains
Page 12
rates inventoryhotels reservationsoffers
13. Strata + Hadoop World NYC Sept 26-29, 2016
Hotel Management
Domain
Guest DomainDistribution Domain
Conceptual Data Model – Domain Relationships
Page 13
hotels
guest
stay
loyalty
rates
inventory
offers
reservations
14. Strata + Hadoop World NYC Sept 26-29, 2016
Rates Domain
Composite Rate Service
Rate Plan Service
Rate
Service
Logical Data Model – Identifying Types
Page 14
Rate Plan
• id
• code
• hotelId
• effectiveDates
• Conditions
Rate
• id
• ratePlanId
• productId
• hotelId
• dateSpan
Price
• condition
• amount
Product
• id
• code
• hotelId
• features
• …
15. Strata + Hadoop World NYC Sept 26-29, 2016
Standardizing Common Data Types
• Instead of a Canonical Data Model,
we standardize basic building blocks
– Feature, Category, Brand
– Geospatial
– Financial
– Time
– Contact information
Page 15
Address
• lines[]
• city
• subdivision
• country
• postalCode
16. Strata + Hadoop World NYC Sept 26-29, 2016
Data Types → Microservice Identification
Page 16
Hotel
Service
Rates
Service
Data Maintenance
Apps
Inventory
Service
Offer
Service
Inventory
Domain
Rates
Domain
Hotel
Domain
Offer
Domain
Internal / External
Client Apps
Reservation
Service
Reservation
Domain
17. Strata + Hadoop World NYC Sept 26-29, 2016
Physical Data Models
Page 17
Physical Models
Java APIs
RESTful APIs
(JSON)
Events
(JSON)
Cassandra
Schemas
JSON = primary definition of
the data type owned by each
service
18. Strata + Hadoop World NYC Sept 26-29, 2016
Key Data Types → RESTful Resource Paths
Page 18
Offer
Service
/offers
/reservations
Hotel
Service
Reservation
Service
Rates
Service
Inventory
Service
/hotels
/rates
/inventory
19. Strata + Hadoop World NYC Sept 26-29, 2016
REST Java API
GET /types/<id> Type getTypeById()
GET /types?<query parameters> Type[] searchType(TypeSearchCriteria)
POST /types/ (JSON body) createType(Type)
PUT /types/ (JSON body) updateType(Type)
DELETE /types/<id> deleteType(TypeId)
Java and RESTful APIs – common pattern
Page 19
20. Strata + Hadoop World NYC Sept 26-29, 2016Page 20
Cassandra Data Modeling
(an idealized view)
22. Strata + Hadoop World NYC Sept 26-29, 2016
pois_by_hotel
hotel_id
poi_name
description
Q3
Q1 Q2 Q4
Q5
amenities_by_room
hotel_id
room_id
amenity_name
description
K
K
C↑
K
C↑
hotels_by_poi
poi_name
hotel_id
name
phone
address
K
C↑
hotels
hotel_id
name
phone
address
K
available_rooms_
by_hotel_date
hotel_id
date
room_number
is_available
K
C↑
C↑
Cassandra Data Modeling – Chebotko Diagrams
Page 22
23. Strata + Hadoop World NYC Sept 26-29, 2016
hotelkeyspace
hotels_by_poi
poi_name
hotel_id
name
phone
address
K
C↑
pois_by_hotel
hotel_id
poi_name
description
amenities_by_room
hotel_id
room_number
amenity_name
description
K
K
C↑
K
C↑
available_rooms_
by_hotel_date
hotel_id
date
room_number
is_available
K
C↑
C↑
date
smallint
boolean
text
text
text
text
address
text
text
smallint
text
text
text
text
*address*
street
city
state_or_province
postal_code
country
hotels
hotel_id
name
phone
*address*
text
text
text
text
text
text
text
text
address
K
text
Cassandra Data Modeling - Physical
Page 23
24. Strata + Hadoop World NYC Sept 26-29, 2016
Cassandra Data Modeling - Schemas
CREATE KEYSPACE hotel
WITH replication = {'class':
'SimpleStrategy',
'replication_factor' : 3};
CREATE TYPE hotel.address (
street text,
city text,
state_or_province text,
postal_code text,
country text
);
CREATE TABLE hotel.hotels_by_poi (
poi_name text,
hotel_id text,
name text,
phone text,
address frozen<address>,
PRIMARY KEY ((poi_name),
hotel_id)
)
WITH CLUSTERING ORDER BY (
hotel_id ASC) ;
Page 24
25. Strata + Hadoop World NYC Sept 26-29, 2016Page 25
And now…
Back to reality
26. Strata + Hadoop World NYC Sept 26-29, 2016
Keyspace hotel
Access Patterns and Denormalization
Page 26
Locate hotel
by identifier
Find hotels
within X miles
of point Y
Find hotels by
city, state,
country
Find hotels
by postal
code
Hotels by
amenity
Find hotels
by brand
hotels_by_id
hotels_by_brand
hotels_by_postal_code
…
Hotels by
this
Hotels by
that
Hotels by
something
else
27. Strata + Hadoop World NYC Sept 26-29, 2016
Metadata
Page 27
Request Context
• Requestor
• Tracking ID
• Token
• Locale
Service AMQ
Logs
ELK Stack
EventsIncoming
Request
29. Strata + Hadoop World NYC Sept 26-29, 2016
Putting It Together – Diagnostics
Page 29
Service
C*
node
node
node
node
Incoming
Request
Data History Logs
Metrics StoreELK StackData Platform
Metrics
30. Strata + Hadoop World NYC Sept 26-29, 2016
Metrics StoreELK Stack
Putting It Together – Long Term Storage
Page 30
Data Platform
C*
node
node
node
node
Long
Term
Storage
31. Strata + Hadoop World NYC Sept 26-29, 2016
Separating Active and History Data
Page 31
Now
Time
Yesterday’s data is
ancient history
Rate + Inventory Data
32. Strata + Hadoop World NYC Sept 26-29, 2016
Data Platform - Cloudera
History architecture
Page 32
Service AMQ Kafka
S3
Other
subscribers
History retrieval
History capture
Customer
Service Apps
History
Service
Spark
node
node
node
node
Impala*
33. Strata + Hadoop World NYC Sept 26-29, 2016
Microservice Data Challenges
No Joins?
Data Maintenance
Data Integrity
Cascading Deletes
Transactions
Page 33
34. Strata + Hadoop World NYC Sept 26-29, 2016
Distributed Transactions, Anyone?
Page 34
Commit the
contract
Reserve
the inventory
Booking
Client
Data Maintenance
Apps
Inventory
Service
Reservation
Service
inventory
reservations
Data
synchronization
35. Strata + Hadoop World NYC Sept 26-29, 2016
Alternatives to Distributed Transactions
Approach Example Scope
C* Lightweight
Transaction
Updating inventory counts Data Tier
C* Logged Batch
Writing to multiple denormalized
hotel tables
Data Tier
Retrying failed calls
Data synchronization, reservation
processing
Service
Compensating
transactions
Verifying reservation processing System
Page 35
Eventual
consistency
Strong
consistency
36. Strata + Hadoop World NYC Sept 26-29, 2016
Final Thoughts
Data Models > Microservices
Events = Streams
Use Metadata Everywhere
Page 36
37. Strata + Hadoop World NYC Sept 26-29, 2016
Now Available!
Page 37
Cassandra: The Definitive Guide, 2nd Edition
Completely reworked for Cassandra 3.X:
• Data modeling in CQL
• SASI indexes
• Materialized views
• Lightweight transactions
• DataStax drivers
• New chapters on security, deployment, and integration
38. Strata + Hadoop World NYC Sept 26-29, 2016
Contact Info
@choicehotels
careers.choicehotels.com
@jscarp
jeffreyscarpenter
Page 38
Notes de l'éditeur
Thanks for staying for the last session of the conference
I’m Jeff Carpenter, system architect at CHI
In this session we’re going to talk about why data modeling is important for both transactional and analytics systems, and how we’ve put this into practice at Choice as we’re building new systems.
Overview of what we'll cover
Choice Hotels is a technology-centric company. We operate according to a franchise model, and a lot of the value proposition of our IT organization is based on the services we provide to our franchisees. As a result we’re continually innovating and looking to modernize key systems.
We’ve divided our IT capabilities into several domains, including guest, franchise relations, hotel property management, and our corporate IT services.
This presentation focuses especially on systems in the distribution domain, such as our reservation system, and the business intelligence domain, which includes analytics and reporting systems.
In particular, we’re going to focus on the relationship between the distribution domain and the BI domain
Key systems in the distribution domain include the Central Reservation System (CRS), our website and mobile apps.
Last year we launched new versions of our website and mobile app, and we are currently working on a new reservation system
The reservation system interfaces to many of our other IT systems so replacing it is a major undertaking
Internal channels like our website and mobile applications allow customers to shop and book rooms
External channels as well
We interface with property systems so our franchisees can tell us about their room types, rates, and inventory
We interface with customer and loyalty systems to credit stays and support reward reservations
Reporting and billing systems pull information about reservations
Our current reservation system is over 25 years old - written in C and running on a large UNIX box with traditional RDBMS
We’re currently making reservations for over 6000 hotels worldwide, and distributing over 50 different channels – everything from our own website and mobile apps to GDS and OTA partners
This system is very performant and reliable, servicing over 4000 TPS
However, the system scales vertically - we need horizontal scalability for future growth
Pulling back the covers a bit, we have the unique opportunity to make major improvements to two important areas of the enterprise at once
We’re replacing our legacy Central Resrvation System with a new Distribution Platform which we hope will have the longevity of the previous one
We’re also modernizing our business intelligence approach with a new data platform
One of the major themes of this talk is the relationship between these two systems, and the role that data modeling plays in it.
Specifically, we’re using the data platform to capture changes as they occur in the distribution platform for analytics purposes
There are also some use cases in the distribution system where we need to access that historic data for customer service and diagnostic purposes, so we are actually implementing a limited capability to pull some of that data back out.
I’m not going to give a detailed presentation of the self-service data platform because my colleague Narasimhan from Choice and Avinash from Clairvoyant have already provided a great talk on that earlier today.
Instead we’ll focus on how the distribution platform makes use of capabilities of the data platform
Here are some of the tenets of our architecture:
We designed for the cloud to run anywhere, in multiple data centers worldwide
We wanted a microservices architecture based primarily on RESTful APIs and event publishing
We use open source infrastructure as much as feasible
Since this system needs to work for the next 25 years, we want a design which is easily extensible to new features and business areas
The key architectural –ilities we repeat again and again are scalability, stability and security
So when I say we have a microservice based architecture, that could mean a lot of things
We use a mixture of synchronous and asynchronous approaches in order to support shopping and booking hotel rooms and notification of various partners
In our architecture, a typical microservice exposes a RESTful API which allows it to be accessed by clients.
The entity service manages the persistence of a specific data type, and publishes events when data is created, updated, or deleted.
We have other types of services which compose the entity services, and message driven services which respond to events and generate other events or interface to external systems.
But the bedrock principle I want to call is the data ownership. Every data type is owned by a single service, and it owns the persistence.
So, designing a new distribution platform – a greenfield design, so many choices available, the world is new…
First, let’s talk about our technology selection - these are a few of the elements in our data stack
We use Cassandra as our primary data store
We use Amazon S3 and Glacier for medium to long term storage
We use the ELK stack for logging
We use Spark, Impala and other technologies in our data platform for reporting and analytics
We use Karios DB as our metrics store and Grafana to construct operational dashboards
For messaging, we use Active MQ when message ordering is required and Kafka for fast streaming between systems
With that technology foundation, we recognized at the project outset the important role that data modeling would play in the development of our reservation system, especially since it touches so many other systems
In the past, our enterprise consisted of multiple stovepipe systems, each with their own data models. There were efforts to reconcile these in a corporate data dictionary, which was a massive undertaking.
As we started doing more SOA style work, we began wrapping many of these systems, and developed a canonical data model as way to enforce a common language across these services. This is a centralized definition.
This proved difficult to maintain, and when we started work on the new reservation system, we decided to allow each service to own its own data model.
We used the classic levels of data modeling – conceptual, logical, and physical, to drive our identification of data types in the system and the microservices that manage each data type, and then the various physical representations we need to drive software development.
The next few slides take us through that process.
Let’s introduce some of the key data types within the distribution domain
Hotels - descriptive data about the hotels and their products, and policies. Quite static
Rates - prices that are charged for the products.
these can change many times a day, and could include an automated pricing system
Inventory - constantly changes as rooms are booked, cancelled, etc. Data quality and currency is extremely important here so we don’t oversell our hotels
Reservations - contract with the customer. Generally only changed when initiated by the customer, infrequent changes
(Talk to nuances on inventory buckets, rate plans, packages, rules)
An important part of our conceptual data model was defining the boundaries and relationships between domains
This includes the distribution domain and its sub-domains, and relationsihps to other domains
As we see, the inventory and rates domains reference the hotel domain – the inventory and rates are for products at specific hotels
Offers and reservations, in turn reference hotels, inventory and rates
As we look to relationships outside the data domain, a reservation can reference customer and loyalty accounts that are managed by other systems. In these cases our reservation system holds references to those external data types, since the system of record is external
Another interesting case comes when there are data types that form the boundary of relationships between systems. This occurs in the case of the reservation. Reservations are created and managed by both central reservation systems and by property management systems which reside in the Hotel Management domain.
In our case, we have an internal representation of a reservation which forms the basis of our exchange with the property management systems. They also need a copy of the reservation so they can manage the guest stay.
While it is possible for a reservation itself to be updated while the guest is on property (for example, adding an extra night) we’ve made a clear boundary so that stay information doesn’t start creeping into our reservation definition.
Let’s consider the rates domain, this is a sub-domain within distribution
As we begin to model this in UML, we see there are distinct types for rates and rate plans. Rate plans comprise the rules that hotels use to describe how to get access to a particular set of rates
The rates describe what customers will be charged on a given day, at a given hotel, in association with a rate plan
The rates themselves may consist of multiple price points, for example, a one person rate, a two person rate, an extra person rate.
We may have references from these data types to data types outside the rates domain. For example a rate references a product or products to which it applies. This is a unique ID reference
We draw boundaries around portions of the data model to be owned by each service. In this example we derive microservices for rate plans and rates. In this way each service represents a bounded context.
However, from a deployment perspective, it may make sense to reduce the number of services, especially if rates and rate plans are most frequently accessed together (as they are)
Identifying potential services at a fairly low level and then potential compositions helps us make sure we maintain an extensible design.
An example of the common building blocks that demonstrates why standardization is important is the concept of an address.
A problem with some of our historic systems has been support across various system for a varying number of addresses. Many coded with 2, some coded with three. We’re constructing our new services to support addresses with a variable number of lines, and using validation to control how long the list can be.
This summarized the sort of results that arise from identifying services based on the logical data model – we end up with microservices organized around these domains of hotels, rates, inventory, offers, and reservations
Each of the services serve as the owners of a specific set of data types, approaching a share-nothing architecture style.
We’re then able to build client applications such as our website and mobile apps on top of these services, and build integrations with external partners
We also built data maintenance applications to:
synchronize of data from other systems – our legacy system as well as some other systems that will stay in operation, such as property management systems
Verify data accuracy across systems and across service boundaries
Correct data issues caused by defects
Once we’ve identified our services, we can approach the physical data modeling associated with each service as an internal concern of that service.
This includes the Java and RESTful APIs, events published by the service, and Cassandra schemas for data storage
These representations are all derived from the logical data model
We decided that the JSON representation of the resource owned by the service was the authoritative definition of the resource from the perspective of external services
Services are organized around the RESTful resource paths they own
Consider these RESTful resource paths as namespaces – need to manage these as well
Our usage of RESTful APIs helped reinforce the focus around data types.
While we do not adhere to some of the strictest definitions of what is RESTful, focusing on resources rather than actions helped keep our APIs clean and relatively free of RPC-style interactions
The common pattern that emerged for both our Java and RESTful APIs was to have simple CRUD operations
The cookie cutter approach was helpful in being able to generate common templates to kickstart development on each new service
(TODO: work flow/time or delete)
One of the tenets of Cassandra data modeling is to identify access patterns and design tables around those access patterns
We followed this pretty strictly at first, but soon ran into cases where adding a table per unique access pattern proved to be too much
Take for example hotels and the number of ways by which various clients could search for hotels
Since the hotel records are quite large, imagine the impact of all of these tables on our cluster size and storage requirements for 6000+ hotels.
We reined this in by designing tables to support multiple queries, select usage of indexes, and doing some filtering at the service layer, which helped us rein in our computing costs.
We’re also looking to move to Cassandra 3.X in order to take advantage of materialized views, which will allow us to shift some of the processing burden back to the database
Switching gears, the concept of metadata was very important to us – being able to keep a common request context helps us track interactions between services, events, and find key interactions in log files, and so on
Putting this all together in the context of a service
A service receives an incoming request with data including metadata
The data is written to Cassandra
An event is generated which is captured as history
The operation, metadata and data identifiers are logged
The elapsed time for the operation is captured as metrics
Now the operations team is able to configure alarms on service state and metrics
We have policies in place to ensure all of our application, logging and metrics data is captured in appropriate tools for long term storage and archival in S3 and Glacier
We have separated the shopping and booking concerns from our analysis and history uses, which means that in the reservation system, data in the past is not much use.
As we insert our data into Cassandra, we set the TTL for when it will no longer be needed, which saves us from developing our own cleanup process and reduces our storage footprint.
We still need the historic data for analysis and customer service purposes, though, so we store it in a separate data platform which we feed from the reservation system using asynchronous event processing
Let’s talk about how we architected the history features of our platform.
Since we’re already capturing the event streams in the data platform, we can reach back into that platform to access data for our customer service applications. These are the applications we use to help answer customer questions about their reservation including what was changed, when it was changed, and who changed it. We also use these applications when things go wrong to diagnose problems
I’ll refer you to Narasimhan’s talk for the complete architecture of our self-service data platform. What I’ve highlighted is the elements we use
On ingestion, we tie into Active MQ event queues published by our services and bridge the events to Kafka to stream them into our Spark cluster backed by S3.
To retrieve historical data, our customer service apps call history services, which make SQL queries via Impala to retrieve the data
We work closely between the teams to manage the SLAs for this data retrieval
Here are some of the challenges we’ve encountered in working with this kind of architecture
When data is spread across multiple services, it can be hard to get a picture of the relationships between data types – you can’t just do a join in Cassandra
We’re investigating use of Spark in some environments in order to support ad-hoc searching and exploration
There are also challenges to maintaining data in this environment. Teams have to be conscious that changing data in one service may affect other services. What happens if I delete a hotel using the Hotel Service but don’t delete the inventory and rates?
We’ve had to put tighter controls around deletes and manage cascading deletes at the application level to prevent these issues. Thankfully, data deletion is more of a maintenance activity and not a regular operational practice
Another issue comes when we need to commit changes to multiple data types at the same time. Let’s look at an example
One of the challenges of a microservices architecture is keeping changes in sync across service boundaries. One example situation is in booking a reservation.
Since the reservation represents our contract with the customer to reserve a specific room at a specific price and with certain conditions, we need to mark a reservation as committed at the same time as we reserve the inventory.
This is important so that we don’t accidentally overbook our hotel. Making the situation more complicated, there could be simultaneous bookings and data maintenance activities also trying to access the same inventory
Since these types are split across microservice boundaries, there is no transaction mechanism. In fact, since the data is in different rows (and different tables), Cassandra’s lightweight transactions are of no use to us here.
We solved this by a layered approach – LWTs to protect inventory counts, retries within the booking service, and compensating processes to detect and cleanup failures
Thankfully we have a variety of tools in our toolbox for guaranteeing consistency. Some of these are provided by Cassandra but some of them are architecture approaches.
Use data modeling to identify key types and bounded contexts, let that drive the microservice design
Events are great for decoupling services and systems, you can leverage the streams for history as well
Use metadata across services and infrastructure to allow common thread of debugging and performance analysis