Namely is a late-stage startup that builds HR, Payroll and Benefits software for mid-sized businesses. Over the years, we've ended up with a number of monolithic and legacy applications covering overlapping domain concepts, which has limited our ability to deliver new and innovative features to our customers. We need a way to get our data out of the monoliths to decouple our systems and increase our velocity. We've chosen Kafka as our way to liberate our data in a reliable, scalable and maintainable way. This talk covers specific examples of successes and missteps in our move to Kafka as the backbone of our architecture. It then looks to the future - where we are trying to go, and how we plan on getting, both from the short term and long term perspectives. Key Takeaways: - Successful and unsuccessful approaches to gradually introducing Kafka to a large organization in a way that meets the short and long term needs of the business. - Successful and unsuccessful patterns for using Kafka. - Pragmaticism versus purisim: Building Kafka-first systems, and migrating legacy systems to Kafka with Debezium. - Combining event driven systems with RPC based systems. Observability, alerting and testing. - Actionable steps that you can take to your organization to help drive adoption.
2. Meet Namely
● NYC startup
● All-in-one HR platform for midsized business.
● Originally just HRIS.
● Acquired Payroll Company in ~2015.
… and now there are TWO monoliths.
This talk is about unifying these monoliths
using Kafka.
3. Monolith Isn't A Four Letter Word!
● Monoliths aren't bad.
● Duplicated core domain.
● Changes don't flow between systems.
● Unnecessary work for engineering,
operations, clients.
4. Why Is Kafka A Good Match?
● HR domain is naturally event driven.
● History of a domain entity is interesting.
● Customers demand rich reporting.
● Auditing requirements are strict.
● A lot of systems to integrate.
5. We Want Domain Events
Commands are requests for state change. Commands can fail.
CreateCompany (Please)
Domain Events are facts about changes to Domain Entities as
a result of Commands.
CompanyCreated
Events can't fail - they already happened!
6. Kafka As The Source Of Truth
Kafka is a log of Domain Events.
One Kafka topic per type of Domain Entity.
Other services create Views of Domain Events.
Domain Event data wrapped in Envelope with metadata.
7. Reader
Reader
Building Kafka-First Services
Domain Service gets Command (RPC), validates and
produces one or more Domain Events.
Readers consume events and (usually) provide read API.
Domain
Service
(API)
ReadersKafka
CompanyCreatedRPC
8. Events Give Reporting + Auditing
Big advantage: rich reporting.
● How many people work here?
● How much is payroll projected to be next cycle?
● How much would it cost to open a new office in SF and
hire 20 engineers over 6 months?
● Who approved that bonus payment?
9. Challenges We Faced
1. Tooling Is Essential
2. Schema Evolution
3. Eventual Consistency
4. Error Handling
5. Data Migration
10. Tooling Is Essential
Need tools!
● Developer Self Service
● KSQL
● Lenses.io
● Libraries and templates
11. Schema Evolution
Data is forever. Want to make sure it's useful.
Schema backward compatibility: old messages should be
readable by newer readers. Use Avro for this.
Data backward compatibility: old messages should be
meaningful to newer readers. Populate deprecated fields (i.e.
name and first_name/last_name) until migration.
12. Eventual Consistency
How to read your writes with Kafka. User saves - page shows
old data until they refresh.
Writer does validation before emitting event. UI updates
state on successful write, without hitting reader.
Reader
Reader
Writer ReadersKafka
CompanyCreated
13. Consumer Error Handling
Transient: can't write to database or call other service
because it's unavailable (5xx). Just retry.
Permanent: Can't deserialize message/message doesn't make
sense. Stop!
Need metrics, and give services flags to skip permanent
failures (or use etcd to push configs).
Transient failures are expected.
Permanent failures are bugs.
14. Migration
Would be great if we were building a brand new system…
But we have existing data in several databases, and a lot of
software that relies on that data.
Legacy
DB
#1
Legacy
DB
#2
Legacy
DB
#3
15. Migration - APIs
First, create a write API proxy for the services.
Migrate all of the places that wrote to legacy system to use this proxy
API.
Proxy API just a pass through layer for
your legacy systems. Legacy
DB
#1
Legacy
DB
#3
Legacy
DB
#2
Proxy
API
16. Migration - Domain Events
Challenge with this is multiple writes - what if write to third system
fails?
Retry idempotently, but need to store that info somewhere. So have
the API proxy (no longer a proxy) write it to
Kafka as a Domain Event.
Then use that Domain Event to populate
the legacy system.
Legacy
DB
#1
Legacy
DB
#3
Legacy
DB
#2
Proxy
API
Kafka
17. Migration - Backfillers
Backfillers read Kafka topic and populate legacy system. One
backfiller per legacy system. Legacy DBs augmented with
link to new world record.
Legacy
DB
#1
Legacy
DB
#3
Domain
Service
Kafka
CompanyCreated
Legacy
DB
#2
Backfiller
Backfiller
Backfiller
18. Migration - Existing Data
Services have flag-enabled migration path that takes legacy
data from databases, and writes to their Kafka topic. That way
write logic is reused.
Domain
Service
Kafka
Legacy
DB
#1
Backfiller
If migrate flag is set, reads from legacy DB to
generate initial set of Domain Events in Kafka.
CompanyCreated
19. Migration - Existing Data Pt. 2
Not always so easy!
What if data in one legacy system doesn't agree with the
other (i.e. different name in each).
What if data in one system can't be represented in another?
You will find lots of fun and exciting things about your data!
20. Migration - Push Ops To Clients
Push your operations away from engineering and toward the
clients.
Build tooling and features to help clients and operations
teams clean up data (i.e. ask client to input name when a
disagreement is detected that can't be reconciled).
Build tooling to support the migration (enable new feature →
give data cleanup workflow to client or operations).
21. Migration - Insights
1. Start with new clients only.
2. Plan for migration of existing clients from the start.
3. Push operations toward customer!
4. Consolidate data into one legacy system first if
possible.
22. Summary
● Know your domain - find your aggregates.
● Buy and build tooling.
● Eventual consistency is hard. It takes time to understand.
● Plan for migration.
● Separate feature development from migration.
● Push operations away from engineering toward
customer.
● Embrace eventual consistency.
24. Services Own Domain Events
Typical flow:
1. Command (RPC) to change entity.
2. Domain Service validates command makes sense.
3. Domain Service writes one or more Domain Events to
Kafka topic(s).
4. Other systems read Domain Events, possibly storing
them.