If you’re considering -- or planning -- a cloud migration, you may be concerned about risks to your data and your mental health. Migrations at scale are fraught with risk. You absolutely can’t lose data, compromise its integrity, or suffer downtime, so you want to be slow and careful. On the other hand, you’re paying two providers for every day the migration goes on, so you need to move as fast as possible.
Unity Technologies accumulates lots of data. We recently moved our data infrastructure as part of a major cloud migration from Amazon Web Services (AWS) to Google Cloud Platform (GCP).
To minimize risk and costs our team used Apache Kafka and Confluent Platform, while engaging Confluent Platform Professional Services to help ensure a speedy and seamless migration. Kafka was already serving as the backbone to our data infrastructure, which handles over half a million events per second, and during the migration it also served as the bridge between AWS and GCP.
Join us at this session to learn about the processes and tools used, the challenges faced, and the lessons learned as we moved our operations and petabytes of data from AWS to GCP with zero downtime.
Similaire à Migrating from One Cloud Provider to Another (Without Losing Your Data or Your Sanity) (Oguz Kayral, Unity Technologies) Kafka Summit 2020 (20)
2. About Unity
2
— Millions of Creators, Billions of Gamers
— Half of Top 1000 Mobile Games
— Products include Analytics, Monetization, Crash and
Performance Reporting, Asset Store, Collaborate, Cloud
Build
— Beyond Game Development:
– Automotive, Transportation & Manufacturing
– Film, Animation & Cinematics
– Architecture, Engineering & Construction
3. Apache Kafka at Unity
3
— In production since Kafka 0.8
— 10s of Billions of events every day
— Served as the backbone of a massive cloud migration
5. 5
“Migrations are the only mechanism to
effectively manage technical debt as your
company and code grows. If you don't get
effective at software and system
migrations, you'll end up languishing in
technical debt.”
— Will Larson, “An Elegant Puzzle: Systems of Engineering Management”
9. Why migrations are hard?
9
— Can we stop the world?
– Synchronizing starting state is trying to hit a moving target.
10. Why migrations are hard?
10
— Can we stop the world?
– Synchronizing starting state is trying to hit a moving target.
— How about double writes and double reads?
11. Why migrations are hard?
11
— Can we stop the world?
– Synchronizing starting state is trying to hit a moving target.
— How about double writes and double reads?
12. Why migrations are hard?
12
— Can we stop the world?
– Synchronizing starting state is trying to hit a moving target.
— How about double writes and double reads?
– Too much organizational complexity involved.
13. Why migrations are hard?
13
— Can we stop the world?
– Synchronizing starting state is trying to hit a moving target.
— How about double writes and double reads?
– Too much organizational complexity involved.
— Why solve the same problem twice? (or 100 times)
– Every team will need their own pipeline for their own purposes.
14. Why migrations are hard?
14
— Can we stop the world?
– Synchronizing starting state is trying to hit a moving target.
— How about double writes and double reads?
– Too much organizational complexity involved.
— Why solve the same problem twice? (or 100 times)
– Every team will need their own pipeline for their own purposes.
— Are we even speaking the same language?
– Legacy systems might not be compatible with newer
cloud-based application.
16. Event driven architecture to the rescue
16
— Can we stop the world?
– Stream processing is the perfect tool to deal with changing state
— How about double writes and double reads?
– Kafka Connect or CDC solutions can act as the bridge
— Why solve the same problem twice? (or 100 times)
– Confluent Replicator acts as the single integration pipeline
— Are we even speaking the same language?
– Kafka client libraries and Connect make sure every system can
be connected
18. Event driven architecture to the rescue
18
— Event driven architecture enables using hybrid or
multi-cloud deployments to continue operating during
migrations
— MirrorMaker, uReplicator, Confluent Replicator
20. Preparing for the migration
20
— You can’t overdesign
– A migration planned for the happy path will fail
– Try to uncover harder paths and edge cases as early as possible
— Pre-mortem everything
– If you think it can break, it will.
— Minimize one-way doors
– Plan a way to revert as many operations as possible
21. Preparing for the migration
21
— “Festina lente”
– Make haste slowly. Tooling and documentation built before
kicking off the migration will act as a force multiplier
— It’s OK to ask for help
– Get aligned with other internal teams who are subject matter
experts on infrastructure, network, project management…
– Get external help when necessary. Professional services,
trainings etc.
22. Preparing for the migration
22
— Make sure Kafka is sufficiently resourced on both sides
– More memory is better (at least 32GB
– Multiple disks (we had 8, ext4 or XFS
– Uniform nodes
– Network… read fallacies of distributed computing
— Use a tool to simplify Data Center Interconnect
— Install the Replicator Monitoring Extension
— Don’t trust the network. Don’t trust ZooKeeper (KIP500
23. Preparing for the migration
23
— Make sure JVM is properly configured
— Make sure Kafka and producers are configured to
minimize chance of data loss
-Xms6g -Xmx6g -XX:MetaspaceSize=96m -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M
-XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80
unclean.leader.election.enable=false
default.replication.factor=3
min.insync.replicas=2
acks=all
24. Replicator
24
— Prevents cyclic replication of topics
– Enables 2 Kafka clusters to run in active-active mode enabling
producing and consuming on both sides
— Timestamp Preservation
– Replicator will preserve the timestamp of the message from the
source cluster on the target cluster
— Consumer Offset Translation
– Replicator automatically translates offsets using timestamps so
that consumers can start consuming data in the destination
cluster where they left off in the origin cluster.
25. Finalizing the preparation
25
— By this point:
– If you don’t have a good idea of what objectives and timelines
look like (total migration count etc.), go back to planning. You
can’t hit a target you can’t see.
– If a majority of teams are not willing to prioritize the migration,
go back to clarifying the objectives or reframing the discussion.
– If a team can not run their migration using documentation and
self-serve tooling provided by the migration owner. Go back to
improving them.
26. Running the migration
26
— The best migration is one you don’t have to do
– If the preparation steps went well, by this point most simple
migrations can be automated
— Track the migration separately from the workload of the
team
– Separate board, separate meetings, separate reporting, DRI etc.
– Only progress is completed migrations
– Report is “completed migrations / total migrations”
27. Running the migration
27
— Thresholds and alerts are your friend
– Link pagers with Control Center
— Remember that you’re paying for both sides during
migration
– Migrations are only successful at 100%
— Wrap it up
– In edge cases where automated or self-serve tooling is not
enough and service team can’t prioritize the migration, step in
and push it over the finish line
28. Finishing the migration
28
— Recognize and celebrate
— Be diligent in shutting down old infra (remember one-way
doors)
— Start tracking new tech debt caused by shortcuts taken
during the migration
29. Wrap-up
29
— Plan deeply. Resources and configuration require careful
thought.
— Go slow to go fast.
— Automation, self-serve tooling and documentation will
lead to success.
— Replicator enables many use cases.
— Migrations are only successful at 100%.