Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Movingthe
needleofthePin:
Oct, 2018Henry Cai
www.linkedin.com/in/hecai
Streaming100TBofpinsfrom
MySQLtoS3/Hadoop
continuou...
Pinterestisthe
visualdiscovery
engine.Mission
Helppeoplediscoveranddowhattheylove. 
>250M
80%
75%
ofsignupsare
fromoutside

theU.S.
ofPinnersuse
Pinterestfrom
mobile
100B
Pinsand

2BBoards
monthly
activeuse...
Data-driven
products
• Personalized
recommendation
• SpamControl
• SearchQuality
• A/BExperiments
• RelatedPins
• …
DataPipeline
stats
• >1PBdata/day
• >10Mmessages/second
• >800Bmessages/day
• >2,000kafkabrokers
• >50,000clienthosts
Dataingestion
types
• Onlinelogging
• Databasesnapshots
2016
pipeline
Dataingestion@Pinterest 2016
Pinterest Services
Singer
Kafka
Dataingestion@Pinterest 2016
Pinterest Services
Singer
Kafka
events
Dataingestion@Pinterest 2016
Pinterest Services
Singer
Kafka
events
Real-time 

consumers
Merced
Tracker
Dataingestion@Pinterest 2016
Pinterest Services
Singer
Kafka
events
Real-time 

consumers
Databases
Merced
Tracker
Dataingestion@Pinterest 2016
Pinterest Services
Singer
Kafka
events
Real-time 

consumers
Databases
Merced
Tracker
Dataingestion@Pinterest 2016
Pinterest Services
Singer
Kafka
events
Real-time 

consumers
Databases Logical backup
Merced
...
DBingestion@Pinterest
Version1
DatabasesShard1 Slave
Shard1 DrSlave
Shard1 Master
Mysqldump
Hadoop
Streaming
Mapper1
Shard...
DBingestion@Pinterest
Version2
Databases logical csv 

backup
Tracker
Version1
Shard1 Slave
Shard1 DrSlave
Shard1 Master
M...
Painpoints
Constraints
• Reliabilitycausedbymysqlhostshiccup
• Pullingover100TBdatadailybutonlyafewTB
changedeveryday
• Lo...
The

newpipeline
Dataingestion@Pinterest now
Pinterest Services
Singer
Kafka
events
Dataingestion@Pinterest now
Pinterest Services
Singer
Kafka
events
Databases
DB/Kafka
Bridge
Dataingestion@Pinterest now
Pinterest Services
Singer
Kafka
events
Databases
DB/Kafka
Bridge
Merced
Dataingestion@Pinterest now
Pinterest Services
Singer
Kafka
events
Databases
DB/Kafka
Bridge
Merced
Watermill
Dataingestion@Pinterest now
Pinterest Services
Singer
Kafka
events
Real-time 

consumers
Databases
DB/Kafka
Bridge
Merced
...
DB/KafkaBridge(Maxwell)
Pinterest Services
Singer
Kafka
events
Real-time 

consumers
Databases
Merced
Watermill
DB/Kafka
B...
DB/KafkaBridge
Replica-SetNode
Maxwell_position
Maxwell_schema
MySQL Processes and Schemas
Maxwell Tables
Binlog File
Shar...
DB/KafkaBridge
Replica-SetNode
Maxwell_position
Maxwell_schema
MySQL Processes and Schemas
Maxwell Tables
MySQL Processes ...
DB/Kafka
Bridge
Watermillcompaction
Pinterest Services
Singer
Kafka
events
Real-time 

consumers
Databases
Merced
Watermill
Compaction
ForOneShard
• HashJoinbetweensnapshotanddelta
• Deltaloadedinmemoryfirstassidelookup
• Basesnapshotwaspipedthro...
IncrementalDBingestionsequence
MySQL
Maxwell
Kafka
IncrementalDBingestionsequence
MySQL
Maxwell Merced
Delta
Kafka
IncrementalDBingestionsequence
MySQL
Maxwell Merced Periodic
Compaction
Snapshot1
Delta
Snapshot2
Kafka
IncrementalDBingestionsequence
MySQL
Tracker
Batch
Backup
Backup
Snapshot
Maxwell Merced Periodic
Compaction
Snapshot1
Del...
IncrementalDBingestionsequence
MySQL
Tracker
Batch
Backup
Backup
Snapshot
Maxwell Merced Periodic
Compaction
Periodic
File...
IncrementalDBingestionsequence
MySQL
Tracker
Batch
Backup
Maxwell Merced Periodic
Compaction
Periodic
FileGC
SELECT
FROM
r...
DataLifecycleandTimelineManagement
DailyDump
11:30
Bootstrap
Snapshot
11:55
1
1
:
3
0
1
1
:
5
5
Timeline
DataLifecycleandTimelineManagement
Merced Delta
12:01
DailyDump
11:30
Bootstrap
Snapshot
11:55
1
1
:
3
0
1
1
:
5
5
1
2
:
0...
DataLifecycleandTimelineManagement
Merced CompactionDelta
12:01
Snapshot
12:10AM
DailyDump
11:30
Bootstrap
Snapshot
11:55
...
DataLifecycleandTimelineManagement
Merced CompactionDelta
12:01
Snapshot
12:10AM
12:15
Select
DailyDump
11:30
Bootstrap
Sn...
DataLifecycleandTimelineManagement
Merced CompactionDelta
12:01
Snapshot
12:10AM
12:15
Select
DailyDump
11:30
Bootstrap
Sn...
DataLifecycleandTimelineManagement
Merced CompactionDelta
12:01
Snapshot
12:10AM
12:15
Select
DailyDump
11:30
Bootstrap
Sn...
DataLifecycleandTimelineManagement
Merced CompactionDelta
12:01
Snapshot
12:10AM
12:15
Select
DailyDump
11:30
Bootstrap
Sn...
DataLifecycleandTimelineManagement
Merced CompactionDelta
12:01
Snapshot
12:10AM
12:15
Select
DailyDump
11:30
Bootstrap
Sn...
DataLifecycleandTimelineManagement
1
1
:
3
0
1
1
:
5
5
1
2
:
0
1
1
2
:
1
0
1
1
:
4
5
1
2
:
2
0
Timeline
Merced CompactionD...
DataLifecycleandTimelineManagement
Merced CompactionDelta
12:01
Snapshot
12:10AM
DailyDump
11:30
Bootstrap
Snapshot
11:55
...
DataLifecycleandTimelineManagement
Merced CompactionDelta
12:01
Snapshot
12:10AM
DailyDump
11:30
Bootstrap
Snapshot
11:55
...
Consistency
• MySQLMaster/SlaveFailover,ShardMigration
• MySQLTransactions:
• Splitbetweentables,splitbetweenKafkamessages...
Scalability
• Partitioning
• ShardedMySQL
- Shardbaseddbsnapshotanddeltafiles
- Twolevelsharinginthecasethatoriginalshards...
KafkaNuances
• MessageOrdering:
• Asyncproducerbutstillneedtomaintainmessageorder
• MaintainorderbetweenS3fileandwithinS3f...
S3Nuances
● Eventual Consistency
● Read-after-write is OK, but not PUT followed
by LIST
● Directory listing is slow
● Shor...
PIIProcessing
• username,emailaddressetcneedstobe
filteredout
• ipaddressneedstobefilteredout
john.doe@abc.com
Justin Bieb...
PIIProcessing
• username,emailaddressetcneedstobe
filteredout
• ipaddressneedstobefilteredout
john.doe@abc.com
Justin Bieb...
Operation
Bootstrap,synchronize&rewind
MySQL
Tracker
Batch
Backup
Backup
Snapshot
Maxwell Merced Periodic
Compaction
Snapshot1
Delta...
Bootstrap,synchronize&rewind(cont)
• Wehavetheabilitytosynchronizeandrewind
• Incaseofsoftwarebugsornetworkglitches
• Snap...
Schema
Management
andSchema
Change
• SchemaisUsedfor
• Identifytheprimarykeyoftherow
• Drivetheparquetfilegeneration
• Dea...
Validation
• Validation
• CreatingcompactionbasedonfromandtoGTIDrange
• Compactionoutputvsbatchbackupoutput
• Monitoring
•...
Summary
Comparison

toother
technologies
• UberHudi(Hoodie)
• NotsupportingS3,OnlysupportJava8+,Avro
Comparison

toother
technologies
• UberHudi(Hoodie)
• NotsupportingS3,OnlysupportJava8+,Avro
• KafkaConnect
• Onlyingestio...
Comparison

toother
technologies
• UberHudi(Hoodie)
• NotsupportingS3,OnlysupportJava8+,Avro
• KafkaConnect/Debezium
• Onl...
Takeaway
• Scalability
• support100TBofdatabasedata
• E2Elatencyof15minutes
• Reliability
• Strongdatabaseconsistencyonglo...
Futurework
• AdoptingKafkaExact-OnceProcessing
Model
• Kafkaasthedatabasechangestream
• Cacheinvalidationacrossdatacenters...
Acknowledgement
• Jointworkfrommany
engineering,including
YuYang,ChunyanWang,
IndyPrentice,Shawn
Nguyen,YinianQi, and
many...
Thanks!
© Copyright, All Rights Reserved, Pinterest Inc. 2018
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3/Hadoop Continuously
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3/Hadoop Continuously
Prochain SlideShare
Chargement dans…5
×

Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3/Hadoop Continuously

(Henri Cai, Pinterest) Kafka Summit SF 2018

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Pinterest has a hundred billion pins stored in MySQL at the scale of a 100TB and most of this data is needed for building data-driven products for machine learning and data analytics.

This talk discusses how Pinterest designed and built a continuous database (DB) ingestion system for moving MySQL data into near-real-time computation pipelines with only 15 minutes of latency to support our dynamic personalized recommendations and search indices. Pinterest helps people discover and do things that they love. We have billions of core objects (pins/boards/users) stored in MySQL at the scale of 100TB. All this data needs to be ingested onto S3/Hadoop for machine learning and data analytics. As Pinterest is moving towards real-time computation, we are facing a stringent service-level agreement requirement such as making the MySQL data available on S3/Hadoop within 15 minutes, and serving the DB data incrementally in stream processing. We designed WaterMill: a continuous DB ingestion system to listen for MySQL binlog changes, publish the MySQL changelogs as an Apache Kafka® change stream and ingest and compact the stream into Parquet columnar tables in S3/Hadoop within 15 minutes.

We would like to share how we solved the problem of:
-Scalable data partitioning, efficient compaction algorithm
-Stories on schema migration, rewind and recovery
-PII (personally identifiable information) processing
-Columnar storage for efficient incremental query
-How the DB change stream powers other use cases such as cache invalidation in multi-datacenter
-How we deal with the issue of S3 eventual consistency and rate limiting; related technologies: Apache Kafka, stream processing, MySQL binlog processing, Amazon S3, Hadoop and Parquet columnar storage

  • Identifiez-vous pour voir les commentaires

Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3/Hadoop Continuously

  1. 1. Movingthe needleofthePin: Oct, 2018Henry Cai www.linkedin.com/in/hecai Streaming100TBofpinsfrom MySQLtoS3/Hadoop continuously@Pinterest
  2. 2. Pinterestisthe visualdiscovery engine.Mission Helppeoplediscoveranddowhattheylove. 
  3. 3. >250M 80% 75% ofsignupsare fromoutside
 theU.S. ofPinnersuse Pinterestfrom mobile 100B Pinsand
 2BBoards monthly activeusers
  4. 4. Data-driven products • Personalized recommendation • SpamControl • SearchQuality • A/BExperiments • RelatedPins • …
  5. 5. DataPipeline stats • >1PBdata/day • >10Mmessages/second • >800Bmessages/day • >2,000kafkabrokers • >50,000clienthosts
  6. 6. Dataingestion types • Onlinelogging • Databasesnapshots
  7. 7. 2016 pipeline
  8. 8. Dataingestion@Pinterest 2016 Pinterest Services Singer Kafka
  9. 9. Dataingestion@Pinterest 2016 Pinterest Services Singer Kafka events
  10. 10. Dataingestion@Pinterest 2016 Pinterest Services Singer Kafka events Real-time 
 consumers Merced Tracker
  11. 11. Dataingestion@Pinterest 2016 Pinterest Services Singer Kafka events Real-time 
 consumers Databases Merced Tracker
  12. 12. Dataingestion@Pinterest 2016 Pinterest Services Singer Kafka events Real-time 
 consumers Databases Merced Tracker
  13. 13. Dataingestion@Pinterest 2016 Pinterest Services Singer Kafka events Real-time 
 consumers Databases Logical backup Merced Tracker
  14. 14. DBingestion@Pinterest Version1 DatabasesShard1 Slave Shard1 DrSlave Shard1 Master Mysqldump Hadoop Streaming Mapper1 Shard2 Slave Shard2 DrSlave Shard2 Master Mysqldump Hadoop Streaming Mapper2
  15. 15. DBingestion@Pinterest Version2 Databases logical csv 
 backup Tracker Version1 Shard1 Slave Shard1 DrSlave Shard1 Master Mysqldump Hadoop Streaming Mapper1 Shard2 Slave Shard2 DrSlave Shard2 Master Mysqldump Hadoop Streaming Mapper2
  16. 16. Painpoints Constraints • Reliabilitycausedbymysqlhostshiccup • Pullingover100TBdatadailybutonlyafewTB changedeveryday • Longlatency>24hour Future:DBChangeStreams • Trulycapturesdbtransactions • Across-regioncacheinvalidation • Realtimesearchindexbuilding • RealtimeRecommendationEngine
  17. 17. The
 newpipeline
  18. 18. Dataingestion@Pinterest now Pinterest Services Singer Kafka events
  19. 19. Dataingestion@Pinterest now Pinterest Services Singer Kafka events Databases DB/Kafka Bridge
  20. 20. Dataingestion@Pinterest now Pinterest Services Singer Kafka events Databases DB/Kafka Bridge Merced
  21. 21. Dataingestion@Pinterest now Pinterest Services Singer Kafka events Databases DB/Kafka Bridge Merced Watermill
  22. 22. Dataingestion@Pinterest now Pinterest Services Singer Kafka events Real-time 
 consumers Databases DB/Kafka Bridge Merced Watermill
  23. 23. DB/KafkaBridge(Maxwell) Pinterest Services Singer Kafka events Real-time 
 consumers Databases Merced Watermill DB/Kafka Bridge
  24. 24. DB/KafkaBridge Replica-SetNode Maxwell_position Maxwell_schema MySQL Processes and Schemas Maxwell Tables Binlog File Shard1 Shard2 Shard3 User Tables
  25. 25. DB/KafkaBridge Replica-SetNode Maxwell_position Maxwell_schema MySQL Processes and Schemas Maxwell Tables MySQL Processes (Co-located with MySQL Process) Binlog File Shard1 Shard2 Shard3 User Tables Kafka User Topic Kafka Pin Topic BinLog Tailer Thread InMemory Queue Async Kafka Producer Thread • BasedonMaxwell/Binlog-Connector • AddGTIDsupport • Addhandlingforretry/out-of-ordermessages • Co-locatewithmysql • Listensonmaster/slave
  26. 26. DB/Kafka Bridge Watermillcompaction Pinterest Services Singer Kafka events Real-time 
 consumers Databases Merced Watermill
  27. 27. Compaction ForOneShard • HashJoinbetweensnapshotanddelta • Deltaloadedinmemoryfirstassidelookup • Basesnapshotwaspipedthroughthemappernodeand compareagainstlookuptable - Lookupfail,snapshotrecordemittooutput - Lookupsucceed,butsnapshotrecordold,skipthe snapshot - Lookupsucceed,butsnapshotrecordnewer,remove lookuprecord • Attheend,appendtheremaininglookuprecordstooutput Delta Shard 1 Old Snapshot 
 Shard 1 Compactor New Snapshot 
 Shard 1
  28. 28. IncrementalDBingestionsequence MySQL Maxwell Kafka
  29. 29. IncrementalDBingestionsequence MySQL Maxwell Merced Delta Kafka
  30. 30. IncrementalDBingestionsequence MySQL Maxwell Merced Periodic Compaction Snapshot1 Delta Snapshot2 Kafka
  31. 31. IncrementalDBingestionsequence MySQL Tracker Batch Backup Backup Snapshot Maxwell Merced Periodic Compaction Snapshot1 Delta Snapshot2 Bootstrapper Kafka
  32. 32. IncrementalDBingestionsequence MySQL Tracker Batch Backup Backup Snapshot Maxwell Merced Periodic Compaction Periodic FileGC Snapshot1 Delta Snapshot2 Differ Bootstrapper Kafka
  33. 33. IncrementalDBingestionsequence MySQL Tracker Batch Backup Maxwell Merced Periodic Compaction Periodic FileGC SELECT FROM rt_users Snapshot1 Delta Snapshot2 Custom Input
 Format Differ Bootstrapper Backup Snapshot Kafka
  34. 34. DataLifecycleandTimelineManagement DailyDump 11:30 Bootstrap Snapshot 11:55 1 1 : 3 0 1 1 : 5 5 Timeline
  35. 35. DataLifecycleandTimelineManagement Merced Delta 12:01 DailyDump 11:30 Bootstrap Snapshot 11:55 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 Kafka Timeline
  36. 36. DataLifecycleandTimelineManagement Merced CompactionDelta 12:01 Snapshot 12:10AM DailyDump 11:30 Bootstrap Snapshot 11:55 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 1 2 : 1 0 Kafka Timeline
  37. 37. DataLifecycleandTimelineManagement Merced CompactionDelta 12:01 Snapshot 12:10AM 12:15 Select DailyDump 11:30 Bootstrap Snapshot 11:55 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 1 2 : 1 0 Kafka Timeline
  38. 38. DataLifecycleandTimelineManagement Merced CompactionDelta 12:01 Snapshot 12:10AM 12:15 Select DailyDump 11:30 Bootstrap Snapshot 11:55 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 1 2 : 1 0 Kafka Timeline ProcessedUpTo CurrentSnapshot
  39. 39. DataLifecycleandTimelineManagement Merced CompactionDelta 12:01 Snapshot 12:10AM 12:15 Select DailyDump 11:30 Bootstrap Snapshot 11:55 DailyDump 11:45 Bootstrap Snapshot 12:20 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 1 2 : 1 0 1 1 : 4 5 1 2 : 2 0 Kafka Timeline
  40. 40. DataLifecycleandTimelineManagement Merced CompactionDelta 12:01 Snapshot 12:10AM 12:15 Select DailyDump 11:30 Bootstrap Snapshot 11:55 DailyDump 11:45 Bootstrap Snapshot 12:20 12:25 Select 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 1 2 : 1 0 1 1 : 4 5 1 2 : 2 0 Kafka Timeline
  41. 41. DataLifecycleandTimelineManagement Merced CompactionDelta 12:01 Snapshot 12:10AM 12:15 Select DailyDump 11:30 Bootstrap Snapshot 11:55 DailyDump 11:45 Bootstrap Snapshot 12:20 12:25 Select 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 1 2 : 1 0 1 1 : 4 5 1 2 : 2 0 Kafka Timeline CurrentSnapshot ProcessedUpto
  42. 42. DataLifecycleandTimelineManagement 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 1 2 : 1 0 1 1 : 4 5 1 2 : 2 0 Timeline Merced CompactionDelta 12:01 Snapshot 12:10AM 12:15 Select DailyDump 11:30 Bootstrap Snapshot 11:55 DailyDump 11:45 Bootstrap Snapshot 12:20 12:25 Select Kafka ProcessedUpto … …NextCompaction……
  43. 43. DataLifecycleandTimelineManagement Merced CompactionDelta 12:01 Snapshot 12:10AM DailyDump 11:30 Bootstrap Snapshot 11:55 DailyDump 11:45 Bootstrap Snapshot 12:20 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 1 2 : 1 0 1 1 : 4 5 1 2 : 2 0 Kafka Timeline CurrentSnapshot Periodic GC
  44. 44. DataLifecycleandTimelineManagement Merced CompactionDelta 12:01 Snapshot 12:10AM DailyDump 11:30 Bootstrap Snapshot 11:55 DailyDump 11:45 Bootstrap Snapshot 12:20 1 1 : 3 0 1 1 : 5 5 1 2 : 0 1 1 2 : 1 0 Kafka Timeline PossibleRewind Periodic GC
  45. 45. Consistency • MySQLMaster/SlaveFailover,ShardMigration • MySQLTransactions: • Splitbetweentables,splitbetweenKafkamessages • Ordering • BetweenINSERTandUPDATE • BetweenUPDATEandDELETE • SoftDELETEvs.HardDELETE • Consistencybetweenmultiplebootstrapand incrementalstreams • DuplicateRecords
  46. 46. Scalability • Partitioning • ShardedMySQL - Shardbaseddbsnapshotanddeltafiles - Twolevelsharinginthecasethatoriginalshardsarenot balanced • UnShardeddataset - Usehash+modtopartitionthedataonbothsnapshot anddeltafile • Filefilteringusingpredicatepushdown: • Onshard/partitionlevel • OnS3directory,fileandrecordlevel 10X
  47. 47. KafkaNuances • MessageOrdering: • Asyncproducerbutstillneedtomaintainmessageorder • MaintainorderbetweenS3fileandwithinS3file • At-least-oncedelivery • Duplicatemessages • MySQLGTIDnotalwaysincreasing • DealwithKafkaclusterhiccup: • produceracks=2 • cleanleaderelection
  48. 48. S3Nuances ● Eventual Consistency ● Read-after-write is OK, but not PUT followed by LIST ● Directory listing is slow ● Shorter SLA —> More smaller files ● In early iterations, directly listing >> file content reading ● Rate Limit: ● Launching thousands of mappers would quickly hit S3 rate limit
  49. 49. PIIProcessing • username,emailaddressetcneedstobe filteredout • ipaddressneedstobefilteredout john.doe@abc.com Justin Bieber 192.168.0.1
  50. 50. PIIProcessing • username,emailaddressetcneedstobe filteredout • ipaddressneedstobefilteredout john.doe@abc.com Justin Bieber 192.168.0.1 ColumnarLayout andIncremental Processing • Useparquetformattosupportfastquerieson subsetofcolumns • ingest_timeasnewcolumntogetthe incrementalresultsincethelastprocessing;
  51. 51. Operation
  52. 52. Bootstrap,synchronize&rewind MySQL Tracker Batch Backup Backup Snapshot Maxwell Merced Periodic Compaction Snapshot1 Delta Snapshot2 Bootstrapper Kafka
  53. 53. Bootstrap,synchronize&rewind(cont) • Wehavetheabilitytosynchronizeandrewind • Incaseofsoftwarebugsornetworkglitches • Snapshot(s)ontoBootstraptosynchronize • AbilitytorewindviatheSnapshots/Bootstrapmechanism MySQL Tracker Batch Backup Backup Snapshot Maxwell Merced Periodic Compaction Snapshot1 Delta Snapshot2 Bootstrapper Kafka
  54. 54. Schema Management andSchema Change • SchemaisUsedfor • Identifytheprimarykeyoftherow • Drivetheparquetfilegeneration • DealingWithSchemachange • Willissueanewbootstraponofflinetableschema • Compactionwillstillusethesnapshotschema
 (whichmightbeold) ID C1 C2 123 …. … 124 … …. 125 …. … 126 … …. dbname.table_name new_column …. … …. …
  55. 55. Validation • Validation • CreatingcompactionbasedonfromandtoGTIDrange • Compactionoutputvsbatchbackupoutput • Monitoring • Error,failure,stall • Latencyoncompaction Backup Snapshot Periodic Compaction Snapshot1 Snapshot2 Differ Bootstrapper
  56. 56. Summary
  57. 57. Comparison
 toother technologies • UberHudi(Hoodie) • NotsupportingS3,OnlysupportJava8+,Avro
  58. 58. Comparison
 toother technologies • UberHudi(Hoodie) • NotsupportingS3,OnlysupportJava8+,Avro • KafkaConnect • Onlyingestion,nocompacting,synchronizebetween bootstrap/incremental
  59. 59. Comparison
 toother technologies • UberHudi(Hoodie) • NotsupportingS3,OnlysupportJava8+,Avro • KafkaConnect/Debezium • Onlyingestion,nocompacting,synchronizebetween bootstrap/incremental • ApacheSqoop • BasedonBatchMode
  60. 60. Takeaway • Scalability • support100TBofdatabasedata • E2Elatencyof15minutes • Reliability • Strongdatabaseconsistencyonglobaltransactions, messageordering,duplicatemessagehandling • ValidationandMonitoring • Operability • Bootstrap,re-synchronize • Schemamanagement
  61. 61. Futurework • AdoptingKafkaExact-OnceProcessing Model • Kafkaasthedatabasechangestream • Cacheinvalidationacrossdatacenters • BuildingMaterializedViewsforMySQL • GeneratingIncrementalRecommendationSignals • OpenSource
  62. 62. Acknowledgement • Jointworkfrommany engineering,including YuYang,ChunyanWang, IndyPrentice,Shawn Nguyen,YinianQi, and manyothers
  63. 63. Thanks!
  64. 64. © Copyright, All Rights Reserved, Pinterest Inc. 2018

×