Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Best Practices for Building a Data...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to expect
• Defining the AWS data lake on A...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Defining the AWS data lake
Data lake is an archi...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a data lake?
Amazon
Glacier...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a data lake?
Amazon
Glacier...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a data lake?
Amazon
Glacier...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched durability,
availability, and scalabil...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimize costs with data tiering
Hot
Cold
Amazon...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multiple data lake ingestion methods
AWS Snowbal...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Catalog your S3 data
AWS Lambda
AWS Lambda
Metad...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue analytics data catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue analytics data catalog
Manage table met...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Populating the AWS Glue data catalog
Automatical...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Securing your data on Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Encryption
• Default encryption
• Server-side en...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security: Access to encryption keys
IAM Security...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security: Access to encryption keys
IAM Security...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IAM best practices
SSL/TLS connections
Server-si...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimizing performance on Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting high throughput with Amazon S3
examplebu...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Aggregate small files
EMR: S3distcp
Amazon Kines...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big data analytics & query in place
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3
Data Catalog
AthenaEMR Amazon
Redshift...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introducing Amazon S3 Select
Simple API to retri...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Highly distributed
processing frameworks
such as...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Structured data w/ joins
Multiple on-demand
clus...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless service
Schema on read
Compress datas...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use the right data formats
• Pay by the amount o...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Viber data lake
Amir Ish-Shalom
Chief Architect,...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Messaging (including group)
• Secure end-to-en...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big data @ Viber
• Close to 1 billion users worl...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Viber—big data architecture
RT data pipeline
Kin...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake challenges
Use case #1: S3 performance...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: S3 performance
Challenge:
• Over 300 d...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: S3 performance
Solution:
• Concatenate...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: S3 performance
Future solution:
• Conc...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Data access rights
Challenge:
• Events...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Data access rights
Solution #1:
• Crea...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Data access rights
Solution #2:
• Anon...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Encrypted data storage
Challenge:
• St...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Case: Encrypted data storage
SSE-KMSViber Ba...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Storage of data from third parties
Cha...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Storage of data from third parties
S3 ...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Viber data lake—summary
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Airbnb—tiered storage system
Hongbo Zeng
Softwar...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Challenges
• Tiered storage system
• S3...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Motivations
• 3x+ YoY data growth
• HDFS bottlen...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tiered storage system
• HDFS + S3
• Hot data on ...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecture
HDFS/Hive
Cluster
S3
Archive
Policy...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Backup
DB Table Part
foo bar baz
Dest paths for
...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Archive
• Metadata validation
• Is there a succe...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The journey of a partition
2017-12-312017-12-302...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problem solved?
• HDFS bottlenecks
• Namenode sc...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problem solved … partly
• HDFS bottlenecks
• Nam...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3A+ file system
• Cache metadata
• Leverage S3 ...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata cache in MySQL
Path Is dir Is empty Len...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata cache in MySQL
Path Is dir Is empty Len...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 multipart API
• Improved throughput
• Quick r...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 multipart API
• Improved throughput
• Quick r...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3A multipart API
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Read prefetch
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance
0
2000
4000
6000
8000
10000
12000
1 ...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Putting it all together
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
To summarize
ü Always store a copy of the raw in...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
For Enterprise Storage Engineers
• Learn how to ...
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Q&A
Amazon S3 Amazon Glacier
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank You!
Prochain SlideShare
Chargement dans…5
×

Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with Special Guests, Airbnb & Viber - STG312 - re:Invent 2017

Learn how to build a data lake for analytics in Amazon S3 and Amazon Glacier. In this session, we discuss best practices for data curation, normalization, and analysis on Amazon object storage services. We examine ways to reduce or eliminate costly extract, transform, and load (ETL) processes using query-in-place technology, such as Amazon Athena and Amazon Redshift Spectrum. We also review custom analytics integration using Apache Spark, Apache Hive, Presto, and other technologies in Amazon EMR. You'll also get a chance to hear from Airbnb & Viber about their solutions for Big Data analytics using S3 as a data lake.

  • Soyez le premier à commenter

Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with Special Guests, Airbnb & Viber - STG312 - re:Invent 2017

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Best Practices for Building a Data Lake on Amazon S3 & Amazon Glacier w i t h s p e c i a l g u e s t s : V i b e r a n d A i r b n b J o h n M a l l o r y , B u s i n e s s D e v e l o p m e n t , S t o r a g e P D D u t t a , S r . P r o d u c t M a n a g e r , A m a z o n S 3 S T G 3 1 2 N o v e m b e r 3 0 , 2 0 1 7
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to expect • Defining the AWS data lake on Amazon S3 and Amazon Glacier • Data cataloging • Security, performance, and analytics best practices • Special guests Viber and Airbnb
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Defining the AWS data lake Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous data sets Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What can you do with a data lake? Amazon Glacier Amazon S3 Amazon Redshift Data Warehouse Amazon EMR Clusterless SQL Query Amazon Athena Clusterless ETL Amazon Glue BI & Visualization Hadoop/Hive/Presto Batch processing
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What can you do with a data lake? Amazon Glacier Amazon S3 Streaming and real-time analytics AWS Lambda Amazon Elasticsearch Service Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Spark Streaming on EMR Amazon ElastiCache
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What can you do with a data lake? Amazon Glacier Amazon S3 AI and machine learning Life-like speech Amazon Polly Amazon Lex Conversational engine Amazon Rekognition Image analysis Deep learning Frameworks MXNet, TensorFlow, Theano, Caffe, Torch
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unmatched durability, availability, and scalability Best security, compliance, and audit capability Object-level control at any scale Business insight into your data Twice as many partner integrations Most ways to bring data in Reasons to choose Amazon S3 for data lake
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimize costs with data tiering Hot Cold Amazon S3 standard Amazon S3— infrequent access Amazon Glacier HDFS ü Use EMR/Hadoop with local HDFS for hottest data sets ü Store cooler data in S3 and Glacier to reduce costs ü Use S3 Analytics to optimize tiering strategy S3 Analytics
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Multiple data lake ingestion methods AWS Snowball and AWS Snowmobile • PB-scale migration AWS Storage Gateway • Migrate legacy files Native/ISV Connectors • Ecosystem integration Amazon S3 Transfer Acceleration • Long-distance data transfer AWS Direct Connect • On-premises integration Amazon Kinesis Firehose • Ingest device streams • Transform and store on Amazon S3
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Catalog your S3 data AWS Lambda AWS Lambda Metadata index (DynamoDB) Search index (Amazon Elasticsearch Service or Amazon CloudSearch) ObjectCreated ObjectDeleted PutItem Update Stream Update Index Extract Search Fields
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue analytics data catalog
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue analytics data catalog Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools like Hive, Presto, Spark, etc. We added a few extensions: § Search over metadata for data discovery § Connection info—JDBC URLs, credentials § Classification for identifying and parsing files § Versioning of table metadata as schemas evolve and other metadata are updated Populate using Hive DDL, bulk import, or automatically through crawlers.
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Populating the AWS Glue data catalog Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expressions Run via Lambda triggers or scheduled; serverless—only pay when crawler runs Crawlers automatically build your data catalog and keep it in sync
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Securing your data on Amazon S3
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encryption • Default encryption • Server-side encryption • Client-side encryption • SSL endpoints • Encryption status in inventory • CRR with KMS Identity and access • Amazon Macie • Permission checks • AWS Config Rules • IAM & bucket policies • Access control lists Compliance • Certifications—HIPAA, FedRAMP, PCI-DSS • Cloud HSM integration • Versioning & MFA deletes • Audit logging AWS data lake security entitlements
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security: Access to encryption keys IAM Security Token Service Temporary Credentials Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security: Access to encryption keys IAM Security Token Service Temporary Credentials Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key Amazon S3 S3 object …… Name: MyData Key: Ciphertext Key ….. My Data My Data
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. IAM best practices SSL/TLS connections Server-side encryption Bucket policies Versioning; recycle bin MFA deletes Security for your data lake Pro Tip
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimizing performance on Amazon S3
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting high throughput with Amazon S3 examplebucket/232a-2017-26-05-15-00-00/cust1234234/photo1.jpg examplebucket/7b54-2017-26-05-15-00-00/cust3857422/photo2.jpg examplebucket/921c-2017-26-05-15-00-00/cust1248473/photo2.jpg examplebucket/animations/232a-2017-26-05-15-00-00/cust1234234/animation1.obj examplebucket/videos/ba65-2017-26-05-15-00-00/cust8474937/video2.mpg examplebucket/photos/8761-2017-26-05-15-00-00/cust1248473/photo3.jpg A bit more LIST friendly: Random hash should come before patterns such as dates and sequential IDs Always first ensure that your application can accommodate Most customers need not worry about introducing entropy in key names Consider 3-4 character hash for higher requests per second
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Aggregate small files EMR: S3distcp Amazon Kinesis Firehose S3 Select Big data cheaper, faster Up to 400% faster Data Formats Columnar formats EMRFS consistent view Optimizing data lake performance Amazon S3 Amazon DynamoDB Pro Tip
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big data analytics & query in place
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Data Catalog AthenaEMR Amazon Redshift Spectrum Amazon ML/MXNet RDS Amazon QuickSight Kinesis Database Migration Service AWS Glue IAM Other Sources Amazon analytics end-to-end architecture
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Introducing Amazon S3 Select Simple API to retrieve subset of data based on a SQL expression Accelerate performance for data retrieval and processing by up to 400% Simplify compute by retrieving subset of data in a common format
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Highly distributed processing frameworks such as Hadoop/Spark Compress datasets Columnar file formats Amazon EMR: Decouple compute & storage Aggregate small files S3distcp “group-by” clause Pro Tip
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Structured data w/ joins Multiple on-demand clusters-scale concurrency Columnar file formats Data partitioning Better query performance with predicate pushdown Amazon Redshift Spectrum: Exabyte Scale query-in-place Pro Tip
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless service Schema on read Compress datasets Columnar file formats Amazon Athena: Query without ETL Optimize file sizes Optimize querying (Presto backend) Pro Tip
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use the right data formats • Pay by the amount of data scanned per query • Use compressed columnar formats • Parquet • ORC • Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Viber data lake Amir Ish-Shalom Chief Architect, Viber Special guest: Viber
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Messaging (including group) • Secure end-to-end encryption • Rich media and chat extensions • Full multiple device support • HD video and voice calls • Viber out and Viber in • Public chats and accounts • Chatbots
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big data @ Viber • Close to 1 billion users worldwide • Globally used in 230 countries • 10-15 billion events daily (2 TB) • 300,000 events per second (peak hours) • 5 PB of data stored on Amazon S3/Amazon Glacier • NoSQL DB (Couchbase) performing 2 million TPS on 20 TB of data with 35 billion keys
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Viber—big data architecture RT data pipeline Kinesis Viber backend servers RT data processor Apache Storm Data lake Amazon S3 Raw data backup Kinesis Firehose ETL jobs Spark, Presto, Pig, Lambda functions Query engines Presto, Athena, Spark SQL, Pig Reporting tools Tableau, Redash, Zeppelin, others Databases & data warehouses Amazon Redshift, Aurora, MySQL Events NoSQL profile database Couchbase
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake challenges Use case #1: S3 performance Use case #2: Data access rights Use case #3: Encrypted data storage Use case #4: Storage of data from third parties
  35. 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: S3 performance Challenge: • Over 300 different event types with large throughput variance • Storm data processor created many small files, especially for lower throughput events • Events were stored in Hive partitioned folders (Y/M/D/H), which are not optimal for Amazon S3 • Running a query over these events using Presto could generate up to 15K tps on a single S3 bucket, resulting in 5xx errors and throttling the whole bucket for other processes
  36. 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: S3 performance Solution: • Concatenate small files into large files, optimally 100 MB+ • Convert files into columnar file format such as Parquet or ORC S3DistCp Spark Concatenate files Convert to Parquet S3 S3 S3
  37. 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: S3 performance Future solution: • Concatenate and convert files in a single process (Glue?) • Use better partitioned hive directory format (H/D/M/Y instead of Y/M/D/H) • Use even larger files for high-throughput events Spark/AWS Glue Concatenate files & convert to Parquet S3 S3
  38. 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Data access rights Challenge: • Events can contain sensitive personal data • Allow access to events without exposing personal data
  39. 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Data access rights Solution #1: • Create separate Hive metastores for full and redacted data • Reporting tools will select relevant metastore based on current user Event definitions Aurora RDS Full access Hive Metastore Redacted access Hive Metastore
  40. 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Data access rights Solution #2: • Anonymize sensitive personal data • Legal compliancy issues • Limits data science capabilities RT data pipeline Kinesis RT data processor Apache Storm Data lake S3 Raw Data Anonymized Data
  41. 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Encrypted data storage Challenge: • Store daily backups in S3 • Backup must be encrypted • Strict access control • Regional replication • Complex data retention
  42. 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Case: Encrypted data storage SSE-KMSViber Back-End Servers Local region data storage—Amazon S3 Remote regional data storage— Amazon S3 Solution Cross-region replication CRR-KMS Security—encrypt using SSE-KMS Access—require permissions to both S3 bucket & KMS key Tagging—use S3 object-level tagging to apply different lifecycle policies to certain objects Multiple regions—use CRR-KMS
  43. 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Storage of data from third parties Challenge: • Securely store data from third parties in data lake • Validate data before storing • Allow optional data transformation
  44. 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Storage of data from third parties S3 (external bucket)Third party Lambda SSE-KMS S3 data lake Third-party access via access keys Lambda validates data, performs transformations and u/l it using KMS to another S3 bucket in the Viber data lake Solution:
  45. 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Viber data lake—summary
  46. 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Airbnb—tiered storage system Hongbo Zeng Software Engineer, Airbnb S p e c i a l g u e s t : A i r b n b
  47. 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Challenges • Tiered storage system • S3A+ file system
  48. 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivations • 3x+ YoY data growth • HDFS bottlenecks • Namenode scalability • Cost • S3 is an object store • Eventual consistency • Metadata retrieval performance • Read/write performance
  49. 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tiered storage system • HDFS + S3 • Hot data on HDFS • Warm and cold data on S3 • Bring the best of both together • Performance • Scalability • Cost HDFS Cluster S3 Clients (Hive, Presto, Spark and etc)
  50. 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecture HDFS/Hive Cluster S3 Archive Policy Retention Policy UI FSImage / Metastore Pipeline Storage Processor Data Archive Data Retention Data Backup
  51. 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Backup DB Table Part foo bar baz Dest paths for foo/bar/baz/a0 foo/bar/baz/a1 Dest paths for foo/bar/baz/a2 foo/bar/baz/a3 Generate paths for files Reducers Src Dest foo/bar/baz/a0 s3://bkt/foo/bar/baz/data/a0 foo/bar/baz/a1 s3://bkt/foo/bar/baz/data/a1 foo/bar/baz/a2 s3://bkt/foo/bar/baz/data/a2 foo/bar/baz/a3 s3://bkt/foo/bar/baz/data/a3 Copy foo/bar/baz/a0 foo/bar/baz/a1 Copy foo/bar/baz/a2 foo/bar/baz/a3 CRC for foo/bar/baz/a0 foo/bar/baz/a1 CRC for foo/bar/baz/a2 foo/bar/baz/a3 Tags for foo/bar/baz Copy files Generate and compare CRC Update the metadata for partitions Reducers Reducers Reducers
  52. 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Archive • Metadata validation • Is there a successful backup? • Is the backup location valid? • Anything changed since the latest backup? • Data validation • File count • File size • Archive • Update the location of partitions
  53. 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The journey of a partition 2017-12-312017-12-302017-11-302017-11-30 generate a partition foo.db/bar/ds=11-30 backup the partition to s3://bkt/foo.db/bar/ds=1 1-30 archive the partition foo.db/bar/ds=11-30 delete the data from HDFS
  54. 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Problem solved? • HDFS bottlenecks • Namenode scalability • Cost • S3 is an object store • Eventual consistency • Metadata retrieval performance • Read/write performance
  55. 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Problem solved … partly • HDFS bottlenecks • Namenode scalability • Cost • S3 is an object store • Eventual consistency • Metadata retrieval performance • Read/write performance
  56. 56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3A+ file system • Cache metadata • Leverage S3 multipart API • Prefetch data for reads
  57. 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata cache in MySQL Path Is dir Is empty Length s3a://bucket/foo/bar/baz/data 1 0 0 s3a://bucket/foo/bar/baz/data/a0 0 0 100 s3a://bucket/foo/bar/baz/data/a1 0 0 200 s3a://bucket/foo/bar/baz/data/a2 0 0 300
  58. 58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata cache in MySQL Path Is dir Is empty Length s3a://bucket/foo/bar/baz/data 1 0 0 s3a://bucket/foo/bar/baz/data/a0 0 0 100 s3a://bucket/foo/bar/baz/data/a1 0 0 200 s3a://bucket/foo/bar/baz/data/a2 0 0 300 30x Speed Up
  59. 59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 multipart API • Improved throughput • Quick recovery from any network issues • Pause and resume object uploads • Begin an upload before you know the final object size
  60. 60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 multipart API • Improved throughput • Quick recovery from any network issues • Pause and resume object uploads • Begin an upload before you know the final object size
  61. 61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3A multipart API
  62. 62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Read prefetch
  63. 63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Performance 0 2000 4000 6000 8000 10000 12000 1 2 3 4 5 6 Latency(seconds) Hive Queries Hive Query Performance S3A latency S3A+ latency
  64. 64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Putting it all together
  65. 65. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. To summarize ü Always store a copy of the raw input ü Use automation with S3 events to enable trigger-based workflows ü Implement the right security controls ü Use a format that supports your data, rather than forcing your data into the format ü Partition data to improve performance ü Apply compression to lower network load and cost
  66. 66. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. For Enterprise Storage Engineers • Learn how to architect and manage highly available solutions on AWS storage services • Advance toward AWS certifications • Help your organization migrate to the cloud faster Online at www.aws.training • Access 100+ new digital training courses including advanced training on storage • Deep dives on Amazon S3, EFS, and EBS • Migrating and tiering storage to AWS (hybrid solutions) At re:Invent • Visit Hands-on Labs at the Venetian • Attend a proctored “Introduction to EFS” Spotlight Lab on Thursday at 3pm at the Venetian • Meet storage experts at the Ask the Experts in Hands-on Labs room at the Venetian New storage training
  67. 67. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Q&A Amazon S3 Amazon Glacier
  68. 68. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank You!

    Soyez le premier à commenter

    Identifiez-vous pour voir les commentaires

  • apurohit30

    May. 7, 2018
  • jorisvdb

    Jun. 10, 2018
  • MarcoStorto

    Jul. 2, 2018
  • jackwy

    Jul. 23, 2018
  • tazimehdi

    Sep. 5, 2018
  • DavidPopejoy

    Sep. 7, 2018
  • srikarasu

    Sep. 24, 2018
  • andreasbaakind

    Oct. 1, 2018
  • krishis

    Oct. 15, 2018
  • JuanMontalvoBressi

    Nov. 1, 2018
  • v4u2chat

    Nov. 22, 2018
  • UmapathyV

    Dec. 18, 2018
  • OscarCassetti

    Feb. 27, 2019
  • criabdala

    Mar. 13, 2019
  • wtung100

    Mar. 13, 2019
  • ahmedanish1

    Apr. 22, 2019
  • yoheiazekatsu

    May. 1, 2019
  • schallix

    Aug. 18, 2020
  • DanielLin68

    Oct. 31, 2020
  • ColinBlakeButler

    Dec. 12, 2020

Learn how to build a data lake for analytics in Amazon S3 and Amazon Glacier. In this session, we discuss best practices for data curation, normalization, and analysis on Amazon object storage services. We examine ways to reduce or eliminate costly extract, transform, and load (ETL) processes using query-in-place technology, such as Amazon Athena and Amazon Redshift Spectrum. We also review custom analytics integration using Apache Spark, Apache Hive, Presto, and other technologies in Amazon EMR. You'll also get a chance to hear from Airbnb & Viber about their solutions for Big Data analytics using S3 as a data lake.

Vues

Nombre de vues

15 303

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

178

Actions

Téléchargements

0

Partages

0

Commentaires

0

Mentions J'aime

29

×