Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

(BDT317) Building A Data Lake On AWS

27 233 vues

Publié le

"Conceptually, a data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created “on demand”, providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we will introduce key concepts for a data lake and present aspects related to its implementation. We will discuss critical success factors, pitfalls to avoid as well as operational aspects such as security, governance, search, indexing and metadata management. We will also provide insight on how AWS enables a data lake architecture.  

A data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created ""on demand"", providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we introduce key concepts for a data lake and present aspects related to its implementation. We discuss critical success factors and pitfalls to avoid, as well as operational aspects such as security, governance, search, indexing, and metadata management. We also provide insight on how AWS enables a data lake architecture. Attendees get practical tips and recommendations to get started with their data lake implementations on AWS."

Publié dans : Technologie
  • Soyez le premier à commenter

(BDT317) Building A Data Lake On AWS

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Meyers, Principal Solution Architect, AWS October 2015 BDT317 Building Your Data Lake on AWS
  2. 2. Benefits of the Enterprise Data Warehouse • Self documenting schema • Enforced data types • Ubiquitous and common security model • Simple tools to access, robust ecosystem • Transactionality
  3. 3. STORAGE COMPUTE
  4. 4. But customers have additional requirements…
  5. 5. The Rise of “Big Data” Enterprise data warehouse Amazon EMR Amazon S3
  6. 6. STORAGE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTECOMPUTE COMPUTE
  7. 7. Benefits of Separation of Compute & Storage • All your data, without paying for unused cores • Independent cost attribution per dataset • Use the right tool for a job, at the right time • Increased durability without operations • Common model for data, without enforcing access method
  8. 8. Comparison of a Data Lake to an Enterprise Data Warehouse Complementary to EDW (not replacement) Data lake can be source for EDW Schema on read (no predefined schemas) Schema on write (predefined schemas) Structured/semi-structured/Unstructured data Structured data only Fast ingestion of new data/content Time consuming to introduce new content Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics) Data at low level of detail/granularity Data at summary/aggregated level of detail Loosely defined SLAs Tight SLAs (production schedules) Flexibility in tools (open source/tools for advanced analytics) Limited flexibility in tools (SQL only) Enterprise DWEMR S3
  9. 9. EMR S3 The New Problem Enterprise data warehouse ≠ Which system has my data? How can I do machine learning against the DW? I built this in Hive, can we get it into the Finance reports? These sources are giving different results… But I implemented the algorithm in Anaconda…
  10. 10. Dive Into The Data Lake ≠ Enterprise data warehouseEMR S3
  11. 11. Dive Into The Data Lake Enterprise data warehouseEMR S3 Load Cleansed Data Export Computed Aggregates Ingest any data Data cleansing Data catalogue Trend analysis Machine learning Structured analysis Common access tools Efficient aggregation Structured business rules
  12. 12. Components of a Data Lake Data Storage • High durability • Stores raw data from input sources • Support for any type of data • Low cost Streaming • Streaming ingest of feed data • Provides the ability to consume any dataset as a stream • Facilitates low latency analytics Storage & Streams Catalogue & Search Entitlements API & UI
  13. 13. Components of a Data Lake Storage & Streams Catalogue & Search Entitlements API & UI Catalogue • Metadata lake • Used for summary statistics and data Classification management Search • Simplified access model for data discovery
  14. 14. Components of a Data Lake Storage & Streams Catalogue & Search Entitlements API & UI Entitlements system • Encryption • Authentication • Authorisation • Chargeback • Quotas • Data masking • Regional restrictions
  15. 15. Components of a Data Lake Storage & Streams Catalogue & Search Entitlements API & UI API & User Interface • Exposes the data lake to customers • Programmatically query catalogue • Expose search API • Ensures that entitlements are respected
  16. 16. STORAGE High durability Stores raw data from input sources Support for any type of data Low cost Storage & Streams Catalogue & Search Entitlements API & UI
  17. 17. Amazon Simple Storage Service Highly scalable object storage for the Internet 1 byte to 5 TB in size Designed for 99.999999999% durability, 99.99% availability Regional service, no single points of failure Server side encryption Compute Storage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics
  18. 18. Storage Lifecycle Integration S3 – Standard S3 – Infrequent Access Amazon Glacier
  19. 19. Data Storage Format • Not all data formats are created equally • Unstructured vs. semi-structured vs. structured • Store a copy of raw input • Data standardisation as a workflow following ingest • Use a format that supports your data, rather than force your data into a format • Consider how data will change over time • Apply common compression
  20. 20. Consider Different Types of Data Unstructured • Store native file format (logs, dump files, whatever) • Compress with a streaming codec (LZO, Snappy) Semi-structured - JSON, XML files, etc. • Consider evolution ability of the data schema (Avro) • Store the schema for the data as a file attribute (metadata/tag) Structured • Lots of data is CSV! • Columnar storage (Orc, Parquet)
  21. 21. Where to Store Data • Amazon S3 storage uses a flat keyspace • Separate data storage by business unit, application, type, and time • Natural data partitioning is very useful • Paths should be self documenting and intuitive • Changing prefix structure in future is hard/costly
  22. 22. Metadata Services CRUD API Query API Analytics API Systems of Reference Return URLs URLs as deeplinks to applications, file exchanges via S3 (RESTful file services) or manifests for Big Data Analytics / HPC. Integration Layer System to system via Amazon SNS/Amazon SQS System to user via mobile push Amazon Simple Workflow for high level system integration / orchestration http://en.wikipedia.org/wiki/Resource-oriented_architecture s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied} Resource Oriented Architecture
  23. 23. STREAMING Streaming ingest of feed data Provides the ability to consume any dataset as a stream Facilitates low latency analytics Storage & Streams Catalogue & Search Entitlements API & UI
  24. 24. Why Do Streams Matter? • Latency between event & action • Most BI systems target event to action latency of 1 hour • Streaming analytics would expect event to action latency < 2 seconds • Stream orientation simplifies architecture, but can increase operational complexity • Increase in complexity needs to be justified by business value of reduced latency
  25. 25. Amazon Kinesis Managed service for real time big data processing Create streams to produce & consume data Elastically add and remove shards for performance Use Amazon Kinesis Worker Library to process data Integration with S3, Amazon Redshift, and DynamoDB Compute Storage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics
  26. 26. Data Sources AWSEndpointData Sources Data Sources Data Sources S3 App.1 [Archive/Inge stion] App.2 [Sliding Window Analysis] App.3 [Data Loading] App.4 [Event Processing Systems] DynamoDB Amazon Redshift Data Sources Availability Zone Shard 1 Shard 2 Shard N Availability Zone Availability Zone Amazon Kinesis Architecture
  27. 27. Streaming Storage Integration Object store Amazon S3 Streaming store Amazon Kinesis Analytics applications Read & write file dataRead & write to streams Archive stream Replay history
  28. 28. CATALOGUE & SEARCH Metadata lake Used for summary statistics and data Classification management Simplified model for data discovery & governance Storage & Streams Catalogue & Search Entitlements API & UI
  29. 29. Building a Data Catalogue • Aggregated information about your storage & streaming layer • Storage service for metadata Ownership, data lineage • Data abstraction layer Customer data = collection of prefixes • Enabling data discovery • API for use by entitlements service
  30. 30. Data Catalogue – Metadata Index • Stores data about your Amazon S3 storage environment • Total size & count of objects by prefix, data classification, refresh schedule, object version information • Amazon S3 events processed by Lambda function • DynamoDB metadata tables store required attributes
  31. 31. http://amzn.to/1LSSbFp
  32. 32. Amazon DynamoDB Provisioned throughput NoSQL database Fast, predictable, configurable performance Fully distributed, fault tolerant HA architecture Integration with Amazon EMR & Hive Amazon RDS Amazon DynamoDB Amazon Redshift Amazon ElastiCache Managed NoSQL Compute Storage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics
  33. 33. AWS Lambda Fully-managed event processor Node.js or Java, integrated AWS SDK Natively compile & install any Node.js modules Specify runtime RAM & timeout Automatically scaled to support event volume Events from Amazon S3, Amazon SNS, Amazon DynamoDB, Amazon Kinesis, & AWS Lambda Integrated CloudWatch logging Compute Storage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics Serverless Compute
  34. 34. Data Catalogue – Search Ingestion and pre-processing Text processing (normalization) • Tokenization • Downcasing • Stemming • Stopword removal • Synonym addition Indexing Matching Ranking and relevance • TF-IDF • Additional criteria (rating, user behavior, freshness, etc.) NoSQLRDBMS Files Any Source Search Index Processor
  35. 35. Features and Benefits Easy to set up and operate • AWS Management Console, SDK, CLI Scalable • Automatic scaling on data size and traffic Reliable • Automatic recovery of instances, multi-AZ, etc. High performance • Low latency and high throughput performance through in-memory caching Fully managed • No capacity guessing Rich features • Faceted search, suggestions, relevance ranking, geospatial search, multi-language support, etc. Cost effective • Pay as you go Amazon CloudSearch & Amazon Elasticsearch
  36. 36. Data Catalogue – Building Search Index • Enable DynamoDB Update Stream for metadata index table • Additional AWS Lambda function reads Update Stream and extracts index fields from S3 object • Update to search domain
  37. 37. Catalogue & Search Architecture
  38. 38. ENTITLEMENTS Encryption Authentication Authorisation Chargeback Quotas Data masking Regional restrictions Storage & Streams Catalogue & Search Entitlements API & UI
  39. 39. Data Lake != Open Access
  40. 40. Identity & Access Management • Manage users, groups, and roles • Identity federation with Open ID • Temporary credentials with Amazon Security Token Service (Amazon STS) • Stored policy templates • Powerful policy language • Amazon S3 bucket policies
  41. 41. IAM Policy Language • JSON documents • Can include variables which extract information from the request context aws:CurrentTime For date/time conditions aws:EpochTime The date in epoch or UNIX time, for use with date/time conditions aws:TokenIssueTime The date/time that temporary security credentials were issued, for use with date/time conditions. aws:principaltype A value that indicates whether the principal is an account, user, federated, or assumed role—see the explanation that follows aws:SecureTransport Boolean representing whether the request was sent using SSL aws:SourceIp The requester's IP address, for use with IP address conditions aws:UserAgent Information about the requester's client application, for use with string conditions aws:userid The unique ID for the current user aws:username The friendly name of the current user
  42. 42. IAM Policy Language Example: Allow a user to access a private part of the data lake { "Version": "2012-10-17", "Statement": [ { "Action": ["s3:ListBucket"], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake"], "Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}} }, { "Action": [ "s3:GetObject", "s3:PutObject" ], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"] } ] }
  43. 43. IAM Federation • IAM allows federation to Active Directory and other OpenID providers (Amazon, Facebook, Google) • AWS Directory Service provides an AD Connector which can automate federated connectivity to ADFS IAM Users AWS Directory Service AD Connector Direct Connect Hardware VPN
  44. 44. Extended user defined security
  45. 45. Entitlements Engine: Amazon STS Token Vending Machine http://amzn.to/1FMPrTF
  46. 46. Data Encryption AWS CloudHSM Dedicated Tenancy SafeNet Luna SA HSM Device Common Criteria EAL4+, NIST FIPS 140-2 AWS Key Management Service Automated key rotation & auditing Integration with other AWS services AWS server side encryption AWS managed key infrastructure
  47. 47. Entitlements – Access to Encryption Keys Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key IAM Temporary Credential Security Token Service MyData MyData S3 S3 Object … Name: MyData Key: Ciphertext Key …
  48. 48. Secure Data Flow IAM Amazon S3 API Gateway Users Temporary Credential Temporary Credential s3://mydatalake/${YYY-MM-DD}/ ${resource}/${resourceID} Encrypted Data Metadata Index - DynamoDB TVM - Elastic Beanstalk Security Token Service
  49. 49. API & UI Exposes the data lake to customers Programmatically query catalogue Expose search API Ensures that entitlements are respected Storage & Streams Catalogue & Search Entitlements API & UI
  50. 50. Data Lake API & UI • Exposes the Metadata API, search, and Amazon S3 storage services to customers • Can be based on TVM/STS Temporary Access for many services, and a bespoke API for Metadata • Drive all UI operations from API?
  51. 51. AMAZON API GATEWAY
  52. 52. Amazon API Gateway Host multiple versions and stages of APIs Create and distribute API keys to developers Leverage AWS Sigv4 to authorize access to APIs Throttle and monitor requests to protect the backend Leverages AWS Lambda
  53. 53. Additional Features Managed cache to store API responses Reduced latency and DDoS protection through AWS CloudFront SDK generation for iOS, Android, and JavaScript Swagger support Request / response data transformation and API mocking
  54. 54. An API Call Flow Internet Mobile Apps Websites Services API Gateway AWS Lambda functions AWS API Gateway cache Endpoints on Amazon EC2 Any other publicly accessible endpoint Amazon CloudWatch monitoring Amazon CloudFront
  55. 55. API & UI Architecture API Gateway UI - Elastic Beanstalk AWS Lambda Metadata IndexUsers IAM TVM - Elastic Beanstalk
  56. 56. Putting It All Together
  57. 57. A Data Lake Is… • A foundation of highly durable data storage and streaming of any type of data • A metadata index and workflow which helps us categorise and govern data stored in the data lake • A search index and workflow which enables data discovery • A robust set of security controls – governance through technology, not policy • An API and user interface that expose these features to internal and external users
  58. 58. Storage & Streams Amazon Kinesis Amazon S3 Amazon Glacier Entitlements IAM Encrypted Data Security Token Service Data Catalogue & Search AWS Lambda Search Index Metadata Index API & UI API GatewayUsers UI - Elastic Beanstalk TVM - Elastic Beanstalk KMS
  59. 59. Remember to complete your evaluations!
  60. 60. Thank you! Ian Meyers, Principal Solution Architect

×