SlideShare une entreprise Scribd logo
1  sur  34
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Gargi Singh Chhatwal, Associate Solutions Architect, AWS
Dr. Nitin Naik, Chief Technology Officer, Census
Session Code : 194326
Big Data @Scale
Nandakumar Sreenivasan, Senior Solutions Architect, AWS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key Takeaways
1. Why big data?
2. How to do big data processing on AWS?
3. Architectural patterns
4. US Census data lake overview
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ever Increasing data
International Data Corporation(IDC) -Digital universe
2016 – 16.1 Zettabyte(ZB) 2025 – 163 Zettabyte(ZB)
Volume
Velocity
Variety
1 Zettabyte : 1000 Exabyte : 1 million PB : 1 billion TB
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Big Data Processing @ Scale
COLLECT STORE PROCESS/
ANALYZE
CONSUME
data answers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT
Logging
Logging
Amazon
CloudWatch
AWS
CloudTrail
Devices
Sensors &
IoT solutions AWS IoT
Analytics
IoT
Mobile apps
Web apps
Enterprise apps
Applications
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Getting data into AWS
AWS Direct Connect
AWS Snowball
Amazon Kinesis
Firehose
AWS Storage
Gateway
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT STORE
data answers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STORE
Amazon
Elasticsearch Service
Amazon DynamoDB
Amazon Redshift
Amazon RDS
Search SQL NoSQL
Database
Amazon S3
Storage
File/Object
Storage
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
IOT / Applications/Devices streams
Streaming
data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT STORE
data answers
PROCESS/
ANALYZE
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Data Enrichment
Analyze- Batch, Interactive,
Streaming
Extract Transform Load
(ETL)
Data Lake
Amazon EMR Amazon Kinesis AWS Glue
Amazon EMR Amazon Kinesis Amazon QuickSightAmazon Redshift*
Amazon ES Amazon EMR Amazon S3Amazon Athena
Amazon EMR AWS GlueAmazon Redshift*
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
AWS Elastic
MapReduce
(EMR)
Fully Managed Hadoop Cluster Framework
Supports big data frameworks such as Hive, Impala, Presto, Spark and
more...
EMR File System(EMRFS) allows Amazon EMR clusters to efficiently
and securely use Amazon S3 for storage of any scale.
Integrated with Amazon S3, Amazon RDS, Amazon Redshift, & any
JDBC-compliant data store
On-demand and spot pricing; pay as you go
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon
Redshift
Fully managed Relational data warehouse
Massively parallel; Petabyte scale
Data Compression reduces I/O massively
Columnar data storage designed for scale
$1,000/TB/Year; starts at $0.25/hour
a lot faster
a lot simpler
a lot cheaper
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon
Kinesis
Managed Service for Real Time Big Data Processing
Kinesis Data Streams
Create Streams to Produce & Consume Data
Elastically add and remove Shards for performance and scale
Kinesis Data Firehose
Easily load massive amount of streaming data into S3,Redshift
Kinesis Data Analytics
Easily analyze data streams using standard SQL queries
Elastically scales to match data throughput
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon
Athena
An interactive query service that makes it easy to analyze data
directly from Amazon S3 using Standard SQL.
Server less – No infrastructure or resources to manage at any
scale
Schema on read – Same data, many views
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
PROCESS / ANALYZE
AWS
Glue
Data Catalog
Hive Metastore compatible with enhanced functionality
Crawlers automatically extract metadata and creates tables
Managed Transform Engine
Auto-generates ETL code
Build on open frameworks – Python and Spark
Job Scheduler
Runs jobs on a serverless Spark Platform; Massively scalable
Integrated with S3, Amazon RDS, Amazon EMR, Amazon
Redshift, Athena & any JDBC-compliant data store
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
COLLECT STORE PROCESS/
ANALYZE
CONSUME
data answers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CONSUME
Apps & Services
API
Amazon QuickSight
Analysis and Visualization Notebooks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Putting It All Together
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
CONSUME
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksAPI
ETL
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS/ANALYZ
E
Amazon Machine
Learning
Presto
Amazon
EMR
BatchInteractiveStreamML
Amazon EC2
COLLECT
Mobile apps
Web apps
Devices
Sensors &
IoT solutions AWS IoT
Analytics
Enterprise
apps
Logging
Amazon
CloudWatch
AWS
CloudTrail
LoggingIoTApplications
STORE
Amazon Elasticsearch
Service
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon RDS
Amazon DynamoDB
Streams
SearchSQLNoSQLFileStream
Amazon Redshift
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architectural Patterns
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building Event-Driven Batch Analytics on AWS
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
Staging Data
Input
validation
/conversion
layer
Pre-processed dataAWS Lambda
Input Tracking
layer
AggrJob
Submission
and Monitoring
Layer
AWS Lambda
AWS Lambda
State
Management
Store
Identity and Access Management (IAM)
Monitoring and logging (CloudWatch)
Aggregation
and load layer
Amazon
Redshift
Amazon EMR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-Time and Batch Analytics Using the Big Data
Architecture
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
Athena
Amazon QuickSight
Raw data in
Kinesis Data Firehose
Serving Layer
Pre-processed Views
Filtered data
S3 Bucket
S3 Bucket
Speed Layer
Kinesis Data Firehose
Kinesis Data Analytics
User device settings
Raw Data
Batch Layer
S3 Bucket
S3 Bucket
Amazon EMR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift Spectrum Extends Data Warehousing Out to Exabyte's—No Loading
Required
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data lake on Amazon S3 with AWS Glue
On premises data
Web appdata
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
U.S Census - Enterprise data lake
Official US Statistics
Collection and
dissemination:
mostly the same
since World War II
Multi-Agency
effort
Surveys are
dominant data
source
Administrative
records support
surveys
11
Users want more, faster, current…
27
Users
want
more:
• Timely and
detailed
estimates
• Statistics that
link with other
data
• Microdata
• Relevant data
Big Data Benefits for Census
28
Enhance current surveys
Reduce respondent burden
Improve timeliness of
release
Better information for
unique situations
Granularity enhanced
Optimize Data Quality
Process
Problem Statement
Today, the process surrounding data access for the Census’s MathStats and Data Scientists are manual,
cumbersome, and slow. Whether to gain access to data or to link the data across datasets (e.g., AdRecs, multi-survey
data, and multi-period data) for longitudinal or other studies, the Census’s data stewardship policies must be
respected. The resulting data may inherit controls from the source data (e.g., Title 13, Title 26, and more), and manual
efforts are currently required to track the data lineage from source to resulting data. Additionally, multiple IT
environments are installed to handle each project’s survey instance.
29
• Linking data across
surveys is difficult
• Sharing data is a manual
exercise
• Data is copied multiple
times
• Honoring data
stewardship policies
requires distributed
manual efforts
Decentralized Data
Management Limitations
• Controls must be
duplicated for every survey
system
• Governance and security
measures are cumbersome
• Auditing and monitoring
capabilities are
inconsistent
Security Control Limitations
• Data processing code is
inconsistently managed from
one group to the next
• Reproducing results from base
data is not feasible since data
lineage is not consistently
tracked
Processing Approach
Limitations
• Current approach
requires constant
acquisition of new
servers
• Technology is
inconsistent from one
group or survey to the
next
• Handling large datasets
with complex
calculations is
challenging
Technology Limitations
DEMOECON
S1 S
2
S
3
S
4
…
.M
1
M
2
M
3
…
.
Y
1
Y
2
Y
n
…
.
…
.
Survey Portfolio
Time
Period
Census Data Limitation
S
n
S
n
Sn
+1
01
0
3
04
0
2
Census Security Control and Usage of Data
30
Enterprise Data Lake (EDL) Solution Supports the Mission
31
Security as a Service
Analytics as a Service
Enterprise Data Lake
Data as a Service
Content
Repositories
Infrastructure & Operations as a Service
1
Data/Code
Repository
LEGEND
Cloud
Standardized Cloud
Services
Standardized EDL
Services
Component of EDL
Ecosystem Specific to the EDL
Computational
Environment
Data Ingestion Services
Transactional Systems /
Data Sources
The proposed EDL solutions will support the business process by storing and analyzing any data with associated code at anytime throughout
the lifecycle.
data
encryption key
permissions
monitoring
Proposed Enterprise Data Lake in the Cloud
32
The data lake will streamline time consuming tasks and simplify complex
processes to make the Business and IT users’ lives easier. MathStats and
Data Scientists will be able to focus on their data, models, and products rather
than on administrative tasks.
Security
Governance
Infrastructure
Management
Data
Management
Analytics
Security
Governance
Infrastructure
Management
Data
Management
Analytics
Security
Governance
Infrastructure
Management
Data
Management
Analytics
Survey N
DEMOGRAPHICS
DECENNIAL
OTHER PROGRAMS
Survey N + 1 Survey N + 1…
ECON
Enterprise
Directorate
Analytic
s
Directorate
Analytic
s
Directorate
Analytic
s
EDL Standard Services
Standardized Cloud Services
Standardized Census Data Services
Governance
Security
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Please complete the session survey in
the summit mobile app.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
Questions?

Contenu connexe

Tendances

Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Amazon Web Services
 
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...Amazon Web Services
 
Preparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFPreparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFAmazon Web Services
 
Amazon big success using big data analytics
Amazon big success using big data analyticsAmazon big success using big data analytics
Amazon big success using big data analyticsKovid Academy
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
Data Warehousing in the Cloud - AWS Summit Sydney
Data Warehousing in the Cloud - AWS Summit SydneyData Warehousing in the Cloud - AWS Summit Sydney
Data Warehousing in the Cloud - AWS Summit SydneyAmazon Web Services
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSAmazon Web Services
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
Welcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution OverviewWelcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution OverviewAmazon Web Services
 
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWSAmazon Web Services
 
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesIntroduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesAmazon Web Services
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Amazon Web Services
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsAmazon Web Services
 

Tendances (20)

Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
 
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
Building with Purpose - Built Databases: Match Your Workloads to the Right Da...
 
Preparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SFPreparing Data for the Lake: Data Analytics Week SF
Preparing Data for the Lake: Data Analytics Week SF
 
Amazon big success using big data analytics
Amazon big success using big data analyticsAmazon big success using big data analytics
Amazon big success using big data analytics
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Data Warehousing in the Cloud - AWS Summit Sydney
Data Warehousing in the Cloud - AWS Summit SydneyData Warehousing in the Cloud - AWS Summit Sydney
Data Warehousing in the Cloud - AWS Summit Sydney
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Welcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution OverviewWelcome and AWS Big Data Solution Overview
Welcome and AWS Big Data Solution Overview
 
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
 
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesIntroduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
 
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS AnalyticsFinding Meaning in the Noise: Understanding Big Data with AWS Analytics
Finding Meaning in the Noise: Understanding Big Data with AWS Analytics
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 

Similaire à Big Data@Scale

Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018Amazon Web Services
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Amazon Web Services
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Amazon Web Services
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Amazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 

Similaire à Big Data@Scale (20)

Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Big Data@Scale

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Gargi Singh Chhatwal, Associate Solutions Architect, AWS Dr. Nitin Naik, Chief Technology Officer, Census Session Code : 194326 Big Data @Scale Nandakumar Sreenivasan, Senior Solutions Architect, AWS
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Key Takeaways 1. Why big data? 2. How to do big data processing on AWS? 3. Architectural patterns 4. US Census data lake overview
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ever Increasing data International Data Corporation(IDC) -Digital universe 2016 – 16.1 Zettabyte(ZB) 2025 – 163 Zettabyte(ZB) Volume Velocity Variety 1 Zettabyte : 1000 Exabyte : 1 million PB : 1 billion TB
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big Data Processing @ Scale COLLECT STORE PROCESS/ ANALYZE CONSUME data answers
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. COLLECT Logging Logging Amazon CloudWatch AWS CloudTrail Devices Sensors & IoT solutions AWS IoT Analytics IoT Mobile apps Web apps Enterprise apps Applications
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Getting data into AWS AWS Direct Connect AWS Snowball Amazon Kinesis Firehose AWS Storage Gateway
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. COLLECT STORE data answers
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STORE Amazon Elasticsearch Service Amazon DynamoDB Amazon Redshift Amazon RDS Search SQL NoSQL Database Amazon S3 Storage File/Object Storage Amazon Kinesis Firehose Amazon Kinesis Streams Apache Kafka Amazon DynamoDB Streams IOT / Applications/Devices streams Streaming data
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. COLLECT STORE data answers PROCESS/ ANALYZE
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE Data Enrichment Analyze- Batch, Interactive, Streaming Extract Transform Load (ETL) Data Lake Amazon EMR Amazon Kinesis AWS Glue Amazon EMR Amazon Kinesis Amazon QuickSightAmazon Redshift* Amazon ES Amazon EMR Amazon S3Amazon Athena Amazon EMR AWS GlueAmazon Redshift*
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE AWS Elastic MapReduce (EMR) Fully Managed Hadoop Cluster Framework Supports big data frameworks such as Hive, Impala, Presto, Spark and more... EMR File System(EMRFS) allows Amazon EMR clusters to efficiently and securely use Amazon S3 for storage of any scale. Integrated with Amazon S3, Amazon RDS, Amazon Redshift, & any JDBC-compliant data store On-demand and spot pricing; pay as you go
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE Amazon Redshift Fully managed Relational data warehouse Massively parallel; Petabyte scale Data Compression reduces I/O massively Columnar data storage designed for scale $1,000/TB/Year; starts at $0.25/hour a lot faster a lot simpler a lot cheaper
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE Amazon Kinesis Managed Service for Real Time Big Data Processing Kinesis Data Streams Create Streams to Produce & Consume Data Elastically add and remove Shards for performance and scale Kinesis Data Firehose Easily load massive amount of streaming data into S3,Redshift Kinesis Data Analytics Easily analyze data streams using standard SQL queries Elastically scales to match data throughput
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE Amazon Athena An interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL. Server less – No infrastructure or resources to manage at any scale Schema on read – Same data, many views
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. PROCESS / ANALYZE AWS Glue Data Catalog Hive Metastore compatible with enhanced functionality Crawlers automatically extract metadata and creates tables Managed Transform Engine Auto-generates ETL code Build on open frameworks – Python and Spark Job Scheduler Runs jobs on a serverless Spark Platform; Massively scalable Integrated with S3, Amazon RDS, Amazon EMR, Amazon Redshift, Athena & any JDBC-compliant data store
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. COLLECT STORE PROCESS/ ANALYZE CONSUME data answers
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CONSUME Apps & Services API Amazon QuickSight Analysis and Visualization Notebooks
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Putting It All Together
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CONSUME Amazon QuickSight Apps & Services Analysis&visualizationNotebooksAPI ETL Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift PROCESS/ANALYZ E Amazon Machine Learning Presto Amazon EMR BatchInteractiveStreamML Amazon EC2 COLLECT Mobile apps Web apps Devices Sensors & IoT solutions AWS IoT Analytics Enterprise apps Logging Amazon CloudWatch AWS CloudTrail LoggingIoTApplications STORE Amazon Elasticsearch Service Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon RDS Amazon DynamoDB Streams SearchSQLNoSQLFileStream Amazon Redshift
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architectural Patterns
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building Event-Driven Batch Analytics on AWS On premises data Web appdata Amazon RDS Other databases Streaming data Your data Staging Data Input validation /conversion layer Pre-processed dataAWS Lambda Input Tracking layer AggrJob Submission and Monitoring Layer AWS Lambda AWS Lambda State Management Store Identity and Access Management (IAM) Monitoring and logging (CloudWatch) Aggregation and load layer Amazon Redshift Amazon EMR
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-Time and Batch Analytics Using the Big Data Architecture On premises data Web appdata Amazon RDS Other databases Streaming data Your data Athena Amazon QuickSight Raw data in Kinesis Data Firehose Serving Layer Pre-processed Views Filtered data S3 Bucket S3 Bucket Speed Layer Kinesis Data Firehose Kinesis Data Analytics User device settings Raw Data Batch Layer S3 Bucket S3 Bucket Amazon EMR
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Spectrum Extends Data Warehousing Out to Exabyte's—No Loading Required Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data lake on Amazon S3 with AWS Glue On premises data Web appdata Amazon RDS Other databases Streaming data Your data AMAZON QUICKSIGHT AWS GLUE ETL
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. U.S Census - Enterprise data lake
  • 26. Official US Statistics Collection and dissemination: mostly the same since World War II Multi-Agency effort Surveys are dominant data source Administrative records support surveys 11
  • 27. Users want more, faster, current… 27 Users want more: • Timely and detailed estimates • Statistics that link with other data • Microdata • Relevant data
  • 28. Big Data Benefits for Census 28 Enhance current surveys Reduce respondent burden Improve timeliness of release Better information for unique situations Granularity enhanced Optimize Data Quality Process
  • 29. Problem Statement Today, the process surrounding data access for the Census’s MathStats and Data Scientists are manual, cumbersome, and slow. Whether to gain access to data or to link the data across datasets (e.g., AdRecs, multi-survey data, and multi-period data) for longitudinal or other studies, the Census’s data stewardship policies must be respected. The resulting data may inherit controls from the source data (e.g., Title 13, Title 26, and more), and manual efforts are currently required to track the data lineage from source to resulting data. Additionally, multiple IT environments are installed to handle each project’s survey instance. 29 • Linking data across surveys is difficult • Sharing data is a manual exercise • Data is copied multiple times • Honoring data stewardship policies requires distributed manual efforts Decentralized Data Management Limitations • Controls must be duplicated for every survey system • Governance and security measures are cumbersome • Auditing and monitoring capabilities are inconsistent Security Control Limitations • Data processing code is inconsistently managed from one group to the next • Reproducing results from base data is not feasible since data lineage is not consistently tracked Processing Approach Limitations • Current approach requires constant acquisition of new servers • Technology is inconsistent from one group or survey to the next • Handling large datasets with complex calculations is challenging Technology Limitations DEMOECON S1 S 2 S 3 S 4 … .M 1 M 2 M 3 … . Y 1 Y 2 Y n … . … . Survey Portfolio Time Period Census Data Limitation S n S n Sn +1 01 0 3 04 0 2
  • 30. Census Security Control and Usage of Data 30
  • 31. Enterprise Data Lake (EDL) Solution Supports the Mission 31 Security as a Service Analytics as a Service Enterprise Data Lake Data as a Service Content Repositories Infrastructure & Operations as a Service 1 Data/Code Repository LEGEND Cloud Standardized Cloud Services Standardized EDL Services Component of EDL Ecosystem Specific to the EDL Computational Environment Data Ingestion Services Transactional Systems / Data Sources The proposed EDL solutions will support the business process by storing and analyzing any data with associated code at anytime throughout the lifecycle. data encryption key permissions monitoring
  • 32. Proposed Enterprise Data Lake in the Cloud 32 The data lake will streamline time consuming tasks and simplify complex processes to make the Business and IT users’ lives easier. MathStats and Data Scientists will be able to focus on their data, models, and products rather than on administrative tasks. Security Governance Infrastructure Management Data Management Analytics Security Governance Infrastructure Management Data Management Analytics Security Governance Infrastructure Management Data Management Analytics Survey N DEMOGRAPHICS DECENNIAL OTHER PROGRAMS Survey N + 1 Survey N + 1… ECON Enterprise Directorate Analytic s Directorate Analytic s Directorate Analytic s EDL Standard Services Standardized Cloud Services Standardized Census Data Services Governance Security
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the summit mobile app.
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! Questions?