The AWS Big Data services are inherently built to run at @scale. In this session, you will learn how to develop an enterprise scale big data application using AWS services such as Amazon EMR, Amazon Redshift & Redshift Spectrum, Amazon Athena, Amazon Elasticsearch Service, Amazon Kinesis, Amazon QuickSight and AWS Glue. This session will also cover different architectural patterns and customer use cases.
26. Official US Statistics
Collection and
dissemination:
mostly the same
since World War II
Multi-Agency
effort
Surveys are
dominant data
source
Administrative
records support
surveys
11
27. Users want more, faster, current…
27
Users
want
more:
• Timely and
detailed
estimates
• Statistics that
link with other
data
• Microdata
• Relevant data
28. Big Data Benefits for Census
28
Enhance current surveys
Reduce respondent burden
Improve timeliness of
release
Better information for
unique situations
Granularity enhanced
Optimize Data Quality
Process
29. Problem Statement
Today, the process surrounding data access for the Census’s MathStats and Data Scientists are manual,
cumbersome, and slow. Whether to gain access to data or to link the data across datasets (e.g., AdRecs, multi-survey
data, and multi-period data) for longitudinal or other studies, the Census’s data stewardship policies must be
respected. The resulting data may inherit controls from the source data (e.g., Title 13, Title 26, and more), and manual
efforts are currently required to track the data lineage from source to resulting data. Additionally, multiple IT
environments are installed to handle each project’s survey instance.
29
• Linking data across
surveys is difficult
• Sharing data is a manual
exercise
• Data is copied multiple
times
• Honoring data
stewardship policies
requires distributed
manual efforts
Decentralized Data
Management Limitations
• Controls must be
duplicated for every survey
system
• Governance and security
measures are cumbersome
• Auditing and monitoring
capabilities are
inconsistent
Security Control Limitations
• Data processing code is
inconsistently managed from
one group to the next
• Reproducing results from base
data is not feasible since data
lineage is not consistently
tracked
Processing Approach
Limitations
• Current approach
requires constant
acquisition of new
servers
• Technology is
inconsistent from one
group or survey to the
next
• Handling large datasets
with complex
calculations is
challenging
Technology Limitations
DEMOECON
S1 S
2
S
3
S
4
…
.M
1
M
2
M
3
…
.
Y
1
Y
2
Y
n
…
.
…
.
Survey Portfolio
Time
Period
Census Data Limitation
S
n
S
n
Sn
+1
01
0
3
04
0
2
31. Enterprise Data Lake (EDL) Solution Supports the Mission
31
Security as a Service
Analytics as a Service
Enterprise Data Lake
Data as a Service
Content
Repositories
Infrastructure & Operations as a Service
1
Data/Code
Repository
LEGEND
Cloud
Standardized Cloud
Services
Standardized EDL
Services
Component of EDL
Ecosystem Specific to the EDL
Computational
Environment
Data Ingestion Services
Transactional Systems /
Data Sources
The proposed EDL solutions will support the business process by storing and analyzing any data with associated code at anytime throughout
the lifecycle.
data
encryption key
permissions
monitoring
32. Proposed Enterprise Data Lake in the Cloud
32
The data lake will streamline time consuming tasks and simplify complex
processes to make the Business and IT users’ lives easier. MathStats and
Data Scientists will be able to focus on their data, models, and products rather
than on administrative tasks.
Security
Governance
Infrastructure
Management
Data
Management
Analytics
Security
Governance
Infrastructure
Management
Data
Management
Analytics
Security
Governance
Infrastructure
Management
Data
Management
Analytics
Survey N
DEMOGRAPHICS
DECENNIAL
OTHER PROGRAMS
Survey N + 1 Survey N + 1…
ECON
Enterprise
Directorate
Analytic
s
Directorate
Analytic
s
Directorate
Analytic
s
EDL Standard Services
Standardized Cloud Services
Standardized Census Data Services
Governance
Security