Learn about data lifecycle best practices in the AWS Cloud, so you can optimize performance and lower the costs of data ingestion, staging, storage, cleansing, analytics and visualization, and archiving.
Learn about data lifecycle best practices in the AWS Cloud, so you can optimize performance and lower the costs of data ingestion, staging, storage, cleansing, analytics and visualization, and archiving.
The ingest mechanism describes the movement of data from an external source to land into the data lifecycle.
Data ingest refers to identifying the correct data sources,
validating and importing the data files from those sources,
and sending the data to the desired destination.
Data sources include transactions, ERP systems, clickstream data, log files, devices, and disparate databases being migrated over. Generally the destination could be some form of storage or a database (we discuss the “destination” in the following chapter on Data Staging).
Amazon Kinesis Data Firehose provides a simple way to capture and load streaming data with just a few clicks in the AWS Management Console. You can easily create a Firehose delivery stream from the AWS Management Console, configure it with a few clicks, and start sending data to the stream from hundreds of thousands of data sources to be loaded continuously to AWS – all in just a few minutes.
If you tried to a giant all-purpose tool to do every function, it wouldn’t be really good at any single thing.
That’s often what customers see running their systems both on the cloud as well as on premises.\
Because of this, AWS offers a set of data tier services. Ranging from your traditional relational stores like Aurora, Oracle, SQL Server, We also have a set of NoSQL database to store key/value, documents, and graphs. In memory stores to provide microsecond retrieval of re-hydratable data.
Different data structures often require different types of storage.
Key value stores are great for storing data that needs to be quickly stored and queried – this includes examples like session data or lat/long tracking data of car locations (like lyft does on AWS)
Other use cases require different storage than heavily connected graph data, data warehousing data or relational data.
The way that the data gets queried is also a major characteristics and related to the structure of the data.
These are some of the characteristics or really design decisions on which database to use.
Two important categories to focus on is the use case. And the shape of the data. The rest often times is driven from those factors.
Staging provides the opportunity to perform any data housekeeping tasks prior to making the data available to the organization or its users for analytics.
One of the most common challenges we hear from customers is that their organization has data in multiple systems or locations, including data warehouses, spreadsheets, databases, and text files. Not only is the variety expanding, but its volume in many cases is growing exponentially. Add the complexity of mandatory data security and governance, user access, and the data demands of the analytics, business, and reporting teams, an organization can find they are unable to see a way forward.
Before data is analyzed, data cleansing ensures that data is transformed and presented in a format that is optimized for code. The Extract, Transform, Load (ETL) process is carried out as part of the data cleansing stage in the lifecycle. For example, a field may contain a data/time in a format that may not meet the needs of an algorithmic requirement. Or there may be a name field where the first and last names need to be separated out. Other concerns addressed during data cleansing stage might include merging of data sources, aligning formats, converting strings to numerical data, or summarizing of data.
Problem #1 – Many organizations don’t know what they have.
When you accumulate such a diversity of data, you need mechanisms to understand what data you have, where it is located, and what format.
This is metadata management. And if not managed properly (or at all), the data is essentially lost. It is taking up space, but you have no means to put it to use.
A common issue, regardless of whether it is on-prem or in the cloud, is the lack of a metadata management approach from the onset.
The Financial Industry Regulatory Authority (FINRA) oversees more than 3,900 securities firms with approximately 640,000 brokers.
FINRA processes approximately 6 terabytes of data and 37 billion records on an average day to build a complete, holistic picture of market trading in the U.S. On busy days, the stock markets can generate 75 billion+ records.
The way they’re able to make all this data useful, whether to data scientists or business users or others, is through a metadata system they developed and open sourced, called HERD.
This is the same platform that is used by LinkedIn, for example.
But most organizations don’t actually go off and built their own tooling.
Ivy Tech is a community college - 60,000 online and in-person course sections, 8,300 on staff, 170,000 students, and130 locations.
Ivy Tech uses metadata capabilities provided by AWS to manage their information.
Data analytics is the stage where an organization can identify ways to increase revenue or reduce cost. Analytics and visualization delivers decisions makers the insights to transform an organization by identifying unmet needs within the customers or by optimizing operational processes. Data-driven decisions leads to transforming how managers allocate resources and evaluate results within an organization. Reliance on data reduces the role of hearsay and instincts when making choices. A manager’s intuition is now backed with data at the front-end of the planning process, through the course of implementation, and when evaluating the impact of his or her decisions.
Key considerations in this phase include the requirements for analytics being clearly defined; the output being aligned to the use cases; and the consumers of data within the organization finding the insight generated as actionable data. Let’s review some of the solutions available for analytics within the AWS portfolio during this stage.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.
Redshifft Specturm
AVRO
PARQUET
TEXTFILE
SEQUENCEFILE
RCFILE
RegexSerDe
ORC
Grok
OpenCSV
Athea
SEQUENCEFILE
TEXTFILE
RCFILE
ORC
PARQUET
AVRO
SEQUENCEFILE
TEXTFILE
RCFILE
ORC
PARQUET
AVRO
Athena – Simple
Redshift – Fast
EMR - Configurable
Add connector
30% queries & 70% of data – Athena…
Direct Acyclic Graphs?
Exactly once processing & DAG? — how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Add connector
Direct Acyclic Graphs?
Exactly once processing & DAG? – how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Amazon Web Services offers a complete set of cloud storage services for archiving. You can choose Amazon S3 Glacier or Deep Archive for affordable, non-time sensitive cloud storage, or Amazon S3 for faster storage, depending on your needs.
AWS cloud storage solutions have achieved numerous compliance standards, security certifications, and provide built-in encryption; helping to ensure that your data stored in AWS meets all the requirements for your business. AWS's cloud storage solutions makes the archival process easy to manage, and allows you to focus on the storage of your data, rather than the management of your tape systems and library.
Picking the right analytical engine for your needs(200)
AWS offers analytical engines for several use cases such as big data processing, data warehousing, ad-hoc analysis, real-time streaming, and operational/log analytics.
In this session, you will learn about what engines you can use for your use case to analyze all of your data stored in your Amazon S3 data lake in open formats.
You will also learn how to use these engines together for generating new insights, such as complementing your data warehouse workloads with ad-hoc and real-time analytics engines to incorporate new data into your reports.