2. What is Data
Lake ?
- It's a collection of Raw, Semi Structured,
UnStructured, Structured data at one place,
enabled by low cost technologies from which
downstream applications may use and act.
- Can keep the data in its original form like
native format, streaming data and big data. It
gives high agility to configure, reconfigure
data as per needs.
- All data sources are considered to form a Data
Lake.
3. What is the
use of Data
Lake ?
- It eliminates silos of data and simplifies the
management.
- It eliminates redundant data movement across
the platforms.
- It facilitates a common platform for data
access, data processing, data analytics and
data presentation.
- Orchestration becomes possible
- Streaming data accomodation possible
4. Social Media
Sensor
Video
Files
Logs
Enterprise
Transactions
OLTP, ERP, CRM
Data Sources Ingest into
Data Lake
Distributed FS
NoSQL
Data Lake Storage
Batch
Processing
Data Access and Processing
(SparkSQL,M
LIB,SparkR)
Dashboard
Predictive
Model
Stream
Processing
RDBMS
Data Analytics and Presentation
Enabling Data Lake Architecture with Open Source
5. Data Pipeline
& processing.
- All the data is fed into Hadoop data Lake.
- Data preparation and enrichment as per needs.
- Store the processed data into the data lake or
use in memory db's for low latency
applications.
- Streaming data processing can be done using
Kafka and Flume. Kafka and Flume both allow
connections directly into Hive and HBase, and
Spark can ingest and process data without ever
writing to disk.
6. Data Pipeline
& processing
- cont.
Spark - Continuous Application
- Spark Streaming can be used for near real time
processing of streams using Structured
Streaming.
- Structured Streaming allows applications to
connect to kafka sources and apply dataset
functions on the infinite tables. Below link gives
the details of continuous application.
https://databricks.com/blog/2016/07/28/continuo
us-applications-evolving-streaming-in-apache-spark-
2-0.html