SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Harness the Power of Data in a
Big Data Lake
Saurabh K. Gupta
Manager, Data & Analytics, GE
www.amazon.com/author/saurabhgupta
@saurabhkg
Disclaimer:
“This report has been prepared by the Authors who are part of GE. The opinions expresses herein by the Authors
and the information contained herein are in good faith and Authors and GE disclaim any liability for the content in
the report. The report is the property of GE and GE is the holder of the copyright or any intellectual property over
the report. No part of this document may be reproduced in any manner without the written permission of GE. This
Report also contains certain information available in public domain, created and maintained by private and public
organizations. GE does not control or guarantee the accuracy, relevance, timelines or completeness of such
information. This Report constitutes a view as on the date of publication and is subject to change. GE does not
warrant or solicit any kind of act or omission based on this Report.”
About me
• Manager, Data & Analytics at GE Digital
• 10+ years of experience in data architecture,
engineering, analytics.
• Authored couple of books
• Oracle Advanced PL/SQL Developer Professional
Guide (2012)
• Advanced Oracle PL/SQL Developer’s Guide
(2016)
• Current session - excerpt from an “untitled book”
(TBD)
• Speaker at AIOUG, IOUG, NASSCOM
• Twitter @saurabhkg, Blog@
sbhoracle.wordpress.com
AIOUG Sangam’17
Why I’m here?
• Data lake is relatively a new term when compared to all fancy ones since the industry
realized the potential of data. Industry is planning their way out to adopt big data lake as
the key data store but what challenges them is the traditional approach. Traditional
approaches pertaining to data pipelines, data processing, data security still hold good but
architects do need to leap an extra mile while designing big data lake.
• This session will focus on this shift in approaches. We will explore what are the road
blockers while setting up a data lake and how to size the key milestones. Health and
efficiency of a data lake largely depends on two factors - data ingestion and data
processing. Attend this session to learn key practices of data ingestion under different
circumstances. Data processing for variety of scenarios will be covered as well.
AIOUG Sangam’17
Agenda
• Data Lake – the evolution
• Data Lake architecture
• Data Lake ingestion strategy
• Structured
• Unstructured
• Change data capture
• Principles
• Data Processing techniques
AIOUG Sangam’17
Data Explosion
Future data trends
AIOUG Sangam’17
Data as a Service
Cybersecurity
Augmented Analytics Machine Intelligence
“Fast Data” and
“Actionable data”
Evolution of Data Lake
• James Dixon’s “time machine” vision of data
• Leads Data-as-an-Asset strategy
“If you think of a datamart as a store of bottled water
– cleansed and packaged and structured for easy
consumption – the data lake is a large body of water
in a more natural state. The contents of the data
lake stream in from a source to fill the lake, and
various users of the lake can come to examine, dive
in, or take samples.”
-James Dixon
Footer
What is Data Lake? And Why?
What?
• Data lake represents a state of enterprise at a given time
• Store all data in much detailed fashion at one place
• Empower business analytics applications, predictive and deep learning models with one
“time machine” data store
• Unifies data discovery, data science and enterprise BI in an organization
Why?
• Scalable framework to store massive volumes of data and churn out analytical insights
• Data ingestion and processing framework becomes the cornerstone rather than just the
storage
• warrants data discovery platforms to soak the data trends at a horizontal scale and
produce visual insights
Footer
Data lake – constructing strategy
Footer
Data lake vs Data warehouse
Footer
• Data lake gets the data in its raw format,
unprocessed and untouched to build mirror
layer. Follows schema on-read strategy
• War room for data scientists, analysts and
data engineering specialists from multiple
domains to run analytical models.
• Data lake is primarily hosted on Hadoop,
which is open source and comes with free
community support
• Data warehouse pulls data from the source
and takes through a processing layer i.e.
schema on-write
• Data warehouse targets management staff
and business leaders who expect structured
analyst reports end of the day
• Data warehouse, runs on relatively expensive
storage to withstand high-scale data
processing.
Data Lake Data Warehouse
Data Lake architecture
Footer
Source
data
systems
(RDBMS, web,
IoT, Files)
Data Acquisition
Layer
(Source-to-Data Lake
ETL)
Data
Visualization
Mirror Layer
Analytics Layer
Data Science and
exploration
Data
Consumption
Data
Discovery/Exploration
ENTERPRISE DATA LAKE
Intermediate layers
(optional)
Deep learning
models
Data Lake operational pillars
Footer
Data
As-An
Asset
Data
Management
Data Ingestion
Data
Engineering
Data
Visualization
Data Ingestion
Data Ingestion challenges
Footer
Understand the data
• Structured data is an organized piece of
information
• Aligns strongly with the relational
standards
• Defined metadata
• Easy ingestion, retrieval, and processing
• Unstructured data lacks structure and
metadata
• Not so easy to ingest
• Complex retrieval
• Complex processing
Footer
Understand the data sources
• OLTP and Data warehouses – structured data from typical relational data stores.
• Data management systems – documents and text files
• Legacy systems – essential for historical and regulatory analytics
• Sensors and IoT devices – Devices installed on healthcare, home, and mobile appliances
and large machines can upload logs to data lake at periodic intervals or in a secure
network region
• Web content – data from the web world (retail sites, blogs, social media)
• Geographical data – data flowing from location data, maps, and geo-positioning systems.
Footer
Data Ingestion Framework
Design Considerations
• Data format – What format is the data to be ingested?
• Data change rate – Critical for CDC design and streaming data. Performance is a
derivative of throughput and latency.
• Data location and security –
• Whether data is located on-premise or public cloud infrastructure. While fetching data
from cloud instances, network bandwidth plays an important role.
• If the data source is enclosed within a security layer, ingestion framework should be
enabled establishment of a secure tunnel to collect data for ingestion
• Transfer data size (file compression and file splitting) – what would be the average and
maximum size of block or object in a single ingestion operation?
• Target file format – Data from a source system needs to be ingested in a hadoop
compatible file format.
Footer
ETL vs ELT for Data Lake
• Heavy transformation may restrict data
surface area for data exploration
• Brings down the data agility
• Transformation on huge volumes of data
may foster a latency between data
source and data lake
• Curated layer to empower analytical
models
Footer
ETL
ELT
Batched data ingestion principles
Structured data
• Data collector fires a SELECT query (also known as filter query) on the source to pull
incremental records or full extract
• Query performance and source workload determine how efficient data collector is
• Robust and flexible
Footer
• Change Track flag – flag rows with the operation code
• Incremental extraction – pull all the changes after certain timestamp
• Full Extraction – refresh target on every ingestion run
Ingestion techniques
Change Data Capture
• Log mining process
• Capture changed data from the source system’s transaction logs and integrate with the
target
• Eliminating the need to run SQL queries on source system. Incurs no load overhead on a
transactional source system.
• Achieves near real-time replication between source and target
Footer
Change Data Capture
Design Considerations
• Source database be enabled for logging
• Commercial tools - Oracle GoldenGate, HVR, Talend CDC, custom replicators
• Keys are extremely important for replication
• Helps capture job in establishing uniqueness of a record in the changed data set
• Source PK ensures the changes are applied to the correct record on target
• PK not available; establish uniqueness based on composite columns
• Establish uniqueness based on a unique constraint - terrible design!!
• Trigger based CDC
• Event on a table triggers the change to be captured in an change-log table
• Change-log table merged with the target
• Works when source transaction logs are not available
Footer
LinkedIn Databus
CDC capture pipeline
• Relay is responsible for pulling the most recent committed
transactions from the source
• Relays are implemented through tungsten replicator
• Relay stores the changes in logs or cache in compressed
format
• Consumer pulls the changes from relay
• Bootstrap component – a snapshot of data source on a
temporary instance. It is consistent with the changes
captured by Relay
• If any consumer falls behind and can’t find the changes in
relay, bootstrap component transforms and packages the
changes to the consumer
• A new consumer, with the help of client library, can apply all
the changes from bootstrap component until a time. Client
library will point the consumer to Relay to continue pulling
most recent changes
Footer
Apache Sqoop
• Native member of Hadoop tech stack for data ingestion
• Batched ingestion, no CDC
• Java based utility (web interface in Sqoop2) that spawns Map jobs from MapReduce
engine to store data in HDFS
• Provides full extract as well as incremental import mode support
• Runs on HDFS cluster and can populate tables in Hive, HBase
• Can establish a data integration layer between NoSQL and HDFS
• Can be integrated with Oozie to schedule import/export tasks
• Supports connectors to multiple relational databases like Oracle, SQL Server, MySQL
Footer
Sqoop architecture
• Mapper jobs of MapReduce
processing layer in Hadoop
• By default, a sqoop job has four
mappers
• Rule of Split
• Values of --split-by column
must be equally distributed to each
mapper
• --split-by column must be a
primary key
• --split-by column should be a
primary key
Footer
Sqoop
Design considerations - I
• Mappers
• --num-mappers [n] argument
• run in parallel within Hadoop. No formula but needs to be judiciously set
• Cannot split
• --autoreset-to-one-mapper to perform unsplit extraction
• Source has no PK
• Split based on natural or surrogate key
• Source has character keys
• Divide and conquer! Manual partitions and run one mapper per partition
• If key value is an integer, no worries
Footer
Sqoop
Design considerations - II
• If only subset of columns is required from the source table, specify column list in --
columns argument.
• For example, --columns “orderId, product, sales”
• If limited rows are required to be “sqooped”, specify --where clause with the predicate
clause.
• For example, --where “sales > 1000”
• If result of a structured query needs to be imported, use --query clause.
• For example, --query ‘select orderId, product, sales from orders where sales>1000’
• Use --hive-partition-key and --hive-partition-value attributes to create
partitions on a column key from the import
• Delimiters can be handled through either of the below ways –
• Specify --hive-drop-import-delims to remove delimiters during import process
• Specify --hive-delims-replacement to replace delimiters with an alternate character
Footer
Oracle copyToBDA
• Licensed under Oracle BigData SQL
• Stack of Oracle BDA, Exadata, Infiniband
• Helps in loading Oracle database tables to Hadoop by –
• Dumping the table data in Data Pump format
• Copying them into HDFS
• Full extract and load
• Source data changes
• Rerun the utility to refresh Hive tables
Footer
Greenplum’s GPHDFS
• Setup on all segment nodes of
a Greenplum cluster
• All segments concurrently push
the local copies of data splits to
Hadoop cluster
• Cluster segments yield the
power of parallelism
Footer
Writeable
Ext Table
Writeable
Ext Table
Writeable
Ext Table
Writeable
Ext Table
Writeable
Ext Table
Hadoop clusterGreenplum cluster
Ingest unstructured data using Flume
• Distributed system to capture and load large volumes of log data from different source
systems to data lake
• Collection and aggregation of streaming data as events
Footer
Incoming
Events
Outgoing
Data
Source Sink
Flume Agent
Channel
Source Transaction Sink Transaction
Client
PUT TAKE
Apache Flume
Design considerations - I
• Channel type
• MEMORY - events are read from source to memory
• Good performance, but volatile. Not cost effective
• FILE – events are ready from source into file system
• Controllable performance. Persistent and Transactional guarantee
• JDBC - events are read and stored in Derby database
• Slow performance
• Kafka – store events in Kafka topic
• Event batch size - maximum number of events that can be batched by source or sink in a
single transaction
• Fatter the better. But not for FILE channel
• Stable number ensures data consistency
Footer
Apache Flume
Design considerations - II
• Channel capacity and transaction capacity
• For MEMORY channel, channel capacity is
limited by RAM size.
• For FILE, channel capacity is limited by disk
size
• Should not exceed batch size configured for
the sinks
• Channel selector
• An event can either be replicated or
multiplexed
• Preferable vs conditional
• Handle high throughput systems
• Tiered architecture to handle event flow
• Aggregate and push approach
Footer
Questions?
AIOUG Sangam’17
Harness Big Data Power in a Data Lake

Contenu connexe

Tendances

2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data LakeWaterlineData
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lakeBHASKAR CHAUDHURY
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameCloudera, Inc.
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data LakeRobert Chong
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
 
Extending Data Lake using the Lambda Architecture June 2015
Extending Data Lake using the Lambda Architecture June 2015Extending Data Lake using the Lambda Architecture June 2015
Extending Data Lake using the Lambda Architecture June 2015DataWorks Summit
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
 

Tendances (20)

2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data Lake
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lake
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Data Federation
Data FederationData Federation
Data Federation
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data Lake
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
 
Extending Data Lake using the Lambda Architecture June 2015
Extending Data Lake using the Lambda Architecture June 2015Extending Data Lake using the Lambda Architecture June 2015
Extending Data Lake using the Lambda Architecture June 2015
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 

Similaire à Harness Big Data Power in a Data Lake

The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationDATAVERSITY
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introductionMurli Jha
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptRafiulHasan19
 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesMark Kromer
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 

Similaire à Harness Big Data Power in a Data Lake (20)

Data warehouseold
Data warehouseoldData warehouseold
Data warehouseold
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introduction
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Meetup 25/04/19: Big Data
Meetup 25/04/19: Big DataMeetup 25/04/19: Big Data
Meetup 25/04/19: Big Data
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Harness Big Data Power in a Data Lake

  • 1. Harness the Power of Data in a Big Data Lake Saurabh K. Gupta Manager, Data & Analytics, GE www.amazon.com/author/saurabhgupta @saurabhkg
  • 2. Disclaimer: “This report has been prepared by the Authors who are part of GE. The opinions expresses herein by the Authors and the information contained herein are in good faith and Authors and GE disclaim any liability for the content in the report. The report is the property of GE and GE is the holder of the copyright or any intellectual property over the report. No part of this document may be reproduced in any manner without the written permission of GE. This Report also contains certain information available in public domain, created and maintained by private and public organizations. GE does not control or guarantee the accuracy, relevance, timelines or completeness of such information. This Report constitutes a view as on the date of publication and is subject to change. GE does not warrant or solicit any kind of act or omission based on this Report.”
  • 3. About me • Manager, Data & Analytics at GE Digital • 10+ years of experience in data architecture, engineering, analytics. • Authored couple of books • Oracle Advanced PL/SQL Developer Professional Guide (2012) • Advanced Oracle PL/SQL Developer’s Guide (2016) • Current session - excerpt from an “untitled book” (TBD) • Speaker at AIOUG, IOUG, NASSCOM • Twitter @saurabhkg, Blog@ sbhoracle.wordpress.com AIOUG Sangam’17
  • 4. Why I’m here? • Data lake is relatively a new term when compared to all fancy ones since the industry realized the potential of data. Industry is planning their way out to adopt big data lake as the key data store but what challenges them is the traditional approach. Traditional approaches pertaining to data pipelines, data processing, data security still hold good but architects do need to leap an extra mile while designing big data lake. • This session will focus on this shift in approaches. We will explore what are the road blockers while setting up a data lake and how to size the key milestones. Health and efficiency of a data lake largely depends on two factors - data ingestion and data processing. Attend this session to learn key practices of data ingestion under different circumstances. Data processing for variety of scenarios will be covered as well. AIOUG Sangam’17
  • 5. Agenda • Data Lake – the evolution • Data Lake architecture • Data Lake ingestion strategy • Structured • Unstructured • Change data capture • Principles • Data Processing techniques AIOUG Sangam’17
  • 6. Data Explosion Future data trends AIOUG Sangam’17 Data as a Service Cybersecurity Augmented Analytics Machine Intelligence “Fast Data” and “Actionable data”
  • 7. Evolution of Data Lake • James Dixon’s “time machine” vision of data • Leads Data-as-an-Asset strategy “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” -James Dixon Footer
  • 8. What is Data Lake? And Why? What? • Data lake represents a state of enterprise at a given time • Store all data in much detailed fashion at one place • Empower business analytics applications, predictive and deep learning models with one “time machine” data store • Unifies data discovery, data science and enterprise BI in an organization Why? • Scalable framework to store massive volumes of data and churn out analytical insights • Data ingestion and processing framework becomes the cornerstone rather than just the storage • warrants data discovery platforms to soak the data trends at a horizontal scale and produce visual insights Footer
  • 9. Data lake – constructing strategy Footer
  • 10. Data lake vs Data warehouse Footer • Data lake gets the data in its raw format, unprocessed and untouched to build mirror layer. Follows schema on-read strategy • War room for data scientists, analysts and data engineering specialists from multiple domains to run analytical models. • Data lake is primarily hosted on Hadoop, which is open source and comes with free community support • Data warehouse pulls data from the source and takes through a processing layer i.e. schema on-write • Data warehouse targets management staff and business leaders who expect structured analyst reports end of the day • Data warehouse, runs on relatively expensive storage to withstand high-scale data processing. Data Lake Data Warehouse
  • 11. Data Lake architecture Footer Source data systems (RDBMS, web, IoT, Files) Data Acquisition Layer (Source-to-Data Lake ETL) Data Visualization Mirror Layer Analytics Layer Data Science and exploration Data Consumption Data Discovery/Exploration ENTERPRISE DATA LAKE Intermediate layers (optional) Deep learning models
  • 12. Data Lake operational pillars Footer Data As-An Asset Data Management Data Ingestion Data Engineering Data Visualization Data Ingestion
  • 14. Understand the data • Structured data is an organized piece of information • Aligns strongly with the relational standards • Defined metadata • Easy ingestion, retrieval, and processing • Unstructured data lacks structure and metadata • Not so easy to ingest • Complex retrieval • Complex processing Footer
  • 15. Understand the data sources • OLTP and Data warehouses – structured data from typical relational data stores. • Data management systems – documents and text files • Legacy systems – essential for historical and regulatory analytics • Sensors and IoT devices – Devices installed on healthcare, home, and mobile appliances and large machines can upload logs to data lake at periodic intervals or in a secure network region • Web content – data from the web world (retail sites, blogs, social media) • Geographical data – data flowing from location data, maps, and geo-positioning systems. Footer
  • 16. Data Ingestion Framework Design Considerations • Data format – What format is the data to be ingested? • Data change rate – Critical for CDC design and streaming data. Performance is a derivative of throughput and latency. • Data location and security – • Whether data is located on-premise or public cloud infrastructure. While fetching data from cloud instances, network bandwidth plays an important role. • If the data source is enclosed within a security layer, ingestion framework should be enabled establishment of a secure tunnel to collect data for ingestion • Transfer data size (file compression and file splitting) – what would be the average and maximum size of block or object in a single ingestion operation? • Target file format – Data from a source system needs to be ingested in a hadoop compatible file format. Footer
  • 17. ETL vs ELT for Data Lake • Heavy transformation may restrict data surface area for data exploration • Brings down the data agility • Transformation on huge volumes of data may foster a latency between data source and data lake • Curated layer to empower analytical models Footer ETL ELT
  • 18. Batched data ingestion principles Structured data • Data collector fires a SELECT query (also known as filter query) on the source to pull incremental records or full extract • Query performance and source workload determine how efficient data collector is • Robust and flexible Footer • Change Track flag – flag rows with the operation code • Incremental extraction – pull all the changes after certain timestamp • Full Extraction – refresh target on every ingestion run Ingestion techniques
  • 19. Change Data Capture • Log mining process • Capture changed data from the source system’s transaction logs and integrate with the target • Eliminating the need to run SQL queries on source system. Incurs no load overhead on a transactional source system. • Achieves near real-time replication between source and target Footer
  • 20. Change Data Capture Design Considerations • Source database be enabled for logging • Commercial tools - Oracle GoldenGate, HVR, Talend CDC, custom replicators • Keys are extremely important for replication • Helps capture job in establishing uniqueness of a record in the changed data set • Source PK ensures the changes are applied to the correct record on target • PK not available; establish uniqueness based on composite columns • Establish uniqueness based on a unique constraint - terrible design!! • Trigger based CDC • Event on a table triggers the change to be captured in an change-log table • Change-log table merged with the target • Works when source transaction logs are not available Footer
  • 21. LinkedIn Databus CDC capture pipeline • Relay is responsible for pulling the most recent committed transactions from the source • Relays are implemented through tungsten replicator • Relay stores the changes in logs or cache in compressed format • Consumer pulls the changes from relay • Bootstrap component – a snapshot of data source on a temporary instance. It is consistent with the changes captured by Relay • If any consumer falls behind and can’t find the changes in relay, bootstrap component transforms and packages the changes to the consumer • A new consumer, with the help of client library, can apply all the changes from bootstrap component until a time. Client library will point the consumer to Relay to continue pulling most recent changes Footer
  • 22. Apache Sqoop • Native member of Hadoop tech stack for data ingestion • Batched ingestion, no CDC • Java based utility (web interface in Sqoop2) that spawns Map jobs from MapReduce engine to store data in HDFS • Provides full extract as well as incremental import mode support • Runs on HDFS cluster and can populate tables in Hive, HBase • Can establish a data integration layer between NoSQL and HDFS • Can be integrated with Oozie to schedule import/export tasks • Supports connectors to multiple relational databases like Oracle, SQL Server, MySQL Footer
  • 23. Sqoop architecture • Mapper jobs of MapReduce processing layer in Hadoop • By default, a sqoop job has four mappers • Rule of Split • Values of --split-by column must be equally distributed to each mapper • --split-by column must be a primary key • --split-by column should be a primary key Footer
  • 24. Sqoop Design considerations - I • Mappers • --num-mappers [n] argument • run in parallel within Hadoop. No formula but needs to be judiciously set • Cannot split • --autoreset-to-one-mapper to perform unsplit extraction • Source has no PK • Split based on natural or surrogate key • Source has character keys • Divide and conquer! Manual partitions and run one mapper per partition • If key value is an integer, no worries Footer
  • 25. Sqoop Design considerations - II • If only subset of columns is required from the source table, specify column list in -- columns argument. • For example, --columns “orderId, product, sales” • If limited rows are required to be “sqooped”, specify --where clause with the predicate clause. • For example, --where “sales > 1000” • If result of a structured query needs to be imported, use --query clause. • For example, --query ‘select orderId, product, sales from orders where sales>1000’ • Use --hive-partition-key and --hive-partition-value attributes to create partitions on a column key from the import • Delimiters can be handled through either of the below ways – • Specify --hive-drop-import-delims to remove delimiters during import process • Specify --hive-delims-replacement to replace delimiters with an alternate character Footer
  • 26. Oracle copyToBDA • Licensed under Oracle BigData SQL • Stack of Oracle BDA, Exadata, Infiniband • Helps in loading Oracle database tables to Hadoop by – • Dumping the table data in Data Pump format • Copying them into HDFS • Full extract and load • Source data changes • Rerun the utility to refresh Hive tables Footer
  • 27. Greenplum’s GPHDFS • Setup on all segment nodes of a Greenplum cluster • All segments concurrently push the local copies of data splits to Hadoop cluster • Cluster segments yield the power of parallelism Footer Writeable Ext Table Writeable Ext Table Writeable Ext Table Writeable Ext Table Writeable Ext Table Hadoop clusterGreenplum cluster
  • 28. Ingest unstructured data using Flume • Distributed system to capture and load large volumes of log data from different source systems to data lake • Collection and aggregation of streaming data as events Footer Incoming Events Outgoing Data Source Sink Flume Agent Channel Source Transaction Sink Transaction Client PUT TAKE
  • 29. Apache Flume Design considerations - I • Channel type • MEMORY - events are read from source to memory • Good performance, but volatile. Not cost effective • FILE – events are ready from source into file system • Controllable performance. Persistent and Transactional guarantee • JDBC - events are read and stored in Derby database • Slow performance • Kafka – store events in Kafka topic • Event batch size - maximum number of events that can be batched by source or sink in a single transaction • Fatter the better. But not for FILE channel • Stable number ensures data consistency Footer
  • 30. Apache Flume Design considerations - II • Channel capacity and transaction capacity • For MEMORY channel, channel capacity is limited by RAM size. • For FILE, channel capacity is limited by disk size • Should not exceed batch size configured for the sinks • Channel selector • An event can either be replicated or multiplexed • Preferable vs conditional • Handle high throughput systems • Tiered architecture to handle event flow • Aggregate and push approach Footer