SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Enterprise Data Science
Frank KienleBig Data Overview
1. Understand the business
2. Understand data
3. Prepare data
4. Modell
5. Evaluation
6. Deployment
CRISP Value Process
Frank Kienle
Data are individual units of
information
We store more and more data which
leads to
Big Data
Data to Big Data
Frank Kienle
Erik Larson, Harper’s magazine:
‘The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes
other than originally intended.’
(Reality today: private data is becoming commoditized)
Big Data definitions 1989
Frank Kienle
Doug Laney, Gartner,2001:
,3-D Data Management:
Controlling Data
Volume, Velocity and Variety’
Big data definition 2001
Source: avyamuthanna.files.wordpress.com/2013/01/picture-11.png
Frank Kienle
Big Data is any data that is expensive to manage and hard to extract value from
(Souce: Michael Franklin, Dirctor of the Algorithms, Machines and Computer Science, Unverisity of Berkeley)
Extracting value out of big data is all about predicting the futures based on
observation of the past
Big Data today: it’s all about value
Frank Kienle
Big Data: the four V’s https://www.ibmbigdatahub.com/infographic/four-vs-big-data
Frank Kienle
handling (big) data is an art - not a value
§ up to 75 control devices in each BMW
§ ~ 1.000 individual configurations possible
§ ~1 GByte functional software, 15 GByte data in the
car
§ ~ 2.000 customer functions implemented
§ ~ 12.000 error storage memories for onboard
§ daily up to 60.000 diagnoses processes world
wide
§ centralized data storage and organization
§ data fusion and data mining for quality insurance
and better understanding of realistic
environments
Source: Bitcom BMW keynote talk
source: pixabay
Frank Kienle
Tracking the data in a car can
have benefits
but
comes with security / privacy
challenges
See lecture on ethical
challenges
Big Data Sources: Car black boxes
source: Los Angeles Times
Frank Kienle
A gas turbine has up 1000 sensors
§ Each sensor can (theoretically) processes data in the
millisecond range
§ example real live set up:
§ averages are stored per second (history kept for
one year)
§ often long history available, e.g. up to year 2000 in
5 minutes range (averages)
IoT Sensor Data example: Gas Turbine
source: pixabay
Frank Kienle
Realistic scenario store tuples: (timestamp, value)
• new sensors will be introduced, sensors might change
Theoretical data stream storage, gas turbine example
§(timestamp, value) 64 Byte X 1000 sensors à
Reality:
► 1 year stored in 1 s averages:
► 10 years stored in 5min averages:
3.2 Mbyte Time: 1 s
276Mbyte Time: 1 day
100.9 GByte Time: 1 year
64 kByte Time: 20 ms
x 100 engines in one data center à 10 TByte Time: 1 year
200 GByte Time: 1 year
~ 7 TByte Time: 10 years
Frank Kienle
Big Data Landscape - Data Lake Architecture
Components overview and terminology
mostly structuredsemi-structuredunstructured
The data lake is one part in the overall data to value path
§#123 §10101
§
Raw (Big) Data is
typically coming from
different sources and
has many different data
types
A data lake is a storage
repository that holds a vast
amount of (big) data in its
native format and provides
intelligent (semi-
structured) access until it is
needed
The value of data is
delivered via enterprise
systems / UX components
with the overall goal to
perform data driven
decisions
twitter www social
sensors mobile payments
transactions transport
video
Source Manage Value
pictures voice
Frank Kienle
Stages of Data in Data Lake – High Level Architecture
The data flow and used technology, tools, programming depend on data type and the final application layer
Business Systems
Business Systems
Business Systems
Business Systems
Data Sources
Delivery
Applications
Applications
ApplicationsApplications
&
Visualizations
Enriched
Data
Raw
Data
Ingestion
Transform /
Curate
File Transfer,
RDB Import
REST APIs
Stream or
batch transfer
WhatHow
Initial raw
raw data
storage
Distributed
Storage (e.g.
Hadoop)
Cleansing /
transform for
purpose
Distributed
Storage (e.g.
Hadoop)
add semantic,
searchable,
anonymized, …
Data bases for
purpose
semantic
data access
On request
data services
simplified data lake data path
Exemplary high-level walk through to extract, store and deliver trend information
Clean, structured data(Semi) Unstructured
or raw data
Mining big data
Information
retrieving
Data Lake
storing and mining relevant information
Final PresentationData Source
Drill down boards
WWW sources
Large-scaled
Web crawlers
download all
links found
Saved
webpages
Search and mine
data to extract
semantic
(relevance)
Structured
(graph)
database of
trends to
allow for easy
access
Relevant
Internet
Webpages
for topic
Trend Report
source: trends.google
A data to value architecture is composed of many building blocks
Data sources and data
ingestion
Data Storage
Data Access / Pipelines
Value DeliveryDepending on the data type and
final business application
different elements are utilized
Business Application
Data
Governance
Functional Layer
Deployment / Physical
raw data input
valuable
data output
A data lake is often a fundamental part of the data to value stack and focuses on
the technical management of big data
Data sources and data
ingestion
Data Storage
Data Access / Pipelines
Value DeliveryDepending on the data type and
final business application
different elements are utilized
Business Application
Data
Governance
Functional Layer
Deployment / Physical
raw data input
valuable
data output
(often) focus of
data lake
architectures
Data Lake High level architecture with different possibilities to store, process, and
deliver valuable information
Text, emails
documents
Video,
Media
Voice, Music,
Sound
Unstructured
XML, JSON Sensor
Semi-structured data
Databases ERP core
Structured data
Data Sources
Stream
Batch
Hybrid
Data
Ingestion
Row Based
Column
Based
Relational
Graph DB
Document
DB
Non-Relational
Key-Value
Data Storage
Stream Batch Interactive
Data Access / Pipelines
Descriptive Predictive
Value Delivery
Visualizations Interfaces
Operational
Depending on the data type and
final business application
different elements are utilized
IoT
Prescriptive
Business Application
Availability
Data Security
Compliance &
Controls
Data Governance
Functional Layer
…
Roles &
Responsibility
Data Quality
Reporting's
Tactical Strategic
Deployment
On premise Cloud Hybrid
Application
Life cycle
Data Requirements
Which data are needed?
The design of a data pipeline / data lake depends on the business, technical, non-
functional requirements
Row Based
Column
Based
Relational
Graph DB
Document
DB
Non-Relational
Key-Value
Data Storage
Stream Batch Interactive
Data Access / Pipelines
Descriptive Predictive
Value Delivery
Visualizations Interfaces
Operational
Prescriptive
Business Application
Availability
Data Security
Roles &
Responsibility
Data Quality
Reporting's
Tactical Strategic
Deployment
On premise Cloud Hybrid
Application
Life cycle
Technical Requirements
How to realize?
Business Requirements
Why we need this?
Non-
functional
requirements
What
constraints?
The design of a data pipeline / data lake depends on the business, technical, non-
functional requirements – example questions to be unswered
Technical Requirements
How to realize?
Business Requirements
Why we need this?
Non-functional
requirements
What constraints?
Who is the customer (internal, external)?
How does it help in which situation / process?
Which value do we expect?
When we improve quality by x% which benefit do we expect?
How to visualize / serve the results / back integration?
Which service level has the solution (on request, 99%uptime)?
Where is the data allowed to be stored, e.g. GDPR?
Who has access to the application / data?
How is the support organized?
Which security level is granted?
How does the application provide the result, e.g. which technical interface?
How is the data stored, what are the latency requirements for read / write?
How to ensure a test / productive setup?
Where do we compute and which libraries?
Which algorithms serve best the requirements?
For each layer in the data stack many different vendors and applications exist
Data Storage
Data Access / Pipelines
Value Delivery
Business Application
Functional Layer
Deployment / Physical
Managing
big Data and
data pipelines
• Infrastructure und Hardware for Big Data
• Big Data Distributions (e.g.. Hadoop)
• Components for data management
(distributed data systems,
• in memory data bases,…)
Focus
Extracting
value
• Full business SaaS Services
• Tool boxes visualization
• Workflow enablement
Nearly all technical Big Data / Data Lakes are based on the (open source) Hadoop
& Ecosystem.
Com-
ponent
Description
HDFS The Hadoop Distributed File System.
Mahout Machine Learning on HDFS system
Zoo-
keeper
A centralized service for maintaining
synchronization and group services.
Yarn
Hadoop’s resource manager and job
scheduler.
HBase The Hadoop database.
Pig
A high-level data-flow language and execution
framework for parallel computation.
Spark
SQL
A module for structured and semi-structured
data processing.
Hive
A data warehouse infrastructure supporting
data summarization, query, and analysis.
Sqoop A tool to move data from RDBMS to Hadoop.
Flume A service for moving log data into Hadoop
Flume
Sqoop
Unstructured or semi-structured data Structured data
HDFS (Hadoop Distributed Files System)
HBase
Map Reduce Framework
Apache Oozie (Workflow)
Hive
DW System
PIG Latin
Data Analysis
Mahout
Machine Learning
Z
O
O
K
E
E
P
E
R
Data Storage
Data Access /
Pipelines
Ingestions
Functions
Focus
in stack
Components Layers
Frank Kienle
Nearly all Big Data / Data Lakes are based on the (open source) Hadoop & Ecosystem. However, only Enterprise Big Data
Platforms ensure a professional management
Component Description
Ambari
An open operational framework for provisioning, managing and monitoring Apache
Hadoop clusters.
HDFS The Hadoop Distributed File System.
Zookeeper
A centralized service for maintaining configuration information and naming, and for
providing distributed synchronization and group services.
Yarn Hadoop’s resource manager and job scheduler.
HBase The Hadoop database.
Pig A high-level data-flow language and execution framework for parallel computation.
Spark SQL A module for structured and semi-structured data processing.
Hive A data warehouse infrastructure supporting data summarization, query, and analysis.
Sqoop A tool to move data from RDBMS to Hadoop.
Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data into Hadoop.
Kafka A high-throughput, distributed, publish-subscribe messaging system.
Frank Kienle
Visualization Tools example for Data Scientists
(some practical tools/libraries, the purpose defines the tool)
Open source programming language,
active community participation, quick results
and must know-how for a data scientist
Focus on, interactive data visualizations
in web browsersava. Script library for
manipulating documents based on data
Most often used from nearly everybody
for visualization due to its mighty capabilities
and penetration
ExcelGeneral
Purpose Example
Web D3.js + derivates
Description
Rapid
Prototyping
Python (Matplotlib)
R (Shiny)
Professional
Visual Exploration
Tableau, Qlik
MS PowerBI
Professional interactive visualization tools
with focus on quick insights, with the goal
to provide business intelligence (BI) for an
enterprise
Focus
in stack
Visualization
Frank Kienle
Libraries/Algorithms/Programming/Tools
(some practical tools/libraries, the purpose defines the selection)
Query Languages and stream/batch processing
programming paradigms with ease access to
managed big data (there exist many more)
The two most important languages
for data science (there exist many more)
World wide most used tool for data
processing/calculation purposes with
mighty capabilities (mostly not know)
ExcelGeneral
Purpose Example
Statistics /
Machine Learning
Python + R
Description
(Big) Data
Processing
Spark + SQL
Tool Providers
Statistics/ML
SAS, Rapid Miner,
Knime, Matlab, …
Professional tools with the goal to provide
packaged, maintained and easy consumable
analytics for professional and citizen data
scientists
Focus
in stack
Functional Layer
Data Pipelines
Frank Kienle
Big Data Landscape 2012
Frank Kienle
Frank Kienle
Frank Kienle
Big Data Landscape v 3.0 by Sub-Categories (source kdnuggets.com)
Frank Kienle

Contenu connexe

Tendances

An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big DataForwardSprint
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...Romeo Kienzler
 
Business proposal (2) (1)
Business proposal (2) (1)Business proposal (2) (1)
Business proposal (2) (1)Sparsh Jha
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Casesboorad
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
 
(The life of a) Data engineer
(The life of a) Data engineer(The life of a) Data engineer
(The life of a) Data engineerAlex Chalini
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 

Tendances (20)

An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big Data
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
Business proposal (2) (1)
Business proposal (2) (1)Business proposal (2) (1)
Business proposal (2) (1)
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and SolrBig Data Analysis Patterns with Hadoop, Mahout and Solr
Big Data Analysis Patterns with Hadoop, Mahout and Solr
 
BigData
BigDataBigData
BigData
 
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBData Integration Alternatives: When to use Data Virtualization, ETL, and ESB
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESB
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 
(The life of a) Data engineer
(The life of a) Data engineer(The life of a) Data engineer
(The life of a) Data engineer
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 

Similaire à Introduction Big Data

Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2Joe_F
 
A Gentle Introduction to Big Data
A Gentle Introduction to Big DataA Gentle Introduction to Big Data
A Gentle Introduction to Big DataMehmet Ali Akyol
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSAWS User Group Kochi
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Denodo
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)Xavier Constant
 
Gilbane 2009 -- How Can Content Management Software Keep Pace?
Gilbane 2009 -- How Can Content Management Software Keep Pace?Gilbane 2009 -- How Can Content Management Software Keep Pace?
Gilbane 2009 -- How Can Content Management Software Keep Pace?weisinger
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 

Similaire à Introduction Big Data (20)

Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
 
A Gentle Introduction to Big Data
A Gentle Introduction to Big DataA Gentle Introduction to Big Data
A Gentle Introduction to Big Data
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Unit 2
Unit 2Unit 2
Unit 2
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)
 
Gilbane 2009 -- How Can Content Management Software Keep Pace?
Gilbane 2009 -- How Can Content Management Software Keep Pace?Gilbane 2009 -- How Can Content Management Software Keep Pace?
Gilbane 2009 -- How Can Content Management Software Keep Pace?
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 

Plus de Frank Kienle

AI for good summary
AI for good summaryAI for good summary
AI for good summaryFrank Kienle
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Frank Kienle
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science Frank Kienle
 
Machine Learning part1 - Introduction to Data Science
Machine Learning part1 - Introduction to Data Science Machine Learning part1 - Introduction to Data Science
Machine Learning part1 - Introduction to Data Science Frank Kienle
 
Business Models - Introduction to Data Science
Business Models -  Introduction to Data ScienceBusiness Models -  Introduction to Data Science
Business Models - Introduction to Data ScienceFrank Kienle
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralFrank Kienle
 
Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...Frank Kienle
 
Lecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsLecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsFrank Kienle
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centuryFrank Kienle
 

Plus de Frank Kienle (9)

AI for good summary
AI for good summaryAI for good summary
AI for good summary
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
 
Machine Learning part1 - Introduction to Data Science
Machine Learning part1 - Introduction to Data Science Machine Learning part1 - Introduction to Data Science
Machine Learning part1 - Introduction to Data Science
 
Business Models - Introduction to Data Science
Business Models -  Introduction to Data ScienceBusiness Models -  Introduction to Data Science
Business Models - Introduction to Data Science
 
Data Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information CollateralData Science Lecture: Overview and Information Collateral
Data Science Lecture: Overview and Information Collateral
 
Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...Lecture summary: architectures for baseband signal processing of wireless com...
Lecture summary: architectures for baseband signal processing of wireless com...
 
Lecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsLecture: Monte Carlo Methods
Lecture: Monte Carlo Methods
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st century
 

Dernier

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Dernier (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

Introduction Big Data

  • 1. Enterprise Data Science Frank KienleBig Data Overview
  • 2. 1. Understand the business 2. Understand data 3. Prepare data 4. Modell 5. Evaluation 6. Deployment CRISP Value Process Frank Kienle
  • 3. Data are individual units of information We store more and more data which leads to Big Data Data to Big Data Frank Kienle
  • 4. Erik Larson, Harper’s magazine: ‘The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes other than originally intended.’ (Reality today: private data is becoming commoditized) Big Data definitions 1989 Frank Kienle
  • 5. Doug Laney, Gartner,2001: ,3-D Data Management: Controlling Data Volume, Velocity and Variety’ Big data definition 2001 Source: avyamuthanna.files.wordpress.com/2013/01/picture-11.png Frank Kienle
  • 6. Big Data is any data that is expensive to manage and hard to extract value from (Souce: Michael Franklin, Dirctor of the Algorithms, Machines and Computer Science, Unverisity of Berkeley) Extracting value out of big data is all about predicting the futures based on observation of the past Big Data today: it’s all about value Frank Kienle
  • 7. Big Data: the four V’s https://www.ibmbigdatahub.com/infographic/four-vs-big-data Frank Kienle
  • 8. handling (big) data is an art - not a value
  • 9. § up to 75 control devices in each BMW § ~ 1.000 individual configurations possible § ~1 GByte functional software, 15 GByte data in the car § ~ 2.000 customer functions implemented § ~ 12.000 error storage memories for onboard § daily up to 60.000 diagnoses processes world wide § centralized data storage and organization § data fusion and data mining for quality insurance and better understanding of realistic environments Source: Bitcom BMW keynote talk source: pixabay Frank Kienle
  • 10. Tracking the data in a car can have benefits but comes with security / privacy challenges See lecture on ethical challenges Big Data Sources: Car black boxes source: Los Angeles Times Frank Kienle
  • 11. A gas turbine has up 1000 sensors § Each sensor can (theoretically) processes data in the millisecond range § example real live set up: § averages are stored per second (history kept for one year) § often long history available, e.g. up to year 2000 in 5 minutes range (averages) IoT Sensor Data example: Gas Turbine source: pixabay Frank Kienle
  • 12. Realistic scenario store tuples: (timestamp, value) • new sensors will be introduced, sensors might change Theoretical data stream storage, gas turbine example §(timestamp, value) 64 Byte X 1000 sensors à Reality: ► 1 year stored in 1 s averages: ► 10 years stored in 5min averages: 3.2 Mbyte Time: 1 s 276Mbyte Time: 1 day 100.9 GByte Time: 1 year 64 kByte Time: 20 ms x 100 engines in one data center à 10 TByte Time: 1 year 200 GByte Time: 1 year ~ 7 TByte Time: 10 years Frank Kienle
  • 13. Big Data Landscape - Data Lake Architecture Components overview and terminology
  • 14. mostly structuredsemi-structuredunstructured The data lake is one part in the overall data to value path §#123 §10101 § Raw (Big) Data is typically coming from different sources and has many different data types A data lake is a storage repository that holds a vast amount of (big) data in its native format and provides intelligent (semi- structured) access until it is needed The value of data is delivered via enterprise systems / UX components with the overall goal to perform data driven decisions twitter www social sensors mobile payments transactions transport video Source Manage Value pictures voice Frank Kienle
  • 15. Stages of Data in Data Lake – High Level Architecture The data flow and used technology, tools, programming depend on data type and the final application layer Business Systems Business Systems Business Systems Business Systems Data Sources Delivery Applications Applications ApplicationsApplications & Visualizations Enriched Data Raw Data Ingestion Transform / Curate File Transfer, RDB Import REST APIs Stream or batch transfer WhatHow Initial raw raw data storage Distributed Storage (e.g. Hadoop) Cleansing / transform for purpose Distributed Storage (e.g. Hadoop) add semantic, searchable, anonymized, … Data bases for purpose semantic data access On request data services simplified data lake data path
  • 16. Exemplary high-level walk through to extract, store and deliver trend information Clean, structured data(Semi) Unstructured or raw data Mining big data Information retrieving Data Lake storing and mining relevant information Final PresentationData Source Drill down boards WWW sources Large-scaled Web crawlers download all links found Saved webpages Search and mine data to extract semantic (relevance) Structured (graph) database of trends to allow for easy access Relevant Internet Webpages for topic Trend Report source: trends.google
  • 17. A data to value architecture is composed of many building blocks Data sources and data ingestion Data Storage Data Access / Pipelines Value DeliveryDepending on the data type and final business application different elements are utilized Business Application Data Governance Functional Layer Deployment / Physical raw data input valuable data output
  • 18. A data lake is often a fundamental part of the data to value stack and focuses on the technical management of big data Data sources and data ingestion Data Storage Data Access / Pipelines Value DeliveryDepending on the data type and final business application different elements are utilized Business Application Data Governance Functional Layer Deployment / Physical raw data input valuable data output (often) focus of data lake architectures
  • 19. Data Lake High level architecture with different possibilities to store, process, and deliver valuable information Text, emails documents Video, Media Voice, Music, Sound Unstructured XML, JSON Sensor Semi-structured data Databases ERP core Structured data Data Sources Stream Batch Hybrid Data Ingestion Row Based Column Based Relational Graph DB Document DB Non-Relational Key-Value Data Storage Stream Batch Interactive Data Access / Pipelines Descriptive Predictive Value Delivery Visualizations Interfaces Operational Depending on the data type and final business application different elements are utilized IoT Prescriptive Business Application Availability Data Security Compliance & Controls Data Governance Functional Layer … Roles & Responsibility Data Quality Reporting's Tactical Strategic Deployment On premise Cloud Hybrid Application Life cycle
  • 20. Data Requirements Which data are needed? The design of a data pipeline / data lake depends on the business, technical, non- functional requirements Row Based Column Based Relational Graph DB Document DB Non-Relational Key-Value Data Storage Stream Batch Interactive Data Access / Pipelines Descriptive Predictive Value Delivery Visualizations Interfaces Operational Prescriptive Business Application Availability Data Security Roles & Responsibility Data Quality Reporting's Tactical Strategic Deployment On premise Cloud Hybrid Application Life cycle Technical Requirements How to realize? Business Requirements Why we need this? Non- functional requirements What constraints?
  • 21. The design of a data pipeline / data lake depends on the business, technical, non- functional requirements – example questions to be unswered Technical Requirements How to realize? Business Requirements Why we need this? Non-functional requirements What constraints? Who is the customer (internal, external)? How does it help in which situation / process? Which value do we expect? When we improve quality by x% which benefit do we expect? How to visualize / serve the results / back integration? Which service level has the solution (on request, 99%uptime)? Where is the data allowed to be stored, e.g. GDPR? Who has access to the application / data? How is the support organized? Which security level is granted? How does the application provide the result, e.g. which technical interface? How is the data stored, what are the latency requirements for read / write? How to ensure a test / productive setup? Where do we compute and which libraries? Which algorithms serve best the requirements?
  • 22. For each layer in the data stack many different vendors and applications exist Data Storage Data Access / Pipelines Value Delivery Business Application Functional Layer Deployment / Physical Managing big Data and data pipelines • Infrastructure und Hardware for Big Data • Big Data Distributions (e.g.. Hadoop) • Components for data management (distributed data systems, • in memory data bases,…) Focus Extracting value • Full business SaaS Services • Tool boxes visualization • Workflow enablement
  • 23. Nearly all technical Big Data / Data Lakes are based on the (open source) Hadoop & Ecosystem. Com- ponent Description HDFS The Hadoop Distributed File System. Mahout Machine Learning on HDFS system Zoo- keeper A centralized service for maintaining synchronization and group services. Yarn Hadoop’s resource manager and job scheduler. HBase The Hadoop database. Pig A high-level data-flow language and execution framework for parallel computation. Spark SQL A module for structured and semi-structured data processing. Hive A data warehouse infrastructure supporting data summarization, query, and analysis. Sqoop A tool to move data from RDBMS to Hadoop. Flume A service for moving log data into Hadoop Flume Sqoop Unstructured or semi-structured data Structured data HDFS (Hadoop Distributed Files System) HBase Map Reduce Framework Apache Oozie (Workflow) Hive DW System PIG Latin Data Analysis Mahout Machine Learning Z O O K E E P E R Data Storage Data Access / Pipelines Ingestions Functions Focus in stack Components Layers Frank Kienle
  • 24. Nearly all Big Data / Data Lakes are based on the (open source) Hadoop & Ecosystem. However, only Enterprise Big Data Platforms ensure a professional management Component Description Ambari An open operational framework for provisioning, managing and monitoring Apache Hadoop clusters. HDFS The Hadoop Distributed File System. Zookeeper A centralized service for maintaining configuration information and naming, and for providing distributed synchronization and group services. Yarn Hadoop’s resource manager and job scheduler. HBase The Hadoop database. Pig A high-level data-flow language and execution framework for parallel computation. Spark SQL A module for structured and semi-structured data processing. Hive A data warehouse infrastructure supporting data summarization, query, and analysis. Sqoop A tool to move data from RDBMS to Hadoop. Flume It is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into Hadoop. Kafka A high-throughput, distributed, publish-subscribe messaging system. Frank Kienle
  • 25. Visualization Tools example for Data Scientists (some practical tools/libraries, the purpose defines the tool) Open source programming language, active community participation, quick results and must know-how for a data scientist Focus on, interactive data visualizations in web browsersava. Script library for manipulating documents based on data Most often used from nearly everybody for visualization due to its mighty capabilities and penetration ExcelGeneral Purpose Example Web D3.js + derivates Description Rapid Prototyping Python (Matplotlib) R (Shiny) Professional Visual Exploration Tableau, Qlik MS PowerBI Professional interactive visualization tools with focus on quick insights, with the goal to provide business intelligence (BI) for an enterprise Focus in stack Visualization Frank Kienle
  • 26. Libraries/Algorithms/Programming/Tools (some practical tools/libraries, the purpose defines the selection) Query Languages and stream/batch processing programming paradigms with ease access to managed big data (there exist many more) The two most important languages for data science (there exist many more) World wide most used tool for data processing/calculation purposes with mighty capabilities (mostly not know) ExcelGeneral Purpose Example Statistics / Machine Learning Python + R Description (Big) Data Processing Spark + SQL Tool Providers Statistics/ML SAS, Rapid Miner, Knime, Matlab, … Professional tools with the goal to provide packaged, maintained and easy consumable analytics for professional and citizen data scientists Focus in stack Functional Layer Data Pipelines Frank Kienle
  • 27. Big Data Landscape 2012 Frank Kienle
  • 30. Big Data Landscape v 3.0 by Sub-Categories (source kdnuggets.com) Frank Kienle