SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Big Data, Hadoop, NoSQL DB - Introduction
Ing. Ľuboš Takáč, PhD.

University of Žilina

November, 2013
Overview
• Big Data

• Hadoop
– HDFS
– Map Reduce Paradigm

• NoSQL Databases
Big Data
• the origin of the term “BIG DATA” is unclear

• there are a lot of definitions,
e.g. “Big data is now almost universally understood to refer to the
realization of greater business intelligence by storing, processing, and
analyzing data that was previously ignored due to the limitations of traditional
data management technologies.” Matt Aslett
Big Data
• Can be defined by (original) 3V
– Volume (a lot of data)

– Variety (various structured)
– Velocity (fast processing)
– other V
• Veracity (IBM)
• Value (Oracle)
• Etc.
Where are Big Data Generated
Sample of Big Data Use Cases Today
Hadoop
• new idea to store and process distributed data
• open source project based on google GFS (Google
distributed File System) and Map Reduce Paradigm
– google published papers in 2003-2004 about GFS and Map Reduce

• open source community led by Dough Cutting applied this
tools on open search engine Nutch
• 2006 became an own research project named HADOOP
Different Approach for Data Processing

powerful hardware

commodity hardware
HDFS (Hadoop Distributed File System)
• the core part of Hadoop

• open source implementation of Google's GFS (Google File System)
• designed for commodity hardware
• responsible for distributing files throughout the cluster (connected PCs in hadoop)

• designed for high throughput rather than low latency
• typical files are in GB size
• files are broken down into blocks (64MB, 128MB)

• blocks are replicated (typical 3 replicas)
• rack aware, write once (append)
• fault tolerance
HDFS – example of using

• $ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hadoop/gutenberg
– (it is something like virtual folder, after copying all PC in cluster can access those files)

• $ bin/hadoop dfs -ls /user/hadoop
– (virtual folder is accessible via common commands)
Map Reduce Paradigm
• processing of data stored in HDFS
• map task – works locally on a part of the overall data
• reduce task – collect and process the results of mapped task
Map Reduce Example “Hello World”

• text files over HDFS
• word count – counting the frequency of words
Map Reduce Example (Code)
Map phase
Reduce Phase
Map Reduce Example (How it works)
Map Reduce Task (Execution)

• $ bin/hadoop jar WordCount.jar /user/hadoop/input_dir /user/hadoop/output_dir

• $ bin/hadoop dfs -cat /user/hadoop/gutenberg-output/part-r-00000
Map Reduce Task – Monitoring & Debugging
• hadoop has interactive web interface for watching tasks and
cluster
• log files
Hadoop Ecosystem
• the other tools usable in hadoop (or made for hadoop)
Hadoop Ecosystem
• Hadoop (HDFS, Map Reduce Framework)

• Avro (data serialization)
• Chukwa (monitoring large clustered systems)
• Flume (data collection and navigation)

• HBase (real-time read and write database)
• Hive (data summarization and querying)
• Lucene (text search)
• Pig (programming and query language)
• Sqoop (data transfer between hadoop and databases)
• Oozie (work flow and job orchestration)
• etc.
Hadoop Distributions
• open source (hard to configure), http://hadoop.apache.org/

• commercial solutions
– debugged ready-made solutions with support
– include proprietary software and hardware

– user friendly interfaces, also in cloud
– IBM
• InfoSphere BigInsights
• Cloudera

– ORACLE
• Exadata
• Exalytics
NoSQL Databases
• SQL – Traditional relational DBMS
• not every data management/analysis problem is best solved
exclusively using a traditional relational DBMS

• NoSQL = No SQL = not using traditional relational DBMS
• NoSQL = not only SQL
• NoSQL is not substitution for SQL DBMS and even they do
not try to replace them
• often used for Big Data
NoSQL Databases
• designed for fast retrieval and appending operations

• no data structures
• types
–
–
–
–

document store
graph databases
key-value store
etc.

• key-value store (like relational table with two columns, key
and value)
NoSQL Databases
• advantages
– low latency, high throughput
– highly parallelizable, massive scalability
– simplicity of design, easy to set up

– relaxed consistency => higher performance and availability

• disadvantages
– no declarative query language => more programming
– relaxed consistency => fewer guarantees
– absence of model => data model is inside the application (a big step back)

• examples: MongoDB, Neo4j, Dynamo, HBase, Allegro, Cassandra, etc.
Summary
• Big Data
– unstructured typically generated data (sensors, applications) with potential
– often not used before
– volume, variety, velocity => hard to process it by traditional technologies

• Hadoop
– open source technology for storing and processing distributed data
– processing Big Data on commodity hardware cluster
– HDFS, Map Reduce (and the other components of Hadoop Ecosystem)

• NoSQL Databases
– not using traditional relational DBMS
– typically key-value stores, easy
– designed for fast retrieval and appending operations
– highly parallelizable
References
•

[1] JP. Dijcks, Oracle: Big Data for the Enterprise, Jan. 2012.

•

[2] Ľ. Takáč, Data Processing over Very Large Databases, PhD thesis, 2013.

•

[3] O. Dolák, Big Data, http://www.systemonline.cz, 2012.

•

[4] P. Zikopoulos, D. Deroos, K. Parasuraman, T. Deutsch, D. Corrigan, J. Giles, Harness the Power of Big Data,
ISBN 978-0-07-180817-0, 2013.

•

[5] http://www.go-globe.com, 2013.

•

[6] Kanik T., Kováč M., NOSQL - Non-Relational Database Systems as the New Generation of DBMS, OSSConf,
2012.

•

[7] http://wiki.apache.org/hadoop, 2013.

•

[8] http://hadoop.apache.org, 2013.

•

[9] L22: SC Report, Map Reduce, The University of Utah

•

[10] http://bigdatauniversity.com, 2013.

•

[11] http://en.wikipedia.org/wiki/NoSQL
Thank you for your attention!
lubos.takac@gmail.com

Contenu connexe

Tendances

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 
Hive: Data Warehousing for Hadoop
Hive: Data Warehousing for HadoopHive: Data Warehousing for Hadoop
Hive: Data Warehousing for Hadoopbigdatasyd
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Data warehousing con hadoop y el paradigma map reduce
Data warehousing con hadoop y el paradigma map reduceData warehousing con hadoop y el paradigma map reduce
Data warehousing con hadoop y el paradigma map reduceIsmel Martínez Díaz
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 

Tendances (20)

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Hive: Data Warehousing for Hadoop
Hive: Data Warehousing for HadoopHive: Data Warehousing for Hadoop
Hive: Data Warehousing for Hadoop
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Anju
AnjuAnju
Anju
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Data warehousing con hadoop y el paradigma map reduce
Data warehousing con hadoop y el paradigma map reduceData warehousing con hadoop y el paradigma map reduce
Data warehousing con hadoop y el paradigma map reduce
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Big data
Big dataBig data
Big data
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 

En vedette

(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)BIOVIA
 
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layoutkvaderlipa
 
(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline PilotBIOVIA
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 

En vedette (6)

Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
(ATS3-PLAT06) Handling “Big Data” with Pipeline Pilot (MapReduce/NoSQL)
 
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
2014 dt takac-radius-degree_layout-fast_and_easy_graph_visualization_layout
 
(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot(ATS3-PLAT01) Recent developments in Pipeline Pilot
(ATS3-PLAT01) Recent developments in Pipeline Pilot
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 

Similaire à Big data, Hadoop, NoSQL DB - introduction

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptxD21CE161GOSWAMIPARTH
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 

Similaire à Big data, Hadoop, NoSQL DB - introduction (20)

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx2016-07-21-Godil-presentation.pptx
2016-07-21-Godil-presentation.pptx
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Big data
Big dataBig data
Big data
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 

Plus de kvaderlipa

2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transportkvaderlipa
 
Art & Science Data Visualization
Art & Science Data VisualizationArt & Science Data Visualization
Art & Science Data Visualizationkvaderlipa
 
Visualization of Large Multivariate Data Sets using Parallel Coordinates
Visualization of Large Multivariate Data Sets using Parallel CoordinatesVisualization of Large Multivariate Data Sets using Parallel Coordinates
Visualization of Large Multivariate Data Sets using Parallel Coordinateskvaderlipa
 
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length PatternsFast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patternskvaderlipa
 
Design and Development of New Automatic on-line Media Monitoring System
Design and Development of New Automatic on-line Media Monitoring SystemDesign and Development of New Automatic on-line Media Monitoring System
Design and Development of New Automatic on-line Media Monitoring Systemkvaderlipa
 
Data Processing over very Large Relational Databases
Data Processing over very Large Relational DatabasesData Processing over very Large Relational Databases
Data Processing over very Large Relational Databaseskvaderlipa
 

Plus de kvaderlipa (6)

2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
2014 dti monitoring-solution_for_dangerous_goods_carried_by_intermodal_transport
 
Art & Science Data Visualization
Art & Science Data VisualizationArt & Science Data Visualization
Art & Science Data Visualization
 
Visualization of Large Multivariate Data Sets using Parallel Coordinates
Visualization of Large Multivariate Data Sets using Parallel CoordinatesVisualization of Large Multivariate Data Sets using Parallel Coordinates
Visualization of Large Multivariate Data Sets using Parallel Coordinates
 
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length PatternsFast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
 
Design and Development of New Automatic on-line Media Monitoring System
Design and Development of New Automatic on-line Media Monitoring SystemDesign and Development of New Automatic on-line Media Monitoring System
Design and Development of New Automatic on-line Media Monitoring System
 
Data Processing over very Large Relational Databases
Data Processing over very Large Relational DatabasesData Processing over very Large Relational Databases
Data Processing over very Large Relational Databases
 

Dernier

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Dernier (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Big data, Hadoop, NoSQL DB - introduction

  • 1. Big Data, Hadoop, NoSQL DB - Introduction Ing. Ľuboš Takáč, PhD. University of Žilina November, 2013
  • 2. Overview • Big Data • Hadoop – HDFS – Map Reduce Paradigm • NoSQL Databases
  • 3. Big Data • the origin of the term “BIG DATA” is unclear • there are a lot of definitions, e.g. “Big data is now almost universally understood to refer to the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies.” Matt Aslett
  • 4. Big Data • Can be defined by (original) 3V – Volume (a lot of data) – Variety (various structured) – Velocity (fast processing) – other V • Veracity (IBM) • Value (Oracle) • Etc.
  • 5. Where are Big Data Generated
  • 6. Sample of Big Data Use Cases Today
  • 7. Hadoop • new idea to store and process distributed data • open source project based on google GFS (Google distributed File System) and Map Reduce Paradigm – google published papers in 2003-2004 about GFS and Map Reduce • open source community led by Dough Cutting applied this tools on open search engine Nutch • 2006 became an own research project named HADOOP
  • 8. Different Approach for Data Processing powerful hardware commodity hardware
  • 9. HDFS (Hadoop Distributed File System) • the core part of Hadoop • open source implementation of Google's GFS (Google File System) • designed for commodity hardware • responsible for distributing files throughout the cluster (connected PCs in hadoop) • designed for high throughput rather than low latency • typical files are in GB size • files are broken down into blocks (64MB, 128MB) • blocks are replicated (typical 3 replicas) • rack aware, write once (append) • fault tolerance
  • 10. HDFS – example of using • $ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hadoop/gutenberg – (it is something like virtual folder, after copying all PC in cluster can access those files) • $ bin/hadoop dfs -ls /user/hadoop – (virtual folder is accessible via common commands)
  • 11. Map Reduce Paradigm • processing of data stored in HDFS • map task – works locally on a part of the overall data • reduce task – collect and process the results of mapped task
  • 12. Map Reduce Example “Hello World” • text files over HDFS • word count – counting the frequency of words
  • 13. Map Reduce Example (Code) Map phase Reduce Phase
  • 14. Map Reduce Example (How it works)
  • 15. Map Reduce Task (Execution) • $ bin/hadoop jar WordCount.jar /user/hadoop/input_dir /user/hadoop/output_dir • $ bin/hadoop dfs -cat /user/hadoop/gutenberg-output/part-r-00000
  • 16. Map Reduce Task – Monitoring & Debugging • hadoop has interactive web interface for watching tasks and cluster • log files
  • 17.
  • 18.
  • 19. Hadoop Ecosystem • the other tools usable in hadoop (or made for hadoop)
  • 20. Hadoop Ecosystem • Hadoop (HDFS, Map Reduce Framework) • Avro (data serialization) • Chukwa (monitoring large clustered systems) • Flume (data collection and navigation) • HBase (real-time read and write database) • Hive (data summarization and querying) • Lucene (text search) • Pig (programming and query language) • Sqoop (data transfer between hadoop and databases) • Oozie (work flow and job orchestration) • etc.
  • 21. Hadoop Distributions • open source (hard to configure), http://hadoop.apache.org/ • commercial solutions – debugged ready-made solutions with support – include proprietary software and hardware – user friendly interfaces, also in cloud – IBM • InfoSphere BigInsights • Cloudera – ORACLE • Exadata • Exalytics
  • 22. NoSQL Databases • SQL – Traditional relational DBMS • not every data management/analysis problem is best solved exclusively using a traditional relational DBMS • NoSQL = No SQL = not using traditional relational DBMS • NoSQL = not only SQL • NoSQL is not substitution for SQL DBMS and even they do not try to replace them • often used for Big Data
  • 23. NoSQL Databases • designed for fast retrieval and appending operations • no data structures • types – – – – document store graph databases key-value store etc. • key-value store (like relational table with two columns, key and value)
  • 24. NoSQL Databases • advantages – low latency, high throughput – highly parallelizable, massive scalability – simplicity of design, easy to set up – relaxed consistency => higher performance and availability • disadvantages – no declarative query language => more programming – relaxed consistency => fewer guarantees – absence of model => data model is inside the application (a big step back) • examples: MongoDB, Neo4j, Dynamo, HBase, Allegro, Cassandra, etc.
  • 25. Summary • Big Data – unstructured typically generated data (sensors, applications) with potential – often not used before – volume, variety, velocity => hard to process it by traditional technologies • Hadoop – open source technology for storing and processing distributed data – processing Big Data on commodity hardware cluster – HDFS, Map Reduce (and the other components of Hadoop Ecosystem) • NoSQL Databases – not using traditional relational DBMS – typically key-value stores, easy – designed for fast retrieval and appending operations – highly parallelizable
  • 26. References • [1] JP. Dijcks, Oracle: Big Data for the Enterprise, Jan. 2012. • [2] Ľ. Takáč, Data Processing over Very Large Databases, PhD thesis, 2013. • [3] O. Dolák, Big Data, http://www.systemonline.cz, 2012. • [4] P. Zikopoulos, D. Deroos, K. Parasuraman, T. Deutsch, D. Corrigan, J. Giles, Harness the Power of Big Data, ISBN 978-0-07-180817-0, 2013. • [5] http://www.go-globe.com, 2013. • [6] Kanik T., Kováč M., NOSQL - Non-Relational Database Systems as the New Generation of DBMS, OSSConf, 2012. • [7] http://wiki.apache.org/hadoop, 2013. • [8] http://hadoop.apache.org, 2013. • [9] L22: SC Report, Map Reduce, The University of Utah • [10] http://bigdatauniversity.com, 2013. • [11] http://en.wikipedia.org/wiki/NoSQL
  • 27. Thank you for your attention! lubos.takac@gmail.com