SlideShare une entreprise Scribd logo
1  sur  50
An Overview of Big data & Hadoop
Prepared & presented by
Tony Nguyen
July 2014
Presentation outline
 This presentation gives Big data concepts and an
overview of different Big Data technologies
 Understand different tools and use the right tools for
DW and ETL
 How does current BI/DW fit to the Big Data context?
 How do Microsoft BI and Hadoop get married?
What is big data?
 Refers to any collection of data sets so large
and complex i.e. hundreds of Petabytes
Why is Big Data concerned?
• 2 billion internet users in the world today,
• 7.3 billion active cell phones in 2014
• 7TB of data is processed by Twitter everyday
• 500TB of data is processed by Facebook everyday
• With massive quantity of data, businesses need fast,
reliable, deeper data insight
Big Data Technologies
What is Hadoop?
 refers an ecosystem which includes large
scale distributed filesystem in order to store
and process big data across multiple storage
servers.
 Hadoop technologies include MapReduce &
Hadoop Distributed Filesytem (HDFS)
Who are the major Hadoop vendors?
 IBM InfoSphere BigInsights : IBM packs Hadoop with
its products including Text analytics, Social Data
Analytics Accelerator, Big SQL, Big R
 Clourera: pack Hadoop core components with its well-
known analytic SQL product named Impala and
provides enterprise support. Current Clourera Hadoop
versions includes CDH4.7 and CDH5.1
 Hortonworks: a company is formed by Yahoo and
Benchmark Capital, Hortonworks makes Hadoop
ready for enterprise with the latest version of HDP 2.1
 Microsoft: contributes HDInsight as Hadoop on
Windows platform
HDFS
 The Hadoop distributed file system
(HDFS) is a distributed, scalable, and
portable file-system written in Java for the
Hadoop framework.
 It is designed to run across low-cost
commodity hardware
MapReduce
 MapReduce is a programming model and an
associated implementation for processing
and generating large data sets with a
parallel, distributed algorithm on a cluster.
 From Hadoop version 2.1, Yet Another
MapReduce (YARN) was introduced.

Core components on the top of Hadoop
1. Hive (Facebook)
2. Pig (Yahoo)
3. Hbase
4. HCatalog
5. Knox
6. ZooKeeper
7. Sqoop
Pig
1. Originally developed by Yahoo
2. Best used for large data set ETL
3. Dataflow scripting language called PigLatin, a High-level
language designed to remove the complexities of coding
MapReduce applications.
4. Pig converts its operators into MapReduce code.
5. Instead of needing Java programming skills and an
understanding of the MapReduce coding infrastructure,
people with little programming skills, can simply invoke
SORT or FILTER operators without having to code a
MapReduce application to accomplish those tasks.
Hive
 Originally developed by facebook in 2007
 Hive is a data warehouse built on the top of
Hadoop file system (HDFS) and allowing
developers use SQL-like scripts (called Hive
SQL or HQL) to create databases & tables.
 Hive translates the SQL-like scripts into the
MapReduce algorithm to store and process large
data sets.
 The short learning curve as BI developers use
familiar SQL-like scripts
Hive (Cont’d)
 UPDATE or DELETE a record isn't allowed in Hive,
but INSERT INTO is acceptable.
 A way to work around this limitation is to use
partitions: if you're getting different batches of ids
separately, you could redesign your table so that it is
partitioned by id, and then you would be able to
easily drop partitions for the ids you want to get rid
of.
Hbase
 HBase is a column-oriented database management system that
runs on top of HDFS
 The database that is modelled after Google’s BigTable
technology. HBase was created for hosting very large tables
with billions of rows and millions of columns.
 An HBase system comprises a set of tables. Each table contains
rows and columns, much like a traditional database
 HBase provides random, real time access to your Big Data.
 Does not support a structured query language like SQL
 Referred as NoSQL technology (NoSQL means Not Only SQL)
as HBase is not intended to replace your traditional RDBMS
HCatalog
1. HCatalog is a table and storage management layer
for Hadoop that enables users with different data
processing tools – Apache Pig, Apache MapReduce,
and Apache Hive – to more easily read and write data
on the grid
2. Frees the user from having to know where the data is
stored, with the table abstraction
3. Enables notifications of data availability
4. Provides visibility for data cleaning and archiving
tools
Knox
A system that provides a single point of
authentication and access for Apache Hadoop
services in a cluster. The goal of the project is
to simplify Hadoop security for users who
access the cluster data and execute jobs, and
for operators who control access and manage
the cluster.
Zookeeper
Apache ZooKeeper provides
operational services for a Hadoop
cluster, including high availability,
naming service, notifying system,
message queue.
Sqoop
Sqoop provides a way to import and export data to
and from relational database tables (for example, SQL
Server) and HDFS.
Eight Hadoop SQL databases
 Apache Hive
 Impala
 Presto (Facebook)
 Shark
 Apache Drill
 EMC/Pivotal HAWQ
 BigSQL by IBM
 Apache Pheonix (for HBase)
 Apache Tajo
Three popular open source Hadoop-based
SQL databases
1. Impala (Cloudera)
2. Stinger (Hortonworks) –(aka Hive 11, Hive
12, Hive 13 or Hive-on-Tez)
3. Presto (Facebook)
Impala
1. Developed by Cloudera in 2012
2. SQL query engine that runs natively in Apache Hadoop
3. Query data uses SELECT, JOIN, and aggregate
functions – in real time
4. Access directly to HDFS and use MPP computation
instead of MapReduce. Therefore, provide nearly real
time data access
5. The entire process happen on memory, therefore it
eliminates the latency of Disk IO that happen extensively
during MapReduce job.
MPP vs MapReduce
Both are distributed data processing systems but difference are as follows:
MPP MapReduce
used on expensive, specialized
hardware tuned for CPU, storage
and network performance
deployed to clusters of commodity
servers that in turn use commodity
disks
Faster Slower
In memory computation Disk I/O computation
Queried with SQL Java code
Declarative query Imperative code (machine code)
SQL is easier and more productive More difficult for IT processional
Stinger
1. Refers to new versions of Hive (versions
0.11 - 0.13) to overcome the performance
barrier of MapReduce computation
2. More SQL compliance for Hive SQL
 http://hortonworks.com/labs/stinger/
Stinger’s Hive SQL new features
Presto
1. Respond to Cloudera Impala, Facebook introduced
Presto in 2012
2. Presto is similar in approach to Impala in that it is
designed to provide an interactive experience whilst
still using your existing datasets stored in Hadoop.
It provides:
 JDBC Drivers
 ANSI-SQL syntax support (presumably ANSI-92)
 A set of ‘connectors’ used to read data from existing data sources. Connectors
include: HDFS, Hive, and Cassandra.
 Interop with the Hive metastore for schema sharing
How Hive, Impala and Presto work?
Comparison of Hive, Impala, Presto and
Stinger
Hive Impala Presto Stinger
Year 2007 2012 Developing Developing
Orginal developer Facebook Cloudera Facebook hortonworks
Main Purpose Data warehouse Enable analysts and data
scientists to directly interact
with any data stored in
Hadoop. Offload self-service
business intelligence to
Hadoop.
RDBMS RDBMS
Computation
approach
MapReduce Massively parallel processing
(MPP) architecture
MPP MPP
Performance low fast fast fast
Latency High low latency low latency low latency
Language SQL like script ANSI-92 SQL support with
user-defined functions (UDFs)
SQL including RANK,
LEAD, LAG
SQL like script
Interfaces CLI, Web, ODBC,
JDBC
ODBC, JDBC , impala-shell,
web JDBC JDBC
High availability
Hadoop 2.0/CDH4
has HA on hdfs level
Yes
Hadoop 2.0/CDH4 has
HA on hdfs level
Hadoop 2.0/CDH4
has HA on hdfs
level
Replication Yes supported between two CDH 5
clusters
Unknown Unknown
Hive pros and cons
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
Advantage Disadvantage
It’s been around 5 years. You could say it
is matured and proven solution.
Since it is using MapReduce, It’s carrying
all the drawbacks which MapReduce has
such as expensive shuffle phase as well
as huge IO operations
Runs on proven MapReduce framework Hive still not support multiple reducers that
make queries like Group By and Order By
lot slower
Good support for user defined functions Lot slower compare to other competitors.
It can be mapped to HBase and other
systems easily
Impala pros and cons
Advantage Disadvantage
Lighting speed and promise near real
time adhoc query processing.
No fault tolerance for running queries.
If a query failed on a node, the query
has to be reissued, It can’t resume
from where it fails.
The computation happen in memory,
that reduce enormous amount of
latency and Disk IO
Latest version supports UDF
Open source, Apache licensed
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
Presto pros and cons
Advantage Disadvantage
Lighting fast and promise near real time
interactive querying.
It’s a new born baby. Need to wait and watch
since there were some interesting active
developments going on.
Used extensively in Facebook. So it is proven
and stable.
As of now support only Hive managed tables.
Though the website claim one can query
hbase also, the feature still under
development.
Open Source and there is a strong momentum
behind it ever since it’s been open sourced.
Still no UDF support yet. This is the most
requested feature to be added.
It is also using Distributed query processing
engine. So it eliminates all the latency and
DiskIO issues with traditional MapReduce.
Well documented. Perhaps this is the first open
source software from Facebook that got a
dedicated website from day 1.
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
Performance comparison
Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar
May 29, 2014
Performance comparison (cont’d)
Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar
May 29, 2014
Performance comparison (cont’d)
Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar
May 29, 2014
Performance comparison (cont’d)
Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar
May 29, 2014
Comments on Impala
Among Impala, Hive and Presto, it seems that
Impala is a matured SQL in Hadoop
Impala appears to be the winner in term of
performance and matured level
Hadoop DW/BI Solutions
Combining Hadoop and SQL Server tools
 Both Hadoop and SQL Server have strengths and
weaknesses
 Combining Hadoop and SQL Server tools will
overcome strengths and weaknesses of each
technology
SQL Server vs SQL on Hadoop
SQL Server SQL on Hadoop
SQL Server enforces data
quality and consistency better
(unique index, key and
foreign key)
Lack of data quality
enforcement
There is scalability limit Better for scaling and
processing massive data
Deployment options
 Hadoop on Premise
 Hadoop in the Cloud
1. Infrastructure as a Service (IAAS) – providers of IaaS
offer computers – physical or (more often) virtual
machines
2. Platform as a Service (PAAS) - including operating
system, programming language execution environment,
database, and web server.
3. Software as a service (SaaS) - provide access to
application software and databases
Deployment options scorecard
Why move Hadoop to cloud?
Save time and money
Scalability
Microsoft BI get married with Hadoop
Move Microsoft BI to cloud
Use right ETL tools
 SSIS – existing skills in organisation, need
transformation, performance tuning is impartant
 Pig – use when very large data set, take advantage
of the scalability of Hadoop, IT staff is comfortable
learning a new language
 Sqoop –Little need to transform the data, easy to
use, IT staff isn’t comfortable with SSIS or Pig, load
sql table directly to Hadoop.
SQL Server Parallel Data Warehouse –
- A high performance & expensive solution
 SQL Server Parallel Data Warehouse is the MPP edition of SQL
Server.
 Unlike the Standard, Enterprise or Data Center editions, PDW is
actually a hardware and software bundle rather than just a piece of
software. Microsoft call it a database "appliance".
 It isn't a substitute for SSIS, SSAS and SSRS. It's Microsoft's
answer for customers needing to process 10s or 100s of terabytes
who want the ability to scale out large workloads across multiple
servers, large storage arrays and many processors.
 It includes:
◦ Microsoft PolyBase
◦ Microsoft Analytics Platform System (APS)
◦ Run on the top of Hadoop
SQL Server Parallel Data Warehouse (con’d)
SQL Server Parallel Data Warehouse (cont’d)
References
 Microsoft Big Data Solutions, Wiley, February 2014
 Microsoft SQL 2012 Server with Hadoop, Debarchan
Sarkar, published by Packt Publishing Ltd 2013
 Cloudera.com
 Hortonworks.com
 Hadoop.apache.org
 Microsoft.com/bigdata
 Impala.io
 Prestodb.io
 Hive.apache.org
Q & A

Contenu connexe

Tendances

Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
Cloudera, Inc.
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 

Tendances (20)

Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 

En vedette

En vedette (8)

Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
How to get started with R programming
How to get started with R programmingHow to get started with R programming
How to get started with R programming
 
Ebd 1° trimestre 2017 lição 9 fidelidade, firmes na fé.
Ebd 1° trimestre 2017 lição 9  fidelidade, firmes na fé.Ebd 1° trimestre 2017 lição 9  fidelidade, firmes na fé.
Ebd 1° trimestre 2017 lição 9 fidelidade, firmes na fé.
 
[系列活動] 使用 R 語言建立自己的演算法交易事業
[系列活動] 使用 R 語言建立自己的演算法交易事業[系列活動] 使用 R 語言建立自己的演算法交易事業
[系列活動] 使用 R 語言建立自己的演算法交易事業
 
Macy's Marketing Challenge 2017
Macy's Marketing Challenge 2017Macy's Marketing Challenge 2017
Macy's Marketing Challenge 2017
 
Working With Big Data
Working With Big DataWorking With Big Data
Working With Big Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similaire à Overview of Big data, Hadoop and Microsoft BI - version1

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 

Similaire à Overview of Big data, Hadoop and Microsoft BI - version1 (20)

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Big data
Big dataBig data
Big data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
HDFS
HDFSHDFS
HDFS
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Bigdata ppt
Bigdata pptBigdata ppt
Bigdata ppt
 
Bigdata
BigdataBigdata
Bigdata
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 

Dernier

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 

Dernier (20)

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 

Overview of Big data, Hadoop and Microsoft BI - version1

  • 1. An Overview of Big data & Hadoop Prepared & presented by Tony Nguyen July 2014
  • 2. Presentation outline  This presentation gives Big data concepts and an overview of different Big Data technologies  Understand different tools and use the right tools for DW and ETL  How does current BI/DW fit to the Big Data context?  How do Microsoft BI and Hadoop get married?
  • 3. What is big data?  Refers to any collection of data sets so large and complex i.e. hundreds of Petabytes
  • 4. Why is Big Data concerned? • 2 billion internet users in the world today, • 7.3 billion active cell phones in 2014 • 7TB of data is processed by Twitter everyday • 500TB of data is processed by Facebook everyday • With massive quantity of data, businesses need fast, reliable, deeper data insight
  • 6. What is Hadoop?  refers an ecosystem which includes large scale distributed filesystem in order to store and process big data across multiple storage servers.  Hadoop technologies include MapReduce & Hadoop Distributed Filesytem (HDFS)
  • 7. Who are the major Hadoop vendors?  IBM InfoSphere BigInsights : IBM packs Hadoop with its products including Text analytics, Social Data Analytics Accelerator, Big SQL, Big R  Clourera: pack Hadoop core components with its well- known analytic SQL product named Impala and provides enterprise support. Current Clourera Hadoop versions includes CDH4.7 and CDH5.1  Hortonworks: a company is formed by Yahoo and Benchmark Capital, Hortonworks makes Hadoop ready for enterprise with the latest version of HDP 2.1  Microsoft: contributes HDInsight as Hadoop on Windows platform
  • 8. HDFS  The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework.  It is designed to run across low-cost commodity hardware
  • 9. MapReduce  MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.  From Hadoop version 2.1, Yet Another MapReduce (YARN) was introduced.
  • 10.
  • 11. Core components on the top of Hadoop 1. Hive (Facebook) 2. Pig (Yahoo) 3. Hbase 4. HCatalog 5. Knox 6. ZooKeeper 7. Sqoop
  • 12. Pig 1. Originally developed by Yahoo 2. Best used for large data set ETL 3. Dataflow scripting language called PigLatin, a High-level language designed to remove the complexities of coding MapReduce applications. 4. Pig converts its operators into MapReduce code. 5. Instead of needing Java programming skills and an understanding of the MapReduce coding infrastructure, people with little programming skills, can simply invoke SORT or FILTER operators without having to code a MapReduce application to accomplish those tasks.
  • 13. Hive  Originally developed by facebook in 2007  Hive is a data warehouse built on the top of Hadoop file system (HDFS) and allowing developers use SQL-like scripts (called Hive SQL or HQL) to create databases & tables.  Hive translates the SQL-like scripts into the MapReduce algorithm to store and process large data sets.  The short learning curve as BI developers use familiar SQL-like scripts
  • 14. Hive (Cont’d)  UPDATE or DELETE a record isn't allowed in Hive, but INSERT INTO is acceptable.  A way to work around this limitation is to use partitions: if you're getting different batches of ids separately, you could redesign your table so that it is partitioned by id, and then you would be able to easily drop partitions for the ids you want to get rid of.
  • 15. Hbase  HBase is a column-oriented database management system that runs on top of HDFS  The database that is modelled after Google’s BigTable technology. HBase was created for hosting very large tables with billions of rows and millions of columns.  An HBase system comprises a set of tables. Each table contains rows and columns, much like a traditional database  HBase provides random, real time access to your Big Data.  Does not support a structured query language like SQL  Referred as NoSQL technology (NoSQL means Not Only SQL) as HBase is not intended to replace your traditional RDBMS
  • 16. HCatalog 1. HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on the grid 2. Frees the user from having to know where the data is stored, with the table abstraction 3. Enables notifications of data availability 4. Provides visibility for data cleaning and archiving tools
  • 17. Knox A system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.
  • 18. Zookeeper Apache ZooKeeper provides operational services for a Hadoop cluster, including high availability, naming service, notifying system, message queue.
  • 19. Sqoop Sqoop provides a way to import and export data to and from relational database tables (for example, SQL Server) and HDFS.
  • 20. Eight Hadoop SQL databases  Apache Hive  Impala  Presto (Facebook)  Shark  Apache Drill  EMC/Pivotal HAWQ  BigSQL by IBM  Apache Pheonix (for HBase)  Apache Tajo
  • 21. Three popular open source Hadoop-based SQL databases 1. Impala (Cloudera) 2. Stinger (Hortonworks) –(aka Hive 11, Hive 12, Hive 13 or Hive-on-Tez) 3. Presto (Facebook)
  • 22. Impala 1. Developed by Cloudera in 2012 2. SQL query engine that runs natively in Apache Hadoop 3. Query data uses SELECT, JOIN, and aggregate functions – in real time 4. Access directly to HDFS and use MPP computation instead of MapReduce. Therefore, provide nearly real time data access 5. The entire process happen on memory, therefore it eliminates the latency of Disk IO that happen extensively during MapReduce job.
  • 23. MPP vs MapReduce Both are distributed data processing systems but difference are as follows: MPP MapReduce used on expensive, specialized hardware tuned for CPU, storage and network performance deployed to clusters of commodity servers that in turn use commodity disks Faster Slower In memory computation Disk I/O computation Queried with SQL Java code Declarative query Imperative code (machine code) SQL is easier and more productive More difficult for IT processional
  • 24. Stinger 1. Refers to new versions of Hive (versions 0.11 - 0.13) to overcome the performance barrier of MapReduce computation 2. More SQL compliance for Hive SQL  http://hortonworks.com/labs/stinger/
  • 25. Stinger’s Hive SQL new features
  • 26. Presto 1. Respond to Cloudera Impala, Facebook introduced Presto in 2012 2. Presto is similar in approach to Impala in that it is designed to provide an interactive experience whilst still using your existing datasets stored in Hadoop. It provides:  JDBC Drivers  ANSI-SQL syntax support (presumably ANSI-92)  A set of ‘connectors’ used to read data from existing data sources. Connectors include: HDFS, Hive, and Cassandra.  Interop with the Hive metastore for schema sharing
  • 27. How Hive, Impala and Presto work?
  • 28. Comparison of Hive, Impala, Presto and Stinger Hive Impala Presto Stinger Year 2007 2012 Developing Developing Orginal developer Facebook Cloudera Facebook hortonworks Main Purpose Data warehouse Enable analysts and data scientists to directly interact with any data stored in Hadoop. Offload self-service business intelligence to Hadoop. RDBMS RDBMS Computation approach MapReduce Massively parallel processing (MPP) architecture MPP MPP Performance low fast fast fast Latency High low latency low latency low latency Language SQL like script ANSI-92 SQL support with user-defined functions (UDFs) SQL including RANK, LEAD, LAG SQL like script Interfaces CLI, Web, ODBC, JDBC ODBC, JDBC , impala-shell, web JDBC JDBC High availability Hadoop 2.0/CDH4 has HA on hdfs level Yes Hadoop 2.0/CDH4 has HA on hdfs level Hadoop 2.0/CDH4 has HA on hdfs level Replication Yes supported between two CDH 5 clusters Unknown Unknown
  • 29. Hive pros and cons Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/ Advantage Disadvantage It’s been around 5 years. You could say it is matured and proven solution. Since it is using MapReduce, It’s carrying all the drawbacks which MapReduce has such as expensive shuffle phase as well as huge IO operations Runs on proven MapReduce framework Hive still not support multiple reducers that make queries like Group By and Order By lot slower Good support for user defined functions Lot slower compare to other competitors. It can be mapped to HBase and other systems easily
  • 30. Impala pros and cons Advantage Disadvantage Lighting speed and promise near real time adhoc query processing. No fault tolerance for running queries. If a query failed on a node, the query has to be reissued, It can’t resume from where it fails. The computation happen in memory, that reduce enormous amount of latency and Disk IO Latest version supports UDF Open source, Apache licensed Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
  • 31. Presto pros and cons Advantage Disadvantage Lighting fast and promise near real time interactive querying. It’s a new born baby. Need to wait and watch since there were some interesting active developments going on. Used extensively in Facebook. So it is proven and stable. As of now support only Hive managed tables. Though the website claim one can query hbase also, the feature still under development. Open Source and there is a strong momentum behind it ever since it’s been open sourced. Still no UDF support yet. This is the most requested feature to be added. It is also using Distributed query processing engine. So it eliminates all the latency and DiskIO issues with traditional MapReduce. Well documented. Perhaps this is the first open source software from Facebook that got a dedicated website from day 1. Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
  • 32. Performance comparison Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014
  • 33. Performance comparison (cont’d) Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014
  • 34. Performance comparison (cont’d) Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014
  • 35. Performance comparison (cont’d) Performance Test by Justin Erickson, Marcel Kornacker, and Dileep Kumar May 29, 2014
  • 36. Comments on Impala Among Impala, Hive and Presto, it seems that Impala is a matured SQL in Hadoop Impala appears to be the winner in term of performance and matured level
  • 38. Combining Hadoop and SQL Server tools  Both Hadoop and SQL Server have strengths and weaknesses  Combining Hadoop and SQL Server tools will overcome strengths and weaknesses of each technology
  • 39. SQL Server vs SQL on Hadoop SQL Server SQL on Hadoop SQL Server enforces data quality and consistency better (unique index, key and foreign key) Lack of data quality enforcement There is scalability limit Better for scaling and processing massive data
  • 40. Deployment options  Hadoop on Premise  Hadoop in the Cloud 1. Infrastructure as a Service (IAAS) – providers of IaaS offer computers – physical or (more often) virtual machines 2. Platform as a Service (PAAS) - including operating system, programming language execution environment, database, and web server. 3. Software as a service (SaaS) - provide access to application software and databases
  • 42. Why move Hadoop to cloud? Save time and money Scalability
  • 43. Microsoft BI get married with Hadoop
  • 44. Move Microsoft BI to cloud
  • 45. Use right ETL tools  SSIS – existing skills in organisation, need transformation, performance tuning is impartant  Pig – use when very large data set, take advantage of the scalability of Hadoop, IT staff is comfortable learning a new language  Sqoop –Little need to transform the data, easy to use, IT staff isn’t comfortable with SSIS or Pig, load sql table directly to Hadoop.
  • 46. SQL Server Parallel Data Warehouse – - A high performance & expensive solution  SQL Server Parallel Data Warehouse is the MPP edition of SQL Server.  Unlike the Standard, Enterprise or Data Center editions, PDW is actually a hardware and software bundle rather than just a piece of software. Microsoft call it a database "appliance".  It isn't a substitute for SSIS, SSAS and SSRS. It's Microsoft's answer for customers needing to process 10s or 100s of terabytes who want the ability to scale out large workloads across multiple servers, large storage arrays and many processors.  It includes: ◦ Microsoft PolyBase ◦ Microsoft Analytics Platform System (APS) ◦ Run on the top of Hadoop
  • 47. SQL Server Parallel Data Warehouse (con’d)
  • 48. SQL Server Parallel Data Warehouse (cont’d)
  • 49. References  Microsoft Big Data Solutions, Wiley, February 2014  Microsoft SQL 2012 Server with Hadoop, Debarchan Sarkar, published by Packt Publishing Ltd 2013  Cloudera.com  Hortonworks.com  Hadoop.apache.org  Microsoft.com/bigdata  Impala.io  Prestodb.io  Hive.apache.org
  • 50. Q & A