Research on big data

Research on Big Data
- FlexDB: A cloud-scale database engine
based on Hadoop
Jidong Chen (jidong.chen@emc.com)
Manager, Research Scientist, Big Data Lab

EMC Labs China
Sept. 2011

© Copyright 2011 EMC Corporation. All rights reserved. 1

Grand Opening Announcement

EMC Labs China is formed from EMC Research China and the
Advanced Technology Venture group, which were established in
2007 by the office of CTO.


EMC Labs China - Vision and Mission
Advanced Technology
Research and Development University
Collaboration
Vision
Big Data Lab Become an elite
research and advanced
technology institute
Industry Standards in China
Cloud Infrastructure Office -
and System Lab Become the model for
future EMC Labs
Cloud Platform and worldwide
IP Portfolio
Applications Lab
Development


Outline

• Big Data projects overview at EMC Labs China
• Introduction to Cloud Databases
• Data analytics in the cloud
– Parallel DBMS
– MapReduce
• FlexDB - A cloud-scale database engine based on
Hadoop
• Summary


The Digital Universe 2009-2020

Growing
by a
Factor of 44
2009:
0.8 Zb

2020: 35.2 Zettabytes
Source: IDC Digital Universe Study, sponsored by EMC, May 2010


Big Data is Changing the World
Expanding Data Sources Bigger Challenges
• Science and research • Scale out automatically
– Gene sequences – Vs. scale up manually
– LHC accelerator
– Earth and space exploration • More capacity and bigger pool
– E.g., 10 PB in a single file system
• Enterprise applications
– Email, documents, files • New process capability
– Applications log – Loading, Analyzing, Moving data
– Transaction records – Intelligence

• Web 2.0 data • Better performance
– Search log / click stream – Linear vs. exponent
– Twitter/ Blog / SNS – Faster
– Wiki • Autonomous
• Other unstructured data – Fewer human interference
– Video/Movie – Lower cost
– Graphics
– Digital widgets


Research Scopes and Topics in Big Data
• Search and Analytics
– Search: Entity Search, Faceted Search, Associative Search
– Analytics: Text Analysis, Activity Modeling and Sequence Analysis,
Real-time Data Analysis for Streaming, Parallel Data Mining
Algorithms
• MPP Databases and Data Services
– Parallel Database: Parallel Query Optimization, Data Partitioning
and Replication, Distributed Transaction
– In-memory Database: Cache, Recovery, Consistence
– Database as a Service: Multi-tenant Data Management, Auto-
Administration
• Hadoop/NoSQL
– Hadoop: Single-node Failure, Performance, Real-time MapReduce
Scheduler and Fault Tolerance
– NoSQL: Key-Value Store, Documents Store, Graph Data Store


Project Overview
• Hadoop/NoSQL
– vHadoop - joint project with VMWare
• Parallel SAN file system for DISC on virtualized platform
– Online MapReduce for Real-time Data Analytics
• Pipelined task execution, Group task scheduling, Enhanced fault tolerance
• Parallel Data Mining
– FlexDB: Cloud-scale Parallel Database for OLAP
• MapReduce integration into DBMS, Parallel query execution, Cost-based query
optimization
– Cloud-scale Parallel Database for OLTP
• Intelligent database sharding and resharding
• Active-active (eager) replication with group communication service
• Multiple masters with elastic distributed coordination


Cloud Databases
• Two largest components of data management market
– Transactional Data Management
• Banks, airline reservation, online e-commerce
• ACID, write-intensive
– Analytical Data Management
• Business planning, decision support
• Query-intensive

• Challenges of data management in the Cloud
– Scalability
– Fault Tolerance
– Availability & Consistence
– Transaction Management
– Flexible Schemes


Cloud Databases
• Data analytics in the cloud
– Parallel DBMS
– MapReduce
• Transactional data management in the cloud
– NoSQL Store
– SQL Database
• Cloud data services (Database as a Service)
– Multi-tenant data management
– Auto-administration


Commercial Landscape Major Players

• Amazon EC2
– IaaS abstraction
– Data management using S3 and SimpleDB
• Microsoft Azure
– PaaS abstraction
– Relational engine (SQL Azure)
• Google AppEngine
– PaaS abstraction
– Data management using Google MegaStore


Data Analytics in the Cloud

• Scalability to large data volumes:
– Scan 100 TB on 1 node @ 50 MB/sec = 23 days
– Scan on 1000-node cluster = 33 minutes
 Divide-And-Conquer (i.e., data partitioning)

• Cost-efficiency:
– Commodity nodes (cheap, but unreliable)
– Commodity network
– Automatic fault-tolerance (fewer admins)
– Easy to use (fewer programmers)


Solutions for Large-scale Data Analysis

• Parallel DBMS technologies
– Proposed in late eighties
– Matured over the last two decades
– Multi-billion dollar industry: Proprietary DBMS Engines
intended as Data Warehousing solutions for very large
enterprises
• Map Reduce
– pioneered by Google
– popularized by Yahoo! (Hadoop)


Parallel DBMS technologies
• Popularly used for more than two decades
– Research Projects: Gamma, Grace, …
– Commercial: Teradata, Greenplum (acquired by EMC), Netezza
(acquired by IBM), DATAllegro (acquired by Microsoft), Vertica
(acquired by HP), Aster Data (acquired by Teradata)
• Share-nothing nodes clusters
• Relational Data Model
• Indexing
• Familiar SQL interface
• Parallel query execution
– Horizontal partitioning of relational tables with partitioned execution of
SQL queries
• Advanced query optimization
• Well understood and studied


Greenplum: A Share-nothing Parallel DBMS
 Greenplum’s MPP Database has extreme scalability
– Optimized for BI and analytics
– Fault-tolerant reliability and optimized performance
using commodity CPUs, disks and networking
Interconnect  Provides automatic parallelization
– No need for manual partitioning or tuning
– Just load and query like any database
– Tables are automatically distributed across nodes
 Extremely scalable and I/O optimized
– All nodes can scan and process in parallel
Loading – No I/O contention between segments
 Linear scalability by adding nodes
– Each adds storage, query performance and loading
performance


Greenplum Database Architecture
MPP (Massively Parallel Processing) SQL
MapReduce
Shared-Nothing Architecture

Master
Servers ... ...
Query planning &
dispatch

Network
Interconnect

Segment
Servers ... ...
Query processing
& data storage

External
Sources
Loading,
streaming, etc.


Example of Parallel Query Optimization
Gather Motion 4:1
(slice 3)
select
c_custkey, c_name,
sum(l_extendedprice * (1 - l_discount)) as Sort
revenue,
c_acctbal, n_name, c_address, c_phone,
c_comment HashAggregate

from
customer, orders, lineitem, nation HashJoin

where
c_custkey = o_custkey Redistribute Motion 4:4
Hash
(slice 1)
and l_orderkey = o_orderkey
and o_orderdate >= date '1994-08-01'
HashJoin HashJoin
and o_orderdate < date '1994-08-01'
+ interval '3 month'
Seq Scan on Seq Scan on
and l_returnflag = 'R' Hash Hash
lineitem customer
and c_nationkey = n_nationkey
Broadcast Motion 4:4
group by Seq Scan on orders
(slice 2)
c_custkey, c_name, c_acctbal,
c_phone, n_name, c_address, c_comment
Seq Scan on nation
order by
revenue desc


MapReduce

• Overview
– large-scale, massively parallel data access platform
– Simple data-parallel programming model to express relatively
sophisticated distributed programs
– An associated parallel and distributed implementation for commodity
clusters
• Pioneered by Google
– Processes 20 PB of data per day
• Popularized by open-source Hadoop project
– Used by Yahoo!, Facebook, Amazon, and the list is growing …


Programming Framework

Raw Input: <key, value>

MAP

<K1, V1> <K2,V2> <K3,V3>

REDUCE


MapReduce Example: WordCount Reduce(K, V[ ]) {
Int count = 0;
For each v in V
Map(K, V) {
count += v;
For each word w in V
Collect(K, count);
Collect(w, 1);
}
}

combine part0
map reduce
Cat split
. Cat 3
.
reduce part1 Bat 4
. split map combine

Bat Dog 3
…
.
. map part2
split combine reduce
Dog
.
Combine(K, V[ ]) {
. map Int count = 0;
Other split For each v in V
Words count += v;
Collect(K, count);
(size: }
TByte)

MapReduce Implementation in Hadoop
client

job
master

assign assign
map reduce

mapper
split0
write
reducer file0
split1
read local remote
split2 mapper write read
split3
reducer file1
split4

mapper

input map intermediate files reduce output
files phase (local disk) phase files


MapReduce Advantages
• Automatic Parallelization:
– Depending on the size of RAW INPUT DATA  instantiate
multiple MAP tasks
– Similarly, depending upon the number of intermediate <key,
value> partitions 
instantiate multiple REDUCE tasks
• Run-time:
– Data partitioning
– Task scheduling
– Handling machine failures
– Managing inter-machine communication
• Completely transparent to the programmer/analyst/user


Possible Applications
• Special-purpose programs to process large amounts
of data: crawled documents, Web query logs, etc.
– ETL and “read once” data sets
– Complex analytics
– Semi-structured data, key-value pairs
• At Google and others (Yahoo!, Facebook):
– Inverted index
– Graph structure of the WEB documents
– Summaries of #pages/host, set of frequent queries, etc.
– Ad Optimization
– Spam filtering


Map Reduce vs Parallel DBMS
Parallel DBMS MapReduce

Schema Support  Not out of the box

Indexing  Not out of the box
Imperative
Declarative (C/C++, Java, …)
Programming Model
(SQL) Extensions through
Pig and Hive
Optimizations
(Compression, Query  Not out of the box
Optimization)
Flexibility Not out of the box 
Coarse grained
Fault Tolerance 
techniques


Further Analysis and Comparison
• Limitations of some current parallel database / data warehouse
– Often use expensive/specialized hardware
– Difficult to scale to more than 100 nodes
– Difficult to parallelize data mining applications
• MPI …
– Difficult to deal with unstructured data
– Fault tolerance
• One node fails, restart whole query
– Expensive
• Disadvantages of some MapReduce based solution (Hive)
– A sub-optimal brute force implementation: No indexing, No JOINs
• Find those guys whose salary is $10,000
– Row based storage, Updates?
– Not SQL/BI tool compatible
– No support for schema
– Non-declarative programming model


MapReduce Integration in DBMS Context

• FlexDB - A Cloud-scale Parallel Database Engine based on
Hadoop MapReduce (A Research Project)
– An architectural hybrid of MapReduce and DBMS
technologies
– Use Fault-tolerance and Scalability of Map Reduce
framework
– Leverage advanced data processing techniques (e.g.,
Query Optimization) of an RDBMS for high performance
– Expose a declarative interface to the user
• Goal: Leverage from the best of both worlds


FlexDB Architecture


FlexDB Master
Query Parser

SELECT *
FROM Account Query Optimizer
WHERE balance > 30
Job Generator Catalog manager

Job Executor

Job
Job Job
Job

MapReduce Mapper
Framework
Account Reducer
r0 n0 m0
SELECT * SELECT * SELECT *
r1 n1 m1 FROM Account FROM Account FROM Account
r2 n2 m2 WHERE balance > 30 WHERE balance > 30 WHERE balance > 30

r3 n3 m3
subquery subquery subquery
r4 n4 m4
r5 n5 m5
r6 n6 m6
r7 n7 m7 Database Database Database Database Database Database Database

r0 n0 m0 r2 n2 m2 r4 n4 m4 r6 n6 m6 r8 n8 m8
r1 n1 m1 r3 n3 m3 r5 n5 m5 r7 n7 m7 r9 n9 m9


Comparison with other systems

FlexDB Hive HadoopDB Traditional parallel
database
Query Language SQL HQL SQL (not SQL
support join
currently)
Storage Postgres/Greenplum HDFS JDBC Native OS files
compatible
Optimizer Cost based (DB/MR Simple rule Simple rule Cost based
paths) based based
Physical storage Column/Row based Row based Currently Row Column/Row based
organization based
Implementation FlexDB Master + Hive + Hadoop Hive (rev) + Native
Hadoop + DB Hadoop + DB
Efficiency High Low Middle Very High

Scale Large Large Large Middle

Cost Low Low Low High


Summary
• New in cloud computing
– Elasticity/Scalability
– Resource sharing (multi-tenancy)
– Focus on failure
• Data analytics in the cloud: Different solutions suitable for
different workloads
– Parallel DBMSs excel at efficient querying of large data sets
– MR-style systems excel at complex analytics and ETL tasks
• Combine MapReduce with shared-nothing DBMS to produce a
system that better fit the cloud computing market


Acknowledgements

• Some slides are adapted from the following references:
– Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud
Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial
– Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik
Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel
DBMS’s: Friends or Foes?”, Communications of the ACM 2010


易安信中国研究院
陶波博士
易安信中国研究院院长

博客 http://blog.sina.com.cn/emclabschina
微博 http://weibo.com/emclabschina


THANK YOU


Research on big data

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Research on big data

Similar to Research on big data (20)

Recently uploaded

Recently uploaded (20)

Research on big data