SlideShare a Scribd company logo
1 of 33
Download to read offline
Research on Big Data
                                                     - FlexDB: A cloud-scale database engine
                                                     based on Hadoop
                                                         Jidong Chen (jidong.chen@emc.com)
                                                         Manager, Research Scientist, Big Data Lab

                                                         EMC Labs China
                                                         Sept. 2011




© Copyright 2011 EMC Corporation. All rights reserved.                                               1
Grand Opening Announcement




                      EMC Labs China is formed from EMC Research China and the
                      Advanced Technology Venture group, which were established in
                      2007 by the office of CTO.



© Copyright 2011 EMC Corporation. All rights reserved.                               2
EMC Labs China - Vision and Mission
       Advanced Technology
     Research and Development                               University
                                                           Collaboration
                                                                                 Vision
                     Big Data Lab                                                 Become an elite
                                                                              research and advanced
                                                                                technology institute
                                                         Industry Standards           in China
            Cloud Infrastructure                               Office                     -
              and System Lab                                                   Become the model for
                                                                                  future EMC Labs
             Cloud Platform and                                                      worldwide
                                                            IP Portfolio
              Applications Lab
                                                           Development




© Copyright 2011 EMC Corporation. All rights reserved.                                                 3
Outline

• Big Data projects overview at EMC Labs China
• Introduction to Cloud Databases
• Data analytics in the cloud
           – Parallel DBMS
           – MapReduce
• FlexDB - A cloud-scale database engine based on
  Hadoop
• Summary



© Copyright 2011 EMC Corporation. All rights reserved.   4
The Digital Universe 2009-2020



                                                 Growing
                                                 by a
                                                 Factor of 44
2009:
0.8 Zb




                                                                 2020: 35.2 Zettabytes
Source: IDC Digital Universe Study, sponsored by EMC, May 2010




 © Copyright 2011 EMC Corporation. All rights reserved.                                  5
Big Data is Changing the World
             Expanding Data Sources                              Bigger Challenges
• Science and research                                   • Scale out automatically
          – Gene sequences                                   – Vs. scale up manually
          – LHC accelerator
          – Earth and space exploration                  • More capacity and bigger pool
                                                             – E.g., 10 PB in a single file system
• Enterprise applications
          – Email, documents, files                      • New process capability
          – Applications log                                 – Loading, Analyzing, Moving data
          – Transaction records                              – Intelligence

• Web 2.0 data                                           • Better performance
          – Search log / click stream                        – Linear vs. exponent
          – Twitter/ Blog / SNS                              – Faster
          – Wiki                                         • Autonomous
• Other unstructured data                                    – Fewer human interference
          – Video/Movie                                      – Lower cost
          – Graphics
          – Digital widgets



© Copyright 2011 EMC Corporation. All rights reserved.                                               6
Research Scopes and Topics in Big Data
• Search and Analytics
          – Search: Entity Search, Faceted Search, Associative Search
          – Analytics: Text Analysis, Activity Modeling and Sequence Analysis,
            Real-time Data Analysis for Streaming, Parallel Data Mining
            Algorithms
• MPP Databases and Data Services
          – Parallel Database: Parallel Query Optimization, Data Partitioning
            and Replication, Distributed Transaction
          – In-memory Database: Cache, Recovery, Consistence
          – Database as a Service: Multi-tenant Data Management, Auto-
            Administration
• Hadoop/NoSQL
          – Hadoop: Single-node Failure, Performance, Real-time MapReduce
            Scheduler and Fault Tolerance
          – NoSQL: Key-Value Store, Documents Store, Graph Data Store

© Copyright 2011 EMC Corporation. All rights reserved.                           7
Project Overview
• Hadoop/NoSQL
          – vHadoop - joint project with VMWare
                    • Parallel SAN file system for DISC on virtualized platform
          – Online MapReduce for Real-time Data Analytics
                    • Pipelined task execution, Group task scheduling, Enhanced fault tolerance
                    • Parallel Data Mining
          – FlexDB: Cloud-scale Parallel Database for OLAP
                    • MapReduce integration into DBMS, Parallel query execution, Cost-based query
                      optimization
          – Cloud-scale Parallel Database for OLTP
                    • Intelligent database sharding and resharding
                    • Active-active (eager) replication with group communication service
                    • Multiple masters with elastic distributed coordination




© Copyright 2011 EMC Corporation. All rights reserved.                                              8
Cloud Databases
  • Two largest components of data management market
            – Transactional Data Management
                      • Banks, airline reservation, online e-commerce
                      • ACID, write-intensive
            – Analytical Data Management
                      • Business planning, decision support
                      • Query-intensive

  • Challenges of data management in the Cloud
            –     Scalability
            –     Fault Tolerance
            –     Availability & Consistence
            –     Transaction Management
            –     Flexible Schemes




© Copyright 2011 EMC Corporation. All rights reserved.                  9
Cloud Databases
  • Data analytics in the cloud
            – Parallel DBMS
            – MapReduce
  • Transactional data management in the cloud
            – NoSQL Store
            – SQL Database
  • Cloud data services (Database as a Service)
            – Multi-tenant data management
            – Auto-administration




© Copyright 2011 EMC Corporation. All rights reserved.   10
Commercial Landscape Major Players

  • Amazon EC2
            – IaaS abstraction
            – Data management using S3 and SimpleDB
  • Microsoft Azure
            – PaaS abstraction
            – Relational engine (SQL Azure)
  • Google AppEngine
            – PaaS abstraction
            – Data management using Google MegaStore



© Copyright 2011 EMC Corporation. All rights reserved.   11
Data Analytics in the Cloud

• Scalability to large data volumes:
           – Scan 100 TB on 1 node @ 50 MB/sec = 23 days
           – Scan on 1000-node cluster = 33 minutes
 Divide-And-Conquer (i.e., data partitioning)

• Cost-efficiency:
           –     Commodity nodes (cheap, but unreliable)
           –     Commodity network
           –     Automatic fault-tolerance (fewer admins)
           –     Easy to use (fewer programmers)

© Copyright 2011 EMC Corporation. All rights reserved.      12
Solutions for Large-scale Data Analysis

  • Parallel DBMS technologies
            – Proposed in late eighties
            – Matured over the last two decades
            – Multi-billion dollar industry: Proprietary DBMS Engines
              intended as Data Warehousing solutions for very large
              enterprises
  • Map Reduce
            – pioneered by Google
            – popularized by Yahoo! (Hadoop)



© Copyright 2011 EMC Corporation. All rights reserved.                  13
Parallel DBMS technologies
  • Popularly used for more than two decades
            – Research Projects: Gamma, Grace, …
            – Commercial: Teradata, Greenplum (acquired by EMC), Netezza
              (acquired by IBM), DATAllegro (acquired by Microsoft), Vertica
              (acquired by HP), Aster Data (acquired by Teradata)
  •    Share-nothing nodes clusters
  •    Relational Data Model
  •    Indexing
  •    Familiar SQL interface
  •    Parallel query execution
            – Horizontal partitioning of relational tables with partitioned execution of
              SQL queries
  • Advanced query optimization
  • Well understood and studied


© Copyright 2011 EMC Corporation. All rights reserved.                                     14
Greenplum: A Share-nothing Parallel DBMS
                                                          Greenplum’s MPP Database has extreme scalability
                                                             – Optimized for BI and analytics
                                                             – Fault-tolerant reliability and optimized performance
                                                                using commodity CPUs, disks and networking
                  Interconnect                            Provides automatic parallelization
                                                              – No need for manual partitioning or tuning
                                                              – Just load and query like any database
                                                              – Tables are automatically distributed across nodes
                                                          Extremely scalable and I/O optimized
                                                              – All nodes can scan and process in parallel
                  Loading                                     – No I/O contention between segments
                                                          Linear scalability by adding nodes
                                                              – Each adds storage, query performance and loading
                                                                performance




© Copyright 2011 EMC Corporation. All rights reserved.                                                                15
Greenplum Database Architecture
   MPP (Massively Parallel Processing)                         SQL
                                                               MapReduce
            Shared-Nothing Architecture

                      Master
                      Servers                            ...               ...
                 Query planning &
                     dispatch


                   Network
                 Interconnect


                     Segment
                     Servers                      ...                            ...
                 Query processing
                  & data storage




                     External
                     Sources
                       Loading,
                   streaming, etc.




© Copyright 2011 EMC Corporation. All rights reserved.                                 16
Example of Parallel Query Optimization
                                                                        Gather Motion 4:1
                                                                            (slice 3)
select
    c_custkey, c_name,
    sum(l_extendedprice * (1 - l_discount)) as                                 Sort
revenue,
    c_acctbal, n_name, c_address, c_phone,
c_comment                                                                 HashAggregate

from
       customer, orders, lineitem, nation                                   HashJoin

where
    c_custkey = o_custkey                                Redistribute Motion 4:4
                                                                                                    Hash
                                                                 (slice 1)
    and l_orderkey = o_orderkey
    and o_orderdate >= date '1994-08-01'
                                                                HashJoin                           HashJoin
    and o_orderdate < date '1994-08-01'
                      + interval '3 month'
                                                         Seq Scan on                        Seq Scan on
    and l_returnflag = 'R'                                                   Hash                               Hash
                                                           lineitem                          customer
    and c_nationkey = n_nationkey
                                                                                                      Broadcast Motion 4:4
group by                                                               Seq Scan on orders
                                                                                                            (slice 2)
    c_custkey, c_name, c_acctbal,
    c_phone, n_name, c_address, c_comment
                                                                                                          Seq Scan on nation
order by
    revenue desc




© Copyright 2011 EMC Corporation. All rights reserved.                                                                         17
MapReduce

  • Overview
            – large-scale, massively parallel data access platform
            – Simple data-parallel programming model to express relatively
              sophisticated distributed programs
            – An associated parallel and distributed implementation for commodity
              clusters
  • Pioneered by Google
            – Processes 20 PB of data per day
  • Popularized by open-source Hadoop project
            – Used by Yahoo!, Facebook, Amazon, and the list is growing …




© Copyright 2011 EMC Corporation. All rights reserved.                              18
Programming Framework

                                               Raw Input: <key, value>


                                                         MAP



                              <K1, V1>                    <K2,V2>        <K3,V3>


                                                         REDUCE


© Copyright 2011 EMC Corporation. All rights reserved.                             19
MapReduce Example: WordCount                                                               Reduce(K, V[ ]) {
                                                                                             Int count = 0;
                                                                                             For each v in V
                                               Map(K, V) {
                                                                                              count += v;
                                                 For each word w in V
                                                                                             Collect(K, count);
                                                  Collect(w, 1);
                                                                                           }
                                               }


                                                                        combine                               part0
                                                              map                           reduce
  Cat                                 split
   .                                                                                                                  Cat 3
   .
                                                                                            reduce            part1 Bat 4
   .                                  split                   map       combine

  Bat                                                                                                                 Dog 3
                                                                                                                      …
    .
    .                                                         map                                            part2
                                      split                             combine             reduce
  Dog
    .
                                                                         Combine(K, V[ ]) {
    .                                                         map          Int count = 0;
Other                                 split                                For each v in V
Words                                                                       count += v;
                                                                           Collect(K, count);
 (size:                                                                  }
TByte)
© Copyright 2011 EMC Corporation. All rights reserved.                                                                        20
MapReduce Implementation in Hadoop
                                                                      client

                                                                               job
                                                                      master

                                                         assign                      assign
                                                         map                         reduce

                                            mapper
            split0
                                                                                                        write
                                                                                              reducer            file0
            split1
                          read                             local               remote
            split2                          mapper         write               read
            split3
                                                                                              reducer            file1
            split4

                                            mapper

                 input                        map                 intermediate files          reduce            output
                 files                        phase               (local disk)                phase             files

© Copyright 2011 EMC Corporation. All rights reserved.                                                                   21
MapReduce Advantages
     • Automatic Parallelization:
                – Depending on the size of RAW INPUT DATA  instantiate
                  multiple MAP tasks
                – Similarly, depending upon the number of intermediate <key,
                  value> partitions 
                  instantiate multiple REDUCE tasks
     • Run-time:
                –     Data partitioning
                –     Task scheduling
                –     Handling machine failures
                –     Managing inter-machine communication
     • Completely transparent to the programmer/analyst/user


© Copyright 2011 EMC Corporation. All rights reserved.                         22
Possible Applications
  • Special-purpose programs to process large amounts
    of data: crawled documents, Web query logs, etc.
            – ETL and “read once” data sets
            – Complex analytics
            – Semi-structured data, key-value pairs
  • At Google and others (Yahoo!, Facebook):
            –      Inverted index
            –      Graph structure of the WEB documents
            –      Summaries of #pages/host, set of frequent queries, etc.
            –      Ad Optimization
            –      Spam filtering

© Copyright 2011 EMC Corporation. All rights reserved.                       23
Map Reduce vs Parallel DBMS
                                                          Parallel DBMS          MapReduce

         Schema Support                                                       Not out of the box

                  Indexing                                                    Not out of the box
                                                                                   Imperative
                                                            Declarative         (C/C++, Java, …)
    Programming Model
                                                              (SQL)           Extensions through
                                                                                  Pig and Hive
      Optimizations
   (Compression, Query                                                        Not out of the box
      Optimization)
                 Flexibility                             Not out of the box            
                                                          Coarse grained
          Fault Tolerance                                                              
                                                            techniques


© Copyright 2011 EMC Corporation. All rights reserved.                                              24
Further Analysis and Comparison
• Limitations of some current parallel database / data warehouse
           – Often use expensive/specialized hardware
           – Difficult to scale to more than 100 nodes
           – Difficult to parallelize data mining applications
                     • MPI …
           – Difficult to deal with unstructured data
           – Fault tolerance
                     • One node fails, restart whole query
           – Expensive
• Disadvantages of some MapReduce based solution (Hive)
           – A sub-optimal brute force implementation: No indexing, No JOINs
                     • Find those guys whose salary is $10,000
           –     Row based storage, Updates?
           –     Not SQL/BI tool compatible
           –     No support for schema
           –     Non-declarative programming model


© Copyright 2011 EMC Corporation. All rights reserved.                         25
MapReduce Integration in DBMS Context

  • FlexDB - A Cloud-scale Parallel Database Engine based on
    Hadoop MapReduce (A Research Project)
      – An architectural hybrid of MapReduce and DBMS
        technologies
      – Use Fault-tolerance and Scalability of Map Reduce
        framework
      – Leverage advanced data processing techniques (e.g.,
        Query Optimization) of an RDBMS for high performance
      – Expose a declarative interface to the user
  • Goal: Leverage from the best of both worlds



© Copyright 2011 EMC Corporation. All rights reserved.         26
FlexDB Architecture




© Copyright 2011 EMC Corporation. All rights reserved.   27
FlexDB Master
                                                                            Query Parser

                                           SELECT *
                                         FROM Account                     Query Optimizer
                                       WHERE balance > 30
                                                                            Job Generator                 Catalog manager

                                                                             Job Executor

                                                                                                    Job
                                                                                        Job                  Job
                                                                                                    Job

 MapReduce                                                                                                                           Mapper
 Framework
Account                                                                                                                              Reducer
r0   n0      m0
                             SELECT *                                  SELECT *                               SELECT *
r1   n1      m1            FROM Account                              FROM Account                           FROM Account
r2   n2      m2          WHERE balance > 30                        WHERE balance > 30                     WHERE balance > 30

r3   n3      m3
                                     subquery                              subquery                                   subquery
r4   n4      m4
r5   n5      m5
r6   n6      m6
r7   n7      m7              Database            Database   Database     Database        Database          Database       Database



                r0 n0 m0                           r2 n2 m2            r4 n4 m4                     r6 n6 m6                   r8 n8 m8
                r1 n1 m1                           r3 n3 m3            r5 n5 m5                     r7 n7 m7                   r9 n9 m9


 © Copyright 2011 EMC Corporation. All rights reserved.                                                                                        28
Comparison with other systems

                                                         FlexDB   Hive     HadoopDB Traditional parallel
                                                                                        database
     Query Language                                       SQL     HQL       SQL (not            SQL
                                                                           support join
                                                                            currently)
              Storage                      Postgres/Greenplum   HDFS          JDBC         Native OS files
                                                                           compatible
            Optimizer                      Cost based (DB/MR Simple rule   Simple rule       Cost based
                                                  paths)        based        based
      Physical storage                     Column/Row based Row based Currently Row       Column/Row based
        organization                                                         based
      Implementation                        FlexDB Master + Hive + Hadoop Hive (rev) +         Native
                                              Hadoop + DB                 Hadoop + DB
            Efficiency                             High          Low         Middle           Very High

                Scale                                     Large   Large       Large            Middle

                 Cost                                     Low     Low          Low              High




© Copyright 2011 EMC Corporation. All rights reserved.                                                       29
Summary
  • New in cloud computing
            – Elasticity/Scalability
            – Resource sharing (multi-tenancy)
            – Focus on failure
  • Data analytics in the cloud: Different solutions suitable for
    different workloads
            – Parallel DBMSs excel at efficient querying of large data sets
            – MR-style systems excel at complex analytics and ETL tasks
  • Combine MapReduce with shared-nothing DBMS to produce a
    system that better fit the cloud computing market




© Copyright 2011 EMC Corporation. All rights reserved.                        30
Acknowledgements

  • Some slides are adapted from the following references:
            – Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud
              Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial
            – Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik
              Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel
              DBMS’s: Friends or Foes?”, Communications of the ACM 2010




© Copyright 2011 EMC Corporation. All rights reserved.                               31
易安信中国研究院
                                  陶波 博士
                                  易安信中国研究院 院长


                                                         博客 http://blog.sina.com.cn/emclabschina
                                                         微博 http://weibo.com/emclabschina




© Copyright 2011 EMC Corporation. All rights reserved.                                             32
THANK YOU



© Copyright 2011 EMC Corporation. All rights reserved.           33

More Related Content

What's hot

Do More with Oracle Environment with Open and Best of breed Technologies
Do More with Oracle Environment with Open and Best of breed TechnologiesDo More with Oracle Environment with Open and Best of breed Technologies
Do More with Oracle Environment with Open and Best of breed TechnologiesEMC Forum India
 
Cloud Computing for Enterprise Architects
Cloud Computing for Enterprise ArchitectsCloud Computing for Enterprise Architects
Cloud Computing for Enterprise ArchitectsJean-François Caenen
 
IBM Storage for AI and Big Data
IBM Storage for AI and Big DataIBM Storage for AI and Big Data
IBM Storage for AI and Big DataTony Pearson
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
[OSDC.tw 2011] The Path to Pass into PaaS -- How We Build the Solution
[OSDC.tw 2011] The Path to Pass into PaaS -- How We Build the Solution[OSDC.tw 2011] The Path to Pass into PaaS -- How We Build the Solution
[OSDC.tw 2011] The Path to Pass into PaaS -- How We Build the SolutionJeff Hung
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBasedarach
 
Hw09 Data Processing In The Enterprise
Hw09   Data Processing In The EnterpriseHw09   Data Processing In The Enterprise
Hw09 Data Processing In The EnterpriseCloudera, Inc.
 
IBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use casesIBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use casesTony Pearson
 
Hadoop Twelve Predictions for 2012
Hadoop Twelve Predictions for 2012Hadoop Twelve Predictions for 2012
Hadoop Twelve Predictions for 2012Cloudera, Inc.
 
Leaders in the Cloud: Identifying Cloud Business Value for Customers
Leaders in the Cloud: Identifying Cloud Business Value for CustomersLeaders in the Cloud: Identifying Cloud Business Value for Customers
Leaders in the Cloud: Identifying Cloud Business Value for CustomersOpSource
 
ECS/Cloud Object Storage - DevOps Day
ECS/Cloud Object Storage - DevOps DayECS/Cloud Object Storage - DevOps Day
ECS/Cloud Object Storage - DevOps DayBob Sokol
 
Infrastructure Consolidation and Virtualization
Infrastructure Consolidation and VirtualizationInfrastructure Consolidation and Virtualization
Infrastructure Consolidation and VirtualizationBob Rhubart
 
Data distribution in the cloud with Node.js
Data distribution in the cloud with Node.jsData distribution in the cloud with Node.js
Data distribution in the cloud with Node.jsdarach
 
Using BrightWork for Project Management with SharePoint 2010 - from Atidan
Using BrightWork for Project Management with SharePoint 2010 - from AtidanUsing BrightWork for Project Management with SharePoint 2010 - from Atidan
Using BrightWork for Project Management with SharePoint 2010 - from AtidanDavid J Rosenthal
 
Dell OpenStack Powered Cloud Solution and Case Sharing
Dell OpenStack Powered Cloud Solution and Case SharingDell OpenStack Powered Cloud Solution and Case Sharing
Dell OpenStack Powered Cloud Solution and Case SharingHui Cheng
 
Cloud Computing and Big Data
Cloud Computing and Big DataCloud Computing and Big Data
Cloud Computing and Big DataZaloni
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17David Spurway
 
Self-Service Access and Exploration of Big Data
Self-Service Access and Exploration of Big DataSelf-Service Access and Exploration of Big Data
Self-Service Access and Exploration of Big DataInside Analysis
 

What's hot (19)

Do More with Oracle Environment with Open and Best of breed Technologies
Do More with Oracle Environment with Open and Best of breed TechnologiesDo More with Oracle Environment with Open and Best of breed Technologies
Do More with Oracle Environment with Open and Best of breed Technologies
 
Cloud Computing for Enterprise Architects
Cloud Computing for Enterprise ArchitectsCloud Computing for Enterprise Architects
Cloud Computing for Enterprise Architects
 
IBM Storage for AI and Big Data
IBM Storage for AI and Big DataIBM Storage for AI and Big Data
IBM Storage for AI and Big Data
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
[OSDC.tw 2011] The Path to Pass into PaaS -- How We Build the Solution
[OSDC.tw 2011] The Path to Pass into PaaS -- How We Build the Solution[OSDC.tw 2011] The Path to Pass into PaaS -- How We Build the Solution
[OSDC.tw 2011] The Path to Pass into PaaS -- How We Build the Solution
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Hw09 Data Processing In The Enterprise
Hw09   Data Processing In The EnterpriseHw09   Data Processing In The Enterprise
Hw09 Data Processing In The Enterprise
 
IBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use casesIBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use cases
 
Hadoop Twelve Predictions for 2012
Hadoop Twelve Predictions for 2012Hadoop Twelve Predictions for 2012
Hadoop Twelve Predictions for 2012
 
Leaders in the Cloud: Identifying Cloud Business Value for Customers
Leaders in the Cloud: Identifying Cloud Business Value for CustomersLeaders in the Cloud: Identifying Cloud Business Value for Customers
Leaders in the Cloud: Identifying Cloud Business Value for Customers
 
ECS/Cloud Object Storage - DevOps Day
ECS/Cloud Object Storage - DevOps DayECS/Cloud Object Storage - DevOps Day
ECS/Cloud Object Storage - DevOps Day
 
Infrastructure Consolidation and Virtualization
Infrastructure Consolidation and VirtualizationInfrastructure Consolidation and Virtualization
Infrastructure Consolidation and Virtualization
 
Data distribution in the cloud with Node.js
Data distribution in the cloud with Node.jsData distribution in the cloud with Node.js
Data distribution in the cloud with Node.js
 
Using BrightWork for Project Management with SharePoint 2010 - from Atidan
Using BrightWork for Project Management with SharePoint 2010 - from AtidanUsing BrightWork for Project Management with SharePoint 2010 - from Atidan
Using BrightWork for Project Management with SharePoint 2010 - from Atidan
 
Dell OpenStack Powered Cloud Solution and Case Sharing
Dell OpenStack Powered Cloud Solution and Case SharingDell OpenStack Powered Cloud Solution and Case Sharing
Dell OpenStack Powered Cloud Solution and Case Sharing
 
Cloud Computing and Big Data
Cloud Computing and Big DataCloud Computing and Big Data
Cloud Computing and Big Data
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17
 
Self-Service Access and Exploration of Big Data
Self-Service Access and Exploration of Big DataSelf-Service Access and Exploration of Big Data
Self-Service Access and Exploration of Big Data
 

Similar to Research on big data

Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperabilityparker01
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deckKeithETD_CTO
 
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based ExtensibilityExtending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based ExtensibilityJerome Leonard
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Cloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstackCloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstackOpenCity Community
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28korusamol
 
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters Emulex Corporation
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackJoe Arnold
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
The Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand ITThe Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand ITInnoTech
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven businessOpenDataSoft
 
Cloud Computing - Making IT Simple
 Cloud Computing - Making IT Simple Cloud Computing - Making IT Simple
Cloud Computing - Making IT SimpleBob Rhubart
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Cloud Computing: Making IT Simple
Cloud Computing: Making IT SimpleCloud Computing: Making IT Simple
Cloud Computing: Making IT SimpleBob Rhubart
 

Similar to Research on big data (20)

Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based ExtensibilityExtending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
Extending The Value Of Oracle Crm On Demand Through Cloud Based Extensibility
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Cloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstackCloud foundry elastic architecture and deploy based on openstack
Cloud foundry elastic architecture and deploy based on openstack
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Cloud computingjun28
Cloud computingjun28Cloud computingjun28
Cloud computingjun28
 
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
The Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand ITThe Rise of Big Data and On-Demand IT
The Rise of Big Data and On-Demand IT
 
EMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras PelenisEMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras Pelenis
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven business
 
Cloud Computing - Making IT Simple
 Cloud Computing - Making IT Simple Cloud Computing - Making IT Simple
Cloud Computing - Making IT Simple
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Cloud Computing: Making IT Simple
Cloud Computing: Making IT SimpleCloud Computing: Making IT Simple
Cloud Computing: Making IT Simple
 

Recently uploaded

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Research on big data

  • 1. Research on Big Data - FlexDB: A cloud-scale database engine based on Hadoop Jidong Chen (jidong.chen@emc.com) Manager, Research Scientist, Big Data Lab EMC Labs China Sept. 2011 © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. Grand Opening Announcement EMC Labs China is formed from EMC Research China and the Advanced Technology Venture group, which were established in 2007 by the office of CTO. © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. EMC Labs China - Vision and Mission Advanced Technology Research and Development University Collaboration Vision Big Data Lab Become an elite research and advanced technology institute Industry Standards in China Cloud Infrastructure Office - and System Lab Become the model for future EMC Labs Cloud Platform and worldwide IP Portfolio Applications Lab Development © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. Outline • Big Data projects overview at EMC Labs China • Introduction to Cloud Databases • Data analytics in the cloud – Parallel DBMS – MapReduce • FlexDB - A cloud-scale database engine based on Hadoop • Summary © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. The Digital Universe 2009-2020 Growing by a Factor of 44 2009: 0.8 Zb 2020: 35.2 Zettabytes Source: IDC Digital Universe Study, sponsored by EMC, May 2010 © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. Big Data is Changing the World Expanding Data Sources Bigger Challenges • Science and research • Scale out automatically – Gene sequences – Vs. scale up manually – LHC accelerator – Earth and space exploration • More capacity and bigger pool – E.g., 10 PB in a single file system • Enterprise applications – Email, documents, files • New process capability – Applications log – Loading, Analyzing, Moving data – Transaction records – Intelligence • Web 2.0 data • Better performance – Search log / click stream – Linear vs. exponent – Twitter/ Blog / SNS – Faster – Wiki • Autonomous • Other unstructured data – Fewer human interference – Video/Movie – Lower cost – Graphics – Digital widgets © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. Research Scopes and Topics in Big Data • Search and Analytics – Search: Entity Search, Faceted Search, Associative Search – Analytics: Text Analysis, Activity Modeling and Sequence Analysis, Real-time Data Analysis for Streaming, Parallel Data Mining Algorithms • MPP Databases and Data Services – Parallel Database: Parallel Query Optimization, Data Partitioning and Replication, Distributed Transaction – In-memory Database: Cache, Recovery, Consistence – Database as a Service: Multi-tenant Data Management, Auto- Administration • Hadoop/NoSQL – Hadoop: Single-node Failure, Performance, Real-time MapReduce Scheduler and Fault Tolerance – NoSQL: Key-Value Store, Documents Store, Graph Data Store © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. Project Overview • Hadoop/NoSQL – vHadoop - joint project with VMWare • Parallel SAN file system for DISC on virtualized platform – Online MapReduce for Real-time Data Analytics • Pipelined task execution, Group task scheduling, Enhanced fault tolerance • Parallel Data Mining – FlexDB: Cloud-scale Parallel Database for OLAP • MapReduce integration into DBMS, Parallel query execution, Cost-based query optimization – Cloud-scale Parallel Database for OLTP • Intelligent database sharding and resharding • Active-active (eager) replication with group communication service • Multiple masters with elastic distributed coordination © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. Cloud Databases • Two largest components of data management market – Transactional Data Management • Banks, airline reservation, online e-commerce • ACID, write-intensive – Analytical Data Management • Business planning, decision support • Query-intensive • Challenges of data management in the Cloud – Scalability – Fault Tolerance – Availability & Consistence – Transaction Management – Flexible Schemes © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. Cloud Databases • Data analytics in the cloud – Parallel DBMS – MapReduce • Transactional data management in the cloud – NoSQL Store – SQL Database • Cloud data services (Database as a Service) – Multi-tenant data management – Auto-administration © Copyright 2011 EMC Corporation. All rights reserved. 10
  • 11. Commercial Landscape Major Players • Amazon EC2 – IaaS abstraction – Data management using S3 and SimpleDB • Microsoft Azure – PaaS abstraction – Relational engine (SQL Azure) • Google AppEngine – PaaS abstraction – Data management using Google MegaStore © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. Data Analytics in the Cloud • Scalability to large data volumes: – Scan 100 TB on 1 node @ 50 MB/sec = 23 days – Scan on 1000-node cluster = 33 minutes  Divide-And-Conquer (i.e., data partitioning) • Cost-efficiency: – Commodity nodes (cheap, but unreliable) – Commodity network – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers) © Copyright 2011 EMC Corporation. All rights reserved. 12
  • 13. Solutions for Large-scale Data Analysis • Parallel DBMS technologies – Proposed in late eighties – Matured over the last two decades – Multi-billion dollar industry: Proprietary DBMS Engines intended as Data Warehousing solutions for very large enterprises • Map Reduce – pioneered by Google – popularized by Yahoo! (Hadoop) © Copyright 2011 EMC Corporation. All rights reserved. 13
  • 14. Parallel DBMS technologies • Popularly used for more than two decades – Research Projects: Gamma, Grace, … – Commercial: Teradata, Greenplum (acquired by EMC), Netezza (acquired by IBM), DATAllegro (acquired by Microsoft), Vertica (acquired by HP), Aster Data (acquired by Teradata) • Share-nothing nodes clusters • Relational Data Model • Indexing • Familiar SQL interface • Parallel query execution – Horizontal partitioning of relational tables with partitioned execution of SQL queries • Advanced query optimization • Well understood and studied © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. Greenplum: A Share-nothing Parallel DBMS  Greenplum’s MPP Database has extreme scalability – Optimized for BI and analytics – Fault-tolerant reliability and optimized performance using commodity CPUs, disks and networking Interconnect  Provides automatic parallelization – No need for manual partitioning or tuning – Just load and query like any database – Tables are automatically distributed across nodes  Extremely scalable and I/O optimized – All nodes can scan and process in parallel Loading – No I/O contention between segments  Linear scalability by adding nodes – Each adds storage, query performance and loading performance © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. Greenplum Database Architecture MPP (Massively Parallel Processing) SQL MapReduce Shared-Nothing Architecture Master Servers ... ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Example of Parallel Query Optimization Gather Motion 4:1 (slice 3) select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as Sort revenue, c_acctbal, n_name, c_address, c_phone, c_comment HashAggregate from customer, orders, lineitem, nation HashJoin where c_custkey = o_custkey Redistribute Motion 4:4 Hash (slice 1) and l_orderkey = o_orderkey and o_orderdate >= date '1994-08-01' HashJoin HashJoin and o_orderdate < date '1994-08-01' + interval '3 month' Seq Scan on Seq Scan on and l_returnflag = 'R' Hash Hash lineitem customer and c_nationkey = n_nationkey Broadcast Motion 4:4 group by Seq Scan on orders (slice 2) c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment Seq Scan on nation order by revenue desc © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. MapReduce • Overview – large-scale, massively parallel data access platform – Simple data-parallel programming model to express relatively sophisticated distributed programs – An associated parallel and distributed implementation for commodity clusters • Pioneered by Google – Processes 20 PB of data per day • Popularized by open-source Hadoop project – Used by Yahoo!, Facebook, Amazon, and the list is growing … © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. Programming Framework Raw Input: <key, value> MAP <K1, V1> <K2,V2> <K3,V3> REDUCE © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. MapReduce Example: WordCount Reduce(K, V[ ]) { Int count = 0; For each v in V Map(K, V) { count += v; For each word w in V Collect(K, count); Collect(w, 1); } } combine part0 map reduce Cat split . Cat 3 . reduce part1 Bat 4 . split map combine Bat Dog 3 … . . map part2 split combine reduce Dog . Combine(K, V[ ]) { . map Int count = 0; Other split For each v in V Words count += v; Collect(K, count); (size: } TByte) © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. MapReduce Implementation in Hadoop client job master assign assign map reduce mapper split0 write reducer file0 split1 read local remote split2 mapper write read split3 reducer file1 split4 mapper input map intermediate files reduce output files phase (local disk) phase files © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. MapReduce Advantages • Automatic Parallelization: – Depending on the size of RAW INPUT DATA  instantiate multiple MAP tasks – Similarly, depending upon the number of intermediate <key, value> partitions  instantiate multiple REDUCE tasks • Run-time: – Data partitioning – Task scheduling – Handling machine failures – Managing inter-machine communication • Completely transparent to the programmer/analyst/user © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Possible Applications • Special-purpose programs to process large amounts of data: crawled documents, Web query logs, etc. – ETL and “read once” data sets – Complex analytics – Semi-structured data, key-value pairs • At Google and others (Yahoo!, Facebook): – Inverted index – Graph structure of the WEB documents – Summaries of #pages/host, set of frequent queries, etc. – Ad Optimization – Spam filtering © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Map Reduce vs Parallel DBMS Parallel DBMS MapReduce Schema Support  Not out of the box Indexing  Not out of the box Imperative Declarative (C/C++, Java, …) Programming Model (SQL) Extensions through Pig and Hive Optimizations (Compression, Query  Not out of the box Optimization) Flexibility Not out of the box  Coarse grained Fault Tolerance  techniques © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. Further Analysis and Comparison • Limitations of some current parallel database / data warehouse – Often use expensive/specialized hardware – Difficult to scale to more than 100 nodes – Difficult to parallelize data mining applications • MPI … – Difficult to deal with unstructured data – Fault tolerance • One node fails, restart whole query – Expensive • Disadvantages of some MapReduce based solution (Hive) – A sub-optimal brute force implementation: No indexing, No JOINs • Find those guys whose salary is $10,000 – Row based storage, Updates? – Not SQL/BI tool compatible – No support for schema – Non-declarative programming model © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. MapReduce Integration in DBMS Context • FlexDB - A Cloud-scale Parallel Database Engine based on Hadoop MapReduce (A Research Project) – An architectural hybrid of MapReduce and DBMS technologies – Use Fault-tolerance and Scalability of Map Reduce framework – Leverage advanced data processing techniques (e.g., Query Optimization) of an RDBMS for high performance – Expose a declarative interface to the user • Goal: Leverage from the best of both worlds © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. FlexDB Architecture © Copyright 2011 EMC Corporation. All rights reserved. 27
  • 28. FlexDB Master Query Parser SELECT * FROM Account Query Optimizer WHERE balance > 30 Job Generator Catalog manager Job Executor Job Job Job Job MapReduce Mapper Framework Account Reducer r0 n0 m0 SELECT * SELECT * SELECT * r1 n1 m1 FROM Account FROM Account FROM Account r2 n2 m2 WHERE balance > 30 WHERE balance > 30 WHERE balance > 30 r3 n3 m3 subquery subquery subquery r4 n4 m4 r5 n5 m5 r6 n6 m6 r7 n7 m7 Database Database Database Database Database Database Database r0 n0 m0 r2 n2 m2 r4 n4 m4 r6 n6 m6 r8 n8 m8 r1 n1 m1 r3 n3 m3 r5 n5 m5 r7 n7 m7 r9 n9 m9 © Copyright 2011 EMC Corporation. All rights reserved. 28
  • 29. Comparison with other systems FlexDB Hive HadoopDB Traditional parallel database Query Language SQL HQL SQL (not SQL support join currently) Storage Postgres/Greenplum HDFS JDBC Native OS files compatible Optimizer Cost based (DB/MR Simple rule Simple rule Cost based paths) based based Physical storage Column/Row based Row based Currently Row Column/Row based organization based Implementation FlexDB Master + Hive + Hadoop Hive (rev) + Native Hadoop + DB Hadoop + DB Efficiency High Low Middle Very High Scale Large Large Large Middle Cost Low Low Low High © Copyright 2011 EMC Corporation. All rights reserved. 29
  • 30. Summary • New in cloud computing – Elasticity/Scalability – Resource sharing (multi-tenancy) – Focus on failure • Data analytics in the cloud: Different solutions suitable for different workloads – Parallel DBMSs excel at efficient querying of large data sets – MR-style systems excel at complex analytics and ETL tasks • Combine MapReduce with shared-nothing DBMS to produce a system that better fit the cloud computing market © Copyright 2011 EMC Corporation. All rights reserved. 30
  • 31. Acknowledgements • Some slides are adapted from the following references: – Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial – Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM 2010 © Copyright 2011 EMC Corporation. All rights reserved. 31
  • 32. 易安信中国研究院 陶波 博士 易安信中国研究院 院长 博客 http://blog.sina.com.cn/emclabschina 微博 http://weibo.com/emclabschina © Copyright 2011 EMC Corporation. All rights reserved. 32
  • 33. THANK YOU © Copyright 2011 EMC Corporation. All rights reserved. 33