SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Using distributed technologies
to analyze Big Data

                    Abhijit Sharma
                    Innovation Lab
                    BMC Software




                                     1
Data Explosion in Data Center
• Performance / Time Series Data
    § Incoming data rates ~Millions of data
        points/ min
    § Data generated/server/year ~ 2 GB
    § 50 K servers ~ 100 TB data / year




                                              2
Online Warehouse - Time Series
   § Extreme storage requirements – TS data for a data center e.g. last
       year
   § Online TS data availability i.e. no separate ETL
   § Support for common analytics operations
           § Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc
           § Slice and Dice – CPU util. for UNIX servers in SFO data center last week
           § Statistical Operations : sum, count, avg., var, std. moving avg., frequency
                distributions, forecasting etc
   § Ease of use – SQL interface, design schema for TS data
   § Horizontal scaling - lower cost commodity hardware
                                                            OS          Data Cube -
   § High R/W volume                                                    CPU
                                                                        Time
                                                   Data
                                                   Center




                                                                                      3
P
a
g    Why not use RDBMS based Data
e
4    Warehousing?
|    Star schema – dimensions & facts
6/5/11 §   Offline data availability – ETL required – not online
      § Expensive to scale vertically – High end Hardware & Software
      § Limits to vertical scaling – big data may not fit
      § Features like transactions etc are unnecessary and a overhead
          for certain applications
      § Large scale distributed/partitioning is painful – sub optimal
          on high W/R ratios
      § Flexible Schema support which can be changed on the fly is
           not possible

                                                                        4
High Level Architecture


  Real time Continuous                      Schema &
  load of Metric &                          Query
  Dimension Data


                         Hive – Distributed SQL


            NoSQL Column Store - HBase


            Hadoop HDFS & Map Reduce Framework




                          Map Reduce & HDFS Nodes
                                                       5
P
a
g
e
     Map Reduce - Recap
6        Map Function                                   Reduce Function
                        § Apply to input data, Emits         § Apply to data grouped by reduction key
|
                            reduction key and value          § Often ‘reduces’ data (for example –
6/5/11                  § Output of Map is sorted              sum(values))
                            and partitioned for use    Mappers and Reducers can be chained together
                            by Reducers
                                Mappers and Reducers can be chained together




                                                                                                6
P
a
g
e
     HDFS Sweet spot
7

|     § Big Data Storage : Optimized for large files (ETL)
6/5/11 §   Writes are create, append, and large
      § Reads are mostly big and streaming
      § Throughput is more important than latency
      § Distributed, HA, Transparent Replication




                                                             7
When is raw HDFS unsuitable?
• Mutable data – Create, Update, Delete
• Small writes
• Random reads, % of small reads
• Structured data
• Online access to data – HDFS Loading is
   offline / batch process


                                            8
P
a
g
e
     NoSQL Data stores - Column
9

|        § Excellent W/R concurrent performance – fast writes
             and fast reads (random and sequential) – this is
6/5/11
             required for near real time update of data to TS Data
         § Distributed architecture, horizontal scaling, transparent
             replication of data
         § Highly Available (HA) and Fault Tolerant (FT) for no
            SPOF – shared nothing architecture
         § Reasonably rich data model
         § Flexible in terms of schema – amenable to ad-hoc
             changes even at runtime



                                                                  9
P
a
g
e
     HBase
10
         § (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored
|             value 
         § Table is split into multiple equal sized regions each of which is a range of
6/5/11       sorted keys (partitioned automatically by the key)
         § Ordered Rows by key, Ordered columns in a Column Family
         § Table schema defines Column Families
         § Rows can have different number of columns
         § Columns have value and versions (any number)
         § Column range and key range queries

          Row Key        Column Family (dimensions)       Column Family
                                                          (metric)
          112334-7782    server : host1   dc : PUNE       value:20

          112334-7783             server:host2            value:10

                                                                                      10
P
a
g
e
      Hive – Distributed SQL > MR
11
       § MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining
|
           several Mappers & Reducers required
6/5/11 §
           Hive provides familiar SQL queries which automatically gets translated to a flow
              of appropriate Mappers and Reducers that execute the query leveraging MR.
       § Leverages Hadoop ecosystem - MR, HDFS, HBase

       § Hive defines a schema for the meta-tables it will use to build a schema its SQL
            queries can use and to store metadata
       § Storage Handlers for HDFS, HBase

       § Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc
            clauses
       § Hive stores the data partitioned by partitions (you can specify partitioning key
            while loading Hive tables) and buckets (useful for statistical operations like
            sampling)
       § Hive queries can also include custom map/reduce tasks as scripts

                                                                                              11
Hive Queries - CREATE
TABLE                               EXTERNAL TABLE



CREATE TABLE wordfreq (word       CREATE external TABLE iops(key
  STRING, freq INT) ROW FORMAT      string, os string, deploymentsize
  DELIMITED FIELDS TERMINATED       string, ts int, value int) STORED
  BY 't' STORED AS TEXTFILE;       BY
                                    'org.apache.hadoop.hive.hbase.HB
LOAD DATA LOCAL INPATH              aseStorageHandler' WITH
  ‘freq.txt' OVERWRITE INTO TABLE   SERDEPROPERTIES
  wordfreq;                         ("hbase.columns.mapping" =
                                    ":key,data:os,data:deploymentSize,
                                    data:ts,data:value")




                                                                    12
Hive Queries - SELECT
TABLE                                      EXTERNAL TABLE
select * from wordfreq where freq >        select ts, avg(value) as cpu from
   100 sort by freq desc limit 3;             cpu_util_5min group by ts;
explain select * from wordfreq where       select architecture, avg(value) as cpu
   freq > 100 sort by freq desc limit 3;      from cpu_util_5min group by
                                              architecture;
select freq, count(*) AS f2 from
   wordfreq group by freq sort by f2
   desc limit 3;




                                                                                13
P
a
g
e
        Hive – SQL -> Map Reduce
     CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix
14
     SELECT timestamp, AVG(value)

|    FROM timeseries WHERE server-type = ‘Unix’


6/5/11 BY timestamp
   GROUP

           timeseries




                                                         Shuffle                             Reduce
                               Map
                                                          Sort




                                                                                                                               14
Thanks



         15

Contenu connexe

Plus de IndicThreads

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
IndicThreads
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
IndicThreads
 

Plus de IndicThreads (20)

Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprises
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fame
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads Conference
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java Carputer
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedback
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!
 
Architectural Considerations For Complex Mobile And Web Applications
 Architectural Considerations For Complex Mobile And Web Applications Architectural Considerations For Complex Mobile And Web Applications
Architectural Considerations For Complex Mobile And Web Applications
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8
 
Changing application demands: What developers need to know
Changing application demands: What developers need to knowChanging application demands: What developers need to know
Changing application demands: What developers need to know
 
Data Privacy using IoTs in Smart Cities Project
 Data Privacy using IoTs in Smart Cities Project Data Privacy using IoTs in Smart Cities Project
Data Privacy using IoTs in Smart Cities Project
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karanIndic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
 
Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Using the cloud and distributed technologies to analyze big data in the enterprise - Indicthreads cloud computing conference 2011

  • 1. Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1
  • 2. Data Explosion in Data Center • Performance / Time Series Data § Incoming data rates ~Millions of data points/ min § Data generated/server/year ~ 2 GB § 50 K servers ~ 100 TB data / year 2
  • 3. Online Warehouse - Time Series § Extreme storage requirements – TS data for a data center e.g. last year § Online TS data availability i.e. no separate ETL § Support for common analytics operations § Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc § Slice and Dice – CPU util. for UNIX servers in SFO data center last week § Statistical Operations : sum, count, avg., var, std. moving avg., frequency distributions, forecasting etc § Ease of use – SQL interface, design schema for TS data § Horizontal scaling - lower cost commodity hardware OS Data Cube - § High R/W volume CPU Time Data Center 3
  • 4. P a g Why not use RDBMS based Data e 4 Warehousing? | Star schema – dimensions & facts 6/5/11 § Offline data availability – ETL required – not online § Expensive to scale vertically – High end Hardware & Software § Limits to vertical scaling – big data may not fit § Features like transactions etc are unnecessary and a overhead for certain applications § Large scale distributed/partitioning is painful – sub optimal on high W/R ratios § Flexible Schema support which can be changed on the fly is not possible 4
  • 5. High Level Architecture Real time Continuous Schema & load of Metric & Query Dimension Data Hive – Distributed SQL NoSQL Column Store - HBase Hadoop HDFS & Map Reduce Framework Map Reduce & HDFS Nodes 5
  • 6. P a g e Map Reduce - Recap 6 Map Function Reduce Function § Apply to input data, Emits § Apply to data grouped by reduction key | reduction key and value § Often ‘reduces’ data (for example – 6/5/11 § Output of Map is sorted sum(values)) and partitioned for use Mappers and Reducers can be chained together by Reducers Mappers and Reducers can be chained together 6
  • 7. P a g e HDFS Sweet spot 7 | § Big Data Storage : Optimized for large files (ETL) 6/5/11 § Writes are create, append, and large § Reads are mostly big and streaming § Throughput is more important than latency § Distributed, HA, Transparent Replication 7
  • 8. When is raw HDFS unsuitable? • Mutable data – Create, Update, Delete • Small writes • Random reads, % of small reads • Structured data • Online access to data – HDFS Loading is offline / batch process 8
  • 9. P a g e NoSQL Data stores - Column 9 | § Excellent W/R concurrent performance – fast writes and fast reads (random and sequential) – this is 6/5/11 required for near real time update of data to TS Data § Distributed architecture, horizontal scaling, transparent replication of data § Highly Available (HA) and Fault Tolerant (FT) for no SPOF – shared nothing architecture § Reasonably rich data model § Flexible in terms of schema – amenable to ad-hoc changes even at runtime 9
  • 10. P a g e HBase 10 § (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored | value  § Table is split into multiple equal sized regions each of which is a range of 6/5/11 sorted keys (partitioned automatically by the key) § Ordered Rows by key, Ordered columns in a Column Family § Table schema defines Column Families § Rows can have different number of columns § Columns have value and versions (any number) § Column range and key range queries Row Key Column Family (dimensions) Column Family (metric) 112334-7782 server : host1 dc : PUNE value:20 112334-7783 server:host2 value:10 10
  • 11. P a g e Hive – Distributed SQL > MR 11 § MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining | several Mappers & Reducers required 6/5/11 § Hive provides familiar SQL queries which automatically gets translated to a flow of appropriate Mappers and Reducers that execute the query leveraging MR. § Leverages Hadoop ecosystem - MR, HDFS, HBase § Hive defines a schema for the meta-tables it will use to build a schema its SQL queries can use and to store metadata § Storage Handlers for HDFS, HBase § Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc clauses § Hive stores the data partitioned by partitions (you can specify partitioning key while loading Hive tables) and buckets (useful for statistical operations like sampling) § Hive queries can also include custom map/reduce tasks as scripts 11
  • 12. Hive Queries - CREATE TABLE EXTERNAL TABLE CREATE TABLE wordfreq (word CREATE external TABLE iops(key STRING, freq INT) ROW FORMAT string, os string, deploymentsize DELIMITED FIELDS TERMINATED string, ts int, value int) STORED BY 't' STORED AS TEXTFILE; BY 'org.apache.hadoop.hive.hbase.HB LOAD DATA LOCAL INPATH aseStorageHandler' WITH ‘freq.txt' OVERWRITE INTO TABLE SERDEPROPERTIES wordfreq; ("hbase.columns.mapping" = ":key,data:os,data:deploymentSize, data:ts,data:value") 12
  • 13. Hive Queries - SELECT TABLE EXTERNAL TABLE select * from wordfreq where freq > select ts, avg(value) as cpu from 100 sort by freq desc limit 3; cpu_util_5min group by ts; explain select * from wordfreq where select architecture, avg(value) as cpu freq > 100 sort by freq desc limit 3; from cpu_util_5min group by architecture; select freq, count(*) AS f2 from wordfreq group by freq sort by f2 desc limit 3; 13
  • 14. P a g e Hive – SQL -> Map Reduce CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix 14 SELECT timestamp, AVG(value) | FROM timeseries WHERE server-type = ‘Unix’ 6/5/11 BY timestamp GROUP timeseries Shuffle Reduce Map Sort 14
  • 15. Thanks 15