SlideShare a Scribd company logo
1 of 29
Apache Drill


©MapR Technologies - Confidential   1
My Background

     Startups
       – Aptex, MusicMatch, ID Analytics, Veoh
       – Big data since before big

     Open source
       – since the dark ages before the internet
       – Mahout, Zookeeper, Drill
       – bought the beer at first HUG

 MapR
 Founding member of Apache Drill


©MapR Technologies - Confidential       2
MapR Technologies

     The open enterprise-grade distribution for Hadoop
       – Easy, dependable and fast
       – Open source with standards-based extensions


     MapR is deployed at 1000’s of companies
       –   From small Internet startups to the world’s largest enterprises

     MapR customers analyze massive amounts of data:
       – Hundreds of billions of events daily
       – 90% of the world’s Internet population monthly
       – $1 trillion in retail purchases annually


     MapR has partnered with Google to provide Hadoop on Google Compute
      Engine


©MapR Technologies - Confidential               3
Agenda

     What?
       –   What exactly does Drill do?
     Why?
       –   Why do we need Apache Drill?
     Who?
       –   Who is doing this?
     How?
       –   How does Drill work inside?
     Conclusion
       –   How can you help?
       –   Where can you find out more?


©MapR Technologies - Confidential         4
Apache Drill Overview

   Drill overview
    – Low latency interactive queries
    – Standard ANSI SQL support


   Open-Source
    – 100’s involved across US and Europe
    – Community consensus on API, functionality


   PMC expects first version late this quarter
    – Several components already developed


©MapR Technologies - Confidential   5
Big Data Processing – Hadoop

                                    Batch processing
  Query runtime                     Minutes to hours

  Data volume                       TBs to PBs
  Programming                       MapReduce
  model
  Users                             Developers

  Google project                    MapReduce
  Open source                       Hadoop
  project                           MapReduce




©MapR Technologies - Confidential                      6
Big Data Processing – Hadoop and Storm

                                    Batch processing   Stream processing
  Query runtime                     Minutes to hours   Never-ending

  Data volume                       TBs to PBs         Continuous stream
  Programming                       MapReduce          DAG
  model                                                (pre-programmed)
  Users                             Developers         Developers

  Google project                    MapReduce
  Open source                       Hadoop             Storm or Apache S4
  project                           MapReduce




©MapR Technologies - Confidential                       7
Big Data Processing – The missing part

                                    Batch processing   Interactive analysis   Stream processing
  Query runtime                     Minutes to hours                          Never-ending

  Data volume                       TBs to PBs                                Continuous stream
  Programming                       MapReduce                                 DAG
  model                                                                       (pre-programmed)
  Users                             Developers                                Developers

  Google project                    MapReduce
  Open source                       Hadoop                                    Storm and S4
  project                           MapReduce




©MapR Technologies - Confidential                       8
Big Data Processing – The missing part

                                    Batch processing   Interactive analysis   Stream processing
  Query runtime                     Minutes to hours   Milliseconds to        Never-ending
                                                       minutes
  Data volume                       TBs to PBs         GBs to PBs             Continuous stream
  Programming                       MapReduce          Queries                DAG
  model                                                (ad hoc)               (pre-programmed)
  Users                             Developers         Analysts and           Developers
                                                       developers
  Google project                    MapReduce
  Open source                       Hadoop                                    Storm and S4
  project                           MapReduce




©MapR Technologies - Confidential                       9
Big Data Processing

                                    Batch processing   Interactive analysis   Stream processing
  Query runtime                     Minutes to hours   Milliseconds to        Never-ending
                                                       minutes
  Data volume                       TBs to PBs         GBs to PBs             Continuous stream
  Programming                       MapReduce          Queries                DAG
  model
  Users                             Developers         Analysts and           Developers
                                                       developers
  Google project                    MapReduce          Dremel
  Open source                       Hadoop                                    Storm and S4
  project                           MapReduce




©MapR Technologies - Confidential                       10
Big Data Processing

                                    Batch processing   Interactive analysis   Stream processing
  Query runtime                     Minutes to hours   Milliseconds to        Never-ending
                                                       minutes
  Data volume                       TBs to PBs         GBs to PBs             Continuous stream
  Programming                       MapReduce          Queries                DAG
  model
  Users                             Developers         Analysts and           Developers
                                                       developers
  Google project                    MapReduce          Dremel
  Open source                       Hadoop                                    Storm and S4
  project                           MapReduce


                    Introducing Apache Drill
©MapR Technologies - Confidential                       11
Latency Matters

     Ad-hoc analysis with interactive tools


     Real-time dashboards


     Event/trend detection and analysis
       –   Network intrusions
       –   Fraud
       –   Failures




©MapR Technologies - Confidential     12
Nested Query Languages

     DrQL
       –   SQL-like query language for nested data
       –   Compatible with Google BigQuery/Dremel
            •    BigQuery applications should work with Drill
       –   Designed to support efficient column-based processing
            •    No record assembly during query processing


     Mongo Query Language
       –   {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}


     Other languages/programming models can plug in


©MapR Technologies - Confidential                    13
Nested Data Model

     The data model in Dremel is Protocol Buffers
       –   Nested
       –   Schema
     Apache Drill is designed to support multiple data models
       –   Schema: Protocol Buffers, Apache Avro, …
       –   Schema-less: JSON, BSON, …
     Flat records are supported as a special case of nested data
       –   CSV, TSV, …

                                    Avro IDL                               JSON
             enum Gender {                                   {
               MALE, FEMALE                                      "name": "Srivas",
             }                                                   "gender": "Male",
                                                                 "followers": 100
             record User {                                   }
               string name;                                  {
               Gender gender;                                    "name": "Raina",
               long followers;                                   "gender": "Female",
             }                                                   "followers": 200,
                                                                 "zip": "94305"
                                                             }
©MapR Technologies - Confidential                     14
Extensibility

     Nested query languages
       – Pluggable model
       – DrQL
       – Mongo Query Language
       – Cascading


     Distributed execution engine
       – Extensible model (eg, Dryad)
       – Low-latency
       – Fault tolerant


     Nested data formats
       – Pluggable model
       – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV)
       – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)


     Scalable data sources
       – Pluggable model
       – Hadoop
       – HBase

©MapR Technologies - Confidential                          15
Design Principles

             Flexible                               Easy
             • Pluggable query languages            •   Unzip and run
             • Extensible execution engine          •   Zero configuration
             • Pluggable data formats               •   Reverse DNS not needed
               • Column-based and row-based         •   IP addresses can change
               • Schema and schema-less             •   Clear and concise log messages
             • Pluggable data sources


             Dependable                             Fast
             • No SPOF                              • C/C++ core with Java support
             • Instant recovery from crashes          • Google C++ style guide
                                                    • Min latency and max throughput
                                                      (limited only by hardware)




©MapR Technologies - Confidential              16
Apache DRill


©MapR Technologies - Confidential   17
Architecture




     Only the execution engine knows the physical attributes of the cluster
       –   # nodes, hardware, file locations, …


     Public interfaces enable extensibility
       – Developers can build parsers for new query languages
       – Developers can provide an execution plan directly


     Each level of the plan has a human readable representation
       –   Facilitates debugging and unit testing


©MapR Technologies - Confidential                   18
Execution Engine Layers

     Drill execution engine has two layers
       –   Operator layer is serialization-aware
            •    Processes individual records
       –   Execution layer is not serialization-aware
            •    Processes batches of records (blobs)
            •    Responsible for communication, dependencies and fault tolerance




©MapR Technologies - Confidential                    19
DrQL Example


                         SELECT DocId AS Id,
                           COUNT(Name.Language.Code) WITHIN Name AS
                         Cnt,
                           Name.Url + ',' + Name.Language.Code AS
                         Str
                         FROM t
                         WHERE REGEXP(Name.Url, '^http')
                           AND DocId < 20;




©MapR Technologies - Confidential             20             * Example from the Dremel paper
Query Components

     Query components:
       –   SELECT
       –   FROM
       –   WHERE
       –   GROUP BY
       –   HAVING
       –   (JOIN)


     Key logical operators:
       – Scan
       – Filter
       – Aggregate
       – (Join)



©MapR Technologies - Confidential   21
Logical Plan

                                    scan-json        "table-1"



                                      filter     exp1



                                     flatten



                                    aggregate        exp2



©MapR Technologies - Confidential               22
Logical Plan Syntax
                                    {op: "sequence",
                                      do: [
                                        {op: "scan",
                                          source: "table-1.json"
                                          selection: "*"
                                        },
                                        {op: "filter",
                                          expr: <expr>
                                        },
                                        {op: "flatten",
                                          expr: <expr>,
                                          drop: "false"
                                        },
                                        {op: "aggregate",
                                          type: repeat,
                                          keys: [<name>,...],
                                          aggregations: [
                                            {ref: <name>, expr: <aggexpr> },...
                                          ]
                                        }
                                      ]
                                    }
©MapR Technologies - Confidential                           23
Representing a DAG

                                18


                       aggregate       exp2


                                19
                                     { @id: 19, op: "aggregate",
                                       input: 18,
                                       type: <simple|running|repeat>,
                                       keys: [<name>,...],
                                       aggregations: [
                                         {ref: <name>, expr: <aggexpr> },...
                                       ]
                                     }




©MapR Technologies - Confidential                24
Multiple Inputs

                  id                23 24     id



                                    cogroup
                                                   { @id: 25, op: "cogroup",
                                                       groupings: [
                                      25               {ref: 23, expr: “id”}, {ref:
                                                   24, expr: “id”}
                                                       ]
                                                   }




©MapR Technologies - Confidential                       25
Scan Operators

  • Drill supports multiple data formats by having per-format scan operators
     • Queries involving multiple data formats/sources are supported

  • Fields and predicates can be pushed down into the scan operator

  • Scan operators may have adaptive side-effects (database cracking)
     • Produce ColumnIO from RecordIO
     • Google PowerDrill stores materialized expressions with the data
                                    Scan with schema                          Scan without schema

   Operator                         Protocol Buffers                          JSON-like (MessagePack)
   output
   Supported                        ColumnIO (column-based protobuf/Dremel)   JSON
   data formats                     RecordIO (row-based protobuf)             HBase
                                    CSV
   SELECT …                         ColumnIO(proto URI, data URI)             Json(data URI)
   FROM …                           RecordIO(proto URI, data URI)             HBase(table name)

©MapR Technologies - Confidential                                    26
Design Principles

             Flexible                               Easy
             • Pluggable query languages            •   Unzip and run
             • Extensible execution engine          •   Zero configuration
             • Pluggable data formats               •   Reverse DNS not needed
               • Column-based and row-based         •   IP addresses can change
               • Schema and schema-less             •   Clear and concise log messages
             • Pluggable data sources


             Dependable                             Fast
             • No SPOF                              • C/C++ core with Java support
             • Instant recovery from crashes          • Google C++ style guide
                                                    • Min latency and max throughput
                                                      (limited only by hardware)




©MapR Technologies - Confidential              27
Hadoop Integration

     Hadoop data sources
       –   Hadoop FileSystem API (HDFS/MapR-FS)
       –   HBase
     Hadoop data formats
       –   Apache Avro
       –   RCFile
     MapReduce-based tools to create column-based formats
     Table registry in HCatalog
     Run long-running services in YARN




©MapR Technologies - Confidential         28
Get Involved!

     Download these slides
       –   http://www.mapr.com/company/events/hug-france-12-04-2012


     Join the project
       –   drill-dev-subscribe@incubator.apache.org
       –   #apachedrill

     Contact me:
       –   tdunning@maprtech.com
       –   tdunning@apache.org
       –   ted.dunning@maprtech.com
       –   @ted_dunning


     Join MapR
       –   jobs@mapr.com


©MapR Technologies - Confidential                 29

More Related Content

What's hot

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR Technologies
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19jasonfrantz
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Stormboorad
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down InternetMapR Technologies
 
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David TuckerChallenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David TuckerMapR Technologies
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 

What's hot (20)

Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
 
Realtime Computation with Storm
Realtime Computation with StormRealtime Computation with Storm
Realtime Computation with Storm
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David TuckerChallenges & Capabilites in Managing a MapR Cluster by David Tucker
Challenges & Capabilites in Managing a MapR Cluster by David Tucker
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 

Viewers also liked

Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Analyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotAnalyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotModern Data Stack France
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Modern Data Stack France
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaModern Data Stack France
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiModern Data Stack France
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopHortonworks
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainModern Data Stack France
 

Viewers also liked (20)

M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
IBM Stream au Hadoop User Group
IBM Stream au Hadoop User GroupIBM Stream au Hadoop User Group
IBM Stream au Hadoop User Group
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Analyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotAnalyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien Cabot
 
Cascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand DechouxCascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand Dechoux
 
Hadoop on Azure
Hadoop on AzureHadoop on Azure
Hadoop on Azure
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy Hanna
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on Hadoop
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
 
Dépasser map() et reduce()
Dépasser map() et reduce()Dépasser map() et reduce()
Dépasser map() et reduce()
 
Hadoop chez Kobojo
Hadoop chez KobojoHadoop chez Kobojo
Hadoop chez Kobojo
 
Big Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent HeuschlingBig Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent Heuschling
 
HCatalog
HCatalogHCatalog
HCatalog
 
Hadopp Vue d'ensemble
Hadopp Vue d'ensembleHadopp Vue d'ensemble
Hadopp Vue d'ensemble
 
Hadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas VialHadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas Vial
 
Retour Hadoop Summit 2012
Retour Hadoop Summit 2012Retour Hadoop Summit 2012
Retour Hadoop Summit 2012
 

Similar to Hug france-2012-12-04

PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batchboorad
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in businessMapR Technologies
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San DiegoMapR Technologies
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData ResumeAnil Sokhal
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Cloudera, Inc.
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
 

Similar to Hug france-2012-12-04 (20)

PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batch
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in business
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San Diego
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
London hug
London hugLondon hug
London hug
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData Resume
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

More from Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Hug france-2012-12-04

  • 2. My Background  Startups – Aptex, MusicMatch, ID Analytics, Veoh – Big data since before big  Open source – since the dark ages before the internet – Mahout, Zookeeper, Drill – bought the beer at first HUG  MapR  Founding member of Apache Drill ©MapR Technologies - Confidential 2
  • 3. MapR Technologies  The open enterprise-grade distribution for Hadoop – Easy, dependable and fast – Open source with standards-based extensions  MapR is deployed at 1000’s of companies – From small Internet startups to the world’s largest enterprises  MapR customers analyze massive amounts of data: – Hundreds of billions of events daily – 90% of the world’s Internet population monthly – $1 trillion in retail purchases annually  MapR has partnered with Google to provide Hadoop on Google Compute Engine ©MapR Technologies - Confidential 3
  • 4. Agenda  What? – What exactly does Drill do?  Why? – Why do we need Apache Drill?  Who? – Who is doing this?  How? – How does Drill work inside?  Conclusion – How can you help? – Where can you find out more? ©MapR Technologies - Confidential 4
  • 5. Apache Drill Overview  Drill overview – Low latency interactive queries – Standard ANSI SQL support  Open-Source – 100’s involved across US and Europe – Community consensus on API, functionality  PMC expects first version late this quarter – Several components already developed ©MapR Technologies - Confidential 5
  • 6. Big Data Processing – Hadoop Batch processing Query runtime Minutes to hours Data volume TBs to PBs Programming MapReduce model Users Developers Google project MapReduce Open source Hadoop project MapReduce ©MapR Technologies - Confidential 6
  • 7. Big Data Processing – Hadoop and Storm Batch processing Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm or Apache S4 project MapReduce ©MapR Technologies - Confidential 7
  • 8. Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce ©MapR Technologies - Confidential 8
  • 9. Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model (ad hoc) (pre-programmed) Users Developers Analysts and Developers developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce ©MapR Technologies - Confidential 9
  • 10. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce ©MapR Technologies - Confidential 10
  • 11. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce Introducing Apache Drill ©MapR Technologies - Confidential 11
  • 12. Latency Matters  Ad-hoc analysis with interactive tools  Real-time dashboards  Event/trend detection and analysis – Network intrusions – Fraud – Failures ©MapR Technologies - Confidential 12
  • 13. Nested Query Languages  DrQL – SQL-like query language for nested data – Compatible with Google BigQuery/Dremel • BigQuery applications should work with Drill – Designed to support efficient column-based processing • No record assembly during query processing  Mongo Query Language – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}  Other languages/programming models can plug in ©MapR Technologies - Confidential 13
  • 14. Nested Data Model  The data model in Dremel is Protocol Buffers – Nested – Schema  Apache Drill is designed to support multiple data models – Schema: Protocol Buffers, Apache Avro, … – Schema-less: JSON, BSON, …  Flat records are supported as a special case of nested data – CSV, TSV, … Avro IDL JSON enum Gender { { MALE, FEMALE "name": "Srivas", } "gender": "Male", "followers": 100 record User { } string name; { Gender gender; "name": "Raina", long followers; "gender": "Female", } "followers": 200, "zip": "94305" } ©MapR Technologies - Confidential 14
  • 15. Extensibility  Nested query languages – Pluggable model – DrQL – Mongo Query Language – Cascading  Distributed execution engine – Extensible model (eg, Dryad) – Low-latency – Fault tolerant  Nested data formats – Pluggable model – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)  Scalable data sources – Pluggable model – Hadoop – HBase ©MapR Technologies - Confidential 15
  • 16. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware) ©MapR Technologies - Confidential 16
  • 17. Apache DRill ©MapR Technologies - Confidential 17
  • 18. Architecture  Only the execution engine knows the physical attributes of the cluster – # nodes, hardware, file locations, …  Public interfaces enable extensibility – Developers can build parsers for new query languages – Developers can provide an execution plan directly  Each level of the plan has a human readable representation – Facilitates debugging and unit testing ©MapR Technologies - Confidential 18
  • 19. Execution Engine Layers  Drill execution engine has two layers – Operator layer is serialization-aware • Processes individual records – Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance ©MapR Technologies - Confidential 19
  • 20. DrQL Example SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; ©MapR Technologies - Confidential 20 * Example from the Dremel paper
  • 21. Query Components  Query components: – SELECT – FROM – WHERE – GROUP BY – HAVING – (JOIN)  Key logical operators: – Scan – Filter – Aggregate – (Join) ©MapR Technologies - Confidential 21
  • 22. Logical Plan scan-json "table-1" filter exp1 flatten aggregate exp2 ©MapR Technologies - Confidential 22
  • 23. Logical Plan Syntax {op: "sequence", do: [ {op: "scan", source: "table-1.json" selection: "*" }, {op: "filter", expr: <expr> }, {op: "flatten", expr: <expr>, drop: "false" }, {op: "aggregate", type: repeat, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] } ] } ©MapR Technologies - Confidential 23
  • 24. Representing a DAG 18 aggregate exp2 19 { @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] } ©MapR Technologies - Confidential 24
  • 25. Multiple Inputs id 23 24 id cogroup { @id: 25, op: "cogroup", groupings: [ 25 {ref: 23, expr: “id”}, {ref: 24, expr: “id”} ] } ©MapR Technologies - Confidential 25
  • 26. Scan Operators • Drill supports multiple data formats by having per-format scan operators • Queries involving multiple data formats/sources are supported • Fields and predicates can be pushed down into the scan operator • Scan operators may have adaptive side-effects (database cracking) • Produce ColumnIO from RecordIO • Google PowerDrill stores materialized expressions with the data Scan with schema Scan without schema Operator Protocol Buffers JSON-like (MessagePack) output Supported ColumnIO (column-based protobuf/Dremel) JSON data formats RecordIO (row-based protobuf) HBase CSV SELECT … ColumnIO(proto URI, data URI) Json(data URI) FROM … RecordIO(proto URI, data URI) HBase(table name) ©MapR Technologies - Confidential 26
  • 27. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware) ©MapR Technologies - Confidential 27
  • 28. Hadoop Integration  Hadoop data sources – Hadoop FileSystem API (HDFS/MapR-FS) – HBase  Hadoop data formats – Apache Avro – RCFile  MapReduce-based tools to create column-based formats  Table registry in HCatalog  Run long-running services in YARN ©MapR Technologies - Confidential 28
  • 29. Get Involved!  Download these slides – http://www.mapr.com/company/events/hug-france-12-04-2012  Join the project – drill-dev-subscribe@incubator.apache.org – #apachedrill  Contact me: – tdunning@maprtech.com – tdunning@apache.org – ted.dunning@maprtech.com – @ted_dunning  Join MapR – jobs@mapr.com ©MapR Technologies - Confidential 29

Editor's Notes

  1. No graphic changes….Note for Bullet changes:Open Source-- Community consensusAPIAvailable for all Distributions--
  2. Likely to support theseCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon queryPig as wellPluggabilityData formatQuery languageSomething 6-9 months alpha qualityCommunity driven, I can’t speak for projectMapRFS gives better chunk size controlNFS support may make small test drivers easierUnified namespace will allow multi-cluster accessMight even have drill component that autoformats dataRead only model
  3. Protocol buffers are conceptual data modelWill support multiple data modelsWill have to define a way to explain data format (filtering, fields, etc)Schema-less will have perf penaltyHbase will be one format
  4. Note: we have an already partially built execution engine
  5. Example query that Drill should supportNeed to talk more here about what Dremel does
  6. Be prepared for Apache questionsCommitter vs committee vs contributorIf can’t answer question, ask them to answer and contributeLisa - Need landing pageReferences to paper and such at end