SlideShare une entreprise Scribd logo
1  sur  13
Using realtime SQL2003 to query
JSON on Hadoop with Apache Drill
               January 28, 2013
                    Jacques Nadeau
     Apache Drill Contributor @ MapR Technologies
Me
• Apache Drill and HBase Contributor
• Sponsored by MapR Technologies to lead Apache Drill
  contributions


   – Enterprise-grade high performance distribution for
     Hadoop
   – Open source plus standards-based extensions
   – Large number Fortune 100 customers, startups too.
   – Free distribution for unlimited nodes
   – Partnered to provide on Google Compute Engine and
     Amazon Elastic MapReduce
Transaction
                         information
Jane works as an
Analyst at an
ecommerce website

How does she figure         User
                            profiles
out good targeting
segments for the next
marketing campaign?

She has some ideas
and lots of data        Access
                        logs
Let’s try using existing options
•   Use Oracle
     – Write flattening MongoDB query for export and generate giant CSV. Work with MapReduce
       team to build a MapReduce job that provides export. Contact DBA to import data exports. Use
       Oracle SQL to determine answers.
•   Use Hive
     – Pull up Hive. Start writing queries. Realize that Hive/Mongo interconnector doesn’t support
       nested data. Realize that Hive doesn’t have JDBC/ODBC storage handler. Query data from
       Oracle and copy to Hadoop. Query flattened Mongo data and copy into Hadoop. Write HiveQL
       query. Wait 30 minutes for result. Repeat until desired outcome. Avoid frustration along the
       way with the flattened Mongo data, portion of Oracle extraction, and the lack of major
       portions of SQL syntax.
•   Use Data Virtualization Solution
     – Write SQL query against virtualization interface. Realize that you still need to ETL Mongo data
       since it isn’t natively supported. Query runs slowly since virtualization solution doesn’t run
       locally against Hadoop data and fails to effectively distribute your query.
•   Use MapReduce
     – Work with Engineering to define a specification for needs. Use Sqoop to setup regular ETL
       from Oracle. Define a custom MapReduce to import Mongo data.
     – Look at output and realize different analyses should be done, repeat cycle (or learn Java)
Why are things so hard?
• Slow
   – Virtualization solutions don’t support data locality and pushdown
   – MapReduce sacrifices performance to support long running jobs, recoverability, and
     ultimate flexibility
• Old
   – Most systems assume flat data with well-defined static schemas
• Hard
   – Write queries in multiple languages (Does anybody no MongoQL, CQL, HiveQL and
     SQL?)
   – Analysts often need custom development help
• Error Prone
   – ETL leads to data synchronization issues
   – Lack of query transparency leads to incorrect assumptions and bad business conclusions
• Expensive
   – Commercial solutions are very expensive
   – Typically provide poor compatibility with newer NoSQL technologies
Open Source Mantra: WWGD?
         Distributed                 Interactive   Batch
                       Datastore
         File System                 analysis      processing


              GFS         BigTable      Dremel      MapReduce


                                                     Hadoop
             HDFS          HBase
                                                    MapReduce




Build Apache Drill to provide a true open source
   solution to interactive analysis of Big Data
Apache Drill Overview
• Drill overview
   –   Low latency interactive queries
   –   Standard ANSI SQL2003 support
   –   Domain Specific Languages / Your own QL
   –   Inspired by, compatible with Google BigQuery/Dremel
   –   Supports Nested/Hierarchical Data Formats
   –   Supports RDBMS, Hadoop and NoSQL alike

• Open-Source and Flexible
   – Apache Incubator
   – 100’s involved across US and Europe
   – Community consensus on API, functionality
Why do we need another tool?

Point queries              Data Analyst & Reporting Queries
0-100ms                    3 minutes – 20 minutes
     Interactive Queries
     100ms – 3 minutes                                  Data Mining and Major ETL
                                                        20 minutes – 20 hours




                                                              MapReduce,
                           Apache
 Per                                                          Hive and PIG
                           Drill
 system
 interfaces
Why not improve Hive or Pig?
•   Different Goals
•   SQL should be first class concern
•   MapReduce severely hampers processing model and performance
     –   Startup cost is high
     –   Map:Reduce recoverability and barrier disadvantages
     –   Job:Job recoverability and barrier disadvantages (chained jobs)
•   Need to build from in-memory representation
     –   Two canonical in-memory formats (row-based and columnar)
     –   Support much larger memory sizes
     –   Smaller memory footprint per record
     –   Avoid serialization/deserialization and object creation costs between nodes and operations
•   Performance of interactive queries is critical
     –   Evaluation and Operator code generation & compilation
•   First class recognition of nested types without metadata requirement
     –   Schema Discovery and standard schema representation
•   Clear delineation between important stages
     –   Support for multiple optimizers and researcher experimentation
How does it work?
• Drillbits run on each node to minimize
  network transfer
• Queries can be fed to any Drillbit.      SELECT * FROM
                                           oracle.transactions,
• Coordination, query planning,            mongo.users,
  optimization, scheduling, and            hdfs.events
                                           LIMIT 1
  execution are distributed
Flexibility with Strongly Defined Tiers and APIs
Apache Drill currently in development
• Heavy active development by multiple
  supporting organizations
• Available
  – Logical plan syntax and interpreter
  – Reference Interpreter
• In progress
  – SQL interpreter
  – Storage Engine implementations for Accumulo,
    Cassandra, HBase, and HDFS file formats
Conclusion & Questions
• Put Apache Drill on your roadmap, we’ll make your life
  easier

• Join the community
   – Code: http://github.com/apache/incubator-drill
   – Mailing List: drill-user@incubator.apache.org
   – Wiki: https://cwiki.apache.org/confluence/display/DRILL

• Access this presentation: http://bit.ly/Wo6DLd

• Contact Me:
   – jacques.drill@gmail.com

Contenu connexe

Plus de MapR Technologies

Plus de MapR Technologies (20)

Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
Handling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in FinanceHandling the Extremes: Scaling and Streaming in Finance
Handling the Extremes: Scaling and Streaming in Finance
 
Baptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big DataBaptist Health: Solving Healthcare Problems with Big Data
Baptist Health: Solving Healthcare Problems with Big Data
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

GBDC 2013-01-28

  • 1. Using realtime SQL2003 to query JSON on Hadoop with Apache Drill January 28, 2013 Jacques Nadeau Apache Drill Contributor @ MapR Technologies
  • 2. Me • Apache Drill and HBase Contributor • Sponsored by MapR Technologies to lead Apache Drill contributions – Enterprise-grade high performance distribution for Hadoop – Open source plus standards-based extensions – Large number Fortune 100 customers, startups too. – Free distribution for unlimited nodes – Partnered to provide on Google Compute Engine and Amazon Elastic MapReduce
  • 3. Transaction information Jane works as an Analyst at an ecommerce website How does she figure User profiles out good targeting segments for the next marketing campaign? She has some ideas and lots of data Access logs
  • 4. Let’s try using existing options • Use Oracle – Write flattening MongoDB query for export and generate giant CSV. Work with MapReduce team to build a MapReduce job that provides export. Contact DBA to import data exports. Use Oracle SQL to determine answers. • Use Hive – Pull up Hive. Start writing queries. Realize that Hive/Mongo interconnector doesn’t support nested data. Realize that Hive doesn’t have JDBC/ODBC storage handler. Query data from Oracle and copy to Hadoop. Query flattened Mongo data and copy into Hadoop. Write HiveQL query. Wait 30 minutes for result. Repeat until desired outcome. Avoid frustration along the way with the flattened Mongo data, portion of Oracle extraction, and the lack of major portions of SQL syntax. • Use Data Virtualization Solution – Write SQL query against virtualization interface. Realize that you still need to ETL Mongo data since it isn’t natively supported. Query runs slowly since virtualization solution doesn’t run locally against Hadoop data and fails to effectively distribute your query. • Use MapReduce – Work with Engineering to define a specification for needs. Use Sqoop to setup regular ETL from Oracle. Define a custom MapReduce to import Mongo data. – Look at output and realize different analyses should be done, repeat cycle (or learn Java)
  • 5. Why are things so hard? • Slow – Virtualization solutions don’t support data locality and pushdown – MapReduce sacrifices performance to support long running jobs, recoverability, and ultimate flexibility • Old – Most systems assume flat data with well-defined static schemas • Hard – Write queries in multiple languages (Does anybody no MongoQL, CQL, HiveQL and SQL?) – Analysts often need custom development help • Error Prone – ETL leads to data synchronization issues – Lack of query transparency leads to incorrect assumptions and bad business conclusions • Expensive – Commercial solutions are very expensive – Typically provide poor compatibility with newer NoSQL technologies
  • 6. Open Source Mantra: WWGD? Distributed Interactive Batch Datastore File System analysis processing GFS BigTable Dremel MapReduce Hadoop HDFS HBase MapReduce Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
  • 7. Apache Drill Overview • Drill overview – Low latency interactive queries – Standard ANSI SQL2003 support – Domain Specific Languages / Your own QL – Inspired by, compatible with Google BigQuery/Dremel – Supports Nested/Hierarchical Data Formats – Supports RDBMS, Hadoop and NoSQL alike • Open-Source and Flexible – Apache Incubator – 100’s involved across US and Europe – Community consensus on API, functionality
  • 8. Why do we need another tool? Point queries Data Analyst & Reporting Queries 0-100ms 3 minutes – 20 minutes Interactive Queries 100ms – 3 minutes Data Mining and Major ETL 20 minutes – 20 hours MapReduce, Apache Per Hive and PIG Drill system interfaces
  • 9. Why not improve Hive or Pig? • Different Goals • SQL should be first class concern • MapReduce severely hampers processing model and performance – Startup cost is high – Map:Reduce recoverability and barrier disadvantages – Job:Job recoverability and barrier disadvantages (chained jobs) • Need to build from in-memory representation – Two canonical in-memory formats (row-based and columnar) – Support much larger memory sizes – Smaller memory footprint per record – Avoid serialization/deserialization and object creation costs between nodes and operations • Performance of interactive queries is critical – Evaluation and Operator code generation & compilation • First class recognition of nested types without metadata requirement – Schema Discovery and standard schema representation • Clear delineation between important stages – Support for multiple optimizers and researcher experimentation
  • 10. How does it work? • Drillbits run on each node to minimize network transfer • Queries can be fed to any Drillbit. SELECT * FROM oracle.transactions, • Coordination, query planning, mongo.users, optimization, scheduling, and hdfs.events LIMIT 1 execution are distributed
  • 11. Flexibility with Strongly Defined Tiers and APIs
  • 12. Apache Drill currently in development • Heavy active development by multiple supporting organizations • Available – Logical plan syntax and interpreter – Reference Interpreter • In progress – SQL interpreter – Storage Engine implementations for Accumulo, Cassandra, HBase, and HDFS file formats
  • 13. Conclusion & Questions • Put Apache Drill on your roadmap, we’ll make your life easier • Join the community – Code: http://github.com/apache/incubator-drill – Mailing List: drill-user@incubator.apache.org – Wiki: https://cwiki.apache.org/confluence/display/DRILL • Access this presentation: http://bit.ly/Wo6DLd • Contact Me: – jacques.drill@gmail.com