Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
H104: Harnessing the Hadoop Ecosystem
Optimizations in Apache Hive
Jason Huang, Senior Solutions Architect – Qubole, Inc.
...
A little bit about Qubole
Ashish Thusoo
Founder & CEO
Joydeep Sen Sarma
Founder & CTO
Founded in 2011 by the pioneers of “...
Hive – SQL on Hadoop
●  A system for managing and querying unstructured data as if it were
structured
●  Uses Map-Reduce f...
Why Hive?
●  Problem : Unlimited data
●  Terabytes everyday
●  Wide Adoption of Hadoop
●  Scalable/Available
●  But, Hadoo...
Qubole DataFlow Diagram
Qubole UI via
Browser
SDK
ODBC
User Access
Qubole’s
AWS Account
Customer’s AWS Account
REST	
  API...
De-normalizing data:
Normalization:
-  models data tables with certain rules to deal with redundancy
-  normalizing create...
Partitioning Tables:
Hive partitioning is an effective method to improve the query performance
on larger tables. Partition...
Bucketing:
Improves the join performance if the bucket key and join keys are
common.
Bucketing:
-  improves the join performance if the bucket key and join keys are
common
-  distributes the data in differen...
Map join:
Really efficient if a table on the other side of a join is small enough to fit
in the memory.
File Input Formats:
-  play a critical role in Hive performance
E.g. JSON, the text type of input formats
-  not a good ch...
Compress map/reduce output:
-  reduce the intermediate data volume
-  reduces the amount of data transfers between mappers...
Parallel execution:
Hadoop can execute MapReduce jobs in parallel, and several queries
executed on Hive automatically use ...
Vectorization:
-  allows Hive to process a batch of rows in ORC format together instead
of processing one row at a time
Ea...
Sampling:
-  allows users to take a subset of dataset and analyze it, without having
to analyze the entire data set
Hive o...
Sampling on Buckets:
Unit Testing:
-  In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries
and more.
-  Verify the correctn...
Qubole Data Service
18
19
Use Cases and Additional Information
20
21
“Qubole has enabled more users within Pinterest to get to the
data and has made the data platform lot more scalable and...
22
“We needed something that was reliable and easy to learn,
setup, use and put into production without the risk and high
...
23
Operations
Analyst
Marketing Ops
Analyst
Data
Architect
Business
Users
Product
Support
Customer
Support
Developer
Sales...
Prochain SlideShare
Chargement dans…5
×

Harnessing the Hadoop Ecosystem Optimizations in Apache Hive

Presented by Jason Huang, Sr. Solutions Architect at Qubole
At the New York Data Summit, May 12, 2015

  • Identifiez-vous pour voir les commentaires

Harnessing the Hadoop Ecosystem Optimizations in Apache Hive

  1. 1. H104: Harnessing the Hadoop Ecosystem Optimizations in Apache Hive Jason Huang, Senior Solutions Architect – Qubole, Inc. May 12, 2015 NYC Data Summit Hadoop Day
  2. 2. A little bit about Qubole Ashish Thusoo Founder & CEO Joydeep Sen Sarma Founder & CTO Founded in 2011 by the pioneers of “big data” @ Facebook and the creator’s of the Apache Hive Project. Based in Mountain View, CA with offices in Bangalore, India. Investments by Charles River, LightSpeed, Norwest Ventures. 2015 CNBC Disruptor 50 Companies – announced today! World class product and engineering team from:
  3. 3. Hive – SQL on Hadoop ●  A system for managing and querying unstructured data as if it were structured ●  Uses Map-Reduce for execution ●  HDFS for Storage (or Amazon S3) ●  Key Building Principles ●  SQL as a familiar data warehousing tool ●  Extensibility (Pluggable map/reduce scripts in the language of your choice, Rich and User Defined Data Types, User Defined Functions) ●  Interoperability (Extensible Framework to support different file and data formats) ●  Performance
  4. 4. Why Hive? ●  Problem : Unlimited data ●  Terabytes everyday ●  Wide Adoption of Hadoop ●  Scalable/Available ●  But, Hadoop can be … ●  Complex ●  Different Paradigm ●  Map-Reduce hard to program
  5. 5. Qubole DataFlow Diagram Qubole UI via Browser SDK ODBC User Access Qubole’s AWS Account Customer’s AWS Account REST  API   (HTTPS)     SSH     Ephemeral  Hadoop  Clusters,   Managed  by  Qubole   Slave Master Data Flow within Customer’s AWS (optional) Other RDS, Redshift Ephemeral Web Tier Web Servers Encrypted Result Cache Encrypted HDFS Slave Encrypted HDFS RDS  –  Qubole   User,  Account   ConfiguraFons   (Encrypted   credenFals   Amazon S3 w/S3 Server Side Encryption Default  Hive   Metastore   Encryption Options: a)  Qubole can encrypt the result cache b)  Qubole supports encryption of the ephemeral drives used for HDFS c)  Qubole supports S3 Server Side Encryption (c)(b)   (a)   (opFonal)   Custom  Hive   Metastore   SSH    
  6. 6. De-normalizing data: Normalization: -  models data tables with certain rules to deal with redundancy -  normalizing creates multiple relational tables -  requires joins at runtime to produce results Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues. Because of this, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.
  7. 7. Partitioning Tables: Hive partitioning is an effective method to improve the query performance on larger tables. Partition key is best as a low cardinal attribute.
  8. 8. Bucketing: Improves the join performance if the bucket key and join keys are common.
  9. 9. Bucketing: -  improves the join performance if the bucket key and join keys are common -  distributes the data in different buckets based on the hash results on the bucket key -  Reduces I/O scans during the join process if the process is happening on the same keys (columns) Note: set bucketing flag (hive.enforce.bucketing) each time before writing data to the bucketed table. To leverage the bucketing in the join operation we should set hive.optimize.bucketmapjoin=true. This setting hints to Hive to do bucket level join during the map stage join.
  10. 10. Map join: Really efficient if a table on the other side of a join is small enough to fit in the memory.
  11. 11. File Input Formats: -  play a critical role in Hive performance E.g. JSON, the text type of input formats -  not a good choice for a large production system where data volume is really high -  readable format take a lot of space and have some overhead of parsing ( e.g JSON parsing ) To address these problems, Hive comes with columnar input formats like RCFile, ORC etc. Columnar formats reduce read operations in queries by allowing each column to be accessed individually. Other binary formats like Avro, sequence files, Thrift can be effective in various use cases.
  12. 12. Compress map/reduce output: -  reduce the intermediate data volume -  reduces the amount of data transfers between mappers and reducers over the network Note: gzip compressed files are not splittable – so apply with caution File size should not be larger than a few hundred megabytes -  otherwise it can potentially lead to an imbalanced job -  compression codec options: e.g. snappy, lzo, bzip, etc. For map output compression: set mapred.compress.map.output=“true” For job output compression: set mapred.output.compress=“true”
  13. 13. Parallel execution: Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism.
  14. 14. Vectorization: -  allows Hive to process a batch of rows in ORC format together instead of processing one row at a time Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. To enable: set hive.vectorized.execution.enabled=true
  15. 15. Sampling: -  allows users to take a subset of dataset and analyze it, without having to analyze the entire data set Hive offers a built-in TABLESAMPLE clause that allows you to sample your tables. TABLESAMPLE can sample at various granularity levels -  return only subsets of buckets (bucket sampling) -  HDFS blocks (block sampling) -  first N records from each input split Alternatively, you can implement your own UDF that filters out records according to your sampling algorithm.
  16. 16. Sampling on Buckets:
  17. 17. Unit Testing: -  In Hive, you can unit test UDFs, SerDes, streaming scripts, Hive queries and more. -  Verify the correctness of your whole HiveQL query without touching a Hadoop cluster. -  Executing HiveQL query in the local mode takes literally seconds, compared to minutes, hours or days if it runs in the Hadoop mode. Various tools available: e.g HiveRunner, Hive_test and Beetest.
  18. 18. Qubole Data Service 18
  19. 19. 19
  20. 20. Use Cases and Additional Information 20
  21. 21. 21 “Qubole has enabled more users within Pinterest to get to the data and has made the data platform lot more scalable and stable” Mohammad Shahangian - Lead, Data Science and Infrastructure Moved to Qubole from Amazon EMR because of stability and rapidly expanded big data usage by giving access to data to users beyond developers. Rapid expansion of big data beyond developers (240 users out of 600 person company) Use CasesUser and Query Growth Rapid expansion in use cases ranging from ETL, search, adhoc querying, product analytics etc. Rock solid infrastructure sees 50% less failures as compared to AWS Elastic Map/Reduce Enterprise scale processing and data access
  22. 22. 22 “We needed something that was reliable and easy to learn, setup, use and put into production without the risk and high expectations that comes with committing millions of dollars in upfront investment. Qubole was that thing.” Marc Rosen - Sr. Director, Data Analytics Moved to Big data on the cloud (from internal Oracle clusters) because getting to analysis was much quicker than operating infrastructure themselves. Used to answer client queries and power client dashboards. Use Cases# Commands Per Month 0 1250 2500 3750 5000 Aug-13 Sept-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Number of queries Segment audiences based on their behavior including such topics as user pathway and multi-dimensional recency analysis Build customer profiles (both uni/multivariate) across thousands of first party (i.e., client CRM files) and third party (i.e., demographic) segments Simplify attribution insights showing the effects of upper funnel prospecting on lower funnel remarketing media strategies
  23. 23. 23 Operations Analyst Marketing Ops Analyst Data Architect Business Users Product Support Customer Support Developer Sales Ops Product Managers Data Infrastructure Links for more information http://www.datacenterknowledge.com/ archives/2015/04/02/hybrid-clouds-need-for- speed/ http://engineering.pinterest.com/post/ 92742371919/powering-big-data-at-pinterest http://www.itbusinessedge.com/slideshows/ six-details-your-big-data-provider-wont-tell- you.html http://www.marketwired.com/press-release/ qubole-reports-rapid-adoption-of-its-self- service-big-data-analytics- platform-1990272.htm

×