Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

HBase Global Indexing to support large-scale data ingestion at Uber

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 33 Publicité

HBase Global Indexing to support large-scale data ingestion at Uber

Télécharger pour lire hors ligne

Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.

Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.

At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.

At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.

Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.

Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.

At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.

At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à HBase Global Indexing to support large-scale data ingestion at Uber (20)

Publicité

Plus par DataWorks Summit (20)

Plus récents (20)

Publicité

HBase Global Indexing to support large-scale data ingestion at Uber

  1. 1. HBase Global Indexing to support large-scale data ingestion @ Uber May 21, 2019
  2. 2. Danny Chen dannyc@uber.com ● Engineering Manager on Hadoop Data Platform team ● Leading Data Ingestion team ● Previous worked @ on storage team (Manhattan) ● Enjoy playing basketball, biking, and spending time w/my kids.
  3. 3. Uber Apache Hadoop Platform Team Mission Build products to support reliable, scalable, easy-to-use, compliant, and efficient data transfer (both ingestion & dispersal) as well as data storage leveraging the Apache Hadoop ecosystem. Apache Hadoop is either a registered trademark or trademark of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.
  4. 4. Overview ● High-Level Ingestion & Dispersal introduction ● Different types of workloads ● Need for Global Index ● How Global Index Works ● Generating Global Indexes with HFiles ● Throttling HBase Access ● Next Steps
  5. 5. High Level Ingestion/Dispersal Introduction
  6. 6. Hadoop Data Ecosystem at Uber Apache Hadoop Data Lake Schemaless Analytical Processing Apache Kafka, Cassandra, Spark, and HDFS logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks. Data Ingestion Data Dispersal
  7. 7. Hadoop Data Ecosystem at Uber
  8. 8. Different Types of Workloads
  9. 9. Bootstrap ● One time only at beginning of lifecycle ● Large amounts of data ● Millions of QPS throughput ● Need to finish in a matter of hours ● NoSQL stores cannot keep up
  10. 10. Incremental ● Dominates lifecycle of Hive table ingestion ● Incremental upstream changes from Kafka or other data sources. ● 1000’s QPS per dataset ● Reasonable throughput requirements for NoSQL stores
  11. 11. Cell vs Row Changes
  12. 12. Need for Global Index
  13. 13. Requirements for Global Index ● Large amounts of historical data ingested in short amount of time ● Append only vs Append-plus-update ● Data layout and partitioning ● Bookkeeping for data layout ● Strong consistency ● High Throughput ● Horizontally scalable ● Required a NoSQL store
  14. 14. ● Decision was to use HBase ● Trade Availability for Consistency ● Automatic Rebalancing of HBase tables via region splitting ● Global view of dataset via master/slave architecture VS
  15. 15. How Global Index Works
  16. 16. Generating Global Indexes
  17. 17. Batch and One Time Index Upload
  18. 18. Data Model For Global Index
  19. 19. Spark & RDD Transformations for index generation
  20. 20. HFile Upload Process
  21. 21. HFile Index Job Tuning ● Explicitly register classes with Kryo Serialization ● Reduce 3 shuffle stages to one ● Proper HFile Size ● Proper Partition Counting Size ● 13 TB index data with 54 billion indexes ○ 2 hours to generate indexes ○ 10 min to load
  22. 22. Throttling HBase Access
  23. 23. The need for throttling HBase Access
  24. 24. Horizontal Scalability & Throttling
  25. 25. Next Steps
  26. 26. Next Steps ● Handle non-append-only data during bootstrap ● Explore other indexing solutions
  27. 27. Useful Links https://github.com/uber/marmaray https://github.com/uber/hudi https://eng.uber.com/data-partitioning-global-indexing/ https://eng.uber.com/uber-big-data-platform/ https://eng.uber.com/marmaray-hadoop-ingestion- open-source/
  28. 28. Other Dataworks Summit Talks Marmaray: Uber’s Open-sourced Generic Hadoop Data Ingestion and Dispersal Framework Wednesday at 11 am
  29. 29. Attribution Kaushik Devarajaiah Nishith Agarwal Jing Li
  30. 30. Positions available: Seattle, Palo Alto & San Francisco email : hadoop-platform-jobs@uber.com We are hiring!
  31. 31. Thank you Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed. All recipients of this document are notified that the information contained herein includes proprietary information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. Questions: email ospo@uber.com Follow our Facebook page: www.facebook.com/uberopensource

Notes de l'éditeur

  • Lots of effort into making a completely self-serve onboarding process
    Analytical users will little technical knowledge of Spark, Hadoop, Hive etc will still be able to take advantage of our platform


    Our assertion is that when relevant data is discoverable in the appropriate data stores for either analytical purposes, there really can be a substantial gains in terms of efficiency and value for your business. Marmaray is critical for ensuring data is in the appropriate data store.
    Familiarity with suite of tools in our Hadoop Ecosystem for many potential use cases to extract insights out of raw data
  • Completion of the Hadoop Ecosystem of tools at Uber and original vision of the Data Processing Platform
    Heatpipe/Watchtower produce quality schematized data
    Ingest the data via Marmaray
    Orchestrate jobs via Workflow Management System to run analytics and generate derived datasets, or build models using Michelangelo
    Disperse the data using Marmaray to stores with low latency semantics

    What sets it apart
    Generic ingestion framework
    Not tightly coupled to any source or sink

    Shouldn’t be coupled to a specific source or a specific sink (product teams focus on this)
  • Dividing bootstrap and incremental allows us to choose a kv store where we that can scale for incremental phase indexing but not necessarily for bootstrapping of data.
  • HBase automatically rebalances tables within a cluster by splitting up key ranges when a region gets too large.

    Can also load balance by having new regions moved to other servers

    The master-slave architecture enables getting a global view of the spread of a dataset across the cluster, which we utilize in customizing dataset specific throughputs to our HBase cluster.
  • During incremental ingestion
    We work in mini batches. It is the job of work unit calculator to provide required level of throttling
  • We work in mini batches. It is the job of work unit calculator to provide required level of throttling
  • Our Big Data ecosystem’s model of indexes stored in HBase contains entities shown in green that help identify files that need to be updated corresponding to a given record in an append-plus-update dataset.

    The layout of index entries in HFiles lets us sort based on key value and column.
  • This is for the one time upload case



    FlatMapToMair transformation in Apache Spark does not preserve the ordering of entries, so a partition isolated sort is performed. The partitioning is unchanged to ensure each partition still corresponds to a non-overlapping key range.
  • HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process.

    - Hfile upload by be severely affected by splitting
    - Presplit HBase table into as many regions as there are HFiles so each Hfile can fit within a regio
    - We avoid splitting Hfile based on Hfile size and it severely impacts Hfile upload time (10 min even for tens of TB)
    - Done by presplitting hbase table so each hfile fits within a seaparate HBase region with non overlapping keys

  • HFiles are written to the cluster where HBase is hosted to ensure HBase region servers have access to them during the upload process.
  • Three Apache Spark jobs corresponding to three different datasets access their respective HBase index table, creating loads on HBase regional servers hosting these tables.
  • Adding more servers to the HBase cluster for a single dataset that is using global index linearly correlates with a QPS increase, although the dataset’s QPSFraction remains constant.
  • Explore other indexing solutions to possibly merge bootstrap and incremental indexing solutions for easier maintenance.

×