Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
‹#›© Cloudera, Inc. All rights reserved.
Marcel Kornacker // Cloudera, Inc.
Friction-Free ELT:
Automating Data
Transformat...
© Cloudera, Inc. All rights reserved.
Outline
• Evolution of ELT in the context of analytics
• traditional systems
• Hadoo...
© Cloudera, Inc. All rights reserved.
Traditional ELT
• Extract: physical extraction from source data store
• could be an ...
© Cloudera, Inc. All rights reserved.
• Three aspects to the traditional ELT process:
1. semantic transformation such as d...
© Cloudera, Inc. All rights reserved.
• The goals of “analytics with no ELT”:
• simplify aspect 1
• eliminate aspects 2 an...
© Cloudera, Inc. All rights reserved.
A typical ELT workflow with Hadoop looks like this:
raw source data initially lands ...
© Cloudera, Inc. All rights reserved.
… map that data into a table to make it queryable …
Text, XML, JSON Original Data
CR...
© Cloudera, Inc. All rights reserved.
… create the target table at a different location …
Text, XML, JSON Original Data Pa...
© Cloudera, Inc. All rights reserved.
… convert the raw source data to the target format …
Text, XML, JSON Original Data P...
© Cloudera, Inc. All rights reserved.
… the data is then available for batch reporting/analytics (via Impala, Hive,
Pig, S...
© Cloudera, Inc. All rights reserved.
• Compared to traditional ELT, this has several advantages:
• Hadoop acts as a centr...
© Cloudera, Inc. All rights reserved.
• However, even this still has drawbacks:
• new data needs to be loaded periodically...
© Cloudera, Inc. All rights reserved.
• No explicit loading/conversion step to move raw data into a target table
• A singl...
© Cloudera, Inc. All rights reserved.
• support for complex/nested schemas
» avoid remapping of raw data into a flat relat...
© Cloudera, Inc. All rights reserved.
• Standard relational: all columns have scalar values:
CHAR(n), DECIMAL(p, s), INT, ...
© Cloudera, Inc. All rights reserved.
Sample schema CREATE TABLE Customers (
cid BIGINT,
address STRUCT {
street STRING,
z...
© Cloudera, Inc. All rights reserved.
• Paths in expressions: generalization of column references to drill
through structs...
© Cloudera, Inc. All rights reserved.
• Basic idea: nested collections are referenced like tables in the FROM
clause
SELEC...
© Cloudera, Inc. All rights reserved.
• Parent/child relationships don’t require join predicates
identical to
SELECT c.cid...
© Cloudera, Inc. All rights reserved.
• All join types are supported (INNER, OUTER, SEMI, ANTI)
• Advanced querying capabi...
© Cloudera, Inc. All rights reserved.
• Aggregates over collections are concise and easy to read
instead of
SELECT c.cid,
...
© Cloudera, Inc. All rights reserved.
• Aggregates over collections can replace subqueries
instead of
SELECT c.cid
FROM Cu...
© Cloudera, Inc. All rights reserved.
• Parquet optimizes storage of complex schemas by inlining structural data
into the ...
© Cloudera, Inc. All rights reserved.
• Automated data conversion takes care of
• format conversion: from text/JSON/Avro/…...
© Cloudera, Inc. All rights reserved.
Automated Data Conversion
25
JSON
Impala
Hive
Spark
Pig
PARQUETJSON
Single-Table View
© Cloudera, Inc. All rights reserved.
• Conversion process
• atomic: the switch from the source to the target data files i...
© Cloudera, Inc. All rights reserved.
• Specify data transformation via “view” definition:
• derived table is expressed as...
© Cloudera, Inc. All rights reserved.
• From the user’s perspective:
• table is queryable like any other table (but doesn’...
© Cloudera, Inc. All rights reserved.
• From the system’s perspective:
• table is Union of
• physically materialized data,...
© Cloudera, Inc. All rights reserved.
Automating Data Transformation: Derived Tables
30
Impala
Hive
Spark
Pig
Delta Query ...
© Cloudera, Inc. All rights reserved.
• Schema inference from data files is useful to reduce the barrier to
analyzing comp...
© Cloudera, Inc. All rights reserved.
• Schema evolution:
• a necessary follow-on to schema inference: every schema evolve...
© Cloudera, Inc. All rights reserved.
• Example workflow:
• scans data to determine new columns/fields to add
• synchronou...
© Cloudera, Inc. All rights reserved.
• CREATE TABLE … LIKE <File>:
• available today for Parquet
• Impala 2.3+ for Avro, ...
© Cloudera, Inc. All rights reserved.
• Hadoop offers a number of advantages over traditional multi-platform ETL
solutions...
‹#›© Cloudera, Inc. All rights reserved.
Thank you
Marcel Kornacker | marcel@cloudera.com
36
Prochain SlideShare
Chargement dans…5
×

Friction-free ETL: Automating data transformation with Impala | Strata + Hadoop World London 2015

3 484 vues

Publié le

Speaker: Marcel Kornacker

As data is ingested into Apache Hadoop at an increasing rate from a diverse range of data sources, it is becoming more and more important for users that new data be accessible for analysis as quickly as possible—because “data freshness” can have a direct impact on business results.

In the traditional ETL process, raw data is transformed from the source into a target schema, possibly requiring flattening and condensing, and then loaded into an MPP DBMS. However, this approach has multiple drawbacks that make it unsuitable for real-time, “at-source” analytics—for example, the “ETL lag” reduces data freshness, and the inherent complexity of the process makes it costly to deploy and maintain, and reduces the speed at which new analytic applications can be introduced.

In this talk, attendees will learn about Impala’s approach to on-the-fly, automatic data transformation, which in conjunction with the ability to handle nested structures such as JSON and XML documents, addresses the needs of at-source analytics—including direct querying of your input schema, immediate querying of data as it lands in HDFS, and high performance on par with specialized engines. This performance level is attained in spite of the most challenging and diverse input formats, which are addressed through an automated background conversion process into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem.

In this talk, attendees will learn about Impala’s upcoming features that will enable at-source analytics: support for nested structures such as JSON and XML documents, which allows direct querying of the source schema; automated background file format conversion into Parquet, the high-performance, open source columnar format that has been widely adopted across the Hadoop ecosystem; and automated creation of declaratively-specified derived data for simplified data cleansing and transformation.

Publié dans : Technologie

Friction-free ETL: Automating data transformation with Impala | Strata + Hadoop World London 2015

  1. 1. ‹#›© Cloudera, Inc. All rights reserved. Marcel Kornacker // Cloudera, Inc. Friction-Free ELT: Automating Data Transformation with Cloudera Impala
  2. 2. © Cloudera, Inc. All rights reserved. Outline • Evolution of ELT in the context of analytics • traditional systems • Hadoop today • Cloudera’s vision for ELT • make most of it disappear • automate data transformation 2
  3. 3. © Cloudera, Inc. All rights reserved. Traditional ELT • Extract: physical extraction from source data store • could be an RDBMS acting as an operational data store • or log data materialized as json • Load: the targeted analytic DBMS converts the transformed data into its binary format (typically columnar) • Transform: • data cleansing and standardization • conversion of naturally complex/nested data into a flat relational schema 3
  4. 4. © Cloudera, Inc. All rights reserved. • Three aspects to the traditional ELT process: 1. semantic transformation such as data standardization/cleansing » makes data more queryable, adds value 2. representational transformation: from source to target schema (from complex/nested to flat relational) » “lateral” transformation that doesn’t change semantics, adds operational overhead 3. data movement: from source to staging area to target system » adds yet more operational overhead Traditional ELT 4
  5. 5. © Cloudera, Inc. All rights reserved. • The goals of “analytics with no ELT”: • simplify aspect 1 • eliminate aspects 2 and 3 Traditional ELT 5
  6. 6. © Cloudera, Inc. All rights reserved. A typical ELT workflow with Hadoop looks like this: raw source data initially lands in HDFS (examples: text/xml/json log files) … Text, XML, JSON ELT with Hadoop Today 6
  7. 7. © Cloudera, Inc. All rights reserved. … map that data into a table to make it queryable … Text, XML, JSON Original Data CREATE TABLE RawLogData (…) ROW FORMAT DELIMITED FIELDS LOCATION ‘/raw-log-data/‘; ELT with Hadoop Today 7
  8. 8. © Cloudera, Inc. All rights reserved. … create the target table at a different location … Text, XML, JSON Original Data Parquet CREATE TABLE LogData (…) STORED AS PARQUET LOCATION ‘/log-data/‘; ELT with Hadoop Today 8
  9. 9. © Cloudera, Inc. All rights reserved. … convert the raw source data to the target format … Text, XML, JSON Original Data Parquet INSERT INTO LogData SELECT * FROM RawLogData; ELT with Hadoop Today 9
  10. 10. © Cloudera, Inc. All rights reserved. … the data is then available for batch reporting/analytics (via Impala, Hive, Pig, Spark) or interactive analytics (via Impala, Search) Text, XML, JSON Original Data Parquet Impala Hive Spark Pig ELT with Hadoop Today 10
  11. 11. © Cloudera, Inc. All rights reserved. • Compared to traditional ELT, this has several advantages: • Hadoop acts as a centralized location for all data: raw source data lives side by side with the transformed data • data does not need to be moved between multiple platforms/clusters • data in the raw source format is queryable as soon as it lands, although at reduced performance, compared to an optimized columnar data format ELT with Hadoop Today 11
  12. 12. © Cloudera, Inc. All rights reserved. • However, even this still has drawbacks: • new data needs to be loaded periodically into the target table • doing that reliably and within SLAs can be a challenge • you now have two tables: one with current but slow data another with lagging but fast data Text, XML, JSON Original Data Parquet Impala Hive Spark Pig Another user-managed data pipeline! ELT with Hadoop Today 12
  13. 13. © Cloudera, Inc. All rights reserved. • No explicit loading/conversion step to move raw data into a target table • A single view of the data that is up-to-date and (mostly) in an efficient columnar format • automated with custom logic Text, XML, JSON Parquet Impala Hive Spark Pig automated A Vision for Analytics with No ELT 13
  14. 14. © Cloudera, Inc. All rights reserved. • support for complex/nested schemas » avoid remapping of raw data into a flat relational schema • background and incremental data conversion » retain in-place single view of entire data set, with most data being in an efficient format • bonus: schema inference and schema evolution » start analyzing data as soon as it arrives, regardless of its complexity A Vision for Analytics with No ELT 14
  15. 15. © Cloudera, Inc. All rights reserved. • Standard relational: all columns have scalar values: CHAR(n), DECIMAL(p, s), INT, DOUBLE, TIMESTAMP, etc. • Complex types: structs, arrays, maps in essence, a nested relational schema • Supported file formats: Parquet, json, XML, Avro • Design principle for SQL extensions: maintain SQL’s way of dealing with multi-valued data Support for Complex Schemas in Impala 15
  16. 16. © Cloudera, Inc. All rights reserved. Sample schema CREATE TABLE Customers ( cid BIGINT, address STRUCT { street STRING, zip INT, city STRING }, orders ARRAY<STRUCT { oid BIGINT, total DECIMAL(9, 2), items ARRAY< STRUCT { iid BIGINT, qty INT, price DECIMAL(9, 2) }> }> ) Support for Complex Schemas in Impala 16
  17. 17. © Cloudera, Inc. All rights reserved. • Paths in expressions: generalization of column references to drill through structs • Can appear anywhere a conventional column reference is legal SELECT address.zip, c.address.street FROM Customers c WHERE address.city = ‘Oakland’ SQL Syntax Extensions: Structs 17
  18. 18. © Cloudera, Inc. All rights reserved. • Basic idea: nested collections are referenced like tables in the FROM clause SELECT SUM(i.price * i.qty) FROM Customers.orders.items i WHERE i.price > 10 SQL Syntax Extensions: Arrays and Maps 18
  19. 19. © Cloudera, Inc. All rights reserved. • Parent/child relationships don’t require join predicates identical to SELECT c.cid, o.total FROM Customers c, c.orders o SELECT c.cid, o.total FROM Customers c INNER JOIN c.orders o SQL Syntax Extensions: Arrays and Maps 19
  20. 20. © Cloudera, Inc. All rights reserved. • All join types are supported (INNER, OUTER, SEMI, ANTI) • Advanced querying capabilities via correlated inline views and subqueries SELECT c.cid FROM Customers c LEFT ANTI JOIN c.orders o SELECT c.cid FROM Customers c WHERE EXISTS (SELECT * FROM c.orders.items where iid = 117) SQL Syntax Extensions: Arrays and Maps 20
  21. 21. © Cloudera, Inc. All rights reserved. • Aggregates over collections are concise and easy to read instead of SELECT c.cid, COUNT(c.orders), AVG(c.orders.items.price) FROM Customers c SELECT c.cid, a, b FROM Customers c, (SELECT COUNT(*) AS a FROM c.orders), (SELECT AVG(price) AS b FROM c.orders.items) SQL Syntax Extensions: Arrays and Maps 21
  22. 22. © Cloudera, Inc. All rights reserved. • Aggregates over collections can replace subqueries instead of SELECT c.cid FROM Customers c WHERE COUNT(c.orders) > 10 SELECT c.cid FROM Customers c WHERE (SELECT COUNT(*) FROM c.orders) > 10 SQL Syntax Extensions: Arrays and Maps 22
  23. 23. © Cloudera, Inc. All rights reserved. • Parquet optimizes storage of complex schemas by inlining structural data into the columnar data (repetition/definition levels) • No performance penalty for accessing nested collections! • Parquet translates logical hierarchy into physical collocation: • you don’t need parent-child joins • queries run faster! Complex Schemas with Parquet 23
  24. 24. © Cloudera, Inc. All rights reserved. • Automated data conversion takes care of • format conversion: from text/JSON/Avro/… to Parquet • compaction: multiple smaller files are coalesced into a single larger one • Sample workflow: • create table for data: • attach data to table: CREATE TABLE LogData (…) WITH CONVERSION TO PARQUET; LOAD DATA INPATH ‘/raw-log-data/file1’ INTO LogData SOURCE FORMAT JSON; Automated Data Conversion 24
  25. 25. © Cloudera, Inc. All rights reserved. Automated Data Conversion 25 JSON Impala Hive Spark Pig PARQUETJSON Single-Table View
  26. 26. © Cloudera, Inc. All rights reserved. • Conversion process • atomic: the switch from the source to the target data files is atomic from the perspective of a running query (but any running query sees the full data set) • redundant: with option to retain original data • incremental: Impala’s catalog service detects new data files that are not in the target format automatically Automated Data Conversion 26
  27. 27. © Cloudera, Inc. All rights reserved. • Specify data transformation via “view” definition: • derived table is expressed as Select, Project, Join, Aggregation • custom logic incorporated via UDFs, UDAs CREATE DERIVED TABLE CleanLogData AS SELECT StandardizeName(name), StandardizeAddr(addr, zip), … FROM LogData STORED AS PARQUET; Automating Data Transformation: Derived Tables 27
  28. 28. © Cloudera, Inc. All rights reserved. • From the user’s perspective: • table is queryable like any other table (but doesn’t allow INSERTs) • reflects all data visible in source tables at time of query (not: at time of CREATE) • performance is close to that of a table created with CREATE TABLE … AS SELECT (ie, that of a static snapshot) Automating Data Transformation: Derived Tables 28
  29. 29. © Cloudera, Inc. All rights reserved. • From the system’s perspective: • table is Union of • physically materialized data, derived from input tables as of some point in the past • view over yet-unprocessed data of input tables • table is updated incrementally (and in the background) when new data is added to input tables Automating Data Transformation: Derived Tables 29
  30. 30. © Cloudera, Inc. All rights reserved. Automating Data Transformation: Derived Tables 30 Impala Hive Spark Pig Delta Query Materialized Data Base Table Updates Single-Table View
  31. 31. © Cloudera, Inc. All rights reserved. • Schema inference from data files is useful to reduce the barrier to analyzing complex source data • as an example, log data often has hundreds of fields • the time required to create the DDL manually is substantial • Example: schema inference from structured data files • available today: • future formats: XML, json, Avro CREATE TABLE LogData LIKE PARQUET ‘/log-data.pq’ Schema Inference and Schema Evolution 31
  32. 32. © Cloudera, Inc. All rights reserved. • Schema evolution: • a necessary follow-on to schema inference: every schema evolves over time; explicit maintenance is as time-consuming as the initial creation • algorithmic schema evolution requires sticking to generally safe schema modifications: adding new fields • adding new top-level columns • adding fields within structs Schema Inference and Schema Evolution 32
  33. 33. © Cloudera, Inc. All rights reserved. • Example workflow: • scans data to determine new columns/fields to add • synchronous: if there is an error, the ‘load’ is aborted and the user notified LOAD DATA INPATH ‘/path’ INTO LogData SOURCE FORMAT JSON WITH SCHEMA EXPANSION; Schema Inference and Schema Evolution 33
  34. 34. © Cloudera, Inc. All rights reserved. • CREATE TABLE … LIKE <File>: • available today for Parquet • Impala 2.3+ for Avro, JSON, XML • Nested types: Impala 2.3 • Background format conversion: Impala 2.4 • Derived tables: > Impala 2.4 Timeline of Features in Impala 34
  35. 35. © Cloudera, Inc. All rights reserved. • Hadoop offers a number of advantages over traditional multi-platform ETL solutions: • availability of all data sets on a single platform • data becomes accessible through SQL as soon as it lands • However, this can be improved further: • a richer analytic SQL that is extended to handle nested data • an automated background conversion process that preserves an up-to- date view of all data while providing BI-typical performance • a declarative transformation process that focuses on application semantics and removes operational complexity • simple automation of initial schema creation and subsequent maintenance that makes dealing with large, complex schemas less labor-intensive Conclusion 35
  36. 36. ‹#›© Cloudera, Inc. All rights reserved. Thank you Marcel Kornacker | marcel@cloudera.com 36

×