Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

•

5 likes•2,515 views

Sqoop2 is Sqoop as a service. Its focus is on ease of use, ease of extensibility, and security. Recently, Sqoop2 was refactored to handle generic data transfer needs.

Software

Sqoop 2
Refactoring for generic data transfer
Abraham Elmahrek

Introduction to Sqoop 2
Ease of use Extensible Security
Provide a rest API and Java
API for easy integration.
Existing clients include a Hue
UI and a command line client.
Provide a connector SDK and
focus on pluggability. Existing
connectors include Generic
JDBC connector and HDFS
connector.
Emphasize separation of
responsibilities. Eventually
have ACLs or RBAC.

Life of a Request
• Client
– Talks to server over REST + JSON
– Does nothing but sends requests
• Server
– Extracts metadata from data source
– Delegates to execution engine
– Does all the heavy lifting really
• MapReduce
– Parallelizes execution of the job

Job Types
IMPORT into Hadoop and EXPORT out of Hadoop

Responsibilities
Transfer data from Connector A to Hadoop
Connector responsibilities Sqoop framework responsibilities

Connector Definitions
• Connectors define:
– How to connect to a data source
– How to extract data from a data source
– How to load data to a data source
public Importer getImporter(); // Supply extract method
public Importer getExporter(); // Supply load method
public class getConnectionConfigurationClass();
public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT

Intermediate Data Format
• Describe a single record as it moves through Sqoop
• currently available
– CSV
col1,col2,col3,...
col1,col2,col3,...
...

What’s Wrong w/ Current Implementation?
• Hadoop as a first class citizen disables transfers between the
components in the Hadoop ecosystem
– HBase to HDFS not supported
– HDFS to Accumulo not supported
• Hadoop ecosystem not well defined
– Accumulo was not considered part of Hadoop ecosystem
– What’s next? Kafka?

Refactoring
• Connectors already defined extractors and loaders
– Refactor the connector SDK
• Pull out HDFS integration to a connector
• Improve Schema integration
Transfer data from Connector A to Connector B

Connector SDK
• Connectors assume all roles
• Add Direction for FROM and TO
• Initializers and destroyers for both directions
Connector responsibilities

HDFS Connector
• Move Hadoop role to connector
• Schemaless
• Data formats
– Text (CSV)
– Sequence
– etc.

Schema Improvements
• Schema per connector
• Intermediate data format (IDF) has a Schema
• Introduce matcher
• Schema represents data as it moves through the system

Matcher
• Matcher ensures data goes to right place
• Combinations
– FROM and TO schema
– FROM schema
– TO schema
– No schema = Error

Matcher
Location Name User defined
Ensure that FROM schema
matches TO schema by index
location of Schema
Provide a connector SDK and
focus on pluggability. Existing
connectors include Generic
JDBC connector and HDFS
connector.
Emphasize separation of
responsibilities. Eventually
have ACLs or RBAC.

Checkout http:
//ingest.tips for
general ingest

What's hot

Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk

Apache sqoopmegrhi haikel

Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!

Advanced Sqoop Yogesh Kulkarni

Introduction to Apache SqoopAvkash Chauhan

Habits of Effective Sqoop UsersKathleen Ting

Sqoop on Spark for Data IngestionDataWorks Summit

Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2Cloudera, Inc.

SQOOP - RDBMS to HadoopSofian Hadiwijaya

Big data: Loading your data with flume and sqoopChristophe Marchal

Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru

SQL and Search with Spark in your browserDataWorks Summit/Hadoop Summit

October 2014 HUG : Hive On SparkYahoo Developer Network

Hive Quick Start TutorialCarl Steinbach

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network

Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit

Simplified Cluster Operation & TroubleshootingDataWorks Summit/Hadoop Summit

Apache hivepradipbajpai68

Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit

What's hot (20)

Apache Sqoop: Unlocking Hadoop for Your Relational Database

Apache sqoop

Apache Sqoop: A Data Transfer Tool for Hadoop

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...

Advanced Sqoop

Introduction to Apache Sqoop

Habits of Effective Sqoop Users

Sqoop on Spark for Data Ingestion

Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2

SQOOP - RDBMS to Hadoop

Big data: Loading your data with flume and sqoop

Big data components - Introduction to Flume, Pig and Sqoop

SQL and Search with Spark in your browser

October 2014 HUG : Hive On Spark

Hive Quick Start Tutorial

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...

Large-Scale Stream Processing in the Hadoop Ecosystem

Simplified Cluster Operation & Troubleshooting

Apache hive

Flexible and Real-Time Stream Processing with Apache Flink

Viewers also liked

Introduction to sqoopUday Vakalapudi

Highlights Of Sqoop2Alexander Alten-Lorenz

Sqooping 50 Million Rows a Day from MySQL Kathleen Ting

DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...WG_ Events

Kafka SecuritySriharsha Chintalapani

Apache sqoop with an use caseDavin Abraham

Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Spark Summit

Spark SecurityYifeng Jiang

Viewers also liked (8)

Introduction to sqoop

Highlights Of Sqoop2

Sqooping 50 Million Rows a Day from MySQL

DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...

Kafka Security

Apache sqoop with an use case

Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...

Spark Security

Similar to Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.pptAjajKhan23

Kafka connect 101Whiteklay

Diving into the Deep End - Kafka Connectconfluent

Data Pipelines with Kafka ConnectKaufman Ng

Build on AWS: Migrating And PlatformingAmazon Web Services

Build on AWS: Migrating and PlatformingAmazon Web Services

A sdn based application aware and network provisioningStanley Wang

Java Web servicesSujit Kumar

Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Fieldconfluent

HadoopDB in ActionTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Data integration with Apache Kafkaconfluent

Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha

Apache Tez - Accelerating Hadoop Data Processinghitesh1892

Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit

Analysing big data with cluster service and RLushi Chen

Apache Tez: Accelerating Hadoop Query ProcessingHortonworks

Developing real-time data pipelines with Spring and Kafkamarius_bogoevici

Confluent kafka meetupseattle jan2017Nitin Kumar

RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis

Oozie & sqoop by pradeepPradeep Pandey

Similar to Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup (20)

SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt

Kafka connect 101

Diving into the Deep End - Kafka Connect

Data Pipelines with Kafka Connect

Build on AWS: Migrating And Platforming

Build on AWS: Migrating and Platforming

A sdn based application aware and network provisioning

Java Web services

Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field

HadoopDB in Action

Data integration with Apache Kafka

Apache Tez : Accelerating Hadoop Query Processing

Apache Tez - Accelerating Hadoop Data Processing

Apache Tez - A New Chapter in Hadoop Data Processing

Analysing big data with cluster service and R

Apache Tez: Accelerating Hadoop Query Processing

Developing real-time data pipelines with Spring and Kafka

Confluent kafka meetupseattle jan2017

RDF-Gen: Generating RDF from streaming and archival data

Oozie & sqoop by pradeep

Recently uploaded

React Server Component in Next.js by Hanief UtamaHanief Utama

Advantages of Odoo ERP 17 for Your BusinessEnvertis Software Solutions

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

VK Business Profile - provides IT solutions and Web Developmentvyaparkranti

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

Introduction Computer Science - Software Design.pdfFerryKemperman

Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

How to submit a standout Adobe Champion ApplicationBradBedford3

Recently uploaded (20)

React Server Component in Next.js by Hanief Utama

Advantages of Odoo ERP 17 for Your Business

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

Unveiling the Future: Sylius 2.0 New Features

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

SpotFlow: Tracking Method Calls and States at Runtime

A healthy diet for your Java application Devoxx France.pdf

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...

Implementing Zero Trust strategy with Azure

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

VK Business Profile - provides IT solutions and Web Development

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

What is Advanced Excel and what are some best practices for designing and cre...

2.pdf Ejercicios de programación competitiva

Introduction Computer Science - Software Design.pdf

Machine Learning Software Engineering Patterns and Their Engineering

Automate your Kamailio Test Calls - Kamailio World 2024

英国UN学位证,北安普顿大学毕业证书1:1制作

Folding Cheat Sheet #4 - fourth in a series

How to submit a standout Adobe Champion Application

Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

1. Sqoop 2 Refactoring for generic data transfer Abraham Elmahrek

2. Cloudera Ingest!

3. Introduction to Sqoop 2 Ease of use Extensible Security Provide a rest API and Java API for easy integration. Existing clients include a Hue UI and a command line client. Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector. Emphasize separation of responsibilities. Eventually have ACLs or RBAC.

4. Life of a Request • Client – Talks to server over REST + JSON – Does nothing but sends requests • Server – Extracts metadata from data source – Delegates to execution engine – Does all the heavy lifting really • MapReduce – Parallelizes execution of the job

5. Workflow

6. Job Types IMPORT into Hadoop and EXPORT out of Hadoop

7. Responsibilities Transfer data from Connector A to Hadoop Connector responsibilities Sqoop framework responsibilities

8. Connector Definitions • Connectors define: – How to connect to a data source – How to extract data from a data source – How to load data to a data source public Importer getImporter(); // Supply extract method public Importer getExporter(); // Supply load method public class getConnectionConfigurationClass(); public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT

9. Intermediate Data Format • Describe a single record as it moves through Sqoop • currently available – CSV col1,col2,col3,... col1,col2,col3,... ...

10. What’s Wrong w/ Current Implementation? • Hadoop as a first class citizen disables transfers between the components in the Hadoop ecosystem – HBase to HDFS not supported – HDFS to Accumulo not supported • Hadoop ecosystem not well defined – Accumulo was not considered part of Hadoop ecosystem – What’s next? Kafka?

11. Refactoring • Connectors already defined extractors and loaders – Refactor the connector SDK • Pull out HDFS integration to a connector • Improve Schema integration Transfer data from Connector A to Connector B

12. Connector SDK • Connectors assume all roles • Add Direction for FROM and TO • Initializers and destroyers for both directions Connector responsibilities

13. HDFS Connector • Move Hadoop role to connector • Schemaless • Data formats – Text (CSV) – Sequence – etc.

14. Schema Improvements • Schema per connector • Intermediate data format (IDF) has a Schema • Introduce matcher • Schema represents data as it moves through the system

15. Matcher • Matcher ensures data goes to right place • Combinations – FROM and TO schema – FROM schema – TO schema – No schema = Error

16. Matcher Location Name User defined Ensure that FROM schema matches TO schema by index location of Schema Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector. Emphasize separation of responsibilities. Eventually have ACLs or RBAC.

17. Checkout http: //ingest.tips for general ingest

18. Thank you

Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup

Similar to Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup (20)

Recently uploaded

Recently uploaded (20)

Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup