Sqoop2 is Sqoop as a service. Its focus is on ease of use, ease of extensibility, and security. Recently, Sqoop2 was refactored to handle generic data transfer needs.
3. Introduction to Sqoop 2
Ease of use Extensible Security
Provide a rest API and Java
API for easy integration.
Existing clients include a Hue
UI and a command line client.
Provide a connector SDK and
focus on pluggability. Existing
connectors include Generic
JDBC connector and HDFS
connector.
Emphasize separation of
responsibilities. Eventually
have ACLs or RBAC.
4. Life of a Request
• Client
– Talks to server over REST + JSON
– Does nothing but sends requests
• Server
– Extracts metadata from data source
– Delegates to execution engine
– Does all the heavy lifting really
• MapReduce
– Parallelizes execution of the job
8. Connector Definitions
• Connectors define:
– How to connect to a data source
– How to extract data from a data source
– How to load data to a data source
public Importer getImporter(); // Supply extract method
public Importer getExporter(); // Supply load method
public class getConnectionConfigurationClass();
public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT
9. Intermediate Data Format
• Describe a single record as it moves through Sqoop
• currently available
– CSV
col1,col2,col3,...
col1,col2,col3,...
...
10. What’s Wrong w/ Current Implementation?
• Hadoop as a first class citizen disables transfers between the
components in the Hadoop ecosystem
– HBase to HDFS not supported
– HDFS to Accumulo not supported
• Hadoop ecosystem not well defined
– Accumulo was not considered part of Hadoop ecosystem
– What’s next? Kafka?
11. Refactoring
• Connectors already defined extractors and loaders
– Refactor the connector SDK
• Pull out HDFS integration to a connector
• Improve Schema integration
Transfer data from Connector A to Connector B
12. Connector SDK
• Connectors assume all roles
• Add Direction for FROM and TO
• Initializers and destroyers for both directions
Connector responsibilities
13. HDFS Connector
• Move Hadoop role to connector
• Schemaless
• Data formats
– Text (CSV)
– Sequence
– etc.
14. Schema Improvements
• Schema per connector
• Intermediate data format (IDF) has a Schema
• Introduce matcher
• Schema represents data as it moves through the system
15. Matcher
• Matcher ensures data goes to right place
• Combinations
– FROM and TO schema
– FROM schema
– TO schema
– No schema = Error
16. Matcher
Location Name User defined
Ensure that FROM schema
matches TO schema by index
location of Schema
Provide a connector SDK and
focus on pluggability. Existing
connectors include Generic
JDBC connector and HDFS
connector.
Emphasize separation of
responsibilities. Eventually
have ACLs or RBAC.