SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Vineet Kumar
Data Migration with Spark
#UnifiedAnalytics #SparkAISummit
Why Data Migration?
• Business requirement - Migrate Historical data into data lake for analysis - This may
require some light transformation.
• Data Science - This data can provide better business insight. More data -> better
models -> better predictions
• Enterprise Data Lake - There may be thousands of datafiles sitting in multiple
source systems.
• EDW - Data from EDW for archival purpose into the lake.
• Standards - Store data in with some standards- For example: partition strategy,
storage formats- Parquet, Avro etc.
3#UnifiedAnalytics #SparkAISummit
Data Lake
1000s of files
- Text Files
- XML
- JSON
RDMBS
- EDW
- Data Archive
Migration Issues
• Schema for text files - header missing or first line as a header.
• Data in files is neither consistent an nor in target standard format.
– Example – timestamp, Hive standard format is yyyy-MM-dd hh24:mi:ss
• Over the time file structure(s) might have changed: some new columns added or removed.
• There could be thousands of files with different structure- Separate ETL mapping/code for each file ???
• Target table can be partitioned or non-partitioned.
• Source data size can from few megabytes to terabytes, or from a few columns to thousands of columns.
• Partitioned column can be one of the existing columns or custom column based on the value passed as an
argument.
– partitionBy=partitionField, partitionValue=<<Value>>
– Example : partitionBy=‘ingestion_date’ partitionValue=‘20190401’
Data Migration Approach
Text Files
(mydata.csv)
XML Files
Hive
val df = sparksession.read.format("com.databricks.spark.csv")
.option("delimiter",”,")
.load(“mydata.csv”)
val sparksession = SparkSession
.builder()
.enableHiveSupport()
.getOrCreate()
val df = sparksession.read.format("com.databricks.spark.xml")
.load(“mydata.xml”)
RDBMS
Create Spark/Hive context
Write
<<transformation logic >>
df.write.mode(“APPEND”).
insertInto(HivetableName)JDBC call
Wait… Schema ?
6#UnifiedAnalytics #SparkAISummit
Text Files – how to get schema?
• File header is not present in file.
• Can Infer schema, but what about column names?
• Data from last couple of years and the file format has been evolved.
scala> df.columns
res01: Array[String] = Array(_c0, _c1, _c2)
XML, JSON or RDBMS sources
• Schema present in the source.
• Spark can automatically infer schema from the source.
scala> df.columns
res02: Array[String] = Array(id, name, address)
Schema for Text files
• Option 1 : File header exists in first line
• Option 2: File header from external file – JSON
• Option 3: Create empty table corresponds to csv file structure
• Option 4: define schema - StructType or Case Class
Schema for CSV files : Option 3
• Create empty table corresponds to text file structure
– Example – Text file : file1.csv
1|John|100 street1,NY|10/20/1974
– Create Hive structure corresponds to the file : file1_structure
create table file1_raw(id:string, name:string, address:string, dob:timestamp)
– Map Dataframe columns with the structure from previous step
val hiveSchema = sparksession.sql("select * from file1_raw where 1=0")
val dfColumns = hiveSchema.schema.fieldNames.toList
// You should check if column count is same before mapping it
val textFileWithSchemaDF = df.toDF(dfColumns: _*)
Before - Column Names :
scala> df.show()
+---+-----+--------------+----------+
|_c0| _c1| _c2| _c3|
+---+-----+--------------+----------+
| 1| John|100 street1,NY|10/20/1975|
| 2|Chris|Main Street,KY|10/20/1975|
| 3|Marry|park Avenue,TN|10/20/1975|
+---+-----+--------------+----------+
After :
scala> df2.show()
+---+-----+--------------+----------+
|id| name| address| dob|
+---+-----+--------------+----------+
| 1| John|100 street1,NY|10/20/1975|
| 2|Chris|Main Street,KY|10/20/1975|
| 3|Marry|park Avenue,TN|10/20/1975|
+---+-----+--------------+----------+
Dates/timestamp ?
• Historical files - Date format can change over time.
Files from year 2004:
1|John|100 street1,NY|10/20/1974
2|Chris|Main Street,KY|10/01/1975
3|Marry|park Avenue,TN|11/10/1972
…
…
Files from year 2018 onwards:
1|John|100 street1,NY|1975-10-02
2|Chris|Main Street,KY|2010-11-20|Louisville
3|Marry|park Avenue,TN|2018-04-01 10:20:01.001
Files have different formats:
File1- dd/mm/yyyy
File2 – mm/dd/yyyy
File3 – yyyy/mm/dd:hh24:mi:ss
….
Date format changed
New columns added
Same file but different date
format
Timestamp columns
• Target Hive Table:
id: int,
name string,
dob timestamp,
address string,
move_in_date timestamp,
rent_due_date timestamp
PARTITION COLUMNS…
Timestamp Column – Can be at any
location in target table. Find these first
for (i <- 0 to (hiveSchemaArray.length - 1)) {
hiveschemaArray.toString.contains("Timestamp")
}
Transformation for Timestamp data
• Hive timestamp format - yyyy-MM-dd HH:mm:ss
• Create UDF to return valid date format. Input can be in any formats
val getHiveDateFormatUDF=udf(getValidDateFormat)
10/12/2019 2019-10-12 00:00:00getHiveDateFormatUDF
UDF Logic for Date transformation
import java.time.LocalDateTime
import java.time.format.{DateTimeFormatter, DateTimeParseException)
val inputFormats=Array(“MM/dd/yyyy”, ”yyyyMMdd”, …….)
val validHiveFormat= DateTimeFormatter.ofPattern(“yyyy-MM-dd HH:mm:ss”)
val inputDate=“10/12/2019”
for (format <- inputFormats) {
try {
val dateFormat = DateTimeFormatter.ofPattern(format)
val newDate= LocalDateTime.parse(inputDate, dateFormat)
newDate.format(validHiveFormat)
}
catch e : DateTimeParseException =>null
}
This can build from a file as an argument
Parameter to a function
Date transformation – Hive format
• Find columns with timestamp data type and apply udf on those columns
for (i <- 0 to (hiveSchemaArray.length - 1)) {
if (hiveschemaArray(i).toString.contains("Timestamp")) {
val field=schemaArray(i).toString.replace("StructField(","").split(",")(0)
val tempfield = field + "_tmp"
val tempdf = df.withColumn(tempfield,getHiveDateFormatUDF(col(field)))
.drop(field).withColumnRenamed(tempfied, field)
val newdf = tempdf
}
Before applying UDF :
scala> df2.show()
+---+-----+--------------+----------+
|id| name| address| dob|
+---+-----+--------------+----------+
| 1| John|100 street1,NY|10/20/1975|
| 2|Chris|Main Street,KY|10/20/1975|
| 3|Marry|park Avenue,TN|10/20/1975|
+---+-----+--------------+----------+
After applying date UDF :
scala> newDF.show()
+---+-----+--------------+----------+
|id| name| address| dob|
+---+-----+--------------+----------+
| 1| John|100 street1,NY|1975-10-20 00:00:00|
| 2|Chris|Main Street,KY|1975-10-20 00:00:00|
| 3|Marry|park Avenue,TN|1975-10-20 00:00:00|
+---+-----+--------------+----------+
Column/Data Element Position
• Spark Dataframe(df) format from text file: name at position 2, address at position 3
id| name| address| dob
--------------------------------
1|John|100 street1,NY|10/20/1974
df.write.mode(“APPEND”).insertInto(HivetableName)
Or
val query = ”INSERT OVERWRITE INTO hivetable SELECT * FROM textFileDF“
Sparksession.sql(query)
• Hive table format : address at position 2
• name switched to address, and address to name
id| address| name| dob
----------------------------------------------------------
1,John, 100 stree1 NY, Null
Id :int
Name: string
Address: string
Dob: string
Id :int
Address: string
Name: string
Dob: timestamp
File format Hive structure
Column/Data Element Position
• Relational world
INSERT INTO hivetable(id,name,address)
SELECT id,name,address from <<textFiledataFrame>>
• Read target hive table structure and column position.
//Get the Hive columns with position from target table
val hiveTableColumns= sparksession.sql("select * from hivetable where 1=0").columns
val columns = hiveTableColumns.map(x=> col(x))
// select columns source data frame and insert into target table.
dfnew.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
Partitioned Table.
• Target Hive tables can be partitioned or non-partitioned.
newDF.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
• Partitioned by daily, monthly or hourly.
• Partitioned with different columns.
Example:
Hivetable1 – Frequency daily – partition column – load_date
Hivetable2 – Frequency monthly – partition column – load_month
• Add partitioned column as last column before inserting into hive
newDF2 = newDF.withColumn(“load_date", lit(current_timestamp()))
Partitioned Table.
• Partition column is one of the existing field from the file.
val hiveTableColumns= sparksession.sql("select * from hivetable where1=0").columns
val columns = hiveTableColumns.map(x=> col(x))
newDF.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
• Partition is based on custom field and value passed as an argument from the command line.
spark2-submit arguments : partionBy=“field1”, partitionValue=“1100”
• Add Partition column as a last column in Dataframe
newDF2 = newDF.withColumn(partitionBy.trim().toLowerCase(), lit(partitionValue))
• Final step: before inserting into hive table
newDF2.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
Performance: Cluster Resources
• Migration runs in it’s own Yarn Pool
• Large (Files Size >10GB, or with 1000+ Columns)
# repartition size = min(fileSize(MB)/256,50)
# executors
# Executor-cores
# executor-size
• Medium(File Size : 1 – 10GB, or with < 1000 Columns)
# repartition size = min(fileSize(MB)/256,20)
# executors
# executor-cores
# executor-size
• Small
# executors
# Executor-cores
# executor-size
Data Pipeline
22#UnifiedAnalytics #SparkAISummit
Read datafile
Parquet table
Dataframe
Apply schema on Dataframe from
Hive table corresponds to text file
Perform transformation-
timestamp conversion etc
Add partitioned column to
Dataframe
Write to Hive table
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Contenu connexe

Tendances

Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...HostedbyConfluent
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkTakuya UESHIN
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 

Tendances (20)

Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 

Similaire à Data Migration with Spark to Hive

Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
Scaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptScaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptssuserbad56d
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Lecture 15 - MySQL- PHP 1.ppt
Lecture 15 - MySQL- PHP 1.pptLecture 15 - MySQL- PHP 1.ppt
Lecture 15 - MySQL- PHP 1.pptTempMail233488
 
Infrastructure as code deployed using Stacker
Infrastructure as code deployed using StackerInfrastructure as code deployed using Stacker
Infrastructure as code deployed using StackerMessageMedia
 
Adv DB - Full Handout.pdf
Adv DB - Full Handout.pdfAdv DB - Full Handout.pdf
Adv DB - Full Handout.pdf3BRBoruMedia
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Python and CSV Connectivity
Python and CSV ConnectivityPython and CSV Connectivity
Python and CSV ConnectivityNeeru Mittal
 
R data structures-2
R data structures-2R data structures-2
R data structures-2Victor Ordu
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
HBase_-_data_operaet le opérations de calciletions_final.pptx
HBase_-_data_operaet le opérations de calciletions_final.pptxHBase_-_data_operaet le opérations de calciletions_final.pptx
HBase_-_data_operaet le opérations de calciletions_final.pptxHmadSADAQ2
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
 
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...dotNet Miami
 

Similaire à Data Migration with Spark to Hive (20)

Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Scaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptScaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.ppt
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
qwe.ppt
qwe.pptqwe.ppt
qwe.ppt
 
Lecture 15 - MySQL- PHP 1.ppt
Lecture 15 - MySQL- PHP 1.pptLecture 15 - MySQL- PHP 1.ppt
Lecture 15 - MySQL- PHP 1.ppt
 
Infrastructure as code deployed using Stacker
Infrastructure as code deployed using StackerInfrastructure as code deployed using Stacker
Infrastructure as code deployed using Stacker
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Adv DB - Full Handout.pdf
Adv DB - Full Handout.pdfAdv DB - Full Handout.pdf
Adv DB - Full Handout.pdf
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Python and CSV Connectivity
Python and CSV ConnectivityPython and CSV Connectivity
Python and CSV Connectivity
 
MYSQL-Database
MYSQL-DatabaseMYSQL-Database
MYSQL-Database
 
R data structures-2
R data structures-2R data structures-2
R data structures-2
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
HBase_-_data_operaet le opérations de calciletions_final.pptx
HBase_-_data_operaet le opérations de calciletions_final.pptxHBase_-_data_operaet le opérations de calciletions_final.pptx
HBase_-_data_operaet le opérations de calciletions_final.pptx
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
PT- Oracle session01
PT- Oracle session01 PT- Oracle session01
PT- Oracle session01
 
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 

Dernier (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 

Data Migration with Spark to Hive

  • 1. Vineet Kumar Data Migration with Spark #UnifiedAnalytics #SparkAISummit
  • 2. Why Data Migration? • Business requirement - Migrate Historical data into data lake for analysis - This may require some light transformation. • Data Science - This data can provide better business insight. More data -> better models -> better predictions • Enterprise Data Lake - There may be thousands of datafiles sitting in multiple source systems. • EDW - Data from EDW for archival purpose into the lake. • Standards - Store data in with some standards- For example: partition strategy, storage formats- Parquet, Avro etc.
  • 3. 3#UnifiedAnalytics #SparkAISummit Data Lake 1000s of files - Text Files - XML - JSON RDMBS - EDW - Data Archive
  • 4. Migration Issues • Schema for text files - header missing or first line as a header. • Data in files is neither consistent an nor in target standard format. – Example – timestamp, Hive standard format is yyyy-MM-dd hh24:mi:ss • Over the time file structure(s) might have changed: some new columns added or removed. • There could be thousands of files with different structure- Separate ETL mapping/code for each file ??? • Target table can be partitioned or non-partitioned. • Source data size can from few megabytes to terabytes, or from a few columns to thousands of columns. • Partitioned column can be one of the existing columns or custom column based on the value passed as an argument. – partitionBy=partitionField, partitionValue=<<Value>> – Example : partitionBy=‘ingestion_date’ partitionValue=‘20190401’
  • 5. Data Migration Approach Text Files (mydata.csv) XML Files Hive val df = sparksession.read.format("com.databricks.spark.csv") .option("delimiter",”,") .load(“mydata.csv”) val sparksession = SparkSession .builder() .enableHiveSupport() .getOrCreate() val df = sparksession.read.format("com.databricks.spark.xml") .load(“mydata.xml”) RDBMS Create Spark/Hive context Write <<transformation logic >> df.write.mode(“APPEND”). insertInto(HivetableName)JDBC call
  • 7. Text Files – how to get schema? • File header is not present in file. • Can Infer schema, but what about column names? • Data from last couple of years and the file format has been evolved. scala> df.columns res01: Array[String] = Array(_c0, _c1, _c2) XML, JSON or RDBMS sources • Schema present in the source. • Spark can automatically infer schema from the source. scala> df.columns res02: Array[String] = Array(id, name, address)
  • 8. Schema for Text files • Option 1 : File header exists in first line • Option 2: File header from external file – JSON • Option 3: Create empty table corresponds to csv file structure • Option 4: define schema - StructType or Case Class
  • 9. Schema for CSV files : Option 3 • Create empty table corresponds to text file structure – Example – Text file : file1.csv 1|John|100 street1,NY|10/20/1974 – Create Hive structure corresponds to the file : file1_structure create table file1_raw(id:string, name:string, address:string, dob:timestamp) – Map Dataframe columns with the structure from previous step val hiveSchema = sparksession.sql("select * from file1_raw where 1=0") val dfColumns = hiveSchema.schema.fieldNames.toList // You should check if column count is same before mapping it val textFileWithSchemaDF = df.toDF(dfColumns: _*)
  • 10. Before - Column Names : scala> df.show() +---+-----+--------------+----------+ |_c0| _c1| _c2| _c3| +---+-----+--------------+----------+ | 1| John|100 street1,NY|10/20/1975| | 2|Chris|Main Street,KY|10/20/1975| | 3|Marry|park Avenue,TN|10/20/1975| +---+-----+--------------+----------+ After : scala> df2.show() +---+-----+--------------+----------+ |id| name| address| dob| +---+-----+--------------+----------+ | 1| John|100 street1,NY|10/20/1975| | 2|Chris|Main Street,KY|10/20/1975| | 3|Marry|park Avenue,TN|10/20/1975| +---+-----+--------------+----------+
  • 11. Dates/timestamp ? • Historical files - Date format can change over time. Files from year 2004: 1|John|100 street1,NY|10/20/1974 2|Chris|Main Street,KY|10/01/1975 3|Marry|park Avenue,TN|11/10/1972 … … Files from year 2018 onwards: 1|John|100 street1,NY|1975-10-02 2|Chris|Main Street,KY|2010-11-20|Louisville 3|Marry|park Avenue,TN|2018-04-01 10:20:01.001 Files have different formats: File1- dd/mm/yyyy File2 – mm/dd/yyyy File3 – yyyy/mm/dd:hh24:mi:ss …. Date format changed New columns added Same file but different date format
  • 12. Timestamp columns • Target Hive Table: id: int, name string, dob timestamp, address string, move_in_date timestamp, rent_due_date timestamp PARTITION COLUMNS… Timestamp Column – Can be at any location in target table. Find these first for (i <- 0 to (hiveSchemaArray.length - 1)) { hiveschemaArray.toString.contains("Timestamp") }
  • 13. Transformation for Timestamp data • Hive timestamp format - yyyy-MM-dd HH:mm:ss • Create UDF to return valid date format. Input can be in any formats val getHiveDateFormatUDF=udf(getValidDateFormat) 10/12/2019 2019-10-12 00:00:00getHiveDateFormatUDF
  • 14. UDF Logic for Date transformation import java.time.LocalDateTime import java.time.format.{DateTimeFormatter, DateTimeParseException) val inputFormats=Array(“MM/dd/yyyy”, ”yyyyMMdd”, …….) val validHiveFormat= DateTimeFormatter.ofPattern(“yyyy-MM-dd HH:mm:ss”) val inputDate=“10/12/2019” for (format <- inputFormats) { try { val dateFormat = DateTimeFormatter.ofPattern(format) val newDate= LocalDateTime.parse(inputDate, dateFormat) newDate.format(validHiveFormat) } catch e : DateTimeParseException =>null } This can build from a file as an argument Parameter to a function
  • 15. Date transformation – Hive format • Find columns with timestamp data type and apply udf on those columns for (i <- 0 to (hiveSchemaArray.length - 1)) { if (hiveschemaArray(i).toString.contains("Timestamp")) { val field=schemaArray(i).toString.replace("StructField(","").split(",")(0) val tempfield = field + "_tmp" val tempdf = df.withColumn(tempfield,getHiveDateFormatUDF(col(field))) .drop(field).withColumnRenamed(tempfied, field) val newdf = tempdf }
  • 16. Before applying UDF : scala> df2.show() +---+-----+--------------+----------+ |id| name| address| dob| +---+-----+--------------+----------+ | 1| John|100 street1,NY|10/20/1975| | 2|Chris|Main Street,KY|10/20/1975| | 3|Marry|park Avenue,TN|10/20/1975| +---+-----+--------------+----------+ After applying date UDF : scala> newDF.show() +---+-----+--------------+----------+ |id| name| address| dob| +---+-----+--------------+----------+ | 1| John|100 street1,NY|1975-10-20 00:00:00| | 2|Chris|Main Street,KY|1975-10-20 00:00:00| | 3|Marry|park Avenue,TN|1975-10-20 00:00:00| +---+-----+--------------+----------+
  • 17. Column/Data Element Position • Spark Dataframe(df) format from text file: name at position 2, address at position 3 id| name| address| dob -------------------------------- 1|John|100 street1,NY|10/20/1974 df.write.mode(“APPEND”).insertInto(HivetableName) Or val query = ”INSERT OVERWRITE INTO hivetable SELECT * FROM textFileDF“ Sparksession.sql(query) • Hive table format : address at position 2 • name switched to address, and address to name id| address| name| dob ---------------------------------------------------------- 1,John, 100 stree1 NY, Null Id :int Name: string Address: string Dob: string Id :int Address: string Name: string Dob: timestamp File format Hive structure
  • 18. Column/Data Element Position • Relational world INSERT INTO hivetable(id,name,address) SELECT id,name,address from <<textFiledataFrame>> • Read target hive table structure and column position. //Get the Hive columns with position from target table val hiveTableColumns= sparksession.sql("select * from hivetable where 1=0").columns val columns = hiveTableColumns.map(x=> col(x)) // select columns source data frame and insert into target table. dfnew.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
  • 19. Partitioned Table. • Target Hive tables can be partitioned or non-partitioned. newDF.select(columns:_*).write.mode(“APPEND”).insertInto(tableName) • Partitioned by daily, monthly or hourly. • Partitioned with different columns. Example: Hivetable1 – Frequency daily – partition column – load_date Hivetable2 – Frequency monthly – partition column – load_month • Add partitioned column as last column before inserting into hive newDF2 = newDF.withColumn(“load_date", lit(current_timestamp()))
  • 20. Partitioned Table. • Partition column is one of the existing field from the file. val hiveTableColumns= sparksession.sql("select * from hivetable where1=0").columns val columns = hiveTableColumns.map(x=> col(x)) newDF.select(columns:_*).write.mode(“APPEND”).insertInto(tableName) • Partition is based on custom field and value passed as an argument from the command line. spark2-submit arguments : partionBy=“field1”, partitionValue=“1100” • Add Partition column as a last column in Dataframe newDF2 = newDF.withColumn(partitionBy.trim().toLowerCase(), lit(partitionValue)) • Final step: before inserting into hive table newDF2.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
  • 21. Performance: Cluster Resources • Migration runs in it’s own Yarn Pool • Large (Files Size >10GB, or with 1000+ Columns) # repartition size = min(fileSize(MB)/256,50) # executors # Executor-cores # executor-size • Medium(File Size : 1 – 10GB, or with < 1000 Columns) # repartition size = min(fileSize(MB)/256,20) # executors # executor-cores # executor-size • Small # executors # Executor-cores # executor-size
  • 22. Data Pipeline 22#UnifiedAnalytics #SparkAISummit Read datafile Parquet table Dataframe Apply schema on Dataframe from Hive table corresponds to text file Perform transformation- timestamp conversion etc Add partitioned column to Dataframe Write to Hive table
  • 23. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT