SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Jennifer Shin
Founder, 8 Path Solutions LLC
Lecturer, UC Berkeley
Fuzzy Matching on
Apache Spark
Agenda
• Intro to fuzzy matching:
what you need to know
• Use Case:
a fuzzy solution for surveys
• Fuzzy implementations:
real world considerations
© 2017 8 Path Solutions LLC. All Rights Reserved.
Intro to Fuzzy Matching
What You Need To Know
Fuzzy Matching
(aka Approximate String Matching)
• process of finding strings that approximately match a given
pattern
• closeness of a match is measured in terms of an edit
distance, i.e. the number of operations necessary to convert
the string into an exact match.
© 2017 8 Path Solutions LLC. All Rights Reserved.
Fuzzy Matching
The edit distance is the number of primitive operations
necessary to convert the string into an exact match.
Examples of primitive operations are:
insertion: cot → coat
deletion: coat → cot
substitution: coat → cost
© 2017 8 Path Solutions LLC. All Rights Reserved.
What is fuzzy matching?
• A fuzzy matching program is used to returns a list of results that
are not an exact match for the term being searched
– search cab argument words
– spellings may not exactly match.
© 2017 8 Path Solutions LLC. All Rights Reserved.
Why use fuzzy matching?
• Not all data is clean
• Not all formatting is consistent
• Not all databases are structured
• Not all text is correct
• People are not perfect
© 2017 8 Path Solutions LLC. All Rights Reserved.
When can we use fuzzy matching?
• Case by case basis
• Data cleaning
© 2017 8 Path Solutions LLC. All Rights Reserved.
When can we use fuzzy matching?
• Case by case basis
• Data cleaning
• Entity/Name matching
© 2017 8 Path Solutions LLC. All Rights Reserved.
When can we use fuzzy matching?
• Case by case basis
• Data cleaning
• Entity/Name matching
• Recommendations
© 2017 8 Path Solutions LLC. All Rights Reserved.
When can we use fuzzy matching?
• Case by case basis
• Data cleaning
• Entity/Name matching
• Recommendations
• Predictive text
© 2017 8 Path Solutions LLC. All Rights Reserved.
Use Case
A Fuzzy Solution For Surveys
Data: Survey
 Comprehensive survey about attitudes, usage, purchases
 6,000 products
 20,000 variables
 26 feed categories
© 2017 8 Path Solutions LLC. All Rights Reserved.
Problem Description
A:
Dental Floss: Light Users: 0-2 Times/Last 7 Days:
Total Category
B:
Dental Floss: Times/Last 7 Days: Light (0-2)
How similar is A to B?
A B
+ =
How many new questions?
© 2017 8 Path Solutions LLC. All Rights Reserved.
Anxiety/Panic Used a branded
prescription remedy
Ailments/Remedies: : Anxiety/Panic: In
last 12 months: Used a branded
prescription remedy
Old label New label
© 2017 8 Path Solutions LLC. All Rights Reserved.
Word Based Comparison Model (WCM)
Anxiety/Panic Used a branded
prescription remedy
Ailments/Remedies: : Anxiety/Panic: In
last 12 months: Used a branded
prescription remedy
Old label New label
Score: 6
Good match
Then set threshold: match with scores above 5 is a good match
Word Based Comparison Model (WCM)
Anxiety/Panic Used a branded
prescription remedy
Ailments/Remedies: : Anxiety/Panic: In
last 12 months: Used a branded
prescription remedy
Old label New label
Score: 6
Good match
Then set threshold: match with scores above 5 is a good match
Word Based Comparison Model (WCM)
Any air conditioner Amount spent :
total :in last 12 months: $1000+
Shoes - Amount Spent in Total: any
Nike air: In last 12 months: $1000+
anyairconditioneramountspent
$1000+
anynikeair
Inlast12months
$1000+
shoesamountspentintotal
inlast12months
By cells
 total
Two cell does not have match, even most of the words do have matches.

Word Based Comparison Model (WCM)
Tires: Total Users: Bought in Last
12 Months: Hankook
Batteries: Total Users: Bought in Last
6 Months: Kodak
Prescription Brands - Used: : Evista
(men only): In last 12 months
Prescription Brands - Used: : Avodart
(men only): In last 12 months
wrong matches due to changes of brand names:
Score
7
Score
9
Match with scores above 5 can be a wrong match!
Why does Word-based Comparison Model(WCM) perform so poorly?
© 2017 8 Path Solutions LLC. All Rights Reserved.
Athletic Shoes - Amount Spent in Total: : Baseball
/Softball shoes: In last 12 months: $75 - $149
Athletic Shoes - Amount Spent in Total: Baseball
/Softball shoes: In last 12 months: $50 - $74
wrong matches due to different numbers:
Athletic Shoes - Number of pairs bought: :
Baseball/Softball shoes: In last 12 months: 2+
Athletic Shoes - Number of pairs bought:
Baseball/Softball shoes: In last 12 months: 2
Hair Tonic Or Dressing (Men): Heavy Users: 8+
Times/Last 7 Days: Total Category
Hair Tonic Or Dressing (Men): Heavy Users: 3+
Times/Last 7 Days: Total Category
Why does Word-based Comparison Model (WCM) perform so poorly?
Scores
12
Scores
11
Scores
12
Match with scores above 5 can be a wrong match!
© 2017 8 Path Solutions LLC. All Rights Reserved.
• Check if one cell is a subset of another cell.
• If all the cells in shorter label can find their counterparts, a
match is found.
Criteria:
Fuzzy Matching:
Levenshtein distance
© 2017 8 Path Solutions LLC. All Rights Reserved.
New Approach Proposed by Gan Song
• Levenshtein distance is a string metric for measuring the
difference between two sequences.
• Informally, the Levenshtein distance between two words is the
minimum number of single-character edits
(i.e. insertions, deletions or substitutions)
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein Distance
smtchgy smmtchg
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein Distance
smtchgy smmtchg
smtchgy----> smmtchgy----> smmtchg
smtchgy---->smmchgy----> smmthgy----> smmtcgy----> smmtchy----> smmtchg
Insert ‘m’ delete‘y’
Change ‘t’
To ‘m’
Change ‘c’
To ‘t’
Change ‘h’
To ‘c’
Change ‘g’
To ‘h’
Change ‘y’
To ‘g’
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein Distance
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N H O A N
Shuffle!
© 2017 8 Path Solutions LLC. All Rights Reserved.
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N
H O N A
A H O N
A N H O
N H O A
N A H O
H A O N
H N O A
A H N O
O A N H
N H A O
O N A H
H A N O
H N A O
O A H N
A O N H
O N H A
N O A H
O H A N
O H N A
A O H N
A N O H
N O H A
N A O H
H O A N H O A N
Find a match!
© 2017 8 Path Solutions LLC. All Rights Reserved.
Cell-based Comparison Model (CCM)
Social Networking – LinkedIn How
important to you: : Not at all Important ::
Keep in touch with family/friends
Social Networking – LinkedIn.com How
important to you: : Keep in touch with
family/friends: : Not at all Important
['socialnetworkinglinkedincomhowimportanttoyou',
'keepintouchwithfamilyfriends',
'notatallimportant']
['socialnetworkinglinkedinhowimportanttoyou',
'notatallimportant',
'keepintouchwithfamilyfriends']
© 2017 8 Path Solutions LLC. All Rights Reserved.
Levenshtein
'socialnetworkinglinkedincomh
owimportanttoyou’
'keepintouchwithfamilyfriends’ 'notatallimportant’
'socialnetworkinglinkedinhowi
mportanttoyou’
{'insert': 3, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 21,
'delete': 13}
{'insert': 0, 'replace': 4,
'delete': 24}
'notatallimportant’
{'insert': 27, 'replace': 4,
'delete': 0}
{'insert': 11, 'replace': 11,
'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
'keepintouchwithfamilyfriends’ {'insert': 16, 'replace':
20, 'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 11,
'delete': 11}
Old
New
Only small amount of insertions or deletions is accepted.
Any other combination of operations are rejected as a match.
['socialnetworkinglinkedincomhowimportanttoyou',
'keepintouchwithfamilyfriends',
'notatallimportant']
['socialnetworkinglinkedinhowimportanttoyou',
'notatallimportant',
'keepintouchwithfamilyfriends']
Levenshtein
'socialnetworkinglinkedincomh
owimportanttoyou’
'keepintouchwithfamilyfriends’ 'notatallimportant’
'socialnetworkinglinkedinhowi
mportanttoyou’
{'insert': 3, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 21,
'delete': 13}
{'insert': 0, 'replace': 4,
'delete': 24}
'notatallimportant’
{'insert': 27, 'replace': 4,
'delete': 0}
{'insert': 11, 'replace': 11,
'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
'keepintouchwithfamilyfriends’ {'insert': 16, 'replace':
20, 'delete': 0}
{'insert': 0, 'replace': 0,
'delete': 0}
{'insert': 0, 'replace': 11,
'delete': 11}
Old
New
Only small amount of insertions or deletions is accepted.
Any other combination of operations are rejected as a match.
['socialnetworkinglinkedincomhowimportanttoyou',
'keepintouchwithfamilyfriends',
'notatallimportant']
['socialnetworkinglinkedinhowimportanttoyou',
'notatallimportant',
'keepintouchwithfamilyfriends']
Process
1. Preprocess the labels
3. Compare the labels by using CCM
4. Find out good matches
5. Output the ‘old not in new’ and ‘new not in old’
2. Remove duplicates
© 2017 8 Path Solutions LLC. All Rights Reserved.
Fuzzy Implementations
Real World Considerations
2. Process Design
1. Data Suitability
3. Validation Methodology
Implementation Considerations
4. Computing Resources
© 2017 8 Path Solutions LLC. All Rights Reserved.
Python
def levenshtein(s1, s2):
if (s1) < (s2):
return levenshtein(s2, s1)
if (s2) == 0:
return (s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1 # than s2
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
def levenshtein(str1: String, str2: String): Int = {
val lenStr1 = str1.length
val lenStr2 = str2.length
val d: Array[Array[Int]] = Array.ofDim(lenStr1 + 1, lenStr2 + 1)
for (i <- 0 to lenStr1) d(i)(0) = i for (j <- 0 to lenStr2) d(0)(j) = j
for (i <- 1 to lenStr1; j <- 1 to lenStr2) {
val cost = if (str1(i - 1) == str2(j - 1)) 0 else 1
d(i)(j) = min( d(i-1)(j ) + 1, // deletion
d(i )(j-1) + 1, // insertion
d(i-1)(j-1) + cost // substitution ) }
d(lenStr1)(lenStr2)
}
def min(nums: Int*): Int = nums.min
Scala
Spark
pyspark.sql.functions.levenshtein(left, right)
Computes the Levenshtein distance of the two given strings.
from pyspark.sql.functions import *
df = spark.createDataFrame([(<word 1>, <word 2>,)], ['l', 'r'])
df.select(levenshtein('l', 'r').alias('d')).collect()
© 2017 8 Path Solutions LLC. All Rights Reserved.
Example: kitinmy vs. sitting
[('replace', 0, 0), ('insert', 2, 2), ('delete', 5, 6), ('replace', 6, 6)]
© 2017 8 Path Solutions LLC. All Rights Reserved.
Example: Kitten vs Sitting
© 2017 8 Path Solutions LLC. All Rights Reserved.
Example: Kitten vs Sitten
© 2017 8 Path Solutions LLC. All Rights Reserved.
Jennifer Shin
jshin@8pathsolutions.com
Thank You.

Contenu connexe

Tendances

Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsDatabricks
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdfBOSupport
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language Biswanath Dutta
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...InfluxData
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 

Tendances (20)

Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
ETL VS ELT.pdf
ETL VS ELT.pdfETL VS ELT.pdf
ETL VS ELT.pdf
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
Tutorial on SPARQL: SPARQL Protocol and RDF Query Language
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 

Similaire à Fuzzy Matching on Apache Spark with Jennifer Shin

Fuzzy Matching to the Rescue
Fuzzy Matching to the RescueFuzzy Matching to the Rescue
Fuzzy Matching to the RescueDomino Data Lab
 
GPSBUS206_Best Practices for Building a Partner Database Practice on AWS
GPSBUS206_Best Practices for Building a Partner Database Practice on AWSGPSBUS206_Best Practices for Building a Partner Database Practice on AWS
GPSBUS206_Best Practices for Building a Partner Database Practice on AWSAmazon Web Services
 
AWS Migration - General
AWS Migration - GeneralAWS Migration - General
AWS Migration - GeneralKenji Morooka
 
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...Amazon Web Services
 
ALX401-Advanced Alexa Skill Building Conversation and Memory
ALX401-Advanced Alexa Skill Building Conversation and MemoryALX401-Advanced Alexa Skill Building Conversation and Memory
ALX401-Advanced Alexa Skill Building Conversation and MemoryAmazon Web Services
 
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...Amazon Web Services
 
Conversation and Memory - ALX401-R - re:Invent 2017
Conversation and Memory - ALX401-R - re:Invent 2017Conversation and Memory - ALX401-R - re:Invent 2017
Conversation and Memory - ALX401-R - re:Invent 2017Amazon Web Services
 
AWS reInvent Recap 線上研討會
AWS reInvent Recap 線上研討會AWS reInvent Recap 線上研討會
AWS reInvent Recap 線上研討會Amazon Web Services
 
AWS Migration - As-Is Tool
AWS Migration - As-Is ToolAWS Migration - As-Is Tool
AWS Migration - As-Is ToolKenji Morooka
 
Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done" Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done" EBG Consulting, Inc.
 
MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...
MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...
MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...Amazon Web Services
 
Do More of This and Less of That (Sam Yen at Enterprise UX 2017)
Do More of This and Less of That (Sam Yen at Enterprise UX 2017)Do More of This and Less of That (Sam Yen at Enterprise UX 2017)
Do More of This and Less of That (Sam Yen at Enterprise UX 2017)Rosenfeld Media
 
GAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdf
GAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdfGAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdf
GAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdfAmazon Web Services
 
Latam virtual event_keynote-pt-br_americo
Latam virtual event_keynote-pt-br_americoLatam virtual event_keynote-pt-br_americo
Latam virtual event_keynote-pt-br_americoSandro Borges
 
Quarterly Planning Deck
Quarterly Planning DeckQuarterly Planning Deck
Quarterly Planning Deckjessicawishart
 
GPSBUS216-GPS Applying AI-ML to Find Security Needles in the Haystack
GPSBUS216-GPS Applying AI-ML to Find Security Needles in the HaystackGPSBUS216-GPS Applying AI-ML to Find Security Needles in the Haystack
GPSBUS216-GPS Applying AI-ML to Find Security Needles in the HaystackAmazon Web Services
 
CityWallet - From Mount Augustus to Los Roques Archipelago
CityWallet - From Mount Augustus to Los Roques ArchipelagoCityWallet - From Mount Augustus to Los Roques Archipelago
CityWallet - From Mount Augustus to Los Roques ArchipelagoHernan Garcia
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAmazon Web Services Japan
 
AWS上でのオンラインゲームリリースガイド
AWS上でのオンラインゲームリリースガイドAWS上でのオンラインゲームリリースガイド
AWS上でのオンラインゲームリリースガイドAmazon Web Services Japan
 

Similaire à Fuzzy Matching on Apache Spark with Jennifer Shin (20)

Fuzzy Matching to the Rescue
Fuzzy Matching to the RescueFuzzy Matching to the Rescue
Fuzzy Matching to the Rescue
 
GPSBUS206_Best Practices for Building a Partner Database Practice on AWS
GPSBUS206_Best Practices for Building a Partner Database Practice on AWSGPSBUS206_Best Practices for Building a Partner Database Practice on AWS
GPSBUS206_Best Practices for Building a Partner Database Practice on AWS
 
AWS reInvent 2017 Recap Webinar
AWS reInvent 2017 Recap WebinarAWS reInvent 2017 Recap Webinar
AWS reInvent 2017 Recap Webinar
 
AWS Migration - General
AWS Migration - GeneralAWS Migration - General
AWS Migration - General
 
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
 
ALX401-Advanced Alexa Skill Building Conversation and Memory
ALX401-Advanced Alexa Skill Building Conversation and MemoryALX401-Advanced Alexa Skill Building Conversation and Memory
ALX401-Advanced Alexa Skill Building Conversation and Memory
 
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
Keynote: What Transformation Really Means for the Enterprise - AWS Transforma...
 
Conversation and Memory - ALX401-R - re:Invent 2017
Conversation and Memory - ALX401-R - re:Invent 2017Conversation and Memory - ALX401-R - re:Invent 2017
Conversation and Memory - ALX401-R - re:Invent 2017
 
AWS reInvent Recap 線上研討會
AWS reInvent Recap 線上研討會AWS reInvent Recap 線上研討會
AWS reInvent Recap 線上研討會
 
AWS Migration - As-Is Tool
AWS Migration - As-Is ToolAWS Migration - As-Is Tool
AWS Migration - As-Is Tool
 
Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done" Making Your User Stories "Ready" to Get to "Done"
Making Your User Stories "Ready" to Get to "Done"
 
MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...
MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...
MCL301_Building a Voice-Enabled Customer Service Chatbot Using Amazon Lex and...
 
Do More of This and Less of That (Sam Yen at Enterprise UX 2017)
Do More of This and Less of That (Sam Yen at Enterprise UX 2017)Do More of This and Less of That (Sam Yen at Enterprise UX 2017)
Do More of This and Less of That (Sam Yen at Enterprise UX 2017)
 
GAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdf
GAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdfGAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdf
GAM311-How Linden Lab Built a Virtual World on the AWS Cloud.pdf
 
Latam virtual event_keynote-pt-br_americo
Latam virtual event_keynote-pt-br_americoLatam virtual event_keynote-pt-br_americo
Latam virtual event_keynote-pt-br_americo
 
Quarterly Planning Deck
Quarterly Planning DeckQuarterly Planning Deck
Quarterly Planning Deck
 
GPSBUS216-GPS Applying AI-ML to Find Security Needles in the Haystack
GPSBUS216-GPS Applying AI-ML to Find Security Needles in the HaystackGPSBUS216-GPS Applying AI-ML to Find Security Needles in the Haystack
GPSBUS216-GPS Applying AI-ML to Find Security Needles in the Haystack
 
CityWallet - From Mount Augustus to Los Roques Archipelago
CityWallet - From Mount Augustus to Los Roques ArchipelagoCityWallet - From Mount Augustus to Los Roques Archipelago
CityWallet - From Mount Augustus to Los Roques Archipelago
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLift
 
AWS上でのオンラインゲームリリースガイド
AWS上でのオンラインゲームリリースガイドAWS上でのオンラインゲームリリースガイド
AWS上でのオンラインゲームリリースガイド
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Dernier

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Dernier (20)

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

Fuzzy Matching on Apache Spark with Jennifer Shin

  • 1. Jennifer Shin Founder, 8 Path Solutions LLC Lecturer, UC Berkeley Fuzzy Matching on Apache Spark
  • 2. Agenda • Intro to fuzzy matching: what you need to know • Use Case: a fuzzy solution for surveys • Fuzzy implementations: real world considerations © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 3. Intro to Fuzzy Matching What You Need To Know
  • 4. Fuzzy Matching (aka Approximate String Matching) • process of finding strings that approximately match a given pattern • closeness of a match is measured in terms of an edit distance, i.e. the number of operations necessary to convert the string into an exact match. © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 5. Fuzzy Matching The edit distance is the number of primitive operations necessary to convert the string into an exact match. Examples of primitive operations are: insertion: cot → coat deletion: coat → cot substitution: coat → cost © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 6. What is fuzzy matching? • A fuzzy matching program is used to returns a list of results that are not an exact match for the term being searched – search cab argument words – spellings may not exactly match. © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 7. Why use fuzzy matching? • Not all data is clean • Not all formatting is consistent • Not all databases are structured • Not all text is correct • People are not perfect © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 8. When can we use fuzzy matching? • Case by case basis • Data cleaning © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 9. When can we use fuzzy matching? • Case by case basis • Data cleaning • Entity/Name matching © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 10. When can we use fuzzy matching? • Case by case basis • Data cleaning • Entity/Name matching • Recommendations © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 11. When can we use fuzzy matching? • Case by case basis • Data cleaning • Entity/Name matching • Recommendations • Predictive text © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 12. Use Case A Fuzzy Solution For Surveys
  • 13. Data: Survey  Comprehensive survey about attitudes, usage, purchases  6,000 products  20,000 variables  26 feed categories © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 14. Problem Description A: Dental Floss: Light Users: 0-2 Times/Last 7 Days: Total Category B: Dental Floss: Times/Last 7 Days: Light (0-2) How similar is A to B? A B + = How many new questions? © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 15. Anxiety/Panic Used a branded prescription remedy Ailments/Remedies: : Anxiety/Panic: In last 12 months: Used a branded prescription remedy Old label New label © 2017 8 Path Solutions LLC. All Rights Reserved. Word Based Comparison Model (WCM)
  • 16. Anxiety/Panic Used a branded prescription remedy Ailments/Remedies: : Anxiety/Panic: In last 12 months: Used a branded prescription remedy Old label New label Score: 6 Good match Then set threshold: match with scores above 5 is a good match Word Based Comparison Model (WCM)
  • 17. Anxiety/Panic Used a branded prescription remedy Ailments/Remedies: : Anxiety/Panic: In last 12 months: Used a branded prescription remedy Old label New label Score: 6 Good match Then set threshold: match with scores above 5 is a good match Word Based Comparison Model (WCM)
  • 18. Any air conditioner Amount spent : total :in last 12 months: $1000+ Shoes - Amount Spent in Total: any Nike air: In last 12 months: $1000+ anyairconditioneramountspent $1000+ anynikeair Inlast12months $1000+ shoesamountspentintotal inlast12months By cells  total Two cell does not have match, even most of the words do have matches.  Word Based Comparison Model (WCM)
  • 19. Tires: Total Users: Bought in Last 12 Months: Hankook Batteries: Total Users: Bought in Last 6 Months: Kodak Prescription Brands - Used: : Evista (men only): In last 12 months Prescription Brands - Used: : Avodart (men only): In last 12 months wrong matches due to changes of brand names: Score 7 Score 9 Match with scores above 5 can be a wrong match! Why does Word-based Comparison Model(WCM) perform so poorly? © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 20. Athletic Shoes - Amount Spent in Total: : Baseball /Softball shoes: In last 12 months: $75 - $149 Athletic Shoes - Amount Spent in Total: Baseball /Softball shoes: In last 12 months: $50 - $74 wrong matches due to different numbers: Athletic Shoes - Number of pairs bought: : Baseball/Softball shoes: In last 12 months: 2+ Athletic Shoes - Number of pairs bought: Baseball/Softball shoes: In last 12 months: 2 Hair Tonic Or Dressing (Men): Heavy Users: 8+ Times/Last 7 Days: Total Category Hair Tonic Or Dressing (Men): Heavy Users: 3+ Times/Last 7 Days: Total Category Why does Word-based Comparison Model (WCM) perform so poorly? Scores 12 Scores 11 Scores 12 Match with scores above 5 can be a wrong match! © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 21. • Check if one cell is a subset of another cell. • If all the cells in shorter label can find their counterparts, a match is found. Criteria: Fuzzy Matching: Levenshtein distance © 2017 8 Path Solutions LLC. All Rights Reserved. New Approach Proposed by Gan Song
  • 22. • Levenshtein distance is a string metric for measuring the difference between two sequences. • Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) © 2017 8 Path Solutions LLC. All Rights Reserved. Levenshtein Distance
  • 23. smtchgy smmtchg © 2017 8 Path Solutions LLC. All Rights Reserved. Levenshtein Distance
  • 24. smtchgy smmtchg smtchgy----> smmtchgy----> smmtchg smtchgy---->smmchgy----> smmthgy----> smmtcgy----> smmtchy----> smmtchg Insert ‘m’ delete‘y’ Change ‘t’ To ‘m’ Change ‘c’ To ‘t’ Change ‘h’ To ‘c’ Change ‘g’ To ‘h’ Change ‘y’ To ‘g’ © 2017 8 Path Solutions LLC. All Rights Reserved. Levenshtein Distance
  • 25. H O A N H O N A A H O N A N H O N H O A N A H O H A O N H N O A A H N O O A N H N H A O O N A H H A N O H N A O O A H N A O N H O N H A N O A H O H A N O H N A A O H N A N O H N O H A N A O H H O A N H O N A A H O N A N H O N H O A N A H O H A O N H N O A A H N O O A N H N H A O O N A H H A N O H N A O O A H N A O N H O N H A N O A H O H A N O H N A A O H N A N O H N O H A N A O H H O A N H O A N Shuffle! © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 26. H O A N H O N A A H O N A N H O N H O A N A H O H A O N H N O A A H N O O A N H N H A O O N A H H A N O H N A O O A H N A O N H O N H A N O A H O H A N O H N A A O H N A N O H N O H A N A O H H O A N H O N A A H O N A N H O N H O A N A H O H A O N H N O A A H N O O A N H N H A O O N A H H A N O H N A O O A H N A O N H O N H A N O A H O H A N O H N A A O H N A N O H N O H A N A O H H O A N H O A N Find a match! © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 27. Cell-based Comparison Model (CCM) Social Networking – LinkedIn How important to you: : Not at all Important :: Keep in touch with family/friends Social Networking – LinkedIn.com How important to you: : Keep in touch with family/friends: : Not at all Important ['socialnetworkinglinkedincomhowimportanttoyou', 'keepintouchwithfamilyfriends', 'notatallimportant'] ['socialnetworkinglinkedinhowimportanttoyou', 'notatallimportant', 'keepintouchwithfamilyfriends'] © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 28. Levenshtein 'socialnetworkinglinkedincomh owimportanttoyou’ 'keepintouchwithfamilyfriends’ 'notatallimportant’ 'socialnetworkinglinkedinhowi mportanttoyou’ {'insert': 3, 'replace': 0, 'delete': 0} {'insert': 0, 'replace': 21, 'delete': 13} {'insert': 0, 'replace': 4, 'delete': 24} 'notatallimportant’ {'insert': 27, 'replace': 4, 'delete': 0} {'insert': 11, 'replace': 11, 'delete': 0} {'insert': 0, 'replace': 0, 'delete': 0} 'keepintouchwithfamilyfriends’ {'insert': 16, 'replace': 20, 'delete': 0} {'insert': 0, 'replace': 0, 'delete': 0} {'insert': 0, 'replace': 11, 'delete': 11} Old New Only small amount of insertions or deletions is accepted. Any other combination of operations are rejected as a match. ['socialnetworkinglinkedincomhowimportanttoyou', 'keepintouchwithfamilyfriends', 'notatallimportant'] ['socialnetworkinglinkedinhowimportanttoyou', 'notatallimportant', 'keepintouchwithfamilyfriends']
  • 29. Levenshtein 'socialnetworkinglinkedincomh owimportanttoyou’ 'keepintouchwithfamilyfriends’ 'notatallimportant’ 'socialnetworkinglinkedinhowi mportanttoyou’ {'insert': 3, 'replace': 0, 'delete': 0} {'insert': 0, 'replace': 21, 'delete': 13} {'insert': 0, 'replace': 4, 'delete': 24} 'notatallimportant’ {'insert': 27, 'replace': 4, 'delete': 0} {'insert': 11, 'replace': 11, 'delete': 0} {'insert': 0, 'replace': 0, 'delete': 0} 'keepintouchwithfamilyfriends’ {'insert': 16, 'replace': 20, 'delete': 0} {'insert': 0, 'replace': 0, 'delete': 0} {'insert': 0, 'replace': 11, 'delete': 11} Old New Only small amount of insertions or deletions is accepted. Any other combination of operations are rejected as a match. ['socialnetworkinglinkedincomhowimportanttoyou', 'keepintouchwithfamilyfriends', 'notatallimportant'] ['socialnetworkinglinkedinhowimportanttoyou', 'notatallimportant', 'keepintouchwithfamilyfriends']
  • 30. Process 1. Preprocess the labels 3. Compare the labels by using CCM 4. Find out good matches 5. Output the ‘old not in new’ and ‘new not in old’ 2. Remove duplicates © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 32. 2. Process Design 1. Data Suitability 3. Validation Methodology Implementation Considerations 4. Computing Resources © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 33. Python def levenshtein(s1, s2): if (s1) < (s2): return levenshtein(s2, s1) if (s2) == 0: return (s1) previous_row = range(len(s2) + 1) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j + 1] + 1 deletions = current_row[j] + 1 # than s2 substitutions = previous_row[j] + (c1 != c2) current_row.append(min(insertions, deletions, substitutions)) previous_row = current_row return previous_row[-1]
  • 34. def levenshtein(str1: String, str2: String): Int = { val lenStr1 = str1.length val lenStr2 = str2.length val d: Array[Array[Int]] = Array.ofDim(lenStr1 + 1, lenStr2 + 1) for (i <- 0 to lenStr1) d(i)(0) = i for (j <- 0 to lenStr2) d(0)(j) = j for (i <- 1 to lenStr1; j <- 1 to lenStr2) { val cost = if (str1(i - 1) == str2(j - 1)) 0 else 1 d(i)(j) = min( d(i-1)(j ) + 1, // deletion d(i )(j-1) + 1, // insertion d(i-1)(j-1) + cost // substitution ) } d(lenStr1)(lenStr2) } def min(nums: Int*): Int = nums.min Scala
  • 35. Spark pyspark.sql.functions.levenshtein(left, right) Computes the Levenshtein distance of the two given strings. from pyspark.sql.functions import * df = spark.createDataFrame([(<word 1>, <word 2>,)], ['l', 'r']) df.select(levenshtein('l', 'r').alias('d')).collect() © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 36. Example: kitinmy vs. sitting [('replace', 0, 0), ('insert', 2, 2), ('delete', 5, 6), ('replace', 6, 6)] © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 37. Example: Kitten vs Sitting © 2017 8 Path Solutions LLC. All Rights Reserved.
  • 38. Example: Kitten vs Sitten © 2017 8 Path Solutions LLC. All Rights Reserved.