SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Benchmark MinHash +
LSH Algorithm on Spark
Insight Data Engineering Fellow Program, Silicon Valley
Xiaoqian Liu
June 2016
What’s next?
Post Recommendation
● Data: Reddit posts and titles in 12/2014
● Similarity metric: Jaccard Similarity
○ (%common) on titles
Pairwise Similarity Calculation is
Expensive!!
● ~700k posts in 12/2014
● Individual lookup: 700K times, O(n)
● Pairwise calculation: 490B times, O(n^2)
MinHash: Dimension Reduction
Post 1 Dave Grohl tells a story
Post 2 Dave Grohl shares a story with Taylor Swift
Post 3 I knew it was trouble when they drove by
Min hash 1 Min hash 2 Min hash 3 Min hash 4
Post 1 932378 11070 107000 195512
Post 2 20930 213012 107000 195512
Post 3 27698 14136 104464 154376
4 hash funcs
LSH (Locality Sensitive Hashing)
● Further reduce the dimension
● Suppose the table is divided into 2 bands w/ width of 2
● Rehash on each item
● Use (Band id, Band hash) to find similar items
Band 1 Band 2
Post 1 Hash (932378,11070) Hash (107000,195512)
Post 2 Hash (20930, 213012) Hash (107000,195512)
Post 3 Hash (27698,14136) Hash (104464,154376)
Dave Grohl
Dave Grohl
Trouble
*Algorithm source: Mining of Massive Datasets (Rajaraman,Leskovec)
Infrastructure for Evaluation
● Batch implementation+Eval
● Real-time implementation+Eval
Preprocessing
(tokenize, remove
stopwords)
Minhash+LSH
(batch version)
Minhash+LSH
(online version)
Reddits (1/2015)
Reddits (12/2014)
Export LSH+post info
Group lookup+update
6 nodes
(m4.xlarge)
3 nodes
(m4.xlarge)
Evaluation &
Lookup
6 nodes
(m4.xlarge)
1 node
(m4.xlarge)
Batch Processing Optimization on Spark
● SparkSQL join, cartesian product
● Reduce Shuffle times for joining two different datasets:
○ Co-partition before joining
● Persist the data before actions
○ Storage level depends on the RDD size
● Filter results before joining and calculating similarities
○ filter(), reducebyKey()
Batch Processing: Brute-force vs
Minhash+LSH (10 hash funcs, 2 bands)
100k entries, 12/2014 Reddits
Precision and Recall
● 100k entries, estimated threshold = 0.44
Parameters Items
>=threshold
Total
count
Time (sec) Precision Recall num
partitions
Brute-force 16,046 9.99B 29,880 1 1 3,600
k=10, b=2 585 65,353 7.68 0.009 0.036 60
780k reddit posts, precision vs k values
K = # hash functions
780k reddit posts, time vs k values
Streaming: Average Time
● Throughput: 315 events/sec, 10 sec time window
● 8 sec/microbatch, 6 nodes,
Conclusion
● Effectively speed up on batch processing
● Use 400-500 hash functions, set the threshold above .65
○ Filter out pairs w/ low similarities
○ Linear scan for pairs w/ 0 neighbors
● Only for Jaccard Similarity.
○ For cosine similarity: LSH + random projection
About Me
● BS, MS in Systems Engineering
(CS minor), UVA
● Operations/Data Science Intern,
Samsung Austin R&D Center
● ML, NLP at scale
● Music, Singing
“We can have a party, just listening to music”
Backup Slides
Limits & Future Work
● Investigate recall values vs parameters/time/...
○ More recall and precision comparison btw Brute-Force and LSH+MinHash
○ More comparison between different parameter comparisons
● Benchmark for batch processing:
○ Size vs Time
● More detailed benchmark on real-time processing
● More runs of experiments:
○ More representative data
● Optimize resource utilization
MapReduce version of MinHash+LSH
● Mapper side: for each post
○ Calculate min hash values
○ Create bands and band hashes
● Reducer side:
○ Get similar items grouped by (band id, band hash)
○ Calculate jaccard similarity on each item combination ->
find the most similar pair
Threshold of MinHash + LSH
● Estimated Similarity Lower bound for each band:
○ ~(1/#bands)^(1/#rows)
● e.g. k =4, 2 bands and 2 rows. at least 0.70 similar
● Collision
● Higher k, more accurate, but slower
Streaming: Kafka
● Throughput: 146 events/sec, 10 ms time window
Streaming: Average Time
● Throughput:120 events/sec, 10 ms time window
● 8 sec/microbatch, 6 nodes, 1024 MB memory/node
Streaming: Kafka
● Throughput: 315.30 events/sec, 10 ms time window
780k reddit posts, precision vs time
Threshold: 0.4-0.5 Threshold: 0.6-0.7 Threshold: 0.8-0.9
CPU usage
●
Task Diagram
●

Contenu connexe

Tendances

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
Databricks
 

Tendances (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
Deep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming ApplicationsDeep Dive and Best Practices for Real Time Streaming Applications
Deep Dive and Best Practices for Real Time Streaming Applications
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 

Similaire à Benchmark MinHash+LSH algorithm on Spark

Similaire à Benchmark MinHash+LSH algorithm on Spark (20)

Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
February 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDBFebruary 2016 Webinar Series - Introduction to DynamoDB
February 2016 Webinar Series - Introduction to DynamoDB
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
 
Deep Dive - DynamoDB
Deep Dive - DynamoDBDeep Dive - DynamoDB
Deep Dive - DynamoDB
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Neo4j after 1 year in production
Neo4j after 1 year in productionNeo4j after 1 year in production
Neo4j after 1 year in production
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere Demo
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討
 
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB DayAmazon Dynamo DB for Developers (김일호) - AWS DB Day
Amazon Dynamo DB for Developers (김일호) - AWS DB Day
 
Druid
DruidDruid
Druid
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
DPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for AnalysisDPF 2017: GPUs in LHCb for Analysis
DPF 2017: GPUs in LHCb for Analysis
 

Dernier

Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 

Dernier (20)

Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 

Benchmark MinHash+LSH algorithm on Spark

  • 1. Benchmark MinHash + LSH Algorithm on Spark Insight Data Engineering Fellow Program, Silicon Valley Xiaoqian Liu June 2016
  • 3. Post Recommendation ● Data: Reddit posts and titles in 12/2014 ● Similarity metric: Jaccard Similarity ○ (%common) on titles
  • 4. Pairwise Similarity Calculation is Expensive!! ● ~700k posts in 12/2014 ● Individual lookup: 700K times, O(n) ● Pairwise calculation: 490B times, O(n^2)
  • 5. MinHash: Dimension Reduction Post 1 Dave Grohl tells a story Post 2 Dave Grohl shares a story with Taylor Swift Post 3 I knew it was trouble when they drove by Min hash 1 Min hash 2 Min hash 3 Min hash 4 Post 1 932378 11070 107000 195512 Post 2 20930 213012 107000 195512 Post 3 27698 14136 104464 154376 4 hash funcs
  • 6. LSH (Locality Sensitive Hashing) ● Further reduce the dimension ● Suppose the table is divided into 2 bands w/ width of 2 ● Rehash on each item ● Use (Band id, Band hash) to find similar items Band 1 Band 2 Post 1 Hash (932378,11070) Hash (107000,195512) Post 2 Hash (20930, 213012) Hash (107000,195512) Post 3 Hash (27698,14136) Hash (104464,154376) Dave Grohl Dave Grohl Trouble *Algorithm source: Mining of Massive Datasets (Rajaraman,Leskovec)
  • 7. Infrastructure for Evaluation ● Batch implementation+Eval ● Real-time implementation+Eval Preprocessing (tokenize, remove stopwords) Minhash+LSH (batch version) Minhash+LSH (online version) Reddits (1/2015) Reddits (12/2014) Export LSH+post info Group lookup+update 6 nodes (m4.xlarge) 3 nodes (m4.xlarge) Evaluation & Lookup 6 nodes (m4.xlarge) 1 node (m4.xlarge)
  • 8. Batch Processing Optimization on Spark ● SparkSQL join, cartesian product ● Reduce Shuffle times for joining two different datasets: ○ Co-partition before joining ● Persist the data before actions ○ Storage level depends on the RDD size ● Filter results before joining and calculating similarities ○ filter(), reducebyKey()
  • 9. Batch Processing: Brute-force vs Minhash+LSH (10 hash funcs, 2 bands) 100k entries, 12/2014 Reddits
  • 10. Precision and Recall ● 100k entries, estimated threshold = 0.44 Parameters Items >=threshold Total count Time (sec) Precision Recall num partitions Brute-force 16,046 9.99B 29,880 1 1 3,600 k=10, b=2 585 65,353 7.68 0.009 0.036 60
  • 11. 780k reddit posts, precision vs k values K = # hash functions
  • 12. 780k reddit posts, time vs k values
  • 13. Streaming: Average Time ● Throughput: 315 events/sec, 10 sec time window ● 8 sec/microbatch, 6 nodes,
  • 14. Conclusion ● Effectively speed up on batch processing ● Use 400-500 hash functions, set the threshold above .65 ○ Filter out pairs w/ low similarities ○ Linear scan for pairs w/ 0 neighbors ● Only for Jaccard Similarity. ○ For cosine similarity: LSH + random projection
  • 15. About Me ● BS, MS in Systems Engineering (CS minor), UVA ● Operations/Data Science Intern, Samsung Austin R&D Center ● ML, NLP at scale ● Music, Singing “We can have a party, just listening to music”
  • 17. Limits & Future Work ● Investigate recall values vs parameters/time/... ○ More recall and precision comparison btw Brute-Force and LSH+MinHash ○ More comparison between different parameter comparisons ● Benchmark for batch processing: ○ Size vs Time ● More detailed benchmark on real-time processing ● More runs of experiments: ○ More representative data ● Optimize resource utilization
  • 18. MapReduce version of MinHash+LSH ● Mapper side: for each post ○ Calculate min hash values ○ Create bands and band hashes
  • 19. ● Reducer side: ○ Get similar items grouped by (band id, band hash) ○ Calculate jaccard similarity on each item combination -> find the most similar pair
  • 20. Threshold of MinHash + LSH ● Estimated Similarity Lower bound for each band: ○ ~(1/#bands)^(1/#rows) ● e.g. k =4, 2 bands and 2 rows. at least 0.70 similar ● Collision ● Higher k, more accurate, but slower
  • 21. Streaming: Kafka ● Throughput: 146 events/sec, 10 ms time window
  • 22. Streaming: Average Time ● Throughput:120 events/sec, 10 ms time window ● 8 sec/microbatch, 6 nodes, 1024 MB memory/node
  • 23. Streaming: Kafka ● Throughput: 315.30 events/sec, 10 ms time window
  • 24. 780k reddit posts, precision vs time Threshold: 0.4-0.5 Threshold: 0.6-0.7 Threshold: 0.8-0.9