SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Document Similarity with Cloud
Computing
by Bryan Bende
What is Cloud Computing ?
"A style of computing in which dynamically scalable and often
virtualized resources are provided as a service over the Internet." -
Wikipedia
● Resources could be storage, processing power, applications, etc
● Third-party providers own the cloud
● Customers rent resources for an affordable price
Amazon Web Services
●Amazon provides several web services that utilize cloud
computing:
○ Elastic Compute Cloud (EC2)
○ Simple Storage Service (S3)
○ Simple DB
○ Simple Queue Service (SQS)
○ Elastic Map Reduce
● Pay only for what you use - services typicaly charge based on
bandwith in and out, and hourly or monthly usage, rates are very
affordable
Amazon Elastic Compute Cloud (EC2)
● Provides resizable computing capacity
● Customer requests a number of instances and the type of OS
image to load on each instance
● Intances are allocated on-demand and can be added at any time
(more than 20 instances requires approval)
○ Small Instance (Default)
■ 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute
Unit), 160 GB of instance storage
○ Large Instance
■ 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute
Units each), 850 GB of instance storage
○ Extra Large Instance
■ 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute
Units each), 1690 GB of instance storage
On-Demand Instances Linux/UNIX Usage Windows Usage Small
(Default) $0.10 per hour $0.125 per hour
Large $0.40 per hour $0.50 per hour
Extra Large $0.80 per hour $1.00 per hour
Also pay $.10/GB data in, $.17/GB data out (for first 10TB)
Amazon Simple Storage Service (S3)
● Provides data storage in the cloud
● Write, read, and delete objects up to 5GB in size, number of objects
is unlimited.
● Each object is stored in a bucket and retrieved via a unique,
developer-assigned key
Storage
● $0.150 per GB – first 50 TB / month of storage used
Data Transfer
● $0.10 per GB – all data transfer in
● $0.17 per GB – first 10 TB / month data transfer out
Requests
● $0.01 per 1,000 PUT, COPY, POST, or LIST requests
● $0.01 per 10,000 GET and all other requests*
How do we use these services ?
Typical Scenario:
1. Transfer data to be processed into S3
2. Launch a cluster of machines on EC2
3. Transfer data from S3 onto master node of cluster
4. Launch a job that uses the cluster to process the data
5. Send results back to S3, or SCP back to local machine
6. Shutdown EC2 instances
All data on the EC2 instances is lost when shutting down
How do we use the cluster to process the data ?
Map Reduce / Hadoop
● Map Reduce is a software framework to support
distributed computing on large data sets
● Does not have to be used with cloud computing,
could be used with a personal cluster of machines
● Map task produces key/value pairs from input
● Reduce task receives all the key/value pairs with the
same key
● Framework handles distributing the data, developers
only write the Map and Reduce operations
● Hadoop is a Java-based open-source
implementation of Map Reduce
Diagram from:
http://www.sigcrap.org/2008/01/23/mapreduce-a-major-disruptionto-database-dogma/
Hadoop Continued...
Map Function:
public void map( LongWritable key, Text value,
OutputCollector<Text,Tuple> output, Reporter reporter)
throws IOException {
....
}
Reduce Function:
public void reduce(Text key, Iterator<Tuple>
values, OutputCollector<Text, Tuple> output,
Reporter reporter)
throws IOException {
....
}
Main Method
● Creates a Job
● Specifies the Map class, Reduce class, Input path, Output path, Number of
Map tasks, and Number of Reduce tasks
● Submits the job
Experiment: Compute Document Similarity
Motivated by Pairwise Document Similarity in Large Collection with Map Reduce by Tamer
Elsayed, Jimmy Lin, and Douglas W. Oard
● Score every document in a large collection against every other
document in the collection
● Similar to the process of scoring a query against a document, but
instead of a query we have another document
● Data Set - Wikipedia Abstracts provided by DBPedia
○ Pre-processed so each abstract is on a single line with the wikipedia URL
at the beginning of each line
○ Used Wikipedia URL as a document id, rest of the text as the document
Example Data:
<http://dbpedia.org/resource/Bulls-Pistons_rivalry> The Bulls-Pistons rivalry
originated in the 1970's and was most intense in the late 1980s - early 1990's, a
period when the Bulls' superstar, Michael Jordan, ...
Step 1 - Inverted File with Map Reduce
● Each line of the input file gets passed to a Mapper (i.e. each
mapper handles one document at a time because of DBPedia
format, makes everything simpler)
● Mapper tokenizes and normalizes the text
● Produces key value pairs where each key is a word and the value is
a tuple containing the doc id, doc term frequency, and doc length
○ <word1, (doc1, dtf1, docLength)>
○ <word2, (doc1, dtf2, docLength)>
● Each Reducer receives all the records for a single key at one time
(handled by the framework)
● Iterates over each record and uses the dtf and doc length to
calculate a score for the word in the given document
● Produces a posting list for the word
○ <word1, (doc1, w1), (doc2, w2) ... (docN, wN)>
Inverted File Example
Scoring Function
●Okapi Term Weighting - Variation from paper by Scott
Olsson and Douglas Oard, Improving Text Classification
w(tf,dl) = ( tf / ( 0.5 + 1.5( dl / avdl ) + tf ) )
tf = term frequency in the document
dl = document length
avdl = average document length for collection
●Wrote utility to pre-compute AVDL, hard-coded into the
Inverted File Reducer
Step 2 - Map Over the Inverted File
● Each Mapper receives one posting list at a time
● For each posting, go to every other posting and produce a tuple
where the key contains the doc ids of each posting, and the value
contains the product of the weights
○ <(doc1, doc2), combined weight>
○ <(doc1, doc3), combined weight>
○ <(doc1, docN), combined weight>
● Each Reducer receives all records for one pair of doc ids at one
time
● Sums all the combined weights to get the total score for doc X vs
doc Y
Tools and Technologies
● Amazon EC2 and S3
● Map Reduce / Hadoop 0.17
● Cloud9
○ Library developed by Jimmy Lin at University of Maryland
○ Helper classes and script for working with Hadoop
● JetS3t Cockpit
○ Application to manage S3 buckets
● Small Text
○ Java Library for performing external sorting of large files
Experiment Steps
1. Transfer DBPedia data into S3
2. Start EC2 Cluster using Cloud9 scripts
3. Transfer DBPedia data on master node of cluster
4. Put DBPedia data into Hadoop's Distributed File System (HDFS)
5. Transfer JAR file containg Mapper, Reducer, and Main Method into
the master node
6. Submit the Inverted File job
7. Submit the Document Similarity Job
8. SCP the results of the Document Similarity job back to local
machine
9. Concatenate all the partial result files to one file
10. Run SmallText to perform an external sort on the concatenated file
First Attempt
●Used subset of DBPedia Data
○ First 160K Abstracts from full set
○ 98 MB file
○ 460k Unique Words
●Inverted File
○ 2 Small EC2 Instances
○ 2 Map Tasks, 1 Reduce Task so all output is one file
○ Completed in approximately 5 minutes
○ 170MB Inverted File
●Document Similarity
○ 20 Small EC2 Instances
○ 5 Map Tasks per instance (100 total)
○ 1 Reduce Task per instance (20 total)
○ Map phase only 50% complete after 12 hours
What was the problem ?
Document Frequency Cut (DF-Cut)
● Document Frequency Cut is the process of ignoring the most
frequent terms in the collection when generating the inverted file
● Most frequent terms generate the longest posting lists
● Longest posting lists generate the most pairs during the mapping
phase
● Common words in the collection aren't a big factor in determining
document similarity
● Paper by Elsayed describes using a 1% DF-Cut which ignored the
nine-thousand most frequent words
● Used a .5% DF-Cut on DBPedia data set which ignored
approximately 2,300 words
Second Attempt
●Same parameters as first attempt except Inverted File used .
5%DF-Cut
●Inverted File
○ Size reduced to around 80MB
○ Completed slightly faster
●Document Similarity
○ Completed in approximately 1.5 hours
○ Produced 16GB of output
●Small Text Sorting
○ Completed in approximately 30 mins
● Most Similar Documents
○ African_Broadbill and African_Shrike-flycatcher
Third Attempt
● Increased size of data set
○ First 600k Abstracts
○ 400MB File
○ 1.2 Million Words
● Same number of instances and tasks as previous attempts
● Same DF-Cut which removed 6k words
● Changed Inverted File to produce postings list sorted by Document
Id
● Changed Document Similarity to not produce <doc1, doc2> and
<doc2, doc1>
● Document Similarity completed in 2.5 hours
● Produced 30GB of output files
● SmallText completed sorting in 2 hours
● Most similar documents seem inaccurate
Conclusions
●Amazon Web Services makes it easy for anyone to use
Cloud Computing for data mining tasks
●Map Reduce / Hadoop makes it easy to implement
distributed processing, hides complexity from developer
●Can be hard to debug problems in the cloud
●More efficient way to store and read the inverted file
●Scoring function may not be accurate

Contenu connexe

Tendances

Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easynathanmarz
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaAndrew Montalenti
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
The inner workings of Dynamo DB
The inner workings of Dynamo DBThe inner workings of Dynamo DB
The inner workings of Dynamo DBJonathan Lau
 
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropeFlip Kromer
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solrthelabdude
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceRobert Evans
 
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsyLucidworks
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentSpeedment, Inc.
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrlucenerevolution
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 

Tendances (20)

Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
The inner workings of Dynamo DB
The inner workings of Dynamo DBThe inner workings of Dynamo DB
The inner workings of Dynamo DB
 
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, EtsySearch-time Parallelism: Presented by Shikhar Bhushan, Etsy
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
 
CaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use CasesCaffeOnSpark Update: Recent Enhancements and Use Cases
CaffeOnSpark Update: Recent Enhancements and Use Cases
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
 
Building a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solrBuilding a near real time search engine & analytics for logs using solr
Building a near real time search engine & analytics for logs using solr
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 

En vedette

Hed 510 final pres
Hed 510 final presHed 510 final pres
Hed 510 final presdonco21
 
Ten Tonne Skeleton
Ten Tonne SkeletonTen Tonne Skeleton
Ten Tonne Skeletonjohnfcshaw
 
jaws ug-kansai_lt_141202
jaws ug-kansai_lt_141202jaws ug-kansai_lt_141202
jaws ug-kansai_lt_141202Hiromi Ito
 
はじめてのクラウドとそしてaws_JAWS-UG関西女子会150121
はじめてのクラウドとそしてaws_JAWS-UG関西女子会150121はじめてのクラウドとそしてaws_JAWS-UG関西女子会150121
はじめてのクラウドとそしてaws_JAWS-UG関西女子会150121Hiromi Ito
 
Integrating NiFi and Apex
Integrating NiFi and ApexIntegrating NiFi and Apex
Integrating NiFi and ApexBryan Bende
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiBryan Bende
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and FlinkBryan Bende
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemBryan Bende
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveBryan Bende
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBryan Bende
 

En vedette (15)

Hed 510 final pres
Hed 510 final presHed 510 final pres
Hed 510 final pres
 
Ten Tonne Skeleton
Ten Tonne SkeletonTen Tonne Skeleton
Ten Tonne Skeleton
 
Presentation2
Presentation2Presentation2
Presentation2
 
Presentation2
Presentation2Presentation2
Presentation2
 
Userguide
UserguideUserguide
Userguide
 
jaws ug-kansai_lt_141202
jaws ug-kansai_lt_141202jaws ug-kansai_lt_141202
jaws ug-kansai_lt_141202
 
はじめてのクラウドとそしてaws_JAWS-UG関西女子会150121
はじめてのクラウドとそしてaws_JAWS-UG関西女子会150121はじめてのクラウドとそしてaws_JAWS-UG関西女子会150121
はじめてのクラウドとそしてaws_JAWS-UG関西女子会150121
 
Integrating NiFi and Apex
Integrating NiFi and ApexIntegrating NiFi and Apex
Integrating NiFi and Apex
 
News values
News values News values
News values
 
Técnicas de redimension de valores
Técnicas de redimension de valoresTécnicas de redimension de valores
Técnicas de redimension de valores
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 

Similaire à Document Similarity with Cloud Computing

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data LakeMarcos Rebelo
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 

Similaire à Document Similarity with Cloud Computing (20)

Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Thesis presentation
Thesis presentationThesis presentation
Thesis presentation
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data Lake
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 

Dernier

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Dernier (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 

Document Similarity with Cloud Computing

  • 1. Document Similarity with Cloud Computing by Bryan Bende
  • 2. What is Cloud Computing ? "A style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet." - Wikipedia ● Resources could be storage, processing power, applications, etc ● Third-party providers own the cloud ● Customers rent resources for an affordable price
  • 3. Amazon Web Services ●Amazon provides several web services that utilize cloud computing: ○ Elastic Compute Cloud (EC2) ○ Simple Storage Service (S3) ○ Simple DB ○ Simple Queue Service (SQS) ○ Elastic Map Reduce ● Pay only for what you use - services typicaly charge based on bandwith in and out, and hourly or monthly usage, rates are very affordable
  • 4. Amazon Elastic Compute Cloud (EC2) ● Provides resizable computing capacity ● Customer requests a number of instances and the type of OS image to load on each instance ● Intances are allocated on-demand and can be added at any time (more than 20 instances requires approval) ○ Small Instance (Default) ■ 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage ○ Large Instance ■ 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage ○ Extra Large Instance ■ 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage On-Demand Instances Linux/UNIX Usage Windows Usage Small (Default) $0.10 per hour $0.125 per hour Large $0.40 per hour $0.50 per hour Extra Large $0.80 per hour $1.00 per hour Also pay $.10/GB data in, $.17/GB data out (for first 10TB)
  • 5. Amazon Simple Storage Service (S3) ● Provides data storage in the cloud ● Write, read, and delete objects up to 5GB in size, number of objects is unlimited. ● Each object is stored in a bucket and retrieved via a unique, developer-assigned key Storage ● $0.150 per GB – first 50 TB / month of storage used Data Transfer ● $0.10 per GB – all data transfer in ● $0.17 per GB – first 10 TB / month data transfer out Requests ● $0.01 per 1,000 PUT, COPY, POST, or LIST requests ● $0.01 per 10,000 GET and all other requests*
  • 6. How do we use these services ? Typical Scenario: 1. Transfer data to be processed into S3 2. Launch a cluster of machines on EC2 3. Transfer data from S3 onto master node of cluster 4. Launch a job that uses the cluster to process the data 5. Send results back to S3, or SCP back to local machine 6. Shutdown EC2 instances All data on the EC2 instances is lost when shutting down How do we use the cluster to process the data ?
  • 7. Map Reduce / Hadoop ● Map Reduce is a software framework to support distributed computing on large data sets ● Does not have to be used with cloud computing, could be used with a personal cluster of machines ● Map task produces key/value pairs from input ● Reduce task receives all the key/value pairs with the same key ● Framework handles distributing the data, developers only write the Map and Reduce operations ● Hadoop is a Java-based open-source implementation of Map Reduce Diagram from: http://www.sigcrap.org/2008/01/23/mapreduce-a-major-disruptionto-database-dogma/
  • 8. Hadoop Continued... Map Function: public void map( LongWritable key, Text value, OutputCollector<Text,Tuple> output, Reporter reporter) throws IOException { .... } Reduce Function: public void reduce(Text key, Iterator<Tuple> values, OutputCollector<Text, Tuple> output, Reporter reporter) throws IOException { .... } Main Method ● Creates a Job ● Specifies the Map class, Reduce class, Input path, Output path, Number of Map tasks, and Number of Reduce tasks ● Submits the job
  • 9. Experiment: Compute Document Similarity Motivated by Pairwise Document Similarity in Large Collection with Map Reduce by Tamer Elsayed, Jimmy Lin, and Douglas W. Oard ● Score every document in a large collection against every other document in the collection ● Similar to the process of scoring a query against a document, but instead of a query we have another document ● Data Set - Wikipedia Abstracts provided by DBPedia ○ Pre-processed so each abstract is on a single line with the wikipedia URL at the beginning of each line ○ Used Wikipedia URL as a document id, rest of the text as the document Example Data: <http://dbpedia.org/resource/Bulls-Pistons_rivalry> The Bulls-Pistons rivalry originated in the 1970's and was most intense in the late 1980s - early 1990's, a period when the Bulls' superstar, Michael Jordan, ...
  • 10. Step 1 - Inverted File with Map Reduce ● Each line of the input file gets passed to a Mapper (i.e. each mapper handles one document at a time because of DBPedia format, makes everything simpler) ● Mapper tokenizes and normalizes the text ● Produces key value pairs where each key is a word and the value is a tuple containing the doc id, doc term frequency, and doc length ○ <word1, (doc1, dtf1, docLength)> ○ <word2, (doc1, dtf2, docLength)> ● Each Reducer receives all the records for a single key at one time (handled by the framework) ● Iterates over each record and uses the dtf and doc length to calculate a score for the word in the given document ● Produces a posting list for the word ○ <word1, (doc1, w1), (doc2, w2) ... (docN, wN)>
  • 12. Scoring Function ●Okapi Term Weighting - Variation from paper by Scott Olsson and Douglas Oard, Improving Text Classification w(tf,dl) = ( tf / ( 0.5 + 1.5( dl / avdl ) + tf ) ) tf = term frequency in the document dl = document length avdl = average document length for collection ●Wrote utility to pre-compute AVDL, hard-coded into the Inverted File Reducer
  • 13. Step 2 - Map Over the Inverted File ● Each Mapper receives one posting list at a time ● For each posting, go to every other posting and produce a tuple where the key contains the doc ids of each posting, and the value contains the product of the weights ○ <(doc1, doc2), combined weight> ○ <(doc1, doc3), combined weight> ○ <(doc1, docN), combined weight> ● Each Reducer receives all records for one pair of doc ids at one time ● Sums all the combined weights to get the total score for doc X vs doc Y
  • 14. Tools and Technologies ● Amazon EC2 and S3 ● Map Reduce / Hadoop 0.17 ● Cloud9 ○ Library developed by Jimmy Lin at University of Maryland ○ Helper classes and script for working with Hadoop ● JetS3t Cockpit ○ Application to manage S3 buckets ● Small Text ○ Java Library for performing external sorting of large files
  • 15. Experiment Steps 1. Transfer DBPedia data into S3 2. Start EC2 Cluster using Cloud9 scripts 3. Transfer DBPedia data on master node of cluster 4. Put DBPedia data into Hadoop's Distributed File System (HDFS) 5. Transfer JAR file containg Mapper, Reducer, and Main Method into the master node 6. Submit the Inverted File job 7. Submit the Document Similarity Job 8. SCP the results of the Document Similarity job back to local machine 9. Concatenate all the partial result files to one file 10. Run SmallText to perform an external sort on the concatenated file
  • 16. First Attempt ●Used subset of DBPedia Data ○ First 160K Abstracts from full set ○ 98 MB file ○ 460k Unique Words ●Inverted File ○ 2 Small EC2 Instances ○ 2 Map Tasks, 1 Reduce Task so all output is one file ○ Completed in approximately 5 minutes ○ 170MB Inverted File ●Document Similarity ○ 20 Small EC2 Instances ○ 5 Map Tasks per instance (100 total) ○ 1 Reduce Task per instance (20 total) ○ Map phase only 50% complete after 12 hours What was the problem ?
  • 17. Document Frequency Cut (DF-Cut) ● Document Frequency Cut is the process of ignoring the most frequent terms in the collection when generating the inverted file ● Most frequent terms generate the longest posting lists ● Longest posting lists generate the most pairs during the mapping phase ● Common words in the collection aren't a big factor in determining document similarity ● Paper by Elsayed describes using a 1% DF-Cut which ignored the nine-thousand most frequent words ● Used a .5% DF-Cut on DBPedia data set which ignored approximately 2,300 words
  • 18. Second Attempt ●Same parameters as first attempt except Inverted File used . 5%DF-Cut ●Inverted File ○ Size reduced to around 80MB ○ Completed slightly faster ●Document Similarity ○ Completed in approximately 1.5 hours ○ Produced 16GB of output ●Small Text Sorting ○ Completed in approximately 30 mins ● Most Similar Documents ○ African_Broadbill and African_Shrike-flycatcher
  • 19. Third Attempt ● Increased size of data set ○ First 600k Abstracts ○ 400MB File ○ 1.2 Million Words ● Same number of instances and tasks as previous attempts ● Same DF-Cut which removed 6k words ● Changed Inverted File to produce postings list sorted by Document Id ● Changed Document Similarity to not produce <doc1, doc2> and <doc2, doc1> ● Document Similarity completed in 2.5 hours ● Produced 30GB of output files ● SmallText completed sorting in 2 hours ● Most similar documents seem inaccurate
  • 20. Conclusions ●Amazon Web Services makes it easy for anyone to use Cloud Computing for data mining tasks ●Map Reduce / Hadoop makes it easy to implement distributed processing, hides complexity from developer ●Can be hard to debug problems in the cloud ●More efficient way to store and read the inverted file ●Scoring function may not be accurate