Document Similarity with Cloud Computing

Document Similarity with Cloud
Computing
by Bryan Bende

What is Cloud Computing ?
"A style of computing in which dynamically scalable and often
virtualized resources are provided as a service over the Internet." -
Wikipedia
● Resources could be storage, processing power, applications, etc
● Third-party providers own the cloud
● Customers rent resources for an affordable price

Amazon Web Services
●Amazon provides several web services that utilize cloud
computing:
○ Elastic Compute Cloud (EC2)
○ Simple Storage Service (S3)
○ Simple DB
○ Simple Queue Service (SQS)
○ Elastic Map Reduce
● Pay only for what you use - services typicaly charge based on
bandwith in and out, and hourly or monthly usage, rates are very
affordable

Amazon Elastic Compute Cloud (EC2)
● Provides resizable computing capacity
● Customer requests a number of instances and the type of OS
image to load on each instance
● Intances are allocated on-demand and can be added at any time
(more than 20 instances requires approval)
○ Small Instance (Default)
■ 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute
Unit), 160 GB of instance storage
○ Large Instance
■ 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute
Units each), 850 GB of instance storage
○ Extra Large Instance
■ 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute
Units each), 1690 GB of instance storage
On-Demand Instances Linux/UNIX Usage Windows Usage Small
(Default) $0.10 per hour $0.125 per hour
Large $0.40 per hour $0.50 per hour
Extra Large $0.80 per hour $1.00 per hour
Also pay $.10/GB data in, $.17/GB data out (for first 10TB)

Amazon Simple Storage Service (S3)
● Provides data storage in the cloud
● Write, read, and delete objects up to 5GB in size, number of objects
is unlimited.
● Each object is stored in a bucket and retrieved via a unique,
developer-assigned key
Storage
● $0.150 per GB – first 50 TB / month of storage used
Data Transfer
● $0.10 per GB – all data transfer in
● $0.17 per GB – first 10 TB / month data transfer out
Requests
● $0.01 per 1,000 PUT, COPY, POST, or LIST requests
● $0.01 per 10,000 GET and all other requests*

How do we use these services ?
Typical Scenario:
1. Transfer data to be processed into S3
2. Launch a cluster of machines on EC2
3. Transfer data from S3 onto master node of cluster
4. Launch a job that uses the cluster to process the data
5. Send results back to S3, or SCP back to local machine
6. Shutdown EC2 instances
All data on the EC2 instances is lost when shutting down
How do we use the cluster to process the data ?

Map Reduce / Hadoop
● Map Reduce is a software framework to support
distributed computing on large data sets
● Does not have to be used with cloud computing,
could be used with a personal cluster of machines
● Map task produces key/value pairs from input
● Reduce task receives all the key/value pairs with the
same key
● Framework handles distributing the data, developers
only write the Map and Reduce operations
● Hadoop is a Java-based open-source
implementation of Map Reduce
Diagram from:
http://www.sigcrap.org/2008/01/23/mapreduce-a-major-disruptionto-database-dogma/

Hadoop Continued...
Map Function:
public void map( LongWritable key, Text value,
OutputCollector<Text,Tuple> output, Reporter reporter)
throws IOException {
....
}
Reduce Function:
public void reduce(Text key, Iterator<Tuple>
values, OutputCollector<Text, Tuple> output,
Reporter reporter)
throws IOException {
....
}
Main Method
● Creates a Job
● Specifies the Map class, Reduce class, Input path, Output path, Number of
Map tasks, and Number of Reduce tasks
● Submits the job

Experiment: Compute Document Similarity
Motivated by Pairwise Document Similarity in Large Collection with Map Reduce by Tamer
Elsayed, Jimmy Lin, and Douglas W. Oard
● Score every document in a large collection against every other
document in the collection
● Similar to the process of scoring a query against a document, but
instead of a query we have another document
● Data Set - Wikipedia Abstracts provided by DBPedia
○ Pre-processed so each abstract is on a single line with the wikipedia URL
at the beginning of each line
○ Used Wikipedia URL as a document id, rest of the text as the document
Example Data:
<http://dbpedia.org/resource/Bulls-Pistons_rivalry> The Bulls-Pistons rivalry
originated in the 1970's and was most intense in the late 1980s - early 1990's, a
period when the Bulls' superstar, Michael Jordan, ...

Step 1 - Inverted File with Map Reduce
● Each line of the input file gets passed to a Mapper (i.e. each
mapper handles one document at a time because of DBPedia
format, makes everything simpler)
● Mapper tokenizes and normalizes the text
● Produces key value pairs where each key is a word and the value is
a tuple containing the doc id, doc term frequency, and doc length
○ <word1, (doc1, dtf1, docLength)>
○ <word2, (doc1, dtf2, docLength)>
● Each Reducer receives all the records for a single key at one time
(handled by the framework)
● Iterates over each record and uses the dtf and doc length to
calculate a score for the word in the given document
● Produces a posting list for the word
○ <word1, (doc1, w1), (doc2, w2) ... (docN, wN)>

Scoring Function
●Okapi Term Weighting - Variation from paper by Scott
Olsson and Douglas Oard, Improving Text Classification
w(tf,dl) = ( tf / ( 0.5 + 1.5( dl / avdl ) + tf ) )
tf = term frequency in the document
dl = document length
avdl = average document length for collection
●Wrote utility to pre-compute AVDL, hard-coded into the
Inverted File Reducer

Step 2 - Map Over the Inverted File
● Each Mapper receives one posting list at a time
● For each posting, go to every other posting and produce a tuple
where the key contains the doc ids of each posting, and the value
contains the product of the weights
○ <(doc1, doc2), combined weight>
○ <(doc1, doc3), combined weight>
○ <(doc1, docN), combined weight>
● Each Reducer receives all records for one pair of doc ids at one
time
● Sums all the combined weights to get the total score for doc X vs
doc Y

Tools and Technologies
● Amazon EC2 and S3
● Map Reduce / Hadoop 0.17
● Cloud9
○ Library developed by Jimmy Lin at University of Maryland
○ Helper classes and script for working with Hadoop
● JetS3t Cockpit
○ Application to manage S3 buckets
● Small Text
○ Java Library for performing external sorting of large files

Experiment Steps
1. Transfer DBPedia data into S3
2. Start EC2 Cluster using Cloud9 scripts
3. Transfer DBPedia data on master node of cluster
4. Put DBPedia data into Hadoop's Distributed File System (HDFS)
5. Transfer JAR file containg Mapper, Reducer, and Main Method into
the master node
6. Submit the Inverted File job
7. Submit the Document Similarity Job
8. SCP the results of the Document Similarity job back to local
machine
9. Concatenate all the partial result files to one file
10. Run SmallText to perform an external sort on the concatenated file

First Attempt
●Used subset of DBPedia Data
○ First 160K Abstracts from full set
○ 98 MB file
○ 460k Unique Words
●Inverted File
○ 2 Small EC2 Instances
○ 2 Map Tasks, 1 Reduce Task so all output is one file
○ Completed in approximately 5 minutes
○ 170MB Inverted File
●Document Similarity
○ 20 Small EC2 Instances
○ 5 Map Tasks per instance (100 total)
○ 1 Reduce Task per instance (20 total)
○ Map phase only 50% complete after 12 hours
What was the problem ?

Document Frequency Cut (DF-Cut)
● Document Frequency Cut is the process of ignoring the most
frequent terms in the collection when generating the inverted file
● Most frequent terms generate the longest posting lists
● Longest posting lists generate the most pairs during the mapping
phase
● Common words in the collection aren't a big factor in determining
document similarity
● Paper by Elsayed describes using a 1% DF-Cut which ignored the
nine-thousand most frequent words
● Used a .5% DF-Cut on DBPedia data set which ignored
approximately 2,300 words

Second Attempt
●Same parameters as first attempt except Inverted File used .
5%DF-Cut
●Inverted File
○ Size reduced to around 80MB
○ Completed slightly faster
●Document Similarity
○ Completed in approximately 1.5 hours
○ Produced 16GB of output
●Small Text Sorting
○ Completed in approximately 30 mins
● Most Similar Documents
○ African_Broadbill and African_Shrike-flycatcher

Third Attempt
● Increased size of data set
○ First 600k Abstracts
○ 400MB File
○ 1.2 Million Words
● Same number of instances and tasks as previous attempts
● Same DF-Cut which removed 6k words
● Changed Inverted File to produce postings list sorted by Document
Id
● Changed Document Similarity to not produce <doc1, doc2> and
<doc2, doc1>
● Document Similarity completed in 2.5 hours
● Produced 30GB of output files
● SmallText completed sorting in 2 hours
● Most similar documents seem inaccurate

Conclusions
●Amazon Web Services makes it easy for anyone to use
Cloud Computing for data mining tasks
●Map Reduce / Hadoop makes it easy to implement
distributed processing, hides complexity from developer
●Can be hard to debug problems in the cloud
●More efficient way to store and read the inverted file
●Scoring function may not be accurate

Document Similarity with Cloud Computing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (15)

Similaire à Document Similarity with Cloud Computing

Similaire à Document Similarity with Cloud Computing (20)

Dernier

Dernier (20)

Document Similarity with Cloud Computing