Implementation of Classifier tool in Twister (Iterative MapReduce)

Implementation of Classifier Tool
in Twister

Magesh khanna Vadivelu
Shivaraman Janakiraman

Apriori
• Generating 1-itemset Frequent Pattern

Apriori

Twister
• Iterative Mapreduce
• Configure once use many times
• Map -> Reduce -> Combine
• Static data configured with partition file
reused through iterations
• Provides Fault tolerant solution

Implementation
• Candidate generation
• Map
• Reduce
• Combine
• Generate frequent items
• Iterate

Data Structures
• Vector
• String delimited by coma
• StringValue
• HashMap<String, Integer>

Inputs
• Configuration file
– Number of items & transactions
– Minimum support count %

• Partition file
– Split data
– Number of items & transactions

Inputs

Number of transactions
Number of Items

Challenges
• Twister API
– StringValue
– Vector<String>
– StringVector
• toByte, fromByte

Challenges
• runMapReduce()
• runMapReduce(List<KeyValuePair>)
• runMapReduceBCast(StringValue)

Time vs. Transactions
Time vs Transactions
14

12

10

8

Time vs Transactions
6

4

2

0
10000 20000 30000

Time vs. Itemsets
Time vs Item sets
250

200

150

Time vs Item sets
Seconds

100

50

0
25 50 75

Itemsets

Time vs. Itemsets
Time vs Item sets
250

200

150

5 Mappers
Time vs Item sets
Seconds

100

50

20 Mappers
0
25 50 75

Itemsets

Implementation of Classifier Tool in Twister
Magesh khanna Vadivelu, Shivaraman Janakiraman
magevadi@indiana.edu, shivjana@indiana.edu

Motivation: Architecture: Results:
Time vs. Itemsets.
Mining frequent item-sets from large-
scale databases has emerged as an
important problem in the data mining
and knowledge discovery research
community. To overcome this
problem, we have proposed to
implement Apriori algorithm, a
classification algorithm, in Twister, a
Twister has several components. Client
distributed framework, that makes use Time vs. Transactions.
side is to drive MapReduce jobs.
of MapReduce. We specify a map
Daemons and workers which live on
function that processes a key-value pair
compute nodes manage MapReduce
to generate a set of intermediate key-
tasks. Connection between
value pairs, and a reduce function that
components are based on SSH and
merges all intermediate values
messaging software. To drive
associated with the same intermediate
MapReduce jobs, firstly client needs to
key. Our implementation of Apriori
configure the job. It configures
algorithm runs on a large cluster of
MapReduce methods to the More transactions increases the
machines and is highly scalable. On an
job, prepares KeyValue pairs and execution time but not as much as
application level, we can use this
configures static data to MapReduce Itemsets. This behavior is because
Apriori algorithm to identify the pattern
tasks through partition file if required. transactions are static data cached
in which customers buy products in a
Messages are transmitted through a in memory for each map-reduce
supermarket.
network of message brokers with cycle. Whereas Itemsets are
publish/subscribe mechanism. broadcasted for each map reduce.

Implementation of Classifier tool in Twister (Iterative MapReduce)

Recommended

Recommended

More Related Content

Similar to Implementation of Classifier tool in Twister (Iterative MapReduce)

Similar to Implementation of Classifier tool in Twister (Iterative MapReduce) (20)

Recently uploaded

Recently uploaded (20)

Implementation of Classifier tool in Twister (Iterative MapReduce)