Social Media Mining using GAE Map Reduce

Social Network Mining
Solutions using Google App Engine Map Reduce

J Singh, DataThinks.org

October 19, 2011

MapReduce: A Genealogical Perspective
• Roots
– Lisp, Scheme
– APL

• Google OS papers, 2004
– Exploit extreme parallelism of data

• Apache Top Level Project (Hadoop)

• MapReduceGAE borrows from these

© J Singh, 2011 2
2

Social Network Mining
• Finding people based on data in social networks
– Love and Romance
– Common interests
– Similar buying habits
– Similar voting propensities
– Location

• It‟s not a new problem
– We have additional solutions for the old problem
• Examples based on proprietary data: eHarmony, etc.
• Early examples based on social network data: ShoutFlow,
WhoIsJustLikeMe.

© J Singh, 2011 3
3

Based on clustering algorithms
• On-line demo of clustering • Resource intensive.
– Best done in batch mode

• Exploit data parallelism of the
algorithm
– App Engine Map Reduce,
employing one map job for
each cluster
– App Engine Pipeline API,
employing one stage of the
pipeline for each „step‟

• But first, a detour into Map
Reduce…
© J Singh, 2011 4
4

MapReduce Conceptual Underpinnings
• Based on Functional Programming model
– From Lisp / Scheme
• (map square '(1 2 3 4)) (1 4 9 16)
• (reduce plus '(1 4 9 16)) 30
– From APL
• +/ N N  1 2 3 4

• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
– Hundreds and thousands of low-end servers are running at the
same time

© J Singh, 2011 5
5

MapReduce Flow

© J Singh, 2011 6
6

MapReduce Components in GAE 2011
• Input Reader
– Several provided by GAE, can write your own

• Map function: Written by Programmer

• Shuffle function:
– Provided by GAE, can write your own

• Reduce function: Written by Programmer

• Output Writer
– Several provided by GAE, can write your own

© J Singh, 2011 7
7

Invoking GAE Map Reduce
class MapreducePipeline (…):
def run(self,
job_name, # A string
mapper_spec, # Mapper function
reducer_spec, # Reducer function
input_reader_spec, # Input reader fn
output_writer_spec, # Output writer
mapper_params, # A dictionary
reducer_params, # A dictionary
shards, # An int
)

© J Singh, 2011 8
8

GAE Pipeline API
• Based on Python Generator functions

• The old Unix idea on steroids:
– Perform complex operations by piping data between primitives
– But the primitives are not so primitive
– Data lives in permanent storage between pipeline stages

• MapreducePipeline (prev page) was just one type of pipeline

© J Singh, 2011 9
9

Pipeline API Example Code
Split and Merge example

class aPipe(pipeline.Pipeline):
def run(self, e_kind, prop_name, *value_list):
all_bs = []
for v in value_list:
stage = yield bPipe(e_kind, prop_name, v)
all_bs.append(stage)
yield common.Append(*all_bs)

© J Singh, 2011 10
10

Pause and Assess
• Assertion:
– GAE Map/Reduce is a complete solution for analysis of social
network mining
– We know it will scale, the question is how far.

• Working on one Proof of Concept for Social Network Mining
– Recruiting a second test case

• Will report back in 3-4 months with data on
– Performance
– Cost
– Limits of scalability

© J Singh, 2011 11
11

Adapting the algorithm to M/R
• Clustering Algorithm

1. Create k randomly placed centroids Map each
data point

2. Find the centroid (1-k) closest to each data point

3. Move each centroid to the average of its members
Reduce
Each Centroid
4. Repeat 2 and 3 until there is no more change

Connect to next stage
using Pipelining API

© J Singh, 2011 12
12

About Us
• Involved with Map/Reduce and NoSQL technologies on several
platforms
– Google App Engine, MongoDB

• DataThinks.org is a new service of Early Stage IT
– Building and operating “Big Data” analytics services

Thanks
© J Singh, 2011 13
13

Social Media Mining using GAE Map Reduce

Recommended

Recommended

More Related Content

Similar to Social Media Mining using GAE Map Reduce

Similar to Social Media Mining using GAE Map Reduce (20)

More from J Singh

More from J Singh (20)

Social Media Mining using GAE Map Reduce