More Related Content
Similar to Social Media Mining using GAE Map Reduce (20)
Social Media Mining using GAE Map Reduce
- 1. Social Network Mining
Solutions using Google App Engine Map Reduce
J Singh, DataThinks.org
October 19, 2011
- 2. MapReduce: A Genealogical Perspective
• Roots
– Lisp, Scheme
– APL
• Google OS papers, 2004
– Exploit extreme parallelism of data
• Apache Top Level Project (Hadoop)
• MapReduceGAE borrows from these
© J Singh, 2011 2
2
- 3. Social Network Mining
• Finding people based on data in social networks
– Love and Romance
– Common interests
– Similar buying habits
– Similar voting propensities
– Location
• It‟s not a new problem
– We have additional solutions for the old problem
• Examples based on proprietary data: eHarmony, etc.
• Early examples based on social network data: ShoutFlow,
WhoIsJustLikeMe.
© J Singh, 2011 3
3
- 4. Based on clustering algorithms
• On-line demo of clustering • Resource intensive.
– Best done in batch mode
• Exploit data parallelism of the
algorithm
– App Engine Map Reduce,
employing one map job for
each cluster
– App Engine Pipeline API,
employing one stage of the
pipeline for each „step‟
• But first, a detour into Map
Reduce…
© J Singh, 2011 4
4
- 5. MapReduce Conceptual Underpinnings
• Based on Functional Programming model
– From Lisp / Scheme
• (map square '(1 2 3 4)) (1 4 9 16)
• (reduce plus '(1 4 9 16)) 30
– From APL
• +/ N N 1 2 3 4
• Easy to distribute (based on each element of the vector)
• New for Map/Reduce: Nice failure/retry semantics
– Hundreds and thousands of low-end servers are running at the
same time
© J Singh, 2011 5
5
- 7. MapReduce Components in GAE 2011
• Input Reader
– Several provided by GAE, can write your own
• Map function: Written by Programmer
• Shuffle function:
– Provided by GAE, can write your own
• Reduce function: Written by Programmer
• Output Writer
– Several provided by GAE, can write your own
© J Singh, 2011 7
7
- 8. Invoking GAE Map Reduce
class MapreducePipeline (…):
def run(self,
job_name, # A string
mapper_spec, # Mapper function
reducer_spec, # Reducer function
input_reader_spec, # Input reader fn
output_writer_spec, # Output writer
mapper_params, # A dictionary
reducer_params, # A dictionary
shards, # An int
)
© J Singh, 2011 8
8
- 9. GAE Pipeline API
• Based on Python Generator functions
• The old Unix idea on steroids:
– Perform complex operations by piping data between primitives
– But the primitives are not so primitive
– Data lives in permanent storage between pipeline stages
• MapreducePipeline (prev page) was just one type of pipeline
© J Singh, 2011 9
9
- 10. Pipeline API Example Code
Split and Merge example
class aPipe(pipeline.Pipeline):
def run(self, e_kind, prop_name, *value_list):
all_bs = []
for v in value_list:
stage = yield bPipe(e_kind, prop_name, v)
all_bs.append(stage)
yield common.Append(*all_bs)
© J Singh, 2011 10
10
- 11. Pause and Assess
• Assertion:
– GAE Map/Reduce is a complete solution for analysis of social
network mining
– We know it will scale, the question is how far.
• Working on one Proof of Concept for Social Network Mining
– Recruiting a second test case
• Will report back in 3-4 months with data on
– Performance
– Cost
– Limits of scalability
© J Singh, 2011 11
11
- 12. Adapting the algorithm to M/R
• Clustering Algorithm
1. Create k randomly placed centroids Map each
data point
2. Find the centroid (1-k) closest to each data point
3. Move each centroid to the average of its members
Reduce
Each Centroid
4. Repeat 2 and 3 until there is no more change
Connect to next stage
using Pipelining API
© J Singh, 2011 12
12
- 13. About Us
• Involved with Map/Reduce and NoSQL technologies on several
platforms
– Google App Engine, MongoDB
• DataThinks.org is a new service of Early Stage IT
– Building and operating “Big Data” analytics services
Thanks
© J Singh, 2011 13
13