SNS Analysis using Cloud Computing Services

PlatformDay2009

SNS Analysis using Cloud Computing Services
DHT-based Key-Value Storage and MapReduce-based Analysis

DongWoo Lee
oiko.cloud@gmail.com

S Oiko
Laboratory
D SocialFlow
OikoLab 2
CloudKR

Agenda
2
CloudKR

‣ Introduction
• Social Network Serivce
• Motivation : Visualization, Social Network Analysis
• SocialFlow
• Scale Out Technologies : Cloud Computing
‣ SNS Analysis Architecture based on Cloud
• Overall Process
• Crawling
• DHT Storage (CouchDB)
• MapReduce
• Pair-Wise Similarity
‣ Cloud Computing Service
• Amazon Web Service
• EC2 / S3 / Elastic MapReduce
• Tips
‣ References

Introduction
2 CloudKR

Social Network Cloud Computing Mobile Device

Social Network Service
2
CloudKR

“Social Applications = Social Networks”
“A social network is a collection of people bound together
through a specific set of social relations.”

“A collection of people is a social network if and only if it is
possible for something to spread virally through that collection.”

Social Network Services : Twitter, Facebook
2
CloudKR

Social Networks

http://www.vincos.it/world-map-of-social-networks/

Social Network Analysis
2
CloudKR
‣ Social Graph Analysis
‣ Visualization
‣ Person-to-Person Relationship
‣ Temporal Mind Mining (Content Clustering)
‣ Post-Mortem Log Processing

Social Network Analysis : Visualization
2
CloudKR

‣

Social Network Analysis : Visualization
2
CloudKR

‣

‣
‣
‣

SocialFlow
2
CloudKR
‣ Thoughts, Feelings, Interests, Relationship and Information of SNS
‣ Real-time Massive Social Data Streams
‣ Difficult to follow the Social Streams
‣ Need a way to get a summary or clustered information based on Common Interests

D SocialFlow
OikoLab

SocialFlow
‣ Getting Common Flows of people through Content Similarities
2
CloudKR

‣ Reflecting Short-Term Interests of People
‣ Extracting Hot Issues
‣ Revealing Relationships among In/Out Resources
‣ Implementing Scale-Out Technologies
‣ Evolving toward Recommendation System
based on Collective Intelligence

Scale Out Technologies : Cloud Computing
2
CloudKR

Why Cloud Computing?
2
CloudKR
‣ SPOF (Single Point of Failure)

‣ Cluster Administration (Who do this?)

‣ Initial Infrastructure Investment (Risk Management)

‣ Focus on Main Thing (Intelligence)

‣ Enable Highly Scalable Services

New resource provision paradigms
for Grid Infrastructures: Virtualization
and Cloud / ISGC 2009
http://tinyurl.com/nacgu7

Cloud Computing: e.g. Storage Failure
2
CloudKR

é

SNS Analysis Architecture based on Cloud
2
CloudKR

D SocialFlow
OikoLab

Experimental Project
2
CloudKR

D SocialFlow
OikoLab

‣Python / Django / Boto
‣ML / Data Mining
‣DHT / CouchDB
‣Cloud / AWS S3, EC2, Hadoop MapReduce

Workflow
2 CloudKR

SNS Crawler MapReduce Post-Processing CDN User

In-house Cluster Cloud Service
(Local DataCenter)

Technologies : Before
2
CloudKR

Crawler

Crawler Crawler

Hash_ring Consistent
MapReduce CouchJS
DHT

CouchDB Key-Value Machine Home
Storage Learning Made

Technologies : After
2
CloudKR

Crawler

Crawler Crawler

Storage S3

Hash_ring Consistent EC2
MapReduce
DHT Hadoop

CouchDB Key-Value Machine Home
Storage Learning Made

Crawling
2 CloudKR
‣ Fetching recent postings of SNS
‣ Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever)
‣ Pushing raw data into the Cloud to process them with MapReduce

Crawler

DB [ term, doc ]

Crawler
DB
Index
Indexer
File
DB
Crawler

DB Mapper

Crawler

DHT Replication

Consistent DHT (Distributed Hash Table)
‣ Uniform key distribution and load balancing with a good hash function
2 CloudKR

‣ Minimizing the effects of a storage crash or temporal down
‣ High availability with replication scheme

N-1 0

Node N-1 Node k-1

k-1

‣ Notice: A real node has non-
linear portions of the total key
space.
Replicas

k+1
Node k+1 Node k

2 1

Replicate(k, k-1, k+1)

!"#$!%&'()*+,-.(
/0123',(0405123',(&6-.-7-1(080.-'9(.0405.-'9(.&6-.-7-1(0:

Consistent DHT (Distributed Hash Table)
2 CloudKR

Admin Anonymouse
Trafﬁc User Trafﬁc

View

Admin View User View

Generated Contents
SNS Crawler SNS Anlysis AWS S3

html image
DHT Front End

Memory Cache

N-1 0

Node N-1 Node k-1
DHT
Node k+1 Node k

2 1

Consistent DHT : Replication
2 CloudKR

* Replica = 2

D A B C
A B B C C D D A

B
B B

Replica Replica

CouchDB (Key-Value Storage)
2
CloudKR
‣ Erlang -based Key-Value Storage
‣ Storage Engine (MVCC, B-tree)
‣ RESTful API
‣ Service-side JavaScript Engine (MapReduce)
‣ View Engine
‣ Futon Web UI

CouchDB: Server-side Javascript
2 CloudKR
‣ Purpose

‣ Local Computations on Local Data Sets

‣ Features

‣ Mozilla’s Spidermonkey

‣ MapReduce Framework with Javascript

‣ Fork External Process (couchjs)

‣ Performance Enhancements Expected

‣ Googles V8
(Chrome’s Javascript Engine / JIT)

http://tinyurl.com/m76sx3

CouchDB: MapReduce
2
CloudKR

doc = (d1, d2, fq)

dx: { di }

Map & Reduce : Pair-Wise Similarity
2 CloudKR

[ term, { docs } ]
=>
DB [ term, doc ] [ term, { docs } ] [ doc1, doc2 ]

DB
Index Doc Group Doc Candidate
Indexer
File Grouper File Combinator File
DB

DB Mapper Reducer Mapper
DocPair
Reducer
Counter

Doc
File
Result
File

‣ Indexer and Grouper for Processing Korean.
[ freq, doc1, doc2 ]
‣ No NLP and No Structural Analysis.

‣ Produce a pairwise similarity between two postings.

Map & Reduce : Optimization
2 CloudKR

‣ Concerns ‣ Sample Data
‣ Consider Key Group Size Distribution ‣ Two months postings of my friends
‣ Data Load Balancing ‣ Reachable graph: 4,060 Peoples
‣ Barrier Point ‣ Total Postings: 206,115

Pair-Wise Similarity and its TreeMap

Posting: 110,008
Users: 2,691

Score >= 6

Pair-Wise Similarity and its Cluster
2
CloudKR
➡One issue and different opinions among people

2
CloudKR
➡Common Interest / Hot Issue

➡One person and the similar contents pattern (specialty)
2
CloudKR

➡ Similar Structure of Sentences (trendy, parady)
2
CloudKR

Deployment
2
CloudKR

EC2

S3/CloudFront

Flickr

www

Cloud Computing Service
2
CloudKR

Before the Cloud Age
‣ Smart Shell Guru’s Daily Work : Parallel Sort
2 CloudKR

$ wc -l data scp scp $ sort -rm data*.sorted >
$ split -l 1000k data NFS NFS data.sorted
$ nohup ./work.sh data1 > data1.processed
$ nohup sort -r data1.processed > data1.sorted

➡ Need to prepare/maintain physical machines and resources
Complexity ➡ Need to monitor job progress (wait and see job’s status)
➡ Need to cope with machine failure (slave nodes / storages / networks)
➡ Need to schedule multiple jobs

Amazon Web Service : Overview
EC2 EC2 EC2 EC2
2 CloudKR

Messages

SQS (Simple Query Service)
Auto Scaling

CloudWatch
Monitoring

Elastic Load Balancing EC2 (Elastic Compute Cloud) Mount EBS (Elastic Block Store) 1 GB to 1TB

Permissions Header
Clients API Objects
Clients HTTP
Clients Buckets
AMI (Machine Image)
eSATA/USB
SimpleDB S3 (Simple Storage Service)
Ofﬂine
Mgmt Console EC2 CLI SSH
Import/Export
Admin key-value CloudFront

Access Key ID Edges
Secret Access Key
Key Pair Instant EC2 Hadoop Cluster Elastic MapReduce HTTP

Hadoop Hadoop Hadoop Clients

Amazon Web Service
‣ Amazon Management Console
2
CloudKR

AWS : AMI
2 CloudKR

AMI
Amazon Machine Image

AWS : Paid AMI / The Cloud Market
2 CloudKR

AMI
Amazon Machine Image

Paid AMI

AWS : How to make a AMI (1)
2
CloudKR
Loopback File
# dd if=/dev/zero of=new_image.fs bs=1M count=1024

Make ext3 file system
# mke2fs -F -j new_image.fs
# mkdir /mnt/ec2-fs
# mount -o loop new_image.fs /mnt/ec2-fs
# mkdir /mnt/ec2-fs/dev
# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console
# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null
# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero
# mkdir /mnt/ec2-fs/etc

Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys)
Create yum-xen.conf

# mkdir /mnt/ec2-fs/proc
# mount -t proc none /mnt/ec2-fs/proc
# yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base

Edit /mnt/ec2-fs/etc/sysconfig/network-scripts/ifcfg-eth0
Edit /mnt/ec2-fs/etc/sysconfig/network
Edit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap)

chroot /mnt/ec2-fs /bin/sh
Edit services

AWS : How to make a AMI (2)
2
CloudKR
Building an AMI
# yum install ruby
# rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket)
# ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id

Local Machine Root File System
# ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id

Upload to S3
# ec2-upload-bundle -b my-bucket -m image.manifest
-a my-aws-access-key-id -s my-secret-key-id

Register AMI
# ec2-register my-bucket/image.manifest
IMAGE ami-xxxx

Testing
# ec2-describe-images ami-xxxx

Deregister AMI
# ec2-deregister ami-xxxx

Running AMI
# ec2-run-intances ami-xxxx -n 1

http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/

AWS : EC2 Running Instance
‣ AWS Management Console
2
CloudKR

AWS : EC2 Running Instance
2
CloudKR

Amazon Web Service: Access Methods
‣ Access Key ID / Secret Access Key ID / Key Pairs
2
CloudKR

‣ Amazon Management Console
‣ EC2 API (WSDL) / EC2 CLI (Command Line Interface)
‣ SSH
‣ Firefox Extensions
• S3 Firefox Organizer
• Elasticfox
‣ S3
•DNS: s3 CNAME s3.amazonaws.com.
e.g) Bucket Name: /s3.xyz.com
http://s3.xyz.com ---> S3‘s s3.xyz.com

‣s3cmd (python)
‣s3cmd.rb / s3sync.rb (ruby)
‣S3Hub (Mac)

Amazon Web Service: Elasticfox
‣ Firefox’s Extension: Elasticfox
2
CloudKR

2 CloudKR

‣ Key Pairs
‣ Private Key
‣ SSH

2 CloudKR

‣ Security Groups
‣ Open Network Ports

AWS: Elastic MapReduce
2
CloudKR

‣ EC2 + Hadoop
‣Tools
‣ Management Console
‣ elastic-mapreduce CLI
‣ Preparation
‣ Code --> S3
‣ Data --> S3
‣ Log Folder
‣ Output Folder
‣Job Flow
‣ Streaming
‣ Custom Jar
‣ Sample Applications

2
CloudKR

AWS: Elastic MapReduce : Web UI
2
CloudKR

AWS: Elastic MapReduce : CLI for Workflow
2 CloudKR

input/* Step1

jobﬂow #id

output1/part-000** Step2

output2/part-000** Step3

output3/part-000**

2
CloudKR

‣ Failed tasks will be rescheduled in other Hadoop slaves.
‣ If a task is finished, the same instance will be killed by a tracker.

AWS: SocialFlow Automation
2 CloudKR

Home IDC Amazon Wild World
Local Global

Results

Admin
DHT S3 Users

Read/Write

Read Only
Renderer

boto python Launching EC2 pool

AWS: EC2, EMR Price Model
2 CloudKR

Service Type Per Instance Hour 1 Week (7 Days) 1 Week (7 Days)

$ 0.10 (S) $ 16.8 KRW 20,865
On-Demand $ 0.40 (L) $ 67.2 KRW 83,462
$ 0.80 (E) $ 134.4 KRW 166,924

EC2
Reserved $ 0.03 (S) $ 5.04 KRW 6,259
1yr $ 325 $ 0.12 (L) $ 20.16 KRW 25,038
3yr $ 500 $ 0.24 (E) $ 40.32 KRW 50,077

$ 0.10 (S) $ 0.015 $ 19.32 KRW 23,995
Elastic
On-Demand $ 0.40 (L) $ 0.06 $ 77.28 KRW 95,981
MapReduce $ 0.80 (E) $ 154.56 KRW 191,963
$ 0.12

(S) = Small, (L) = Large, (E) = Extra Large 1 USD = 1242 KRW

AWS: Performance
2
CloudKR

http://tinyurl.com/qj6ao7

AWS: Performance
2
CloudKR

AWS: Performance
2 CloudKR

http://tinyurl.com/p9jsyz

AWS: Performance
2 CloudKR

http://tinyurl.com/cqqxgl

10 Cent Tips
2
CloudKR
‣ AWS EC2

‣ Minimizing set-up time with prepared shell scripts

‣ Use Boto for automating deployments

‣ Use S3 (Free of Charge between S3 and EC2 in the same region)

‣ $0.030 per GB through June 30, 2000 ($0.1 per GB normal price)

‣ AWS Elastic MapReduce

‣ Enabling the SSH port(22) and Hadoop related ports (9100, 91001)
‣ Assess to Master Node: ssh -i keypair hadoop@public_dns_name

‣ Double Check (PATH, etc)

‣ Debug, Debug, Debug

‣ Use EC2 for hadoop (eg. Clouera’s Hadoop AMI) (No extra cost for Hadoop!)

10 Cent Tips
2 CloudKR
‣ AWS S3

‣ Setting HTTP header for images and static resources.
‣ Cache-Control: max-age=31536000

‣ Block Search Bots

‣ robots.txt at the root of a Bucket
‣ User-agent: *
‣ Disallow: /
‣ Using BitTorrent for large files
‣ http://s3.xyz.com/xfile.zip?torrent
‣ Compress Rendered HTML with gzip
‣ Content-Encoding: gzip
$ s3cmd put index.html s3://s3.xyz.com/www

--mime-type "text/html”
--add-header "Content-Encoding: gzip"
--acl-public

Amazon Web Service : Limitations
2
CloudKR

References
‣ 10 MapReduces Tips, Cloudera, http://tinyurl.com/pxuqup
2 CloudKR

‣ Christian Charas, Thierry Lecroq, Handbook of Exact String-Matching Algorithms
‣ Dan Pritchett (eBay), BASE: Alternative ACID, p.48-55, ACM Queue May/June 2008
‣ Edward Chang, (Google Research), Mining Large Scale Social Networks, MMDS ’08
‣ Edward Walker, Benchmarking Amazon EC2 for high-performance scientific computing
‣ Matei Zaharia et al, Improving MapReduce Performance in Heterogeneous Environments, OSDI ’08

‣ Following Twitter
‣ http://twitter.com/AmazonEC2
‣ http://twitter.com/AmazonS3S3

SNS Analysis using Cloud Computing Services

Recommandé

Recommandé

Contenu connexe

Similaire à SNS Analysis using Cloud Computing Services

Similaire à SNS Analysis using Cloud Computing Services (20)

Dernier

Dernier (20)

SNS Analysis using Cloud Computing Services