Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
SNS Analysis using Cloud Computing Services
1. PlatformDay2009
SNS Analysis using Cloud Computing Services
DHT-based Key-Value Storage and MapReduce-based Analysis
DongWoo Lee
oiko.cloud@gmail.com
S Oiko
Laboratory
D SocialFlow
OikoLab 2
CloudKR
2. Agenda
2
CloudKR
‣ Introduction
• Social Network Serivce
• Motivation : Visualization, Social Network Analysis
• SocialFlow
• Scale Out Technologies : Cloud Computing
‣ SNS Analysis Architecture based on Cloud
• Overall Process
• Crawling
• DHT Storage (CouchDB)
• MapReduce
• Pair-Wise Similarity
‣ Cloud Computing Service
• Amazon Web Service
• EC2 / S3 / Elastic MapReduce
• Tips
‣ References
3. Introduction
2 CloudKR
Social Network Cloud Computing Mobile Device
4. Social Network Service
2
CloudKR
“Social Applications = Social Networks”
“A social network is a collection of people bound together
through a specific set of social relations.”
“A collection of people is a social network if and only if it is
possible for something to spread virally through that collection.”
13. SocialFlow
2
CloudKR
‣ Thoughts, Feelings, Interests, Relationship and Information of SNS
‣ Real-time Massive Social Data Streams
‣ Difficult to follow the Social Streams
‣ Need a way to get a summary or clustered information based on Common Interests
D SocialFlow
OikoLab
14. SocialFlow
‣ Getting Common Flows of people through Content Similarities
2
CloudKR
‣ Reflecting Short-Term Interests of People
‣ Extracting Hot Issues
‣ Revealing Relationships among In/Out Resources
‣ Implementing Scale-Out Technologies
‣ Evolving toward Recommendation System
based on Collective Intelligence
19. Experimental Project
2
CloudKR
D SocialFlow
OikoLab
‣Python / Django / Boto
‣ML / Data Mining
‣DHT / CouchDB
‣Cloud / AWS S3, EC2, Hadoop MapReduce
20. Workflow
2 CloudKR
SNS Crawler MapReduce Post-Processing CDN User
In-house Cluster Cloud Service
(Local DataCenter)
21. Technologies : Before
2
CloudKR
Crawler
Crawler Crawler
Hash_ring Consistent
MapReduce CouchJS
DHT
CouchDB Key-Value Machine Home
Storage Learning Made
22. Technologies : After
2
CloudKR
Crawler
Crawler Crawler
Storage S3
Hash_ring Consistent EC2
MapReduce
DHT Hadoop
CouchDB Key-Value Machine Home
Storage Learning Made
23. Crawling
2 CloudKR
‣ Fetching recent postings of SNS
‣ Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever)
‣ Pushing raw data into the Cloud to process them with MapReduce
Crawler
DB [ term, doc ]
Crawler
DB
Index
Indexer
File
DB
Crawler
DB Mapper
Crawler
DHT Replication
24. Consistent DHT (Distributed Hash Table)
‣ Uniform key distribution and load balancing with a good hash function
2 CloudKR
‣ Minimizing the effects of a storage crash or temporal down
‣ High availability with replication scheme
N-1 0
Node N-1 Node k-1
k-1
‣ Notice: A real node has non-
linear portions of the total key
space.
Replicas
k+1
Node k+1 Node k
2 1
Replicate(k, k-1, k+1)
!"#$!%&'()*+,-.(
/0123',(0405123',(&6-.-7-1(080.-'9(.0405.-'9(.&6-.-7-1(0:
25. Consistent DHT (Distributed Hash Table)
2 CloudKR
Admin Anonymouse
Traffic User Traffic
View
Admin View User View
Generated Contents
SNS Crawler SNS Anlysis AWS S3
html image
DHT Front End
Memory Cache
N-1 0
Node N-1 Node k-1
DHT
Node k+1 Node k
2 1
26. Consistent DHT : Replication
2 CloudKR
* Replica = 2
D A B C
A B B C C D D A
B
B B
Replica Replica
30. Map & Reduce : Pair-Wise Similarity
2 CloudKR
[ term, { docs } ]
=>
DB [ term, doc ] [ term, { docs } ] [ doc1, doc2 ]
DB
Index Doc Group Doc Candidate
Indexer
File Grouper File Combinator File
DB
DB Mapper Reducer Mapper
DocPair
Reducer
Counter
Doc
File
Result
File
‣ Indexer and Grouper for Processing Korean.
[ freq, doc1, doc2 ]
‣ No NLP and No Structural Analysis.
‣ Produce a pairwise similarity between two postings.
31. Map & Reduce : Optimization
2 CloudKR
‣ Concerns ‣ Sample Data
‣ Consider Key Group Size Distribution ‣ Two months postings of my friends
‣ Data Load Balancing ‣ Reachable graph: 4,060 Peoples
‣ Barrier Point ‣ Total Postings: 206,115
39. Before the Cloud Age
‣ Smart Shell Guru’s Daily Work : Parallel Sort
2 CloudKR
$ wc -l data scp scp $ sort -rm data*.sorted >
$ split -l 1000k data NFS NFS data.sorted
$ nohup ./work.sh data1 > data1.processed
$ nohup sort -r data1.processed > data1.sorted
➡ Need to prepare/maintain physical machines and resources
Complexity ➡ Need to monitor job progress (wait and see job’s status)
➡ Need to cope with machine failure (slave nodes / storages / networks)
➡ Need to schedule multiple jobs
43. AWS : Paid AMI / The Cloud Market
2 CloudKR
AMI
Amazon Machine Image
Paid AMI
44. AWS : How to make a AMI (1)
2
CloudKR
Loopback File
# dd if=/dev/zero of=new_image.fs bs=1M count=1024
Make ext3 file system
# mke2fs -F -j new_image.fs
# mkdir /mnt/ec2-fs
# mount -o loop new_image.fs /mnt/ec2-fs
# mkdir /mnt/ec2-fs/dev
# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console
# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null
# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero
# mkdir /mnt/ec2-fs/etc
Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys)
Create yum-xen.conf
# mkdir /mnt/ec2-fs/proc
# mount -t proc none /mnt/ec2-fs/proc
# yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base
Edit /mnt/ec2-fs/etc/sysconfig/network-scripts/ifcfg-eth0
Edit /mnt/ec2-fs/etc/sysconfig/network
Edit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap)
chroot /mnt/ec2-fs /bin/sh
Edit services
45. AWS : How to make a AMI (2)
2
CloudKR
Building an AMI
# yum install ruby
# rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket)
# ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id
Local Machine Root File System
# ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id
Upload to S3
# ec2-upload-bundle -b my-bucket -m image.manifest
-a my-aws-access-key-id -s my-secret-key-id
Register AMI
# ec2-register my-bucket/image.manifest
IMAGE ami-xxxx
Testing
# ec2-describe-images ami-xxxx
Deregister AMI
# ec2-deregister ami-xxxx
Running AMI
# ec2-run-intances ami-xxxx -n 1
http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/
56. AWS: Elastic MapReduce
2
CloudKR
‣ Failed tasks will be rescheduled in other Hadoop slaves.
‣ If a task is finished, the same instance will be killed by a tracker.
58. AWS: SocialFlow Automation
2 CloudKR
Home IDC Amazon Wild World
Local Global
Results
Admin
DHT S3 Users
Read/Write
Read Only
Renderer
boto python Launching EC2 pool
64. 10 Cent Tips
2
CloudKR
‣ AWS EC2
‣ Minimizing set-up time with prepared shell scripts
‣ Use Boto for automating deployments
‣ Use S3 (Free of Charge between S3 and EC2 in the same region)
‣ $0.030 per GB through June 30, 2000 ($0.1 per GB normal price)
‣ AWS Elastic MapReduce
‣ Enabling the SSH port(22) and Hadoop related ports (9100, 91001)
‣ Assess to Master Node: ssh -i keypair hadoop@public_dns_name
‣ Double Check (PATH, etc)
‣ Debug, Debug, Debug
‣ Use EC2 for hadoop (eg. Clouera’s Hadoop AMI) (No extra cost for Hadoop!)
65. 10 Cent Tips
2 CloudKR
‣ AWS S3
‣ Setting HTTP header for images and static resources.
‣ Cache-Control: max-age=31536000
‣ Block Search Bots
‣ robots.txt at the root of a Bucket
‣ User-agent: *
‣ Disallow: /
‣ Using BitTorrent for large files
‣ http://s3.xyz.com/xfile.zip?torrent
‣ Compress Rendered HTML with gzip
‣ Content-Encoding: gzip
$ s3cmd put index.html s3://s3.xyz.com/www
--mime-type "text/html”
--add-header "Content-Encoding: gzip"
--acl-public