4. Grid
• Core:
• Data mining
• Machine Learning
• Collecting data from users, logs and calculate out the strategy
• Sort our data in a proper form, them we could use it anytime
Data -> Information
5. Ad Server
• Ranking
• According the “information” in Grid, decide which AD should be advertised
• show proper ads to website visitors
7. Stream Computing
• Core:
• logging
• feedback
• anti-cheating
• pricing
• post-process everything thrown out from Ad Server, and feedback useful information to Grid
• be the entrance of advertisement system
8. Hadoop
• an open-source software framework for data scientists
• derives from Google’s MapReduce and Google File System (GFS) papers
• written in Java
• could be divided in to 2 components:
• MapReduce
• HDFS (Hadoop distributed file system)
• a yellow elephant
9. Why Hadoop?
• moving computation is much cheaper and easier than moving data
• “Big Data”, the amount of data becomes too large, need a effective way to manage it
• so does computation
• high fault-tolerance
• developed by Yahoo!
10. MapReduce
• a programming model for processing “large data sets” with a “parallel, distributed” algorithm on a cluster
• different from map/reduce, the conception of functional programming, but actually they have the same idea,
“divide and conquer”
• proposed by Google
11. Functional “map/reduce”
• map()/reduce() in Python
• map(function(elem), list) -> list
• reduce(function(elem1, elem2), list) -> single result
• e.g.
• map(lambda x: x*2, [1,2,3,4]) => [2,4,6,8]
• reduce(lambda x,y: x+y, [1,2,3,4]) => 10
12. Parallel “MapReduce” 5 Steps
•
prepare the map() input for mappers
•
mappers run the map() code -> generated intermediate pairs
•
dispatch intermediate pairs to reducers
•
reducers run the reduce() code, aggregate the results
•
prepare output from the result of reduce()