36. 36
MapReduce
• Oldie but goody
• Restrictive Framework / Innovated Work Around
• Extreme Batch
Confidentiality Information Goes Here
37. 37
MapReduce Basic High Level
Confidentiality Information Goes Here
Mapper
HDFS
(Replicated)
Native File System
Block of
Data
Temp Spill
Data
Partitioned
Sorted Data
Reducer
Reducer
Local Copy
Output File
38. 38
Abstractions
• SQL
– Hive
• Script/Code
– Pig: Pig Latin
– Crunch: Java/Scala
– Cascading: Java/Scala
Confidentiality Information Goes Here
39. 39
Spark
• The New Kid that isn’t that New Anymore
• Easily 10x less code
• Extremely Easy and Powerful API
• Very good for machine learning
• Scala, Java, and Python
• RDDs
• DAG Engine
Confidentiality Information Goes Here
44. 44
Why sessionize?
Confidentiality Information Goes Here
Helps answers questions like:
• What is my website’s bounce rate?
– i.e. how many % of visitors don’t go past the landing page?
• Which marketing channels (e.g. organic search, display ad, etc.) are
leading to most sessions?
– Which ones of those lead to most conversions (e.g. people buying things,
signing up, etc.)
• Do attribution analysis – which channels are responsible for most
conversions?
45. 45
How to Sessionize?
Confidentiality Information Goes Here
1. Given a list of clicks, determine which clicks
came from the same user
2. Given a particular user's clicks, determine if a
given click is a part of a new session or a
continuation of the previous session
64. The image cannot be displayed. Your computer may not have enough memory to open the image, or the
image may have been corrupted. Restart your computer, and then open the file again. If the red x still
appears, you may have to delete the image and then insert it again.
Thank you