SlideShare une entreprise Scribd logo
1  sur  28
Functional Big Data
Agenda 
MapReduce 
Google 
Scaling Out 
Key Value Store 
Chaining 
Fault Tolerance 
Functional Example 
Business Problem 
Design 
Processes 
Schema 
Big Data Guidelines
MapReduce
Google MapReduce 
+ Paper published in 2004 
+ Implemented in 2003 
+ Production use at Google 
+ Built for Google 
+ Not open sourced
Google in 2004 
+ Clusters of 100s or 1000s of servers 
o Linux 
o dual-processor x86 
o 2-4 GB memory 
o 100BaseT or GigE 
o inexpensive IDE hard drives 
+ Servers fail every day 
+ Network maintenance is constant
Scaling Out 
+ Scaling up (faster computer) doesn’t get far 
+ Scaling out is the only next step 
+ Hundreds/thousands of modest computers 
outperform the biggest single computers 
+ Scaling one to a few is hard 
+ Scaling a few to many is easy 
+ Scaling many to massive is (almost) trivial
Concurrency
Intermediate Data 
+ Input data is split between the workers 
+ Map workers create key/value pairs 
+ Reduce workers read in all intermediate 
data and sort by key 
+ Reduce workers then iterate over the sorted 
data producing a result for each key
Key Value Store
Rinse and Repeat 
+ Often the results of one MapReduce are 
used as input to another 
+ Building on a powerful basic functional 
model complex data processing can be 
accomplished
Chaining
Fault Tolerance 
+ Likelihood of failure rises with number of 
servers and processing time 
+ Resiliency is a necessity at scale 
+ Scheduler/Supervisor (master) reassigns 
failed jobs and ensures reduce workers find 
the (right) data
Scheduling
Supervision
Functional Example
Example Business Problem 
Scenario: 
A mobile operator wants to know if an instant 
messaging (IM) service would be useful to 
current subscribers. 
Question: 
What percentage of text messages (SMS) 
are part of a conversation?
Challenge 
✓ 10 million subscribers 
✓ average of 100 SMS a month per subscriber 
✓ ∴ one billion SMS each month 
✓ call detail records (CDR) include SMS but also 
voice and data events 
✓ ∴ 20 billion (20,000,000,000) records/month
Requirements 
+ Identify SMS conversations 
o messages sent or received with one other party 
o interval between messages < 10 minutes 
o at least three messages exchanged 
+ Provide result as 
o ratio of conversational to non-conversational SMS 
o per subscriber 
o per month
Process Design
Filter 
+ Read events from CDR files 
o records are in chronological order 
o read files in chronological order 
+ Discard non-SMS events 
+ Distribute SMS events to Map processes 
o Consistent distribution by subscriber
Hashing 
+ To analyze interval between 
messages one process must 
handle all events for a 
particular subscriber 
+ Simple Hash: 
o M = last four digits of subscriber’s 
mobile number 
o N = number of processes available 
o Pid = M rem N
Map 
+ Read subscriber’s stored data 
+ Find other party in set 
+ Increment total count of messages 
+ Is previous message < 10 minutes? 
o Is next previous message < 10m before previous? 
 Increment conversational messages count 
+ Update previous and next previous times
Schema Design
Interim Data 
+ We are using an in memory key value store 
+ The key is the subscriber number 
+ The value is a set of OtherParty 
+ OtherParty data structure contains counts 
+ When the map is complete we transfer the 
data to disk for persistence
Reduce 
+ Collect intermediate data 
from disk copies 
+ Iterate through all parties for 
each subscriber 
+ Total all party counts 
+ Provide result as percentage 
of conversational messages 
to total messages
Big Data Guidelines 
+ Find opportunities for concurrency 
+ Choose the right containers for your data 
+ Use memory as effectively as possible 
+ Minimize copying data 
+ Avoid any unnecessary overhead 
+ Anything you are going to do hundreds of 
billions of times should be efficient!
Thank you.
SLASSCOM TECH TALKS 
https://www.facebook.com/SlasscomTechnologyForum 
http://www.slasscom.lk/events 
https://twitter.com/slasscom 
www.slideshare.net/slasscomtechforum

Contenu connexe

Similaire à MapReduce Agenda for Functional Big Data Analysis

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataCloudera, Inc.
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Transaction processing system
Transaction processing systemTransaction processing system
Transaction processing systemAyisha Kowsar
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private BankingJérôme Kehrli
 
The BUsiness of Windows Azure Platform
The BUsiness of Windows Azure PlatformThe BUsiness of Windows Azure Platform
The BUsiness of Windows Azure PlatformDan Moore
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataStylight
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)Dave Cortright
 
Big Data
Big DataBig Data
Big DataNGDATA
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunk
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthdaveconnors
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)MIT College Of Engineering,Pune
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 

Similaire à MapReduce Agenda for Functional Big Data Analysis (20)

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Transaction processing system
Transaction processing systemTransaction processing system
Transaction processing system
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
The BUsiness of Windows Azure Platform
The BUsiness of Windows Azure PlatformThe BUsiness of Windows Azure Platform
The BUsiness of Windows Azure Platform
 
Big Data
Big DataBig Data
Big Data
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Smart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat TranSmart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat Tran
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)
 
Big Data
Big DataBig Data
Big Data
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCS
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 

Dernier

Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 

Dernier (20)

Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 

MapReduce Agenda for Functional Big Data Analysis

  • 2. Agenda MapReduce Google Scaling Out Key Value Store Chaining Fault Tolerance Functional Example Business Problem Design Processes Schema Big Data Guidelines
  • 4. Google MapReduce + Paper published in 2004 + Implemented in 2003 + Production use at Google + Built for Google + Not open sourced
  • 5. Google in 2004 + Clusters of 100s or 1000s of servers o Linux o dual-processor x86 o 2-4 GB memory o 100BaseT or GigE o inexpensive IDE hard drives + Servers fail every day + Network maintenance is constant
  • 6. Scaling Out + Scaling up (faster computer) doesn’t get far + Scaling out is the only next step + Hundreds/thousands of modest computers outperform the biggest single computers + Scaling one to a few is hard + Scaling a few to many is easy + Scaling many to massive is (almost) trivial
  • 8. Intermediate Data + Input data is split between the workers + Map workers create key/value pairs + Reduce workers read in all intermediate data and sort by key + Reduce workers then iterate over the sorted data producing a result for each key
  • 10. Rinse and Repeat + Often the results of one MapReduce are used as input to another + Building on a powerful basic functional model complex data processing can be accomplished
  • 12. Fault Tolerance + Likelihood of failure rises with number of servers and processing time + Resiliency is a necessity at scale + Scheduler/Supervisor (master) reassigns failed jobs and ensures reduce workers find the (right) data
  • 16. Example Business Problem Scenario: A mobile operator wants to know if an instant messaging (IM) service would be useful to current subscribers. Question: What percentage of text messages (SMS) are part of a conversation?
  • 17. Challenge ✓ 10 million subscribers ✓ average of 100 SMS a month per subscriber ✓ ∴ one billion SMS each month ✓ call detail records (CDR) include SMS but also voice and data events ✓ ∴ 20 billion (20,000,000,000) records/month
  • 18. Requirements + Identify SMS conversations o messages sent or received with one other party o interval between messages < 10 minutes o at least three messages exchanged + Provide result as o ratio of conversational to non-conversational SMS o per subscriber o per month
  • 20. Filter + Read events from CDR files o records are in chronological order o read files in chronological order + Discard non-SMS events + Distribute SMS events to Map processes o Consistent distribution by subscriber
  • 21. Hashing + To analyze interval between messages one process must handle all events for a particular subscriber + Simple Hash: o M = last four digits of subscriber’s mobile number o N = number of processes available o Pid = M rem N
  • 22. Map + Read subscriber’s stored data + Find other party in set + Increment total count of messages + Is previous message < 10 minutes? o Is next previous message < 10m before previous?  Increment conversational messages count + Update previous and next previous times
  • 24. Interim Data + We are using an in memory key value store + The key is the subscriber number + The value is a set of OtherParty + OtherParty data structure contains counts + When the map is complete we transfer the data to disk for persistence
  • 25. Reduce + Collect intermediate data from disk copies + Iterate through all parties for each subscriber + Total all party counts + Provide result as percentage of conversational messages to total messages
  • 26. Big Data Guidelines + Find opportunities for concurrency + Choose the right containers for your data + Use memory as effectively as possible + Minimize copying data + Avoid any unnecessary overhead + Anything you are going to do hundreds of billions of times should be efficient!
  • 28. SLASSCOM TECH TALKS https://www.facebook.com/SlasscomTechnologyForum http://www.slasscom.lk/events https://twitter.com/slasscom www.slideshare.net/slasscomtechforum

Notes de l'éditeur

  1. In order to successfully handle really big data requires massive concurrency and in the real world this requires fault tolerance.
  2. Google didn’t invent map and reduce but they were the first to apply the paradigm in a general way on a massive scale.
  3. … or, more probably, a number of results. By dividing the work we can assign it to many servers. This concurrency is what allows scale.
  4. Here is an example of something which Google do as part of their core business. Google places web sites which are linked to by many other web sites higher in search results (PageRank). To determine this a map reads web pages found by crawlers and creates key/value pairs. These are written in memory and then pushed out in blocks to disk. A reduce reads these disk blocks and sorts all the intermediate data by key. The reduce function then iterates over all the pairs for a key and outputs one result for each key.
  5. The results from one MapReduce can, and often are, provided as input for further MapReduce runs.
  6. Something like RAID, maybe Reduced Array of Inexpensive Servers (RAIS)? The can and do fail individually without the system failing.
  7. The user process forks all of the other processes which will be used including a master process. The master then assigns those processes work to perform, either map or reduce roles.
  8. The master process monitors each worker by sending a ping periodically. When it detects that a server has failed (or is no longer reachable) it will reassign that server’s work to another worker. After this reassignment each of the reduce workers will be notified to ignore the failed server and instead get the interim data from the newly assigned server.
  9. This is a contrived example.
  10. That’s billion with a ‘B’. In Canada that’s 1,000 million.
  11. There is an obvious hole in this pseudo code, the first two messages of the conversation are not included in the conversational totals. I could have accommodated that but I left it out to keep the example as simple possible.