SlideShare une entreprise Scribd logo
1  sur  18
SOCIAL DATA AND DBS
IVAN SANCHEZ
JULIO SALINAS
MARLENE ROBLES
CONTENTS
•Background
•Study Cases:
•Twitter: Real time search-Earlybird
•Facebook: Storage.
•LinkedIn: Storage (Voldemort)
•Conclusion
•References
BACKGROUND OSN
● Huge amount of data, diverse and changing over time. Likes,
sharing, comments, logins, page-views, search queries.
● New approaches to manipulate it.
● Distributed Databases, NoSQL.
● How to retrieve the data (Search relevance,
Recommendations, Security against abusive behavior,
Newsfeed features)
● Goal: massive scaling of demand: Unstructured, Semi-
Facebook Twitter LinkedIn
2.7M likes &
comments/da
y
500M
tweets/day
300+ M.Users(2 new/s).
200 group
conversation/min
STORING AND QUERYING AT TWITTER
● Storage:
o MySQL used as key-value store.
o FlockDB to Twitter Social Graph.
● Desired queries:
o TrendingTopics
o Breaking news
o Sentiment
REAL TIME SEARCH AT TWITTER: EARLYBIRD
TAO AND THE FACEBOOK SOCIAL GRAPH
TAO
o Architecture and Data Model:
 Objects: (id) → (otype, (key ? value)∗)
 Associations: (id1, atype, id2) → (time, (key ? value)∗)
o MySQL to the Storage Layer.
o Main challenges:
 Efficience scale.
 Very fast response time.
 High Read Availability.
Professional Social
Network
Data Driven Features:
● Recomendation System
(people you may know)
● People Search (Jobs
search - candidates)
● Who view your profile?
● Events you may be
STORAGE - VOLDEMORT
Highly Available Distrib. KV
Store
10 Voldemort Clusters
(+100 nodes) - 9 of BDB
Layered Design
All layers – single interface:
-Put/Delete/Get
-Flexible
-Every layer->decorates
next one
STORAGE - VOLDEMORT
Voldemort provides:
•High available
•Low latency
•Distributed
Like a Distrib. Hash Table
(DHT).
Storage Data engine on
nodes:
•Compact index
•Data files
DISTRIBUTED HASHING ALGORITHM
This slide is from Roshan Sumbaly & Jay Kreps! (thanks Rosh & Jay)
SUMMARY
Problem Solved Main Advantages
EarlyBird
Real time search Fast indexing,
concurrence Management
TAO
Storing Facebook
Social Graph
Very fast response time.
High read availability.
Voldemort
Simple Data
Partitioning to
meet scalability
needs
High Scalable, Seamless
replication
CONCLUSION
• The selection of the database systems depends on the
needs of the applications and the primary type of
information of the social network.
• Many OSN have developed their own solutions to cope with
the ever growing nature of big data and its challenges.
• Summarizing, the main features that the data solutions
should have are:
• Storage huge amount of data.
• Fast read and low latency.
• Processing of big data (meaningful results)
• Streaming and indexing are critical.
EXTREMELY DIFFICULT QUESTIONS
1.Why did LinkedIn needed to build their own
solution Voldemort?
2.How does TAO resolve the challenges it was built
for?
3.How the real time search service works at
twitter?
REFERENCES
● Auradkar, A., Botev, C., Das, S., De Maagd, D., Feinberg, A., Ganti, P., … Zhang, J.
(2012). Data Infrastructure at LinkedIn. In Data Engineering (ICDE), 2012 IEEE 28th
International Conference on (pp. 1370–1381).
● N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo,
S. Kulkarni, and H. Li, “Tao: Facebook’s distributed data store for the social graph,” in
USENIX ATC, 2013.
● N. Ruflin, H. Burkhart, and S. Rizzotti, “Social-data storage-systems,” Databases Soc.
Networks - DBSocial ’11, pp. 7–12, 2011.
● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H.
Liu, “Data warehousing and analytics infrastructure at facebook,” Proceedings of the
2010 ACM SIGMOD International Conference on Management of data. ACM,
Indianapolis, Indiana, USA, pp. 1013–1020, 2010.
● D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a Needle in Haystack:
Facebook’s Photo Storage,” in OSDI, 2010, vol. 2010, pp. 47–60.
● M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, “Earlybird: Real-Time
Search at Twitter,” Proceedings of the 2012 IEEE 28th International Conference on Data
Engineering. IEEE Computer Society, pp. 1360–1369, 2012.
REFERENCES (II)
● D. Borthakur, J. Gray, J. Sen Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K.
Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache hadoop goes
realtime at Facebook,” Proceedings of the 2011 ACM SIGMOD International Conference on
Management of data. ACM, Athens, Greece, pp. 1071–1080, 2011.
● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data
warehousing and analytics infrastructure at facebook,” Proceedings of the 2010 ACM SIGMOD
International Conference on Management of data. ACM, Indianapolis, Indiana, USA, pp. 1013–
1020, 2010.
● C. Chen, F. Li, B. C. Ooi, and S. Wu, “TI: an efficient indexing mechanism for real-time search
on tweets,” Proceedings of the 2011 ACM SIGMOD International Conference on Management of
data. ACM, Athens, Greece, pp. 649–660, 2011.
● G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin, “Fast data in the era of big data: Twitter’s real-
time related query suggestion architecture,” Proceedings of the 2013 ACM SIGMOD
International Conference on Management of Data. ACM, New York, New York, USA, pp. 1147–
1158, 2013.
● S. Cohen and B. Kimelfeld, “A Social Network Database that Learns How to Answer Queries ∗,”
2013.
LINKS
● https://www.usenix.org/conference/atc13/technical-
sessions/presentation/bronson
● http://www-
conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105
_DhrubaBorthakur.pdf
● http://www.slideshare.net/linkedin/jay-kreps-on-project-
voldemort-scaling-simple-storage-at-linkedin
● http://data.linkedin.com/
● http://www.infoq.com/presentations/Project-Voldemort-at-
Gilt-Groupe
THE END

Contenu connexe

Tendances

Tendances (20)

Big Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique BruxellesBig Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique Bruxelles
 
IIIF and Linked Data: A Cultural Heritage DAM Ecosystem
IIIF and Linked Data: A Cultural Heritage DAM EcosystemIIIF and Linked Data: A Cultural Heritage DAM Ecosystem
IIIF and Linked Data: A Cultural Heritage DAM Ecosystem
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/Interoperability
 
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
Capacity Building: Data Science in the University  At Rensselaer Polytechnic ...Capacity Building: Data Science in the University  At Rensselaer Polytechnic ...
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school students
 
Discovery of IIIF Resources: Intro for Working Group / Vatican
Discovery of IIIF Resources: Intro for Working Group / VaticanDiscovery of IIIF Resources: Intro for Working Group / Vatican
Discovery of IIIF Resources: Intro for Working Group / Vatican
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Overview of Big Data
Overview of Big DataOverview of Big Data
Overview of Big Data
 
Introduction to the Linked Art Data Model
Introduction to the Linked Art Data ModelIntroduction to the Linked Art Data Model
Introduction to the Linked Art Data Model
 
A Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and TrendsA Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and Trends
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challenges
 
DATA CENTRIC EDUCATION & LEARNING
 DATA CENTRIC EDUCATION & LEARNING DATA CENTRIC EDUCATION & LEARNING
DATA CENTRIC EDUCATION & LEARNING
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 
The Semantic Web: It's for Real
The Semantic Web: It's for RealThe Semantic Web: It's for Real
The Semantic Web: It's for Real
 
Noshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked DataNoshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked Data
 
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 

Similaire à Social databases - A brief overview

Data Mining Algorithm and New HRDSD Theory for Big Data
Data Mining Algorithm and New HRDSD Theory for Big DataData Mining Algorithm and New HRDSD Theory for Big Data
Data Mining Algorithm and New HRDSD Theory for Big Data
KamleshKumar394
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
Jay Gendron
 
Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...
Research Data Alliance
 

Similaire à Social databases - A brief overview (20)

Stay Calm & Keep Current
Stay Calm & Keep CurrentStay Calm & Keep Current
Stay Calm & Keep Current
 
Promise of web science
Promise of web sciencePromise of web science
Promise of web science
 
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
 
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATIONKNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
 
6.[34 38]face location - a novel approach to post the user global location
6.[34 38]face location - a novel approach to post the user global location6.[34 38]face location - a novel approach to post the user global location
6.[34 38]face location - a novel approach to post the user global location
 
Big data trends in 2020
Big data trends in 2020Big data trends in 2020
Big data trends in 2020
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-Research
 
Data Mining Algorithm and New HRDSD Theory for Big Data
Data Mining Algorithm and New HRDSD Theory for Big DataData Mining Algorithm and New HRDSD Theory for Big Data
Data Mining Algorithm and New HRDSD Theory for Big Data
 
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
Media REVEALr: A social multimedia monitoring and intelligence system for Web...Media REVEALr: A social multimedia monitoring and intelligence system for Web...
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
 
20220103 jim spohrer hicss v9
20220103 jim spohrer hicss v920220103 jim spohrer hicss v9
20220103 jim spohrer hicss v9
 
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"
 
ITU IHSAN lab introduction
ITU IHSAN lab introductionITU IHSAN lab introduction
ITU IHSAN lab introduction
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
2014_WWW_BTOR
2014_WWW_BTOR2014_WWW_BTOR
2014_WWW_BTOR
 
Big data divided (24 march2014)
Big data divided (24 march2014)Big data divided (24 march2014)
Big data divided (24 march2014)
 
Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...
 

Plus de Iván Sanchez Vera

Economia de Recursos Naturales y Economia Tradicional
Economia de Recursos Naturales y Economia TradicionalEconomia de Recursos Naturales y Economia Tradicional
Economia de Recursos Naturales y Economia Tradicional
Iván Sanchez Vera
 
Nociones básica de ecología y recursos naturales.
Nociones básica de ecología y recursos naturales. Nociones básica de ecología y recursos naturales.
Nociones básica de ecología y recursos naturales.
Iván Sanchez Vera
 
Proceso de Adquisiciones de Tecnologia
Proceso de Adquisiciones de TecnologiaProceso de Adquisiciones de Tecnologia
Proceso de Adquisiciones de Tecnologia
Iván Sanchez Vera
 

Plus de Iván Sanchez Vera (20)

Git res baz ec - final
Git   res baz ec - finalGit   res baz ec - final
Git res baz ec - final
 
Intro a Metodos Numericos
Intro a Metodos NumericosIntro a Metodos Numericos
Intro a Metodos Numericos
 
Intro Inteligencia Artificial (AI)
Intro Inteligencia Artificial (AI)Intro Inteligencia Artificial (AI)
Intro Inteligencia Artificial (AI)
 
Trajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmTrajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus Algorithm
 
Proofs on cryptocurrencies
Proofs on cryptocurrenciesProofs on cryptocurrencies
Proofs on cryptocurrencies
 
(Draft) Nuevos caminos de innovación en tecnología
(Draft) Nuevos caminos de innovación en tecnología(Draft) Nuevos caminos de innovación en tecnología
(Draft) Nuevos caminos de innovación en tecnología
 
Pin payments presentation final (4)
Pin payments presentation final (4)Pin payments presentation final (4)
Pin payments presentation final (4)
 
Impacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptx
Impacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptxImpacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptx
Impacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptx
 
Funciones Economicas Biosfera
Funciones Economicas BiosferaFunciones Economicas Biosfera
Funciones Economicas Biosfera
 
Economia de Recursos Naturales y Economia Tradicional
Economia de Recursos Naturales y Economia TradicionalEconomia de Recursos Naturales y Economia Tradicional
Economia de Recursos Naturales y Economia Tradicional
 
Nociones básica de ecología y recursos naturales.
Nociones básica de ecología y recursos naturales. Nociones básica de ecología y recursos naturales.
Nociones básica de ecología y recursos naturales.
 
Economia de Recursos Naturales
Economia de Recursos NaturalesEconomia de Recursos Naturales
Economia de Recursos Naturales
 
Tolerencia de fallas
Tolerencia de fallasTolerencia de fallas
Tolerencia de fallas
 
Ingenieria software
Ingenieria softwareIngenieria software
Ingenieria software
 
Pruebas de Software
Pruebas de SoftwarePruebas de Software
Pruebas de Software
 
Proceso de Adquisiciones de Tecnologia
Proceso de Adquisiciones de TecnologiaProceso de Adquisiciones de Tecnologia
Proceso de Adquisiciones de Tecnologia
 
Proceso de Compra de Tecnologia
Proceso de Compra de TecnologiaProceso de Compra de Tecnologia
Proceso de Compra de Tecnologia
 
Pasos para elaborar RFP
Pasos para elaborar  RFPPasos para elaborar  RFP
Pasos para elaborar RFP
 
Redes ieee 802_11n
Redes ieee 802_11nRedes ieee 802_11n
Redes ieee 802_11n
 
Formacion de Empresas
Formacion de EmpresasFormacion de Empresas
Formacion de Empresas
 

Dernier

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Dernier (20)

WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 

Social databases - A brief overview

  • 1. SOCIAL DATA AND DBS IVAN SANCHEZ JULIO SALINAS MARLENE ROBLES
  • 2. CONTENTS •Background •Study Cases: •Twitter: Real time search-Earlybird •Facebook: Storage. •LinkedIn: Storage (Voldemort) •Conclusion •References
  • 3. BACKGROUND OSN ● Huge amount of data, diverse and changing over time. Likes, sharing, comments, logins, page-views, search queries. ● New approaches to manipulate it. ● Distributed Databases, NoSQL. ● How to retrieve the data (Search relevance, Recommendations, Security against abusive behavior, Newsfeed features) ● Goal: massive scaling of demand: Unstructured, Semi- Facebook Twitter LinkedIn 2.7M likes & comments/da y 500M tweets/day 300+ M.Users(2 new/s). 200 group conversation/min
  • 4. STORING AND QUERYING AT TWITTER ● Storage: o MySQL used as key-value store. o FlockDB to Twitter Social Graph. ● Desired queries: o TrendingTopics o Breaking news o Sentiment
  • 5. REAL TIME SEARCH AT TWITTER: EARLYBIRD
  • 6. TAO AND THE FACEBOOK SOCIAL GRAPH
  • 7. TAO o Architecture and Data Model:  Objects: (id) → (otype, (key ? value)∗)  Associations: (id1, atype, id2) → (time, (key ? value)∗) o MySQL to the Storage Layer. o Main challenges:  Efficience scale.  Very fast response time.  High Read Availability.
  • 8. Professional Social Network Data Driven Features: ● Recomendation System (people you may know) ● People Search (Jobs search - candidates) ● Who view your profile? ● Events you may be
  • 9. STORAGE - VOLDEMORT Highly Available Distrib. KV Store 10 Voldemort Clusters (+100 nodes) - 9 of BDB Layered Design All layers – single interface: -Put/Delete/Get -Flexible -Every layer->decorates next one
  • 10. STORAGE - VOLDEMORT Voldemort provides: •High available •Low latency •Distributed Like a Distrib. Hash Table (DHT). Storage Data engine on nodes: •Compact index •Data files
  • 11. DISTRIBUTED HASHING ALGORITHM This slide is from Roshan Sumbaly & Jay Kreps! (thanks Rosh & Jay)
  • 12. SUMMARY Problem Solved Main Advantages EarlyBird Real time search Fast indexing, concurrence Management TAO Storing Facebook Social Graph Very fast response time. High read availability. Voldemort Simple Data Partitioning to meet scalability needs High Scalable, Seamless replication
  • 13. CONCLUSION • The selection of the database systems depends on the needs of the applications and the primary type of information of the social network. • Many OSN have developed their own solutions to cope with the ever growing nature of big data and its challenges. • Summarizing, the main features that the data solutions should have are: • Storage huge amount of data. • Fast read and low latency. • Processing of big data (meaningful results) • Streaming and indexing are critical.
  • 14. EXTREMELY DIFFICULT QUESTIONS 1.Why did LinkedIn needed to build their own solution Voldemort? 2.How does TAO resolve the challenges it was built for? 3.How the real time search service works at twitter?
  • 15. REFERENCES ● Auradkar, A., Botev, C., Das, S., De Maagd, D., Feinberg, A., Ganti, P., … Zhang, J. (2012). Data Infrastructure at LinkedIn. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on (pp. 1370–1381). ● N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, and H. Li, “Tao: Facebook’s distributed data store for the social graph,” in USENIX ATC, 2013. ● N. Ruflin, H. Burkhart, and S. Rizzotti, “Social-data storage-systems,” Databases Soc. Networks - DBSocial ’11, pp. 7–12, 2011. ● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, Indianapolis, Indiana, USA, pp. 1013–1020, 2010. ● D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a Needle in Haystack: Facebook’s Photo Storage,” in OSDI, 2010, vol. 2010, pp. 47–60. ● M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, “Earlybird: Real-Time Search at Twitter,” Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. IEEE Computer Society, pp. 1360–1369, 2012.
  • 16. REFERENCES (II) ● D. Borthakur, J. Gray, J. Sen Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache hadoop goes realtime at Facebook,” Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, Athens, Greece, pp. 1071–1080, 2011. ● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, Indianapolis, Indiana, USA, pp. 1013– 1020, 2010. ● C. Chen, F. Li, B. C. Ooi, and S. Wu, “TI: an efficient indexing mechanism for real-time search on tweets,” Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, Athens, Greece, pp. 649–660, 2011. ● G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin, “Fast data in the era of big data: Twitter’s real- time related query suggestion architecture,” Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, New York, New York, USA, pp. 1147– 1158, 2013. ● S. Cohen and B. Kimelfeld, “A Social Network Database that Learns How to Answer Queries ∗,” 2013.
  • 17. LINKS ● https://www.usenix.org/conference/atc13/technical- sessions/presentation/bronson ● http://www- conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105 _DhrubaBorthakur.pdf ● http://www.slideshare.net/linkedin/jay-kreps-on-project- voldemort-scaling-simple-storage-at-linkedin ● http://data.linkedin.com/ ● http://www.infoq.com/presentations/Project-Voldemort-at- Gilt-Groupe

Notes de l'éditeur

  1. Web applications and online social network (OSN) produce a huge amount of real time content. Within these, there are social data that is generated for example by users of Facebook, Twitter, and LinkedIn. So, this content grows apace and becomes a rising need to get new approaches to manipulate it, because the data becomes difficult to capture and impossible to storage in conventional databases systems. i.e face.,twi. it is like 7 TB per day(2+ PB per year) Also, the diverse and variety social data changes over the time and makes the structure a challenge for storage. The new community NoSQL creates different types of storage system for recording. Event data includes (1) user activity events corresponding to logins, page-views, clicks, “likes”, sharing, comments, and search queries; (2) operational metrics. This data now require online consumption for (1) search relevance, (2) recommendations which may be driven by item popularity or co-occurrence in the activity stream, (3) security applications that protect against abusive behaviours such as spam or unauthorized data scraping, (4) newsfeed features that aggregate user status updates or actions for their “friends” or “connections” to read, and (5) real time dashboards of various service metrics. How to retrieve the data (Search relevance, Recommendations (Eg: Popularity Driven), Security against abusive behavior (spam), Newsfeed features) Graph: Nodes (people) and Edges (interactions/relationships).
  2. Storage i.e., column store could be a good option for social data because it is almost scalable as key-value store is and also is able to store semi-structured data with simple indices which allows doing simple queries. Twitter uses MySQL in the way of a key-value store. Hadoop: Distributed file system(Automatic replication, fault tolerance). MapReduce- based in parallel computation (key value based computation). Powerful (sorted data very fast). Open source. Scalable A user queries a social network in pursuit of a desired outcome. People search Twitter to find temporally relevant information (e.g., breaking news, real-time content, and popular trends) and information related to people (e.g., content directed at the searcher, information about people of interest, and general sentiment and opinion). Twitter queries are shorter and popular.(tweets are short, frequent , and do not change after being posted) The systems incorporate abstract predicates relevant to social networks as primitive building blocks in the query language, uses machine learning as an integral part of the query processor, to select and improve upon the predicate implementations. i.e Twitter: allows users to search by words, people,places,and tweet properties. i.e. social network query language: SoQL and SociQL, based on SQL; SNQL, etc.
  3. Earlybird: (retrieval engine that lies at the core of Twitter’s real-time search service) it is specifically designed to handle tweets. It maintains an inverted index, manages concurrency and uses a ranking function that combines relevance signals and the user’s local social graph to compute a personalized relevance score for each tweet. Ingested tweets first fill up a segment before proceeding to the next one. At any given time, there is at most one index segment actively being modified, whereas the remaining segments are read-only. Once an index segment ceases to accept new tweets, it converts from a write-friendly structure into an optimized read-only structure. The highest-ranking, most-recent tweets are returned to the Blender, which merges and re-ranks the results before returning them to the user.
  4. LinkedIn has a number of data-driven features, including People You May Know, Jobs you may be interested in, and LinkedIn Skills. Building these features involves 2 phases: offline computation Online serving. As an early adopter of Hadoop, they were able to scale the offline computation phase successfully. Difficult part has been bulk loading the output of this computation phase into the online serving system without causing performance degradation. Batch computed algorithms (with map reduce with hadoop). To make all this offline data available to the live site, we've developed a multi-terabyte scale data pipeline from Hadoop to our online serving layer, Project Voldemort. How do we serve these massive outputs to our 300 million members? RDBMS: ORACLE Expresso: Doc-oriented Data store with hierarchical indexing. Kafka: High Volume Low Latency Messaging System Collecting & delivering Event Data. Uses a messaging API to support real time and offline consumption. Publisher/Consumer scheme 10 B message writes/day. Solves Real time log processing! Kafka solves this problem: Moves arround large amounts of data in a robust & escalable manner. Streaming: Databus (DB stream replication) & Kafka ((pub/sub user activitity and logs) Databus: Timeline – Consistent Change Data Capture Kafka: Provides feeds of data to Applications & Other Data Systems Enable near real-time processing of Data. (newsfeed & other asynch Transport of updates to subscriber systems. Kafka alone is not sufficient, as it lacks a processing engine and the ability to persist data over long spans
  5. Simple Data Partitioning to meet scalability needs. Primary goals: High performance & Availability. DB supports only the most minimal Schema. Schema in JSON. Speed and availability -> Distributed key-Value system. Bulk load massive data sets -> Offload index construction to processing system Port: 666 Storage is secondary, is really about distributing & recovering data across a set of nodes. Main Cluster has 60% read 40% writes Techniques: Decentralized -> no Master Data partitioned and replicated via consistent hashing Multinode read and writes for redundancy Versioning 4 consistency… Vector clocks for this. Non locking optimistic locking (nodeID, counter) tuples on each node. Each object has a vector clock, updated on each write and examined on each read. Pluggable persistence (BDB, MySQL, HADOOP (RO), MongoDB We currently house roughly ten clusters, spanning more than a hundred nodes, holding several hundred stores (database tables). Berkeley DB (BDB) is a software library that provides a high-performance embedded databasefor key/value data. Data is cached in oracle storage Voldemort is a distributed Key value storage system 4 high capacity storage. Data is stored under a key and partition and replicated among multiple servers. In case of failure conflicts are resolved using versioning. Voldemort is not a relational DB. Voldemort is a: Big, distributed, persistent, fault tolerant hash table”. API involve 3 operations only: Get, Put and Delete. Keys & Values can be complex objects. Client allows pluggable interfaces. Stack Conflict resolution: Multiple reads & writes, multiple versions. Latest version hides most of the difficulties of getting last version. Serialization: Network: You can choose client side or server side. Client connects to the cluster and get metadata describing how the nodes and the hashing are set
  6. Data is replicated automaticaly over multiple servers. Storage is pluggable on disk using BerkeleyDB or MySQL. Can be used with Hadoop. Info is stored in HDFS By several map reduce jobs by hadoop and the is pulled by Voldemort. A store is like a DB table, each store is split into partitions and this partitions into chunk sets. One reducer = one chunk set Chunk Set = Index + data file storage engine is made up of “chunk sets”—multiple pairs of index and data files. The index file is a compact structure containing a hash of the key followed by the offset to the corresponding value in the data file Node independence: Each node is independent of other nodes with no central point of failure or coordination For versioning and Conflict Resolution: Vector Clocks (support optimistic locking in the client). We currently house roughly ten clusters, spanning more than a hundred nodes, holding several hundred stores (database tables). Its similar to share nothing architectures in Distributed systems, which can grow indefinetely. Berkeley DB (BDB) is a software library that provides a high-performance embedded databasefor key/value data. Also use Oracle. it has been designed for frequent transient and short- term failures Each store maps to a single cluster, with the store partitioned over all nodes in the cluster. Lookups O(1).
  7. Voldemort uses a technique of consistent hashing to distribute load evenly on the servers. Key hashes to a point on fixed circular space. Avoids problem of linear hashing when you remove a bucket, all your keys has to move! Hash space (32 bit number) Fix boundaries. Can resize buckets, very efficient for rebalancing. In case of server failure, its keys will be distributed equally to other servers in cluster. Ring node structure of size Q partitions, larger than number of nodes S. Each node is assigned Q/S partitions, it appears multiple times in a ring. Keys are held by primary node as well as its K unique successors in clockwise direction. Advantage of predictable performance compare to SQL queries in Relational DB. Get operation returns all values associated with a key organized into lists.
  8. We need to agree on this