HBase at Mendeley

•

13 likes•4,429 views

Dan Harvey

The details behind how and why we use HBase in the data mining team at Mendeley.

Technology

HBase at Mendeley
Dan Harvey
Data Mining Engineer
dan.harvey@mendeley.com

Overview
➔ What is Mendeley
➔ Why we chose HBase
➔ How we're using HBase
➔ Challenges

Mendeley extracts
research data..
Install
Mendeley Desktop
Mendeley helps researchers work smarter

..and aggregates research
data in the cloud
Mendeley extracts
research data..
Mendeley helps researchers work smarter

Mendeley in numbers
➔ 600,000+ users
➔ 50+ million user documents
➔ Since January 2009
➔ 30 million unique documents
➔ De-duplicated from user and other imports
➔ 5TB of papers

Data Mining Team
➔ Catalogue
➔ Importing
➔ Web Crawling
➔ De-duplication
➔ Statistics
➔ Related and recommended research
➔ Search

Starting off
➔ Users data in MySQL
➔ Normalised document tables
➔ Quite a few joins..
➔ Stuck with MySQL for data mining
➔ Clustering and de-duplication
➔ Got us to launch the article pages

But..
➔ Re-process everything often
➔ Algorithms with global counts
➔ Modifying algorithms affect everything
➔ Iterating over tables was slow
➔ Could not easily scale processing
➔ Needed to shard for more documents
➔ Daily stats took > 24h to process...

What we needed
➔ Scale to 100s of millions of documents
➔ ~80 million papers
➔ ~120 million books
➔ ~2-3 billion references
➔ More projects using data and processing
➔ Update the data more often
➔ Rapidly prototype and develop
➔ Cost effective

So much choice..
But they mostly miss out good scalable processing.
And many more...

HBase and Hadoop
➔ Scalable storage
➔ Scalable processing
➔ Designed to work with map reduce
➔ Fast scans
➔ Incremental updates
➔ Flexible schema

How we store data
➔ Mostly documents
➔ Column Families for different data
➔ Metadata / raw pdf files
➔ More efficient scans
➔ Protocol Buffers for metadata
➔ Easy to manage 100+ fields
➔ Faster serialisation

Example Schema
Row Column family Qualifier
sha1_hash metadata document
date_added
date_modified
source
content pdf
full_text
entity_extraction
canonical_id version_live
● All data for documents in one table

How we process data
➔ Java Map Reduce
➔ More control over data flows
➔ Allows us to do more complex work
➔ Pig
➔ Don't have to think in map reduce
➔ Twitter's Elephant Bird decodes protocol buffers
➔ Enables rapid prototyping
➔ Less efficient than using java map reduce
➔ Quick example...

Example
➔ Trending keywords over time
➔ For a give keyword, how many documents per year?
➔ Multiple map/reduce tasks
➔ 100s of line of java...

Pig Example
-- Load the document bag
rawDocs = LOAD 'hbase://canonical_documents'
USING HbaseLoader('metadata:document')
AS (protodoc);
-- De-serialise protocol buffer
docs = FOREACH rawDocs GENERATE
DocumentProtobufBytesToTuple(protodoc)AS doc;
-- Get keyword, year tuples
tagYear = FOREACH docs GENERATE
FLATTEN (doc.(year, keywords_bag))
AS keyword, doc::year AS year;

-- Group unique (keyword, year) tuples
yearTag = GROUP tagYear BY (keyword, year);
-- Create (keyword, year, count) tuples
yearTagCount = FOREACH yearTag GENERATE
FLATTEN(group) AS (keyword, year),
COUNT(tagYear) AS count;
-- Group the counts by keyword
tagYearCounts = GROUP yearTagCount BY keyword;
-- Group the counts by keyword
tagYearCounts = FOREACH tagYearCounts GENERATE
group AS keyword,
yearTagCount.(year, count) AS years;
STORE tagYearCounts INTO 'tag_year_counts';

Challenges
➔ MySQL hard to export from
➔ Many joins slow things down
➔ Don't normalise if you don't have to!
➔ HBase needs memory
➔ Stability issues if you give it too little

Challenges: Hardware
➔ Knowing where to start is hard...
➔ 2x quad core Intel cpu
➔ 4x 1TB disks
➔ Memory
➔ Started with 8GB, then 16GB
➔ Upgrading to 24GB soon
➔ Currently 15 nodes

What's hot

Migrating structured data between Hadoop and RDBMSBouquet

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit

Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma

Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack

Running Cassandra on Amazon EC2Dave Gardner

Cloud Optimized Big DataJoydeep Sen Sarma

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax

HBaseConAsia2018 Track1-3: HBase at XiaomiMichael Stack

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA

Hadoop at ayasdiMohit Jaggi

Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy

What Every Developer Should Know About Database Scalabilityjbellis

Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...DataStax

Hadoop on-mesosHenry Cai 蔡明航

Cassandra on Docker @ Walmart LabsDataStax Academy

Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax

Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi

What's hot (20)

Migrating structured data between Hadoop and RDBMS

Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

Messaging architecture @FB (Fifth Elephant Conference)

Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

Running Cassandra on Amazon EC2

Cloud Optimized Big Data

Hadoop trainting in hyderabad@kelly technologies

Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...

HBaseConAsia2018 Track1-3: HBase at Xiaomi

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...

Hadoop at ayasdi

Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay

What Every Developer Should Know About Database Scalability

Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...

Hadoop on-mesos

Cassandra on Docker @ Walmart Labs

Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons

Prestogres, ODBC & JDBC connectivity for Presto

Viewers also liked

thesis-despoinaDespoina Magka

Plagcitation fa2012Laksamee Putnam

Google Apps and PlagiarismJon Corippo

ISTC 201 - Plagiarism and Proper CitationLaksamee Putnam

Google analytics pptmaddinpiya

5 Fantasy Google TranslatorJing-mei Huang

Project Voldemort: Big data loadingDan Harvey

How to set up campaign in google adwords by Tanuja TalekarTanuja Talekar

Scientific writing pro : Office word & Mendeley (dani r firman)Dani Firman

Webmaster tool by Neha NayakNeha Nayak

Google Analytics OverviewAnvil Media, Inc.

Google analytics by Neha NayakNeha Nayak

Top 10 Google Analytics ReportsSally Falkow

Google Analytics 101 for Business - How to Get Started With Google AnalyticsJeff Sauer

An introduction to Google AnalyticsJoris Roebben

Google Analytics 101 | 2015Insivia

Hive at Yahoo: Letters from the trenchesDataWorks Summit

APAC Big Data Strategy RadhaKrishna HiremaneIntelAPAC

Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.

Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit

Viewers also liked (20)

thesis-despoina

Plagcitation fa2012

Google Apps and Plagiarism

ISTC 201 - Plagiarism and Proper Citation

Google analytics ppt

5 Fantasy Google Translator

Project Voldemort: Big data loading

How to set up campaign in google adwords by Tanuja Talekar

Scientific writing pro : Office word & Mendeley (dani r firman)

Webmaster tool by Neha Nayak

Google Analytics Overview

Google analytics by Neha Nayak

Top 10 Google Analytics Reports

Google Analytics 101 for Business - How to Get Started With Google Analytics

An introduction to Google Analytics

Google Analytics 101 | 2015

Hive at Yahoo: Letters from the trenches

APAC Big Data Strategy RadhaKrishna Hiremane

Hw09 Hadoop Development At Facebook Hive And Hdfs

Hive and Apache Tez: Benchmarked at Yahoo! Scale

Similar to HBase at Mendeley

NDC Minnesota - Analyzing StackExchange data with Azure Data LakeTom Kerkhove

NDC Sydney - Analyzing StackExchange with Azure Data LakeTom Kerkhove

Twitter with hadoop for oowGwen (Chen) Shapira

Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA

BDA311 Introduction to AWS GlueAmazon Web Services

מיכאלsqlserver.co.il

Predictive Analytics and Machine Learning…with SAS and Apache HadoopHortonworks

Slide 2 collecting, storing and analyzing big dataTrieu Nguyen

Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt

Meta scale kognitio hadoop webinarKognitio

Need for Time series DatabasePramit Choudhary

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu

Technologies for Data Analytics PlatformN Masahiro

Datalake ArchitectureTechYugadi IT Solutions & Consulting

Big data architectures and the data lakeJames Serra

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Spark to DocumentDB connectorDenny Lee

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal

Hadoop and Voldemort @ LinkedInHadoop User Group

List of Engineering Colleges in UttarakhandRoorkee College of Engineering, Roorkee

Similar to HBase at Mendeley (20)

NDC Minnesota - Analyzing StackExchange data with Azure Data Lake

NDC Sydney - Analyzing StackExchange with Azure Data Lake

Twitter with hadoop for oow

Big Data Technologies and Why They Matter To R Users

BDA311 Introduction to AWS Glue

מיכאל

Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Slide 2 collecting, storing and analyzing big data

Hadoop on OpenStack - Sahara @DevNation 2014

Meta scale kognitio hadoop webinar

Need for Time series Database

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Technologies for Data Analytics Platform

Datalake Architecture

Big data architectures and the data lake

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...

Spark to DocumentDB connector

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Hadoop and Voldemort @ LinkedIn

List of Engineering Colleges in Uttarakhand

Recently uploaded

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

A Framework for Development in the AI AgeCprime

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Data governance with Unity Catalog PresentationKnoldus Inc.

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5

How to write a Business Continuity PlanDatabarracks

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

2024 April Patch TuesdayIvanti

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Recently uploaded (20)

Generative AI for Technical Writer or Information Developers

UiPath Community: Communication Mining from Zero to Hero

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Decarbonising Buildings: Making a net-zero built environment a reality

A Framework for Development in the AI Age

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

Emixa Mendix Meetup 11 April 2024 about Mendix Native development

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Data governance with Unity Catalog Presentation

Genislab builds better products and faster go-to-market with Lean project man...

What is DBT - The Ultimate Data Build Tool.pdf

Take control of your SAP testing with UiPath Test Suite

Testing tools and AI - ideas what to try with some tool examples

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...

How to write a Business Continuity Plan

TeamStation AI System Report LATAM IT Salaries 2024

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

2024 April Patch Tuesday

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

HBase at Mendeley

1. HBase at Mendeley Dan Harvey Data Mining Engineer dan.harvey@mendeley.com

2. Overview ➔ What is Mendeley ➔ Why we chose HBase ➔ How we're using HBase ➔ Challenges

3. Mendeley helps researchers work smarter

4. Mendeley extracts research data.. Install Mendeley Desktop Mendeley helps researchers work smarter

5. ..and aggregates research data in the cloud Mendeley extracts research data.. Mendeley helps researchers work smarter

10.

11. Mendeley in numbers ➔ 600,000+ users ➔ 50+ million user documents ➔ Since January 2009 ➔ 30 million unique documents ➔ De-duplicated from user and other imports ➔ 5TB of papers

12. Data Mining Team ➔ Catalogue ➔ Importing ➔ Web Crawling ➔ De-duplication ➔ Statistics ➔ Related and recommended research ➔ Search

13. Starting off ➔ Users data in MySQL ➔ Normalised document tables ➔ Quite a few joins.. ➔ Stuck with MySQL for data mining ➔ Clustering and de-duplication ➔ Got us to launch the article pages

14. But.. ➔ Re-process everything often ➔ Algorithms with global counts ➔ Modifying algorithms affect everything ➔ Iterating over tables was slow ➔ Could not easily scale processing ➔ Needed to shard for more documents ➔ Daily stats took > 24h to process...

15. What we needed ➔ Scale to 100s of millions of documents ➔ ~80 million papers ➔ ~120 million books ➔ ~2-3 billion references ➔ More projects using data and processing ➔ Update the data more often ➔ Rapidly prototype and develop ➔ Cost effective

16. So much choice.. But they mostly miss out good scalable processing. And many more...

17. HBase and Hadoop ➔ Scalable storage ➔ Scalable processing ➔ Designed to work with map reduce ➔ Fast scans ➔ Incremental updates ➔ Flexible schema

18. Where HBase fits in

19. How we store data ➔ Mostly documents ➔ Column Families for different data ➔ Metadata / raw pdf files ➔ More efficient scans ➔ Protocol Buffers for metadata ➔ Easy to manage 100+ fields ➔ Faster serialisation

20. Example Schema Row Column family Qualifier sha1_hash metadata document date_added date_modified source content pdf full_text entity_extraction canonical_id version_live ● All data for documents in one table

21. How we process data ➔ Java Map Reduce ➔ More control over data flows ➔ Allows us to do more complex work ➔ Pig ➔ Don't have to think in map reduce ➔ Twitter's Elephant Bird decodes protocol buffers ➔ Enables rapid prototyping ➔ Less efficient than using java map reduce ➔ Quick example...

22. Example ➔ Trending keywords over time ➔ For a give keyword, how many documents per year? ➔ Multiple map/reduce tasks ➔ 100s of line of java...

23. Pig Example -- Load the document bag rawDocs = LOAD 'hbase://canonical_documents' USING HbaseLoader('metadata:document') AS (protodoc); -- De-serialise protocol buffer docs = FOREACH rawDocs GENERATE DocumentProtobufBytesToTuple(protodoc)AS doc; -- Get keyword, year tuples tagYear = FOREACH docs GENERATE FLATTEN (doc.(year, keywords_bag)) AS keyword, doc::year AS year;

24. -- Group unique (keyword, year) tuples yearTag = GROUP tagYear BY (keyword, year); -- Create (keyword, year, count) tuples yearTagCount = FOREACH yearTag GENERATE FLATTEN(group) AS (keyword, year), COUNT(tagYear) AS count; -- Group the counts by keyword tagYearCounts = GROUP yearTagCount BY keyword; -- Group the counts by keyword tagYearCounts = FOREACH tagYearCounts GENERATE group AS keyword, yearTagCount.(year, count) AS years; STORE tagYearCounts INTO 'tag_year_counts';

25. Challenges ➔ MySQL hard to export from ➔ Many joins slow things down ➔ Don't normalise if you don't have to! ➔ HBase needs memory ➔ Stability issues if you give it too little

26. Challenges: Hardware ➔ Knowing where to start is hard... ➔ 2x quad core Intel cpu ➔ 4x 1TB disks ➔ Memory ➔ Started with 8GB, then 16GB ➔ Upgrading to 24GB soon ➔ Currently 15 nodes

27. www.mendeley.com

HBase at Mendeley

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HBase at Mendeley

Similar to HBase at Mendeley (20)

Recently uploaded

Recently uploaded (20)

HBase at Mendeley