11. Mendeley in numbers
➔ 600,000+ users
➔ 50+ million user documents
➔ Since January 2009
➔ 30 million unique documents
➔ De-duplicated from user and other imports
➔ 5TB of papers
12. Data Mining Team
➔ Catalogue
➔ Importing
➔ Web Crawling
➔ De-duplication
➔ Statistics
➔ Related and recommended research
➔ Search
13. Starting off
➔ Users data in MySQL
➔ Normalised document tables
➔ Quite a few joins..
➔ Stuck with MySQL for data mining
➔ Clustering and de-duplication
➔ Got us to launch the article pages
14. But..
➔ Re-process everything often
➔ Algorithms with global counts
➔ Modifying algorithms affect everything
➔ Iterating over tables was slow
➔ Could not easily scale processing
➔ Needed to shard for more documents
➔ Daily stats took > 24h to process...
15. What we needed
➔ Scale to 100s of millions of documents
➔ ~80 million papers
➔ ~120 million books
➔ ~2-3 billion references
➔ More projects using data and processing
➔ Update the data more often
➔ Rapidly prototype and develop
➔ Cost effective
16. So much choice..
But they mostly miss out good scalable processing.
And many more...
17. HBase and Hadoop
➔ Scalable storage
➔ Scalable processing
➔ Designed to work with map reduce
➔ Fast scans
➔ Incremental updates
➔ Flexible schema
19. How we store data
➔ Mostly documents
➔ Column Families for different data
➔ Metadata / raw pdf files
➔ More efficient scans
➔ Protocol Buffers for metadata
➔ Easy to manage 100+ fields
➔ Faster serialisation
20. Example Schema
Row Column family Qualifier
sha1_hash metadata document
date_added
date_modified
source
content pdf
full_text
entity_extraction
canonical_id version_live
● All data for documents in one table
21. How we process data
➔ Java Map Reduce
➔ More control over data flows
➔ Allows us to do more complex work
➔ Pig
➔ Don't have to think in map reduce
➔ Twitter's Elephant Bird decodes protocol buffers
➔ Enables rapid prototyping
➔ Less efficient than using java map reduce
➔ Quick example...
22. Example
➔ Trending keywords over time
➔ For a give keyword, how many documents per year?
➔ Multiple map/reduce tasks
➔ 100s of line of java...
23. Pig Example
-- Load the document bag
rawDocs = LOAD 'hbase://canonical_documents'
USING HbaseLoader('metadata:document')
AS (protodoc);
-- De-serialise protocol buffer
docs = FOREACH rawDocs GENERATE
DocumentProtobufBytesToTuple(protodoc)AS doc;
-- Get keyword, year tuples
tagYear = FOREACH docs GENERATE
FLATTEN (doc.(year, keywords_bag))
AS keyword, doc::year AS year;
24. -- Group unique (keyword, year) tuples
yearTag = GROUP tagYear BY (keyword, year);
-- Create (keyword, year, count) tuples
yearTagCount = FOREACH yearTag GENERATE
FLATTEN(group) AS (keyword, year),
COUNT(tagYear) AS count;
-- Group the counts by keyword
tagYearCounts = GROUP yearTagCount BY keyword;
-- Group the counts by keyword
tagYearCounts = FOREACH tagYearCounts GENERATE
group AS keyword,
yearTagCount.(year, count) AS years;
STORE tagYearCounts INTO 'tag_year_counts';
25. Challenges
➔ MySQL hard to export from
➔ Many joins slow things down
➔ Don't normalise if you don't have to!
➔ HBase needs memory
➔ Stability issues if you give it too little
26. Challenges: Hardware
➔ Knowing where to start is hard...
➔ 2x quad core Intel cpu
➔ 4x 1TB disks
➔ Memory
➔ Started with 8GB, then 16GB
➔ Upgrading to 24GB soon
➔ Currently 15 nodes