I was invited to give a keynote presentation at the RecSysTEL Workshop (http://bit.ly/b2Bg2J) on 2010/09/30.
It presents Mendeley's tools for researchers and data sets that we made available for the dataTEL challenge, designed to provide new large scale data for researcers in recommendation systems.
The event was really enjoyable and the participants were excited about Mendeley.
Mendeley, putting data into the hands of researchers
1. Mendeley, putting data
into the hands of
researchers
Kris Jack, PhD
Data Mining Team Coordinator
2. “All the time we are very
conscious of the huge
challenges that human
society has now – curing
cancer, understanding the
brain for Alzheimer‘s [...].
But a lot of the state of
knowledge of the human race
is sitting in the scientists’
computers, and is currently
not shared […] We need to
get it unlocked so we can
tackle those huge problems.“
3. Summary
➔
idea behind mendeley
➔
our features
➔
our technical challenges and solutions
➔
what does this mean for you?
4. Mendeley Last.fm
3) Last.fm builds your music
works like this: profile and recommends you
music you also could like...
1) Install “Audioscrobbler” and it’s the world‘s biggest
open music database
2) Listen to
music
5. Mendeley Last.fm
music libraries research libraries
artists researchers
songs papers
genres disciplines
6. Summary
➔
idea behind mendeley
➔
our features
➔
our technical challenges and solutions
➔
what does this mean for you?
9. Mendeley helps researchers work smarter
..and aggregates research
data in the cloud
Mendeley extracts
research data..
10. By doing this, Mendeley makes science
more collaborative and transparent
11.
12.
13.
14.
15.
16.
17.
18. Summary
➔
idea behind mendeley
➔
our features
➔
our technical challenges and
solutions
➔
what does this mean for you?
19. 500,000+ users; the 20 largest userbases:
University of Cambridge
Stanford University
MIT
University of Michigan
Harvard University
University of Oxford
Sao Paulo University
Imperial College London
University of Edinburgh
Cornell University
University of California at Berkeley
RWTH Aachen
Columbia University
Georgia Tech
University of Wisconsin
UC San Diego
39,000,000+ articles University of California at LA
University of Florida
University of North Carolina
20. we can only use algorithms that scale up
readership statistics
search
most frequent tags related research + dozens of other services
21. most frequent tags on our scale
readership statistics
search
most frequent tags related research
22. most frequent tags on our scale
most frequent tags
called 39,000,000 times
for each document
for each tag in document
increment count for tag
called ~3 times
sort tags by frequency
called ~39,000,000 x 3 = ~117,000,000 times
23. solution: distributed computing
map reduce
for each document
for each tag in document
increment count for tag
sort tags by frequency
for each tag counted
emit the tag and frequency
MapReduce: Simplified Data Processing on Large Clusters
In Proceedings of OSDI 2004, San Francisco, CA, 2004.
Jeffrey Dean and Sanjay Ghemawat
24. solution: distributed computing
hadoop
MapReduce: Simplified Data Processing on Large Clusters
In Proceedings of OSDI 2004, San Francisco, CA, 2004.
Jeffrey Dean and Sanjay Ghemawat
26. conditional random fields
Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit:
An open-source CRF reference string parsing package.
In Proceedings of the LREC 08, Marrakesh, Morrocco.
27. deduplication crowd sourcing new articles from users
collapse metadata and update canonical docs
file hash check
metadata comparison
document fingerprinting
39,000,000 canonical documents
40. “All the time we are very
conscious of the huge
challenges that human
society has now – curing
cancer, understanding the
brain for Alzheimer‘s [...].
But a lot of the state of
knowledge of the human race
is sitting in the scientists’
computers, and is currently
not shared […] We need to
get it unlocked so we can
tackle those huge problems.“