2. First off...
About me
Consultant for SARA's eScience & Cloud Services
Technical lead for LifeWatch Netherlands
Lead Hadoop infrastructure
About you
Who uses large-scale computing as a supporting tool?
For who is large-scale computing core-business?
3. In this talk
Large-scale data processing?
Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
Hadoop @ SARA & BiG Grid
4. Large-scale data processing?
Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
Hadoop @ SARA & BiG Grid
7. More business is done on-line
Mobile devices are more sophisticated
Governments collect more data
Sensing devices are becoming a commodity
Technology advanced: DNA sequencers!
Enormous funding for research infrastructures
And so on...
Lesson: everybody collects data
Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2011–2016
14. How are these observations addressed?
We collect data, we store data, we have the
knowledge to interpret data. What tools do we
have that bring these together?
Pioneers: HPC centers, universities, and in recent
years, Internet companies. (Lots of knowledge
exchange, by the way.)
16. Some background (bear with me...) 2/3
(The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
17. Some background (bear with me...) 3/3
Nodes (x2000):
8GB DRAM
4 x 1TB disks
Rack:
40 nodes
1Gbps switch
Datacenter:
8Gbps rack-to-cluster
switch connection
(The Datacenter as a Computer, 2009, Luiz André Barroso and Urs Hölzle)
20. Large-scale data processing?
Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
Hadoop @ SARA & BiG Grid
21. SARA
the national center for scientific computing
Facilitating Science in The Netherlands with Equipment for
and Expertise on Large-Scale Computing, Large-Scale
Data Storage, High-Performance Networking,
eScience, and Visualization
24. Case Study: Virtual Knowledge Studio
How do categories in WikiPedia
evolve over time? (And how do
they relate to internal links?)
2.7 TB raw text, single file
Java application, searches for
categories in Wiki markup,
like [[Category:NAME]]
Executed on the Grid
http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment
25. Case Study: Virtual Knowledge Studio
Method
Take an article, including history, as input
Extract categories and links for each revision
Output all links for each category, per revision
Aggregate all links for each category, per revision
Generate graph linking all categories on links, per revision
26. Case Study: Virtual Knowledge Studio
1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in
Machine to Grid storage Storage to single machine parallel: N machines
2.2) Cut into pieces of 10 GB run the Java application,
2.3) Stream back to Grid fetch a 10GB file as
Storage input, processing it, and
putting the result back
27. Large-scale data processing?
Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
Hadoop @ SARA & BiG Grid
28. A bit of history
2002 2004 2006
Nutch* MR/GFS** Hadoop
* http://nutch.apache.org/
** http://labs.google.com/papers/mapreduce.html
http://labs.google.com/papers/gfs.html
29. 2010 - 2012: A Hype in Production
http://wiki.apache.org/hadoop/PoweredBy
30. What's different about Hadoop?
No more do-it-yourself parallelism – it's hard!
But rather linearly scalable data parallelism
Separating the what from the how
2009, Luiz André Barroso and Urs Hölzle)
31. Core principals
Scale out, not up
Move processing to the data
Process data sequentially, avoid random reads
Seamless scalability
(Jimmy Lin, University of Maryland / Twitter, 2011)
32. A typical data-parallel problem in abstraction
Iterate over a large number of records
Extract something of interest
Create an ordering in intermediate results
Aggregate intermediate results
Generate output
MapReduce: functional abstraction of step 2 & step 4
(Jimmy Lin, University of Maryland / Twitter, 2011)
33. MapReduce
Programmer specifies two functions
map(k, v) → <k', v'>*
reduce(k', v') → <k', v'>*
All values associated with a single key are sent to the same
reducer
The framework handles the rest
35. Case Study: Virtual Knowledge Studio
This is how it would be done with Hadoop
1) Load file into 2) Submit code to
HDFS MR
Automatic distribution of data,
Parallelism based on data,
Automatic ordering of intermediate results
37. Large-scale data processing?
Large-scale @ SARA & BiG Grid
An introduction to Hadoop & MapReduce
Hadoop @ SARA & BiG Grid
38. Timeline
2009: Piloting Hadoop on Cloud
2010: Test cluster available for scientists
6 machines * 4 cores / 24 TB storage / 16GB
RAM
Just me!
2011: Funding granted for production service
2012: Production cluster available (~March)
72 machines * 8 cores / 8 TB storage / 64GB
RAM
Integration with Kerberos for secure multi-
tenancy
42. What are scientists doing?
Information Retrieval
Natural Language Processing
Machine Learning
Econometry
Bioinformatics
Computational Ecology / Ecoinformatics
44. Structural health monitoring
145 x 100 x 60 x 60 x 24 x 365 = large data
sensors Hz seconds minutes hours days
(Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
45. And others: NLP & IR
e.g. ClueWeb: a ~13.4 TB webcrawl
e.g. Twitter gardenhose data
e.g. Wikipedia dumps
e.g. del.ico.us & flickr tags
Finding named entities: [person company place] names
Creating inverted indexes
Piloting real-time search
Personalization
Semantic web
48. Experience: How we embrace Hadoop
Parallelism has never been easy… so we teach!
December 2010: hackathon (~50 participants - full)
April 2011: Workshop for Bioinformaticians
November 2011: 2 day PhD course (~60 participants – full)
June 2012: 1 day PhD course
The datascientist is still in school... so we fill the gap!
Devops maintain the system, fix bugs, develop new
functionality
Technical consultants learn how to efficiently implement
algorithms
50. Final thoughts
Hadoop is the first to provide commodity computing
Hadoop is not the only
Hadoop is probably not the best
Hadoop has momentum
What degree of diversification of infrastructure should we
embrace?
MapReduce fits surprisingly well as a programming model for
data-parallelism
Where is the data scientist?
Teach. A lot. And work together.