Simon Metson of Bristol University and CERN's CMS experiment, discussing how to use Hadoop for processing CERN event data, or other data generated in/by the experiment
2. About us
• CMS will take 1-10PB of data a year
• we’ll generate approx. the same in simulation data
• It could run for 20-30 years
• Have ~80 large computing centres around
the world (>0.5PB, 100’s job slots each)
• ~3000 members of the collaboration
Wednesday, 12 August 2009
3. Why so much data?
• We have a very big digital camera
• Each event is ~1MB for normal
running
• size increases for HI and upgrade
studies
• Need many millions of events to
get statistically significant results
out for rare processes
• In my thesis I started with ~5M events
to see an eventual “signal” of ~300
Wednesday, 12 August 2009
4. What’s an event?
• We have protons colliding, which contain quarks
• Quarks interact to produce excited states of
matter
• These excited states decay and we record the
decay products
• We then work back from the products to “see”
the original event
• Many events happen at once
• Think of working out how a carburettor works
by crashing 6 cars together on a motorway
Wednesday, 12 August 2009
6. Duplication of data
• We keep events in multiple “tiers” of data
• Each tier contains a subset of the
information of the parent tier
• We do this to let people work on huge
amounts of data quickly
• In reality this style of working hasn’t really kicked off
yet, but it’s early days
• Data is housed at >1 site
Wednesday, 12 August 2009
7. Duplication of work
• One person’s signal is another’s background
• Common framework (CMSSW) for analysis but
very little ability to share large amounts of work
• People coalesce into working groups, but these are generally
small
• While everyone is trying to do the same thing
they’re all trying to do it in different ways
• I suspect this is different from, say, Yahoo or
last.fm
Wednesday, 12 August 2009
8. How we work
• Large, ~dedicated compute farms
• PBS/Torque/Maui/SGE accessed via grid
interface
• ACL’s to prevent misuse of resources
• Not worried about people reading our data, but
worried they might delete it accidentally
• Prevent DDoS
Wednesday, 12 August 2009
9. Where we use Hadoop
• We currently use Hadoop’s HDFS at some of our T2 sites,
mainly in the US
• Led by Nebraska, been very successful to date
• I suspect more people will switch as centres expand
• Administration tools as well as performance particularly
appreciated
• Alternatives are academic/research projects and tend to
have a different focus (pub for details/rants)
• Maintenance & stability of code a big issue
• Storage in WN’s is also interesting
Wednesday, 12 August 2009
10. What would we have to do
to run analysis with Hadoop?
• Split events sensibly over the cluster
• By event? by file? don’t care?
• Data files are ~2G - need to reliably
reconstruct these files for export if we split
them up
• Have CMSSW run in Hadoop
• Many, many pitfalls there, may not even be possible...
Wednesday, 12 August 2009
11. Metadata
• Lots of metadata associated with the data
itself
• Moving that to HBase or similar and mining
with Hadoop would be interesting
• Currently this is stored in big Oracle
databases
• Also, log mining - probably harder to get
people interested in this
Wednesday, 12 August 2009
12. Issues
• Some analyses don’t map onto MapReduce
• Data is complex and in a weird file format
• CMSSW has a large memory foot print
• Not efficient to run only a few events as start up/tear
down is expensive
• Sociologically it would be difficult to persuade people
to move to MapReduce algorithms
• Until people see benefits - demonstrating those benefits is hard,
physicists don’t think in cost terms
Wednesday, 12 August 2009