Hadoop & Hep

•

1 j'aime•1,107 vues

Simon Metson of Bristol University and CERN's CMS experiment, discussing how to use Hadoop for processing CERN event data, or other data generated in/by the experiment

Technologie Formation

Hadoop and HEP

Simon

Wednesday, 12 August 2009

About us
• CMS will take 1-10PB of data a year
• we’ll generate approx. the same in simulation data

• It could run for 20-30 years
• Have ~80 large computing centres around
the world (>0.5PB, 100’s job slots each)
• ~3000 members of the collaboration

Wednesday, 12 August 2009

Why so much data?
• We have a very big digital camera
• Each event is ~1MB for normal
running
• size increases for HI and upgrade
studies

• Need many millions of events to
get statistically signiﬁcant results
out for rare processes
• In my thesis I started with ~5M events
to see an eventual “signal” of ~300

Wednesday, 12 August 2009

What’s an event?
• We have protons colliding, which contain quarks
• Quarks interact to produce excited states of
matter
• These excited states decay and we record the
decay products
• We then work back from the products to “see”
the original event
• Many events happen at once
• Think of working out how a carburettor works
by crashing 6 cars together on a motorway

Wednesday, 12 August 2009

Duplication of data
• We keep events in multiple “tiers” of data
• Each tier contains a subset of the
information of the parent tier
• We do this to let people work on huge
amounts of data quickly
• In reality this style of working hasn’t really kicked off
yet, but it’s early days

• Data is housed at >1 site
Wednesday, 12 August 2009

Duplication of work
• One person’s signal is another’s background
• Common framework (CMSSW) for analysis but
very little ability to share large amounts of work
• People coalesce into working groups, but these are generally
small

• While everyone is trying to do the same thing
they’re all trying to do it in different ways
• I suspect this is different from, say, Yahoo or
last.fm

Wednesday, 12 August 2009

How we work
• Large, ~dedicated compute farms
• PBS/Torque/Maui/SGE accessed via grid
interface
• ACL’s to prevent misuse of resources
• Not worried about people reading our data, but
worried they might delete it accidentally

• Prevent DDoS

Wednesday, 12 August 2009

Where we use Hadoop
• We currently use Hadoop’s HDFS at some of our T2 sites,
mainly in the US
• Led by Nebraska, been very successful to date
• I suspect more people will switch as centres expand

• Administration tools as well as performance particularly
appreciated
• Alternatives are academic/research projects and tend to
have a different focus (pub for details/rants)
• Maintenance & stability of code a big issue

• Storage in WN’s is also interesting

Wednesday, 12 August 2009

What would we have to do
to run analysis with Hadoop?
• Split events sensibly over the cluster
• By event? by ﬁle? don’t care?
• Data ﬁles are ~2G - need to reliably
reconstruct these ﬁles for export if we split
them up
• Have CMSSW run in Hadoop
• Many, many pitfalls there, may not even be possible...

Wednesday, 12 August 2009

Metadata
• Lots of metadata associated with the data
itself
• Moving that to HBase or similar and mining
with Hadoop would be interesting
• Currently this is stored in big Oracle
databases
• Also, log mining - probably harder to get
people interested in this

Wednesday, 12 August 2009

Issues
• Some analyses don’t map onto MapReduce
• Data is complex and in a weird ﬁle format
• CMSSW has a large memory foot print
• Not efﬁcient to run only a few events as start up/tear
down is expensive
• Sociologically it would be difﬁcult to persuade people
to move to MapReduce algorithms
• Until people see beneﬁts - demonstrating those beneﬁts is hard,
physicists don’t think in cost terms

Wednesday, 12 August 2009

Recommandé

01 introduction to cloud computing technologyNan Sheng

北航云计算公开课01 introduction to cloud computing technologyyhz87

Hofstra University - Overview of Big Datasarasioux

Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson

Big Data is not Rocket Sciencelarsgeorge

Adapt and respond: keeping responsive into the futureChris Mills

Help! My Hadoop doesn't work!Steve Loughran

Beyond Unit TestingSteve Loughran

Recommandé

01 introduction to cloud computing technologyNan Sheng

北航云计算公开课01 introduction to cloud computing technologyyhz87

Hofstra University - Overview of Big Datasarasioux

Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson

Big Data is not Rocket Sciencelarsgeorge

Adapt and respond: keeping responsive into the futureChris Mills

Help! My Hadoop doesn't work!Steve Loughran

Beyond Unit TestingSteve Loughran

When Web Services Go BadSteve Loughran

BenchmarkingSteve Loughran

Deploying On EC2Steve Loughran

HA Hadoop -ApacheCon talkSteve Loughran

Hadoop: today and tomorrowSteve Loughran

The Wondrous Curse of InteroperabilitySteve Loughran

TestingSteve Loughran

My other computer is a datacentre - 2012 editionSteve Loughran

Hadoop FuturesSteve Loughran

New Roles In The CloudSteve Loughran

Farming hadoop in_the_cloudSteve Loughran

Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran

Application Architecture For The CloudSteve Loughran

Apache Spark and Object StoresSteve Loughran

Spark Summit East 2017: Apache spark and object storesSteve Loughran

Household INFOSEC in a Post-Sony EraSteve Loughran

Hadoop gets GroovySteve Loughran

Unexpected Challenges in Large Scale Machine Learning by Charles ParkerBigMine

Big Data & Hadoop IntroductionJayant Mukherjee

Silicon valley nosql meetup april 2012InfiniteGraph

To Cloud or Not To Cloud?Greg Lindahl

Building A Scalable Open Source Storage SolutionPhil Cryer

Contenu connexe

En vedette

When Web Services Go BadSteve Loughran

BenchmarkingSteve Loughran

Deploying On EC2Steve Loughran

HA Hadoop -ApacheCon talkSteve Loughran

Hadoop: today and tomorrowSteve Loughran

The Wondrous Curse of InteroperabilitySteve Loughran

TestingSteve Loughran

My other computer is a datacentre - 2012 editionSteve Loughran

Hadoop FuturesSteve Loughran

New Roles In The CloudSteve Loughran

Farming hadoop in_the_cloudSteve Loughran

Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran

Application Architecture For The CloudSteve Loughran

Apache Spark and Object StoresSteve Loughran

Spark Summit East 2017: Apache spark and object storesSteve Loughran

Household INFOSEC in a Post-Sony EraSteve Loughran

Hadoop gets GroovySteve Loughran

En vedette (17)

When Web Services Go Bad

Benchmarking

Deploying On EC2

HA Hadoop -ApacheCon talk

Hadoop: today and tomorrow

The Wondrous Curse of Interoperability

Testing

My other computer is a datacentre - 2012 edition

Hadoop Futures

New Roles In The Cloud

Farming hadoop in_the_cloud

Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition

Application Architecture For The Cloud

Apache Spark and Object Stores

Spark Summit East 2017: Apache spark and object stores

Household INFOSEC in a Post-Sony Era

Hadoop gets Groovy

Similaire à Hadoop & Hep

Unexpected Challenges in Large Scale Machine Learning by Charles ParkerBigMine

Big Data & Hadoop IntroductionJayant Mukherjee

Silicon valley nosql meetup april 2012InfiniteGraph

To Cloud or Not To Cloud?Greg Lindahl

Building A Scalable Open Source Storage SolutionPhil Cryer

Apache hadoop by shahShah Hussain

Mapping Life Science Informatics to the CloudChris Dagdigian

5 Things that Make Hadoop a Game ChangerCaserta

Big data and hadoop overvewKunal Khanna

A Lightning Introduction To Clouds & HLT - Human Language Technology ConferenceBasis Technology

Cloud-Friendly Hadoop and Hive - StampedeCon 2013StampedeCon

Big iron 2 (published)Ben Stopford

Large scale topic modelingSameer Wadkar

BW Tech Meetup: Hadoop and The rise of Big Data Mindgrub Technologies

Bw tech hadoopMindgrub Technologies

Pig and Python to Process Big DataShawn Hermans

Rhat OSS - Cloudera - Mike Olson - Hadoop Data Analytics In The CloudCloudera, Inc.

Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner

Petabyte scale on commodity infrastructureelliando dias

Dan node meetup_socket_talkIshi von Meier

Similaire à Hadoop & Hep (20)

Unexpected Challenges in Large Scale Machine Learning by Charles Parker

Big Data & Hadoop Introduction

Silicon valley nosql meetup april 2012

To Cloud or Not To Cloud?

Building A Scalable Open Source Storage Solution

Apache hadoop by shah

Mapping Life Science Informatics to the Cloud

5 Things that Make Hadoop a Game Changer

Big data and hadoop overvew

A Lightning Introduction To Clouds & HLT - Human Language Technology Conference

Cloud-Friendly Hadoop and Hive - StampedeCon 2013

Big iron 2 (published)

Large scale topic modeling

BW Tech Meetup: Hadoop and The rise of Big Data

Bw tech hadoop

Pig and Python to Process Big Data

Rhat OSS - Cloudera - Mike Olson - Hadoop Data Analytics In The Cloud

Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera

Petabyte scale on commodity infrastructure

Dan node meetup_socket_talk

Plus de Steve Loughran

Hadoop Vectored IOSteve Loughran

The age of rename() is overSteve Loughran

What does Rename Do: (detailed version)Steve Loughran

Put is the new rename: San Jose Summit EditionSteve Loughran

@Dissidentbot: dissent will be automated!Steve Loughran

PUT is the new rename()Steve Loughran

Extreme Programming DeployedSteve Loughran

TestingSteve Loughran

I hate mockingSteve Loughran

What does rename() do?Steve Loughran

Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran

Apache Spark and Object Stores —for London Spark User GroupSteve Loughran

Hadoop, Hive, Spark and Object StoresSteve Loughran

Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran

Slider: Applications on YARNSteve Loughran

YARN ServicesSteve Loughran

Datacentre stackSteve Loughran

Overview of slider projectSteve Loughran

2014 01-02-patching-workflowSteve Loughran

2013 11-19-hoya-statusSteve Loughran

Plus de Steve Loughran (20)

Hadoop Vectored IO

The age of rename() is over

What does Rename Do: (detailed version)

Put is the new rename: San Jose Summit Edition

@Dissidentbot: dissent will be automated!

PUT is the new rename()

Extreme Programming Deployed

Testing

I hate mocking

What does rename() do?

Dancing Elephants: Working with Object Storage in Apache Spark and Hive

Apache Spark and Object Stores —for London Spark User Group

Hadoop, Hive, Spark and Object Stores

Hadoop and Kerberos: the Madness Beyond the Gate

Slider: Applications on YARN

YARN Services

Datacentre stack

Overview of slider project

2014 01-02-patching-workflow

2013 11-19-hoya-status

Dernier

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Histor y of HAM Radio presentation slidevu2urc

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Dernier (20)

Advantages of Hiring UIUX Design Service Providers for Your Business

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

🐬 The future of MySQL is Postgres 🐘

Histor y of HAM Radio presentation slide

How to Troubleshoot Apps for the Modern Connected Worker

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Boost PC performance: How more available memory can improve productivity

Scaling API-first – The story of a global engineering organization

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

08448380779 Call Girls In Civil Lines Women Seeking Men

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Boost Fertility New Invention Ups Success Rates.pdf

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

What Are The Drone Anti-jamming Systems Technology?

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Hadoop & Hep

1. Hadoop and HEP Simon Wednesday, 12 August 2009

2. About us • CMS will take 1-10PB of data a year • we’ll generate approx. the same in simulation data • It could run for 20-30 years • Have ~80 large computing centres around the world (>0.5PB, 100’s job slots each) • ~3000 members of the collaboration Wednesday, 12 August 2009

3. Why so much data? • We have a very big digital camera • Each event is ~1MB for normal running • size increases for HI and upgrade studies • Need many millions of events to get statistically signiﬁcant results out for rare processes • In my thesis I started with ~5M events to see an eventual “signal” of ~300 Wednesday, 12 August 2009

4. What’s an event? • We have protons colliding, which contain quarks • Quarks interact to produce excited states of matter • These excited states decay and we record the decay products • We then work back from the products to “see” the original event • Many events happen at once • Think of working out how a carburettor works by crashing 6 cars together on a motorway Wednesday, 12 August 2009

5. An event Wednesday, 12 August 2009

6. Duplication of data • We keep events in multiple “tiers” of data • Each tier contains a subset of the information of the parent tier • We do this to let people work on huge amounts of data quickly • In reality this style of working hasn’t really kicked off yet, but it’s early days • Data is housed at >1 site Wednesday, 12 August 2009

7. Duplication of work • One person’s signal is another’s background • Common framework (CMSSW) for analysis but very little ability to share large amounts of work • People coalesce into working groups, but these are generally small • While everyone is trying to do the same thing they’re all trying to do it in different ways • I suspect this is different from, say, Yahoo or last.fm Wednesday, 12 August 2009

8. How we work • Large, ~dedicated compute farms • PBS/Torque/Maui/SGE accessed via grid interface • ACL’s to prevent misuse of resources • Not worried about people reading our data, but worried they might delete it accidentally • Prevent DDoS Wednesday, 12 August 2009

9. Where we use Hadoop • We currently use Hadoop’s HDFS at some of our T2 sites, mainly in the US • Led by Nebraska, been very successful to date • I suspect more people will switch as centres expand • Administration tools as well as performance particularly appreciated • Alternatives are academic/research projects and tend to have a different focus (pub for details/rants) • Maintenance & stability of code a big issue • Storage in WN’s is also interesting Wednesday, 12 August 2009

10. What would we have to do to run analysis with Hadoop? • Split events sensibly over the cluster • By event? by file? don’t care? • Data files are ~2G - need to reliably reconstruct these files for export if we split them up • Have CMSSW run in Hadoop • Many, many pitfalls there, may not even be possible... Wednesday, 12 August 2009

11. Metadata • Lots of metadata associated with the data itself • Moving that to HBase or similar and mining with Hadoop would be interesting • Currently this is stored in big Oracle databases • Also, log mining - probably harder to get people interested in this Wednesday, 12 August 2009

12. Issues • Some analyses don’t map onto MapReduce • Data is complex and in a weird file format • CMSSW has a large memory foot print • Not efficient to run only a few events as start up/tear down is expensive • Sociologically it would be difficult to persuade people to move to MapReduce algorithms • Until people see benefits - demonstrating those benefits is hard, physicists don’t think in cost terms Wednesday, 12 August 2009