SlideShare une entreprise Scribd logo
1  sur  27
Inside hadoop-dev
Steve Loughran– Hortonworks
@steveloughran

Apachecon EU, November 2012




© Hortonworks Inc. 2012
stevel@apache.org




• HP Labs:
   –Deployment, cloud infrastructure, Hadoop-in-Cloud
• Apache – member and committer
   –Ant (author, Ant in Action), Axis 2
   –HadoopJoined Hortonworks in 2012
   –UK based R&D
                                                        Page 2
        © Hortonworks Inc. 2012
Hadoop is the OS for the datacentre




                                             Page 3
© Hortonworks Inc. 2012
Page 4
© Hortonworks Inc. 2012
History: ASF releases slowed
                              0.20.0   0.20.1   0.20.2   0.21.0   0.20.20{3,4,5}.0




• 64 Releases from 2006-2011
• Branches from the last 2.5 years:
   –0.20.{0,1,2} – Stable release without security
   –0.20.2xx.y – Stable release with security
   –0.21.0 – released, unstable, deprecated
   –0.22.0 – orphan, unstable, lack of community
   –0.23.x
• Cloudera CDH: fork w/ patches pushed back

                                                                             Page 5
Now: 2 ASF branches
Hadoop 1.x
• Stable, used in production systems
• Features focus on fixes & low-risk performance


Hadoop 2.x/trunk
• The successor
• Alpha-release. Download and test
• Where features & fixes first go in
• Your new code goes here.




                                                   Page 6
Loosely coupled projects form the stack




                                     Page 7
    © Hortonworks Inc. 2012
Incubating & graduate projects


                                    Kafka




         Giraph


  HCatalog
                                            templeton
                               Ambari


                                                  Page 8
     © Hortonworks Inc. 2012
Integration is a major undertaking

                                Latest ASF artifacts




                                   Stable, tested
                                   ASF artifacts




                                     ASF + own
                                     artifacts




                                              Page 9
    © Hortonworks Inc. 2012
What does all this mean?




                            Page 10
  © Hortonworks Inc. 2012
There is more work than
we can cope with



                            Page 11
  © Hortonworks Inc. 2012
Hadoop is CS-Hard
• Core HDFS, MR and YARN
  – Distributed Computing
  – Consensus Protocols & Consistency Models
  – Work Scheduling & Data Placement
  – Reliability theory
  – CPU Architecture; x86 assembler
• Others
  – Machine learning
  – Distributed Transactions
  – Graph Theory
  – Queue Theory
  – Correctness proofs



                                               Page 12
      © Hortonworks Inc. 2012
If you have these skills,
come and play!


http://hortonworks.com/careers/
                                  Page 13
     © Hortonworks Inc. 2012
But there are barriers




                            Page 14
  © Hortonworks Inc. 2012
Your time & cluster

• Full time core business @ Hortonworks + Cloudera

• Full time projects at others:
 LinkedIn, IBM, MSFT, VMWare

• Single developers can't compete

• Small test runs take too long

• Your cluster probably isn't as big as Yahoo!'s

• Review-then-Commit neglects everyone's patches



                                                     Page 15
      © Hortonworks Inc. 2012
Fear of damage
The worth of Hadoop is the data in HDFS
the worth of all companies whose data it is
cost to individuals of data loss
cost to governments of losing their data

∴ resistance to radical changes in HDFS
Scheduling performance worth $100Ks to individual
organisations

∴ resistance to radical work in compute layer except by
people with track record


                                                          Page 16
      © Hortonworks Inc. 2012
Fear of support and maintenance costs

• What will show up on Yahoo!-scale clusters?

• Costs of regression testing

• Who maintains the code if the author disappears?

• Documentation?

The 80%-done problem




                                                     Page 17
      © Hortonworks Inc. 2012
How to get your code in

• Trust: get known in the -dev lists, meet-ups

• Competence: help with patches other than your own.

• Don't attempt rewrites of the core services

• Help develop plugin-points

• Test across the configuration space

• Test at scale, complexity, “unusualness”




                                                       Page 18
      © Hortonworks Inc. 2012
Testing: not just for the 1%




                               Page 19
© Hortonworks Inc. 2012
youTesting: not just for scale issues
    have network and the 1%




                                 Page 20
    © Hortonworks Inc. 2012
Documentation & Books




                             Page 21
   © Hortonworks Inc. 2012
Challenge: Major Works
• YARN and HDFS HA
  – Branch then final review at merge
  – Agile; merge costs scale w/ duration of branch


• Independent works
  – Things that didn't get in -my lifecycle work, …
  – VMWare virtualisations –initial failure topology
    how best to get this stuff in


• Postgraduate Research
  – How to get the next generation of postgraduate researchers
    developing in and with Apache Hadoop?



                                                                 Page 22
      © Hortonworks Inc. 2012
A mentoring program?
Guided support for associated projects, the goal to
be to merge into the Hadoop codebase.

Who has the time to mentor?




                                                 Page 23
© Hortonworks Inc. 2012
Better Distributed Development

• Regional developer workshops
  – with local university participation?


• Online meet-ups: google+ hangouts?
  – Shared IDEA or other editor sessions
  – Remote presentations and demos




                                           Page 24
      © Hortonworks Inc. 2012
Git + Gerrit




                              Page 25
    © Hortonworks Inc. 2012
Get involved!
svn.apache.org
issues.apache.org
{hadoop,hbase, mahout, pig, oozie, …}.apache.org




                                             Page 26
     © Hortonworks Inc. 2012
hortonworks.com




                             Page 27
   © Hortonworks Inc. 2012

Contenu connexe

Tendances

Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Hortonworks
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowDataWorks Summit
 
Hadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxHadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxVinay Shukla
 
Redis for Security Data : SecurityScorecard JVM Redis Usage
Redis for Security Data : SecurityScorecard JVM Redis UsageRedis for Security Data : SecurityScorecard JVM Redis Usage
Redis for Security Data : SecurityScorecard JVM Redis UsageTimothy Spann
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXAbhishek Mallick
 
Hadoop security overview_hit2012_1117rev
Hadoop security overview_hit2012_1117revHadoop security overview_hit2012_1117rev
Hadoop security overview_hit2012_1117revJason Shih
 
Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017Toni de la Fuente
 
Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Shravan (Sean) Pabba
 

Tendances (20)

Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
 
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopHadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
 
Hadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxHadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache Knox
 
Redis for Security Data : SecurityScorecard JVM Redis Usage
Redis for Security Data : SecurityScorecard JVM Redis UsageRedis for Security Data : SecurityScorecard JVM Redis Usage
Redis for Security Data : SecurityScorecard JVM Redis Usage
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFS
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOX
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Hadoop security overview_hit2012_1117rev
Hadoop security overview_hit2012_1117revHadoop security overview_hit2012_1117rev
Hadoop security overview_hit2012_1117rev
 
Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017Automate or die! Rootedcon 2017
Automate or die! Rootedcon 2017
 
Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015
 

En vedette

High availability hadoop november 2010
High availability hadoop   november 2010High availability hadoop   november 2010
High availability hadoop november 2010Steve Loughran
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Steve Loughran
 
Data protection for hadoop environments
Data protection for hadoop environmentsData protection for hadoop environments
Data protection for hadoop environmentsDataWorks Summit
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Businessazuyo.com
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 

En vedette (7)

High availability hadoop november 2010
High availability hadoop   november 2010High availability hadoop   november 2010
High availability hadoop november 2010
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
 
Data protection for hadoop environments
Data protection for hadoop environmentsData protection for hadoop environments
Data protection for hadoop environments
 
Importance of Big data for your Business
Importance of Big data for your BusinessImportance of Big data for your Business
Importance of Big data for your Business
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similaire à Inside hadoop-dev

Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrowSteve Loughran
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataPatrickCrompton
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Hortonworks
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Mac Moore
 
Deploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDeploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDataWorks Summit
 
OSDC 2013 | Introduction into Hadoop by Olivier Renault
OSDC 2013 | Introduction into Hadoop by Olivier RenaultOSDC 2013 | Introduction into Hadoop by Olivier Renault
OSDC 2013 | Introduction into Hadoop by Olivier RenaultNETWAYS
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Mac Moore
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigRavi Mutyala
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopHortonworks
 
Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Hortonworks
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitDataWorks Summit
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopHortonworks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
 
Orange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPOrange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPHortonworks
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPHortonworks
 
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data SlidesUtrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data SlidesHortonworks
 

Similaire à Inside hadoop-dev (20)

Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrow
 
Mrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big DataMrinal devadas, Hortonworks Making Sense Of Big Data
Mrinal devadas, Hortonworks Making Sense Of Big Data
 
Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'Don't Let Security Be The 'Elephant in the Room'
Don't Let Security Be The 'Elephant in the Room'
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
Deploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDeploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARI
 
OSDC 2013 | Introduction into Hadoop by Olivier Renault
OSDC 2013 | Introduction into Hadoop by Olivier RenaultOSDC 2013 | Introduction into Hadoop by Olivier Renault
OSDC 2013 | Introduction into Hadoop by Olivier Renault
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on Hadoop
 
Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?Big Data Analytics - Is Your Elephant Enterprise Ready?
Big Data Analytics - Is Your Elephant Enterprise Ready?
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
Orange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPOrange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDP
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDP
 
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data SlidesUtrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
 

Plus de Steve Loughran

The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is overSteve Loughran
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)Steve Loughran
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionSteve Loughran
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!Steve Loughran
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()Steve Loughran
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming DeployedSteve Loughran
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupSteve Loughran
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraSteve Loughran
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARNSteve Loughran
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider projectSteve Loughran
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Steve Loughran
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflowSteve Loughran
 

Plus de Steve Loughran (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
 
Overview of slider project
Overview of slider projectOverview of slider project
Overview of slider project
 
Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!Help! My Hadoop doesn't work!
Help! My Hadoop doesn't work!
 
2014 01-02-patching-workflow
2014 01-02-patching-workflow2014 01-02-patching-workflow
2014 01-02-patching-workflow
 

Inside hadoop-dev

  • 1. Inside hadoop-dev Steve Loughran– Hortonworks @steveloughran Apachecon EU, November 2012 © Hortonworks Inc. 2012
  • 2. stevel@apache.org • HP Labs: –Deployment, cloud infrastructure, Hadoop-in-Cloud • Apache – member and committer –Ant (author, Ant in Action), Axis 2 –HadoopJoined Hortonworks in 2012 –UK based R&D Page 2 © Hortonworks Inc. 2012
  • 3. Hadoop is the OS for the datacentre Page 3 © Hortonworks Inc. 2012
  • 5. History: ASF releases slowed 0.20.0 0.20.1 0.20.2 0.21.0 0.20.20{3,4,5}.0 • 64 Releases from 2006-2011 • Branches from the last 2.5 years: –0.20.{0,1,2} – Stable release without security –0.20.2xx.y – Stable release with security –0.21.0 – released, unstable, deprecated –0.22.0 – orphan, unstable, lack of community –0.23.x • Cloudera CDH: fork w/ patches pushed back Page 5
  • 6. Now: 2 ASF branches Hadoop 1.x • Stable, used in production systems • Features focus on fixes & low-risk performance Hadoop 2.x/trunk • The successor • Alpha-release. Download and test • Where features & fixes first go in • Your new code goes here. Page 6
  • 7. Loosely coupled projects form the stack Page 7 © Hortonworks Inc. 2012
  • 8. Incubating & graduate projects Kafka Giraph HCatalog templeton Ambari Page 8 © Hortonworks Inc. 2012
  • 9. Integration is a major undertaking Latest ASF artifacts Stable, tested ASF artifacts ASF + own artifacts Page 9 © Hortonworks Inc. 2012
  • 10. What does all this mean? Page 10 © Hortonworks Inc. 2012
  • 11. There is more work than we can cope with Page 11 © Hortonworks Inc. 2012
  • 12. Hadoop is CS-Hard • Core HDFS, MR and YARN – Distributed Computing – Consensus Protocols & Consistency Models – Work Scheduling & Data Placement – Reliability theory – CPU Architecture; x86 assembler • Others – Machine learning – Distributed Transactions – Graph Theory – Queue Theory – Correctness proofs Page 12 © Hortonworks Inc. 2012
  • 13. If you have these skills, come and play! http://hortonworks.com/careers/ Page 13 © Hortonworks Inc. 2012
  • 14. But there are barriers Page 14 © Hortonworks Inc. 2012
  • 15. Your time & cluster • Full time core business @ Hortonworks + Cloudera • Full time projects at others: LinkedIn, IBM, MSFT, VMWare • Single developers can't compete • Small test runs take too long • Your cluster probably isn't as big as Yahoo!'s • Review-then-Commit neglects everyone's patches Page 15 © Hortonworks Inc. 2012
  • 16. Fear of damage The worth of Hadoop is the data in HDFS the worth of all companies whose data it is cost to individuals of data loss cost to governments of losing their data ∴ resistance to radical changes in HDFS Scheduling performance worth $100Ks to individual organisations ∴ resistance to radical work in compute layer except by people with track record Page 16 © Hortonworks Inc. 2012
  • 17. Fear of support and maintenance costs • What will show up on Yahoo!-scale clusters? • Costs of regression testing • Who maintains the code if the author disappears? • Documentation? The 80%-done problem Page 17 © Hortonworks Inc. 2012
  • 18. How to get your code in • Trust: get known in the -dev lists, meet-ups • Competence: help with patches other than your own. • Don't attempt rewrites of the core services • Help develop plugin-points • Test across the configuration space • Test at scale, complexity, “unusualness” Page 18 © Hortonworks Inc. 2012
  • 19. Testing: not just for the 1% Page 19 © Hortonworks Inc. 2012
  • 20. youTesting: not just for scale issues have network and the 1% Page 20 © Hortonworks Inc. 2012
  • 21. Documentation & Books Page 21 © Hortonworks Inc. 2012
  • 22. Challenge: Major Works • YARN and HDFS HA – Branch then final review at merge – Agile; merge costs scale w/ duration of branch • Independent works – Things that didn't get in -my lifecycle work, … – VMWare virtualisations –initial failure topology how best to get this stuff in • Postgraduate Research – How to get the next generation of postgraduate researchers developing in and with Apache Hadoop? Page 22 © Hortonworks Inc. 2012
  • 23. A mentoring program? Guided support for associated projects, the goal to be to merge into the Hadoop codebase. Who has the time to mentor? Page 23 © Hortonworks Inc. 2012
  • 24. Better Distributed Development • Regional developer workshops – with local university participation? • Online meet-ups: google+ hangouts? – Shared IDEA or other editor sessions – Remote presentations and demos Page 24 © Hortonworks Inc. 2012
  • 25. Git + Gerrit Page 25 © Hortonworks Inc. 2012
  • 26. Get involved! svn.apache.org issues.apache.org {hadoop,hbase, mahout, pig, oozie, …}.apache.org Page 26 © Hortonworks Inc. 2012
  • 27. hortonworks.com Page 27 © Hortonworks Inc. 2012

Notes de l'éditeur

  1. This is my background: key point until 2012 I was working on my own things inside a large organisation; now I am FTE on Hadoop
  2. There's a CoI here between trunk features and branch-1 commits -the latter get into people's hands faster, but threaten the very feature -stability- that justifies branch-1's existence.All the interesting stuff goes into trunk, which is where I push most of my patches (it's easier to avoid backporting)
  3. Bigtop is ±Fedora: bleeding edge -but also defines RPM installation layout and startup scripts for everyone, for consistency.Hortonworks -trails with the stable artifacts, team manages the Apache Hadoop releases and QA team tests all.Cloudera do a mix of ASF + Apache; got own fork of Hadoop with different set/ordering of patches,.CDH vs HDP is a matter of argument. One thing to know is that everyone now tends to use Git to manage their individual branches
  4. If you thinjk
  5. If you thinjk
  6. Plugin points: yes, I think googleguice would be the alternative, but, well…
  7. Most people here do not have 500+ clusters with double digit PB of storage. Those clusters are the best for the stress testing of the storage and computer layers -but only a few people have them at this scale: Y! FB. We use Y!'s test clusters for all the apache & Hortonworks releases,
  8. you have your own issues. Does it scale down enough? does it assume the LAN is well managed, clocks in sync, DNS andrDNS works. Your problems -especially the networking ones -are your own. This is why testing them matters
  9. I'm proposing people write books for the benefit of the project, not the fame and money with comes with writing a book, Anyone else who has written a book will know precisely why I'm doing that.
  10. We do have this for the Apache Incubator -but they are projects above and alongside the existing codebase. I'm wondering here how to get medium-sized bits of work done in a way that is timely, not wasted.
  11. There's no easy answers here, but here are some things I think could be goodGit workflow support. Stops people having to resubmit patches all the time; git pull can be used to grab and apply a patch.Gerrit code review -makes reviewing much, much easier. We have HUG events -but they tend to not normally delve into the codebase. I'm proposing doing exactly that -in regions other than just the Bay Area. I will back this up by offering to host an all day one at a bar/café near me in Bristol if enough people are interested., I'm also advocating university involvement so that they get more of an idea of Hadoop internals.For those of outside the Bay Area, remote events are good. We've had some good webex'd events recently (e.g. the YARN one), but could do with more. I'd like to see something more interactive, and think we could/should try with an online only google+ hangout coding event, possibly using a shared IDE.