SlideShare une entreprise Scribd logo
1  sur  17
Apache Bigtop Working Group
7/14/2013
Basic Skills
Hadoop Pipelines (Roman's/Ron's Idea)
Career positioning
Basic Skills
● Working group, you set your own goals. Structure: do a demo in
front of the class. Focus on skills employers are looking for.
● Cluster skills using AWS; create instances, ec2-api, will have to
extend this using scripts or your own code. Have to demo some
skill
– Goal:Manage multiple instances. You can do this manually but the number
of keystrokes goes up exponentially as you add new components. Need
some automation or code.
– Bash scripts are good b/c they are used in Bigtop init.d files and Roman's
code, e.g. copy the mkdir commands into script and run them.
Basic Skills
● Hadoop*, all the features of 2.0.0. No training
course can give this to you. You will have to
manually do this.
– Use 2.0.X unit test code as a base
Hadoop 2.0.0
● Basic FS Review:
– Copy On Write
– Write Through/Write Back, FSCK
– Inodes/BTrees, NN/DN
Working Group
● Not a class which gives you answers. The
answers classes give you are too simple to be
valuable.
● E.g.; Does YARN/Hadoop 2.0.X support
multitenancy? Multiple users/companies cant
see each other's data and if they run a query,
they can't crash the cluster for other users. This
isn't the case now.
Hadoop 2.0.X
● Zookeeper in HDFS, requires some
administration. Do you need to do a rollback of
zookeeper logs when a zk cluster fails?
Bigtop Basic Skills
● Run Bigtop in AWS in distributed mode, start
w/HDFS
● Create Hadoop* pipelines (Roman's/Ron's idea)
– Ron: book. Great idea!!!!!
● Run mvn verify/learn to debug and write tests hers
● Will take months, demo driven. People do demos.
Career positioning
● Choose where to spend time.
● Bigdata =
– Devops
– App development (Astyanax)
– Internals
● Don't get distracted into 3). Not enough time to do all well. Let
Cloudera ppl help you.
● Do something new that people care about
– Don't try to be better than people w/the same job skill
– Learn efficiently, practice, practice, practice, Can't learn by watching
Big Company vs. Small
● Big:
– Interpolate Cloudera's strategy. Hadoop 2.0.X runs in the cloud, users access
from Desktop via browser, can run Hive/Pig on YOUR data, if you need to ingest
data like w/flume a sys admin has to set this up. e.g. Don't spend time getting
flume to work in Hue. But make sure you know 2.0.x security models/LDAP,
pipeline debugging when things get stuck, failover, application development
– HUE != Ambari. Why?
– Value to building apps in HUE or w/HUE. Approach for webapps changing away
from HUE to something like Ambari which is a simpler user defined MVC pattern.
– User defined MVC better. Why? Think like a manager and what happens as
Django adds more complicated features?
– e.g. Jetty/J2EE example
Small
● Do everything, use BT, get to working app as
fast as possible. 1) and 2) very important. Have
to do things quickly.
● You decide how to spend your own time
Structure
● Schedule 3x meetings after this every 2 weeks
● Individual demos
● Install Bigtop, demo WC, PI, demo components
and pipelines.
● Turn pipeline demos into integration tests
● Test on pseudo distributed mode and cluster
● Listen to Roman: Hue....
HBase/Hadoop
● HBase requirements: R/S 48GB, 8-12
cores/node
Memory: M/R 1-2GB+, R/S 32GB+, OS 4-8GB,
HDFS
● Disk: 25% for shuffle files for HDFS, <50% full,
JBOD, no RAID
Starting Hadoop, M/R
● Look at the logs /var/log/hadoop-hdfs
● Cluster ID: under ~/cache/.../data, VERSION,
change the text. DEMO
● No connection, check ping, check core-site.xml,
/etc/hosts
● M/R/Yarn: mapred-site.xml. NOTE: M/R uses
port 8021 and so does NAMENODE. Keep this
port, run on differeent server; open port 8031
● Telnet jt:8021, turn off iptables, disable selinux
M/R Setup
● 1 node manager
– WRONG_REDUCE=0
– File Input Format Counters
– Bytes Read=1180
– File Output Format Counters
– Bytes Written=97
– Job Finished in 92.72 seconds
– Estimated value of Pi is 3.14080000000000000000
–
M/R AWS
● 3 nodemanagers
● File Output Format Counters
● Bytes Written=97
● Job Finished in 86.762 seconds
● Estimated value of Pi is
3.14080000000000000000
●
Zookeeper Administration
Many options for projects
● Integration code testing when Roman gets here
in 2 weeks
● Work w/Ron or Victor on projects
● Update the wiki w/ AWS cluster setup,
automate w/whirr? + chef/puppet?
● Add HBase, Zookeeper management for
Hadoop(monit/supervisord)

Contenu connexe

En vedette

Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming InfoDoug Chang
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitecturesDoug Chang
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013Doug Chang
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notesDoug Chang
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargetingDoug Chang
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1Doug Chang
 

En vedette (7)

Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming Info
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitectures
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notes
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargeting
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1
 
Training
TrainingTraining
Training
 

Similaire à Apache bigtopwg7142013

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.DrupalCampDN
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintroDoug Chang
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep LearningGreg Gandenberger
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...Leons Petražickis
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceAshok Modi
 
Performance and Scalability
Performance and ScalabilityPerformance and Scalability
Performance and ScalabilityMediacurrent
 
Efficient development workflows with composer
Efficient development workflows with composerEfficient development workflows with composer
Efficient development workflows with composernuppla
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Drupal Multi-Site Setup
Drupal Multi-Site SetupDrupal Multi-Site Setup
Drupal Multi-Site Setupylynfatt
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Søren Lund
 

Similaire à Apache bigtopwg7142013 (20)

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Drupal development
Drupal development Drupal development
Drupal development
 
How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep Learning
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
 
Scaling symfony apps
Scaling symfony appsScaling symfony apps
Scaling symfony apps
 
Performance and Scalability
Performance and ScalabilityPerformance and Scalability
Performance and Scalability
 
Efficient development workflows with composer
Efficient development workflows with composerEfficient development workflows with composer
Efficient development workflows with composer
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Drupal Multi-Site Setup
Drupal Multi-Site SetupDrupal Multi-Site Setup
Drupal Multi-Site Setup
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 

Dernier

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Apache bigtopwg7142013

  • 1. Apache Bigtop Working Group 7/14/2013 Basic Skills Hadoop Pipelines (Roman's/Ron's Idea) Career positioning
  • 2. Basic Skills ● Working group, you set your own goals. Structure: do a demo in front of the class. Focus on skills employers are looking for. ● Cluster skills using AWS; create instances, ec2-api, will have to extend this using scripts or your own code. Have to demo some skill – Goal:Manage multiple instances. You can do this manually but the number of keystrokes goes up exponentially as you add new components. Need some automation or code. – Bash scripts are good b/c they are used in Bigtop init.d files and Roman's code, e.g. copy the mkdir commands into script and run them.
  • 3. Basic Skills ● Hadoop*, all the features of 2.0.0. No training course can give this to you. You will have to manually do this. – Use 2.0.X unit test code as a base
  • 4. Hadoop 2.0.0 ● Basic FS Review: – Copy On Write – Write Through/Write Back, FSCK – Inodes/BTrees, NN/DN
  • 5. Working Group ● Not a class which gives you answers. The answers classes give you are too simple to be valuable. ● E.g.; Does YARN/Hadoop 2.0.X support multitenancy? Multiple users/companies cant see each other's data and if they run a query, they can't crash the cluster for other users. This isn't the case now.
  • 6. Hadoop 2.0.X ● Zookeeper in HDFS, requires some administration. Do you need to do a rollback of zookeeper logs when a zk cluster fails?
  • 7. Bigtop Basic Skills ● Run Bigtop in AWS in distributed mode, start w/HDFS ● Create Hadoop* pipelines (Roman's/Ron's idea) – Ron: book. Great idea!!!!! ● Run mvn verify/learn to debug and write tests hers ● Will take months, demo driven. People do demos.
  • 8. Career positioning ● Choose where to spend time. ● Bigdata = – Devops – App development (Astyanax) – Internals ● Don't get distracted into 3). Not enough time to do all well. Let Cloudera ppl help you. ● Do something new that people care about – Don't try to be better than people w/the same job skill – Learn efficiently, practice, practice, practice, Can't learn by watching
  • 9. Big Company vs. Small ● Big: – Interpolate Cloudera's strategy. Hadoop 2.0.X runs in the cloud, users access from Desktop via browser, can run Hive/Pig on YOUR data, if you need to ingest data like w/flume a sys admin has to set this up. e.g. Don't spend time getting flume to work in Hue. But make sure you know 2.0.x security models/LDAP, pipeline debugging when things get stuck, failover, application development – HUE != Ambari. Why? – Value to building apps in HUE or w/HUE. Approach for webapps changing away from HUE to something like Ambari which is a simpler user defined MVC pattern. – User defined MVC better. Why? Think like a manager and what happens as Django adds more complicated features? – e.g. Jetty/J2EE example
  • 10. Small ● Do everything, use BT, get to working app as fast as possible. 1) and 2) very important. Have to do things quickly. ● You decide how to spend your own time
  • 11. Structure ● Schedule 3x meetings after this every 2 weeks ● Individual demos ● Install Bigtop, demo WC, PI, demo components and pipelines. ● Turn pipeline demos into integration tests ● Test on pseudo distributed mode and cluster ● Listen to Roman: Hue....
  • 12. HBase/Hadoop ● HBase requirements: R/S 48GB, 8-12 cores/node Memory: M/R 1-2GB+, R/S 32GB+, OS 4-8GB, HDFS ● Disk: 25% for shuffle files for HDFS, <50% full, JBOD, no RAID
  • 13. Starting Hadoop, M/R ● Look at the logs /var/log/hadoop-hdfs ● Cluster ID: under ~/cache/.../data, VERSION, change the text. DEMO ● No connection, check ping, check core-site.xml, /etc/hosts ● M/R/Yarn: mapred-site.xml. NOTE: M/R uses port 8021 and so does NAMENODE. Keep this port, run on differeent server; open port 8031 ● Telnet jt:8021, turn off iptables, disable selinux
  • 14. M/R Setup ● 1 node manager – WRONG_REDUCE=0 – File Input Format Counters – Bytes Read=1180 – File Output Format Counters – Bytes Written=97 – Job Finished in 92.72 seconds – Estimated value of Pi is 3.14080000000000000000 –
  • 15. M/R AWS ● 3 nodemanagers ● File Output Format Counters ● Bytes Written=97 ● Job Finished in 86.762 seconds ● Estimated value of Pi is 3.14080000000000000000 ●
  • 17. Many options for projects ● Integration code testing when Roman gets here in 2 weeks ● Work w/Ron or Victor on projects ● Update the wiki w/ AWS cluster setup, automate w/whirr? + chef/puppet? ● Add HBase, Zookeeper management for Hadoop(monit/supervisord)