SlideShare a Scribd company logo
1 of 33
October 2011 – CHADNUG – Chattanooga, TN What is Hadoop? Josh Patterson | Sr Solution Architect
Who is Josh Patterson? josh@cloudera.com Twitter: @jpatanooga Master’s Thesis: self-organizing mesh networks  Published in IAAI-09: TinyTermite: A Secure Routing Algorithm Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) Led team which designed classification techniques for time series and Map Reduce Open source work at  http://openpdc.codeplex.com https://github.com/jpatanooga Today Sr. Solutions Architect at Cloudera
Outline Let’s Set the Stage Story Time: Hadoop and the Smartgrid The Enterprise and Hadoop Use Cases and Tools 3
“After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.” --- Excerpt from the book “The American Gas Station” Data Today: The Oil Industry Circa 1900 4
DNA Sequencing Trends Cost of DNA Sequencing Falling Very Fast 5
Unstructured Data Explosion 6 Complex,  Unstructured Relational ,[object Object]
 Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year,[object Object]
Sometimes makes the data unwieldy
Customers are not creating schemas for all of their data
Yet still may want to join data sets
Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”
They are throwing data away because its too expensive to hold
Similar to the oil industry in 1900,[object Object]
Story Time: Hadoop and the Smartgrid 9 “We’re gonna need a bigger boat.” --- Roy Scheider, “Jaws”
NERC Sensor Data Collection openPDC PMU Data Collection circa 2009  ,[object Object]
30 samples/second
4.3B Samples/day
Housed in Hadoop,[object Object]
Major Themes From openPDC Velocity of incoming data Where to put the data? Wanted to scale out, not up Wanted linear scalability in cost vs size Wanted system robust in the face of HW failure Not fans of vendor lock-in What can we realistically expect from analysis and extraction at this scale? How long does it take to scan a Petabyte @ 40MB/s?
Apache Hadoop Open Source Distributed Storage and Processing Engine ,[object Object]
 Move complex and relational data into a single repository
Stores Inexpensively
 Keep raw data always available
 Use industry standard hardware
Processes at the Source
 Eliminate ETL bottlenecks
 Mine data first, govern later MapReduce Hadoop Distributed File System (HDFS)
What Hadoop does ,[object Object]
Scales to petabytes without modification
Manages fault tolerance and data replication automatically
Processes semi-structured and unstructured data easily
Supports MapReduce natively to analyze data in parallel,[object Object]
Hadoop is not a database

More Related Content

What's hot

Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformDatabricks
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureDatabricks
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkDatabricks
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkDatabricks
 
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersDatabricks
 
Is This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the PeopleIs This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the PeopleDatabricks
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksDatabricks
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
Democratizing PySpark for Mobile Game Publishing
Democratizing PySpark for Mobile Game PublishingDemocratizing PySpark for Mobile Game Publishing
Democratizing PySpark for Mobile Game PublishingDatabricks
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!Databricks
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleDatabricks
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Databricks
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningDatabricks
 

What's hot (20)

Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache Spark
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache Spark
 
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
 
Is This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the PeopleIs This Thing On? A Well State Model for the People
Is This Thing On? A Well State Model for the People
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Democratizing PySpark for Mobile Game Publishing
Democratizing PySpark for Mobile Game PublishingDemocratizing PySpark for Mobile Game Publishing
Democratizing PySpark for Mobile Game Publishing
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Anomaly Detection at Scale!
Anomaly Detection at Scale!Anomaly Detection at Scale!
Anomaly Detection at Scale!
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
 

Viewers also liked

Day of Presentation 2015 (SIS)
Day of Presentation 2015 (SIS)Day of Presentation 2015 (SIS)
Day of Presentation 2015 (SIS)Tracey Cantabene
 
Stiftung Endamarariek - Health Centre in Tanzania
Stiftung Endamarariek - Health Centre in Tanzania Stiftung Endamarariek - Health Centre in Tanzania
Stiftung Endamarariek - Health Centre in Tanzania Marc Hänggi
 
Audi A3 1.6 Fsi Ambition 115cv 3p. En Venta En Salamanca. AñO
Audi A3 1.6 Fsi Ambition 115cv 3p. En Venta En Salamanca. AñOAudi A3 1.6 Fsi Ambition 115cv 3p. En Venta En Salamanca. AñO
Audi A3 1.6 Fsi Ambition 115cv 3p. En Venta En Salamanca. AñOkator
 
Fragmento Ariadne Auf Naxos
Fragmento Ariadne Auf NaxosFragmento Ariadne Auf Naxos
Fragmento Ariadne Auf Naxoslidiuska
 
Control y Dirección Administrativa por Ricardo Efrén Bravo Mendoza
Control y Dirección Administrativa por Ricardo Efrén Bravo MendozaControl y Dirección Administrativa por Ricardo Efrén Bravo Mendoza
Control y Dirección Administrativa por Ricardo Efrén Bravo Mendozaricardoefrenbravomendoza
 
INFORMATION COMMUNICATION TECHNOLOGY I
INFORMATION COMMUNICATION TECHNOLOGY IINFORMATION COMMUNICATION TECHNOLOGY I
INFORMATION COMMUNICATION TECHNOLOGY IYekini Nureni
 
Directorio agroindusria
Directorio agroindusriaDirectorio agroindusria
Directorio agroindusriaProColombia
 
Wood Products and Green Building: Rating Systems Recognizing Wood’s Environme...
Wood Products and Green Building: Rating Systems Recognizing Wood’s Environme...Wood Products and Green Building: Rating Systems Recognizing Wood’s Environme...
Wood Products and Green Building: Rating Systems Recognizing Wood’s Environme...Think Wood
 
Unilever AXE Angels Vietnam - Mindshare
Unilever AXE Angels Vietnam - Mindshare Unilever AXE Angels Vietnam - Mindshare
Unilever AXE Angels Vietnam - Mindshare MSSOCIAL
 
How to Startup Company from 1 people to 100 people
How to Startup Company from 1 people to 100 peopleHow to Startup Company from 1 people to 100 people
How to Startup Company from 1 people to 100 peopleDuong The Vinh
 
Ragout de seitan al moscatel de chiclana
Ragout de seitan al moscatel de chiclanaRagout de seitan al moscatel de chiclana
Ragout de seitan al moscatel de chiclanaAremy Palma
 
Toolkit part 4 - A3 Posters: how to work with scenarios; filtering ideas; val...
Toolkit part 4 - A3 Posters: how to work with scenarios; filtering ideas; val...Toolkit part 4 - A3 Posters: how to work with scenarios; filtering ideas; val...
Toolkit part 4 - A3 Posters: how to work with scenarios; filtering ideas; val...Forum for the Future
 
Gestion de conflictos, negociacion y consenso para proyectos de inversion
Gestion de conflictos, negociacion y consenso para proyectos de inversionGestion de conflictos, negociacion y consenso para proyectos de inversion
Gestion de conflictos, negociacion y consenso para proyectos de inversionUniversidad de Lima
 
Daxten praesentation zum green it bb award 2012
Daxten praesentation zum green it bb award 2012Daxten praesentation zum green it bb award 2012
Daxten praesentation zum green it bb award 2012Netzwerk GreenIT-BB
 

Viewers also liked (20)

Directiva requerimientos b y s, planillas
Directiva requerimientos b y s, planillasDirectiva requerimientos b y s, planillas
Directiva requerimientos b y s, planillas
 
conceptbakery - marketing across media
conceptbakery - marketing across mediaconceptbakery - marketing across media
conceptbakery - marketing across media
 
Insights success The 50 Fastest Growing consultant Company august 2016
Insights success The 50 Fastest Growing consultant Company  august 2016Insights success The 50 Fastest Growing consultant Company  august 2016
Insights success The 50 Fastest Growing consultant Company august 2016
 
Day of Presentation 2015 (SIS)
Day of Presentation 2015 (SIS)Day of Presentation 2015 (SIS)
Day of Presentation 2015 (SIS)
 
Comentario Domingo II
Comentario Domingo IIComentario Domingo II
Comentario Domingo II
 
Stiftung Endamarariek - Health Centre in Tanzania
Stiftung Endamarariek - Health Centre in Tanzania Stiftung Endamarariek - Health Centre in Tanzania
Stiftung Endamarariek - Health Centre in Tanzania
 
Audi A3 1.6 Fsi Ambition 115cv 3p. En Venta En Salamanca. AñO
Audi A3 1.6 Fsi Ambition 115cv 3p. En Venta En Salamanca. AñOAudi A3 1.6 Fsi Ambition 115cv 3p. En Venta En Salamanca. AñO
Audi A3 1.6 Fsi Ambition 115cv 3p. En Venta En Salamanca. AñO
 
Fragmento Ariadne Auf Naxos
Fragmento Ariadne Auf NaxosFragmento Ariadne Auf Naxos
Fragmento Ariadne Auf Naxos
 
Motor eléctrico
Motor eléctricoMotor eléctrico
Motor eléctrico
 
Control y Dirección Administrativa por Ricardo Efrén Bravo Mendoza
Control y Dirección Administrativa por Ricardo Efrén Bravo MendozaControl y Dirección Administrativa por Ricardo Efrén Bravo Mendoza
Control y Dirección Administrativa por Ricardo Efrén Bravo Mendoza
 
SANYIKO CRM/ERP
SANYIKO CRM/ERPSANYIKO CRM/ERP
SANYIKO CRM/ERP
 
INFORMATION COMMUNICATION TECHNOLOGY I
INFORMATION COMMUNICATION TECHNOLOGY IINFORMATION COMMUNICATION TECHNOLOGY I
INFORMATION COMMUNICATION TECHNOLOGY I
 
Directorio agroindusria
Directorio agroindusriaDirectorio agroindusria
Directorio agroindusria
 
Wood Products and Green Building: Rating Systems Recognizing Wood’s Environme...
Wood Products and Green Building: Rating Systems Recognizing Wood’s Environme...Wood Products and Green Building: Rating Systems Recognizing Wood’s Environme...
Wood Products and Green Building: Rating Systems Recognizing Wood’s Environme...
 
Unilever AXE Angels Vietnam - Mindshare
Unilever AXE Angels Vietnam - Mindshare Unilever AXE Angels Vietnam - Mindshare
Unilever AXE Angels Vietnam - Mindshare
 
How to Startup Company from 1 people to 100 people
How to Startup Company from 1 people to 100 peopleHow to Startup Company from 1 people to 100 people
How to Startup Company from 1 people to 100 people
 
Ragout de seitan al moscatel de chiclana
Ragout de seitan al moscatel de chiclanaRagout de seitan al moscatel de chiclana
Ragout de seitan al moscatel de chiclana
 
Toolkit part 4 - A3 Posters: how to work with scenarios; filtering ideas; val...
Toolkit part 4 - A3 Posters: how to work with scenarios; filtering ideas; val...Toolkit part 4 - A3 Posters: how to work with scenarios; filtering ideas; val...
Toolkit part 4 - A3 Posters: how to work with scenarios; filtering ideas; val...
 
Gestion de conflictos, negociacion y consenso para proyectos de inversion
Gestion de conflictos, negociacion y consenso para proyectos de inversionGestion de conflictos, negociacion y consenso para proyectos de inversion
Gestion de conflictos, negociacion y consenso para proyectos de inversion
 
Daxten praesentation zum green it bb award 2012
Daxten praesentation zum green it bb award 2012Daxten praesentation zum green it bb award 2012
Daxten praesentation zum green it bb award 2012
 

Similar to Oct 2011 CHADNUG Presentation on Hadoop

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015Edureka!
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)GeeksLab Odessa
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 

Similar to Oct 2011 CHADNUG Presentation on Hadoop (20)

Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Actian DataFlow Whitepaper
Actian DataFlow WhitepaperActian DataFlow Whitepaper
Actian DataFlow Whitepaper
 
Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015Webinar: Ways to Succeed with Hadoop in 2015
Webinar: Ways to Succeed with Hadoop in 2015
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 

More from Josh Patterson

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial IntelligenceJosh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecJosh Patterson
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecJosh Patterson
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning ModelsJosh Patterson
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Josh Patterson
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JJosh Patterson
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopJosh Patterson
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Josh Patterson
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Josh Patterson
 

More from Josh Patterson (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Oct 2011 CHADNUG Presentation on Hadoop

  • 1. October 2011 – CHADNUG – Chattanooga, TN What is Hadoop? Josh Patterson | Sr Solution Architect
  • 2. Who is Josh Patterson? josh@cloudera.com Twitter: @jpatanooga Master’s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing Algorithm Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) Led team which designed classification techniques for time series and Map Reduce Open source work at http://openpdc.codeplex.com https://github.com/jpatanooga Today Sr. Solutions Architect at Cloudera
  • 3. Outline Let’s Set the Stage Story Time: Hadoop and the Smartgrid The Enterprise and Hadoop Use Cases and Tools 3
  • 4. “After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.” --- Excerpt from the book “The American Gas Station” Data Today: The Oil Industry Circa 1900 4
  • 5. DNA Sequencing Trends Cost of DNA Sequencing Falling Very Fast 5
  • 6.
  • 7.
  • 8. Sometimes makes the data unwieldy
  • 9. Customers are not creating schemas for all of their data
  • 10. Yet still may want to join data sets
  • 11. Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”
  • 12. They are throwing data away because its too expensive to hold
  • 13.
  • 14. Story Time: Hadoop and the Smartgrid 9 “We’re gonna need a bigger boat.” --- Roy Scheider, “Jaws”
  • 15.
  • 18.
  • 19. Major Themes From openPDC Velocity of incoming data Where to put the data? Wanted to scale out, not up Wanted linear scalability in cost vs size Wanted system robust in the face of HW failure Not fans of vendor lock-in What can we realistically expect from analysis and extraction at this scale? How long does it take to scan a Petabyte @ 40MB/s?
  • 20.
  • 21. Move complex and relational data into a single repository
  • 23. Keep raw data always available
  • 24. Use industry standard hardware
  • 26. Eliminate ETL bottlenecks
  • 27. Mine data first, govern later MapReduce Hadoop Distributed File System (HDFS)
  • 28.
  • 29. Scales to petabytes without modification
  • 30. Manages fault tolerance and data replication automatically
  • 31. Processes semi-structured and unstructured data easily
  • 32.
  • 33. Hadoop is not a database
  • 35. Hadoop is batch oriented
  • 38.
  • 39. MapReduce In simple terms, it’s an application with 2 functions Map Function Think massively parallel “group by” Reduce Function Think “aggregation + processing” Not hard to write Can be challenging to refactor existing algorithms Designed to work hand-in-hand with HDFS Minimizes disk seeks, operates at “transfer rate” of disk
  • 40. Speed @ Scale Scenario 1 million sensors, collecting sample / 5 min 5 year retention policy Storage needs of 15 TB Processing Single Machine: 15TB takes 2.2 DAYS to scan MapReduce@ 20 nodes: Same task takes 11 Minutes
  • 41. MapReduce Tools Pig Procedural language compiled into MR Hive SQL-like language compiled into MR Mahout Collection of data mining algorithms for Hadoop Streaming Ability to write MR with tools such as python, etc
  • 42. “What if you don’t know the questions?” --- Forrester Report The Enterprise and Hadoop 20
  • 43. Apache Hadoop in Production ©2011 Cloudera, Inc. All Rights Reserved. 21 How Apache Hadoop fits into your existing infrastructure. CUSTOMERS ANALYSTS BUSINESS USERS OPERATORS ENGINEERS Web Application Management Tools IDE’s BI / Analytics Enterprise Reporting Enterprise Data Warehouse Operational Rules Engines Logs Files Web Data Relational Databases
  • 44. What the Industry is Doing Microsoft ships CTP of Hadoop Connectors for SQL Server and Parallel Data Warehouse (based on Cloudera’sSqoop) http://blogs.technet.com/b/dataplatforminsider/archive/2011/08/25/microsoft-ships-ctp-of-hadoop-connectors-for-sql-server-and-parallel-data-warehouse.aspx Oracle Announcments at Oracle Open World Connector to Hadoop allowing data flow into Oracle Hadoop Accelerator for Exalogic In-memory processing for MapReduce ETL using Hadoop Integrated analytics on Oracle and Hadoop Copyright 2010 Cloudera Inc. All rights reserved 22
  • 45. Integrating With the Enterprise IT Ecosystem 23 Drivers, language enhancements, testing File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata Sqoop* frame-work, adapters APACHE OOZIE APACHE OOZIE APACHE HIVE More coming… Data Integration Fast Read/Write Access Languages / Compilers APACHE PIG, APACHE HIVE APACHE FLUME, APACHE SQOOP APACHE HBASE Coordination APACHE ZOOKEEPER Packaging, testing *Sqoop supports JDBC, MySQL, Postgres, HSQLDB
  • 46. Forester Report  World Economic Forum  Declared that data is a new asset class Big data is an applied science project in most companies Major potential constraint is not the cost of the computing technology but the skilled people needed to carry out these projects the data scientists What if you don’t know the questions? Big data is all about exploration without preconceived notions Need tools to ask questions to understand the right questions to ask Much of the software is based on open-source Hadoop http://blogs.forrester.com/brian_hopkins/11-09-30-big_data_will_help_shape_your_markets_next_big_winners Copyright 2010 Cloudera Inc. All rights reserved 24
  • 47. Ever been recommended a friend on Facebook? Ever been recommended a product on Amazon? Ever used the homepage at Yahoo? What Can Hadoop Do For Me? 25
  • 48. Problems Addressed With Hadoop Text Mining Index Building Search Engines Graph Creation Twitter, Facebook Pattern Recognition Naïve Bayes Classification Recommendation Engines Predictive Models Risk Assessment Copyright 2010 Cloudera Inc. All rights reserved 26
  • 49. A Few Named Examples Analyze search terms and subsequent user purchase decisions to tune search results, increase conversion rates Digest long-term historical trade data to identify fraudulent activity and build real-time fraud prevention Model site visitor behavior with analytics that deliver better recommendations for new purchases Continually refine predictive models for advertising response rates to deliver more precisely targeted advertisements Replace expensive legacy ETL system with more flexible, cheaper infrastructure that is 20 times faster Correlate educational outcomes with programs and student histories to improve results Copyright © 2011, Cloudera, Inc. All Rights Reserved. 27
  • 50. Packages For Hadoop DataFu From Linkedin http://sna-projects.com/datafu/ UDFs in Pig used at LinkedIn in many of off-line workflows for data derived products "People You May Know” "Skills” Techniques PageRank Quantiles (median), variance, etc. Sessionization Convenience bag functions Convenience utility functions 28
  • 51. Integration with Libs Mix MapReduce with Machine Learning Libs WEKA KXEN CPLEX Map side “groups data” Reduce side processes groups of data with Lib in parallel Involves tricks in getting K/V pairs into lib Pipes, tmp files, task cache dir, etc 29
  • 52. Ask the right questions up front.. Is the job disk bound? What is the latency requirement on the job? Does it need a sub-minute latency? (not good for Hadoop!) Does the job look at a lot or all of the data at the same time? Hadoop is good at looking at all data, complex/fuzzy joins Is large amounts of ETL processing needed before analysis? Hadoop is good at ETL pre-processing work Can the analysis be converted into MR / Pig / Hive? Copyright 2010 Cloudera Inc. All rights reserved 30
  • 53. What Hadoop Not Good At in Data Mining Anything highly iterative Anything that is extemely CPU bound and not disk bound Algorithms that can’t be inherently parallelized Examples Stochastic Gradient Descent (SGD) Support Vector Machines (SVM) Doesn’t mean they arent great to use
  • 54. Questions? (Thank You!) Hadoop World 2011 http://www.hadoopworld.com/ Cloudera’s Distribution including Apache Hadoop (CDH): http://www.cloudera.com Resources http://www.slideshare.net/cloudera/hadoop-as-the-platform-for-the-smartgrid-at-tva http://gigaom.com/cleantech/the-google-android-of-the-smart-grid-openpdc/ http://news.cnet.com/8301-13846_3-10393259-62.html http://gigaom.com/cleantech/how-to-use-open-source-hadoop-for-the-smart-grid/ Timeseries blog article http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/ 32
  • 55.
  • 57. Lots of great use cases.
  • 58. Check out the downloads page at
  • 60. Get your own copy of Cloudera Distribution for Apache Hadoop (CDH)
  • 61. Grab Demo VMs, Connectors, other useful tools.
  • 62. Contact Josh with any questions at
  • 63. josh@cloudera.comCopyright 2010 Cloudera Inc. All rights reserved 33

Editor's Notes

  1. Its all about the love, baby.
  2. Theme: they through away a lot of valuable gas and oil just like we through away data today
  3. Example of data production trends
  4. But what if some constraints changed?
  5. Talk about changing market dynamics of storage costWhat if some of the previously held constraints changed? Enter hadoop
  6. Let’s set the stage in the context of story, why we were looking at big data for time series.
  7. Ok, so how did we get to this point?Older SCADA systems take 1 data point per 2-4 seconds --- PMUs --- 30 times a sec, 120 PMUs, Growing by 10x factor
  8. Data was sampled 30 times a secondNumber of sensors (Phasor Measurement Units / PMU) was increasing rapidly was 120, heading towards 1000 over next 2 years, currently taking in 4.3 billion samples per dayCost of SAN storage became excessiveLittle analysis possible on SAN due to poor read rates on large amounts (TBs) of data
  9. Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
  10. Note: these are not simple queries, they are DEEP COMPLEX SCANS
  11. Note: these are not simple queries, they are DEEP COMPLEX SCANS
  12. Theme: they through away a lot of valuable gas and oil just like we through away data today
  13. Apache Hadoop is a new solution in your existing infrastructure.It does not replace any existing major existing investment.Apache brings data that you’re already generating into context and integrates it with your business.You get access to key information about how your business is operating but pulling togetherWeb and application logsUnstructured filesWeb dataRelational dataHadoop is used by your team to analyze this data and deliver it to business users directly and via existing data management technologies
  14. Allows vendor specific solutions to build on topCertain custom code would be written for the platform as well
  15. Theme: they through away a lot of valuable gas and oil just like we through away data today
  16. Transition into “hadoop as a platform”, but vendors are building on top of it for specific vertical challenges
  17. This is probably a better slide 30
  18. Check this against the Mahout impl