SlideShare une entreprise Scribd logo
1  sur  35
Bridging Unstructured & Structured Data with Hadoop and Vertica Glenn Gebhart 	ggebhart@vertica.com Steve Watt         swatt@hp.com
Contents ,[object Object]
Accelerating and monitoring Apache Hadoop deployments with HP CMU
I have my Apache Hadoop Cluster deployed….. Now what ?
Sample application scenario with Apache Hadoop and Vertica,[object Object]
Managing Scale Out with HP CMU ,[object Object]
11 Years Experience
Proven with clusters of 3500+ nodes
Deployment and Management
Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
Command Line or GUI based cluster wide configuration
Manage a node individually or manage a cluster as a whole
Monitoring
Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
Extensible through Collectl integration,[object Object]
6 HP Confidential Tech Bubble?  What does the Data Say? Attribution: CC PascalTerjan via Flickr
7 HP Confidential
But what if I could turn that into this? 8 HP Confidential
And see how the amount invested this year differs from previous years?
10 HP Confidential Where is the money going?
What type of startups get the most investment funding?
Amount invested in Software Startups by Zip Code
How did you do that? 13 HP Confidential How did you  Do that? Attribution: CC  Colin_K on Flickr
14 HP Confidential Apache  Identify Optimal Seed URLs & Crawl to a depth of 2 http://www.crunchbase.com/companies?c=a&q=privately_held Crawl data is stored in segment dirs on the HDFS
15 HP Confidential
16 HP Confidential Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out
17 HP Confidential Aargh! My viz tool requires zipcodes to plot geospatially!
Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
The Story So Far Used Nutch to retrieve investment data from web site. Used Hadoop to extract and structure the data Used Pig to add zipcode data. End result is a collection of relations describing investment activity. We’ve got raw data, now we need to understand it.
Why Vertica? Vertica and Hadoop are complementary technologies. Hadoop’s strengths:  Analysis of unstructured data (screen scraping, natural language recognition)  Non-numeric operations (graphics preparation) Vertica’s strengths  Counting, adding, grouping, sorting, …  Rich suite of advanced analytic functions  All at TB+ scales.
Built from the Ground Up: The Four C’s of Vertica Columnar storage and execution Continuous performance Clustering Compression Achieve best data query performance with unique Vertica column store Linear scaling by adding more resources on the fly Store more data, provide more views, use less hardware Query and load 24x7 with zero administration
Getting Data From Here To There
Connecting Vertica And Hadoop Vertica provides connectors for Hadoop 20.2 and Pig 0.7. Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data. Input retrieved from Vertica using standard SQL query. Output written to Vertica table.
Vertica As a M/R Data Source // Set up the configuration and job objects Configuration conf = getConf();  Job job = new Job(conf);   // Set the input format to retrieve data from Vertica job.setInputFormatClass(VerticaInputFormat.class); // Set the query to retrieve data from the Vertica DB  VerticaInputFormat.setInput( 	job, 	“SELECT * FROM foo WHERE bar = ‘baz’ );

Contenu connexe

Tendances

Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World VerticaOmer Trajman
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to HiveDatabricks
 
Apache Spark Side of Funnels
Apache Spark Side of FunnelsApache Spark Side of Funnels
Apache Spark Side of FunnelsDatabricks
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsKasper Sørensen
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksGoDataDriven
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 

Tendances (20)

Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Hadoop World Vertica
Hadoop World VerticaHadoop World Vertica
Hadoop World Vertica
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache Spark Side of Funnels
Apache Spark Side of FunnelsApache Spark Side of Funnels
Apache Spark Side of Funnels
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Percona Lucid Db
Percona Lucid DbPercona Lucid Db
Percona Lucid Db
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data points
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 

En vedette

Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks
 
Vertica loading best practices
Vertica loading best practicesVertica loading best practices
Vertica loading best practicesZvika Gutkin
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
 
Hp vertica certification guide
Hp vertica certification guideHp vertica certification guide
Hp vertica certification guideneinamat
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)LivePerson
 
Vertica mpp columnar dbms
Vertica mpp columnar dbmsVertica mpp columnar dbms
Vertica mpp columnar dbmsZvika Gutkin
 
Vertica finalist interview
Vertica finalist interviewVertica finalist interview
Vertica finalist interviewMITX
 
Optimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management InfrastructureOptimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management InfrastructureImanis Data
 
Vertica the convertro way
Vertica   the convertro wayVertica   the convertro way
Vertica the convertro wayZvika Gutkin
 
Vertica 7.0 Architecture Overview
Vertica 7.0 Architecture OverviewVertica 7.0 Architecture Overview
Vertica 7.0 Architecture OverviewAndrey Karpov
 
How to install Vertica in a single node.
How to install Vertica in a single node.How to install Vertica in a single node.
How to install Vertica in a single node.Anil Maharjan
 
A short introduction to Vertica
A short introduction to VerticaA short introduction to Vertica
A short introduction to VerticaTommi Siivola
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed RJorge Martinez de Salinas
 
HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016Analytics10
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataVoltDB
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks
 

En vedette (20)

Hortonworks and HP Vertica Webinar
Hortonworks and HP Vertica WebinarHortonworks and HP Vertica Webinar
Hortonworks and HP Vertica Webinar
 
Vertica loading best practices
Vertica loading best practicesVertica loading best practices
Vertica loading best practices
 
Vertica-Database
Vertica-DatabaseVertica-Database
Vertica-Database
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 
Hp vertica certification guide
Hp vertica certification guideHp vertica certification guide
Hp vertica certification guide
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Vertica mpp columnar dbms
Vertica mpp columnar dbmsVertica mpp columnar dbms
Vertica mpp columnar dbms
 
Vertica finalist interview
Vertica finalist interviewVertica finalist interview
Vertica finalist interview
 
Optimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management InfrastructureOptimize Your Vertica Data Management Infrastructure
Optimize Your Vertica Data Management Infrastructure
 
Vertica
VerticaVertica
Vertica
 
Vertica the convertro way
Vertica   the convertro wayVertica   the convertro way
Vertica the convertro way
 
Vertica 7.0 Architecture Overview
Vertica 7.0 Architecture OverviewVertica 7.0 Architecture Overview
Vertica 7.0 Architecture Overview
 
How to install Vertica in a single node.
How to install Vertica in a single node.How to install Vertica in a single node.
How to install Vertica in a single node.
 
HP Vertica basics
HP Vertica basicsHP Vertica basics
HP Vertica basics
 
Vertica
VerticaVertica
Vertica
 
A short introduction to Vertica
A short introduction to VerticaA short introduction to Vertica
A short introduction to Vertica
 
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
 
HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016HPE Vertica Chile Desayuno Oct 2016
HPE Vertica Chile Desayuno Oct 2016
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time Data
 
Hortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinarHortonworks and Voltage Security webinar
Hortonworks and Voltage Security webinar
 

Similaire à Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce Amazon Web Services
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...Big Data Spain
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingMax Kleiner
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupLandoop Ltd
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.Sergey Zelvenskiy
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RPaul Bradshaw
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataDynamical Software, Inc.
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sri Ambati
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 

Similaire à Bridging Structured and Unstructred Data with Apache Hadoop and Vertica (20)

Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage codingmaxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
 
MapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London MeetupMapReduce with Scalding @ 24th Hadoop London Meetup
MapReduce with Scalding @ 24th Hadoop London Meetup
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in R
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Three Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big DataThree Functional Programming Technologies for Big Data
Three Functional Programming Technologies for Big Data
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 

Plus de Steve Watt

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusionedSteve Watt
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systemsSteve Watt
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoopSteve Watt
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureSteve Watt
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using HadoopSteve Watt
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 

Plus de Steve Watt (12)

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusioned
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systems
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructure
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Final deck
Final deckFinal deck
Final deck
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Extractiv
ExtractivExtractiv
Extractiv
 

Dernier

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 

Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

  • 1. Bridging Unstructured & Structured Data with Hadoop and Vertica Glenn Gebhart ggebhart@vertica.com Steve Watt swatt@hp.com
  • 2.
  • 3. Accelerating and monitoring Apache Hadoop deployments with HP CMU
  • 4. I have my Apache Hadoop Cluster deployed….. Now what ?
  • 5.
  • 6.
  • 8. Proven with clusters of 3500+ nodes
  • 10. Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.
  • 11. Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)
  • 12. Command Line or GUI based cluster wide configuration
  • 13. Manage a node individually or manage a cluster as a whole
  • 15. Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics
  • 16.
  • 17. 6 HP Confidential Tech Bubble? What does the Data Say? Attribution: CC PascalTerjan via Flickr
  • 19. But what if I could turn that into this? 8 HP Confidential
  • 20. And see how the amount invested this year differs from previous years?
  • 21. 10 HP Confidential Where is the money going?
  • 22. What type of startups get the most investment funding?
  • 23. Amount invested in Software Startups by Zip Code
  • 24. How did you do that? 13 HP Confidential How did you Do that? Attribution: CC  Colin_K on Flickr
  • 25. 14 HP Confidential Apache Identify Optimal Seed URLs & Crawl to a depth of 2 http://www.crunchbase.com/companies?c=a&q=privately_held Crawl data is stored in segment dirs on the HDFS
  • 27. 16 HP Confidential Making the data STRUCTURED Retrieving HTML Prelim Filtering on URL Company POJO then /t Out
  • 28. 17 HP Confidential Aargh! My viz tool requires zipcodes to plot geospatially!
  • 29. Apache Pig Script to Join on City to get Zip Code and Write the results to VerticaZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('') AS (State:chararray, City:chararray, ZipCode:int);CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('') AS (Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amount:int);CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);STORECrunchBaseZip INTO '{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40), Month int, Year int, Investor int, Amount varchar(40))}’USINGcom.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');
  • 30. The Story So Far Used Nutch to retrieve investment data from web site. Used Hadoop to extract and structure the data Used Pig to add zipcode data. End result is a collection of relations describing investment activity. We’ve got raw data, now we need to understand it.
  • 31. Why Vertica? Vertica and Hadoop are complementary technologies. Hadoop’s strengths: Analysis of unstructured data (screen scraping, natural language recognition) Non-numeric operations (graphics preparation) Vertica’s strengths Counting, adding, grouping, sorting, … Rich suite of advanced analytic functions All at TB+ scales.
  • 32. Built from the Ground Up: The Four C’s of Vertica Columnar storage and execution Continuous performance Clustering Compression Achieve best data query performance with unique Vertica column store Linear scaling by adding more resources on the fly Store more data, provide more views, use less hardware Query and load 24x7 with zero administration
  • 33. Getting Data From Here To There
  • 34. Connecting Vertica And Hadoop Vertica provides connectors for Hadoop 20.2 and Pig 0.7. Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data. Input retrieved from Vertica using standard SQL query. Output written to Vertica table.
  • 35. Vertica As a M/R Data Source // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf); // Set the input format to retrieve data from Vertica job.setInputFormatClass(VerticaInputFormat.class); // Set the query to retrieve data from the Vertica DB VerticaInputFormat.setInput( job, “SELECT * FROM foo WHERE bar = ‘baz’ );
  • 36. Vertica As a M/R Data Sink // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf); // Set the output format to to write data to Vertica job.setOutputKeyClass(Text.class); job.setOutputValueClass(VerticaRecord.class); job.setOutputFormatClass(VerticaOutputFormat.class); // Define the table which will hold the output VerticaOutputFormat.setOutput( job, <table name>, <truncate table?>, <col 1 def>, <col 2 def>, …, <col N def> );
  • 37. Reading Data Via Pig # Read some tuples A = LOAD 'sql://< Your query here >' USING com.vertica.pig.VerticaLoader( ‘server1,server2,server3', ‘< DB Name>','5433',‘< user >',‘< password >’ ); 26
  • 38. Writing Data Via Pig # Write some tuples STORE < some var > INTO '{ < table name > (< col 1 def >, < col 2 def >, … ) }' USING com.vertica.pig.VerticaStorer( ‘< server >',‘< DB >','5433',‘< user >',‘< password >’ ); 27
  • 39. Reporting And Data Visualization
  • 40. Does My Favorite Application Work With Vertica? Vertica is an ANSI SQL99 compliant DB. Comes with drivers for ODBC, JDBC, and ADO.Net. If your tool uses a SQL DB, and speaks one of these protocols, it’ll work just fine.
  • 42. Traditional Reports Integrates smoothly with reporting frontends such as Jasper and Pentaho. Scriptable via the vsqlcommand line tool. C/C++ SDK for parallelized, in-DB computation. But… you have to know what questions you want to ask.
  • 45. Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges. Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data. Getting them to work together is easy. In Closing…