SlideShare une entreprise Scribd logo
1  sur  8
Apache HCatalog
● What is it ?
● How does it work ?
● Interfaces
● Architecture
● Example
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – What is it ?
● A Hive metastore interface set
● Shared schema and data types for Hadoop tools
● Rest interface for external data access
● Assists inter operability between
– Pig, Hive and Map Reduce
● Table abstraction of data storage
● Will provide data availability notifications
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – How does it work ?
● Pig
– HCatLoader + HCatStorer interface
● Map Reduce
– HCatInputFormat + HCatOutputFormat interface
● Hive
– No interface necessary
– Direct access to meta data
● Notifications when data available
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Interfaces
● Interface via
– Pig
– Map Reduce
– Hive
– Streaming
● Access data via
– Orc file
– RC file
– Text file
– Sequence file
– Custom format
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Interfaces
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Architecture
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Example
A data flow example from hive.apache.org
First Joe in data acquisition uses distcp to get data onto the grid.
hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
Second Sally in data processing uses Pig to cleanse and prepare the data.
Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.
A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …);
B = filter A by bot_finder(zeta) = 0;
…
store Z into 'data/processedevents/20100819/data';
With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.
A = load 'rawevents' using HCatLoader();
B = filter A by date = '20100819' and by bot_finder(zeta) = 0;
…
store Z into 'processedevents' using HcatStorer("date=20100819");
Note that the pig job refers to the data by name rawevents rather than a location
Now access the data via Hive QL
select advertiser_id, count(clicks) from processedevents
where date = ‘20100819’ group by advertiser_id;
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

Contenu connexe

En vedette

Cluster management and automation with cloudera manager
Cluster management and automation with cloudera managerCluster management and automation with cloudera manager
Cluster management and automation with cloudera managerChris Westin
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectGreg Makowski
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNHortonworks
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herreroBigData_Europe
 
Deploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDeploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDataWorks Summit
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache SqoopAvkash Chauhan
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
BigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate ChangeBigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate ChangeBigData_Europe
 

En vedette (18)

SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 
Cluster management and automation with cloudera manager
Cluster management and automation with cloudera managerCluster management and automation with cloudera manager
Cluster management and automation with cloudera manager
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot Project
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Restaurant management
Restaurant managementRestaurant management
Restaurant management
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
 
Hive ppt (1)
Hive ppt (1)Hive ppt (1)
Hive ppt (1)
 
Deploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDeploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARI
 
Big data 101
Big data 101Big data 101
Big data 101
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
BigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate ChangeBigDataEurope - Big Data & Climate Change
BigDataEurope - Big Data & Climate Change
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 

Plus de Mike Frampton (20)

Apache Airavata
Apache AiravataApache Airavata
Apache Airavata
 
Apache MADlib AI/ML
Apache MADlib AI/MLApache MADlib AI/ML
Apache MADlib AI/ML
 
Apache MXNet AI
Apache MXNet AIApache MXNet AI
Apache MXNet AI
 
Apache Gobblin
Apache GobblinApache Gobblin
Apache Gobblin
 
Apache Singa AI
Apache Singa AIApache Singa AI
Apache Singa AI
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
OrientDB
OrientDBOrientDB
OrientDB
 
Prometheus
PrometheusPrometheus
Prometheus
 
Apache Tephra
Apache TephraApache Tephra
Apache Tephra
 
Apache Kudu
Apache KuduApache Kudu
Apache Kudu
 
Apache Bahir
Apache BahirApache Bahir
Apache Bahir
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
JanusGraph DB
JanusGraph DBJanusGraph DB
JanusGraph DB
 
Apache Ignite
Apache IgniteApache Ignite
Apache Ignite
 
Apache Samza
Apache SamzaApache Samza
Apache Samza
 
Apache Flink
Apache FlinkApache Flink
Apache Flink
 
Apache Edgent
Apache EdgentApache Edgent
Apache Edgent
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDB
 
An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache Mesos
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to Pentaho
 

Dernier

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

An introduction to Apache HCatalog

  • 1. Apache HCatalog ● What is it ? ● How does it work ? ● Interfaces ● Architecture ● Example www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 2. HCatalog – What is it ? ● A Hive metastore interface set ● Shared schema and data types for Hadoop tools ● Rest interface for external data access ● Assists inter operability between – Pig, Hive and Map Reduce ● Table abstraction of data storage ● Will provide data availability notifications www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 3. HCatalog – How does it work ? ● Pig – HCatLoader + HCatStorer interface ● Map Reduce – HCatInputFormat + HCatOutputFormat interface ● Hive – No interface necessary – Direct access to meta data ● Notifications when data available www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 4. HCatalog – Interfaces ● Interface via – Pig – Map Reduce – Hive – Streaming ● Access data via – Orc file – RC file – Text file – Sequence file – Custom format www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 7. HCatalog – Example A data flow example from hive.apache.org First Joe in data acquisition uses distcp to get data onto the grid. hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'" Second Sally in data processing uses Pig to cleanse and prepare the data. Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS. A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …); B = filter A by bot_finder(zeta) = 0; … store Z into 'data/processedevents/20100819/data'; With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started. A = load 'rawevents' using HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; … store Z into 'processedevents' using HcatStorer("date=20100819"); Note that the pig job refers to the data by name rawevents rather than a location Now access the data via Hive QL select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by advertiser_id; www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 8. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems