1. Apache HCatalog
● What is it ?
● How does it work ?
● Interfaces
● Architecture
● Example
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
2. HCatalog – What is it ?
● A Hive metastore interface set
● Shared schema and data types for Hadoop tools
● Rest interface for external data access
● Assists inter operability between
– Pig, Hive and Map Reduce
● Table abstraction of data storage
● Will provide data availability notifications
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
3. HCatalog – How does it work ?
● Pig
– HCatLoader + HCatStorer interface
● Map Reduce
– HCatInputFormat + HCatOutputFormat interface
● Hive
– No interface necessary
– Direct access to meta data
● Notifications when data available
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
4. HCatalog – Interfaces
● Interface via
– Pig
– Map Reduce
– Hive
– Streaming
● Access data via
– Orc file
– RC file
– Text file
– Sequence file
– Custom format
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
7. HCatalog – Example
A data flow example from hive.apache.org
First Joe in data acquisition uses distcp to get data onto the grid.
hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
Second Sally in data processing uses Pig to cleanse and prepare the data.
Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.
A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …);
B = filter A by bot_finder(zeta) = 0;
…
store Z into 'data/processedevents/20100819/data';
With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.
A = load 'rawevents' using HCatLoader();
B = filter A by date = '20100819' and by bot_finder(zeta) = 0;
…
store Z into 'processedevents' using HcatStorer("date=20100819");
Note that the pig job refers to the data by name rawevents rather than a location
Now access the data via Hive QL
select advertiser_id, count(clicks) from processedevents
where date = ‘20100819’ group by advertiser_id;
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
8. Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems