Project Voldemort: Big data loading

•

1 j'aime•1,556 vues

Dan Harvey

Lightning talk about loading big data in voldemort read-only stores.

Technologie Formation

Big Data Loading
● So you've processed your data...
● Now, how to get that to people quickly?

● Project Voldemort's Read-Only stores
● Simple key-value store
● Based upon Amazon Dynamo
● Simple Java interface and operation
● Immutable read only stores

Read Only Stores
● Precompute in Hadoop or else where
● Creates an indexed key-value store
● One reducer (or file) per node
● Replicated data for fail over

● Atomically loads into nodes
● Copy from hdfs or other http source
● Very fast, limited by network or storage i/o
● Can throttle so not affecting live services
● Can also roll back to previous versions

Example Hadoop Store Builder
public class JsonStoreBuilder
extends AbstractHadoopStoreBuilderMapper<LongWritable, Text>{

JSONParser parser = new JSONParser();

@Override
public Object makeKey(LongWritable lineNo, Text line) {
JSONObject json = parser.parse(line.toString());
return json.get("name");
}

@Override
public Object makeValue(LongWritable lineNo, Text line) {
return line.toString();
}
}

Example Hadoop Job
$VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh

--input hdfs/JsonFile.json
--output hdfs/StoreOut
--tmpdir hdfs/temp_dir
--mapper uk.co.danharvey.hadoop.JsonStoreBuilder
--jar hadoop-core.jar
--cluster config/cluster.xml
--storename example_store
--storedefinitions config/store.xml
--chunksize 1073741824
--replication 1

Pig to Json Index
● Output JSON from pig
STORE bag INTO 'data.json' USING JsonStorage();

● JsonStoreBuilder
● Extends Voldemort StoreBuilder
● Easily index any field

● Code up here:
http://github.com/danharvey/pigJsonUtils

Recommandé

Voldemortfasiha ikram

No SQL and MongoDB - Hyderabad Scalability MeetupHyderabad Scalability Meetup

Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn

Apache HBase in the Enterprise Data Hub at CernerHBaseCon

Voldemort on Solid State DrivesVinoth Chandar

Сергей Сверчков и Виталий Руденя. Choosing a NoSQL databaseVolha Banadyseva

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightHBaseCon

An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...DataStax

Recommandé

Voldemortfasiha ikram

No SQL and MongoDB - Hyderabad Scalability MeetupHyderabad Scalability Meetup

Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn

Apache HBase in the Enterprise Data Hub at CernerHBaseCon

Voldemort on Solid State DrivesVinoth Chandar

Сергей Сверчков и Виталий Руденя. Choosing a NoSQL databaseVolha Banadyseva

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightHBaseCon

An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...DataStax

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBAthiq Ahamed

Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...DataStax

HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsightHBaseCon

HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.

The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon

Gcp data engineerNarendranath Reddy T

Keynote: The Future of Apache HBaseHBaseCon

HBase Data Modeling and Access Patterns with Kite SDKHBaseCon

HBaseConAsia2018 Track1-3: HBase at XiaomiMichael Stack

HBaseCon 2015- HBase @ FlipboardMatthew Blair

Brian Bulkowski. AerospikeVolha Banadyseva

7. Key-Value Databases: In DepthFabio Fumarola

Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit

C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...DataStax

Cassandra implementation for collecting data and presenting dataChen Robert

HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon

HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.

Project VoldemortFabiano Da Ventura

thesis-despoinaDespoina Magka

Contenu connexe

Tendances

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBAthiq Ahamed

Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...DataStax

HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsightHBaseCon

HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.

The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico

HBaseCon 2015: HBase Operations in a FlurryHBaseCon

HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon

Gcp data engineerNarendranath Reddy T

Keynote: The Future of Apache HBaseHBaseCon

HBase Data Modeling and Access Patterns with Kite SDKHBaseCon

HBaseConAsia2018 Track1-3: HBase at XiaomiMichael Stack

HBaseCon 2015- HBase @ FlipboardMatthew Blair

Brian Bulkowski. AerospikeVolha Banadyseva

7. Key-Value Databases: In DepthFabio Fumarola

Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit

C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...DataStax

Cassandra implementation for collecting data and presenting dataChen Robert

HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon

HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.

Tendances (20)

HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...

HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight

HBaseCon 2013: ETL for Apache HBase

The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)

HBaseCon 2015: HBase Operations in a Flurry

HBaseCon 2015: State of HBase Docs and How to Contribute

Gcp data engineer

Keynote: The Future of Apache HBase

HBase Data Modeling and Access Patterns with Kite SDK

HBaseConAsia2018 Track1-3: HBase at Xiaomi

HBaseCon 2015- HBase @ Flipboard

Brian Bulkowski. Aerospike

7. Key-Value Databases: In Depth

Kafka to the Maxka - (Kafka Performance Tuning)

C* Keys: Partitioning, Clustering, & CrossFit (Adam Hutson, DataScale) | Cass...

Cassandra implementation for collecting data and presenting data

HBase Read High Availability Using Timeline-Consistent Region Replicas

HBaseCon 2013: Compaction Improvements in Apache HBase

En vedette

Project VoldemortFabiano Da Ventura

thesis-despoinaDespoina Magka

Plagcitation fa2012Laksamee Putnam

ISTC 201 - Plagiarism and Proper CitationLaksamee Putnam

Google Apps and PlagiarismJon Corippo

Google analytics pptmaddinpiya

5 Fantasy Google TranslatorJing-mei Huang

HBase at MendeleyDan Harvey

How to set up campaign in google adwords by Tanuja TalekarTanuja Talekar

Scientific writing pro : Office word & Mendeley (dani r firman)Dani Firman

Webmaster tool by Neha NayakNeha Nayak

Google Analytics OverviewAnvil Media, Inc.

Google analytics by Neha NayakNeha Nayak

Top 10 Google Analytics ReportsSally Falkow

Google Analytics 101 for Business - How to Get Started With Google AnalyticsJeff Sauer

An introduction to Google AnalyticsJoris Roebben

Google Analytics 101 | 2015Insivia

A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud

Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.

Voldemort : Prototype to ProductionVinoth Chandar

En vedette (20)

Project Voldemort

thesis-despoina

Plagcitation fa2012

ISTC 201 - Plagiarism and Proper Citation

Google Apps and Plagiarism

Google analytics ppt

5 Fantasy Google Translator

HBase at Mendeley

How to set up campaign in google adwords by Tanuja Talekar

Scientific writing pro : Office word & Mendeley (dani r firman)

Webmaster tool by Neha Nayak

Google Analytics Overview

Google analytics by Neha Nayak

Top 10 Google Analytics Reports

Google Analytics 101 for Business - How to Get Started With Google Analytics

An introduction to Google Analytics

Google Analytics 101 | 2015

A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook

Facebook - Jonthan Gray - Hadoop World 2010

Voldemort : Prototype to Production

Similaire à Project Voldemort: Big data loading

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt

מיכאלsqlserver.co.il

Web Services Hadoop Summit 2012Hortonworks

H2O on Hadoop Dec 12 Sri Ambati

Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati

HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit

Hadoop and object stores can we do it bettergvernik

Hadoop and object stores: Can we do it better?gvernik

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter

"Xapi-lang For declarative code generation" By James NelsonGWTcon

Clogeny Hadoop ecosystem - an overviewMadhur Nawandar

Zend Server Data CachingEl Taller Web

Collect distributed application logging using fluentd (EFK stack)Marco Pas

Use Xdebug to profile PHPSeravo

Exploring Node.jSDeepu S Nath

Large Scale Data With Hadoopguest27e6764

BuildingsocialanalyticstoolwithmongodbMongoDB APAC

Webinar: Managing Real Time Risk Analytics with MongoDB MongoDB

Java Persistence Frameworks for MongoDBMongoDB

Hw09 Sqoop Database Import For HadoopCloudera, Inc.

Similaire à Project Voldemort: Big data loading (20)

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

מיכאל

Web Services Hadoop Summit 2012

H2O on Hadoop Dec 12

Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned

HDFS Tiered Storage: Mounting Object Stores in HDFS

Hadoop and object stores can we do it better

Hadoop and object stores: Can we do it better?

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK

"Xapi-lang For declarative code generation" By James Nelson

Clogeny Hadoop ecosystem - an overview

Zend Server Data Caching

Collect distributed application logging using fluentd (EFK stack)

Use Xdebug to profile PHP

Exploring Node.jS

Large Scale Data With Hadoop

Buildingsocialanalyticstoolwithmongodb

Webinar: Managing Real Time Risk Analytics with MongoDB

Java Persistence Frameworks for MongoDB

Hw09 Sqoop Database Import For Hadoop

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

A Call to Action for Generative AI in 2024Results

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Histor y of HAM Radio presentation slidevu2urc

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Slack Application Development 101 Slidespraypatel2

How to convert PDF to text with Nanonetsnaman860154

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

A Call to Action for Generative AI in 2024

Driving Behavioral Change for Information Management through Data-Driven Gree...

A Domino Admins Adventures (Engage 2024)

Axa Assurance Maroc - Insurer Innovation Award 2024

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

08448380779 Call Girls In Friends Colony Women Seeking Men

Histor y of HAM Radio presentation slide

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Slack Application Development 101 Slides

How to convert PDF to text with Nanonets

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

Handwritten Text Recognition for manuscripts and early printed texts

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Project Voldemort: Big data loading

1. Big Data Loading: Project Voldemort

2. Big Data Loading ● So you've processed your data... ● Now, how to get that to people quickly? ● Project Voldemort's Read-Only stores ● Simple key-value store ● Based upon Amazon Dynamo ● Simple Java interface and operation ● Immutable read only stores

3. Read Only Stores ● Precompute in Hadoop or else where ● Creates an indexed key-value store ● One reducer (or file) per node ● Replicated data for fail over ● Atomically loads into nodes ● Copy from hdfs or other http source ● Very fast, limited by network or storage i/o ● Can throttle so not affecting live services ● Can also roll back to previous versions

4. Example Hadoop Store Builder public class JsonStoreBuilder extends AbstractHadoopStoreBuilderMapper<LongWritable, Text>{ JSONParser parser = new JSONParser(); @Override public Object makeKey(LongWritable lineNo, Text line) { JSONObject json = parser.parse(line.toString()); return json.get("name"); } @Override public Object makeValue(LongWritable lineNo, Text line) { return line.toString(); } }

5. Example Hadoop Job $VOLDEMORT_HOME/bin/hadoop-build-readonly-store.sh --input hdfs/JsonFile.json --output hdfs/StoreOut --tmpdir hdfs/temp_dir --mapper uk.co.danharvey.hadoop.JsonStoreBuilder --jar hadoop-core.jar --cluster config/cluster.xml --storename example_store --storedefinitions config/store.xml --chunksize 1073741824 --replication 1

6. Pig to Json Index ● Output JSON from pig STORE bag INTO 'data.json' USING JsonStorage(); ● JsonStoreBuilder ● Extends Voldemort StoreBuilder ● Easily index any field ● Code up here: http://github.com/danharvey/pigJsonUtils