SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
NoSQL solutions

A comparative study
Brief history of databases




 ● 1980: Oracle DBMS released
 ● 1995: first release of MySQL, a lightweight, asynchronously
   replicated database
 ● 2000: MySQL 3.23 is adopted by most startups as a
   database plateform
 ● 2004: Google develops BigTable
 ● 2009: The term NoSQL is coined out
Evolution of database storage

● Monolithic: using traditional databases (Oracle, Sybase...)
   ○ High cost in infrastructure (big CPU, big SAN)
   ○ High cost in software licenses (commercial software)
   ○ Qualified personnel required (certified DBAs)
● LAMP platform (circa 2000)
   ○ Free software
   ○ Runs on commodity hardware
   ○ Low administration (don't need a DBA until your data
     grows important)
● NoSQL
   ○ Scales indefinitely (replication only scales vertically)
   ○ Logic is in the application
   ○ Doesn't require SQL or internals knowledge
Scaling
● Vertical scaling (MySQL)
   ○ Scaling by adding more replicas
   ○ Load is evenly distributed across the replicas
   ○ Problems !
       ■ Cached data is also distributed evenly: inefficient
         resource usage
       ■ Efficiency goes down as dataset exceeds available
         memory, cannot get more performance by adding
         replicas
       ■ Write bottleneck

● Horizontal scaling (NoSQL)
   ○ Data is distributed evenly across the nodes (hashing)
   ○ More capacity ? Just add one node
   ○ Loss of traditional database properties (ACID)
NoSQL definition




● Not only SQL (is not exposed via query language)
● Non-relational (denormalized data)
● Distributed (horizontal partitioning)
● Different implementations :
   ○ Key-value store
   ○ Document database
   ○ Graph database
Key-value stores

  ● Schema-less storage
  ● Basic associative arrays
{ "username"=> "guillaume" }
  ● Key-value stores can have column families and subkeys
{ "user:name"=> "guillaume", "user:uid" => 1000 }

  ● Implementations
      ○ K/V caches: Redis, memcache
         ■ in-memory databases
      ○ Column databases: Cassandra, HBase
         ■ Data is stored in a tabular fashion (as opposed to
           rows in traditional RDBMS)
Document Databases

  ● Data is organized into documents:
FirstName="Frank", City="Haifa", Hobby="Photographing"
  ● No strong typing or predefined fields; additional
   information can be added easily
FirstName="Guillaume", Address="Hidalgo Village, Pasay City", Languages=[{Name:"French"}, {Name:"
English"}, {Name:"Tagalog"}]
  ● An ensemble of documents is called a collection
  ● Uses structured standards: XML, JSON
  ● Implementations
      ○ CouchDB (Erlang)
      ○ MongoDB (C++)
      ○ RavenDB (.NET)
Graph Databases

● Uses graph theory structure to represent information
● Typically used for relations
   ○ Example: followers/following in Twitter
Databases at Toluna

 ● MySQL
    ○ Traditional Master-Slave configuration
    ○ Very efficient for small requests
    ○ Not good for analytics
    ○ "Big Data" issues (i.e. usersvotes)

 ● Microsoft SQL Server
    ○ Good all-around performance
    ○ Monolithic
        ■ Suffers from locking issues
        ■ Hard to scale (many connections)
    ○ Potentially complex SQL programming to get the better of it
Solutions ?

Let's evaluate some products...
Apache HBase

  ● A column database based on the Hadoop architecture
  ● Commercially supported (Cloudera)
  ● Available on Red Hat, Debian
  ● Designed for very big data storage (Terabytes)
  ● Users: Facebook, Yahoo!, Adobe, Mahalo, Twitter

Pros
  ● Pure Java implementation
  ● Access to Hadoop MapReduce data via column storage
  ● True clustered architecture

Cons
 ● Java
 ● Hard to deploy and maintain
 ● Limited options via the API (get, put, scans)
Apache HBase: architecture

● Data is stored in cells
   ○ Primary row key
   ○ Column family
      ■ Limited
      ■ May have indefinite qualifiers
   ○ Timestamp (version)
● Example cell structure
RowId     Column Family:Qualifier   Timestamp    Value

1000      user:name                 1312868789   guillaume

1000      user:email                1312868789   g@dragonscale.eu
HBase Data operations
  ● Creating a table and put some data
create table 'usersvotes', 'userid', 'date', 'meta'
hadoop jar /usr/lib/hbase/hbase-0.90.3-cdh3u1.jar importtsv -Dimporttsv.columns=userid,
HBASE_ROW_KEY,date,meta:answer,meta:country usersvotes ~/import/
  ● Retrieve data
hbase(main):002:0> get 'usersvotes', 1071726
COLUMN                 CELL
 date:           timestamp=1312780185940, value=1296523245
 meta:answer         timestamp=1312780185940, value=2
 meta:country        timestamp=1312780185940, value=ES
 userid:          timestamp=1312780185940, value=685352
4 row(s) in 0.0720 seconds
  ● The last versioned row (higher timestamp) for the specified primary
    key is retrieved
HBase Data operations: API

● If not using JAVA, Hbase must be queried using a
  webservice (XML, JSON or protobuf)

● Type of operations
   ○ Get (read single value)
   ○ Put (write single value)
   ○ Get multi (read multiple versions)
   ○ Scan (retrieve multiple rows via a scan)

● MapReduce jobs can be run vs. the database using JAVA
  code
What is MapReduce ?
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes
those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.
The worker node processes that smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them
in some way to get the output – the answer to the problem it was originally trying to solve.
MongoDB
 ● A document-oriented database
 ● Written in C++
 ● Uses javascript as functional specification
 ● Commercial support (10gen)
 ● Large user base (NY Times, Disney, MTV, foursquare)

Pros
  ● Easy installation and deployment
  ● Sharding and replication
  ● Easy API (javascript), multiple languages
  ● Similarities to MySQL (indexes, queries)

Cons
 ● Versions < 1.8.0 had many issues (development not mature):
   consistency, crashes...
MongoDB data structure

  ● No predefinition of fields; all operations are implicit

# Create document
> d = { "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3,
"country" : "GB" };
# Save it into a new collection
> db.usersvotes.save(d);
# Retrieve documents from the collection
> db.usersvotes.find();
{ "_id" : ObjectId("4e3bde5ae84838f87bf883b2"), "userid" : 8173095, "pollid" : 53064, "date" :
NumberLong(1293874493), "answer" : 3, "country" : "GB" }
MongoDB Indexes and Queries
Let's create an index...
db.usersvotes.ensureIndex({pollid: 1, date :-1})

 ● Indexes can be created in the background
 ● Index keys are sortable
 ● Queries without indexes are slow (scans)
Let's get a usersvotes stream
db.usersvotes.find({pollid: 676781}).sort({date: -1}).skip(10).limit(10);
# equals to the following SQL : SELECT * FROM usersvotes WHERE pollid = 676781 ORDER BY DATE
DESC LIMIT 10,10;
{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e6"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong
(1295077466), "answer" : 1, "country" : "GB" }
{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e7"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong
(1295077466), "answer" : 5, "country" : "GB" }
MongoDB MapReduce
   ● Uses javascript functions
   ● Single-threaded (javascript limitation)
   ● Parallelized (runs across all shards)

# Aggregate data over the country field
map = function() { emit ( this.country, { count: 1 } ) };
# Count items in each country
r=function(k,vals) {
  var result = { count: 0 };
vals.forEach(function(value) {
result.count += value.count;
}); return result; }
# Start the job
res=db.usersvotes.mapReduce(m,r, {out: { inline : 1}});
VoltDB A SQL alternative
 ● "NewSQL"
 ● In-memory database

Pros
 ● Lock-free
 ● ACID properties
 ● Linear scalability
 ● Fault tolerant
 ● Java implementation

Cons
 ● Cannot query the database directly; java stored procs only
 ● Database shutdown needed to modify schema
 ● Database shutdown needed to add cluster nodes
 ● No more memory = no more storage
What about MySQL ?

   Potential solutions
MySQL some interesting facts

 ● In single node tests, MySQL was always faster than NoSQL
   solutions
 ● Load data was faster
     ○ Sample usersvotes data (1G tsv file)
         ■ MySQL: 20 seconds
         ■ MongoDB : >10 minutes
         ■ HBase: >30 minutes
 ● Proven technology
MySQL Analytics

 ● MySQL might be outperformed by other solutions in
   analytics depending on the data size
 ● There are several column database solutions existing for
   MySQL (Infobright, ICE, Tokutek)
 ● Word count operations can be offloaded to a full-text search
   engine (Sphinx, SolR, Lucene)
MySQL Big Data
  ● The Vote Stream case
  ● Simple query
 explain select * from toluna_polls.usersvotes where pollid=843206 order by votedate desc limit
10,20
1, 'SIMPLE', 'usersvotes', 'ref', 'Index_POLLID', 'Index_POLLID', '8', 'const', 2556, 'Using where;
Using filesort'


  ● Can be easily solved by covering index
ALTER TABLE usersvotes ADD KEY (pollid, votedate)


  ● But !
     ○ usersvotes = 160Gb datafiles
     ○ Adding index: offline operation, would take hours
     ○ Online schema change could be used, but might run out of
        space and/or take days
Conclusions

● HBase: good choice for analytics, but not very adapted to
  traditional database operations
    ○ Most companies use HBase/Hadoop to offload analytical
      data from their main database
    ○ Java experience needed (which Toluna has imho)
    ○ IT must be trained
● MongoDB
    ○ Very good choice for starting new web applications "from
      the ground up"
● VoltDB
    ○ Great technology but lack of flexibility
● Traditional databases
    ○ Will probably be around for long time
Questions ?

Contenu connexe

Tendances

NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and consFabio Fumarola
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases labFabio Fumarola
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databasesAshwani Kumar
 
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databasesFabio Fumarola
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...DataStax
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرdatastack
 

Tendances (20)

NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab9b. Document-Oriented Databases lab
9b. Document-Oriented Databases lab
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 

Similaire à NoSQL Solutions - a comparative study

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsShawn Zhu
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...Marcin Bielak
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresqlZaid Shabbir
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive Omid Vahdaty
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Lviv Startup Club
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of dataPiyush Katariya
 

Similaire à NoSQL Solutions - a comparative study (20)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
No sql bigdata and postgresql
No sql bigdata and postgresqlNo sql bigdata and postgresql
No sql bigdata and postgresql
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
NoSQL
NoSQLNoSQL
NoSQL
 
Handling the growth of data
Handling the growth of dataHandling the growth of data
Handling the growth of data
 
Beyond Relational Databases
Beyond Relational DatabasesBeyond Relational Databases
Beyond Relational Databases
 

Dernier

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

NoSQL Solutions - a comparative study

  • 2. Brief history of databases ● 1980: Oracle DBMS released ● 1995: first release of MySQL, a lightweight, asynchronously replicated database ● 2000: MySQL 3.23 is adopted by most startups as a database plateform ● 2004: Google develops BigTable ● 2009: The term NoSQL is coined out
  • 3. Evolution of database storage ● Monolithic: using traditional databases (Oracle, Sybase...) ○ High cost in infrastructure (big CPU, big SAN) ○ High cost in software licenses (commercial software) ○ Qualified personnel required (certified DBAs) ● LAMP platform (circa 2000) ○ Free software ○ Runs on commodity hardware ○ Low administration (don't need a DBA until your data grows important) ● NoSQL ○ Scales indefinitely (replication only scales vertically) ○ Logic is in the application ○ Doesn't require SQL or internals knowledge
  • 4. Scaling ● Vertical scaling (MySQL) ○ Scaling by adding more replicas ○ Load is evenly distributed across the replicas ○ Problems ! ■ Cached data is also distributed evenly: inefficient resource usage ■ Efficiency goes down as dataset exceeds available memory, cannot get more performance by adding replicas ■ Write bottleneck ● Horizontal scaling (NoSQL) ○ Data is distributed evenly across the nodes (hashing) ○ More capacity ? Just add one node ○ Loss of traditional database properties (ACID)
  • 5. NoSQL definition ● Not only SQL (is not exposed via query language) ● Non-relational (denormalized data) ● Distributed (horizontal partitioning) ● Different implementations : ○ Key-value store ○ Document database ○ Graph database
  • 6. Key-value stores ● Schema-less storage ● Basic associative arrays { "username"=> "guillaume" } ● Key-value stores can have column families and subkeys { "user:name"=> "guillaume", "user:uid" => 1000 } ● Implementations ○ K/V caches: Redis, memcache ■ in-memory databases ○ Column databases: Cassandra, HBase ■ Data is stored in a tabular fashion (as opposed to rows in traditional RDBMS)
  • 7. Document Databases ● Data is organized into documents: FirstName="Frank", City="Haifa", Hobby="Photographing" ● No strong typing or predefined fields; additional information can be added easily FirstName="Guillaume", Address="Hidalgo Village, Pasay City", Languages=[{Name:"French"}, {Name:" English"}, {Name:"Tagalog"}] ● An ensemble of documents is called a collection ● Uses structured standards: XML, JSON ● Implementations ○ CouchDB (Erlang) ○ MongoDB (C++) ○ RavenDB (.NET)
  • 8. Graph Databases ● Uses graph theory structure to represent information ● Typically used for relations ○ Example: followers/following in Twitter
  • 9. Databases at Toluna ● MySQL ○ Traditional Master-Slave configuration ○ Very efficient for small requests ○ Not good for analytics ○ "Big Data" issues (i.e. usersvotes) ● Microsoft SQL Server ○ Good all-around performance ○ Monolithic ■ Suffers from locking issues ■ Hard to scale (many connections) ○ Potentially complex SQL programming to get the better of it
  • 10. Solutions ? Let's evaluate some products...
  • 11. Apache HBase ● A column database based on the Hadoop architecture ● Commercially supported (Cloudera) ● Available on Red Hat, Debian ● Designed for very big data storage (Terabytes) ● Users: Facebook, Yahoo!, Adobe, Mahalo, Twitter Pros ● Pure Java implementation ● Access to Hadoop MapReduce data via column storage ● True clustered architecture Cons ● Java ● Hard to deploy and maintain ● Limited options via the API (get, put, scans)
  • 12. Apache HBase: architecture ● Data is stored in cells ○ Primary row key ○ Column family ■ Limited ■ May have indefinite qualifiers ○ Timestamp (version) ● Example cell structure RowId Column Family:Qualifier Timestamp Value 1000 user:name 1312868789 guillaume 1000 user:email 1312868789 g@dragonscale.eu
  • 13. HBase Data operations ● Creating a table and put some data create table 'usersvotes', 'userid', 'date', 'meta' hadoop jar /usr/lib/hbase/hbase-0.90.3-cdh3u1.jar importtsv -Dimporttsv.columns=userid, HBASE_ROW_KEY,date,meta:answer,meta:country usersvotes ~/import/ ● Retrieve data hbase(main):002:0> get 'usersvotes', 1071726 COLUMN CELL date: timestamp=1312780185940, value=1296523245 meta:answer timestamp=1312780185940, value=2 meta:country timestamp=1312780185940, value=ES userid: timestamp=1312780185940, value=685352 4 row(s) in 0.0720 seconds ● The last versioned row (higher timestamp) for the specified primary key is retrieved
  • 14. HBase Data operations: API ● If not using JAVA, Hbase must be queried using a webservice (XML, JSON or protobuf) ● Type of operations ○ Get (read single value) ○ Put (write single value) ○ Get multi (read multiple versions) ○ Scan (retrieve multiple rows via a scan) ● MapReduce jobs can be run vs. the database using JAVA code
  • 15. What is MapReduce ? "Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
  • 16. MongoDB ● A document-oriented database ● Written in C++ ● Uses javascript as functional specification ● Commercial support (10gen) ● Large user base (NY Times, Disney, MTV, foursquare) Pros ● Easy installation and deployment ● Sharding and replication ● Easy API (javascript), multiple languages ● Similarities to MySQL (indexes, queries) Cons ● Versions < 1.8.0 had many issues (development not mature): consistency, crashes...
  • 17. MongoDB data structure ● No predefinition of fields; all operations are implicit # Create document > d = { "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3, "country" : "GB" }; # Save it into a new collection > db.usersvotes.save(d); # Retrieve documents from the collection > db.usersvotes.find(); { "_id" : ObjectId("4e3bde5ae84838f87bf883b2"), "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3, "country" : "GB" }
  • 18. MongoDB Indexes and Queries Let's create an index... db.usersvotes.ensureIndex({pollid: 1, date :-1}) ● Indexes can be created in the background ● Index keys are sortable ● Queries without indexes are slow (scans) Let's get a usersvotes stream db.usersvotes.find({pollid: 676781}).sort({date: -1}).skip(10).limit(10); # equals to the following SQL : SELECT * FROM usersvotes WHERE pollid = 676781 ORDER BY DATE DESC LIMIT 10,10; { "_id" : ObjectId("4e3bde5ce84838f87bfa31e6"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong (1295077466), "answer" : 1, "country" : "GB" } { "_id" : ObjectId("4e3bde5ce84838f87bfa31e7"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong (1295077466), "answer" : 5, "country" : "GB" }
  • 19. MongoDB MapReduce ● Uses javascript functions ● Single-threaded (javascript limitation) ● Parallelized (runs across all shards) # Aggregate data over the country field map = function() { emit ( this.country, { count: 1 } ) }; # Count items in each country r=function(k,vals) { var result = { count: 0 }; vals.forEach(function(value) { result.count += value.count; }); return result; } # Start the job res=db.usersvotes.mapReduce(m,r, {out: { inline : 1}});
  • 20. VoltDB A SQL alternative ● "NewSQL" ● In-memory database Pros ● Lock-free ● ACID properties ● Linear scalability ● Fault tolerant ● Java implementation Cons ● Cannot query the database directly; java stored procs only ● Database shutdown needed to modify schema ● Database shutdown needed to add cluster nodes ● No more memory = no more storage
  • 21. What about MySQL ? Potential solutions
  • 22. MySQL some interesting facts ● In single node tests, MySQL was always faster than NoSQL solutions ● Load data was faster ○ Sample usersvotes data (1G tsv file) ■ MySQL: 20 seconds ■ MongoDB : >10 minutes ■ HBase: >30 minutes ● Proven technology
  • 23. MySQL Analytics ● MySQL might be outperformed by other solutions in analytics depending on the data size ● There are several column database solutions existing for MySQL (Infobright, ICE, Tokutek) ● Word count operations can be offloaded to a full-text search engine (Sphinx, SolR, Lucene)
  • 24. MySQL Big Data ● The Vote Stream case ● Simple query explain select * from toluna_polls.usersvotes where pollid=843206 order by votedate desc limit 10,20 1, 'SIMPLE', 'usersvotes', 'ref', 'Index_POLLID', 'Index_POLLID', '8', 'const', 2556, 'Using where; Using filesort' ● Can be easily solved by covering index ALTER TABLE usersvotes ADD KEY (pollid, votedate) ● But ! ○ usersvotes = 160Gb datafiles ○ Adding index: offline operation, would take hours ○ Online schema change could be used, but might run out of space and/or take days
  • 25. Conclusions ● HBase: good choice for analytics, but not very adapted to traditional database operations ○ Most companies use HBase/Hadoop to offload analytical data from their main database ○ Java experience needed (which Toluna has imho) ○ IT must be trained ● MongoDB ○ Very good choice for starting new web applications "from the ground up" ● VoltDB ○ Great technology but lack of flexibility ● Traditional databases ○ Will probably be around for long time