SlideShare une entreprise Scribd logo
1  sur  53
NOSQL, NO?
 Introductory presentation
RELATIONAL

 SQL                            ACID

 Relational algebra             Optimal for ad-hoc queries

 Tables, Columns, Rows          Sharding can be difficult

 Metadata separate from data

 Normalized data

 Optimized storage
POPULAR RDBMS

 MySQL                  Informix

 SQL Server             Progress

 Oracle                 Pervasive

 Postgres               Sybase

 DB2                    Access

 Interbase, Firebird   …
SQL

 Unified language to create and query both data and metadata

 Similar to English

 Verbose(!)

 Can get complex for non-trivial queries

 Does not expose execution plan – you say what you want it to
return, not how
SQL EXAMPLES
 If you can say what you mean, you can query the existing data
 Results are near-instant when querying based on primary key
select * from valute where id=1 and sid=42

 Results are fast when querying based on non-unique index
select valuta from valute where ((id=1 and sid=42)) and (valute.firma_id=123 and
valute.firma__sid=1)

 Very readable for trivial queries
select r.customer,sum(rs.iznos) sveukupno from racuni r
join racuni_stavke rs on r.id=rs.racun_id
where r.id=5
order by rs.ordinal
SQL EXAMPLES

 Not so readable for non-trivial queries
select "MP" tip_prometa, mprac.broj broj_racuna, mprac_stavke.kolicina kolicina, (mprac.tecaj*mprac_stavke.kolicina*mprac_stavke.rabat_iznos)
rabat_iznos, (round(mprac_stavke.cijena - mprac_stavke.rabat_iznos - mprac_stavke.rabat2_iznos - mprac_stavke.rabat3_iznos - mprac_stavke.porez1 -
mprac_stavke.porez2 - mprac_stavke.porez_potrosnja,6)*mprac_stavke.kolicina) iznos, (mprac_stavke.kolicina* ifnull((select
sum(pn_cijena*kolicina)/sum(kolicina) from mprac_skl left join skl_stavke on mprac_skl.skl_id=skl_stavke.skl_id and
mprac_skl.skl__sid=skl_stavke.skl__sid where mprac_skl.mprac_id=mprac.id and mprac_skl.mprac__sid=mprac.sid and
skl_stavke.artikl_id=mprac_stavke.artikl_id and skl_stavke.artikl__sid=mprac_stavke.artikl__sid ),0) ) iznos_nabavno, ifnull( (select
sum(mprac_stavke.kolicina*ambalaze.naknada_kom) from artikli_ambalaze left join ambalaze on ambalaze.id=artikli_ambalaze.ambalaza_id and
ambalaze.sid=artikli_ambalaze.ambalaza__sid where artikli_ambalaze.artikl_id=artikli.id and artikli_ambalaze.artikl__sid=artikli.sid and
ambalaze.kalkulacija="N" ),0) naknada, radnici_komercijalisti.ime racun_komercijalist_ime, (select naziv from skladista where skladista.tip_skladista="M"
and pj_id=mprac.pj_id limit 1) skladiste_naziv , pj.naziv pj_naziv, mprac.datum,
cast(concat("(",if(DayOfWeek(mprac.datum)=1,7,DayOfWeek(mprac.datum)-1),") ", if(DayOfWeek(mprac.datum)=1,"1 Nedjelja",
if(DayOfWeek(mprac.datum)=2,"2 Ponedjeljak", if(DayOfWeek(mprac.datum)=3,"3 Utorak", if(DayOfWeek(mprac.datum)=4,"4 Srijeda",
if(DayOfWeek(mprac.datum)=5,"5 Èetvratk", if(DayOfWeek(mprac.datum)=6,"6 Petak", if(DayOfWeek(mprac.datum)=7,"7 Subota","")))))))) as char(15))
dan_u_tjednu, cast(month(mprac.datum) as unsigned) mjesec, cast(week(mprac.datum) as unsigned) tjedan, cast(quarter(mprac.datum) as unsigned) kvartal,
cast(year(mprac.datum) as unsigned) godina, cast(if(tipovi_komitenata.tip="F",trim(concat(partneri.ime," ",partneri.prezime)),partneri.naziv) as char(200))
kupac_naziv, partneri_mjesta.postanski_broj kupac_mjesto, partneri_mjesta.mjesto kupac_mjesto_naziv, partneri_grupe_mjesta.naziv …
RDBMS SCALING

 Vertical scaling
     •   Better CPU, more CPUs
     •   More RAM
     •   More disks
     •   SAN

 Partitioning

 Sharding
PARTITIONING

 With many rows and heavy usage, partitioning is a must

 What to partition
     • Tables
     • Indexes
     • Views

 Typical cases
     • Monthly data
     • Alphabetical keys
RDBMS SHARDING

 Sharding means using several databases where each represents part
of data (500 clients on one server, another 500 on another)

 Requires changing application code
     connect(calculate_server_from(sharding_key))

 Impossible to join data from different databases, so choose your
sharding key wisely

 Very difficult to repartition your databases based on a new key
RDBMS METADATA

 Metadata: data describing other data

 RDBMS structures are explicitly defined, and each data type is
optimized for storage

 Lots of constraints

 Can get slow with lot of data
NOSQL

 “Not SQL”, “Not only SQL”

 Core NoSQL databases invented mostly because RDBMS made
life very hard for huge and heavy traffic web databases

 NoSQL databases are the ones significantly different from
relational databases
NOSQL TYPES

 Wide Column Store / Column Families
 Document Store
 Key Value / Tuple Store
 Graph Databases
 Object Databases
 XML Databases
 Multivalue Databases
4 MAIN DATA MODELS

 Key-Value Stores

 BigTable Clones (aka "ColumnFamily")

 Document Databases

 Graph Databases
Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
KEY/VALUE STORES

 Lineage: Amazon's Dynamo paper and Distributed HashTables.

 Data model: A global collection of key-value pairs.

 Example: Voldemort, Dynomite, Tokyo Cabinet
Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
BIGTABLE CLONES

 Lineage: Google's BigTable paper.

 Data model: Column family, i.e. a tabular model where each row at
least in theory can have an individual configuration of columns.

 Example: HBase, Hypertable, Cassandra
Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
DOCUMENT DATABASES

 Lineage: Inspired by Lotus Notes.

 Data model: Collections of documents, which contain key-value
collections (called "documents").

 Example: CouchDB, MongoDB, Riak
Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
GRAPH DATABASES

 Lineage: Draws from Euler and graph theory.

 Data model: Nodes & relationships, both which can hold key-value
pairs

 Example: AllegroGraph, InfoGrid, Neo4j
Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
POPULAR NOSQL

 Hadoop / Hbase     MemcacheDB

 Cassandra          Voldemort

 Amazon SimpleDB    Hypertable

 MongoDB            Cloudata

 CouchDB            IBM Lotus/Domino

 Redis
NOSQL CHARACTERISTICTS

 Almost infinite horizontal scaling
 Very fast
 Performance doesn’t deteriorate with growth (much)
 No fixed table schemas
 No join operations
 Ad-hoc queries difficult or impossible
 Structured storage
 Almost everything happens in RAM
REAL-WORLD USE
 Cassandra
      •   Facebook (original developer, used it till late 2010)
      •   Twitter
      •   Digg
      •   Reddit
      •   Rackspace
      •   Cisco
 BigTable
      •   Google (open-source version is HBase)
 MongoDB
      •   Foursquare
      •   Craigslist
      •   Bit.ly
      •   SourceForge
      •   GitHub
WHY NOSQL?

 Handles huge databases (I know, I said it before)

 Redundancy, data is pretty safe on commodity hardware

 Super flexible queries using map/reduce

 Rapid development (no fixed schema, yeah!)

 Very fast for common use cases
PERFORMANCE

 RDBMS uses buffer to ensure ACID properties

 NoSQL does not guarantee ACID and is therefore much faster

 We don’t need ACID everywhere!

 I used MySQL and switched to MongDB for my analytics app
     • Data processing (every minute) is 4x faster with MongoDB, despite
       being a lot more detailed (due to much simple development)
SCALING

 Simple web application with not much traffic
     • Application server, database server all on one machine
SCALING

 More traffic comes in
     • Application server
     • Database server
SCALING

 Even more traffic comes in
     • Load balancer
     • Application server x2
     • Database server
SCALING

 Even more traffic comes in
     • Load balancer x N
         • easy
     • Application server x N
         • easy
     • Database server xN
         • hard for SQL databases
SQL SLOWDOWN

 Not linear!
 http://www.slideshare.net/rightscale
/scaling-sql-and-nosql-databases-in-the-
cloud
NOSQL SCALING

 Need more storage?
     • Add more servers!

 Need higher performance?
     • Add more servers!

 Need better reliability?
     • Add more servers!
SCALING SUMMARY

 You can scale SQL databases (Oracle, MySQL, SQL Server…)
     • This will cost you dearly
     • If you don’t have a lot of money, you will reach limits quickly

 You can scale NoSQL databases
     •   Very easy horizontal scaling
     •   Lots of open-source solutions
     •   Scaling is one of the basic incentives for design, so it is well handled
     •   Scaling is the cause of trade-offs causing you to have to use
         map/reduce
RAM

 Why map/reduce? I just need some simple queries. Tomorrow I
will need some other queries….

 SQL databases are optimized for very efficient disk access, but for
significant scaling need RAM caching (MySQL+memcached)

 NoSQL databases are designed to keep whole working set in RAM
WORKING SET

 In real-world use working set is much less than complete database
     • For analytics 99% of queries will be regarding last 30 days

 As you need RAM only for working set, you can use commodity
servers, VPS, and just add more as your app becomes more popular
WORKING SET WOES

 Foursquare has millions of users and working set the same as the database
 They used a single 66GB Amazon EC2 High-Memory Quadruple Extra Large
Instance (with cheese) for millions of users
 When their RAM usage was 65GB, they decided to shard
 Too late, they started to have disk swaps
 Disk is much slower than RAM - 100x slowdown
 Server could not keep up due to swapping
 11 hours outage (ouch!)
MAP/REDUCE

 Google’s framework for processing highly distributable
problems across huge datasets using a large number of
computers

 Let’s define large number of computers
     • Cluster if all of them have same hardware
     • Grid unless Cluster (if !Cluster for old-style programmers)
MAP/REDUCE

 Process split into two phases
     • Map
          • Take the input, partition it delegate to other machines
          • Other machines can repeat the process, leading to tree structure
          • Each machine returns results to the machine who gave it the task
     • Reduce
          • collect results from machines you gave the tasks
          • combine results and return it to requester
     • Slower than sequential data processing, but massively parallel
     • Sort petabyte of data in a few hours
     • Input, Map, Shuffle, Reduce, Output
MAP/REDUCE EXAMPLE

 You need to write two functions

 Count different words in a set of documents
MONGODB

 Document store

 Basic support for dynamic (ad hoc) queries

 Query by example (nice!)
MONGODB

 Conditional Operators
     • <, <=, >, >=
     • $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $size, $type




  Regular expressions
MONGODB
    Data is stored as BSON (binary JSON)
         •    Makes it very well suited for languages with native JSON support
    Map/Reduce written in Javascript
         •    Slow! There is one single thread of execution in Javascript
    Master/slave replication (auto failover with replica sets)
    Sharding built-in
    Uses memory mapped files for data storage
    Performance over features
    On 32bit systems, limited to ~2.5Gb
    An empty database takes up 192Mb
    GridFS to store big data + metadata (not actually an FS)
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
CASSANDRA
 Written in: Java
 Protocol: Custom, binary (Thrift)
 Tunable trade-offs for distribution and replication (N, R, W)
 Querying by column, range of keys
 BigTable-like features: columns, column families
 Writes are much faster than reads (!)
         • Constant write time regardless of database size
 Map/reduce possible with Apache Hadoop
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
HBASE
     Written in: Java
     Main point: Billions of rows X millions of columns
     Modeled after BigTable
     Map/reduce with Hadoop
     Query predicate push down via server side scan and get filters
     Optimizations for real time queries
     A high performance Thrift gateway
     HTTP supports XML, Protobuf, and binary
     Cascading, hive, and pig source and sink modules
     No single point of failure
     While Hadoop streams data efficiently, it has overhead for starting map/reduce jobs. HBase is column oriented key/value store and
allows for low latency read and writes.
     Random access performance is like MySQL
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
REDIS
   Written in: C/C++
   Main point: Blazing fast
   Disk-backed in-memory database,
   Master-slave replication
   Simple values or hash tables by keys,
   Has sets (also union/diff/inter)
   Has lists (also a queue; blocking pop)
   Has hashes (objects of multiple fields)
   Sorted sets (high score table, good for range queries)
   Has transactions (!)
   Values can be set to expire (as in a cache)
   Pub/Sub lets one implement messaging (!)

Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
COUCHDB
    Written in: Erlang
    Main point: DB consistency, ease of use
    Bi-directional (!) replication, continuous or ad-hoc, with conflict detection, thus, master-master replication. (!)
    MVCC - write operations do not block reads
    Previous versions of documents are available
    Crash-only (reliable) design
    Needs compacting from time to time
    Views: embedded map/reduce
    Formatting views: lists & shows
    Server-side document validation possible
    Authentication possible
    Real-time updates via _changes (!)
    Attachment handling
    CouchApps (standalone JS apps)
Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
HADOOP

 Apache project

 A framework that allows for the distributed processing of large
data sets across clusters of computers

 Designed to scale up from single servers to thousands of machines

 Designed to detect and handle failures at the application layer,
instead of relying on hardware for it
HADOOP
   Created by Doug Cutting, who named it after his son's toy elephant
   Hadoop subprojects
        •    Cassandra
        •    HBase
        •    Pig
   Hive was a Hadoop subproject, but is now a top-level Apache project
   Used by many large & famous organizations
        •    http://wiki.apache.org/hadoop/PoweredBy
   Scales to hundreds or thousands of computers, each with several processor cores
   Designed to efficiently distribute large amounts of work across a set of machines
   Hundreds of gigabytes of data constitute the low end of Hadoop-scale
   Built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes
HADOOP

 See http://www.slideshare.net/hadoop/practical-problem-solving-
with-apache-hadoop-pig

 Uses Java, but allows streaming so other languages can easily send
and accept data items to/from Hadoop
HADOOP

 Uses distributed file system (HDFS)
     • Designed to hold very large amounts of data (terabytes or even
       petabytes)
     • Files are stored in a redundant fashion across multiple machines to
       ensure their durability to failure and high availability to very parallel
       applications
     • Data organized into directories and files
     • Files are divided into block (64MB by default) and distributed across
       nodes
 Design of HDFS is based on the design of the Google File System
HIVE

 A petabyte-scale data warehouse system for Hadoop

 Easy data summarization, ad-hoc queries

 Query the data using a SQL-like language called HiveQL

 Hive compiler generates map-reduce jobs for most queries
PIG

 Platform for analyzing large data sets

 High-level language for expressing data analysis programs

 Compiler produces sequences of Map-Reduce programs

 Textual language called Pig Latin
     • Ease of programming
     • System optimizes task execution automatically
     • Users can create their own functions
PIG LATIN

 Pig Latin – high level Map/Reduce programming

 Equivalent to SQL for RDBMS systems.

 Pig Latin can be extended using Java User Defined Functions

 “Word Count” script in Pig Latin
MY MONGODB
MY MONGODB
SUMMARY

 NoSQL is a great problem solver if you need it

 Choose your NoSQL platform carefully as each is designed for
specific purpose

 Get used to Map/Reduce

 It’s not a sin to use NoSQL alongside (yes)SQL database

 I am really happy to work with MongoDB  instead of MySQL

Contenu connexe

Tendances

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveNiko Neugebauer
 
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelishuguk
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDBMongoDB
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandraBrian Enochson
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
NoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesNoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesRonen Botzer
 
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014Amazon Web Services
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLRichard Schneeman
 
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works BestEDB
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetupRemus Rusanu
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...DataStax Academy
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudKellyn Pot'Vin-Gorman
 

Tendances (20)

Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep Dive
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Amazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni VamvadelisAmazon RedShift - Ianni Vamvadelis
Amazon RedShift - Ianni Vamvadelis
 
What's New in Amazon Aurora
What's New in Amazon AuroraWhat's New in Amazon Aurora
What's New in Amazon Aurora
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
NoSQL Intro with cassandra
NoSQL Intro with cassandraNoSQL Intro with cassandra
NoSQL Intro with cassandra
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
NoSQL in Real-time Architectures
NoSQL in Real-time ArchitecturesNoSQL in Real-time Architectures
NoSQL in Real-time Architectures
 
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
(SOV202) Choosing Among AWS Managed Database Services | AWS re:Invent 2014
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works Best
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 

En vedette

Google Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a BearGoogle Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a BearLead Generation Websites
 
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013Anja Bonelli
 
Backbone js in action
Backbone js in actionBackbone js in action
Backbone js in actionUsha Guduri
 

En vedette (7)

Backbonejs
BackbonejsBackbonejs
Backbonejs
 
353 357
353 357353 357
353 357
 
IChresemo Technologies
IChresemo TechnologiesIChresemo Technologies
IChresemo Technologies
 
Google Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a BearGoogle Updates 2014 - Three Birds and a Bear
Google Updates 2014 - Three Birds and a Bear
 
TLA_ fuer_Drittsemester
TLA_ fuer_DrittsemesterTLA_ fuer_Drittsemester
TLA_ fuer_Drittsemester
 
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
Fachartikel "Die Rückkehr der Telefonie", Call Center Scout, 10/2013
 
Backbone js in action
Backbone js in actionBackbone js in action
Backbone js in action
 

Similaire à NoSQL

001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introductionScott Miao
 
AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17Neal Davis
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
Database Choices
Database ChoicesDatabase Choices
Database ChoicesLynn Langit
 
If NoSQL is your answer, you are probably asking the wrong question.
If NoSQL is your answer, you are probably asking the wrong question.If NoSQL is your answer, you are probably asking the wrong question.
If NoSQL is your answer, you are probably asking the wrong question.Lukas Smith
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Mike King
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLbalwinders
 
Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSatya Pal
 
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL ServicesAmazon Web Services
 
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...Charley Hanania
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]Huy Do
 

Similaire à NoSQL (20)

Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
NoSQL Seminer
NoSQL SeminerNoSQL Seminer
NoSQL Seminer
 
AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17AWS Certified Cloud Practitioner Course S11-S17
AWS Certified Cloud Practitioner Course S11-S17
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2
 
How and when to use NoSQL
How and when to use NoSQLHow and when to use NoSQL
How and when to use NoSQL
 
Database Choices
Database ChoicesDatabase Choices
Database Choices
 
If NoSQL is your answer, you are probably asking the wrong question.
If NoSQL is your answer, you are probably asking the wrong question.If NoSQL is your answer, you are probably asking the wrong question.
If NoSQL is your answer, you are probably asking the wrong question.
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
No sql
No sqlNo sql
No sql
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Sql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explainedSql vs NO-SQL database differences explained
Sql vs NO-SQL database differences explained
 
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
(DAT204) NoSQL? No Worries: Build Scalable Apps on AWS NoSQL Services
 
NoSQL and MongoDB
NoSQL and MongoDBNoSQL and MongoDB
NoSQL and MongoDB
 
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
Pass chapter meeting dec 2013 - compression a hidden gem for io heavy databas...
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]
 

Dernier

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Dernier (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

NoSQL

  • 1. NOSQL, NO? Introductory presentation
  • 2. RELATIONAL  SQL  ACID  Relational algebra  Optimal for ad-hoc queries  Tables, Columns, Rows  Sharding can be difficult  Metadata separate from data  Normalized data  Optimized storage
  • 3. POPULAR RDBMS  MySQL  Informix  SQL Server  Progress  Oracle  Pervasive  Postgres  Sybase  DB2  Access  Interbase, Firebird …
  • 4. SQL  Unified language to create and query both data and metadata  Similar to English  Verbose(!)  Can get complex for non-trivial queries  Does not expose execution plan – you say what you want it to return, not how
  • 5. SQL EXAMPLES  If you can say what you mean, you can query the existing data  Results are near-instant when querying based on primary key select * from valute where id=1 and sid=42  Results are fast when querying based on non-unique index select valuta from valute where ((id=1 and sid=42)) and (valute.firma_id=123 and valute.firma__sid=1)  Very readable for trivial queries select r.customer,sum(rs.iznos) sveukupno from racuni r join racuni_stavke rs on r.id=rs.racun_id where r.id=5 order by rs.ordinal
  • 6. SQL EXAMPLES  Not so readable for non-trivial queries select "MP" tip_prometa, mprac.broj broj_racuna, mprac_stavke.kolicina kolicina, (mprac.tecaj*mprac_stavke.kolicina*mprac_stavke.rabat_iznos) rabat_iznos, (round(mprac_stavke.cijena - mprac_stavke.rabat_iznos - mprac_stavke.rabat2_iznos - mprac_stavke.rabat3_iznos - mprac_stavke.porez1 - mprac_stavke.porez2 - mprac_stavke.porez_potrosnja,6)*mprac_stavke.kolicina) iznos, (mprac_stavke.kolicina* ifnull((select sum(pn_cijena*kolicina)/sum(kolicina) from mprac_skl left join skl_stavke on mprac_skl.skl_id=skl_stavke.skl_id and mprac_skl.skl__sid=skl_stavke.skl__sid where mprac_skl.mprac_id=mprac.id and mprac_skl.mprac__sid=mprac.sid and skl_stavke.artikl_id=mprac_stavke.artikl_id and skl_stavke.artikl__sid=mprac_stavke.artikl__sid ),0) ) iznos_nabavno, ifnull( (select sum(mprac_stavke.kolicina*ambalaze.naknada_kom) from artikli_ambalaze left join ambalaze on ambalaze.id=artikli_ambalaze.ambalaza_id and ambalaze.sid=artikli_ambalaze.ambalaza__sid where artikli_ambalaze.artikl_id=artikli.id and artikli_ambalaze.artikl__sid=artikli.sid and ambalaze.kalkulacija="N" ),0) naknada, radnici_komercijalisti.ime racun_komercijalist_ime, (select naziv from skladista where skladista.tip_skladista="M" and pj_id=mprac.pj_id limit 1) skladiste_naziv , pj.naziv pj_naziv, mprac.datum, cast(concat("(",if(DayOfWeek(mprac.datum)=1,7,DayOfWeek(mprac.datum)-1),") ", if(DayOfWeek(mprac.datum)=1,"1 Nedjelja", if(DayOfWeek(mprac.datum)=2,"2 Ponedjeljak", if(DayOfWeek(mprac.datum)=3,"3 Utorak", if(DayOfWeek(mprac.datum)=4,"4 Srijeda", if(DayOfWeek(mprac.datum)=5,"5 Èetvratk", if(DayOfWeek(mprac.datum)=6,"6 Petak", if(DayOfWeek(mprac.datum)=7,"7 Subota","")))))))) as char(15)) dan_u_tjednu, cast(month(mprac.datum) as unsigned) mjesec, cast(week(mprac.datum) as unsigned) tjedan, cast(quarter(mprac.datum) as unsigned) kvartal, cast(year(mprac.datum) as unsigned) godina, cast(if(tipovi_komitenata.tip="F",trim(concat(partneri.ime," ",partneri.prezime)),partneri.naziv) as char(200)) kupac_naziv, partneri_mjesta.postanski_broj kupac_mjesto, partneri_mjesta.mjesto kupac_mjesto_naziv, partneri_grupe_mjesta.naziv …
  • 7. RDBMS SCALING  Vertical scaling • Better CPU, more CPUs • More RAM • More disks • SAN  Partitioning  Sharding
  • 8. PARTITIONING  With many rows and heavy usage, partitioning is a must  What to partition • Tables • Indexes • Views  Typical cases • Monthly data • Alphabetical keys
  • 9. RDBMS SHARDING  Sharding means using several databases where each represents part of data (500 clients on one server, another 500 on another)  Requires changing application code connect(calculate_server_from(sharding_key))  Impossible to join data from different databases, so choose your sharding key wisely  Very difficult to repartition your databases based on a new key
  • 10. RDBMS METADATA  Metadata: data describing other data  RDBMS structures are explicitly defined, and each data type is optimized for storage  Lots of constraints  Can get slow with lot of data
  • 11. NOSQL  “Not SQL”, “Not only SQL”  Core NoSQL databases invented mostly because RDBMS made life very hard for huge and heavy traffic web databases  NoSQL databases are the ones significantly different from relational databases
  • 12. NOSQL TYPES  Wide Column Store / Column Families  Document Store  Key Value / Tuple Store  Graph Databases  Object Databases  XML Databases  Multivalue Databases
  • 13. 4 MAIN DATA MODELS  Key-Value Stores  BigTable Clones (aka "ColumnFamily")  Document Databases  Graph Databases Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 14. KEY/VALUE STORES  Lineage: Amazon's Dynamo paper and Distributed HashTables.  Data model: A global collection of key-value pairs.  Example: Voldemort, Dynomite, Tokyo Cabinet Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 15. BIGTABLE CLONES  Lineage: Google's BigTable paper.  Data model: Column family, i.e. a tabular model where each row at least in theory can have an individual configuration of columns.  Example: HBase, Hypertable, Cassandra Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 16. DOCUMENT DATABASES  Lineage: Inspired by Lotus Notes.  Data model: Collections of documents, which contain key-value collections (called "documents").  Example: CouchDB, MongoDB, Riak Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 17. GRAPH DATABASES  Lineage: Draws from Euler and graph theory.  Data model: Nodes & relationships, both which can hold key-value pairs  Example: AllegroGraph, InfoGrid, Neo4j Source: http://blogs.neotechnology.com/emil/2009/11/nosql-scaling-to-size-and-scaling-to-complexity.html
  • 18. POPULAR NOSQL  Hadoop / Hbase  MemcacheDB  Cassandra  Voldemort  Amazon SimpleDB  Hypertable  MongoDB  Cloudata  CouchDB  IBM Lotus/Domino  Redis
  • 19. NOSQL CHARACTERISTICTS  Almost infinite horizontal scaling  Very fast  Performance doesn’t deteriorate with growth (much)  No fixed table schemas  No join operations  Ad-hoc queries difficult or impossible  Structured storage  Almost everything happens in RAM
  • 20. REAL-WORLD USE  Cassandra • Facebook (original developer, used it till late 2010) • Twitter • Digg • Reddit • Rackspace • Cisco  BigTable • Google (open-source version is HBase)  MongoDB • Foursquare • Craigslist • Bit.ly • SourceForge • GitHub
  • 21. WHY NOSQL?  Handles huge databases (I know, I said it before)  Redundancy, data is pretty safe on commodity hardware  Super flexible queries using map/reduce  Rapid development (no fixed schema, yeah!)  Very fast for common use cases
  • 22. PERFORMANCE  RDBMS uses buffer to ensure ACID properties  NoSQL does not guarantee ACID and is therefore much faster  We don’t need ACID everywhere!  I used MySQL and switched to MongDB for my analytics app • Data processing (every minute) is 4x faster with MongoDB, despite being a lot more detailed (due to much simple development)
  • 23. SCALING  Simple web application with not much traffic • Application server, database server all on one machine
  • 24. SCALING  More traffic comes in • Application server • Database server
  • 25. SCALING  Even more traffic comes in • Load balancer • Application server x2 • Database server
  • 26. SCALING  Even more traffic comes in • Load balancer x N • easy • Application server x N • easy • Database server xN • hard for SQL databases
  • 27. SQL SLOWDOWN  Not linear!  http://www.slideshare.net/rightscale /scaling-sql-and-nosql-databases-in-the- cloud
  • 28. NOSQL SCALING  Need more storage? • Add more servers!  Need higher performance? • Add more servers!  Need better reliability? • Add more servers!
  • 29. SCALING SUMMARY  You can scale SQL databases (Oracle, MySQL, SQL Server…) • This will cost you dearly • If you don’t have a lot of money, you will reach limits quickly  You can scale NoSQL databases • Very easy horizontal scaling • Lots of open-source solutions • Scaling is one of the basic incentives for design, so it is well handled • Scaling is the cause of trade-offs causing you to have to use map/reduce
  • 30. RAM  Why map/reduce? I just need some simple queries. Tomorrow I will need some other queries….  SQL databases are optimized for very efficient disk access, but for significant scaling need RAM caching (MySQL+memcached)  NoSQL databases are designed to keep whole working set in RAM
  • 31. WORKING SET  In real-world use working set is much less than complete database • For analytics 99% of queries will be regarding last 30 days  As you need RAM only for working set, you can use commodity servers, VPS, and just add more as your app becomes more popular
  • 32. WORKING SET WOES  Foursquare has millions of users and working set the same as the database  They used a single 66GB Amazon EC2 High-Memory Quadruple Extra Large Instance (with cheese) for millions of users  When their RAM usage was 65GB, they decided to shard  Too late, they started to have disk swaps  Disk is much slower than RAM - 100x slowdown  Server could not keep up due to swapping  11 hours outage (ouch!)
  • 33. MAP/REDUCE  Google’s framework for processing highly distributable problems across huge datasets using a large number of computers  Let’s define large number of computers • Cluster if all of them have same hardware • Grid unless Cluster (if !Cluster for old-style programmers)
  • 34. MAP/REDUCE  Process split into two phases • Map • Take the input, partition it delegate to other machines • Other machines can repeat the process, leading to tree structure • Each machine returns results to the machine who gave it the task • Reduce • collect results from machines you gave the tasks • combine results and return it to requester • Slower than sequential data processing, but massively parallel • Sort petabyte of data in a few hours • Input, Map, Shuffle, Reduce, Output
  • 35. MAP/REDUCE EXAMPLE  You need to write two functions  Count different words in a set of documents
  • 36.
  • 37. MONGODB  Document store  Basic support for dynamic (ad hoc) queries  Query by example (nice!)
  • 38. MONGODB  Conditional Operators • <, <=, >, >= • $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $size, $type  Regular expressions
  • 39. MONGODB  Data is stored as BSON (binary JSON) • Makes it very well suited for languages with native JSON support  Map/Reduce written in Javascript • Slow! There is one single thread of execution in Javascript  Master/slave replication (auto failover with replica sets)  Sharding built-in  Uses memory mapped files for data storage  Performance over features  On 32bit systems, limited to ~2.5Gb  An empty database takes up 192Mb  GridFS to store big data + metadata (not actually an FS) Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 40. CASSANDRA  Written in: Java  Protocol: Custom, binary (Thrift)  Tunable trade-offs for distribution and replication (N, R, W)  Querying by column, range of keys  BigTable-like features: columns, column families  Writes are much faster than reads (!) • Constant write time regardless of database size  Map/reduce possible with Apache Hadoop Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 41. HBASE  Written in: Java  Main point: Billions of rows X millions of columns  Modeled after BigTable  Map/reduce with Hadoop  Query predicate push down via server side scan and get filters  Optimizations for real time queries  A high performance Thrift gateway  HTTP supports XML, Protobuf, and binary  Cascading, hive, and pig source and sink modules  No single point of failure  While Hadoop streams data efficiently, it has overhead for starting map/reduce jobs. HBase is column oriented key/value store and allows for low latency read and writes.  Random access performance is like MySQL Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 42. REDIS  Written in: C/C++  Main point: Blazing fast  Disk-backed in-memory database,  Master-slave replication  Simple values or hash tables by keys,  Has sets (also union/diff/inter)  Has lists (also a queue; blocking pop)  Has hashes (objects of multiple fields)  Sorted sets (high score table, good for range queries)  Has transactions (!)  Values can be set to expire (as in a cache)  Pub/Sub lets one implement messaging (!) Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 43. COUCHDB  Written in: Erlang  Main point: DB consistency, ease of use  Bi-directional (!) replication, continuous or ad-hoc, with conflict detection, thus, master-master replication. (!)  MVCC - write operations do not block reads  Previous versions of documents are available  Crash-only (reliable) design  Needs compacting from time to time  Views: embedded map/reduce  Formatting views: lists & shows  Server-side document validation possible  Authentication possible  Real-time updates via _changes (!)  Attachment handling  CouchApps (standalone JS apps) Source: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 44. HADOOP  Apache project  A framework that allows for the distributed processing of large data sets across clusters of computers  Designed to scale up from single servers to thousands of machines  Designed to detect and handle failures at the application layer, instead of relying on hardware for it
  • 45. HADOOP  Created by Doug Cutting, who named it after his son's toy elephant  Hadoop subprojects • Cassandra • HBase • Pig  Hive was a Hadoop subproject, but is now a top-level Apache project  Used by many large & famous organizations • http://wiki.apache.org/hadoop/PoweredBy  Scales to hundreds or thousands of computers, each with several processor cores  Designed to efficiently distribute large amounts of work across a set of machines  Hundreds of gigabytes of data constitute the low end of Hadoop-scale  Built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes
  • 46. HADOOP  See http://www.slideshare.net/hadoop/practical-problem-solving- with-apache-hadoop-pig  Uses Java, but allows streaming so other languages can easily send and accept data items to/from Hadoop
  • 47. HADOOP  Uses distributed file system (HDFS) • Designed to hold very large amounts of data (terabytes or even petabytes) • Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications • Data organized into directories and files • Files are divided into block (64MB by default) and distributed across nodes  Design of HDFS is based on the design of the Google File System
  • 48. HIVE  A petabyte-scale data warehouse system for Hadoop  Easy data summarization, ad-hoc queries  Query the data using a SQL-like language called HiveQL  Hive compiler generates map-reduce jobs for most queries
  • 49. PIG  Platform for analyzing large data sets  High-level language for expressing data analysis programs  Compiler produces sequences of Map-Reduce programs  Textual language called Pig Latin • Ease of programming • System optimizes task execution automatically • Users can create their own functions
  • 50. PIG LATIN  Pig Latin – high level Map/Reduce programming  Equivalent to SQL for RDBMS systems.  Pig Latin can be extended using Java User Defined Functions  “Word Count” script in Pig Latin
  • 53. SUMMARY  NoSQL is a great problem solver if you need it  Choose your NoSQL platform carefully as each is designed for specific purpose  Get used to Map/Reduce  It’s not a sin to use NoSQL alongside (yes)SQL database  I am really happy to work with MongoDB  instead of MySQL