1. Concepts of
Juan Antonio Roy Couto
Twitter: @juanroycouto
Website: www.juanroy.es
September 2014
2. Juan Antonio Roy Couto 2
Concepts
Contents
Why?
Characteristics
Who?
DB Ranking
Shell Drivers
Utilities
Community
Terms
Failover Replication Schema design
Replica Set
Indexes
Sharding
Pre-splitting Questions?
3. Apps
● Horizontal scalability
● Real time analytics
● Better strategic decisions
Internet of Things
Juan Antonio Roy Couto 3
Wearables
Smartcities
Cloud computing
● Non structured data
● Reduce costs and time to
market
Concepts
Why?
MongoDB
● Faster development
4. Juan Antonio Roy Couto 4
Concepts
Who provides MongoDB in the cloud?
http://www.mongodb.com/partners/list
Who is using MongoDB?
http://www.mongodb.com/who-uses-mongodb
Who?
5. Juan Antonio Roy Couto 5
Concepts
DB Ranking
http://db-engines.com/en/ranking
6. Juan Antonio Roy Couto 6
Concepts
Community
8 Million +
Downloads
200k+
Education Registrations
30k+
MongoDB User Group Members
7. Juan Antonio Roy Couto 7
Concepts
Drivers
http://docs.mongodb.org/ecosystem/drivers/
Driver
MongoDB
● C
● C++
● C#
● Java
● Node.js
● Perl
● PHP
● Python
● Ruby
● Scala
App
8. Juan Antonio Roy Couto 8
Concepts
Characteristics
http://www.mongodbspain.com/en/2014/08/17/mongodb-characteristics-future/
General purpose NoSQL database Native replication
Document oriented (stores data as
documents in BSON – Binary JSON) Auto sharding & load balancing
Schemaless (dynamic schema) Security
Open source Automatic failover
High availability (replica sets) JSON objects
Horizontal scalability (commodity
servers) MMS (continuous monitoring in the cloud)
Aggregation framework Geospatial queries
Map Reduce In-memory performance
Hadoop connector (for processing large
volumes of data in batch) ACID compliant at the document level
9. Juan Antonio Roy Couto 9
Concepts
Advanced characteristics
Chunk 1
Chunk 2
Chunk 3
GridFS
TTL (special indexes that
MongoDB can use to
automatically remove
documents from a collection
after a certain amount of
time)
Capped collections
Index intersection
...
10. Juan Antonio Roy Couto 10
Concepts
Shell
MongoDB
● Administrative tasks
● Full featured
● Javascript interpreter
● Standalone MongoDB client
● Allows interaction with a MongoDB instance from the
command line
11. mmoonnggooeexxppoorrtt mongoimport mongodump mongorestore mongoexport Utility that generates a JSON or CSV file of data from a MongoDB instance
Imports content from a JSON, CSV or TSV export
Utility for creating a binary export
Writes data to a MongoDB instance from a binary file
Juan Antonio Roy Couto 11
Concepts
Utilities
MongoDB tools for backup:
MongoDB tools for tracking instances:
mongostat Provides a quick overview of the status of a running mongod or mongos
instance
mongotop
Provides a method to track the amount of time a MongoDB instance spends
reading and writing data. mongotop provides statistics on a per-collection level.
By default, mongotop returns values every second
12. Juan Antonio Roy Couto 12
Concepts
Basic terms to know
MongoDB SQL
database database
collection table
document row
field column
embedding join
13. Geospatial indexes
MongoDB has two types of indexes
for supporting geographical queries.
● 2d indexes: for calculations on a
flat surface
● 2dsphere indexes: for
calculations on a earth-like
sphere
Juan Antonio Roy Couto 13
14. Tables
Customers Addresses
Juan Antonio Roy Couto 14
Concepts
SQL Schema Design
Customer key
First name
Last name
Phone number
Address key
Customer key
Street
Number
Location
Postal Code
Pets
Pet key
Customer key
Type
Breed
Name
Age
15. Customers collection
Customer info Addresses
Juan Antonio Roy Couto 15
Concepts
MongoDB Schema Design
> db.customers.findOne()
{
"_id" : ObjectId("54131863041cd2e6181156ba"),
"first_name" : "Peter",
"last_name" : "Keil",
"phone_number" : 619123456,
"address" : {
"street" : "C/Alcalá",
"number" : 123,
"location" : "Madrid",
"postal_code" : 12345
},
"pets" : [
{
"type" : "Dog",
"breed" : "Airedale Terrier",
"name" : "Linda",
"age" : 2
},
{
"type" : "Dog",
"breed" : "Akita",
"name" : "Bruto",
"age" : 10
}
]
}
>
First name
Last name
Phone number
Street
Number
Location
Postal Code
Type
Breed
Name
Age
Type
Breed
Name
Age
Pets
16. Replica Set ● High availability
Juan Antonio Roy Couto 16
Concepts
Replication
Primary
Secondary 1
Secondary 2
● Data safety
● Read preference
● Asynchronus
● Single primary
● Statement based
● Master-slave
● Automatic failover
● Automatic node recovery
17. Replica Set
Juan Antonio Roy Couto 17
Concepts
Failover scenario
Replica Set
Primary
Secondary 1
Secondary 2
Secondary 2
Primary
Secondary 1
1) Primary goes
down
2) New election
(majority of the
set)
3) Primary comes
back (now as
secondary)
4) The new primary
assumes
replication tasks
18. Replica Set
Juan Antonio Roy Couto 18
Concepts
Failover scenario with rollback
Replica Set
Primary
Secondary 1
Secondary 2
Secondary 2
Primary
Secondary 1
Rollback
Hard Disk
mongorestore
19. Juan Antonio Roy Couto 19
Concepts
Replica Set principles
● Write is truly
committed
upon
application at
the majority of
the set
20. Juan Antonio Roy Couto 20
Concepts
Replica Set: read preference
Reasons
Geography dispersed
nodes
Separate a work load
Availability
Types
Primary
Primary preferred
Secondary
Secondary preferred
Nearest
Tags
21. Shard 2
Shard N-1
Juan Antonio Roy Couto 21
Concepts
Sharding
Shard 0
Secondary
Secondary
Primary
Shard 1
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Config server
Config server
Config server
Query router Query router
...
Client Client Client
CLUSTER
22. Sharding: concepts
Sharding concepts
Data are uniformely distributed across the
shards using the shard key
Each shard allocates those documents that
belongs to its own range
Sharding improves efficiency and, therefore,
the performance because queries are routed
only to the shards in where our data resides
Juan Antonio Roy Couto 22
23. Sharding: metadata
The config servers allocates the config database which contains the cluster metadata
Metadata describes what is in the cluster, what is contained in the shards
It is a map of the data itself
Range-based partitioning
Shard key:
lastname Low High Shard
Range 0 Martín Pérez 0
Range 1 Pérez Rodriguez 1
Juan Antonio Roy Couto 23
24. Sharding: chunks, split and migrate
Chunk Split Migrate
Range data subset Runs in background Runs in background
Juan Antonio Roy Couto 24
Aproximately 1 chunk per 60MB
When a chunk grows beyond
60MB it will be splitted in two
equal chunks
It will move the
chunks across the
shards in order to
achieve the balance
The MongoDB goal is to achieve a uniform data distribution
across all the shards
MongoDB balances the number of chunks pers shard (nor
documents nor bytes)
By default all collections belong to shard 0
An empty collection has only one chunk (shard 0)
25. Sharding: chunks, split and migrate (2)
mongos
Shard 0
chunk 0
chunk 0
chunk 1
Shard 1
Juan Antonio Roy Couto 25
26. Pre-splitting
Utilized in batch/bulk loads
Split and migration do not work
Metadata are not altered
Data are stored automatically in its
shard
Shard 0
Shard 1
Shard 2
mongos
data
data
data
Juan Antonio Roy Couto 26
27. Summary
Designed to be:
● Fast (no joins, in-memory performance),
Juan Antonio Roy Couto 27
● Flexible (schemaless),
● Scalable (horizontal vs vertical),
● Easy to learn
Designed to:
● Reduce administrative tasks (replica set, sharding, disaster recovery)
With powerful:
● Analysis tools (aggregation framework, map reduce, hadoop
connector),
● Characteristics such as geospatial indexes, GridFS, etc.
29. Concepts
Thank you for your attention!
Juan Antonio Roy Couto
Email: juanroycouto@gmail.com September 2014
Juan Antonio Roy Couto 29
Notes de l'éditeur
NoSQL surge debido a la globalización, se necesita una muy alta tasa de lectura y escritura, soportar gran cantidad de datos, máxima disponibilidad, peticiones,...
Rendimiento
Fiabilidad
Escalabilidad
Replica Set
Sharding Clusters
Auto balanceado de carga
Disminución de las labores típicas de administración de una base de datos (enumerar cuáles y por qué)
Aumento en la velocidad de la puesta en producción de un proceso al disminuir el tiempo del desarrollo de un producto
NoSQL significa No solo SQL
En el momento en que el modelo relacional no es capaz de asumir las necesidades actuales de
almacenamiento y procesado de la ingente cantidad de datos que hoy se genera (IoT, redes sociales,...)
Hoy los datos que se generan son multidisciplinares, no siguen un esquema fijo
MongoDB no pretende que nadie cambie su base de datos si esta le ofrece un rendimiento y
fiabilidad con la que está satisfecho. Sin embargo, sí basa su esfuerzo en las
pequeñas empresas o startups que abordan nuevos proyectos. También en aquellas empresas,
de cualquier tamaño, que quieren o necesitan mejorar el rendimiento de una aplicación
en marcha.
BBVA, Telefónica, Santander, ...
Por que es la base de datos no relacional líder del mercado
Open-source db used by companies of all sizes, across all industries and for a wide variety of applications. It is an agile database that allows schemas to change quickly as applications evolve, while still providing the functionality developers expect from traditional databases, such as secondary indexes, a full query language and strict consistency.
MongoDB is built for scalability, performance and high availability, scaling from single server deployments to large, complex multi-site architectures. By leveraging in-memory computing, MongoDB provides high performance for both reads and writes. MongoDB’s native replication and automated failover enable enterprise-grade reliability and operational flexibility.
Horizontal Scalability. As the data volume and throughput grow, developers can take advantage of commodity hardware and cloud infrastructure to increase the capacity of the MongoDB system.
High Availability. Multiple copies of data are maintained with native replication. Automatic failover to secondary nodes, racks and data centers makes it possible to achieve enterprise- grade uptime without custom code and complicated tuning
In-Memory Performance. Data is read and written to RAM while also persisted to disk for durability, providing fast performance and eliminating the need for a separate caching layer.
Aggregation - Batch processing of data and aggregate calculations
JavaScript execution - Ability to store JavaScript functions on the server
Es una base de datos generalista, no se enfoca en hacer bien una cosa, como podría ser el
caso de las clave:valor que son las que ofrecen la velocidad de respuesta más elevada del
mercado. Su objetivo es abarcar lo más posible y, por tanto, ofrece todas, o casi todas,
las características de las bases de datos relacionales y las ventajas de las no relacionales,
como pueden ser: schemaless, rendimiento,...
All mapReduce functions are native for both MongoDB are JavaScript and run on the database nodes.
Además de estas herramientas existen otras técnicas para hacer backup, como puede ser a través de una simple copia de los ficheros
MongoDB ha sido diseñada para que sea rápida (no joins but embedded documents)
Geospatial queries return results based on proximity criteria, intersection and inclusion as specified by a point, line, circle or polygon.
For supporting geospatial queries (2d and 2dsphere)
Failover:
- Proceso desde que se cae el primario hasta que otro nodo asume su papel
Node recovery:
- Rollback a todas las escrituras del primario que no llegaron a replicarse (si las había).
- Recepción de todas las operaciones que se han hecho mientras ha estado caído.
- Comienza a funcionar como secundario
Slave Delay:
Tiempo de retraso hasta que un secundario se actualiza.
Se utiliza en situaciones en las que se ha cometido un error (fat fingers) y se necesita volver atrás rápidamente sin tener que esperar a hacer un restore desde algún backup.
Tags:
Sirve para escoger los servidores con los que queremos hablar
Los routers (mongos) enrutan las peticiones de los clientes al shard/s implicado
El cliente no sabe si la colección está particionada o no, ni en qué shard residen los datos que necesita. Por lo tanto, no hay que cambiar el código de nuestra aplicación
MongoDB leverages horizontal scalability effortlessly by using commodity computers
Replica:
High availability
Data safety
Disaster recovery
Sharding:
Scale out
Sharding allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the application.
1 chunk is about 60MB of data
Chunks > 60 MB → split
Uniform data distribution across shards (chunks / shard)
Balancer decides when to migrate chunks and to which shard
Performance
Horizontal scalability with commodity hardware
Replica Set
Sharding Clusters
Auto load balancing
high availability
In-memory performance
Schema less
Failover
Data safety
Disaster recovery
MongoDB ha sido diseñada para que sea rápida (no joins but embedded documents), flexible (schema less), escalable (horizontal no vertical), para reducir al mínimo las labores de administración (replica set, failover, sharding) y para que a los programadores les resulte divertida y rápida de aprender a utilizar y dotada de potentes herramientas de análisis de datos (aggregation framework), geospatial indexes, GridFS, and so on.
MongoDB does not support multi-document transactions.
However, MongoDB does provide atomic operations on a single document. Often these document-level atomic operations are sufficient to solve problems that would require ACID transactions in a relational database. Relational databases might represent the same kind of data with multiple tables and rows, which would require transaction support to update the data atomically.