A quick overview of the several options we have right now when choosing the architecture of our data storage system. Relational is fantastic, but sometimes we have the feeling our problem won’t fit very well there… are there any other options? We will review a few different DB engines and we will see examples on how we can use them from our Microsoft .Net applications (or any others, really).
Originally introduced in http://www.meetup.com/es/dotnetMALAGA/events/226374459/
and
http://www.meetup.com/es/MalagaMakers/events/225695665/
4. SQL
Commercial example: Oracle | OS example: (Oracle) MySQL
NoSQL
“Mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.”
“Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally
scalable.”
NoSQL systems are also sometimes called "Not only SQL".
SQL? ACID? Relations? Distributed?
Commercial example: DynamoDB | OS example: MongoDB
NewSQL
Modern relational database management systems that seek to provide the same
scalable performance of NoSQL systems for online transaction processing (OLTP)
read-write workloads while still maintaining the ACID guarantees of a traditional
database system.
OS example: VoltDB
Y
A
X B
NoSQL vs. SQL vs. NewSQL
Wikipedia
No-sql.org
5. More Database classifications
On premises vs. Cloud “As a service” (Azure DocumentDB)
Memory / Disk vs. Only in memory (OrigoDB, Redis, SQL Server)
OLTP vs. OLAP
Databases vs. Not a database but a data store (Zookeeper, Kafka)
CAP classifications
22. Use cases…
General data management
Network and IT operations
Recommendation engines
Fraud detection
Social networks
Graph DBs
Just a few slides remaining…
41. Key Takeaways
Always think about the schema
(even with schema less DBs)
Best DB? “It depends”
• Prototyping?
• Domain?
• How the data is going to be used?
Most of us don’t work with “big data” but “small or medium”
44. Resources
Different DB images: https://www.thoughtworks.com/insights/blog/nosql-databases-
overview
Polyglot persistence images: http://www.slideshare.net/mongodb/webinar-
mongodb-and-polyglot-persistence-architecture
DATABASE NAME AVAILABLE FOR WINDOWS?
Redis Yes (C)
MongoDB Yes (C++)
Cassandra Yes (Java)
Neo4j Yes (Java)
ElasticSearch Yes (Java)
InfluxDB Yes (Go)
EventStore Yes
OrientDB Yes (Java)
SQL Server Yes (C++)
Notes de l'éditeur
· Welcome
· About me & Sequel Business Solutions
· Thanks MalagaMakers & dotnetMalaga
We only have 1 hour, and we have a lot to talk about
Start with a brief review of several key concepts we are going to work with
o Difference between the “trending” NoSQL movement and the old school SQL
o Review 9 different databases (this talk could have been called 9 databases in 1 hour but I didn’t know how many I was going to review when I setup the meeting inmeetup.com)
o Docker
· Briefly talk about Polyglot persistence
· Statistics about databases
· Caveats
o Not exhaustive
o Tools / Info on how to choose
o Don´t ask me “what is the difference between X and Y”
· How many developers are in this room?
· What is NoSQL?
· A few ambiguous definitions:
o Not using the relational model
o Running well on clusters
o Mostly open-source
o Built for the 21st century web estates
o Schema-less
· We can always find an example of NoSQL database that violates theses sentences
· I have tried to summarise what NoSQL is in this slide by copying definitions from the internet
· Not sure if it is clear yet…
· Notice there are even a new generation of databases called NewSQL
· The databases can be stored in your local PC/Cluster or they can be in the cloud, as SASS. An example of a cloud database is Azure DocumentDB where you can use Javascript directly inside the database engine. You can scale storage and throughput linearly with cost via combinable units as our application grows, or tune consistency via levels (strong, session, eventual) to suit application scenarios. All via the Azure web management console - SLIDER
· There are circumstances in which you might prefer to store data in memory, no need to save that to disk. The in memory database can be restored from disk on startup. Microsoft SQL Server allows you to have some tables in memory so it is amazingly fast. – CONTENTION
· OLTP is online transaction processing. This is what we usually we when users read / write to our database through the application. OLAP is the process is online analytical processing, to perform multidimensional analysis of business data, for complex calculations, etc.
· There are some “data stores” that cannot be considered databases but can help with the management of data. A couple of examples are zookeeper to keep data in distributed environments, usually for synchronization. Kafka is a pub/sub messaging system but it works like a transaction log. These 2 work as a base for other data storage systems.
· CAP is a classification where consistency, availability and partition tolerance are the 3 angles in a triangle and we can only have 2. Some authors say that the concepts can be very strict and are no longer relevant.
· I’ve just put this here to show you how Microsoft scores at the Gartner Magic Quadrant, although this means probably nothing to you.
· These are the databases we are going to cover. How many of those sound familiar?
· I will try to do a very quick demo with some aspect of each database so you can get at least “a feeling” on what those DBs are about.
I was going to use RavenDB as the example but then this presentation would be too “Microsoft”
· Key Value are those databases in which we will index everything based on a particular key, so in theory you cannot have 2 keys with the same value.
· Redis in particular is used in a lot of big companies, and is particularly useful to speed up legacy applications for example.
· Redis is called a data structure server, as the “value” bit can be a list, map, string, binary, etc.
· Redis is fast
Redis is called the data structure server
Key-value stores are the simplest NoSQL data stores to use from an API perspective. The client can either get the value for the key, put a value for a key, or delete a key from the data store. The value is a blob that the data store just stores, without caring or knowing what's inside; it's the responsibility of the application to understand what was stored.
Since key-value stores always use primary-key access, they generally have great performance and can be easily scaled.
Some of the popular key-value databases are Riak, Redis (often referred to as Data Structure server), Memcached and its flavors, Berkeley DB, HamsterDB (especially suited for embedded use), Amazon DynamoDB (not open-source), Project Voldemort and Couchbase.
All key-value databases are not the same, there are major differences between these products, for example: Memcached data is not persistent while in Riak it is, these features are important when implementing certain solutions. Lets consider we need to implement caching of user preferences, implementing them in memcached means when the node goes down all the data is lost and needs to be refreshed from source system, if we store the same data in Riak we may not need to worry about losing data but we must also consider how to update stale data. Its important to not only choose a key-value database based on your requirements, it's also important to choose which key-value database.
DEMO – Redis PUB SUB
· Redis Desktop
· Run 2 instances
This is a bit of code in C#: Store a string, TTL, JSON, You can also store lists, sets, etc. and do operations with them
· This was the first well known NoSQL database
· Most used document database
· The documents have an ID
· Documents contain documents, are similar to each other but do not have to be exactly the same.
· Documents contain references to other documents (but there are no joins)
Documents are the main concept in document databases. The database stores and retrieves documents, which can be XML, JSON, BSON, and so on. These documents are self-describing, hierarchical tree data structures which can consist of maps, collections, and scalar values. The documents stored are similar to each other but do not have to be exactly the same. Document databases store documents in the value part of the key-value store; think about document databases as key-value stores where the value is examinable. Document databases such as MongoDB provide a rich query language and constructs such as database, indexes etc allowing for easier transition from relational databases.
Some of the popular document databases we have seen are MongoDB, CouchDB , Terrastore, OrientDB, RavenDB, and of course the well-known and often reviled Lotus Notes that uses document storage.
DEMO
· Meteor
· MongoDB visualizer
· For this example I am going to use Meteor, a framework to create web applications based on NodeJS and MongDB. The whole thing is fully integrated and even some code is shared between client and server. Meteor is reactive on its foundations, things like changing things on 1 session are immediately reflected in other sessions.
· Wide Column databases are those that can handle millions of columns without any trouble.
· Cassandra in particular is known by its high availability via clustering and performance, reading and writing.
· It is being used by monsters like ebay, Spotify and Netflix
· Netflix – 50 clusters, 750 nodes, in AWS. Nearly all film metadata is there, user ratings, recommendations
· Spotify – Playlist storage, like a version control system, more than 1 billion playlists, > 40k requests per second, concurrent changes
1) Why should I choose C* ? a. linear scalability, throughputs scale "almost" linearly with number ofnodes b. almost unbounded extensivity (there is no limit, or at least hugelimit in term of number of nodes you can have on a cluster) c. operational simplicity due to master-less architecture. This featureis, although quite transparent for developers, is a key selling point.Having suffered when installing manually a Hadoop cluster, I happen to lovethe deployment simplicity of C*, only one process per node, no moving parts.d. high availability. C* trades consistency for availability clearly so youcan expect to have something like 99.99% of uptime. Very selling point forcritical business which need to be up all the timee. support for multi data centers out of the box. Again, on the operationalside, it's a great feature if you plan a worldwide deploymentThat's all I can see for now2) Why shouldn't I choose C* ?a. need for a strong consistency most of the time. Although you can performall requests with Consistency level ALL, it's clearly not the best use ofC*. You'll suffer for higher latency and reduced availability. Even the new"lightweight transaction" feature is not meant to be use on large scaleb. very complicated and changing queries. Denormalizing is great when youknow ahead of time exactly how you'll query your data. Once done, any newway of querying will require new coding & new tables to support itc. ridiculous data load. I've seen people in prod using C* for only 200Gbbecause they want to be trendy and use bleeding edge technologies. They'dbetter off using a classical RDBMS solution that fit perfectly their load
the main principle in designing the table is not the relationship of the table to other tables, as it is in relational database modeling. Data in Cassandra is often arranged as one query per table, and data is repeated amongst many tables
· CQL exposes a Cassandra DB in a very similar way to SQL (but there are no joins)
· Sets, lists, maps – we can easily store denormalised data in a row
· Speed up reads by writing in several places – in the same way an index is automatically maintained by the database, we are responsible of maintaining all the column families (tables) in sync
· Just wanted to show you how the rows in CQL are actually stored as columns in Cassandra
· Partition key + Clustering key
· Graph databases are very trendy lately, and Neo4j is one of the most famous ones.
· They are used by companies like meetic and infojobs (to store relationships)
· We have nodes with properties and then we have the relationships between nodes.
· Use cases
DEMO
· This is a very specific databases as even it can be used as the single database in an application it main objective is to perform searches, and that’s how it is used in those 2 companies.
We usually feed elasticsearch with another data source, like for example setting up replication between couchDB (documental) and elastic search.
· I am going to show you another example and is the analysis of logs for an application. There are commercial cloud solutions like Raygun.io where you can log analyse anything but you can get something similar in house in .Net with the elasticsearch logger for log4net (similar to log4j).
· We can then visualise the results with Kibana, which is a generic data visualizer for ElasticSearch.
· Time series databases are used by companies like SoundCloud or Digital Ocean (hosting provider).
· They store a series of data entries that change with time, for example, temperature from sensors.
· Particularly good at answering queries based on time windows, etc.
· They can be easily replaced by a database like PostgreSQL if the size of your data is not HUGE. There is one on top of Cassandra called OpenTSDB.
· In the same way Kibana is there for ElasticSearch, for a few time series databases (not just InfluxDB) we have Grafana.
· This is another very specific database as it cannot be used as the main database and is not a general purpose database.
· Its mission is to store events in an event sourcing application.
· If we have a bank account there are 2 ways to store information about that bank account:
o Store the amount of money you have
o Store the movements that lead to that final scenario – This is what these DBs are for (DEPOSIT / WITHDRAWAL)
Event sourcing is typically used in CQRS pattern, when you separate reads and writes in two different systems / objects / repositories.
CQRS can be done with or without event sourcing.
Event sourcing database does not contain a DELETE action.
· We have seen a few different DBs, each one is strong in a particular area but… what is stopping DB creators from mixing characteristics from different databases?
· For example OrientDB is graph database AND a document database.
· On paper everything looks great, I have read it is a bit buggy.
· I don’t have any demos (sorry)
· I didn’t want to leave this without quickly going through the existing relational databases
· Who is the guy at the top left? Martin Fowler
· We have seen a few databases in the last hour, but what we’ll usually end up with if our system is big enough is not with just 1 but a combination of databases, to be able to get the most of it.
· LinkedIn is an example where we can see graph databases, relational databases, etc.
· Need to be careful as the system might become messy
· There are things like Kafka to help organising the mess, here is used as the main pipe to move data that will end in a data store.
· These are some graphs taken from a web site that show
o How popular each database (and database type) is
o How “trendy” each database type is
A triplestore or RDF store is a purpose-built database for the storage and retrieval of triples[1] through semantic queries. A triple is a data entity composed of subject-predicate-object, like "Bob is 35" or "Bob knows Fred".
Much like a relational database, one stores information in a triplestore and retrieves it via a query language. Unlike a relational database, a triplestore is optimized for the storage and retrieval of triples. In addition to queries, triples can usually be imported/exported using Resource Description Framework (RDF) and other formats.
Relational – think about schema NoSQL – think about schema!
· What is going to happen in a few years when a new developer joins the company and has to maintain the application?
There is not a best database (unless we have created it ourselves from scratch J), and it all depends on what we want to do
o Domain (what the business is about)
o How we want to extract the data
o Do we know everything? For example Event Stores are very good in terms of “new ways to explore the data” as we are able to rebuild different things from the event series
Lastly remember hardware is much more powerful than we think, most of us are probably working with small or medium data rather than “big data”