1. The document discusses various technologies for building big data architectures, including NoSQL databases, distributed file systems, and data partitioning techniques.
2. Key-value stores, document databases, and graph databases are introduced as alternatives to relational databases for large, unstructured data.
3. The document also covers approaches for scaling databases horizontally, such as sharding, replication, and partitioning data across multiple servers.
7. Latency
● Hibernia Express
● 3,000-mile fiber-optic
● across the Atlantic Ocean to connect London to New York
● goal for 5ms latency
● To be used by Financial Institutes for trading
Src: http://shop.oreilly.com/product/0636920028048.do
22. CAP theorem
Strong Consistency, High Availability, and Partition-Tolerance
Img-src:http://image.slidesharecdn.com/cap-131117230434-phpapp02/95/dynamo-and-bigtable-in-light-of-the-cap-theorem-12-638.jpg?cb=1384729712
29. BASE
● Basically Available: If a single node fails, part of the
data won't be available, but the entire data layer stays
operational.
● Soft state: Soft state means data that is not persisted
on the disk, yet in case of failure it could be possible to
restore it.
● Eventually consistent: indicates that the system will
become consistent over time, given that the system
doesn't receive input during that time.
33. Implementation 2 - Associative Arrays
Key Value
user1 Mike
user2 Mary
user3 Nina
On hotspace
34. Simple Storage Design
- put key value - will add content to file in one line
- get key - will grep for key and return the value from the
file
What are the problems with this?
Activity
How can we improve this?
35. Simple Storage Design
- Add in memory index, with key and value as byte offset.
What are the problems with this?
Activity
How can we improve this?
36. Simple Storage Design
- Segments
- Compaction
What are the problems with this?
Activity
How can we improve this?
37. Simple Storage Design
- Sorted Key-Value
- Sparse index
- SSTable
What are the problems with this?
Activity
How can we improve this?
39. Simple Storage Design - Overall
- Writes into RedBlack or AVL trees in memory - memtable
=> faster writes
- When memtable is 64MB, write to disk as SSTable and
clean memtable
- First read from memtable and most recent segments in-
memory sparse index (SSTable) => faster reads
- Run a merging and compaction process in the background
=> lesser storage and faster
50. Why?
❏ Modelling and storing relationships in RDBMS is
complicated
❏ Performance degrades with number and levels of
relationships.
❏ Query complexity grows
❏ Adding new type requires schema redesign
51. With neo4j, you can traverse 4M+ relationships
per second and core
63. When is this better?
❏ Huge number of columns, with queries on few columns
❏ Aggregation
❏ Column level update
❏ Column data is uniform; so better compression
64. Time Series Data
Measurement and Time of measurement done repeatedly
Img src: https://www.safaribooksonline.com/library/view/time-series-databases/9781491920909/images/tsdn_0103.png.jpg
66. When - Time Series
Data
● Huge amount of data
● Mostly query based on time
● Stock exchange
● Sensor data. E.g: Trucks
● Cell towers for usage patterns
68. Replication
This is useful when you have a ncie photo or color-black as a
background. On this slide only, you can put your elements behind
a master element.
81. Federated
Tables
A Federated Table is a table which points
to a table in another database instance
(mostly on an other server). It can be
seen as a view to this remote database
table.
- Administration overhead
- Security
- Access over network
- Okay for reporting/analytical tasks
82. Federated
Tables
A Federated Table is a table which points
to a table in another database instance
(mostly on an other server). It can be
seen as a view to this remote database
table.
- Administration overhead
- Security
- Access over network
- Okay for reporting/analytical tasks
84. Hash based Take hash of key and modulo operation,
put the data in the server based on
reminder value.
- Uniform distribution
- Range queries may take time
85. Co-ordinators
- Take request, if key is in the request, talk to correct shard
- Co-ordinate across shards to give the result back
- Monitor health
- Take care of rebalancing
- Can be a random node, which will complete the task
- Set of co-ordinators
86. Take care, while sharding
● Balance your shards, with proper shard key
● Choose correct number of shards. E.g: 12
● Give time for rebalancing. In case of increasing capacity of
server, add nodes faster, and give time move your shards.
● Shard on denormalized data.
● Try to have shard key as part of your queries.