4. (4)
Big Data
D e f i n i t i o n o f b i g d a t a …
Big Data includes data sets
that become so large that it is
harder to capture, store and
process using conventional
methods
It is hard to handle,
• Large amount of data
• Coming at high speed
• In different formats
5. (5)
Digital Data Years
The rapid growth of
internet over 35
years
E v e r y d a y w e c r e a t e a s
m u c h i n f o r m a t i o n a s
w e d i d f r o m b e g i n n i n g
o f t i m e u n t i l 2 0 0 3
T h e t o t a l a m o u n t
o f d a t a b e i n g
c a p t u r e d a n d s t o r e d
b y i n d u s t r y d o u b l e s
e v e r y 1 . 2 y e a r s
0000 2003 20202019
m a k e s u r e y o u
k n o w y o u r
z e t t a b y t e s ( 1 0 2 1 )
f r o m y o u r
y o t t a b y t e s ( 1 0 2 4 )
O v e r 9 5 % o f a l l
t h e d a t a i n t h e
w o r l d w a s c r e a t e d
i n t h e p a s t 2
y e a r s
Every minute in 2014
• we sent 204 million emails
• generated 1.8 million
Facebook likes
• sent 278 thousand Tweets
• Uploaded 200,000 photos
to Facebook
• 3.5 billion searches in a
single day
Internet
• More than 3.7 billion
humans use the internet
• We conduct more than half
of our web searches from a
mobile phone now
• On average, Google now
processes more than
40,000 searches EVERY
second (3.5 billion searches
per day)!
• Only accelerating with the
growth of the Internet of
Things (IoT)
A Day of Data
• 500 million tweets are sent
• 294 billion emails are sent
• 4 petabytes of data are
created on Facebook
• 4 terabytes of data are
created from each
connected car
• 65 billion messages are
sent on WhatsApp
• 5 billion searches are made
Exponential Growth
• By 2025, it’s estimated
that 463 exabytes of
data will be created
each day globally –
that’s the equivalent of
212,765,957 DVDs per
day
20252014
7. (7)
Changing Landscape of Big Data
B i g D a t a C h a l l e n g e s …
Volume
More data coming in
huge quantities
• Your own data (Archives, junk, logs), Free public data and Premium
data adds on to the Volume
• The data will be coming in high speed mainly due to the increase
number of users and interactions
• There can be many data types in unstructured data (Files), Semi-
Structured data (JSON, graphs), Structured (Relational)
• It is challenging to figure out misinformation and invalid data within
that large volume
Velocity
The speed of the
incoming data
Variety
Different types of
data
Veracity
The quality or truth of
the data
9. (9)
W h a t we u s e d t o d o . . .
Data is stored in the form of tables.
It supports multiple users.
Maintaining the relationships among the
tables.
Higher hardware and software need.
RDBMS supports the integrity constraints
at the schema level.
Data can be easily accessed using SQL
query.
MySQL, Oracle, SQL Server
User
RDBM
Traditional Approach
Server
Centralized System
10. (10)
W h a t we d o n o w. . .
User
Distributed Approach
Server
Distributed Network
DB1
Server1
DB2
Server2
DB2
Server2
DB2
Server2
12. (12)
RDBMS Challenges
M o d e r n a p p l i c a t i o n s p r e s e n t s
When we implement modern
applications there are new challenges
we face with a traditional solution…
Expensive to scale up
Expensive to scale down
Hard to process high volumes near real-time
Requires DBAs to manage and tune
Designed for relational data
13. CAP
THEOREM
Brewer ’s Theorem
CONSISTANCY
Every read receives the most
recent write or an error
PARTITION TOLERANCE
continues to operate despite even
if one part of system fails
AVAILABILITY
Every request receives a (non-
error) response
15. (15)
NoSQL Database
W h a t i s a N o S Q L D a t a b a s e …
A NoSQL database provides
storage and retrieval of data
that is modeled in non-tabular
relations used in relational
databases
Introduced by Google and AWS
A set of characteristics not a defined thing
Non-relational, Highly scalable
16. (16)
Why do we need NoSQL?
T h e W h y …
Large amount of data being generated
Connections between data is growing
Adaptable to changing structure of data
Using advanced server architecture
Designed for non-relational data
When we need high availability
17. (17)
Main Use Cases
W h e n d o we u s e N o S Q L d a t a b a s e s ?
• Large Data Volumes
Massively distributed architecture required to store data (Google,
Amazon, Facebook)
• Extreme Query Workload
Impossible to efficiently do joins at that scale with an RDBMS
• Schema Evolution
Schema flexibility is trivial to the solution
18. (18)
PROS AND CONS
N o S Q L
PROS
Massive Scalability
High Availability
Economical
Schema Flexibility
Sparse and semi-structured data
CONS
Limited query capabilities
Not standardized
Still developing
Less support
Business related analytics
21. (21)
Big Table
N o S Q L D a t a b a s e S t r u c t u r e s
• Behaves like a standard
relational database
• Designed to work with a lot of
data…. A REALLY BIG LOT of data
• Created by Google now used by
many others
• It is a sparse, distributed,
persistent stored map
• Indexed and with a timestamp
22. (22)
Key Value
N o S Q L D a t a b a s e S t r u c t u r e s
• Each bit of data is stored in a
single collection
• Each collection can have different
types of data
• Values are hidden inside the key
• To find out what the value is we
need to query using the key
23. (23)
Document Store
N o S Q L D a t a b a s e S t r u c t u r e s
• Very similar to a key value
database
• Each collection can have different
types of data
• Difference is you can see the
values
24. (24)
Graph Database
N o S Q L D a t a b a s e S t r u c t u r e s
• Focus is modelling the structure
of data
• Inspired by graph theory
• Scales well to the structure of data
• The use cases are mainly related
to the structure of the database
• Machine learning, Mapping, Supply
Chain Transparency
26. (26)
What are the database
technologies we use in our system?
Infor Nexus
https://wiki.gtnexus.info/display/dev/Core+Data+Systems
27. (27)
Data Modeling
W h y d o we n e e d N o S Q L d a t a m o d e l i n g ?
• Understand the data
• Plan the database structure
• Understand application specific
queries
• Document and communicate
design and content
28. (28)
Data Modeling Differences
R e l a t i o n a l v s N o S Q L d a t a m o d e l i n g
Relational
Fixed set of columns
Atomic fields
Highly normalized
Slow to change
Avoid duplication of data
NoSQL
Unstructured/Semi-Structured data
Aggregations of data
Highly denormalized
Rapidly changing
Duplication of data is supported
29. (29)
Denormalization
R e p l i c a t i o n o f d a t a …
• Copying of the same data into multiple
documents or tables
• Simplify the query
• Optimize query processing
30. (30)
Application side JOINs
J o i n s a r e n o t e n c o u r a g e d …
• Joins are rarely supported in NoSQL
solutions.
• Many to Many relationships are often
modeled as joins
• We can use aggregations where
possible
32. (32)
Elasticsearch
W h a t i s E l a s t i c s e a r c h ?
Elasticsearch is a search
engine based on the Lucene
library
• It was developed in Java
• Multitenant-capable
• Full text search engine
• Work with a HTTP interface
• JSON documents
• Official clients are available
in,
Java, .NET (C#), PHP, Python,
Apache Groovy, Ruby
Atomicity: All of a transaction happens or none of it does
Consistency: A database is initially in a consistent state, and it should remain consistent after every transaction.
Isolation: One transaction cannot read data from another transaction that is not yet completed
Durability: Once a transaction is complete it is it is guaranteed that it’s written to a durable medium
Basically available: No locking scenarios. Even if a node fails the system should still work and available.
SOFT STATE: continuously changing state.
Q: What’s there in B? A: A triangleRedis, Dynamite, Voldemort
Q: Ask B, Bring me the triangles? A: Here you goCouchDB, MongoDB, Apache Lucene, Elasticsearch