Concept of Big Data in the context of real world data scenario.
Presented on DotNetters Tech Summit - 2015 RUET
Presenter: Md. Delwar Hiossain
Event Url: https://www.facebook.com/events/512834685530439/
5. Depending on traditional system
Traditional Software Tools
100 MB Document 100 GB Image 100 TB Video
Unable to
Send
Unable to
View
Unable to
Edit
6. Data come from many quarters.
Social media sites
Sensors
Business transactions
Location-based
7. Volume: large volumes of data
Velocity: quickly moving data
Variety: structure, unstructured, images, audios, videos etc.
8. Business value outcomes tied to
the business strategy resulting
from business decisions.
Higher productivity, faster time to
complete tasks
Lower total cost of ownership and
greater efficiencies in IT
10. Scalability Semi-structured and unstructured
Data.
High Velocity
11. Schema less : data structure is not predefined
Focus on retrieval of data and appending new data
Focus on key-value data stores that can be used to locate data
objects
Focus on supporting storage of large quantities of unstructured
data
SQL is not used for storage or retrieval of data
No ACID (atomicity, consistency, isolation, durability)
12. Built for cloud
Scale out architecture
Agility afforded by cloud
computing
Sharding automatically
distributes data evenly across
multi-node clusters
Automatically manages
redundant servers. (replica sets)
Horizontal scalality
Application
Document Oriented
{
author:”delwar”,
date:new Date(),
Topics:’Big Data Concept’,
tag:[“tools”,”database”]
}
Volume. A typical PC might have had 10 gigabytes of storage in 2000. Today, Facebook ingests 500 terabytes of new data every day; a Boeing 737 will generate 240 terabytes of flight data during a single flight across the US; the proliferation of smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information
Velocity. Clickstreams and ad impressions capture user behavior at millions of events per second; high-frequency stock trading algorithms reflect market changes within microseconds; machine to machine processes exchange data between billions of devices; infrastructure and sensors generate massive log data in real-time; on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
Variety. Big Data data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.
Selecting a Big Data Technology: Operational vs. Analytical
The Big Data landscape is dominated by two classes of technology: systems that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored; and systems that provide analytical capabilities for retrospective, complex analysis that may touch most or all of the data. These classes of technology are complementary and frequently deployed together.
Operational and analytical workloads for Big Data present opposing requirements and systems have evolved to address their particular demands separately and in very different ways. Each has driven the creation of new technology architectures. Operational systems, such as theNoSQL databases, focus on servicing highly concurrent requests while exhibiting low latency for responses operating on highly selective access criteria. Analytical systems, on the other hand, tend to focus on high throughput; queries can be very complex and touch most if not all of the data in the system at any time. Both systems tend to operate over many servers operating in a cluster, managing tens or hundreds of terabytes of data across billions of records.
Operational Big Data
For operational Big Data workloads, NoSQL Big Data systems such as document databases have emerged to address a broad set of applications, and other architectures, such as key-value stores, column family stores, and graph databases are optimized for more specific applications. NoSQL technologies, which were developed to address the shortcomings of relational databases in the modern computing environment, are faster and scale much more quickly and inexpensively than relational databases.
Critically, NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational Big Data workloads much easier to manage, and cheaper and faster to implement.
In addition to user interactions with data, most operational systems need to provide some degree of real-time intelligence about the active data in the system. For example in a multi-user game or financial application, aggregates for user activities or instrument performance are displayed to users to inform their next actions. Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.
Analytical Big Data
Analytical Big Data workloads, on the other hand, tend to be addressed by MPP database systems and MapReduce. These technologies are also a reaction to the limitations of traditional relational databases and their lack of ability to scale beyond the resources of a single server. Furthermore, MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL.
As applications gain traction and their users generate increasing volumes of data, there are a number of retrospective analytical workloads that provide real value to the business. Where these workloads involve algorithms that are more sophisticated than simple aggregation, MapReduce has emerged as the first choice for Big Data analytics. Some NoSQL systems provide native MapReduce functionality that allows for analytics to be performed on operational data in place. Alternately, data can be copied from NoSQL systems into analytical systems such as Hadoop for MapReduce.