2. What is Big Data?
• Big data is an all-encompassing term for any
collection of data sets so large and complex
that it becomes difficult to process using on-
hand data management tools or traditional data
processing applications.
• Big Data refers to extremely vast amounts of
multi-structured data that typically has been
cost prohibitive to store and analyze.
3. In the simplest terms, Big Data can be broken
down into two basic types, structured and
unstructured data.
• Structured – Predefined data type
• Spreadsheets and Oracle Relational database
• Unstructured – is non pre-defined data model
or is not organized in a pre-defined manner.
• Video, Audio, Images, Metadata, etc…
• Semi-structured – Structured data embedded
with some unstructured data
• Email, Text Messaging
4. • Where is the big data come from?
• A simple answer is ‘everywhere’.
• The sources we ignored earlier because of technical limitations
are treated as gold mines today.
• Big data may come from web logs, RFIDs, GPS systems,
sensor networks, social networks, IOT, search indices, detail
call records, science experiments like nuclear physics, medical
records, military surveillance, photo archives, video archives,
e-commerce practices etc.
• Since the advent of data warehouses in early 90s, companies
are storing relevant data in large volumes.
• Many believe that big data is not only dependent on data itself
but variety, velocity, veracity, variability and value preposition
are also an important aspects of Big Data.
5. • Cover varying types of data sources
Data can be streaming, batch, structured, unstructured, and
semi-structured, depending on the information type, where
it comes from and its primary use. Big Data must be able
to accommodate all of these various types of data on a very
large scale.
• Analytics
Big Data must provide the mechanisms to allow ad-hoc
queries, data discovery and experimentation on the large
data sets to effectively correlate various events and data
types to get an understanding of the data that is useful and
addresses business needs.
6. • Big data is typically defined by three “V”s :
– Volume,
– Variety and
– Velocity.
• In addition to these three, leading big data solution
providers added other Vs such as
– Veracity (IBM)
– Variability (SAS)
– Value Proposition
7. • Although there are number of different
technologies that are useful in analyzing Big
Data. Most of them share some common
characteristics.
• There are three Big Data Technologies that
stand out of the lot:
– MapReduce
– Hadoop
– NoSQL
Big Data Technologies
8. • MapReduce is a technique popularized by Google
that distributes the processing of a very large multi-
structured data files across a large cluster of
machines.
• High performance is achieved by breaking the
processing into small units of work that can be run in
parallel across thousands of clusters.
• Map reduce help organization in processing and
analyzing large volumes of multi-structured data. For
example- graph analysis, text analysis, machine
learning, data transformation etc.
9. • It is an open source framework for processing, storing
and analyzing massive amounts of distributed,
unstructured data.
• Hadoop was inspired by MapReduce and was
designed to handle petabytes and exabytes of data.
• Rather than banging away huge block of data with
single machine, Hadoop breaks up Big Data into
multiple parts so each part can be processed and
analyzed at the same time.
• Sources of data may include log files, social media
feeds and internal data sources.