Introduction to big data

How much data ? How it matters?3
 Data Size matters……
 How it matters……………….
 Like this………..

4
bit (b) 0 or 1 1/8 of a byte
byte (B) 8 bits 1 byte
kilobyte (KB) 10001 bytes 1,000 bytes
megabyte (MB) 10002 bytes 1,000,000 bytes
gigabyte (GB) 10003 bytes 1,000,000,000 bytes
terabyte (TB) 10004 bytes
1,000,000,000,000
bytes
petabyte (PB) 10005 bytes
1,000,000,000,000,000
bytes
exabyte (EB) 10006 bytes
1,000,000,000,000,000
,000 bytes
zettabyte (ZB) 10007 bytes
1,000,000,000,000,000
,000,000 bytes
yottabyte (YB) 10008 bytes
1,000,000,000,000,000
,000,000,000 bytes

5
 But Where and in which
companies…….?
 Every Where……. Like in……

7
 Where in real time…..?

8

9

10

11
Asia’s largest and world’s third largest data centre in Bengaluru

12

Simple to startWhat is the maximum file size you have dealt so far?
Movies/Files/Streaming video that you have used?
What have you observed?
What is the maximum download speed you get?
Simple computation
How much time to just transfer.

Introduction to Big Data
Big Data is a term used for a collection of data sets
that are large and complex, which is difficult to store
and process using available database management
tools or traditional data processing applications.
 The challenge includes capturing, curating, storing,
searching, sharing, transferring, analyzing and
visualization of this data.
14

Introduction to Big Data
Big Data is a term used for a collection of data sets
that are large and complex, which is difficult to store
and process using available database management
tools or traditional data processing applications.
 The challenge includes capturing, curating, storing,
searching, sharing, transferring, analyzing and
visualization of this data.
15

Big data spans three dimensions: Volume, velocity and Variety
 Volume: Volume refers to the ‘amount of data’, which is growing day
by day at a very fast pace.
 The size of data generated by humans, machines and their interactions
on social media itself is massive.
 Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of 300 times from 2005.
18

VELOCITY
 Velocity is defined as the pace at which different sources generate the data
every day.
 This flow of data is massive and continuous. There are 1.03 billion Daily
Active Users (Facebook DAU) on Mobile as of now, which is an increase of
22% year-over-year.
 This shows how fast the number of users are growing on social media and
how fast the data is getting generated daily.
20

Examples of Big Data
 Daily we upload millions of bytes of data. 90 % of the world’s data has been
created in last two years.
21

Examples of Big Data
 Walmart handles more than 1 million customer transactions every hour.
 Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
 230+ millions of tweets are created every day.
 More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide.
 YouTube users upload 48 hours of new video every minute of the day.
22

Contn..
 Amazon handles 15 million customer click stream user data per day to
recommend products.
 294 billion emails are sent every day. Services analyses this data to find
the spams.
 Modern cars have close to 100 sensors which monitors fuel level, tire
pressure etc. , each vehicle generates a lot of sensor data.
23

Traits of Big data
The eight (8) ‘V’ Dimension Characteristics of Big Data:
Part One: Volume, Velocity, Variety
Part Two: Variability (Unpredictability), Veracity
(Reliability), Virality (Circulated rapidly), Visualization and
Value.
24

 Veracity
 Big Data Veracity refers to the biases, noise and abnormality in data. Is
the data that is being stored, and mined meaningful to the problem being
analyzed. Inderpal feel veracity in data analysis is the biggest challenge
when compares to things like volume and velocity.
 Validity
 Like big data veracity is the issue of validity meaning is the data correct
and accurate for the intended use. Clearly valid data is key to making the
right decisions.
25

 Volatility
Big data volatility refers to how long is data valid and how
long should it be stored.
In this world of real time data you need to determine at
what point is data no longer relevant to the current
analysis.
26

Challenges of Conventional Systems
• Conventional analytical tools and techniques are
inadequate to handle data that is unstructured
(like text data), that is too large in size, or that is
growing rapidly like social media data.
• A cluster analysis on a 200MB file with 1 million
customer records is manageable, but the same
cluster analysis on 1000GB of Facebook customer
profile information will take a considerable
amount of time if conventional tools and
techniques are used.

• Facebook as well as entities like Google and
Walmart generate data in petabytes every day.
• Traditional Analytics analyzes on the known
data environment that too the data that is well
understood. It cannot work on unstructured
data efficiently.
• Traditional Analytics is built on top of the
relational data model, relationships between
the subjects of interests have been created
inside the system and the analysis is done
based on them. This approach will not
adequate for big data analytics.

• Traditional analytics is batch oriented and we need to
wait for nightly ETL (extract,transform and load) and
transformation jobs to complete before the required
insight is obtained.
• Parallelism in a traditional analytics system is achieved
through costly hardware like MPP
• (Massively Parallel Processing) systems.

Other Challenges of Conventional
Systems
• Data challenges
• Volume, velocity, veracity, variety
• Data discovery and comprehensiveness
• Scalability
• Process challenges
• Capturing Data
• Aligning data from different sources
• Transforming data into suitable form for data analysis
• Modeling data (Mathematically, simulation)
• Understating output, visualizing results and display issues
on mobile devices.

Traits or Characteristics of Big Data
The eight (8) ‘V’ Dimension Characteristics of Big
Data:
Part One: Volume, Velocity, Variety
Part Two: Variability (Unpredictability), Veracity
(Reliability), Virality (Circulated rapidly),
Visualization and Value.

Characteristics of Big Data
The original three ‘V’ Dimension Characteristics of
Big Data identified in 2001 are:

Features of Big Data -
Security, Compliance,
Auditing and Protection
 The sheer size of a Big Data repository brings with it a major
security challenge, generating the age-old question presented to
IT: How can the data be protected?
 Steps to Securing Big Data
 Classifying Data
 Protecting Big Data Analytics
 Big Data and Compliance
 The Intellectual Property Challenge

Security, Compliance, Auditing
and Protection
 Data Access:
Data can be easily protected, but only if you eliminate access to the data.
That’s not a pragmatic solution, to say the least. The key is to control access,
but even then, knowing the who, what, when, and where of data access is only
a start.
 Data availability:
Controlling where the data are stored and how the data are distributed. The
more control you have, the better you are positioned to protect the data.

Security, Compliance,
Auditing and Protection
 Performance:
Higher levels of encryption, complex security methodologies, and
additional security layers can all improve security. However, these
security techniques all carry a processing burden that can severely
affect performance.
 Liability:
Accessible data carry with them liability, such as the sensitivity of the
data, the legal requirements connected to the data, privacy issues, and
intellectual property concerns.

PRAGMATIC STEPS TO SECURING BIG DATA
 First, get rid of data that are no longer needed. If you do not need
certain information, it should be destroyed, because it represents a risk
to the organization.
 Information cannot legally be destroyed; in that case, the information
should be securely archived by an offline method.
 Real challenge is to decide which data is needed? As values can be
found in unexpected places. For example, getting rid of activity logs
may be a smart move from a security standpoint.

Classifying data
 Protecting data becomes much easier if the data are classified—that is, the
data should be divided into appropriate groupings for management
purposes.
 For example, Internal e-mails between two colleagues should not be secured or
treated the same way as financial reports, human resources (HR)information, or
customer data.
 Classification can become a powerful tool for determining the sensitivity of
data.
 A simple approach may just include classifications such as financial, HR, sales,
inventory, and communications, each of which is self-explanatory and offers
insight into the sensitivity of the data.
o Once organizations better understand their data, they can take important
steps to segregate the information, which will make the deployment of
security measures like encryption and monitoring more manageable.

PROTECTING BIG DATA ANALYTICS
 The real cause of concern is the fact that Big Data contains all of the
things you don’t want to see when you are trying to protect data.
 Big Data can contain very unique sample sets—for example, data
from devices that monitor physical elements (e.g., traffic, movement,
soil pH, rain, wind) on a frequent schedule, that are accumulated
frequently and in real time.
 All of the data are unique to the moment, and if they are lost, they are
impossible to recreate.
 That uniqueness also means you cannot leverage time-saving backup
preparation and security technologies, such as deduplication.

 This greatly increases the capacity requirements for backup
subsystems, slows down security scanning, makes it harder to detect
data corruption, and complicates archiving.
 There is also the issue of the large size and number of files often
found in Big Data analytic environments.
 In order for a backup application and associated appliances or
hardware to churn through a large number of files, bandwidth to the
backup systems and/or the backup appliance must be large, and the
receiving devices must be able to ingest data at the rate that the data
can be delivered.

BIG DATA AND COMPLIANCE
 Compliance has a major effect on how Big Data is protected, stored,
accessed, and archived.
 Big Data is not easily handled by the RDBMS; This means it is harder
to understand how compliance affects the data.
 Big Data is transforming the storage and access paradigms to an
emerging new world of horizontally scaling, unstructured databases,
which are better at solving some old business problems through
analytics.
 New data types and methodologies are still expected to meet the
legislative requirements expected by compliance laws.

 Health care probably provides the best example for those charged
with compliance as they examine how Big Data creation, storage,
and flow work in their organizations.
 Electronic health record systems, driven by the Health Insurance
Portability and Accountability Act (HIPAA).-Storing Personal
information.
 Unfortunately, most of the data stores in use today—including
Hadoop, Cassandra, and MongoDB—do not incorporate sufficient
data security tools to provide enterprises with the peace of mind that
confidential data will remain safe and secure at all times.

THE INTELLECTUAL PROPERTY CHALLENGE
 One of the biggest issues around Big Data is the concept of
intellectual property (IP).
 IP refers to creations of the human mind, such as inventions, literary
and artistic works, and symbols, names, images, and designs used in
commerce.
 Between 1985 and 2010, the number of patents granted worldwide
rose from slightly less than 400,000 to more than 900,000. Increase
of more than 125 percent over one generation (25 years).
 The same concepts just have to be expanded into the realm of Big
Data. Some basic rules are as follows:

 Understand what IP is and know what you have to protect:
 What needs to protect it, how to protect it and whom to protect it
from.TO do so, IP security in IT (usually a computer security officer, or
CSO) must communicate on an ongoing basis with the executives who
oversee intellectual capital. Meeting at least quarterly. Corporate
leaders will be the foundation for protecting IP.
o Prioritize protection:
 CSOs with extensive experience normally recommend doing a risk and
cost-benefit analysis.
 require you to create a map of your company’s assets and determine
what information, if lost, would hurt your company the most.
 This help you figure out where to best allocate your protective efforts.

 Label:
 Confidential information should be labeled appropriately. If company
data are proprietary, note that on every log-in screen.
o Lock it up:
 Physical as well as digital protection schemes are a must. Rooms that
store sensitive data should be locked. This applies to everything from the
server farm to the file room. Keep track of who has the keys, always use
complex passwords, and limit employee access to important databases.
o Educate employees.
o Know your tools:
 Those tools can locate sensitive documents and keep track of how they
are being used and by whom.
o Use a counterintelligence mind-set:
 If you were spying on your own company, how would you do it?

 These guidelines can be applied to almost any information
security paradigm that is geared toward protecting IP. The
same guidelines can be used when designing IP protection for a
Big Data platform.

Analysis vs Reporting
Where does "Reporting" stop and "Analytics" kick in?
Let us try to understand the differences first.
 While Reporting provides data, Analytics is supposed to
provide answers.
 Reporting is typical Standardized while Analytics is
customized.
 Reporting has a stringent format while Analytics is
flexible.
 Reporting provides what is typically asked for,
while Analytics caters to the underlying need

 The output of Reporting is in the form of
canned reports, dashboards and alerts
while Analytics has presentations
comprising of insights, recommended actions,
and a forecast of its impact on the company.
 Reporting includes building, configuring,
consolidating, organizing, formatting, and
summarizing data while Analytics consist of
questioning, examining, interpreting, predicting
and prescribing.

Both reporting and analysis play their roles in
influencing and driving the actions in an
organization with the ultimate goal of value
maximization

• Canned reports:
• These are the out-of-the-box and custom reports that
you can access within the analytics tool.
• In general, some canned reports are more valuable
than others, and a report’s value may depend on how
relevant it is to an individual’s role (e.g., SME or
specialist vs. web producer).
• Dashboards:
• These custom-made reports combine different KPIs
and reports to provide a comprehensive, high-level
view of business performance for specific audiences.
Dashboards may include data from various data
sources and are also usually fairly static.

• Alerts:
• These conditional reports are triggered when data
falls outside of expected ranges or some other pre-
defined criteria is met. Once people are notified of
what happened, they can take appropriate action as
necessary.

To overcome the first huddle of the initial confusion
between Reporting and Analytics and taken a leap
towards deriving the real benefits of analysis

Four stages of Analytics maturity model
 Descriptive (Pure play Reporting)
 Diagnostic
 Predictive
 Prescriptive

Analytic Processes and Tools
• The process of examining large data sets containing
a variety of data types – i.e., Big Data – to market
trends, customer preferences, and other useful
information.
• Companies and enterprises that implement Big
Data Analytics often gather several business
benefits, such as marketing campaigns, new
revenue opportunities, improved customer service
delivery, more efficient operations, and competitive
advantages.

• Companies implement Big Data Analytics
because they want to make more informed
business decisions.
• Big Data Analytics gives analytics professionals,
such as data scientists and predictive modeller's,
the ability to analyze Big Data from multiple
and varied sources, including transactional data
and other structured data.

Types of Big Data Analytics Tools
Big Data Analytics tools are important for companies and
enterprises because of the huge volume of Big Data now
generated and managed by modern organizations.
Big Data Analytics tools also help businesses save time and
money in gaining insights to inform data-driven decisions.
The different types Big Data Analytics tools are:
Data storage and management, Data cleaning, Data
mining, Data analysis, Data visualization, Data Integration,
and Data collection.

Types of Big Data Analytics Tools and Environment

Modern Data Analytic Tools
• Whenever analysts or journalists assemble lists of the top
trends for this year, "big data" is almost certain to be on
the list.
• Big data isn't really a new concept. Computers have
always worked with large and growing sets of data, and
we've had databases and data warehouses for years.
• What is new is… how much bigger that data is, how
quickly it is growing and how complicated it is.
Enterprises understand that the data in their systems
represents of insights that could help them improve their
processes and their performance.
• But they need tools that will allow them to collect and
analyze that data.

• Interestingly, many of the best and best known
big data tools available are open source
projects. The very best known of these is
Hadoop, which is spawning an entire industry
of related services and products.
• As well as 49 other big data projects. We find
a lot of Apache projects related to Hadoop, as
well as open source NoSQL databases,
business intelligence tools, development tools
and much more.
• Here it is ……

Big Data Companies: The Leaders
Tableau
Tableau started out by offering visualization techniques for
exploring and analyzing relational databases and data cubes
and has expanded to include Big Data research.
It offers visualization of data from any source, from Hadoop to
Excel files.
New Relic
New Relic uses a SaaS model for monitoring Web and mobile
applications in real-time that run in the cloud, on-premises, or
in a hybrid mix.
The plug-ins uses PaaS/cloud services, caching, database,
Web servers and queuing.

IBM
IBM offers cloud services for massive compute scale
through its Soft layer subsidiary. On the software side, its
DB2, Informix and InfoSphere support Big Data analytics
and Cognos and SPSS analytics software specialize in BI. t.
IBM also offers InfoSphere, the data integration and data
warehousing used in a BD scenario.
VMware
VMware has incorporated Big Data into its flagship
virtualization product, called VMware vSphere Big Data
Extensions. BDE is a virtual appliance that enables
administrators to deploy and manage the Hadoop clusters
under vSphere. It supports a number of Hadoop
distributions, including Apache, Cloudera, Hortonworks,
MapR and Pivotal.
Big Data Companies

SAP
SAP's main Big Data tool is its HANA. It can run analytics on
80 terabytes of data and integrates with Hadoop. It can also
perform advanced analytics, like predictive analytics, spatial
data processing, text analytics, text search, streaming
analytics, and graph data processing and has ETL (Extract,
Transform, and Load) capabilities.
Oracle
Oracle has its Big Data Appliance with a number of software
products. They include Oracle NoSQL , Apache Hadoop,
Oracle Data Integrator Application Adapter for Hadoop, Oracle
Loader for Hadoop, Oracle R Enterprise tool, for R
programming language , Oracle Linux so on…
Big Data Companies

Pentaho
Pentaho is a suite of open source-based tools for business
analytics that has expanded to cover Big Data. The suite
offers data integration, OLAP services, reporting, a
dashboard, data mining and ETL capabilities. Pentaho for
Big Data is a data integration tool based specifically
designed for executing ETL jobs in and out of Big Data
environments such as Apache Hadoop or Hadoop
distributions on Amazon, Cloudera,
Thoughtworks
Thoughtworks incorporates Agile software development
principals into building Big Data applications through its Agile
Analytics product. It builds applications for data warehousing
and business intelligence using the fast paced Agile process
for quick and continuous delivery of newer applications to
extract insight from data.
Big Data Companies

Amazon Web Services
Amazon has a number of enterprise Big Data platforms,
including the Hadoop-based Elastic MapReduce, Kinesis
Firehose for streaming massive amounts of data into AWS,
Kinesis Analytics to analyze the data, DynamoDB big data
database, NoSQL and Hbase. All of these services work
within its greater Amazon Web Services offerings.
Microsoft
It has a partnership with Hortonworks and offers the
HDInsights tool based for analyzing structured and
unstructured data on Hortonworks. SQL Server 2016 comes
with a connector to Hadoop for Big Data processing, and
Microsoft recently acquired Revolution Analytics, which made
the only Big Data analytics platform written in R, a
programming language for building Big Data apps without
requiring the skills of a data scientist.
Big Data Companies

Some other Big Data Companies are……..
• Tibco Jaspersoft
• Google
• Mu Sigma
• HP Enterprise
• Big Panda
• Cogito
• Alation
• Splunk
• ……

Open Source Big Data Analysis Platforms and Tools
Hadoop
You simply can't talk about big data without
mentioning Hadoop. The Apache distributed data processing
software is so pervasive that often the terms "Hadoop" and "big
data" are used synonymously. The Apache Foundation also
sponsors a number of related projects that extend the
capabilities of Hadoop. In addition, numerous vendors offer
supported versions of Hadoop and related technologies.
Operating System: Windows, Linux, OS X.

MapReduce
Originally developed by Google,
the MapReduce website describes it as "a programming
model and software framework for writing applications
that rapidly process vast amounts of data in parallel on
large clusters of compute nodes." It's used by Hadoop,
as well as many other data processing applications.
Operating System: OS Independent.

• GridGain
• It offers an alternative to Hadoop's MapReduce that is
compatible with the Hadoop Distributed File System.
• It offers in-memory processing for fast analysis of real-
time data. You can download the open source version
from GitHub or purchase a commercially supported
version from the link above.
• Operating System: Windows, Linux, OS X.
• HPCC Systems
Developed by LexisNexis Risk Solutions, HPCC
Systems is short for "high performance computing cluster."
It claims to offer superior performance to Hadoop.
• Both free community versions and paid enterprise
versions are available. Operating System: Linux.

• Storm
• Recently owned by Twitter, Storm offers distributed
real-time computation capabilities and is often
described as the "Hadoop of realtime." It's highly
scalable, robust, fault-tolerant and works with nearly
all programming languages.
• Operating System: Linux.

Open Source Big Data Business Intelligence Tools
• Talend
• Talend Open Studio for Big Data, which is a set of data
integration tools that support Hadoop, HDFS, Hive, Hbase
and Pig.
• The company also sells an enterprise edition and other
commercial products and services.
• Operating System: Windows, Linux, OS X.
• Jaspersoft
• It makes "the most flexible, cost effective and widely
deployed business intelligence software in the world."

• Jedox
• The open source Palo Suite includes an OLAP Server,
Palo Web, Palo ETL Server and Palo for Excel.
• Jedox offers commercial software based on the same
tools.
• Operating System: OS Independent.
• SpagoBI
• It claims to be "the only entirely open source business
intelligence suite." Commercial support, training and
services are available.
• Operating System: OS Independent.
Open Source Big Data Business Intelligence Tools

Statistical Concepts: Sampling
Distributions
 The sampling distribution is a distribution of a sample statistic. While the concept of a distribution of a
set of numbers is intuitive for most students.
 The sampling distribution is a distribution of a sample statistic. It is a model of a distribution of scores,
like the population distribution, except that the scores are not raw scores, but statistics. It is a thought
experiment; "what would the world be like if a person repeatedly took samples of size N from the
population distribution and computed a particular statistic each time?" The resulting distribution of
statistics is called the sampling distribution of that statistic.
 For example, suppose that a sample of size sixteen (N=16) is taken from some population. The mean of
the sixteen numbers is computed. Next a new sample of sixteen is taken, and the mean is again
computed. If this process were repeated an infinite number of times, the distribution of the now infinite
number of sample means would be called the sampling distribution of the mean.
 Every statistic has a sampling distribution. For example, suppose that instead of the mean, medians were
computed for each sample. The infinite number of medians would be called the sampling distribution of
the median.

Re-Sampling
 In statistics, resampling is any of a variety of methods for doing one of the
following:
 Estimating the precision of sample statistics (medians, variances, percentiles)
by using subsets of available data ( jackknifing ) or drawing randomly with
replacement from a set of data points ( bootstrapping ).
 Exchanging labels on data points when performing significance tests(
permutation tests , also called exact tests, randomization tests, or re-
randomization tests).
 Validating models by using random subsets (bootstrapping, crossvalidation).
 Common resampling techniques include bootstrapping, jackknifing and
permutation tests.

Statistical Inference
 Statistical Inference, Model & Estimation
 Recall, a statistical inference aims at learning characteristics of th epopulation
from a sample; the population characteristics are parameters and sample
characteristics are statistics.
 A statistical model is a representation of a complex phenomena that generated the
data.
 It has mathematical formulations that describe relationships between random
variables and parameters.
 It makes assumptions about the random variables, and sometimes parameters.
 A general form: data = model + residuals
 Model should explain most of the variation in the data
 Residuals are a representation of a lack-of-fit, that is of the portion of the data
unexplained by the model.

 Estimation represents a way of a process of learning and determining the population parameter based on
the model fitted to the data.
 Point estimation and interval estimation, and hypothesis testing are three main ways of learning about the
population parameter from the sample statistic.
 An estimator is particular example of a statistic, which becomes an estimate when the formula is
replaced with actual observed sample values.
 Point estimation = a single value that estimates the parameter. Point estimates are single values
calculated from the sample
 Confidence Intervals = gives a range of values for the parameter Interval estimates are intervals within
which the parameter is expected to fall, with a certain degree of confidence.
 Hypothesis tests = tests for a specific value(s) of the parameter.
 In order to perform these inferential tasks, i.e., make inference about the unknown population parameter
from the sample statistic, we need to know the likely values of the sample statistic. What would happen if
we do sampling many times?
 We need the sampling distribution of the statistic It depends on the model assumptions about the
population distribution, and/or on the sample size.
 Standard error : refers to the standard deviation of a sampling distribution.

Prediction Error
Prediction error is a discontinuity attribute that
removes the predictable image components and
reveals the unpredictable.
To use prediction error as a discontinuity attribute -
the original goal and starting point of my project - one
has to devise a prediction-error computation that
predicts and removes the plane-wave volumes of
sedimentary layers but that is incapable of predicting
the discontinuities.

Introduction to big data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction to big data

Similaire à Introduction to big data (20)

Dernier

Dernier (20)

Introduction to big data