SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
BIG DATA
Dept. of Computer Engineering 1 GPTC KOTHAMANGALAM
1. INTRODUCTION
Big data is a broad term for data sets so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture, data curation,
search, sharing, storage, transfer, visualization, and information privacy. The term often refers
simply to the use of predictive analytics or other certain advanced methods to extract value
from data, and seldom to a particular size of data set. Accuracy in big data may lead to more
confident decision making. And better decisions can mean greater operational efficiency, cost
reductions and reduced risk.
Analysis of data sets can find new correlations, to "spot business trends, prevent
diseases, combat crime and so on." Scientists, practitioners of media and advertising and
governments alike regularly meet difficulties with large data sets in areas including Internet
search, finance and business informatics. Scientists encounter limitations in e-Science
work, including meteorology, genomics, connectomics, complex physics simulations, and
biological and environmental research.
BIG DATA
Dept. of Computer Engineering 2 GPTC KOTHAMANGALAM
Data sets grow in size in part because they are increasingly being gathered by cheap
and numerous information-sensing mobile devices, aerial (remote sensing), software logs,
cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor
networks. The world's technological per-capita capacity to store information has roughly
doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018
) of
data were created; The challenge for large enterprises is determining who should own big data
initiatives that straddle the entire organization.
Work with big data is necessarily uncommon; most analysis is of "PC size" data, on a
desktop PC or notebook that can handle the available data set.
Relational database management systems and desktop statistics and visualization
packages often have difficulty handling big data. The work instead requires "massively parallel
software running on tens, hundreds, or even thousands of servers". What is considered "big
data" varies depending on the capabilities of the users and their tools, and expanding
capabilities make Big Data a moving target. Thus, what is considered to be "Big" in one year
will become ordinary in later years. "For some organizations, facing hundreds of gigabytes of
data for the first time may trigger a need to reconsider data management options. For others, it
may take tens or hundreds of terabytes before data size becomes a significant consideration."
BIG DATA
Dept. of Computer Engineering 3 GPTC KOTHAMANGALAM
2. DEFINITION
Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big
data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to
many petabytes of data. Big data is a set of techniques and technologies that require new
forms of integration to uncover large hidden values from large datasets that are diverse,
complex, and of a massive scale.
In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug
Laney defined data growth challenges and opportunities as being three-dimensional, i.e.
increasing volume (amount of data), velocity (speed of data in and out), and variety (range of
data types and sources). Gartner, and now much of the industry, continue to use this "3Vs"
model for describing big data. In 2012, Gartner updated its definition as follows: "Big data is
high volume, high velocity, and/or high variety information assets that require new forms of
processing to enable enhanced decision making, insight discovery and process optimization."
Additionally, a new V "Veracity" is added by some organizations to describe it.
BIG DATA
Dept. of Computer Engineering 4 GPTC KOTHAMANGALAM
If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept
fosters a more sound difference between big data and Business Intelligence, regarding data
and their use:
Business Intelligence uses descriptive statistics with data with high information density
to measure things, detect trends etc.;
Big data uses inductive statistics and concepts from nonlinear system identification to
infer laws (regressions, nonlinear relationships, and causal effects) from large sets of
data with low information density to reveal relationships, dependencies and perform
predictions of outcomes and behaviors.
A more recent, consensual definition states that "Big Data represents the Information assets
characterized by such a High Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value".
The term has been in use since the 1990s, with some giving credit to John Mashey for
popularizing the term. Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process data within a tolerable
elapsed time. Big data philosophy encompasses unstructured, semi-structured and structured
data, however the main focus is on unstructured data. Big data "size" is a constantly moving
target, as of 2012 ranging from a few dozen terabytes to many zettabytes of data. Big data
requires a set of techniques and technologies with new forms of integration to reveal insights
from data-sets that are diverse, complex, and of a massive scale.
"Variety", "veracity" and various other "Vs" are added by some organizations to describe it, a
revision challenged by some industry authorities.
A 2018 definition states "Big data is where parallel computing tools are needed to handle data",
and notes, "This represents a distinct and clearly defined change in the computer science used,
via parallel programming theories, and losses of some of the guarantees and capabilities made
by Codd's relational model."
The growing maturity of the concept more starkly delineates the difference between "big data"
and "Business Intelligence":
BIG DATA
Dept. of Computer Engineering 5 GPTC KOTHAMANGALAM
 Business Intelligence uses applied mathematics tools and descriptive statistics with data with
high information density to measure things, detect trends, etc.
 Big data uses mathematical analysis, optimization, inductive statistics and concepts
from nonlinear system identification to infer laws (regressions, nonlinear relationships, and
causal effects) from large sets of data with low information density to reveal relationships
and dependencies, or to perform predictions of outcomes and behaviors.
BIG DATA
Dept. of Computer Engineering 6 GPTC KOTHAMANGALAM
3.CHARACTERISTIS
Big data can be described by the following characteristics:
 Volume
 Variety
 Velocity
 Variability
 Veracity
 Complexity
BIG DATA
Dept. of Computer Engineering 7 GPTC KOTHAMANGALAM
VOLUME
The quantity of data that is generated is very important in this context. It is the size of the data
which determines the value and potential of the data under consideration and whether it can
actually be considered Big Data or not. The name ‘Big Data’ itself contains a term which is
related to size and hence the characteristic.
Within the Social Media space for example, Volume refers to the amount of data generated
through websites, portals and online applications. Especially for B2C companies, Volume
encompasses the available data that are out there and need to be assessed for relevance.
Consider the following -Facebook has 2 billion users, Youtube 1 billion users, Twitter 350
million users and Instagram 700 million users. Every day, these users contribute to billions of
images, posts, videos, tweets etc. You can now imagine the insanely large amount -or Volume-
of data that is generated every minute and every hour.
The sheer scale of the information processed helps define big data systems. These
datasets can be orders of magnitude larger than traditional datasets, which demands more thought
at each stage of the processing and storage life cycle. Often, because the work requirements
exceed the capabilities of a single computer, this becomes a challenge of pooling, allocating, and
coordinating resources from groups of computers. Cluster management and algorithms capable
of breaking tasks into smaller pieces become increasingly important.
BIG DATA
Dept. of Computer Engineering 8 GPTC KOTHAMANGALAM
VARIETY
The next aspect of Big Data is its variety. This means that the category to which Big Data
belongs to is also a very essential fact that needs to be known by the data analysts. This helps
the people, who are closely analyzing the data and are associated with it, to effectively use the
data to their advantage and thus upholding the importance of the Big Data.
Variety in Big Data refers to all the structured and unstructured data that has the possibility of
getting generated either by humans or by machines. The most commonly added data are
structured -texts, tweets, pictures & videos. However, unstructured data like emails, voicemails,
hand-written text, ECG reading, audio recordings etc, are also important elements under
Variety. Variety is all about the ability to classify the incoming data into various categories.
BIG DATA
Dept. of Computer Engineering 9 GPTC KOTHAMANGALAM
VELOCITY
The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data
is generated and processed to meet the demands and the challenges which lie ahead in the path
of growth and development.
With Velocity we refer to the speed with which data are being generated. Staying with our
social media example, every day 900 million photos are uploaded on Facebook, 500 million
tweets are posted on Twitter, 0.4 million hours of video are uploaded on Youtube and 3.5
billion searches are performed in Google. This is like a nuclear data explosion. Big Data helps
the company to hold this explosion, accept the incoming flow of data and at the same time
process it fast so that it does not create bottlenecks.
BIG DATA
Dept. of Computer Engineering 10 GPTC KOTHAMANGALAM
VARIABILITY
This is a factor which can be a problem for those who analyze the data. This refers to the
inconsistency which can be shown by the data at times, thus hampering the process of being
able to handle and manage the data effectively.
Variability in big data's context refers to a few different things. One is the number of
inconsistencies in the data. These need to be found by anomaly and outlier detection methods in
order for any meaningful analytics to occur.
Big data is also variable because of the multitude of data dimensions resulting from multiple
disparate data types and sources. Variability can also refer to the inconsistent speed at which big
data is loaded into your database.
BIG DATA
Dept. of Computer Engineering 11 GPTC KOTHAMANGALAM
VERACITY
The quality of the data being captured can vary greatly. Accuracy of analysis depends on the
veracity of the source data.
This is one of the unfortunate characteristics of big data. As any or all of the above properties
increase, the veracity (confidence or trust in the data) drops. This is similar to, but not the same
as, validity or volatility (see below). Veracity refers more to the provenance or reliability of the
data source, its context, and how meaningful it is to the analysis based on it.
For example, consider a data set of statistics on what people purchase at restaurants and these
items' prices over the past five years. You might ask: Who created the source? What
methodology did they follow in collecting the data? Were only certain cuisines or certain types
of restaurants included? Did the data creators summarize the information? Has the information
been edited or modified by anyone else?
Answers to these questions are necessary to determine the veracity of this information.
Knowledge of the data's veracity in turn helps us better understand the risks associated with
analysis and business decisions based on this particular data set.
BIG DATA
Dept. of Computer Engineering 12 GPTC KOTHAMANGALAM
COMPLEXITY
Data management can become a very complex process, especially when large volumes of data
come from multiple sources. These data need to be linked, connected and correlated in order to
be able to grasp the information that is supposed to be conveyed by these data. This situation, is
therefore, termed as the ‘complexity’ of Big Data.
Factory work and Cyber-physical systems may have a 6C system:
1. Connection (sensor and networks),
2. Cloud (computing and data on demand),
3. Cyber (model and memory),
4. content/context (meaning and correlation),
5. community (sharing and collaboration), and
6. customization (personalization and value).
In this scenario and in order to provide useful insight to the factory management and gain correct
content, data has to be processed with advanced tools (analytics and algorithms) to generate
meaningful information. Considering the presence of visible and invisible issues in an industrial
factory, the information generation algorithm has to be capable of detecting and addressing
invisible issues such as machine degradation, component wear, etc. in the factory floor.
BIG DATA
Dept. of Computer Engineering 13 GPTC KOTHAMANGALAM
4.ARCHITECTURE
In 2000, Seisint Inc. developed C++ based distributed file sharing framework for data
storage and querying. Structured, semi-structured and/or unstructured data is stored and
distributed across multiple servers. Querying of data is done by modified C++ called ECL
which uses apply scheme on read method to create structure of stored data during time of
query. In 2004 LexisNexis acquired Seisint Inc. and 2008 acquired Choice Point, Inc. and
their high speed parallel processing platform. The two platforms were merged into HPCC
Systems and in 2011 was open sourced under Apache v2.0 License. Currently HPCC and
Quant cast File System are the only publicly available platforms capable of analyzing multiple
exabytes of data.
In 2004, Google published a paper on a process called Map Reduce that used such an
architecture. The Map Reduce framework provides a parallel processing model and associated
implementation to process huge amounts of data. With Map Reduce, queries are split and
distributed across parallel nodes and processed in parallel (the Map step). The results are then
gathered and delivered (the Reduce step). The framework was very successful, so others
wanted to replicate the algorithm. Therefore, an implementation of the Map Reduce
framework was adopted by an Apache open source project named Hadoop.
BIG DATA
Dept. of Computer Engineering 14 GPTC KOTHAMANGALAM
MIKE2.0 is an open approach to information management that acknowledges the need
for revisions due to big data implications in an article titled "Big Data Solution Offering". The
methodology addresses handling big data in terms of useful permutations of data sources,
complexity in interrelationships, and difficulty in deleting (or modifying) individual records.
Recent studies show that the use of a multiple layer architecture is an option for dealing
with big data. The Distributed Parallel architecture distributes data across multiple processing
units and parallel processing units provide data much faster, by improving processing speeds.
This type of architecture inserts data into a parallel DBMS, which implements the use of Map
Reduce and Hadoop frameworks. This type of framework looks to make the processing power
transparent to the end user by using a front end application server.
Big Data Analytics for Manufacturing Applications can be based on a 5C architecture
(connection, conversion, cyber, cognition, and configuration).
Big Data Lake - With the changing face of business and IT sector, capturing and storage of
data has emerged into a sophisticated system. The big data lake allows an organization to shift
its focus from centralized control to a shared model to respond to the changing dynamics of
information management. This enables quick segregation of data into the data lake thereby
reducing the overhead time.
Big data repositories have existed in many forms, often built by corporations with a special need.
Commercial vendors historically offered parallel database management systems for big data
beginning in the 1990s. For many years, WinterCorp published the largest database report.
Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata
systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were
2.5 GB in 1991 so the definition of big data continuously evolves according to Kryder's Law.
Teradata installed the first petabyte class RDBMS based system in 2007. As of 2017, there are a
few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50
BIG DATA
Dept. of Computer Engineering 15 GPTC KOTHAMANGALAM
PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added
unstructured data types including XML, JSON, and Avro.
In 2000, Seisint Inc. (now LexisNexis Risk Solutions) developed a C++-based distributed
platform for data processing and querying known as the HPCC Systems platform. This system
automatically partitions, distributes, stores and delivers structured, semi-structured, and
unstructured data across multiple commodity servers. Users can write data processing pipelines
and queries in a declarative dataflow programming language called ECL. Data analysts working
in ECL are not required to define data schemas upfront and can rather focus on the particular
problem at hand, reshaping data in the best possible manner as they develop the solution. In
2004, LexisNexis acquired Seisint Inc. and their high-speed parallel processing platform and
successfully utilized this platform to integrate the data systems of Choicepoint Inc. when they
acquired that company in 2008. In 2011, the HPCC systems platform was open-sourced under
the Apache v2.0 License.
CERN and other physics experiments have collected big data sets for many decades, usually
analyzed via high-throughput computing rather than the map-reduce architectures usually meant
by the current "big data" movement.
In 2004, Google published a paper on a process called MapReduce that uses a similar
architecture. The MapReduce concept provides a parallel processing model, and an associated
implementation was released to process huge amounts of data. With MapReduce, queries are
split and distributed across parallel nodes and processed in parallel (the Map step). The results
are then gathered and delivered (the Reduce step). The framework was very successful, so
others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce
framework was adopted by an Apache open-source project named Hadoop. Apache Spark was
developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to
set up many operations (not just map followed by reducing).
MIKE2.0 is an open approach to information management that acknowledges the need for
revisions due to big data implications identified in an article titled "Big Data Solution
Offering". The methodology addresses handling big data in terms of useful permutations of data
sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual
records.
BIG DATA
Dept. of Computer Engineering 16 GPTC KOTHAMANGALAM
2012 studies showed that a multiple-layer architecture is one option to address the issues that big
data presents. A distributed parallel architecture distributes data across multiple servers; these
parallel execution environments can dramatically improve data processing speeds. This type of
architecture inserts data into a parallel DBMS, which implements the use of MapReduce and
Hadoop frameworks. This type of framework looks to make the processing power transparent to
the end-user by using a front-end application server.
The data lake allows an organization to shift its focus from centralized control to a shared model
to respond to the changing dynamics of information management. This enables quick segregation
of data into the data lake, thereby reducing the overhead time.
BIG DATA
Dept. of Computer Engineering 17 GPTC KOTHAMANGALAM
5.TECHNOLOGIES
Big data requires exceptional technologies to efficiently process large quantities of data
within tolerable elapsed times. A 2011 McKinsey report suggests suitable technologies include
A/B testing, crowd sourcing, data fusion and integration, genetic algorithms, machine learning,
natural language processing, signal processing, simulation, time series analysis and
visualization.
Multidimensional big data can also be represented as tensors, which can be more
efficiently handled by tensor-based computation, such as multi linear subspace learning.
Additional technologies being applied to big data include massively parallel-processing
(MPP) databases, search-based applications, data mining, distributed file systems,
distributed databases, cloud based infrastructure (applications, storage and computing
resources) and the Internet.
BIG DATA
Dept. of Computer Engineering 18 GPTC KOTHAMANGALAM
Some but not all MPP relational databases have the ability to store and manage petabytes of
data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data
tables in the RDBMS.
DARPA’s Topological Data Analysis program seeks the fundamental structure of massive
data sets and in 2008 the technology went public with the launch of a company called
Ayasdi.
The practitioners of big data analytics processes are generally hostile to slower shared
storage, preferring direct-attached storage (DAS) in its various forms from solid state drive
(SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of
shared storage architectures—Storage area network (SAN) and Network-attached storage
(NAS) —is that they are relatively slow, complex, and expensive. These qualities are not
consistent with big data analytics systems that thrive on system performance, commodity
infrastructure, and low cost.
Real or near-real time information delivery is one of the defining characteristics of big
data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory
is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a
SAN at the scale needed for analytics applications is very much higher than other storage
techniques.
BIG DATA
Dept. of Computer Engineering 19 GPTC KOTHAMANGALAM
6.APPLICATIONS
Big data has increased the demand of information management specialists in that
Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP and Dell have spent more
than $15 billion on software firms specializing in data management and analytics. In 2010, this
industry was worth more than $100 billion and was growing at almost 10 percent a year: about
twice as fast as the software business as a whole.
Developed economies make increasing use of data-intensive technologies. There are
4.6 billion mobile-phone subscriptions worldwide and between 1 billion and 2 billion people
accessing the internet Between 1990 and 2005, more than 1 billion people worldwide entered
the middle class which means more and more people who gain money will become more
literate which in turn leads to information growth. The world's effective capacity to exchange
information through telecommunication networks was 281 petabytes in 1986, 471 petabytes
in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007and it is predicted that the amount of
traffic flowing over the internet will reach 667 exabytes annually by 2014. It is estimated that
one third of the globally stored information is in the form of alphanumeric text and still image
data, which is the format most useful for most big data applications. This also shows the
potential of yet unused data (i.e. in the form of video and audio content).
BIG DATA
Dept. of Computer Engineering 20 GPTC KOTHAMANGALAM
While many vendors offer off-the-shelf solutions for Big Data, experts recommend the
development of in-house solutions custom-tailored to solve the company's problem at hand if
the company has sufficient technical capabilities.
6.1 GOVERNMENT
The use and adoption of Big Data within governmental processes is beneficial and allows
efficiencies in terms of cost, productivity, and innovation. That said, this process does not come
without its flaws. Data analysis often requires multiple parts of government (central and local)
to work in collaboration and create new and innovative processes to deliver the desired
outcome.
One of the greatest strengths of big data is it’s flexibility and universal application to so many
different industries. Along with many other areas, big data in government can have an
enormous impact — local, national and global. With so many complex issues on the table
BIG DATA
Dept. of Computer Engineering 21 GPTC KOTHAMANGALAM
today, governments have their work cut out trying to make sense of all the information they
receive and make vital decisions that affect millions of people. Not only is it difficult to sift
through all the information, but it’s sometimes difficult to verify the reality of the information
itself. Faulty information can have awful consequences.
By implementing a big data platform, governments can access vast amounts of relevant
information important to their daily functions. The positive effect it can have is nearly endless.
It’s so important because it not only allows the government to pinpoint areas that need attention,
but it also gives them that information in real time. In a society that moves so quickly from one
thing to the next, real-time analysis is vital. It allows governments to make faster decisions, and
it allows them to monitor those decisions and quickly enact changes if necessary. Here are just a
few of the areas that big data can positively affect at the government level.
Below are the thought leading examples within the Governmental Big Data space.
6.2 UNITED STATES OF AMERICA
In 2012, the Obama administration announced the Big Data Research and
Development Initiative, to explore how big data could be used to address important
problems faced by the government. The initiative is composed of 84 different big data
programs spread across six departments.
Big data analysis played a large role in Barack Obama's successful 2012 re-
election campaign.
The United States Federal Government owns six of the ten most
powerful supercomputers in the world.
The Utah Data Center is a data center currently being constructed by the United
States National Security Agency. When finished, the facility will be able to handle
a large amount of information collected by the NSA over the Internet. The exact
amount of storage space is unknown, but more recent sources claim it will be on the order
of a few exabytes.
BIG DATA
Dept. of Computer Engineering 22 GPTC KOTHAMANGALAM
6.3 INDIA
Big data analysis was, in parts, responsible for the BJP and its allies to win a
highly successful Indian General Election 2014.
The Indian Government utilises numerous techniques to ascertain how the Indian
electorate is responding to government action, as well as ideas for policy
augmentation
6.4 UNITED KINGDOM
Examples of uses of big data in public services:
Data on prescription drugs: by connecting origin, location and the time of each
prescription, a research unit was able to exemplify the considerable delay between the
release of any given drug, and a UK-wide adaptation of the National Institute for
Health and Care Excellence guidelines. This suggests that new/most up-to-date drugs
take some time to filter through to the general patient.
Joining up data: a local authority blended data about services, such as road gritting
rotas, with services for people at risk, such as 'meals on wheels'. The connection of
data allowed the local authority to avoid any weather related delay.
6.5 INTERNATIONAL DEVELOPMENT
Research on the effective usage of information and communication technologies for
development (also known as ICT4D) suggests that big data technology can make important
contributions but also present unique challenges to International development. Advancements
in big data analysis offer cost-effective opportunities to improve decision-making in critical
development areas such as health care, employment, economic productivity, crime, security,
and natural disaster and resource management. However, longstanding challenges for
developing regions such as inadequate technological infrastructure and economic and human
resource scarcity exacerbate existing concerns with big data such as privacy, imperfect
methodology, and interoperability issues.
BIG DATA
Dept. of Computer Engineering 23 GPTC KOTHAMANGALAM
6.6 MANUFACTURING
Based on TCS 2013 Global Trend Study, improvements in supply planning and product quality provide
the greatest benefit of big data for manufacturing. Big data provides an infrastructure for transparency in
manufacturing industry, which is the ability to unravel uncertainties such as inconsistent component
performance and availability. Predictive manufacturing as an applicable approach toward near-zero
downtime and transparency requires vast amount of data and advanced prediction tools for a systematic
process of data into useful information.
A conceptual framework of predictive manufacturing begins with data acquisition where different type of
sensory data is available to acquire such as acoustics, vibration, pressure, current, voltage and controller
data. Vast amount of sensory data in addition to historical data construct the big data in manufacturing.
The generated big data acts as the input into predictive tools and preventive strategies such as Prognostics
and Health Management (PHM).
BIG DATA
Dept. of Computer Engineering 24 GPTC KOTHAMANGALAM
6.7 CYBER-PHYSICAL MODELS
Current PHM implementations mostly utilize data during the actual usage while
analytical algorithms can perform more accurately when more information throughout
the machine’s lifecycle, such as system configuration, physical knowledge and working
principles, are
included. There is a need to systematically integrate, manage and analyze machinery or
process data during different stages of machine life cycle to handle data/information more
efficiently and further achieve better transparency of machine health condition for
manufacturing industry.
With such motivation a cyber-physical (coupled) model scheme has been developed. Please
see http://www.imscenter.net/cyber-physical-platform The coupled model is a digital twin of
the real machine that operates in the cloud platform and simulates the health condition with an
integrated knowledge from both data driven analytical algorithms as well as other available
physical knowledge. It can also be described as a 5S systematic approach consisting of
Sensing, Storage, Synchronization, Synthesis and Service. The coupled model first constructs
a digital image from the early design stage. System information and physical knowledge are
logged during product design, based on which a simulation model is built as a reference for
future analysis. Initial parameters may be statistically generalized and they can be tuned using
data from testing or the manufacturing process using parameter estimation. After which, the
simulation model can be considered as a mirrored image of the real machine, which is able to
continuously record and track machine condition during the later utilization stage. Finally,
with ubiquitous connectivity offered by cloud computing technology, the coupled model also
provides better accessibility of machine condition for factory managers in cases where
physical access to actual equipment or machine data is limited.
BIG DATA
Dept. of Computer Engineering 25 GPTC KOTHAMANGALAM
6.8 MEDIA
6.8.1 INTERNET OF THINGS (IOT)
To understand how the media utilises Big Data, it is first necessary to provide some context
into the mechanism used for media process. It has been suggested by Nick Couldry and
Joseph Turow that practitioners in Media and Advertising approach big data as many
actionable points of information about millions of individuals. The industry appears to be
moving away from the traditional approach of using specific media environments such as
newspapers, magazines, or television shows and instead tap into consumers with technologies
that reach targeted people at optimal times in optimal locations. The ultimate aim is to serve,
or convey, a message or content that is (statistically speaking) in line with the consumers
mindset. For example, publishing environments are increasingly tailoring messages
(advertisements) and content (articles) to appeal to consumers that have been exclusively
gleaned through various data-mining activities.
Targeting of consumers (for advertising by marketers)
Data-capture
BIG DATA
Dept. of Computer Engineering 26 GPTC KOTHAMANGALAM
Big Data and the IoT work in conjunction. From a media perspective, data is the key
derivative of device inter connectivity and allows accurate targeting. The Internet of Things,
with the help of big data, therefore transforms the media industry, companies and even
governments, opening up a new era of economic growth and competitiveness. The
intersection of people, data and intelligent algorithms have far-reaching impacts on media
efficiency. The wealth of data generated allows an elaborate layer on the present targeting
mechanisms of the industry.
6.8.2 TECHNOLOGY
eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB
Hadoop cluster for search, consumer recommendations, and merchandising. Inside
eBay’s 90PB data warehouse
Amazon.com handles millions of back-end operations every day, as well as queries
from more than half a million third-party sellers. The core technology that keeps
Amazon running is Linux-based and as of 2005 they had the world’s three largest
Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.
Facebook handles 50 billion photos from its user base.
As of August 2012, Google was handling roughly 100 billion searches per month.
6.9 PRIVATE SECTOR
6.9.1 RETAIL
Walmart handles more than 1 million customer transactions every hour, which are
imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes)
of data – the equivalent of 167 times the information contained in all the books in the
US Library of Congress.
BIG DATA
Dept. of Computer Engineering 27 GPTC KOTHAMANGALAM
6.9.2 R
RETAIL BANKING
FICO Card Detection System protects accounts world-wide.
The volume of business data worldwide, across all companies, doubles every 1.2
years, according to estimates.
6.9.3 REAL ESTATE
Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers
to help new home buyers determine their typical drive times to and from work
throughout various times of the day.
6.9.4 SCIENCE
The Large Hadron Collider experiments represent about 150 million sensors delivering data
40 million times per second. There are nearly 600 million collisions per second. After
filtering and refraining from recording more than 99.99995% of these streams, there are 100
collisions of interest per second.
As a result, only working with less than 0.001% of the sensor stream data, the data
flow from all four LHC experiments represents 25 petabytes annual rate before
replication (as of 2012). This becomes nearly 200 petabytes after replication.
If all sensor data were to be recorded in LHC, the data flow would be extremely hard
to work with. The data flow would exceed 150 million petabytes annual rate, or
nearly 500 exabytes per day, before replication. To put the number in perspective, this
is equivalent to 500 quintillion (5×1020
) bytes per day, almost 200 times more than all
the other sources combined in the world.
The Square Kilometre Array is a telescope which consists of millions of antennas and is
expected to be operational by 2024. Collectively, these antennas are expected to gather
14 exabytes and store one petabyte per day. It is considered to be one of the most
ambitious scientific projects ever undertaken.
BIG DATA
Dept. of Computer Engineering 28 GPTC KOTHAMANGALAM
6.10 SCIENCE AND RESEARCH
When the Sloan Digital Sky Survey (SDSS) began collecting astronomical data in
2000, it amassed more in its first few weeks than all data collected in the history of
astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more
than 140 terabytes of information. When the Large Synoptic Survey Telescope,
successor to SDSS, comes online in 2016 it is anticipated to acquire that amount of
data every five days.
Decoding the human genome originally took 10 years to process, now it can be
achieved in less than a day: the DNA sequencers have divided the sequencing cost by
10,000 in the last ten years, which is 100 times cheaper than the reduction in cost
predicted by Moore's Law.
The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of
climate observations and simulations on the Discover supercomputing cluster.
BIG DATA
Dept. of Computer Engineering 29 GPTC KOTHAMANGALAM
7. ADVANTAGES
 Using big data cuts your costs.
 Using big data increases your efficiency.
 Using big data improves your pricing.
 You can compete with big businesses.
 Allows you to focus on local preferences.
 Using big data helps you increase sales and loyalty.
 Using big data ensures you hire the right employees.
BIG DATA
Dept. of Computer Engineering 30 GPTC KOTHAMANGALAM
8. DISADVANTAGES
Traditional storage can cost lot of money to store big data.
Lots of big data is unstructured.
Big data analysis violates principles of privacy.
It can be used for manipulation of customer records.
It may increase social stratification.
Big data analysis is not useful in short run. It needs to be analyzed for longer duration to
leverage its benefits.
Big data analysis results are misleading sometimes.
Speedy updates in big data can mismatch real figures.
BIG DATA
Dept. of Computer Engineering 31 GPTC KOTHAMANGALAM
9. CONCLUSION
The availability of Big Data, low-cost commodity hardware, and new information
management and analytic software have produced a unique moment in the history of data
analysis. The convergence of these trends means that we have the capabilities required to
analyze astonishing data sets quickly and cost-effectively for the first time in history. These
capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a
clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and
profitability.
The Age of Big Data is here, and these are truly revolutionary times if both business
and technology professionals continue to work together and deliver on the promise.
As the career paths available in big data continue to grow so does the shortage of big data
professionals needed to fill those positions. In the previous sections of this chapter the
characteristics needed to be successful in the field of big data have been introduced and
explained. The characteristics such as communication, knowledge of big data concepts, and
agility are equally as important as the technical skill aspects of big data.
Big data professionals are the bridge between raw data and useable information. They should
have the skills to manipulate data on the lowest levels, and they must know how to interpret its
trends, patterns, and outliers in many different forms. The languages and methods used to
achieve these goals are growing in strength and numbers, a pattern unlikely to change in the near
future, especially as more languages and tools enter and gain popularity in the big data fray.
Regardless of language, method, or specialization, big data scientists face a unique technical
challenge: working in a field where their exact role lacks a clear definition. Within an
organization, they help to solve problems, but even these problems may be undefined. To further
complicate matters, some data scientists work outside any specific organization and its direction,
like in academic research. Future chapters will explore concrete applications of big data across
multiple disciplines to demonstrate how diversely big data scientists can work.
BIG DATA
Dept. of Computer Engineering 32 GPTC KOTHAMANGALAM
REFERENCE
 www.google.com
 www.wikipedia.com
 www.studymafia.org

Contenu connexe

Tendances (20)

IoT(Internet of Things) Report
IoT(Internet of Things) ReportIoT(Internet of Things) Report
IoT(Internet of Things) Report
 
IoT and its Applications
IoT and its ApplicationsIoT and its Applications
IoT and its Applications
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)
 
Simple Internet Of Things (IoT) PPT 2020
Simple Internet Of Things (IoT) PPT 2020 Simple Internet Of Things (IoT) PPT 2020
Simple Internet Of Things (IoT) PPT 2020
 
IOT report
IOT reportIOT report
IOT report
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Ethical, Legal and Social issues IoT
Ethical, Legal and Social issues IoTEthical, Legal and Social issues IoT
Ethical, Legal and Social issues IoT
 
Big data
Big dataBig data
Big data
 
Introduction to IoT
Introduction to IoTIntroduction to IoT
Introduction to IoT
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
IoT(Internet of Things) ppt
IoT(Internet of Things) pptIoT(Internet of Things) ppt
IoT(Internet of Things) ppt
 
Big data
Big dataBig data
Big data
 
THE INTERNET OF THINGS
THE INTERNET OF THINGSTHE INTERNET OF THINGS
THE INTERNET OF THINGS
 
Applications of Big Data
Applications of Big DataApplications of Big Data
Applications of Big Data
 
Edge computing
Edge computingEdge computing
Edge computing
 
Internet Of Things
 Internet Of Things Internet Of Things
Internet Of Things
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 

Similaire à Big Data

Analysis of Big Data
Analysis of Big DataAnalysis of Big Data
Analysis of Big DataIRJET Journal
 
Know The What, Why, and How of Big Data_.pdf
Know The What, Why, and How of Big Data_.pdfKnow The What, Why, and How of Big Data_.pdf
Know The What, Why, and How of Big Data_.pdfAnil
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET Journal
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIIJCSEA Journal
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIIJCSEA Journal
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIIJCSEA Journal
 
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...IJERDJOURNAL
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesEditor IJCATR
 
Whitebook on Big Data
Whitebook on Big DataWhitebook on Big Data
Whitebook on Big DataViren Aul
 
Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)Sonu Gupta
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)NikitaRajbhoj
 
Big data analytics in Business Management and Businesss Intelligence: A Lietr...
Big data analytics in Business Management and Businesss Intelligence: A Lietr...Big data analytics in Business Management and Businesss Intelligence: A Lietr...
Big data analytics in Business Management and Businesss Intelligence: A Lietr...IRJET Journal
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Gregg Barrett
 

Similaire à Big Data (20)

Analysis of Big Data
Analysis of Big DataAnalysis of Big Data
Analysis of Big Data
 
Big data
Big dataBig data
Big data
 
Know The What, Why, and How of Big Data_.pdf
Know The What, Why, and How of Big Data_.pdfKnow The What, Why, and How of Big Data_.pdf
Know The What, Why, and How of Big Data_.pdf
 
IRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth EnhancementIRJET- Big Data Management and Growth Enhancement
IRJET- Big Data Management and Growth Enhancement
 
1
11
1
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AI
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AI
 
DEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AIDEALING CRISIS MANAGEMENT USING AI
DEALING CRISIS MANAGEMENT USING AI
 
Big Data.pdf
Big Data.pdfBig Data.pdf
Big Data.pdf
 
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
 
Big data upload
Big data uploadBig data upload
Big data upload
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New Challenges
 
Whitebook on Big Data
Whitebook on Big DataWhitebook on Big Data
Whitebook on Big Data
 
Big data - a review (2013 4)
Big data - a review (2013 4)Big data - a review (2013 4)
Big data - a review (2013 4)
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
new.pptx
new.pptxnew.pptx
new.pptx
 
Big data analytics in Business Management and Businesss Intelligence: A Lietr...
Big data analytics in Business Management and Businesss Intelligence: A Lietr...Big data analytics in Business Management and Businesss Intelligence: A Lietr...
Big data analytics in Business Management and Businesss Intelligence: A Lietr...
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Big Data

  • 1. BIG DATA Dept. of Computer Engineering 1 GPTC KOTHAMANGALAM 1. INTRODUCTION Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reductions and reduced risk. Analysis of data sets can find new correlations, to "spot business trends, prevent diseases, combat crime and so on." Scientists, practitioners of media and advertising and governments alike regularly meet difficulties with large data sets in areas including Internet search, finance and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research.
  • 2. BIG DATA Dept. of Computer Engineering 2 GPTC KOTHAMANGALAM Data sets grow in size in part because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018 ) of data were created; The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization. Work with big data is necessarily uncommon; most analysis is of "PC size" data, on a desktop PC or notebook that can handle the available data set. Relational database management systems and desktop statistics and visualization packages often have difficulty handling big data. The work instead requires "massively parallel software running on tens, hundreds, or even thousands of servers". What is considered "big data" varies depending on the capabilities of the users and their tools, and expanding capabilities make Big Data a moving target. Thus, what is considered to be "Big" in one year will become ordinary in later years. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
  • 3. BIG DATA Dept. of Computer Engineering 3 GPTC KOTHAMANGALAM 2. DEFINITION Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale. In a 2001 research report and related lectures, META Group (now Gartner) analyst Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). Gartner, and now much of the industry, continue to use this "3Vs" model for describing big data. In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Additionally, a new V "Veracity" is added by some organizations to describe it.
  • 4. BIG DATA Dept. of Computer Engineering 4 GPTC KOTHAMANGALAM If Gartner’s definition (the 3Vs) is still widely used, the growing maturity of the concept fosters a more sound difference between big data and Business Intelligence, regarding data and their use: Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends etc.; Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships, dependencies and perform predictions of outcomes and behaviors. A more recent, consensual definition states that "Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value". The term has been in use since the 1990s, with some giving credit to John Mashey for popularizing the term. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data. Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many zettabytes of data. Big data requires a set of techniques and technologies with new forms of integration to reveal insights from data-sets that are diverse, complex, and of a massive scale. "Variety", "veracity" and various other "Vs" are added by some organizations to describe it, a revision challenged by some industry authorities. A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and notes, "This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by Codd's relational model." The growing maturity of the concept more starkly delineates the difference between "big data" and "Business Intelligence":
  • 5. BIG DATA Dept. of Computer Engineering 5 GPTC KOTHAMANGALAM  Business Intelligence uses applied mathematics tools and descriptive statistics with data with high information density to measure things, detect trends, etc.  Big data uses mathematical analysis, optimization, inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors.
  • 6. BIG DATA Dept. of Computer Engineering 6 GPTC KOTHAMANGALAM 3.CHARACTERISTIS Big data can be described by the following characteristics:  Volume  Variety  Velocity  Variability  Veracity  Complexity
  • 7. BIG DATA Dept. of Computer Engineering 7 GPTC KOTHAMANGALAM VOLUME The quantity of data that is generated is very important in this context. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered Big Data or not. The name ‘Big Data’ itself contains a term which is related to size and hence the characteristic. Within the Social Media space for example, Volume refers to the amount of data generated through websites, portals and online applications. Especially for B2C companies, Volume encompasses the available data that are out there and need to be assessed for relevance. Consider the following -Facebook has 2 billion users, Youtube 1 billion users, Twitter 350 million users and Instagram 700 million users. Every day, these users contribute to billions of images, posts, videos, tweets etc. You can now imagine the insanely large amount -or Volume- of data that is generated every minute and every hour. The sheer scale of the information processed helps define big data systems. These datasets can be orders of magnitude larger than traditional datasets, which demands more thought at each stage of the processing and storage life cycle. Often, because the work requirements exceed the capabilities of a single computer, this becomes a challenge of pooling, allocating, and coordinating resources from groups of computers. Cluster management and algorithms capable of breaking tasks into smaller pieces become increasingly important.
  • 8. BIG DATA Dept. of Computer Engineering 8 GPTC KOTHAMANGALAM VARIETY The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also a very essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data. Variety in Big Data refers to all the structured and unstructured data that has the possibility of getting generated either by humans or by machines. The most commonly added data are structured -texts, tweets, pictures & videos. However, unstructured data like emails, voicemails, hand-written text, ECG reading, audio recordings etc, are also important elements under Variety. Variety is all about the ability to classify the incoming data into various categories.
  • 9. BIG DATA Dept. of Computer Engineering 9 GPTC KOTHAMANGALAM VELOCITY The term ‘velocity’ in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development. With Velocity we refer to the speed with which data are being generated. Staying with our social media example, every day 900 million photos are uploaded on Facebook, 500 million tweets are posted on Twitter, 0.4 million hours of video are uploaded on Youtube and 3.5 billion searches are performed in Google. This is like a nuclear data explosion. Big Data helps the company to hold this explosion, accept the incoming flow of data and at the same time process it fast so that it does not create bottlenecks.
  • 10. BIG DATA Dept. of Computer Engineering 10 GPTC KOTHAMANGALAM VARIABILITY This is a factor which can be a problem for those who analyze the data. This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. Variability in big data's context refers to a few different things. One is the number of inconsistencies in the data. These need to be found by anomaly and outlier detection methods in order for any meaningful analytics to occur. Big data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources. Variability can also refer to the inconsistent speed at which big data is loaded into your database.
  • 11. BIG DATA Dept. of Computer Engineering 11 GPTC KOTHAMANGALAM VERACITY The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data. This is one of the unfortunate characteristics of big data. As any or all of the above properties increase, the veracity (confidence or trust in the data) drops. This is similar to, but not the same as, validity or volatility (see below). Veracity refers more to the provenance or reliability of the data source, its context, and how meaningful it is to the analysis based on it. For example, consider a data set of statistics on what people purchase at restaurants and these items' prices over the past five years. You might ask: Who created the source? What methodology did they follow in collecting the data? Were only certain cuisines or certain types of restaurants included? Did the data creators summarize the information? Has the information been edited or modified by anyone else? Answers to these questions are necessary to determine the veracity of this information. Knowledge of the data's veracity in turn helps us better understand the risks associated with analysis and business decisions based on this particular data set.
  • 12. BIG DATA Dept. of Computer Engineering 12 GPTC KOTHAMANGALAM COMPLEXITY Data management can become a very complex process, especially when large volumes of data come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data. This situation, is therefore, termed as the ‘complexity’ of Big Data. Factory work and Cyber-physical systems may have a 6C system: 1. Connection (sensor and networks), 2. Cloud (computing and data on demand), 3. Cyber (model and memory), 4. content/context (meaning and correlation), 5. community (sharing and collaboration), and 6. customization (personalization and value). In this scenario and in order to provide useful insight to the factory management and gain correct content, data has to be processed with advanced tools (analytics and algorithms) to generate meaningful information. Considering the presence of visible and invisible issues in an industrial factory, the information generation algorithm has to be capable of detecting and addressing invisible issues such as machine degradation, component wear, etc. in the factory floor.
  • 13. BIG DATA Dept. of Computer Engineering 13 GPTC KOTHAMANGALAM 4.ARCHITECTURE In 2000, Seisint Inc. developed C++ based distributed file sharing framework for data storage and querying. Structured, semi-structured and/or unstructured data is stored and distributed across multiple servers. Querying of data is done by modified C++ called ECL which uses apply scheme on read method to create structure of stored data during time of query. In 2004 LexisNexis acquired Seisint Inc. and 2008 acquired Choice Point, Inc. and their high speed parallel processing platform. The two platforms were merged into HPCC Systems and in 2011 was open sourced under Apache v2.0 License. Currently HPCC and Quant cast File System are the only publicly available platforms capable of analyzing multiple exabytes of data. In 2004, Google published a paper on a process called Map Reduce that used such an architecture. The Map Reduce framework provides a parallel processing model and associated implementation to process huge amounts of data. With Map Reduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the Map Reduce framework was adopted by an Apache open source project named Hadoop.
  • 14. BIG DATA Dept. of Computer Engineering 14 GPTC KOTHAMANGALAM MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications in an article titled "Big Data Solution Offering". The methodology addresses handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records. Recent studies show that the use of a multiple layer architecture is an option for dealing with big data. The Distributed Parallel architecture distributes data across multiple processing units and parallel processing units provide data much faster, by improving processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of Map Reduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end user by using a front end application server. Big Data Analytics for Manufacturing Applications can be based on a 5C architecture (connection, conversion, cyber, cognition, and configuration). Big Data Lake - With the changing face of business and IT sector, capturing and storage of data has emerged into a sophisticated system. The big data lake allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake thereby reducing the overhead time. Big data repositories have existed in many forms, often built by corporations with a special need. Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s. For many years, WinterCorp published the largest database report. Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991 so the definition of big data continuously evolves according to Kryder's Law. Teradata installed the first petabyte class RDBMS based system in 2007. As of 2017, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50
  • 15. BIG DATA Dept. of Computer Engineering 15 GPTC KOTHAMANGALAM PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added unstructured data types including XML, JSON, and Avro. In 2000, Seisint Inc. (now LexisNexis Risk Solutions) developed a C++-based distributed platform for data processing and querying known as the HPCC Systems platform. This system automatically partitions, distributes, stores and delivers structured, semi-structured, and unstructured data across multiple commodity servers. Users can write data processing pipelines and queries in a declarative dataflow programming language called ECL. Data analysts working in ECL are not required to define data schemas upfront and can rather focus on the particular problem at hand, reshaping data in the best possible manner as they develop the solution. In 2004, LexisNexis acquired Seisint Inc. and their high-speed parallel processing platform and successfully utilized this platform to integrate the data systems of Choicepoint Inc. when they acquired that company in 2008. In 2011, the HPCC systems platform was open-sourced under the Apache v2.0 License. CERN and other physics experiments have collected big data sets for many decades, usually analyzed via high-throughput computing rather than the map-reduce architectures usually meant by the current "big data" movement. In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop. Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to set up many operations (not just map followed by reducing). MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications identified in an article titled "Big Data Solution Offering". The methodology addresses handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records.
  • 16. BIG DATA Dept. of Computer Engineering 16 GPTC KOTHAMANGALAM 2012 studies showed that a multiple-layer architecture is one option to address the issues that big data presents. A distributed parallel architecture distributes data across multiple servers; these parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end-user by using a front-end application server. The data lake allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake, thereby reducing the overhead time.
  • 17. BIG DATA Dept. of Computer Engineering 17 GPTC KOTHAMANGALAM 5.TECHNOLOGIES Big data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. A 2011 McKinsey report suggests suitable technologies include A/B testing, crowd sourcing, data fusion and integration, genetic algorithms, machine learning, natural language processing, signal processing, simulation, time series analysis and visualization. Multidimensional big data can also be represented as tensors, which can be more efficiently handled by tensor-based computation, such as multi linear subspace learning. Additional technologies being applied to big data include massively parallel-processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed databases, cloud based infrastructure (applications, storage and computing resources) and the Internet.
  • 18. BIG DATA Dept. of Computer Engineering 18 GPTC KOTHAMANGALAM Some but not all MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS. DARPA’s Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called Ayasdi. The practitioners of big data analytics processes are generally hostile to slower shared storage, preferring direct-attached storage (DAS) in its various forms from solid state drive (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—Storage area network (SAN) and Network-attached storage (NAS) —is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost. Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques.
  • 19. BIG DATA Dept. of Computer Engineering 19 GPTC KOTHAMANGALAM 6.APPLICATIONS Big data has increased the demand of information management specialists in that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole. Developed economies make increasing use of data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and between 1 billion and 2 billion people accessing the internet Between 1990 and 2005, more than 1 billion people worldwide entered the middle class which means more and more people who gain money will become more literate which in turn leads to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007and it is predicted that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2014. It is estimated that one third of the globally stored information is in the form of alphanumeric text and still image data, which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content).
  • 20. BIG DATA Dept. of Computer Engineering 20 GPTC KOTHAMANGALAM While many vendors offer off-the-shelf solutions for Big Data, experts recommend the development of in-house solutions custom-tailored to solve the company's problem at hand if the company has sufficient technical capabilities. 6.1 GOVERNMENT The use and adoption of Big Data within governmental processes is beneficial and allows efficiencies in terms of cost, productivity, and innovation. That said, this process does not come without its flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome. One of the greatest strengths of big data is it’s flexibility and universal application to so many different industries. Along with many other areas, big data in government can have an enormous impact — local, national and global. With so many complex issues on the table
  • 21. BIG DATA Dept. of Computer Engineering 21 GPTC KOTHAMANGALAM today, governments have their work cut out trying to make sense of all the information they receive and make vital decisions that affect millions of people. Not only is it difficult to sift through all the information, but it’s sometimes difficult to verify the reality of the information itself. Faulty information can have awful consequences. By implementing a big data platform, governments can access vast amounts of relevant information important to their daily functions. The positive effect it can have is nearly endless. It’s so important because it not only allows the government to pinpoint areas that need attention, but it also gives them that information in real time. In a society that moves so quickly from one thing to the next, real-time analysis is vital. It allows governments to make faster decisions, and it allows them to monitor those decisions and quickly enact changes if necessary. Here are just a few of the areas that big data can positively affect at the government level. Below are the thought leading examples within the Governmental Big Data space. 6.2 UNITED STATES OF AMERICA In 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how big data could be used to address important problems faced by the government. The initiative is composed of 84 different big data programs spread across six departments. Big data analysis played a large role in Barack Obama's successful 2012 re- election campaign. The United States Federal Government owns six of the ten most powerful supercomputers in the world. The Utah Data Center is a data center currently being constructed by the United States National Security Agency. When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a few exabytes.
  • 22. BIG DATA Dept. of Computer Engineering 22 GPTC KOTHAMANGALAM 6.3 INDIA Big data analysis was, in parts, responsible for the BJP and its allies to win a highly successful Indian General Election 2014. The Indian Government utilises numerous techniques to ascertain how the Indian electorate is responding to government action, as well as ideas for policy augmentation 6.4 UNITED KINGDOM Examples of uses of big data in public services: Data on prescription drugs: by connecting origin, location and the time of each prescription, a research unit was able to exemplify the considerable delay between the release of any given drug, and a UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that new/most up-to-date drugs take some time to filter through to the general patient. Joining up data: a local authority blended data about services, such as road gritting rotas, with services for people at risk, such as 'meals on wheels'. The connection of data allowed the local authority to avoid any weather related delay. 6.5 INTERNATIONAL DEVELOPMENT Research on the effective usage of information and communication technologies for development (also known as ICT4D) suggests that big data technology can make important contributions but also present unique challenges to International development. Advancements in big data analysis offer cost-effective opportunities to improve decision-making in critical development areas such as health care, employment, economic productivity, crime, security, and natural disaster and resource management. However, longstanding challenges for developing regions such as inadequate technological infrastructure and economic and human resource scarcity exacerbate existing concerns with big data such as privacy, imperfect methodology, and interoperability issues.
  • 23. BIG DATA Dept. of Computer Engineering 23 GPTC KOTHAMANGALAM 6.6 MANUFACTURING Based on TCS 2013 Global Trend Study, improvements in supply planning and product quality provide the greatest benefit of big data for manufacturing. Big data provides an infrastructure for transparency in manufacturing industry, which is the ability to unravel uncertainties such as inconsistent component performance and availability. Predictive manufacturing as an applicable approach toward near-zero downtime and transparency requires vast amount of data and advanced prediction tools for a systematic process of data into useful information. A conceptual framework of predictive manufacturing begins with data acquisition where different type of sensory data is available to acquire such as acoustics, vibration, pressure, current, voltage and controller data. Vast amount of sensory data in addition to historical data construct the big data in manufacturing. The generated big data acts as the input into predictive tools and preventive strategies such as Prognostics and Health Management (PHM).
  • 24. BIG DATA Dept. of Computer Engineering 24 GPTC KOTHAMANGALAM 6.7 CYBER-PHYSICAL MODELS Current PHM implementations mostly utilize data during the actual usage while analytical algorithms can perform more accurately when more information throughout the machine’s lifecycle, such as system configuration, physical knowledge and working principles, are included. There is a need to systematically integrate, manage and analyze machinery or process data during different stages of machine life cycle to handle data/information more efficiently and further achieve better transparency of machine health condition for manufacturing industry. With such motivation a cyber-physical (coupled) model scheme has been developed. Please see http://www.imscenter.net/cyber-physical-platform The coupled model is a digital twin of the real machine that operates in the cloud platform and simulates the health condition with an integrated knowledge from both data driven analytical algorithms as well as other available physical knowledge. It can also be described as a 5S systematic approach consisting of Sensing, Storage, Synchronization, Synthesis and Service. The coupled model first constructs a digital image from the early design stage. System information and physical knowledge are logged during product design, based on which a simulation model is built as a reference for future analysis. Initial parameters may be statistically generalized and they can be tuned using data from testing or the manufacturing process using parameter estimation. After which, the simulation model can be considered as a mirrored image of the real machine, which is able to continuously record and track machine condition during the later utilization stage. Finally, with ubiquitous connectivity offered by cloud computing technology, the coupled model also provides better accessibility of machine condition for factory managers in cases where physical access to actual equipment or machine data is limited.
  • 25. BIG DATA Dept. of Computer Engineering 25 GPTC KOTHAMANGALAM 6.8 MEDIA 6.8.1 INTERNET OF THINGS (IOT) To understand how the media utilises Big Data, it is first necessary to provide some context into the mechanism used for media process. It has been suggested by Nick Couldry and Joseph Turow that practitioners in Media and Advertising approach big data as many actionable points of information about millions of individuals. The industry appears to be moving away from the traditional approach of using specific media environments such as newspapers, magazines, or television shows and instead tap into consumers with technologies that reach targeted people at optimal times in optimal locations. The ultimate aim is to serve, or convey, a message or content that is (statistically speaking) in line with the consumers mindset. For example, publishing environments are increasingly tailoring messages (advertisements) and content (articles) to appeal to consumers that have been exclusively gleaned through various data-mining activities. Targeting of consumers (for advertising by marketers) Data-capture
  • 26. BIG DATA Dept. of Computer Engineering 26 GPTC KOTHAMANGALAM Big Data and the IoT work in conjunction. From a media perspective, data is the key derivative of device inter connectivity and allows accurate targeting. The Internet of Things, with the help of big data, therefore transforms the media industry, companies and even governments, opening up a new era of economic growth and competitiveness. The intersection of people, data and intelligent algorithms have far-reaching impacts on media efficiency. The wealth of data generated allows an elaborate layer on the present targeting mechanisms of the industry. 6.8.2 TECHNOLOGY eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay’s 90PB data warehouse Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Facebook handles 50 billion photos from its user base. As of August 2012, Google was handling roughly 100 billion searches per month. 6.9 PRIVATE SECTOR 6.9.1 RETAIL Walmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress.
  • 27. BIG DATA Dept. of Computer Engineering 27 GPTC KOTHAMANGALAM 6.9.2 R RETAIL BANKING FICO Card Detection System protects accounts world-wide. The volume of business data worldwide, across all companies, doubles every 1.2 years, according to estimates. 6.9.3 REAL ESTATE Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day. 6.9.4 SCIENCE The Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% of these streams, there are 100 collisions of interest per second. As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication. If all sensor data were to be recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5×1020 ) bytes per day, almost 200 times more than all the other sources combined in the world. The Square Kilometre Array is a telescope which consists of millions of antennas and is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day. It is considered to be one of the most ambitious scientific projects ever undertaken.
  • 28. BIG DATA Dept. of Computer Engineering 28 GPTC KOTHAMANGALAM 6.10 SCIENCE AND RESEARCH When the Sloan Digital Sky Survey (SDSS) began collecting astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2016 it is anticipated to acquire that amount of data every five days. Decoding the human genome originally took 10 years to process, now it can be achieved in less than a day: the DNA sequencers have divided the sequencing cost by 10,000 in the last ten years, which is 100 times cheaper than the reduction in cost predicted by Moore's Law. The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.
  • 29. BIG DATA Dept. of Computer Engineering 29 GPTC KOTHAMANGALAM 7. ADVANTAGES  Using big data cuts your costs.  Using big data increases your efficiency.  Using big data improves your pricing.  You can compete with big businesses.  Allows you to focus on local preferences.  Using big data helps you increase sales and loyalty.  Using big data ensures you hire the right employees.
  • 30. BIG DATA Dept. of Computer Engineering 30 GPTC KOTHAMANGALAM 8. DISADVANTAGES Traditional storage can cost lot of money to store big data. Lots of big data is unstructured. Big data analysis violates principles of privacy. It can be used for manipulation of customer records. It may increase social stratification. Big data analysis is not useful in short run. It needs to be analyzed for longer duration to leverage its benefits. Big data analysis results are misleading sometimes. Speedy updates in big data can mismatch real figures.
  • 31. BIG DATA Dept. of Computer Engineering 31 GPTC KOTHAMANGALAM 9. CONCLUSION The availability of Big Data, low-cost commodity hardware, and new information management and analytic software have produced a unique moment in the history of data analysis. The convergence of these trends means that we have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the first time in history. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and profitability. The Age of Big Data is here, and these are truly revolutionary times if both business and technology professionals continue to work together and deliver on the promise. As the career paths available in big data continue to grow so does the shortage of big data professionals needed to fill those positions. In the previous sections of this chapter the characteristics needed to be successful in the field of big data have been introduced and explained. The characteristics such as communication, knowledge of big data concepts, and agility are equally as important as the technical skill aspects of big data. Big data professionals are the bridge between raw data and useable information. They should have the skills to manipulate data on the lowest levels, and they must know how to interpret its trends, patterns, and outliers in many different forms. The languages and methods used to achieve these goals are growing in strength and numbers, a pattern unlikely to change in the near future, especially as more languages and tools enter and gain popularity in the big data fray. Regardless of language, method, or specialization, big data scientists face a unique technical challenge: working in a field where their exact role lacks a clear definition. Within an organization, they help to solve problems, but even these problems may be undefined. To further complicate matters, some data scientists work outside any specific organization and its direction, like in academic research. Future chapters will explore concrete applications of big data across multiple disciplines to demonstrate how diversely big data scientists can work.
  • 32. BIG DATA Dept. of Computer Engineering 32 GPTC KOTHAMANGALAM REFERENCE  www.google.com  www.wikipedia.com  www.studymafia.org