1. DATA MINING ON BIG DATA
Presented by - Swapnil H. Chaudhari
Guided by
Prof. B. R. Mandre
DEPARTMENT OF COMPUTER ENGINEERING
SSVPS’s B. S. DEORE COLLEGE OF ENGINEERING, DHULE
2013 - 2014
18-Jan-16
2. OBJECTIVE :
Brief introduction on Big Data
What is Data Mining
Rise of big data
Big Data Characteristics: HASE Theorem
Data Mining Challenges with Big Data
A Big Data processing framework
18-Jan-16
2
3. BIG DATA AND DATA MINING
Big Data concern large-volume, complex, growing data sets with
multiple, autonomous sources.
Data Mining is Process of semi-automatically analyzing large
databases to find patterns that are:
valid: hold on new data with some certainty
useful: should be possible to act on the item
understandable: humans should be able to interpret the
pattern
Also known as Knowledge Discovery in Databases (KDD)
18-Jan-16
3
4. HOW BIG IS THE BIG DATA?
4
- What is big today maybe not big tomorrow
- Fast growing Big data can challenge our current technology in some
manner
- Volume
- Communication
- Speed of Generating
- Meaningful Analysis
5. BIG DATA VECTORS (4VS)
- Volume
amount of data
- Velocity
Speed rate in collecting or acquiring or generating or
processing of data
- Variety
different data type such as audio, video, image data (mostly
unstructured data)
- Variability
semantics, or the variability of meaning in language.
[Gartner 2012] 5
6. EXAMPLES:
Government
On 4 October 2012, the first presidential debate between President
Barack Obama and Governor Mitt Romney triggered more than 10
million tweets within 2 hours
Private Sector
Walmart handles more than 1 million customer transactions every hour,
which is imported into databases estimated to contain more than 2.5
petabytes of data
Facebook handles 40 billion photos from its user base.
Flickr, a public picture sharing site, which received 1.8 million photos
per day, on average, from February to March 2012 [5]. Assuming the
size of each photo is 2 megabytes (MB), this requires 3.6 terabytes (TB)
storage every single day.
18-Jan-16
6
7. BIG DATA CHARACTERISTICS: HASE THEOREM
HACE Theorem. Big Data starts with large volume,
Heterogeneous, Autonomous sources with distributed and
decentralized control, and seeks to explore Complex and
Evolving relationships among data [1].
18-Jan-16
7
8. Fig. The blind men and the giant elephant: the localized (limited) view of each blind man leads to a
biased conclusion.
18-Jan-16
8
9. BIG DATA CHARACTERISTICS
Huge Data with Heterogeneous and Diverse Dimensionality.
Autonomous Sources with Distributed and Decentralized
Control.
Complex and Evolving Relationships.
18-Jan-16
9
10. CONCEPTUAL VIEW OF THE BIG DATA
PROCESSING FRAMEWORK
Fig. A Big Data processing framework
18-Jan-16
10
11. A BIG DATA PROCESSING FRAMEWORK:
Tier I :- which focuses on low-level data accessing and
computing.
Tier II:- which concentrates on high-level semantics, application
domain knowledge, and user privacy issues.
Tier III:- challenges on actual mining algorithms.
18-Jan-16
11
12. TIER I: BIG DATA MINING PLATFORM
TIRE I -which focuses on low-level data accessing and
computing.
One of the most important characteristics of Big Data is to carry
out computing on the petabyte (PB), even the exabyte (EB)-level
data with a complex computing process.
18-Jan-16
12
13. Small scale data mining tasks:
a single desktop computer, which contains hard disk and CPU
processors, is sufficient to fulfill the data mining goals.
Medium scale data mining tasks:
Common solutions are to rely on parallel computing [3], [4]
or collective mining [2] parallel computing programming.
Big Data mining tasks:
with a data mining task being deployed by running some
parallel programming tools, such as MapReduce or Enterprise
Control Language (ECL), on a large number of computing
nodes (i.e., clusters).
18-Jan-16
13
14. MAPREDUCE TECHNIQUE
MapReduce is programming model for distributed system.
MapReduce program execute in three stages
Map
Shuffle
Reduce
MapReduce is a batch-oriented parallel computing model[7]
18-Jan-16
14
20. TIER II: BIG DATA SEMANTICS AND
APPLICATIONKNOWLEDGE
Information Sharing and Data Privacy
To protect privacy, two common approaches are to
1. restrict access to the data
such as adding certification or access control to the data entries, so
sensitive information is accessible by a limited group of users only
2. anonymize data fields
sensitive information cannot be pinpointed to an individual
record [15].
18-Jan-16
20
21. TIER II: BIG DATA SEMANTICS AND
APPLICATIONKNOWLEDGE
Domain and Application Knowledge
Domain and application knowledge [28] provides essential
information for designing Big Data mining algorithms and
systems.
Help identify right features for modeling the underlying data.
Help design achievable business objectives by using Big Data
analytical techniques
18-Jan-16
21
22. TIER III: BIG DATA MINING ALGORITHMS
Local Learning and Model Fusion for Multiple Information
Sources
Mining from Sparse, Uncertain, and Incomplete Data
Mining Complex and Dynamic Data
18-Jan-16
22
23. CONCLUSION
To explore Big Data, we have analyzed several challenges at the
data, model, and system levels.
To support Big Data mining, high-performance computing
platforms are required, which impose systematic designs to
unleash the full power of the Big Data.
18-Jan-16
23
24. REFERENCES
1. Xindong wu, Xingquan zhu, Gong-qing wu, Wei ding, “Data Mining With Big Data” IEEE transactions on
knowledge and data engineering, vol. 26, no. 1, january 2014
2. B. Brown, M. Chuiu and J. Manyika, “Are you ready for the era of Big Data?” McKinsey Quarterly, Oct
2011, McKinsey Global Institute
3. C. Bizer, P. Bonez, M. L. Bordie and O. Erling, “The Meaningful Use of Big Data: Four Perspective Four
Challenges” SIGMOD Vol. 40, No. 4, December 2011
4. D. Boyd and K. Crawford, “Six Provation for Big Data” A Decade in Internet Time: Symposium on the
Dynamics of the Internet and Society, September 2011, Oxford Internet Institute
5. D. Agrawal, S. Das and A. E. Abbadi, “Big Data and Cloud Computing: Current State and Future
Opportunities” ETDB 2011, Uppsala, Sweden
6. D. Agrawal, S. Das and A. E. Abbadi, “Big Data and Cloud Computing: New Wine or Just New Bottles?”
VLDB 2010, Vol. 3, No. 2
7. F. J. Alexander, A. Hoisie and A. Szalay, “Big Data” IEEE Computing in Science and Engineering
journal 2011
8. O. Trelles, P Prins, M. Snir and R. C. Jansen, “Big Data, but are we ready?” Nature Reviews, Feb 2011
9. K. Bakhshi, “Considerations for Big data: Architecture and approach” Aerospace Conference, 2012 IEEE
10. S. Lohr, “The Age of Big Data” Thr New York times Publication, February 2012
11. M. Nielsen, “Aguide to the day of big data”, Nature, vol. 462, December 2009
24
These characteristics make it an extreme challenge for discovering useful knowledge from the Big Data.
we can imagine that a number of blind men
are trying to size up a giant elephant (see Fig. 1), which
will be the Big Data in this context. The goal of each blind
man is to draw a picture (or conclusion) of the elephant
according to the part of information he collects during the
process. Because each person’s view is limited to his local
region, it is not surprising that the blind men will each
conclude independently that the elephant “feels” like a
rope, a hose, or a wall, depending on the region each of
them is limited to. To make the problem even more
complicated, let us assume that 1) the elephant is growing
rapidly and its pose changes constantly, and 2) each blind
man may have his own (possible unreliable and inaccurate)
information sources that tell him about biased
knowledge about the elephant (e.g., one blind man may
exchange his feeling about the elephant with another blind
man, where the exchanged knowledge is inherently
biased). Exploring the Big Data in this scenario is
equivalent to aggregating heterogeneous information from
different sources (blind men) to help draw a best possible
picture to reveal the genuine gesture of the elephant in a
real-time fashion. Indeed, this task is not as simple as
asking each blind man to describe his feelings about the
elephant and then getting an expert to draw one single
picture with a combined view, concerning that each
individual may speak a different language (heterogeneous
and diverse information sources) and they may even have
privacy concerns about the messages they deliberate in the
information exchange process.
Small scale data mining tasks:
a single desktop computer, which contains hard disk and CPU processors, is sufficient to fulfill the data mining goals.
Medium scale data mining tasks:
data are typically large (and possibly distributed) and cannot be fit into the main memory. Common solutions are to rely on parallel computing [3], [4] or collective mining [2] to sample and aggregate data from different sources and then use parallel computing programming (such as the Message to carry out the mining process.