Data mining on big data

DATA MINING ON BIG DATA
Presented by - Swapnil H. Chaudhari
Guided by
Prof. B. R. Mandre
DEPARTMENT OF COMPUTER ENGINEERING
SSVPS’s B. S. DEORE COLLEGE OF ENGINEERING, DHULE
2013 - 2014
18-Jan-16

OBJECTIVE :
 Brief introduction on Big Data
 What is Data Mining
 Rise of big data
 Big Data Characteristics: HASE Theorem
 Data Mining Challenges with Big Data
 A Big Data processing framework
18-Jan-16
2

BIG DATA AND DATA MINING
 Big Data concern large-volume, complex, growing data sets with
multiple, autonomous sources.
 Data Mining is Process of semi-automatically analyzing large
databases to find patterns that are:
 valid: hold on new data with some certainty
 useful: should be possible to act on the item
 understandable: humans should be able to interpret the
pattern
 Also known as Knowledge Discovery in Databases (KDD)
18-Jan-16
3

HOW BIG IS THE BIG DATA?
4
- What is big today maybe not big tomorrow
- Fast growing Big data can challenge our current technology in some
manner
- Volume
- Communication
- Speed of Generating
- Meaningful Analysis

BIG DATA VECTORS (4VS)
- Volume
amount of data
- Velocity
Speed rate in collecting or acquiring or generating or
processing of data
- Variety
different data type such as audio, video, image data (mostly
unstructured data)
- Variability
semantics, or the variability of meaning in language.
[Gartner 2012] 5

EXAMPLES:
 Government
 On 4 October 2012, the first presidential debate between President
Barack Obama and Governor Mitt Romney triggered more than 10
million tweets within 2 hours
 Private Sector
 Walmart handles more than 1 million customer transactions every hour,
which is imported into databases estimated to contain more than 2.5
petabytes of data
 Facebook handles 40 billion photos from its user base.
 Flickr, a public picture sharing site, which received 1.8 million photos
per day, on average, from February to March 2012 [5]. Assuming the
size of each photo is 2 megabytes (MB), this requires 3.6 terabytes (TB)
storage every single day.
18-Jan-16
6

BIG DATA CHARACTERISTICS: HASE THEOREM
 HACE Theorem. Big Data starts with large volume,
Heterogeneous, Autonomous sources with distributed and
decentralized control, and seeks to explore Complex and
Evolving relationships among data [1].
18-Jan-16
7

Fig. The blind men and the giant elephant: the localized (limited) view of each blind man leads to a
biased conclusion.
18-Jan-16
8

BIG DATA CHARACTERISTICS
 Huge Data with Heterogeneous and Diverse Dimensionality.
 Autonomous Sources with Distributed and Decentralized
Control.
 Complex and Evolving Relationships.
18-Jan-16
9

CONCEPTUAL VIEW OF THE BIG DATA
PROCESSING FRAMEWORK
Fig. A Big Data processing framework
18-Jan-16
10

A BIG DATA PROCESSING FRAMEWORK:
 Tier I :- which focuses on low-level data accessing and
computing.
 Tier II:- which concentrates on high-level semantics, application
domain knowledge, and user privacy issues.
 Tier III:- challenges on actual mining algorithms.
18-Jan-16
11

TIER I: BIG DATA MINING PLATFORM
 TIRE I -which focuses on low-level data accessing and
computing.
 One of the most important characteristics of Big Data is to carry
out computing on the petabyte (PB), even the exabyte (EB)-level
data with a complex computing process.
18-Jan-16
12

Small scale data mining tasks:
 a single desktop computer, which contains hard disk and CPU
processors, is sufficient to fulfill the data mining goals.
Medium scale data mining tasks:
 Common solutions are to rely on parallel computing [3], [4]
or collective mining [2] parallel computing programming.
Big Data mining tasks:
with a data mining task being deployed by running some
parallel programming tools, such as MapReduce or Enterprise
Control Language (ECL), on a large number of computing
nodes (i.e., clusters).
18-Jan-16
13

MAPREDUCE TECHNIQUE
 MapReduce is programming model for distributed system.
 MapReduce program execute in three stages
 Map
 Shuffle
 Reduce
 MapReduce is a batch-oriented parallel computing model[7]
18-Jan-16
14

MAPREDUCE ALGORITHM
18-Jan-16
15

FLOW OF MAP REDUCE FUNCTION
Fig. MapReduce Technique[IBM.COM]
18-Jan-16
16

EXAMPLE : WORD COUNT
18-Jan-16
17
Fig. MapReduce Technique for word count [IBM.COM]

TIER II: BIG DATA SEMANTICS AND
APPLICATIONKNOWLEDGE
 Information Sharing and Data Privacy
 To protect privacy, two common approaches are to
1. restrict access to the data
such as adding certification or access control to the data entries, so
sensitive information is accessible by a limited group of users only
2. anonymize data fields
sensitive information cannot be pinpointed to an individual
record [15].
18-Jan-16
20

TIER II: BIG DATA SEMANTICS AND
APPLICATIONKNOWLEDGE
 Domain and Application Knowledge
 Domain and application knowledge [28] provides essential
information for designing Big Data mining algorithms and
systems.
 Help identify right features for modeling the underlying data.
 Help design achievable business objectives by using Big Data
analytical techniques
18-Jan-16
21

TIER III: BIG DATA MINING ALGORITHMS
 Local Learning and Model Fusion for Multiple Information
Sources
 Mining from Sparse, Uncertain, and Incomplete Data
 Mining Complex and Dynamic Data
18-Jan-16
22

CONCLUSION
 To explore Big Data, we have analyzed several challenges at the
data, model, and system levels.
 To support Big Data mining, high-performance computing
platforms are required, which impose systematic designs to
unleash the full power of the Big Data.
18-Jan-16
23

REFERENCES
1. Xindong wu, Xingquan zhu, Gong-qing wu, Wei ding, “Data Mining With Big Data” IEEE transactions on
knowledge and data engineering, vol. 26, no. 1, january 2014
2. B. Brown, M. Chuiu and J. Manyika, “Are you ready for the era of Big Data?” McKinsey Quarterly, Oct
2011, McKinsey Global Institute
3. C. Bizer, P. Bonez, M. L. Bordie and O. Erling, “The Meaningful Use of Big Data: Four Perspective Four
Challenges” SIGMOD Vol. 40, No. 4, December 2011
4. D. Boyd and K. Crawford, “Six Provation for Big Data” A Decade in Internet Time: Symposium on the
Dynamics of the Internet and Society, September 2011, Oxford Internet Institute
5. D. Agrawal, S. Das and A. E. Abbadi, “Big Data and Cloud Computing: Current State and Future
Opportunities” ETDB 2011, Uppsala, Sweden
6. D. Agrawal, S. Das and A. E. Abbadi, “Big Data and Cloud Computing: New Wine or Just New Bottles?”
VLDB 2010, Vol. 3, No. 2
7. F. J. Alexander, A. Hoisie and A. Szalay, “Big Data” IEEE Computing in Science and Engineering
journal 2011
8. O. Trelles, P Prins, M. Snir and R. C. Jansen, “Big Data, but are we ready?” Nature Reviews, Feb 2011
9. K. Bakhshi, “Considerations for Big data: Architecture and approach” Aerospace Conference, 2012 IEEE
10. S. Lohr, “The Age of Big Data” Thr New York times Publication, February 2012
11. M. Nielsen, “Aguide to the day of big data”, Nature, vol. 462, December 2009
24

Data mining on big data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Data mining on big data

Similaire à Data mining on big data (20)

Dernier

Dernier (20)

Data mining on big data

Notes de l'éditeur