2. Introduction
Datamining and bigdata analytics is
the process of examining data to
uncover hidden patterns, unknown
correlations and other useful
information that can be used to make
better decisions.
3. Definitions:
Big Data is a phrase used to
mean a massive volume of
both structured and
unstructured data that is so
large it is difficult to process
using traditional database
and software techniques.
Data mining is about
finding new information in a
lot of data. The information
obtained from data
mining is hopefully both
new and useful. In many
cases, data is stored so it
can be used later.
4. Interesting Facts
The volume of business data worldwide, across all companies, doubles
every 1.2 years (was 1.5 years)
Daily 2500 quadrillion of data are produced and more than 90 percentage
of data are produced within past two years.
A regular person is processing daily more data than a 16th century
individual in his entire life
In the last years cost of storage and processing power dropped significantly
Bad data or poor data quality costs US businesses $600 billion annually
By 2015, 4.4 million IT jobs globally will be created to support big data
(Gartner)
Facebook processes 10 TB of data every day / Twitter 7 TB
Google has over 3 million servers processing over 2 trillion searches per
year in 2012 (only 22 million in 2000)
5. Characteristics of Big Data
Volume - The quantity of data
Variety - categorizing the data
Velocity - speed of generation of data or the speed of processing the data
Variability - Inconsistency
Complexity - Managing the data
6. Big Data Mining Algorithm
Big data applications have so many sources to gather information.
If we want to mine data, we need to gather all distributed data to the
centralized site. But it is prohibited because of high data transmission
cost and privacy concerns.
Most of the mining levels order to achieve the pattern of correlations,
or patterns can be discovered from combined variety of sources.
The global data mining is done through two steps process.
Model level
Knowledge level.
Each and every local sites use local data to calculate the data statistics
and it share this information in order to achieve global data distribution
in their data level.
7. In model level it will produce local pattern. This pattern will be
produced after mined local data.
By sharing these local patterns with other local sites, we can produce a
single global pattern.
At the knowledge level, model correlation analysis investigates the
relevance between models generated from various data sources to
determine how related the data sources are correlated to each other,
and how to form accurate decisions based on models built from
autonomous sources
8. Applications of Big Data
Healthcare organizations can achieve better insight into disease trends
and patient treatments.
Public sector agencies can catch fraud and other threats in real-time.
Applications of Multimedia data
To find travelling pattern of travelers
CC TV camera footage
Photos and Videos from social network
Recommender system
Integration and mining of Bio data from various sources in Biological
network by NSF (National Science Foundation).
Classifying the Big data stream in run time, by Australian Research
council.
9. Applications of Data Mining
It uses data and analytics to identify best practices that improve care and
reduce costs.
Market basket analysis is a modelling technique based upon a theory that if
you buy a certain group of items you are more likely to buy another group of
items. This technique may allow the retailer to understand the purchase
behaviour of a buyer.
There is a new emerging field, called Educational Data Mining, concerns with
developing methods that discover knowledge from data originating from
educational Environments.
There is a new emerging field, called Educational Data Mining, concerns with
developing methods that discover knowledge from data originating from
educational Environments.
10. DATA MINING CHALLENGES WITH BIG DATA
Main challenge for an intelligent database is handling Big data. The
important thing is scaling the large amount of data and provide
solution for these problem by HACE theorem
11. Challenges
Location of Big Data sources- Commonly Big Data are stored in different locations
Volume of the Big Data- size of the Big Data grows continuously.
Hardware resources- RAM capacity
Privacy- Medical reports, bank transactions
Having domain knowledge
Getting meaningful information
12. Solutions
Parallel computing programming
An efficient platform for computing will not have centralized data storage instead
of that platform will be distributed in big scale storage.
Restricting access to the data
13. BIG Data Mining Tools
Hadoop
Apache S4
Strom
Apache Mahout
MOA
14. Hadoop
It is developed by Apache Software Foundation project and open
source software platform for scalable, distributed computing.
Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
Hadoop provides fast and reliable analysis of both Structured and un
structured data.
It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage.
Hadoop uses MapReduce programming model to mine data.
This MapReduce program is used to separate datasets which are sent as
input into independent subsets. Those are process parallel map task.
Map() procedure that performs filtering and sorting
Reduce() procedure that performs a summary operation
15. Data Mining Software
•Weka - an open-source software for data mining
•RapidMiner - an open-source system for data and text mining
•KNIME - an open-source data integration, processing, analysis, and exploration
platform
•The Mahout machine learning library - mining large data sets. It supports
recommendation mining, clustering, classification and frequent itemset mining.
•Rattle - a GUI for data mining using R
16. From the dawn of civilization until
2003, humankind generated five
exabytes of data. Now we produce
five exabytes every two days…and
the pace is accelerating.
Eric Schmidt,
Executive Chairman, Google
Notes de l'éditeur
Sourcessssssssss
Social network
Satellite data
Geographical data
Live streaming data