Abstract:
Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.
2. Problem Definition
Purpose
What is ….
Challenges with data
Big data algorithms
How To Produce The Big Data
Big Data Characteristics
Applications of Data Mining
FILD OF BIG DATA
Variety (Complexity)
Real-time/Fast Data
Real-Time Analytics/Decision Requirement
A Single View to the Customer
What’s driving Big Data
Benefits
3. Big Data consists of huge modules, difficult,
growing data sets with numerous and , independent
sources. With the fast development of networking,
storage of data, and the data gathering capacity, Big
Data are now quickly increasing in all science and
engineering domains, as well as animal, genetic and
biomedical sciences. This paper elaborates a HACE
theorem that states the characteristics of the Big
Data revolution, and proposes a Big Data processing
model from the data mining view.
4. This requires carefully designed algorithms to
analyze model correlations between distributed sites,
and fuse decisions from multiple sources to gain a
best model out of the Big Data. Developing a safe
and sound information sharing protocol is a major
challenge. To support Big Data mining, high-
performance computing platforms are required,
which impose systematic designs to unleash the full
power of the Big Data. Big data as an emerging trend
and the need for Big data mining is rising in all
science and engineering domains.
5. What is …… ?
Data Mining
computational process of discovering patterns in large data sets
Big Data
Big data is the data characterized by 3 attributes: volume, variety and
velocity.”
it is the term for a collection of data sets so large and complex that it becomes
difficult to process
data has exponential growth, both structured and unstructured
Data: data is any set of characters that has been gathered and translated
for some purpose, usually analysis. It can be any character, including text and
numbers, pictures, sound, or video. If data is not put into context, it doesn't
do anything to a human or computer.
6. How much Data does exist?
• 2.5 quintillion bytes of data are created
EVERY DAY
• IBM: 90 percent of the data in the world today
were produced with past two years
• Forms of Data????
7. Data Mining Challenges with Big Data
• Big Data Mining Platform
• Dig Data Semantics and Application Knowledge
I. Information Sharing and Data Privacy
II. Domain and Application Knowledge
• Big Data Mining Algorithm
I. Local Learning and Model Fusion for Multiple
Information Sources
II. mining from Sparse, Uncertain, and Incomplete Data
III. Mining Complex and Dynamic Data
8.
9. Data Mining Algorithm
Decision tree induction classification algorithms
Evolutionary based classification algorithms
Partitioning based clustering algorithms
Hierarchical
based clustering algorithms Hierarchical based
clustering algorithms Hierarchical based
clustering algorithms
Model based clustering algorithms
10. How To Produce The Big Data
Big Data
Types
Enterprise
Data
Transactions
Public
Data
Social
Media
Sensor
Data
11. Big Data Characteristics
Data has grown
tremendously.
Big Data starts
with large-volume,
heterogeneous,
autonomous
sources with
distributed and
decentralized
system
11
12. Applications of Data Mining
Marketing
Analysis of consumer behavior
Advertising campaigns
Targeted mailings
Finance
o Creditworthiness of clients
o Performance analysis of finance investments
Manufacturing
o Optimization of resources
o Optimization of manufacturing processes
13.
14.
15. Variety (Complexity)
Relational Data (Tables/Transaction/Legacy
Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can only scan the data once
A single application can be
generating/collecting many types of data
Big Public Data (online, weather, finance,
etc)
15
To extract knowledge all these types of
data need to linked together
16. Real-time/Fast Data
The progress and innovation is no longer hindered by the ability to collect
data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable
fashion 16
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
17. Real-Time Analytics/Decision Requirement
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
Friend Invitations
to join a
Game or Activity
that expands
business
Preventing Fraud
as it is Occurring
& preventing more
proactively
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
18. A Single View to the Customer
Customer
Social
Media
Gamin
g
Entertain
Bankin
g
Financ
e
Our
Known
Histor
y
Purchas
e
19. 5 Vs of Big Data
Volume
• Data quantity
Velocity
• Data Speed
Variety
• Data Types
Veracity
• Authenticity
Value
• Statistical
• Events
20. What’s driving Big Data
20
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
21. Benefits
Cost & management
Economies of scale, “out-sourced” resource
management
Reduced Time to deployment
Ease of assembly, works “out of the box”
Scaling
On demand provisioning, co-locate data and compute
Reliability
Massive, redundant, shared resources
Sustainability
Hardware not owned