2. Why use Data Mining?
• Explosive growth of data available
• Major sources:
• Business: Web, E-Commerce, transactions
• Science : Remote Sensing, bioinformatics,….
• Society : news, gadgets, social media
• Too much data but too little information
• To extract useful information from the data and to interpret
the data
• can automate the process of finding relationships and patterns
in raw data
3. What is Data Mining?
• Knowledge Discovery in Databases, or ”KDD”
• the process of extracting hidden predictive information
from large data sets
• Converting information into knowledge to predict the
future trends and decisions
• Examples :
consumer buying behavior of retail supermarket sales
Google instant, YouTube instant
Blogs and news: Technorati, News360 and so on
Social Mining : Livehoods: find pattern and behaviors
of foursquare check-in data
4. Data Mining Process
The Cross-Industry Standard Process (CRISP-DM)
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
5. Techniques
I. Association Rule-also known as market basket analysis.
discover interesting associations between attributes
II. Classification- a technique based on machine learning
use mathematical techniques such as decision trees, linear
programming, neural network and statistics.
III. Clustering- makes meaningful or useful cluster of objects that
have similar characteristic
IV. Prediction-discovers relationship between independent variables
and relationship between dependent and
independent variables
V. Sequential Patterns-discover similar patterns in data transaction
over a business period
6. Tools
• There are three categories of tools for data mining:
i. Traditional Data Mining Tools
ii. Dashboards
iii. Text-mining Tools
Some data mining tools:
• R- r-project.org
• Datameer Analytics Solution - datameer.com
• SAS Analytics- sas.com
• Google Chart API- code.google.com/apis/chart
7. Column Stores
• stores data tables as columns of data
• Column Oriented DBMS-
• Bigtable, DBase, Hypertable, Cassandra(Relational)
• Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL)
• Use in systems like data warehouses and data mining
• Example: Emp_ID Emp_Name Emp_Dept Emp_Salar
y
1 Smith IT 40000
2 Adam Sales 35000
3 Jones Marketing 45000
the database must coax its two-dimensional table into one for the operating
system
• 1,2,3
Smith, Adam, Jones
IT, Sales, Marketing
40000, 35000, 45000
8. Advantages and Disadvantages of
Column Stores
Advantages
• Only need to read relevant data( improved bandwidth utilization)
• Improved cache locality
No need to transmit surrounding attributes
• Compression efficiency-column compress better than rows
Because rows contain values from different domain
Row-store compression ratio: 1:3
Colum-Store: 1:10
Disadvantages
• Increased Disk seek time
• Increased cost of inserts.
• Increased tuple reconstruction costs
9. Case Study: Bazaarvoice
• Facing difficulties to aggregate large amounts of data on the fly in real time
for analytics product
• Common among queries- a small number of columns with most values
being aggregates such as counts, sums and averages
• Use InfoBright, an open source database built on MySQL
• Test result using a data set with 100MM records in the main fact table
• Average query execution time for analytical queries was 20x faster than
MySQL’s
10. Case Study: Bazaarvoice(cont.)
• disk footprint was over 10x smaller compared to MySQL due to data
compression.
• Why?
• Column stores- small disk I/O
• “knowledge grid”, aggregate data Infobright calculates during data
loading
• E.g. pre-calculate min, max, and avg value for each column in the
pack
• Limitations of InfoBright
• does not support DML
• only way is to bulk loads using “LOAD DATA INFILE …” command
• no way to update or delete existing data without reloading the table