Data mining & column stores

Data Mining &
Column Stores
Aung Thu Rha Hein

Why use Data Mining?
• Explosive growth of data available
• Major sources:
• Business: Web, E-Commerce, transactions
• Science : Remote Sensing, bioinformatics,….
• Society : news, gadgets, social media

• Too much data but too little information
• To extract useful information from the data and to interpret
the data
• can automate the process of finding relationships and patterns
in raw data

What is Data Mining?
• Knowledge Discovery in Databases, or ”KDD”
• the process of extracting hidden predictive information
from large data sets
• Converting information into knowledge to predict the
future trends and decisions
• Examples :
 consumer buying behavior of retail supermarket sales
 Google instant, YouTube instant
 Blogs and news: Technorati, News360 and so on
 Social Mining : Livehoods: find pattern and behaviors
of foursquare check-in data

Data Mining Process
The Cross-Industry Standard Process (CRISP-DM)

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment

Techniques
I. Association Rule-also known as market basket analysis.
 discover interesting associations between attributes
II. Classification- a technique based on machine learning
 use mathematical techniques such as decision trees, linear
programming, neural network and statistics.
III. Clustering- makes meaningful or useful cluster of objects that
have similar characteristic
IV. Prediction-discovers relationship between independent variables
and relationship between dependent and
independent variables
V. Sequential Patterns-discover similar patterns in data transaction
over a business period

Tools
• There are three categories of tools for data mining:
i. Traditional Data Mining Tools
ii. Dashboards
iii. Text-mining Tools

Some data mining tools:
• R- r-project.org
• Datameer Analytics Solution - datameer.com
• SAS Analytics- sas.com
• Google Chart API- code.google.com/apis/chart

Column Stores
• stores data tables as columns of data
• Column Oriented DBMS-
• Bigtable, DBase, Hypertable, Cassandra(Relational)
• Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL)
• Use in systems like data warehouses and data mining
• Example: Emp_ID Emp_Name Emp_Dept Emp_Salar
y
1 Smith IT 40000
2 Adam Sales 35000
3 Jones Marketing 45000
the database must coax its two-dimensional table into one for the operating
system
• 1,2,3
Smith, Adam, Jones
IT, Sales, Marketing
40000, 35000, 45000

Advantages and Disadvantages of
Column Stores
Advantages
• Only need to read relevant data( improved bandwidth utilization)
• Improved cache locality
 No need to transmit surrounding attributes
• Compression efficiency-column compress better than rows
 Because rows contain values from different domain
 Row-store compression ratio: 1:3
 Colum-Store: 1:10
Disadvantages
• Increased Disk seek time
• Increased cost of inserts.
• Increased tuple reconstruction costs

Case Study: Bazaarvoice
• Facing difficulties to aggregate large amounts of data on the fly in real time
for analytics product
• Common among queries- a small number of columns with most values
being aggregates such as counts, sums and averages
• Use InfoBright, an open source database built on MySQL
• Test result using a data set with 100MM records in the main fact table

• Average query execution time for analytical queries was 20x faster than
MySQL’s

Case Study: Bazaarvoice(cont.)
• disk footprint was over 10x smaller compared to MySQL due to data
compression.
• Why?
• Column stores- small disk I/O
• “knowledge grid”, aggregate data Infobright calculates during data
loading
• E.g. pre-calculate min, max, and avg value for each column in the
pack
• Limitations of InfoBright
• does not support DML
• only way is to bulk loads using “LOAD DATA INFILE …” command
• no way to update or delete existing data without reloading the table

References
Data Mining
• http://en.wikipedia.org/wiki/Data_mining
• http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html
• http://www.dataminingtechniques.net/
• http://www.unc.edu/~xluan/258/datamining.html
• http://www.data-miners.com/
• http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html
• http://livehoods.org/

Column Stores
• http://en.wikipedia.org/wiki/Column_store
• http://developer.bazaarvoice.com/why-columns-are-cool
• http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices-
in_the_Use_of_Columnar_Databases.pdf

Data mining & column stores

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data mining & column stores

Similaire à Data mining & column stores (20)

Plus de Aung Thu Rha Hein

Plus de Aung Thu Rha Hein (18)

Dernier

Dernier (20)

Data mining & column stores