SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
Data Mining &
Column Stores
Aung Thu Rha Hein
Why use Data Mining?
• Explosive growth of data available
• Major sources:
          • Business: Web, E-Commerce, transactions
          • Science : Remote Sensing, bioinformatics,….
          • Society : news, gadgets, social media

• Too much data but too little information
• To extract useful information from the data and to interpret
  the data
• can automate the process of finding relationships and patterns
  in raw data
What is Data Mining?
• Knowledge Discovery in Databases, or ”KDD”
• the process of extracting hidden predictive information
  from large data sets
• Converting information into knowledge to predict the
  future trends and decisions
• Examples :
             consumer buying behavior of retail supermarket sales
             Google instant, YouTube instant
             Blogs and news: Technorati, News360 and so on
             Social Mining : Livehoods: find pattern and behaviors
              of foursquare check-in data
Data Mining Process
The Cross-Industry Standard Process (CRISP-DM)



                                                 Business understanding

                                                 Data understanding

                                                 Data preparation

                                                 Modeling

                                                 Evaluation

                                                 Deployment
Techniques
I.    Association Rule-also known as market basket analysis.
           discover interesting associations between attributes
II.   Classification- a technique based on machine learning
           use mathematical techniques such as decision trees, linear
            programming, neural network and statistics.
III. Clustering- makes meaningful or useful cluster of objects that
                  have similar characteristic
IV. Prediction-discovers relationship between independent variables
                 and relationship between dependent and
                 independent variables
V. Sequential Patterns-discover similar patterns in data transaction
                 over a business period
Tools
• There are three categories of tools for data mining:
      i. Traditional Data Mining Tools
      ii. Dashboards
      iii. Text-mining Tools


Some data mining tools:
      •   R- r-project.org
      •   Datameer Analytics Solution - datameer.com
      •   SAS Analytics- sas.com
      •   Google Chart API- code.google.com/apis/chart
Column Stores
• stores data tables as columns of data
   • Column Oriented DBMS-
       • Bigtable, DBase, Hypertable, Cassandra(Relational)
       • Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL)
• Use in systems like data warehouses and data mining
• Example:         Emp_ID Emp_Name             Emp_Dept Emp_Salar
                                                        y
                   1          Smith            IT            40000
                   2          Adam             Sales         35000
                   3          Jones            Marketing     45000
the database must coax its two-dimensional table into one for the operating
system
             • 1,2,3
                Smith, Adam, Jones
               IT, Sales, Marketing
                40000, 35000, 45000
Advantages and Disadvantages of
Column Stores
Advantages
• Only need to read relevant data( improved bandwidth utilization)
• Improved cache locality
       No need to transmit surrounding attributes
• Compression efficiency-column compress better than rows
       Because rows contain values from different domain
       Row-store compression ratio: 1:3
       Colum-Store: 1:10
Disadvantages
• Increased Disk seek time
• Increased cost of inserts.
• Increased tuple reconstruction costs
Case Study: Bazaarvoice
• Facing difficulties to aggregate large amounts of data on the fly in real time
  for analytics product
• Common among queries- a small number of columns with most values
  being aggregates such as counts, sums and averages
• Use InfoBright, an open source database built on MySQL
• Test result using a data set with 100MM records in the main fact table




• Average query execution time for analytical queries was 20x faster than
  MySQL’s
Case Study: Bazaarvoice(cont.)
• disk footprint was over 10x smaller compared to MySQL due to data
  compression.
• Why?
  • Column stores- small disk I/O
  • “knowledge grid”, aggregate data Infobright calculates during data
    loading
      • E.g. pre-calculate min, max, and avg value for each column in the
        pack
  • Limitations of InfoBright
      • does not support DML
      • only way is to bulk loads using “LOAD DATA INFILE …” command
      • no way to update or delete existing data without reloading the table
References
Data Mining
•   http://en.wikipedia.org/wiki/Data_mining
•   http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html
•   http://www.dataminingtechniques.net/
•   http://www.unc.edu/~xluan/258/datamining.html
•   http://www.data-miners.com/
•   http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html
•   http://livehoods.org/


Column Stores
• http://en.wikipedia.org/wiki/Column_store
• http://developer.bazaarvoice.com/why-columns-are-cool
• http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices-
  in_the_Use_of_Columnar_Databases.pdf

Contenu connexe

Tendances

Saa s multitenant database architecture
Saa s multitenant database architectureSaa s multitenant database architecture
Saa s multitenant database architecture
mmubashirkhan
 
Security in Cloud Computing
Security in Cloud ComputingSecurity in Cloud Computing
Security in Cloud Computing
Rohit Buddabathina
 

Tendances (20)

Data Confidentiality in Cloud Computing
Data Confidentiality in Cloud ComputingData Confidentiality in Cloud Computing
Data Confidentiality in Cloud Computing
 
Ensuring data storage security in cloud computing
Ensuring data storage security in cloud computingEnsuring data storage security in cloud computing
Ensuring data storage security in cloud computing
 
Multi Tenancy In The Cloud
Multi Tenancy In The CloudMulti Tenancy In The Cloud
Multi Tenancy In The Cloud
 
Data Management Gateway - Deep Dive
Data Management Gateway - Deep DiveData Management Gateway - Deep Dive
Data Management Gateway - Deep Dive
 
Introduction of cloud computing
Introduction of cloud computingIntroduction of cloud computing
Introduction of cloud computing
 
Saa s multitenant database architecture
Saa s multitenant database architectureSaa s multitenant database architecture
Saa s multitenant database architecture
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Cloud computing 1
Cloud computing  1Cloud computing  1
Cloud computing 1
 
security Issues of cloud computing
security Issues of cloud computingsecurity Issues of cloud computing
security Issues of cloud computing
 
Cloud Computing Overview
Cloud Computing OverviewCloud Computing Overview
Cloud Computing Overview
 
CLOUD COMPUTING AND STORAGE
CLOUD COMPUTING AND STORAGECLOUD COMPUTING AND STORAGE
CLOUD COMPUTING AND STORAGE
 
Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)
 
Cloud security
Cloud securityCloud security
Cloud security
 
Platform as a Service
Platform as a ServicePlatform as a Service
Platform as a Service
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private Clouds
 
Microsoft Cloud Computing
Microsoft Cloud ComputingMicrosoft Cloud Computing
Microsoft Cloud Computing
 
Third party cloud services cloud computing
Third party cloud services cloud computingThird party cloud services cloud computing
Third party cloud services cloud computing
 
Cloud computing intro
Cloud computing  introCloud computing  intro
Cloud computing intro
 
Multi-tenancy In the Cloud
Multi-tenancy In the CloudMulti-tenancy In the Cloud
Multi-tenancy In the Cloud
 
Security in Cloud Computing
Security in Cloud ComputingSecurity in Cloud Computing
Security in Cloud Computing
 

Similaire à Data mining & column stores

Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructure
projectandppt
 

Similaire à Data mining & column stores (20)

IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructure
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
data warehousing
data warehousingdata warehousing
data warehousing
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
 
Lecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdfLecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdf
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Data warehousing and Data mining
Data warehousing and Data mining Data warehousing and Data mining
Data warehousing and Data mining
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 

Plus de Aung Thu Rha Hein

Web application security: Threats & Countermeasures
Web application security: Threats & CountermeasuresWeb application security: Threats & Countermeasures
Web application security: Threats & Countermeasures
Aung Thu Rha Hein
 
Fuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentFuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessment
Aung Thu Rha Hein
 

Plus de Aung Thu Rha Hein (18)

Writing with ease
Writing with easeWriting with ease
Writing with ease
 
Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists
 
Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)
 
Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)
 
Private Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic OpportunityPrivate Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic Opportunity
 
Network switching
Network switchingNetwork switching
Network switching
 
Digital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research ChallengeDigital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research Challenge
 
Survey & Review of Digital Forensic
Survey & Review of Digital ForensicSurvey & Review of Digital Forensic
Survey & Review of Digital Forensic
 
Partitioned Based Regression Verification
Partitioned Based Regression VerificationPartitioned Based Regression Verification
Partitioned Based Regression Verification
 
CRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web ApplicationsCRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web Applications
 
Botnets 101
Botnets 101Botnets 101
Botnets 101
 
Session initiation protocol
Session initiation protocolSession initiation protocol
Session initiation protocol
 
TPC-H in MongoDB
TPC-H in MongoDBTPC-H in MongoDB
TPC-H in MongoDB
 
Web application security: Threats & Countermeasures
Web application security: Threats & CountermeasuresWeb application security: Threats & Countermeasures
Web application security: Threats & Countermeasures
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaught
 
Fuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentFuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessment
 
Link state routing protocol
Link state routing protocolLink state routing protocol
Link state routing protocol
 
Chat bot analysis
Chat bot analysisChat bot analysis
Chat bot analysis
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Data mining & column stores

  • 1. Data Mining & Column Stores Aung Thu Rha Hein
  • 2. Why use Data Mining? • Explosive growth of data available • Major sources: • Business: Web, E-Commerce, transactions • Science : Remote Sensing, bioinformatics,…. • Society : news, gadgets, social media • Too much data but too little information • To extract useful information from the data and to interpret the data • can automate the process of finding relationships and patterns in raw data
  • 3. What is Data Mining? • Knowledge Discovery in Databases, or ”KDD” • the process of extracting hidden predictive information from large data sets • Converting information into knowledge to predict the future trends and decisions • Examples :  consumer buying behavior of retail supermarket sales  Google instant, YouTube instant  Blogs and news: Technorati, News360 and so on  Social Mining : Livehoods: find pattern and behaviors of foursquare check-in data
  • 4. Data Mining Process The Cross-Industry Standard Process (CRISP-DM) Business understanding Data understanding Data preparation Modeling Evaluation Deployment
  • 5. Techniques I. Association Rule-also known as market basket analysis.  discover interesting associations between attributes II. Classification- a technique based on machine learning  use mathematical techniques such as decision trees, linear programming, neural network and statistics. III. Clustering- makes meaningful or useful cluster of objects that have similar characteristic IV. Prediction-discovers relationship between independent variables and relationship between dependent and independent variables V. Sequential Patterns-discover similar patterns in data transaction over a business period
  • 6. Tools • There are three categories of tools for data mining: i. Traditional Data Mining Tools ii. Dashboards iii. Text-mining Tools Some data mining tools: • R- r-project.org • Datameer Analytics Solution - datameer.com • SAS Analytics- sas.com • Google Chart API- code.google.com/apis/chart
  • 7. Column Stores • stores data tables as columns of data • Column Oriented DBMS- • Bigtable, DBase, Hypertable, Cassandra(Relational) • Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL) • Use in systems like data warehouses and data mining • Example: Emp_ID Emp_Name Emp_Dept Emp_Salar y 1 Smith IT 40000 2 Adam Sales 35000 3 Jones Marketing 45000 the database must coax its two-dimensional table into one for the operating system • 1,2,3 Smith, Adam, Jones IT, Sales, Marketing 40000, 35000, 45000
  • 8. Advantages and Disadvantages of Column Stores Advantages • Only need to read relevant data( improved bandwidth utilization) • Improved cache locality  No need to transmit surrounding attributes • Compression efficiency-column compress better than rows  Because rows contain values from different domain  Row-store compression ratio: 1:3  Colum-Store: 1:10 Disadvantages • Increased Disk seek time • Increased cost of inserts. • Increased tuple reconstruction costs
  • 9. Case Study: Bazaarvoice • Facing difficulties to aggregate large amounts of data on the fly in real time for analytics product • Common among queries- a small number of columns with most values being aggregates such as counts, sums and averages • Use InfoBright, an open source database built on MySQL • Test result using a data set with 100MM records in the main fact table • Average query execution time for analytical queries was 20x faster than MySQL’s
  • 10. Case Study: Bazaarvoice(cont.) • disk footprint was over 10x smaller compared to MySQL due to data compression. • Why? • Column stores- small disk I/O • “knowledge grid”, aggregate data Infobright calculates during data loading • E.g. pre-calculate min, max, and avg value for each column in the pack • Limitations of InfoBright • does not support DML • only way is to bulk loads using “LOAD DATA INFILE …” command • no way to update or delete existing data without reloading the table
  • 11. References Data Mining • http://en.wikipedia.org/wiki/Data_mining • http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html • http://www.dataminingtechniques.net/ • http://www.unc.edu/~xluan/258/datamining.html • http://www.data-miners.com/ • http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html • http://livehoods.org/ Column Stores • http://en.wikipedia.org/wiki/Column_store • http://developer.bazaarvoice.com/why-columns-are-cool • http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices- in_the_Use_of_Columnar_Databases.pdf