SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
Data Mining &
Column Stores
Aung Thu Rha Hein
Why use Data Mining?
• Explosive growth of data available
• Major sources:
          • Business: Web, E-Commerce, transactions
          • Science : Remote Sensing, bioinformatics,….
          • Society : news, gadgets, social media

• Too much data but too little information
• To extract useful information from the data and to interpret
  the data
• can automate the process of finding relationships and patterns
  in raw data
What is Data Mining?
• Knowledge Discovery in Databases, or ”KDD”
• the process of extracting hidden predictive information
  from large data sets
• Converting information into knowledge to predict the
  future trends and decisions
• Examples :
             consumer buying behavior of retail supermarket sales
             Google instant, YouTube instant
             Blogs and news: Technorati, News360 and so on
             Social Mining : Livehoods: find pattern and behaviors
              of foursquare check-in data
Data Mining Process
The Cross-Industry Standard Process (CRISP-DM)



                                                 Business understanding

                                                 Data understanding

                                                 Data preparation

                                                 Modeling

                                                 Evaluation

                                                 Deployment
Techniques
I.    Association Rule-also known as market basket analysis.
           discover interesting associations between attributes
II.   Classification- a technique based on machine learning
           use mathematical techniques such as decision trees, linear
            programming, neural network and statistics.
III. Clustering- makes meaningful or useful cluster of objects that
                  have similar characteristic
IV. Prediction-discovers relationship between independent variables
                 and relationship between dependent and
                 independent variables
V. Sequential Patterns-discover similar patterns in data transaction
                 over a business period
Tools
• There are three categories of tools for data mining:
      i. Traditional Data Mining Tools
      ii. Dashboards
      iii. Text-mining Tools


Some data mining tools:
      •   R- r-project.org
      •   Datameer Analytics Solution - datameer.com
      •   SAS Analytics- sas.com
      •   Google Chart API- code.google.com/apis/chart
Column Stores
• stores data tables as columns of data
   • Column Oriented DBMS-
       • Bigtable, DBase, Hypertable, Cassandra(Relational)
       • Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL)
• Use in systems like data warehouses and data mining
• Example:         Emp_ID Emp_Name             Emp_Dept Emp_Salar
                                                        y
                   1          Smith            IT            40000
                   2          Adam             Sales         35000
                   3          Jones            Marketing     45000
the database must coax its two-dimensional table into one for the operating
system
             • 1,2,3
                Smith, Adam, Jones
               IT, Sales, Marketing
                40000, 35000, 45000
Advantages and Disadvantages of
Column Stores
Advantages
• Only need to read relevant data( improved bandwidth utilization)
• Improved cache locality
       No need to transmit surrounding attributes
• Compression efficiency-column compress better than rows
       Because rows contain values from different domain
       Row-store compression ratio: 1:3
       Colum-Store: 1:10
Disadvantages
• Increased Disk seek time
• Increased cost of inserts.
• Increased tuple reconstruction costs
Case Study: Bazaarvoice
• Facing difficulties to aggregate large amounts of data on the fly in real time
  for analytics product
• Common among queries- a small number of columns with most values
  being aggregates such as counts, sums and averages
• Use InfoBright, an open source database built on MySQL
• Test result using a data set with 100MM records in the main fact table




• Average query execution time for analytical queries was 20x faster than
  MySQL’s
Case Study: Bazaarvoice(cont.)
• disk footprint was over 10x smaller compared to MySQL due to data
  compression.
• Why?
  • Column stores- small disk I/O
  • “knowledge grid”, aggregate data Infobright calculates during data
    loading
      • E.g. pre-calculate min, max, and avg value for each column in the
        pack
  • Limitations of InfoBright
      • does not support DML
      • only way is to bulk loads using “LOAD DATA INFILE …” command
      • no way to update or delete existing data without reloading the table
References
Data Mining
•   http://en.wikipedia.org/wiki/Data_mining
•   http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html
•   http://www.dataminingtechniques.net/
•   http://www.unc.edu/~xluan/258/datamining.html
•   http://www.data-miners.com/
•   http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html
•   http://livehoods.org/


Column Stores
• http://en.wikipedia.org/wiki/Column_store
• http://developer.bazaarvoice.com/why-columns-are-cool
• http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices-
  in_the_Use_of_Columnar_Databases.pdf

Contenu connexe

Tendances

Data Confidentiality in Cloud Computing
Data Confidentiality in Cloud ComputingData Confidentiality in Cloud Computing
Data Confidentiality in Cloud ComputingRitesh Dwivedi
 
Ensuring data storage security in cloud computing
Ensuring data storage security in cloud computingEnsuring data storage security in cloud computing
Ensuring data storage security in cloud computingUday Wankar
 
Multi Tenancy In The Cloud
Multi Tenancy In The CloudMulti Tenancy In The Cloud
Multi Tenancy In The Cloudrohit_ainapure
 
Data Management Gateway - Deep Dive
Data Management Gateway - Deep DiveData Management Gateway - Deep Dive
Data Management Gateway - Deep DiveJean-Pierre Riehl
 
Introduction of cloud computing
Introduction of cloud computingIntroduction of cloud computing
Introduction of cloud computingSuman Sharma
 
Saa s multitenant database architecture
Saa s multitenant database architectureSaa s multitenant database architecture
Saa s multitenant database architecturemmubashirkhan
 
Cloud computing 1
Cloud computing  1Cloud computing  1
Cloud computing 1Ashok Kumar
 
security Issues of cloud computing
security Issues of cloud computingsecurity Issues of cloud computing
security Issues of cloud computingprachupanchal
 
Cloud Computing Overview
Cloud Computing OverviewCloud Computing Overview
Cloud Computing OverviewSean Connolly
 
Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)johndorian555
 
Platform as a Service
Platform as a ServicePlatform as a Service
Platform as a ServiceAshok Kumar
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private CloudsPatrick Nicolas
 
Microsoft Cloud Computing
Microsoft Cloud ComputingMicrosoft Cloud Computing
Microsoft Cloud ComputingDavid Chou
 
Third party cloud services cloud computing
Third party cloud services cloud computingThird party cloud services cloud computing
Third party cloud services cloud computingSohailAliMalik
 
Cloud computing intro
Cloud computing  introCloud computing  intro
Cloud computing introAshok Kumar
 
Multi-tenancy In the Cloud
Multi-tenancy In the CloudMulti-tenancy In the Cloud
Multi-tenancy In the Cloudsdevillers
 

Tendances (20)

Data Confidentiality in Cloud Computing
Data Confidentiality in Cloud ComputingData Confidentiality in Cloud Computing
Data Confidentiality in Cloud Computing
 
Ensuring data storage security in cloud computing
Ensuring data storage security in cloud computingEnsuring data storage security in cloud computing
Ensuring data storage security in cloud computing
 
Multi Tenancy In The Cloud
Multi Tenancy In The CloudMulti Tenancy In The Cloud
Multi Tenancy In The Cloud
 
Data Management Gateway - Deep Dive
Data Management Gateway - Deep DiveData Management Gateway - Deep Dive
Data Management Gateway - Deep Dive
 
Introduction of cloud computing
Introduction of cloud computingIntroduction of cloud computing
Introduction of cloud computing
 
Saa s multitenant database architecture
Saa s multitenant database architectureSaa s multitenant database architecture
Saa s multitenant database architecture
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Cloud computing 1
Cloud computing  1Cloud computing  1
Cloud computing 1
 
security Issues of cloud computing
security Issues of cloud computingsecurity Issues of cloud computing
security Issues of cloud computing
 
Cloud Computing Overview
Cloud Computing OverviewCloud Computing Overview
Cloud Computing Overview
 
CLOUD COMPUTING AND STORAGE
CLOUD COMPUTING AND STORAGECLOUD COMPUTING AND STORAGE
CLOUD COMPUTING AND STORAGE
 
Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)Infrastructure as a service (iaa s)
Infrastructure as a service (iaa s)
 
Cloud security
Cloud securityCloud security
Cloud security
 
Platform as a Service
Platform as a ServicePlatform as a Service
Platform as a Service
 
Multi-tenancy in Private Clouds
Multi-tenancy in Private CloudsMulti-tenancy in Private Clouds
Multi-tenancy in Private Clouds
 
Microsoft Cloud Computing
Microsoft Cloud ComputingMicrosoft Cloud Computing
Microsoft Cloud Computing
 
Third party cloud services cloud computing
Third party cloud services cloud computingThird party cloud services cloud computing
Third party cloud services cloud computing
 
Cloud computing intro
Cloud computing  introCloud computing  intro
Cloud computing intro
 
Multi-tenancy In the Cloud
Multi-tenancy In the CloudMulti-tenancy In the Cloud
Multi-tenancy In the Cloud
 
Security in Cloud Computing
Security in Cloud ComputingSecurity in Cloud Computing
Security in Cloud Computing
 

Similaire à Data mining & column stores

Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructureprojectandppt
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDhilsath Fathima
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsGDi Techno Solutions
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptAravindReddy565690
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsGanesan Narayanasamy
 
Lecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdfLecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdfahmedibrahimghnnam01
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxAIMLSEMINARS
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptxKannanThangavelu2
 

Similaire à Data mining & column stores (20)

IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructure
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
data warehousing
data warehousingdata warehousing
data warehousing
 
Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
 
Lecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdfLecture 1-big data engineering (Introduction).pdf
Lecture 1-big data engineering (Introduction).pdf
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Data warehousing and Data mining
Data warehousing and Data mining Data warehousing and Data mining
Data warehousing and Data mining
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 

Plus de Aung Thu Rha Hein

Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists Aung Thu Rha Hein
 
Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)Aung Thu Rha Hein
 
Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)Aung Thu Rha Hein
 
Private Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic OpportunityPrivate Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic OpportunityAung Thu Rha Hein
 
Digital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research ChallengeDigital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research ChallengeAung Thu Rha Hein
 
Survey & Review of Digital Forensic
Survey & Review of Digital ForensicSurvey & Review of Digital Forensic
Survey & Review of Digital ForensicAung Thu Rha Hein
 
Partitioned Based Regression Verification
Partitioned Based Regression VerificationPartitioned Based Regression Verification
Partitioned Based Regression VerificationAung Thu Rha Hein
 
CRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web ApplicationsCRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web ApplicationsAung Thu Rha Hein
 
Web application security: Threats & Countermeasures
Web application security: Threats & CountermeasuresWeb application security: Threats & Countermeasures
Web application security: Threats & CountermeasuresAung Thu Rha Hein
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtAung Thu Rha Hein
 
Fuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentFuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentAung Thu Rha Hein
 

Plus de Aung Thu Rha Hein (18)

Writing with ease
Writing with easeWriting with ease
Writing with ease
 
Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists Bioinformatics for Computer Scientists
Bioinformatics for Computer Scientists
 
Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)Analysis of hybrid image with FFT (Fast Fourier Transform)
Analysis of hybrid image with FFT (Fast Fourier Transform)
 
Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)Introduction to Common Weakness Enumeration (CWE)
Introduction to Common Weakness Enumeration (CWE)
 
Private Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic OpportunityPrivate Browsing: A Window of Forensic Opportunity
Private Browsing: A Window of Forensic Opportunity
 
Network switching
Network switchingNetwork switching
Network switching
 
Digital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research ChallengeDigital Forensic: Brief Intro & Research Challenge
Digital Forensic: Brief Intro & Research Challenge
 
Survey & Review of Digital Forensic
Survey & Review of Digital ForensicSurvey & Review of Digital Forensic
Survey & Review of Digital Forensic
 
Partitioned Based Regression Verification
Partitioned Based Regression VerificationPartitioned Based Regression Verification
Partitioned Based Regression Verification
 
CRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web ApplicationsCRAXweb: Automatic Exploit Generation for Web Applications
CRAXweb: Automatic Exploit Generation for Web Applications
 
Botnets 101
Botnets 101Botnets 101
Botnets 101
 
Session initiation protocol
Session initiation protocolSession initiation protocol
Session initiation protocol
 
TPC-H in MongoDB
TPC-H in MongoDBTPC-H in MongoDB
TPC-H in MongoDB
 
Web application security: Threats & Countermeasures
Web application security: Threats & CountermeasuresWeb application security: Threats & Countermeasures
Web application security: Threats & Countermeasures
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaught
 
Fuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessmentFuzzy logic based students’ learning assessment
Fuzzy logic based students’ learning assessment
 
Link state routing protocol
Link state routing protocolLink state routing protocol
Link state routing protocol
 
Chat bot analysis
Chat bot analysisChat bot analysis
Chat bot analysis
 

Dernier

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Dernier (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Data mining & column stores

  • 1. Data Mining & Column Stores Aung Thu Rha Hein
  • 2. Why use Data Mining? • Explosive growth of data available • Major sources: • Business: Web, E-Commerce, transactions • Science : Remote Sensing, bioinformatics,…. • Society : news, gadgets, social media • Too much data but too little information • To extract useful information from the data and to interpret the data • can automate the process of finding relationships and patterns in raw data
  • 3. What is Data Mining? • Knowledge Discovery in Databases, or ”KDD” • the process of extracting hidden predictive information from large data sets • Converting information into knowledge to predict the future trends and decisions • Examples :  consumer buying behavior of retail supermarket sales  Google instant, YouTube instant  Blogs and news: Technorati, News360 and so on  Social Mining : Livehoods: find pattern and behaviors of foursquare check-in data
  • 4. Data Mining Process The Cross-Industry Standard Process (CRISP-DM) Business understanding Data understanding Data preparation Modeling Evaluation Deployment
  • 5. Techniques I. Association Rule-also known as market basket analysis.  discover interesting associations between attributes II. Classification- a technique based on machine learning  use mathematical techniques such as decision trees, linear programming, neural network and statistics. III. Clustering- makes meaningful or useful cluster of objects that have similar characteristic IV. Prediction-discovers relationship between independent variables and relationship between dependent and independent variables V. Sequential Patterns-discover similar patterns in data transaction over a business period
  • 6. Tools • There are three categories of tools for data mining: i. Traditional Data Mining Tools ii. Dashboards iii. Text-mining Tools Some data mining tools: • R- r-project.org • Datameer Analytics Solution - datameer.com • SAS Analytics- sas.com • Google Chart API- code.google.com/apis/chart
  • 7. Column Stores • stores data tables as columns of data • Column Oriented DBMS- • Bigtable, DBase, Hypertable, Cassandra(Relational) • Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL) • Use in systems like data warehouses and data mining • Example: Emp_ID Emp_Name Emp_Dept Emp_Salar y 1 Smith IT 40000 2 Adam Sales 35000 3 Jones Marketing 45000 the database must coax its two-dimensional table into one for the operating system • 1,2,3 Smith, Adam, Jones IT, Sales, Marketing 40000, 35000, 45000
  • 8. Advantages and Disadvantages of Column Stores Advantages • Only need to read relevant data( improved bandwidth utilization) • Improved cache locality  No need to transmit surrounding attributes • Compression efficiency-column compress better than rows  Because rows contain values from different domain  Row-store compression ratio: 1:3  Colum-Store: 1:10 Disadvantages • Increased Disk seek time • Increased cost of inserts. • Increased tuple reconstruction costs
  • 9. Case Study: Bazaarvoice • Facing difficulties to aggregate large amounts of data on the fly in real time for analytics product • Common among queries- a small number of columns with most values being aggregates such as counts, sums and averages • Use InfoBright, an open source database built on MySQL • Test result using a data set with 100MM records in the main fact table • Average query execution time for analytical queries was 20x faster than MySQL’s
  • 10. Case Study: Bazaarvoice(cont.) • disk footprint was over 10x smaller compared to MySQL due to data compression. • Why? • Column stores- small disk I/O • “knowledge grid”, aggregate data Infobright calculates during data loading • E.g. pre-calculate min, max, and avg value for each column in the pack • Limitations of InfoBright • does not support DML • only way is to bulk loads using “LOAD DATA INFILE …” command • no way to update or delete existing data without reloading the table
  • 11. References Data Mining • http://en.wikipedia.org/wiki/Data_mining • http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html • http://www.dataminingtechniques.net/ • http://www.unc.edu/~xluan/258/datamining.html • http://www.data-miners.com/ • http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html • http://livehoods.org/ Column Stores • http://en.wikipedia.org/wiki/Column_store • http://developer.bazaarvoice.com/why-columns-are-cool • http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices- in_the_Use_of_Columnar_Databases.pdf