SlideShare a Scribd company logo
1 of 27
Download to read offline
ITITITIT
[[[[1111]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Madrid JUGMadrid JUGMadrid JUGMadrid JUG ---- Minería de Datos sobre Weka (Data Mining)Minería de Datos sobre Weka (Data Mining)Minería de Datos sobre Weka (Data Mining)Minería de Datos sobre Weka (Data Mining)
9 de Mayo 2013
Jose María Gómez Hidalgo (@jmgomez)
Guillermo Santos García (@gsantosgo)
DATA MINING
ITITITIT
[[[[2222]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
INDEXINDEXINDEXINDEX
Madrid JUG - Minería de Datos sobre Weka (Data Mining)...............................................................................................1
INDEX......................................................................................................................................................................................2
1. Artificial Intelligence. Conceptual Map............................................................................................................................ 4
1.1 Knowledge Based System vs. Machine Learning System.......................................................................................5
2. Data Mining Process.........................................................................................................................................................6
2.1 Machine Learning.........................................................................................................................................................7
2.1.1 Supervised Machine Learning...................................................................................................................................7
2.1.2 Unsupervised Machine Learning............................................................................................................................8
2.1.3 The Top Ten Algorithms in Data Mining.................................................................................................................9
3. Tools...................................................................................................................................................................................10
3.1 WEKA (Waikato Environment for Knowledge Analysis) ...................................................................................10
3.2 R (#RStats)...........................................................................................................................................................10
3.3 RapidMiner.............................................................................................................................................................11
3.4 KNIME Desktop......................................................................................................................................................11
3.5 Orange...................................................................................................................................................................12
3.6 Polls.......................................................................................................................................................................13
3.6.1 What programming/statistics languages you used for analytics / data mining in the past 12 months?
[579 voters] (Aug 2012).............................................................................................................................................13
3.6.2 What Analytics, Data mining, Big Data software you used in the past 12 months for a real project?
(May 2012) ..................................................................................................................................................................13
4. Examples............................................................................................................................................................................15
4.1 Predicting Price House........................................................................................................................................15
4.2 Lending Club........................................................................................................................................................16
4.3 Spam or Ham Email............................................................................................................................................. 17
4.4 Handwritten Digit Recognition ..........................................................................................................................18
ITITITIT
[[[[3333]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
4.5 Human Activity Recognition using Smartphones............................................................................................19
4.6 Inventory .............................................................................................................................................................20
4.7 Image Classification............................................................................................................................................21
4.8 Clustering............................................................................................................................................................22
5. Supervised Machine Learning........................................................................................................................................23
6. Evaluation.........................................................................................................................................................................24
6.1 Random Subsampling ..............................................................................................................................................24
6.2 Cross Validation (K-FOLD).......................................................................................................................................24
6.3 Confusion Matrix......................................................................................................................................................25
A.1. ¿What is a DATASET?....................................................................................................................................................26
A.2 Types of variables..........................................................................................................................................................26
ITITITIT
[[[[4444]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
1.1.1.1. Artificial Intelligence.Artificial Intelligence.Artificial Intelligence.Artificial Intelligence. ConceptualConceptualConceptualConceptual MapMapMapMap
Link: http://en.wikipedia.org/wiki/Artificial_intelligence
DATA MINING. LEARN FROM DATA
Artificial
Intelligence
Problem Solving
Search Methods
Logic
Agents
Fuzzy Logic
Automatic
Classification
Information
Retrieval
Filtering
Autromatic
Categorization
Knowledge Based
System
Expert System
Knowledge
representation
Data Mining
Data Acquisition
Machine
Learning
Supervised
Unsupervised
Natural Language
Processing
Statistical NLP
Knowlegde Based
NLP
Robotics
ITITITIT
[[[[5555]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
1111....1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System
Knowledge Based System (Expert System)Knowledge Based System (Expert System)Knowledge Based System (Expert System)Knowledge Based System (Expert System)
- Rules are codified manually (Represent knowledge)
- Experts (expert is a person with extensive knowledge about domain).
- Cost.
Expert Sytems (Credit Expert System)
If (Annual Income > 3 * Annual Debt) Then CREDIT = YES
Annual IncomeAnnual IncomeAnnual IncomeAnnual Income Annual DebtAnnual DebtAnnual DebtAnnual Debt CreditCreditCreditCredit
42.000 € 15.000 € NO
37.000 € 12.000 € SI
80.000 € 40.500 € NO
150.000 € 45.000€ SI
Machine Learning SystemMachine Learning SystemMachine Learning SystemMachine Learning System
- The manual process is automated.
- There aren’t experts.
- We take us advantage of data classified manually over years.
- Training phase and testing phase.
- At first, machine learning systems aren’t as accurate as knowledge based systems, however they’re can evolve
and get better through time. (Ex. Spam Detection Spam)
ITITITIT
[[[[6666]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
2. Data Mining Process2. Data Mining Process2. Data Mining Process2. Data Mining Process
KDD (Knowlegde Discovery in Databases)
Source: From Data Mining to Knowledge Discovery in Databases (Fayyad. 1997)
1. Selection. The data relevant to select.
2. Preprocessing.
3. Transformation.
4. Data-Mininq. Building Models and Patterns. (MODELLING)
5. Interpretation/Evaluation . Evaluation and Results
The term DATA-MINING sometimes refers to the complete process KDD, and sometimes refers only to the phase
of MODELLING (4). Here mainly are applied algorithms in the scope of Machine Learning.
ITITITIT
[[[[7777]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
2.2.2.2.1 Machine Learning1 Machine Learning1 Machine Learning1 Machine Learning
Aim. Building or creating programs capable of generalizgeneralizgeneralizgeneralizinginginging behaviorbehaviorbehaviorbehavior from weakly structured information.
2.2.2.2.1.11.11.11.1 SupervisedSupervisedSupervisedSupervised Machine LearningMachine LearningMachine LearningMachine Learning
Aim. Predict the value of a variable based on a number of input variables.
Regression Problem.
Classification Problem.
Result: PREDICTIVE MODELSPREDICTIVE MODELSPREDICTIVE MODELSPREDICTIVE MODELS or CLASSIFIERSCLASSIFIERSCLASSIFIERSCLASSIFIERS.
DATA
PREDICTIVE MODELS
DESCRIPTIVE MODELS
ATTRIBUTES
ITITITIT
[[[[8888]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
2.2.2.2.1.2 Unsupervised Machine Learning1.2 Unsupervised Machine Learning1.2 Unsupervised Machine Learning1.2 Unsupervised Machine Learning
Aim. Describe patterns or associations among a set of input measures.
Patterns or Associations
Clustering
Result: DESCRIPTIVEDESCRIPTIVEDESCRIPTIVEDESCRIPTIVE MODELSMODELSMODELSMODELS
ITITITIT
[[[[9999]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
2.2.2.2.1.3 The Top Ten Algorithms in Data Mining1.3 The Top Ten Algorithms in Data Mining1.3 The Top Ten Algorithms in Data Mining1.3 The Top Ten Algorithms in Data Mining
IEEE International Conference on Data Mining (ICDM). http://www.cs.uvm.edu/~icdm/
The most influential algorithms used in the Data Mining Community.
1. C 4.5 (Decision Tree).
2. K-Means.
3. Support Vector Machine (SVM). The Best Generalization Ability
4. Apriori. To find frequent itemsets from a transaction dataset and derive association rules
5. EM (Expectation- Maximization) Pattern Recognition
6. PageRank. Link-based ranking algorithm, which also powers the Google search engine.
7. AdaBoost.
8. k-Nearest Neighbors (k-NN)
9. Naïve Bayes.
10. CART. Classification and Regression Trees
Source: http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf
ITITITIT
[[[[10101010]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
3. Tools3. Tools3. Tools3. Tools
3.1 WEKA (Waikato Environment for Knowledge Analysis)3.1 WEKA (Waikato Environment for Knowledge Analysis)3.1 WEKA (Waikato Environment for Knowledge Analysis)3.1 WEKA (Waikato Environment for Knowledge Analysis)
http://www.cs.waikato.ac.nz/ml/weka/
- Data Mining Software in Java.
- Implemented in Java
- Multi-platform
- GUI (Limitations)
- GPL License.
- University of Waikato, New Zealand
3.23.23.23.2 RRRR (#RStats)(#RStats)(#RStats)(#RStats)
http://www.r-project.org/
R is a language and environment for statistical computing and graphics.
- S Language (Bell Laboratories)
- Implemented in C/C++
- Highly extensible. R can be extended via packages.
- R Environment. Uses a command line interface. (NO GUI)
- RStudio. Graphical User Interfaces (GUI)
- GPL License.
- Created by University of Auckland, New Zealand and currently developed R Development Core Team
Links: How R grows
Books: Machine Learning for Hackers, The Elements of Statistical Learning: Data Mining, Inference and Prediction,
OpenIntro Statistics
Enterprises: Revolution Analytics, Oracle R Enterprise, …
R for LinuxR for LinuxR for LinuxR for Linux R for Mac OSXR for Mac OSXR for Mac OSXR for Mac OSX R for WindowsR for WindowsR for WindowsR for Windows
RWekaRWekaRWekaRWeka
ITITITIT
[[[[11111111]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
3.33.33.33.3 RRRRapidMinerapidMinerapidMinerapidMiner
http://rapid-i.com/content/view/181/190/lang,en/
- Open-Source Data Mining and Analysis System
- Implemented in Java
- Multi-platform
- Machine Learning library Weka fully integrated.
- Access to data sources: Excel, MySQL, Oracle
- ETL
- Reporting
- Data Analysis
- AGPL License
- Created by Dortmund University of Technology
3.4 KNIME Desktop3.4 KNIME Desktop3.4 KNIME Desktop3.4 KNIME Desktop
http://www.knime.org/knime
- Data Analytics (Data access, data transformation, predictive analytics, visualization and reporting).
- Implemented in Java (Based in Eclipse Platform)
- Reporting
- ETL
- KNIME Extensions. Excel support, R integration, Weka
- GPL License
- Konstanz University, Germany
RRRR Extension for RapidMinerExtension for RapidMinerExtension for RapidMinerExtension for RapidMiner
R Extension for KnimeR Extension for KnimeR Extension for KnimeR Extension for Knime
ITITITIT
[[[[12121212]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
3.5 Orange3.5 Orange3.5 Orange3.5 Orange
http://orange.biolab.si/
- A component-based data mining and machine learning software suite
- A visual programming front-end for explorative data analysis and visualization
- Multi-platform.
- Python
- GPL License
- University of Ljubljana, Slovenia
ITITITIT
[[[[13131313]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
3.6 Polls3.6 Polls3.6 Polls3.6 Polls
3.6.13.6.13.6.13.6.1 What programming/statistics languages you used for analytics / data mining in the past 12What programming/statistics languages you used for analytics / data mining in the past 12What programming/statistics languages you used for analytics / data mining in the past 12What programming/statistics languages you used for analytics / data mining in the past 12
months? [579 voters]months? [579 voters]months? [579 voters]months? [579 voters] (Aug 2012)(Aug 2012)(Aug 2012)(Aug 2012)
Source: http://www.kdnuggets.com/polls/2012/analytics-data-mining-programming-languages.html
3.6.23.6.23.6.23.6.2 What Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in the past 12 months for a realhe past 12 months for a realhe past 12 months for a realhe past 12 months for a real
project?project?project?project? (May 2012)(May 2012)(May 2012)(May 2012)
ITITITIT
[[[[14141414]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Source: http:/www.kdnuggets.com/polls/2012/analytics-data-mining-big-data-software.html
ITITITIT
[[[[15151515]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
4444.... ExamplesExamplesExamplesExamples
4444.1 Predicting Price House.1 Predicting Price House.1 Predicting Price House.1 Predicting Price House
SizeSizeSizeSize Price (K)Price (K)Price (K)Price (K)
80 70
90 83
100 74
110 93
140 89
140 58
150 85
160 114
180 95
200 100
240 138
250 111
270 124
320 161
350 172
Link: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/predictHousePrice.md
Regression ProblemRegression ProblemRegression ProblemRegression Problem
ITITITIT
[[[[16161616]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
4444.2 Lending Club.2 Lending Club.2 Lending Club.2 Lending Club
Peer to peer lending company.
What are the variables associated with the interest rate of a loan? Multivariate
Links: http://www.lendingclub.com/
http://en.wikipedia.org/wiki/Lending_Club
https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/loansLendingClub.md
Regression ProblemRegression ProblemRegression ProblemRegression Problem
ITITITIT
[[[[17171717]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
4444.3.3.3.3 Spam orSpam orSpam orSpam or HamHamHamHam EmailEmailEmailEmail
Links: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/spam.md
Classification ProblemClassification ProblemClassification ProblemClassification Problem
ITITITIT
[[[[18181818]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
4444.4.4.4.4 HandwrittenHandwrittenHandwrittenHandwritten Digit RecognitionDigit RecognitionDigit RecognitionDigit Recognition
Identification the numbers in a handwritten ZIP code, from a digitized image.
001 002 003 004 ... 015 016
017 018 019 020 ... 031 032
033 034 035 036 ... 037 038
| | | | ... | |
209 210 211 212 ... 223 224
225 226 227 228 ... 239 240
241 242 243 244 ... 255 256
Each image is a 16 x 16 (256) 8-bit grayscale representation of a handwritten digit
http://www.kaggle.com/c/digit-recognizer
Link: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/handwritten.md
16x16
Classification ProblemClassification ProblemClassification ProblemClassification Problem
ITITITIT
[[[[19191919]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
4444.5.5.5.5 Human Activity RHuman Activity RHuman Activity RHuman Activity Recognition using Smartphonesecognition using Smartphonesecognition using Smartphonesecognition using Smartphones
We used data obtained from accelerometer and gyroscope sensor signals of the smartphones
3-axial linear acceleration
3-axial angular velocity
We can monitor acceleration, positions, rotation and angular motion.
Laying, Sitting, Standing, Walk, WalkDown, WalkUp
ITITITIT
[[[[20202020]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
DataSet: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Source: Activity Recognition using Cell Phone Accelerometers
http://www.cis.fordham.edu/wisdm/public_files/sensorKDD-2010.pdf
Link: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/handwritten.md
4444.6.6.6.6 InventoryInventoryInventoryInventory
A large inventory of identical items. You want to predict how many of these items will sell over the next 3 months.
Classification ProblemClassification ProblemClassification ProblemClassification Problem
Regression ProblemRegression ProblemRegression ProblemRegression Problem
ITITITIT
[[[[21212121]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
4.74.74.74.7 Image ClassificationImage ClassificationImage ClassificationImage Classification
Computer Vision (C.V.)
Haralick texture features. Haralick described 14 statistics that can be calculated from the co-occurrence matrix
with the intent of describing the texture of the image:
- Angular Second Moment
- Constrast
- Correlation
..
Source: https://github.com/gsantosgo/RStats/tree/master/MadridJUG-DataMining/data/faces.arff
Alessandra Ambrosio
Jessica Alba
Megan Fox
Links: http://murphylab.web.cmu.edu/publications/boland/boland_node26.html
Classification ProblemClassification ProblemClassification ProblemClassification Problem
ITITITIT
[[[[22222222]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
4.84.84.84.8 ClusteringClusteringClusteringClustering
Google News
News Clustering
Source: http://news.google.es/
Clustering ProblemClustering ProblemClustering ProblemClustering Problem
ITITITIT
[[[[23232323]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
5. Supervised Machine Learning5. Supervised Machine Learning5. Supervised Machine Learning5. Supervised Machine Learning
Guide for Supervised Machine Learning
Training Phase Testing Phase
Training DataSet
(Colección de Entrenamiento)
Attributes Selection and Extraction
(Selección y Extracción de Atributos)
Filtered DataSet
(Colección filtrada)
Learning or Training
(Entrenamiento o Aprendizaje)
Predictive Model or Classifier
(Modelo Predictivo o Clasificador)
Testing DataSet
(Colección de Datos Reales)
Filtering Attributes
(Filtrado de Atributos)
Filtered DataSet
(Colección filtrada)
Classification
(Clasificación)
Classified Data
(Datos Clasificados)
ITITITIT
[[[[24242424]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
6666. Evaluation. Evaluation. Evaluation. Evaluation
STATE OF ARTSTATE OF ARTSTATE OF ARTSTATE OF ART
6666.1 Random Subsampling.1 Random Subsampling.1 Random Subsampling.1 Random Subsampling
1. Use the training set.
2. Split it into training set (66.66 %) and testing set (33.33%). (RANDOM)
3. Build a model on the training set.
4. Evaluate on the test set.
6666.2 Cross Validation (K.2 Cross Validation (K.2 Cross Validation (K.2 Cross Validation (K----FOLD)FOLD)FOLD)FOLD)
1. Use the training set.
2. Split it into training/test sets.
3. Build a model on the training set
4. Evaluate on the test set.
5. Repeat and average the estimated
Never Overlap!
K-FOLD
K = 1
K = 2
…….
K = 10
	 	 =
1
Test Data
Test Data
Test Data
Training Data
Training Data
Training DataTr. Data
Test Training DataTest
Data
Test
Data
Test
Data
Test
Data
Test
Data
Test Training DataTest
Data
Test
Data
Test
Data
Test
Data
Test
Data
ITITITIT
[[[[25252525]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Link: https://es.wikipedia.org/wiki/Validaci%C3%B3n_cruzada
6666.3 Confusion Matrix.3 Confusion Matrix.3 Confusion Matrix.3 Confusion Matrix
- Accuracy (Precisión o Efectividad) . The rate of correct predictions
- Error rate. The rate of incorrect predictions.
- Performance (Eficiencia). The algorithm is quick or nor in the training phase or in the testing phase.
Actual/Real ClassActual/Real ClassActual/Real ClassActual/Real Class Predicted ClassPredicted ClassPredicted ClassPredicted Class TotalTotalTotalTotal
Yes No
Yes (1)Yes (1)Yes (1)Yes (1) True Positive (TP) False Negative (FN) Total Positive Real (TPR)
No (0)No (0)No (0)No (0) False Positive (FP) True Negative (TN) Total Negative Real (TNR)
TotalTotalTotalTotal Total Positive Predicted
(TPP)
Total Negative Predicted
(TNP)
Total
Link: http://en.wikipedia.org/wiki/Confusion_matrix
ITITITIT
[[[[26262626]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
A.1A.1A.1A.1.... ¿What is a DATASET?¿What is a DATASET?¿What is a DATASET?¿What is a DATASET?
Example: Dataset email50
Row represents a casecasecasecase, a unit of observationunit of observationunit of observationunit of observation, an observational unitobservational unitobservational unitobservational unit, an instanceinstanceinstanceinstance. OBSERVATIONS.OBSERVATIONS.OBSERVATIONS.OBSERVATIONS.
EXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARY....
Column represents an attributeattributeattributeattribute, a variablevariablevariablevariable, a featurefeaturefeaturefeature (represent characteristics).
Special column. the classthe classthe classthe class, the class labelthe class labelthe class labelthe class label ( two values or multi-valued)
For example: The email 4, which is not spam, contains 2454 characters, 61 line breaks, is written in Text format
(0=text, 1=html), and contains only small numbers.
Variable Description
spam Specifies whether the message was spam
num_char The number of characters in the email
line_breaks The number of line breaks in the email (not including text
wrapping)
Format Indicates if the email contained special formatting, such as
bolding, tables or links, which would indicate the message is
in HTML format
Number Indicates whether the email contained no number, a small
number (under 1 million) or a large number
DatasetDatasetDatasetDataset represents a data matrixdata matrixdata matrixdata matrix, data framedata framedata framedata frame. Each row of a data matrix corresponds to unique case
(example), and each column corresponds to a variable.
A.2A.2A.2A.2 Types of variablesTypes of variablesTypes of variablesTypes of variables
ITITITIT
[[[[27272727]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
num_char and line_breaks CUANTITATIVE, NUMERICAL AND CONTINOUS VARIABLES.
spam CUANTITATIVE, NUMERICAL AND DISCRETE VARIABLE.
number indicates whether the email contained no number, a small number (under 1 million) or a large number. It
takes values none, small and big. The different levels have a natural ordering. CUALITATIVE, CATEGORICAL
VARIABLES AND ORDINAL VARIABLE.
Variables
Numerical
Continuous Discretes
Categorical
Regular
Categorical
Ordinal

More Related Content

Similar to MadridJUG Mineria de Datos-Data Mining.09.may.2013

Ecosystm IoT combined forecast 2017 - 2022
Ecosystm IoT combined forecast 2017 - 2022Ecosystm IoT combined forecast 2017 - 2022
Ecosystm IoT combined forecast 2017 - 2022Chris White
 
Tracxn - Database Technology Startup Landscape
Tracxn - Database Technology Startup LandscapeTracxn - Database Technology Startup Landscape
Tracxn - Database Technology Startup LandscapeTracxn
 
KloudData Omnitrol Asset Intelligence Solution
KloudData Omnitrol Asset Intelligence SolutionKloudData Omnitrol Asset Intelligence Solution
KloudData Omnitrol Asset Intelligence SolutionKloudData Inc
 
IoT digital disruption and new IoT business models
IoT digital disruption and new IoT business modelsIoT digital disruption and new IoT business models
IoT digital disruption and new IoT business modelsIoTAnalytics
 
Tracxn - Big Data Infrastructure Startup Landscape
Tracxn - Big Data Infrastructure Startup LandscapeTracxn - Big Data Infrastructure Startup Landscape
Tracxn - Big Data Infrastructure Startup LandscapeAmar Christy
 
Internet of Everything. El Reto de la Innovación
Internet of Everything. El Reto de la InnovaciónInternet of Everything. El Reto de la Innovación
Internet of Everything. El Reto de la InnovaciónAMETIC
 
Pascal A1 ASIC investors presentation
Pascal A1 ASIC investors presentationPascal A1 ASIC investors presentation
Pascal A1 ASIC investors presentationPascal Project
 
Regulators' traceability requirements and solutions for i gambling operators ...
Regulators' traceability requirements and solutions for i gambling operators ...Regulators' traceability requirements and solutions for i gambling operators ...
Regulators' traceability requirements and solutions for i gambling operators ...Market Engel SAS
 
Tracxn - Data Center Infrastructure Startup Landscape
Tracxn - Data Center Infrastructure Startup LandscapeTracxn - Data Center Infrastructure Startup Landscape
Tracxn - Data Center Infrastructure Startup LandscapeTracxn
 
Tracxn - Big Data Analytics Startup Landscape
Tracxn - Big Data Analytics Startup LandscapeTracxn - Big Data Analytics Startup Landscape
Tracxn - Big Data Analytics Startup LandscapeTracxn
 
Tracxn - Internet of Things Infrastructure Startup Landscape
Tracxn - Internet of Things Infrastructure Startup LandscapeTracxn - Internet of Things Infrastructure Startup Landscape
Tracxn - Internet of Things Infrastructure Startup LandscapeTracxn
 
Tracxn - Embedded Systems Startup Landscape
Tracxn - Embedded Systems Startup LandscapeTracxn - Embedded Systems Startup Landscape
Tracxn - Embedded Systems Startup LandscapeTracxn
 
Intellisoft introduction @ Recruitment Camp L131229
Intellisoft introduction @ Recruitment Camp L131229Intellisoft introduction @ Recruitment Camp L131229
Intellisoft introduction @ Recruitment Camp L131229Sham Yemul
 
Your Next IoT Journey is NOW
Your Next IoT Journey is NOWYour Next IoT Journey is NOW
Your Next IoT Journey is NOWDr. Mazlan Abbas
 
Advanced manufacturing syposium 2016 siaa colin koh
Advanced manufacturing syposium 2016 siaa colin kohAdvanced manufacturing syposium 2016 siaa colin koh
Advanced manufacturing syposium 2016 siaa colin kohColin Koh (許国仁)
 
Capturing the IOT Opportunity
Capturing the IOT OpportunityCapturing the IOT Opportunity
Capturing the IOT Opportunityliao yong
 
Capturing the IOT Opportunity
Capturing the IOT OpportunityCapturing the IOT Opportunity
Capturing the IOT Opportunityliao yong
 
Internet of Things - We Are at the Tip of an Iceberg
Internet of Things - We Are at the Tip of an IcebergInternet of Things - We Are at the Tip of an Iceberg
Internet of Things - We Are at the Tip of an IcebergDr. Mazlan Abbas
 
Trends, Potentials and Challenges in ICT Market
Trends, Potentials and Challenges in ICT MarketTrends, Potentials and Challenges in ICT Market
Trends, Potentials and Challenges in ICT MarketNovida Global
 

Similar to MadridJUG Mineria de Datos-Data Mining.09.may.2013 (20)

Ecosystm IoT combined forecast 2017 - 2022
Ecosystm IoT combined forecast 2017 - 2022Ecosystm IoT combined forecast 2017 - 2022
Ecosystm IoT combined forecast 2017 - 2022
 
Tracxn - Database Technology Startup Landscape
Tracxn - Database Technology Startup LandscapeTracxn - Database Technology Startup Landscape
Tracxn - Database Technology Startup Landscape
 
KloudData Omnitrol Asset Intelligence Solution
KloudData Omnitrol Asset Intelligence SolutionKloudData Omnitrol Asset Intelligence Solution
KloudData Omnitrol Asset Intelligence Solution
 
IoT digital disruption and new IoT business models
IoT digital disruption and new IoT business modelsIoT digital disruption and new IoT business models
IoT digital disruption and new IoT business models
 
Tracxn - Big Data Infrastructure Startup Landscape
Tracxn - Big Data Infrastructure Startup LandscapeTracxn - Big Data Infrastructure Startup Landscape
Tracxn - Big Data Infrastructure Startup Landscape
 
Internet of Everything. El Reto de la Innovación
Internet of Everything. El Reto de la InnovaciónInternet of Everything. El Reto de la Innovación
Internet of Everything. El Reto de la Innovación
 
Pascal A1 ASIC investors presentation
Pascal A1 ASIC investors presentationPascal A1 ASIC investors presentation
Pascal A1 ASIC investors presentation
 
Regulators' traceability requirements and solutions for i gambling operators ...
Regulators' traceability requirements and solutions for i gambling operators ...Regulators' traceability requirements and solutions for i gambling operators ...
Regulators' traceability requirements and solutions for i gambling operators ...
 
Tracxn - Data Center Infrastructure Startup Landscape
Tracxn - Data Center Infrastructure Startup LandscapeTracxn - Data Center Infrastructure Startup Landscape
Tracxn - Data Center Infrastructure Startup Landscape
 
Tracxn - Big Data Analytics Startup Landscape
Tracxn - Big Data Analytics Startup LandscapeTracxn - Big Data Analytics Startup Landscape
Tracxn - Big Data Analytics Startup Landscape
 
Tracxn - Internet of Things Infrastructure Startup Landscape
Tracxn - Internet of Things Infrastructure Startup LandscapeTracxn - Internet of Things Infrastructure Startup Landscape
Tracxn - Internet of Things Infrastructure Startup Landscape
 
Tracxn - Embedded Systems Startup Landscape
Tracxn - Embedded Systems Startup LandscapeTracxn - Embedded Systems Startup Landscape
Tracxn - Embedded Systems Startup Landscape
 
Intellisoft introduction @ Recruitment Camp L131229
Intellisoft introduction @ Recruitment Camp L131229Intellisoft introduction @ Recruitment Camp L131229
Intellisoft introduction @ Recruitment Camp L131229
 
Your Next IoT Journey is NOW
Your Next IoT Journey is NOWYour Next IoT Journey is NOW
Your Next IoT Journey is NOW
 
Advanced manufacturing syposium 2016 siaa colin koh
Advanced manufacturing syposium 2016 siaa colin kohAdvanced manufacturing syposium 2016 siaa colin koh
Advanced manufacturing syposium 2016 siaa colin koh
 
Capturing the IOT Opportunity
Capturing the IOT OpportunityCapturing the IOT Opportunity
Capturing the IOT Opportunity
 
Capturing the IOT Opportunity
Capturing the IOT OpportunityCapturing the IOT Opportunity
Capturing the IOT Opportunity
 
Internet of Things - We Are at the Tip of an Iceberg
Internet of Things - We Are at the Tip of an IcebergInternet of Things - We Are at the Tip of an Iceberg
Internet of Things - We Are at the Tip of an Iceberg
 
1Q18 Global ISG Index
1Q18 Global ISG Index1Q18 Global ISG Index
1Q18 Global ISG Index
 
Trends, Potentials and Challenges in ICT Market
Trends, Potentials and Challenges in ICT MarketTrends, Potentials and Challenges in ICT Market
Trends, Potentials and Challenges in ICT Market
 

More from Guillermo Santos

Handwritten Digit recognition with R. Classification Problem
Handwritten Digit recognition with R. Classification ProblemHandwritten Digit recognition with R. Classification Problem
Handwritten Digit recognition with R. Classification ProblemGuillermo Santos
 
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...Guillermo Santos
 
Data Analysis. Regression. LendingClub Loans
Data Analysis. Regression. LendingClub LoansData Analysis. Regression. LendingClub Loans
Data Analysis. Regression. LendingClub LoansGuillermo Santos
 
Presentación Geolocalización Noticias (geo news).2012
Presentación Geolocalización Noticias (geo news).2012Presentación Geolocalización Noticias (geo news).2012
Presentación Geolocalización Noticias (geo news).2012Guillermo Santos
 
Algoritmos Aprendizaje Automático.2012
Algoritmos Aprendizaje Automático.2012Algoritmos Aprendizaje Automático.2012
Algoritmos Aprendizaje Automático.2012Guillermo Santos
 
Kettle. Recuperación y Procesado de datos.2012
Kettle. Recuperación y Procesado de datos.2012Kettle. Recuperación y Procesado de datos.2012
Kettle. Recuperación y Procesado de datos.2012Guillermo Santos
 

More from Guillermo Santos (6)

Handwritten Digit recognition with R. Classification Problem
Handwritten Digit recognition with R. Classification ProblemHandwritten Digit recognition with R. Classification Problem
Handwritten Digit recognition with R. Classification Problem
 
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
Data Analysis. Predictive Analysis. Activity Prediction that a subject perfor...
 
Data Analysis. Regression. LendingClub Loans
Data Analysis. Regression. LendingClub LoansData Analysis. Regression. LendingClub Loans
Data Analysis. Regression. LendingClub Loans
 
Presentación Geolocalización Noticias (geo news).2012
Presentación Geolocalización Noticias (geo news).2012Presentación Geolocalización Noticias (geo news).2012
Presentación Geolocalización Noticias (geo news).2012
 
Algoritmos Aprendizaje Automático.2012
Algoritmos Aprendizaje Automático.2012Algoritmos Aprendizaje Automático.2012
Algoritmos Aprendizaje Automático.2012
 
Kettle. Recuperación y Procesado de datos.2012
Kettle. Recuperación y Procesado de datos.2012Kettle. Recuperación y Procesado de datos.2012
Kettle. Recuperación y Procesado de datos.2012
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

MadridJUG Mineria de Datos-Data Mining.09.may.2013

  • 1. ITITITIT [[[[1111]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Madrid JUGMadrid JUGMadrid JUGMadrid JUG ---- Minería de Datos sobre Weka (Data Mining)Minería de Datos sobre Weka (Data Mining)Minería de Datos sobre Weka (Data Mining)Minería de Datos sobre Weka (Data Mining) 9 de Mayo 2013 Jose María Gómez Hidalgo (@jmgomez) Guillermo Santos García (@gsantosgo) DATA MINING
  • 2. ITITITIT [[[[2222]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology INDEXINDEXINDEXINDEX Madrid JUG - Minería de Datos sobre Weka (Data Mining)...............................................................................................1 INDEX......................................................................................................................................................................................2 1. Artificial Intelligence. Conceptual Map............................................................................................................................ 4 1.1 Knowledge Based System vs. Machine Learning System.......................................................................................5 2. Data Mining Process.........................................................................................................................................................6 2.1 Machine Learning.........................................................................................................................................................7 2.1.1 Supervised Machine Learning...................................................................................................................................7 2.1.2 Unsupervised Machine Learning............................................................................................................................8 2.1.3 The Top Ten Algorithms in Data Mining.................................................................................................................9 3. Tools...................................................................................................................................................................................10 3.1 WEKA (Waikato Environment for Knowledge Analysis) ...................................................................................10 3.2 R (#RStats)...........................................................................................................................................................10 3.3 RapidMiner.............................................................................................................................................................11 3.4 KNIME Desktop......................................................................................................................................................11 3.5 Orange...................................................................................................................................................................12 3.6 Polls.......................................................................................................................................................................13 3.6.1 What programming/statistics languages you used for analytics / data mining in the past 12 months? [579 voters] (Aug 2012).............................................................................................................................................13 3.6.2 What Analytics, Data mining, Big Data software you used in the past 12 months for a real project? (May 2012) ..................................................................................................................................................................13 4. Examples............................................................................................................................................................................15 4.1 Predicting Price House........................................................................................................................................15 4.2 Lending Club........................................................................................................................................................16 4.3 Spam or Ham Email............................................................................................................................................. 17 4.4 Handwritten Digit Recognition ..........................................................................................................................18
  • 3. ITITITIT [[[[3333]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 4.5 Human Activity Recognition using Smartphones............................................................................................19 4.6 Inventory .............................................................................................................................................................20 4.7 Image Classification............................................................................................................................................21 4.8 Clustering............................................................................................................................................................22 5. Supervised Machine Learning........................................................................................................................................23 6. Evaluation.........................................................................................................................................................................24 6.1 Random Subsampling ..............................................................................................................................................24 6.2 Cross Validation (K-FOLD).......................................................................................................................................24 6.3 Confusion Matrix......................................................................................................................................................25 A.1. ¿What is a DATASET?....................................................................................................................................................26 A.2 Types of variables..........................................................................................................................................................26
  • 4. ITITITIT [[[[4444]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 1.1.1.1. Artificial Intelligence.Artificial Intelligence.Artificial Intelligence.Artificial Intelligence. ConceptualConceptualConceptualConceptual MapMapMapMap Link: http://en.wikipedia.org/wiki/Artificial_intelligence DATA MINING. LEARN FROM DATA Artificial Intelligence Problem Solving Search Methods Logic Agents Fuzzy Logic Automatic Classification Information Retrieval Filtering Autromatic Categorization Knowledge Based System Expert System Knowledge representation Data Mining Data Acquisition Machine Learning Supervised Unsupervised Natural Language Processing Statistical NLP Knowlegde Based NLP Robotics
  • 5. ITITITIT [[[[5555]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 1111....1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System Knowledge Based System (Expert System)Knowledge Based System (Expert System)Knowledge Based System (Expert System)Knowledge Based System (Expert System) - Rules are codified manually (Represent knowledge) - Experts (expert is a person with extensive knowledge about domain). - Cost. Expert Sytems (Credit Expert System) If (Annual Income > 3 * Annual Debt) Then CREDIT = YES Annual IncomeAnnual IncomeAnnual IncomeAnnual Income Annual DebtAnnual DebtAnnual DebtAnnual Debt CreditCreditCreditCredit 42.000 € 15.000 € NO 37.000 € 12.000 € SI 80.000 € 40.500 € NO 150.000 € 45.000€ SI Machine Learning SystemMachine Learning SystemMachine Learning SystemMachine Learning System - The manual process is automated. - There aren’t experts. - We take us advantage of data classified manually over years. - Training phase and testing phase. - At first, machine learning systems aren’t as accurate as knowledge based systems, however they’re can evolve and get better through time. (Ex. Spam Detection Spam)
  • 6. ITITITIT [[[[6666]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 2. Data Mining Process2. Data Mining Process2. Data Mining Process2. Data Mining Process KDD (Knowlegde Discovery in Databases) Source: From Data Mining to Knowledge Discovery in Databases (Fayyad. 1997) 1. Selection. The data relevant to select. 2. Preprocessing. 3. Transformation. 4. Data-Mininq. Building Models and Patterns. (MODELLING) 5. Interpretation/Evaluation . Evaluation and Results The term DATA-MINING sometimes refers to the complete process KDD, and sometimes refers only to the phase of MODELLING (4). Here mainly are applied algorithms in the scope of Machine Learning.
  • 7. ITITITIT [[[[7777]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 2.2.2.2.1 Machine Learning1 Machine Learning1 Machine Learning1 Machine Learning Aim. Building or creating programs capable of generalizgeneralizgeneralizgeneralizinginginging behaviorbehaviorbehaviorbehavior from weakly structured information. 2.2.2.2.1.11.11.11.1 SupervisedSupervisedSupervisedSupervised Machine LearningMachine LearningMachine LearningMachine Learning Aim. Predict the value of a variable based on a number of input variables. Regression Problem. Classification Problem. Result: PREDICTIVE MODELSPREDICTIVE MODELSPREDICTIVE MODELSPREDICTIVE MODELS or CLASSIFIERSCLASSIFIERSCLASSIFIERSCLASSIFIERS. DATA PREDICTIVE MODELS DESCRIPTIVE MODELS ATTRIBUTES
  • 8. ITITITIT [[[[8888]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 2.2.2.2.1.2 Unsupervised Machine Learning1.2 Unsupervised Machine Learning1.2 Unsupervised Machine Learning1.2 Unsupervised Machine Learning Aim. Describe patterns or associations among a set of input measures. Patterns or Associations Clustering Result: DESCRIPTIVEDESCRIPTIVEDESCRIPTIVEDESCRIPTIVE MODELSMODELSMODELSMODELS
  • 9. ITITITIT [[[[9999]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 2.2.2.2.1.3 The Top Ten Algorithms in Data Mining1.3 The Top Ten Algorithms in Data Mining1.3 The Top Ten Algorithms in Data Mining1.3 The Top Ten Algorithms in Data Mining IEEE International Conference on Data Mining (ICDM). http://www.cs.uvm.edu/~icdm/ The most influential algorithms used in the Data Mining Community. 1. C 4.5 (Decision Tree). 2. K-Means. 3. Support Vector Machine (SVM). The Best Generalization Ability 4. Apriori. To find frequent itemsets from a transaction dataset and derive association rules 5. EM (Expectation- Maximization) Pattern Recognition 6. PageRank. Link-based ranking algorithm, which also powers the Google search engine. 7. AdaBoost. 8. k-Nearest Neighbors (k-NN) 9. Naïve Bayes. 10. CART. Classification and Regression Trees Source: http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf
  • 10. ITITITIT [[[[10101010]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 3. Tools3. Tools3. Tools3. Tools 3.1 WEKA (Waikato Environment for Knowledge Analysis)3.1 WEKA (Waikato Environment for Knowledge Analysis)3.1 WEKA (Waikato Environment for Knowledge Analysis)3.1 WEKA (Waikato Environment for Knowledge Analysis) http://www.cs.waikato.ac.nz/ml/weka/ - Data Mining Software in Java. - Implemented in Java - Multi-platform - GUI (Limitations) - GPL License. - University of Waikato, New Zealand 3.23.23.23.2 RRRR (#RStats)(#RStats)(#RStats)(#RStats) http://www.r-project.org/ R is a language and environment for statistical computing and graphics. - S Language (Bell Laboratories) - Implemented in C/C++ - Highly extensible. R can be extended via packages. - R Environment. Uses a command line interface. (NO GUI) - RStudio. Graphical User Interfaces (GUI) - GPL License. - Created by University of Auckland, New Zealand and currently developed R Development Core Team Links: How R grows Books: Machine Learning for Hackers, The Elements of Statistical Learning: Data Mining, Inference and Prediction, OpenIntro Statistics Enterprises: Revolution Analytics, Oracle R Enterprise, … R for LinuxR for LinuxR for LinuxR for Linux R for Mac OSXR for Mac OSXR for Mac OSXR for Mac OSX R for WindowsR for WindowsR for WindowsR for Windows RWekaRWekaRWekaRWeka
  • 11. ITITITIT [[[[11111111]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 3.33.33.33.3 RRRRapidMinerapidMinerapidMinerapidMiner http://rapid-i.com/content/view/181/190/lang,en/ - Open-Source Data Mining and Analysis System - Implemented in Java - Multi-platform - Machine Learning library Weka fully integrated. - Access to data sources: Excel, MySQL, Oracle - ETL - Reporting - Data Analysis - AGPL License - Created by Dortmund University of Technology 3.4 KNIME Desktop3.4 KNIME Desktop3.4 KNIME Desktop3.4 KNIME Desktop http://www.knime.org/knime - Data Analytics (Data access, data transformation, predictive analytics, visualization and reporting). - Implemented in Java (Based in Eclipse Platform) - Reporting - ETL - KNIME Extensions. Excel support, R integration, Weka - GPL License - Konstanz University, Germany RRRR Extension for RapidMinerExtension for RapidMinerExtension for RapidMinerExtension for RapidMiner R Extension for KnimeR Extension for KnimeR Extension for KnimeR Extension for Knime
  • 12. ITITITIT [[[[12121212]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 3.5 Orange3.5 Orange3.5 Orange3.5 Orange http://orange.biolab.si/ - A component-based data mining and machine learning software suite - A visual programming front-end for explorative data analysis and visualization - Multi-platform. - Python - GPL License - University of Ljubljana, Slovenia
  • 13. ITITITIT [[[[13131313]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 3.6 Polls3.6 Polls3.6 Polls3.6 Polls 3.6.13.6.13.6.13.6.1 What programming/statistics languages you used for analytics / data mining in the past 12What programming/statistics languages you used for analytics / data mining in the past 12What programming/statistics languages you used for analytics / data mining in the past 12What programming/statistics languages you used for analytics / data mining in the past 12 months? [579 voters]months? [579 voters]months? [579 voters]months? [579 voters] (Aug 2012)(Aug 2012)(Aug 2012)(Aug 2012) Source: http://www.kdnuggets.com/polls/2012/analytics-data-mining-programming-languages.html 3.6.23.6.23.6.23.6.2 What Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in the past 12 months for a realhe past 12 months for a realhe past 12 months for a realhe past 12 months for a real project?project?project?project? (May 2012)(May 2012)(May 2012)(May 2012)
  • 14. ITITITIT [[[[14141414]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Source: http:/www.kdnuggets.com/polls/2012/analytics-data-mining-big-data-software.html
  • 15. ITITITIT [[[[15151515]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 4444.... ExamplesExamplesExamplesExamples 4444.1 Predicting Price House.1 Predicting Price House.1 Predicting Price House.1 Predicting Price House SizeSizeSizeSize Price (K)Price (K)Price (K)Price (K) 80 70 90 83 100 74 110 93 140 89 140 58 150 85 160 114 180 95 200 100 240 138 250 111 270 124 320 161 350 172 Link: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/predictHousePrice.md Regression ProblemRegression ProblemRegression ProblemRegression Problem
  • 16. ITITITIT [[[[16161616]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 4444.2 Lending Club.2 Lending Club.2 Lending Club.2 Lending Club Peer to peer lending company. What are the variables associated with the interest rate of a loan? Multivariate Links: http://www.lendingclub.com/ http://en.wikipedia.org/wiki/Lending_Club https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/loansLendingClub.md Regression ProblemRegression ProblemRegression ProblemRegression Problem
  • 17. ITITITIT [[[[17171717]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 4444.3.3.3.3 Spam orSpam orSpam orSpam or HamHamHamHam EmailEmailEmailEmail Links: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/spam.md Classification ProblemClassification ProblemClassification ProblemClassification Problem
  • 18. ITITITIT [[[[18181818]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 4444.4.4.4.4 HandwrittenHandwrittenHandwrittenHandwritten Digit RecognitionDigit RecognitionDigit RecognitionDigit Recognition Identification the numbers in a handwritten ZIP code, from a digitized image. 001 002 003 004 ... 015 016 017 018 019 020 ... 031 032 033 034 035 036 ... 037 038 | | | | ... | | 209 210 211 212 ... 223 224 225 226 227 228 ... 239 240 241 242 243 244 ... 255 256 Each image is a 16 x 16 (256) 8-bit grayscale representation of a handwritten digit http://www.kaggle.com/c/digit-recognizer Link: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/handwritten.md 16x16 Classification ProblemClassification ProblemClassification ProblemClassification Problem
  • 19. ITITITIT [[[[19191919]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 4444.5.5.5.5 Human Activity RHuman Activity RHuman Activity RHuman Activity Recognition using Smartphonesecognition using Smartphonesecognition using Smartphonesecognition using Smartphones We used data obtained from accelerometer and gyroscope sensor signals of the smartphones 3-axial linear acceleration 3-axial angular velocity We can monitor acceleration, positions, rotation and angular motion. Laying, Sitting, Standing, Walk, WalkDown, WalkUp
  • 20. ITITITIT [[[[20202020]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology DataSet: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones Source: Activity Recognition using Cell Phone Accelerometers http://www.cis.fordham.edu/wisdm/public_files/sensorKDD-2010.pdf Link: https://github.com/gsantosgo/RStats/blob/master/MadridJUG-DataMining/handwritten.md 4444.6.6.6.6 InventoryInventoryInventoryInventory A large inventory of identical items. You want to predict how many of these items will sell over the next 3 months. Classification ProblemClassification ProblemClassification ProblemClassification Problem Regression ProblemRegression ProblemRegression ProblemRegression Problem
  • 21. ITITITIT [[[[21212121]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 4.74.74.74.7 Image ClassificationImage ClassificationImage ClassificationImage Classification Computer Vision (C.V.) Haralick texture features. Haralick described 14 statistics that can be calculated from the co-occurrence matrix with the intent of describing the texture of the image: - Angular Second Moment - Constrast - Correlation .. Source: https://github.com/gsantosgo/RStats/tree/master/MadridJUG-DataMining/data/faces.arff Alessandra Ambrosio Jessica Alba Megan Fox Links: http://murphylab.web.cmu.edu/publications/boland/boland_node26.html Classification ProblemClassification ProblemClassification ProblemClassification Problem
  • 22. ITITITIT [[[[22222222]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 4.84.84.84.8 ClusteringClusteringClusteringClustering Google News News Clustering Source: http://news.google.es/ Clustering ProblemClustering ProblemClustering ProblemClustering Problem
  • 23. ITITITIT [[[[23232323]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 5. Supervised Machine Learning5. Supervised Machine Learning5. Supervised Machine Learning5. Supervised Machine Learning Guide for Supervised Machine Learning Training Phase Testing Phase Training DataSet (Colección de Entrenamiento) Attributes Selection and Extraction (Selección y Extracción de Atributos) Filtered DataSet (Colección filtrada) Learning or Training (Entrenamiento o Aprendizaje) Predictive Model or Classifier (Modelo Predictivo o Clasificador) Testing DataSet (Colección de Datos Reales) Filtering Attributes (Filtrado de Atributos) Filtered DataSet (Colección filtrada) Classification (Clasificación) Classified Data (Datos Clasificados)
  • 24. ITITITIT [[[[24242424]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology 6666. Evaluation. Evaluation. Evaluation. Evaluation STATE OF ARTSTATE OF ARTSTATE OF ARTSTATE OF ART 6666.1 Random Subsampling.1 Random Subsampling.1 Random Subsampling.1 Random Subsampling 1. Use the training set. 2. Split it into training set (66.66 %) and testing set (33.33%). (RANDOM) 3. Build a model on the training set. 4. Evaluate on the test set. 6666.2 Cross Validation (K.2 Cross Validation (K.2 Cross Validation (K.2 Cross Validation (K----FOLD)FOLD)FOLD)FOLD) 1. Use the training set. 2. Split it into training/test sets. 3. Build a model on the training set 4. Evaluate on the test set. 5. Repeat and average the estimated Never Overlap! K-FOLD K = 1 K = 2 ……. K = 10 = 1 Test Data Test Data Test Data Training Data Training Data Training DataTr. Data Test Training DataTest Data Test Data Test Data Test Data Test Data Test Training DataTest Data Test Data Test Data Test Data Test Data
  • 25. ITITITIT [[[[25252525]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Link: https://es.wikipedia.org/wiki/Validaci%C3%B3n_cruzada 6666.3 Confusion Matrix.3 Confusion Matrix.3 Confusion Matrix.3 Confusion Matrix - Accuracy (Precisión o Efectividad) . The rate of correct predictions - Error rate. The rate of incorrect predictions. - Performance (Eficiencia). The algorithm is quick or nor in the training phase or in the testing phase. Actual/Real ClassActual/Real ClassActual/Real ClassActual/Real Class Predicted ClassPredicted ClassPredicted ClassPredicted Class TotalTotalTotalTotal Yes No Yes (1)Yes (1)Yes (1)Yes (1) True Positive (TP) False Negative (FN) Total Positive Real (TPR) No (0)No (0)No (0)No (0) False Positive (FP) True Negative (TN) Total Negative Real (TNR) TotalTotalTotalTotal Total Positive Predicted (TPP) Total Negative Predicted (TNP) Total Link: http://en.wikipedia.org/wiki/Confusion_matrix
  • 26. ITITITIT [[[[26262626]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology A.1A.1A.1A.1.... ¿What is a DATASET?¿What is a DATASET?¿What is a DATASET?¿What is a DATASET? Example: Dataset email50 Row represents a casecasecasecase, a unit of observationunit of observationunit of observationunit of observation, an observational unitobservational unitobservational unitobservational unit, an instanceinstanceinstanceinstance. OBSERVATIONS.OBSERVATIONS.OBSERVATIONS.OBSERVATIONS. EXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARY.... Column represents an attributeattributeattributeattribute, a variablevariablevariablevariable, a featurefeaturefeaturefeature (represent characteristics). Special column. the classthe classthe classthe class, the class labelthe class labelthe class labelthe class label ( two values or multi-valued) For example: The email 4, which is not spam, contains 2454 characters, 61 line breaks, is written in Text format (0=text, 1=html), and contains only small numbers. Variable Description spam Specifies whether the message was spam num_char The number of characters in the email line_breaks The number of line breaks in the email (not including text wrapping) Format Indicates if the email contained special formatting, such as bolding, tables or links, which would indicate the message is in HTML format Number Indicates whether the email contained no number, a small number (under 1 million) or a large number DatasetDatasetDatasetDataset represents a data matrixdata matrixdata matrixdata matrix, data framedata framedata framedata frame. Each row of a data matrix corresponds to unique case (example), and each column corresponds to a variable. A.2A.2A.2A.2 Types of variablesTypes of variablesTypes of variablesTypes of variables
  • 27. ITITITIT [[[[27272727]]]]@gsantosgo@gsantosgo@gsantosgo@gsantosgo Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology num_char and line_breaks CUANTITATIVE, NUMERICAL AND CONTINOUS VARIABLES. spam CUANTITATIVE, NUMERICAL AND DISCRETE VARIABLE. number indicates whether the email contained no number, a small number (under 1 million) or a large number. It takes values none, small and big. The different levels have a natural ordering. CUALITATIVE, CATEGORICAL VARIABLES AND ORDINAL VARIABLE. Variables Numerical Continuous Discretes Categorical Regular Categorical Ordinal