SlideShare une entreprise Scribd logo
1  sur  75
Télécharger pour lire hors ligne
Machine Learning for Language Technology 2015
Preliminaries
Understanding and Preprocessing Data
Marina Santini
santinim@stp.lingfil.uu.se
Department of Linguistics and Philology
Uppsala University, Uppsala, Sweden
Autumn 2015
Lecture 2: Preliminaries 1
Acknowledgements
• Weka Slides (teaching material*), Wikipedia,
MathIsFun and other websites.
* http://www.cs.waikato.ac.nz/ml/weka/book.html
Lecture 2: Preliminaries 2
Outline
– Raw Data and Feature Representation:
• Concepts, instances, attributes
– Digression 1: Pills of Statistics
• Sampling, mean, variance, standard deviation,
normalization, standardization, etc.
– Digression2: Data Visualization
• how to read a histogram, scatter plot, etc.
Lecture 2: Preliminaries 3
DATA, CONCEPTS, INSTANCES,
ATTRIBUTES, FEATURES
Raw Data and Data Representation
Lecture 2: Preliminaries 4
What is data?
• Data is a collection of facts, such as numbers,
words, measurements, observations or even
just descriptions of things.
• Data can be qualitative or quantitative.
– Qualitative data is descriptive information (it
describes something)
– Quantitative data is numeric information
(numbers).
Lecture 2: Preliminaries 5
Singular or Plural?
• The singular form of data is "datum”.
– Ex: "that datum is very high”
• The plural form of ”datum” is ”data”.
• ”data” is plural when it indicates many individual datum
– Ex: "the data are available”
• But ”data” can also refer to collection of facts. In this case it
is uncountable and takes the singular verb
– Ex: "the data is available”
http://www.theguardian.com/news/datablog/2010/jul/16/data-plural-singular
Lecture 2: Preliminaries 6
Qualitative Data
• Categorial values
– Nominal (ex: eye colour)
– Ordinal (ex: street numbers)
Lecture 2: Preliminaries 7
Quantitative Data
• Quantitative data can also be discrete or
continous.
• Discrete data is counted, Continuous data is
measured
– Discrete data can only take certain values (like
whole numbers)
– Continuous data can take any value (within a
range)
Lecture 2: Preliminaries 8
Lecture 2: Preliminaries
Concepts, Instances, and Attributes
 Components of the input:
 Concepts: kinds of things that can be learned
 Instances: the individual, independent examples of
a concept
 Attributes: measuring aspects of an instance
9
The importance of feature selection
and representation
Lecture 2: Preliminaries 10
Binary data is a special type of categorical
data. Binary data takes only two values.
GETTING TO KNOW YOUR DATA
Lecture 2: Preliminaries 11
Lecture 2: Preliminaries
Missing Data/Values
 Types: unknown, unrecorded, irrelevant, etc.
 Reasons:
collation of different datasets
measurement not possible
etc.
 Missing data may have significance in itself (e.g.
missing test in a medical examination)
 Most ML schemes assume that missing data have no
special significance. So… be careful and make your
own decisions.
12
Lecture 2: Preliminaries
Inaccurate values
 Typographical errors in nominal attributes  values need
to be checked for consistency
 Typographical and measurement errors in numeric
attributes  outliers need to be identified
13
Noise
• Noise is any unwanted anomaly in the data.
• In ML the presence of noise may cause
difficulties in learning the classes and produce
unreliable classifiers.
• Noise can be caused by:
– imprecisions in recording input attributes
– errors in labelling
– etc.
Lecture 2: Preliminaries 14
Lecture 2: Preliminaries
Getting to know the data
 Simple visualization tools are very useful
 Nominal attributes: histograms
 Numeric attributes: graphs
 Too much data to inspect? Take a sample!
15
ARFF FORMAT
Weka (Waikato Environment for Knowledge Analysis)
Lecture 2: Preliminaries 16
Weka Software Package
http://www.cs.waikato.ac.nz/ml/weka/
Weka (Waikato Environment for Knowledge Analysis) is
developed at University of Waikato in New Zealand.
A collection of state-of-the-art machine learning
algorithms and data preprocessing tools.
It is open source. It is written in Java.
Contains implementations of learning algorithms that you
can apply to your datasets.
Lecture 2: Preliminaries 17
Weka input data formats
• General formats:
• Weka:
– ARFFAttribute-Relation File format.
– It is an ASCII file that describes a list of instances
sharing a set of attributes.
Lecture 2: Preliminaries 18
The ARFF format
Lecture 2: Preliminaries 19
Lecture 2: Preliminaries
Sparse data
 In some applications most attribute values in a
dataset are zero
 E.g.: word counts in a text categorization problem
 ARFF supports sparse data
 This also works for nominal attributes (where the
first value corresponds to “zero”)
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”
{1 26, 6 63, 10 “class A”}
{3 42, 10 “class B”}
20
SAMPLING
NORMAL DISTRIBUTION
MEASURES OF CENTRAL TENDENCY
Digression: Pills of Statistics
Lecture 2: Preliminaries 21
Population and Sample
• Population and Sample
– Population: The whole group of ”things” we want to study
• Ex: All students born between 1980 and 2000
– Sample: A selection taken from a larger group (the "population") so that you
can examine it to find out something about the larger group.
• Ex: 100 randomly chosen students students born between 1980 and 2000
In other words:
the ’population' is the entire pool from which a statistical sample is drawn.
The information obtained from the sample allows statisticians to develop
hypotheses about the larger population.
Researchers gather information from a sample because of the difficulty of
studying the entire population.
Lecture 2: Preliminaries 22
Sampling
• Sampling is a science in itself and there are
different methods to sample a population
– Ex: random sampling, stratified sampling, multi-
stage sampling, quota sampling, etc.
• The main concern: the sample should be
representative of the population.
Lecture 2: Preliminaries 23
Distributions
Lecture 2: Preliminaries 24
Normal Distribution
• A normal distribution is an arrangement of a
data set in which most values cluster in the
middle of the range and the rest taper off
symmetrically toward either extreme.
Lecture 2: Preliminaries 25
Skewness
Lecture 2: Preliminaries
• When data is "skewed", it shows long tail on
one side or the other:
26
Outliers
• An outlier is an observation point that is
distant from other observations.
Lecture 2: Preliminaries 27
Measures of Central Tendency
• In a normal distribution, the mean, mode and
median are all the same.
Lecture 2: Preliminaries 28
Right Skewed Distribution
Lecture 2: Preliminaries 29
Negative Skewed Distribution
Lecture 2: Preliminaries 30
Mean
• The mean is the average of the numbers: a calculated
"central" value of a set of numbers. To calculate: Just
add up all the numbers, then divide by how many
numbers there are.
Ex: what is the mean of 2, 7 and 9?
• Add the numbers: 2 + 7 + 9 = 18
• Divide by how many numbers (i.e. we added 3
numbers): 18 ÷ 3 = 6
• The Mean is 6
Lecture 2: Preliminaries 31
Median
• The Median is the middle number (in a sorted
list of numbers). To find the Median, place
the numbers you are given in value order and
find the middle number. (If there are two
middle numbers, you average them.)
• Find the Median of {13, 23, 11, 16, 15, 10, 26}.
• Put them in order: {10, 11, 13, 15, 16, 23, 26}
• The middle number is 15, so the median is 15.
Lecture 2: Preliminaries 32
Mode
• The Mode is the number which appears most
often in a set of numbers.
• In {6, 3, 9, 6, 6, 5, 9, 3} the Mode is 6 (it occurs
most often).
Lecture 2: Preliminaries 33
Frequency Table
• Ex of a frequency table:
Lecture 2: Preliminaries 34
The mean of a frequency table
• In a frequency table, the mean is calculated
by:
– multiply the score and the frequency, add up all
the numbers and divide by sum of the frequencies
Lecture 2: Preliminaries 35
Mean: Formula
• The x with the bar on top means ”mean of x”
• Σ (sigma) means ”sum up”
• Σ fx means ”sum up all the frequencies times the
matching scores”
• Σ f means ”sum up all the frequencies”
Lecture 2: Preliminaries 36
Quiz: The mean of a frequency table
• Calculate the mean of the following frequency table using
the mean formula:
Answers (only one is correct)
• 2.05
• 5.2
• 3.7
Lecture 2: Preliminaries 37
MEASURES OF DISPERSION
Digression: Pills of Statistics
Lecture 2: Preliminaries 38
Measures of Dispersion
• Dispersion is a general term for different
statistics that describe how values are
distributed around the centre
Lecture 2: Preliminaries 39
Measures of Dispersion
• range
• quartiles
• interquartile range
• percentiles
• mean deviation
• variance
• standard deviation
• etc.
Lecture 2: Preliminaries 40
Range
• The range is the difference between the
lowest and highest values.
– Example: In {4, 6, 9, 3, 7} the lowest value is 3,
and the highest is 9. So the range is 9-3 = 6.
Lecture 2: Preliminaries 41
Quartiles
• Quartiles are the values that divide a list of numbers into
quarters.
– First put the list of numbers in order
– Then cut the list into four equal parts
– The Quartiles are at the "cuts”
• Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8 (The numbers must be in
order)
• Cut the list into quarters. The result is:
• Quartile 1 (Q1) = 4
• Quartile 2 (Q2), which is also the Median = 5
• Quartile 3 (Q3) = 8
Lecture 2: Preliminaries 42
Interquartile Range
• The "Interquartile Range" is from Q1 to Q3.
• To calculate it just subtract Quartile 1 from
Quartile 3:
Lecture 2: Preliminaries 43
Percentiles
• Percentile is the value below which a
percentage of data falls (The data needs to be
in order)
• Example: You are the 4th tallest person in a group of 20;
80% of people are shorter than you: That means you are
at the 80th percentile.
• That is, if your height is 1.85m then "1.85m" is the 80th
percentile height in that group.
Lecture 2: Preliminaries 44
Mean Deviation
• It is the mean of the distances of each value
from their mean.
• Three steps:
– 1. Find the mean of all values
– 2. Find the distance of each value from that mean
(subtract the mean from each value, ignore minus
signs, and take the absolute value)
– 3. Then find the mean of those distances
Lecture 2: Preliminaries 45
Variance: σ2
• The Variance is the average of the squared
differences from the mean.
• To calculate the variance follow these steps:
– Work out the mean.
– Then for each number: subtract the Mean and
square the result (the squared difference).
– Then work out the average of those squared
differences.
Lecture 2: Preliminaries 46
Example: Compute the Variance
For the following dataset find the variance: {600,
470, 170, 430, 300}.
Mean = 600+470+170+430+300/5 = 394
For each number subtract the mean:
600-394=206; 470-394=76, 170-394=224, 430-394=36; 300-394=-94
Take each difference, square it, and then avarage the
results. The variance is 21,704.
Lecture 2: Preliminaries 47
Standard Deviation: σ
• The Standard Deviation is one of the most
reliable measure of how spread out numbers
are.
• The formula is easy: it is the square root of
the variance.
Lecture 2: Preliminaries 48
Standard Deviation Formula
(population)
• μ = the mean
• xi = the individual value of a
dataset
• (xi - μ)2 = for each value subtract
the mean and square the result
• N = the total number of values in
the dataset
• i=1 = start at this value (here the
first number of the dataset)
• Σ = add up all the values
• 1/N = divide by total number of
values in the dataset
• √ = take the square root of all the
calculation
49
Standard Deviation Formula (sample)
Lecture 2: Preliminaries 50
Standard Deviation is the most reliable
measure of dispersion
• Depending of the situation, not all measures of
dispersion are equally reliable.
• For ex, the range can sometimes be misleading when
there are extremely high or low values.
– Example: In {8, 11, 5, 9, 7, 6, 3616}: the lowest value is 5,
and the highest is 3616. So the range is 3616-5 = 3611.
• However: The single value of 3616 makes the range
large, but most values are around 10.
• So we may be better off using other measures such as
Standard Deviation = 1262.65
Lecture 2: Preliminaries 51
Normal Distribution and Standard Deviation
Lecture 2: Preliminaries 52
Standard Deviation vs Variance
• A useful property of the standard deviation is that, unlike
the variance, it is expressed in the same units as the data.
• In other words: the StandDev is expressed in the same units
as the mean is, whereas the variance is expressed in square
units. So standard deviation is more intuitive…
• Note that a normal distribution with mean=10 and
standDev = 3 is exactly the same thing as a normal
distribution with mean=10 and variance = 9.
• Watch out and be clear of what you are using!
Lecture 2: Preliminaries 53
Quiz: Standard Deviation
68% of the frequency values of the word “and” in a
corpus of email (assume emails have equal length) are
between 51 and 64. Assuming this data is normally
distributed, what are the mean and standard
deviation?
1. Mean = 57; S.D. = 6.5
2. Mean = 57.5 ; S.D. = 6.5
3. Mean = 57.5; S.D. = 13
Lecture 2: Preliminaries 54
These notions will be resumed later...
• … when dealing with statistical inference and
other statistical methods.
• Standard Deviation Calculator:
http://www.mathsisfun.com/data/standard-
deviation-calculator.html
Lecture 2: Preliminaries 55
NORMALIZATION AND STANDARDISATION
Digression: Pills of Statistics
Lecture 2: Preliminaries 56
Normalization
• To normalize data means to fit the data within
unity, so all the data will take on a value
between 0 and 1. Many formulas are
available:
• Ex:
Lecture 2: Preliminaries 57
Standardization
• Standardization coverts all variables to a
common scale and reflects how many
standard deviations from the mean that the
data point falls
• The number of standard deviations from the
mean is also called the "Standard Score",
"sigma" or "z-score".
Lecture 2: Preliminaries 58
How to standardize
• z is the "z-score" (Standard Score)
• x is the value to be standardised
• μ is the mean
• σ is the standard deviation
Lecture 2: Preliminaries 59
Why Standardize?
• It can help us make decisions about our data.
Lecture 2: Preliminaries 60
CHARTS AND GRAPHS
Data Visualization
Lecture 2: Preliminaries 61
Weka: Data Visualization
Lecture 2: Preliminaries 62
Outline
• Bar chart
• Histogram
• Pie chart
• Line chart
• Scatter plot
• Dot plot
• Box plot
Lecture 2: Preliminaries 63
Axes and Coordinates
• The left-right (horizontal) direction is commonly called X or abscissa
The up-down (vertical) direction is commonly called Y or ordinate
• The coordinates are always written in a certain order: the horizontal
distance first, then the vertical distance.
Lecture 2: Preliminaries 64
Repetition: Read careful this web page:
https://www.mathsisfun.com/data/cartesian-coordinates.html
Bar Chart
• A Bar Chart (also called Bar Graph) is a
graphical display of data using bars of
different heights.
• Bar charts are used to graph categorical data.
Example:
Lecture 2: Preliminaries 65
Histogram
• With continuous data, histograms are used.
• Histograms are similar to bar charts, but a histogram
groups numbers into ranges.
Lecture 2: Preliminaries 66
Pie Chart
• It is a special chart that uses "slices" to show
relative sizes of data.
• Pie charts have been criticized.
Lecture 2: Preliminaries 67
Line Chart
• Line chart is a graph that shows information
that is connected in some way (such as change
over time).
Lecture 2: Preliminaries 68
Scatter plot
• A scatter plot has points that show the
relationship between two sets of data.
• Example: each dot shows one person's weight
versus their height.
Lecture 2: Preliminaries 69
Line of best fit
• Draw a "Line of Best Fit" (also called a "Trend
Line") on the scatter plot to predict values that
might not on the plot
Lecture 2: Preliminaries 70
Correlations
• Scatter plots are useful to detect correlations
between the sets of data.
– Correlation is Positive when the values increase together
– Correlation is Negative when one value decreases as the other increases
More on scatter plots: https://www.mathsisfun.com/data/scatter-xy-plots.html
Lecture 2: Preliminaries 71
Quiz: Scatter Plot
• The correlation seen in the graph at the right
would be best described as:
1. high positive correlation
2. low positive correlation
3. high negative correlation
4. low negative correlation
Lecture 2: Preliminaries 72
Dot Plot
• A dot plot is a graphical display of data using dots.
• It is an alternative to the bar chart, in which dots are
used to depict the quantitative values (e.g. counts)
associated with categorical variables.
Lecture 2: Preliminaries 73
Box Plot
• Box plots are useful to highlight outliers,
median and the interquartile range.
• aka box-and-whisker plots
Lecture 2: Preliminaries 74
The End
Lecture 2: Preliminaries 75

Contenu connexe

Tendances

Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsViet-Trung TRAN
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Distributed system Tanenbaum chapter 1,2,3,4 notes
Distributed system Tanenbaum chapter 1,2,3,4 notes Distributed system Tanenbaum chapter 1,2,3,4 notes
Distributed system Tanenbaum chapter 1,2,3,4 notes SAhammedShakil
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisJaclyn Kokx
 
Random Number Generation
Random Number GenerationRandom Number Generation
Random Number GenerationRaj Bhatt
 
Exploratory data analysis project
Exploratory data analysis project Exploratory data analysis project
Exploratory data analysis project BabatundeSogunro
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDigiGurukul
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster AnalysisDerek Kane
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptxRoshan86572
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learningmahutte
 
Summer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonSummer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonYash Khanna
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptxmaha797959
 

Tendances (20)

Dimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applicationsDimensionality reduction: SVD and its applications
Dimensionality reduction: SVD and its applications
 
K means clustering
K means clusteringK means clustering
K means clustering
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Distributed system Tanenbaum chapter 1,2,3,4 notes
Distributed system Tanenbaum chapter 1,2,3,4 notes Distributed system Tanenbaum chapter 1,2,3,4 notes
Distributed system Tanenbaum chapter 1,2,3,4 notes
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Random Number Generation
Random Number GenerationRandom Number Generation
Random Number Generation
 
Exploratory data analysis project
Exploratory data analysis project Exploratory data analysis project
Exploratory data analysis project
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptx
 
Foundations of Machine Learning
Foundations of Machine LearningFoundations of Machine Learning
Foundations of Machine Learning
 
Summer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonSummer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of London
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 

En vedette

Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationMarina Santini
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part) Marina Santini
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Complexity of Project And Impacts on Preliminaries of Construction Project At...
Complexity of Project And Impacts on Preliminaries of Construction Project At...Complexity of Project And Impacts on Preliminaries of Construction Project At...
Complexity of Project And Impacts on Preliminaries of Construction Project At...Ir. Abdul Aziz Abas
 
Towards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can HelpTowards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can HelpMarina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Building contracts and the JCT
Building contracts and the JCTBuilding contracts and the JCT
Building contracts and the JCTJulian Swindell
 
Week 01 Preliminaries Works, Soil Investigate & Ground Water Control
Week 01 Preliminaries Works, Soil Investigate & Ground Water ControlWeek 01 Preliminaries Works, Soil Investigate & Ground Water Control
Week 01 Preliminaries Works, Soil Investigate & Ground Water Controlnik kin
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Statistics Notes
Statistics NotesStatistics Notes
Statistics Notessd
 
09 semantic web & ontologies
09 semantic web & ontologies09 semantic web & ontologies
09 semantic web & ontologiesMarina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 

En vedette (20)

Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
Complexity of Project And Impacts on Preliminaries of Construction Project At...
Complexity of Project And Impacts on Preliminaries of Construction Project At...Complexity of Project And Impacts on Preliminaries of Construction Project At...
Complexity of Project And Impacts on Preliminaries of Construction Project At...
 
fidic , jct and nec contracts
fidic , jct and nec contractsfidic , jct and nec contracts
fidic , jct and nec contracts
 
Towards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can HelpTowards Contextualized Information: How Automatic Genre Identification Can Help
Towards Contextualized Information: How Automatic Genre Identification Can Help
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Building contracts and the JCT
Building contracts and the JCTBuilding contracts and the JCT
Building contracts and the JCT
 
Construction tender process
Construction tender processConstruction tender process
Construction tender process
 
Week 01 Preliminaries Works, Soil Investigate & Ground Water Control
Week 01 Preliminaries Works, Soil Investigate & Ground Water ControlWeek 01 Preliminaries Works, Soil Investigate & Ground Water Control
Week 01 Preliminaries Works, Soil Investigate & Ground Water Control
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Statistics Notes
Statistics NotesStatistics Notes
Statistics Notes
 
09 semantic web & ontologies
09 semantic web & ontologies09 semantic web & ontologies
09 semantic web & ontologies
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 

Similaire à Lecture 2: Preliminaries (Understanding and Preprocessing data)

Biostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacyBiostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacyBatizemaryam
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Sherri Gunder
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptxCalvinAdorDionisio
 
Class1.ppt
Class1.pptClass1.ppt
Class1.pptGautam G
 
Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1RajnishSingh367990
 
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSSTATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSnagamani651296
 
Biostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptxBiostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptxSailajaReddyGunnam
 
MEASURES OF CENTRAL TENDENCY AND MEASURES OF DISPERSION
MEASURES OF CENTRAL TENDENCY AND  MEASURES OF DISPERSION MEASURES OF CENTRAL TENDENCY AND  MEASURES OF DISPERSION
MEASURES OF CENTRAL TENDENCY AND MEASURES OF DISPERSION Tanya Singla
 
Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatMarwa Zalat
 
3. Descriptive statistics.pdf
3. Descriptive statistics.pdf3. Descriptive statistics.pdf
3. Descriptive statistics.pdfYomifDeksisaHerpa
 
ANALYSIS OF DATA ANALYSIS TOOLS IN RESEARCH PPT
ANALYSIS OF DATA ANALYSIS TOOLS IN RESEARCH  PPTANALYSIS OF DATA ANALYSIS TOOLS IN RESEARCH  PPT
ANALYSIS OF DATA ANALYSIS TOOLS IN RESEARCH PPTsweetymitra4
 

Similaire à Lecture 2: Preliminaries (Understanding and Preprocessing data) (20)

Biostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacyBiostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacy
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptx
 
Statistics
StatisticsStatistics
Statistics
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1
 
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSSTATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
data
datadata
data
 
Biostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptxBiostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptx
 
MEASURES OF CENTRAL TENDENCY AND MEASURES OF DISPERSION
MEASURES OF CENTRAL TENDENCY AND  MEASURES OF DISPERSION MEASURES OF CENTRAL TENDENCY AND  MEASURES OF DISPERSION
MEASURES OF CENTRAL TENDENCY AND MEASURES OF DISPERSION
 
EDA by Sastry.pptx
EDA by Sastry.pptxEDA by Sastry.pptx
EDA by Sastry.pptx
 
Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa Zalat
 
3 module 2
3 module 23 module 2
3 module 2
 
3. Descriptive statistics.pdf
3. Descriptive statistics.pdf3. Descriptive statistics.pdf
3. Descriptive statistics.pdf
 
Dscriptive statistics
Dscriptive statisticsDscriptive statistics
Dscriptive statistics
 
Statistics(Basic)
Statistics(Basic)Statistics(Basic)
Statistics(Basic)
 
ANALYSIS OF DATA ANALYSIS TOOLS IN RESEARCH PPT
ANALYSIS OF DATA ANALYSIS TOOLS IN RESEARCH  PPTANALYSIS OF DATA ANALYSIS TOOLS IN RESEARCH  PPT
ANALYSIS OF DATA ANALYSIS TOOLS IN RESEARCH PPT
 

Plus de Marina Santini

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-Marina Santini
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesMarina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational SemanticsMarina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Marina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Marina Santini
 
Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)Marina Santini
 
Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities Marina Santini
 
Mathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability TheoryMathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability TheoryMarina Santini
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free GrammarsMarina Santini
 
Lecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesLecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesMarina Santini
 

Plus de Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)
 
Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities
 
Mathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability TheoryMathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability Theory
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
 
Lecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesLecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular Languages
 
Lecture: Automata
Lecture: AutomataLecture: Automata
Lecture: Automata
 

Dernier

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 

Dernier (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 

Lecture 2: Preliminaries (Understanding and Preprocessing data)

  • 1. Machine Learning for Language Technology 2015 Preliminaries Understanding and Preprocessing Data Marina Santini santinim@stp.lingfil.uu.se Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Autumn 2015 Lecture 2: Preliminaries 1
  • 2. Acknowledgements • Weka Slides (teaching material*), Wikipedia, MathIsFun and other websites. * http://www.cs.waikato.ac.nz/ml/weka/book.html Lecture 2: Preliminaries 2
  • 3. Outline – Raw Data and Feature Representation: • Concepts, instances, attributes – Digression 1: Pills of Statistics • Sampling, mean, variance, standard deviation, normalization, standardization, etc. – Digression2: Data Visualization • how to read a histogram, scatter plot, etc. Lecture 2: Preliminaries 3
  • 4. DATA, CONCEPTS, INSTANCES, ATTRIBUTES, FEATURES Raw Data and Data Representation Lecture 2: Preliminaries 4
  • 5. What is data? • Data is a collection of facts, such as numbers, words, measurements, observations or even just descriptions of things. • Data can be qualitative or quantitative. – Qualitative data is descriptive information (it describes something) – Quantitative data is numeric information (numbers). Lecture 2: Preliminaries 5
  • 6. Singular or Plural? • The singular form of data is "datum”. – Ex: "that datum is very high” • The plural form of ”datum” is ”data”. • ”data” is plural when it indicates many individual datum – Ex: "the data are available” • But ”data” can also refer to collection of facts. In this case it is uncountable and takes the singular verb – Ex: "the data is available” http://www.theguardian.com/news/datablog/2010/jul/16/data-plural-singular Lecture 2: Preliminaries 6
  • 7. Qualitative Data • Categorial values – Nominal (ex: eye colour) – Ordinal (ex: street numbers) Lecture 2: Preliminaries 7
  • 8. Quantitative Data • Quantitative data can also be discrete or continous. • Discrete data is counted, Continuous data is measured – Discrete data can only take certain values (like whole numbers) – Continuous data can take any value (within a range) Lecture 2: Preliminaries 8
  • 9. Lecture 2: Preliminaries Concepts, Instances, and Attributes  Components of the input:  Concepts: kinds of things that can be learned  Instances: the individual, independent examples of a concept  Attributes: measuring aspects of an instance 9
  • 10. The importance of feature selection and representation Lecture 2: Preliminaries 10 Binary data is a special type of categorical data. Binary data takes only two values.
  • 11. GETTING TO KNOW YOUR DATA Lecture 2: Preliminaries 11
  • 12. Lecture 2: Preliminaries Missing Data/Values  Types: unknown, unrecorded, irrelevant, etc.  Reasons: collation of different datasets measurement not possible etc.  Missing data may have significance in itself (e.g. missing test in a medical examination)  Most ML schemes assume that missing data have no special significance. So… be careful and make your own decisions. 12
  • 13. Lecture 2: Preliminaries Inaccurate values  Typographical errors in nominal attributes  values need to be checked for consistency  Typographical and measurement errors in numeric attributes  outliers need to be identified 13
  • 14. Noise • Noise is any unwanted anomaly in the data. • In ML the presence of noise may cause difficulties in learning the classes and produce unreliable classifiers. • Noise can be caused by: – imprecisions in recording input attributes – errors in labelling – etc. Lecture 2: Preliminaries 14
  • 15. Lecture 2: Preliminaries Getting to know the data  Simple visualization tools are very useful  Nominal attributes: histograms  Numeric attributes: graphs  Too much data to inspect? Take a sample! 15
  • 16. ARFF FORMAT Weka (Waikato Environment for Knowledge Analysis) Lecture 2: Preliminaries 16
  • 17. Weka Software Package http://www.cs.waikato.ac.nz/ml/weka/ Weka (Waikato Environment for Knowledge Analysis) is developed at University of Waikato in New Zealand. A collection of state-of-the-art machine learning algorithms and data preprocessing tools. It is open source. It is written in Java. Contains implementations of learning algorithms that you can apply to your datasets. Lecture 2: Preliminaries 17
  • 18. Weka input data formats • General formats: • Weka: – ARFFAttribute-Relation File format. – It is an ASCII file that describes a list of instances sharing a set of attributes. Lecture 2: Preliminaries 18
  • 19. The ARFF format Lecture 2: Preliminaries 19
  • 20. Lecture 2: Preliminaries Sparse data  In some applications most attribute values in a dataset are zero  E.g.: word counts in a text categorization problem  ARFF supports sparse data  This also works for nominal attributes (where the first value corresponds to “zero”) 0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B” {1 26, 6 63, 10 “class A”} {3 42, 10 “class B”} 20
  • 21. SAMPLING NORMAL DISTRIBUTION MEASURES OF CENTRAL TENDENCY Digression: Pills of Statistics Lecture 2: Preliminaries 21
  • 22. Population and Sample • Population and Sample – Population: The whole group of ”things” we want to study • Ex: All students born between 1980 and 2000 – Sample: A selection taken from a larger group (the "population") so that you can examine it to find out something about the larger group. • Ex: 100 randomly chosen students students born between 1980 and 2000 In other words: the ’population' is the entire pool from which a statistical sample is drawn. The information obtained from the sample allows statisticians to develop hypotheses about the larger population. Researchers gather information from a sample because of the difficulty of studying the entire population. Lecture 2: Preliminaries 22
  • 23. Sampling • Sampling is a science in itself and there are different methods to sample a population – Ex: random sampling, stratified sampling, multi- stage sampling, quota sampling, etc. • The main concern: the sample should be representative of the population. Lecture 2: Preliminaries 23
  • 25. Normal Distribution • A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme. Lecture 2: Preliminaries 25
  • 26. Skewness Lecture 2: Preliminaries • When data is "skewed", it shows long tail on one side or the other: 26
  • 27. Outliers • An outlier is an observation point that is distant from other observations. Lecture 2: Preliminaries 27
  • 28. Measures of Central Tendency • In a normal distribution, the mean, mode and median are all the same. Lecture 2: Preliminaries 28
  • 29. Right Skewed Distribution Lecture 2: Preliminaries 29
  • 31. Mean • The mean is the average of the numbers: a calculated "central" value of a set of numbers. To calculate: Just add up all the numbers, then divide by how many numbers there are. Ex: what is the mean of 2, 7 and 9? • Add the numbers: 2 + 7 + 9 = 18 • Divide by how many numbers (i.e. we added 3 numbers): 18 ÷ 3 = 6 • The Mean is 6 Lecture 2: Preliminaries 31
  • 32. Median • The Median is the middle number (in a sorted list of numbers). To find the Median, place the numbers you are given in value order and find the middle number. (If there are two middle numbers, you average them.) • Find the Median of {13, 23, 11, 16, 15, 10, 26}. • Put them in order: {10, 11, 13, 15, 16, 23, 26} • The middle number is 15, so the median is 15. Lecture 2: Preliminaries 32
  • 33. Mode • The Mode is the number which appears most often in a set of numbers. • In {6, 3, 9, 6, 6, 5, 9, 3} the Mode is 6 (it occurs most often). Lecture 2: Preliminaries 33
  • 34. Frequency Table • Ex of a frequency table: Lecture 2: Preliminaries 34
  • 35. The mean of a frequency table • In a frequency table, the mean is calculated by: – multiply the score and the frequency, add up all the numbers and divide by sum of the frequencies Lecture 2: Preliminaries 35
  • 36. Mean: Formula • The x with the bar on top means ”mean of x” • Σ (sigma) means ”sum up” • Σ fx means ”sum up all the frequencies times the matching scores” • Σ f means ”sum up all the frequencies” Lecture 2: Preliminaries 36
  • 37. Quiz: The mean of a frequency table • Calculate the mean of the following frequency table using the mean formula: Answers (only one is correct) • 2.05 • 5.2 • 3.7 Lecture 2: Preliminaries 37
  • 38. MEASURES OF DISPERSION Digression: Pills of Statistics Lecture 2: Preliminaries 38
  • 39. Measures of Dispersion • Dispersion is a general term for different statistics that describe how values are distributed around the centre Lecture 2: Preliminaries 39
  • 40. Measures of Dispersion • range • quartiles • interquartile range • percentiles • mean deviation • variance • standard deviation • etc. Lecture 2: Preliminaries 40
  • 41. Range • The range is the difference between the lowest and highest values. – Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9. So the range is 9-3 = 6. Lecture 2: Preliminaries 41
  • 42. Quartiles • Quartiles are the values that divide a list of numbers into quarters. – First put the list of numbers in order – Then cut the list into four equal parts – The Quartiles are at the "cuts” • Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8 (The numbers must be in order) • Cut the list into quarters. The result is: • Quartile 1 (Q1) = 4 • Quartile 2 (Q2), which is also the Median = 5 • Quartile 3 (Q3) = 8 Lecture 2: Preliminaries 42
  • 43. Interquartile Range • The "Interquartile Range" is from Q1 to Q3. • To calculate it just subtract Quartile 1 from Quartile 3: Lecture 2: Preliminaries 43
  • 44. Percentiles • Percentile is the value below which a percentage of data falls (The data needs to be in order) • Example: You are the 4th tallest person in a group of 20; 80% of people are shorter than you: That means you are at the 80th percentile. • That is, if your height is 1.85m then "1.85m" is the 80th percentile height in that group. Lecture 2: Preliminaries 44
  • 45. Mean Deviation • It is the mean of the distances of each value from their mean. • Three steps: – 1. Find the mean of all values – 2. Find the distance of each value from that mean (subtract the mean from each value, ignore minus signs, and take the absolute value) – 3. Then find the mean of those distances Lecture 2: Preliminaries 45
  • 46. Variance: σ2 • The Variance is the average of the squared differences from the mean. • To calculate the variance follow these steps: – Work out the mean. – Then for each number: subtract the Mean and square the result (the squared difference). – Then work out the average of those squared differences. Lecture 2: Preliminaries 46
  • 47. Example: Compute the Variance For the following dataset find the variance: {600, 470, 170, 430, 300}. Mean = 600+470+170+430+300/5 = 394 For each number subtract the mean: 600-394=206; 470-394=76, 170-394=224, 430-394=36; 300-394=-94 Take each difference, square it, and then avarage the results. The variance is 21,704. Lecture 2: Preliminaries 47
  • 48. Standard Deviation: σ • The Standard Deviation is one of the most reliable measure of how spread out numbers are. • The formula is easy: it is the square root of the variance. Lecture 2: Preliminaries 48
  • 49. Standard Deviation Formula (population) • μ = the mean • xi = the individual value of a dataset • (xi - μ)2 = for each value subtract the mean and square the result • N = the total number of values in the dataset • i=1 = start at this value (here the first number of the dataset) • Σ = add up all the values • 1/N = divide by total number of values in the dataset • √ = take the square root of all the calculation 49
  • 50. Standard Deviation Formula (sample) Lecture 2: Preliminaries 50
  • 51. Standard Deviation is the most reliable measure of dispersion • Depending of the situation, not all measures of dispersion are equally reliable. • For ex, the range can sometimes be misleading when there are extremely high or low values. – Example: In {8, 11, 5, 9, 7, 6, 3616}: the lowest value is 5, and the highest is 3616. So the range is 3616-5 = 3611. • However: The single value of 3616 makes the range large, but most values are around 10. • So we may be better off using other measures such as Standard Deviation = 1262.65 Lecture 2: Preliminaries 51
  • 52. Normal Distribution and Standard Deviation Lecture 2: Preliminaries 52
  • 53. Standard Deviation vs Variance • A useful property of the standard deviation is that, unlike the variance, it is expressed in the same units as the data. • In other words: the StandDev is expressed in the same units as the mean is, whereas the variance is expressed in square units. So standard deviation is more intuitive… • Note that a normal distribution with mean=10 and standDev = 3 is exactly the same thing as a normal distribution with mean=10 and variance = 9. • Watch out and be clear of what you are using! Lecture 2: Preliminaries 53
  • 54. Quiz: Standard Deviation 68% of the frequency values of the word “and” in a corpus of email (assume emails have equal length) are between 51 and 64. Assuming this data is normally distributed, what are the mean and standard deviation? 1. Mean = 57; S.D. = 6.5 2. Mean = 57.5 ; S.D. = 6.5 3. Mean = 57.5; S.D. = 13 Lecture 2: Preliminaries 54
  • 55. These notions will be resumed later... • … when dealing with statistical inference and other statistical methods. • Standard Deviation Calculator: http://www.mathsisfun.com/data/standard- deviation-calculator.html Lecture 2: Preliminaries 55
  • 56. NORMALIZATION AND STANDARDISATION Digression: Pills of Statistics Lecture 2: Preliminaries 56
  • 57. Normalization • To normalize data means to fit the data within unity, so all the data will take on a value between 0 and 1. Many formulas are available: • Ex: Lecture 2: Preliminaries 57
  • 58. Standardization • Standardization coverts all variables to a common scale and reflects how many standard deviations from the mean that the data point falls • The number of standard deviations from the mean is also called the "Standard Score", "sigma" or "z-score". Lecture 2: Preliminaries 58
  • 59. How to standardize • z is the "z-score" (Standard Score) • x is the value to be standardised • μ is the mean • σ is the standard deviation Lecture 2: Preliminaries 59
  • 60. Why Standardize? • It can help us make decisions about our data. Lecture 2: Preliminaries 60
  • 61. CHARTS AND GRAPHS Data Visualization Lecture 2: Preliminaries 61
  • 62. Weka: Data Visualization Lecture 2: Preliminaries 62
  • 63. Outline • Bar chart • Histogram • Pie chart • Line chart • Scatter plot • Dot plot • Box plot Lecture 2: Preliminaries 63
  • 64. Axes and Coordinates • The left-right (horizontal) direction is commonly called X or abscissa The up-down (vertical) direction is commonly called Y or ordinate • The coordinates are always written in a certain order: the horizontal distance first, then the vertical distance. Lecture 2: Preliminaries 64 Repetition: Read careful this web page: https://www.mathsisfun.com/data/cartesian-coordinates.html
  • 65. Bar Chart • A Bar Chart (also called Bar Graph) is a graphical display of data using bars of different heights. • Bar charts are used to graph categorical data. Example: Lecture 2: Preliminaries 65
  • 66. Histogram • With continuous data, histograms are used. • Histograms are similar to bar charts, but a histogram groups numbers into ranges. Lecture 2: Preliminaries 66
  • 67. Pie Chart • It is a special chart that uses "slices" to show relative sizes of data. • Pie charts have been criticized. Lecture 2: Preliminaries 67
  • 68. Line Chart • Line chart is a graph that shows information that is connected in some way (such as change over time). Lecture 2: Preliminaries 68
  • 69. Scatter plot • A scatter plot has points that show the relationship between two sets of data. • Example: each dot shows one person's weight versus their height. Lecture 2: Preliminaries 69
  • 70. Line of best fit • Draw a "Line of Best Fit" (also called a "Trend Line") on the scatter plot to predict values that might not on the plot Lecture 2: Preliminaries 70
  • 71. Correlations • Scatter plots are useful to detect correlations between the sets of data. – Correlation is Positive when the values increase together – Correlation is Negative when one value decreases as the other increases More on scatter plots: https://www.mathsisfun.com/data/scatter-xy-plots.html Lecture 2: Preliminaries 71
  • 72. Quiz: Scatter Plot • The correlation seen in the graph at the right would be best described as: 1. high positive correlation 2. low positive correlation 3. high negative correlation 4. low negative correlation Lecture 2: Preliminaries 72
  • 73. Dot Plot • A dot plot is a graphical display of data using dots. • It is an alternative to the bar chart, in which dots are used to depict the quantitative values (e.g. counts) associated with categorical variables. Lecture 2: Preliminaries 73
  • 74. Box Plot • Box plots are useful to highlight outliers, median and the interquartile range. • aka box-and-whisker plots Lecture 2: Preliminaries 74
  • 75. The End Lecture 2: Preliminaries 75