SlideShare une entreprise Scribd logo
1  sur  36
FOUNDATIONS OF MODELING
AND MACHINE LEARNING
Stephen Coggeshall
Retired Chief Analytics and Science Officer, ID Analytics, Lifelock
Professor USC, UCSD
17 October, 2017
10/24/2017 1
What is a Model?
•A model is a representation
of reality
10/24/20172
y
x
• Could be a statistical model, “learned” from data examples
• Could be first principles equations
• Could be simulations, rule systems
What is a Statistical Model?
• A statistical model is a functional relationship, y = f(x), between a bunch of inputs and an
output, where this relationship is approximated from discrete data point examples.
• Examples:
• Will this person pay back a loan? (credit score)
• Is this healthcare claim a fraud?
• Will John buy this product if I offer it to him?
• Where will the stock market be in a month?
• A statistical model is built (trained) by “showing” it a data set of examples.
10/24/20173
There are Many Methods for Statistical Models
• Statistics community:
• Linear, logistic regressions
• PCR, PLS
• Factor analysis
• ARIMA
• Machine learning community
• Decision trees
• Neural nets/MLP
• SVM
• Nearest neighbor
• Clustering
10/24/20174
• Boosted trees
• Random forests
• Bayesian networks
• Deep learning
• Convolutional neural nets
• Generally show all the data at once.
• Generally linear.
• Generally learn incrementally, point by
point.
• Generally nonlinear.
…
…
What Does Data Look Like?
Think this way for data
preparation, statistical
analysis, normalization,
standardization…
10/24/20175
x1 x2 x3 … xn y
x1 x2 x3 … xn y
x1 x2 x3 … xn y
x1 x2 x3 … xn y
…
x1 x2 x3 … xn y
Many hundreds
Many
millions
Look at means, sd’s,
distributions, normalizations
Think this way for model building
process, algorithm design and
selection
x1
x2
y
Note – We first need to build these variables (x’s) from the raw data.
How to Build Variables
• Need to convert the raw data into normalized numeric variables
• Numeric fields: scale carefully, log transforms…
• Categorical fields: Generally need to convert to a number
• Build combinations of fields: sums, ratios, min/max’s…
• Build expert variables
• Talk to domain experts, understand the phenomenon as well as possible
• Build special, explicit variables that best relate the raw fields to the output
• Examples:
• Credit risk scores: open-to-buy, revolve/transact (pay/bal), max # delinquencies in past 12 months…
• General fraud: velocity (# events/time window), # events grouped by address…
• Tax preparer fraud: % returns with foster children, % returns with schedule C…
• Marketing: Income/age, % purchases in a category (e.g. travel, shopping, online…)
10/24/20176
How to Scale Variables
• Most ML algorithms are sensitive to the differences in scale of the input values:
• Counts are typically a few, maybe less than ~10
• $ values can be thousands, millions…
• Generally need to put all variables on the same footing or scale; Critical for clustering.
• Could divide by the range: 𝑥′
=
𝑥−𝜇
𝑥 𝑚𝑎𝑥
−𝑥 𝑚𝑖𝑛
, but this is very sensitive to outliers.
• Better is z scaling: 𝑥′
=
𝑥−𝜇
𝜎
, but be careful when calculating m and s.
• Another very good method is binning:
10/24/20177
2019181716151 1423 …
5%
5%5%
5%5%
data distribution
• Bin the data into bins of equal population
• Replace the value by the bin number
• Good to use a large number of bins, like 1000
• Particularly good when combining multiple
model outputs/scores
Example showing 20 bins
xi
# records
with that
value of xi
Bad
Good
Good
How to Handle Categorical Variables
• Ordinal coding: Assign an integer to each category:
• [A, B, C…] -> [1, 2, 3…]
• Problem: many modeling algorithms assume there is a metric (1 is closer to 2 than 3)
• Dummy, one hot: Expand the category list into new variables:
• [A, B, C…] -> x1, x2, x3… with binary values. So a record with value “B” -> 0,1,0,0…
• Problem: explodes dimensionality, which is the opposite of what we want in modeling
• There are many dimensionality expanding methods (contrast, Helmert, GLM, comparison…);
All have this problem
• Risk tables (for supervised models only): For each possible category assign a specific value. Replace
the field by its assigned value.
• No dimensionality expansion. Each categorical variable becomes one continuous variable
• Assigned value: average of the dependent variable for all records in that category
• Good – direct encoding to what you’re trying to predict
• Bad - loss of interaction information
10/24/20178
Let’s say you have a field x that can take values A, B, C…
How do you encode these categories A, B, C… into numbers to go into a modeling algorithm?
Bad
Bad
Good
How to Handle Missing Values
• First, don’t forget that “Missing” can be a relevant value for a categorical variable (particularly for fraud)
• Ignore the records that have a missing field? – not a good approach. Could be many records
• Replace the missing field value with something reasonable:
• Build a model to predict the missing value, given the other values. Usually too much work; Do this only if you
think its very important.
• Fill in the missing value with the average or most common value of that field over all records, or
• The average or most common value of that field over a relevant subset of records:
• Select one or more other fields that you think are important in determining the missing field
• Bin or group the selected field(s) into categories
• Replace the missing field with the average or most common value for its binned or other appropriate
group
10/24/20179
Record # stories sq feet valueZip
1
2
3
4
5
6
7
8
9
2
1
3
3
2
NaN
4
3
2
22041
22043
22042
22041
22052
22041
22043
22052
22051
1032
539
NaN
4837
462
583
2094
7632
489
294
495
3847
9278
129
948
847
1029
947
Fill in with
• Average # stories across all records, or
• Average # stories for records in that zip
• …
Fill in with
• Average value across all records, or
• Average value for records in zip 22042, or
• Average value for all records with # stories = 3, or
• Average value for all records in zip 22042 and # stories = 3
• …
Given data (about buildings):
• A simpler model likely fits new data better
The Dark Side of Modeling: Overfitting
10/24/201710
• With a sufficiently complex model you
can always get a perfect fit
y
x
Overly-complex models don’t generalize well – Overfitting
Must balance model complexity with generalization
y
x
y
x
y
x
• What if you want to fit a model to this
1-d data:
• But when new data comes in that your model
hasn’t seen before this fit can be very bad
0
0.02
0.04
0.06
0 20 40 60 80 100
How to Avoid Overfitting: Training/Testing Data
10/24/201711
• Randomly separate the data into two sets: one for training and one for testing
• Build the model on the training data, then evaluate it on the testing data
Measure of
model fitness
For example,
model error
Model complexity
( # free parameters, # training iterations…)
Model error for the
training data
Model error for the
testing data
Stop
here
This is the
REAL model
performance
Feature Selection Methods
• Filter – independent of any modeling method
• PCA, Pierson correlations, mutual information, univariate KS…
• Wrapper – uses a model “wrapped” around the feature selection process
• stepwise selection
• Embedded – does feature selection as the model is built
• Trees, use of a regularization/loss function (e.g., Lasso)
• Methods can be linear or nonlinear
• Some supervised, some unsupervised
10/24/201712
Always best to work in as low dimensionality as possible
Try to eliminate less important features (aka dimensions, inputs, variables…)
Model Measures of Goodness
10/24/201713
• Continuous target ($ loss, profit…). Use sum of errors
• Each record has an error. Just add up the errors, or the square of the
errors (MSE)
• Marketing models typically use lift
• Lift of 3 means response is 3 times random
• Binary target (good/bad). Use FDR, KS, ROC
• KS is a robust measure of how well two distributions
are separated (goods vs bads)
# in that bin
Score bin
goods bads
Decision Trees, Boosted Trees
10/24/201714
• A decision tree cuts up input space into boxes and assigns a value to each box
• A boosted tree model is a linear combination of many weak models
x1
x2
y = 1.00 + .99 + .97 + .96 + .94 …
Decision tree model
Boosted decision tree model
y = 23
Boosting
10/24/201715
• A way for training a series of weak learners to result in a strong learner. Any learner can
be used but trees are the most common.
• First boosting algorithms (1989) built the next weak learner using misclassified records.
• Adaptive Boosting (AdaBoost, 1995) is assigning a new weight on each data record for
training the next weak learner in the series (uses all records but with weights).
• AdaBoost increases the weights on misclassified records so the next iteration can pay
more attention to them. Popular AdaBoost algorithms: GentleBoost, LogitBoost,
BrownBoost, DiscreteAB, KLBoost, RealAB…
• It’s often called gradient boosting because one descends down decreasing model error in
a greedy (local gradient) way.
• Stochastic gradient boosting uses a random subset of the data at each iteration.
• Gradient boosting is currently one of the most popular ML methods.
Bootstrap and Bagging
10/24/201716
• Bootstraping - Build many models, each uses only a sample of the records, with
replacement, so there is overlap between the many data samples.
• Bagging – Bootstrap Aggregation. The final model is a simple combination of all the
sampled models, typically the average over the sample models.
Random Forests
10/24/201717
• Build many trees, each uses only a randomly-chosen subset of features
• Different from bootstrap, which uses a random subset of records
• Combine all the results by averaging or voting
Neural Net
10/24/201718
Input
layer
Hidden
nodes
Output
layer
• A typical neural net consists of an input layer, some number of hidden
layers and an output layer.
• All the independent variables (x’s) are the input layer.
• The dependent variable y is the output layer.
• Each node in the hidden layer receives weighted signals from all the
nodes in the previous layer and does a nonlinear transform on this
linear combination of signals.
• The nonlinear transform/activation function can be a logistic function
(hyperbolic tangent, sigmoid), or something else.
• The weights are trained by backpropagating the error, record by record.
• Data is shown to the neural net, record by record, and the weights are
slightly adjusted for each record.
• The entire data set is passed through many times, each complete pass
is called a training epoch.
Deep Learning Net
10/24/201719
Deep Learning is a neural net architecture with more than one layer (hence “deep”). It’s not
really new but wasn’t very practical until hardware and algorithms improved sufficiently.
Convolutional Neural Nets (CNN) are Deep NNs with loosely connected first layers
followed by fully connected layers. It is designed to work well with images.
Support Vector Machine (SVM)
10/24/201720
• Use a linear classifier separator, but with the data
projected to a higher dimension.
• Since all we care about is distance we can use a “kernel
trick” to construct a distance measure in an abstract higher
dimension.
• Also use the concept of “margin” to find a more robust
linear separator location.
• The separator is completely defined by the location of the
data points on the boundary, which are called the “support
vectors.”
Data is
separable in a
higher
dimension
Margin
optimization
gives better
separation
Each ML method Has An Objective Function and
Learning Rules
10/24/201721
The regularization term minimizes model complexity by minimizing the parameters a.
Can use a variety of regularization forms:
• L1 norm (Lasso):
• L2 norm:
Loss/error term can be
• MSE:
• Logistic error:
Objective function is typically a Loss (error) plus Regularization:
Error Regularization
Each modeling method also has a set of training rules to adjust the parameters a:
adjustable
parameters
learning rate
How To Choose Which Method to Use
• Do you have labeled data? supervised vs. unsupervised
• Nature of output to be predicted – categorical, binary, continuous
• Dimensionality, noise, nature of raw data/features/variables
• Try several methods
• Start simple (linear/logistic regression), increase complexity
• Always strive to minimize dimensionality. In high dimensions:
• All points become outliers (points become closer to outside boundaries than center)
• Everything looks linear (# data points required to see nonlinearity increases exponentially)
10/24/201722
x1
x2
y
Summary
• A statistical model is a functional relationship y=f(x) found using data
• Modeling process:
• Prepare data: build expert variables, encode categoricals,
scale, missing values, feature selection
• Build models: training/testing to avoid overfitting,
measures of goodness
• There are a lot of nonlinear modeling methods. Understand them before you use them.
10/24/201723
x1
x2
y
0
0.02
0.04
0.06
0 20 40 60 80 100
Model error for the
training data
Model error for the
testing data
Stop
here
This is the
REAL model
performance
x1
x2
Score = 1.00 + .99 + .97 + .96 + .94 …
Boosted decision tree model
• Always seek simplicity
• Minimize dimensionality via feature selection. Data is always sparse in high dimensions
• Minimize complexity. Start simple
…
10/24/2017 24
What To Do When You Don’t Have Tags?
25
# address links
# apps in last 3 months
Anomalies/outliers
People/events with unusual behavior
Possible frauds
• How can you build a model to predict something when you don’t have examples of it? In this
situation you can build an unsupervised model
• Unsupervised modeling examines how the data points are distributed in the input space only
(there is no output/label)
• You simply look at how the data points scatter. Can you find clusters? Can you see outliers?
• Clusters: groups of people or events that are similar in nature
• Outliers: unusual, anomalous events – potential frauds
• Can work well for fraud models where you’re looking for anomalies
• Hardest part is figuring out what variables (features, dimensions) to examine
Principal Component Analysis and Regression (PCA,
PCR)
26
• PCA finds the dominant directions in the data
• The PC’s are orthogonal, and ordered by the
variance (spread) in their direction
• The magnitude of the PC’s are the
eigenvalues of the PC’s
• The eigenvalues are proportional to the
variance in that PC’s direction
One can look at the decaying of the eigenvalues and
select a subset of the PC’s as a subspace to work in.
One can then regress in this PC subspace. That’s PCR.
PC1
PC2
Modeling Method – k Nearest Neighbor (kNN)
27
• Put all data into a selected subspace
(variable creation and selection)
• For any new (test) record , see where it
falls in this space, grow a sphere around it
until it contains the “k nearest neighbors”
• Return the average value for these k
records as the predicted value
• Selecting the right features is important
• Requires no training, but can be slow for
test/go-forward use
Predicted
value for
test point
Ozone
levels
Modeling Method – K-Means, Fuzzy C-Means, SOM
28
Unsupervised modeling method – finds natural groupings of data points in
independent variable (x) space
Mahalanobis Distance
29
X1 – distance from home
X2 – time
since last
event
Which point is more anomalous, the red or green?
The Mahalnobis distance takes into account the different
scales and correlations. It draws equal contours by scaling
based on the correlations and different standard deviations.
This is also called “whitening” the data (without rotation).
The Mahalanobis distance from the center is the same
as first transforming to the Principal Component space
and scaling by the eigenvectors, or first PCA then z
scaling.
Another Unsupervised Model Method
• Use an Autoencoder:
• This is nonlinear and thus more general than the previous
methods (PCA, Mahalanobis), which are linear
30
Original
data
records
Reproduced
data
records
• Train the model to reproduce the original data
• The reproduction error is a measure of the
record’s unusualness, and is thus a fraud
score
10/24/2017
Some Fuzzy Matching Algorithms for Individual
Field Values
31
• Levenshtein edit distance: minimum # single character edits (insertion, deletion,
substitutions)
• Damerau-Levenshtein: Levenshtein plus transpositions
• Jaro-Winkler: single adjacent character transpositions
• Hamming: % characters that are different (substitutions only)
• N grams: a sliding window of n characters. Can count how many are the same between two
fields.
• Soundex: a phonetic (”sounds like”) comparison
• Stemming, root match: determine the root of the word and compare by roots
• Field specific, ad hoc
Some Fuzzy Matching Algorithms for Columns
32
• Linear correlation
• Divergence
• Kulback-Leibler distance (KL)
• Kolmogorov-Smirnov distance (KS)
Some Multidimensional Distance Metrics (Rows)
33
• Euclidean (L-2) (only useful after proper scaling)
• Mahalanobis: z scaling then Euclidian
• Minkowski (L-n)
• Manhattan: distance (L-1) traversed along axes
• Chebyshev (L-infinity): max along any axis
Glossary 1
34
Model – A representation of something. Models can be physical (e.g., a model airplane, a person wearing new clothes) or
mathematical. They can be dynamic (models of processes) or static (no time evolution). Mathematical models could be based
on first principles (e.g., solving differential equations), expert knowledge (e.g., a priori rule systems), or based on data (statistical
models). Statistical models are the ones built in the discipline called data science, machine learning, etc.
Input, x, variable, dimension, feature, independent variables – The numbers that are the predictors or inputs to the functional
form of the model y = f(x).
Output, tag, label, y, target, dependent variable – The quantity or category that you’re trying to predict with a model. It could be a
continuous value (regression) or a categorical value (classification). Many times for classification the target/output is binary:
yes/no, 0/1, good/bad.
Supervised modeling – Frequently in modeling one has an identified output that one is trying to fit/predict for each record. The
output or dependent variable is the target of the model, and we try to learn the functional relationship between the inputs and
this output. This is called supervised learning since the learning algorithm is supervised during training by constantly looking at
the error between the predicted and actual outputs.
Unsupervised modeling – Here we don’t have an output or dependent variable, all we are given is a set of independent
variables. There are several things we can do with independent variables only, for example we can
• Look for interrelationships between the variables, either linear (e.g. PCA) or nonlinear
• Look for macro structure in the given data in input/independent variable space alone. Methods include PCA and
clustering.
• Look for outliers or anomalies in the records, looking only at the inputs/independent variables. This is frequently a task
in building fraud models in situations when you have no labeled data, that is, no records that have already been
determined to be fraud. Generally you look for what is normal in the data and then look for outliers to this typical data.
Glossary 2
35
Data science, machine learning, statistical modeling, data mining, predictive analytics, data analytics – Roughly the same
meaning but many people ascribe some subtle differences between these terms. They all relate to the process of building
statistical models from sets of data.
Variable creation, feature engineering, variable encoding – the process of constructing thoughtful inputs to a model. Typically
one examines the fields in the raw data, does analysis, cleaning, standardization, and then transformations and combinations
of these fields to create special variables that are candidate inputs to models. Examples of this process are encoding of
categorical variables, risk tables, z-scaling, other normalizations and outlier suppression, nonlinear transformations such as
taking the log or binning, construction of ratios or products of fields.
Feature selection, variable reduction, dimensionality reduction – The process of reducing the number of inputs to a model by
considering which inputs are the most important to the model. Most modeling methods degrade when presented with more
inputs than is needed for a robust, stable model.
Model validation – Typically one separates the data into multiple sets to ensure that the model is robust. In good modeling
practices one reserves a set of data that is never used during training and testing that is used for evaluation of the model on
data that it has never seen before. Frequently we reserve this holdout sample from a time that was not used during training
and testing, and that is called out of time validation.
Boosting – an iterative procedure to create a series of weak models, where the final model is then a linear combination of this
series of weak models. Generally the next model in the series is trained on a weighted data set where the records with the
largest error so far are more heavily weighted.
Bagging – the term comes from “bootstrap aggregation.” Technique to improve model stability and accuracy. It
combines/aggregates the outcomes of many models, each having been built via bootstrap sampling.
Glossary 3
36
Cross validation – The process of splitting data into many separate bunches. We train on all but one of the bunches, test on the
remaining bunch, and then continue by designating a different bunch as the testing data set and training on all the others. We get an
ensemble of model performance measures which we can average. Cross validation uses all records once per iteration whereas
bootstrap will use records a random # of times.
Bootstrap – A method of selecting training data sets that are a subset of the data through random sampling of the records many times,
always with replacement. Provides estimates for the variance of standard statistical measures that would otherwise be point estimates
without bootstrap.
SOM and K Means - The difference is that in SOM the distance of each input from all of the reference vectors instead of just the closest
one is taken into account, weighted by the neighborhood kernel h. Thus, the SOM functions as a conventional clustering algorithm if the
width of the neighborhood kernel is zero.
Random Forests – Using an ensemble of trees and averaging the result across all the trees. Each tree is built from a random subset of
features from the entire feature set. It uses bagging to select the features that are used for each tree.
Reject inference – The process of inferring the labels on records that were rejected at some point in the process, where the actual
outcome isn’t known. This is done so that the model is hardened, or made more robust, to this larger population that it will see when
implemented (the entire TTD, Through The Door, population).
Goods, Bads – When we are doing a 2-class classification problem (as opposed to a regression where the output is continuous) we
frequently refer to one class as Goods and the other as Bads. For fraud problems the Bads are the records labeled as fraud and the
Goods are the records labeled as non fraud.
False Positives – (FP) Goods that have been incorrectly classified as Bads.
True Positives – (TP) Bads correctly classified as Bad.
True Negatives – (TN) Goods correctly classified as Good.
False Negative – (FN) Bads that have been incorrectly classified as Goods.
False Positive Rate is FPR = #FP / #examined = #FP / (#FP + #TP).
False Positive Ratio = #FP / #TP.

Contenu connexe

Tendances

Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodHonglin Yu
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
 
L9 using datawarrior for scientific data visualization
L9 using datawarrior for scientific data visualizationL9 using datawarrior for scientific data visualization
L9 using datawarrior for scientific data visualizationSeppo Karrila
 
R User Group Singapore, Data Mining with R (Workshop II) - Random forests
R User Group Singapore, Data Mining with R (Workshop II) - Random forestsR User Group Singapore, Data Mining with R (Workshop II) - Random forests
R User Group Singapore, Data Mining with R (Workshop II) - Random forestsWei Zhong Toh
 
Introduction to spss 1
Introduction to spss 1Introduction to spss 1
Introduction to spss 1Michael Taiwo
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 
Introduction to business statistics
Introduction to business statisticsIntroduction to business statistics
Introduction to business statisticsAakash Kulkarni
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetSalford Systems
 
Business Statistics
Business StatisticsBusiness Statistics
Business StatisticsTim Walters
 
CABT SHS Statistics & Probability - Estimation of Parameters (intro)
CABT SHS Statistics & Probability -  Estimation of Parameters (intro)CABT SHS Statistics & Probability -  Estimation of Parameters (intro)
CABT SHS Statistics & Probability - Estimation of Parameters (intro)Gilbert Joseph Abueg
 
Fundamentals of data analysis
Fundamentals of data analysisFundamentals of data analysis
Fundamentals of data analysisShameem Ali
 
Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningHJ van Veen
 

Tendances (20)

Borderline Smote
Borderline SmoteBorderline Smote
Borderline Smote
 
Decision tree
Decision treeDecision tree
Decision tree
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
L9 using datawarrior for scientific data visualization
L9 using datawarrior for scientific data visualizationL9 using datawarrior for scientific data visualization
L9 using datawarrior for scientific data visualization
 
Random forest
Random forestRandom forest
Random forest
 
R User Group Singapore, Data Mining with R (Workshop II) - Random forests
R User Group Singapore, Data Mining with R (Workshop II) - Random forestsR User Group Singapore, Data Mining with R (Workshop II) - Random forests
R User Group Singapore, Data Mining with R (Workshop II) - Random forests
 
Introduction to spss 1
Introduction to spss 1Introduction to spss 1
Introduction to spss 1
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Introduction to business statistics
Introduction to business statisticsIntroduction to business statistics
Introduction to business statistics
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
Business statistics
Business statisticsBusiness statistics
Business statistics
 
Business Statistics
Business StatisticsBusiness Statistics
Business Statistics
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
CABT SHS Statistics & Probability - Estimation of Parameters (intro)
CABT SHS Statistics & Probability -  Estimation of Parameters (intro)CABT SHS Statistics & Probability -  Estimation of Parameters (intro)
CABT SHS Statistics & Probability - Estimation of Parameters (intro)
 
Fundamentals of data analysis
Fundamentals of data analysisFundamentals of data analysis
Fundamentals of data analysis
 
Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
 

En vedette

Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
QGIS第三講—地圖展示與匯出
QGIS第三講—地圖展示與匯出QGIS第三講—地圖展示與匯出
QGIS第三講—地圖展示與匯出Chengtao Lin
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
AI for Good Global Summit - 2017 Report
AI for Good Global Summit - 2017 ReportAI for Good Global Summit - 2017 Report
AI for Good Global Summit - 2017 ReportITU
 
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017NVIDIA
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017NVIDIA
 

En vedette (7)

Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
QGIS第三講—地圖展示與匯出
QGIS第三講—地圖展示與匯出QGIS第三講—地圖展示與匯出
QGIS第三講—地圖展示與匯出
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
AI for Good Global Summit - 2017 Report
AI for Good Global Summit - 2017 ReportAI for Good Global Summit - 2017 Report
AI for Good Global Summit - 2017 Report
 
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
Revolutionizing Radiology with Deep Learning: The Road to RSNA 2017
 
AWS AI Solutions
AWS AI SolutionsAWS AI Solutions
AWS AI Solutions
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017
 

Similaire à Foundations of Machine Learning - StampedeCon AI Summit 2017

DutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical PerspectiveDutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical PerspectiveBigML, Inc
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balanceAlex Henderson
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programmingSoumya Mukherjee
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.pptchatbot9
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningSanghamitra Deb
 
Machine Learning in Production
Machine Learning in ProductionMachine Learning in Production
Machine Learning in ProductionBen Freundorfer
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 

Similaire à Foundations of Machine Learning - StampedeCon AI Summit 2017 (20)

Mini datathon
Mini datathonMini datathon
Mini datathon
 
DutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical PerspectiveDutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical Perspective
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Classification.pptx
Classification.pptxClassification.pptx
Classification.pptx
 
PCA.pptx
PCA.pptxPCA.pptx
PCA.pptx
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Machine Learning in Production
Machine Learning in ProductionMachine Learning in Production
Machine Learning in Production
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Intro to ml_2021
Intro to ml_2021Intro to ml_2021
Intro to ml_2021
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Machine Learning (Decisoion Trees)
Machine Learning (Decisoion Trees)Machine Learning (Decisoion Trees)
Machine Learning (Decisoion Trees)
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 

Plus de StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016StampedeCon
 
Visualizing Big Data – The Fundamentals
Visualizing Big Data – The FundamentalsVisualizing Big Data – The Fundamentals
Visualizing Big Data – The FundamentalsStampedeCon
 

Plus de StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
 
Visualizing Big Data – The Fundamentals
Visualizing Big Data – The FundamentalsVisualizing Big Data – The Fundamentals
Visualizing Big Data – The Fundamentals
 

Dernier

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Dernier (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 

Foundations of Machine Learning - StampedeCon AI Summit 2017

  • 1. FOUNDATIONS OF MODELING AND MACHINE LEARNING Stephen Coggeshall Retired Chief Analytics and Science Officer, ID Analytics, Lifelock Professor USC, UCSD 17 October, 2017 10/24/2017 1
  • 2. What is a Model? •A model is a representation of reality 10/24/20172 y x • Could be a statistical model, “learned” from data examples • Could be first principles equations • Could be simulations, rule systems
  • 3. What is a Statistical Model? • A statistical model is a functional relationship, y = f(x), between a bunch of inputs and an output, where this relationship is approximated from discrete data point examples. • Examples: • Will this person pay back a loan? (credit score) • Is this healthcare claim a fraud? • Will John buy this product if I offer it to him? • Where will the stock market be in a month? • A statistical model is built (trained) by “showing” it a data set of examples. 10/24/20173
  • 4. There are Many Methods for Statistical Models • Statistics community: • Linear, logistic regressions • PCR, PLS • Factor analysis • ARIMA • Machine learning community • Decision trees • Neural nets/MLP • SVM • Nearest neighbor • Clustering 10/24/20174 • Boosted trees • Random forests • Bayesian networks • Deep learning • Convolutional neural nets • Generally show all the data at once. • Generally linear. • Generally learn incrementally, point by point. • Generally nonlinear. … …
  • 5. What Does Data Look Like? Think this way for data preparation, statistical analysis, normalization, standardization… 10/24/20175 x1 x2 x3 … xn y x1 x2 x3 … xn y x1 x2 x3 … xn y x1 x2 x3 … xn y … x1 x2 x3 … xn y Many hundreds Many millions Look at means, sd’s, distributions, normalizations Think this way for model building process, algorithm design and selection x1 x2 y Note – We first need to build these variables (x’s) from the raw data.
  • 6. How to Build Variables • Need to convert the raw data into normalized numeric variables • Numeric fields: scale carefully, log transforms… • Categorical fields: Generally need to convert to a number • Build combinations of fields: sums, ratios, min/max’s… • Build expert variables • Talk to domain experts, understand the phenomenon as well as possible • Build special, explicit variables that best relate the raw fields to the output • Examples: • Credit risk scores: open-to-buy, revolve/transact (pay/bal), max # delinquencies in past 12 months… • General fraud: velocity (# events/time window), # events grouped by address… • Tax preparer fraud: % returns with foster children, % returns with schedule C… • Marketing: Income/age, % purchases in a category (e.g. travel, shopping, online…) 10/24/20176
  • 7. How to Scale Variables • Most ML algorithms are sensitive to the differences in scale of the input values: • Counts are typically a few, maybe less than ~10 • $ values can be thousands, millions… • Generally need to put all variables on the same footing or scale; Critical for clustering. • Could divide by the range: 𝑥′ = 𝑥−𝜇 𝑥 𝑚𝑎𝑥 −𝑥 𝑚𝑖𝑛 , but this is very sensitive to outliers. • Better is z scaling: 𝑥′ = 𝑥−𝜇 𝜎 , but be careful when calculating m and s. • Another very good method is binning: 10/24/20177 2019181716151 1423 … 5% 5%5% 5%5% data distribution • Bin the data into bins of equal population • Replace the value by the bin number • Good to use a large number of bins, like 1000 • Particularly good when combining multiple model outputs/scores Example showing 20 bins xi # records with that value of xi Bad Good Good
  • 8. How to Handle Categorical Variables • Ordinal coding: Assign an integer to each category: • [A, B, C…] -> [1, 2, 3…] • Problem: many modeling algorithms assume there is a metric (1 is closer to 2 than 3) • Dummy, one hot: Expand the category list into new variables: • [A, B, C…] -> x1, x2, x3… with binary values. So a record with value “B” -> 0,1,0,0… • Problem: explodes dimensionality, which is the opposite of what we want in modeling • There are many dimensionality expanding methods (contrast, Helmert, GLM, comparison…); All have this problem • Risk tables (for supervised models only): For each possible category assign a specific value. Replace the field by its assigned value. • No dimensionality expansion. Each categorical variable becomes one continuous variable • Assigned value: average of the dependent variable for all records in that category • Good – direct encoding to what you’re trying to predict • Bad - loss of interaction information 10/24/20178 Let’s say you have a field x that can take values A, B, C… How do you encode these categories A, B, C… into numbers to go into a modeling algorithm? Bad Bad Good
  • 9. How to Handle Missing Values • First, don’t forget that “Missing” can be a relevant value for a categorical variable (particularly for fraud) • Ignore the records that have a missing field? – not a good approach. Could be many records • Replace the missing field value with something reasonable: • Build a model to predict the missing value, given the other values. Usually too much work; Do this only if you think its very important. • Fill in the missing value with the average or most common value of that field over all records, or • The average or most common value of that field over a relevant subset of records: • Select one or more other fields that you think are important in determining the missing field • Bin or group the selected field(s) into categories • Replace the missing field with the average or most common value for its binned or other appropriate group 10/24/20179 Record # stories sq feet valueZip 1 2 3 4 5 6 7 8 9 2 1 3 3 2 NaN 4 3 2 22041 22043 22042 22041 22052 22041 22043 22052 22051 1032 539 NaN 4837 462 583 2094 7632 489 294 495 3847 9278 129 948 847 1029 947 Fill in with • Average # stories across all records, or • Average # stories for records in that zip • … Fill in with • Average value across all records, or • Average value for records in zip 22042, or • Average value for all records with # stories = 3, or • Average value for all records in zip 22042 and # stories = 3 • … Given data (about buildings):
  • 10. • A simpler model likely fits new data better The Dark Side of Modeling: Overfitting 10/24/201710 • With a sufficiently complex model you can always get a perfect fit y x Overly-complex models don’t generalize well – Overfitting Must balance model complexity with generalization y x y x y x • What if you want to fit a model to this 1-d data: • But when new data comes in that your model hasn’t seen before this fit can be very bad
  • 11. 0 0.02 0.04 0.06 0 20 40 60 80 100 How to Avoid Overfitting: Training/Testing Data 10/24/201711 • Randomly separate the data into two sets: one for training and one for testing • Build the model on the training data, then evaluate it on the testing data Measure of model fitness For example, model error Model complexity ( # free parameters, # training iterations…) Model error for the training data Model error for the testing data Stop here This is the REAL model performance
  • 12. Feature Selection Methods • Filter – independent of any modeling method • PCA, Pierson correlations, mutual information, univariate KS… • Wrapper – uses a model “wrapped” around the feature selection process • stepwise selection • Embedded – does feature selection as the model is built • Trees, use of a regularization/loss function (e.g., Lasso) • Methods can be linear or nonlinear • Some supervised, some unsupervised 10/24/201712 Always best to work in as low dimensionality as possible Try to eliminate less important features (aka dimensions, inputs, variables…)
  • 13. Model Measures of Goodness 10/24/201713 • Continuous target ($ loss, profit…). Use sum of errors • Each record has an error. Just add up the errors, or the square of the errors (MSE) • Marketing models typically use lift • Lift of 3 means response is 3 times random • Binary target (good/bad). Use FDR, KS, ROC • KS is a robust measure of how well two distributions are separated (goods vs bads) # in that bin Score bin goods bads
  • 14. Decision Trees, Boosted Trees 10/24/201714 • A decision tree cuts up input space into boxes and assigns a value to each box • A boosted tree model is a linear combination of many weak models x1 x2 y = 1.00 + .99 + .97 + .96 + .94 … Decision tree model Boosted decision tree model y = 23
  • 15. Boosting 10/24/201715 • A way for training a series of weak learners to result in a strong learner. Any learner can be used but trees are the most common. • First boosting algorithms (1989) built the next weak learner using misclassified records. • Adaptive Boosting (AdaBoost, 1995) is assigning a new weight on each data record for training the next weak learner in the series (uses all records but with weights). • AdaBoost increases the weights on misclassified records so the next iteration can pay more attention to them. Popular AdaBoost algorithms: GentleBoost, LogitBoost, BrownBoost, DiscreteAB, KLBoost, RealAB… • It’s often called gradient boosting because one descends down decreasing model error in a greedy (local gradient) way. • Stochastic gradient boosting uses a random subset of the data at each iteration. • Gradient boosting is currently one of the most popular ML methods.
  • 16. Bootstrap and Bagging 10/24/201716 • Bootstraping - Build many models, each uses only a sample of the records, with replacement, so there is overlap between the many data samples. • Bagging – Bootstrap Aggregation. The final model is a simple combination of all the sampled models, typically the average over the sample models.
  • 17. Random Forests 10/24/201717 • Build many trees, each uses only a randomly-chosen subset of features • Different from bootstrap, which uses a random subset of records • Combine all the results by averaging or voting
  • 18. Neural Net 10/24/201718 Input layer Hidden nodes Output layer • A typical neural net consists of an input layer, some number of hidden layers and an output layer. • All the independent variables (x’s) are the input layer. • The dependent variable y is the output layer. • Each node in the hidden layer receives weighted signals from all the nodes in the previous layer and does a nonlinear transform on this linear combination of signals. • The nonlinear transform/activation function can be a logistic function (hyperbolic tangent, sigmoid), or something else. • The weights are trained by backpropagating the error, record by record. • Data is shown to the neural net, record by record, and the weights are slightly adjusted for each record. • The entire data set is passed through many times, each complete pass is called a training epoch.
  • 19. Deep Learning Net 10/24/201719 Deep Learning is a neural net architecture with more than one layer (hence “deep”). It’s not really new but wasn’t very practical until hardware and algorithms improved sufficiently. Convolutional Neural Nets (CNN) are Deep NNs with loosely connected first layers followed by fully connected layers. It is designed to work well with images.
  • 20. Support Vector Machine (SVM) 10/24/201720 • Use a linear classifier separator, but with the data projected to a higher dimension. • Since all we care about is distance we can use a “kernel trick” to construct a distance measure in an abstract higher dimension. • Also use the concept of “margin” to find a more robust linear separator location. • The separator is completely defined by the location of the data points on the boundary, which are called the “support vectors.” Data is separable in a higher dimension Margin optimization gives better separation
  • 21. Each ML method Has An Objective Function and Learning Rules 10/24/201721 The regularization term minimizes model complexity by minimizing the parameters a. Can use a variety of regularization forms: • L1 norm (Lasso): • L2 norm: Loss/error term can be • MSE: • Logistic error: Objective function is typically a Loss (error) plus Regularization: Error Regularization Each modeling method also has a set of training rules to adjust the parameters a: adjustable parameters learning rate
  • 22. How To Choose Which Method to Use • Do you have labeled data? supervised vs. unsupervised • Nature of output to be predicted – categorical, binary, continuous • Dimensionality, noise, nature of raw data/features/variables • Try several methods • Start simple (linear/logistic regression), increase complexity • Always strive to minimize dimensionality. In high dimensions: • All points become outliers (points become closer to outside boundaries than center) • Everything looks linear (# data points required to see nonlinearity increases exponentially) 10/24/201722 x1 x2 y
  • 23. Summary • A statistical model is a functional relationship y=f(x) found using data • Modeling process: • Prepare data: build expert variables, encode categoricals, scale, missing values, feature selection • Build models: training/testing to avoid overfitting, measures of goodness • There are a lot of nonlinear modeling methods. Understand them before you use them. 10/24/201723 x1 x2 y 0 0.02 0.04 0.06 0 20 40 60 80 100 Model error for the training data Model error for the testing data Stop here This is the REAL model performance x1 x2 Score = 1.00 + .99 + .97 + .96 + .94 … Boosted decision tree model • Always seek simplicity • Minimize dimensionality via feature selection. Data is always sparse in high dimensions • Minimize complexity. Start simple …
  • 25. What To Do When You Don’t Have Tags? 25 # address links # apps in last 3 months Anomalies/outliers People/events with unusual behavior Possible frauds • How can you build a model to predict something when you don’t have examples of it? In this situation you can build an unsupervised model • Unsupervised modeling examines how the data points are distributed in the input space only (there is no output/label) • You simply look at how the data points scatter. Can you find clusters? Can you see outliers? • Clusters: groups of people or events that are similar in nature • Outliers: unusual, anomalous events – potential frauds • Can work well for fraud models where you’re looking for anomalies • Hardest part is figuring out what variables (features, dimensions) to examine
  • 26. Principal Component Analysis and Regression (PCA, PCR) 26 • PCA finds the dominant directions in the data • The PC’s are orthogonal, and ordered by the variance (spread) in their direction • The magnitude of the PC’s are the eigenvalues of the PC’s • The eigenvalues are proportional to the variance in that PC’s direction One can look at the decaying of the eigenvalues and select a subset of the PC’s as a subspace to work in. One can then regress in this PC subspace. That’s PCR. PC1 PC2
  • 27. Modeling Method – k Nearest Neighbor (kNN) 27 • Put all data into a selected subspace (variable creation and selection) • For any new (test) record , see where it falls in this space, grow a sphere around it until it contains the “k nearest neighbors” • Return the average value for these k records as the predicted value • Selecting the right features is important • Requires no training, but can be slow for test/go-forward use Predicted value for test point Ozone levels
  • 28. Modeling Method – K-Means, Fuzzy C-Means, SOM 28 Unsupervised modeling method – finds natural groupings of data points in independent variable (x) space
  • 29. Mahalanobis Distance 29 X1 – distance from home X2 – time since last event Which point is more anomalous, the red or green? The Mahalnobis distance takes into account the different scales and correlations. It draws equal contours by scaling based on the correlations and different standard deviations. This is also called “whitening” the data (without rotation). The Mahalanobis distance from the center is the same as first transforming to the Principal Component space and scaling by the eigenvectors, or first PCA then z scaling.
  • 30. Another Unsupervised Model Method • Use an Autoencoder: • This is nonlinear and thus more general than the previous methods (PCA, Mahalanobis), which are linear 30 Original data records Reproduced data records • Train the model to reproduce the original data • The reproduction error is a measure of the record’s unusualness, and is thus a fraud score 10/24/2017
  • 31. Some Fuzzy Matching Algorithms for Individual Field Values 31 • Levenshtein edit distance: minimum # single character edits (insertion, deletion, substitutions) • Damerau-Levenshtein: Levenshtein plus transpositions • Jaro-Winkler: single adjacent character transpositions • Hamming: % characters that are different (substitutions only) • N grams: a sliding window of n characters. Can count how many are the same between two fields. • Soundex: a phonetic (”sounds like”) comparison • Stemming, root match: determine the root of the word and compare by roots • Field specific, ad hoc
  • 32. Some Fuzzy Matching Algorithms for Columns 32 • Linear correlation • Divergence • Kulback-Leibler distance (KL) • Kolmogorov-Smirnov distance (KS)
  • 33. Some Multidimensional Distance Metrics (Rows) 33 • Euclidean (L-2) (only useful after proper scaling) • Mahalanobis: z scaling then Euclidian • Minkowski (L-n) • Manhattan: distance (L-1) traversed along axes • Chebyshev (L-infinity): max along any axis
  • 34. Glossary 1 34 Model – A representation of something. Models can be physical (e.g., a model airplane, a person wearing new clothes) or mathematical. They can be dynamic (models of processes) or static (no time evolution). Mathematical models could be based on first principles (e.g., solving differential equations), expert knowledge (e.g., a priori rule systems), or based on data (statistical models). Statistical models are the ones built in the discipline called data science, machine learning, etc. Input, x, variable, dimension, feature, independent variables – The numbers that are the predictors or inputs to the functional form of the model y = f(x). Output, tag, label, y, target, dependent variable – The quantity or category that you’re trying to predict with a model. It could be a continuous value (regression) or a categorical value (classification). Many times for classification the target/output is binary: yes/no, 0/1, good/bad. Supervised modeling – Frequently in modeling one has an identified output that one is trying to fit/predict for each record. The output or dependent variable is the target of the model, and we try to learn the functional relationship between the inputs and this output. This is called supervised learning since the learning algorithm is supervised during training by constantly looking at the error between the predicted and actual outputs. Unsupervised modeling – Here we don’t have an output or dependent variable, all we are given is a set of independent variables. There are several things we can do with independent variables only, for example we can • Look for interrelationships between the variables, either linear (e.g. PCA) or nonlinear • Look for macro structure in the given data in input/independent variable space alone. Methods include PCA and clustering. • Look for outliers or anomalies in the records, looking only at the inputs/independent variables. This is frequently a task in building fraud models in situations when you have no labeled data, that is, no records that have already been determined to be fraud. Generally you look for what is normal in the data and then look for outliers to this typical data.
  • 35. Glossary 2 35 Data science, machine learning, statistical modeling, data mining, predictive analytics, data analytics – Roughly the same meaning but many people ascribe some subtle differences between these terms. They all relate to the process of building statistical models from sets of data. Variable creation, feature engineering, variable encoding – the process of constructing thoughtful inputs to a model. Typically one examines the fields in the raw data, does analysis, cleaning, standardization, and then transformations and combinations of these fields to create special variables that are candidate inputs to models. Examples of this process are encoding of categorical variables, risk tables, z-scaling, other normalizations and outlier suppression, nonlinear transformations such as taking the log or binning, construction of ratios or products of fields. Feature selection, variable reduction, dimensionality reduction – The process of reducing the number of inputs to a model by considering which inputs are the most important to the model. Most modeling methods degrade when presented with more inputs than is needed for a robust, stable model. Model validation – Typically one separates the data into multiple sets to ensure that the model is robust. In good modeling practices one reserves a set of data that is never used during training and testing that is used for evaluation of the model on data that it has never seen before. Frequently we reserve this holdout sample from a time that was not used during training and testing, and that is called out of time validation. Boosting – an iterative procedure to create a series of weak models, where the final model is then a linear combination of this series of weak models. Generally the next model in the series is trained on a weighted data set where the records with the largest error so far are more heavily weighted. Bagging – the term comes from “bootstrap aggregation.” Technique to improve model stability and accuracy. It combines/aggregates the outcomes of many models, each having been built via bootstrap sampling.
  • 36. Glossary 3 36 Cross validation – The process of splitting data into many separate bunches. We train on all but one of the bunches, test on the remaining bunch, and then continue by designating a different bunch as the testing data set and training on all the others. We get an ensemble of model performance measures which we can average. Cross validation uses all records once per iteration whereas bootstrap will use records a random # of times. Bootstrap – A method of selecting training data sets that are a subset of the data through random sampling of the records many times, always with replacement. Provides estimates for the variance of standard statistical measures that would otherwise be point estimates without bootstrap. SOM and K Means - The difference is that in SOM the distance of each input from all of the reference vectors instead of just the closest one is taken into account, weighted by the neighborhood kernel h. Thus, the SOM functions as a conventional clustering algorithm if the width of the neighborhood kernel is zero. Random Forests – Using an ensemble of trees and averaging the result across all the trees. Each tree is built from a random subset of features from the entire feature set. It uses bagging to select the features that are used for each tree. Reject inference – The process of inferring the labels on records that were rejected at some point in the process, where the actual outcome isn’t known. This is done so that the model is hardened, or made more robust, to this larger population that it will see when implemented (the entire TTD, Through The Door, population). Goods, Bads – When we are doing a 2-class classification problem (as opposed to a regression where the output is continuous) we frequently refer to one class as Goods and the other as Bads. For fraud problems the Bads are the records labeled as fraud and the Goods are the records labeled as non fraud. False Positives – (FP) Goods that have been incorrectly classified as Bads. True Positives – (TP) Bads correctly classified as Bad. True Negatives – (TN) Goods correctly classified as Good. False Negative – (FN) Bads that have been incorrectly classified as Goods. False Positive Rate is FPR = #FP / #examined = #FP / (#FP + #TP). False Positive Ratio = #FP / #TP.