The Future of Software Development - Devin AI Innovative Approach.pdf
Machine Learning in the Financial Industry
1. Machine Learning in the
FINANCIAL Industry
A birD’s eye View
Subrat Panda and Biswa G Singh
2. Brief Introduction - Subrat
● BTech ( 2002) , PhD (2009) – CSE, IIT Kharagpur
● Synopsys (EDA), IBM (CPU), NVIDIA (GPU), Taro (Full Stack Engineer), Capillary
(Principal Architect - AI)
● Applying AI to Retail
● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz
(Synopsys) and Biswa Gourav Singh (AMD)
● https://www.facebook.com/groups/idliai/
● Linked In - https://www.linkedin.com/in/subratpanda/
● Facebook - https://www.facebook.com/subratpanda
● Twitter - @subratpanda
3. Brief Introduction - Biswa
● BTech ( NIST - 2005) , MS (2009) – Clemson University
● Synopsys (EDA), IBM (CPU), ARM, AMD, Capillary (Lead ML Engineer - Data
Sciences)
● Applying AI to Retail
● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz
(Synopsys) and Subrat Panda
● https://www.facebook.com/groups/idliai/
● Linked In - https://www.linkedin.com/in/biswagsingh/
● Facebook - https://www.facebook.com/biswa.singh
● Kaggle Expert, Winner of AV (Click stream prediction)
8. Machine Learning Classical Definition
▪ Arthur Samuel (1959): "computer’s ability to learn without being
explicitly programmed.“
▪ Tom M Mitchel (1998): "A computer program is said to learn from
experience E with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P,
improves with experience E.“
▪ Optimize a performance criterion using example data or past
experience.
9. Types of Machine Learning Algorithms
▪ Supervised Learning: Input data with labeled
responses
▪ Regression : Given a picture of a person, we have to
predict their age on the basis of the given picture
▪ Classification : Given a patient with a tumor, we
have to predict whether the tumor is malignant or
benign.
IRIS DATASET SPECIES
CLASSIFICATION
TEXT CLASSIFICATIONIMAGE CLASSIFICATION
Linear Regression Non-Linear Regression
10. ▪ Unsupervised Learning: Input data without labeled responses.
▪ Clustering: Take a collection of 1,000,000 different genes, and find a way to
automatically group these genes into groups that are somehow similar or
related by different variables, such as lifespan, location, roles, and so on.
▪ Non Clustering: Exploratory data analysis (PCA, Auto-encoders)
Types of Machine Learning Algorithms
Customer
SegmentationMNIST Digit Segmentation
13. Pop Quiz
▪ Predicting housing prices based on input parameters like house size, number of
rooms, location of house etc. falls under which category of machine learning
problem:
▪ A) Regression
▪ B) Classification
▪ C) Clustering
▪ D) None
▪ Automatically segmenting your customers according to the customer
information falls under which category of machine learning.
▪ A) Regression
▪ B) Classification
▪ C) Clustering
▪ D) None
17. 1) Review credit application with an expert
2) Learn algorithms to replicate expert judgement
3) Use of traditional data
4) Additional Insight:
a) Give Applicant a questionnaire
b) Add the questionnaire data to predict outcome
c) Long term effort , as risk outcomes needs to be
observed
d) Mining Voice data
Map Expert Judgement to improve
18. Why Machines?
- For banks NPAs are a big mess (whether big or small)
- NPAs happen because of a lot of reasons:
- Human Error in Judgement
- Lack of analysis of all available data points
- Long term data and multiple data sources not considered together
- Inherent biases in some people
- Big view analytics of data missing
- Incomplete Risk analysis
- Market dynamics and correlation change over time
- Machine Learning Algorithms can model most if not all of the
conditions
- Can assist Risk Analysts - Augmented Intelligence
- Multiple models can be used and voting between them so that
people responsible don’t get blind-sided
19. Techniques we will discuss
- Logistic Regression
- Discussion of Concepts
- Demo
- Boosting
- Discussion of Concepts
- Demo
- Time Series Analysis ( Discussion )
21. Introduction
▪ It is an approach to the classification problem.
▪ The output vector is either 1 or 0 instead of a continuous range of
values
▪ y ∈ {0,1}
▪ Binary classification problem (two values)
▪ Linear regression wont work in the classification problem
IMAGE CLASSIFICATION
22. Logistic Regression: Hypothesis
▪ The hypothesis should satisfy
▪ 0 ≤ h(x) ≤ 1
▪ the "Sigmoid Function," also called the
"Logistic Function":
▪ We want to restrict the range to 0 and 1.
This is accomplished by plugging θTx into the
Logistic Function
23. Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as
follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
24. Cost Function
▪ Can not use squared cost function as Logistic Function will cause the
output to be wavy, causing many local optima.
26. We will have to maximize the log likelihood
Maximizing log likelihood
Similar to linear regression, we have to use gradient descent. Now our
updates will look like below:
27. ▪ Bias is the algorithm's tendency to
consistently learn the wrong thing by
not taking into account all the
information in the data
▪ Variance is the algorithm's tendency to
learn random things irrespective of the
real signal by fitting highly flexible
models that follow the error/noise in
the data too closely
Bias/Variance
28. • Generalization ability gives an algorithm’s ability to give accurate
prediction new, previous unseen data
• Models that are too complex for the amount of training data
available are said to overfit and are not likely to generalize well to
new examples
• High variance can cause an algorithm to model the random noise in
the training data, rather than the intended outputs (overfitting).
• Models that are too simple, that do not even do well on training data,
are said to underfit and also not likely to generalize well.
• High bias can cause an algorithm to miss the relevant relations
between features and target outputs (underfitting).
Problem of high Bias/Variance
30. Bias/Variance is a Way to Understand
Overfitting and Underfitting
Error/Loss on
training set Dtrain
Error/Loss on an
unseen test set
Dtest
high error
30
complex classifiersimple classifier
“too simple”
“too complex”
31. Definitions
• Overfitting: too much reliance on the training data
• Underfitting: a failure to learn the relationships in the training data
• High Variance: model changes significantly based on training data
• High Bias: assumptions about model lead to ignoring training data
• Overfitting and underfitting cause poor generalization on the test set
• A validation set for model tuning can prevent under and overfitting
32. ▪ Underfitting:
▪ Easier to resolve
▪ Try different machine learning models
▪ Try stronger models with higher capacity
(hyperparameter tuning)
▪ Try more features
▪ Overfitting
▪ Use a resampling technique like K-fold cross validation
▪ Improve the feature quality or remove some features
▪ Training with more data
▪ Early stopping
▪ Regularization
▪ Ensembling
Ways to Deal with Overfitting and Underfitting
Early Stopping
33. • Regularization penalizes the coefficients. In machine learning, it
actually penalizes the weight matrices of the nodes.
• L1 and L2 are the most common types of regularization.
• These update the general cost function by adding another term
known as the regularization term.
Regularization
Cost function = Loss (say, binary cross entropy) + Regularization term
34. ▪ In L2, we have:
▪ Here, lambda is the regularization parameter. It is the hyperparameter whose
value is optimized for better results. L2 regularization is also known as weight
decay as it forces the weights to decay towards zero (but not exactly zero).
▪ In L1, we have:
▪ In this, we penalize the absolute value of the weights. Unlike L2, the weights
may be reduced to zero here.
L1 and L2 Regularization
36. Decision Tree
▪ Decision Tree is the supervised learning algorithm.
▪ We split the population or sample into two or more homogeneous
sets (or sub-populations) based on most significant differentiator
in input variables.
1.Root Node: It represents entire
population or sample and this
further gets divided into two or
more homogeneous sets.
2.Splitting: It is a process of
dividing a node into two or more
sub-nodes.
3.Decision Node: When a sub-
node splits into further sub-nodes,
then it is called decision node.
4.Leaf/ Terminal Node: Nodes do
not split is called Leaf or Terminal
node.
38. Methods of splitting: Information gain
which node can be described
easily?
▪ Information theory is a measure to define this degree of disorganization in a system known as
Entropy.
Here p and q is probability of success and failure respectively in that node.
39. Other Tree based methods
▪ Trade-off management of bias-variance errors.
▪ Bagging is a simple ensembling technique in which we
build many independent predictors/models/learners
and combine them using some model averaging
techniques.
▪ Ensemble methods involve group of predictive models
to achieve a better accuracy and model stability.
▪ Random Forest: Multiple Trees instead
of single tree. It’s a bagging method
▪ To classify a new object based on
attributes, each tree gives a classification
and we say the tree “votes” for that class.
40. ▪ Gradient Boosting is a tree ensemble technique that creates a
strong classifier from a number of weak classifiers.
▪ It works in the technique of weak learners and the additive model.
▪ Boosting is an ensemble technique in which the predictors are not
made independently, but sequentially.
Other Tree based methods
42. DEFINITION
• The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong learners.
• Let’s understand this definition in detail by solving a problem of spam email identification:
• How would you classify an email as SPAM or not? Like everyone else, our initial approach would be to identify
‘spam’ and ‘not spam’ emails using following criteria. If:
• Email has only one image file (promotional image), It’s a SPAM
• Email has only link(s), It’s a SPAM
• Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
• Email from our official domain “metu.edu.tr” , Not a SPAM
• Email from known source, Not a SPAM
• Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you think these rules
individually are strong enough to successfully classify an email? No.
• Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not spam’. Therefore, these
rules are called as weak learner.
43. DEFINITION
• To convert weak learner to strong learner, we’ll combine the
prediction of each weak learner using methods like:
• Using average/ weighted average
• Considering prediction has higher vote
• For example: Above, we have defined 5 weak learners. Out of these
5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case,
by default, we’ll consider an email as SPAM because we have
higher(3) vote for ‘SPAM’.
44. How Boosting Algorithms works?
• To find weak rule, we apply base learning algorithms with a different distribution. Each time base
learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process.
After many iterations, the boosting algorithm combines these weak rules into a single strong
prediction rule.
• For choosing the right distribution, here are the following steps:
Step 1: The base learner takes all the distributions and assign equal weight or attention to each
observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher
attention to observations having prediction error. Then, we apply the next base learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is
achieved.
• Finally, it combines the outputs from weak learner and creates a strong learner which eventually
improves the prediction power of the model. Boosting pays higher focus on examples which are
misclassified or have higher errors by preceding weak rules.
45. Types of Boosting Algorithms
• Underlying engine used for boosting algorithms can be anything. It
can be decision stamp, margin-maximizing classification algorithm
etc. There are many boosting algorithms which use other types of
engine such as:
• AdaBoost (Adaptive Boosting)
• Gradient Tree Boosting
• GentleBoost
• LPBoost
• BrownBoost
• XGBoost
46. Gradient Boosting
• In gradient boosting, it trains many models sequentially. Each new
model gradually minimizes the loss function (y = ax + b + e, e needs
special attention as it is an error term) of the whole system
using Gradient Descent method. The learning procedure
consecutively fit new models to provide a more accurate estimate of
the response variable.
• The principle idea behind this algorithm is to construct new base
learners which can be maximally correlated with negative gradient of
the loss function, associated with the whole ensemble.
47. Gradient Boosting
• Type of Problem – You have a set of variables vectors x1 , x2 and x3. You need to predict y
which is a continuous variable.
• Steps of Gradient Boost algorithm
Step 1 : Assume mean is the prediction of all variables.
Step 2 : Calculate errors of each observation from the mean (latest prediction).
Step 3 : Find the variable that can split the errors perfectly and find the value for the split. This
is assumed to be the latest prediction.
Step 4 : Calculate errors of each observation from the mean of both the sides of split (latest
prediction).
Step 5 : Repeat the step 3 and 4 till the objective function maximizes/minimizes.
Step 6 : Take a weighted mean of all the classifiers to come up with the final model.
• We have excluded the mathematical formation of boosting algorithms from this article to keep
the article simple.
48. Example
• Assume, you are given a previous model M to improve on. Currently you observe that the model
has an accuracy of 80% (any metric). How do you go further about it?
• One simple way is to build an entirely different model using new set of input variables and trying
better ensemble learners. On the contrary, I have a much simpler way to suggest. It goes like
this:
Y = M(x) + error
• What if I am able to see that error is not a white noise but have same correlation with
outcome(Y) value. What if we can develop a model on this error term? Like,
error = G(x) + error2
49. Example
• Probably, you’ll see error rate will improve to a higher number, say
84%. Let’s take another step and regress against error2.
error2 = H(x) + error3
• Now we combine all these together :
Y = M(x) + G(x) + H(x) + error3
• This probably will have a accuracy of even more than 84%. What if I
can find an optimal weights for each of the three learners,
Y = alpha * M(x) + beta * G(x) + gamma * H(x) + error4
50. Example
• If we found good weights, we probably have made even a better
model. This is the underlying principle of a boosting learner.
• Boosting is generally done on weak learners, which do not have a
capacity to leave behind white noise.
• Boosting can lead to overfitting, so we need to stop at the right point.
51. XGBoosting (Extreme Gradient Boosting)
• Execution Speed: Generally, XGBoost is fast. Really fast when
compared to other implementations of gradient boosting.
• Model Performance: XGBoost dominates structured or tabular
datasets on classification and regression predictive modeling
problems.
• The evidence is that it is the go-to algorithm for competition winners
on the Kaggle competitive data science platform.
52. What Algorithm Does XGBoost Use?
• The XGBoost library implements the gradient boosting decision tree algorithm.
• This algorithm goes by lots of different names such as gradient boosting, multiple additive
regression trees, stochastic gradient boosting or gradient boosting machines.
• Boosting is an ensemble technique where new models are added to correct the errors made by
existing models. Models are added sequentially until no further improvements can be made. A
popular example is the AdaBoost algorithm that weights data points that are hard to predict.
• Gradient boosting is an approach where new models are created that predict the residuals or
errors of prior models and then added together to make the final prediction. It is called gradient
boosting because it uses a gradient descent algorithm to minimize the loss when adding new
models.
• This approach supports both regression and classification predictive modeling problems.
53. XGBoosting (Extreme Gradient Boosting)
• What is the difference between the R gbm (gradient boosting machine) and xgboost
(extreme gradient boosting)?
• Both xgboost and gbm follows the principle of gradient boosting. There are however, the
difference in modeling details. Specifically, xgboost used a more regularized model
formalization to control over-fitting, which gives it better performance.
• Objective Function : Training Loss + Regularization
• The regularization term controls the complexity of the model, which helps us to avoid overfitting.
This sounds a bit abstract, so let us consider the following problem in the following picture. You
are asked to fit visually a step function given the input data points on the upper left corner of the
image. Which solution among the three do you think is the best fit?
58. ● Assumes equal cost for both kinds of errors – cost(b-type-
error) = cost (c-type-error)
● Is 99% accuracy good? – can be excellent, good, mediocre,
poor, terrible – depends on problem
● Is 10% accuracy bad? – information retrieval
● BaseRate = accuracy of predicting predominant class (on
most problems obtaining Base Rate accuracy is easy)
PREDICTION THRESHOLD
59. ● An expensive robotic chicken crosses a very busy road a thousand times per day. An ML model
evaluates traffic patterns and predicts when this chicken can safely cross the street with an accuracy of
99.99%.
● A deadly, but curable, medical condition afflicts .01% of the population. An ML model uses
symptoms as features and predicts this affliction with an accuracy of 99.99%.
● In the game of roulette, a ball is dropped on a spinning wheel and eventually lands in one of 38
slots. Using visual features (the spin of the ball, the position of the wheel when the ball was
dropped, the height of the ball over the wheel), an ML model can predict the slot that the ball will
land in with an accuracy of 4%.
● A 99.99% accuracy value on a very busy road strongly suggests that the ML model is far better than chance. In some
settings, however, the cost of making even a small number of mistakes is still too high. 99.99% accuracy means that
the expensive chicken will need to be replaced, on average, every 10 days. (The chicken might also cause extensive
damage to cars that it hits.)
● Accuracy is a poor metric here. After all, even a "dumb" model that always predicts "not sick" would still be 99.99%
accurate. Mistakenly predicting "not sick" for a person who actually is sick could be deadly.
● This ML model is making predictions far better than chance; a random guess would be correct 1/38 of the time—
yielding an accuracy of 2.6%. Although the model's accuracy is "only" 4%, the benefits of success far outweigh the
disadvantages of failure.
In which of the following scenarios would a high accuracy value suggest that the ML model is doing a good job?
60. ● Precision attempts to answer the following question:
What proportion of positive identifications was actually correct?
Precision is defined as follows:
Note: A model that produces no false positives has a precision of 1.0.
I.e The model when it predicts a tumor is malignant, it is correct 50% of time
PRECISION
61. ● Recall attempts to answer the following question:
What proportion of actual positives was identified correctly?
Mathematically, recall is defined as follows:
Note: A model that produces no false negatives has a recall of 1.0.
RECALL
Our model has a recall of 0.11—in other words, it correctly identifies 11% of all malignant tumors.
62. Consider a classification model that separates email into two categories: "spam" or "not spam."
If you raise the classification threshold, what will happen to precision?
a) Probably increase.
b) Probably decrease.
c) Definitely decrease.
d) Definitely increase.
Consider a classification model that separates email into two categories: "spam" or "not spam."
If you raise the classification threshold, what will happen to recall?
a) Always decrease or stay the same.
b) Always increase.
c) Always stay constant.
Consider two models—A and B—that each evaluate the same dataset. Which one of the following
statements is true?
a) If Model A has better precision than model B, then model A is better.
b) If model A has better recall than model B, then model A is better.
c) If model A has better precision and better recall than model B, then model A is probably
better.
63. In general, a model that outperforms another model on both precision and
recall is likely the better model. Obviously, we'll need to make sure that
comparison is being done at a precision / recall point that is useful in
practice for this to be meaningful. For example, suppose our spam detection
model needs to have at least 90% precision to be useful and avoid
unnecessary false alarms. In this case, comparing one model at {20%
precision, 99% recall} to another at {15% precision, 98% recall} is not
particularly instructive, as neither model meets the 90% precision
requirement. But with that caveat in mind, this is a good way to think about
comparing models when using precision and recall.
64. F1 Score
Various metrics have been developed that rely on both precision and recall.
Harmonic average of precision and recall
65.
66. ● An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve plots
two parameters:
○ True Positive Rate
○ False Positive Rate
● Sweep threshold and plot
○ TPR vs. FPR
○ Sensitivity vs. 1-Specificity
○ P(true|true) vs. P(true|false)
● AUC is scale-invariant. It measures how well predictions are ranked, rather than their
absolute values.
● AUC is classification-threshold-invariant. It measures the quality of the model's
predictions irrespective of what classification threshold is chosen.
ROC and AUC
69. ● Split dataset into two groups
○ Training set: used to train the classifier
○ Test set: used to estimate the error rate of the trained classifier
● The holdout method has two basic drawbacks
○ In problems where we have a sparse dataset we may not be able to afford the
“luxury” of setting aside a portion of the dataset for testing
○ Since it is a single train-and-test experiment, the holdout estimate of error
rate will be misleading if we happen to get an “unfortunate” split
● The limitations of the holdout can be overcome with a family of resampling
methods at the expense of higher computational cost
○ Cross Validation
■ Random Subsampling
■ K-Fold Cross-Validation
Validation Strategy
70. ● Random Subsampling performs K data splits of the entire dataset
○ Each data split randomly selects a (fixed) number of examples without
replacement
○ For each data split we retrain the classifier from scratch with the training
examples and then estimate Ei with the test examples
● The true error estimate is obtained as the average of the separate estimates Ei
○ This estimate is significantly better than the holdout estimate
Random Sampling
71. ● Create a K-fold partition of the the dataset n
○ For each of K experiments, use K-1 folds for training and a different fold
for testing g This procedure is illustrated in the following figure for K=4
● K-Fold Cross validation is similar to Random Subsampling
○ The advantage of K-Fold Cross validation is that all the examples in the
dataset are eventually used for both training and testing
● As before, the true error is estimated as the average error rate on test
examples
K-fold Cross Validation
72. Definition of Time Series: An ordered sequence of values of a variable at equally spaced time intervals.
2-fold use of time Series:
● Obtain an understanding of the underlying forces and structure that produced the observed data
● Fit a model and proceed to forecasting, monitoring or even feedback and feedforward control.
Time Series Analysis is used for many applications such as:
● Economic Forecasting
● Sales Forecasting
● Budgetary Analysis
● Stock Market Analysis
● Yield Projections
● Process and Quality Control
● Inventory Studies
● Workload Projections
● Utility Studies
● Census Analysis and many more
Time Series Methodologies
73. Time Series Models
- ARIMA Models
- Multivariate Models
- Holt Winters Exponential Smoothing
- We will just cover a overview
74. Stationary Data
- A common assumption in many time series techniques is that the data are stationary.
- A stationary process has the property that the mean, variance and autocorrelation structure do not
change over time. Stationarity can be defined in precise mathematical terms, but for our purpose we
mean a flat looking series, without trend, constant variance over time, a constant autocorrelation
structure over time and no periodic fluctuations (seasonality).
If the time series is not stationary, we can often transform it to stationarity with one of the following
techniques.
1. We can difference the data. That is, given the series Zt, we create the new series
2. Yi=Zi−Zi−1.
3. The differenced data will contain one less point than the original data. Although you can difference the
data more than once, one difference is usually sufficient.
4. If the data contain a trend, we can fit some type of curve to the data and then model the residuals from
that fit. Since the purpose of the fit is to simply remove long term trend, a simple fit, such as a straight
line, is typically used.
5. For non-constant variance, taking the logarithm or square root of the series may stabilize the variance.
For negative data, you can add a suitable constant to make all the data positive before applying the
transformation. This constant can then be subtracted from the model to obtain predicted (i.e., the fitted)
values and forecasts for future points.
76. ARIMA
- Autoregressive Integrated Moving Average Model, or ARIMA for short is a standard statistical
model for time series forecast and analysis.
- A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer
values to quickly indicate the specific ARIMA model being used.
- The parameters of the ARIMA model are defined as follows:
- p: The number of lag observations included in the model, also called the lag order.
- d: The number of times that the raw observations are differenced, also called the degree of
differencing.
- q: The size of the moving average window, also called the order of moving average.
77. ARIMA Diagnostics
Two diagnostic plots can be used to help choose the p and q parameters of the ARMA or ARIMA. They are:
● Autocorrelation Function (ACF). The plot summarizes the correlation of an observation with lag values. The x-
axis shows the lag and the y-axis shows the correlation coefficient between -1 and 1 for negative and positive
correlation.
● Partial Autocorrelation Function (PACF). The plot summarizes the correlations for an observation with lag
values that is not accounted for by prior lagged observations.
Some useful patterns you may observe on these plots are:
● The model is AR if the ACF trails off after a lag and has a hard cut-off in the PACF after a lag. This lag is taken
as the value for p.
● The model is MA if the PACF trails off after a lag and has a hard cut-off in the ACF after the lag. This lag value is
taken as the value for q.
● The model is a mix of AR and MA if both the ACF and PACF trail off.
78. Handling Seasonality
- Seasonality is quite common in economic time series. It is less
common in engineering and scientific data.
- If seasonality is present, it must be incorporated into the time
series model. In this section, we discuss techniques for detecting
seasonality. We defer modeling of seasonality until later sections.
- Removing seasonality:
- A run sequence plot will often show seasonality.
- A seasonal subseries plot is a specialized technique for showing seasonality.
- Multiple box plots can be used as an alternative to the seasonal subseries plot to detect seasonality.
- The autocorrelation plot can help identify seasonality.
79.
80. Acknowledgements
- https://github.com/avannaldas/Loan-Defaulter-Prediction-Machine-Learning/
- https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc41.htm
- https://machinelearningmastery.com/gentle-introduction-box-jenkins-method-time-series-forecasting/
- ML course by Andew Ng
- https://cse.iitk.ac.in/users/piyush/courses/ml_autumn16/771A_lec21_slides.pdf
- https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/lecture-
notes/MIT15_097S12_lec10.pdf
- https://www.cs.toronto.edu/~hinton/csc2515/notes/lec11boo.htm
- http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf
- http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/