SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Higgs Boson Machine Learning 
Challenge 
Group Project ­CS4622 
Team Members: 
100112V ­Edirisinghe 
E.A.S.D 
100132G ­Fernando 
W.V.D. 
100440A ­Ranasinghe 
R.H.T.D. 
100498G ­Senaratne 
H. H. 
100559V ­Vithana 
Y. G. K. 
100577A ­Weerasinghe 
L.A.
Table of Content 
1. Introduction 
2. Background 
3. Approach Followed 
3.1 Preprocessing 
3.1.1 Understanding the nature of the given variables 
3.1.2 Handling missing values 
3.1.3 Converting Data Types 
3.1.4 Data Normalization 
3.1.5 Feature Selection and Deriving Features 
3.2 Training Techniques 
3.2.1 Random Forest Classifier 
3.2.2 Gradient Boost Classifier 
3.2.3 Neural Networks 
3.2.4 XGBoost Classifier 
4. Results and Discussion 
5. Reference 
6. Appendix
1. Introduction 
This report reveals the procedure used by the team in order to solve “Higgs Boson Machine 
Learning Challenge” stated under the Kaggle site. 
As the initial parts of this report, we have delivered some knowledge about the background of 
this problem, which is closely related to particle physics. Later on we have included how we have 
modeled and pre­processed 
the data, what machine learning techniques and procedures that we have 
used to solve this problem and what results we were able to obtain with the followed approaches. 
Finally we have analysed and discussed about the methods that we followed and the outputs that we 
obtained. 
2. Background 
Discovery of Higgs Boson which is an elementary particle of particle physics was recently 
claimed by the ATLAS experiment and the CMS experiment. This discovery was acknowledged by the 
2013 Nobel prize in physics given to Francois Englert and Peter Higgs. The related experiments are 
running at the Large Hadron Collider which is commonly known as LHC at CERN (the European 
Organization for Nuclear Research), Geneva, Switzerland; which began operating in 2009 after about 
20 years of design and construction. 
This particle decays under several processes, and produces other particles. A channel is the 
term that is used to indicate the decay of a particle into other specific particles in physics. It was recently 
reported by the ATLAS experiment, the first evidence of the Higgs Boson to tau tau channel. The 
ATLAS experiment has observed a signal of the above mentioned decay and this signal is small one and 
buried in background noise. 
What is expected from this Higgs Boson Machine Learning Challenge is to explore the potential 
of advanced machine learning methods to improve the discovery significance of the experiment by
classifying a given event into the correct region out of ‘signal’ and ‘background’. That is deciding 
whether the the results of a certain event has happened due to tau tau decay of Higgs Boson (signal) or 
due to other background noise (background). 
Training set consists of several primary attributes and derived attributes related to this event 
classification, along with signal/background labels and with weights. The weights are related to the 
normalizations of signals and backgrounds. The test set consists of the variables in training set instead of 
labels and weights. The required data as the solution should contain the fields EventID (a unique 
identifier for each event), RankOrder (a permutation of integer list from 1 to test set size) and Class 
(either “b” or “s”). The higher ranks indicate more signal­like 
events and the lower ranks indicate more 
background­like 
events. Since the rank could be calculated using the weight values, the objective is to 
find a function of weights or in simple terms to predict the weights for test set after training a machine 
learning model with the use of training set. Depending on the value for the weight it is possible to predict 
the event’s class because it is clear that two different ranges of weights fall into two different classes. 
Figure 1: Graphical representation of a Higgs boson decaying to two tau particles in the ATLAS 
detector
3. Approach Followed 
Under this section we discuss how we preprocessed training data in order to feed for a machine 
learning model and what machine learning techniques that we have used for training. 
3.1 Preprocessing 
Data preprocessing plays an important part in any machine learning challenge. In higgs Boson 
Machine Learning Challenge, we used several data preprocessing methods which will be described in 
this section. 
3.1.1 Understanding the nature of the given variables 
Before starting with the preprocessing work, we tried to figure out any directly visible 
relationships between the classification and the variables. In order to see this we thought of graphically 
representing the data which will show any information directly associated with the classification. The 
following figures (Figure 2 to Figure 5) show how the classification has occurred with respect to the 
range of values of few of the variable. 
Figure 2: Classification relative to the distribution of the variable Der_lep_eta_centrality
Figure 3: Classification relative to the distribution of the variable Weight 
Figure 4: Classification relative to the distribution of the variable DER_mass_MMC
Figure 5: Classification relative to the distribution of the variable PRI_lep_eta 
Through these visualizations we figured out that there is no directly associated variable except 
for the weight. From this fact we learned that if we predict the weight for the given test scenarios, we 
will be able to do both the classification and the ranking at the same time.
3.1.2 Handling missing values 
In the data given for the competition the missing values are stored as ­999. 
Exploring we 
discovered that there are lot of missing values in data. 
Figure 6: Variable statistics 
As you can see in the Figure 6, many columns like DER_deltaeta_jet_jet, 
DER_massdelta_jet_jet have ­999 
values for more than half of the total values (more than the median). 
It was clear that dropping training subjects where the missing values are present cannot be used for 
handling missing values, because we need to predict for the test entries which also contain these missing 
values. So as the first approach to handle missing values, we tried dropping variables where the 
missing values are present. It could not improve the results due to the large amount of missing values 
present. After dropping the variables there was not enough data to predict and also important 
relationships and variables tend to disappear for the sake of handling missing values. Therefore it is not a 
good approach for handling missing values. 
The next approach we considered to handle the missing values was to use traditional 
imputation, but the results were not good. In this case we substituted missing values with the average
values of the corresponding variable columns while ignoring the missing values for the calculations. The 
main reason behind the non improved performance is that the missing values are “actually” missing 
values where a value for that feature can not exist in that particular training instance. So the best way to 
handle the missing values is to interpret ­999 
as a special missing value and use algorithms that will 
consider ­999 
values as a special category. 
3.1.3 Converting Data Types 
In Order to apply xgboosting and gradient boosting techniques the value of the label should be 
numeric. So we had to change the Label type to 0,1 in when we were doing data pre processing. Used 
0 if label is equal to “b” and used 1 if the label is equal to “s”. 
3.1.4 Data Normalization 
If you have a look at Figure 6, the distribution of data values varies highly for different columns. 
For an example the column DER_pt_h varies from 0 to 2835, DER_met_phi_centrality varies from ­1.4 
to +1.4 only. To guarantee stable convergence of weights and biases in our model we had to normalize 
all the columns. In this competition we used min­max 
normalization where each value in the columns will 
be matched to a value between 0 and 1. 
3.1.5 Feature Selection and Deriving Features 
The Figure 7 represents the correlation between the label and other features in the data set.
Figure 7: The correlation between the label and other features in the data set 
As you can see in the Figure 7 some variables like PRI_tau_eta can be dropped when building 
the model since they are insignificant to the Label value. But the other thing you should notice with the 
diagram is that no variable can be considered as significant to Label­value 
that we should predict. So 
deriving new features was required. 
We identified 4 features[1] which can be important to our model. 
assymenj = (MET − MHT)/(MHT + MET) 
dijet = sum of the two jet masses 
deltaphi = jet1_phi − jet2_phi
eltaphimet (jet1_d = phi + jet2_phi) / 2 
The feature; dijet is already included as a variable in the data sets as; DER_mass_jet_jet. We 
derived the other three variables with the available data as follows. Since MHT (Missing energy 
calculated from the jets) was not readily available we used a derived variable (estimatedMHT) which is 
proportional to this quantity. 
estimatedMHT = PRI_jet_all_pt − PRI_jet_leading_pt − PRI_jet_subleading_pt 
assymenj = (PRI_met − estimatedMHT ) / (PRI_met + estimatedMHT) 
deltaphi = PRI_jet_leading_phi − PRI_jet_subleading_phi 
deltaphimet = (PRI_jet_leading_phi + PRI_jet_subleading_phi) / 2 
Also identified a variable using greedy approach that had a 0.2 correlation with the label. 
Special = DER_mass_MMC × DER_pt_ratio_lep_tau / (DER_sum_pt + 0.0000001) 
These newly added columns in the pre­processing 
stage improved the score in the public leader 
board with submissions using the xgboosting algorithm. The initial version of these new variables simply 
calculated the relevant values using the relevant columns of data without considering the fact whether 
any of the columns have ­999. 
In that case the variable creation algorithm simply took ­999 
as a valid 
value for the respective field and calculated the result. We thought that since ­999 
is not a valid value, 
but only an indicator that those values are not available. 
With consideration of the above fact, we decided to filter out the entries which have invalid 
inputs. We changed the variable creation algorithm to output ­999 
in cases where at least one of the 
inputs have invalid value. But unfortunately the results did not appear to be as expected. from the 
analysis of the change and the results, we came to a conclusion that the success rate decreased due to 
the elimination of diversity. To clarify this let’s use two example entries.
EventID Value 1 Value 2 New variable 
neglecting ­999 
New Variable 
considering ­999 
1 ­999 
2.14 ­996.86 
­999 
15 1.52 ­999 
1000.52 ­999 
122 ­999 
­999 
0 ­999 
You can see that for the above three entries, the variable which does not consider the invalid 
input has given three different values and their value range is directly associated with the invalid data 
input combination. But in the variable created with the consideration of the invalid inputs, all three have 
the same value ­999. 
This clearly shows us the reason why the success rate decreased with the new 
variable. The new variable had eliminated variability of the previous variable which hides a lot of 
information which are very important in classification. 
The new variables before considering the invalid inputs seemed to have introduced new 
measurements of the relations among the the variables when taken as groups. With this improvement we 
thought of discovering the possibility of creating more derived variables to impose measurements of the 
collective relative relationship among the primitive variables for a specific result. 
We came up with few more columns by randomly combining primitive variables. The intention 
was to see if we can improve the success rate by introducing new variables which have combined 
information on other variables. But the results decreased the success rate. So we concluded that 
introducing variables which have known relationship to the classification may improve success rate 
while others may decrease the success rate by introducing unimportant relationships.
3.2 Training Techniques 
3.2.1 Random Forest Classifier 
We used random forest classifier for the higgs boson challenge in the earlier stages. We used 
scikit learn package for python to develop the solution. Random Forest Classifier comes under the 
sklearn.ensemble package in scikit learn. 
The basic functionality of random forest is as follows[2]. It creates number of classification trees 
instead of making a single classification tree. So when it needs to classify for new input it is given to all 
the classifier trees and the answer is taken. Then in order to get the final answer it uses voting 
mechanism where results from each tree is considered as a vote and the final answer is selected by the 
answer which has most votes. 
When building the trees in the random forest, there are some guidelines follows. One is if there 
are N cases in the training set then there will be N sample cases which are used to train the trees. One 
other thing is it will select m number of variables randomly from the total M number of input variables for 
each node. Other thing is the trees are grown without any pruning. 
One major feature in random forest classifier is it runs efficiently on large data sets, and it can 
handle large number of input variables. It has the ability to handle the missing values effectively and it can 
maintain the accuracy when large proportion of data is missing. Furthermore it has the ability to identify 
the variables which are most important and relationship between variables. It also does not get over 
fitted to the inputs. 
When training the trees in random forest classifier about one third of data is not used and they 
are used as out of bag data to get running unbiased estimates and also to get the importance of 
variables. The rest of the data is used a bootstrap to train the trees. The out of bag data for the trees are 
put back on them to get a classification, and finally take a class which got most votes from the out of 
bag data. That is used as an error estimate for the random forest classifier.
Measuring the importance of variables is also an important feature in random forest 
classification. That is done by putting the out of bag data on each tree in the forest and count the number 
of votes in correct class. Then it changes the value in the variable that needs to be checked and put 
back on to the trees and count the votes in the correct class. Then by subtracting the votes from the 
original result and from the changed input and averaging the results over the forest to get an score about 
the importance of the variable. If the number of variables in the data set is very high then forest can be 
run for all the variables for once and then again with only the most important variables. 
Proximities are also an important feature in the random forest classifier. It is formed by creating 
a NxN matrix and putting all the data including the training data and out of bag data. Since it is not 
possible to have NxN matrix for large data sets NxT matrix is formed where T is number of trees in the 
forest. 
To fill the missing values in the data set random forest classifier has two methods. The faster 
way is filling the missing values by the median. But the more accurate way is the second way where 
initially filling the missing values by rough estimates and then run the forest to compute the proximities. 
Outliers are identified by the random forest method by the proximity values. If there are entries 
in a class with small proximities then those entries are identified as outliers. 
In the random forest classifier of scikit learn package there are several parameters which we can 
use to tune our results[3]. One parameter is n_estimators which is used to specify the number of trees in 
the forest. max_depth is used to specify the maximum depth of the trees in the random forest. The 
default value for that is none and it will expand until all the leaves are pure. oob_score is an boolean 
parameter to specify whether to use Out of bag data for the dataset. 
It has several methods that the users can use for the prediction work. Fit method is used to build 
the forest using the training set and predict method is used to predict the results for the test data. It has a
method called transform which can be used to reduce the input data matrix to the most important 
features. 
Initially we tried Random forest classifier to predict the Label value of the data as Signal (s) or 
Background (b) without predicting the weight value. That way we were not able to have a rank value 
for the test data. And also for the initial submission we removed derived features from the training and 
then we added them back later. Then we made submissions with replacing the values with ­999 
from 
the average value of the columns and also tried removing those columns from training and test data sets. 
But Random forest classifier did not gave much good results with either of those methods. The 
maximum we were able to score with random forest method was 2.90576 in the private score with the 
n_estimator value as 150. Then when we tried to estimate the weight value using the random forest 
classifier it failed because it took huge amount of memory. So we decided to move for other available 
options to have better results. 
3.2.2 Gradient Boost Classifier 
Another classifier we tested in the initial states was the Gradient Boost Classifier. Gradient 
boosting algorithms use an ensemble of weak decision trees built to optimize a customizable loss 
function. Trees are built using boosting in a staged manner. Gradient boosting classifiers can be used for 
both regression and classification. Gradient boosting algorithms can handle data of mixed types and are 
very robust to outliers. 
We used a Gradient boosting regression trees algorithm from the Scikit­learn 
library in python 
for this problem. This model used all the features in the data set to train the classifier. To improve the 
accuracy we used hyper parameter tuning along with stratified cross validation to set the best values for 
the parameters. 
We also tried using multiple loss functions such as the default ‘deviance’ function as well as the 
AMS function used in this competition. Using the AMS function as the loss function improved our
results slightly. Even with all this effort we were unable to match the performance we got from the 
XGBoost algorithm. so this approach was dropped. 
3.2.3 Neural Networks 
Artificial neural networks provide a general, practical method for learning real­valued, 
discrete­valued, 
and vector­valued 
functions from examples. Algorithms such as Backpropagation use 
gradient descent to tune network parameters to best fit a training set of input­output 
pairs. Neural 
Network learning is robust to errors in the training data and has been successfully applied to problems 
such as interpreting visual scenes, speech recognition, and learning robot control strategies. 
We have used the PyBrain[5] python library to build a neural network which used 
backpropagation algorithm to train the network. While training the neural network, we have faced a 
number of problems such as 
1. Number of hidden layers to be used 
Number of Hidden Layers Result 
none Only capable of representing linear separable functions 
decisions. 
1 Can approximate any function that contains a continuous 
mapping from one finite space to another. 
2 Can represent an arbitrary decision boundary to arbitrary 
accuracy with rational activation functions and can 
approximate any smooth mapping to any accuracy. 
Above summarize the knowledge we have acquired by going through various research 
papers. But unfortunately we were unable to find a specific method to determine the number
hidden layers and hence we’ve tested a various number of hidden layers ranging from 2 to 50. 
We were unable to increase the number of hidden layers further due to the huge amount of time 
taken by the network training phase. 
2. Numbers of neurons for each hidden layer 
We were unable to find any specific formula to calculate the number of neurons in a particular 
hidden layer. Although we found many rule­of­thumb 
methods for determining the correct 
number of neurons to use in the hidden layers, such as the following: 
● The number of hidden neurons should be between the size of the input layer and the size 
of the output layer. 
● The number of hidden neurons should be 2/3 the size of the input layer, plus the size of 
the output layer. 
● The number of hidden neurons should be less than twice the size of the input layer. 
We have applied above rules and further we tried to decide the number of neurons in a 
hidden layer based on a combination of prime number series and fibonacci number series. 
3. Neural Network training time 
Training the neural network took a lot of time. As a last resort we tried to use genetic 
algorithms[6] and pruning algorithms[7][8] to optimize the neural network. But the result was 
not satisfactory 
4. How to decide the cut­off 
mark for signal or background noise 
The output of the neural network was a floating point value between 0­1. 
The values 
closer to the 1 are the signal and the values closer to the 0 are the background noise. By using 
the 10­fold 
cross validation, we found that floating point values above 0.65 should be 
considered as signal and values below 0.65 should be considered as background noise 
However, all the prediction results obtained through the neural network model performed 
poorly when compared to other models during the cross validation.
3.2.4 XGBoost Classifier 
We used xgboost package[4] with R language which implements extreme gradient boosting 
classifier. Extreme gradient boosting classifier is an efficient and scalable implementation of gradient 
boosting framework that we have described earlier. The package includes efficient linear model solver 
and tree learning algorithms. The package can automatically do parallel computation with OpenMP, so 
that this is more than 10 times faster than the previously used gradient boosting. Xgboost supports 
various objective functions, including regression, classification and ranking. We used “rank” function in 
order to rank the probabilities of the events being due to signals, as required for the submissions. The 
two sets of classes; “s” and “b” are classified using a threshold value which was carefully chosen after 
analysis, on the sorted test entries according to the probabilities of those being due to signals. 
Unlike in gradient boost classifier, xgboost classifier provides a special way of handling missing 
values. Xgboost automatically learns the best direction to go when a value is missing. We feeded ­999 
as missing values to xgboost classifier and by doing so, we could improve the scores that we got in 
public leaderboard. 
Then we tried tuning up the parameters. We used GridSearch in SciKit learn package[9] in 
selecting a better set of parameters, apart from the default parameter values. With the following set of 
parameters we could obtain good results. Since gradient boosting dramatically improves the model’s 
generalization ability with lower learning rates (heavy shrinkages), we reduced the default value of eta. 
Since the lower learning rates needs more iterations increasing nrounds variable had a positive impact on 
our results. It is a known fact that lower learning rates reduce overfitting. 
eta = 0.05 
max_depth = 6 
silent = 1 
nthread = 16 
nround = 500
4. Results and Discussion 
As mentioned in the previous parts of this document we tried multiple classification systems to 
solve this challenge to varying results. 
1. Gradient Boosting 
2. Random Forests 
3. Neural Networks 
4. XGBoost 
But we found out that XGBoost algorithm was able to produce the best results for this problem. 
Using the XGBoost algorithm as mentioned above we were able to get a score of 3.64655 which gave 
us a rank of 437. 
While we were able to get higher score on the public leaderboards they were misleading us on 
the quality of the overall predictive ability of our models. So our best submission on the private 
leaderboard which scored 3.64672 and was ranked 223 was a result of overfitting which caused us to 
drop in the private leaderboard which determines the final positions. 
So the biggest issue we had with our process in this competition was the lack of good cross 
validation. We were relying too much on the public leaderboard to access the quality of our model that 
we were unable to avoid overfitting the predictions to the leaderboard. As the public leader board was 
made up using only 18% of the data relying on that to gauge improvements had the effect of leading to 
overfitting. A valuable lesson learned through this contest is the importance of maintaining good 
standards of cross validating our predictions, which would have allowed us to perform much better.
One of the major challenges we have faced during this competition was to come up with derived 
features. Since we had no knowledge in the field of high energy particle physics we had to read a couple 
of research papers on that area in­order 
to come up features mentioned in section 3.1.5. and in the 
process we were able to gain a considerable amount of knowledge on that field. 
This competition was a valuable opportunity for us to learn important machine learning and data 
mining concepts while contributing to a very important scientific cause. Through this competition we 
were able to get a better understanding of the challenges in the field and also the methodologies 
practically used to overcome them.
5. References 
[1] http://www.lps.ens.fr/~laetitia/HIGGS.pdf 
[2] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm 
[3] http://scikit­learn. 
org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html 
[4] http://cran.r­project. 
org/web/packages/xgboost/index.html 
[5] Schaul, T., Bayer, J., Wierstra, D., Sun, Y., Felder, M., Sehnke, F., ... & Schmidhuber, J. (2010). 
PyBrain. The Journal of Machine Learning Research, 11, 743­746. 
[6] Karnin, E. D. (1990). A simple procedure for pruning back­propagation 
trained neural networks. 
Neural Networks, IEEE Transactions on, 1(2), 239­242. 
[7] Leung, F. H. F., Lam, H. K., Ling, S. H., & Tam, P. K. S. (2003). Tuning of the structure and 
parameters of a neural network using an improved genetic algorithm. Neural Networks, IEEE 
Transactions on, 14(1), 79­88. 
[8] http://www.pybrain.org/docs/api/optimization/optimization.html#population­based 
[9] Scikit­learn: 
Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825­2830, 
2011.
6. Appendix 
Appendix A : Public and private scores for Random Forest models and Logistic Regression 
Models
Appendix B : Public and private scores for Gradient Boosting Models 
Appendix C : Public and private scores for some Xgboost Models
Appendix D : Public and private scores for some Neural Network Models

Contenu connexe

Tendances

STAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftSTAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftJonathan Fivelsdal
 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292HARDIK SINGH
 
2DFMT in the Range to & its Application with Some Function
2DFMT in the Range to       & its Application with Some Function2DFMT in the Range to       & its Application with Some Function
2DFMT in the Range to & its Application with Some FunctionIOSRJM
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ijsc
 
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...CSCJournals
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
Data Mining using SAS
Data Mining using SASData Mining using SAS
Data Mining using SASTanu Puri
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...ijaia
 
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYCATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYijaia
 
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...ijfls
 
Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningHierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningYashraj Nigam
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7Birat Sharma
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classificationcsandit
 

Tendances (17)

STAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final DraftSTAT 897D Project 2 - Final Draft
STAT 897D Project 2 - Final Draft
 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292
 
JEDM_RR_JF_Final
JEDM_RR_JF_FinalJEDM_RR_JF_Final
JEDM_RR_JF_Final
 
2DFMT in the Range to & its Application with Some Function
2DFMT in the Range to       & its Application with Some Function2DFMT in the Range to       & its Application with Some Function
2DFMT in the Range to & its Application with Some Function
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
ANALYTICAL FORMULATIONS FOR THE LEVEL BASED WEIGHTED AVERAGE VALUE OF DISCRET...
 
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Data Mining using SAS
Data Mining using SASData Mining using SAS
Data Mining using SAS
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
 
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYCATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
 
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...
INTERVAL TYPE-2 INTUITIONISTIC FUZZY LOGIC SYSTEM FOR TIME SERIES AND IDENTIF...
 
Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningHierarchical Clustering in Data Mining
Hierarchical Clustering in Data Mining
 
Fb35884889
Fb35884889Fb35884889
Fb35884889
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
Cluster spss week7
Cluster spss week7Cluster spss week7
Cluster spss week7
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classification
 

En vedette

Logistic regression with low event rate (rare events)
Logistic regression with low event rate (rare events)Logistic regression with low event rate (rare events)
Logistic regression with low event rate (rare events)Tejamoy Ghosh
 
System Configuration for UltraESB
System Configuration for UltraESBSystem Configuration for UltraESB
System Configuration for UltraESBAdroitLogic
 
ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...
ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...
ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...MuleSoft
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandrySri Ambati
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning Shiraz316
 
IaaS - Infrastructure as a Service
IaaS - Infrastructure as a ServiceIaaS - Infrastructure as a Service
IaaS - Infrastructure as a ServiceRajind Ruparathna
 
Forecasting P2P Credit Risk based on Lending Club data
Forecasting P2P Credit Risk based on Lending Club dataForecasting P2P Credit Risk based on Lending Club data
Forecasting P2P Credit Risk based on Lending Club dataArchange Giscard DESTINE
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
 
Estimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit RishEstimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit RishArsalan Qadri
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Magnify Analytic Solutions
 
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)Sri Ambati
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Digital Businesses of the Future
Digital Businesses of the Future Digital Businesses of the Future
Digital Businesses of the Future MuleSoft
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMNYC Predictive Analytics
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeGilles Louppe
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval Venkata Reddy Konasani
 
GBM package in r
GBM package in rGBM package in r
GBM package in rmark_landry
 

En vedette (20)

Logistic regression with low event rate (rare events)
Logistic regression with low event rate (rare events)Logistic regression with low event rate (rare events)
Logistic regression with low event rate (rare events)
 
System Configuration for UltraESB
System Configuration for UltraESBSystem Configuration for UltraESB
System Configuration for UltraESB
 
ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...
ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...
ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark Landry
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning
 
IaaS - Infrastructure as a Service
IaaS - Infrastructure as a ServiceIaaS - Infrastructure as a Service
IaaS - Infrastructure as a Service
 
Forecasting P2P Credit Risk based on Lending Club data
Forecasting P2P Credit Risk based on Lending Club dataForecasting P2P Credit Risk based on Lending Club data
Forecasting P2P Credit Risk based on Lending Club data
 
Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
 
Estimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit RishEstimation of the probability of default : Credit Rish
Estimation of the probability of default : Credit Rish
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
Logistic Modeling with Applications to Marketing and Credit Risk in the Autom...
 
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Digital Businesses of the Future
Digital Businesses of the Future Digital Businesses of the Future
Digital Businesses of the Future
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
Introduction to Modeling
Introduction to ModelingIntroduction to Modeling
Introduction to Modeling
 
Xgboost
XgboostXgboost
Xgboost
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to Practice
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 

Similaire à Higgs Boson Machine Learning Challenge Group Project Report

Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Query Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain ObjectsQuery Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain Objectsnexgentechnology
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertainnexgentech15
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSNexgen Technology
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertainShakas Technologies
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityGon-soo Moon
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...cscpconf
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...csandit
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNINGMLReview
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegressionDaniel K
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectSurya Chandra
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...cscpconf
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind MapAshish Patel
 
Chapter5.pdf
Chapter5.pdfChapter5.pdf
Chapter5.pdfsravan66
 

Similaire à Higgs Boson Machine Learning Challenge Group Project Report (20)

Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
E1802023741
E1802023741E1802023741
E1802023741
 
Query Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain ObjectsQuery Aware Determinization of Uncertain Objects
Query Aware Determinization of Uncertain Objects
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Paper-Allstate-Claim-Severity
Paper-Allstate-Claim-SeverityPaper-Allstate-Claim-Severity
Paper-Allstate-Claim-Severity
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
 
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
 
report
reportreport
report
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
debatrim_report (1)
debatrim_report (1)debatrim_report (1)
debatrim_report (1)
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegression
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems Project
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
 
Machine learning Mind Map
Machine learning Mind MapMachine learning Mind Map
Machine learning Mind Map
 
Chapter5.pdf
Chapter5.pdfChapter5.pdf
Chapter5.pdf
 

Dernier

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 

Dernier (20)

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 

Higgs Boson Machine Learning Challenge Group Project Report

  • 1. Higgs Boson Machine Learning Challenge Group Project ­CS4622 Team Members: 100112V ­Edirisinghe E.A.S.D 100132G ­Fernando W.V.D. 100440A ­Ranasinghe R.H.T.D. 100498G ­Senaratne H. H. 100559V ­Vithana Y. G. K. 100577A ­Weerasinghe L.A.
  • 2. Table of Content 1. Introduction 2. Background 3. Approach Followed 3.1 Preprocessing 3.1.1 Understanding the nature of the given variables 3.1.2 Handling missing values 3.1.3 Converting Data Types 3.1.4 Data Normalization 3.1.5 Feature Selection and Deriving Features 3.2 Training Techniques 3.2.1 Random Forest Classifier 3.2.2 Gradient Boost Classifier 3.2.3 Neural Networks 3.2.4 XGBoost Classifier 4. Results and Discussion 5. Reference 6. Appendix
  • 3. 1. Introduction This report reveals the procedure used by the team in order to solve “Higgs Boson Machine Learning Challenge” stated under the Kaggle site. As the initial parts of this report, we have delivered some knowledge about the background of this problem, which is closely related to particle physics. Later on we have included how we have modeled and pre­processed the data, what machine learning techniques and procedures that we have used to solve this problem and what results we were able to obtain with the followed approaches. Finally we have analysed and discussed about the methods that we followed and the outputs that we obtained. 2. Background Discovery of Higgs Boson which is an elementary particle of particle physics was recently claimed by the ATLAS experiment and the CMS experiment. This discovery was acknowledged by the 2013 Nobel prize in physics given to Francois Englert and Peter Higgs. The related experiments are running at the Large Hadron Collider which is commonly known as LHC at CERN (the European Organization for Nuclear Research), Geneva, Switzerland; which began operating in 2009 after about 20 years of design and construction. This particle decays under several processes, and produces other particles. A channel is the term that is used to indicate the decay of a particle into other specific particles in physics. It was recently reported by the ATLAS experiment, the first evidence of the Higgs Boson to tau tau channel. The ATLAS experiment has observed a signal of the above mentioned decay and this signal is small one and buried in background noise. What is expected from this Higgs Boson Machine Learning Challenge is to explore the potential of advanced machine learning methods to improve the discovery significance of the experiment by
  • 4. classifying a given event into the correct region out of ‘signal’ and ‘background’. That is deciding whether the the results of a certain event has happened due to tau tau decay of Higgs Boson (signal) or due to other background noise (background). Training set consists of several primary attributes and derived attributes related to this event classification, along with signal/background labels and with weights. The weights are related to the normalizations of signals and backgrounds. The test set consists of the variables in training set instead of labels and weights. The required data as the solution should contain the fields EventID (a unique identifier for each event), RankOrder (a permutation of integer list from 1 to test set size) and Class (either “b” or “s”). The higher ranks indicate more signal­like events and the lower ranks indicate more background­like events. Since the rank could be calculated using the weight values, the objective is to find a function of weights or in simple terms to predict the weights for test set after training a machine learning model with the use of training set. Depending on the value for the weight it is possible to predict the event’s class because it is clear that two different ranges of weights fall into two different classes. Figure 1: Graphical representation of a Higgs boson decaying to two tau particles in the ATLAS detector
  • 5. 3. Approach Followed Under this section we discuss how we preprocessed training data in order to feed for a machine learning model and what machine learning techniques that we have used for training. 3.1 Preprocessing Data preprocessing plays an important part in any machine learning challenge. In higgs Boson Machine Learning Challenge, we used several data preprocessing methods which will be described in this section. 3.1.1 Understanding the nature of the given variables Before starting with the preprocessing work, we tried to figure out any directly visible relationships between the classification and the variables. In order to see this we thought of graphically representing the data which will show any information directly associated with the classification. The following figures (Figure 2 to Figure 5) show how the classification has occurred with respect to the range of values of few of the variable. Figure 2: Classification relative to the distribution of the variable Der_lep_eta_centrality
  • 6. Figure 3: Classification relative to the distribution of the variable Weight Figure 4: Classification relative to the distribution of the variable DER_mass_MMC
  • 7. Figure 5: Classification relative to the distribution of the variable PRI_lep_eta Through these visualizations we figured out that there is no directly associated variable except for the weight. From this fact we learned that if we predict the weight for the given test scenarios, we will be able to do both the classification and the ranking at the same time.
  • 8. 3.1.2 Handling missing values In the data given for the competition the missing values are stored as ­999. Exploring we discovered that there are lot of missing values in data. Figure 6: Variable statistics As you can see in the Figure 6, many columns like DER_deltaeta_jet_jet, DER_massdelta_jet_jet have ­999 values for more than half of the total values (more than the median). It was clear that dropping training subjects where the missing values are present cannot be used for handling missing values, because we need to predict for the test entries which also contain these missing values. So as the first approach to handle missing values, we tried dropping variables where the missing values are present. It could not improve the results due to the large amount of missing values present. After dropping the variables there was not enough data to predict and also important relationships and variables tend to disappear for the sake of handling missing values. Therefore it is not a good approach for handling missing values. The next approach we considered to handle the missing values was to use traditional imputation, but the results were not good. In this case we substituted missing values with the average
  • 9. values of the corresponding variable columns while ignoring the missing values for the calculations. The main reason behind the non improved performance is that the missing values are “actually” missing values where a value for that feature can not exist in that particular training instance. So the best way to handle the missing values is to interpret ­999 as a special missing value and use algorithms that will consider ­999 values as a special category. 3.1.3 Converting Data Types In Order to apply xgboosting and gradient boosting techniques the value of the label should be numeric. So we had to change the Label type to 0,1 in when we were doing data pre processing. Used 0 if label is equal to “b” and used 1 if the label is equal to “s”. 3.1.4 Data Normalization If you have a look at Figure 6, the distribution of data values varies highly for different columns. For an example the column DER_pt_h varies from 0 to 2835, DER_met_phi_centrality varies from ­1.4 to +1.4 only. To guarantee stable convergence of weights and biases in our model we had to normalize all the columns. In this competition we used min­max normalization where each value in the columns will be matched to a value between 0 and 1. 3.1.5 Feature Selection and Deriving Features The Figure 7 represents the correlation between the label and other features in the data set.
  • 10. Figure 7: The correlation between the label and other features in the data set As you can see in the Figure 7 some variables like PRI_tau_eta can be dropped when building the model since they are insignificant to the Label value. But the other thing you should notice with the diagram is that no variable can be considered as significant to Label­value that we should predict. So deriving new features was required. We identified 4 features[1] which can be important to our model. assymenj = (MET − MHT)/(MHT + MET) dijet = sum of the two jet masses deltaphi = jet1_phi − jet2_phi
  • 11. eltaphimet (jet1_d = phi + jet2_phi) / 2 The feature; dijet is already included as a variable in the data sets as; DER_mass_jet_jet. We derived the other three variables with the available data as follows. Since MHT (Missing energy calculated from the jets) was not readily available we used a derived variable (estimatedMHT) which is proportional to this quantity. estimatedMHT = PRI_jet_all_pt − PRI_jet_leading_pt − PRI_jet_subleading_pt assymenj = (PRI_met − estimatedMHT ) / (PRI_met + estimatedMHT) deltaphi = PRI_jet_leading_phi − PRI_jet_subleading_phi deltaphimet = (PRI_jet_leading_phi + PRI_jet_subleading_phi) / 2 Also identified a variable using greedy approach that had a 0.2 correlation with the label. Special = DER_mass_MMC × DER_pt_ratio_lep_tau / (DER_sum_pt + 0.0000001) These newly added columns in the pre­processing stage improved the score in the public leader board with submissions using the xgboosting algorithm. The initial version of these new variables simply calculated the relevant values using the relevant columns of data without considering the fact whether any of the columns have ­999. In that case the variable creation algorithm simply took ­999 as a valid value for the respective field and calculated the result. We thought that since ­999 is not a valid value, but only an indicator that those values are not available. With consideration of the above fact, we decided to filter out the entries which have invalid inputs. We changed the variable creation algorithm to output ­999 in cases where at least one of the inputs have invalid value. But unfortunately the results did not appear to be as expected. from the analysis of the change and the results, we came to a conclusion that the success rate decreased due to the elimination of diversity. To clarify this let’s use two example entries.
  • 12. EventID Value 1 Value 2 New variable neglecting ­999 New Variable considering ­999 1 ­999 2.14 ­996.86 ­999 15 1.52 ­999 1000.52 ­999 122 ­999 ­999 0 ­999 You can see that for the above three entries, the variable which does not consider the invalid input has given three different values and their value range is directly associated with the invalid data input combination. But in the variable created with the consideration of the invalid inputs, all three have the same value ­999. This clearly shows us the reason why the success rate decreased with the new variable. The new variable had eliminated variability of the previous variable which hides a lot of information which are very important in classification. The new variables before considering the invalid inputs seemed to have introduced new measurements of the relations among the the variables when taken as groups. With this improvement we thought of discovering the possibility of creating more derived variables to impose measurements of the collective relative relationship among the primitive variables for a specific result. We came up with few more columns by randomly combining primitive variables. The intention was to see if we can improve the success rate by introducing new variables which have combined information on other variables. But the results decreased the success rate. So we concluded that introducing variables which have known relationship to the classification may improve success rate while others may decrease the success rate by introducing unimportant relationships.
  • 13. 3.2 Training Techniques 3.2.1 Random Forest Classifier We used random forest classifier for the higgs boson challenge in the earlier stages. We used scikit learn package for python to develop the solution. Random Forest Classifier comes under the sklearn.ensemble package in scikit learn. The basic functionality of random forest is as follows[2]. It creates number of classification trees instead of making a single classification tree. So when it needs to classify for new input it is given to all the classifier trees and the answer is taken. Then in order to get the final answer it uses voting mechanism where results from each tree is considered as a vote and the final answer is selected by the answer which has most votes. When building the trees in the random forest, there are some guidelines follows. One is if there are N cases in the training set then there will be N sample cases which are used to train the trees. One other thing is it will select m number of variables randomly from the total M number of input variables for each node. Other thing is the trees are grown without any pruning. One major feature in random forest classifier is it runs efficiently on large data sets, and it can handle large number of input variables. It has the ability to handle the missing values effectively and it can maintain the accuracy when large proportion of data is missing. Furthermore it has the ability to identify the variables which are most important and relationship between variables. It also does not get over fitted to the inputs. When training the trees in random forest classifier about one third of data is not used and they are used as out of bag data to get running unbiased estimates and also to get the importance of variables. The rest of the data is used a bootstrap to train the trees. The out of bag data for the trees are put back on them to get a classification, and finally take a class which got most votes from the out of bag data. That is used as an error estimate for the random forest classifier.
  • 14. Measuring the importance of variables is also an important feature in random forest classification. That is done by putting the out of bag data on each tree in the forest and count the number of votes in correct class. Then it changes the value in the variable that needs to be checked and put back on to the trees and count the votes in the correct class. Then by subtracting the votes from the original result and from the changed input and averaging the results over the forest to get an score about the importance of the variable. If the number of variables in the data set is very high then forest can be run for all the variables for once and then again with only the most important variables. Proximities are also an important feature in the random forest classifier. It is formed by creating a NxN matrix and putting all the data including the training data and out of bag data. Since it is not possible to have NxN matrix for large data sets NxT matrix is formed where T is number of trees in the forest. To fill the missing values in the data set random forest classifier has two methods. The faster way is filling the missing values by the median. But the more accurate way is the second way where initially filling the missing values by rough estimates and then run the forest to compute the proximities. Outliers are identified by the random forest method by the proximity values. If there are entries in a class with small proximities then those entries are identified as outliers. In the random forest classifier of scikit learn package there are several parameters which we can use to tune our results[3]. One parameter is n_estimators which is used to specify the number of trees in the forest. max_depth is used to specify the maximum depth of the trees in the random forest. The default value for that is none and it will expand until all the leaves are pure. oob_score is an boolean parameter to specify whether to use Out of bag data for the dataset. It has several methods that the users can use for the prediction work. Fit method is used to build the forest using the training set and predict method is used to predict the results for the test data. It has a
  • 15. method called transform which can be used to reduce the input data matrix to the most important features. Initially we tried Random forest classifier to predict the Label value of the data as Signal (s) or Background (b) without predicting the weight value. That way we were not able to have a rank value for the test data. And also for the initial submission we removed derived features from the training and then we added them back later. Then we made submissions with replacing the values with ­999 from the average value of the columns and also tried removing those columns from training and test data sets. But Random forest classifier did not gave much good results with either of those methods. The maximum we were able to score with random forest method was 2.90576 in the private score with the n_estimator value as 150. Then when we tried to estimate the weight value using the random forest classifier it failed because it took huge amount of memory. So we decided to move for other available options to have better results. 3.2.2 Gradient Boost Classifier Another classifier we tested in the initial states was the Gradient Boost Classifier. Gradient boosting algorithms use an ensemble of weak decision trees built to optimize a customizable loss function. Trees are built using boosting in a staged manner. Gradient boosting classifiers can be used for both regression and classification. Gradient boosting algorithms can handle data of mixed types and are very robust to outliers. We used a Gradient boosting regression trees algorithm from the Scikit­learn library in python for this problem. This model used all the features in the data set to train the classifier. To improve the accuracy we used hyper parameter tuning along with stratified cross validation to set the best values for the parameters. We also tried using multiple loss functions such as the default ‘deviance’ function as well as the AMS function used in this competition. Using the AMS function as the loss function improved our
  • 16. results slightly. Even with all this effort we were unable to match the performance we got from the XGBoost algorithm. so this approach was dropped. 3.2.3 Neural Networks Artificial neural networks provide a general, practical method for learning real­valued, discrete­valued, and vector­valued functions from examples. Algorithms such as Backpropagation use gradient descent to tune network parameters to best fit a training set of input­output pairs. Neural Network learning is robust to errors in the training data and has been successfully applied to problems such as interpreting visual scenes, speech recognition, and learning robot control strategies. We have used the PyBrain[5] python library to build a neural network which used backpropagation algorithm to train the network. While training the neural network, we have faced a number of problems such as 1. Number of hidden layers to be used Number of Hidden Layers Result none Only capable of representing linear separable functions decisions. 1 Can approximate any function that contains a continuous mapping from one finite space to another. 2 Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy. Above summarize the knowledge we have acquired by going through various research papers. But unfortunately we were unable to find a specific method to determine the number
  • 17. hidden layers and hence we’ve tested a various number of hidden layers ranging from 2 to 50. We were unable to increase the number of hidden layers further due to the huge amount of time taken by the network training phase. 2. Numbers of neurons for each hidden layer We were unable to find any specific formula to calculate the number of neurons in a particular hidden layer. Although we found many rule­of­thumb methods for determining the correct number of neurons to use in the hidden layers, such as the following: ● The number of hidden neurons should be between the size of the input layer and the size of the output layer. ● The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer. ● The number of hidden neurons should be less than twice the size of the input layer. We have applied above rules and further we tried to decide the number of neurons in a hidden layer based on a combination of prime number series and fibonacci number series. 3. Neural Network training time Training the neural network took a lot of time. As a last resort we tried to use genetic algorithms[6] and pruning algorithms[7][8] to optimize the neural network. But the result was not satisfactory 4. How to decide the cut­off mark for signal or background noise The output of the neural network was a floating point value between 0­1. The values closer to the 1 are the signal and the values closer to the 0 are the background noise. By using the 10­fold cross validation, we found that floating point values above 0.65 should be considered as signal and values below 0.65 should be considered as background noise However, all the prediction results obtained through the neural network model performed poorly when compared to other models during the cross validation.
  • 18. 3.2.4 XGBoost Classifier We used xgboost package[4] with R language which implements extreme gradient boosting classifier. Extreme gradient boosting classifier is an efficient and scalable implementation of gradient boosting framework that we have described earlier. The package includes efficient linear model solver and tree learning algorithms. The package can automatically do parallel computation with OpenMP, so that this is more than 10 times faster than the previously used gradient boosting. Xgboost supports various objective functions, including regression, classification and ranking. We used “rank” function in order to rank the probabilities of the events being due to signals, as required for the submissions. The two sets of classes; “s” and “b” are classified using a threshold value which was carefully chosen after analysis, on the sorted test entries according to the probabilities of those being due to signals. Unlike in gradient boost classifier, xgboost classifier provides a special way of handling missing values. Xgboost automatically learns the best direction to go when a value is missing. We feeded ­999 as missing values to xgboost classifier and by doing so, we could improve the scores that we got in public leaderboard. Then we tried tuning up the parameters. We used GridSearch in SciKit learn package[9] in selecting a better set of parameters, apart from the default parameter values. With the following set of parameters we could obtain good results. Since gradient boosting dramatically improves the model’s generalization ability with lower learning rates (heavy shrinkages), we reduced the default value of eta. Since the lower learning rates needs more iterations increasing nrounds variable had a positive impact on our results. It is a known fact that lower learning rates reduce overfitting. eta = 0.05 max_depth = 6 silent = 1 nthread = 16 nround = 500
  • 19. 4. Results and Discussion As mentioned in the previous parts of this document we tried multiple classification systems to solve this challenge to varying results. 1. Gradient Boosting 2. Random Forests 3. Neural Networks 4. XGBoost But we found out that XGBoost algorithm was able to produce the best results for this problem. Using the XGBoost algorithm as mentioned above we were able to get a score of 3.64655 which gave us a rank of 437. While we were able to get higher score on the public leaderboards they were misleading us on the quality of the overall predictive ability of our models. So our best submission on the private leaderboard which scored 3.64672 and was ranked 223 was a result of overfitting which caused us to drop in the private leaderboard which determines the final positions. So the biggest issue we had with our process in this competition was the lack of good cross validation. We were relying too much on the public leaderboard to access the quality of our model that we were unable to avoid overfitting the predictions to the leaderboard. As the public leader board was made up using only 18% of the data relying on that to gauge improvements had the effect of leading to overfitting. A valuable lesson learned through this contest is the importance of maintaining good standards of cross validating our predictions, which would have allowed us to perform much better.
  • 20. One of the major challenges we have faced during this competition was to come up with derived features. Since we had no knowledge in the field of high energy particle physics we had to read a couple of research papers on that area in­order to come up features mentioned in section 3.1.5. and in the process we were able to gain a considerable amount of knowledge on that field. This competition was a valuable opportunity for us to learn important machine learning and data mining concepts while contributing to a very important scientific cause. Through this competition we were able to get a better understanding of the challenges in the field and also the methodologies practically used to overcome them.
  • 21. 5. References [1] http://www.lps.ens.fr/~laetitia/HIGGS.pdf [2] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm [3] http://scikit­learn. org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html [4] http://cran.r­project. org/web/packages/xgboost/index.html [5] Schaul, T., Bayer, J., Wierstra, D., Sun, Y., Felder, M., Sehnke, F., ... & Schmidhuber, J. (2010). PyBrain. The Journal of Machine Learning Research, 11, 743­746. [6] Karnin, E. D. (1990). A simple procedure for pruning back­propagation trained neural networks. Neural Networks, IEEE Transactions on, 1(2), 239­242. [7] Leung, F. H. F., Lam, H. K., Ling, S. H., & Tam, P. K. S. (2003). Tuning of the structure and parameters of a neural network using an improved genetic algorithm. Neural Networks, IEEE Transactions on, 14(1), 79­88. [8] http://www.pybrain.org/docs/api/optimization/optimization.html#population­based [9] Scikit­learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825­2830, 2011.
  • 22. 6. Appendix Appendix A : Public and private scores for Random Forest models and Logistic Regression Models
  • 23. Appendix B : Public and private scores for Gradient Boosting Models Appendix C : Public and private scores for some Xgboost Models
  • 24. Appendix D : Public and private scores for some Neural Network Models