SlideShare une entreprise Scribd logo
1  sur  9
Télécharger pour lire hors ligne
International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019
DOI: 10.5121/ijaia.2019.10504 39
A HYBRID LEARNING ALGORITHM IN AUTOMATED
TEXT CATEGORIZATION OF LEGACY DATA
Dali Wang1
, Ying Bai2
, and David Hamblin1
1
Christopher Newport University, Newport News, VA, USA
2
Johnson C. Smith University, Charlotte, NC, USA
ABSTRACT
The goal of this research is to develop an algorithm to automatically classify measurement types from
NASA’s airborne measurement data archive. The product has to meet specific metrics in term of accuracy,
robustness and usability, as the initial decision-tree based development has shown limited applicability due
to its resource intensive characteristics. We have developed an innovative solution that is much more
efficient while offering comparable performance. Similar to many industrial applications, the data
available are noisy and correlated; and there is a wide range of features that are associated with the type
of measurement to be identified. The proposed algorithm uses a decision tree to select features and
determine their weights.A weighted Naive Bayes is used due to the presence of highly correlated inputs.
The development has been successfully deployed in an industrial scale, and the results show that the
development is well-balanced in term of performance and resource requirements.
KEYWORDS
classification, machine learning, atmospheric measurement
1. BACKGROUND
Beginning in the 1980s, NASA (National Aeronautics and Space Administration) introduced
tropospheric chemistry studies to gather information about the contents of the air. These studies
involve a data gathering process through the use of airplanes mounted with on-board instruments,
which gather multiple measurements over specific regions of the air. These atmospheric
measurements are complemented with other sources, such as satellites and ground stations. The
data collection, called missions or campaigns, are supported by the federal government with the
goal of collecting data for scientific studies on climate and air quality issues. For the last two
decades, data collected have been compiled into a single format, ICARTT (International
Consortium for Atmospheric Research on Transport and Transformation) [1], which is designed to
allow for sharing across the airborne community, with other agencies, universities, or any other
group working with the data.
A typical ICARTT file contains, at the minimum, about 30 fields of metadata that describe the
atmospheric data gathered, the who/where/what aspects of the data gathering process, along with
additional comments supplied by the principal investigator that is pertinent to the data [1]. Over
the past few years, the federal government has launched the Open Data Initiative,part of which
aims at scaling up open data efforts in the climate sector. Accordingly, NASA has made it a
priority to open up its data archive to the general public in order to promote their usage in
scientific studies. As part of this initiative, NASA has introduced a centralized data archive
service, which allows for others to quickly find certain information within the data archive by
filtering data according to specific metadata components[2].
International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019
40
It is commonly believed that the most important information in an ICARTT data archive is the
type of measurement taken by that experiment. Some of the examples include aerosol (such as
aerosol particle absorption coefficient at blue wavelengths), trace gas (such as ozone mixing ratio),
and cloud property (such as cloud surface area density). In order to uniquely define each
measurement, NASA has adopted a naming system called “common names”; each common name
is used for indexing a measurement with a uniquely-defined physical or chemical characteristic.
Unfortunately, the common name information is not directly presented in the ICARTT data file.
Instead, each measurement is given a “variable name”. Principal investigators decide each variable
name in compilation of the ICARTT files, and along with the unique metadata it offers, the
variable name is not easily searchable. Although they can be found in different missions,
conducted by separate investigators, much of the atmospheric data associated with a certain
variable may be equivalent to others, save for the distinguishing metadata. For example, the
gathering of formaldehyde (CH2O) in the troposphere is done in several missions across a number
of years, all with differing variable names and through different sources. However, in a centralized
data archive, all of these variables should be given the same common name to identify them as
CH2O data. This more general common name allows a user to easily access similar data across
missions by filtering based on this name.
The retrieval of a common name for each measurement variable was done by domain experts at
NASA. The task is rather complicated as many fields in an ICARRT file contribute to the common
name of a measurement: variable name, variable description, instrument used and measurement
unit, etc. There are other information, although less intuitive, that may also contribute to the
naming of each measurement. One such example is the principal investigator’s (PI) name, because
each person has his/her tendency in naming a variable. In addition, the data are noisy – missing
fields and typos are common – as there has been no rigorous vetting process for metadata fields.As
a result, there are only a few domain experts in NASA who can accurately identify a common
name by studying the ICARRT file under consideration. Given the vast amount of data
accumulated over the past 30 years, it is essential to automate the information retrieval process.
2. OVERVIEW OF SOLUTIONS
In its most common form, the common naming process is to choose a specific label (i.e., a
common name) for a measurement variable from a list of predefined labels, given the metadata
associated with this variable. It belongs to the category of label classification problems. Textual
similarity comparisons using fuzzy logic is a common approach to decide on matched and non-
matched records [3]. It presents an opportunity for common name matching; however, given the
variations in the type of metadata associated with each common name, and the large number of
common names (300+), it becomes impractical to implement such an approach.
A forward-chaining based expert system, which belongs to a category of artificial intelligence
algorithms, may provide another viable solution [4]. This approach was attempted by summarizing
the rules used by domain experts and synthesizing a knowledge base. It turns out to not be very
effective, as the rules are complex, non-inclusive, and sometimes contradictory. Also, whenever
new missions are conducted, an expansion of rules in the inference engine is required - a tedious
task.
Solutions to similar problems include Artificial Neural Networks (ANN) especially the perceptron-
based network, Support Vector Machines (SVM), instance-based learning especially the k-Nearest
Neighbor (k-NN) algorithm. Generally, SVMs and ANN tend to perform much better when
dealing with multi-dimensional and continuous-valued features [5]. However, this problem only
involves single dimensional and discrete features. The k-NN algorithm is very sensitive to
International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019
41
irrelevant features and intolerant to noise. These make it unattractive for this problem, as not all
features are relevant to every common name, and the information in general is very noisy due to
errors and inconsistencies in ICARTT files.
Logic-based systems such as decision trees tend to perform better when dealing with
discrete/categorical features. One of the most useful characteristics of decision trees is their
comprehensibility. Typical decision trees, such as C4.5, are unreliable for classifying the airborne
data. Preliminary testing with the algorithms using missions’ data led to low overall accuracy
values, due to their preference for binary and discrete features. Instead, a separate type of decision
tree is considered. Classification and regression trees (CART) have been used in the statistics
community for over thirty years [6], and are well-proven methods for creating predictive models in
the form of a “tree”, where branches are features and the end of a branch - the leaf - is the chosen
class. Classification trees in particular have provided good results for datasets with categorical
classes and features [7], while regression trees deal with continuous values unlike the atmospheric
dataset. Classification trees are constructed through the partitioning of features based upon
providing good splits, utilizing some metric to determine the “best” split. As they have the
capability to work with datasets filled with categorical information and seeking a categorical
outcome, classification trees align with the common naming process. Favorable preliminary results
for implementation of a classification tree model lead to further investigation into utilizing this
algorithm.
We first implemented CART, the Boosted CART and Bagged CART algorithms as solution for
the common naming problem [8]. The algorithms provide the accuracy as well as robustness that
we have aimed at. However, there are two major roadblocks in the introduction of the
implementation. The first one is the execution time of the algorithms: the running time ranges
from hours (for CART) to over 10 hours (for Boosted and Bagged version) using normal office
computing power. This might be acceptable if the training is a one-time event. However, the
training is necessary whenever a new mission is added to the data archive, because the newer
missions may require the insertion of new common names.
The time complexity of CART-based algorithm is shown in Table 1, where the complexity is a
function of D (data space), L (label or class space), F (feature space), Vf value space for each
feature, and M (number of trees).
Table 1.Time complexity of algorithms
Algorithm CART Boosted CART Bagged CART
Complexity O(|D|·|L|·|F|·2|V
f
|
) O(M|D|·|L|·|F|·2|V
f
|
) O(M|D|·|L|·|F|·2|V
f
|
)
As shown by the table’s entries, CART is limited by the speed at which the algorithm constructs
its classification tree, which can be bogged down by the introduction of large categorical features -
like principal investigator (PI) or instrument - with their many choices. As it recursively partitions
each feature to create a split, CART has to check for 2n
−1 values, where n is equal to the number
of possible values for a feature. The running time will get worse in the future when more missions
are included, because the possible values of a feature (such as the list of PIs and instruments) can
only grow over time.
The second drawback of the CART-based algorithms is their memory-intensive nature. The
classifiers are typically usable for all of the datasets, save for the enhanced (boosted and bagged)
CART algorithms on the largest dataset - aerosol. Neither bagging nor boosting is able to execute
for the aerosol dataset because of the algorithm’s limitations. Due to the nature of the algorithms,
including the large number of predictors and samples, the training models built exceed the memory
limitations of 64-bit R. While performing well for the other datasets with less samples and
International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019
42
predictors, the inability to produce results for aerosol shows the limited usability of enhanced
CART methods.
The limitations of the CART-based algorithms make them less desirable for our application. We
therefore were tasked with exploring other algorithms that provide the accuracy and robustness
needed, while at the same time, demonstrate time and memory efficiency.
3. PROPOSED SOLUTION
The statistical-based Bayesian approach is investigated in this research. The major advantage of
the Naive Bayes (NB) classifier is its linear time complexity and small storage space requirement;
both are highly desirable for this application. In addition, the algorithm is very transparent, as the
output is easily grasped by users through a probabilistic explanation [5][9].
The Naive Bayes algorithm assumes all features are independent. Its accuracy will be affected with
the presence of feature correlation. Theoretically, if two features are redundant (i.e., 100%
correlation), the feature is counted twice in the probability calculation of a class. In the common
naming process, some features are highly correlated. For example, a variable and the instrument
used to measure the variable are closely related.
In general, there are two categories of approaches that can be used to reduce the impact of feature
correlation while still maintaining a single interpretable model. One is to use some pre-processing
algorithms to remove the correlation from the training data set [10]. The addition of this pre-
processing process adds to the complexity of overall operation. In addition, the algorithms in this
category have their complexity tie to the size of feature space (which is fairly large in our case).
Therefore, we choose the second type of approach - to treat features differently by using a
weighting scheme - which is illustrated in the next few sections.
3.1 Weighted Naive Bayes
The Naive Bayes algorithm is derived from Bayes’ theorem of describing the probability of a label
(or class), based on prior knowledge, formulated as:
where cn is a common name and fi is a feature.
The first assumption of Naive Bayes is that all features are independent. The second assumption,
which is associated with feature independence, is that all of the features associated with a label are
equally important. These assumptions, when violated, negatively affect its accuracy and make
Naive Bayes less tolerant to redundant features. Numerous methods have been proposed in order
to improve the performance of the algorithm by alleviating feature independence. A
comprehensive study by Wu [11] on the entire 36 standard UCI data sets have shown that the
weighted Naive Bayes provides higher classification accuracy than simple Naive Bayes.
Weighting Naive Bayes is accomplished with a single addition to the probability of features
associated with common names:
International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019
43
For the purposes of common name matching, the probabilities are converted to logarithmic from
linear for calculations; and the goal is to find the label x (common name) with the highest
probability:
where {cnk} is the set of all possible common names.
3.2 Weight Calculation
We use a separate classifier to determine the importance of features. In this case, an original
CART decision tree [7] is utilized. In summary, a certain number of randomly sampled (with
replacement) samples are assembled on differing proportions of the training set, and several
classifiers are constructed based upon these samples for a given number of iterations. When
building a classifier, not all features from the training set are used and some are considered more
important than others, placed at higher depths within the tree. For each iteration of the classifier,
feature weights are calculated based upon the depth (d) at which they are found. The assumption is
that the more important a feature is, the closer it is to the root. The weight assigned to each feature
should be inversely related to the depth at which a feature is found. The weight for feature i from
the k-th iteration is calculated as:
The process is highlighted in Figure 1. After all iterations, the average of each feature’s weight
throughout the process is taken as the final weight.
Fig.1. CART Algorithm for Weight Calculation
An example of such a tree constructed for J variables is shown in Figure 2.
Random Sample of
Training Data
Decision Tree Building
Using CART
Record the Depth of
Features
Repeat N
Times
International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019
44
Fig.2. A Example Decision Tree
4. ALGORITHM DEVELOPMENT
There are two parts of the Hybrid Weighted Naive Bayes (HWNB) algorithm as shown in Figure
3: the part that selects features and their weights, and the part that generates common name
classification given a new data set. The first part, based on CART and written mainly in R, takes a
long time to run (hours on a well-configured office computer), due to its computational
complexity. However, this part only needs to be run occasionally (thus called “static”), because
feature set rarely changes. It is re-run only if there is addition of new common names or there are
signification changes in feature set.
The second part, based on weighted NB and written mainly in Python, is the application user used
to generate common name mapping given a new data set. It runs fast (minutes on an office
computer), thus suitable for day-to-day operation. The software release also comes with an
interactive user interface that allows user interaction before and during the algorithm’s execution.
As a statistical learning algorithm, weighted Naive Bayes is able to provide multiple predictions
with a confidence level for each prediction. This feature becomes very desirable as the user can
decide whether a domain expert’s input is necessary based on the predictor’s confidence level. The
development has another very desirable feature: it allows for retraining of the classifier on-the-fly
upon the request from a user (thus called “dynamic”). In the case that the algorithm misclassifies a
measurement, the input from a domain expert is memorized by the algorithm, and classifier
retraining can be started immediately. This ensures that future use of the algorithm will be based
on an updated knowledge base, or the same misclassification is unlikely to happen again.
Fig. 3.HWNB Algorithm
International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019
45
5. RESULTS
5.1 Complexity Comparison
In term of computational complexity, Naive Bayes based algorithms are linear with respect to the
size of feature, label, and data space. The decision-tree based CART algorithm is also linear with
respect to the size of feature, label and data space; however, it is exponential with respect to the
space of feature values. This leads to a tremendous increase in its complexity as most of the
features in our application are not binary by nature (e.g., the type of measurement instruments).
The complexity of the two categories of methods is summarized in the table below (D: data, L:
label, F: feature, Vf: value of feature).
Table 2.Complexity Comparison
Algorithm NB CART
Complexity O(|D|·|L|·|F|) O(|D|·|L|·|F|·2|V
f
|
)
5.2 Comparison with NB
The training set used for this research includes a combination of ten different missions over a
period of 10 years. A summary of the data set, listed in alphabetical order, is given in Table 3.
These are the data sets undergoing the data ingestion process during this research work. The results
are determined using k-fold cross validation, breaking all mission data into k subsets, and testing is
done through leave-one-out-cross-validation. The performance of our software release using
HWNB is summarized in Table 4. In comparison, we also list the performance of the unweighted
Naive Bayes algorithm. The results are shown based on different categories of airborne
measurements, in order to determine how well each algorithm performs for different type of
variables. An aggregate measurement - the weighted average accuracy by the number of variables
– is also shown.
Table 3. Dataset by missions
Mission Year Number of
Files
Number of Unique
Common Names
ARCPAC 2008 27 41
ARCTAS 2008 70 230
CalNex 2010 31 193
DC3 2012 105 276
DISCOVERAQ California 2014 29 100
DISCOVERAQ Maryland 2011 32 78
INTEX-A 2004 81 172
INTEX-B 2006 99 223
NEAQS - ITCT2004 2004 22 160
TexasAQS 2000 30 200
International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019
46
Table 4.Comparison with Unweighted Naive Bayes
Category
Number of
Variables HWNB
Unweighted
NB
Aerosol 1880 0.9486 0.9353
Trace Gas 1182 0.7804 0.7412
Cloud 549 0.9683 0.9285
J 284 0.2623 0.17
Nav/Meteor 384 0.7853 0.7605
Radiation 680 0.9970 0.9364
Weighted Average 0.8653 0.8310
The results show that the HWNB outperforms the unweighted NB consistently in very data
category. The overall performance, measured by the weighted average of variables, also
demonstrates the improvement by the weighted approach. Furthermore, the two algorithms are
very similar in term their computational and memory complexity. The difference in execution time
between the two algorithms is not even noticeable in our testing.
5.3 Comparison with Decision-Tree based algorithms
The motivation of this research is to find a computationally efficient algorithm to achieve
comparable results obtained by more complex algorithms. In order to determine the performance
of the proposed HWNB, the algorithm is compared against Classification and regression
trees(CART) algorithm. Furthermore, a Boosted CART algorithm Ada Boost [12] and a Bagged
CART algorithm Tree bag [13] are also used.
Table 5.Comparison with Decision-Tree Based Algorithms
Category
Number
of
Variables HWNB CART Boosting Bagging
Aerosol 1880 0.9486 0.9518 - -
Trace Gas 1182 0.7804 0.6604 0.5268 0.8199
Cloud 549 0.9683 0.9413 0.8861 0.9715
J 284 0.2623 0.556 0.8066 0.8817
Nav/Meteor 384 0.7853 0.8994 0.9441 0.9669
Radiation 680 0.9970 0.9946 0.9964 0.9990
Weighted
Average 0.8653 0.8603 0.7724* 0.9105*
The aggregate measurement shows that the HWNB delivers very similar performance in
comparison with the computationally intensive decision-tree based CART. The results for aerosol
variables are not available for boosting and bagging, and thus, results for aerosol are excluded
from their weighted average calculation (noted by an asterisk).Due to the nature of the algorithms,
including the large number of features and samples, the training models built exceed the memory
limitations of 64-bit R. This limits the ability to deploy the enhanced CART methods for the entire
training set.
6. CONCLUSION
The contribution of this work is to develop a hybrid machine learning algorithm in order to
achieve a good performance with reasonable computational complexity. The proposed approach
harnesses the strength of both algorithms: CART ranks features automatically and is robust on
noisy data, whereas Naive Bayes is very computationally efficient. The CART algorithms
developed prior to this research were very memory demanding and computationally intensive. In
some large data sets, it exceeded memory limitation of the running environment; and it took hours
International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019
47
to run on a well-equipped office computer. Therefore, the objectives set for us were to achieve a
specified performance metrics while making the implementation practical for day-to-day
operations.
Overall, an assessment of the Hybrid Weighted Naive Bayes algorithm with recent NASA
atmospheric measurements shows that the solution delivers robust results. It exceeds the
performance expectation in the presence of inconsistencies and inaccuracies among measurement
data. The software developed shows satisfactory results, runs quickly and requires very limited
memory space. Although the process involves decision-tree based training using CART, it is
fundamentally different from the process that uses a decision-tree based algorithm as the sole
solution. Since the ICARTT format remains the same, the feature set will likely not change from
mission to mission. Therefore, the CART algorithm only needs to be run once (or occasionally)in
order to obtain features and their weights, which in turn are used by the weighted Naive Bayes in
all future applications. With its low storage requirement (megabytes versus gigabytes for CART)
and short execution time (minutes versus hours for CART on a typical office computer), the
Hybrid Weighted Naive Bayes algorithm presents a practical solution to the common name
classification problem.
REFERENCES
[1] A. Aknan, G. Chen, J. Crawford, and E. Williams, ICARTT File Format Standards V1.1,
International Consortium for Atmospheric Research on Transport and Transformation, 2013
[2] Matthew Rutherford, Nathan Typanski, Dali Wang and Gao Chen, “Processing of ICARTT data files
using fuzzy matching and parser combinators”, 2014 International Conference on Artificial
Intelligence (ICAI’14), pp. 217-220, Las Vegas, July 2014
[3] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti and Rajeev Motwani, “Robust and Efficient Fuzzy
Match for Online Data Cleaning”, Proceedings of the 2003 ACM SIGMOD International Conference
on Management of Data, June 2003, San Diego, CA, Pages 313 – 324.
[4] B. G. Buchanan and E. H. Shortliffe, “Rule Based Expert Systems: The MycinExperiments of the
Stanford Heuristic Programming Project”, Boston: MA, 1984.
[5] S.B. Kotsiantis, “Supervised Machine Learning: A Review of Classification Techniques”,
Informatica, Vol. 31, 2007, Pages 249-268
[6] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone, Classification and Regression Trees
(Wadsworth, Belmont, CA, 1984).
[7] W. Y. Loh, “Classification and regression trees,” WIREs Data Mining Knowl Discovery (2011).
[8] David Hamblin, Dali Wang, Gao Chen and Ying Bai, “Investigation of Machine Learning
Algorithms for the Classification of Atmospheric Measurements”, Proceedings of the 2017
International Conference on Artificial Intelligence, page 115-119, Las Vegas, 2017
[9] Tjen-Sien Lim, Wei-Yin Loh, Yu-Shan Shih,“A Comparison of Prediction Accuracy, Complexity,
and Training Time of Thirty-Three Old and New Classification Algorithms”,Machine Learning Vol.
40, 2000, Pages 203–228.
[10] Zijian Zheng, Geoffrey Webb, Lazy learning of Bayesian rules, MachineLearning 41, 2000, 53–84.
[11] Jia Wu and Zhihua Cai, “Attribute Weighting via Differential Evolution Algorithm for Attribute
Weighted Naive Bayes”, Journal of Computational Information Systems, Vol. 7, Issue 5, 2011, Pages
1672-1679
[12] JiZhu, Hui Zou, Saharon Rosset and Trevor Hastie, “Multi-class AdaBoost” Statistics and Its
Interface Volume 2 (2009) 349–360
[13] E. Alfaro, M. Gamez, and N. Garcia, “Adabag: An R Package for Classification with Boosting and
Bagging,” Journal of Statistical Softare, Volume 54, Issue 2 (2013).

Contenu connexe

Tendances

A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET Journal
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.ppsbutest
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering toBiniam Behailu
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Designaimsnist
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithmsIkutwa
 
On distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big dataOn distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big datanexgentechnology
 
Chapter1_C.doc
Chapter1_C.docChapter1_C.doc
Chapter1_C.docbutest
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data queryIJDKP
 
Astromat Update on Developments 2021-01-29
Astromat Update on Developments 2021-01-29Astromat Update on Developments 2021-01-29
Astromat Update on Developments 2021-01-29Kerstin Lehnert
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
Similarity distance measures
Similarity  distance measuresSimilarity  distance measures
Similarity distance measuresthilagasna
 
When The New Science Is In The Outliers
When The New Science Is In The OutliersWhen The New Science Is In The Outliers
When The New Science Is In The Outliersaimsnist
 

Tendances (19)

A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
ChemEngine ACS
ChemEngine ACSChemEngine ACS
ChemEngine ACS
 
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...IRJET-  	  Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 
chemengine karthi acs sandiego rev1.0
chemengine karthi acs sandiego rev1.0chemengine karthi acs sandiego rev1.0
chemengine karthi acs sandiego rev1.0
 
UHDMML.pps
UHDMML.ppsUHDMML.pps
UHDMML.pps
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering to
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
On distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big dataOn distributed fuzzy decision trees for big data
On distributed fuzzy decision trees for big data
 
Chapter1_C.doc
Chapter1_C.docChapter1_C.doc
Chapter1_C.doc
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
A unified approach for spatial data query
A unified approach for spatial data queryA unified approach for spatial data query
A unified approach for spatial data query
 
Astromat Update on Developments 2021-01-29
Astromat Update on Developments 2021-01-29Astromat Update on Developments 2021-01-29
Astromat Update on Developments 2021-01-29
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Similarity distance measures
Similarity  distance measuresSimilarity  distance measures
Similarity distance measures
 
When The New Science Is In The Outliers
When The New Science Is In The OutliersWhen The New Science Is In The Outliers
When The New Science Is In The Outliers
 

Similaire à A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA

Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalWaqas Tariq
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.docbutest
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET Journal
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
On nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsOn nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsunyil96
 
Nonmetric similarity search
Nonmetric similarity searchNonmetric similarity search
Nonmetric similarity searchunyil96
 
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSA PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSijsc
 
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents ijsc
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique ofIJDKP
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERkevig
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
 
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...cscpconf
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach IJECEIAES
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET Journal
 

Similaire à A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA (20)

Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
EDRG12_Re.doc
EDRG12_Re.docEDRG12_Re.doc
EDRG12_Re.doc
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant Colony
 
SpectralClassificationOfStars
SpectralClassificationOfStarsSpectralClassificationOfStars
SpectralClassificationOfStars
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
On nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsOn nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domains
 
Nonmetric similarity search
Nonmetric similarity searchNonmetric similarity search
Nonmetric similarity search
 
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSA PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
 
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
 

Plus de gerogepatton

AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
 
10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...gerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
 
A Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of CyberbullyingA Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of Cyberbullyinggerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
 
10th International Conference on Data Mining (DaMi 2024)
10th International Conference on Data Mining (DaMi 2024)10th International Conference on Data Mining (DaMi 2024)
10th International Conference on Data Mining (DaMi 2024)gerogepatton
 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...gerogepatton
 
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATIONWAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATIONgerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
 
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...gerogepatton
 
10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...gerogepatton
 
The International Journal of Artificial Intelligence & Applications (IJAIA)
The International Journal of Artificial Intelligence & Applications (IJAIA)The International Journal of Artificial Intelligence & Applications (IJAIA)
The International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
 
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecturegerogepatton
 
2nd International Conference on Computer Science, Engineering and Artificial ...
2nd International Conference on Computer Science, Engineering and Artificial ...2nd International Conference on Computer Science, Engineering and Artificial ...
2nd International Conference on Computer Science, Engineering and Artificial ...gerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
 
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...gerogepatton
 
13th International Conference on Soft Computing, Artificial Intelligence and ...
13th International Conference on Soft Computing, Artificial Intelligence and ...13th International Conference on Soft Computing, Artificial Intelligence and ...
13th International Conference on Soft Computing, Artificial Intelligence and ...gerogepatton
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
 
Ensemble Learning Approach for Digital Communication Modulation’s Classification
Ensemble Learning Approach for Digital Communication Modulation’s ClassificationEnsemble Learning Approach for Digital Communication Modulation’s Classification
Ensemble Learning Approach for Digital Communication Modulation’s Classificationgerogepatton
 

Plus de gerogepatton (20)

AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONAN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATION
 
10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
 
A Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of CyberbullyingA Machine Learning Ensemble Model for the Detection of Cyberbullying
A Machine Learning Ensemble Model for the Detection of Cyberbullying
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
 
10th International Conference on Data Mining (DaMi 2024)
10th International Conference on Data Mining (DaMi 2024)10th International Conference on Data Mining (DaMi 2024)
10th International Conference on Data Mining (DaMi 2024)
 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
 
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATIONWAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
WAVELET SCATTERING TRANSFORM FOR ECG CARDIOVASCULAR DISEASE CLASSIFICATION
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
 
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
Passive Sonar Detection and Classification Based on Demon-Lofar Analysis and ...
 
10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...10th International Conference on Artificial Intelligence and Applications (AI...
10th International Conference on Artificial Intelligence and Applications (AI...
 
The International Journal of Artificial Intelligence & Applications (IJAIA)
The International Journal of Artificial Intelligence & Applications (IJAIA)The International Journal of Artificial Intelligence & Applications (IJAIA)
The International Journal of Artificial Intelligence & Applications (IJAIA)
 
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer ArchitectureFoundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
Foundations of ANNs: Tolstoy’s Genius Explored using Transformer Architecture
 
2nd International Conference on Computer Science, Engineering and Artificial ...
2nd International Conference on Computer Science, Engineering and Artificial ...2nd International Conference on Computer Science, Engineering and Artificial ...
2nd International Conference on Computer Science, Engineering and Artificial ...
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
 
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
Website URL:https://www.airccse.org/journal/ijaia/ijaia.html Review of AI Mat...
 
13th International Conference on Soft Computing, Artificial Intelligence and ...
13th International Conference on Soft Computing, Artificial Intelligence and ...13th International Conference on Soft Computing, Artificial Intelligence and ...
13th International Conference on Soft Computing, Artificial Intelligence and ...
 
International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)International Journal of Artificial Intelligence & Applications (IJAIA)
International Journal of Artificial Intelligence & Applications (IJAIA)
 
Ensemble Learning Approach for Digital Communication Modulation’s Classification
Ensemble Learning Approach for Digital Communication Modulation’s ClassificationEnsemble Learning Approach for Digital Communication Modulation’s Classification
Ensemble Learning Approach for Digital Communication Modulation’s Classification
 

Dernier

Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringJuanCarlosMorales19600
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 

Dernier (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineering
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 

A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA

  • 1. International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019 DOI: 10.5121/ijaia.2019.10504 39 A HYBRID LEARNING ALGORITHM IN AUTOMATED TEXT CATEGORIZATION OF LEGACY DATA Dali Wang1 , Ying Bai2 , and David Hamblin1 1 Christopher Newport University, Newport News, VA, USA 2 Johnson C. Smith University, Charlotte, NC, USA ABSTRACT The goal of this research is to develop an algorithm to automatically classify measurement types from NASA’s airborne measurement data archive. The product has to meet specific metrics in term of accuracy, robustness and usability, as the initial decision-tree based development has shown limited applicability due to its resource intensive characteristics. We have developed an innovative solution that is much more efficient while offering comparable performance. Similar to many industrial applications, the data available are noisy and correlated; and there is a wide range of features that are associated with the type of measurement to be identified. The proposed algorithm uses a decision tree to select features and determine their weights.A weighted Naive Bayes is used due to the presence of highly correlated inputs. The development has been successfully deployed in an industrial scale, and the results show that the development is well-balanced in term of performance and resource requirements. KEYWORDS classification, machine learning, atmospheric measurement 1. BACKGROUND Beginning in the 1980s, NASA (National Aeronautics and Space Administration) introduced tropospheric chemistry studies to gather information about the contents of the air. These studies involve a data gathering process through the use of airplanes mounted with on-board instruments, which gather multiple measurements over specific regions of the air. These atmospheric measurements are complemented with other sources, such as satellites and ground stations. The data collection, called missions or campaigns, are supported by the federal government with the goal of collecting data for scientific studies on climate and air quality issues. For the last two decades, data collected have been compiled into a single format, ICARTT (International Consortium for Atmospheric Research on Transport and Transformation) [1], which is designed to allow for sharing across the airborne community, with other agencies, universities, or any other group working with the data. A typical ICARTT file contains, at the minimum, about 30 fields of metadata that describe the atmospheric data gathered, the who/where/what aspects of the data gathering process, along with additional comments supplied by the principal investigator that is pertinent to the data [1]. Over the past few years, the federal government has launched the Open Data Initiative,part of which aims at scaling up open data efforts in the climate sector. Accordingly, NASA has made it a priority to open up its data archive to the general public in order to promote their usage in scientific studies. As part of this initiative, NASA has introduced a centralized data archive service, which allows for others to quickly find certain information within the data archive by filtering data according to specific metadata components[2].
  • 2. International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019 40 It is commonly believed that the most important information in an ICARTT data archive is the type of measurement taken by that experiment. Some of the examples include aerosol (such as aerosol particle absorption coefficient at blue wavelengths), trace gas (such as ozone mixing ratio), and cloud property (such as cloud surface area density). In order to uniquely define each measurement, NASA has adopted a naming system called “common names”; each common name is used for indexing a measurement with a uniquely-defined physical or chemical characteristic. Unfortunately, the common name information is not directly presented in the ICARTT data file. Instead, each measurement is given a “variable name”. Principal investigators decide each variable name in compilation of the ICARTT files, and along with the unique metadata it offers, the variable name is not easily searchable. Although they can be found in different missions, conducted by separate investigators, much of the atmospheric data associated with a certain variable may be equivalent to others, save for the distinguishing metadata. For example, the gathering of formaldehyde (CH2O) in the troposphere is done in several missions across a number of years, all with differing variable names and through different sources. However, in a centralized data archive, all of these variables should be given the same common name to identify them as CH2O data. This more general common name allows a user to easily access similar data across missions by filtering based on this name. The retrieval of a common name for each measurement variable was done by domain experts at NASA. The task is rather complicated as many fields in an ICARRT file contribute to the common name of a measurement: variable name, variable description, instrument used and measurement unit, etc. There are other information, although less intuitive, that may also contribute to the naming of each measurement. One such example is the principal investigator’s (PI) name, because each person has his/her tendency in naming a variable. In addition, the data are noisy – missing fields and typos are common – as there has been no rigorous vetting process for metadata fields.As a result, there are only a few domain experts in NASA who can accurately identify a common name by studying the ICARRT file under consideration. Given the vast amount of data accumulated over the past 30 years, it is essential to automate the information retrieval process. 2. OVERVIEW OF SOLUTIONS In its most common form, the common naming process is to choose a specific label (i.e., a common name) for a measurement variable from a list of predefined labels, given the metadata associated with this variable. It belongs to the category of label classification problems. Textual similarity comparisons using fuzzy logic is a common approach to decide on matched and non- matched records [3]. It presents an opportunity for common name matching; however, given the variations in the type of metadata associated with each common name, and the large number of common names (300+), it becomes impractical to implement such an approach. A forward-chaining based expert system, which belongs to a category of artificial intelligence algorithms, may provide another viable solution [4]. This approach was attempted by summarizing the rules used by domain experts and synthesizing a knowledge base. It turns out to not be very effective, as the rules are complex, non-inclusive, and sometimes contradictory. Also, whenever new missions are conducted, an expansion of rules in the inference engine is required - a tedious task. Solutions to similar problems include Artificial Neural Networks (ANN) especially the perceptron- based network, Support Vector Machines (SVM), instance-based learning especially the k-Nearest Neighbor (k-NN) algorithm. Generally, SVMs and ANN tend to perform much better when dealing with multi-dimensional and continuous-valued features [5]. However, this problem only involves single dimensional and discrete features. The k-NN algorithm is very sensitive to
  • 3. International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019 41 irrelevant features and intolerant to noise. These make it unattractive for this problem, as not all features are relevant to every common name, and the information in general is very noisy due to errors and inconsistencies in ICARTT files. Logic-based systems such as decision trees tend to perform better when dealing with discrete/categorical features. One of the most useful characteristics of decision trees is their comprehensibility. Typical decision trees, such as C4.5, are unreliable for classifying the airborne data. Preliminary testing with the algorithms using missions’ data led to low overall accuracy values, due to their preference for binary and discrete features. Instead, a separate type of decision tree is considered. Classification and regression trees (CART) have been used in the statistics community for over thirty years [6], and are well-proven methods for creating predictive models in the form of a “tree”, where branches are features and the end of a branch - the leaf - is the chosen class. Classification trees in particular have provided good results for datasets with categorical classes and features [7], while regression trees deal with continuous values unlike the atmospheric dataset. Classification trees are constructed through the partitioning of features based upon providing good splits, utilizing some metric to determine the “best” split. As they have the capability to work with datasets filled with categorical information and seeking a categorical outcome, classification trees align with the common naming process. Favorable preliminary results for implementation of a classification tree model lead to further investigation into utilizing this algorithm. We first implemented CART, the Boosted CART and Bagged CART algorithms as solution for the common naming problem [8]. The algorithms provide the accuracy as well as robustness that we have aimed at. However, there are two major roadblocks in the introduction of the implementation. The first one is the execution time of the algorithms: the running time ranges from hours (for CART) to over 10 hours (for Boosted and Bagged version) using normal office computing power. This might be acceptable if the training is a one-time event. However, the training is necessary whenever a new mission is added to the data archive, because the newer missions may require the insertion of new common names. The time complexity of CART-based algorithm is shown in Table 1, where the complexity is a function of D (data space), L (label or class space), F (feature space), Vf value space for each feature, and M (number of trees). Table 1.Time complexity of algorithms Algorithm CART Boosted CART Bagged CART Complexity O(|D|·|L|·|F|·2|V f | ) O(M|D|·|L|·|F|·2|V f | ) O(M|D|·|L|·|F|·2|V f | ) As shown by the table’s entries, CART is limited by the speed at which the algorithm constructs its classification tree, which can be bogged down by the introduction of large categorical features - like principal investigator (PI) or instrument - with their many choices. As it recursively partitions each feature to create a split, CART has to check for 2n −1 values, where n is equal to the number of possible values for a feature. The running time will get worse in the future when more missions are included, because the possible values of a feature (such as the list of PIs and instruments) can only grow over time. The second drawback of the CART-based algorithms is their memory-intensive nature. The classifiers are typically usable for all of the datasets, save for the enhanced (boosted and bagged) CART algorithms on the largest dataset - aerosol. Neither bagging nor boosting is able to execute for the aerosol dataset because of the algorithm’s limitations. Due to the nature of the algorithms, including the large number of predictors and samples, the training models built exceed the memory limitations of 64-bit R. While performing well for the other datasets with less samples and
  • 4. International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019 42 predictors, the inability to produce results for aerosol shows the limited usability of enhanced CART methods. The limitations of the CART-based algorithms make them less desirable for our application. We therefore were tasked with exploring other algorithms that provide the accuracy and robustness needed, while at the same time, demonstrate time and memory efficiency. 3. PROPOSED SOLUTION The statistical-based Bayesian approach is investigated in this research. The major advantage of the Naive Bayes (NB) classifier is its linear time complexity and small storage space requirement; both are highly desirable for this application. In addition, the algorithm is very transparent, as the output is easily grasped by users through a probabilistic explanation [5][9]. The Naive Bayes algorithm assumes all features are independent. Its accuracy will be affected with the presence of feature correlation. Theoretically, if two features are redundant (i.e., 100% correlation), the feature is counted twice in the probability calculation of a class. In the common naming process, some features are highly correlated. For example, a variable and the instrument used to measure the variable are closely related. In general, there are two categories of approaches that can be used to reduce the impact of feature correlation while still maintaining a single interpretable model. One is to use some pre-processing algorithms to remove the correlation from the training data set [10]. The addition of this pre- processing process adds to the complexity of overall operation. In addition, the algorithms in this category have their complexity tie to the size of feature space (which is fairly large in our case). Therefore, we choose the second type of approach - to treat features differently by using a weighting scheme - which is illustrated in the next few sections. 3.1 Weighted Naive Bayes The Naive Bayes algorithm is derived from Bayes’ theorem of describing the probability of a label (or class), based on prior knowledge, formulated as: where cn is a common name and fi is a feature. The first assumption of Naive Bayes is that all features are independent. The second assumption, which is associated with feature independence, is that all of the features associated with a label are equally important. These assumptions, when violated, negatively affect its accuracy and make Naive Bayes less tolerant to redundant features. Numerous methods have been proposed in order to improve the performance of the algorithm by alleviating feature independence. A comprehensive study by Wu [11] on the entire 36 standard UCI data sets have shown that the weighted Naive Bayes provides higher classification accuracy than simple Naive Bayes. Weighting Naive Bayes is accomplished with a single addition to the probability of features associated with common names:
  • 5. International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019 43 For the purposes of common name matching, the probabilities are converted to logarithmic from linear for calculations; and the goal is to find the label x (common name) with the highest probability: where {cnk} is the set of all possible common names. 3.2 Weight Calculation We use a separate classifier to determine the importance of features. In this case, an original CART decision tree [7] is utilized. In summary, a certain number of randomly sampled (with replacement) samples are assembled on differing proportions of the training set, and several classifiers are constructed based upon these samples for a given number of iterations. When building a classifier, not all features from the training set are used and some are considered more important than others, placed at higher depths within the tree. For each iteration of the classifier, feature weights are calculated based upon the depth (d) at which they are found. The assumption is that the more important a feature is, the closer it is to the root. The weight assigned to each feature should be inversely related to the depth at which a feature is found. The weight for feature i from the k-th iteration is calculated as: The process is highlighted in Figure 1. After all iterations, the average of each feature’s weight throughout the process is taken as the final weight. Fig.1. CART Algorithm for Weight Calculation An example of such a tree constructed for J variables is shown in Figure 2. Random Sample of Training Data Decision Tree Building Using CART Record the Depth of Features Repeat N Times
  • 6. International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019 44 Fig.2. A Example Decision Tree 4. ALGORITHM DEVELOPMENT There are two parts of the Hybrid Weighted Naive Bayes (HWNB) algorithm as shown in Figure 3: the part that selects features and their weights, and the part that generates common name classification given a new data set. The first part, based on CART and written mainly in R, takes a long time to run (hours on a well-configured office computer), due to its computational complexity. However, this part only needs to be run occasionally (thus called “static”), because feature set rarely changes. It is re-run only if there is addition of new common names or there are signification changes in feature set. The second part, based on weighted NB and written mainly in Python, is the application user used to generate common name mapping given a new data set. It runs fast (minutes on an office computer), thus suitable for day-to-day operation. The software release also comes with an interactive user interface that allows user interaction before and during the algorithm’s execution. As a statistical learning algorithm, weighted Naive Bayes is able to provide multiple predictions with a confidence level for each prediction. This feature becomes very desirable as the user can decide whether a domain expert’s input is necessary based on the predictor’s confidence level. The development has another very desirable feature: it allows for retraining of the classifier on-the-fly upon the request from a user (thus called “dynamic”). In the case that the algorithm misclassifies a measurement, the input from a domain expert is memorized by the algorithm, and classifier retraining can be started immediately. This ensures that future use of the algorithm will be based on an updated knowledge base, or the same misclassification is unlikely to happen again. Fig. 3.HWNB Algorithm
  • 7. International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019 45 5. RESULTS 5.1 Complexity Comparison In term of computational complexity, Naive Bayes based algorithms are linear with respect to the size of feature, label, and data space. The decision-tree based CART algorithm is also linear with respect to the size of feature, label and data space; however, it is exponential with respect to the space of feature values. This leads to a tremendous increase in its complexity as most of the features in our application are not binary by nature (e.g., the type of measurement instruments). The complexity of the two categories of methods is summarized in the table below (D: data, L: label, F: feature, Vf: value of feature). Table 2.Complexity Comparison Algorithm NB CART Complexity O(|D|·|L|·|F|) O(|D|·|L|·|F|·2|V f | ) 5.2 Comparison with NB The training set used for this research includes a combination of ten different missions over a period of 10 years. A summary of the data set, listed in alphabetical order, is given in Table 3. These are the data sets undergoing the data ingestion process during this research work. The results are determined using k-fold cross validation, breaking all mission data into k subsets, and testing is done through leave-one-out-cross-validation. The performance of our software release using HWNB is summarized in Table 4. In comparison, we also list the performance of the unweighted Naive Bayes algorithm. The results are shown based on different categories of airborne measurements, in order to determine how well each algorithm performs for different type of variables. An aggregate measurement - the weighted average accuracy by the number of variables – is also shown. Table 3. Dataset by missions Mission Year Number of Files Number of Unique Common Names ARCPAC 2008 27 41 ARCTAS 2008 70 230 CalNex 2010 31 193 DC3 2012 105 276 DISCOVERAQ California 2014 29 100 DISCOVERAQ Maryland 2011 32 78 INTEX-A 2004 81 172 INTEX-B 2006 99 223 NEAQS - ITCT2004 2004 22 160 TexasAQS 2000 30 200
  • 8. International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019 46 Table 4.Comparison with Unweighted Naive Bayes Category Number of Variables HWNB Unweighted NB Aerosol 1880 0.9486 0.9353 Trace Gas 1182 0.7804 0.7412 Cloud 549 0.9683 0.9285 J 284 0.2623 0.17 Nav/Meteor 384 0.7853 0.7605 Radiation 680 0.9970 0.9364 Weighted Average 0.8653 0.8310 The results show that the HWNB outperforms the unweighted NB consistently in very data category. The overall performance, measured by the weighted average of variables, also demonstrates the improvement by the weighted approach. Furthermore, the two algorithms are very similar in term their computational and memory complexity. The difference in execution time between the two algorithms is not even noticeable in our testing. 5.3 Comparison with Decision-Tree based algorithms The motivation of this research is to find a computationally efficient algorithm to achieve comparable results obtained by more complex algorithms. In order to determine the performance of the proposed HWNB, the algorithm is compared against Classification and regression trees(CART) algorithm. Furthermore, a Boosted CART algorithm Ada Boost [12] and a Bagged CART algorithm Tree bag [13] are also used. Table 5.Comparison with Decision-Tree Based Algorithms Category Number of Variables HWNB CART Boosting Bagging Aerosol 1880 0.9486 0.9518 - - Trace Gas 1182 0.7804 0.6604 0.5268 0.8199 Cloud 549 0.9683 0.9413 0.8861 0.9715 J 284 0.2623 0.556 0.8066 0.8817 Nav/Meteor 384 0.7853 0.8994 0.9441 0.9669 Radiation 680 0.9970 0.9946 0.9964 0.9990 Weighted Average 0.8653 0.8603 0.7724* 0.9105* The aggregate measurement shows that the HWNB delivers very similar performance in comparison with the computationally intensive decision-tree based CART. The results for aerosol variables are not available for boosting and bagging, and thus, results for aerosol are excluded from their weighted average calculation (noted by an asterisk).Due to the nature of the algorithms, including the large number of features and samples, the training models built exceed the memory limitations of 64-bit R. This limits the ability to deploy the enhanced CART methods for the entire training set. 6. CONCLUSION The contribution of this work is to develop a hybrid machine learning algorithm in order to achieve a good performance with reasonable computational complexity. The proposed approach harnesses the strength of both algorithms: CART ranks features automatically and is robust on noisy data, whereas Naive Bayes is very computationally efficient. The CART algorithms developed prior to this research were very memory demanding and computationally intensive. In some large data sets, it exceeded memory limitation of the running environment; and it took hours
  • 9. International Journal of Artificial Intelligence & Applications (IJAIA) Vol.10, No.5, September 2019 47 to run on a well-equipped office computer. Therefore, the objectives set for us were to achieve a specified performance metrics while making the implementation practical for day-to-day operations. Overall, an assessment of the Hybrid Weighted Naive Bayes algorithm with recent NASA atmospheric measurements shows that the solution delivers robust results. It exceeds the performance expectation in the presence of inconsistencies and inaccuracies among measurement data. The software developed shows satisfactory results, runs quickly and requires very limited memory space. Although the process involves decision-tree based training using CART, it is fundamentally different from the process that uses a decision-tree based algorithm as the sole solution. Since the ICARTT format remains the same, the feature set will likely not change from mission to mission. Therefore, the CART algorithm only needs to be run once (or occasionally)in order to obtain features and their weights, which in turn are used by the weighted Naive Bayes in all future applications. With its low storage requirement (megabytes versus gigabytes for CART) and short execution time (minutes versus hours for CART on a typical office computer), the Hybrid Weighted Naive Bayes algorithm presents a practical solution to the common name classification problem. REFERENCES [1] A. Aknan, G. Chen, J. Crawford, and E. Williams, ICARTT File Format Standards V1.1, International Consortium for Atmospheric Research on Transport and Transformation, 2013 [2] Matthew Rutherford, Nathan Typanski, Dali Wang and Gao Chen, “Processing of ICARTT data files using fuzzy matching and parser combinators”, 2014 International Conference on Artificial Intelligence (ICAI’14), pp. 217-220, Las Vegas, July 2014 [3] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti and Rajeev Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning”, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, June 2003, San Diego, CA, Pages 313 – 324. [4] B. G. Buchanan and E. H. Shortliffe, “Rule Based Expert Systems: The MycinExperiments of the Stanford Heuristic Programming Project”, Boston: MA, 1984. [5] S.B. Kotsiantis, “Supervised Machine Learning: A Review of Classification Techniques”, Informatica, Vol. 31, 2007, Pages 249-268 [6] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone, Classification and Regression Trees (Wadsworth, Belmont, CA, 1984). [7] W. Y. Loh, “Classification and regression trees,” WIREs Data Mining Knowl Discovery (2011). [8] David Hamblin, Dali Wang, Gao Chen and Ying Bai, “Investigation of Machine Learning Algorithms for the Classification of Atmospheric Measurements”, Proceedings of the 2017 International Conference on Artificial Intelligence, page 115-119, Las Vegas, 2017 [9] Tjen-Sien Lim, Wei-Yin Loh, Yu-Shan Shih,“A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms”,Machine Learning Vol. 40, 2000, Pages 203–228. [10] Zijian Zheng, Geoffrey Webb, Lazy learning of Bayesian rules, MachineLearning 41, 2000, 53–84. [11] Jia Wu and Zhihua Cai, “Attribute Weighting via Differential Evolution Algorithm for Attribute Weighted Naive Bayes”, Journal of Computational Information Systems, Vol. 7, Issue 5, 2011, Pages 1672-1679 [12] JiZhu, Hui Zou, Saharon Rosset and Trevor Hastie, “Multi-class AdaBoost” Statistics and Its Interface Volume 2 (2009) 349–360 [13] E. Alfaro, M. Gamez, and N. Garcia, “Adabag: An R Package for Classification with Boosting and Bagging,” Journal of Statistical Softare, Volume 54, Issue 2 (2013).