SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                        VOL. 22,   NO. 1,    JANUARY 2010                                                     139


    Bayesian Classifiers Programmed in SQL                                             set and classification model. We also develop query optimizations
                                                                                       that may be applicable to other distance-based algorithms. A
                                                                                       horizontal layout of the cluster centroids table and a denormalized
           Carlos Ordonez and Sasi K. Pitchaimalai
                                                                                       table for sufficient statistics are essential to accelerate queries. Past
                                                                                       research has shown the main alternatives to integrate data mining
Abstract—The Bayesian classifier is a fundamental classification technique. In         algorithms without modifying DBMS source code are SQL queries
this work, we focus on programming Bayesian classifiers in SQL. We introduce           and User-Defined Functions (UDFs), with UDFs being generally
two classifiers: Naive Bayes and a classifier based on class decomposition using       more efficient [5]. We show SQL queries are more efficient than
K-means clustering. We consider two complementary tasks: model computation             UDFs for Bayesian classification. Finally, despite the fact that C++
and scoring a data set. We study several layouts for tables and several indexing
                                                                                       is a significantly faster alternative, we provide evidence that
alternatives. We analyze how to transform equations into efficient SQL queries
and introduce several query optimizations. We conduct experiments with real and
                                                                                       exporting data sets is a performance bottleneck.
synthetic data sets to evaluate classification accuracy, query optimizations, and          The article is organized as follows: Section 2 introduces
scalability. Our Bayesian classifier is more accurate than Naive Bayes and             definitions. Section 3 introduces two alternative Bayesian classifiers
decision trees. Distance computation is significantly accelerated with horizontal                  ¨
                                                                                       in SQL: Naıve Bayes and a Bayesian classifier based on K-means.
layout for tables, denormalization, and pivoting. We also compare Naive Bayes          Section 4 presents an experimental evaluation on accuracy,
implementations in SQL and C++: SQL is about four times slower. Our Bayesian           optimizations, and scalability. Related work is discussed in
classifier in SQL achieves high classification accuracy, can efficiently analyze       Section 5. Section 6 concludes the paper.
large data sets, and has linear scalability.


Index Terms—Classification, K-means, query optimization.                               2     DEFINITIONS
                                       Ç                                               We focus on computing classification models on a data set X ¼
                                                                                       fx1 ; . . . ; xn g with d attributes X1 ; . . . ; Xd , one discrete attribute G
1    INTRODUCTION                                                                      (class or target), and n records (points). We assume G has m ¼ 2
CLASSIFICATION is a fundamental problem in machine learning                            values. Data set X represents a d  n matrix, where xi represents a
and statistics [2]. Bayesian classifiers stand out for their robust-                   column vector. We study two complementary models: 1) each class
ness, interpretability, and accuracy [2]. They are deeply related to                   is approximated by a normal distribution or histogram and 2) fitting
maximum likelihood estimation and discriminant analysis [2],                           a mixture model with k clusters on each class with K-means. We use
highlighting their theoretical importance. In this work, we focus                      subscripts i; j; h; g as follows: i ¼ 1 . . . n; j ¼ 1 . . . k; h ¼ 1 . . . d;
                                                         ¨
on Bayesian classifiers considering two variants: Naıve Bayes [2]                      g ¼ 1 . . . m. The T superscript indicates matrix transposition.
and a Bayesian classifier based on class decomposition using                               Throughout the paper, we will use a small running example,
clustering [7].                                                                        where d ¼ 4; k ¼ 3 (for K-means) and m ¼ 2 (binary).
    In this work, we integrate Bayesian classification algorithms
into a DBMS. Such integration allows users to directly analyze data
                                                                                       3     BAYESIAN CLASSIFIERS PROGRAMMED IN SQL
sets inside the DBMS and to exploit its extensive capabilities
(storage management, querying, concurrency control, fault toler-                                                        ¨
                                                                                       We introduce two classifiers: Naıve Bayes (NB) and a Bayesian
ance, and security). We use SQL queries as the programming                             Classifier based on K-means (BKM) that performs class decom-
mechanism, since it is the standard language in a DBMS. More                           position. We consider two phases for both classifiers: 1) computing
importantly, using SQL eliminates the need to understand and                           the model and 2) scoring a data set based on the model. Our
modify the internal source, which is a difficult task. Unfortunately,                  optimizations do not alter model accuracy.
SQL has two important drawbacks: it has limitations to manipulate                      3.1        ¨
                                                                                                Naıve Bayes
vectors and matrices and has more overhead than a systems
                                                                                       We consider two versions of NB: one for numeric attributes and
language like C. Keeping those issues in mind, we study how to
                                                                                       another for discrete attributes. Numeric NB will be improved with
evaluate mathematical equations with several tables layouts and
                                                                                       class decomposition.
optimized SQL queries.
                                                                                           NB assumes attributes are independent, and thus, the joint class
    Our contributions are the following: We present two efficient
                                                                                       conditional probability can be estimated as the product of
                               ¨
SQL implementations of Naıve Bayes for numeric and discrete
                                                                                       probabilities of each attribute [2]. We now discuss NB based on a
attributes. We introduce a classification algorithm that builds one
                                                                                       multivariate Gaussian. NB has no input parameters. Each class is
clustering model per class, which is a generalization of K-means
[1], [4]. Our main contribution is a Bayesian classifier programmed                    modeled as a single normal distribution with mean vector Cg and a
                       ¨
in SQL, extending Naıve Bayes, which uses K-means to decompose                         diagonal variance matrix Rg . Scoring assumes a model is available
each class into clusters. We generalize queries for clustering                         and there exists a data set with the same attributes in order to
adding a new problem dimension. That is, our novel queries                             predict class G. Variance computation based on sufficient statistics
combine three dimensions: attribute, cluster, and class subscripts.                    [5] in one pass can be numerically unstable when the variance is
We identify euclidean distance as the most time-consuming                              much smaller compared to large attribute values or when the data
computation. Thus, we introduce several schemes to efficiently                         set has a mix of very large and very small numbers. For ill-
compute distance considering different storage layouts for the data                    conditioned data sets, the computed variance can be significantly
                                                                                       different from the actual variance or even become negative.
                                                                                       Therefore, the model is computed in two passes: a first pass to
. The authors are with the Department of Computer Science, University of               get the mean per class and a second one to compute the variance
  Houston, Houston, TX 77204. E-mail: {ordonez, yeskaym}@cs.uh.edu.                                                                   P
                                                                                       per class. The mean per class is given by Cg ¼ xi 2Yg xi =Ng , where
                                                                                                                                               P
Manuscript received 24 Apr. 2008; revised 12 Jan. 2009; accepted 4 May                 Yg  X are the records in class g. Equation Rg ¼ 1=Ng ni 2Yg ðxi À
                                                                                                                                                  x
2009; published online 14 May 2009.                                                                  T
                                                                                       Cg Þðxi À Cg Þ gives a diagonal variance matrix Rg , which is
Recommended for acceptance by S. Wang.
For information on obtaining reprints of this article, please send e-mail to:          numerically stable, but requires two passes over the data set.
tkde@computer.org, and reference IEEECS Log Number TKDE-2008-04-0219.                      The SQL implementation for numeric NB follows the mean
Digital Object Identifier no. 10.1109/TKDE.2009.127.                                   and variance equations introduced above. We compute three
    1041-4347/10/$26.00 ß 2010 IEEE    Published by the IEEE Computer Society
                        Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
140                                               IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                     VOL. 22,    NO. 1,    JANUARY 2010


aggregations grouping by g with two queries. The first query                                                     TABLE 1
computes the mean Cg of class g with a sum ðXh Þ=count ðÃÞ                                                      BKM Tables
aggregation and class priors g with a count() aggregation. The
second query computes Rg with sum(ðXh À h Þ2 Þ. Note the joint
probability computation is not done in this phase. Scoring uses
the Gaussian parameters as input to classify an input point to the
most probable class, with one query in one pass over X. Each
class probability is evaluated as a Gaussian. To avoid numerical
issues when a variance is zero, the probability is set to 1 and the
joint probability is computed with a sum of probability
logarithms instead of a product of probabilities. A CASE
statement pivots probabilities and avoids a max() aggregation.
A final query determines the predicted class, being the one with
maximum probability, obtained with a CASE statement.
    We now discuss NB for discrete attributes. For numeric NB, we              vectors (instead of variables). Class priors are analogous to NB:
used Gaussians because they work well for large data sets and                  g ¼ Ng =n.
because they are easy to manipulate mathematically. That is, NB                    We now introduce sufficient statistics in the generalized
does not assume any specific probability density function (pdf).               notation. Let Xg:j represent partition j of points in class g (as
Assume X1 ; . . . ; Xd can be discrete or numeric. If an attribute Xh is       determined by K-means). Then, L, the linear sum of points is:
                                                                                      P
discrete (categorical) NB simply computes its histogram: prob-                 Lg:j ¼ xi 2Xg:j xi , and the sum of “squared” points for cluster g : j
                                                                                                   P
abilities are derived with counts per value divided by the                     becomes: Qg:j ¼ xi 2Xg:j xi xT . Based on L; Q, the Gaussian para-
                                                                                                              i
corresponding number of points in each class. Otherwise, if the                meters per class g are:
attribute Xh is numeric then binning is required. Binning requires
two passes over the data set, pretty much like numeric NB. In the                                                      1
                                                                                                             Cg:j ¼        Lg:j ;                         ð1Þ
first pass, bin boundaries are determined. On the second pass, one-                                                   Ng:j
dimensional frequency histograms are computed on each attribute.
The bin boundaries (interval ranges) may impact the accuracy of                                            1            1
                                                                                                 Rg:j ¼        Qg:j À          Lg:j ðLg:j ÞT :            ð2Þ
NB due to skewed distributions or extreme values. Thus, we                                                Ng:j        ðNg:j Þ2
consider two techniques to bin attributes: 1) creating k uniform
intervals between min and max and 2) taking intervals around the                  We generalize K-means to compute m models, fitting a mixture
mean based on multiples of the standard deviation, thus getting                model to each class. K-means is initialized, and then, it iterates
more evenly populated bins. We do not study other binning                      until it converges on all classes. The algorithm is as given below.
schemes such as quantiles (i.e., equidepth binning).                              Initialization:
    The implementation in SQL of discrete NB is straightforward.
                                                                                  1. Get global N; L; Q and ; .
For discrete attributes, no preprocessing is required. For numeric
                                                                                  2. Get k random points per class to initialize C.
attributes, the minimum, maximum, and mean can be determined
in one pass in a single query. The variance for all numeric                       While not all m models converge:
attributes is computed on a second pass to avoid numerical issues.
                                                                                  1.    E step: get k distances j per g; find nearest cluster j per g;
Then, each attribute is discretized finding the interval for each
                                                                                        update N; L; Q per class.
value. Once we have a binned version of X, then we compute
                                                                                  2. M step: update W ; C; R from N; L; Q per class; compute
histograms on each attribute with SQL aggregations. Probabilities
                                                                                        model quality per g; monitor convergence.
are obtained dividing by the number of records in each class.
Scoring requires determining the interval for each attribute value                For scoring, in a similar manner to NB, a point is assigned to the
and retrieving its probability. Each class probability is also                 class with highest probability. In this case, this means getting the
computed by adding logarithms. NB has an advantage over other                  most probable cluster based on mk Gaussians. The probability gets
classifiers: it can handle a data set with mixed attribute types (i.e.,        multiplied by the class prior by default (under the common
discrete and numerical).                                                       Bayesian framework), but it can be ignored for imbalanced
                                                                               classification problems.
3.2     Bayesian Classifier Based on K-Means
We now present BKM, a Bayesian classifier based on class                       3.2.2    Bayesian Classifier Programmed in SQL
decomposition obtained with the K-means algorithm. BKM is a                    Based on the mathematical framework introduced in the previous
generalization of NB, where NB has one cluster per class and the               section, we now study how to create an efficient implementation of
Bayesian classifier has k  1 clusters per class. For ease of                  BKM in SQL.
exposition and elegance, we use the same k for each class, but                    Table 1 provides a summary of all working tables used by BKM.
we experimentally show it is a sound decision. We consider two                 In this case, the tables for NB are extended with the j subscript,
major tasks: model computation and scoring a data set based on a               and since there is a clustering algorithm behind, we introduce
computed model. The most time-consuming phase is building the                  tables for distance computation. Like NB, since computations
model. Scoring is equivalent to an E step from K-means executed                require accessing all attributes/distances/probabilities for each
on each class.                                                                 point at one time, tables have a physical organization that stores all
                                                                               values for point i on the same address and there is a primary index
3.2.1    Mixture Model and Decomposition Algorithm                             on i for efficient hash join computation. In this manner, we force
We fit a mixture of k clusters to each class with K-means [4]. The             the query optimizer to perform n I/Os instead of the actual
output are priors  (same as NB), mk weights (W ), mk centroids                number of rows. This storage and indexing optimization is
(C), and mk variance matrices (R). We generalize NB notation with              essential for all large tables having n or or more rows.
k clusters per class g with notation g : j, meaning the given vector              This is a summary of optimizations. There is a single set of tables
or matrix refers to jth cluster from class g. Sums are defined over            for the classifier: all m models are updated on the same table scan;

                     Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                VOL. 22,   NO. 1,   JANUARY 2010                                                 141


                             TABLE 2
                                                                               but mdk columns (or m tables, with dk columns each). Table 2
             Query Optimization: Distance Computation                          shows the expected I/O cost.
                                                                                  The following SQL statement computes k distances for each
                                                                               point, corresponding to the gth model. This statement is also used
                                                                               for scoring, and therefore, it is convenient to include g.

                                                                               INSERT INTO XD
                                                                               SELECT i,XH.g ,(X1-C1_X1)**2 + .. +(X4-C1_X4)**2,
                                                                                           ..,(X1-C3_X1)**2 + .. +(X4-C3_X4)**2
                                                                               FROM XH,CH WHERE XH.g=CH.g;
                                                                                   We also exploited UDFs which represent an extensibility
this eliminates the need to create multiple temporary tables. We               mechanism [8] provided by the DBMS. UDFs are programmed in
introduce a fast distance computation mechanism based on a                     the fast C language and they can take advantage of arrays and flow
flattened version of the centroids; temporary tables have less rows.           control statements (loops, if-then). Based on this layout, we
We use a CASE statement to get closest cluster avoiding the use of             developed a UDF which receives Cg as parameter and can
standard SQL aggregations; a join between two large tables is                  internally manipulate it as a matrix. The UDF is called mn times
avoided. We delete points from classes whose model has converged;              and it returns the closest cluster per class. That is, the UDF
the algorithm works with smaller tables as models converge.                    computes distances and determines the closest cluster. In the
                                                                               experimental section, we will study which is the most efficient
3.2.3   Model Computation                                                      alternative between SQL and the UDF. A big drawback about UDFs
Table CH is initialized with k random points per class; this                   is that they are dependent on the database system architecture.
initialization requires building a stratified sample per class.                Therefore, they are not portable and optimizations developed for
    The global variance is used to normalize X to follow a                     one database system may not work well in another one.
Nð0; 1Þ pdf. The global mean and global variance are derived as                    At this point, we have computed k distances per class and we
in NB. In this manner, K-means computes euclidean distance with                need to determine the closest cluster. There are two basic
all attributes being on the same relative scale. The normalized data           alternatives: pivoting distances and using SQL standard aggrega-
set is stored in table XH, which also has n rows. BKM uses XH                  tions, using case statements to determine the minimum distance.
thereafter at each iteration.                                                  For the first alternative, XD must be pivoted into a bigger table.
    We first discuss several approaches to compute distance for                Then, the minimum distance is determined using the min()
each class. Based on these approaches we derive a highly efficient             aggregation. The closest cluster is the subscript of the minimum
scheme to compute distance. The following statements assume the                distance, which is determined joining XH. In the second
same k per class.                                                              alternative, we just need to compare every distance against the
    The simplest, but inefficient, way to compute distances is to              rest using a CASE statement. Since the second alternative does not
create pivoted versions of X and C having one value per row.                   use joins and is based on a single table scan, it is much faster than
Such a table structure enables the usage of SQL standard                       using a pivoted version of XD. The SQL to determine closest
aggregations (e.g., sum(), count()). Evaluating a distance query               cluster per class for our running example is given below. This
involves creating a temporary table with mdkn rows, which will be              statement can also be used for scoring.
much larger than X. Since the vertical variant is expected to be the
slowest, it is not analyzed in the experimental section. Instead, we           INSERT INTO XN SELECT i,g
use a hybrid distance computation in which we have k distances                       ,CASE WHEN d1=d2 AND d1=d3 THEN 1
per row. We start by considering a pivoted version of X called                        WHEN d2=d3 THEN 2
XV . The corresponding SQL query computes k distances per class                       ELSE 3 END AS j
using an intermediate table having dn rows. We call this variant               FROM XD;
hybrid vertical.
                                                                                  Once we have determined the closest cluster for each point on
    We consider a horizontal scheme to compute distances. All
                                                                               each class, we need to update sufficient statistics, which are just
k distances per class are computed as SELECT terms. This
                                                                               sums. This computation requires joining XH and XN and
produces a temporary table with mkn rows. Then, the temporary
                                                                               partitioning points by class and cluster number. Since table NLQ
table is pivoted and aggregated to produce a table having mn rows
                                                                               is denormalized, the jth vector and the jth matrix are computed
but k columns. Such a layout enables efficient computation of the
                                                                               together. The join computation is demanding since we are joining
nearest cluster in a single table scan. The query optimizer decides
                                                                               two tables with OðnÞ rows. For BKM, it is unlikely that variance
which join algorithm to use and does not index any intermediate
                                                                               can have numerical issues, but is feasible. In such a case, sufficient
table. We call this approach nested horizontal. We introduce a
                                                                               statistics are substituted by a two-pass computation like NB.
modification to the nested horizontal approach in which we create
a primary index on the temporary distance table in order to enable             INSERT INTO NLQ SELECT XH.g,j
fast join computation. We call this scheme horizontal temporary.                 ,sum(1.0) as Ng
    We introduce a yet more efficient scheme to get distance
                                                                                 ,sum(X1) , .. ,sum(X4) /* L */
using a flattened version of C, which we call CH, that has
                                                                                 ,sum(X1**2) , .. ,sum(X4**2) /* Q*/
dk columns corresponding to all k centroids for one class. The
SQL code pairs each attribute from X with the corresponding                    FROM XH,XN WHERE XH.i=XN.i GROUP BY XH.g,j;
attribute from the cluster. Computation is efficient because CH                   We now discuss the M step to update W ; C; R. Computing
has only m rows and the join computes a table with n rows (for                 WCR from NLQ is straightforward since both tables have the
building the model) or mn rows for scoring. We call this                       same structure and they are small. There is a Cartesian product
approach the horizontal approach. There exists yet another                     between NLQ and the model quality table, but this latter table has
“completely horizontal” approach in which all mk distances are                 only one row. Finally, to speed up computations, we delete points
stored on the same row for each point and C has only one row                   from XH for classes whose model has converged: a reduction in

                    Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
142                                              IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                   VOL. 22,   NO. 1,   JANUARY 2010


                             TABLE 3                                                                       TABLE 4
                         Accuracy Varying k                                                   Comparing Accuracy: NB, BKM, and DT




size is propagated to the nearest cluster along with sufficient
statistics queries.

INSERT INTO WCR SELECT
    NLQ.g,NLQ.j
  ,Ng/MODEL.n                           /* pi */
  ,L1/Ng, ..,L4/Ng                      /* C */
  ,Q1/Ng-(L1/Ng)**2,..,Q4/Ng-(L4/Ng)**2 /* R */                               BKM ran until K-means converged on all classes. Decision trees
                                                                              used the CN5.0 algorithm splitting nodes until they reached a
FROM NLQ,MODEL WHERE NLQ.g=MODEL.g;
                                                                              minimum percentage of purity or became too small. Pruning was
                                                                              applied to reduce overfit. The number of clusters for BKM by
3.2.4    Scoring                                                              default was k ¼ 4.
We consider two alternatives: based on probability (default) or                  The number of clusters (k) is the most important parameter to
distance. Scoring is similar to the E step, but there are differences.        tune BKM accuracy. Table 3 shows accuracy behavior as k
First, the new data set must be normalized with the original                  increases. A higher k produces more complex models. Intuitively,
variance used to build the model. Such variance should be similar             a higher k should achieve higher accuracy because each class can
to the variance of the new data set. Second, we need to compute               be better approximated with localized clusters. On the other hand,
mk probabilities (distances) instead of only k distances because              a higher k than necessary can lead to empty clusters. As can be
we are trying to find the most probable (closest) cluster among all           seen, a lower k produces better models for spam and wbcancer,
(this is a subtle, but important difference in SQL code generation);          whereas a higher k produces higher accuracy for pima and bscale.
thus, the join condition between XH and CH gets eliminated.                   Therefore, there is no ideal setting for k, but, in general, k 20
Third, once we have mk probabilities (distances), we need to pick             produces good results. Based on these results, we set k ¼ 4 by
the maximum (minimum) one, and then, the point is assigned to                 default since it provides reasonable accuracy for all data sets.
the corresponding cluster. To have a more elegant implementa-                    Table 4 compares the accuracy of the three models, including
tion, we predict the class in two stages: we first determine the              the overall accuracy as well as a breakdown per class. Accuracy
most probable cluster per class, and then, compare the prob-                  per predicted class is important to understand issues with
abilities of such clusters. Column XH.g is not used for scoring
                                                                              imbalanced classes and detecting subsets of highly similar points.
purposes, but it is used to measure accuracy, when known.
                                                                              In practice, one class is generally more important than the other
                                                                              one (asymmetric), depending on whether the classification task is
4     EXPERIMENTAL EVALUATION                                                 to minimize false positives or false negatives. This is a summary of
                                                                              findings. First of all, considering global accuracy, BKM is better
We analyze three major aspects: 1) classification accuracy, 2) query
                                                                              than NB on two data sets and becomes tied with the other two data
optimization, and 3) time complexity and speed. We compare the
                                                                              sets. On the other hand, compared to decision trees, it is better in
accuracy of NB, BKM, and decision trees (DTs).
                                                                              one case and worse otherwise. NB follows the same pattern, but
4.1     Setup                                                                 showing a wider gap. However, it is important to understand why
We used the Teradata DBMS running on a server with a 3.2 GHz                  BKM and NB perform worse than DT in some cases, looking at the
CPU, 2 GB of RAM, and a 750 GB disk.                                          accuracy breakdown. In this case, BKM generally does better than
   Parameters were set as follows: We set  ¼ 0:001 for K-means.              NB, except in two cases. Therefore, class decomposition does
The number of clusters per class was k ¼ 4 (setting experimentally            increase prediction accuracy. On the other hand, compared to
justified). All query optimizations were turned on by default (they           decision trees, in all cases where BKM is less accurate than DT on
do not affect model accuracy). Experiments with DTs were                      global accuracy, BKM has higher accuracy on one class. In fact, on
performed using a data mining tool.                                           closer look, it can be seen that DT completely misses prediction for
   We used real data sets to test classification accuracy (from               one class for the bscale data set. That is, global accuracy for DT is
the UCI repository) and synthetic data sets to analyze speed                  misleading. We conclude that BKM performs better than NB and is
(varying d; n). Real data sets include pima (d ¼ 6; n ¼ 768),                 competitive with DT.
spam (d ¼ 7; n ¼ 4;601), bscale (d ¼ 4; n ¼ 625), and wbcancer
(d ¼ 7; n ¼ 569). Categorical attributes (!3 values) were trans-
formed into binary attributes.                                                                              TABLE 5
                                                                                  Query Optimization: Distance Computation n ¼ 100k (Seconds)
4.2     Model Accuracy
In this section, we measure the accuracy of predictions when using
Bayesian classification models. We used 5-fold cross validation.
For each run, the data set was partitioned into a training set and a
test set. The training set was used to compute the model, whereas
the test set was used to independently measure accuracy. The
training set size was 80 percent and the test set was 20 percent.

                    Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                 VOL. 22,   NO. 1,   JANUARY 2010                                                        143


                               TABLE 6
    Query Optimization: SQL versus UDF (Scoring, n ¼ 1M) (Seconds)




4.3     Query Optimization
Table 5 provides detailed experiments on distance computation,
the most demanding computation for BKM. Our best distance                       Fig. 1. BKM Classifier: Time complexity; (default d ¼ 4, k ¼ 4, n ¼ 100k).
strategy is two orders of magnitude faster than the worst strategy
and it is one order of magnitude faster than its closest rival. The             ing the DBMS source code are SQL queries and UDFs. A discrete
explanation is that I/O is minimized to n operations and                            ¨
                                                                                Naıve Bayes classifier programmed in SQL is introduced in [6]. We
computations happen in main memory for each row through                         summarize differences with ours. The data set is assumed to have
SQL arithmetic expressions. Note a standard aggregation on the                  discrete attributes: binning numeric attributes is not considered. It
pivoted version of X (XV) is faster than the horizontal nested                  uses an inefficient large pivoted intermediate table, whereas our
query variant.                                                                  discrete NB model can directly work on a horizontal layout. Our
    Table 6 compares SQL with UDFs to score the data set                        proposal extends clustering algorithms in SQL [4] to perform
(computing distance and finding nearest cluster per class). We                                                    ¨
                                                                                classification, generalizing Naıve Bayes [2]. K-means clustering
exploit scalar UDFs [5]. Since finding the nearest cluster is                   was programmed with SQL queries introducing three variants [4]:
straightforward in the UDF, this comparison considers both                      standard, optimized, and incremental. We generalized the opti-
computations as one. This experiment favors the UDF since SQL                   mized variant. Note that classification represents a significantly
requires accessing large tables in separate queries. As we can see,             harder problem than clustering. User-Defined Functions are
SQL (with arithmetic expressions) turned out be faster than the                 identified as an important extensibility mechanism to integrate
UDF. This was interesting because both approaches used the same                 data mining algorithms [5], [8]. ATLaS [8] extends SQL syntax with
table as input and performed the same I/O reading a large table.                object-oriented constructs to define aggregate and table functions
We expected SQL to be slower because it required a join and XH                  (with initialize, iterate, and terminate clauses), providing a user-
and XD were accessed. The explanation was that the UDF has                      friendly interface to the SQL standard. We point out several
overhead to pass each point and model as parameters in each call.               differences with our work. First, we propose to generate SQL code
4.4     Speed and Time Complexity                                               from a host language, thus achieving Turing-completeness in SQL
                                                                                (similar to embedded SQL). We showed the Bayesian classifier can
We compare SQL and C++ running on the same computer. We also
                                                                                be solved more efficiently with SQL queries than with UDFs. Even
compare the time to export with ODBC. C++ worked on flat files
                                                                                further, SQL code provides better portability and ATLaS requires
exported from the DBMS. We used binary files in order to get
                                                                                modifying the DBMS source code. Class decomposition with
maximum performance in C++. Also, we shut down the DBMS
                                                                                clustering is shown to improve NB accuracy [7]. The classifier can
when C++ was running. In short, we conducted a fair comparison.
                                                                                adapt to skewed distributions and overlapping subsets of points
Table 7 compares SQL, C++, and ODBC varying n. Clearly, ODBC
                                                                                by building better local models. In [7], EM was the algorithm to fit
is a bottleneck. Overall, both languages scale linearly. We can see
                                                                                a mixture per class. Instead, we decided to use K-means because it
SQL performs better as n grows because DBMS overhead becomes
                                                                                is faster, simpler, better understood in database research and has
less important. However, C++ is about four times faster.
                                                                                less numeric issues.
    Fig. 1 shows BKM time complexity varying n; d with large data
sets. Time is measured for one iteration. BKM is linear in n and d,
highlighting its scalability.                                                   6    CONCLUSIONS
                                                                                We presented two Bayesian classifiers programmed in SQL: the
5      RELATED WORK                                                                ¨
                                                                                Naıve Bayes classifier (with discrete and numeric versions) and a
                                                                                                       ¨
                                                                                generalization of Naıve Bayes (BKM), based on decomposing
The most widely used approach to integrate data mining
                                                                                classes with K-means clustering. We studied two complementary
algorithms into a DBMS is to modify the internal source code.
                                                                                aspects: increasing accuracy and generating efficient SQL code. We
Scalable K-means (SKM) [1] and O-cluster [3] are two examples of
                                                                                introduced query optimizations to generate fast SQL code. The best
clustering algorithms internally integrated with a DBMS. A
                                                                                physical storage layout and primary index for large tables is based
discrete Naive Bayes classifier has been internally integrated with
                                                                                on the point subscript. Sufficient statistics are stored on denorma-
the SQL Server DBMS. On the other hand, the two main
                                                                                lized tables. The euclidean distance computation uses a flattened
mechanisms to integrate data mining algorithms without modify-
                                                                                (horizontal) version of the cluster centroids matrix, which enables
                                                                                arithmetic expressions. The nearest cluster per class, required by K-
                                                                                means, is efficiently determined avoiding joins and aggregations.
                            TABLE 7
      Comparing SQL, C++, and ODBC with NB (d ¼ 4) (Seconds)                    Experiments with real data sets compared NB, BKM, and decision
                                                                                trees. The numeric and discrete versions of NB had similar accuracy.
                                                                                BKM was more accurate than NB and was similar to decision trees
                                                                                in global accuracy. However, BKM was more accurate when
                                                                                computing a breakdown of accuracy per class. A low number of
                                                                                clusters produced good results in most cases. We compared
                                                                                equivalent implementations of NB in SQL and C++ with large data
                                                                                sets: SQL was four times slower. SQL queries were faster than UDFs

                     Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
144                                              IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,                      VOL. 22,    NO. 1,   JANUARY 2010


to score, highlighting the importance of our optimizations. NB and            REFERENCES
BKM exhibited linear scalability in data set size and dimensionality.         [1]   P. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to
    There are many opportunities for future work. We want to                        Large Databases,” Proc. ACM Knowledge Discovery and Data Mining (KDD)
derive incremental versions or sample-based methods to accelerate                   Conf., pp. 9-15, 1998.
                                                                              [2]   T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statistical
the Bayesian classifier. We want to improve our Bayesian classifier                 Learning, first ed. Springer, 2001.
to produce more accurate models with skewed distributions, data               [3]   B.L. Milenova and M.M. Campos, “O-Cluster: Scalable Clustering of Large
sets with missing information, and subsets of points having                         High Dimensional Data Sets,” Proc. IEEE Int’l Conf. Data Mining (ICDM),
                                                                                    pp. 290-297, 2002.
significant overlap with each other, which are known issues for               [4]   C. Ordonez, “Integrating K-Means Clustering with a Relational DBMS
clustering algorithms. We are interested in combining dimension-                    Using SQL,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201,
ality reduction techniques like PCA or factor analysis with                         Feb. 2006.
Bayesian classifiers. UDFs need further study to accelerate                   [5]   C. Ordonez, “Building Statistical Models and Scoring with UDFs,” Proc.
                                                                                    ACM SIGMOD, pp. 1005-1016, 2007.
computations and evaluate complex mathematical equations.                     [6]   S. Thomas and M.M. Campos, SQL-Based Naive Bayes Model Building and
                                                                                    Scoring, US Patent 7,051,037, US Patent and Trade Office, 2006.
                                                                              [7]   R. Vilalta and I. Rish, “A Decomposition of Classes via Clustering to
ACKNOWLEDGMENTS                                                                     Explain and Improve Naive Bayes,” Proc. European Conf. Machine Learning
                                                                                    (ECML), pp. 444-455, 2003.
This research work was partially supported by US National                     [8]   H. Wang, C. Zaniolo, and C.R. Luo, “ATLaS: A Small but Complete SQL
Science Foundation grants CCF 0937562 and IIS 0914861.                              Extension for Data Mining and Data Streams,” Proc. Very Large Data Bases
                                                                                    (VLDB) Conf., pp. 1113-1116, 2003.

                                                                              . For more information on this or any other computing topic, please visit our
                                                                              Digital Library at www.computer.org/publications/dlib.




                    Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.

Contenu connexe

Tendances

Image Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary LearningImage Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary LearningIJRESJOURNAL
 
Enhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K AnonymizationEnhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K AnonymizationIDES Editor
 
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnKwanghee Choi
 
Mining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential PatternsMining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential PatternsIOSR Journals
 
The Most Powerful and Accurate SQL
The Most Powerful and Accurate SQLThe Most Powerful and Accurate SQL
The Most Powerful and Accurate SQLMichael M David
 
Duchowski Scanpath Comparison Revisited
Duchowski Scanpath Comparison RevisitedDuchowski Scanpath Comparison Revisited
Duchowski Scanpath Comparison RevisitedKalle
 
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...ijesajournal
 

Tendances (9)

Segmentation arxiv search_paper
Segmentation arxiv search_paperSegmentation arxiv search_paper
Segmentation arxiv search_paper
 
Image Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary LearningImage Super-Resolution Reconstruction Based On Multi-Dictionary Learning
Image Super-Resolution Reconstruction Based On Multi-Dictionary Learning
 
G0354451
G0354451G0354451
G0354451
 
Enhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K AnonymizationEnhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K Anonymization
 
Neural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to LearnNeural Architecture Search: Learning How to Learn
Neural Architecture Search: Learning How to Learn
 
Mining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential PatternsMining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential Patterns
 
The Most Powerful and Accurate SQL
The Most Powerful and Accurate SQLThe Most Powerful and Accurate SQL
The Most Powerful and Accurate SQL
 
Duchowski Scanpath Comparison Revisited
Duchowski Scanpath Comparison RevisitedDuchowski Scanpath Comparison Revisited
Duchowski Scanpath Comparison Revisited
 
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
 

Similaire à Bayesian Classifiers Programmed in SQL

Benchmarking Scalability and Elasticity of DistributedDataba.docx
Benchmarking Scalability and Elasticity of DistributedDataba.docxBenchmarking Scalability and Elasticity of DistributedDataba.docx
Benchmarking Scalability and Elasticity of DistributedDataba.docxjasoninnes20
 
Abstract
AbstractAbstract
Abstractbutest
 
Advanced Data Structures 2006
Advanced Data Structures 2006Advanced Data Structures 2006
Advanced Data Structures 2006Sanjay Goel
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniquesIJDKP
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...IJCSIS Research Publications
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
 
E132833
E132833E132833
E132833irjes
 
Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...CSITiaesprime
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolioguest5a8ee60b
 
EVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENT
EVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENTEVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENT
EVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENTijdms
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Sql vs NoSQL-Presentation
 Sql vs NoSQL-Presentation Sql vs NoSQL-Presentation
Sql vs NoSQL-PresentationShubham Tomar
 
Cache and consistency in nosql
Cache and consistency in nosqlCache and consistency in nosql
Cache and consistency in nosqlJoão Gabriel Lima
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesianAhmad Amri
 

Similaire à Bayesian Classifiers Programmed in SQL (20)

Benchmarking Scalability and Elasticity of DistributedDataba.docx
Benchmarking Scalability and Elasticity of DistributedDataba.docxBenchmarking Scalability and Elasticity of DistributedDataba.docx
Benchmarking Scalability and Elasticity of DistributedDataba.docx
 
Abstract
AbstractAbstract
Abstract
 
Advanced Data Structures 2006
Advanced Data Structures 2006Advanced Data Structures 2006
Advanced Data Structures 2006
 
Data clustring
Data clustring Data clustring
Data clustring
 
Column store databases approaches and optimization techniques
Column store databases  approaches and optimization techniquesColumn store databases  approaches and optimization techniques
Column store databases approaches and optimization techniques
 
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
A Kernel Approach for Semi-Supervised Clustering Framework for High Dimension...
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
 
E132833
E132833E132833
E132833
 
Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...
 
Xia Xiao-Poster
Xia Xiao-PosterXia Xiao-Poster
Xia Xiao-Poster
 
Xia Xiao-research paper
Xia Xiao-research paperXia Xiao-research paper
Xia Xiao-research paper
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
Bt0066 dbms
Bt0066 dbmsBt0066 dbms
Bt0066 dbms
 
EVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENT
EVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENTEVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENT
EVALUATION CRITERIA FOR SELECTING NOSQL DATABASES IN A SINGLE-BOX ENVIRONMENT
 
Clustering
ClusteringClustering
Clustering
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Sql vs NoSQL-Presentation
 Sql vs NoSQL-Presentation Sql vs NoSQL-Presentation
Sql vs NoSQL-Presentation
 
Cache and consistency in nosql
Cache and consistency in nosqlCache and consistency in nosql
Cache and consistency in nosql
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesian
 

Plus de ingenioustech

Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicastingingenioustech
 
Monitoring service systems from
Monitoring service systems fromMonitoring service systems from
Monitoring service systems fromingenioustech
 
Locally consistent concept factorization for
Locally consistent concept factorization forLocally consistent concept factorization for
Locally consistent concept factorization foringenioustech
 
Measurement and diagnosis of address
Measurement and diagnosis of addressMeasurement and diagnosis of address
Measurement and diagnosis of addressingenioustech
 
Impact of le arrivals and departures on buffer
Impact of  le arrivals and departures on bufferImpact of  le arrivals and departures on buffer
Impact of le arrivals and departures on bufferingenioustech
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation foringenioustech
 
Efficient computation of range aggregates
Efficient computation of range aggregatesEfficient computation of range aggregates
Efficient computation of range aggregatesingenioustech
 
Dynamic measurement aware
Dynamic measurement awareDynamic measurement aware
Dynamic measurement awareingenioustech
 
Design and evaluation of a proxy cache for
Design and evaluation of a proxy cache forDesign and evaluation of a proxy cache for
Design and evaluation of a proxy cache foringenioustech
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization iningenioustech
 
Phish market protocol
Phish market protocolPhish market protocol
Phish market protocolingenioustech
 
Peering equilibrium multi path routing
Peering equilibrium multi path routingPeering equilibrium multi path routing
Peering equilibrium multi path routingingenioustech
 
Online social network
Online social networkOnline social network
Online social networkingenioustech
 
On the quality of service of crash recovery
On the quality of service of crash recoveryOn the quality of service of crash recovery
On the quality of service of crash recoveryingenioustech
 
It auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingIt auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingingenioustech
 

Plus de ingenioustech (20)

Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicasting
 
Monitoring service systems from
Monitoring service systems fromMonitoring service systems from
Monitoring service systems from
 
Locally consistent concept factorization for
Locally consistent concept factorization forLocally consistent concept factorization for
Locally consistent concept factorization for
 
Measurement and diagnosis of address
Measurement and diagnosis of addressMeasurement and diagnosis of address
Measurement and diagnosis of address
 
Impact of le arrivals and departures on buffer
Impact of  le arrivals and departures on bufferImpact of  le arrivals and departures on buffer
Impact of le arrivals and departures on buffer
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
Efficient computation of range aggregates
Efficient computation of range aggregatesEfficient computation of range aggregates
Efficient computation of range aggregates
 
Dynamic measurement aware
Dynamic measurement awareDynamic measurement aware
Dynamic measurement aware
 
Design and evaluation of a proxy cache for
Design and evaluation of a proxy cache forDesign and evaluation of a proxy cache for
Design and evaluation of a proxy cache for
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization in
 
Tcp
TcpTcp
Tcp
 
Privacy preserving
Privacy preservingPrivacy preserving
Privacy preserving
 
Phish market protocol
Phish market protocolPhish market protocol
Phish market protocol
 
Peering equilibrium multi path routing
Peering equilibrium multi path routingPeering equilibrium multi path routing
Peering equilibrium multi path routing
 
Peace
PeacePeace
Peace
 
Online social network
Online social networkOnline social network
Online social network
 
On the quality of service of crash recovery
On the quality of service of crash recoveryOn the quality of service of crash recovery
On the quality of service of crash recovery
 
Layered approach
Layered approachLayered approach
Layered approach
 
It auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingIt auditing to assure a secure cloud computing
It auditing to assure a secure cloud computing
 
Intrution detection
Intrution detectionIntrution detection
Intrution detection
 

Dernier

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 

Dernier (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 

Bayesian Classifiers Programmed in SQL

  • 1. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 139 Bayesian Classifiers Programmed in SQL set and classification model. We also develop query optimizations that may be applicable to other distance-based algorithms. A horizontal layout of the cluster centroids table and a denormalized Carlos Ordonez and Sasi K. Pitchaimalai table for sufficient statistics are essential to accelerate queries. Past research has shown the main alternatives to integrate data mining Abstract—The Bayesian classifier is a fundamental classification technique. In algorithms without modifying DBMS source code are SQL queries this work, we focus on programming Bayesian classifiers in SQL. We introduce and User-Defined Functions (UDFs), with UDFs being generally two classifiers: Naive Bayes and a classifier based on class decomposition using more efficient [5]. We show SQL queries are more efficient than K-means clustering. We consider two complementary tasks: model computation UDFs for Bayesian classification. Finally, despite the fact that C++ and scoring a data set. We study several layouts for tables and several indexing is a significantly faster alternative, we provide evidence that alternatives. We analyze how to transform equations into efficient SQL queries and introduce several query optimizations. We conduct experiments with real and exporting data sets is a performance bottleneck. synthetic data sets to evaluate classification accuracy, query optimizations, and The article is organized as follows: Section 2 introduces scalability. Our Bayesian classifier is more accurate than Naive Bayes and definitions. Section 3 introduces two alternative Bayesian classifiers decision trees. Distance computation is significantly accelerated with horizontal ¨ in SQL: Naıve Bayes and a Bayesian classifier based on K-means. layout for tables, denormalization, and pivoting. We also compare Naive Bayes Section 4 presents an experimental evaluation on accuracy, implementations in SQL and C++: SQL is about four times slower. Our Bayesian optimizations, and scalability. Related work is discussed in classifier in SQL achieves high classification accuracy, can efficiently analyze Section 5. Section 6 concludes the paper. large data sets, and has linear scalability. Index Terms—Classification, K-means, query optimization. 2 DEFINITIONS Ç We focus on computing classification models on a data set X ¼ fx1 ; . . . ; xn g with d attributes X1 ; . . . ; Xd , one discrete attribute G 1 INTRODUCTION (class or target), and n records (points). We assume G has m ¼ 2 CLASSIFICATION is a fundamental problem in machine learning values. Data set X represents a d  n matrix, where xi represents a and statistics [2]. Bayesian classifiers stand out for their robust- column vector. We study two complementary models: 1) each class ness, interpretability, and accuracy [2]. They are deeply related to is approximated by a normal distribution or histogram and 2) fitting maximum likelihood estimation and discriminant analysis [2], a mixture model with k clusters on each class with K-means. We use highlighting their theoretical importance. In this work, we focus subscripts i; j; h; g as follows: i ¼ 1 . . . n; j ¼ 1 . . . k; h ¼ 1 . . . d; ¨ on Bayesian classifiers considering two variants: Naıve Bayes [2] g ¼ 1 . . . m. The T superscript indicates matrix transposition. and a Bayesian classifier based on class decomposition using Throughout the paper, we will use a small running example, clustering [7]. where d ¼ 4; k ¼ 3 (for K-means) and m ¼ 2 (binary). In this work, we integrate Bayesian classification algorithms into a DBMS. Such integration allows users to directly analyze data 3 BAYESIAN CLASSIFIERS PROGRAMMED IN SQL sets inside the DBMS and to exploit its extensive capabilities (storage management, querying, concurrency control, fault toler- ¨ We introduce two classifiers: Naıve Bayes (NB) and a Bayesian ance, and security). We use SQL queries as the programming Classifier based on K-means (BKM) that performs class decom- mechanism, since it is the standard language in a DBMS. More position. We consider two phases for both classifiers: 1) computing importantly, using SQL eliminates the need to understand and the model and 2) scoring a data set based on the model. Our modify the internal source, which is a difficult task. Unfortunately, optimizations do not alter model accuracy. SQL has two important drawbacks: it has limitations to manipulate 3.1 ¨ Naıve Bayes vectors and matrices and has more overhead than a systems We consider two versions of NB: one for numeric attributes and language like C. Keeping those issues in mind, we study how to another for discrete attributes. Numeric NB will be improved with evaluate mathematical equations with several tables layouts and class decomposition. optimized SQL queries. NB assumes attributes are independent, and thus, the joint class Our contributions are the following: We present two efficient conditional probability can be estimated as the product of ¨ SQL implementations of Naıve Bayes for numeric and discrete probabilities of each attribute [2]. We now discuss NB based on a attributes. We introduce a classification algorithm that builds one multivariate Gaussian. NB has no input parameters. Each class is clustering model per class, which is a generalization of K-means [1], [4]. Our main contribution is a Bayesian classifier programmed modeled as a single normal distribution with mean vector Cg and a ¨ in SQL, extending Naıve Bayes, which uses K-means to decompose diagonal variance matrix Rg . Scoring assumes a model is available each class into clusters. We generalize queries for clustering and there exists a data set with the same attributes in order to adding a new problem dimension. That is, our novel queries predict class G. Variance computation based on sufficient statistics combine three dimensions: attribute, cluster, and class subscripts. [5] in one pass can be numerically unstable when the variance is We identify euclidean distance as the most time-consuming much smaller compared to large attribute values or when the data computation. Thus, we introduce several schemes to efficiently set has a mix of very large and very small numbers. For ill- compute distance considering different storage layouts for the data conditioned data sets, the computed variance can be significantly different from the actual variance or even become negative. Therefore, the model is computed in two passes: a first pass to . The authors are with the Department of Computer Science, University of get the mean per class and a second one to compute the variance Houston, Houston, TX 77204. E-mail: {ordonez, yeskaym}@cs.uh.edu. P per class. The mean per class is given by Cg ¼ xi 2Yg xi =Ng , where P Manuscript received 24 Apr. 2008; revised 12 Jan. 2009; accepted 4 May Yg X are the records in class g. Equation Rg ¼ 1=Ng ni 2Yg ðxi À x 2009; published online 14 May 2009. T Cg Þðxi À Cg Þ gives a diagonal variance matrix Rg , which is Recommended for acceptance by S. Wang. For information on obtaining reprints of this article, please send e-mail to: numerically stable, but requires two passes over the data set. tkde@computer.org, and reference IEEECS Log Number TKDE-2008-04-0219. The SQL implementation for numeric NB follows the mean Digital Object Identifier no. 10.1109/TKDE.2009.127. and variance equations introduced above. We compute three 1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  • 2. 140 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 aggregations grouping by g with two queries. The first query TABLE 1 computes the mean Cg of class g with a sum ðXh Þ=count ðÃÞ BKM Tables aggregation and class priors g with a count() aggregation. The second query computes Rg with sum(ðXh À h Þ2 Þ. Note the joint probability computation is not done in this phase. Scoring uses the Gaussian parameters as input to classify an input point to the most probable class, with one query in one pass over X. Each class probability is evaluated as a Gaussian. To avoid numerical issues when a variance is zero, the probability is set to 1 and the joint probability is computed with a sum of probability logarithms instead of a product of probabilities. A CASE statement pivots probabilities and avoids a max() aggregation. A final query determines the predicted class, being the one with maximum probability, obtained with a CASE statement. We now discuss NB for discrete attributes. For numeric NB, we vectors (instead of variables). Class priors are analogous to NB: used Gaussians because they work well for large data sets and g ¼ Ng =n. because they are easy to manipulate mathematically. That is, NB We now introduce sufficient statistics in the generalized does not assume any specific probability density function (pdf). notation. Let Xg:j represent partition j of points in class g (as Assume X1 ; . . . ; Xd can be discrete or numeric. If an attribute Xh is determined by K-means). Then, L, the linear sum of points is: P discrete (categorical) NB simply computes its histogram: prob- Lg:j ¼ xi 2Xg:j xi , and the sum of “squared” points for cluster g : j P abilities are derived with counts per value divided by the becomes: Qg:j ¼ xi 2Xg:j xi xT . Based on L; Q, the Gaussian para- i corresponding number of points in each class. Otherwise, if the meters per class g are: attribute Xh is numeric then binning is required. Binning requires two passes over the data set, pretty much like numeric NB. In the 1 Cg:j ¼ Lg:j ; ð1Þ first pass, bin boundaries are determined. On the second pass, one- Ng:j dimensional frequency histograms are computed on each attribute. The bin boundaries (interval ranges) may impact the accuracy of 1 1 Rg:j ¼ Qg:j À Lg:j ðLg:j ÞT : ð2Þ NB due to skewed distributions or extreme values. Thus, we Ng:j ðNg:j Þ2 consider two techniques to bin attributes: 1) creating k uniform intervals between min and max and 2) taking intervals around the We generalize K-means to compute m models, fitting a mixture mean based on multiples of the standard deviation, thus getting model to each class. K-means is initialized, and then, it iterates more evenly populated bins. We do not study other binning until it converges on all classes. The algorithm is as given below. schemes such as quantiles (i.e., equidepth binning). Initialization: The implementation in SQL of discrete NB is straightforward. 1. Get global N; L; Q and ; . For discrete attributes, no preprocessing is required. For numeric 2. Get k random points per class to initialize C. attributes, the minimum, maximum, and mean can be determined in one pass in a single query. The variance for all numeric While not all m models converge: attributes is computed on a second pass to avoid numerical issues. 1. E step: get k distances j per g; find nearest cluster j per g; Then, each attribute is discretized finding the interval for each update N; L; Q per class. value. Once we have a binned version of X, then we compute 2. M step: update W ; C; R from N; L; Q per class; compute histograms on each attribute with SQL aggregations. Probabilities model quality per g; monitor convergence. are obtained dividing by the number of records in each class. Scoring requires determining the interval for each attribute value For scoring, in a similar manner to NB, a point is assigned to the and retrieving its probability. Each class probability is also class with highest probability. In this case, this means getting the computed by adding logarithms. NB has an advantage over other most probable cluster based on mk Gaussians. The probability gets classifiers: it can handle a data set with mixed attribute types (i.e., multiplied by the class prior by default (under the common discrete and numerical). Bayesian framework), but it can be ignored for imbalanced classification problems. 3.2 Bayesian Classifier Based on K-Means We now present BKM, a Bayesian classifier based on class 3.2.2 Bayesian Classifier Programmed in SQL decomposition obtained with the K-means algorithm. BKM is a Based on the mathematical framework introduced in the previous generalization of NB, where NB has one cluster per class and the section, we now study how to create an efficient implementation of Bayesian classifier has k 1 clusters per class. For ease of BKM in SQL. exposition and elegance, we use the same k for each class, but Table 1 provides a summary of all working tables used by BKM. we experimentally show it is a sound decision. We consider two In this case, the tables for NB are extended with the j subscript, major tasks: model computation and scoring a data set based on a and since there is a clustering algorithm behind, we introduce computed model. The most time-consuming phase is building the tables for distance computation. Like NB, since computations model. Scoring is equivalent to an E step from K-means executed require accessing all attributes/distances/probabilities for each on each class. point at one time, tables have a physical organization that stores all values for point i on the same address and there is a primary index 3.2.1 Mixture Model and Decomposition Algorithm on i for efficient hash join computation. In this manner, we force We fit a mixture of k clusters to each class with K-means [4]. The the query optimizer to perform n I/Os instead of the actual output are priors (same as NB), mk weights (W ), mk centroids number of rows. This storage and indexing optimization is (C), and mk variance matrices (R). We generalize NB notation with essential for all large tables having n or or more rows. k clusters per class g with notation g : j, meaning the given vector This is a summary of optimizations. There is a single set of tables or matrix refers to jth cluster from class g. Sums are defined over for the classifier: all m models are updated on the same table scan; Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  • 3. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 141 TABLE 2 but mdk columns (or m tables, with dk columns each). Table 2 Query Optimization: Distance Computation shows the expected I/O cost. The following SQL statement computes k distances for each point, corresponding to the gth model. This statement is also used for scoring, and therefore, it is convenient to include g. INSERT INTO XD SELECT i,XH.g ,(X1-C1_X1)**2 + .. +(X4-C1_X4)**2, ..,(X1-C3_X1)**2 + .. +(X4-C3_X4)**2 FROM XH,CH WHERE XH.g=CH.g; We also exploited UDFs which represent an extensibility this eliminates the need to create multiple temporary tables. We mechanism [8] provided by the DBMS. UDFs are programmed in introduce a fast distance computation mechanism based on a the fast C language and they can take advantage of arrays and flow flattened version of the centroids; temporary tables have less rows. control statements (loops, if-then). Based on this layout, we We use a CASE statement to get closest cluster avoiding the use of developed a UDF which receives Cg as parameter and can standard SQL aggregations; a join between two large tables is internally manipulate it as a matrix. The UDF is called mn times avoided. We delete points from classes whose model has converged; and it returns the closest cluster per class. That is, the UDF the algorithm works with smaller tables as models converge. computes distances and determines the closest cluster. In the experimental section, we will study which is the most efficient 3.2.3 Model Computation alternative between SQL and the UDF. A big drawback about UDFs Table CH is initialized with k random points per class; this is that they are dependent on the database system architecture. initialization requires building a stratified sample per class. Therefore, they are not portable and optimizations developed for The global variance is used to normalize X to follow a one database system may not work well in another one. Nð0; 1Þ pdf. The global mean and global variance are derived as At this point, we have computed k distances per class and we in NB. In this manner, K-means computes euclidean distance with need to determine the closest cluster. There are two basic all attributes being on the same relative scale. The normalized data alternatives: pivoting distances and using SQL standard aggrega- set is stored in table XH, which also has n rows. BKM uses XH tions, using case statements to determine the minimum distance. thereafter at each iteration. For the first alternative, XD must be pivoted into a bigger table. We first discuss several approaches to compute distance for Then, the minimum distance is determined using the min() each class. Based on these approaches we derive a highly efficient aggregation. The closest cluster is the subscript of the minimum scheme to compute distance. The following statements assume the distance, which is determined joining XH. In the second same k per class. alternative, we just need to compare every distance against the The simplest, but inefficient, way to compute distances is to rest using a CASE statement. Since the second alternative does not create pivoted versions of X and C having one value per row. use joins and is based on a single table scan, it is much faster than Such a table structure enables the usage of SQL standard using a pivoted version of XD. The SQL to determine closest aggregations (e.g., sum(), count()). Evaluating a distance query cluster per class for our running example is given below. This involves creating a temporary table with mdkn rows, which will be statement can also be used for scoring. much larger than X. Since the vertical variant is expected to be the slowest, it is not analyzed in the experimental section. Instead, we INSERT INTO XN SELECT i,g use a hybrid distance computation in which we have k distances ,CASE WHEN d1=d2 AND d1=d3 THEN 1 per row. We start by considering a pivoted version of X called WHEN d2=d3 THEN 2 XV . The corresponding SQL query computes k distances per class ELSE 3 END AS j using an intermediate table having dn rows. We call this variant FROM XD; hybrid vertical. Once we have determined the closest cluster for each point on We consider a horizontal scheme to compute distances. All each class, we need to update sufficient statistics, which are just k distances per class are computed as SELECT terms. This sums. This computation requires joining XH and XN and produces a temporary table with mkn rows. Then, the temporary partitioning points by class and cluster number. Since table NLQ table is pivoted and aggregated to produce a table having mn rows is denormalized, the jth vector and the jth matrix are computed but k columns. Such a layout enables efficient computation of the together. The join computation is demanding since we are joining nearest cluster in a single table scan. The query optimizer decides two tables with OðnÞ rows. For BKM, it is unlikely that variance which join algorithm to use and does not index any intermediate can have numerical issues, but is feasible. In such a case, sufficient table. We call this approach nested horizontal. We introduce a statistics are substituted by a two-pass computation like NB. modification to the nested horizontal approach in which we create a primary index on the temporary distance table in order to enable INSERT INTO NLQ SELECT XH.g,j fast join computation. We call this scheme horizontal temporary. ,sum(1.0) as Ng We introduce a yet more efficient scheme to get distance ,sum(X1) , .. ,sum(X4) /* L */ using a flattened version of C, which we call CH, that has ,sum(X1**2) , .. ,sum(X4**2) /* Q*/ dk columns corresponding to all k centroids for one class. The SQL code pairs each attribute from X with the corresponding FROM XH,XN WHERE XH.i=XN.i GROUP BY XH.g,j; attribute from the cluster. Computation is efficient because CH We now discuss the M step to update W ; C; R. Computing has only m rows and the join computes a table with n rows (for WCR from NLQ is straightforward since both tables have the building the model) or mn rows for scoring. We call this same structure and they are small. There is a Cartesian product approach the horizontal approach. There exists yet another between NLQ and the model quality table, but this latter table has “completely horizontal” approach in which all mk distances are only one row. Finally, to speed up computations, we delete points stored on the same row for each point and C has only one row from XH for classes whose model has converged: a reduction in Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  • 4. 142 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 TABLE 3 TABLE 4 Accuracy Varying k Comparing Accuracy: NB, BKM, and DT size is propagated to the nearest cluster along with sufficient statistics queries. INSERT INTO WCR SELECT NLQ.g,NLQ.j ,Ng/MODEL.n /* pi */ ,L1/Ng, ..,L4/Ng /* C */ ,Q1/Ng-(L1/Ng)**2,..,Q4/Ng-(L4/Ng)**2 /* R */ BKM ran until K-means converged on all classes. Decision trees used the CN5.0 algorithm splitting nodes until they reached a FROM NLQ,MODEL WHERE NLQ.g=MODEL.g; minimum percentage of purity or became too small. Pruning was applied to reduce overfit. The number of clusters for BKM by 3.2.4 Scoring default was k ¼ 4. We consider two alternatives: based on probability (default) or The number of clusters (k) is the most important parameter to distance. Scoring is similar to the E step, but there are differences. tune BKM accuracy. Table 3 shows accuracy behavior as k First, the new data set must be normalized with the original increases. A higher k produces more complex models. Intuitively, variance used to build the model. Such variance should be similar a higher k should achieve higher accuracy because each class can to the variance of the new data set. Second, we need to compute be better approximated with localized clusters. On the other hand, mk probabilities (distances) instead of only k distances because a higher k than necessary can lead to empty clusters. As can be we are trying to find the most probable (closest) cluster among all seen, a lower k produces better models for spam and wbcancer, (this is a subtle, but important difference in SQL code generation); whereas a higher k produces higher accuracy for pima and bscale. thus, the join condition between XH and CH gets eliminated. Therefore, there is no ideal setting for k, but, in general, k 20 Third, once we have mk probabilities (distances), we need to pick produces good results. Based on these results, we set k ¼ 4 by the maximum (minimum) one, and then, the point is assigned to default since it provides reasonable accuracy for all data sets. the corresponding cluster. To have a more elegant implementa- Table 4 compares the accuracy of the three models, including tion, we predict the class in two stages: we first determine the the overall accuracy as well as a breakdown per class. Accuracy most probable cluster per class, and then, compare the prob- per predicted class is important to understand issues with abilities of such clusters. Column XH.g is not used for scoring imbalanced classes and detecting subsets of highly similar points. purposes, but it is used to measure accuracy, when known. In practice, one class is generally more important than the other one (asymmetric), depending on whether the classification task is 4 EXPERIMENTAL EVALUATION to minimize false positives or false negatives. This is a summary of findings. First of all, considering global accuracy, BKM is better We analyze three major aspects: 1) classification accuracy, 2) query than NB on two data sets and becomes tied with the other two data optimization, and 3) time complexity and speed. We compare the sets. On the other hand, compared to decision trees, it is better in accuracy of NB, BKM, and decision trees (DTs). one case and worse otherwise. NB follows the same pattern, but 4.1 Setup showing a wider gap. However, it is important to understand why We used the Teradata DBMS running on a server with a 3.2 GHz BKM and NB perform worse than DT in some cases, looking at the CPU, 2 GB of RAM, and a 750 GB disk. accuracy breakdown. In this case, BKM generally does better than Parameters were set as follows: We set ¼ 0:001 for K-means. NB, except in two cases. Therefore, class decomposition does The number of clusters per class was k ¼ 4 (setting experimentally increase prediction accuracy. On the other hand, compared to justified). All query optimizations were turned on by default (they decision trees, in all cases where BKM is less accurate than DT on do not affect model accuracy). Experiments with DTs were global accuracy, BKM has higher accuracy on one class. In fact, on performed using a data mining tool. closer look, it can be seen that DT completely misses prediction for We used real data sets to test classification accuracy (from one class for the bscale data set. That is, global accuracy for DT is the UCI repository) and synthetic data sets to analyze speed misleading. We conclude that BKM performs better than NB and is (varying d; n). Real data sets include pima (d ¼ 6; n ¼ 768), competitive with DT. spam (d ¼ 7; n ¼ 4;601), bscale (d ¼ 4; n ¼ 625), and wbcancer (d ¼ 7; n ¼ 569). Categorical attributes (!3 values) were trans- formed into binary attributes. TABLE 5 Query Optimization: Distance Computation n ¼ 100k (Seconds) 4.2 Model Accuracy In this section, we measure the accuracy of predictions when using Bayesian classification models. We used 5-fold cross validation. For each run, the data set was partitioned into a training set and a test set. The training set was used to compute the model, whereas the test set was used to independently measure accuracy. The training set size was 80 percent and the test set was 20 percent. Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  • 5. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 143 TABLE 6 Query Optimization: SQL versus UDF (Scoring, n ¼ 1M) (Seconds) 4.3 Query Optimization Table 5 provides detailed experiments on distance computation, the most demanding computation for BKM. Our best distance Fig. 1. BKM Classifier: Time complexity; (default d ¼ 4, k ¼ 4, n ¼ 100k). strategy is two orders of magnitude faster than the worst strategy and it is one order of magnitude faster than its closest rival. The ing the DBMS source code are SQL queries and UDFs. A discrete explanation is that I/O is minimized to n operations and ¨ Naıve Bayes classifier programmed in SQL is introduced in [6]. We computations happen in main memory for each row through summarize differences with ours. The data set is assumed to have SQL arithmetic expressions. Note a standard aggregation on the discrete attributes: binning numeric attributes is not considered. It pivoted version of X (XV) is faster than the horizontal nested uses an inefficient large pivoted intermediate table, whereas our query variant. discrete NB model can directly work on a horizontal layout. Our Table 6 compares SQL with UDFs to score the data set proposal extends clustering algorithms in SQL [4] to perform (computing distance and finding nearest cluster per class). We ¨ classification, generalizing Naıve Bayes [2]. K-means clustering exploit scalar UDFs [5]. Since finding the nearest cluster is was programmed with SQL queries introducing three variants [4]: straightforward in the UDF, this comparison considers both standard, optimized, and incremental. We generalized the opti- computations as one. This experiment favors the UDF since SQL mized variant. Note that classification represents a significantly requires accessing large tables in separate queries. As we can see, harder problem than clustering. User-Defined Functions are SQL (with arithmetic expressions) turned out be faster than the identified as an important extensibility mechanism to integrate UDF. This was interesting because both approaches used the same data mining algorithms [5], [8]. ATLaS [8] extends SQL syntax with table as input and performed the same I/O reading a large table. object-oriented constructs to define aggregate and table functions We expected SQL to be slower because it required a join and XH (with initialize, iterate, and terminate clauses), providing a user- and XD were accessed. The explanation was that the UDF has friendly interface to the SQL standard. We point out several overhead to pass each point and model as parameters in each call. differences with our work. First, we propose to generate SQL code 4.4 Speed and Time Complexity from a host language, thus achieving Turing-completeness in SQL (similar to embedded SQL). We showed the Bayesian classifier can We compare SQL and C++ running on the same computer. We also be solved more efficiently with SQL queries than with UDFs. Even compare the time to export with ODBC. C++ worked on flat files further, SQL code provides better portability and ATLaS requires exported from the DBMS. We used binary files in order to get modifying the DBMS source code. Class decomposition with maximum performance in C++. Also, we shut down the DBMS clustering is shown to improve NB accuracy [7]. The classifier can when C++ was running. In short, we conducted a fair comparison. adapt to skewed distributions and overlapping subsets of points Table 7 compares SQL, C++, and ODBC varying n. Clearly, ODBC by building better local models. In [7], EM was the algorithm to fit is a bottleneck. Overall, both languages scale linearly. We can see a mixture per class. Instead, we decided to use K-means because it SQL performs better as n grows because DBMS overhead becomes is faster, simpler, better understood in database research and has less important. However, C++ is about four times faster. less numeric issues. Fig. 1 shows BKM time complexity varying n; d with large data sets. Time is measured for one iteration. BKM is linear in n and d, highlighting its scalability. 6 CONCLUSIONS We presented two Bayesian classifiers programmed in SQL: the 5 RELATED WORK ¨ Naıve Bayes classifier (with discrete and numeric versions) and a ¨ generalization of Naıve Bayes (BKM), based on decomposing The most widely used approach to integrate data mining classes with K-means clustering. We studied two complementary algorithms into a DBMS is to modify the internal source code. aspects: increasing accuracy and generating efficient SQL code. We Scalable K-means (SKM) [1] and O-cluster [3] are two examples of introduced query optimizations to generate fast SQL code. The best clustering algorithms internally integrated with a DBMS. A physical storage layout and primary index for large tables is based discrete Naive Bayes classifier has been internally integrated with on the point subscript. Sufficient statistics are stored on denorma- the SQL Server DBMS. On the other hand, the two main lized tables. The euclidean distance computation uses a flattened mechanisms to integrate data mining algorithms without modify- (horizontal) version of the cluster centroids matrix, which enables arithmetic expressions. The nearest cluster per class, required by K- means, is efficiently determined avoiding joins and aggregations. TABLE 7 Comparing SQL, C++, and ODBC with NB (d ¼ 4) (Seconds) Experiments with real data sets compared NB, BKM, and decision trees. The numeric and discrete versions of NB had similar accuracy. BKM was more accurate than NB and was similar to decision trees in global accuracy. However, BKM was more accurate when computing a breakdown of accuracy per class. A low number of clusters produced good results in most cases. We compared equivalent implementations of NB in SQL and C++ with large data sets: SQL was four times slower. SQL queries were faster than UDFs Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.
  • 6. 144 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 1, JANUARY 2010 to score, highlighting the importance of our optimizations. NB and REFERENCES BKM exhibited linear scalability in data set size and dimensionality. [1] P. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to There are many opportunities for future work. We want to Large Databases,” Proc. ACM Knowledge Discovery and Data Mining (KDD) derive incremental versions or sample-based methods to accelerate Conf., pp. 9-15, 1998. [2] T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statistical the Bayesian classifier. We want to improve our Bayesian classifier Learning, first ed. Springer, 2001. to produce more accurate models with skewed distributions, data [3] B.L. Milenova and M.M. Campos, “O-Cluster: Scalable Clustering of Large sets with missing information, and subsets of points having High Dimensional Data Sets,” Proc. IEEE Int’l Conf. Data Mining (ICDM), pp. 290-297, 2002. significant overlap with each other, which are known issues for [4] C. Ordonez, “Integrating K-Means Clustering with a Relational DBMS clustering algorithms. We are interested in combining dimension- Using SQL,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201, ality reduction techniques like PCA or factor analysis with Feb. 2006. Bayesian classifiers. UDFs need further study to accelerate [5] C. Ordonez, “Building Statistical Models and Scoring with UDFs,” Proc. ACM SIGMOD, pp. 1005-1016, 2007. computations and evaluate complex mathematical equations. [6] S. Thomas and M.M. Campos, SQL-Based Naive Bayes Model Building and Scoring, US Patent 7,051,037, US Patent and Trade Office, 2006. [7] R. Vilalta and I. Rish, “A Decomposition of Classes via Clustering to ACKNOWLEDGMENTS Explain and Improve Naive Bayes,” Proc. European Conf. Machine Learning (ECML), pp. 444-455, 2003. This research work was partially supported by US National [8] H. Wang, C. Zaniolo, and C.R. Luo, “ATLaS: A Small but Complete SQL Science Foundation grants CCF 0937562 and IIS 0914861. Extension for Data Mining and Data Streams,” Proc. Very Large Data Bases (VLDB) Conf., pp. 1113-1116, 2003. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. Authorized licensed use limited to: Asha Das. Downloaded on June 17,2010 at 06:10:21 UTC from IEEE Xplore. Restrictions apply.