1. Big Data Analytics Programming - Assignment 2
Eryk Kulikowski
December 7, 2014
Learning curves and interpretation
Zoo data
The Naive Bayes classifier makes an assumption of conditional independence between the attributes
given the class. The zoo data set is a nice example of that in practice such an assumption does not
hold. The attributes in the data are doubled (e.g., hair = true and hair = false attributes). Other
dependencies can also be found, e.g., for the leg attributes, only one can be true (an animal has either
0 legs, 2 legs, etc.). Figure 1 shows the effects of manipulating the data: unaltered data (figure 1a),
removing the doubling of the attributes (figure 1b), removing the doubling of the attributes and the leg
attributes (figure 1c) and removing the doubling of the attributes and doubling the leg attributes (figure
1d). This figure shows that manipulating the data (introducing and removing dependent attributes)
influences the learning curve of the Naive Bayes Classifier. Nevertheless, NB reaches good accuracy in
all cases, this should be true in all situations where the concept can be learned by the NB classifier.
The Very Fast Decision Three classifier does not make such assumption, but is also influenced by
doubling of the attributes. Figure 1 does not illustrate this very well since the default parameters for
the algorithm do not attempt splitting the nodes before seeing at least 200 examples. These parameters
can be adapted for the small data set, as shown on figure 2, where the standard parameters (figure
2a) where tuned to fit the example data (figure 2b). Because of the doubling of the attributes, the
split could never be reached without using the τ parameter. However, setting the τ and δ parameters
low introduces the risk of overfitting, as shown on figure 2c. Figure 2d shows that even with tuned
parameters, it takes longer for the VFDT classifier to learn the concept than for the NB classifier.
This is discussed in more detail in the next section in the context of larger (randomly generated) data
sets.
Generated data
Figure 3 shows the results of applying the classifiers on randomly generated data, according to the
corresponding concepts, i.e., NB concept for the NB classier and the VFDT concept for the VFDT
classifier. The data was generated without noise. For the VFDT data, trees where generated with
fraction 0.15 and depth 18 (see VFDT paper), resulting in 104505 nodes. We can conclude based
on that figure that the NB classifier always reaches high accuracy earlier than the VFDT classifier.
This could be explained by that each example seen by the NB is used to update the counts, while the
VFDT needs double amount of the examples from the previous level to reach the next level, as only one
node can update its variables with each example seen. This also explains the logarithmic shape of the
learning curve of the VFDT classifier. The authors of the VFDT paper suggest reusing the previously
seen examples to speed up this process, if such action is permissible by the available computational
resources. This figure also illustrates that increasing the number of attributes has almost no influence
on the VFDT classifier (the generated concept has the same complexity of 104505 nodes, as the same
fraction and depth parameters where used to generate the data) and only limited influence (due to
the fast convergence) on the NB classifier.
1
2. (a) All data (b) No double attributes
(c) No leg attributes (d) Double leg attributes
Figure 1: Zoo data
(a) Standard (b) Fitting
(c) Overfitting (d) Compared to Naive Bayes
Figure 2: VFDT parameter choice
2
3. (a) 100 attributes, 10.000 examples (b) 100 attributes, 100.000 examples
(c) 1000 attributes, 100.000 examples (d) 100 attributes, 1.000.000 examples
Figure 3: Influance of number examples and number attributes
Furthermore, the perfect accuracy of 100% remains unreachable. This is illustrated on figure 4,
where only 16 attributes where used (15 attributes and a class attribute) and 1.000.000 examples
where used to train the NB classifier. NB quickly reaches its maximum accuracy and remains in a very
narrow band. This could be explained by the numerical error for the examples that are very close to
the threshold value. Similar effect is to be expected for the VFDT, as there is an error δ on the choice
of the split-attribute and the error on resolving the tie situations (see also the VFDT paper).
Figure 4: NB with 16 attributes and 1.000.000 examples
Figure 5a illustrates the effect of noise on the data for the NB classifier. The data was generated
with 20% of noise, and the accuracy simply drops with 20%. Figure 5b illustrates the effect of 20%
noise on the data for the VFDT classifier. It is a more complex situation, where the noise decreases the
achievable accuracy, but it can also slow down the convergence as the difference between the attributes
can become harder to detect. Nevertheless, the VFDT copes well with the data with noise.
Figure 6 shows the effect of swapping the concepts; the NB classifies the VFDT generated data,
and the VFDT classifies the NB generated data. VFDT is clearly better in this test.
3
4. (a) Naive Bayes with and without noise (b) VFDT with and without noise
Figure 5: Influance of noise
Figure 6: Mismatching concepts
Experiments on efficiency
Java is often used on application servers and therefore the online learners would most likely require
integration with this kind of environments. However, Java is not very fast. Where speed matters, JNI
can be used. I have thought that it is a good opportunity to run benchmark tests between native
code, Java and JNI. For the native code and JNI I have chosen to use Objective C. The installation of
the needed components to run the tests can be hard, but on Linux server environments virtualization
is very common, what simplifies the needed setup, as you need to do it only once. The tests use
clang compiler with dispatching, blocks and ARC, together with GNUStep components. The JIGS
component from the GNUStep libraries proofed to be very fast and easy to use. The integration of
Java and Objective C is almost transparent, JIGS generates the needed wrappers based on the make
file and one configuration file. Because of the difficulty of installing the GNUStep with ARC and
dispatching, I have separated the tests from the first part and the tests from this part of the report.
The efficiency tests are in the JNITest folder.
For the first tests, I have wrote a NB implementation in Objective C without the dispatch, in
order to make it more comparable to Java code, that seemed to be single threaded (however, as the
test have shown, some optimizations of the Java libraries and JVM appear to use multi threading, so
I have added the dispatch in later tests). The tests shown in table 1 are the following (all with the
same generated data set of 100 attributes, 100000 training examples and 1000 test examples, run with
increment of 1000 in order to balance the updates and predictions):
• Linux J The Java NB implementation, as used in the first part of the report run on a Linux
(Ubuntu) environment.
• Linux N The Objective C code as described above, run on the same environment.
• Linux JNI The same Objective C code, run from Java through the JIGS wrapper, on the same
environment.
4
5. • Linux GNUStep Objective C code using the GNUStep library for splitting the strings for data
initialization, run on the same environment.
• Mac N Exactly the same code as Linux N run on a Mac OS X machine (both, Mac and Linux
are core i7 systems, however, the Linux machine is more recent with SSD drive).
• Mac cocoa Exactly the same code as Linux GNUStep (cocoa and GNUStep are compatible), run
on Mac.
• Mac J Exactly the same code as Linux J, run on Mac.
Test Linux J Linux N Linux JNI Linux GNUStep Mac N Mac cocoa Mac J
real 6.384s 1.510s 1.591s 23.086s 2.057s 3.944s 15.564s
user 3.412s 1.473s 1.556s 23.065s 1.988s 3.874s 6.952s
sys 3.776s 0.032s 0.053s 0.040s 0.052s 0.054s 9.317s
Table 1: Comparison of Java, native code and JNI
Based on the results shown in table 1, it can be concluded that JNI speeds up the Java code
significantly. There is almost no difference between the native code and JNI. However, the execution
time using the split selector from the GNUStep is significantly higher then in all other test. It is clear
that Java is optimized for Linux environment, as most servers run Linux, and that the GNUStep
library lacks optimizations that cocoa and Java libraries have. In order to investigate if there is a
difference between LLVM on Linux and Mac, and to compare these results with Java, I have run the
following tests (with the same data as above, all of the tests are run on the Linux environment, except
for the last test that is run on the Mac):
• J cData JNI is used to load the data and then the integers are copied to Java integer arrays,
the remaining code is the same as Linux J
• J cData nc Small optimization of the code above, where the arrayCopy is not used (see the Test
class). This makes the Java code very comparable to the ObjC code with the minimal overhead
of loading the data.
• ObjC strings The native code where GNUStep objects are used, but the split selector is not used
(one string object is made for each integer, that object is then put in an NSMutableArray).
• J strings The same code as ObjC strings, but implemented in Java (except that the data is
loaded with JNI)
• Native+ Native code optimized with dispatch and memcopy
• JNI+ The same code as Native+, but run through the JNI
• ObjC Mac The same code as ObjC strings, run on the Mac
Test J cData J cData nc ObjC strings J strings Native+ JNI+ ObjC Mac
real 5.142s 5.115s 12.551s 12.881s 1.152s 1.213s 17.401s
user 5.193s 5.174s 12.345s 32.612s 1.620s 1.740s 17.151s
sys 0.039s 0.031s 0.216s 1.275s 0.036s 0.071s 0.134s
Table 2: Effect of initializing many objects and putting them in array objects
Results from the table 2 show that working with objects is faster in native code than in Java. Both
environments, Mac and Linux, are very comparable for the native code performance. However, Java
uses multi threading (see the user time) to speed up the process of initializing objects. This lowers
5
6. the real computation time on an idle multi-core system, however, on a busy server it might proof less
effective. Furthermore, using the far from optimal code in the GNUStep environment has better result
that using the algorithm implemented in that library. This illustrates that certain algorithms should
be avoided, but the allocation of the memory, NSMutableArray, and many other features are well
implemented and can be used without any problems, with performance comparable with the cocoa
library on Mac OS X. Finally, using dispatch improves the performance in the native code and JNI,
with minimal overhead for the JNI. Clang compiler also allows mixing C, Objective C, C++ and
Objective C++ in one project, what allows replacing certain algorithms from GNUStep with better
libraries available for these languages.
The remaining tests are the scalability tests. The NB and VFDT are the regular Java implemen-
tations of these algorithms, while the JNI-NB is the JNI implementation of the NB algorithm (the
same code as for the JNI+ test). The results are shown in tables 3 and 4. The used data sets are the
following (the increment is always set to the size of the test set, in order to balance the updating and
predicting):
• 1 100 attributes, 10000 training examples, 100 test examples
• 2 100 attributes, 100000 training examples, 1000 test examples
• 3 1000 attributes, 100000 training examples, 1000 test examples
• 4 100 attributes, 1000000 training examples, 10000 test examples
Test NB 1 VFDT 1 JNI-NB 1 NB 2 VFDT 2 JNI-NB 2
real 0.938s 0.843s 0.258s 6.384s 7.037s 1.211s
user 0.715s 0.946s 0.269s 3.412s 4.421s 1.736s
sys 0.449s 0.417s 0.043s 3.776s 3.756s 0.063s
Table 3: Scalability tests of the implemented algorithms
Test NB 3 VFDT 3 JNI-NB 3 NB 4 VFDT 4 JNI-NB 4
real 1m0.855s 1m8.397s 11.091s 1m2.137s 1m7.628s 11.097s
user 28.549s 37.002s 16.238s 31.996s 36.424s 16.508s
sys 36.082s 34.728s 0.285s 36.325s 37.040s 0.260s
Table 4: Scalability tests of the implemented algorithms continued
As expected, the performance of the NB and VFDT is very comparable, with the NB being slightly
faster then the VFDT. Also, the effect of increasing the number examples is similar to increasing the
number of attributes for both algorithms. However, JNI implementation is about six times faster than
regular Java implementation. JIGS is a high quality library that can also be used in the opposite
direction, calling the Java code from the Objective C code. However, no wrappers are generated and
this requires little extra implementation (I have tried this in a different context, where weka library is
used from Objective C). This is understandable, since JIGS is mainly designed to speed up Java code
on Java application servers.
6