SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
Big Data Analytics Programming - Assignment 2
Eryk Kulikowski
December 7, 2014
Learning curves and interpretation
Zoo data
The Naive Bayes classifier makes an assumption of conditional independence between the attributes
given the class. The zoo data set is a nice example of that in practice such an assumption does not
hold. The attributes in the data are doubled (e.g., hair = true and hair = false attributes). Other
dependencies can also be found, e.g., for the leg attributes, only one can be true (an animal has either
0 legs, 2 legs, etc.). Figure 1 shows the effects of manipulating the data: unaltered data (figure 1a),
removing the doubling of the attributes (figure 1b), removing the doubling of the attributes and the leg
attributes (figure 1c) and removing the doubling of the attributes and doubling the leg attributes (figure
1d). This figure shows that manipulating the data (introducing and removing dependent attributes)
influences the learning curve of the Naive Bayes Classifier. Nevertheless, NB reaches good accuracy in
all cases, this should be true in all situations where the concept can be learned by the NB classifier.
The Very Fast Decision Three classifier does not make such assumption, but is also influenced by
doubling of the attributes. Figure 1 does not illustrate this very well since the default parameters for
the algorithm do not attempt splitting the nodes before seeing at least 200 examples. These parameters
can be adapted for the small data set, as shown on figure 2, where the standard parameters (figure
2a) where tuned to fit the example data (figure 2b). Because of the doubling of the attributes, the
split could never be reached without using the τ parameter. However, setting the τ and δ parameters
low introduces the risk of overfitting, as shown on figure 2c. Figure 2d shows that even with tuned
parameters, it takes longer for the VFDT classifier to learn the concept than for the NB classifier.
This is discussed in more detail in the next section in the context of larger (randomly generated) data
sets.
Generated data
Figure 3 shows the results of applying the classifiers on randomly generated data, according to the
corresponding concepts, i.e., NB concept for the NB classier and the VFDT concept for the VFDT
classifier. The data was generated without noise. For the VFDT data, trees where generated with
fraction 0.15 and depth 18 (see VFDT paper), resulting in 104505 nodes. We can conclude based
on that figure that the NB classifier always reaches high accuracy earlier than the VFDT classifier.
This could be explained by that each example seen by the NB is used to update the counts, while the
VFDT needs double amount of the examples from the previous level to reach the next level, as only one
node can update its variables with each example seen. This also explains the logarithmic shape of the
learning curve of the VFDT classifier. The authors of the VFDT paper suggest reusing the previously
seen examples to speed up this process, if such action is permissible by the available computational
resources. This figure also illustrates that increasing the number of attributes has almost no influence
on the VFDT classifier (the generated concept has the same complexity of 104505 nodes, as the same
fraction and depth parameters where used to generate the data) and only limited influence (due to
the fast convergence) on the NB classifier.
1
(a) All data (b) No double attributes
(c) No leg attributes (d) Double leg attributes
Figure 1: Zoo data
(a) Standard (b) Fitting
(c) Overfitting (d) Compared to Naive Bayes
Figure 2: VFDT parameter choice
2
(a) 100 attributes, 10.000 examples (b) 100 attributes, 100.000 examples
(c) 1000 attributes, 100.000 examples (d) 100 attributes, 1.000.000 examples
Figure 3: Influance of number examples and number attributes
Furthermore, the perfect accuracy of 100% remains unreachable. This is illustrated on figure 4,
where only 16 attributes where used (15 attributes and a class attribute) and 1.000.000 examples
where used to train the NB classifier. NB quickly reaches its maximum accuracy and remains in a very
narrow band. This could be explained by the numerical error for the examples that are very close to
the threshold value. Similar effect is to be expected for the VFDT, as there is an error δ on the choice
of the split-attribute and the error on resolving the tie situations (see also the VFDT paper).
Figure 4: NB with 16 attributes and 1.000.000 examples
Figure 5a illustrates the effect of noise on the data for the NB classifier. The data was generated
with 20% of noise, and the accuracy simply drops with 20%. Figure 5b illustrates the effect of 20%
noise on the data for the VFDT classifier. It is a more complex situation, where the noise decreases the
achievable accuracy, but it can also slow down the convergence as the difference between the attributes
can become harder to detect. Nevertheless, the VFDT copes well with the data with noise.
Figure 6 shows the effect of swapping the concepts; the NB classifies the VFDT generated data,
and the VFDT classifies the NB generated data. VFDT is clearly better in this test.
3
(a) Naive Bayes with and without noise (b) VFDT with and without noise
Figure 5: Influance of noise
Figure 6: Mismatching concepts
Experiments on efficiency
Java is often used on application servers and therefore the online learners would most likely require
integration with this kind of environments. However, Java is not very fast. Where speed matters, JNI
can be used. I have thought that it is a good opportunity to run benchmark tests between native
code, Java and JNI. For the native code and JNI I have chosen to use Objective C. The installation of
the needed components to run the tests can be hard, but on Linux server environments virtualization
is very common, what simplifies the needed setup, as you need to do it only once. The tests use
clang compiler with dispatching, blocks and ARC, together with GNUStep components. The JIGS
component from the GNUStep libraries proofed to be very fast and easy to use. The integration of
Java and Objective C is almost transparent, JIGS generates the needed wrappers based on the make
file and one configuration file. Because of the difficulty of installing the GNUStep with ARC and
dispatching, I have separated the tests from the first part and the tests from this part of the report.
The efficiency tests are in the JNITest folder.
For the first tests, I have wrote a NB implementation in Objective C without the dispatch, in
order to make it more comparable to Java code, that seemed to be single threaded (however, as the
test have shown, some optimizations of the Java libraries and JVM appear to use multi threading, so
I have added the dispatch in later tests). The tests shown in table 1 are the following (all with the
same generated data set of 100 attributes, 100000 training examples and 1000 test examples, run with
increment of 1000 in order to balance the updates and predictions):
• Linux J The Java NB implementation, as used in the first part of the report run on a Linux
(Ubuntu) environment.
• Linux N The Objective C code as described above, run on the same environment.
• Linux JNI The same Objective C code, run from Java through the JIGS wrapper, on the same
environment.
4
• Linux GNUStep Objective C code using the GNUStep library for splitting the strings for data
initialization, run on the same environment.
• Mac N Exactly the same code as Linux N run on a Mac OS X machine (both, Mac and Linux
are core i7 systems, however, the Linux machine is more recent with SSD drive).
• Mac cocoa Exactly the same code as Linux GNUStep (cocoa and GNUStep are compatible), run
on Mac.
• Mac J Exactly the same code as Linux J, run on Mac.
Test Linux J Linux N Linux JNI Linux GNUStep Mac N Mac cocoa Mac J
real 6.384s 1.510s 1.591s 23.086s 2.057s 3.944s 15.564s
user 3.412s 1.473s 1.556s 23.065s 1.988s 3.874s 6.952s
sys 3.776s 0.032s 0.053s 0.040s 0.052s 0.054s 9.317s
Table 1: Comparison of Java, native code and JNI
Based on the results shown in table 1, it can be concluded that JNI speeds up the Java code
significantly. There is almost no difference between the native code and JNI. However, the execution
time using the split selector from the GNUStep is significantly higher then in all other test. It is clear
that Java is optimized for Linux environment, as most servers run Linux, and that the GNUStep
library lacks optimizations that cocoa and Java libraries have. In order to investigate if there is a
difference between LLVM on Linux and Mac, and to compare these results with Java, I have run the
following tests (with the same data as above, all of the tests are run on the Linux environment, except
for the last test that is run on the Mac):
• J cData JNI is used to load the data and then the integers are copied to Java integer arrays,
the remaining code is the same as Linux J
• J cData nc Small optimization of the code above, where the arrayCopy is not used (see the Test
class). This makes the Java code very comparable to the ObjC code with the minimal overhead
of loading the data.
• ObjC strings The native code where GNUStep objects are used, but the split selector is not used
(one string object is made for each integer, that object is then put in an NSMutableArray).
• J strings The same code as ObjC strings, but implemented in Java (except that the data is
loaded with JNI)
• Native+ Native code optimized with dispatch and memcopy
• JNI+ The same code as Native+, but run through the JNI
• ObjC Mac The same code as ObjC strings, run on the Mac
Test J cData J cData nc ObjC strings J strings Native+ JNI+ ObjC Mac
real 5.142s 5.115s 12.551s 12.881s 1.152s 1.213s 17.401s
user 5.193s 5.174s 12.345s 32.612s 1.620s 1.740s 17.151s
sys 0.039s 0.031s 0.216s 1.275s 0.036s 0.071s 0.134s
Table 2: Effect of initializing many objects and putting them in array objects
Results from the table 2 show that working with objects is faster in native code than in Java. Both
environments, Mac and Linux, are very comparable for the native code performance. However, Java
uses multi threading (see the user time) to speed up the process of initializing objects. This lowers
5
the real computation time on an idle multi-core system, however, on a busy server it might proof less
effective. Furthermore, using the far from optimal code in the GNUStep environment has better result
that using the algorithm implemented in that library. This illustrates that certain algorithms should
be avoided, but the allocation of the memory, NSMutableArray, and many other features are well
implemented and can be used without any problems, with performance comparable with the cocoa
library on Mac OS X. Finally, using dispatch improves the performance in the native code and JNI,
with minimal overhead for the JNI. Clang compiler also allows mixing C, Objective C, C++ and
Objective C++ in one project, what allows replacing certain algorithms from GNUStep with better
libraries available for these languages.
The remaining tests are the scalability tests. The NB and VFDT are the regular Java implemen-
tations of these algorithms, while the JNI-NB is the JNI implementation of the NB algorithm (the
same code as for the JNI+ test). The results are shown in tables 3 and 4. The used data sets are the
following (the increment is always set to the size of the test set, in order to balance the updating and
predicting):
• 1 100 attributes, 10000 training examples, 100 test examples
• 2 100 attributes, 100000 training examples, 1000 test examples
• 3 1000 attributes, 100000 training examples, 1000 test examples
• 4 100 attributes, 1000000 training examples, 10000 test examples
Test NB 1 VFDT 1 JNI-NB 1 NB 2 VFDT 2 JNI-NB 2
real 0.938s 0.843s 0.258s 6.384s 7.037s 1.211s
user 0.715s 0.946s 0.269s 3.412s 4.421s 1.736s
sys 0.449s 0.417s 0.043s 3.776s 3.756s 0.063s
Table 3: Scalability tests of the implemented algorithms
Test NB 3 VFDT 3 JNI-NB 3 NB 4 VFDT 4 JNI-NB 4
real 1m0.855s 1m8.397s 11.091s 1m2.137s 1m7.628s 11.097s
user 28.549s 37.002s 16.238s 31.996s 36.424s 16.508s
sys 36.082s 34.728s 0.285s 36.325s 37.040s 0.260s
Table 4: Scalability tests of the implemented algorithms continued
As expected, the performance of the NB and VFDT is very comparable, with the NB being slightly
faster then the VFDT. Also, the effect of increasing the number examples is similar to increasing the
number of attributes for both algorithms. However, JNI implementation is about six times faster than
regular Java implementation. JIGS is a high quality library that can also be used in the opposite
direction, calling the Java code from the Objective C code. However, no wrappers are generated and
this requires little extra implementation (I have tried this in a different context, where weka library is
used from Objective C). This is understandable, since JIGS is mainly designed to speed up Java code
on Java application servers.
6

Contenu connexe

Tendances

Neural tool box
Neural tool boxNeural tool box
Neural tool boxMohan Raj
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTaegyun Jeon
 
Neural network in matlab
Neural network in matlab Neural network in matlab
Neural network in matlab Fahim Khan
 
Presentation on experimental setup for verigying - "Slow Learners are F...
Presentation on experimental setup for verigying  - "Slow Learners are F...Presentation on experimental setup for verigying  - "Slow Learners are F...
Presentation on experimental setup for verigying - "Slow Learners are F...Robin Srivastava
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료taeseon ryu
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronomaraldabash
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Multilayer Backpropagation Neural Networks for Implementation of Logic Gates
Multilayer Backpropagation Neural Networks for Implementation of Logic GatesMultilayer Backpropagation Neural Networks for Implementation of Logic Gates
Multilayer Backpropagation Neural Networks for Implementation of Logic GatesIJCSES Journal
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
 
Object Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksObject Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksUsman Qayyum
 
Simulation of Single and Multilayer of Artificial Neural Network using Verilog
Simulation of Single and Multilayer of Artificial Neural Network using VerilogSimulation of Single and Multilayer of Artificial Neural Network using Verilog
Simulation of Single and Multilayer of Artificial Neural Network using Verilogijsrd.com
 
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...Pooyan Jamshidi
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf
 
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCMLconf
 

Tendances (20)

Neural tool box
Neural tool boxNeural tool box
Neural tool box
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
 
Neural network in matlab
Neural network in matlab Neural network in matlab
Neural network in matlab
 
Presentation on experimental setup for verigying - "Slow Learners are F...
Presentation on experimental setup for verigying  - "Slow Learners are F...Presentation on experimental setup for verigying  - "Slow Learners are F...
Presentation on experimental setup for verigying - "Slow Learners are F...
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Multilayer Backpropagation Neural Networks for Implementation of Logic Gates
Multilayer Backpropagation Neural Networks for Implementation of Logic GatesMultilayer Backpropagation Neural Networks for Implementation of Logic Gates
Multilayer Backpropagation Neural Networks for Implementation of Logic Gates
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : Notes
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
 
Object Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksObject Detection using Deep Neural Networks
Object Detection using Deep Neural Networks
 
Simulation of Single and Multilayer of Artificial Neural Network using Verilog
Simulation of Single and Multilayer of Artificial Neural Network using VerilogSimulation of Single and Multilayer of Artificial Neural Network using Verilog
Simulation of Single and Multilayer of Artificial Neural Network using Verilog
 
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
Fuzzy Self-Learning Controllers for Elasticity Management in Dynamic Cloud Ar...
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
 

Similaire à Eryk_Kulikowski_a2

Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .netMarco Parenzan
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svmtaikhoan262
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETMarco Parenzan
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionYoussefKitane
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Comparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit RecognitionComparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit RecognitionSafaa Alnabulsi
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwarePooyan Jamshidi
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesVarad Meru
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachYusuf Uzun
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSitakanta Mishra
 
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksWilly Marroquin (WillyDevNET)
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta
 

Similaire à Eryk_Kulikowski_a2 (20)

N ns 1
N ns 1N ns 1
N ns 1
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .net
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NET
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
GPU Kernels for Block-Sparse Weights
GPU Kernels for Block-Sparse WeightsGPU Kernels for Block-Sparse Weights
GPU Kernels for Block-Sparse Weights
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Comparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit RecognitionComparison of Learning Algorithms for Handwritten Digit Recognition
Comparison of Learning Algorithms for Handwritten Digit Recognition
 
Configuration Optimization for Big Data Software
Configuration Optimization for Big Data SoftwareConfiguration Optimization for Big Data Software
Configuration Optimization for Big Data Software
 
Guide
GuideGuide
Guide
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
ImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural NetworksImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 
.doc
.doc.doc
.doc
 
.doc
.doc.doc
.doc
 
.doc
.doc.doc
.doc
 

Eryk_Kulikowski_a2

  • 1. Big Data Analytics Programming - Assignment 2 Eryk Kulikowski December 7, 2014 Learning curves and interpretation Zoo data The Naive Bayes classifier makes an assumption of conditional independence between the attributes given the class. The zoo data set is a nice example of that in practice such an assumption does not hold. The attributes in the data are doubled (e.g., hair = true and hair = false attributes). Other dependencies can also be found, e.g., for the leg attributes, only one can be true (an animal has either 0 legs, 2 legs, etc.). Figure 1 shows the effects of manipulating the data: unaltered data (figure 1a), removing the doubling of the attributes (figure 1b), removing the doubling of the attributes and the leg attributes (figure 1c) and removing the doubling of the attributes and doubling the leg attributes (figure 1d). This figure shows that manipulating the data (introducing and removing dependent attributes) influences the learning curve of the Naive Bayes Classifier. Nevertheless, NB reaches good accuracy in all cases, this should be true in all situations where the concept can be learned by the NB classifier. The Very Fast Decision Three classifier does not make such assumption, but is also influenced by doubling of the attributes. Figure 1 does not illustrate this very well since the default parameters for the algorithm do not attempt splitting the nodes before seeing at least 200 examples. These parameters can be adapted for the small data set, as shown on figure 2, where the standard parameters (figure 2a) where tuned to fit the example data (figure 2b). Because of the doubling of the attributes, the split could never be reached without using the τ parameter. However, setting the τ and δ parameters low introduces the risk of overfitting, as shown on figure 2c. Figure 2d shows that even with tuned parameters, it takes longer for the VFDT classifier to learn the concept than for the NB classifier. This is discussed in more detail in the next section in the context of larger (randomly generated) data sets. Generated data Figure 3 shows the results of applying the classifiers on randomly generated data, according to the corresponding concepts, i.e., NB concept for the NB classier and the VFDT concept for the VFDT classifier. The data was generated without noise. For the VFDT data, trees where generated with fraction 0.15 and depth 18 (see VFDT paper), resulting in 104505 nodes. We can conclude based on that figure that the NB classifier always reaches high accuracy earlier than the VFDT classifier. This could be explained by that each example seen by the NB is used to update the counts, while the VFDT needs double amount of the examples from the previous level to reach the next level, as only one node can update its variables with each example seen. This also explains the logarithmic shape of the learning curve of the VFDT classifier. The authors of the VFDT paper suggest reusing the previously seen examples to speed up this process, if such action is permissible by the available computational resources. This figure also illustrates that increasing the number of attributes has almost no influence on the VFDT classifier (the generated concept has the same complexity of 104505 nodes, as the same fraction and depth parameters where used to generate the data) and only limited influence (due to the fast convergence) on the NB classifier. 1
  • 2. (a) All data (b) No double attributes (c) No leg attributes (d) Double leg attributes Figure 1: Zoo data (a) Standard (b) Fitting (c) Overfitting (d) Compared to Naive Bayes Figure 2: VFDT parameter choice 2
  • 3. (a) 100 attributes, 10.000 examples (b) 100 attributes, 100.000 examples (c) 1000 attributes, 100.000 examples (d) 100 attributes, 1.000.000 examples Figure 3: Influance of number examples and number attributes Furthermore, the perfect accuracy of 100% remains unreachable. This is illustrated on figure 4, where only 16 attributes where used (15 attributes and a class attribute) and 1.000.000 examples where used to train the NB classifier. NB quickly reaches its maximum accuracy and remains in a very narrow band. This could be explained by the numerical error for the examples that are very close to the threshold value. Similar effect is to be expected for the VFDT, as there is an error δ on the choice of the split-attribute and the error on resolving the tie situations (see also the VFDT paper). Figure 4: NB with 16 attributes and 1.000.000 examples Figure 5a illustrates the effect of noise on the data for the NB classifier. The data was generated with 20% of noise, and the accuracy simply drops with 20%. Figure 5b illustrates the effect of 20% noise on the data for the VFDT classifier. It is a more complex situation, where the noise decreases the achievable accuracy, but it can also slow down the convergence as the difference between the attributes can become harder to detect. Nevertheless, the VFDT copes well with the data with noise. Figure 6 shows the effect of swapping the concepts; the NB classifies the VFDT generated data, and the VFDT classifies the NB generated data. VFDT is clearly better in this test. 3
  • 4. (a) Naive Bayes with and without noise (b) VFDT with and without noise Figure 5: Influance of noise Figure 6: Mismatching concepts Experiments on efficiency Java is often used on application servers and therefore the online learners would most likely require integration with this kind of environments. However, Java is not very fast. Where speed matters, JNI can be used. I have thought that it is a good opportunity to run benchmark tests between native code, Java and JNI. For the native code and JNI I have chosen to use Objective C. The installation of the needed components to run the tests can be hard, but on Linux server environments virtualization is very common, what simplifies the needed setup, as you need to do it only once. The tests use clang compiler with dispatching, blocks and ARC, together with GNUStep components. The JIGS component from the GNUStep libraries proofed to be very fast and easy to use. The integration of Java and Objective C is almost transparent, JIGS generates the needed wrappers based on the make file and one configuration file. Because of the difficulty of installing the GNUStep with ARC and dispatching, I have separated the tests from the first part and the tests from this part of the report. The efficiency tests are in the JNITest folder. For the first tests, I have wrote a NB implementation in Objective C without the dispatch, in order to make it more comparable to Java code, that seemed to be single threaded (however, as the test have shown, some optimizations of the Java libraries and JVM appear to use multi threading, so I have added the dispatch in later tests). The tests shown in table 1 are the following (all with the same generated data set of 100 attributes, 100000 training examples and 1000 test examples, run with increment of 1000 in order to balance the updates and predictions): • Linux J The Java NB implementation, as used in the first part of the report run on a Linux (Ubuntu) environment. • Linux N The Objective C code as described above, run on the same environment. • Linux JNI The same Objective C code, run from Java through the JIGS wrapper, on the same environment. 4
  • 5. • Linux GNUStep Objective C code using the GNUStep library for splitting the strings for data initialization, run on the same environment. • Mac N Exactly the same code as Linux N run on a Mac OS X machine (both, Mac and Linux are core i7 systems, however, the Linux machine is more recent with SSD drive). • Mac cocoa Exactly the same code as Linux GNUStep (cocoa and GNUStep are compatible), run on Mac. • Mac J Exactly the same code as Linux J, run on Mac. Test Linux J Linux N Linux JNI Linux GNUStep Mac N Mac cocoa Mac J real 6.384s 1.510s 1.591s 23.086s 2.057s 3.944s 15.564s user 3.412s 1.473s 1.556s 23.065s 1.988s 3.874s 6.952s sys 3.776s 0.032s 0.053s 0.040s 0.052s 0.054s 9.317s Table 1: Comparison of Java, native code and JNI Based on the results shown in table 1, it can be concluded that JNI speeds up the Java code significantly. There is almost no difference between the native code and JNI. However, the execution time using the split selector from the GNUStep is significantly higher then in all other test. It is clear that Java is optimized for Linux environment, as most servers run Linux, and that the GNUStep library lacks optimizations that cocoa and Java libraries have. In order to investigate if there is a difference between LLVM on Linux and Mac, and to compare these results with Java, I have run the following tests (with the same data as above, all of the tests are run on the Linux environment, except for the last test that is run on the Mac): • J cData JNI is used to load the data and then the integers are copied to Java integer arrays, the remaining code is the same as Linux J • J cData nc Small optimization of the code above, where the arrayCopy is not used (see the Test class). This makes the Java code very comparable to the ObjC code with the minimal overhead of loading the data. • ObjC strings The native code where GNUStep objects are used, but the split selector is not used (one string object is made for each integer, that object is then put in an NSMutableArray). • J strings The same code as ObjC strings, but implemented in Java (except that the data is loaded with JNI) • Native+ Native code optimized with dispatch and memcopy • JNI+ The same code as Native+, but run through the JNI • ObjC Mac The same code as ObjC strings, run on the Mac Test J cData J cData nc ObjC strings J strings Native+ JNI+ ObjC Mac real 5.142s 5.115s 12.551s 12.881s 1.152s 1.213s 17.401s user 5.193s 5.174s 12.345s 32.612s 1.620s 1.740s 17.151s sys 0.039s 0.031s 0.216s 1.275s 0.036s 0.071s 0.134s Table 2: Effect of initializing many objects and putting them in array objects Results from the table 2 show that working with objects is faster in native code than in Java. Both environments, Mac and Linux, are very comparable for the native code performance. However, Java uses multi threading (see the user time) to speed up the process of initializing objects. This lowers 5
  • 6. the real computation time on an idle multi-core system, however, on a busy server it might proof less effective. Furthermore, using the far from optimal code in the GNUStep environment has better result that using the algorithm implemented in that library. This illustrates that certain algorithms should be avoided, but the allocation of the memory, NSMutableArray, and many other features are well implemented and can be used without any problems, with performance comparable with the cocoa library on Mac OS X. Finally, using dispatch improves the performance in the native code and JNI, with minimal overhead for the JNI. Clang compiler also allows mixing C, Objective C, C++ and Objective C++ in one project, what allows replacing certain algorithms from GNUStep with better libraries available for these languages. The remaining tests are the scalability tests. The NB and VFDT are the regular Java implemen- tations of these algorithms, while the JNI-NB is the JNI implementation of the NB algorithm (the same code as for the JNI+ test). The results are shown in tables 3 and 4. The used data sets are the following (the increment is always set to the size of the test set, in order to balance the updating and predicting): • 1 100 attributes, 10000 training examples, 100 test examples • 2 100 attributes, 100000 training examples, 1000 test examples • 3 1000 attributes, 100000 training examples, 1000 test examples • 4 100 attributes, 1000000 training examples, 10000 test examples Test NB 1 VFDT 1 JNI-NB 1 NB 2 VFDT 2 JNI-NB 2 real 0.938s 0.843s 0.258s 6.384s 7.037s 1.211s user 0.715s 0.946s 0.269s 3.412s 4.421s 1.736s sys 0.449s 0.417s 0.043s 3.776s 3.756s 0.063s Table 3: Scalability tests of the implemented algorithms Test NB 3 VFDT 3 JNI-NB 3 NB 4 VFDT 4 JNI-NB 4 real 1m0.855s 1m8.397s 11.091s 1m2.137s 1m7.628s 11.097s user 28.549s 37.002s 16.238s 31.996s 36.424s 16.508s sys 36.082s 34.728s 0.285s 36.325s 37.040s 0.260s Table 4: Scalability tests of the implemented algorithms continued As expected, the performance of the NB and VFDT is very comparable, with the NB being slightly faster then the VFDT. Also, the effect of increasing the number examples is similar to increasing the number of attributes for both algorithms. However, JNI implementation is about six times faster than regular Java implementation. JIGS is a high quality library that can also be used in the opposite direction, calling the Java code from the Objective C code. However, no wrappers are generated and this requires little extra implementation (I have tried this in a different context, where weka library is used from Objective C). This is understandable, since JIGS is mainly designed to speed up Java code on Java application servers. 6