Some numerical experiments

An experimental comparison of MSVMpack with other implementations of M-SVMs is provided here for illustrative purposes. The aim is not to conclude on what type of M-SVM model is better, but rather to give an idea of the efficiency of the different implementations. The following implementations are considered:

Note that both the Spider and SMSVM are mostly based on non-compiled code. In addition, these two implementations require to store the kernel matrix in memory, which makes them inapplicable to large data sets. BSVM constitutes an efficient alternative for the WW model type. However, to the best of our knowledge, SMSVM is the only implementation (beside MSVMpack) of the LLW M-SVM model described in [8].

The characteristics of the data sets used in the experiments are given in Table 4. The ImageSeg data set is taken from the UCI repository10. The USPS_500 data set is a subset of the USPS data provided with the MCSVM software. The Bio data set is provided with MSVMpack. The CB513_01 data set corresponds to the first split of the 5-fold cross validation procedure on the entire CB513 data set [3]. MSVMpack uses the original MNIST data11 with integer values in the range $ [0,255]$, while other implementations use scaled data downloaded from the LIBSVM homepage12. BSVM scaling tool is applied to the Letter data set downloaded from the UCI repository13 to obtain a normalized data set in floating-point data format.

All methods use the same kernel functions and hyperparameters as summarized in Table 4. In particular, for MCSVM, we used $ \beta = 1/C$. When unspecified, the kernel hyperparameters are set to MSVMpack defaults. Also, SMSVM requires $ \lambda = \log_2 (1/C)$ with larger values of $ C$ to reach a similar training error, so $ C=100000$ is used. Other low level parameters (such as the size of the working set) are kept to each software defaults. These experiments are performed on a computer with 2 Xeon processors (with 4 cores each) at 2.53 GHz and 32 GB of memory running Linux (64-bit).

The results are shown in Tables 5-6. Though no definitive conclusion can be drawn from this small set of experiments, the following is often observed in practice:


Table 4: Characteristics of the data sets.
Data set #classes #training #test data Training parameters
    examples examples dim. format  
ImageSeg 7 210 2100 19 double $ C=10$, RBF kernel, $ \sigma=1$
            data normalized
USPS_500 10 500 500 256 float $ C=10$, RBF kernel, $ \sigma=1$
Bio 4 12471 12471 68 bit $ C=0.2$, linear kernel
CB513_01 3 65210 18909 260 short $ C=0.4$, RBF kernel
CB513 3 84119 5-fold CV 260 short $ C=0.4$, RBF kernel
MNIST 10 60000 10000 784 short $ C=1$, RBF kernel, $ \sigma=1000$
          double data normalized, $ \sigma = 4.08$
Letter 26 16000 4000 16 float $ C=1$, RBF kernel, $ \sigma=1$
            data normalized
           


Table 5: Relative performance of different M-SVM implementations.
Data set M-SVM Software #SV Training Test Training Testing
  model     error (%) error (%) time time
ImageSeg WW Spider 113 5.24 11.05 11s 3.3s
    BSVM 98 2.38 9.81 0.1s 0.1s
    MSVMpack 192 0 10.33 1.2s 0.1s
  CS MCSVM 97 2.86 8.86 0.1s 0.1s
    MSVMpack 141 0 9.24 3.3s 0.1s
  LLW SMSVM 210 0.48 7.67 15s 0.1s
    MSVMpack 182 0 10.57 23s 0.1s
  MSVM$ ^2$ MSVMpack 199 0 10.48 4.5s 0.1s
0=`
height 1.5pt @axhline USPS_500
WW Spider 303 0.40 10.20 4m 19s 0.2s
    BSVM 170 0 10.00 0.2s 0.1s
    MSVMpack 385 0 10.40 2.5s 0.1s
  CS MCSVM 284 0 9.80 0.5s 0.3s
    MSVMpack 292 0 9.80 30s 0.1s
  LLW SMSVM 500 0 12.00 5m 58s 0.1s
    MSVMpack 494 0 11.40 1m 22s 0.1s
  MSVM$ ^2$ MSVMpack 500 0 12.00 22s 0.1s
0=`
height 1.5pt @axhline Bio
WW Spider out of memory    
    BSVM 2200 6.23 6.23 11s 4.6s
    MSVMpack 4301 6.23 6.23 4m 00s 0.5s
  CS MCSVM 1727 6.23 6.23 9s 4.2s
    MSVMpack 3647 6.23 6.23 8s 0.5s
  LLW SMSVM out of memory    
    MSVMpack 12467 6.23 6.23 2m 05s 1.1s
  MSVM$ ^2$ MSVMpack 12471 6.23 6.23 11m 44s 0.9s
0=`
height 1.5pt @axhline CB513_01
WW Spider out of memory    
    BSVM 39183 19.46 25.70 1h 41m 52s 9m 02s
    MSVMpack 42277 16.94 25.55 10m 31s 26s
  CS MCSVM 41401 17.07 25.45 4h 12m 35s 27m 11s
    MSVMpack 40198 16.93 25.41 4m 04s 32s
  LLW SMSVM out of memory    
    MSVMpack 54313 22.14 27.65 11m 46s 32s
  MSVM$ ^2$ MSVMpack 62027 14.31 25.32 1h 08m 54s 47s
0=`
height 1.5pt @axhline

MNIST

WW Spider out of memory    
    BSVM 13572 0.078 1.46 4h 03m 29s 2m 39s
  MSVMpack 14771 0.015 1.41 2h 50m 29s 15s
CS MCSVM 13805 0.038 2.76 1h 54m 26s 4m 09s
  MSVMpack 14408 0.032 1.44 49m 04s 15s
LLW SMSVM out of memory    
    MSVMpack 47906 0.282 1.57 4h 36m 45s 45s
  MSVM$ ^2$ MSVMpack 53773 0.027 1.51 10h 55m 10s 55s
0=`
height 1.5pt @axhline Letter
WW Spider out of memory    
    BSVM 7460 4.30 5.85 6m 20s 5s
    MSVMpack 7725 3.14 4.75 24m 38s 3s
CS MCSVM 6310 2.03 3.90 2m 38s 4s
    MSVMpack 7566 4.72 6.92 6m 54s 3s
  LLW SMSVM out of memory    
    MSVMpack 16000 16.90 18.80 3h 56m 08s 4s
  MSVM$ ^2$ MSVMpack 16000 5.08 7.28 (S) 48h 00m 00s 3s
(S): manually stopped before reaching the default optimization accuracy.


Table 6: Results of a 5-fold cross validation on the CB513 data set. The given times are sums over the 5 training and testing steps of the cross validation. MSVMpack 1.4 includes a parallelized cross validation procedure. The star superscript indicates that the software is run on a computer with 12 CPUs and 64 GB of memory.
M-SVM Software Cross validation Training Testing Total
model   error time time time
WW BSVM 23.96 % 9h 48m 40s 46m 29s 10h 34m 50s
  MSVMpack 1.0 23.72 % 1h 05m 11s 1m 51s 1h 07m 02s
MSVMpack 1.4$ ^*$ 23.63 % N/A N/A 21m 07s
CS MCSVM 23.55 % 22h 52m 10s 2h 08m 33s 25h 00m 43s
  MSVMpack 1.0 23.63 % 1h 00m 36s 2m 06s 1h 02m 42s
MSVMpack 1.4$ ^*$ 23.55 % N/A N/A 12m 05s
LLW MSVMpack 1.0 25.65 % 1h 14m 21s 2m 33s 1h 16m 54s
  MSVMpack 1.4$ ^*$ 25.52 % N/A N/A 30m 29s
MSVM$ ^2$ MSVMpack 1.0 23.47 % 6h 44m 50s 2m 49s 6h 47m 39s
  MSVMpack 1.4$ ^*$ 23.36 % N/A N/A 2h 55m 05s

The ability of MSVMpack to compensate for the lack of cache memory by extra computations in parallel is illustrated by Table 7 for the WW model type (which uses a random working set selection and thus typically needs to compute the entire kernel matrix). In particular, for the Bio data set, changing the size of the cache memory has little influence on the training time. Note that this is also due to the nature of the data which can be processed very efficiently by the linear kernel function for bit data format (see the dot_bit() function in kernel.c). For the CB513_01 data set, very large cache sizes allow the training time to be divided by 2 or 3. However, for smaller cache sizes (below 8 GB) the actual cache size has little influence on the training time.

Table 7: Effect of the cache memory size on the training time of MSVMpack for the WW model type.
Bio  
Cache memory size in MB 10 60 120 300 600 1200  
and in % of the kernel matrix $ < 1 \%$ $ 5 \%$ $ 10 \%$ $ 25 \%$ $ 50 \%$ $ 100 \%$  
Training time 5m 55s 6m 00s 6m 13s 4m 29s 3m 56s 4m 00s  
 
CB513_01  
Cache memory size in MB 1750 3500 8200 16500 24500 30000  
and in % of the kernel matrix $ 5 \%$ $ 10 \%$ $ 25 \%$ $ 50 \%$ $ 75 \%$ $ 92 \%$  
Training time 31m 49s 30m 40s 26m 06s 23m 00s 15m 52s 10m 31s  

lauer 2014-07-03