Some numerical experiments

An experimental comparison of MSVMpack with other implementations of M-SVMs is provided here for illustrative purposes. The aim is not to conclude on what type of M-SVM model is better, but rather to give an idea of the efficiency of the different implementations. The following implementations are considered:

The characteristics of the data sets used in the experiments are given in Table 4. The ImageSeg data set is taken from the UCI repository¹⁰. The USPS_500 data set is a subset of the USPS data provided with the MCSVM software. The Bio data set is provided with MSVMpack. The CB513_01 data set corresponds to the first split of the 5-fold cross validation procedure on the entire CB513 data set [3]. MSVMpack uses the original MNIST data¹¹ with integer values in the range

, while other implementations use scaled data downloaded from the LIBSVM homepage¹². BSVM scaling tool is applied to the Letter data set downloaded from the UCI repository¹³ to obtain a normalized data set in floating-point data format.

All methods use the same kernel functions and hyperparameters as summarized in Table 4. In particular, for MCSVM, we used $\beta = 1/C$ . When unspecified, the kernel hyperparameters are set to MSVMpack defaults. Also, SMSVM requires $\lambda = \log_2 (1/C)$ with larger values of

to reach a similar training error, so

is used. Other low level parameters (such as the size of the working set) are kept to each software defaults. These experiments are performed on a computer with 2 Xeon processors (with 4 cores each) at 2.53 GHz and 32 GB of memory running Linux (64-bit).

The results are shown in Tables 5-6. Though no definitive conclusion can be drawn from this small set of experiments, the following is often observed in practice:

**Table 4:** Characteristics of the data sets.
Data set	#classes	#training	#test	data		Training parameters
		examples	examples	dim.	format
ImageSeg	7	210	2100	19	double	, RBF kernel, $\sigma=1$
						data normalized
USPS_500	10	500	500	256	float	, RBF kernel, $\sigma=1$
Bio	4	12471	12471	68	bit	, linear kernel
CB513_01	3	65210	18909	260	short	, RBF kernel
CB513	3	84119	5-fold CV	260	short	, RBF kernel
MNIST	10	60000	10000	784	short	, RBF kernel, $\sigma=1000$
					double	data normalized, $\sigma = 4.08$
Letter	26	16000	4000	16	float	, RBF kernel, $\sigma=1$
						data normalized

**Table 5:** Relative performance of different M-SVM implementations.
Data set	M-SVM	Software	#SV	Training	Test	Training	Testing
	model			error (%)	error (%)	time	time
ImageSeg	WW	Spider	113	5.24	11.05	11s	3.3s
		BSVM	98	2.38	9.81	0.1s	0.1s
		MSVMpack	192	0	10.33	1.2s	0.1s
	CS	MCSVM	97	2.86	8.86	0.1s	0.1s
		MSVMpack	141	0	9.24	3.3s	0.1s
	LLW	SMSVM	210	0.48	7.67	15s	0.1s
		MSVMpack	182	0	10.57	23s	0.1s
	MSVM	MSVMpack	199	0	10.48	4.5s	0.1s
0=` height 1.5pt @axhline USPS_500	WW	Spider	303	0.40	10.20	4m 19s	0.2s
		BSVM	170	0	10.00	0.2s	0.1s
		MSVMpack	385	0	10.40	2.5s	0.1s
	CS	MCSVM	284	0	9.80	0.5s	0.3s
		MSVMpack	292	0	9.80	30s	0.1s
	LLW	SMSVM	500	0	12.00	5m 58s	0.1s
		MSVMpack	494	0	11.40	1m 22s	0.1s
	MSVM	MSVMpack	500	0	12.00	22s	0.1s
0=` height 1.5pt @axhline Bio	WW	Spider	out	of	memory
		BSVM	2200	6.23	6.23	11s	4.6s
		MSVMpack	4301	6.23	6.23	4m 00s	0.5s
	CS	MCSVM	1727	6.23	6.23	9s	4.2s
		MSVMpack	3647	6.23	6.23	8s	0.5s
	LLW	SMSVM	out	of	memory
		MSVMpack	12467	6.23	6.23	2m 05s	1.1s
	MSVM	MSVMpack	12471	6.23	6.23	11m 44s	0.9s
0=` height 1.5pt @axhline CB513_01	WW	Spider	out	of	memory
		BSVM	39183	19.46	25.70	1h 41m 52s	9m 02s
		MSVMpack	42277	16.94	25.55	10m 31s	26s
	CS	MCSVM	41401	17.07	25.45	4h 12m 35s	27m 11s
		MSVMpack	40198	16.93	25.41	4m 04s	32s
	LLW	SMSVM	out	of	memory
		MSVMpack	54313	22.14	27.65	11m 46s	32s
	MSVM	MSVMpack	62027	14.31	25.32	1h 08m 54s	47s
0=` height 1.5pt @axhline MNIST	WW	Spider	out	of	memory
		BSVM	13572	0.078	1.46	4h 03m 29s	2m 39s
		MSVMpack	14771	0.015	1.41	2h 50m 29s	15s
	CS	MCSVM	13805	0.038	2.76	1h 54m 26s	4m 09s
		MSVMpack	14408	0.032	1.44	49m 04s	15s
	LLW	SMSVM	out	of	memory
		MSVMpack	47906	0.282	1.57	4h 36m 45s	45s
	MSVM	MSVMpack	53773	0.027	1.51	10h 55m 10s	55s
0=` height 1.5pt @axhline Letter	WW	Spider	out	of	memory
		BSVM	7460	4.30	5.85	6m 20s	5s
		MSVMpack	7725	3.14	4.75	24m 38s	3s
	CS	MCSVM	6310	2.03	3.90	2m 38s	4s
		MSVMpack	7566	4.72	6.92	6m 54s	3s
	LLW	SMSVM	out	of	memory
		MSVMpack	16000	16.90	18.80	3h 56m 08s	4s
	MSVM	MSVMpack	16000	5.08	7.28	(S) 48h 00m 00s	3s

(S): manually stopped before reaching the default optimization accuracy.

**Table 6:** Results of a 5-fold cross validation on the **CB513 data set**. The given times are sums over the 5 training and testing steps of the cross validation. *MSVMpack 1.4* includes a parallelized cross validation procedure. The star superscript indicates that the software is run on a computer with 12 CPUs and 64 GB of memory.
M-SVM	Software	Cross validation	Training	Testing	Total
model		error	time	time	time
WW	BSVM	23.96 %	9h 48m 40s	46m 29s	10h 34m 50s
	MSVMpack 1.0	23.72 %	1h 05m 11s	1m 51s	1h 07m 02s
	MSVMpack 1.4	23.63 %	N/A	N/A	21m 07s
CS	MCSVM	23.55 %	22h 52m 10s	2h 08m 33s	25h 00m 43s
	MSVMpack 1.0	23.63 %	1h 00m 36s	2m 06s	1h 02m 42s
	MSVMpack 1.4	23.55 %	N/A	N/A	12m 05s
LLW	MSVMpack 1.0	25.65 %	1h 14m 21s	2m 33s	1h 16m 54s
	MSVMpack 1.4	25.52 %	N/A	N/A	30m 29s
MSVM	MSVMpack 1.0	23.47 %	6h 44m 50s	2m 49s	6h 47m 39s
	MSVMpack 1.4	23.36 %	N/A	N/A	2h 55m 05s

The ability of MSVMpack to compensate for the lack of cache memory by extra computations in parallel is illustrated by Table 7 for the WW model type (which uses a random working set selection and thus typically needs to compute the entire kernel matrix). In particular, for the Bio data set, changing the size of the cache memory has little influence on the training time. Note that this is also due to the nature of the data which can be processed very efficiently by the linear kernel function for bit data format (see the dot_bit() function in kernel.c). For the CB513_01 data set, very large cache sizes allow the training time to be divided by 2 or 3. However, for smaller cache sizes (below 8 GB) the actual cache size has little influence on the training time.

**Table 7:** Effect of the cache memory size on the training time of *MSVMpack* for the WW model type.
Bio
Cache memory size in MB	10	60	120	300	600	1200
and in % of the kernel matrix	$< 1 \%$	$5 \%$	$10 \%$	$25 \%$	$50 \%$	$100 \%$
Training time	5m 55s	6m 00s	6m 13s	4m 29s	3m 56s	4m 00s

CB513_01
Cache memory size in MB	1750	3500	8200	16500	24500	30000
and in % of the kernel matrix	$5 \%$	$10 \%$	$25 \%$	$50 \%$	$75 \%$	$92 \%$
Training time	31m 49s	30m 40s	26m 06s	23m 00s	15m 52s	10m 31s