Subsections

Developer's guide

This section describes some additional options for building MSVMpack and how to call the library from another C program.


make rules and options

The following rules and options are defined in the Makefile:

Note that, on Mac OS X, make must be replaced by make -f Makefile.osx. However, this is not required for the default rule, i.e., make can be used for a simple installation of the software.


Building MSVMpack on Windows

To build MSVMpack on Windows, follow the instructions below.

  1. Install Mingw64 by using the installer that is available in the MSVMpack1.5 $ \backslash$Windows $ \backslash$Utils directory.
    (Please verify that you select 4.8.1 version, x64 architecture, posix threads and seh exception)
  2. Add "mingw64 $ \backslash$bin" path to the windows environment variable PATH.
    For example PATH=...;C: $ \backslash$Program Files $ \backslash$mingw-builds $ \backslash$x64-4.8.1-posix-seh-rev5 $ \backslash$mingw64 $ \backslash$bin.
    How to set the path and environment variables in Windows :
    1. From the Desktop, right-click My Computer and click Properties.
    2. Click Advanced System Settings link in the left column.
    3. In the System Properties window click the Environment Variables button.
  3. Open an MS-DOS window : Start $ \rightarrow$ Execute... $ \rightarrow$ cmd
  4. Go to the right directory:
    cd DIR $ \backslash$Windows (DIR is the directory where the MSVMpack1.5 source files are located)
  5. Use makefile64.bat to compile trainmsvm.exe and predmsvm.exe
    (use makefile32.bat in a 32 bits Windows environment)

Be careful, the new compiled programs will be placed in the bin folder and the original programs trainmsvm.exe and predmsvm.exe will be overwritten. To recover the original programs, you can use setup32bits.bat or setup64bits.bat in the Utils $ \backslash$32bits and Utils $ \backslash$64bits folders. It copies the original binaries located in bin32 or bin64 in the bin folder.


Custom kernels

Custom kernels can be easily added to MSVMpack by editing the file
MSVMpack/src/custom_kernels.c


Simply follow the instructions in this file to implement your own kernel functions. Then rebuild by running
make
to include them in MSVMpack. Note that this modifies your local copy of MSVMpack and that MSVM models do not include explicit kernel functions. This means that models trained with your local copy of MSVMpack can only be used by MSVMpack copies that include the same custom kernel functions.

Using the library

To use the MSVMpack library, you need to include the three main header files located in MSVMpack/include/ as:
#include "libMSVM.h"        // Generic structure and function declarations
#include "libevalMSVM.h"    // Evaluation functions (also used during training)
#include "libtrainMSVM.h"   // Training functions (not required for predictions only)
and link with the libmsvm.a static library located in MSVMpack/lib/, e.g.,
DIR= path_to_MSVMpack_directory
LIB=-lmsvm -lm -ldl -lpthread
gcc myProgram.c -I$(DIR)/include -I$(DIR)/lp_solve_5.5 -L$(DIR)/lib $(LIB)

A simple example

The following example code (also available in doc/example.c with the corresponding Makefile) shows how to train an M-SVM model and use it to classify a test set.

#include "libMSVM.h"        // Generic structure and function declarations
#include "libtrainMSVM.h"   // Training functions (not required for predictions only)
#include "libevalMSVM.h"    // Evaluation functions (also used during training)

int main(void) {
    struct Model *model;                    // declare a model
    struct Data *training_set, *test_set;   // declare the data sets
    
    long status;                // for MSVM_train return code
    double accuracy = 0.98;     // desired accuracy level of 98%  
    long chunk_size = 4;        // size of the chunk used for training
    int cache_memory = 0;       // Amount of cache memory (0 = max)
    int nprocs = 0;             // Number of available CPUs (0 = all)
    long *labels;
                
    model = MSVM_make_model(MSVM2); // Create an empty model with default parameters
    if(model == NULL) {
        printf("Error in model creation\n");
        exit(1);
    }
 
    // Change some parameters
    model->nature_kernel = RBF;    
    printf("Changed the kernel type in model to Gaussian RBF.\n");
    
    MSVM_model_set_C(1.0, 3, model);
    printf("Initialized the hyperparameters C to %1.2lf for all %ld classes.\n",model->C[1],model->Q);
    model->C[2] = 2.0;
    model->C[3] = 3.0;
    printf("Changed hyperparameters C_2 to %1.2lf and C_3 to %1.2lf.\n",model->C[2],model->C[3]);

    printf("Loading the data... \n");
    // The data format can be either DOUBLE, FLOAT, INT, SHORT, BYTE, or BIT
    training_set = MSVM_make_dataset("myTrainingData",DOUBLE); // load the training data
    test_set = MSVM_make_dataset("myTestData",DOUBLE);         // load the test data
        
    printf("Calling MSVM_train()...\n");
    /* Train the model on the training_set
        with default initialization (first NULL)
        no periodic saving of alpha (second NULL)
        no log file (third NULL)
    */
    status = MSVM_train(model, training_set, chunk_size, accuracy, 
                         cache_memory, nprocs, NULL, NULL, NULL);
            
    if(status >= 0) {
        printf("Training done without error, will now classify the test set.\n");
        
        /* Allocate the array for predicted labels 
            (size should be (size of test set + 1) )
        */
        labels = (long *)malloc(sizeof(long) * (test_set->nb_data + 1));

        /* Classify the test set,
        	store the predicted labels in 'labels'
        	and print outputs on screen (because filename=NULL)
        */ 
        MSVM_classify_set(labels, test_set->X, test_set->y, test_set->nb_data, 
                           NULL, model, nprocs);
    }
    else
        printf("Error in training.\n");
        
    // Free the memory
    free(labels);
    MSVM_delete_model(model);    
    MSVM_delete_dataset(training_set);
    MSVM_delete_dataset(test_set);

    return 0;
}
For other example uses of the library, see src/trainMSVM.c.

Structures

The library provides the following structure (defined in libMSVM.h) to store an M-SVM model with all its parameters:
struct Model {
    float version;                            // MSVMpack model version
    enum MSVM_type {WW=0,CS=1,
                    LLW=2,MSVM2=3} type;      // Type of the M-SVM
    enum Algorithm {FrankWolfe=0,
                    Rosen=1} algorithm;       // optimization method
    long Q;                                   // number of classes
    char *training_set_name;                  // name of training set
    long nb_data;                             // number of SVs
    long dim_input;                           // dimension of a description x
    enum Datatype datatype;                   // type of data
    enum Kernel_type nature_kernel;           // type of kernel
    double *kernel_par;                       // kernel function parameters
    double *C;                                // soft-margin parameters
    double training_error;                    // training error rate
    double ratio;                  // optimization accuracy
    long iter;                     // optimization iterations
    int crossvalidation;           // #fold in cross validation (0 otherwise)
    
    double **alpha;                // (nb_data x Q) matrix of alpha
    double *partial_average_alpha;
    double sum_all_alpha;        
    double *b_SVM;                 // vector b
    void **X;                      // support vectors
    long *y;                       // labels of the SVs
    double **normalization;        // means and std used to normalize data
    double **W;                    // weights of the linear model
                                   // (for linear kernel only)
    // Format-specific data storage
    double **X_double;
    float **X_float;
    int **X_int;  
    short int **X_short;
    unsigned char **X_byte;   

	// Thread synchronization
	pthread_mutex_t mutex;                                   
};
where the Kernel_type enumeration is defined in kernel.h as
enum Kernel_type {
    // kernel functions for DOUBLE datatype
    LINEAR=1,    // Linear
    RBF=2,       // Gaussian RBF
    POLY_H=3,    // Homogeneous polynomial
    POLY=4,      // Non-homogeneous polynomial
    CUSTOM1=5,   // Custom kernels
    CUSTOM2=6,
    CUSTOM3=7,
    // kernels for FLOAT datatype
    LINEAR_FLOAT=11,
    RBF_FLOAT=12,	
    POLY_H_FLOAT=13,	
    POLY_FLOAT=14,
    CUSTOM1_FLOAT=15,	
    CUSTOM2_FLOAT=16,
    CUSTOM3_FLOAT=17
    // kernels for INT datatype
    LINEAR_INT=21,
    ...
};

Another generic structure is used for data sets:

struct Data {
    char *name;    // name of the data set
    long nb_data;  // number of data
    long dim;      // dimension of the data
    void **X;      // matrix with the x_i as rows
    long *y;       // vector of labels y_i
    long Q;        // number of classes in data set
    double **X_double;
    float **X_float;
    int **X_int;
    short int **X_short;
    unsigned char **X_byte;
    enum Datatype datatype;
};
where Datatype is defined as enum Datatype {DATATYPE_DOUBLE=1, DATATYPE_FLOAT=2, DATATYPE_INT=3, DATATYPE_SHORT=4, DATATYPE_BYTE=5, DATATYPE_BIT=6}.

Function reference

All functions provided by MSVMpack have the prefix MSVM_. In particular, the following general purpose functions are defined in libMSVM.h:
/* Model handling functions */
struct Model *MSVM_make_model(enum MSVM_type type);
    Creates an empty model of a given type.
    
struct Model *MSVM_load_model(char *model_file);
    Loads a model from a file.
    
void MSVM_delete_model(struct Model *model);
    Deletes a model and frees the memory except for the data
    in model->X and model->y.

void MSVM_delete_model_with_data(struct Model *model);
    Deletes a model and frees all the memory including the SVs.

int MSVM_save_model(const struct Model *model, char *model_file);
    Saves a model to a .model file.

int MSVM_save_model_sparse(const struct Model *model, char *model_file);
    Saves a model to a .model file using the sparse format.

long MSVM_init_model(struct Model *model, char *com_file);
    Initializes the parameters of a model from a .com file.
    
void MSVM_model_set_C(double C, long Q, struct Model *model);
    Sets the values of C_k to C for k=1,..., Q.
    (this also sets the value of Q in the model)

/* Data set handling functions */
struct Data *MSVM_make_dataset(char *data_file, enum Datatype datatype);
    Loads a dataset in datatype format from a file 
    or creates an empty Data structure if data_file is NULL.

void MSVM_delete_dataset(struct Data *dataset);
    Deletes a dataset and frees the memory.

double MSVM_normalize_data(struct Data *dataset, struct Model *model);
    Normalizes the columns of X in a dataset and returns the difference 
    between the largest and smallest std before normalization.

The following training functions are defined in libtrainMSVM.h:

long MSVM_train(struct Model *model, struct Data *training_set, 
    long chunk_size, const double accuracy, int cache_memory, int nprocs,
    char *alpha0_file, char *model_tmp_file, char *log_file); 
    
    Trains an M-SVM model on the training_set with a given chunk_size 
    by using cache_memory MB of memory and nprocs CPUs
    until the desired accuracy is reached.

double **MSVM_train_cv(struct Model *model, struct Data *training_set, 
    int K, long chunk_size, const double accuracy, int cache_memory,
    int nprocs, char *log_file);

    Perform K-fold Cross-validation and returns 
    a 2-by-(K+1) array of error_rates:
	
        error_rates[0] = [Overall training error rate, 
                                training error_rate on fold 1, 
                                ...,
                                training error_rate on fold K]

        error_rates[1] = [Cross-validation estimate of the error, 
                                test error_rate on fold 1, 
                                ...,
                                test error_rate on fold K]
							
        So the cv error estimate is in error_rates[1][0]. 
		

long MSVM_init_train_comfile(struct Model *model, char *com_file, 
    char *training_file, char *alpha0_file, char *alpha_file, char *log_file);
    
    Initializes the parameters and filenames used for training an M-SVM 
    from a .com file.
The source file libtrainMSVM.c also contains the internal code for handling the kernel cache (see section 4.3).

The following evaluation functions are provided in libevalMSVM.h:

long MSVM_classify(void *x, const struct Model *model, double *real_outputs)
    
    Computes the label of x WITHOUT normalizing the data.
     (use MSVM_classify_set() to include normalization)	
    If real_outputs is not NULL, also provide the Q real-valued outputs
    of the model in (real_outputs[1]...real_outputs[Q]).	
    
void MSVM_classify_set(long *labels, void **X, long *y, long m, char *outputs_file, 
    const struct Model *model, const int nprocs)

    Computes the predicted labels of an M-SVM for a set of m data points X
    in parallel over nprocs CPUs. 
    Also computes the test error and some statistics if y is not NULL.    	
    Resulting output is saved into outputs_file or printed on screen 
    only if outputs_file is NULL. 
    If labels is NULL, the predicted labels are not stored in memory. 
    Note: this function takes care of data normalization if needed 
     (X should not be normalized).  

double MSVM_eval(double *best_primal_upper_bound, double **gradient, double **H_alpha,
    double **H_tilde_alpha, struct Model *model, const int verbose, FILE *fp) 
    
    Evaluates the ratio between the value of the dual objective function and 
    the upper bound on the optimum.

These training and evaluation functions are wrappers that are used to call the proper function depending on the model type (WW, CS, LLW or MSVM2). The training functions for a particular model type appear in separate files named libtrainMSVM_XX.c and libevalMSVM_XX.c, where XX stands for the model type.

The other functions included in the library are for internal use and should be called only by the functions described above.

Matrix format

The matrix format conforms to the one of [9]. For a double-precision real matrix $ X$, we have:

This means that notations like X[(i-1)*N+j] cannot be used to access $ (X)_{ij}$.
Note: the ranges for the subscripts $ i$ and $ j$ are $ \llbracket 1,M\rrbracket$ and $ \llbracket 1,N\rrbracket$, respectively, for a matrix of size $ M\times N$ (do not use $ i=0$ or $ j=0$).


File formats

Here is the list of the different file formats described below and the corresponding naming conventions.

Data file format.

The library can load data sets from data files in the following format:
1200     --> number of data
4        --> dimension of the data
7.400000 2.800000 6.100000 1.900000 3.000000
7.900000 3.800000 6.400000 2.000000 3
...
3.400000 -2.800000 0.5600000 2.200000 2
1.300000 0.800000 5.100000 1.500000 1.000000
|<-----     vector x_i      ----->|  y_i
where the labels $ y_i$ can be either positive integers (in an integer or floating-point data format) or omitted (for test data). If the labels are included, the number of classes in a data set is automatically set to $ \max_i y_i$ when the file is read by MSVM_make_dataset("data_filename").

Model file format (.model).

The M-SVM models are saved in the following format:
1.1                           --> MSVMpack model version
3                             --> type of the M-SVM
3                             --> number of classes Q
2                             --> type of kernel
1 2.500000                    --> kernel parameters (#par par1 par2...)
myTrainingData                --> training set filename
0.032000                      --> training error rate
120                           --> number of SVs
4                             --> dimension of the SVs (= dimension of the data)
10.000000 10.000000 10.000000 --> values of C_k (soft-margin parameters)
2.0000000000 1.2000000000 2.5000000000 0.4600000000     --> std for normalization
4.0000000000 3.2000000000 2.8000000000 1.0120000000     --> means for normalization
1.1635254839 -3.5845297998 2.4210043159                 --> bias vector b 
0.0000000000 6.1803244144 0.0000000000                  --> vector alpha for a SV
5.1000000000 3.5000000000 1.4000000000 0.2000000000 1   --> the corresponding SV 
... 
13.5696087594 31.9953451299 0.0000000000                --> vector alpha for a SV
6.3000000000 2.8000000000 5.1000000000 1.5000000000 3   --> the corresponding SV 
|<-----            vector x_i               ----->| y_i
When using MSVM_save_model(), the entire training set is saved as SVs in the model. The function MSVM_save_model_sparse() should be used to save only the true SVs. Note that the former method allows the model to be retrained (resume training from where it was stopped), whereas the sparse format does not allow this feature.

Output file format (.outputs).

The files of outputs assume the following format:
    1.510324    -1.058388    -0.451936    1
    1.511907    -0.600563    -0.911344    1
   ...
   -0.465904     0.924967    -0.459063    2
   -0.448488     0.856948    -0.408460    2
      |                           |       |
   h_1(x_i)        ...        h_Q(x_i)   predicted label for x_i

Log file format (.log).

The log file format is used to record information during the training process. It assumes the following form:
1000 236.011772 234.107674 256.439555 246.631361 
2000 240.308282 241.653037 251.815730 244.529922 
3000 241.525854 241.474467 247.055776 244.238342 
  |      |         |         |          |
#iter   dual       U1        U2         U3
where dual is the value of the dual objective function at iteration #iter and the meaning (and existence) of U1, U2 and U3 depends on the model type. For all the M-SVMs except the M-SVM$ ^2$, U1 is the value of the upper bound on the optimum used to check the convergence, whereas U2 and U3 do not exist. On the other hand, for an M-SVM$ ^2$, U1 is the objective function value of the unconstrained primal problem directly estimated from the current $ \alpha$ (not a valid upper bound), U2 is a cheap upper bound on the optimum obtained by projecting the estimated values of the slack variables onto the constraints and U3 is the value of the optimized upper bound.

Kernel parameter file format.

The parameters of the kernel function can be passed through a file (which is particularly useful for custom kernels using many parameters). The file can take two forms depending on the trainmsvm command line. With
trainmsvm myData.train -k 5 -P myKernel_parameters
the file myKernel_parameters must follow the format:
6 1.1052 2.00032 3.015 4.0 0.55 0.6
|   |<--- list of parameters --->|
|
number of parameters
If the number of parameters is explicitly given on the command line as in
trainmsvm myData.train -k 5 -P 6 myKernel_parameters
then the file myKernel_parameters becomes:
1.1052 2.00032 3.015 4.0 0.55 0.6
  |<--- list of parameters --->|

lauer 2014-07-03