DistFSM

- Density-based data partitioning strategy to approximate large-scale subgraph mining
- Published in Information Systems, Volume 48, Pages 213-223. [Publisher's version]
Authors | Datasets | Results | Download

Authors:

- Sabeur Aridhi
  
  Tel: +39 04 61 28 39 92
  Email: sabeur.aridhi@gmail.com
  Web page: http://fc.isima.fr/~aridhi/
  DISI Department
  University of Trento
  Via Sommarive 9, 38123 Trento, Italy
- Laurent d'Orazio
  
  Tel: +33 4 73 40 78 32
  Email: laurent.dorazio@gmail.com
  Web page: http://www.isima.fr/~dorazio/
  
  LIMOS, UMR 6158
  Blaise Pascal University , BP 10125,
  63173, Clermont Ferrand, France
- Mondher Maddouri
  
  Email: maddourimondher@yahoo.fr
  
  Faculty of sciences of Tunis, LIPAH
  University of Tunis El Manar
  1060, Tunis, Tunisia
- Engelbert Mephu Nguifo
  
  Tel: +33 4 73 40 76 29
  Email: mephu@isima.fr
  Web page: http://www.isima.fr/~mephu/
  
  LIMOS, UMR 6158
  Blaise Pascal University, BP 10125,
  63173, Clermont Ferrand, France

Datasets:

General description

We have performed an extensive set of experiments to evaluate the effectiveness and efficiency of our approach on a number of different real and synthetic datasets.

Table. 1: Experimental data.

Dataset	Type	Number of graphs	Size on disk	Average size
DS1	Synthetic	20,000	18 MB	[50-100]
DS2	Synthetic	100,000	81 MB	[50-70]
DS3	Real	274,860	97 MB	[40-50]
DS4	Synthetic	500,000	402 MB	[60-70]
DS5	Synthetic	1,500,000	1.2 GB	[60-70]
DS6	Synthetic	100,000,000	69 GB	[20-100]

The synthetic datasets are generated by the synthetic data generator GraphGen provided by Kuramochi and Karypis [1].
The real dataset we tested is a chemical compound dataset.

[1] M. Kuramochi, G. Karypis. Frequent Subgraph Discovery. In Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM '01. IEEE Computer Society, Washington, DC, USA, 2001, pp. 313-320.

Data sources

The real dataset we tested is available from the Developmental Therapeutics Program (DTP) web site at National Cancer Institute (NCI). http://dtp.nci.nih.gov/branches/npb/repository.html
GraphGen tool was used to generate our synthetic datasets. Download GraphGen

Results:

We implemented our approach in Perl language and we used Hadoop, (version 0.20.1), an open source version of MapReduce.
All the experiments of our approach were carried out using a local cluster with five nodes.
The processing nodes used in our tests are equipped with a Quad-Core AMD Opteron(TM) Processor 6234 2.40 GHz CPU and 4 GB of memory for each node.

Effect of the partitioning method on the rate of lost subgraphs

(a) DS1	(b) DS2


(c) DS3

Speedup and Scalability

(a) Cost of DGP and MRGP partitioning methods	(b) Effect of the number of buckets on the cost of the DGP method


(c) Effect of the number of workers on the runtime of our MapReduce-based framework

Chunk size and replication factor

(a) Effect of chunk size on the runtime of our MapReduce-based framework	(b) Effect of the number of copies of data on the runtime of our MapReduce-based framework

Download:

An implementation of our DGP (Density-based partitioning) method is available here.
A Hadoop implementation of our MapReduce-based framework for distributed frequent subgraph mining is available here.
The used subgraph extractors can be found here.
Instructions describing how to setup a Hadoop cluster on Ubuntu 14.04 can be found here. You can get a PDF copy from here.