In this experimental study we try to compare the most popular Big Data frameworks. We divided the experiment study into two parts: (1) batch mode processing and (2) stream mode processing. The experimental study covers the performance, scalability, impact of several configuration parameters on the performance and resource usage study of frameworks such as SPARK, HADOOP, FLINK, SAMZA and STORM. We have chosen WordCount, Kmeans and PageRank workloads with real and synthetic data sets to evaluate the frameworks in batch mode and an Extract, Transform, Load (ETL) process to evaluate the studied frameworks in stream mode.
All the experiments were performed in a real cluster called GALACTICA (https://galactica.isima.fr) and hosted in the LIMOS laboratory at the Clermont Auvergne University. The cluster is composed of 10 machines operating with Linux Ubuntu 16.04. Each machine is equipped with a 4 CPU, 8GB of main memory and 500 GB of local storage. For our tests, we used HADOOP 2.6.3, FLINK 1.1.0, SPARK 1.6.0, SAMZA 0.10 and STORM 0.9. Both Spark and Flink have been deployed in the stand-alone mode, while Hadoop and Storm have been deployed respectively with zookeeper and YARN cluster manager. For all the evaluated frameworks, we used the default values of their corresponding parameters. We also varied these parameters in order to analyze the impact of some parameters on the performance.
We consider two scenarios according to the data processing mode. For each scenario, we measure the performance of the presented frameworks.
- Batch Mode: In the Batch Mode scenario, we evaluate HADOOP, SPARK and FLINK while running our workloads, in the WordCount application we have used a set of tweets as a data set. The used tweets were collected by Apache Flume and stored in HDFS. The motivations behind using Apache Flume to collect the processed tweets is its integration facility in the HADOOP ecosystem (especially the HDFS system). Moreover, Apache Flume allows collecting data in a distributed way and offers high data availability and fault tolerance. We collected 10 billions tweets and we used them to form large tweet files with a size on disk varying from 250 MB to 100 GB of data. For KMeans application we generated a synthetic datasets containing between 1000 and 50 millions learning examples. For PageRank workload we have used four real graph datasets with different number of nodes and edges. Table 1 shows more details of the used datasets. The above presented datasets have been downloaded from the Stanford Large Network Dataset Collection (SNAP)and formatted as plan files in which each line represents a link between two nodes.
Dataset | Number of nodes | Number of edges |
---|---|---|
G1 | 685 230 | 7 600 595 |
G2 | 875 713 | 5 105 039 |
G3 | 325 729 | 1 497 134 |
G4 | 281 903 | 2 312 497 |
G5 | 1 965 206 | 2 766 607 |
G6 | 3 997 962 | 34 681 189 |
G7 | 4 847 571 | 68 993 773 |
- Stream Mode: In the Stream Mode scenario, we evaluate real-time data processing capabilities of STORM, FLINK and SPARK. The Stream Mode scenario is divided into three main steps. The first step is devoted to data storage. Do to this step, we collected 1 billion tweets from Twitter using Apache Flume and stored in HDFS. Those data are then sent to KAFKA, a messaging server that guarantees fault tolerance during the streaming and message persistence .
To allow monitoring resources usage according to the executed jobs, we have implemented a personalized monitoring tool as shown in the Figure below.
Here, we present snapshots of the codes used to implement the WordCount workload with the studied frameworks
This experiment aims to evaluate the impact of the size of data on the processing time. In this experiment, we used two simulations according to the size of data: (1) simulation with small datasets and (2) simulation with big datasets. Experiments are conducted with various datasets with a size on disk varying from 250 MB to 2 GB for the first simulation and from 1 GB to 100 GB for the second simulation.
In this experiment, we tried to evaluate our frameworks in the case of iterative processing with both workloads Kmeans and PageRank.
In the same context of the iterative processing, through both use cases Kmeans and Pagerank we evaluated the impact of number of iteration on the response time
In this experiment, we try to show the impact of data partitioning on the studied frameworks. In our experimental setup, we used HDFS for storage. We varied the block size in our HDFS system and we run Kmeans with 10 iterations with all the used frameworks.
In order to understand the impact of the configuration parameters on the performance and the quality of the results, we try in these experiments to study some configuration parameters mainly related to the RAM and the number of threads in each framework.
Apache Flink : In order to evaluate the impact of some configuration parameters on the performance of Flink, we first executed our workloads while varying the number of slots in each TaskManager. Then, we varied the amount of used memory by each TaskManager.
Apache Spark : The parallelism configuration in Spark requires the definition of the number of workers by machine. In addition, the memory management is primordial as we must configure the memory for each worker. These two parameters are respectively spark.worker.instances and spark.worker.memory.
Apache Hadoop : To configure the number of slots on each node in a Hadoop cluster, we must set the two following parameters: (1) mapreduce.tasktracker.map.tasks.maximum and (2) mapreduce.tasktracker.reduce.tasks.maximum. These two parameters define the number of map and reduce functions that run simultaneously on each node of the cluster. These parameters maximize the CPU usage which can improve the processing time.
Here, we measure the CPU, RAM, disk I/O usage as well as and bandwidth consumption of the studied frameworks in a Batch Mode scenario.
The goal here is to compare the performance of the studied frameworks according to the number of processed messages within a period of time. In the first experiment, we send a tweet of 100 KB (in average) per message and in the seconds experiment we duplicated this message fives time (500 kb in average).
Here, we measure the CPU, RAM, disk I/O usage as well as and bandwidth consumption of the studied frameworks in a Stream Mode scenario.