Installing Hadoop,Spark and ElasticSearch

on a Single-node/Multi-node cluster with Ubuntu 14.04


Sabeur Aridhi sabeur.aridhi@telecomnancy.eu
In this document we present tutorials for:
  1. Hadoop
  2. Spark
  3. ElasticSearch

Tutorial 1: Hadoop and HDFS

Back to Menu
  1. Setup a Hadoop cluster
  2. Download VirtualBox 4.3.x from the following link https://www.virtualbox.org/wiki/Downloads and create a VM instance with the configuration below:
    NameHadoop
    System typeUbuntu
    CPU2 Cores
    RAM4GB
    DISK15GB
    Use NAT option to connect to the Internet from your virtual machine
  3. Download & Install Ubuntu in the VM instance
  4. Download Ubuntu 14.04 LTS (Desktop version) from this link and mount iso on VM's CD and boot the system durring installation specify the machine name, user name and passwork to hadoop value, when installation is completed turn off the VM and unmount the iso.
  5. Install Guest Additions
  6. use one of the following two options to install the gueast additions for VirtualBox:
    Option A
    Guest Additions iso can be found in VirtualBox installation path, usually:

    /usr/share/virtualbox/
    Mount the Virtual Box Guest Additions iso in the VM's CD before running the VM and turn the system on, open a terminal and execute the next command line and reboot :

    sh /media/hadoop/VBOXADDITIONS_4.3.34_104062/autorun.sh 

    Option B
    1. start the virtual machine
    2. in the VirtualBox application
    3. you should find a menu entry „Devices“
    4. select „Insert Guest Additions CD image…“
    5. follow the installation diagolues
    6. Reboot
  7. Install important packages and JAVA JDK
  8. Update package repository:
    sudo apt-get update 
    Install a few things:
    $ sudo apt-get install build-essential uuid-dev autoconf rsync
    $ sudo apt-get install aptitude 
    Search for java and install it:
    $ sudo aptitude search openjdk-7
    $ sudo apt-get install openjdk-7-jdk 
    The version installed should be 1.7.0, check with:
    $ javac -version 
    We get the path of our installed java with:
    $ update-java-alternatives -l 
    Edit $HOME/.bashrc and add the following at the end:
    export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64 
    (or whatever path we got form update-alternatives)
    export PATH=$PATH:$JAVA_HOME 
    Force the system to read the updates in .bashrc with:
    $ source .bashrc 
  9. Python version and packages
  10. Start python shell and check version (should be 2.7.x):
    $ python
    - you can quit python by using the command „exit()“
    - Install numpy and scipy:
    $ sudo apt-get install python2.7-numpy
    $ sudo apt-get install python2.7-scipy 
  11. SSH Setup
  12. Install ssh server:
    $ sudo apt-get install openssh-server 
    Create RSA key:
    $ ssh-keygen -t rsa -P "" (just press enter when asked for filename)
    $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
    $ chmod 0600 /home/hadoop/.ssh/authorized_keys 
    Check connectivity to local machine:
    $ ssh localhost
    $ logout 
  13. Download Hadoop
  14. Download Hadoop 2.6.3
    wget  ftp://ftp.funet.fi/pub/mirrors/apache.org/hadoop/common/hadoop-2.6.3/hadoop-2.6.3.tar.gz
    Extract it
    $ cd Downloads
    $ tar -xzf hadoop-2.6.3.tar.gz
  15. Configure Hadoop and HDFS
  16. Edit $HOME/.bashrc
    $ nano $HOME/.bashrc 
    and add the following lines at the end of the file:
    export HADOOP_HOME=$HOME/Downloads/hadoop-2.6.3/
    export PATH=$PATH:$HADOOP_HOME/bin
    Force the system read the updates in .bashrc with:
    $ source $HOME/.bashrc
    Edit $HADOOP_HOME/etc/hadoop/slaves file and add the lines:
    hadoop 
    Edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh: Find line "export JAVA_HOME=" and add the complete java path (same as in step 4)
    In file etc/hadoop/core-site.xml:
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/tmp</value>
    </property>
    <property>
        <name<>fs.defaultFS <</name>
        <value>hdfs://localhost:9000</value>
     </property>
    In file etc/hadoop/mapred-site.xml:
    <property>
      <name>yarn.nodemanager.aux-services </name>
      <value>mapreduce_shuffle </value>
    </property>
    In file etc/hadoop/hdfs-site.xml:
    <property>
      <name>dfs.replication </name>
      <value>1 </value>
    </property>
    In file etc/hadoop/yarn-site.xml:
    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>
    Shut down the machine
    The basic configuration is done. hadoop with machine name hadoop will be the master node. We can clone this machine as many times as needed to have a Multi Node cluster.
  17. Format HDFS
  18. In your master node (hadoop machine) execute:
    $ cd $HADOOP_HOME
    $ bin/hadoop namenode -format
  19. Run Hadoop from master (hadoop)
  20. Start Hadoop:
    $ ./Downloads/hadoop-2.6.3/sbin/start-all.sh
    Check in the UI if Hadoop is up:
    Open a browser and in the bar write "localhost:50070"
    The UI should present both machines
    Check also datanode section. The disk capacity available in each machine should be shown.
    Alternative check to see what is running (can be executed in master and slaves)
     $ jps (shows all spark and hadoop related demons)
    Both platforma can be stoped calling stop-all.sh scripts in the same paths as the start-all.sh
  21. Test Hadoop
  22. Please use the file bigtext.txt provided to you in beginning of the JABD'16 session. Put the file in your home directory. You can also use other local files in your machine.
    Put the file bigtext.txt in HDFS using:
    $ ./Downloads/hadoop-2.6.3/bin/hadoop fs -mkdir /test1
    $ ./Downloads/hadoop-2.6.3/bin/hadoop fs -put ~/bigtext.txt /test1
    $ ./Downloads/hadoop-2.6.3/bin/hadoop fs -ls /test1
    We can also check in the Hadoop UI that the file is uploaded
    - Test a hadoop example:
    $ cd $HADOOP_HOME
    $ ./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.3.jar wordcount bigtext.txt output
    
    The test should give a clear output with no errors.
    Now we can stop Hadoop:
    $ ./Downloads/hadoop-2.6.3/sbin/stop-all.sh

Tutorial 2: Installing Spark

Back to Menu
  1. Download Spark
  2. start a browser in your virtual machine and download Spark 1.6.0 version from this link Release: 1.6.0 Package type: prebuilt for hadoop 2.4 and higher Download the tar file (the ftp server is a fast option) Extract:
    $ cd Downloads
    $ tar -xzf spark-1.6.0-bin-hadoop2.4.tgz
    Reduce verbosity as in single node guide
  3. Configure Spark [only in case of Multi Node Cluster]
  4. In case of Multi Node cluster, edit slaves file for Spark:
    $ cd Downloads/spark-1.6.0-bin-hadoop2.4/conf
    $ cp slaves.template slaves
    $ nano slaves
    Add in the file in different lines the names of your cluster machines (in our case it is a single machine):
    machine 1
    machine 2
    ...
  5. Download and configure Hadoop and HDFS (Tutorial 1)
  6. Run Spark from master (spark)
  7. Start spark:
    $ ./Downloads/spark-1.6.0-bin-hadoop2.4/sbin/start-all.sh
    Check in the UI if spark is up: Open a browser and in the bar write "localhost:8080" The UI should present both machines - Start Hadoop:
    $ ./Downloads/hadoop-2.6.3/sbin/start-all.sh
    Check in the UI if Hadoop is up: Open a browser and in the bar write "localhost:50070" The UI should present both machines Check also datanode section. The disk capacity available in each machine should be shown. - Alternative check to see what is running (can be executed in master and slaves)
    $ jps (shows all spark and hadoop related demons)
    - Both platforms can be stopped calling stop-all.sh scripts in the same paths as the start-all.sh
  8. Test Spark
  9. Put some files in the HDFS (see Tutorial 1) - Test Spark wordcount on "bigtext.txt" file:
    $ cd /Downloads/spark-1.6.0-bin-hadoop2.4
    $ ./bin/spark-submit --name "test1" --master spark://hadoop:7077 examples/src/main/python/
        wordcount.py hdfs://hadoop:9000/test1/bigtext.txt 
    Now we can stop Spark and Hadoop:
    $ ./Downloads/spark-1.6.0-bin-hadoop2.4/sbin/stop-all.sh
    $ ./Downloads/hadoop-2.6.3/sbin/stop-all.sh

Tutorial 3: Installing Elasticsearch

Back to Menu
  1. Download Elasticsearch
  2. curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.1.1.tar.gz
    Extract it:
    tar -xvf elasticsearch-5.1.1.tar.gz
  3. Elasticsearch settings
  4. Two configuration files can be found under ES-HOME/conf: elasticsearch.yml and logging.yml, open elasticsearch.yml and edit some variables to customize the elasticsearch server :
    cluster.name: mycluster
    Node.name:node1
    Here, we specify the node name since we can install elasticsearch in several nodes. we can set the property dynamically using:
    nade.name=${HOSTNAME}
    To enable Elasticsearch to support cluster mode we need to set the list of nodes of our cluster using:
    discovery.zen.ping.unicast.hosts: ["host1", "host2"]
    name of the node or IP address:
    discovery.zen.ping.unicast.hosts: ["IP1", "IP2", "IP3"]
    Start elasticsearch cluster to start our cluster Elasticsearch we are going use the next command: :
    cd ES-Home
    ./bin/elasticsearch
    To check our elasticsearch cluster we can send a HTTP request to port 9200.
    curl http://localhost:9200
    to get this result :
    {
    	"name" : "node-1",
    	"cluster_name" : "my-cluster",
    	"version" : {
    		"number" : "2.3.3",
    		"build_hash" : "218bdf10790eef486ff2c41a3df5cfa32dadcfde",
    		"build_timestamp" : "2016-05-17T15:40:04Z",
    		"build_snapshot" : false,
    		"lucene_version" : "5.5.0"
    	},
    	"tagline" : "You Know, for Search"
    }
          
  5. Use elasticsearch
  6. Elasticsearch provides a RESTFUL web service to answer user CRUD requests (Create, Read, Update and Delete). For this, we use a CURL protocol. We can add our first entry with the command:
    curl -X POST 'http://localhost:9200/database/table/id' -d '{"first name":"mohamed","last name":"tounsi" }'
    with:
    Database: Index (Database name in RDBM)
    Table : Type (table in RDBM)
    Id : ID of the document
    Output:
    {"_index":"database","_type":"table","_id":"id","_version":1,"created":true}
    Get all index in elasticsearch :
    curl 'localhost:9200/_cat/indices?v'
    Get mapping on index elasticsearch:
    curl XGET ‘http://localhost:9200/index/_mapping’

    Search in elasticsearch:

    To facilitate querying with elasticsearch we can use "sense", a google chrome extension : Get all list of documents in index elasticsearch
    GET /index/type/_search
    {
    	"query": {
    	"match_all": {}
    	}
    }
          
    Get first tow documents in an elasticsearch index
    GET /index/type/_search
    {
    	"query": {
    		"match_all": {}
    	},
    	"size": 2
    }
          
    Sort the results returned by a query
    GET /index/type/_search
    {
    	"query": {
    		"match_all": {}
    	},
    	"sort": {
    		"FIELD": {
    			"order": "desc"
    		}
    	}
    }
          
    Get sub set of fields
    GET /index/type/_search
    {
    	"query": {
    		"match_all": {}
    	},
    "_source":["filed","filed"]
    }
          
    Count function
    GET /index/type/_count
    {
    	"query": {
    		"match_all": {}
    	},
    "_source":["filed","filed"]
    }
    Query and filter in Elsticsearch
    In elasticsearch, we have two type of queries :
    Query : in the category we applied our query in text fields, we have in the category for each result query score value to present a relevance of result because elasticsearch get result depending on their context.
    Filter : in the category we applied our query in the numeric field
    Example of Query :
    GET /index/type/_search
    {
    	"query": {
    		"bool": {
    			"must": [
    			{"match": {"field":"text"}},{"match": {"field":"text"}}
    			]
    		}
    	}
    }
    GET /bank/account/_search
    {
    	"query": {
    		"bool": {
    			"should": [
    				{ "match": { "field": "text" } },
    				{ "match": { "field": "text" } }
    			]
    		}
    	}
    }
    
    Example of Filter :
    GET /index/type/_search
    {
    	"query": {
    		"filtered": {
    			"filter": {
    				"range": {
    					"numeric field": {
    						"from": val1,
    						"to": val2
    					}
    				}
    			}
    		}
    	}
    }
    GET /index/type/_search
    {
    	"query": {
    		"filtered": {
    			"filter": {
    				"range": {
    					"numeric field": {
    						"gte": val 1,
    						"lss": val 2
    					}
    				}
    			}
    		}
    	}
    }