Download VirtualBox 4.3.x from the following link https://www.virtualbox.org/wiki/Downloads
and create a VM instance with the configuration below:
Name
Hadoop
System type
Ubuntu
CPU
2 Cores
RAM
4GB
DISK
15GB
Use NAT option to connect to the Internet from your virtual machine
Download & Install Ubuntu in the VM instance
Download Ubuntu 14.04 LTS (Desktop version) from this link and mount iso on VM's CD and boot the system
durring installation specify the machine name, user name and passwork to hadoop value, when installation is completed turn off the VM and unmount the iso.
Install Guest Additions
use one of the following two options to install the gueast additions for VirtualBox: Option A
Guest Additions iso can be found in VirtualBox installation path, usually:
/usr/share/virtualbox/
Mount the Virtual Box Guest Additions iso in the VM's CD before running the VM
and turn the system on, open a terminal and execute the next command line and reboot :
sh /media/hadoop/VBOXADDITIONS_4.3.34_104062/autorun.sh
Force the system read the updates in .bashrc with:
$ source $HOME/.bashrc
Edit $HADOOP_HOME/etc/hadoop/slaves file and add the lines:
hadoop
Edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh:
Find line "export JAVA_HOME=" and add the complete java path (same as in step 4)
In file etc/hadoop/core-site.xml:
Shut down the machine The basic configuration is done. hadoop with machine name hadoop will be the master node.
We can clone this machine as many times as needed to have a Multi Node cluster.
Format HDFS
In your master node (hadoop machine) execute:
$ cd $HADOOP_HOME
$ bin/hadoop namenode -format
Run Hadoop from master (hadoop)
Start Hadoop:
$ ./Downloads/hadoop-2.6.3/sbin/start-all.sh
Check in the UI if Hadoop is up:
Open a browser and in the bar write "localhost:50070"
The UI should present both machines
Check also datanode section. The disk capacity available in each machine should be shown.
Alternative check to see what is running (can be executed in master and slaves)
$ jps (shows all spark and hadoop related demons)
Both platforma can be stoped calling stop-all.sh scripts in the same paths as the start-all.sh
Test Hadoop
Please use the file bigtext.txt provided to you in beginning of the JABD'16 session. Put the file in
your home directory. You can also use other local files in your machine.
Put the file bigtext.txt in HDFS using:
start a browser in your virtual machine
and download Spark 1.6.0 version from this link
Release: 1.6.0
Package type: prebuilt for hadoop 2.4 and higher
Download the tar file (the ftp server is a fast option)
Extract:
$ cd Downloads
$ tar -xzf spark-1.6.0-bin-hadoop2.4.tgz
Reduce verbosity as in single node guide
Configure Spark [only in case of Multi Node Cluster]
In case of Multi Node cluster, edit slaves file for Spark:
$ cd Downloads/spark-1.6.0-bin-hadoop2.4/conf
$ cp slaves.template slaves
$ nano slaves
Add in the file in different lines the names of your cluster machines (in our case it is a single
machine):
machine 1
machine 2
...
Download and configure Hadoop and HDFS (Tutorial 1)
Check in the UI if spark is up:
Open a browser and in the bar write "localhost:8080"
The UI should present both machines
- Start Hadoop:
$ ./Downloads/hadoop-2.6.3/sbin/start-all.sh
Check in the UI if Hadoop is up:
Open a browser and in the bar write "localhost:50070"
The UI should present both machines
Check also datanode section. The disk capacity available in each machine should be shown.
- Alternative check to see what is running (can be executed in master and slaves)
$ jps (shows all spark and hadoop related demons)
- Both platforms can be stopped calling stop-all.sh scripts in the same paths as the start-all.sh
Test Spark
Put some files in the HDFS (see Tutorial 1)
- Test Spark wordcount on "bigtext.txt" file:
Two configuration files can be found under ES-HOME/conf: elasticsearch.yml and
logging.yml, open elasticsearch.yml and edit some variables to customize the elasticsearch server : cluster.name: mycluster Node.name:node1
Here, we specify the node name since we can install elasticsearch in several nodes.
we can set the property dynamically using: nade.name=${HOSTNAME}
To enable Elasticsearch to support cluster mode we need to set the list of nodes of our cluster
using:
discovery.zen.ping.unicast.hosts: ["host1", "host2"]
name of the node or IP address:
discovery.zen.ping.unicast.hosts: ["IP1", "IP2", "IP3"]
Start elasticsearch cluster
to start our cluster Elasticsearch we are going use the next command:
:
cd ES-Home
./bin/elasticsearch
To check our elasticsearch cluster we can send a HTTP request to port 9200.
Elasticsearch provides a RESTFUL web service to answer user CRUD requests (Create,
Read, Update and Delete). For this, we use a CURL protocol.
We can add our first entry with the command:
curl -X POST 'http://localhost:9200/database/table/id' -d '{"first name":"mohamed","last name":"tounsi" }'
with:
Database: Index (Database name in RDBM)
Table : Type (table in RDBM)
Id : ID of the document
Output:
GET /index/type/_search
{
"query": {
"match_all": {}
},
"_source":["filed","filed"]
}
Count function
GET /index/type/_count
{
"query": {
"match_all": {}
},
"_source":["filed","filed"]
}
Query and filter in Elsticsearch
In elasticsearch, we have two type of queries : Query : in the category we applied our query in text fields, we have in the category for each
result query score value to present a relevance of result because elasticsearch get result
depending on their context. Filter : in the category we applied our query in the numeric field
Example of Query :