Skip to content

Instantly share code, notes, and snippets.

@tariqmislam
Created March 28, 2012 19:58
Show Gist options
  • Save tariqmislam/2229987 to your computer and use it in GitHub Desktop.
Save tariqmislam/2229987 to your computer and use it in GitHub Desktop.
Hadoop | HBase | Zookeeper | Sqoop - Installation
##########
# For verification, you can display the OS release.
##########
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=11.10
DISTRIB_CODENAME=oneiric
DISTRIB_DESCRIPTION="Ubuntu 11.10"
##########
# Download all of the packages you'll need. Hopefully,
# you have a fast download connection.
##########
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install curl
$ sudo apt-get install git
$ sudo apt-get install maven2
$ sudo apt-get install openssh-server openssh-client
$ sudo apt-get install openjdk-7-jdk
##########
# Switch to the new Java. On my system, it was
# the third option (marked '2' naturally)
##########
$ sudo update-alternatives --config java
##########
# Set the JAVA_HOME variable. I took the
# time to update my .bashrc script.
##########
$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
##########
# Now we can download Cloudera's version of Hadoop. The
# first step is adding the repository. Note that oneiric
# is not explicitly supported as of 2011-Dec-20. So I am
# using the 'maverick' repository.
##########
# Create a repository list file. Add the two indented lines
# to the new file.
$ sudo vi /etc/apt/sources.list.d/cloudera.list
deb http://archive.cloudera.com/debian maverick-cdh3 contrib
deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib
# Add public key
$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
$ sudo apt-get update
# Install all of the Hadoop components.
$ sudo apt-get install hadoop-0.20
$ sudo apt-get install hadoop-0.20-namenode
$ sudo apt-get install hadoop-0.20-datanode
$ sudo apt-get install hadoop-0.20-secondarynamenode
$ sudo apt-get install hadoop-0.20-jobtracker
$ sudo apt-get install hadoop-0.20-tasktracker
# Set some environment variables. I added these to my
# .bashrc file.
$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
$ export HADOOP_HOME=/usr/lib/hadoop-0.20
$ cd $HADOOP_HOME/conf
# Create the hadoop temp directory. It should not
# be in the /tmp directory because that directory
# disappears after each system restart. Something
# that is done a lot with virtual machines.
sudo mkdir /hadoop_tmp_dir
sudo chmod 777 /hadoop_tmp_dir
# Replace the existing file with the indented lines.
$ sudo vi core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop_tmp_dir</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
##########
# Notice that the dfs secondary http address is not
# the default in the XML below. I don't know what
# process was using the default, but I needed to
# change it to avoid the 'port already in use' message.
##########
# Replace the existing file with the indented lines.
$ sudo vi hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.secondary.http.address</name>
<value>0.0.0.0:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>
</configuration>
# Replace the existing file with the indented lines.
$ sudo vi mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
# format the hadoop filesystem
$ hadoop namenode -format
##########
# Time to setup password-less ssh to localhost
##########
$ cd ~
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
# If you want to test that the ssh works, do this. Then exit.
$ ssh localhost
#######
#######
#######
#######
# REPEAT FOR EACH RESTART
#
# Since we are working inside a virtual machine, I found that
# some settings did not survive a shutdown or reboot. From this
# point on, repeat these command for each instance startup.
#######
# hadoop was installed as root. Therefore we need to
# change the ownership so that your username can
# write. IF YOU ARE NOT USING 'ubuntu', CHANGE THE
# COMMAND ACCORDINGLY.
$ sudo chown -R ubuntu:ubuntu /usr/lib/hadoop-0.20
$ sudo chown -R ubuntu:ubuntu /var/run/hadoop-0.20
$ sudo chown -R ubuntu:ubuntu /var/log/hadoop-0.20
# Start hadoop. I remove the logs so that I can find errors
# faster when I iterate through configuration settings.
$ cd $HADOOP_HOME
$ rm -rf logs/*
$ bin/start-all.sh
======================================
HBASE
======================================
For the sake of sanity:
$ sudo service hadoop-zookeeper-server stop
STOP ALL PROCESSES (Hadoop)
Then...
$ sudo apt-get install hadoop-hbase
$ echo "hdfs - nofile 32768" >> /etc/security/limits.conf
$ echo "hbase - nofile 32768" >> /etc/security/limits.conf
$ echo "session required pam_limits.so" >> /etc/pam.d/common-session
Double check that $HADOOP_HOME/conf/hdfs-site.xml has the dfs.datanode.max.xcievers property set = 4096
$ sudo apt-get install hadoop-hbase-master
- Ensure that you edit $HBASE_HOME/conf/hbase-env.sh such that JAVA_HOME is set. For some reason, when installing the HBase Master, it only looks at that file for JAVA_HOME. Placing JAVA_HOME in /etc/environment is not sufficient (or .bashrc for that matter).
- Edit /etc/hosts such that the ubuntu user points to 127.0.0.1. Otherwise HMaster will be unable to connect properly. This may only be a problem on the VM and not when Ubuntu is the native OS.
$ sudo /etc/init.d/hadoop-hbase-master start
$ hbase shell
Test out the hbase shell and make sure you can create 'test', 'cf'
STOP ALL PROCESSES NOW... going into pseudo-distributed mode and tying in with Hadoop
Edit /etc/hbase/conf/hbase-site.xml and add:
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
Now start up HDFS again and create the /hbase directory (with no security constraints [aka fs instead of dfs])
$ hadoop fs -mkdir /hbase
$ hadoop fs -chown hbase /hbase
=========================================
ZOOKEEPER
=========================================
$ sudo apt-get install hadoop-zookeeper-server
$ sudo /etc/init.d/hadoop-zookeeper-server start
Back to HBase...
Ensure HDFS / Zookeeper is running. Then...
$ sudo /etc/init.d/hadoop-hbase-master start
$ sudo apt-get install hadoop-hbase-regionserver
$ sudo /etc/init.d/hadoop-hbase-regionserver start
Navigate to http://localhost:60010 to ensure region server is working with the master.
You should be able to now manage HBase tables in its shell, and see the results in HDFS.
================================================
SQOOP
================================================
Install Sqoop through Cloudera:
$ sudo apt-get install sqoop
For configuration and installation (JDBC Driver / SQL Server Connector):
After installing and configuring Sqoop, verify the following environment variables are set on the machine with Sqoop installation, as described in the following table. These must be set for SQL Server-Hadoop Connector to work correctly.
Environment Variable = Value to Assign
SQOOP_HOME = Absolute path to the Sqoop installation directory
SQOOP_CONF_DIR = $SQOOP_HOME/conf
Step 3: Download and install the Microsoft JDBC Driver
Sqoop and SQL Server-Hadoop use JDBC technology to establish connections to remote RDBMS servers and therefore needs the JDBC driver for SQL Server. To install this driver on Linux node where Sqoop is already installed:
Visit http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21599 and download “sqljdbc_<version>_enu.tar.gz”
Copy it on the machine with Sqoop installation
Unpack the tar file using following command: tar –zxvf sqljdbc_<version>_enu.tar.gz. This will create a directory “sqljdbc_3.0” in current directory
Copy the driver jar (sqljdbc_3.0/enu/sqljdbc4.jar) file to the $SQOOP_HOME/lib directory on machine with Sqoop installation.
Download and Install SQL Server-Hadoop Connector
After all of the previous steps have completed, you are ready to download, install and configure the SQL Server-Hadoop Connector on the machine with Sqoop installation. The SQL Server–Hadoop connector is distributed as a compressed tar archive named sqoop-sqlserver-1.0.tar.gz. Download the tar archive from http://download.microsoft.com, and save the archive on the same machine where Sqoop is installed.
This archive is composed of the following files and directories:
File/Directory = Description
install.sh = Is a shell script that installs the SQL Server - Hadoop Connector files into the Sqoop directory structure
Microsoft SQL Server - Hadoop Connector User Guide.pdf = Contains instructions to deploy and execute SQL Server – Hadoop Connector.
lib/ = Contains the sqoop-sqlserver-1.0.jar file
conf/ = Contains the configuration files for SQL Server – Hadoop Connector.
THIRDPARTYNOTICES FOR HADOOP-BASED CONNECTORS.txt = Contains the third party notices.
SQL Server Connector for Apache Hadoop MSLT.pdf = EULA for the SQL Server Connector for Apache Hadoop.
To install SQL Server – Hadoop Connector:
1. Login to the machine where Sqoop is installed as a user who has permission to install files
2. Extract the archive with the command: “tar –zxvf sqoop-sqlserver-1.0.tar.gz”. This will create “sqoop-sqlserver-1.0” directory in current directory
3. Change directory (cd) to “sqoop-sqlserver-1.0”
4. Ensure that MSSQL_CONNECTOR_HOME environment variable is set to the absolute path of the sqoop -sqlserver-1.0 directory.
5. Run the shell script install.sh with no additional arguments.
6. Installer will copy the connector jar and configuration file under existing Sqoop installation
Example SQL Server Sqoop import statement:
$ bin/sqoop import --connect 'jdbc:sqlserver://<ip-address>;instanceName=<instance-name>;username=<user-name>;password=<password>;database=<database-name>' --query 'SELECT * FROM [Database].[prefix].[table-name] WHERE $CONDITIONS' --split-by <column-to-split-by> --target-dir <hdfs-target-directory>
For importing into Hbase...
bin/sqoop import --connect 'jdbc:sqlserver://<ip-address>;instanceName=SQLExpress;username=<username>;password=<password>;database=<database>' --query 'SELECT * FROM [database].[prefix].[table] WHERE $CONDITIONS' --split-by <primary-key> --hbase-table <hbase-table> --column-family <column-family>
* Note that the table must be created with a column family in HBase before executing the above command.
For configuration with importing from Oracle:
- Download ojdbc6.jar and place in $SQOOP_HOME/lib
- Connection string format: sqoop --connect jdbc:oracle:thin:@//<address>:<port>/<instance-name>
(all other options such as --query apply)
- Another example: $ sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table DBFUNC1.R1_EVOLUTION --where 'rownum=1' --verbose -P
======================================================
JSON INTEGRATION
======================================================
JSON can be sent through the REST interface (Stargate), however everything sent through the REST interface is encoded in base64, and so any GETs or PUTs should first be decoded/encoded in base64.
See Gist: https://gist.github.com/2284007 for examples
* Remember - $ sudo hbase rest start
======================================================
LILY
======================================================
- Install SOLR
- Install Lily cluster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment