-
-
Save gree2/a7aefa6a19d49d5a2caa to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
########## | |
# For verification, you can display the OS release. | |
########## | |
$ cat /etc/lsb-release | |
DISTRIB_ID=Ubuntu | |
DISTRIB_RELEASE=11.10 | |
DISTRIB_CODENAME=oneiric | |
DISTRIB_DESCRIPTION="Ubuntu 11.10" | |
########## | |
# Download all of the packages you'll need. Hopefully, | |
# you have a fast download connection. | |
########## | |
$ sudo apt-get update | |
$ sudo apt-get upgrade | |
$ sudo apt-get install curl | |
$ sudo apt-get install git | |
$ sudo apt-get install maven2 | |
$ sudo apt-get install openssh-server openssh-client | |
$ sudo apt-get install openjdk-7-jdk | |
########## | |
# Switch to the new Java. On my system, it was | |
# the third option (marked '2' naturally) | |
########## | |
$ sudo update-alternatives --config java | |
########## | |
# Set the JAVA_HOME variable. I took the | |
# time to update my .bashrc script. | |
########## | |
$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386 | |
########## | |
# Now we can download Cloudera's version of Hadoop. The | |
# first step is adding the repository. Note that oneiric | |
# is not explicitly supported as of 2011-Dec-20. So I am | |
# using the 'maverick' repository. | |
########## | |
# Create a repository list file. Add the two indented lines | |
# to the new file. | |
$ sudo vi /etc/apt/sources.list.d/cloudera.list | |
deb http://archive.cloudera.com/debian maverick-cdh3 contrib | |
deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib | |
# Add public key | |
$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add - | |
$ sudo apt-get update | |
# Install all of the Hadoop components. | |
$ sudo apt-get install hadoop-0.20 | |
$ sudo apt-get install hadoop-0.20-namenode | |
$ sudo apt-get install hadoop-0.20-datanode | |
$ sudo apt-get install hadoop-0.20-secondarynamenode | |
$ sudo apt-get install hadoop-0.20-jobtracker | |
$ sudo apt-get install hadoop-0.20-tasktracker | |
# Set some environment variables. I added these to my | |
# .bashrc file. | |
$ export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386 | |
$ export HADOOP_HOME=/usr/lib/hadoop-0.20 | |
$ cd $HADOOP_HOME/conf | |
# Create the hadoop temp directory. It should not | |
# be in the /tmp directory because that directory | |
# disappears after each system restart. Something | |
# that is done a lot with virtual machines. | |
sudo mkdir /hadoop_tmp_dir | |
sudo chmod 777 /hadoop_tmp_dir | |
# Replace the existing file with the indented lines. | |
$ sudo vi core-site.xml | |
<?xml version="1.0"?> | |
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> | |
<configuration> | |
<property> | |
<name>hadoop.tmp.dir</name> | |
<value>/hadoop_tmp_dir</value> | |
</property> | |
<property> | |
<name>fs.default.name</name> | |
<value>hdfs://localhost:9000</value> | |
</property> | |
</configuration> | |
########## | |
# Notice that the dfs secondary http address is not | |
# the default in the XML below. I don't know what | |
# process was using the default, but I needed to | |
# change it to avoid the 'port already in use' message. | |
########## | |
# Replace the existing file with the indented lines. | |
$ sudo vi hdfs-site.xml | |
<?xml version="1.0"?> | |
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> | |
<configuration> | |
<property> | |
<name>dfs.secondary.http.address</name> | |
<value>0.0.0.0:50090</value> | |
</property> | |
<property> | |
<name>dfs.replication</name> | |
<value>1</value> | |
</property> | |
<property> | |
<name>dfs.datanode.max.xcievers</name> | |
<value>4096</value> | |
</property> | |
</configuration> | |
# Replace the existing file with the indented lines. | |
$ sudo vi mapred-site.xml | |
<?xml version="1.0"?> | |
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> | |
<configuration> | |
<property> | |
<name>mapred.job.tracker</name> | |
<value>localhost:9001</value> | |
</property> | |
</configuration> | |
# format the hadoop filesystem | |
$ hadoop namenode -format | |
########## | |
# Time to setup password-less ssh to localhost | |
########## | |
$ cd ~ | |
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa | |
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys | |
# If you want to test that the ssh works, do this. Then exit. | |
$ ssh localhost | |
####### | |
####### | |
####### | |
####### | |
# REPEAT FOR EACH RESTART | |
# | |
# Since we are working inside a virtual machine, I found that | |
# some settings did not survive a shutdown or reboot. From this | |
# point on, repeat these command for each instance startup. | |
####### | |
# hadoop was installed as root. Therefore we need to | |
# change the ownership so that your username can | |
# write. IF YOU ARE NOT USING 'ubuntu', CHANGE THE | |
# COMMAND ACCORDINGLY. | |
$ sudo chown -R ubuntu:ubuntu /usr/lib/hadoop-0.20 | |
$ sudo chown -R ubuntu:ubuntu /var/run/hadoop-0.20 | |
$ sudo chown -R ubuntu:ubuntu /var/log/hadoop-0.20 | |
# Start hadoop. I remove the logs so that I can find errors | |
# faster when I iterate through configuration settings. | |
$ cd $HADOOP_HOME | |
$ rm -rf logs/* | |
$ bin/start-all.sh | |
====================================== | |
HBASE | |
====================================== | |
For the sake of sanity: | |
$ sudo service hadoop-zookeeper-server stop | |
STOP ALL PROCESSES (Hadoop) | |
Then... | |
$ sudo apt-get install hadoop-hbase | |
$ echo "hdfs - nofile 32768" >> /etc/security/limits.conf | |
$ echo "hbase - nofile 32768" >> /etc/security/limits.conf | |
$ echo "session required pam_limits.so" >> /etc/pam.d/common-session | |
Double check that $HADOOP_HOME/conf/hdfs-site.xml has the dfs.datanode.max.xcievers property set = 4096 | |
$ sudo apt-get install hadoop-hbase-master | |
- Ensure that you edit $HBASE_HOME/conf/hbase-env.sh such that JAVA_HOME is set. For some reason, when installing the HBase Master, it only looks at that file for JAVA_HOME. Placing JAVA_HOME in /etc/environment is not sufficient (or .bashrc for that matter). | |
- Edit /etc/hosts such that the ubuntu user points to 127.0.0.1. Otherwise HMaster will be unable to connect properly. This may only be a problem on the VM and not when Ubuntu is the native OS. | |
$ sudo /etc/init.d/hadoop-hbase-master start | |
$ hbase shell | |
Test out the hbase shell and make sure you can create 'test', 'cf' | |
STOP ALL PROCESSES NOW... going into pseudo-distributed mode and tying in with Hadoop | |
Edit /etc/hbase/conf/hbase-site.xml and add: | |
<property> | |
<name>hbase.cluster.distributed</name> | |
<value>true</value> | |
</property> | |
<property> | |
<name>hbase.rootdir</name> | |
<value>hdfs://localhost:9000/hbase</value> | |
</property> | |
Now start up HDFS again and create the /hbase directory (with no security constraints [aka fs instead of dfs]) | |
$ hadoop fs -mkdir /hbase | |
$ hadoop fs -chown hbase /hbase | |
========================================= | |
ZOOKEEPER | |
========================================= | |
$ sudo apt-get install hadoop-zookeeper-server | |
$ sudo /etc/init.d/hadoop-zookeeper-server start | |
Back to HBase... | |
Ensure HDFS / Zookeeper is running. Then... | |
$ sudo /etc/init.d/hadoop-hbase-master start | |
$ sudo apt-get install hadoop-hbase-regionserver | |
$ sudo /etc/init.d/hadoop-hbase-regionserver start | |
Navigate to http://localhost:60010 to ensure region server is working with the master. | |
You should be able to now manage HBase tables in its shell, and see the results in HDFS. | |
================================================ | |
SQOOP | |
================================================ | |
Install Sqoop through Cloudera: | |
$ sudo apt-get install sqoop | |
For configuration and installation (JDBC Driver / SQL Server Connector): | |
After installing and configuring Sqoop, verify the following environment variables are set on the machine with Sqoop installation, as described in the following table. These must be set for SQL Server-Hadoop Connector to work correctly. | |
Environment Variable = Value to Assign | |
SQOOP_HOME = Absolute path to the Sqoop installation directory | |
SQOOP_CONF_DIR = $SQOOP_HOME/conf | |
Step 3: Download and install the Microsoft JDBC Driver | |
Sqoop and SQL Server-Hadoop use JDBC technology to establish connections to remote RDBMS servers and therefore needs the JDBC driver for SQL Server. To install this driver on Linux node where Sqoop is already installed: | |
Visit http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=21599 and download “sqljdbc_<version>_enu.tar.gz” | |
Copy it on the machine with Sqoop installation | |
Unpack the tar file using following command: tar –zxvf sqljdbc_<version>_enu.tar.gz. This will create a directory “sqljdbc_3.0” in current directory | |
Copy the driver jar (sqljdbc_3.0/enu/sqljdbc4.jar) file to the $SQOOP_HOME/lib directory on machine with Sqoop installation. | |
Download and Install SQL Server-Hadoop Connector | |
After all of the previous steps have completed, you are ready to download, install and configure the SQL Server-Hadoop Connector on the machine with Sqoop installation. The SQL Server–Hadoop connector is distributed as a compressed tar archive named sqoop-sqlserver-1.0.tar.gz. Download the tar archive from http://download.microsoft.com, and save the archive on the same machine where Sqoop is installed. | |
This archive is composed of the following files and directories: | |
File/Directory = Description | |
install.sh = Is a shell script that installs the SQL Server - Hadoop Connector files into the Sqoop directory structure | |
Microsoft SQL Server - Hadoop Connector User Guide.pdf = Contains instructions to deploy and execute SQL Server – Hadoop Connector. | |
lib/ = Contains the sqoop-sqlserver-1.0.jar file | |
conf/ = Contains the configuration files for SQL Server – Hadoop Connector. | |
THIRDPARTYNOTICES FOR HADOOP-BASED CONNECTORS.txt = Contains the third party notices. | |
SQL Server Connector for Apache Hadoop MSLT.pdf = EULA for the SQL Server Connector for Apache Hadoop. | |
To install SQL Server – Hadoop Connector: | |
1. Login to the machine where Sqoop is installed as a user who has permission to install files | |
2. Extract the archive with the command: “tar –zxvf sqoop-sqlserver-1.0.tar.gz”. This will create “sqoop-sqlserver-1.0” directory in current directory | |
3. Change directory (cd) to “sqoop-sqlserver-1.0” | |
4. Ensure that MSSQL_CONNECTOR_HOME environment variable is set to the absolute path of the sqoop -sqlserver-1.0 directory. | |
5. Run the shell script install.sh with no additional arguments. | |
6. Installer will copy the connector jar and configuration file under existing Sqoop installation | |
Example SQL Server Sqoop import statement: | |
$ bin/sqoop import --connect 'jdbc:sqlserver://<ip-address>;instanceName=<instance-name>;username=<user-name>;password=<password>;database=<database-name>' --query 'SELECT * FROM [Database].[prefix].[table-name] WHERE $CONDITIONS' --split-by <column-to-split-by> --target-dir <hdfs-target-directory> | |
For importing into Hbase... | |
bin/sqoop import --connect 'jdbc:sqlserver://<ip-address>;instanceName=SQLExpress;username=<username>;password=<password>;database=<database>' --query 'SELECT * FROM [database].[prefix].[table] WHERE $CONDITIONS' --split-by <primary-key> --hbase-table <hbase-table> --column-family <column-family> | |
* Note that the table must be created with a column family in HBase before executing the above command. | |
For configuration with importing from Oracle: | |
- Download ojdbc6.jar and place in $SQOOP_HOME/lib | |
- Connection string format: sqoop --connect jdbc:oracle:thin:@//<address>:<port>/<instance-name> | |
(all other options such as --query apply) | |
- Another example: $ sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table DBFUNC1.R1_EVOLUTION --where 'rownum=1' --verbose -P | |
====================================================== | |
JSON INTEGRATION | |
====================================================== | |
JSON can be sent through the REST interface (Stargate), however everything sent through the REST interface is encoded in base64, and so any GETs or PUTs should first be decoded/encoded in base64. | |
See Gist: https://gist.github.com/2284007 for examples | |
* Remember - $ sudo hbase rest start | |
====================================================== | |
LILY | |
====================================================== | |
- Install SOLR | |
- Install Lily cluster |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment