The department's HPC plaforms offers users 25TB of storage space within the [Hadoop Distributed File System](http://www.aosabook.org/en/hdfs.html]. This disk space is designed to store large datasets accessible by programs designed around the Map/Reduce pattern and running on the Hadoo platform.
If you are user fred
you may view the contents of your personal hdfs
using the hadoop
command
hadoop fs -ls /users/fred
From the command line you can use the HDFS like any Linux file system. For example:
hadoop fs -cat /users/fred/my_big_file.txt | grep -i 'hello world'
Refer to the Hadoop File System Shell Guide to see all the commands available to you.
The Pydoop package has been installed on the HPC cluster and is currently available on the head
node. It contains a hdfs-api which allows Python programs to directly access the hdfs
file system.
The HDFS API tutorial provides some simple examples of how to write Python programs using this API.
The HdFS namenode running on our head
node offers a web interface to allow you to examine the status of the HDFS. Simply
establish a SSH tunnel to this service when you login:
ssh -L 50070:localhost:50070 [email protected]
And then use your web browser to go to http://localhost:50070
This capability has not been fully installed but will be available very soon.
- Your data on
hdfs
will not be backed-up. Please ensure that you can recreate your datasets in the event that they become lost or corrupted. Having said that, HDFS keeps rendundant copies of all data across three separate hosts so data loss due to disk failure would be most uncommon. - Our HDFS platform is currently in 'experimental' status. The platform and the data may not be available at all times.