HPCC information

From CSDMS
Revision as of 14:54, 7 April 2011 by Huttone (talk | contribs) (Change heading name)

Status Report (7 April 2011)

Warning 32.png Although beach is once again stable, if you made use of the /data partition, please read the following message

As many of you know, beach began to behave badly about two weeks ago.

We replaced numerous hardware components in the server, all to no avail. Beach continued to crash, sometimes within two hours of being brought back up.

In the midst of these crashes we did notice that /data was having issues, as well as /home. /home was continually rebuilding its mirror and /data would go into read-only mode with input/output errors. We reported this to SGI and they insisted that the problems we'd been having were software, not hardware, related. They also indicated that the xfs errors and issues we'd been seeing on those two partitions could be cleared up by running file system checks on the system. I have to say that we have never seen the xfs filesystem fail without an underlying hardware problem, and I indicated that to SGI, but they insisted that was the root of the problem.

We were able to run an xfs check on /home, but /data was badly enough damaged that had to run an xfs repair. Unfortunately, the repair essentially unlinked the root directories under /data and threw everything into lost+found. There is no way to recover from that, and everyone knows that /data is not backed up.

Beach has been completely stable since the filesystem check and repair. We have stressed the system as much as possible without seeing any other errors. Although the xfs repair has made things usable, we do expect that there is still an underlying hardware problem. Please use beach normally, putting new data into /data, but please be aware that even though we've attempted to have as much redunancy as possible there is no subsitute for keeping backups of your important data.

Now, there is really nothing to be done but move forward. There is still 18TB of data sitting in lost+found and we must do something about that. I would like anyone who can write off their data to come forward and let me know. I could then remove any files owned by that person from lost+found and it would make looking for any other files a lot more manageable. For those out there who really need something, or would like to look for their data, we can work on it. We must clean out lost+found - it is taking up over half of the usable disk space in /data.

If all of the above was too much detail, I can boil it down to this: If we do nothing, all that data will sit in lost+found forever. So I'm putting a deadline - on June 1 I will remove anything left in lost+found. I'd like anyone who can live with the loss of their data to come forward so I can remove whatever I can find of theirs immediately. Anyone who wants me to help them look for their files, I'd be happy to. I know that I'll be working with just about each and every user on this, and I'm happy to do whatever I can to make this easier and free up /data for use. Beach is ready to be used, just think of /data as being empty unless and until we work to find files that are owned by you in lost+found. If you do desire to look through you files, let me know your username and that you want me to move all files owned by you out of lost+found into /data/<your_username> and we can get started.

If there are ANY questions about this, I am happy to answer them. Please direct all questions and concerns to trouble@colorado.edu and I'll respond to them as quickly as possible.

The CSDMS High Performance Computing Cluster (Code name: beach)

The CSDMS High Performance Computing Cluster (HPCC) provides CSDMS researchers a state-of-the-art HPC cluster.

Use of the CSDMS HPCC is available free of charge to the CSDMS community! To get an account on our machine your will need to meet only a few requirements before you can sign up for a one year guest account. That's it!

Attribution and Reporting of Results

When reporting results which were obtained on the CSDMS cluster, we request that the following language be used as an acknowledgement:

"We acknowledge computing time on the CU-CSDMS High-Performance Computing Cluster."

Also, please notify us of any tech reports, conference papers, journal articles, theses, or dissertations which contain results which were obtained on beach. Your assistance will help to ensure that our online bibliography of results is as complete as possible. Citations should be sent to us.

Hardware

Sgi logo hires.jpg

The CSDMS High Performance Computing Cluster is an SGI Altix XE 1300 that consists of 88 Altix XE320 compute nodes (for a total of 704 cores). The compute nodes are configured with two quad-core 3.0GHz E5472 (Harpertown) processors. 62 of the 88 nodes have 2 GB of memory per core, while the remaining nodes have 4 GB of memory per core. The cluster is controlled through an Altix XE250 head node. Internode communication is accomplished through either gigabit ethernet or over a non-blocking InfiniBand fabric.

Each compute node has 250 GB of local temporary storage. However, all nodes are able to access 36TB of RAID storage through NFS.

The CSDMS system will be tied in to a 153 Tflop front range HPCC called Janus, that offers 1368 compute nodes with 2 2,8 Ghz 6 core Intel Westmere processors for a total of 16,428 cores employing non-blocking QDR Infiniband network. 4.10

Some benchmarks that we've run on beach:

Hardware Summary

Node Type Processors Memory Internal Storage
beach.colorado.edu Head (Altix XE250) 2 Quad-Core Xeon[1] 16GB[2] --
cl1n001 - cl1n056 Compute (Altix XE320) 2 Quad-Core Xeon [1] 16GB [2] 250GB SATA
cl1n057 - cl1n080 Compute (Altix XE320) 2 Quad-Core Xeon [1] 32GB [2] 250GB SATA
cl1n081 - cl1n088 Compute (Altix XE320) 2 Quad-Core Xeon [1] 16GB [2] 250GB SATA
  1. 1.0 1.1 1.2 1.3 Processors are Quad-core Intel Xeon E5472 (Harpertown):
    • Front Side Bus: 1600 MHz
    • L2 Cache: 12MB
  2. 2.0 2.1 2.2 2.3 Memory is DDR2 800 MHz FBDIMM

Software

The CSDMS HPCC

Below is a list of some of the software that we have installed on beach. If there is a particular software package that is not listed below and would like to use it, please feel free to send an email to us outlining what it is you need.

Compilers

Name Version Module Name Location
gcc 4.1 gcc/4.1 /usr
gcc 4.3 gcc/4.3 /usr/local/gcc
gfortran 4.1 gcc/4.1 /usr
gfortran 4.3 gcc/4.3 /usr/local/gcc
icc 11.0 intel /usr/local/intel
ifort 11.0 intel /usr/local/intel
mpich2 1.1 mpich2/1.1 /usr/local/mpich
mvapich2 1.5 mvaich2/1.5 /usr/local/mvapich2-1.5
openmpi 1.3 openmpi/1.3 /usr/local/openmpi

Languages

Name Version Module Name Location
Python[1] 2.4 python/2.4 /usr
Python[2] 2.6 python/2.6 /usr/local/python
Java 1.5 -- --
Java 1.6 -- --
perl 5.8.8 -- /usr
MATLAB 2008b matlab /usr/local/matlab
  1. Python 2.4 modules:
  2. Python 2.6 modules:

Libraries

Name Version Module Name Location
Udunits 1.12.9 udunits /usr/local/udunits
netcdf 4.0.1 netcdf /usr/local/netcdf
hdf5 1.8 hdf5 /usr/local/hdf5
libxml2 2.7.3 libxml2 /data/progs/lib/libxml2
glib-2.0 2.18.3 glib2 /usr/local/glib
petsc 3.0.0p3 petsc /usr/local/petsc
mct 2.6.0 mct /data/progs/mct/2.6.0-mpich2-intel

Tools

Name Version Module Name Location
cmake 2.6p2 cmake /usr/local/cmake
scons 1.2.0 scons /usr/local/scons
subversion 1.6.2 subversion /usr/local/subversion
torque 2.3.5 torque /opt/torque
Environment modules 3.2.6 -- /usr/local/modules

Monitoring Usage of Beach

The CSDMS high performance computing cluster uses the Ganglia Monitoring System to provide real-time usage statistics. Note that although we constantly monitor each computational node of the cluster, Ganglia was designed with high performance computing in mind and the monitoring process itself will not negatively impact you job's execution time.

Take me to the stats!