HPCC information: Difference between revisions

From CSDMS
m (beach status after crash)
m (Improve alert message)
 
(10 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Important Message (7 April 2011) =
[[Image:Alert-yellow.png | center | 50px | Out-of-date page ]]
{{Alert Box|message=Although beach is once again stable, if you made use of the /data partition, please read the following message}}
This page is out of date.
As many of you know, beach began to behave badly about two weeks ago.
Please see the
 
[[HPC]]
We replaced numerous hardware components in the server, all to no avail.
page for information on accessing and using ''blanca'', the CSDMS HPC.
Beach continued to crash, sometimes within two hours of being brought
back up.
 
In the midst of these crashes we did notice that /data was having issues,
as well as /home.  /home was continually rebuilding its mirror and /data
would go into read-only mode with input/output errors. We reported this
to SGI and they insisted that the problems we'd been having were software,
not hardware, related.  They also indicated that the xfs errors and issues
we'd been seeing on those two partitions could be cleared up by running
file system checks on the system.  I have to say that we have never seen
the xfs filesystem fail without an underlying hardware problem, and I
indicated that to SGI, but they insisted that was the root of the problem.
 
We were able to run an xfs check on /home, but /data was badly enough
damaged that had to run an xfs repair.  Unfortunately, the repair
essentially unlinked the root directories under /data and threw everything
into lost+found.  There is no way to recover from that, and everyone knows
that /data is not backed up.
 
Beach has been completely stable since the filesystem check and repair.
We have stressed the system as much as possible without seeing any other
errors.  Although the xfs repair has made things usable, we do expect
that there is still an underlying hardware problem.  Please use beach
normally, putting new data into /data, but please be aware that even
though we've attempted to have as much redunancy as possible there is
no subsitute for keeping backups of your important data.
 
Now, there is really nothing to be done but move forward.  There is still
18TB of data sitting in lost+found and we must do something about that.
I would like anyone who can write off their data to come forward and let me
know.  I could then remove any files owned by that person from lost+found
and it would make looking for any other files a lot more manageable.
For those out there who really need something, or would like to look for
their data, we can work on it.  We must clean out lost+found - it is
taking up over half of the usable disk space in /data.
 
If all of the above was too much detail, I can boil it down to this:
If we do nothing, all that data will sit in lost+found forever.  So I'm
putting a deadline - on June 1 I will remove anything left in lost+found.
I'd like anyone who can live with the loss of their data to come forward
so I can remove whatever I can find of theirs immediately.  Anyone who
wants me to help them look for their files, I'd be happy to.  I know
that I'll be working with just about each and every user on this, and I'm
happy to do whatever I can to make this easier and free up /data for use.
Beach is ready to be used, just think of /data as being empty unless
and until we work to find files that are owned by you in lost+found.
If you do desire to look through you files, let me know your username
and that you want me to move all files owned by you out of lost+found
into /data/<your_username> and we can get started.
 
If there are ANY questions about this, I am happy to answer them.  Please
direct all questions and concerns to trouble@colorado.edu and I'll respond
to them as quickly as possible.
 
= The CSDMS High Performance Computing Cluster (Code name: beach) =
 
The CSDMS High Performance Computing Cluster (HPCC) provides CSDMS researchers a state-of-the-art HPC cluster.
 
Use of the CSDMS HPCC is available free of charge to the CSDMS community!  To get an account on our machine your will need to meet only a few [[HPCC_account_requirements|requirements]] before you can [[HPCC_account_request | sign up]] for a one year guest account.  That's it!
 
== Attribution and Reporting of Results ==
When reporting results which were obtained on the CSDMS cluster, we request that the following language be used as an acknowledgement:
 
"We acknowledge computing time on the CU-CSDMS High-Performance Computing Cluster."
 
Also, please notify us of any tech reports, conference papers, journal articles, theses, or dissertations which contain results which were obtained on beach. Your assistance will help to ensure that our online bibliography of results is as complete as possible. Citations should be sent to [mailto:CSDMSsupport@colorado.edu us].
 
== Hardware ==
[[File:sgi_logo_hires.jpg | right | 250px ]]
 
The CSDMS High Performance Computing Cluster is an [http://www.sgi.com SGI] [http://www.sgi.com/products/servers/altix/xe Altix XE] 1300 that consists of 88 Altix XE320 compute nodes (for a total of 704 cores).  The compute nodes are configured with two quad-core 3.0GHz E5472 (Harpertown) processors.  62 of the 88 nodes have 2 GB of memory per core, while the remaining nodes have 4 GB of memory per core.  The cluster is controlled through an Altix XE250 head node.  Internode communication is accomplished through either gigabit ethernet or over a non-blocking [http://en.wikipedia.org/wiki/InfiniBand InfiniBand] fabric.
 
Each compute node has 250 GB of local temporary storage.  However, all nodes are able to access 36TB of RAID storage through NFS.
 
The CSDMS system will be tied in to a 153 Tflop front range HPCC called Janus, that offers 1368 compute nodes with 2 2,8 Ghz 6 core Intel Westmere processors for a total of 16,428 cores employing non-blocking QDR Infiniband network.
4.10
 
Some benchmarks that we've run on beach:
* The OSU [[ CSDMS_HPCC_OMB_benchmarks |micro-benchmarks]]
* [[CSDMS_HPCC_HPL_benchmarks| High-Performance Linpack Benchmark]]
 
=== Hardware Summary ===
{|
! align=left width=150 | Node
! align=left width=150 | Type
! align=left width=200 | Processors
! align=left width=100 | Memory
! align=left width=150 | Internal Storage
|-
| beach.colorado.edu
| Head (Altix XE250)
| 2 Quad-Core Xeon<ref name=proc_specs>
Processors are Quad-core Intel Xeon E5472 (Harpertown):
* Front Side Bus: 1600 MHz
* L2 Cache: 12MB
</ref>
| 16GB<ref name=mem_specs>
Memory is DDR2 800 MHz FBDIMM</ref>
| --
|-
| cl1n001 - cl1n056
| Compute (Altix XE320)
| 2 Quad-Core Xeon <ref name=proc_specs/>
| 16GB <ref name=mem_specs/>
| 250GB SATA
|-
| cl1n057 - cl1n080
| Compute (Altix XE320)
| 2 Quad-Core Xeon <ref name=proc_specs/>
| 32GB <ref name=mem_specs/>
| 250GB SATA
|-
| cl1n081 - cl1n088
| Compute (Altix XE320)
| 2 Quad-Core Xeon <ref name=proc_specs/>
| 16GB <ref name=mem_specs/>
| 250GB SATA
|}
 
<references />
 
== Software ==
[[Image:HPCC.png | 350px | right | The CSDMS HPCC]]
 
Below is a list of some of the software that we have installed on beach.  If there is a particular software package that is not listed below and would like to use it, please feel free to send an email to [mailto:CSDMSsupport@colorado.edu us] outlining what it is you need.
 
=== Compilers ===
{|
! align=left width=100 |  Name
! align=left width=100 |  Version
! align=left width=100 | Module Name
! align=left | Location
|-
| [http://gcc.gnu.org/ gcc]
| 4.1
| gcc/4.1
| /usr
|-
| [http://gcc.gnu.org/ gcc]
| 4.3
| gcc/4.3
| /usr/local/gcc
|-
| [http://gcc.gnu.org/wiki/GFortran gfortran]
| 4.1
| gcc/4.1
| /usr
|-
| [http://gcc.gnu.org/wiki/GFortran gfortran]
| 4.3
| gcc/4.3
| /usr/local/gcc
|-
| icc
| 11.0
| intel
| /usr/local/intel
|-
| ifort
| 11.0
| intel
| /usr/local/intel
|-
| [http://www.mcs.anl.gov/research/projects/mpich2/ mpich2]
| 1.1
| mpich2/1.1
| /usr/local/mpich
|-
| [http://mvapich.cse.ohio-state.edu/ mvapich2]
| 1.5
| mvaich2/1.5
| /usr/local/mvapich2-1.5
|-
| [http://www.open-mpi.org/ openmpi]
| 1.3
| openmpi/1.3
| /usr/local/openmpi
|}
 
=== Languages ===
{|
! align=left width=100 | Name
! align=left width=100 | Version
! align=left width=100 | Module Name
! align=left | Location
|-
| Python<ref>
Python 2.4 modules:
* [http://numpy.scipy.org/ numpy] 1.2.1
* [http://www.scipy.org/ scipy] 0.6.0
* [http://www.pythonware.com/products/pil Python Imaging Library (PIL)]
</ref>
| 2.4
| python/2.4
| /usr
|-
| Python<ref>
Python 2.6 modules:
* [http://numpy.scipy.org/ numpy] 1.3.0
* [http://www.scipy.org/ scipy] 0.7.1rc3
* [http://www.pyngl.ucar.edu/Nio.shtml PyNIO] 1.3.0b1
* [http://ipython.scipy.org iPython] 0.10
* [http://www.cython.org Cython] 0.11.3
</ref>
| 2.6
| python/2.6
| /usr/local/python
|-
| Java
| 1.5
| --
| --
|-
| Java
| 1.6
| --
| --
|-
| perl
| 5.8.8
| --
| /usr
|-
| [http://www.mathworks.com/ MATLAB]
| 2008b
| matlab
| /usr/local/matlab
|}
 
<references/>
 
=== Libraries ===
{|
! align=left width=100 |  Name
! align=left width=100 |  Version
! align=left width=100 | Module Name
! align=left | Location
|-
| [http://www.unidata.ucar.edu/software/udunits Udunits]
| 1.12.9
| udunits
| /usr/local/udunits
|-
| [http://www.unidata.ucar.edu/software/netcdf netcdf]
| 4.0.1
| netcdf
| /usr/local/netcdf
|-
| [http://www.hdfgroup.org/HDF5 hdf5]
| 1.8
| hdf5
| /usr/local/hdf5
|-
| [http://xmlsoft.org/index.html libxml2]
| 2.7.3
| libxml2
| /data/progs/lib/libxml2
|-
| [http://www.gtk.org/ glib-2.0]
| 2.18.3
| glib2
| /usr/local/glib
|-
| petsc
| 3.0.0p3
| petsc
| /usr/local/petsc
|-
| [http://www.mcs.anl.gov/research/projects/mct/ mct]
| 2.6.0
| mct
| /data/progs/mct/2.6.0-mpich2-intel
|}
 
=== Tools ===
{|
! align=left width=100 |  Name
! align=left width=100 |  Version
! align=left width=100 | Module Name
! align=left | Location
|-
| [http://www.cmake.org/ cmake]
| 2.6p2
| cmake
| /usr/local/cmake
|-
| [http://www.scons.org/ scons]
| 1.2.0
| scons
| /usr/local/scons
|-
| [http://subversion.tigris.org/ subversion]
| 1.6.2
| subversion
| /usr/local/subversion
|-
| [http://www.clusterresources.com/torquedocs21/ torque]
| 2.3.5
| torque
| /opt/torque
|-
| [http://modules.sourceforge.net/ Environment modules]
| 3.2.6
| --
| /usr/local/modules
|}
 
= Monitoring Usage of ''Beach'' =
The CSDMS high performance computing cluster uses the [http://ganglia.sourceforge.net/ Ganglia Monitoring System] to provide real-time usage statistics.  Note that although we constantly monitor each computational node of the cluster, Ganglia was designed with high performance computing in mind and the monitoring process itself will not negatively impact you job's execution time.
 
[http://csdms.colorado.edu/ganglia Take me to the stats!]

Latest revision as of 15:29, 14 March 2020

Out-of-date page

This page is out of date. Please see the HPC page for information on accessing and using blanca, the CSDMS HPC.