TestingExecutorBlanca: Difference between revisions

From CSDMS
m (Link to requirements files)
No edit summary
 
Line 101: Line 101:


When googling `H5F_open(): unable to lock the file`,
When googling `H5F_open(): unable to lock the file`,
I found a pair of offhand references
I found a offhand reference
([http://hdf-forum.184993.n3.nabble.com/h5fcreate-1-10-unable-to-lock-td4028902.html here] and
([http://hdf-forum.184993.n3.nabble.com/h5fcreate-1-10-unable-to-lock-td4028902.html here])
[https://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2016-May/009483.html here])
by an HDF5 developer to issues with file locks on Lustre filesystems in the newly released HDF5 version 1.10.
by an HDF5 developer to issues with file locks on Lustre filesystems in the newly released HDF5 version 1.10.
Interestingly, one report came from a '''''janus''''' user.
Interestingly, the report came from a '''''janus''''' user.


I thought that by rolling back the HDF5 version to 1.8.x, I may be able to work around this issue.
I thought that by rolling back the HDF5 version to 1.8.x, I may be able to work around this issue.

Latest revision as of 12:12, 12 October 2019

wmt-testing executor on blanca

Instructions for installing and configuring a WMT executor on blanca.

--Mpiper (talk) 15:44, 28 February 2018 (MST)

Set build environment on blanca

Login to summit.

ssh mapi8461@login.rc.colorado.edu

Get the corrent slurm for blanca.

module load slurm/blanca

Login to a compute node.

sinteractive

Build everything on the compute node.

Set install directory

The install directory for this executor is /work/csdms/wmt/_testing.

install_dir=/work/csdms/wmt/_testing
mkdir -p $install_dir

Make sure read and execute bits are set on this directory.

chmod 0775 $install_dir

Install Python

Install a Python distribution to be used locally by WMT. We like to use Miniconda.

cd $install_dir
curl https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh -o miniconda.sh
bash ./miniconda.sh -f -b -p $(pwd)/conda
export PATH=$(pwd)/conda/bin:$PATH

If working with an existing Miniconda install, be sure to update everything before continuing:

conda update conda
conda update --all

Install the CSDMS software stack

Using the csdms-stack conda channel (the Bakery) install the CSDMS software stack, including several pre-built components, with the `csdms-stack` metapackage.

conda install csdms-stack -c csdms-stack -c defaults -c conda-forge

This metapackage currently includes

  • pymt
  • cca-tools
  • csdms-child
  • csdms-sedflux-3d
  • csdms-hydrotrend
  • csdms-permamodel-ku
  • csdms-permamodel-frostnumber
  • csdms-permamodel-kugeo
  • csdms-permamodel-frostnumbergeo
  • csdms-cruaktemp
  • csdms-brake
  • csdms-pydeltarcm

Alternately, use the requirements file listed here.

Optionally install the `babelizer`, in case a component needs to be built from source.

conda install -c csdms-stack babelizer

Optionally install IPython for testing.

conda install ipython

Recall that when running IPython remotely, it's helpful to set

export MPLBACKEND=Agg

HDF5 and file locks

When testing the executor, I found that it couldn't write output to NetCDF4 files, with this scary-looking message was written to stdout:

HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 47342242057600:
 #000: H5F.c line 491 in H5Fcreate(): unable to create file
   major: File accessibilty
   minor: Unable to open file
 #001: H5Fint.c line 1305 in H5F_open(): unable to lock the file
   major: File accessibilty
   minor: Unable to open file
 #002: H5FD.c line 1839 in H5FD_lock(): driver lock request failed
   major: Virtual File Layer
   minor: Can't update object
 #003: H5FDsec2.c line 940 in H5FD_sec2_lock(): unable to lock file, errno = 37, error message = 'No locks available'
   major: File accessibilty
   minor: Bad file ID accessed

On further inspection, I found that I could import the `netCDF4` Python package, but calling `Dataset` threw an exception.

When googling `H5F_open(): unable to lock the file`, I found a offhand reference (here) by an HDF5 developer to issues with file locks on Lustre filesystems in the newly released HDF5 version 1.10. Interestingly, the report came from a janus user.

I thought that by rolling back the HDF5 version to 1.8.x, I may be able to work around this issue. However, `esmpy` depends on HDF5, so I couldn't do it directly. I found that rolling back the `netcdf-fortran` package by one build did the trick:

$ conda install netcdf-fortran=4.4.4=5 -c defaults -c conda-forge
<snip>
The following packages will be DOWNGRADED:

   esmf:           7.0.0-9              conda-forge --> 7.0.0-8              conda-forge
   hdf5:           1.10.1-h9caa474_1                --> 1.8.18-h6792536_1
   netcdf-fortran: 4.4.4-6              conda-forge --> 4.4.4-5              conda-forge

WMT can now write output to NetCDF4 on blanca.

Still true. --Mpiper (talk) 10:06, 7 September 2018 (MDT)

Install executor software

Load blanca's `git` module.

module load git

Install the `wmt-exe` package from source.

mkdir -p $install_dir/opt && cd $install_dir/opt
git clone https://github.com/csdms/wmt-exe
cd wmt-exe
python setup.py develop

Create a site configuration file that describes the executor and symlink it to the executor's etc/ diectory.

work_dir="/rc_scratch/$USER/wmt/_testing"
python setup.py configure --wmt-prefix=$install_dir --launch-dir=$work_dir --exec-dir=$work_dir
#ln -s "$(realpath wmt.cfg)" $install_dir/conda/etc # "realpath" not installed on blanca :(
cd $install_dir/conda/etc
ln -s $install_dir/opt/wmt-exe/wmt.cfg

Check that `$USER` didn't get expanded in the file. Lines 10-11 should be:

exec_dir = /rc_scratch/$USER/wmt/_testing
launch_dir = /rc_scratch/$USER/wmt/_testing

Note that we're using /rc_scratch for the launch and execution directories instead of the default ~/.wmt. Also note that we needed an SbatchLauncher class for wmt-exe because blanca uses Slurm instead of Torque for job control.

Install and test CSDMS components

Each section below describes how to install and test a particular CSDMS component.

Currently installed components:

  1. BRaKE
  2. Child
  3. CRUAKTemp
  4. FrostNumberGeoModel
  5. FrostNumberModel
  6. Hydrotrend
  7. KuGeoModel
  8. KuModel
  9. PyDeltaRCM
  10. Sedflux3D