TestingExecutorBlanca: Difference between revisions
m (Add links for sbatch launcher) |
No edit summary |
||
(13 intermediate revisions by one other user not shown) | |||
Line 6: | Line 6: | ||
--[[User:Mpiper|Mpiper]] ([[User talk:Mpiper|talk]]) 15:44, 28 February 2018 (MST) | --[[User:Mpiper|Mpiper]] ([[User talk:Mpiper|talk]]) 15:44, 28 February 2018 (MST) | ||
==Set build environment on blanca== | |||
Login to '''''summit'''''. | |||
ssh mapi8461@login.rc.colorado.edu | |||
Get the corrent slurm for '''''blanca'''''. | |||
module load slurm/blanca | |||
Login to a compute node. | |||
sinteractive | |||
Build everything on the compute node. | |||
==Set install directory== | ==Set install directory== | ||
The install directory for this executor is '''/ | The install directory for this executor is '''/work/csdms/wmt/_testing'''. | ||
install_dir=/ | install_dir=/work/csdms/wmt/_testing | ||
mkdir -p $install_dir | mkdir -p $install_dir | ||
Make sure read and execute bits are set on this directory. | |||
chmod 0775 $install_dir | |||
==Install Python== | ==Install Python== | ||
Line 38: | Line 58: | ||
* csdms-permamodel-kugeo | * csdms-permamodel-kugeo | ||
* csdms-permamodel-frostnumbergeo | * csdms-permamodel-frostnumbergeo | ||
* csdms-cruaktemp | |||
* csdms-brake | * csdms-brake | ||
* csdms-pydeltarcm | * csdms-pydeltarcm | ||
Alternately, use the requirements file listed [https://github.com/mdpiper/wmt-executor-requirements-files here]. | |||
Optionally install the `babelizer`, | |||
in case a component needs to be built from source. | |||
conda install -c csdms-stack babelizer | |||
Optionally install IPython for testing. | |||
conda install ipython | |||
Recall that when running IPython remotely, | |||
it's helpful to set | |||
export MPLBACKEND=Agg | |||
===HDF5 and file locks=== | |||
When testing the executor, I found that it couldn't write output to NetCDF4 files, | |||
with this scary-looking message was written to stdout: | |||
HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 47342242057600: | |||
#000: H5F.c line 491 in H5Fcreate(): unable to create file | |||
major: File accessibilty | |||
minor: Unable to open file | |||
#001: H5Fint.c line 1305 in H5F_open(): unable to lock the file | |||
major: File accessibilty | |||
minor: Unable to open file | |||
#002: H5FD.c line 1839 in H5FD_lock(): driver lock request failed | |||
major: Virtual File Layer | |||
minor: Can't update object | |||
#003: H5FDsec2.c line 940 in H5FD_sec2_lock(): unable to lock file, errno = 37, error message = 'No locks available' | |||
major: File accessibilty | |||
minor: Bad file ID accessed | |||
On further inspection, I found that I could import the `netCDF4` Python package, | |||
but calling `Dataset` threw an exception. | |||
When googling `H5F_open(): unable to lock the file`, | |||
I found a offhand reference | |||
([http://hdf-forum.184993.n3.nabble.com/h5fcreate-1-10-unable-to-lock-td4028902.html here]) | |||
by an HDF5 developer to issues with file locks on Lustre filesystems in the newly released HDF5 version 1.10. | |||
Interestingly, the report came from a '''''janus''''' user. | |||
I thought that by rolling back the HDF5 version to 1.8.x, I may be able to work around this issue. | |||
However, `esmpy` depends on HDF5, so I couldn't do it directly. | |||
I found that rolling back the `netcdf-fortran` package by one build did the trick: | |||
$ conda install netcdf-fortran=4.4.4=5 -c defaults -c conda-forge | |||
<snip> | |||
The following packages will be DOWNGRADED: | |||
esmf: 7.0.0-9 conda-forge --> 7.0.0-8 conda-forge | |||
hdf5: 1.10.1-h9caa474_1 --> 1.8.18-h6792536_1 | |||
netcdf-fortran: 4.4.4-6 conda-forge --> 4.4.4-5 conda-forge | |||
WMT can now write output to NetCDF4 on '''''blanca'''''. | |||
Still true. --[[User:Mpiper|Mpiper]] ([[User talk:Mpiper|talk]]) 10:06, 7 September 2018 (MDT) | |||
===Install executor software=== | |||
Load blanca's `git` module. | |||
module load git | module load git | ||
Install the `wmt-exe` package from source. | |||
mkdir -p $install_dir/opt && cd $install_dir/opt | mkdir -p $install_dir/opt && cd $install_dir/opt | ||
Line 52: | Line 135: | ||
python setup.py develop | python setup.py develop | ||
Note that | Create a site configuration file that describes the executor and symlink it to the executor's '''etc/''' diectory. | ||
work_dir="/rc_scratch/$USER/wmt/_testing" | |||
python setup.py configure --wmt-prefix=$install_dir --launch-dir=$work_dir --exec-dir=$work_dir | |||
#ln -s "$(realpath wmt.cfg)" $install_dir/conda/etc # "realpath" not installed on blanca :( | |||
cd $install_dir/conda/etc | |||
ln -s $install_dir/opt/wmt-exe/wmt.cfg | |||
Check that `$USER` didn't get expanded in the file. | |||
Lines 10-11 should be: | |||
exec_dir = /rc_scratch/$USER/wmt/_testing | |||
launch_dir = /rc_scratch/$USER/wmt/_testing | |||
Note that we're using '''/rc_scratch''' for the launch and execution directories instead of the default '''~/.wmt'''. | |||
Also note that we needed an [https://github.com/csdms/wmt-exe/pull/12 SbatchLauncher class] | |||
for wmt-exe because blanca uses [https://slurm.schedmd.com/overview.html Slurm] | for wmt-exe because blanca uses [https://slurm.schedmd.com/overview.html Slurm] | ||
instead of Torque for job control. | instead of Torque for job control. | ||
==Install and test CSDMS components== | |||
Each section below | |||
describes how to install and test a particular CSDMS component. | |||
Currently installed components: | |||
# BRaKE | |||
# Child | |||
# CRUAKTemp | |||
# FrostNumberGeoModel | |||
# FrostNumberModel | |||
# Hydrotrend | |||
# KuGeoModel | |||
# KuModel | |||
# PyDeltaRCM | |||
# Sedflux3D |
Latest revision as of 12:12, 12 October 2019
Instructions for installing and configuring a WMT executor on blanca.
--Mpiper (talk) 15:44, 28 February 2018 (MST)
Set build environment on blanca
Login to summit.
ssh mapi8461@login.rc.colorado.edu
Get the corrent slurm for blanca.
module load slurm/blanca
Login to a compute node.
sinteractive
Build everything on the compute node.
Set install directory
The install directory for this executor is /work/csdms/wmt/_testing.
install_dir=/work/csdms/wmt/_testing mkdir -p $install_dir
Make sure read and execute bits are set on this directory.
chmod 0775 $install_dir
Install Python
Install a Python distribution to be used locally by WMT. We like to use Miniconda.
cd $install_dir curl https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh -o miniconda.sh bash ./miniconda.sh -f -b -p $(pwd)/conda export PATH=$(pwd)/conda/bin:$PATH
If working with an existing Miniconda install, be sure to update everything before continuing:
conda update conda conda update --all
Install the CSDMS software stack
Using the csdms-stack conda channel (the Bakery) install the CSDMS software stack, including several pre-built components, with the `csdms-stack` metapackage.
conda install csdms-stack -c csdms-stack -c defaults -c conda-forge
This metapackage currently includes
- pymt
- cca-tools
- csdms-child
- csdms-sedflux-3d
- csdms-hydrotrend
- csdms-permamodel-ku
- csdms-permamodel-frostnumber
- csdms-permamodel-kugeo
- csdms-permamodel-frostnumbergeo
- csdms-cruaktemp
- csdms-brake
- csdms-pydeltarcm
Alternately, use the requirements file listed here.
Optionally install the `babelizer`, in case a component needs to be built from source.
conda install -c csdms-stack babelizer
Optionally install IPython for testing.
conda install ipython
Recall that when running IPython remotely, it's helpful to set
export MPLBACKEND=Agg
HDF5 and file locks
When testing the executor, I found that it couldn't write output to NetCDF4 files, with this scary-looking message was written to stdout:
HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 47342242057600: #000: H5F.c line 491 in H5Fcreate(): unable to create file major: File accessibilty minor: Unable to open file #001: H5Fint.c line 1305 in H5F_open(): unable to lock the file major: File accessibilty minor: Unable to open file #002: H5FD.c line 1839 in H5FD_lock(): driver lock request failed major: Virtual File Layer minor: Can't update object #003: H5FDsec2.c line 940 in H5FD_sec2_lock(): unable to lock file, errno = 37, error message = 'No locks available' major: File accessibilty minor: Bad file ID accessed
On further inspection, I found that I could import the `netCDF4` Python package, but calling `Dataset` threw an exception.
When googling `H5F_open(): unable to lock the file`, I found a offhand reference (here) by an HDF5 developer to issues with file locks on Lustre filesystems in the newly released HDF5 version 1.10. Interestingly, the report came from a janus user.
I thought that by rolling back the HDF5 version to 1.8.x, I may be able to work around this issue. However, `esmpy` depends on HDF5, so I couldn't do it directly. I found that rolling back the `netcdf-fortran` package by one build did the trick:
$ conda install netcdf-fortran=4.4.4=5 -c defaults -c conda-forge <snip> The following packages will be DOWNGRADED: esmf: 7.0.0-9 conda-forge --> 7.0.0-8 conda-forge hdf5: 1.10.1-h9caa474_1 --> 1.8.18-h6792536_1 netcdf-fortran: 4.4.4-6 conda-forge --> 4.4.4-5 conda-forge
WMT can now write output to NetCDF4 on blanca.
Still true. --Mpiper (talk) 10:06, 7 September 2018 (MDT)
Install executor software
Load blanca's `git` module.
module load git
Install the `wmt-exe` package from source.
mkdir -p $install_dir/opt && cd $install_dir/opt git clone https://github.com/csdms/wmt-exe cd wmt-exe python setup.py develop
Create a site configuration file that describes the executor and symlink it to the executor's etc/ diectory.
work_dir="/rc_scratch/$USER/wmt/_testing" python setup.py configure --wmt-prefix=$install_dir --launch-dir=$work_dir --exec-dir=$work_dir #ln -s "$(realpath wmt.cfg)" $install_dir/conda/etc # "realpath" not installed on blanca :( cd $install_dir/conda/etc ln -s $install_dir/opt/wmt-exe/wmt.cfg
Check that `$USER` didn't get expanded in the file. Lines 10-11 should be:
exec_dir = /rc_scratch/$USER/wmt/_testing launch_dir = /rc_scratch/$USER/wmt/_testing
Note that we're using /rc_scratch for the launch and execution directories instead of the default ~/.wmt. Also note that we needed an SbatchLauncher class for wmt-exe because blanca uses Slurm instead of Torque for job control.
Install and test CSDMS components
Each section below describes how to install and test a particular CSDMS component.
Currently installed components:
- BRaKE
- Child
- CRUAKTemp
- FrostNumberGeoModel
- FrostNumberModel
- Hydrotrend
- KuGeoModel
- KuModel
- PyDeltaRCM
- Sedflux3D