Update htcondor instructions#459
Conversation
tristan-f-r
left a comment
There was a problem hiding this comment.
[I'll have to restore my HTCondor access to follow this.]
agitter
left a comment
There was a problem hiding this comment.
I'm testing the Snakemake long execution mode. The first time my jobs went on hold because I put my spras-v0.6.0.sif file in the htcondor/ directory instead of the root directory. That should have been obvious based on the comment in the .yaml file.
On the second attempt my jobs went on hold with
Transfer output files failure at execution point slot1_24@e2591.chtc.wisc.edu while sending files to access point ap2001. Details: 1 total failures: first failure: reading from file /var/lib/condor/execute/slot1/dir_3699332/scratch/output: (errno 2) No such file or directory
| log = logs/spras_$(Cluster)_$(Process).log | ||
| output = logs/spras_$(Cluster)_$(Process).out | ||
| error = logs/spras_$(Cluster)_$(Process).err | ||
| log = htcondor/logs/spras_$(Cluster)_$(Process).log |
There was a problem hiding this comment.
Do we want on per cluster or one per cluster_process pair?
There was a problem hiding this comment.
I think one per cluster/process is the right way to go. In theory, one could still queue N>1.
|
I converted this to a draft because these docs will depend on the explicit sif transfer PR, and I haven't yet tested everything here in that paradigm. |
|
Also, apologies for the poor git etiquette in the last commit that rolled too many things into one diff (including running an |
178a93e to
fe7fbbc
Compare
…gging I was tired of hacking around wanting verbose logging in the HTCondor Snakemake executor, so I added some plumbing to pass Snakemake's '--verbose' flag through 'snakemake_long.py' to snakemake itself. Additionally, I added '--env-manager' so I could run things with my preferred mamba env instead of conda (which is too slow to rebuild).
The executor has matured quite a bit since these instructions were first drafted, and it's my hope that these changes remove a lot of the headache for running jobs. Now, you can edit config files in `config/` and use the `input/` directory directly. Workflows should be submitted directly from the repository root.
Co-authored-by: Tristan F.-R. <pub.tristanf@gmail.com>
fe7fbbc to
ceea753
Compare
These came from testing Neha's real workflow in June 2026. Not totally sure how they all work (and whether additional environment variables will need to be added in the future), but they were key to getting custom sif images to unpack alongside the jobs.
|
This is a note for myself -- one thing I should document in the htcondor rst is the need to pre-create apptainer images before launching workflows. |
Add guidance to docs/htcondor.rst encouraging users to pre-build per-algorithm container images rather than pulling them at runtime, and steer them toward the proper place to build those images. Also add a warning against running `apptainer build` directly on a shared Access Point, pointing users to CHTC's guide for building images in an interactive job.
agitter
left a comment
There was a problem hiding this comment.
During my testing, I triggered a Snakemake lock error by launching a long job, killing it, changing the config file, and relaunching. That may be a common error.
python3.11/site-packages/snakemake/persistence.py", line 211, in lock
raise snakemake.exceptions.LockException()
snakemake.exceptions.LockException: Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following director
y:
/home/agitter/spras
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with th
e --unlock argument.
LockException:
Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
/home/agitter/spras
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with th
e --unlock argument.
Should we add it to troubleshooting?
I also hit this error
$ cat htcondor/logs/merge_input/merge_input-5_7645955.err
ModuleNotFoundError in file "/var/lib/condor/execute/slot1/dir_1046931/scratch/Snakefile", line 8:
No module named 'spras.config.revision'
File "/var/lib/condor/execute/slot1/dir_1046931/scratch/Snakefile", line 8, in <module>
I'm guessing that means a need a newer version of the SPRAS sif image. However, we haven't released a SPRAS version recently. What version of the image are you testing with?
|
For the first bug, this requires the spras conda environment to be activated and then the command For the second issue, you are right, this is because the version of SPRAS in the docker image v0.6 isn't up to date with the current version of SPRAS. Justin has a docker image you can pull from dockerhub (i think it is this jhiemstra/spras:update-htcondor-instructions-v2) or you will need to build the image with Docker on your local machine of the updated version of SPRAS, push the image to Docker Hub, and then use that image. |
In general, you should either:
The key is that the repo you're using to submit from the AP should match what's in the image. There's a callout relatively early in the documentation covering this, but I'm open to edits it seems like this is often missed: |
ntalluri
left a comment
There was a problem hiding this comment.
The updated documentation looks great, I added some suggestions on how to help users more.
I also remember we were changing the config.yaml file in spras_profile and wasn't sure if any of the commands needed to be added to the documentation.
| images: | ||
| omicsintegrator1: "images/omics-integrator-1_v2.sif" | ||
| pathlinker: "images/pathlinker_v2.sif" | ||
|
|
There was a problem hiding this comment.
Could you add to this yaml block to show what you mean by the names having to match what is in the config.
| ... | |
| algorithms: | |
| - name: "pathlinker" | |
| include: true | |
| runs: | |
| run1: | |
| k: 10 | |
| - name: "omicsintegrator1" | |
| include: true | |
| runs: | |
| run1: | |
| b: 5 | |
| w: 0 | |
| d: 10 |
| Second, it requires an experimental executor for HTCondor that has been | ||
| forked from the upstream `HTCondor Snakemake executor | ||
| <https://github.com/htcondor/snakemake-executor-plugin-htcondor>`__. | ||
| #. Build/activate the SPRAS conda/mamba environment and ``pip install`` |
There was a problem hiding this comment.
Could you add a note block to a link on how to set up the conda environment on an ap?
| (EP), you'll need to set up three things: | ||
|
|
||
| #. You'll need to modify ``htcondor/spras.sub`` to point at your | ||
| container image, along with any other configuration changes you want |
There was a problem hiding this comment.
I got confused about what specific image this was after reading about all the image stuff above.
| container image, along with any other configuration changes you want | |
| SPRAS container image, along with any other configuration changes you want |
|
|
||
| #. You'll need to modify ``htcondor/spras.sub`` to point at your | ||
| container image, along with any other configuration changes you want | ||
| to make like choosing a logging directory or toggling OSPool |
There was a problem hiding this comment.
can you add a note on what toggling the OSPool submission looks like for a user in the spras.sub.
.. note::
In spras.sub uncomment +WantGlideIn = true and requirements = (HAS_SINGULARITY == True) && (Poolname =!= "CHTC")
There was a problem hiding this comment.
I thought the instructions in the .sub file comments were clear enough, but I'm not totally opposed to this.
| values needed by your workflow (defaults are fine in most cases). | ||
| #. Modify your SPRAS configuration file to set ``unpack_singularity: | ||
| true`` and ``containers.framework: singularity``. | ||
|
|
There was a problem hiding this comment.
To deal with the --unlock bug
| #. Activate the spras conda environment and run the command ``snakemake --configfile <path to config file> --unlock`` | |
| ... note:: | |
| Whenever you change the config file, run ``snakemake --configfile <path to config file> --unlock`` before submitting jobs. Otherwise, the workflow will appear to complete immediately but is actually raising a ``snakemake.exceptions.LockException()``. | |
| scenario requires editing the SPRAS profile in | ||
| ``htcondor/spras_profile/config.yaml``. Make sure you specify the | ||
| correct container, and change any other config values needed by your | ||
| workflow (defaults are fine in most cases). |
There was a problem hiding this comment.
| workflow (defaults are fine in most cases). | |
| workflow (defaults are fine in most cases). Memory and hardware requirements are also set here. To use a config file other than config/config.yaml, set the path next to the configfile: variable in this file. |
There was a problem hiding this comment.
this is what my config.yaml looks like. I wasn't sure if we need to add:
... && versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true)'
stream_ouput: true
stream_error: true
parts to the documentation
# Default configuration for the SPRAS/HTCondor executor profile. Each of these values
# can also be passed via command line flags, e.g. `--jobs 30 --executor htcondor`.
# NOTE: File paths in here should be relative to where you submit from, typically the
# root of the SPRAS repository
# 'jobs' specifies the maximum number of HTCondor jobs that can be in the queue at once.
jobs: 30
executor: htcondor
configfile: config/egfr.yaml
htcondor-jobdir: htcondor/logs
# Indicate to the plugin that jobs running on various EPs do not share a filesystem with
# each other, or with the AP.
shared-fs-usage: none
# Distributed, heterogeneous computational environments are a wild place where strange things
# can happen. If something goes wrong, try again up to 2 times. After that, we assume there's
# a real error that requires user/admin intervention
retries: 2
# Default resources will apply to all workflow steps. If a single workflow step fails due
# to insufficient resources, it can be re-run with modified values. Snakemake will handle
# picking up where it left off, and won't re-run steps that have already completed.
default-resources:
job_wrapper: "htcondor/spras.sh"
# If running in CHTC, this only works with apptainer images
# Note requirement for quotes around the image name
container_image: "test-htc.sif"
universe: "container"
# The value for request_disk should be large enough to accommodate the runtime container
# image, any additional PRM container images, and your input data.
request_disk: "16GB"
request_memory: "12GB"
retry_request_memory_increase: "RequestMemory + 4"
retry_request_memory_max: "32GB"
classad_WantGlideIn: true
requirements: |
'(HAS_SINGULARITY == True) && (Poolname =!= "CHTC") && versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true)'
stream_ouput: true
stream_error: true
| build images inside an interactive job on an Execution Point. If | ||
| you're working at CHTC, follow their guide for building Apptainer | ||
| images in an interactive job: | ||
| https://chtc.cs.wisc.edu/uw-research-computing/apptainer-htc.html |
There was a problem hiding this comment.
| https://chtc.cs.wisc.edu/uw-research-computing/apptainer-htc.html | |
| https://chtc.cs.wisc.edu/uw-research-computing/apptainer-htc.html. Specifically, create the apptainer.sub file on the AP and run ``condor_submit -i apptainer.sub`` on the AP. |
| #. Instead of editing ``spras.sub`` to define the workflow, this | ||
| scenario requires editing the SPRAS profile in | ||
| ``htcondor/spras_profile/config.yaml``. Make sure you specify the | ||
| correct container, and change any other config values needed by your |
There was a problem hiding this comment.
| correct container, and change any other config values needed by your | |
| correct SPRAS container image, and change any other config values needed by your |
|
|
||
| .. tip:: | ||
|
|
||
| It is best practice to make sure that the Snakefile you copy for your |
There was a problem hiding this comment.
Potential update for this tip::
To avoid versioning issues, the Snakefile copied for your workflow in SPRAS should match the Snakefile baked into the SPRAS container image. When this workflow runs, the Snakefile you just copied will be used during remote execution instead of the Snakefile from the container. A mismatch between the repo version and the container can cause difficult-to-diagnose errors, including ModuleNotFoundError.
To keep these in sync, either:
- rebuild the SPRAS container locally, push it to Docker Hub, and use that image for submitting jobs, or
- or, check out the SPRAS repo at the release matching your container
| universe: "container" | ||
| # The value for request_disk should be large enough to accommodate the runtime container | ||
| # image, any additional PRM container images, and your input data. | ||
| request_disk: "16GB" |
There was a problem hiding this comment.
do we need to add request_cpus = $(NUM_PROCS)?
| apptainer build images/omics-integrator-1_v2.sif docker://reedcompbio/omics-integrator-1:v2 | ||
| apptainer build images/pathlinker_v2.sif docker://reedcompbio/pathlinker:v2 |
There was a problem hiding this comment.
People might run this directly on the AP if we don't remind them to use a build job again.
| - ✓ | ||
| - Convenience wrapper (in the repository root) around | ||
| ``snakemake_long.py``. | ||
|
|
There was a problem hiding this comment.
The next section is what I found confusing. It gives instructions to create the .sif from the existing DockerHub image. That usually breaks. I recommend we remove it and only give instructions to build a new image from source.
This largely reformats the directory structure needed to run SPRAS workflows with HTCondor. In particular, it moves a lot of the helper code/submit files out of
docker-wrappers/SPRAS/into a top-levelhtcondor/directory. I can do this now that the HTCondor executor has matured significantly, and can handle all the paths as they're configured in this diff.To run a test SPRAS workflow, try following along with the instructions in
docs/htcondor.rst. If anything is confusing, or you get hung up on any of the steps, let's discuss what I can do to make things more clear.