Skip to content

Update htcondor instructions#459

Open
jhiemstrawisc wants to merge 11 commits into
Reed-CompBio:mainfrom
jhiemstrawisc:update-htcondor-instructions
Open

Update htcondor instructions#459
jhiemstrawisc wants to merge 11 commits into
Reed-CompBio:mainfrom
jhiemstrawisc:update-htcondor-instructions

Conversation

@jhiemstrawisc

Copy link
Copy Markdown
Collaborator

This largely reformats the directory structure needed to run SPRAS workflows with HTCondor. In particular, it moves a lot of the helper code/submit files out of docker-wrappers/SPRAS/ into a top-level htcondor/ directory. I can do this now that the HTCondor executor has matured significantly, and can handle all the paths as they're configured in this diff.

To run a test SPRAS workflow, try following along with the instructions in docs/htcondor.rst. If anything is confusing, or you get hung up on any of the steps, let's discuss what I can do to make things more clear.

@jhiemstrawisc jhiemstrawisc requested a review from agitter January 23, 2026 21:56
@read-the-docs-community

read-the-docs-community Bot commented Jan 23, 2026

Copy link
Copy Markdown

Documentation build overview

📚 spras | 🛠️ Build #33063685 | 📁 Comparing d313d79 against latest (d990664)

  🔍 Preview build  

2 files changed
± htcondor.html
± fordevs/spras.html

@tristan-f-r tristan-f-r left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[I'll have to restore my HTCondor access to follow this.]

Comment thread htcondor/spras_profile/config.yaml
Comment thread docs/htcondor.rst Outdated
Comment thread run_htcondor.sh Outdated

@tristan-f-r tristan-f-r left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code nitpicks

Comment thread htcondor/snakemake_long.py Outdated
Comment thread htcondor/snakemake_long.py Outdated
@tristan-f-r tristan-f-r added the documentation Improvements or additions to documentation label Jan 24, 2026

@agitter agitter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing the Snakemake long execution mode. The first time my jobs went on hold because I put my spras-v0.6.0.sif file in the htcondor/ directory instead of the root directory. That should have been obvious based on the comment in the .yaml file.

On the second attempt my jobs went on hold with

Transfer output files failure at execution point slot1_24@e2591.chtc.wisc.edu while sending files to access point ap2001. Details: 1 total failures: first failure: reading from file /var/lib/condor/execute/slot1/dir_3699332/scratch/output: (errno 2) No such file or directory

Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst Outdated
Comment thread docs/htcondor.rst Outdated
Comment thread htcondor/spras.sub Outdated
Comment thread htcondor/spras.sub
log = logs/spras_$(Cluster)_$(Process).log
output = logs/spras_$(Cluster)_$(Process).out
error = logs/spras_$(Cluster)_$(Process).err
log = htcondor/logs/spras_$(Cluster)_$(Process).log

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want on per cluster or one per cluster_process pair?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one per cluster/process is the right way to go. In theory, one could still queue N>1.

@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

I converted this to a draft because these docs will depend on the explicit sif transfer PR, and I haven't yet tested everything here in that paradigm.

@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

Also, apologies for the poor git etiquette in the last commit that rolled too many things into one diff (including running an rst formatter). I think some of the updates are sufficiently large that the whole file should more or less be re-assessed as a fresh document.

@jhiemstrawisc jhiemstrawisc force-pushed the update-htcondor-instructions branch from 178a93e to fe7fbbc Compare April 8, 2026 21:17
@github-actions github-actions Bot added the merge-conflict This PR has merge conflicts. label Apr 17, 2026
jhiemstrawisc and others added 6 commits June 8, 2026 10:25
…gging

I was tired of hacking around wanting verbose logging in the HTCondor
Snakemake executor, so I added some plumbing to pass Snakemake's
'--verbose' flag through 'snakemake_long.py' to snakemake itself.

Additionally, I added '--env-manager' so I could run things with my
preferred mamba env instead of conda (which is too slow to rebuild).
The executor has matured quite a bit since these instructions were
first drafted, and it's my hope that these changes remove a lot of
the headache for running jobs.

Now, you can edit config files in `config/` and use the `input/`
directory directly. Workflows should be submitted directly from the
repository root.
Co-authored-by: Tristan F.-R. <pub.tristanf@gmail.com>
@jhiemstrawisc jhiemstrawisc force-pushed the update-htcondor-instructions branch from fe7fbbc to ceea753 Compare June 8, 2026 15:46
@github-actions github-actions Bot removed the merge-conflict This PR has merge conflicts. label Jun 8, 2026
These came from testing Neha's real workflow in June 2026. Not totally
sure how they all work (and whether additional environment variables
will need to be added in the future), but they were key to getting
custom sif images to unpack alongside the jobs.
@jhiemstrawisc jhiemstrawisc marked this pull request as ready for review June 8, 2026 21:47
@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

This is a note for myself -- one thing I should document in the htcondor rst is the need to pre-create apptainer images before launching workflows.

@ntalluri ntalluri self-requested a review June 9, 2026 17:54
Add guidance to docs/htcondor.rst encouraging users to pre-build
per-algorithm container images rather than pulling them at runtime,
and steer them toward the proper place to build those images.

Also add a warning against running `apptainer build` directly on a
shared Access Point, pointing users to CHTC's guide for building
images in an interactive job.
@jhiemstrawisc jhiemstrawisc requested a review from agitter June 9, 2026 20:28

@agitter agitter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my testing, I triggered a Snakemake lock error by launching a long job, killing it, changing the config file, and relaunching. That may be a common error.

python3.11/site-packages/snakemake/persistence.py", line 211, in lock
    raise snakemake.exceptions.LockException()
snakemake.exceptions.LockException: Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following director
y:
/home/agitter/spras
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with th
e --unlock argument.

LockException:
Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
/home/agitter/spras
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with th
e --unlock argument.

Should we add it to troubleshooting?

I also hit this error

$ cat htcondor/logs/merge_input/merge_input-5_7645955.err
ModuleNotFoundError in file "/var/lib/condor/execute/slot1/dir_1046931/scratch/Snakefile", line 8:
No module named 'spras.config.revision'
  File "/var/lib/condor/execute/slot1/dir_1046931/scratch/Snakefile", line 8, in <module>

I'm guessing that means a need a newer version of the SPRAS sif image. However, we haven't released a SPRAS version recently. What version of the image are you testing with?

@ntalluri

ntalluri commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

@agitter

For the first bug, this requires the spras conda environment to be activated and then the command snakemake --configfile <path to config file> --unlock to be run anytime the config file of choice is updated. I was planning on commenting this as step five for the parallel jobs.

For the second issue, you are right, this is because the version of SPRAS in the docker image v0.6 isn't up to date with the current version of SPRAS. Justin has a docker image you can pull from dockerhub (i think it is this jhiemstra/spras:update-htcondor-instructions-v2) or you will need to build the image with Docker on your local machine of the updated version of SPRAS, push the image to Docker Hub, and then use that image.

@jhiemstrawisc

Copy link
Copy Markdown
Collaborator Author

For the second issue, you are right, this is because the version of SPRAS in the docker image v0.6 isn't up to date with the current version of SPRAS.

In general, you should either:

  • always rebuild the SPRAS container to match the version repo you're working with, OR
  • check out the repo at a specific release (e.g. git checkout 0.6.0) to match the container you want to use

The key is that the repo you're using to submit from the AP should match what's in the image.

There's a callout relatively early in the documentation covering this, but I'm open to edits it seems like this is often missed:

   It is best practice to make sure that the Snakefile you copy for your
   workflow is the same version as the Snakefile baked into your
   workflow's container image. When this workflow runs, the Snakefile
   you just copied will be used during remote execution instead of the
   Snakefile from the container. As a result, difficult-to-diagnose
   versioning issues may occur if the version of SPRAS in the remote
   container doesn't support the Snakefile on your current branch. The
   safest bet is always to create your own image so you always know
   what's inside of it.

@ntalluri ntalluri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated documentation looks great, I added some suggestions on how to help users more.

I also remember we were changing the config.yaml file in spras_profile and wasn't sure if any of the commands needed to be added to the documentation.

Comment thread docs/htcondor.rst
images:
omicsintegrator1: "images/omics-integrator-1_v2.sif"
pathlinker: "images/pathlinker_v2.sif"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add to this yaml block to show what you mean by the names having to match what is in the config.

Suggested change
...
algorithms:
- name: "pathlinker"
include: true
runs:
run1:
k: 10
- name: "omicsintegrator1"
include: true
runs:
run1:
b: 5
w: 0
d: 10

Comment thread docs/htcondor.rst
Second, it requires an experimental executor for HTCondor that has been
forked from the upstream `HTCondor Snakemake executor
<https://github.com/htcondor/snakemake-executor-plugin-htcondor>`__.
#. Build/activate the SPRAS conda/mamba environment and ``pip install``

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a note block to a link on how to set up the conda environment on an ap?

Comment thread docs/htcondor.rst
(EP), you'll need to set up three things:

#. You'll need to modify ``htcondor/spras.sub`` to point at your
container image, along with any other configuration changes you want

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused about what specific image this was after reading about all the image stuff above.

Suggested change
container image, along with any other configuration changes you want
SPRAS container image, along with any other configuration changes you want

Comment thread docs/htcondor.rst

#. You'll need to modify ``htcondor/spras.sub`` to point at your
container image, along with any other configuration changes you want
to make like choosing a logging directory or toggling OSPool

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note on what toggling the OSPool submission looks like for a user in the spras.sub.

.. note::

In spras.sub uncomment +WantGlideIn = true and requirements = (HAS_SINGULARITY == True) && (Poolname =!= "CHTC")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the instructions in the .sub file comments were clear enough, but I'm not totally opposed to this.

Comment thread docs/htcondor.rst
values needed by your workflow (defaults are fine in most cases).
#. Modify your SPRAS configuration file to set ``unpack_singularity:
true`` and ``containers.framework: singularity``.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To deal with the --unlock bug

Suggested change
#. Activate the spras conda environment and run the command ``snakemake --configfile <path to config file> --unlock``
... note::
Whenever you change the config file, run ``snakemake --configfile <path to config file> --unlock`` before submitting jobs. Otherwise, the workflow will appear to complete immediately but is actually raising a ``snakemake.exceptions.LockException()``.

Comment thread docs/htcondor.rst
scenario requires editing the SPRAS profile in
``htcondor/spras_profile/config.yaml``. Make sure you specify the
correct container, and change any other config values needed by your
workflow (defaults are fine in most cases).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
workflow (defaults are fine in most cases).
workflow (defaults are fine in most cases). Memory and hardware requirements are also set here. To use a config file other than config/config.yaml, set the path next to the configfile: variable in this file.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what my config.yaml looks like. I wasn't sure if we need to add:

... && versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true)'
stream_ouput: true
stream_error: true

parts to the documentation

# Default configuration for the SPRAS/HTCondor executor profile. Each of these values
# can also be passed via command line flags, e.g. `--jobs 30 --executor htcondor`.

# NOTE: File paths in here should be relative to where you submit from, typically the
# root of the SPRAS repository

# 'jobs' specifies the maximum number of HTCondor jobs that can be in the queue at once.
jobs: 30
executor: htcondor
configfile: config/egfr.yaml
htcondor-jobdir: htcondor/logs

# Indicate to the plugin that jobs running on various EPs do not share a filesystem with
# each other, or with the AP.
shared-fs-usage: none
# Distributed, heterogeneous computational environments are a wild place where strange things
# can happen. If something goes wrong, try again up to 2 times. After that, we assume there's
# a real error that requires user/admin intervention
retries: 2

# Default resources will apply to all workflow steps. If a single workflow step fails due
# to insufficient resources, it can be re-run with modified values. Snakemake will handle
# picking up where it left off, and won't re-run steps that have already completed.
default-resources:
  job_wrapper: "htcondor/spras.sh"
  # If running in CHTC, this only works with apptainer images
  # Note requirement for quotes around the image name
  container_image: "test-htc.sif"
  universe: "container"
  # The value for request_disk should be large enough to accommodate the runtime container
  # image, any additional PRM container images, and your input data.
  request_disk: "16GB"
  request_memory: "12GB"
  retry_request_memory_increase: "RequestMemory + 4"
  retry_request_memory_max: "32GB"
  classad_WantGlideIn: true
  requirements: |
    '(HAS_SINGULARITY == True) && (Poolname =!= "CHTC") && versionGE(split(Target.CondorVersion)[1], "24.8.0") && (isenforcingdiskusage =!= true)' 
  stream_ouput: true
  stream_error: true

Comment thread docs/htcondor.rst
build images inside an interactive job on an Execution Point. If
you're working at CHTC, follow their guide for building Apptainer
images in an interactive job:
https://chtc.cs.wisc.edu/uw-research-computing/apptainer-htc.html

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
https://chtc.cs.wisc.edu/uw-research-computing/apptainer-htc.html
https://chtc.cs.wisc.edu/uw-research-computing/apptainer-htc.html. Specifically, create the apptainer.sub file on the AP and run ``condor_submit -i apptainer.sub`` on the AP.

Comment thread docs/htcondor.rst
#. Instead of editing ``spras.sub`` to define the workflow, this
scenario requires editing the SPRAS profile in
``htcondor/spras_profile/config.yaml``. Make sure you specify the
correct container, and change any other config values needed by your

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
correct container, and change any other config values needed by your
correct SPRAS container image, and change any other config values needed by your

Comment thread docs/htcondor.rst

.. tip::

It is best practice to make sure that the Snakefile you copy for your

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential update for this tip::

To avoid versioning issues, the Snakefile copied for your workflow in SPRAS should match the Snakefile baked into the SPRAS container image. When this workflow runs, the Snakefile you just copied will be used during remote execution instead of the Snakefile from the container. A mismatch between the repo version and the container can cause difficult-to-diagnose errors, including ModuleNotFoundError.

To keep these in sync, either:

  • rebuild the SPRAS container locally, push it to Docker Hub, and use that image for submitting jobs, or
  • or, check out the SPRAS repo at the release matching your container

universe: "container"
# The value for request_disk should be large enough to accommodate the runtime container
# image, any additional PRM container images, and your input data.
request_disk: "16GB"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add request_cpus = $(NUM_PROCS)?

Comment thread docs/htcondor.rst
Comment on lines +182 to +183
apptainer build images/omics-integrator-1_v2.sif docker://reedcompbio/omics-integrator-1:v2
apptainer build images/pathlinker_v2.sif docker://reedcompbio/pathlinker:v2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People might run this directly on the AP if we don't remind them to use a build job again.

Comment thread docs/htcondor.rst
- ✓
- Convenience wrapper (in the repository root) around
``snakemake_long.py``.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next section is what I found confusing. It gives instructions to create the .sif from the existing DockerHub image. That usually breaks. I recommend we remove it and only give instructions to build a new image from source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants