kai

CLI for submitting and managing jobs on a KAI Scheduler cluster.

Getting started

Step 1 — Install kai

sh -c "$(curl -fsSL https://raw.githubusercontent.com/BrachioLab/kai/main/install.sh)"

You will be prompted for the configs repository and your lab namespace. The installer checks that your account exists in the repo before proceeding — if it doesn't, ask your lab manager to run kai add-user --name <you> first.

The installer also sets up automatic update checks on every login.

Then start a new shell (or run source ~/.bashrc / source ~/.zshrc) so kai is on your PATH.

Step 2 — Get your kubeconfig from the lab manager

Your lab manager will send you kai-kubeconfig-<you>.yaml via a secure channel (Slack DM, encrypted email, etc.). Keep this secret — treat it like a password.

Step 3 — Set up kai

kai setup kai-kubeconfig-<you>.yaml

This installs your kubeconfig and fetches your CLI config automatically from the configs repo.

That's it — you're ready to submit jobs.

Submitting jobs

# Run a script on 1 GPU (default lane: preemptible — see "Queues / lanes" below)
kai submit --image pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime --gpu 1 -- python train.py

# Guaranteed run on your lab's own nodes (non-preemptible, protected up to quota)
kai submit --image pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime --gpu 1 --queue priority -- python train.py

# Borrow any idle A6000 anywhere (preemptible; yields when an owner needs it)
kai submit --image pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime --gpu 1 --queue preemptible --gpu-type a6000 -- python sweep.py

# Interactive session (opens a shell inside the container)
kai submit --image pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime --gpu 1 --interactive

# Mount a local directory
kai submit --image pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime --gpu 1 -v /data/datasets:/data -- python train.py

# Mount your home directory at the same path (so ~ and relative paths line up)
kai submit --image pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime --gpu 1 --mount-home -- python ~/proj/train.py

Email alerts (opt-in)

Get emailed about your job. All alerts are opt-in — nothing is sent unless you ask. Mail goes to <your-username>@upenn.edu by default.

Flag	Emails you when…
`--alert-finish`	the job completes or fails
`--alert-runtime DUR`	it's been running longer than `DUR` (e.g. `24h`, `90m`, `30s`)
`--alert-mem [PCT]`	a container's memory reaches `PCT`% of its limit (default `90`; OOM early-warning)
`--alert-email ADDR`	(modifier) send alerts to `ADDR` instead of `<username>@upenn.edu`

# email me when it finishes (or fails)
kai submit --image … --gpu 1 --alert-finish -- python train.py

# warn me if it runs past 12h or memory crosses 85% of the limit; send elsewhere
kai submit --image … --gpu 1 --memory 64Gi \
  --alert-runtime 12h --alert-mem 85 --alert-email me@upenn.edu -- python big.py

Each alert fires once per job. --alert-mem needs a memory limit on the job (--memory) and metrics-server in the cluster. Alerts are delivered by an in-cluster controller (see alerts/); nothing extra is needed on your side.

Identity & home inside the container

Jobs run as your UID/GID (not root) so files you write are owned by you. kai fills in sane defaults for that user so common tools don't trip over the image's missing /etc/passwd entry — you can override any of them with -e:

USER / LOGNAME = your username (stops getpass.getuser(), torch, pip, etc. from crashing with "uid not found").
HOME = /tmp (writable scratch), or your mounted home path when you pass --mount-home.

--mount-home bind-mounts your host home directory into the container at the same path and sets $HOME to it. Because the job runs as your UID, writes are owned by you — and on an NFS root_squash home, your UID is exactly the one allowed to write. Pass --root to run as root instead (these defaults are then skipped).

Queues / lanes

You pick a lane with --queue; that's the only knob — it decides both where your job is guaranteed and whether it can be evicted. kai prepends your lab's namespace automatically (--queue priority → <yourlab>-priority).

`--queue`	What you get
(omitted) / `preemptible`	Borrow — runs on any idle GPU, preemptible: reclaimed when an owner needs that node. The safe default; an accident can't block the cluster.
`priority`	Guaranteed — runs on your lab's own nodes, non-preemptible, protected up to your lab's quota. Use for runs that must not be evicted.

Pick hardware with --gpu-type (e.g. --gpu-type a6000), or a specific machine with --node <hostname>. There is no --priority flag — the lane sets it.

Job files

Instead of long flag lists, describe a job in YAML and submit it (CLI flags override file values):

kai submit -f job.yaml

# job.yaml — keys mirror the flags
name: train-run
image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
gpu: 1
queue: priority
memory: 64Gi
volumes: [ /data/datasets:/data ]
env: { WANDB_PROJECT: myproj }
ports: [ 8888 ]
command: python train.py --epochs 50

More submit options

Flag	Use
`--name`, `-n`	name the job (otherwise auto-generated)
`--cpu` / `--memory` (`--mem`)	resource requests, e.g. `--cpu 8 --memory 64Gi` (also sets the limit alert baseline)
`--env`, `-e KEY=VALUE`	set env vars (repeatable)
`--volume`, `-v HOST:CTR[:ro]`	mount host paths (repeatable)
`--workdir`, `-w PATH`	working directory in the container
`--large-shm`	bigger `/dev/shm` (needed by PyTorch dataloaders/NCCL)
`--port`, `-p` + `--forward`	declare a port and port-forward it to your machine (e.g. Jupyter/TensorBoard)
`--host-network`	use the node's network namespace
`--attach`	stream stdin/stdout (interactive run, not a shell)
`--backoff-limit N`	retries after preemption/eviction (default 10)
`--root`	run as root instead of your UID/GID

Managing jobs

kai list                  # show all your jobs and their status
kai pods                  # show the underlying pods (wide)
kai logs <job>            # print recent logs
kai logs <job> -f         # stream logs live
kai bash <job>            # open an interactive bash shell inside a running job
kai exec <job> -- <cmd>   # run a one-off command in the job's pod
kai describe <job>        # detailed job info and events
kai port-forward <job> 8888   # forward a port from a running job (alias: kai pf)
kai delete <job>          # cancel and remove a job

Cluster info

kai gpus                  # GPU used/free across all nodes
kai nodes                 # node health, GPU usage, conditions
kai quota                 # your queues' GPU quota/used/free (+ the nodes each is pinned to)
kai status                # all resources in your namespace
kai queue list            # all queues, quotas, and pinned nodes

Lab managers

# add a user to your lab (creates their RBAC + kubeconfig, pushes their CLI config)
kai add-user --name <user> [--role researcher|lab-manager]
kai add-user --name <user> --queue collaborators --allowed-queues <lab>-collaborators

kai push-config --name <user> [--queue <lane>] [--allowed-queues <q>]   # update a user's config (no cluster change)

kai queue create --name <lab>-<lane> --quota N --priority 100 --parent locust   # new queue leaf
kai queue delete <name>

Send the generated kai-kubeconfig-<user>.yaml to the user securely; they run kai setup <that-file>. (A lab manager can self-serve their own lab's members; cross-lab/admin setup needs cluster-admin — see the cluster's onboarding runbook.)

Updates

kai checks for updates automatically on every login. You will be prompted before anything is applied. To check manually:

kai update                # check for a config update
kai self-update           # check for a kai binary update

To apply without being prompted (e.g. in a script):

kai update --force
kai self-update --force

Troubleshooting

kai: command not found — run source ~/.bashrc (or ~/.zshrc) to pick up the PATH change from the installer, or start a new terminal.

error: namespace not set — you haven't run kai setup yet, or the config file wasn't found at ~/.kai/config.yaml.

error: unable to connect to cluster — your kubeconfig may be missing or expired. Ask your lab manager for a new one and re-run kai setup.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
alerts		alerts
README.md		README.md
distribute-kubeconfigs.sh		distribute-kubeconfigs.sh
install.sh		install.sh
kai		kai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

kai

Getting started

Step 1 — Install kai

Step 2 — Get your kubeconfig from the lab manager

Step 3 — Set up kai

Submitting jobs

Email alerts (opt-in)

Identity & home inside the container

Queues / lanes

Job files

More submit options

Managing jobs

Cluster info

Lab managers

Updates

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

kai

Getting started

Step 1 — Install kai

Step 2 — Get your kubeconfig from the lab manager

Step 3 — Set up kai

Submitting jobs

Email alerts (opt-in)

Identity & home inside the container

Queues / lanes

Job files

More submit options

Managing jobs

Cluster info

Lab managers

Updates

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages