Signal handling and sweep runs

This page provides details about how W&B Sweeps handle system signals and process exit codes, to help you run sweeps reliably in preemptible environments such as SLURM, EC2 Spot, or Google Cloud preemptible VMs. These sections explain how to interrupt runs cleanly from the keyboard and gives details to help you understand and predict run requeue behavior. For details about how runs are requeued when preempted, see Resume preemptible Sweeps runs.

Exit status and signals

W&B uses the training process exit status to decide whether a run is requeued and how run state is recorded. Exit code contract:

Exit code 0: The run is considered to have completed successfully and is not requeued.
Non-zero exit code: The run is treated as failed or preempted. When you use mark_preempting(), W&B requeues the run so another agent (or the same agent after restart) can resume it.

This applies whether the process exits from a signal handler, from an exception, or from an explicit sys.exit() call. Understanding and relying on this contract is vitally important in preemptible or cluster environments. When the process exits due to a catchable signal (SIGINT, SIGTERM, SIGHUP), your handler can run, log to W&B, and then call sys.exit(1) (or another non-zero code). W&B records that exit code and the same requeue rules apply. When the process is killed by the operating system kernel with SIGKILL, the process cannot run exit hooks, so no final summary is written and the run may appear as crashed or killed; the agent still starts the next run.

Catchable signals (`SIGINT`, `SIGTERM`, `SIGHUP`)

You can register custom signal handlers in your training script. When a catchable signal is delivered, your handler runs; metrics already sent to W&B are preserved, and the agent detects the process exit and starts the next run. Best practices:

Register handlers early (for example, before entering the main training loop).
In the handler, log to W&B if useful, perform cleanup (for example, save a checkpoint), then exit with a non-zero code so requeue behavior is correct when using mark_preempting().

The following example registers handlers for SIGINT and SIGTERM, logs the signal to W&B, and exits with code 1:

import signal
import sys
import wandb


def signal_handler(signum, frame):
    if wandb.run is not None:
        wandb.log({"signal_received": signum})
    sys.exit(1)


def train():
    signal.signal(signal.SIGINT, signal_handler)
    signal.signal(signal.SIGTERM, signal_handler)

    with wandb.init() as run:
        config = wandb.config
        for epoch in range(100):
            # Training step; wandb.log(...) as needed
            pass


if __name__ == "__main__":
    train()

`SIGKILL` (uncatchable)

SIGKILL cannot be caught or ignored. The process terminates immediately with no chance to run handlers or atexit callbacks. W&B cannot write a final summary for the run. The agent still recovers and continues the sweep, but run data for that run is incomplete. Use SIGKILL only as a last resort; prefer SIGTERM or SIGINT when you need graceful shutdown.

Forwarding signals from agent to child

The sweep agent runs your training script as a child process. When you interrupt the agent (for example, with Ctrl+C or when a scheduler sends SIGTERM to the job), the child (training process) does not receive the signal by default; the training script cannot run its handler or call mark_preempting(). This is described in GitHub #3667. To let the child shut down gracefully and optionally mark the run as preempting, forward signals from the agent to the child:

CLI: Run the agent with the --forward-signals flag: wandb agent --forward-signals entity/project/sweep_ID
Python: Pass forward_signals=True to wandb.agent()

When the agent receives SIGINT or SIGTERM, it forwards the signal to the child so your training script’s handler can run, call mark_preempting() and wandb.finish(exit_code=1) if needed, and exit with a non-zero code. See the wandb agent CLI reference and the wandb.agent() Python reference for details.

Preemptible clusters like `SLURM`

On preemption, the training process must receive the signal, mark the run as preempting, and exit with a non-zero code so the run is requeued. A new agent (or the same agent after the job is requeued) can then resume the run. Ensure the training process receives the signal:

When the scheduler signals the agent: Run the agent with wandb agent --forward-signals so that when the scheduler (or user) sends a signal to the agent, the agent forwards it to the child. The child’s handler can then call mark_preempting(), wandb.finish(exit_code=1), and sys.exit(1).
When the scheduler signals the launch script (not the agent directly): Have the launch script send the preemption signal directly to the training process. For example, the training script writes its process ID to a file; the launch script traps the cluster signal (for example SIGUSR1) and runs kill -SIGUSR1 $(cat $PID_FILE) so the training process’s handler runs.

In the training script: Register a handler for the signal your cluster uses (for example SIGTERM or SIGUSR1). In the handler, call wandb.run.mark_preempting() if a run is active, then wandb.finish(exit_code=1) and sys.exit(1) so the run is requeued. See Resume preemptible Sweeps runs for the requeue table and mark_preempting() usage. Sweep state: Run wandb sweep entity/project/sweep_ID --resume before starting the agent so the sweep is in resume mode and will hand out requeued runs. Multi-agent coordination: When many agents run at once (such as SLURM array jobs), they can race to claim the same preempted run. This is a known limitation. Stagger agent startup or use external coordination mechanisms like locks to help work around this potential issue.

`wandb sweep --cancel`

You cancel a sweep using the W&B API, not an OS signal. Run a command like wandb sweep --cancel entity/project/sweep_ID. The server tells the agent to exit, and the agent then terminates running child processes and stops. There can be a short delay (on the order of the agent’s API polling interval) before cancellation takes effect. Runs are terminated abruptly. The child processes have no chance to run user-defined signal handlers. Use --cancel when you want to stop the entire sweep and mark it cancelled. For graceful shutdown of the current run, send a signal to the run (or use --forward-signals and signal the agent). For graceful sweep completion, use wandb sweep --stop instead of --cancel. See Manage sweeps for pause, resume, stop, and cancel options.

Killing the agent vs the run

If you send a signal to the agent process (not the child training process), the agent may exit while the child continues running as an orphan. The orphan may keep printing to your terminal, and the shell may not show a new prompt until you press Enter. To confirm the agent has exited, use an OS command like ps -p <agent_pid> or pgrep -f "wandb agent" instead of relying on prompt appearance.

Guides

Integrations

Reference

Exit status and signals

Catchable signals (`SIGINT`, `SIGTERM`, `SIGHUP`)

`SIGKILL` (uncatchable)

Forwarding signals from agent to child

Preemptible clusters like `SLURM`

`wandb sweep --cancel`

Killing the agent vs the run

Guides

Integrations

Reference

​Exit status and signals

​Catchable signals (SIGINT, SIGTERM, SIGHUP)

​SIGKILL (uncatchable)

​Forwarding signals from agent to child

​Preemptible clusters like SLURM

​wandb sweep --cancel

​Killing the agent vs the run

Exit status and signals

Catchable signals (`SIGINT`, `SIGTERM`, `SIGHUP`)

`SIGKILL` (uncatchable)

Forwarding signals from agent to child

Preemptible clusters like `SLURM`

`wandb sweep --cancel`

Killing the agent vs the run