Persistent Agent Presence on a Condo HTC Slice

Table of Contents

Status & handoff (2026-06-29)

Branch feat/confined-sbatch-launch off main. Steps 1-4 of the restructure are DONE, committed, and green (454 tests). Not pushed. Two adversarial-review rounds (3 reviewers each) were run and folded: a design red-team before writing step 3/4, then a code red-team on the implemented diff (see the per-step notes below for the survivors). Remaining: step 5 (flip confined: true) + the live-cluster validation pass – both HUMAN-GATED (need the coder tunnel + OTP).

Code red-team survivors folded (af0627f scope, fixed in a follow-up): reuse-probe now keys on _confined_job_state (ANY non-empty squeue state = live, so a SUSPENDED/PREEMPTED job is not resubmitted-over) and refuses to srun-attach a still-PENDING job (it would block); sbatch runs check=False and raises a MirrorError with a scancel hint on a post-submit transport drop (was an uncaught CommandError masking a leak); RemoteSession.save is now atomic (temp + os.replace + fsync) so a mid-write failure can't truncate the file holding the job id (which would load blank -> resubmit -> leak); confined attach --node prints a note instead of silently dropping it; release's sibling-detach is confined-aware (defensive; unreachable today). Test hardening: unconfined attach pinned to byte-identity (catches a dropped srun prefix on the || new-session fallback), plus new tests for save-failure, reuse-detached / reuse-PENDING / reuse-not-ready, _resolve_remote_home, and env-threading.

Mechanism: cluster-validated. Spike on n0055.savio4 confirmed an sbatch batch task runs in the job cgroup, nproc=4 in the tmux pane, claude on PATH, persists. Required detail it surfaced: a per-mirror dedicated tmux socket (tmux -L sucoder-<mirror>) – a shared socket reuses another session's server (wrong cgroup) and dies.

Built + tested (gated by slurm.confined, default off -> existing targets byte-identical): config.py confined flag; mirror.py:_build_batch_script (the sbatch body: -L socket, env-carry, cd, -A -d, keeper); cli.py:_build_sbatch_command. Tests: tests/test_batch_script.py, tests/test_sbatch_command.py.

Remaining: the allocation/launch restructure (this section + "Red-team outcome" + "Must-keep fixes" below are the spec):

  1. [DONE] _build_executor: skip salloc for confined (the sbatch fuses allocate+launch). Confined now returns a login-node executor (is_compute_node=False, no compute-node proxy fields, NFS mirror root – never compute-node-local disk). Gated on remote.slurm.confined; non-confined targets byte-identical.
  2. [DONE] Extracted mirror.py:MirrorManager._build_remote_agent_cmd_str from launch_agent (the =; exec bash -l + prelude-externalize block) so the confined flow can reuse it. Because step 1 makes the confined executor a login-node executor, _externalize_prelude now writes the prelude over the login node (NFS), not a not-yet-existent compute node. Added a MirrorContext.confined property and a guard: launch_agent refuses a confined target (MirrorError) rather than silently running tmux on the login node – step 3 replaces the guard with the real submit+attach branch. Tests: test_mirror.py (_build_remote_agent_cmd_str, confined-guard), test_cli.py (_build_executor confined-skips-salloc + control). Ordering invariant: keep step 5 (flip confined: true) LAST – until step 3 lands, a confined target hits the guard, not a launch.
  3. [DONE] Confined collaborate via mirror.py:_launch_confined (and helpers), replacing the step-2 guard in launch_agent. Flow: reuse-probe -> build agent cmd (prelude to NFS, bash -lc wrap so the login env/PATH/nvm resolve) -> stage batch script to NFS (absolute path; sbatch won't expand $HOME) -> sbatch -> persist job id immediately (pre-poll, so a poll failure leaves the job RECORDED not leaked) -> bounded three-way RUNNING-poll -> persist node -> confirm tmux session -> attach (srun --overlap --pty tmux -L) for an interactive launch / return for a detached (renew) relaunch. A 3-reviewer design red-team hardened it: (a) the reuse-probe and poll distinguish a squeue failure (raise, never resubmit – a live job would orphan) from a job gone (resubmit); (b) defensive sbatch-id parse (<id>;<cluster> / MOTD lines); (c) three-way poll (RUNNING+node / PENDING-wait / empty=terminal) never persists the literal state word as a node; (d) target_name on MirrorManager, derived identically to _build_executor, so the job id lands in the same session file attach=/=release=/=renew read; (e) the batch script catches a new-session failure and emits a SUCODER: marker + exit 1 instead of silently COMPLETEing exit 0; (f) shared sanitized confined_tmux_target / confined_attach_command helpers so launch and attach use one session+socket name. _build_sbatch_command moved to mirror.py (re-exported by cli) to dodge the import cycle. Tests: test_mirror.py (parser, three-way poll, reuse-probe failure-vs-gone, submit/persist/attach, persist-before-poll, detached-no-attach, bash-lc, session-not-ready), test_batch_script.py (failure marker). Deferred (flagged): confined has no in-tmux 30/15/5-min deadline warning – _start_slurm_timer needs a compute-node ControlMaster confined never opens; renew.py (reads SLURM state directly) + --time backstop cover turnover, so this is a lost courtesy not a safety hole.
  4. [DONE] Gated attach=/=release=/=renew for confined. attach: a confined-only branch forces via_srun (direct SSH to the compute node escapes the cgroup), builds the command from confined_attach_command (srun --overlap --pty tmux -L <socket> attach-session, no || new-session fallback), sanitized session name; the unconfined attach_cmd stays byte-identical (regression- tested). release: unchanged – scancel <job_id> is name- independent and the sibling-detach branch is unreachable for confined (each sbatch is its own job, so holders_of_job is empty). renew: the sbatch relaunch goes through _launch_confined (step 3), which blocks until RUNNING (the poll), so the overlap-scancel no-gap invariant holds; _adopt_* is naturally skipped (confined bypasses _ensure_slurm_node); request_checkpoint writes the renew sentinel over the login node for confined (NFS $HOME; no compute-node SSH). Tests: test_cli.py (confined attach srun/-L/no-new-session; unconfined attach unchanged).
  5. Add confined: true to the carleton-htc stanza (inert until 1-4). Still the LAST step – and HUMAN-GATED: it needs a live-cluster validation pass (collaborate -> nproc=4 + job cgroup; detach/attach; release; ControlMaster-lapse survival) under the coder tunnel + OTP.

Testing note (human's insight): all of the above drives the cluster through the login-node ControlMaster sucoder already maintains (same path as today's salloc=/=attach=/=srun). So a thin harness can submit the confined batch script via the tunnel and observe – the restructure is testable in pieces against the live cluster from the operator box, not only via a full collaborate.

Tunnel/auth for next session: the warm tunnels this session were ligon's; we want coder's. Savio auth uses one-time tokens, no stored SSH credentials, so there is nothing to pre-provision – next session just brings the tunnel up under coder and prompts for a token (PIN+OTP) when sucoder establishes the ControlMaster.

Confined launch on shared partitions (sbatch)

The problem (confirmed on n0052.savio4, 2026-06-29)

On a shared partition the agent must run inside its SLURM job cgroup or it is not confined to the cores it reserved. collaborate allocates with salloc --no-shell then reaches the node by direct SSH. Measured on Savio savio4_htc:

  • A direct-SSH session lands in /user.slice/.../session-cN.scope (a logind user scope) with nproc 56 (whole node) – NOT the job cgroup. This held with a single job on the node, so pam_slurm_adopt is not adopting here (absent or disabled on savio4_htc; /etc/pam.d/sshd is root-only, so we can't read which).
  • A step joined with srun --jobid=J --overlap --pty lands in the job cgroup: nproc 4. So core confinement is enforced (cpuset); the only problem is getting the agent into the cgroup.

On savio3 (exclusive, savio-node) this never showed – the whole node is yours.

Rejected: host-step (srun --overlap + keeper)

A backgrounded srun --overlap bash <keeper> on the login node was designed and partly built (_build_host_step_script, the launch_via_srun flag). A four-reviewer red-team rejected it:

  • The login-node srun is the step's shepherd; if the ControlMaster lapses, the login node reaps it, or the laptop drops, the step (cgroup
    • tmux) dies – worse persistence than direct-SSH tmux today.
  • The executor connects to the compute node (is_compute_node=True), so "run srun on the login node via the executor" was wrong – it would background srun inside the very user scope we're escaping.
  • Holding a step alive with a sleep-loop is non-idiomatic; the job should own tmux.

Chosen: sbatch wrapper

sbatch a script whose body runs as the job's main task – natively inside the job cgroup: no --overlap step, no keeper-babysitter step, no fragile login-node shepherd. slurmd manages it on the node, so it survives ControlMaster expiry, login-node reboot, and laptop sleep – more persistent than today's direct-SSH tmux.

Flow for a confined target (renamed from launch_via_srun):

  1. Build agent_cmd_str (prelude externalized) as today.
  2. Write the prelude file and the batch script to NFS $HOME/.cache/sucoder/ via the login/DTN (NFS -> node-independent).
  3. sbatch --partition .. --account .. --qos .. --cpus-per-task .. --mem .. --time .. -J sucoder-<mirror> -o <log> <script> -> job id.
  4. Poll squeue until RUNNING + node known (bounded; surface PENDING / queue waits instead of hanging).
  5. The batch script (in the job cgroup) runs:

    #!/bin/bash
    set -u
    SESSION="sucoder-<mirror>"
    cd <mirror_path>
    tmux new-session -A -d -s "$SESSION" "<agent_cmd_str>"
    while tmux has-session -t "$SESSION" 2>/dev/null; do sleep 15; done
    
  6. Poll until the tmux session exists (bounded; on failure surface the job log – the detached pane's output is NOT on the script's stdout).
  7. Attach: srun --jobid=J --overlap --pty tmux attach -t <name> from the login node (the existing attach --via-srun path; auto-selected for confined targets; drop the || new-session fallback so attach never spawns an unconfined orphan).

Renew = sbatch a fresh job on turnover (simpler than the host-step relaunch; the renew loop's probe/scancel logic is unchanged). Release = scancel.

Open design decisions (RED-TEAM THESE)

  1. Agent-exit teardown. With ; exec bash -l the tmux session outlives a clean agent /exit, so the keeper loops forever and the job holds the (free but shared) condo pool until release. Options: (a) keep exec bash -l + a courtesy --time cap so the job auto-frees at walltime; (b) drop exec bash -l for confined targets so agent-exit -> session ends -> keeper exits -> job frees (loses post-exit reattach; re-collaborate re-=sbatch=s); (c) idle timeout in the keeper. Lean (a)+(c).
  2. Deadline warnings. _start_slurm_timer runs a nohup on the compute node today; under sbatch the batch script can emit the 30/15/5-min warnings itself (one place, in-cgroup). Fold in or keep separate?
  3. Flag name. confined vs sbatch_launch vs shared_node (intent-named confined preferred; nothing is committed yet).
  4. Node selection / sharing. sbatch can target a node with --nodelist; the "adopt an existing job to share a node" feature does not apply (each sbatch is its own job). confined targets forgo node-sharing.
  5. Job stdout. sbatch -o ~/.cache/sucoder/job-<mirror>-%j.out; the batch script must echo a clear marker if new-session fails so launch failures are diagnosable.
  6. Queue wait. sbatch may queue (vs salloc's immediate-or-fail); collaborate must show "PENDING / waiting for the scheduler".

Must-keep fixes from the review (any in-cgroup launch)

  • tmux new-session -A -d (idempotent), not bare -d.
  • Sanitize the tmux session name (only the prelude filename is today).
  • Bound every poll; log the agent's stderr, not just the launcher's.
  • Drop the || tmux new-session attach fallback for confined targets.

Red-team outcome (sbatch design, 2026-06-29)

Three-reviewer red-team; survivors folded in. Resolved decisions: keep _start_slurm_timer as-is (its tmux display-message reaches the human; batch stdout only hits the job log – do NOT fold warnings in); launch the agent via bash -lc so the login env (PATH/nvm) resolves; carry agent_launcher.env vars into the launch (sbatch drops them); --time as a hard backstop against squatting (idle-timeout deferred); flag name confined; skip _adopt_existing_allocation; pass --no-requeue.

Additional code-certain fixes: write the prelude + batch script to NFS via the login/DTN before sbatch (_externalize_prelude currently targets the not-yet-existent compute node); split launch_agent's remote block so confined = submit-then-srun --overlap attach while the non-confined path stays byte-identical (regression-test it); add a bounded RUNNING-poll that surfaces the queue reason and, on timeout, leaves the job queued with a "still PENDING" message rather than an empty-%N abort that leaks the job; persist the node before returning so the reuse-probe doesn't submit a second job; ensure the sbatch relaunch blocks until RUNNING so renew's overlap-scancel keeps its no-gap invariant; delete the dead host-step code (_build_host_step_script, the launch_via_srun flag).

Cluster-only assumptions – validate with the spike before building: (1) an actual sbatch job confines the tmux pane (nproc=4); (2) the agent starts in the non-interactive batch + detached-tmux context (PATH, no TTY client); (3) the tmux socket is reachable from direct SSH (so attach=/=release=/timer needn't all go through =srun --overlap); (4) persistence across detach. Spike: ~/.sucoder/sucoder-sbatch-spike.sh.

Spike result (n0055.savio4, 2026-06-29): ALL GREEN. pane_nproc=4, pane_cgroup -> job_*/step_batch, claude resolves on PATH (it lives in ~/.local/bin, so bash -lc is belt-and-suspenders, not required), and the job persisted. Required detail the spike surfaced: a per- mirror dedicated tmux socket (tmux -L sucoder-<mirror>). Without it tmux reuses an already-running shared server (e.g. the operator's live collaborate session, in a different cgroup) and the new session dies on contact – the v1/v2 spike failure. Folded into _build_batch_script (tmux 2.7 on the node; -A -d, -L, env-export, cd all confirmed via bash -n + the spike).

Cluster validation

[DONE 2026-06-29] End-to-end on the real confined collaborate path (job 35275011, JobName=sucoder-SuCoder, savio4_htc). The live agent (not the spike harness) reported, from inside its own session: nproc=4; cgroup .../job_35275011/step_batch/user/task_0 (a job cgroup, NOT user.slice); cpuset.cpus.effective=39,41,43,45 (pinned to its 4 reserved CPUs on a shared node); memory.max at the job cgroup = 17179869184 (16 GiB, inherited hierarchically – the leaf reads max); Account=co_carleton, QOS=carleton_htc4_normal, AllocTRES=cpu=4,mem=16G (our _build_sbatch_command args); claude + node v20.20.2 on PATH (the bash -lc login-env wrap works); attached inside $TMUX on the dedicated sucoder-SuCoder socket. Caveat surfaced: free (and other non-cgroup-aware tools) report the host's 251 GiB, not the 16 GiB ceiling – a memory-heavy step hits a silent cgroup OOM-kill, so the agent must treat 16 GiB as the cap (or bump mem: for hungry mirrors). nproc, by contrast, IS cpuset-aware (reported 4).

Still to confirm live (needs a human at a terminal): detach + re-attach survival; release teardown; and the persistence win – let the ControlMaster lapse / drop the laptop and confirm the job + agent survive.

Motivation

The default Savio flow (-T savio-node) parks an exclusive, billed savio3 node for the agent's whole session. For an interactive agent that is idle most of the time (it blocks on API round-trips, not CPU), that is expensive on both axes the scrum-master-hpc skill names: private SUs drawn from fc_jevons, and the social cost of holding a whole node others could use.

The condo QOS carleton_htc4_normal on savio4_htc changes the calculus:

  • No wall-clock cap. MaxWall is empty; the partition MaxTime is UNLIMITED. A job is not killed for running long. (An FCA QOS like savio_normal caps at 72h.) Confirmed 2026-06-29: the partition's DefaultTime=NONE, so Slurm falls back to MaxTime=UNLIMITED – even omitting --time yields an unlimited job, so there is no silent default cap to fear. The 3-00:00:00 we set is therefore pure courtesy cadence, not a necessity; this is also why the "self-renewing job" idea was dropped (it pre-empts a walltime that is never enforced, and an in-job renewer dies with its node at the only real ceiling – a maintenance reboot, which only an external watcher can recover from).
  • Normal, non-preemptible priority. Unlike a co_* lowprio job on the general pool, a normal-QOS slice here is not evicted, so it can host a persistent presence, not just run batch.
  • Free. Usage is not deducted from an FCA allowance.
  • Per-core scheduling. Request only the cores you need; the rest of the node serves co-tenants.

So this partition is the one place that is simultaneously free, persistent, and compute-bearing – which a login node cannot be (login nodes forbid local compute and reap long processes).

What still bounds "indefinitely"

Three limits the scheduler does not show:

  1. Maintenance reboots (~monthly/quarterly) drain and reboot nodes; every job dies, unlimited-walltime included. This is the real ceiling.
  2. The shared pool. GrpTRES cpu=224, mem~2TB is shared across all co_carleton normal-priority users. An idle interactive hold consumes part of the group's pool and can block Carleton co-members. Etiquette limit.
  3. BRC "don't waste resources" stance. Idle interactive sessions are discouraged even when technically allowed.

Design implications: keep the standing slice thin; set a courtesy time and renew rather than squatting unbounded; release when idle.

Deliverable A: target configuration

Drop into ~/.sucoder/config.yaml. Uses only existing config keys (no parser change); it is the savio-htc shape re-pointed to the condo account.

targets:
  carleton-htc:
    gateway: hpc.brc.berkeley.edu
    transfer_host: dtn.brc.berkeley.edu
    mirror_root: ~/mirrors          # NFS $HOME -- survives node turnover
    control_persist: 36h
    slurm:
      partition: savio4_htc
      account: co_carleton
      qos: carleton_htc4_normal
      cpus_per_task: 4              # thin: fast to (re)schedule, light on the 224-core pool
      mem: 16G
      time: "24:00:00"             # courtesy cadence; NOT a hard cap (QOS has none)
      # local_disk: false          # keep work on NFS, not /local (orphaned on re-grab)
      system_prompt_extra: ~/.sucoder/prompts/carleton-htc.org

Knobs:

  • cpus_per_task=/=mem – keep small (2–4 cores). A thin ask re-grabs faster under contention and is the most defensible thing to be holding when a co-member queues. Escalate by dispatching additional short jobs, never by fattening the standing slice.
  • time – this is your re-grab cadence, not a limit (see "courtesy vs cadence" below). The in-node deadline warner now parses the D-HH:MM:SS day format correctly (Appendix), so a multi-day value (e.g. 3-00:00:00) is safe and cuts turnover frequency.

Open requirement: cgroup entry on a shared node

savio4_htc is shared, so a job must stay inside its step cgroup or it will step on co-tenants and escape its slice. collaborate currently does salloc --no-shell then direct SSH to the node (cli.py:559, cli.py:689). On a shared node, direct SSH may land outside the job cgroup. The correct entry is srun --jobid=<J> --overlap --pty (the path attach --via-srun already uses, cli.py:1974). Verify before trusting the htc target for real compute: allocate, enter, and check cat /proc/self/cgroup and nproc reflect the 4-core slice, not the whole node. If direct SSH escapes the cgroup, the executor's entry path for shared partitions must switch to srun --overlap.

Deliverable B: auto-renew / respawn spec

Goal

A persistent presence survives the two forced-turnover events – courtesy time expiry and maintenance reboot – with minimal manual intervention and graceful context handoff.

Status (v1 implemented)

The controller ships as sucoder/renew.py (pure decision logic + loop, fully unit-tested in tests/test_renew.py) wired to the sucoder renew CLI command, with a detached relaunch path added to MirrorManager.launch_agent (tmux new-session -A -d). Implemented: SLURM-state polling with correct D-HH:MM:SS parsing, the HOLD/DRAINING/LOST decision machine (a failed probe never relaunches), checkpoint-sentinel signalling, exponential backoff, overlap re-allocation, and old-job scancel. Cluster-validate: the live re-allocate/relaunch/=scancel= path and the shared-partition cgroup entry (Deliverable A open requirement) have not been exercised on Savio. Deferred: max_idle idle-release, reading renew tunables from a slurm.renew config block (currently CLI flags), and the login-node-daemon controller variant (v2).

Principles

  1. Authoritative state from SLURM, not the watchdog. The renew controller polls squeue -o %L / sacct -o State, parsing D-HH:MM:SS correctly. It does not depend on the in-node slurm-deadline.warn warner regardless of the latter's health (now fixed; see Appendix) – the two channels stay independent.
  2. Files survive, process does not. The repo on NFS $HOME plus a pushed mirror are durable; the agent's conversation context dies on turnover. Every turnover is therefore bracketed by commit+push+handoff before and rehydrate-from-handoff after.
  3. Thin and courteous. Renew re-allocates the same thin slice; never escalate cores on renew.

State machine

  • HOLDING – job J running on node N, agent in tmux sucoder-<mirror>.
  • DRAINING (time-left < courtesy_drain_min, default 20m): signal the agent to checkpoint (commit, push, write handoff); wait for its "ready" marker or a timeout; allocate J' while J still holds (overlap, so the queue slot is never dropped under contention); relaunch the agent in J'; update session YAML; scancel J.
  • LOST (squeue empty / sacct terminal – reboot, crash): re-allocate J' with exponential backoff (a maintenance window may keep the partition down for hours); relaunch; the agent rehydrates from the last handoff plus git log.
  • RELEASED (sucoder release): controller exits; no renew.
  • IDLE-RELEASE (optional, etiquette): no agent activity for max_idle -> push+handoff+=scancel=, so an unused presence stops squatting the condo pool. Re-collaborate to wake.

Courtesy vs cadence (the core tension)

The QOS has no MaxWall, but a running job's TimeLimit cannot be raised by a non-operator, so "renew" means re-allocate, not extend. Therefore time directly sets turnover frequency, and every turnover costs a context reset (the agent process restarts). Pick time to balance:

  • shorter (24h) -> more polite, but daily context resets;
  • longer (3--7d) -> rare resets approaching the reboot cadence, at the cost of a longer idle hold.

A thin slice makes the longer choice defensible (small footprint) and makes re-grab fast when turnover does happen. Recommended: time: "3-00:00:00" with max_idle as the real courtesy guard.

Handoff / rehydration

  • Before turnover the agent writes a structured note to a fixed NFS path, e.g. ~/mirrors/<mirror>/.sucoder/handoff.org: current task, branch, last commit SHA, open questions, next action.
  • On relaunch, system_prompt_extra instructs the agent to first read the handoff + git log -5 + git status, summarize the resumed state back to the human, then continue. This turns a cold start into a one-message rehydration.

Signaling the agent to checkpoint

The watchdog's delivery is the weak link, so the controller writes a durable sentinel (~/.cache/sucoder/renew-requested) AND the prompt instructs the agent to poll it at natural checkpoints (between tool batches), not passively. Optional upgrade: tmux send-keys a notice into the agent pane (more reliable than a status-line flash, but fiddly to inject into an interactive TUI) – defer past v1.

Where the controller runs

  • v1: operator-side loopsucoder -T carleton-htc collaborate --renew (or sucoder renew <mirror>) reuses the warm ControlMaster, session YAML, and _ensure_slurm_node. Smallest delta to existing code. Caveat: dies if the operator box sleeps.
  • v2: login-node daemon – more autonomous (survives laptop sleep) but adds a login-node footprint and itself dies on login-node reboot. Revisit only if v1's sleep caveat bites.

Proposed CLI / config surface

slurm:
  renew:
    courtesy_drain_min: 20     # start DRAINING this many minutes before time
    overlap: true              # allocate J' before cancelling J
    max_idle: "12h"            # IDLE-RELEASE threshold (0 = never)
    backoff_max: "30m"         # LOST re-allocate backoff ceiling
  • sucoder -T carleton-htc collaborate --renew [--renew-detach]
  • sucoder -T carleton-htc renew <mirror> (attach loop to a live session)

Interaction with the existing watchdog

The renew controller supersedes the in-node timer's renewal role. Keep _start_slurm_timer only as a human courtesy (after fixing the Appendix bug and improving delivery), or retire it. Nothing in the renew path reads slurm-deadline.warn.

Open questions

  1. Shared-partition cgroup entry (Deliverable A "Open requirement") – must resolve before real compute on the slice.
  2. Reliable checkpoint signal – sentinel + prompt (v1) vs tmux send-keys (later).
  3. Controller home across operator sleep – operator-side (v1) vs login-node daemon (v2).
  4. Maintenance-window awareness – reactive LOST re-allocate (v1) vs pre-draining on a fed-in schedule (later).

Appendix: watchdog day-format bug

_start_slurm_timer (sucoder/cli.py:721) parses squeue -o %L time-left at cli.py:829--836:

IFS=: read -ra parts <<< "$left"
if [ ${#parts[@]} -eq 3 ]; then
    mins=$(( ${parts[0]#0}*60 + ${parts[1]#0} ))     # assumes HH:MM:SS
...

When >= 1 day remains, %L is D-HH:MM:SS. Splitting on : yields parts=("D-HH" "00" "00"), and $(( D-HH*60 + 00 )) evaluates as arithmetic D - HH*60 + 00, collapsing mins to ~the day count. The timer then fires all three warnings in the first minutes, sets the WARN5/15/30 sentinels, and stays silent forever after – including at the true deadline. A 24:00:00 job mostly escapes (it drops below a day within seconds, before the first poll), which is why the symptom is silence rather than spurious noise; a multi-day time is fully broken.

Fixed (this change): the parser is extracted to a left_to_mins bash helper (module constant _SLURM_TIME_LEFT_TO_MINS_SH in sucoder/cli.py) that splits the optional D- day prefix on - before the : split, forces base-10 to dodge octal traps (08=/=09), and returns a large sentinel for non-numeric values (UNLIMITED=/=INVALID=/empty) so no spurious warning fires. Covered by =tests/test_slurm_timer.py (3-00:00:00, 23:59:00, 45:00, UNLIMITED, …). As separate hardening, each warning now sets a 15s per-session display-time so a full-screen agent TUI does not redraw over it before the human notices. The renew controller still reads SLURM state directly and does not depend on the warner.

SuCoder — Home · GitHub