Persistent Agent Presence on a Condo HTC Slice
Table of Contents
Status & handoff (2026-06-29)
Branch feat/confined-sbatch-launch off main. Steps 1-4 of the
restructure are DONE, committed, and green (454 tests). Not pushed.
Two adversarial-review rounds (3 reviewers each) were run and folded: a
design red-team before writing step 3/4, then a code red-team on the
implemented diff (see the per-step notes below for the survivors).
Remaining: step 5 (flip confined: true) + the live-cluster validation
pass – both HUMAN-GATED (need the coder tunnel + OTP).
Code red-team survivors folded (af0627f scope, fixed in a follow-up):
reuse-probe now keys on _confined_job_state (ANY non-empty squeue state
= live, so a SUSPENDED/PREEMPTED job is not resubmitted-over) and refuses
to srun-attach a still-PENDING job (it would block); sbatch runs
check=False and raises a MirrorError with a scancel hint on a
post-submit transport drop (was an uncaught CommandError masking a
leak); RemoteSession.save is now atomic (temp + os.replace +
fsync) so a mid-write failure can't truncate the file holding the job
id (which would load blank -> resubmit -> leak); confined attach --node
prints a note instead of silently dropping it; release's sibling-detach
is confined-aware (defensive; unreachable today). Test hardening:
unconfined attach pinned to byte-identity (catches a dropped srun
prefix on the || new-session fallback), plus new tests for
save-failure, reuse-detached / reuse-PENDING / reuse-not-ready,
_resolve_remote_home, and env-threading.
Mechanism: cluster-validated. Spike on n0055.savio4 confirmed an
sbatch batch task runs in the job cgroup, nproc=4 in the tmux pane,
claude on PATH, persists. Required detail it surfaced: a per-mirror
dedicated tmux socket (tmux -L sucoder-<mirror>) – a shared socket
reuses another session's server (wrong cgroup) and dies.
Built + tested (gated by slurm.confined, default off -> existing
targets byte-identical): config.py confined flag;
mirror.py:_build_batch_script (the sbatch body: -L socket, env-carry,
cd, -A -d, keeper); cli.py:_build_sbatch_command. Tests:
tests/test_batch_script.py, tests/test_sbatch_command.py.
Remaining: the allocation/launch restructure (this section + "Red-team outcome" + "Must-keep fixes" below are the spec):
- [DONE]
_build_executor: skip salloc forconfined(thesbatchfuses allocate+launch). Confined now returns a login-node executor (is_compute_node=False, no compute-node proxy fields, NFS mirror root – never compute-node-local disk). Gated onremote.slurm.confined; non-confined targets byte-identical. - [DONE] Extracted
mirror.py:MirrorManager._build_remote_agent_cmd_strfromlaunch_agent(the=; exec bash -l+ prelude-externalize block) so the confined flow can reuse it. Because step 1 makes the confined executor a login-node executor,_externalize_preludenow writes the prelude over the login node (NFS), not a not-yet-existent compute node. Added aMirrorContext.confinedproperty and a guard:launch_agentrefuses a confined target (MirrorError) rather than silently running tmux on the login node – step 3 replaces the guard with the real submit+attach branch. Tests:test_mirror.py(_build_remote_agent_cmd_str, confined-guard),test_cli.py(_build_executorconfined-skips-salloc + control). Ordering invariant: keep step 5 (flipconfined: true) LAST – until step 3 lands, a confined target hits the guard, not a launch. - [DONE] Confined
collaborateviamirror.py:_launch_confined(and helpers), replacing the step-2 guard inlaunch_agent. Flow: reuse-probe -> build agent cmd (prelude to NFS,bash -lcwrap so the login env/PATH/nvm resolve) -> stage batch script to NFS (absolute path; sbatch won't expand$HOME) ->sbatch-> persist job id immediately (pre-poll, so a poll failure leaves the job RECORDED not leaked) -> bounded three-way RUNNING-poll -> persist node -> confirm tmux session -> attach (srun --overlap --pty tmux -L) for an interactive launch / return for a detached (renew) relaunch. A 3-reviewer design red-team hardened it: (a) the reuse-probe and poll distinguish a squeue failure (raise, never resubmit – a live job would orphan) from a job gone (resubmit); (b) defensive sbatch-id parse (<id>;<cluster>/ MOTD lines); (c) three-way poll (RUNNING+node / PENDING-wait / empty=terminal) never persists the literal state word as a node; (d)target_nameonMirrorManager, derived identically to_build_executor, so the job id lands in the same session fileattach=/=release=/=renewread; (e) the batch script catches anew-sessionfailure and emits aSUCODER:marker +exit 1instead of silently COMPLETEing exit 0; (f) shared sanitizedconfined_tmux_target/confined_attach_commandhelpers so launch and attach use one session+socket name._build_sbatch_commandmoved tomirror.py(re-exported by cli) to dodge the import cycle. Tests:test_mirror.py(parser, three-way poll, reuse-probe failure-vs-gone, submit/persist/attach, persist-before-poll, detached-no-attach, bash-lc, session-not-ready),test_batch_script.py(failure marker). Deferred (flagged): confined has no in-tmux 30/15/5-min deadline warning –_start_slurm_timerneeds a compute-node ControlMaster confined never opens; renew.py (reads SLURM state directly) +--timebackstop cover turnover, so this is a lost courtesy not a safety hole. - [DONE] Gated
attach=/=release=/=renewfor confined.attach: a confined-only branch forcesvia_srun(direct SSH to the compute node escapes the cgroup), builds the command fromconfined_attach_command(srun --overlap --pty tmux -L <socket> attach-session, no|| new-sessionfallback), sanitized session name; the unconfinedattach_cmdstays byte-identical (regression- tested).release: unchanged –scancel <job_id>is name- independent and the sibling-detach branch is unreachable for confined (eachsbatchis its own job, soholders_of_jobis empty).renew: thesbatchrelaunch goes through_launch_confined(step 3), which blocks until RUNNING (the poll), so the overlap-scancel no-gap invariant holds;_adopt_*is naturally skipped (confined bypasses_ensure_slurm_node);request_checkpointwrites the renew sentinel over the login node for confined (NFS$HOME; no compute-node SSH). Tests:test_cli.py(confined attach srun/-L/no-new-session; unconfined attach unchanged). - Add
confined: trueto thecarleton-htcstanza (inert until 1-4). Still the LAST step – and HUMAN-GATED: it needs a live-cluster validation pass (collaborate ->nproc=4+ job cgroup; detach/attach; release; ControlMaster-lapse survival) under thecodertunnel + OTP.
Testing note (human's insight): all of the above drives the cluster
through the login-node ControlMaster sucoder already maintains (same
path as today's salloc=/=attach=/=srun). So a thin harness can submit
the confined batch script via the tunnel and observe – the restructure
is testable in pieces against the live cluster from the operator box,
not only via a full collaborate.
Tunnel/auth for next session: the warm tunnels this session were
ligon's; we want coder's. Savio auth uses one-time tokens, no
stored SSH credentials, so there is nothing to pre-provision – next
session just brings the tunnel up under coder and prompts for a
token (PIN+OTP) when sucoder establishes the ControlMaster.
Confined launch on shared partitions (sbatch)
The problem (confirmed on n0052.savio4, 2026-06-29)
On a shared partition the agent must run inside its SLURM job cgroup
or it is not confined to the cores it reserved. collaborate
allocates with salloc --no-shell then reaches the node by direct
SSH. Measured on Savio savio4_htc:
- A direct-SSH session lands in
/user.slice/.../session-cN.scope(a logind user scope) withnproc56 (whole node) – NOT the job cgroup. This held with a single job on the node, sopam_slurm_adoptis not adopting here (absent or disabled onsavio4_htc;/etc/pam.d/sshdis root-only, so we can't read which). - A step joined with
srun --jobid=J --overlap --ptylands in the job cgroup:nproc4. So core confinement is enforced (cpuset); the only problem is getting the agent into the cgroup.
On savio3 (exclusive, savio-node) this never showed – the whole
node is yours.
Rejected: host-step (srun --overlap + keeper)
A backgrounded srun --overlap bash <keeper> on the login node was
designed and partly built (_build_host_step_script, the
launch_via_srun flag). A four-reviewer red-team rejected it:
- The login-node
srunis the step's shepherd; if the ControlMaster lapses, the login node reaps it, or the laptop drops, the step (cgroup- tmux) dies – worse persistence than direct-SSH tmux today.
- The executor connects to the compute node (
is_compute_node=True), so "run srun on the login node via the executor" was wrong – it would background srun inside the very user scope we're escaping. - Holding a step alive with a sleep-loop is non-idiomatic; the job should own tmux.
Chosen: sbatch wrapper
sbatch a script whose body runs as the job's main task – natively
inside the job cgroup: no --overlap step, no keeper-babysitter step,
no fragile login-node shepherd. slurmd manages it on the node, so it
survives ControlMaster expiry, login-node reboot, and laptop sleep –
more persistent than today's direct-SSH tmux.
Flow for a confined target (renamed from launch_via_srun):
- Build
agent_cmd_str(prelude externalized) as today. - Write the prelude file and the batch script to NFS
$HOME/.cache/sucoder/via the login/DTN (NFS -> node-independent). sbatch --partition .. --account .. --qos .. --cpus-per-task .. --mem .. --time .. -J sucoder-<mirror> -o <log> <script>-> job id.- Poll
squeueuntil RUNNING + node known (bounded; surface PENDING / queue waits instead of hanging). The batch script (in the job cgroup) runs:
#!/bin/bash set -u SESSION="sucoder-<mirror>" cd <mirror_path> tmux new-session -A -d -s "$SESSION" "<agent_cmd_str>" while tmux has-session -t "$SESSION" 2>/dev/null; do sleep 15; done
- Poll until the tmux session exists (bounded; on failure surface the job log – the detached pane's output is NOT on the script's stdout).
- Attach:
srun --jobid=J --overlap --pty tmux attach -t <name>from the login node (the existingattach --via-srunpath; auto-selected forconfinedtargets; drop the|| new-sessionfallback so attach never spawns an unconfined orphan).
Renew = sbatch a fresh job on turnover (simpler than the host-step
relaunch; the renew loop's probe/scancel logic is unchanged).
Release = scancel.
Open design decisions (RED-TEAM THESE)
- Agent-exit teardown. With
; exec bash -lthe tmux session outlives a clean agent/exit, so the keeper loops forever and the job holds the (free but shared) condo pool untilrelease. Options: (a) keepexec bash -l+ a courtesy--timecap so the job auto-frees at walltime; (b) dropexec bash -lforconfinedtargets so agent-exit -> session ends -> keeper exits -> job frees (loses post-exit reattach; re-collaboratere-=sbatch=s); (c) idle timeout in the keeper. Lean (a)+(c). - Deadline warnings.
_start_slurm_timerruns a nohup on the compute node today; under sbatch the batch script can emit the 30/15/5-min warnings itself (one place, in-cgroup). Fold in or keep separate? - Flag name.
confinedvssbatch_launchvsshared_node(intent-namedconfinedpreferred; nothing is committed yet). - Node selection / sharing. sbatch can target a node with
--nodelist; the "adopt an existing job to share a node" feature does not apply (eachsbatchis its own job).confinedtargets forgo node-sharing. - Job stdout.
sbatch -o ~/.cache/sucoder/job-<mirror>-%j.out; the batch script must echo a clear marker ifnew-sessionfails so launch failures are diagnosable. - Queue wait.
sbatchmay queue (vssalloc's immediate-or-fail); collaborate must show "PENDING / waiting for the scheduler".
Must-keep fixes from the review (any in-cgroup launch)
tmux new-session -A -d(idempotent), not bare-d.- Sanitize the tmux session name (only the prelude filename is today).
- Bound every poll; log the agent's stderr, not just the launcher's.
- Drop the
|| tmux new-sessionattach fallback forconfinedtargets.
Red-team outcome (sbatch design, 2026-06-29)
Three-reviewer red-team; survivors folded in. Resolved decisions:
keep _start_slurm_timer as-is (its tmux display-message reaches the
human; batch stdout only hits the job log – do NOT fold warnings in);
launch the agent via bash -lc so the login env (PATH/nvm) resolves;
carry agent_launcher.env vars into the launch (sbatch drops them);
--time as a hard backstop against squatting (idle-timeout deferred);
flag name confined; skip _adopt_existing_allocation; pass
--no-requeue.
Additional code-certain fixes: write the prelude + batch script to NFS
via the login/DTN before sbatch (_externalize_prelude currently
targets the not-yet-existent compute node); split launch_agent's
remote block so confined = submit-then-srun --overlap attach while the
non-confined path stays byte-identical (regression-test it); add a
bounded RUNNING-poll that surfaces the queue reason and, on timeout,
leaves the job queued with a "still PENDING" message rather than an
empty-%N abort that leaks the job; persist the node before returning
so the reuse-probe doesn't submit a second job; ensure the sbatch
relaunch blocks until RUNNING so renew's overlap-scancel keeps its
no-gap invariant; delete the dead host-step code
(_build_host_step_script, the launch_via_srun flag).
Cluster-only assumptions – validate with the spike before building:
(1) an actual sbatch job confines the tmux pane (nproc=4); (2) the
agent starts in the non-interactive batch + detached-tmux context
(PATH, no TTY client); (3) the tmux socket is reachable from direct SSH
(so attach=/=release=/timer needn't all go through =srun --overlap);
(4) persistence across detach. Spike: ~/.sucoder/sucoder-sbatch-spike.sh.
Spike result (n0055.savio4, 2026-06-29): ALL GREEN. pane_nproc=4,
pane_cgroup -> job_*/step_batch, claude resolves on PATH (it lives
in ~/.local/bin, so bash -lc is belt-and-suspenders, not required),
and the job persisted. Required detail the spike surfaced: a per-
mirror dedicated tmux socket (tmux -L sucoder-<mirror>). Without it
tmux reuses an already-running shared server (e.g. the operator's live
collaborate session, in a different cgroup) and the new session dies
on contact – the v1/v2 spike failure. Folded into
_build_batch_script (tmux 2.7 on the node; -A -d, -L, env-export,
cd all confirmed via bash -n + the spike).
Cluster validation
[DONE 2026-06-29] End-to-end on the real confined collaborate path
(job 35275011, JobName=sucoder-SuCoder, savio4_htc). The live agent
(not the spike harness) reported, from inside its own session:
nproc=4; cgroup .../job_35275011/step_batch/user/task_0 (a job cgroup,
NOT user.slice); cpuset.cpus.effective=39,41,43,45 (pinned to its 4
reserved CPUs on a shared node); memory.max at the job cgroup =
17179869184 (16 GiB, inherited hierarchically – the leaf reads max);
Account=co_carleton, QOS=carleton_htc4_normal, AllocTRES=cpu=4,mem=16G
(our _build_sbatch_command args); claude + node v20.20.2 on PATH (the
bash -lc login-env wrap works); attached inside $TMUX on the dedicated
sucoder-SuCoder socket. Caveat surfaced: free (and other
non-cgroup-aware tools) report the host's 251 GiB, not the 16 GiB
ceiling – a memory-heavy step hits a silent cgroup OOM-kill, so the agent
must treat 16 GiB as the cap (or bump mem: for hungry mirrors).
nproc, by contrast, IS cpuset-aware (reported 4).
Still to confirm live (needs a human at a terminal): detach + re-attach
survival; release teardown; and the persistence win – let the
ControlMaster lapse / drop the laptop and confirm the job + agent survive.
Motivation
The default Savio flow (-T savio-node) parks an exclusive, billed
savio3 node for the agent's whole session. For an interactive agent
that is idle most of the time (it blocks on API round-trips, not CPU),
that is expensive on both axes the scrum-master-hpc skill names:
private SUs drawn from fc_jevons, and the social cost of holding a
whole node others could use.
The condo QOS carleton_htc4_normal on savio4_htc changes the
calculus:
- No wall-clock cap.
MaxWallis empty; the partitionMaxTimeisUNLIMITED. A job is not killed for running long. (An FCA QOS likesavio_normalcaps at 72h.) Confirmed 2026-06-29: the partition'sDefaultTime=NONE, so Slurm falls back toMaxTime=UNLIMITED– even omitting--timeyields an unlimited job, so there is no silent default cap to fear. The3-00:00:00we set is therefore pure courtesy cadence, not a necessity; this is also why the "self-renewing job" idea was dropped (it pre-empts a walltime that is never enforced, and an in-job renewer dies with its node at the only real ceiling – a maintenance reboot, which only an external watcher can recover from). - Normal, non-preemptible priority. Unlike a
co_*lowprio job on the general pool, a normal-QOS slice here is not evicted, so it can host a persistent presence, not just run batch. - Free. Usage is not deducted from an FCA allowance.
- Per-core scheduling. Request only the cores you need; the rest of the node serves co-tenants.
So this partition is the one place that is simultaneously free, persistent, and compute-bearing – which a login node cannot be (login nodes forbid local compute and reap long processes).
What still bounds "indefinitely"
Three limits the scheduler does not show:
- Maintenance reboots (~monthly/quarterly) drain and reboot nodes; every job dies, unlimited-walltime included. This is the real ceiling.
- The shared pool.
GrpTRES cpu=224, mem~2TBis shared across allco_carletonnormal-priority users. An idle interactive hold consumes part of the group's pool and can block Carleton co-members. Etiquette limit. - BRC "don't waste resources" stance. Idle interactive sessions are discouraged even when technically allowed.
Design implications: keep the standing slice thin; set a courtesy
time and renew rather than squatting unbounded; release when idle.
Deliverable A: target configuration
Drop into ~/.sucoder/config.yaml. Uses only existing config keys
(no parser change); it is the savio-htc shape re-pointed to the condo
account.
targets:
carleton-htc:
gateway: hpc.brc.berkeley.edu
transfer_host: dtn.brc.berkeley.edu
mirror_root: ~/mirrors # NFS $HOME -- survives node turnover
control_persist: 36h
slurm:
partition: savio4_htc
account: co_carleton
qos: carleton_htc4_normal
cpus_per_task: 4 # thin: fast to (re)schedule, light on the 224-core pool
mem: 16G
time: "24:00:00" # courtesy cadence; NOT a hard cap (QOS has none)
# local_disk: false # keep work on NFS, not /local (orphaned on re-grab)
system_prompt_extra: ~/.sucoder/prompts/carleton-htc.org
Knobs:
cpus_per_task=/=mem– keep small (2–4 cores). A thin ask re-grabs faster under contention and is the most defensible thing to be holding when a co-member queues. Escalate by dispatching additional short jobs, never by fattening the standing slice.time– this is your re-grab cadence, not a limit (see "courtesy vs cadence" below). The in-node deadline warner now parses theD-HH:MM:SSday format correctly (Appendix), so a multi-day value (e.g.3-00:00:00) is safe and cuts turnover frequency.
Open requirement: cgroup entry on a shared node
savio4_htc is shared, so a job must stay inside its step cgroup or it
will step on co-tenants and escape its slice. collaborate currently
does salloc --no-shell then direct SSH to the node
(cli.py:559, cli.py:689). On a shared node, direct SSH may land
outside the job cgroup. The correct entry is
srun --jobid=<J> --overlap --pty (the path attach --via-srun
already uses, cli.py:1974). Verify before trusting the htc target
for real compute: allocate, enter, and check cat
/proc/self/cgroup and nproc reflect the 4-core slice, not the whole
node. If direct SSH escapes the cgroup, the executor's entry path for
shared partitions must switch to srun --overlap.
Deliverable B: auto-renew / respawn spec
Goal
A persistent presence survives the two forced-turnover events –
courtesy time expiry and maintenance reboot – with minimal manual
intervention and graceful context handoff.
Status (v1 implemented)
The controller ships as sucoder/renew.py (pure decision logic + loop,
fully unit-tested in tests/test_renew.py) wired to the sucoder
renew CLI command, with a detached relaunch path added to
MirrorManager.launch_agent (tmux new-session -A -d). Implemented:
SLURM-state polling with correct D-HH:MM:SS parsing, the
HOLD/DRAINING/LOST decision machine (a failed probe never relaunches),
checkpoint-sentinel signalling, exponential backoff, overlap
re-allocation, and old-job scancel. Cluster-validate: the live
re-allocate/relaunch/=scancel= path and the shared-partition cgroup
entry (Deliverable A open requirement) have not been exercised on
Savio. Deferred: max_idle idle-release, reading renew tunables from
a slurm.renew config block (currently CLI flags), and the
login-node-daemon controller variant (v2).
Principles
- Authoritative state from SLURM, not the watchdog. The renew
controller polls
squeue -o %L/sacct -o State, parsingD-HH:MM:SScorrectly. It does not depend on the in-nodeslurm-deadline.warnwarner regardless of the latter's health (now fixed; see Appendix) – the two channels stay independent. - Files survive, process does not. The repo on NFS
$HOMEplus a pushed mirror are durable; the agent's conversation context dies on turnover. Every turnover is therefore bracketed by commit+push+handoff before and rehydrate-from-handoff after. - Thin and courteous. Renew re-allocates the same thin slice; never escalate cores on renew.
State machine
- HOLDING – job
Jrunning on nodeN, agent in tmuxsucoder-<mirror>. - DRAINING (time-left <
courtesy_drain_min, default 20m): signal the agent to checkpoint (commit, push, write handoff); wait for its "ready" marker or a timeout; allocateJ'while J still holds (overlap, so the queue slot is never dropped under contention); relaunch the agent inJ'; update session YAML;scancel J. - LOST (
squeueempty /sacctterminal – reboot, crash): re-allocateJ'with exponential backoff (a maintenance window may keep the partition down for hours); relaunch; the agent rehydrates from the last handoff plusgit log. - RELEASED (
sucoder release): controller exits; no renew. - IDLE-RELEASE (optional, etiquette): no agent activity for
max_idle-> push+handoff+=scancel=, so an unused presence stops squatting the condo pool. Re-collaborateto wake.
Courtesy vs cadence (the core tension)
The QOS has no MaxWall, but a running job's TimeLimit cannot be
raised by a non-operator, so "renew" means re-allocate, not extend.
Therefore time directly sets turnover frequency, and every turnover
costs a context reset (the agent process restarts). Pick time to
balance:
- shorter (
24h) -> more polite, but daily context resets; - longer (
3--7d) -> rare resets approaching the reboot cadence, at the cost of a longer idle hold.
A thin slice makes the longer choice defensible (small footprint)
and makes re-grab fast when turnover does happen. Recommended:
time: "3-00:00:00" with max_idle as the real courtesy guard.
Handoff / rehydration
- Before turnover the agent writes a structured note to a fixed NFS
path, e.g.
~/mirrors/<mirror>/.sucoder/handoff.org: current task, branch, last commit SHA, open questions, next action. - On relaunch,
system_prompt_extrainstructs the agent to first read the handoff +git log -5+git status, summarize the resumed state back to the human, then continue. This turns a cold start into a one-message rehydration.
Signaling the agent to checkpoint
The watchdog's delivery is the weak link, so the controller writes a
durable sentinel (~/.cache/sucoder/renew-requested) AND the prompt
instructs the agent to poll it at natural checkpoints (between tool
batches), not passively. Optional upgrade: tmux send-keys a notice
into the agent pane (more reliable than a status-line flash, but
fiddly to inject into an interactive TUI) – defer past v1.
Where the controller runs
- v1: operator-side loop –
sucoder -T carleton-htc collaborate --renew(orsucoder renew <mirror>) reuses the warm ControlMaster, session YAML, and_ensure_slurm_node. Smallest delta to existing code. Caveat: dies if the operator box sleeps. - v2: login-node daemon – more autonomous (survives laptop sleep) but adds a login-node footprint and itself dies on login-node reboot. Revisit only if v1's sleep caveat bites.
Proposed CLI / config surface
slurm:
renew:
courtesy_drain_min: 20 # start DRAINING this many minutes before time
overlap: true # allocate J' before cancelling J
max_idle: "12h" # IDLE-RELEASE threshold (0 = never)
backoff_max: "30m" # LOST re-allocate backoff ceiling
sucoder -T carleton-htc collaborate --renew [--renew-detach]sucoder -T carleton-htc renew <mirror>(attach loop to a live session)
Interaction with the existing watchdog
The renew controller supersedes the in-node timer's renewal role.
Keep _start_slurm_timer only as a human courtesy (after fixing the
Appendix bug and improving delivery), or retire it. Nothing in the
renew path reads slurm-deadline.warn.
Open questions
- Shared-partition cgroup entry (Deliverable A "Open requirement") – must resolve before real compute on the slice.
- Reliable checkpoint signal – sentinel + prompt (v1) vs
tmux send-keys(later). - Controller home across operator sleep – operator-side (v1) vs login-node daemon (v2).
- Maintenance-window awareness – reactive
LOSTre-allocate (v1) vs pre-draining on a fed-in schedule (later).
Appendix: watchdog day-format bug
_start_slurm_timer (sucoder/cli.py:721) parses squeue -o %L
time-left at cli.py:829--836:
IFS=: read -ra parts <<< "$left"
if [ ${#parts[@]} -eq 3 ]; then
mins=$(( ${parts[0]#0}*60 + ${parts[1]#0} )) # assumes HH:MM:SS
...
When >= 1 day remains, %L is D-HH:MM:SS. Splitting on : yields
parts=("D-HH" "00" "00"), and $(( D-HH*60 + 00 )) evaluates as
arithmetic D - HH*60 + 00, collapsing mins to ~the day count. The
timer then fires all three warnings in the first minutes, sets the
WARN5/15/30 sentinels, and stays silent forever after – including at
the true deadline. A 24:00:00 job mostly escapes (it drops below a
day within seconds, before the first poll), which is why the symptom is
silence rather than spurious noise; a multi-day time is fully
broken.
Fixed (this change): the parser is extracted to a left_to_mins
bash helper (module constant _SLURM_TIME_LEFT_TO_MINS_SH in
sucoder/cli.py) that splits the optional D- day prefix on -
before the : split, forces base-10 to dodge octal traps (08=/=09),
and returns a large sentinel for non-numeric values
(UNLIMITED=/=INVALID=/empty) so no spurious warning fires. Covered by
=tests/test_slurm_timer.py (3-00:00:00, 23:59:00, 45:00,
UNLIMITED, …). As separate hardening, each warning now sets a 15s
per-session display-time so a full-screen agent TUI does not redraw
over it before the human notices. The renew controller still reads
SLURM state directly and does not depend on the warner.