Remote & Connection FAQ
Table of Contents
- About this FAQ
- Concepts
- Tunnels and warm connections
- When should I set up a tunnel?
- How do I set up a persistent tunnel?
- How do I check whether my tunnels are live?
- How do I tear a tunnel down?
- How long does a tunnel stay warm? Do I have to re-authenticate?
- Do tunnels survive a laptop suspend / hibernate or a network change?
- Can I avoid the OTP for a while? (SSH certificates)
- Editing remote files from Emacs / TRAMP
- Sessions: starting, reattaching, recovering
- How do I start an agent session on savio-node?
- How do I reattach to a job on savio-node?
- Why does
attachsay "Mirror is not configured for remote execution"? - Why did
attachdrop me into a bash shell instead of the agent? - How do I find my running jobs and tmux sessions?
- A crashed agent took my session with it — can I recover the conversation?
- Cost and lifecycle
- Troubleshooting reference
- Command quick reference
- Appendix: clearing stale tunnels automatically
About this FAQ
This covers the day-to-day connection questions for running agents on a
remote cluster target through sucoder — setting up tunnels,
reattaching to sessions, editing remote files from Emacs, and keeping
compute costs under control.
Throughout, savio-node is used as the example target (a SLURM-backed
target on UC Berkeley's Savio cluster). Substitute your own target name
after -T. Commands that operate on a target (the tunnel group)
don't need a mirror or a particular directory; commands that operate on a
mirror (collaborate, attach, release) either take a mirror name or
auto-detect it from the git repo you're standing in.
Concepts
What are the "tunnels", and which ones cost money?
Reaching a compute node on Savio is a chain of SSH hops:
- gateway (e.g.
hpc.brc.berkeley.edu) — the public login gateway; authenticating here requires your PIN + OTP. - login node (e.g.
ln003.brc) — reachable only through the gateway; it is not a name your laptop can resolve directly. - DTN (
dtn.brc.berkeley.edu) — the data-transfer node, used for git transport. - compute node (e.g.
n0019.savio3) — allocated by SLURM; this is the only hop that costs compute budget.
The gateway, login node, and DTN are free to hold open. A compute-node
allocation bills for its entire wall-clock lifetime (often a 24-hour
salloc), whether or not an agent is actually working on it.
What's the difference between tunnel and collaborate?
sucoder -T savio-node tunnel upestablishes and warms only the free hops (gateway/login/DTN). No compute node, no cost. Do this once and many later commands ride the warm connection without re-authenticating.sucoder -T savio-node collaborateallocates a compute node (cost) and launches the agent there.
Think of tunnel up as "open the door to the cluster" and collaborate
as "rent a machine and start working."
Tunnels and warm connections
When should I set up a tunnel?
Run tunnel up when you're about to do several remote operations in a
session — one or more collaborate=/=attach runs, or editing files over
TRAMP — and don't want to type your PIN + OTP each time. It pays for
itself after the second connection. It costs nothing to leave up, so
"first thing in the morning" is a reasonable habit.
You don't strictly need it: collaborate will establish its own
connection on demand. But without a warm gateway mux, each fresh
connection can prompt for the OTP again.
How do I set up a persistent tunnel?
sucoder -T savio-node tunnel up
This authenticates once (PIN + OTP), warms the gateway/login/DTN
ControlMasters, pins a login node, and writes a managed block of SSH host
aliases (savio-node-gw, savio-node-ln, savio-node-dtn) into
~/.ssh/config so that plain ssh and Emacs/TRAMP can reuse the same
sockets. Pass --no-config-edit to warm the sockets without touching
~/.ssh/config.
How do I check whether my tunnels are live?
sucoder -T savio-node tunnel status # human-readable sucoder -T savio-node tunnel status --json # scriptable
This is read-only — it probes each ControlMaster socket and reports
ACTIVE=/=DEAD plus whether the ~/.ssh/config block is present. No
authentication, no network cost.
How do I tear a tunnel down?
sucoder -T savio-node tunnel down # close the sockets sucoder -T savio-node tunnel down --prune # also remove the ssh_config block
How long does a tunnel stay warm? Do I have to re-authenticate?
While the ControlMaster is alive, collaborate, plain ssh, and TRAMP
reuse it with no prompt. The client-side idle lifetime is governed by
control_persist on the target (default 7d; the tunnel up output
shows the current value). You only re-authenticate (PIN + OTP) when the
gateway mux expires; because the login and DTN aliases jump through the
gateway, re-authenticating the gateway once re-warms everything.
Two independent timers are in play, and it helps to keep them straight:
control_persist— idle lifetime: how long the mux lingers after the last client disconnects. This is the "stays open" lever.keepalive_intervalxkeepalive_count_max(theServerAlive*pair, default30s x 120 = 1h) — the in-flight grace budget: how long an established connection may go unanswered before it is declared dead and torn down. This is not a lifetime knob; it is what lets a live link ride out a brief stall (see the suspend question below).
Caveat: both bound the client side only. The real ceiling is the
cluster's server-side session policy, which sucoder does not control —
so treat a long control_persist as best-effort, not a guarantee.
Do tunnels survive a laptop suspend / hibernate or a network change?
Sometimes a brief suspend on the same network now survives; a network change never does.
The default keepalive budget (30s x 120 = 1h) is deliberately generous:
when you wake the laptop, the client tolerates up to an hour of missed
probes before declaring the link dead, so if the connection was still
valid underneath — the server hadn't reaped it and your IP hadn't
changed — it simply resumes with no prompt. Raise keepalive_count_max
on the target if you routinely nap longer than that.
But two things still break tunnels for good, and the second is the more common culprit:
- A long suspend lets the server (or a stateful firewall) drop the connection. Once the cluster sends a reset or the NAT mapping expires, the TCP connection under the ControlMaster socket is dead and no client-side budget can revive it.
- Changing networks changes your IP address. A TCP connection is
identified by the four-tuple (source IP, source port, destination IP,
destination port); when your source IP changes — roaming from home
WiFi to the office, a new DHCP lease, a VPN toggling — the existing
connections are dead by definition, and there is no way for standard
SSH to migrate them. This kills the tunnels even on a brief suspend,
or with no suspend at all if you just switch networks. (
moshroams across IP changes for interactive shells, but ssh ControlMaster sockets do not.)
Either way, because the gateway requires PIN + OTP, the only way back is to re-authenticate once — unless you hold a still-valid SSH certificate, which re-warms the chain with no prompt (see Can I avoid the OTP for a while? below).
sucoder detects the dead-but-zombie sockets correctly: tunnel status
does an end-to-end probe (not just a local mux check), so it reports
DEAD rather than a misleading ACTIVE. Recovery is one command:
sucoder -T savio-node tunnel up # one OTP, re-warms the whole chain sucoder -T savio-node tunnel status # confirm ACTIVE
Re-warm before opening TRAMP — otherwise TRAMP finds the stale login
socket, falls back to a fresh ProxyJump, and you get an awkward auth
prompt inside Emacs instead of a clean one on a real terminal. If you
suspend/resume or change networks frequently, a system hook can clear the
zombie sockets automatically on resume so nothing trips on them — see
the appendix for
ready-to-use Linux and macOS examples. (A hook can only clean up; it
can't re-authenticate for you — that OTP is yours to type.)
Can I avoid the OTP for a while? (SSH certificates)
Yes — for up to 12 hours. The free gateway accepts a short-lived SSH
certificate in place of an interactive PIN + OTP. Mint one once and
every later ssh, tunnel up, and collaborate rides it with no prompt
— even across a suspend or a network change — until it expires. This
is the one way to make a dropped tunnel re-warm without typing an OTP.
The MSM CA hard-caps a cert's lifetime at 12h (confirmed against the server: it rejects anything longer), so one OTP buys at most a 12-hour passwordless window; after that you authenticate once more to mint a fresh cert.
Acquire one from your own terminal — so the OTP never lands in an agent transcript and there is no chat round-trip to blow the ~30s TOTP window:
scripts/brc-connect.sh
It prompts for your PIN (hidden) and a fresh OTP, then consumes them
immediately: it requests the cert, prints the granted validity, and runs a
passwordless login test. On SUCCESS the cert sits at
~/.ssh/ssh_certs/brc_cert and the gateway ControlMaster is already warm.
For a scripted path (PIN already in hand, e.g. from a secrets manager):
scripts/brc-cert.sh <PIN> <OTP> # or: BRC_PIN=.. scripts/brc-cert.sh <OTP>
Env overrides: BRC_USER (default ligon), BRC_LIFETIME (default 12h,
the ceiling), LRC_SCRIPTS (default ~/lrc-scripts, auto-cloned from
lbnl-science-it/lrc-scripts on first run).
Prerequisite: a brc-login stanza in ~/.ssh/config that points
IdentityFile at the cert and enables ControlMaster, so the passwordless
test — and every later connection — actually uses it. Check how much
of the window is left at any time:
ssh-keygen -L -f ~/.ssh/ssh_certs/brc_cert-cert.pub | grep -i valid
Editing remote files from Emacs / TRAMP
How do I open a file on the login node in Emacs?
- Warm the tunnel once:
sucoder -T savio-node tunnel up. Tell TRAMP to honor
~/.ssh/configinstead of opening its own connection (in your Emacs init, once):(setq tramp-use-connection-share nil)
On Emacs < 30.1 the variable is named
tramp-use-ssh-controlmaster-options(now an obsolete alias); set it tonileither way.Open the file via the alias:
C-x C-f /ssh:savio-node-ln:/global/home/users/<you>/project/file.py
TRAMP runs ssh savio-node-ln, which ProxyJump's the gateway and
attaches to the warm socket — no password.
Why does TRAMP (or ssh savio-node-ln) keep prompting for a password?
Two usual causes:
Your
~/.ssh/configshadows the managed block. ssh uses the first value it finds for each keyword, so aHost *block with its ownControlPaththat appears before the managed block wins, and ssh looks for a socket that doesn't exist — then falls back to a fresh, re-authenticating connection. Diagnose it:sucoder -T savio-node tunnel doctor
The fix is to re-run
tunnel up(which writes the block at the top of the file) or move the managed block above the offendingHost *.- TRAMP cached the old connection. Changing
tramp-use-connection-sharedoesn't affect an already-open connection:M-x tramp-cleanup-all-connections, then reopen the file.
Isolate which layer is at fault by testing outside Emacs first:
ssh savio-node-ln true should return instantly and silently once the
tunnel is warm. If that prompts, it's the sshconfig/tunnel; if only
TRAMP prompts, it's TRAMP caching or the variable not taking effect.
Sessions: starting, reattaching, recovering
How do I start an agent session on savio-node?
From inside your project's git repo:
sucoder -T savio-node collaborate
This allocates a compute node, mirrors the repo, and launches the agent
in a tmux session named sucoder-<mirror> on that node.
How do I reattach to a job on savio-node?
sucoder -T savio-node attach # auto-detects the mirror from cwd sucoder -T savio-node attach <mirror> # or name it explicitly
attach verifies the SLURM job is still running, jumps through the
gateway and login node to the compute node, and reattaches to the
sucoder-<mirror> tmux.
If direct login-to-compute SSH is blocked, or the compute node wasn't recorded, use the recovery path that joins the allocation by job id:
sucoder -T savio-node attach <mirror> --via-srun
Why does attach say "Mirror is not configured for remote execution"?
The mirror name didn't resolve to a configured (or cwd-matching) mirror.
attach=/=release accept an explicit name only if it's configured or it
matches the git repo you're standing in. Easiest fix: run the command
from inside the repo directory, with or without the name.
Why did attach drop me into a bash shell instead of the agent?
The agent process inside the tmux exited. collaborate launches the
agent as <agent>; exec bash -l, so when the agent quits, the tmux window
stays alive at a shell rather than closing. You can re-run the agent
(e.g. claude --continue) right there, inspect state, or detach with
C-b d.
For a SLURM target, attach will not silently put you on a login-node
shell: if no live job is recorded, or the compute node is unknown and you
didn't pass --via-srun, it refuses with a clear message instead of
handing you an empty shell that looks like a session.
How do I find my running jobs and tmux sessions?
On the cluster (cheap once the tunnel is warm):
ssh savio-node-ln squeue --me # your live allocations + nodes ssh savio-node-ln tmux ls # tmux sessions on the login node # tmux on a specific compute node, joined via the job: ssh savio-node-ln "srun --jobid=<JOB> --overlap tmux ls"
A crashed agent took my session with it — can I recover the conversation?
Often yes. Claude Code persists the conversation transcript under
~/.claude/projects/<cwd-key>/<session-id>.jsonl. On a cluster where
$HOME is shared (NFS), that survives the compute node going away.
Find the session id:
ls -lt ~/.claude/projects/ | head # locate the project dir ls -lt ~/.claude/projects/<dir>/*.jsonl # newest = the lost session
Resume it as the launched agent (instead of a fresh one):
sucoder -T savio-node collaborate --agent-command 'claude --resume <session-id>'
A bare
collaboratewould start a new session;--agent-commandoverrides what gets launched while still doing the allocation and env setup.
Caveats: resume works cleanly only if the mirror was on shared storage
(so the working directory still exists); a mirror on compute-node
/local disk is gone with the node. An in-flight background workflow
cannot be auto-resumed after the host process dies — but completed
sub-results are saved under ~/.claude/projects/<key>/<session-id>/workflows/,
so the resumed agent can read them and re-dispatch only the unfinished
work.
Cost and lifecycle
Does the agent keep running if I disconnect?
Yes. Detaching from tmux (or losing your SSH connection) leaves the agent running and the SLURM allocation billing. This is deliberate — a dropped connection shouldn't kill your work — but it means you are responsible for releasing the allocation when you're done.
How do I free a compute allocation when I'm finished?
sucoder -T savio-node release # cancels the SLURM job, clears the session
I think I have orphaned allocations. How do I find and cancel them?
ssh savio-node-ln squeue --me # list everything running under your account
Before cancelling, make sure nothing live is on the node. An allocation
can host a running agent even when tmux ls shows nothing on the default
socket (e.g. a session you're driving from the Claude phone app via
/remote-control, which reaches the agent through Anthropic's servers,
not SSH). Check for the process first:
ssh savio-node-ln "srun --jobid=<JOB> --overlap ps -u $USER -o pid,etime,cmd" \ | grep -iE 'claude|node'
Cancel only the jobs that are genuinely idle:
ssh savio-node-ln scancel <JOB> [<JOB> ...]
Troubleshooting reference
"Session open refused by peer" / "Could not resolve hostname ln003.brc"
A transient SSH multiplexing hiccup: the shared connection refused a new
session and ssh fell back to dialing a jump-only hostname directly.
Recent sucoder routes that fallback through the gateway, so simply
re-running the command usually succeeds. If it persists, re-warm the
tunnel: sucoder -T savio-node tunnel up.
tunnel status shows ACTIVE but I still get a password prompt
Classic ~/.ssh/config shadowing — run sucoder -T savio-node tunnel
doctor. See "Why does TRAMP keep prompting" above.
What does tunnel doctor check?
sucoder -T savio-node tunnel doctor
- the managed
~/.ssh/configblock is present; - no earlier
Host=/=Matchblock shadows the managed aliases'ControlPath=/=ControlMaster(the first-value-wins trap); - the login-node alias still matches the pinned login node (catches the cluster reassigning your login node).
It exits non-zero if any problem is found, so you can use it as a pre-flight check.
Command quick reference
| Task | Command |
|---|---|
| Warm the free tunnels (auth once) | sucoder -T savio-node tunnel up |
| Mint a 12h passwordless cert (no OTP) | scripts/brc-connect.sh |
| Check tunnel liveness | sucoder -T savio-node tunnel status |
| Diagnose sshconfig / pin issues | sucoder -T savio-node tunnel doctor |
| Close the tunnels | sucoder -T savio-node tunnel down [--prune] |
| Start an agent on a compute node | sucoder -T savio-node collaborate |
| Reattach to the agent session | sucoder -T savio-node attach |
| Reattach via the allocation (recovery) | sucoder -T savio-node attach --via-srun |
| Resume a crashed conversation | sucoder -T savio-node collaborate --agent-command 'claude --resume <id>' |
| Free a compute allocation | sucoder -T savio-node release |
| Edit remote files in Emacs | C-x C-f /ssh:savio-node-ln:/path (with tramp-use-connection-share nil) |
| List your jobs / cancel an orphan | ssh savio-node-ln squeue --me / ssh savio-node-ln scancel <JOB> |
Appendix: clearing stale tunnels automatically
After a suspend/resume or a network move, the ControlMaster sockets under
~/.sucoder/ssh/ are zombies (the local mux daemon survives, but its TCP
connection is dead). sucoder detects this and tunnel up re-establishes
cleanly, but a plain ssh or a TRAMP buffer opened before you re-warm
will trip on the dead socket and prompt awkwardly. The hooks below tear
the zombies down on resume so nothing trips on them.
They only clean up — they cannot re-authenticate (the gateway OTP is
yours to type). After resume you still run sucoder -T <target> tunnel up
once to re-warm.
All three use the same cleanup: for each socket, ask its master to exit
(a local operation — no network needed); if the master is already gone,
remove the stale socket file. The socket filename is the hostname, which
is also the destination ssh -O exit expects.
Linux — systemd suspend/resume hook
Create /usr/lib/systemd/system-sleep/sucoder-tunnels (root-owned,
chmod +x). systemd runs it as root with $1 = pre=/=post and $2
= suspend=/=hibernate=/...; we drop to your user on =post to clean
that user's sockets. Edit USER_NAME.
#!/bin/sh
# /usr/lib/systemd/system-sleep/sucoder-tunnels
USER_NAME=ligon # <-- your login name
case "$1" in
post)
runuser -l "$USER_NAME" -c '
for sock in "$HOME"/.sucoder/ssh/*.sock; do
[ -S "$sock" ] || continue
host=$(basename "$sock" .sock)
ssh -o ControlPath="$sock" -O exit "$host" 2>/dev/null || rm -f "$sock"
done
'
;;
esac
sudo install -m 0755 sucoder-tunnels /usr/lib/systemd/system-sleep/sucoder-tunnels
macOS — sleepwatcher wake hook
sleepwatcher runs ~/.wakeup on wake. Install it, create the script,
and start the daemon:
brew install sleepwatcher
# ~/.wakeup (chmod +x) #!/bin/sh for sock in "$HOME"/.sucoder/ssh/*.sock; do [ -S "$sock" ] || continue host=$(basename "$sock" .sock) ssh -o ControlPath="$sock" -O exit "$host" 2>/dev/null || rm -f "$sock" done
chmod +x ~/.wakeup brew services start sleepwatcher
Optional — Linux network-change hook (NetworkManager)
A suspend hook won't fire when you only switch networks (no suspend).
If that's your common case, add a NetworkManager dispatcher script that
cleans up when connectivity changes. Create
/etc/NetworkManager/dispatcher.d/90-sucoder-tunnels (root-owned,
chmod +x); it runs as root with $1 = interface and $2 = action.
#!/bin/sh
# /etc/NetworkManager/dispatcher.d/90-sucoder-tunnels
USER_NAME=ligon # <-- your login name
case "$2" in
up|connectivity-change)
runuser -l "$USER_NAME" -c '
for sock in "$HOME"/.sucoder/ssh/*.sock; do
[ -S "$sock" ] || continue
host=$(basename "$sock" .sock)
ssh -o ControlPath="$sock" -O exit "$host" 2>/dev/null || rm -f "$sock"
done
'
;;
esac
This clears the tunnels on every connectivity change, including reconnecting to the same network — harmless (you re-warm on demand), but slightly eager. Skip it if you only ever lose tunnels across a suspend.