Surviving Slack Socket Mode Reconnects: Overlapping Sockets

For about three weeks, our deploy bot would occasionally miss a button click. Not often — once or twice a day, always for a few seconds, never reproducible on demand. We blamed flaky Wi-Fi, then a slow handler, then gremlins. The real culprit was Socket Mode doing exactly what it’s documented to do: Slack periodically recycles the WebSocket connection, sends you a disconnect frame, and expects you to have already opened a replacement. Our client closed the old socket the instant the frame arrived and opened the new one afterward, and every event that landed in that gap fell on the floor.

This is a guide to making a Socket Mode ops bot survive Slack’s connection churn. The core ideas are simple once you see them: open the new socket before draining the old one, and dedup events because the transport is at-least-once, not exactly-once. Most Socket Mode tutorials stop before either of these matters, which is why so many bots have a quiet, intermittent event-loss bug nobody can pin down.

How Socket Mode actually reconnects

When you call apps.connections.open, you get a WebSocket URL, connect, and receive a hello frame. Then events flow as envelopes you must acknowledge by envelope_id within three seconds. The part people miss is what happens next: Slack doesn’t keep that socket forever. It sends disconnect frames for two reasons:

refresh_requested — routine recycling; Slack wants you to reconnect soon
warning — the socket is about to close

The critical detail is the overlap window. When Slack signals a refresh, it does not immediately sever the connection. You’re meant to open a new socket while the old one is still delivering events, let both run briefly, and only then close the old one. If you reverse that order — close first, open second — you create a blind gap. The Bolt SDKs handle this overlap for you when configured correctly, but if you’ve rolled a custom client or disabled the built-in behavior, the gap is yours to own.

// Conceptual: the right ordering
socket.on('disconnect', async (frame) => {
  if (frame.reason === 'refresh_requested') {
    const next = await openNewSocket();   // open FIRST
    await waitUntilReady(next);
    drainAndClose(socket);                // close OLD only after next is live
    socket = next;
  }
});

If you’re choosing Socket Mode in the first place, the tradeoffs against a public HTTP endpoint are worth understanding — Socket Mode versus the Events API is a real architecture decision, and reconnect handling is one of the costs on the Socket Mode side of the ledger.

At-least-once means you will see duplicates

Here’s the consequence nobody warns you about: the same overlap that prevents lost events guarantees you’ll occasionally see duplicate events. During the window when both sockets are live, Slack may deliver an envelope on both. Separately, if your handler doesn’t ack within three seconds, Slack retries — another duplicate. Socket Mode is at-least-once delivery, full stop, and any handler with a side effect needs to be idempotent.

The fix is a short-TTL dedup store keyed on the event identity:

const seen = new Map(); // event_id -> expiry; use Redis if multi-process
const DEDUP_TTL_MS = 60_000;

function alreadyHandled(eventId) {
  const now = Date.now();
  for (const [id, exp] of seen) if (exp < now) seen.delete(id); // cheap sweep
  if (seen.has(eventId)) return true;
  seen.set(eventId, now + DEDUP_TTL_MS);
  return false;
}

app.event('reaction_added', async ({ event, ack }) => {
  await ack();
  if (alreadyHandled(event.event_ts + ':' + event.user)) return;
  await handleReaction(event); // safe to run exactly once now
});

A 60-second TTL comfortably covers both the reconnect overlap and the retry window. If you run multiple processes or multiple connections (an app-level token allows several), the in-memory map won’t do — move the dedup store to Redis so all workers share it, or you’ll re-introduce duplicates across the fleet.

Ack first, work later

The three-second ack rule deserves its own emphasis because violating it is how you manufacture duplicates. Acking is not “I finished the work” — it’s “I received the envelope.” Ack immediately, then do the work:

app.event('app_mention', async ({ event, ack, client }) => {
  await ack();                    // within 3s, before any I/O
  enqueue(() => respondTo(event)); // slow work runs detached
});

If you ack after a database write or a slow downstream call, you’ll eventually miss the deadline under load, Slack retries, and now you’re handling the same mention twice. The ack-first pattern and the dedup store work together: ack-first minimizes retries, dedup catches the ones that slip through anyway.

Letting AI scaffold the resilient client

I used an LLM to draft the reconnect-and-dedup wrapper, and it’s a good fit because the structure is well-known. A prompt like this gets you most of the way:

Write a Socket Mode wrapper that opens a replacement WebSocket on refresh_requested before closing the old one, acks every envelope within 3 seconds then processes async, and dedups events on a 60-second TTL keyed by event_ts.

What came back had the right shape. What it got wrong, and what a human has to catch, was the ordering on reconnect — the first draft closed the old socket before confirming the new one was ready, the exact bug that started this whole story. The AI drafts the boilerplate; you verify the one detail that actually provides the resilience. That’s the recurring pattern with this kind of infrastructure code, and it’s why the saved prompt library for ops scaffolding is worth keeping — you encode the gotchas once and stop re-discovering them.

A reconnect-resilience checklist

Open before close. New socket live before old socket drained. This is the whole ballgame.
Ack within 3 seconds, then process. Never inline slow work before the ack.
Dedup on a short TTL. Cover the overlap and retry windows; share the store across processes.
Handle both disconnect reasons. refresh_requested and warning both lead to reconnect.
Make handlers idempotent. Assume any event can arrive more than once.

Wrapping Up

Socket Mode is a pleasant way to run an ops bot without exposing a public endpoint, but it asks you to take responsibility for the connection lifecycle in return. The two non-negotiables are opening the replacement socket before draining the old one, and deduping events because delivery is at-least-once. Get the ordering right and the duplicates handled, and the intermittent “it missed my click” bug — the one you’ll otherwise chase for weeks — simply stops happening. Let AI scaffold the wrapper, but verify the reconnect ordering yourself, because that single detail is the one carrying all the weight.

Surviving Slack Socket Mode Reconnects: Overlapping Sockets and Event Dedup