silent — a JEPA hunts you by sound

how silent actually works

JEPA = Joint Embedding Predictive Architecture, LeCun's framework for world models that predict in a compact latent space instead of generating pixels. Fast enough for real-time planning, small enough to train in hours on one GPU.

In silent the JEPA is the blind echolocating predator. Its only sensor is a 4-channel cardioid mel-spectrogram — 4 mics N/E/S/W at the predator's position, every tick. You're the teal dot. You move, you make footsteps, you leak audio if you stand still too long. The predator pings, listens to its own echoes plus your sounds, runs CEM in the JEPA's latent space, picks the move that closes the predicted distance to where it thinks you are. (Sister demo RELAY is the same loop but you and the JEPA are adversaries on the same task.)

the stack

LeWM paper byte-for-byte: 6M-param ViT-Tiny encoder trained from scratch (4-channel input conv for audio), 6-layer AR transformer predictor, two losses — next-embedding MSE + SIGReg. Plus a DexWM-style joint state head trained alongside the predictor at λ=10 — this is what lets the planner reason about real positions in the room instead of raw latent geometry.

Per tick: encode the last 3 audio observations, run CEM in latent space (16 candidate actions × 2 iterations of refit), decode each predicted next-embedding through the state head to get a (predator_xy, player_xy), score by predicted predator-to-player distance, apply the best first action. Cost is measured in decoded state space, not raw latent distance — the planner cares if the action shrinks the prey-vector, not if two embeddings are close.

the beacon was a lie

For three weeks the model trained against a beacon — a constant hum at the exit door — as a self-localization anchor. Removing it (Phase 3E) without other changes collapsed the encoder to a training-spawn-region prior: predator drifted to the bottom-right corner every match. But when we removed the beacon AND randomized spawn points, the encoder was forced to learn self-localization from real audio geometry (footsteps, echoes). That's 3E ep30 — current ship, no beacon needed.

Lesson: probe R² (a linear regression from embedding to state) lied about gameplay quality on this project three times. It measures an unconditional encoder property — can you recover absolute coordinates? — but the planner cares about a conditional predictor property: does this thrust shrink the predicted prey-vector? The two correlate in the easy regime and decorrelate exactly where models differ. Behavior-side benchmarks are the only real ship gate.

three variants live, pick one

slot	checkpoint	what it is
canonical · 3E ep30	silent_v1_3e_ep030	current ship. Beacon-free, joint-trained, randomized spawn data. Hunts cleanly across all rooms.
baseline · JEPA og	silent_BASELINE_ep010_joint	the pre-3E beacon-trained model. Useful as the "before" — feel how it's fooled when you stand still.
federation · 3F ep50	silent_v1_3f_ep020	continuation-trained from 3E ep30 on federation-pool data (gameplay-captured embeddings, not scripted episodes). See the federation section below.

this demo is also a federation node

Two pipelines, both browser/edge-side. Ingest: every match you play here contributes to the silent_v1 federation pool. The server piggybacks on the predator's encoder forward — the embedding it computes every planner tick to plan its next move is also captured (with the action chosen) and pushed to the federation hub. Zero extra compute: same forward pass serves both predator and tap.

Training: the right-rail 🧠 train round button runs one full SGD round in this very tab. Your browser pulls the latest predictor weights, fetches a training batch (audio embeddings pre-encoded by the hub), runs K steps of Adam on the predictor stack via TF.js, and uploads a signSGD-compressed delta. The hub aggregates deltas across clients and broadcasts the new weights. Compute is 100% on your machine — our servers never touch a gradient.

The pool feeds continuation training runs. The federation · 3F ep50 variant above is the first model in the project trained this way. Watch ingested_total + round tick up live at /federated/admin. RELAY runs the same architecture; pong + future JEPA games reuse the pattern.

credits & sources

LeWorldModel: arxiv 2603.19312 (Maes, Le Lidec, Scieur, LeCun, Balestriero, 2026). DexWM joint state head: arxiv 2512.13644. Source: github.com/SotoAlt/silent-deploy.