Pieter Delobelle | Steering Language Models

Abstract

Once a language model is trained, what can we actually do at generation time to make it safer, more controllable, or constraint‑respecting? A research‑level tour through steering vectors, sparse autoencoders, and neurosymbolic methods — with an honest take on which ones hold up under evaluation.

Anchored in real failure cases (the Air Canada chatbot, the Character.AI lawsuits) and built up through Anthropic‑style sparse autoencoders, in‑context vectors, AurA's expert suppression and CtrlG's tractable constraint enforcement.

A common theme: the space of “test‑time control” methods ranges from soft steering (toxicity, tone) to hard lexical constraints, and no single technique covers the whole spectrum.

Outline

What the talk covers

01 Why test‑time control matters. Air Canada's misinformation chatbot, the Character.AI safety incidents, and the gap between “we trained it safe” and “it's safe in production”.
02 Mechanistic interpretability. The circuits view of a transformer and Anthropic's monosemanticity work on Claude's residual stream — what sparse autoencoders give us, and what they don't.
03 SAEs as steering vectors. Concepts in the decoder dictionary as directions to push activations along; Neuronpedia's LLM‑labelled features; and why finding genuinely monosemantic concepts is harder than it looks.
04 In‑context vectors. Liu et al. (2024): only ~3–10 examples needed, they don't have to be paired, PCA barely helps, all ICVs fit in VRAM — and the awkward case where steering emoji usage just refuses to work.
05 Evaluating steering. Vibe‑based evals on real prompts versus AxBench's detection‑and‑steering protocol — and the deflating finding that plain prompting often wins.
06 AurA — suppressing experts. The Whispering Experts approach to muting toxicity‑related experts at inference time, work done at Apple.
07 Neurosymbolic control. CtrlG and tractable distilled HMMs for hard lexical / token‑level constraints (Zhang et al., 2023/2024).
08 Where this is going. Steering, expert suppression, and symbolic constraints sit on a spectrum; the open question is whether a unified test‑time control framework is even the right goal.

What the talk covers

The deck