Towards unified test-time controllable text generation.
Ghent University (UGent)
Abstract
Once a language model is trained, what can we actually do at generation time to make it safer, more controllable, or constraint‑respecting? A research‑level tour through steering vectors, sparse autoencoders, and neurosymbolic methods — with an honest take on which ones hold up under evaluation.
Anchored in real failure cases (the Air Canada chatbot, the Character.AI lawsuits) and built up through Anthropic‑style sparse autoencoders, in‑context vectors, AurA's expert suppression and CtrlG's tractable constraint enforcement.
A common theme: the space of “test‑time control” methods ranges from soft steering (toxicity, tone) to hard lexical constraints, and no single technique covers the whole spectrum.
Outline