A framework for amortizing reasoning over repeated tasks
A language model \(\mathcal{M}\) induces a probabilistic higher-order function:
\[ \hat{F} \;:\; \Pi \;\longrightarrow\; \bigl(\Sigma^* \to \Delta(\Sigma^*)\bigr) \]Applying \(\hat{F}\) to a prefix \(\pi\) yields a continuation function: \[ \hat{g}_\pi \;=\; \hat{F}(\pi) \;:\; \Sigma^* \to \Delta(\Sigma^*) \]
Given a continuation \(c \in \Sigma^*\) (chat history + user request), sampling the answer: \[ \hat{y} \;\sim\; \hat{g}_\pi(c) \] The hat propagates — \(\hat{y}\) inherits potential error from \(\hat{F}\).
For a fixed \(\pi\), the model is a black box consuming user turns and emitting sampled answers — correct with high but not certain probability.
For each user query \(u_i\), invoke \(\hat{g}_\pi\):
\[ \hat{y}_i \;\sim\; \hat{g}_\pi(u_i), \qquad \text{cost } c_{\mathcal{M}} \text{ per query} \]Total cost for \(N\) queries:
\[ C_{\mathrm{std}}(N) \;=\; N \cdot c_{\mathcal{M}} \]Hinton et al. (2015): transfer knowledge from a large model \(\mathcal{M}_L\) into a small model \(\mathcal{M}_S\) by training on soft targets (output distributions of \(\mathcal{M}_L\) rather than hard labels).
\[ \hat{F}_L \;\xrightarrow{\;\text{distill}\;}\; \hat{F}_S, \qquad |\mathcal{M}_S| \ll |\mathcal{M}_L| \]The target representation is the same kind of thing as the source. What if we chose a different target?
Instead of another neural net, distill \(\hat{F}\)'s knowledge into discrete structures: programs, grammars, schemas.
\(\tilde{\cdot}\) denotes deterministic but initially unverified — no randomness, but may be semantically wrong if compiled from a flawed \(\hat{y}\).
Given a teaching example \(u_0\), the model produces a recipe \(\hat{r} \in \mathcal{R}\) — a structured description of the intended computation: \[ \hat{r} \;\sim\; \hat{g}_\pi(u_0), \qquad \hat{r} \in \mathcal{R} \subset \Sigma^* \]
A compiler — a total deterministic function — maps recipe to program: \[ \kappa : \mathcal{R} \;\longrightarrow\; (\Theta \to \mathcal{Y}), \qquad \tilde{p} = \kappa(\hat{r}) \]
\(\kappa\) is deterministic but \(\tilde{p}\) is only as correct as \(\hat{r}\) was. Also emitted: a grammar \(G\) that recognizes the task family and extracts \(\theta \in \Theta\).
A task family is a triple \((\mathcal{T},\, \Theta,\, \phi)\):
A grammar \(G : \mathcal{T} \to \Theta\) (implemented as an NFA) recognizes the task family and extracts parameters at near-zero cost: \[ G(u) = \theta \in \Theta \quad \forall\, u \in \mathcal{T} \]
Let \(\{(\mathcal{T}_i, \Theta_i, G_i, \tilde{p}_i)\}_{i=1}^k\) be compiled task families. Define: \[ H_k(u) \;=\; \begin{cases} \tilde{p}_i\bigl(G_i(u)\bigr) & \exists\, i \leq k : u \in \mathcal{T}_i \\[4pt] \hat{g}_\pi(u) & \text{otherwise} \end{cases} \]
The first \(k\) branches are deterministic. The fallback is the full LLM. As \(k\) grows, deterministic coverage expands:
Let \(\mathcal{D}\) be a distribution over user utterances. Define coverage:
\[ \mathrm{cov}(H_k) \;=\; \Pr_{u \sim \mathcal{D}}\!\Bigl[u \in \textstyle\bigcup_{i=1}^k \mathcal{T}_i\Bigr] \]Expected cost per query:
\[ \begin{aligned} \mathbb{E}\bigl[C(H_k)\bigr] &\;=\; \mathrm{cov}(H_k)\cdot c_{\mathrm{det}} \\ &\quad+\; \bigl(1 - \mathrm{cov}(H_k)\bigr)\cdot c_{\mathcal{M}} \end{aligned} \]Since \(c_{\mathrm{det}} \approx 0 \ll c_{\mathcal{M}}\), as \(\mathrm{cov}(H_k) \nearrow 1\): \[ \mathbb{E}\bigl[C(H_k)\bigr] \;\longrightarrow\; 0 \]
Each \(\tilde{p}_i\) is initially unverified. Let \(y^*(\theta)\) be the ground-truth outcome. Define the error rate at refinement step \(t\): \[ \varepsilon_i(t) \;=\; \Pr_{\theta \sim \Theta_i}\!\bigl[\tilde{p}_i^{(t)}(\theta) \neq y^*(\theta)\bigr] \]
When a user signals an error, refinement takes one of three paths:
A third path exploits accumulated user feedback directly. Partition inputs: let \(A \subset \mathcal{T}_i\) be accepted, \(R \subset \mathcal{T}_i\) rejected.
| Neural Distillation | Discrete Distillation | |
|---|---|---|
| Source | \(\hat{F}_L\) (large model) | \(\hat{F}\) (any model) |
| Target | Small neural net \(\hat{F}_S\) | Programs, grammars, schemas |
| Output | Probabilistic (\(\hat{\cdot}\)) | Deterministic (\(\tilde{\cdot}\)) |
| Coverage | Broad — generalizes | Targeted — task families |
| Transparency | Opaque (weights) | Readable (source code) |
| Error correction | Re-training | Conventional debugging |
| Cost per query | \(c_{\mathcal{M}_S} < c_{\mathcal{M}_L}\) | \(c_{\mathrm{det}} \approx 0\) |
| Refinement loop | Gradient descent | Engineering + user feedback |
| Artifact class | Parametric weights (opaque, fixed) | Weakest sufficient programs + grammars |
| Algebraic ops | None — retrain to change | Union, concat, complement — automatic |
\(\hat{F}\) is a probabilistic higher-order function: prompt prefix → continuation function → sampled (possibly incorrect) answer.
Discrete distillation compiles outputs of \(\hat{F}\) into programs \(\tilde{p}\) and grammars \(G\) — deterministic, transparent, refineable structures that approximate \(\hat{F}\) on specific task families.
The handler \(H_k\) is a piecewise function growing over time: deterministic branches multiply, cost collapses, correctness improves through ordinary software engineering.
We target the weakest sufficient artifact class — regular grammars (closed under union and complement) and bounded dataflow programs (composable, terminating, verifiable). Weakness is a feature: algebraic closure means splits, merges, and compositions are automatic. Turing-complete programs fit the framework, but weaker classes accelerate refinement and are secure by construction.
Where neural distillation produces a smaller black box,
discrete distillation produces conventional software.