Discrete Distillation

Compiling LLM Knowledge into Deterministic Programs

A framework for amortizing reasoning over repeated tasks

The Model as a Higher-Order Function

A language model \(\mathcal{M}\) induces a probabilistic higher-order function:

\[ \hat{F} \;:\; \Pi \;\longrightarrow\; \bigl(\Sigma^* \to \Delta(\Sigma^*)\bigr) \]
  • \(\Pi\) — prompt prefixes: system prompts, instruction sets, persona
  • \(\Sigma^*\) — token sequences (inputs and outputs)
  • \(\Delta(\Sigma^*)\) — probability distributions over token sequences

Applying \(\hat{F}\) to a prefix \(\pi\) yields a continuation function: \[ \hat{g}_\pi \;=\; \hat{F}(\pi) \;:\; \Sigma^* \to \Delta(\Sigma^*) \]

The hat \(\hat{\cdot}\) marks functions that may produce semantically incorrect outputs — they are stochastic and not guaranteed correct.

Two-Stage Application

Given a continuation \(c \in \Sigma^*\) (chat history + user request), sampling the answer: \[ \hat{y} \;\sim\; \hat{g}_\pi(c) \] The hat propagates — \(\hat{y}\) inherits potential error from \(\hat{F}\).

π higher-order ĝπ continuation fn c (chat + user request) ŷ may be wrong

For a fixed \(\pi\), the model is a black box consuming user turns and emitting sampled answers — correct with high but not certain probability.

Standard (Query-Time) Reasoning

For each user query \(u_i\), invoke \(\hat{g}_\pi\):

\[ \hat{y}_i \;\sim\; \hat{g}_\pi(u_i), \qquad \text{cost } c_{\mathcal{M}} \text{ per query} \]

Total cost for \(N\) queries:

\[ C_{\mathrm{std}}(N) \;=\; N \cdot c_{\mathcal{M}} \]
  • Appropriate for novel tasks — full generalization, no prior knowledge assumed
  • Cost scales linearly — no benefit from repetition or structure
  • Each \(\hat{y}_i\) is a fresh sample — output may vary even for identical \(u_i\)

Knowledge Distillation — Neural Target

Hinton et al. (2015): transfer knowledge from a large model \(\mathcal{M}_L\) into a small model \(\mathcal{M}_S\) by training on soft targets (output distributions of \(\mathcal{M}_L\) rather than hard labels).

\[ \hat{F}_L \;\xrightarrow{\;\text{distill}\;}\; \hat{F}_S, \qquad |\mathcal{M}_S| \ll |\mathcal{M}_L| \]
What stays the same after neural distillation:
  • Target is still a neural network — still \(\hat{\cdot}\), still probabilistic
  • Knowledge remains in weights — opaque
  • Errors corrected only by re-training
  • Broad generalization preserved — but so is the black-box character

The target representation is the same kind of thing as the source. What if we chose a different target?

Discrete Distillation — A Different Target

Instead of another neural net, distill \(\hat{F}\)'s knowledge into discrete structures: programs, grammars, schemas.

\(\hat{F}\)
\(\tilde{p},\;\; G,\;\; \ldots\)
programs & grammars
  • Programs \(\tilde{p}: \Theta \to \mathcal{Y}\) — deterministic computations over typed parameters
  • Grammars \(G: \mathcal{T} \to \Theta\) — NFA recognizers that match utterances and extract parameters
  • Schemas, API bindings, structured plans, structured memory indexes, …

\(\tilde{\cdot}\) denotes deterministic but initially unverified — no randomness, but may be semantically wrong if compiled from a flawed \(\hat{y}\).

The Compilation Step

Given a teaching example \(u_0\), the model produces a recipe \(\hat{r} \in \mathcal{R}\) — a structured description of the intended computation: \[ \hat{r} \;\sim\; \hat{g}_\pi(u_0), \qquad \hat{r} \in \mathcal{R} \subset \Sigma^* \]

A compiler — a total deterministic function — maps recipe to program: \[ \kappa : \mathcal{R} \;\longrightarrow\; (\Theta \to \mathcal{Y}), \qquad \tilde{p} = \kappa(\hat{r}) \]

u₀ ĝπ LLM recipe κ compiler p̃ : Θ → Y deterministic program probabilistic deterministic

\(\kappa\) is deterministic but \(\tilde{p}\) is only as correct as \(\hat{r}\) was. Also emitted: a grammar \(G\) that recognizes the task family and extracts \(\theta \in \Theta\).

Task Families

A task family is a triple \((\mathcal{T},\, \Theta,\, \phi)\):

  • \(\mathcal{T} \subseteq \Sigma^*\) — all utterances expressing the same underlying task
  • \(\Theta\) — parameter space (e.g. genre × quantity × time period)
  • \(\phi : \Theta \to \mathcal{T}\) — maps parameters to representative utterances

A grammar \(G : \mathcal{T} \to \Theta\) (implemented as an NFA) recognizes the task family and extracts parameters at near-zero cost: \[ G(u) = \theta \in \Theta \quad \forall\, u \in \mathcal{T} \]

Correctness requirement on \(\tilde{p}\): \(\tilde{p}\) must parameterize, not specialize. A recipe that hardcodes one value (e.g. a URL for a single genre) produces \(\tilde{p}\) correct only at the teaching point \(\theta_0\). Correct distillation requires \(\tilde{p}(\theta) \approx_\varepsilon \hat{g}_\pi(\phi(\theta))\) for all \(\theta \in \Theta\).

The Growing Piecewise Handler \(H_k\)

Let \(\{(\mathcal{T}_i, \Theta_i, G_i, \tilde{p}_i)\}_{i=1}^k\) be compiled task families. Define: \[ H_k(u) \;=\; \begin{cases} \tilde{p}_i\bigl(G_i(u)\bigr) & \exists\, i \leq k : u \in \mathcal{T}_i \\[4pt] \hat{g}_\pi(u) & \text{otherwise} \end{cases} \]

The first \(k\) branches are deterministic. The fallback is the full LLM. As \(k\) grows, deterministic coverage expands:

k = 1
T₁
LLM fallback
k = 3
T₁
T₂
T₃
LLM fallback
k → ∞
T₁
T₂
T₃
T₄
T₅
LLM

Coverage and Expected Cost

Let \(\mathcal{D}\) be a distribution over user utterances. Define coverage:

\[ \mathrm{cov}(H_k) \;=\; \Pr_{u \sim \mathcal{D}}\!\Bigl[u \in \textstyle\bigcup_{i=1}^k \mathcal{T}_i\Bigr] \]

Expected cost per query:

\[ \begin{aligned} \mathbb{E}\bigl[C(H_k)\bigr] &\;=\; \mathrm{cov}(H_k)\cdot c_{\mathrm{det}} \\ &\quad+\; \bigl(1 - \mathrm{cov}(H_k)\bigr)\cdot c_{\mathcal{M}} \end{aligned} \]

Since \(c_{\mathrm{det}} \approx 0 \ll c_{\mathcal{M}}\), as \(\mathrm{cov}(H_k) \nearrow 1\): \[ \mathbb{E}\bigl[C(H_k)\bigr] \;\longrightarrow\; 0 \]

Break-even: compiling a task family costs one LLM call \(c_{\mathcal{M}}\) (the teaching example). A single subsequent query recoups that cost entirely. \(N^* \approx 1\).

Correctness Refinement

Each \(\tilde{p}_i\) is initially unverified. Let \(y^*(\theta)\) be the ground-truth outcome. Define the error rate at refinement step \(t\): \[ \varepsilon_i(t) \;=\; \Pr_{\theta \sim \Theta_i}\!\bigl[\tilde{p}_i^{(t)}(\theta) \neq y^*(\theta)\bigr] \]

When a user signals an error, refinement takes one of three paths:

  1. Re-recording — back through \(\hat{g}_\pi\) with corrected context
  2. AI-assisted repair of \(\tilde{p}_i\):
    • (a) direct engineering — unit tests, debugger, code review
    • (b) ask a code-generation model to fix \(\tilde{p}_i\) given the failure case
Key asymmetry: \(\hat{g}_\pi\) is a black box — you can only reprompt. \(\tilde{p}_i\) is conventional software — transparent, testable, refineable. \(\varepsilon_i(t) \to 0\) as \(t \to \infty\).

Correctness Refinement — Data-Driven Splitting

A third path exploits accumulated user feedback directly. Partition inputs: let \(A \subset \mathcal{T}_i\) be accepted, \(R \subset \mathcal{T}_i\) rejected.

  1. Data-driven splitting:
    • (a) Grammar: pose \((A,\, R)\) to \(\hat{g}_\pi\) — "find a rule accepting \(A\) and rejecting \(R\)" — replacing \(G_i\) with discriminating rules \(G_i^+,\, G_i^-\)
    • (b) Program: filter \(\tilde{p}_i\)'s control flow by coverage over \(A\) vs \(R\) — automatically specializing two flow programs \(\tilde{p}_i^+,\, \tilde{p}_i^-\), one per partition
Program Synthesis connection — path 3 is an instance of synthesis from examples: grammar induction from \((A, R)\) recalls Angluin's L* / RPNI (Oncina & García 1992); flow specialization resembles SyGuS (Alur et al. 2013) and FlashFill / PROSE (Gulwani et al.). The novel element: \(\hat{g}_\pi\) as oracle generalizes to informal, natural-language domains; regular closure makes \(G_i^- = \overline{G_i^+}\) automatic.

Neural vs. Discrete Distillation

Neural Distillation Discrete Distillation
Source \(\hat{F}_L\) (large model) \(\hat{F}\) (any model)
Target Small neural net \(\hat{F}_S\) Programs, grammars, schemas
Output Probabilistic (\(\hat{\cdot}\)) Deterministic (\(\tilde{\cdot}\))
Coverage Broad — generalizes Targeted — task families
Transparency Opaque (weights) Readable (source code)
Error correction Re-training Conventional debugging
Cost per query \(c_{\mathcal{M}_S} < c_{\mathcal{M}_L}\) \(c_{\mathrm{det}} \approx 0\)
Refinement loop Gradient descent Engineering + user feedback
Artifact class Parametric weights (opaque, fixed) Weakest sufficient programs + grammars
Algebraic ops None — retrain to change Union, concat, complement — automatic

Summary

\(\hat{F}\) is a probabilistic higher-order function: prompt prefix → continuation function → sampled (possibly incorrect) answer.

Discrete distillation compiles outputs of \(\hat{F}\) into programs \(\tilde{p}\) and grammars \(G\) — deterministic, transparent, refineable structures that approximate \(\hat{F}\) on specific task families.

The handler \(H_k\) is a piecewise function growing over time: deterministic branches multiply, cost collapses, correctness improves through ordinary software engineering.

We target the weakest sufficient artifact class — regular grammars (closed under union and complement) and bounded dataflow programs (composable, terminating, verifiable). Weakness is a feature: algebraic closure means splits, merges, and compositions are automatic. Turing-complete programs fit the framework, but weaker classes accelerate refinement and are secure by construction.

Where neural distillation produces a smaller black box,
discrete distillation produces conventional software.