umma.dev

Distilling CS Research Papers: Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search

Paper: “Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search” - Cruz-Benito, Cross, Kremer, Faro, June 2026. arxiv.org/abs/2606.02418

Quantum computers make a lot of errors

Unlike classical bits, which sit cleanly at 0 or 1, qubits are fragile. They decohere from environmental noise, gates introduce small rotations, and measurements disturb the state you’re trying to preserve. If you want to run any non-trivial quantum computation, you need to wrap your logical qubits in error-correcting codes. Ways of distributing one logical qubit across many physical qubits so that errors on the physical layer can be detected and corrected without collapsing what you’re computing.

Think of it like sending a message over a noisy channel. You repeat the message enough times and with enough structure that the receiver can figure out what you meant even if some bits got flipped in transit. Quantum error correction does the same thing, but the constraints are harder. You can’t just copy a qubit (the no-cloning theorem forbids it), and measuring a qubit to check it disturbs it. The field spent decades developing mathematical frameworks to get around this.

Not all codes are equal - and finding good ones is hard

A code’s quality is described by three numbers written as [[n, k, d]]: n physical qubits encode k logical qubits at distance d, where d determines how many errors you can tolerate before the code fails. Better codes give you higher k and d at lower n - more logical qubits, better error tolerance, less hardware overhead. The gap between a good code and a mediocre one can be the difference between a fault-tolerant quantum computer you can actually build and one that requires an implausible amount of hardware.

Bivariate bicycle (BB) codes are one of the most promising families right now. They belong to a class called quantum LDPC codes - their error-checking structure is sparse, which makes the decoding fast and practical. Google’s recent fault-tolerance experiments showed specific BB codes performing well on real hardware. Most of the space of possible BB codes has never been explored. The known high-performing ones were found by human experts working through the algebra by hand. This paper asks, “can you automate that search?”

The idea: let an LLM search the space

The answer is yes, and the mechanism is more interesting than running a bigger brute-force search. The authors use a language model to guide an evolutionary algorithm - not to solve algebraic equations or do coding theory, but to mutate Python programs that construct candidate codes.

A BB code is defined by a pair of mathematical polynomials. You can write a short Python program that specifies those polynomials and outputs the code. Different programs produce different codes. So the question “find a good code” becomes “find a program that produces a code with good parameters.”

This is where the LLM earns its place. It isn’t doing mathematics, it’s reading a program, understanding its structure and generating a plausible variant. Given a few high-performing parent programs as context, it produces mutations of, changing coefficients, tweaking parameters, trying a different construction approach. Most mutations produce nothing useful. Some produce something better. The good ones survive into the next generation and seed further mutations.

What the validation pipeline actually does

First, a fast linear algebra check computes the code’s parameters - how many logical qubits it encodes and at what rough distance. This is cheap and filters out most bad candidates immediately. Second, distance certification - actually proving that a code achieves a claimed distance d requires solving a combinatorial search problem, which is expensive, so the pipeline uses a mix of exact computation and mathematical bounding (MILP - a standard technique for optimisation problems with hard constraints) to certify or rule out candidates. Third, deduplication, many programs that look different actually produce the same code under the hood, so a graph-based equivalence check removes those before they pollute the catalogue.

The staging matters because you’re screening 200,000 candidates. You run the cheap checks first and only pay for the expensive ones on codes that have already passed.

The total campaign ran roughly 1,650 evolutionary iterations over 140 compute hours, and cost about $400 in LLM inference. That’s the cost of a few cloud instances for an afternoon.

What they found

465 distinct valid BB codes not previously catalogued. The headline result is a novel indecomposable [[288,16,12]] code. “Indecomposable” means the code can’t be split into two independent smaller codes - the whole structure works as one unit. This matters because indecomposable codes can achieve better efficiency ratios; you’re not just bundling two mediocre codes together and calling it one. The [[288,16,12]] parameters are competitive with the best known BB codes at that size, and it wasn’t in any existing catalogue.

The search also found a family of high-rate codes reaching k=50 logical qubits at distance d=8 - unusually high logical qubit count, useful when your primary constraint is qubit count rather than error tolerance.

Importantly, the search rediscovered known high-performers including the [[144,12,12]] “gross code” that appears in Google’s hardware experiments. This is a sanity check: if an automated search misses the codes you already know are good, the fitness landscape is being explored poorly.

Why the LLM approach actually works

The LLM is acting as a structured mutation operator - one that, unlike random perturbation, produces changes that are syntactically valid and make sense given the surrounding program. It doesn’t need to understand why one polynomial pair produces a better code than another. It just needs to understand Python programs well enough to generate variants that are plausible rather than noise. All the domain-specific knowledge lives in the validation pipeline.

This is the same architecture that Google’s FunSearch used to find new results in combinatorics: LLM-guided evolution over program spaces, with an external evaluator providing the fitness signal. The reason it works in both cases is that LLMs trained on code have strong intuitions about program structure and how to vary it meaningfully - and those intuitions are domain-general.

The limitation is a standard one for evolutionary algorithms. As the population converges on good solutions, mutations become less diverse and progress stalls. The authors handle this with diversity-preserving selection but acknowledge it caps how long the search runs productively before returns diminish.