Pre-registered empirical research
A pre-registered program of controlled cross-linguistic experiments on LLM training dynamics, with three lead deposits already in place and a child-scale extension submitted to BabyLM 2026 / EMNLP 2026 Budapest. Collaborator on the BabyLM paper and the EMNLP main-conference extension: David Beauchemin (Université Laval).
The Scaling Hypothesis Is Language-Contingent
Wasserman 2026. Pre-registered controlled ablation training identical 125-million-parameter transformers on matched English and French corpora drawn from C4. French achieves grammatical competence (100% on agreement probes) at 197 million tokens; English remains at chance through 3 billion tokens. A greater than fifteen-fold difference in emergence threshold and a fifty-fold perplexity ratio at matched training steps. Cross-study comparison with established Pythia 125M scaling (Biderman et al. 2023) suggests French may be 50-100× more training-efficient than English on identical architectures. The result cannot be explained by architecture, compute, or random variation; it is explained by the morphological structure of French that the architecture exploits but did not create.
- Zenodo: 10.5281/zenodo.19423151
- OSF pre-registration: SJ48B
- Repository: github.com/adamzwasserman/fractal-language
Companion deposits
Two companion preprints extend the line:
-
English Considered Harmful
Companion preprint to the language-contingent scaling line. Zenodo: 10.5281/zenodo.19443358.
-
The 70% Rule
Pre-registered work on the conditions under which explicit logical axioms in prompts produce qualitatively different outputs on reasoning benchmarks. OSF pre-registrations: 7Z49A, MZF79. Zenodo: 10.5281/zenodo.19423101.
Geometric loss functions and cross-linguistic training dynamics
A complementary line of work, joint with Edward Levin (VM4AI; not to be confused with Michael Levin's bioelectric program at Tufts), tests the same structural claim from a different angle. The study applies geometric topologies (Polytope, Sphere) as training-time loss-function regularizers and measures whether the resulting representations recover the structural advantages morphologically rich languages provide naturally. Pre-registrations in preparation.
Both outcomes are load-bearing for the project's central claim that human language has a discoverable underlying topology of meaning: a negative result reinforces that linguistic structure is irreducible to optimization mechanics; a positive result narrows the alternative-pathway space and strengthens the parsimony case for distributed cognition deposited into language over deep time.
- Repository: github.com/adamzwasserman/wasserman-levin-2026
Right Tool, Right Job: Why Training Language Matters More Than Training Data
Wasserman & Beauchemin, submitted to BabyLM 2026 / EMNLP 2026 Budapest and ACL Rolling Review.
Child-scale extension of the language-contingent scaling line. MÉTRON-FR — a 125M-parameter GPT-2-small trained on 92.5M words of French (a child-scale corpus matching the developmental input available to a six-year-old human) — reaches 85.97% accuracy on the native Quebec-French QFrBLiMP minimal-pair benchmark (Beauchemin et al. 2025) at epoch 3, at $65 total compute cost. The paper formalizes the Language-Only Hypothesis: grammatical competence in transformers is determined primarily by the structural properties of the training language rather than by data volume, parameter count, or architectural sophistication.
In plain language
A 125-million-parameter model is roughly one-thousandth the size of frontier industry models. 92.5 million words is roughly the linguistic input a six-year-old child has heard. $65 is the cost of an evening's dinner for two. With those resources, the model reaches 86% accuracy on a native-Quebec-French grammatical competence test; a result industry narratives say should require hundreds of billions of parameters and institutional-scale compute. What this disproves is the dominant industry assumption that AI capability emerges from scale alone. The model is a measurement instrument; what it measures is the structure approximately 100 billion humans deposited into language over deep time.
Repository: github.com/adamzwasserman/babylm.
The cross-lingual transfer signature
The companion paper documents a graded cross-lingual transfer signature: logical-relational task structures (RTE entailment +7.91pp, MNLI three-way entailment +4.93pp, BoolQ yes/no reading +3.67pp, MRPC paraphrase +2.45pp, WSC coreference +1.93pp) transfer cleanly across languages with light adaptation; discourse-level task structures (MultiRC, QQP, EWoK world-knowledge) do not transfer at all. The paper labels the two ends with the philosophical terms the gradient evokes: a Platonic end (language-independent structure that recovers cleanly) and a Wittgensteinian end (language-bound structure that does not).
A 2,400-year debate about the language-dependence of meaning becomes a measurement protocol with falsifiable per-task predictions. Both Plato (meaning is universal) and the late Wittgenstein (meaning is bound to specific language games) turn out to have been right within their proper domain, and the present project's contribution is to measure that domain rather than to declare a winner.
The architecture of grammar
A further pre-registered line turns the instrument on a second long-running debate: is grammatical structure acquired from a particular language, or supplied by an innate, language-general architecture? Four cross-linguistic experiments test it on one axis, in the language versus in the technology, each registered with its falsifiers and a declared boundary.
-
Learnability
Thin the disambiguating evidence in a training corpus and measure whether competence falls with it (acquired) or holds regardless (innate). The only one of the four that manipulates the input directly.
-
Universality
Does a model trained on one language reproduce that language's own documented binding and island behavior, or impose an English-shaped default? Tested on languages whose grammars diverge from English (Mandarin, Japanese, Icelandic).
-
One operation or a patchwork
Is the active-to-passive relation a single structure-preserving operation that holds across languages, recovered by the same Procrustes rotation used in the BLI alignment, or an item- and language-specific patchwork?
-
Stored or computed
Does grammatical competence carry the fingerprints of stored structure (frequency effects on regular forms, productivity bounded at attested constructions) or of online computation (frequency-blind, freely generalizing)?
Together the four convert the nativist-versus-constructionist debate, and with it a central question of the 4E cognition program, from philosophical argument into a measurement protocol with falsifiable per-language predictions.
The deepest finding
The whole research program rests on a recognition that the cross-linguistic ablation method itself proves: large language models are instruments of OBSERVATION, not instruments of GENERATION. Identical architecture, identical training procedure, identical compute, applied to two different human languages, produces wildly different results. Capability cannot have come from scale alone if scale alone produces wildly different outcomes depending on what is being scaled. The capability must come from somewhere else, and the only candidate that survives the experiment is the structure encoded in the language itself. Reframed as observational instruments (like telescopes, microscopes, spectrometers), LLMs reveal structure already present in their training data rather than generating it. This also disarms the obvious objection, that bigger models plainly do more. A larger telescope sees more, and fainter, and farther, but no one imagines it brings those galaxies into being. Scale works the same way here: a bigger model resolves more of the structure already deposited in language; it does not create the intelligence it resolves. An instrument that reveals what is already there can map the topology of the language its data was drawn from.