# Language Benchmark Evals

MemeLingua now uses two eval layers:

1. Interpretability evals in `eval/cases.json`, which ask a model to decode MemeLingua after reading the generated primer.
2. Language benchmark coverage evals in `eval/language-benchmarks.json`, which compare the locked root system against standard linguistic concept lists.

## Sources

- Leipzig-Jakarta 100: Concepticon `Tadmor-2009-100`, a stable and borrowing-resistant vocabulary list.
- Swadesh 100: Concepticon `Swadesh-1971-100`, a traditional basic vocabulary list for lexicostatistics.
- ASJP / Holman 40: Concepticon `Holman-2008-40`, the 40-concept ASJP reference list.
- Dolgopolsky 15: Concepticon `Dolgopolsky-1964-15`, an ultra-stable vocabulary list.
- NSM semantic primes: the v20 Natural Semantic Metalanguage prime set.
- Leipzig Glossing Rules adaptation: a grammar/glossing presentation check, not a vocabulary list.

Benchmark data can be refreshed from Concepticon:

```bash
npm run eval:update-benchmarks
```

Coverage can be scored locally:

```bash
npm run eval:language
npm run eval:language -- --write
```

Generated coverage reports are written under `eval/reports/` and are intentionally ignored.

## Coverage Tiers

- `root`: direct single-root coverage.
- `composed`: covered by documented composition or a transparent root sequence.
- `literal`: covered by an international literal convention outside the emoji root inventory.
- `approximate`: partially covered, usually via a broader category plus label.
- `gap`: no current root or standard composition.

## Current Baseline Result

Latest run:

| Benchmark | Total | Root | Composed | Literal | Approximate | Gap | Strong Coverage |
|---|---:|---:|---:|---:|---:|---:|---:|
| Leipzig-Jakarta 100 | 100 | 44 | 31 | 1 | 23 | 1 | 76% |
| Swadesh 100 | 100 | 39 | 34 | 2 | 24 | 1 | 75% |
| ASJP / Holman 40 | 40 | 21 | 8 | 2 | 9 | 0 | 78% |
| Dolgopolsky 15 | 15 | 8 | 2 | 1 | 4 | 0 | 73% |
| NSM semantic primes | 65 | 42 | 20 | 2 | 1 | 0 | 98% |

Repeated gaps:

- None after treating Arabic numerals as literal number tokens.

Remaining one-off gaps:

- `COLD`
- `SALT`

## Decision Implication

The v1.5 inventory is locked at 100 roots. Arabic numerals remain literal number tokens outside the emoji root inventory.

After the 100-root pass, there is no repeated hard gap across the benchmark suite. The remaining one-off gaps are weaker signals:

- `SALT` remains a taste/mineral case; `BITTER` and `SWEET` are covered by MOUTH-plus-quality compounds.
- `COLD` points to a missing temperature axis, but temperature can still be composed from feeling, fire, liquid, and negation when needed.

Future additions should require removing or replacing an existing root, so they need stronger evidence than a single awkward benchmark case.