# MemeLingua Interpretability Eval Plan

## Goal

Test whether a model can interpret MemeLingua after receiving only a compact primer generated from the canonical docs. The eval is meant to expose missing roots, weak emoji choices, ambiguous relations, and grammar failures before adding new root slots.

## Current lock

- Core roots: 100 single-emoji roots in `docs/vocabulary.md`.
- Compounds are not roots.
- Arabic numerals are literal number tokens, not roots.
- The eval harness builds its primer from `docs/vocabulary.md` and `docs/spec.md` so tests use the same source of truth as the docs.

## Eval Modes

- `strict-decode`: only decode literal roots, plain word order, line-break frames, relation roots, and listed compounds. This mode should not infer unstated real-world concepts.
- `natural-interpret`: provide the most plausible natural English reading while preserving ambiguity notes for anything inferred beyond the literal roots.

Run all modes together by default, or filter one mode:

```bash
npm run eval:run -- --mode=strict-decode
npm run eval:run -- --mode=natural-interpret
```

## What We Test

1. Root recovery: the model maps one root to its English invariant.
2. Compound recovery: the model interprets multi-root compounds such as music, now, before, after, and polite address.
3. Relation recovery: the model distinguishes TO/FROM/WITH/IN/ON/OF/FOR/ABOUT.
4. Clause parsing: the model respects plain `SUBJECT PREDICATE COMPLEMENT` order, spaces, and line-break frames.
5. Minimal pairs: the model separates CAN/WANT/NEED/MAYBE/CHOOSE/AGREE and SAME/LIKE/OTHER.
6. Gap discovery: the model should admit composition or uncertainty when no direct root exists.
7. Practical instruction decoding: the model interprets ordinary requests, warnings, transfers, perception statements, and time-marked actions.
8. Blind round-trip stability: the model encodes English into MemeLingua and decodes its own MemeLingua back to English.

## Current Results

Latest local written runs:

| Eval | Report | Result |
|---|---|---:|
| Decode | `eval/reports/2026-06-20T05-19-43-900Z.md` | 110/110 pass |
| AMR/Smatch | `eval/reports/amr-smatch-2026-06-20T05-10-55-490Z.md` | MemeLingua 100% F1; Toki Pona 69.5% F1 |
| Round-trip | `eval/reports/roundtrip-2026-06-20T05-17-01-834Z.md` | 25/25 pass |

## Language Benchmark Coverage

The interpretability evals are paired with linguistic benchmark coverage checks:

- Leipzig-Jakarta 100
- Swadesh 100
- ASJP / Holman 40
- Dolgopolsky 15
- NSM semantic primes
- Leipzig Glossing Rules adaptation
- AMR/Smatch-style semantic graph preservation

See `docs/language-evals.md` for source notes, coverage tiers, and the current baseline result.

Run:

```bash
npm run eval:update-benchmarks
npm run eval:language -- --write
npm run eval:amr -- --write
npm run eval:roundtrip -- --write
npm run eval:validate
```

## Scoring

- `pass`: required concepts are present and forbidden concepts are absent.
- `review`: output is usable but misses a required concept, adds a misleading concept, or exposes ambiguity.
- `error`: API, parsing, or route failure.

The batch scorer uses concept-word matching with simple English-output variants, so `move` can match `moving`, `do` can match `does`, and `fear` can match `fears` while `to` will not accidentally match `tool`. This is only scoring normalization; MemeLingua roots themselves do not inflect.

For human review, use:

- `exact`: direct recovery of roles and relation.
- `adequate`: paraphrase preserves the core meaning.
- `gist`: broad intent only.
- `fail`: wrong relation, wrong actor, hallucinated root, or hidden ambiguity.

## API Route

Run the local API:

```bash
npm install
OPENAI_API_KEY=... npm run eval:server
```

Health check:

```bash
curl http://localhost:8787/health
```

Interpret one expression:

```bash
curl -s http://localhost:8787/api/eval/interpret \
  -H 'content-type: application/json' \
  -d '{"input":":raising_hand: :magnet: :hamburger:"}'
```

The route returns the raw model output plus a parsed JSON object when the model follows the output contract.

## Batch Runner

Run the decode cases:

```bash
npm run eval:run
```

Write result JSONL and Markdown report files:

```bash
npm run eval:run -- --write
```

Cases live in `eval/cases.json`. Generated JSONL files are ignored under `eval/results/`; generated Markdown reports are ignored under `eval/reports/`.

Current decode case coverage:

- 110 total decode cases.
- 27 practical instruction cases.
- 29 minimal-pair/ambiguity cases.
- Root, compound, relation, grammar, selector, literal, gap, and short narrative cases.

The Markdown report includes:

- pass/review/error summary
- compact table of all cases
- expanded review items with parsed model output
- full parsed output for every case

## AMR/Smatch Runner

Run:

```bash
npm run eval:amr
npm run eval:amr -- --write
```

Cases live in `eval/amr-smatch-cases.json`. The set now contains 41 semantic graph cases: the original comparison set plus 25 added cases for perception, speech, source/destination, topic, ownership, negation, condition, cause, feeling, love, sleep, change, connection, questions, liquid, drinking, and completion/celebration.

## Round-Trip Runner

Run:

```bash
npm run eval:roundtrip
npm run eval:roundtrip -- --write
```

Cases live in `eval/roundtrip-cases.json`. Each case starts from an English prompt, asks the model to encode it into MemeLingua using only the primer, then asks the model to decode that generated MemeLingua back to English. The scorer checks whether key concepts survived the round trip.

## Local Data Validation

Run:

```bash
npm run eval:validate
```

This does not call the OpenAI API. It validates JSON structure, duplicate case IDs, known MemeLingua shortcodes, the 100-root vocabulary count, and the 100-row vocabulary sheet.

## Root Revision Rule

Do not add new roots because a single case is awkward. Add or replace roots only when repeated eval failures cluster around the same missing meaning or ambiguous relation.