Skip to content

Evaluation

Plato supports an optional [evaluation] section for structured server-side evaluation. This runs after the trainer's regular test metric (for example accuracy or perplexity) and records named benchmark metrics under the evaluation_ prefix in the runtime CSV.

Use this section when you want benchmark-style outputs such as IFEval, ARC, HellaSwag, PIQA, or Nanochat CORE instead of only a single scalar test metric.

When evaluation runs

Structured evaluation is triggered from the trainer's test flow, so it depends on server-side testing being enabled:

[server]
do_test = true

If [evaluation] is omitted, Plato only records the trainer's normal scalar metric.

Common options

type

The evaluator backend to run.

Built-in values include:

  • lighteval for Hugging Face's Lighteval benchmark runner.
  • nanochat_core for Nanochat's CORE benchmark. Requires trainer.type = "nanochat". This evaluator is not registered in the general evaluator registry; it is wired internally by the nanochat trainer. Using it with any other trainer type produces no evaluation output and no error.

fail_on_error

Whether evaluator failures should abort the run.

Default value: false

When false, Plato logs the evaluator exception and continues without structured evaluation metrics. Set this to true when the evaluation itself is a required part of the experiment.

Built-in evaluators

Evaluator Install path Primary output style Typical use
lighteval uv sync --extra llm_eval Named benchmark metrics such as ifeval_avg and arc_avg Server-side LLM evaluation
nanochat_core uv sync --extra nanochat core_metric Nanochat benchmark runs — requires trainer.type = "nanochat"

Lighteval

Plato's Lighteval adapter wraps the lighteval package and normalizes its task outputs into CSV-friendly metrics.

Supported options

preset

Name of the built-in task preset.

Current built-in value:

  • smollm_round_fast

This preset runs:

  • ifeval
  • hellaswag
  • arc_easy
  • arc_challenge
  • piqa

primary_metric

The summary metric to treat as the evaluator's primary output.

For smollm_round_fast, the default is ifeval_avg.

backend

Lighteval execution backend.

Supported values in Plato's current integration include:

  • transformers
  • accelerate

transformers and accelerate currently resolve to the same safe server-side launcher path in Plato.

batch_size

Evaluation batch size passed to Lighteval.

Default value: 1

Plato intentionally defaults to 1 to avoid aggressive auto-probing on multi-GPU systems.

max_length

Optional maximum sequence length passed to the Lighteval transformers backend.

max_samples

Optional per-task sample cap.

Example: max_samples = 32 runs up to 32 examples for each configured task. Lighteval shuffles deterministically before truncating, so the subset is stable across runs.

Partial benchmark

When max_samples is set, benchmark numbers are partial and should not be compared directly with full-dataset leaderboard runs.

model_parallel

Whether Lighteval should shard the evaluated model across multiple GPUs.

Default value: false

dtype

Optional evaluation dtype override.

If omitted, Plato infers a sensible default from the trainer configuration:

  • trainer.bf16 = truebfloat16
  • trainer.fp16 = truefloat16

device

Device string for evaluation, such as cuda:0, cuda:1, or cpu.

If omitted, Plato uses Config.device().

show_progress

Whether to show the coarse-grained server-side Lighteval progress bar.

Default value: true

Reference example

The configuration configs/HuggingFace/fedavg_smol_smoltalk_smollm2_135m.toml uses Lighteval like this:

[server]
do_test = true

[evaluation]
type = "lighteval"
preset = "smollm_round_fast"
primary_metric = "ifeval_avg"
backend = "transformers"
batch_size = 1
model_parallel = false
device = "cuda:0"
show_progress = true
max_samples = 32

Metrics exported to the CSV

Lighteval summary metrics are written as:

  • evaluation_ifeval_avg
  • evaluation_hellaswag
  • evaluation_arc_easy
  • evaluation_arc_challenge
  • evaluation_arc_avg
  • evaluation_piqa

Plato also exports detailed Lighteval task metrics as additional CSV columns when they are present, for example:

  • evaluation_ifeval_prompt_level_strict_acc
  • evaluation_ifeval_inst_level_loose_acc
  • evaluation_arc_easy_acc
  • evaluation_arc_challenge_acc_stderr
  • evaluation_hellaswag_em
  • evaluation_piqa_em

These columns are added to the CSV automatically the first time they appear.

Nanochat CORE

Nanochat's CORE benchmark is also available through [evaluation].

Supported options

bundle_dir

Optional directory containing the downloaded CORE evaluation bundle.

If omitted, Plato resolves the Nanochat base directory automatically and downloads the bundle when needed.

max_per_task

Optional cap on the number of examples per CORE task.

Default value: -1, which means use all available examples.

Example

Requires the nanochat trainer

nanochat_core is only wired up when trainer.type = "nanochat". The nanochat trainer creates the evaluator internally rather than looking it up in the registry. Setting [evaluation] type = "nanochat_core" with any other trainer type silently produces no evaluation output.

[trainer]
type = "nanochat"

[evaluation]
type = "nanochat_core"
max_per_task = 16

This evaluator exports core_metric, which can be listed in [results].types.

Results logging

Structured evaluator metrics are written directly into the runtime CSV in result_path. The CSV is the authoritative log for evaluator outputs.

See Results for how evaluator columns are named and expanded at runtime.