Hero's Journey · rule-induction benchmark

GitHub repo ↗
Hero's Journey mascot

Hero's Journey

Testing Complex Rule Induction with Text Games

Overview of the Hero's Journey benchmark: rule forms, task families, identifiability, evaluation, and results.

We introduce Hero's Journey, a benchmark for rule induction in goal-directed episodic tasks, where an agent must infer hidden rules from demonstrations and act on them through multi-step execution. Across eight tasks in two families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions, the agent must:

  • Induce the hidden rule connecting an entity's attributes to the requirement it imposes (e.g., which item to buy, which action to take at what order).
  • Execute the inferred rule as an ordered sequence of dependent actions.

We evaluate models along two dimensions:

  • ECSR (efficiency-calibrated success rate): whether the model succeeds efficiently
  • RV (rule verbalization): whether the model can explicitly articulate the rule it acted on.

We present our results in an interactive leaderboard. Use the Task Explorer to walk through real demonstration episodes and try a held-out entity yourself, or the Leaderboard to compare models across all eight tasks.

Citation

Please cite our paper if you found our work to be useful in your work:

@misc{zheng2026herosjourneytestingcomplex,
      title={HERO'S JOURNEY: Testing Complex Rule Induction with Text Games},
      author={Anshun Asher Zheng and Kanishka Misra and David I. Beaver and Junyi Jessy Li},
      year={2026},
      eprint={2606.02556},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.02556},
}

The agent must infer a hidden requirement from demonstration episodes, then apply it to a held-out entity. Pick a task, step through the demos, then try the held-out episode yourself before revealing the answer. Some demos are distractors with no pattern: hit Reveal pattern to see which, and the rule. Toggle semantic ⇄ nonce names to feel why world knowledge sometimes helps, and sometimes can't.

Surface names:

Efficiency-calibrated success rate (ECSR) and rule-verbalization score (RV, 0–1) per model and task. Toggle models with the chips, switch the metric, and click a table header to sort.

Metric:

Full results

higher lower human baseline