How We Used SkillOpt to Stress-Test Codex Skills

Download printable cheat-sheet (CC-BY 4.0)

31 May 2026, 00:00 Z

TL;DR We used SkillOpt as a starting point for improving Codex skills. The useful adaptation was simple: write synthetic edge cases, score the skill before changing it, make one narrow edit only when the miss is clear, then test on fresh unseen cases. A post-edit score tells us the fix did not break the old case. A first-unseen score tells us more about whether the skill is starting to generalize.

1 The problem: skills can look fine until the edge case arrives

Codex skills are small instruction files and helper scripts that steer an agent through a task. They are useful because they make repeated work less dependent on memory. They are risky for the same reason: once a skill sounds confident, it can hide the missing branch that only appears in an awkward real session.

That is easy to see with concrete examples:

  • A diff-redaction skill can remove obvious API keys but accidentally destroy the file headers and hunk markers that make the diff useful.
  • An MDX blockquote skill can catch a plain list marker but miss the same marker when it is indented or nested inside repeated quote markers.
  • A session handoff skill can export the latest user turns but accidentally include injected context that should never be treated as user intent.
  • A git restore skill can say it restores all uncommitted work but leave staged index changes behind.

These are not exotic failures. They are the kinds of misses that show up after enough users, repos, and agent sessions touch the same skill.

The question we wanted to answer was practical:

  • Can we improve a skill without rewriting it from scratch?
  • Can we tell the difference between a real general improvement and a fix that only matches yesterday's fixture?
  • Can we keep the evidence small enough that future agents can inspect it?

2 What SkillOpt gave us

The SkillOpt paper treats an agent skill as something that can be optimized while the underlying model stays frozen. Its loop uses scored rollouts, bounded edits to a skill document, and held-out validation before accepting the edit.

The detail that mattered most for our work was not the exact optimizer setup. It was the discipline:

  • do not rewrite the skill freely
  • score behavior before editing
  • make bounded changes
  • accept changes only when held-out behavior improves
  • preserve rejected edits and failures as evidence

We adapted that into a workflow that works for Codex skills and helper scripts. The agent can run the experiment itself, but the scoring rule has to be explicit enough that the agent cannot quietly count a patched case as fresh evidence.

Turn AI video into a repeatable engine

Build an AI-assisted video pipeline with hook-first scripts, brand-safe edits, and multi-platform delivery.