How We Used SkillOpt to Stress-Test Codex Skills

Download printable cheat-sheet (CC-BY 4.0)

31 May 2026, 00:00 Z

TL;DR We used SkillOpt as a starting point for improving Codex skills. The useful adaptation was simple: write synthetic edge cases, score the skill before changing it, make one narrow edit only when the miss is clear, then test on fresh unseen cases. A post-edit score tells us the fix did not break the old case. A first-unseen score tells us more about whether the skill is starting to generalize.

1 The problem: skills can look fine until the edge case arrives

Codex skills are small instruction files and helper scripts that steer an agent through a task. They are useful because they make repeated work less dependent on memory. They are risky for the same reason: once a skill sounds confident, it can hide the missing branch that only appears in an awkward real session.

That is easy to see with concrete examples:

A diff-redaction skill can remove obvious API keys but accidentally destroy the file headers and hunk markers that make the diff useful.
An MDX blockquote skill can catch a plain list marker but miss the same marker when it is indented or nested inside repeated quote markers.
A session handoff skill can export the latest user turns but accidentally include injected context that should never be treated as user intent.
A git restore skill can say it restores all uncommitted work but leave staged index changes behind.

These are not exotic failures. They are the kinds of misses that show up after enough users, repos, and agent sessions touch the same skill.

The question we wanted to answer was practical:

Can we improve a skill without rewriting it from scratch?
Can we tell the difference between a real general improvement and a fix that only matches yesterday's fixture?
Can we keep the evidence small enough that future agents can inspect it?

2 What SkillOpt gave us

The SkillOpt paper treats an agent skill as something that can be optimized while the underlying model stays frozen. Its loop uses scored rollouts, bounded edits to a skill document, and held-out validation before accepting the edit.

The detail that mattered most for our work was not the exact optimizer setup. It was the discipline:

do not rewrite the skill freely
score behavior before editing
make bounded changes
accept changes only when held-out behavior improves
preserve rejected edits and failures as evidence

We adapted that into a workflow that works for Codex skills and helper scripts. The agent can run the experiment itself, but the scoring rule has to be explicit enough that the agent cannot quietly count a patched case as fresh evidence.

Skill	What the test caught	Accepted edit theme	First-unseen streak
`redacted-diff`	Public metadata and secret-looking structure were easy to over-redact or under-redact.	Preserve useful diff structure and public metadata while still redacting secret-bearing values.	OOD23 12 / 12, OOD24 12 / 12, OOD25 11 / 12
`eclat-mdx-blockquote-safety`	OOD1 scored 8 / 12 because the skill missed indented and nested blockquote marker shapes.	Generalize audit regexes across indentation and repeated quote markers.	OOD2 12 / 12, OOD3 12 / 12, OOD4 12 / 12
`codex-session-handoff`	The helper could leak injected context or let older real turns re-enter after filtering.	Skip injected wrappers and apply the tail window before filtering.	OOD3 11 / 12, OOD4 12 / 12, OOD5 11 / 12
`restore-changes`	OOD1 baseline scored 10 / 12 because staged changes were not handled fully.	Add staged-aware preview and restore guidance.	OOD2 12 / 12, OOD3 12 / 12, OOD4 12 / 12

How We Used SkillOpt to Stress-Test Codex Skills

1 The problem: skills can look fine until the edge case arrives

2 What SkillOpt gave us

Turn AI video into a repeatable engine

3 Our adapted loop

4 The scoring rule that kept us honest

5 What changed across the skills

6 What the tests caught that casual use might miss

7 What this does not prove

8 A reusable template for other skills

9 The broader lesson

Related Posts

1 The problem: skills can look fine until the edge case arrives

2 What SkillOpt gave us

Turn AI video into a repeatable engine

3 Our adapted loop

4 The scoring rule that kept us honest

5 What changed across the skills

6 What the tests caught that casual use might miss

7 What this does not prove

8 A reusable template for other skills

9 The broader lesson

Related Posts

AI Video Anchor Frames: First and Last Frame Continuity Playbook

Mining Claude Code and Codex Logs Into a Knowledge Base

YouTube Shorts for AI-Generated Content - Rules, Monetization, and What Gets Flagged