Teaching AI Agents New Tricks Without Rewiring Their Brains

When you want an AI agent to get better at a task, the obvious answer is to retrain it. Update the weights, adjust the parameters, and run another training loop. But what if you could not touch the model at all? What if the model was someone else's, locked behind an API, and all you had was the ability to give it instructions?

That is the problem Microsoft researchers set out to solve with SkillOpt, a system published in May 2026. The idea is deceptively simple: instead of training the model, train the instructions you give it. And do it with the same rigor and discipline that makes neural network training reliable.

The Problem With How Agent Skills Are Made Today

AI agents usually come with what researchers call "skills", natural language documents that tell the agent how to behave in a given domain. Think of them like a procedural playbook: how to search for information, how to format outputs, how to handle errors, which tools to call, and when.

Right now, these skill documents are created in one of three ways. Either a human writes them by hand, an LLM generates them in a single shot, or the agent slowly revises them over time in a loose, uncontrolled way. None of these approaches works particularly well under pressure. Handwritten skills are brittle. One-shot-generated skills cannot correct themselves when they fail. Loosely self-revised skills drift in unpredictable directions.

The core issue is that none of these methods behaves like an actual optimizer. They do not systematically learn from failure. They do not test whether a change actually helped before committing to it. They do not have any concept of a learning rate or a validation set.

The Analogy That Drives SkillOpt

SkillOpt is built on a direct analogy to deep learning training but applied to text instead of model weights.

In standard neural network training, you have parameters that get updated based on gradients, a learning rate that controls how large each update is, and a validation set that tells you whether you are actually improving or just overfitting. SkillOpt maps each of those concepts onto a text-based skill document:

Deep Learning	SkillOpt
Model parameters	Skill document
Gradient direction	Edits derived from agent trajectories
Learning rate	Edit budget (how many changes per step)
Validation check	Held-out selection gate
Stable training settings	Batch size, schedule, gate

The "trainable object" in SkillOpt is a plain Markdown file called best_skill.md. Everything else, the frozen target model, the execution harness, and the benchmark evaluator, stays fixed. Only the skill document changes.

How the Loop Actually Works

Here is the optimization loop at a high level. A frozen target model runs a batch of tasks using the current skill document. A separate optimizer model reads through the trajectories of what the agent did, what succeeded, and what failed, and proposes a small set of structured edits to the skill. These edits are limited to three types: add a new rule, delete an existing rule, or replace one with something better.

The key constraint is the edit budget. At each step, only a small number of edits are allowed through. This is the "textual learning rate." It prevents the optimizer from rewriting the entire skill document at once, which would erase useful rules and make the optimization history meaningless.

Once a candidate skill is assembled, it gets evaluated on a held-out validation split. Only if it scores strictly higher than the current skill does it get accepted. Ties are rejected. Failed edits are not thrown away; they go into a "rejected-edit buffer" that tells the optimizer what not to try again later in the same training run.

At the end of each epoch, a slower update process looks across the full epoch, comparing the same tasks run under the previous skill and the current skill. It writes a protected "longitudinal guidance" block into the skill document that step-level edits cannot overwrite. This is the equivalent of a momentum term; it captures durable lessons that survive across multiple epochs.

After training, the output is a single best_skill.md file, typically between 300 and 2,000 tokens. The target model is never modified. Deploying the result means no weight updates and no additional model calls at inference time.

Running It Yourself

The repo is open source, and the setup is straightforward. After cloning and installing:

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .

You configure your API credentials once:

cp .env.example .env
# fill in your endpoint and key, then:
source .env

For Azure OpenAI:

export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-key"

For standard OpenAI endpoints:

export AZURE_OPENAI_ENDPOINT="https://api.openai.com/v1"
export AZURE_OPENAI_API_KEY="sk-..."
export AZURE_OPENAI_AUTH_MODE="openai_compatible"

For Anthropic:

export ANTHROPIC_API_KEY="sk-ant-..."

Training a skill on a benchmark looks like this:

python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir /path/to/your/searchqa_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

The optimizer model and the target model can be different. You can use a stronger model to train the skill and a cheaper or smaller one to actually run it at deployment. The training run produces a structured output directory:

outputs/<run_name>/
├── best_skill.md            # the artifact you actually deploy
├── history.json             # per-step training history
├── skills/skill_vXXXX.md   # skill snapshot per step
└── steps/step_XXXX/        # per-step patches and eval logs

If a run gets interrupted, re-running the same command picks up where it left off.

The paper also ships a set of pre-trained skill artifacts in the ckpt/ folder for GPT-5.5 across the six benchmarks, so you can evaluate the provided skills without running a full training loop:

python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill ckpt/searchqa/gpt5.5_skill.md \
  --split valid_unseen \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/

What the Numbers Look Like

The paper evaluates SkillOpt across six benchmarks (question answering, spreadsheets, document reasoning, math, and embodied decision-making), seven target models ranging from GPT-5.5 down to small Qwen variants, and three execution environments: direct chat, the Codex CLI, and Claude Code.

The headline result is that SkillOpt is best or tied for best on all 52 evaluated combinations. On GPT-5.5 in direct chat, the average accuracy across six benchmarks goes from 58.8 with no skill to 82.3 with an optimized skill, a gain of 23.5 points. The gains are largest on procedural benchmarks: SpreadsheetBench goes from 41.8 to 80.7, OfficeQA from 33.1 to 72.1, and LiveMathematicianBench from 37.6 to 66.9.

Smaller models benefit the most in relative terms. GPT-5.4-nano nearly doubles its DocVQA score and roughly triples on ALFWorld when given a well-optimized skill. The paper interprets this as the skill supplying procedural knowledge that smaller models do not already have in their weights.

The gains also hold inside tool-backed execution loops. Under the Codex harness, SkillOpt lifts GPT-5.5 by 24.8 points on average over no skill. Under Claude Code, it lifts by 19.1 points.

Skills Transfer Across Models and Environments

One of the more practically useful findings is that optimized skill documents are not tied to the model or environment they were trained on.

A SpreadsheetBench skill trained on GPT-5.4 transferred to GPT-5.4-mini with a gain of 9.4 points and to GPT-5.4-nano with a gain of 3.0 points, without any further optimization. More striking: a skill trained inside the Codex execution environment transferred to Claude Code with a 59.7 point absolute gain over the Claude Code baseline, slightly exceeding the score of a skill trained directly inside Claude Code. The symmetric transfer from Claude Code to Codex added 43.6 points on top of the Codex no-skill baseline.

Cross-benchmark transfer is smaller but still positive. An OlympiadBench math skill applied to Omni-MATH gained 3.7 points on GPT-5.4, 1.8 on GPT-5.4-mini, and 1.3 on GPT-5.4-nano, with no additional training.

These results matter practically. A skill can be trained once, audited as a readable text file, and reused across different models and deployment targets without touching any weights.

What the Learned Rules Actually Look Like

One thing worth noting is how readable the final skill artifacts are. Each best_skill.md is between 300 and 2,000 tokens, and the gains come from only one to four accepted edits across the entire training run. The optimizer proposes many more edits per epoch, but most of them fail the validation gate and never make it into the deployed skill.

The paper shows one representative learned rule per benchmark. For the spreadsheet benchmark, the core rule learned was: inspect the actual workbook structure and formulas, then write out fully computed static values across the requested target range, rather than relying on Excel to recalculate. For math reasoning, the rule was: in strongest-statement multiple-choice questions, rank choices by theorem strength and prefer a justified stronger result over true but weaker corollaries. For the household navigation environment, the rule was: keep a visited-location ledger, diversify search after repeated same-type failures, and avoid revisiting the destination until actually holding the target object.

These are not instance-specific hacks. They are generalizable procedures that a thoughtful domain expert might write after spending a day with the benchmark, except they were discovered automatically and validated against held-out data.

What does this change for Teams Building on AI Agents

The practical implication is that domain adaptation for AI agents does not require model access. If you are building on top of a closed frontier model orThese running an open model but want to avoid the cost of repeated fine-tuning, optimizing a skill document is now a legitimate alternative with a principled training loop behind it.

The optimization cost is paid once. After that, it’s "skills." best_skill.md is just a text file that gets prepended to the agent's context. It adds no inference-time overhead, no extra model calls, and nothing that requires maintaining a separate system at runtime.

The system also supports adding new benchmarks and new model backends with relatively little code. A benchmark needs a dataloader, a rollout function, and a seed skill document. A backend needs a chat or execution adapter registered in the model router. The repo ships a WebUI dashboard for monitoring training runs if you want visibility into how the skill is evolving across epochs.

Resources

Github - Official Repository of SkillOpt
arXiv - Official arXiv page
Page - Official website

Teaching AI Agents New Tricks Without Rewiring Their Brains

The Problem With How Agent Skills Are Made Today

The Analogy That Drives SkillOpt

How the Loop Actually Works

Running It Yourself

What the Numbers Look Like

Skills Transfer Across Models and Environments

What the Learned Rules Actually Look Like

What does this change for Teams Building on AI Agents

Resources

Keep Reading

Get the Free Tech & AI Newsletter

Quick Links

Subscription

Socials