Skip to main content
AI
7 min read
February 24, 2026

Have We Been Wrong All Along About .md Files?

The Research Says: It Depends on Who Wrote Them

Segev Sinay

Segev Sinay

Frontend Architect

Share:

I have a CLAUDE.md file in every project I work on. I wrote an entire article about why it is the most important file in your codebase. I have recommended it to every developer I consult with.

So when a new research paper dropped suggesting that these files might not actually help - and in some cases might make things worse - I paid attention.

The Paper That Started a Fire

A February 2025 paper titled "SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks" tested something we have all been assuming: do structured markdown instructions actually improve AI agent performance?

The researchers ran 7,308 execution trajectories across 86 tasks and 11 domains, testing curated human-written skills, self-generated skills, and a no-skills baseline. The findings challenge some deeply held assumptions in our community.

The headline numbers:

  • Human-curated skill files improved success rates by 16.2 percentage points overall
  • But the results were wildly inconsistent across domains - from +4.5pp in software engineering to +51.9pp in healthcare
  • Self-generated skills provided zero benefit on average - models cannot write their own effective instructions
  • 16 out of 84 tasks actually performed worse with skills attached

That last point deserves a pause. In nearly 20% of cases, giving the AI instructions made it perform worse than giving it nothing at all.

The Hacker News Debate

When this research hit Hacker News, the developer community split into predictable camps. The optimists pointed out that a 16% improvement from a simple markdown file is massive. The skeptics pointed out that software engineering - our domain - saw the smallest improvement of any field tested.

But the most interesting insight came from developers sharing what actually works in practice.

One pattern kept emerging: the best AGENTS.md files are not written proactively. They are written reactively, after watching the agent fail at something specific. You observe the mistake, you document the non-obvious context that would prevent it, and you move on. This produces focused, high-signal instructions instead of comprehensive documentation that dilutes the important bits.

Why Self-Generated Instructions Fail

This is the finding that should concern every team using AI tools. When models generate their own instruction files - following official recommendations from tool vendors, no less - the result is consistently useless or harmful.

The paper's explanation resonates with what I see in real codebases: AI documents the obvious and misses the non-obvious. It will write "use TypeScript strict mode" and "follow component naming conventions" - things the model would do anyway from reading your code. But it will miss "our API returns dates as Unix timestamps, not ISO strings" or "the payments service has a 5-second timeout that affects the checkout flow."

The tribal knowledge, the hard-won lessons, the "is this intentional?" decisions - that is what belongs in these files. And that is exactly what AI cannot generate for itself.

What Actually Works: Lessons From the Data

Combining the paper's findings with what I have seen across dozens of frontend codebases, here is what separates useful .md files from harmful ones:

1. Less Is More - Dramatically

The research found that focused skill sets with 2-3 modules outperformed comprehensive documentation. This matches my experience. A 200-line CLAUDE.md that covers your five most important conventions outperforms a 2,000-line document that tries to cover everything.

Why? Context window pollution. When you give an agent a massive instruction file, the important rules get drowned in a sea of obvious guidelines. The agent treats "always use semicolons" with the same weight as "never call the legacy billing API directly because it will double-charge the customer."

2. Write Rules After Failures, Not Before

The most effective pattern I have found:

  1. Start with a minimal .md file - project overview, stack, key conventions
  2. Use the AI agent on real tasks
  3. When it makes a mistake that stems from missing context, add a rule
  4. When it makes a mistake it should have known to avoid, add a rule
  5. Never add rules for things it already does correctly

This produces a file that is 100% signal. Every line exists because there is a real failure it prevents.

3. Positive Framing Beats Negative

Multiple developers in the Hacker News discussion noted that models struggle with negative instructions. "Do not use inline styles" is less effective than "always use Tailwind utility classes for styling." The research supports this - positive, directive instructions consistently outperform prohibitions.

4. Domain Knowledge Over Coding Style

The paper showed the biggest improvements in domains where the AI lacks specialized knowledge (healthcare: +51.9pp). In software engineering, where models already have extensive training data, the improvement was modest (+4.5pp).

This tells us something important: do not waste your .md file on coding style. ESLint and Prettier handle that. Use it for the things the model genuinely cannot infer from your code:

  • Business logic constraints
  • Integration quirks and API behaviors
  • Performance requirements and budgets
  • Deployment and environment specifics
  • Historical decisions and their rationale

My Updated Approach

After digesting this research, I have revised how I structure CLAUDE.md files for my consulting clients:

Before (comprehensive approach):

# Project Guidelines
## Code Style
- Use TypeScript strict mode
- Use functional components
- Use arrow functions for handlers
- Name components with PascalCase
- Name hooks with use prefix
... (200 more style rules)

After (reactive, high-signal approach):

# Critical Context
- Payment webhooks arrive out of order - always check
  idempotency key before processing
- The /api/users endpoint returns paginated results even
  for single-user queries (legacy API, do not change)
- Bundle budget is 150kb first-load JS - check with
  'npx next-bundle-analyzer' before adding dependencies

The first approach documents things the model already knows. The second documents things it cannot possibly know without being told.

The Uncomfortable Question

Here is what this research really asks: are we writing .md files for the AI, or for ourselves?

I think many developers - myself included - have been using CLAUDE.md and AGENTS.md files partly as a way to feel in control. Writing rules feels productive. It feels like you are shaping the AI's behavior. And sometimes you are. But the data suggests that much of what we write is noise that either has no effect or actively degrades performance.

The 16.2 percentage point improvement for curated skills is real and significant. But it requires genuine curation - understanding what the model needs to know versus what it already knows. That distinction is harder to get right than most of us want to admit.

A Critical Caveat the Research Misses

Before you draw conclusions, there is something important this paper does not measure: the nature of the errors.

The study tracks binary success and failure. A task either passes or it does not. But anyone who has worked with AI coding agents knows that not all failures are equal. An agent that misnames a CSS class is a different problem than an agent that calls a production API endpoint with destructive side effects.

The real question is not just "does the agent succeed more often with a .md file?" It is: what kind of mistakes does it make without one?

If a CLAUDE.md file turns ten critical architectural errors into eight minor style inconsistencies, the success rate might look similar - but the actual impact on your codebase is night and day. A convention violation takes seconds to fix. A wrong data-fetching pattern embedded across fifteen components takes days.

AI agents are still error-prone. That is not changing anytime soon. The trade-off is not between perfection and imperfection - it is between expensive errors and cheap ones. And that is a dimension this research does not capture. Until someone studies the severity and category of failures with and without instruction files, we are only seeing half the picture.

My gut, from working with these tools daily: the .md file's biggest value is not making the agent succeed more. It is making the agent fail better - in ways that are easier to catch, cheaper to fix, and less likely to corrupt your architecture.

The Bottom Line

Should you use .md files with AI coding agents? Absolutely yes - but with a fundamentally different approach than most of us have been taking.

Write less. Write reactively. Write what the AI cannot infer. Skip everything it can learn from your code. And for the love of your context window, do not let the AI generate its own instructions.

The research is clear: a small, focused, human-curated instruction file makes AI agents meaningfully better. A large, comprehensive, auto-generated one makes them measurably worse. And the full picture - how these files change the severity of mistakes, not just the frequency - is a question the research has not answered yet.

We were not wrong about .md files. We were wrong about how to write them.


Sources:

AI
Claude Code
Developer Tools
Prompt Engineering
Technical Leadership
Architecture

Related Articles

Contact

Let’s Connect

Have a question or an idea? I’d love to hear from you.

Send a Message