Improving an AI coding assistant at scale requires more than benchmark runs and internal evaluations. Anthropic, according to a Business Insider investigation, has been quietly running a contractor program called Project Marlin, channeled through data-annotation vendor Snorkel AI, that enlists roughly 1,000 software engineers to judge which version of Claude Code produces better code. The contractors don't know which model they're evaluating. They just see two outputs and pick one.

The program is an example of how the fine-tuning pipeline behind a commercially deployed AI product actually works, stripped of the abstractions typically used to describe it. Reinforcement learning from human feedback, or RLHF, is the standard label. In practice it means human evaluators making tens of thousands of discrete choices about quality, and those choices are aggregated into training signal that shapes how the model behaves. Project Marlin is that process applied specifically to Claude Code, the coding tool that crossed $47 billion in annualized revenue in May 2026.

How the Program Runs

Contractors working on Project Marlin are software engineers by background, not general-purpose annotators. Each task involves creating a prompt designed to test a specific coding scenario, examining two model-generated responses, and selecting the one that better demonstrates what a professional developer would produce. The blind comparison structure means evaluators can't adjust for which model version they're reviewing, removing one common source of evaluator bias.

Compensation runs up to $280 per task. Each task takes roughly an hour, though some require additional back-and-forth with Snorkel's approval layer before they clear. At that rate, a contractor completing one task per hour generates roughly $280 per hour on qualifying submissions, well above standard annotation rates. The premium reflects the specialized engineering background required to make meaningful judgments about code quality, not just syntactic correctness.

Project Marlin by the Numbers

  • Contractor pool~1,000 software engineers
  • Pay per taskUp to $280
  • Task duration~1 hour typical
  • Data vendorSnorkel AI
  • Evaluation methodBlind A/B comparison
  • Target productClaude Code

What "Better Code" Means in Practice

One of the contractors who spoke to Business Insider described the goal as training Claude Code to produce simplified, easier-to-maintain code. That phrasing captures a real tension in AI-generated code. Models optimized purely on correctness tend to produce code that works but is dense: low on variable names, high on clever tricks, and hard for a subsequent engineer to modify without breaking something. Professional developers tend to value the opposite, clear naming, consistent structure, and code that is explicit about what it's doing rather than assuming the reader will reconstruct the logic.

That preference is difficult to capture in an automated benchmark, because running tests and passing a linter says nothing about maintainability. Human evaluation, specifically evaluation by engineers with professional experience, is one of the few ways to generate signal on the qualities that matter for long-term code health. Project Marlin is Anthropic's attempt to inject that signal systematically into Claude Code's training.

"The project focuses on fine-tuning Claude Code's answers so that it could mimic what a professional developer could do." Project Marlin contractor, via Business Insider

Context: The Quality Postmortem and What Came After

Project Marlin doesn't exist in isolation. In April 2026, Anthropic published a postmortem on a quality regression in Claude Code that had drawn user complaints. The document was unusually candid for a product company, detailing how an internal model change had degraded coding quality in ways that automated tests didn't catch. The admission that quality regressions can slip through without human review was implicit in the analysis.

Project Marlin, regardless of when it was initiated, represents one operational response to that class of problem. A pool of experienced engineers conducting ongoing blind evaluations creates a signal that runs parallel to internal benchmarks, and can catch the kinds of qualitative degradation that correct-but-unhelpful code represents. Whether that signal is sufficient to prevent the next regression depends on how tightly the training loop connects contractor judgments to deployed model behavior.

Snorkel AI, the vendor coordinating the contractor program, builds platforms for data labeling and programmatic training-data generation. Its involvement suggests that Anthropic is treating Project Marlin as a structured, ongoing data-collection exercise rather than a one-time evaluation sweep. The company's tooling is designed for the kind of large-scale, iterative annotation work that a continuous fine-tuning pipeline requires.

A Broader Pattern in AI Development

Anthropic is not unique in running this kind of program. Contractor evaluation is standard practice across the major AI labs for exactly the reasons Project Marlin illustrates: human judgment on qualities like maintainability, tone, and professional appropriateness is hard to automate, and the gap between automated benchmarks and real-world user satisfaction is where most product quality problems hide.

What makes the Marlin disclosure noteworthy is the specificity. The scale (roughly 1,000 contractors), the rate ($280 per task), and the blind comparison methodology are not details that AI companies typically share proactively. Their emergence via an investigative report reflects both the growing interest in how AI products are actually built and the degree to which fine-tuning practices remain opaque even as AI products become central infrastructure for professional developers. Claude Code's expanding role in software workflows makes the quality signal Marlin generates more consequential with each major deployment.

Further reading: Learn more about Claude's model family, read our background on Anthropic, or browse the latest Claude AI news.