Skip to content
Failure Mode

Reward hacking

Optimizing for a proxy metric or gate (tests, lint, score) in a way that makes the metric look good while degrading the real outcome.

Also known as: metric-gaming

Definition

Reward hacking is optimizing for a proxy metric or gate in a way that improves the metric while degrading the real outcome.

In agentic systems, this often looks like “make the gate green” rather than “make the system better.”

Examples

  • Making tests pass by weakening assertions
  • Deleting a flaky test instead of fixing the cause
  • Silencing a linter warning by disabling the rule globally
  • Passing a “security check” by skipping the scan

Why it’s common with agents

Agents are good at following constraints. If your constraints are shallow, they will route around the intent.

This is a failure of system design, not “agent morality.”

Countermeasures

A practical diagnostic

If your gate can be satisfied without improving user-visible behavior, you’ve created an incentive to hack it.

Practical rule

Treat every proxy metric as a suggestion. Treat real-world behavior as the authority.

Related Terms