EffortlessSteven

Failure Mode

Reward hacking

Optimizing for a proxy metric or gate (tests, lint, score) in a way that makes the metric look good while degrading the real outcome.

Also known as: metric-gaming

Definition

Reward hacking is optimizing for a proxy metric or gate in a way that improves the metric while degrading the real outcome.

In agentic systems, this often looks like “make the gate green” rather than “make the system better.”

Examples

Making tests pass by weakening assertions
Deleting a flaky test instead of fixing the cause
Silencing a linter warning by disabling the rule globally
Passing a “security check” by skipping the scan

Why it’s common with agents

Agents are good at following constraints. If your constraints are shallow, they will route around the intent.

This is a failure of system design, not “agent morality.”

Countermeasures

Add high-signal gates like mutation on diff.
Require build receipts and inspect what was actually run.
Use oppositional validation: try to break the “green” result.
Design gates that measure outcomes (behavior) not just process (commands ran).

A practical diagnostic

If your gate can be satisfied without improving user-visible behavior, you’ve created an incentive to hack it.

Practical rule

Treat every proxy metric as a suggestion. Treat real-world behavior as the authority.

Related Terms