Making GenAI code review actually useful
I ran a single Claude reviewer agent against an approximately 1000-line C++ change last month (yes, it's too big for a normal workflow, but we are in the GenAI era and norms are different)—some pretty important video pipeline updates, but nothing exotic. The review came back with 14 findings. Three were real. The rest included a phantom race condition in code that runs on a single thread, two style nitpicks elevated to High severity, and a complaint about missing error handling on a function that already returns std::expected. The signal-to-noise ratio was bad enough that I almost closed the tab.
This is the dirty secret of GenAI-assisted code review. The models are good enough to spot real issues, sometimes ones a tired human would miss. But they also hallucinate problems with enough confidence to waste your time, and the false positives are not random. They cluster around the same blind spots every run because they come from the same weights, the same training distribution, the same biases baked into one model family.
I wrote about the false positive problem briefly in my earlier piece on development processes in the GenAI era1, but I didn't have a concrete solution at the time. Now I do, and it's been running in my workflow for a few weeks. The short version: stop asking one agent. Ask three, and only keep what two of them agree on.