In early 2023 we started experimenting with AI-assisted code review in the team. First with a bot commenting pull requests using GPT-3.5, then with more sophisticated integrations, passing through Copilot for Pull Requests, CodeRabbit, and specific tools like Qodo. Two years in, habits have settled and I have reasonably firm opinions on what works and what doesn’t.
This isn’t a product roundup or a “best tools 2025” list. It’s more like the balance I would have liked to read when starting: where AI in code review adds real value and where it just adds noise the team has to filter.
What we expected
The initial expectation was that AI would end up replacing the first pass of human review. The idea seemed reasonable: reviewers spend a lot of time pointing out style issues, obvious oversights, trivial errors. If a tool could cover all that automatically, human time could go to architecture and judgment.
What happened in practice is partly that and partly something quite different.
What has worked
AI is useful for detecting several very specific patterns:
First, mechanical oversights. Things like a TODO without a reference, an import added but not used, a variable declared and never read, a modified test that no longer runs. All this shows up in traditional linters, but LLM-based tools have an advantage: they phrase the warning in natural language and with context, making it much harder to ignore than “unused import”. In large teams where linters become background noise, this helps.
Second, inconsistencies between code and comments or documentation. If you change a function but the docstring still says something else, the AI catches it quite reliably. Humans do too, but it’s the classic case where we skip the detail. The tool has infinite patience.
Third, useful summaries for the reviewer. An automatic summary of what the PR does, generated from the diff, is a reasonable entry point. It reduces friction for the reviewer unfamiliar with context, and if the summary doesn’t match what the PR actually does, that’s already a useful signal.
Fourth, generating suggested tests. Tools that propose test cases for modified code deliver a modest but consistent benefit. They don’t replace human-designed tests, but they catch forgotten edge cases, especially in functions with many parameters.
What hasn’t worked
On the flip side, there are several things we’ve stopped expecting from these tools.
Detecting subtle bugs. When a tool says “possible race condition” or “possible null dereference”, the false-positive rate is still very high. At first we commented on all of them, but we soon found that 80% were impossible in the real context (because that function is only called from one place, because the structure already guarantees the value isn’t null, etc.). Today we filter these comments in most cases and only review them when code is clearly risky.
Architectural judgment. “This function should be a class”, “this module should be elsewhere”, “you should use dependency injection here”. These comments are sometimes technically correct, but without understanding project context they become permanent noise. AI doesn’t know that function has a deliberately narrow purpose, or that the module lives where it does for known historical reasons.
Deep security reviews. Tools detect obvious patterns (hardcoded credentials, SQL built by string concatenation, use of functions flagged as unsafe), and that’s valuable. But they don’t replace a real security audit, and we’ve seen cases where a tool approved a diff with a clear problem because the pattern didn’t match what it was trained to detect.
The pattern that emerged
After trying several configurations, the pattern that worked best in our team is this:
AI makes a first automatic pass when the PR opens. It leaves a summary of what changes, flags mechanical oversights, identifies untested paths, and flags any known risk patterns. That first pass arrives before any human reviewer.
The PR author processes that first pass: fixes oversights, answers the tool’s questions, dismisses false positives with a brief note. By the time the human reviewer arrives, the PR is cleaner of noise.
The human reviewer focuses on what matters: architecture, trade-off choices, consistency with the rest of the system, long-term readability. That conversation is now more productive because no energy is spent on mechanics.
An important detail: AI comments don’t have blocking authority in our process. They’re explicit suggestions. No one has to justify ignoring an automated comment. This was important at first because the false-positive rate would have frustrated the team. Over time the tool has improved and the flow has stabilized.
The decisions that made a difference
There are three concrete decisions I think mark the difference between teams that integrate AI well into code review and those that abandon it:
First, not requiring all automated comments to be resolved. If you do, noise becomes unbearable within months. Automated comments are ignored if they don’t contribute.
Second, picking one tool and using it consistently. Having three bots commenting on the same PR is counterproductive: they overlap, contradict each other, and raise cognitive cost for the human reviewer.
Third, periodically reviewing which kinds of comments add value and which don’t, and tuning the tool’s configuration. Almost all of them let you disable rule categories or raise confidence thresholds. Over time we’ve silenced several categories that produced noise and raised the filter on others.
Looking ahead
What I see coming over the next months is a deepening of what already works. Tools will get better at cases where they already add value (summaries, mechanical oversights, test suggestions), and probably start integrating with debugging tools and incident history, so that a comment might say “this code is similar to what caused last quarter’s incident”.
What I don’t expect, and where optimistic announcements will keep spinning, is AI replacing human review in architectural decisions or judgment calls. That part remains human work, probably better assisted but not delegated.
For anyone thinking of adopting these tools today, my advice is pragmatic: start with one, configure aggressive disables, don’t block merges on automated comments, and revisit every six months whether the tool is contributing more than it costs to maintain. That modest approach is what has left us with a real improvement without having wasted time on oversized promises.