I Rated My AI-Enabled Team. The Score Was 2.87.
Access to AI tools looked like progress. The score showed us where process and practice were still missing.
I rated my AI-enabled team. The average was 2.87 out of five.
That number is not a flex and not a punchline. It is a calibration.
I wrote a book on AI adoption. I have been running our tooling and expectations for more than two years. If anyone on the team should be able to wish proficiency into existence by proximity, it is me. The score still came back under three.
That tells you something useful about how capability actually forms.
Why "we have AI tools" is not the same as "we are good at this"
Most teams treat access as proof. Licenses, plugins, and a chat window in the IDE become the signal that adoption is done.
It does not work that way. The failure mode is quiet. People use AI like a faster autocomplete: finish the line, paste the snippet, accept the refactor, and move on without a verification habit.
Work still ships. Metrics still move. Nobody notices the drag until you measure something sharper than throughput.
Studies continue to show an uncomfortable pattern: developers with AI tools can end up slower when adoption stays shallow. The rating gap is a symptom. The cause is missing structure around when to use the tool, how to check output, and what done means when a model helped write the code.
The better model: proficiency is a system, not a perk
Proximity to expertise does not transfer skill. A leader who wrote the book is not a substitute for shared standards, repeated practice, and feedback loops.
Instead of thinking in tools, think in layers.
Layer one is hygiene: context packaged, prompts documented, and diffs reviewed like any other code. Layer two is judgment: knowing when generation is cheaper than thinking and when it is expensive. Layer three is integration: AI steps live inside real gates like tests, review, and release.
Chapter 1.2 in my book makes this argument directly: giving engineers AI without guidance creates enthusiasm, not reliability. Chapter 5 covers what to build after you accept that tools do not onboard themselves.
What I changed after seeing 2.87
I stopped assuming the stack would teach the team. The sequence below is what actually moved us forward.
- Name the rubric. We scored on a simple 1-5 scale with plain language for each band. If people cannot recognize themselves in the levels, the exercise is theater.
- Make it observable. A rating is useless if it is only vibes. We tied levels to behaviors: how you scope a task for the model, how you verify output, and how you recover when the first answer is wrong.
- Protect setup time. The expensive part is often not writing code. It is environment setup, permissions, repository orientation, and creating a repeatable session baseline. Most teams get this wrong.
- Teach on real work. Weekly applied sessions on actual tickets beat slide decks about prompt tips. The goal is repetition on your stack and your constraints.
- Coach the tail. Anyone at level 3 or below gets focused support. This is not punishment. It is how you keep one performance bar instead of two teams inside one team.
You can run a lighter version of this model. The point is not ceremony. The point is that capability compounds from repeated, specific practice tied to your delivery path.
What good looks like
Good is not everyone scoring five. Good is a narrower spread and a higher floor.
You see it in review quality: fewer blind spots and fewer mystery commits. You see it in incident response: engineers reach for checks instead of arguments. You see it in onboarding: new hires inherit patterns, not folklore.
The number I want next time is not vanity. It is evidence that our defaults match our risk.
Tools scale instantly. Habits scale slowly. If you only buy the first, expect a gap between the story and the score.
One question
If you ran the same rubric on your team tomorrow, where would the average land, and which level would describe most of your real pull requests?
For related field notes, browse the blog archive.