That is how long it took an agentic quality system, in a live demonstration at Honeywell HUG Americas in Phoenix this week, to move from detecting a particle count excursion to recommending action. Root cause documented in the quality system within 90 minutes. Part 11 records maintained throughout.
Honeywell invited me to speak at their first Life Sciences specific HUG during the event's 50th anniversary, and the team built something rare: a room where vendors, practitioners, consultants, and passionate dialogue at the same table. Thank you to the Honeywell team who made it happen, to my fellow industry speakers from Sware, Axendia, McKinsey, and Salesforce, and to every practitioner who pulled me aside between sessions. Those hallway conversations carried as much weight as the keynotes.
Here is what I brought home.
A Honeywell leader asked the audience a question I have been circling for two years: is anyone accurately measuring the accuracy of the humans doing these tasks today?
If humans are the benchmark that an AI system needs to do better than and we have not baseline our human baseline. How do we know when we've achieved 'better' than?
If we do not fully understand where in the processes we are actually making decisions, is it the quality of decisions or the decision quality that is the problem before it is a technology problem.
2026 has put an end to the pilot-phase narrative. A complaint record processing agent is live in production at a major medical device manufacturer. They went straight to production because the agent self-reports when it is uncertain, which inverted the validation conversation: complaints are sampled, not individually reviewed, because confidence is declared record by record.
The economics explain the urgency. A single complaint costs $50 to $150 to manage, consumes 35 to 40 minutes of direct handling, and takes roughly three hours from categorization through reportability determination. Run that math across the tens of thousands of complaints a large device manufacturer processes in a year and the handling cost alone runs into the millions, before regulatory exposure and patient safety enter the ledger. One presenter shared a deviation-to-closure cycle compressed from 29 days to 8. Three weeks less time a known problem sits open.
Now ask the harder question. Were the decisions better?
A 72% cycle compression tells you the system moved faster. It tells you nothing about whether the root cause was right, whether the reportability call would survive an investigator's review, or whether the corrective action prevents recurrence. Faster closure with the same decision quality is an efficiency gain. Faster closure with worse decision quality is a liability accelerating. And here is the uncomfortable part: most organizations cannot tell you which one they got, because they never measured the accuracy of the human decisions they just automated. The economic case for AI in quality is already made. The decision quality case has to be made deliberately, with a documented baseline, declared reasoning, and outcomes tracked past closure. Speed is what the vendor demonstrates. Defensibility is what the regulator examines. Build for the second and the first comes with it.
And the build-versus-buy question flipped in a single year. 2025 was build everything internally. 2026 is buy, because the cost and the intellectual capital required no longer support internal builds for most organizations.
Old news, pharma has been battling for years, what is all this data that we've collected? AI errors today are less about hallucination and more about misreading data the system was never prepared to understand.
The numbers behind that statement are uncomfortable. ~70% of enterprise data sits silent, inaccessible or unused. The average life sciences company runs over 1,000 applications (paraphrased from the data discussed with the group). Preparing data for AI is becoming the validation work of this decade, and the organizations treating it as an IT chore will discover that too late.
One discipline stood out: deliver value in parallel with building the data foundation. The work itself reveals which data deserves the investment.
Annex 22 in its current draft permits only static AI for GxP purposes, they will hold forum at the end of June eventually there will be a revision and finalization. Organizations that are waiting for regulatory certainty before building decision discipline have the sequence backwards. The discipline is what makes you ready for whichever way the guidance lands.
Three patterns from the hallway conversations, every one of them a governance gap rather than a technology gap.
Procurement signs software deals before quality completes any serious review. Organizational Change management and training programs goes chronically unfunded; programs launch with no budgeted training teams, and the same gap repeats company after company. And vendor acquisitions and introduction of technology companies into core GxP applications are creating contract and regulatory risk that quality teams were never trained to evaluate.
Practitioners are looking for software vendors who can help them reach the goals already set for them.
AI is becoming the efficiency layer of quality. That part is settled. What remains unsettled is governance: whether the decisions inside these systems are conscious, defensible, and continuous, and whether anyone can prove it under scrutiny.
There is a second thread I carried home from Phoenix, about what differentiates organizations once efficiency is table stakes. That argument is still forming, and I am saving it for a keynote later this month.
For now, the verdict from Phoenix is simple. The agents are in production. The human baseline is undocumented. Closing that gap is the most consequential quality decision on the table this year.
Thank you, Honeywell, for the stage and the company.
If your organization is benchmarking AI against a baseline you have never documented, that is the conversation to have first.👇👇👇