Shipping an AI Agent You Can Trust in Production

Earlier this year I started working with Ariel Aziernicki on a Salesforce opportunity creation agent for a client’s sales team. The brief was simple. Sales reps didn’t want to do data entry, the workflow was generating a meaningful percentage of bad records, and leadership wanted an agent that could take a natural language description of an opportunity and turn it into a clean Salesforce record on the rep’s behalf.

Here’s a recap of what the work actually involved: validation, evaluation, audit logging, integration coupling, and the handoff to a real user.

What we built

The agent runs on Vertex AI with an 80k context window using Gemini models. It accepts a natural language description from a sales rep, parses out the fields Salesforce needs (account, contact, dates, amounts, stage), validates them, creates or links a primary contact if one isn’t already on the account, and writes the opportunity back to Salesforce. The rep reviews and confirms inside their existing workflow.

Ariel led most of the implementation. I worked alongside on the design, the prompt structure, the evaluation pipeline, and the production decisions.

Phase 1 was the unglamorous work

The first phase was about getting the agent to behave consistently across the kinds of inputs sales reps actually write.

A few things kept tripping us up. Date relativity (“end of the quarter,” “next Tuesday”) needed reinforced validation logic so the agent didn’t quietly resolve to the wrong date. Contact-and-company relationships were inconsistent in ways the LLM would smooth over instead of flag. Field validation on currency, stage values, and account references needed to be deterministic, not LLM-judged. Ariel built that scaffolding around the model so the agent could be wrong in known ways instead of confidently wrong in unpredictable ones.

That’s most of what Phase 1 was: building the boundary that turns a clever model into something a rep can actually use.

The model choice that saved money

We didn’t run everything on the most expensive model. Gemini Flash handled the bulk of the workflow, fast and cheap. We reserved heavier Gemini variants for the parts where the reasoning had to be sharper.

The discipline is asking, per use case, what the right model is for the job. Cost optimization isn’t an afterthought. It’s a production decision you make at design time.

Feedback loops from day one

Every interaction got tagged with a session ID and logged to BigQuery. The Google ADK framework makes this part of the setup straightforward, with built-in support for tracing agent runs and persisting evaluation data without writing custom plumbing. That gave us a queryable history of what the agent was asked, what it returned, what got corrected, and where the patterns of failure clustered.

Weekly we’d pull a sample, look at the misses, and decide what to tune: tighten a validation rule, adjust the prompt, or note a class of input the agent shouldn’t be handling yet and route it to a human.

The point is that we never had to guess how the agent was performing. The data was there from day one because we designed it in.

Audit logging and impersonation

The biggest production blocker wasn’t the model. It was user impersonation and audit logging.

When the agent writes an opportunity to Salesforce on behalf of a rep, whose name is on the record? If it’s the agent’s service account, the rep loses ownership and the audit trail breaks. If it’s the rep’s account, you need to handle authentication and impersonation safely. Either way, every action the agent takes needs to be logged in a way the client can inspect later, because trust at the leadership layer requires being able to answer “what did the agent do, when, and on whose behalf?”

We solved it by impersonating the rep through proper authentication and writing every action to an audit table alongside the Salesforce write. That work took longer than the agent itself. It was also the thing that let us ship.

Validation tests at the boundary

A pattern that kept proving itself: catch errors at the input boundary, not inside the model.

On a related workflow we were dealing with copy-paste errors on Lytics campaign IDs. Reps would paste an ID that was three characters short or had a trailing space, and downstream systems would silently misroute. We added a length validator and a format check at the input layer. The agent never had to reason about it. The bad input never made it past the door.

The same principle applied across the Salesforce agent. Validate currency formats, date strings, account IDs, and required field combinations before the prompt ever runs. The model is for handling the genuinely ambiguous parts of natural language. Everything that can be checked deterministically should be.

Phase 2 and the integration coupling problem

Phase 2 was the quotes workflow, which depended on a Salesforce integration with a separate approval gate. Before either of us wrote a line of code, we did a UX walkthrough with someone on the client’s side who actually used the workflow.

That conversation changed the design. The integration didn’t expose the approval state in a way the agent could safely act on, which meant we needed to either route around it or wait for an approval before doing the next step. We landed on a design that respected the gate instead of trying to skip it.

The lesson is that integration work is where agent projects either ship or stall. The model is portable. The integration coupling is the thing that locks you into how someone else’s tool was designed. A walkthrough before coding saves weeks of refactoring later.

The champion rep pilot

We didn’t roll the agent out to twenty reps at once. We started with one. A champion who was willing to use it daily, give us feedback, and tolerate the friction of being first. That rep’s experience shaped the next version more than any internal review could have.

For adoption we prioritize building the relationships over just deploying. The handoff to one person who actually uses it is worth more than a formal launch to a team that doesn’t.

What I’d carry forward

Most of what made this work wasn’t technical. The model is a commodity now. What’s not commodity is the design discipline around it: where you put validation, how you log for evaluation, who the agent acts on behalf of, when you respect an integration’s constraints versus working around them, and who the first user is.

The other thing I’d carry forward is the collaboration. Working with Ariel made the agent better than either of us would have built alone. The best AI agent work I’ve been part of has been pair work, not solo work.

If you’re building an agent for production, the model is the easy part. The handoff, the audit trail, the boundary validation, the champion rep, and the colleague you’re building with are the project.