Evals First, Agents Second: A Field Guide for Enterprises
Why evaluation comes before the agent: how we measure what "good" looks like when it comes to AI for product intelligence, and how we meet your organization wherever you are in the transformation journey.
Contents
Before an enterprise can trust an AI agent to handle its customers’ product queries, it has to answer this question: how do you actually know the agent is any good?
In our experience this is a non-trivial problem, open ended in scope and easy to underestimate. We see it play out often with companies starting their AI journey — with teams that have little more than a pile of datasheets, and equally with teams that have already drafted internal evaluations and lined up a dozen vendors to compare. In both cases the conversation keeps returning to the same question: how should “good” be measured?
The fact that the industry is still figuring out how to measure good for product intelligence AI (or even document intelligence AI) is not surprising. Companies have spent decades building the muscle to hire, train, and evaluate human product experts through interviews, ramp plans, certifications, and performance reviews. However, the equivalent playbook for prepping and evaluating AI agents is much newer, and most organizations are still developing it.
Many companies reduce this to a simple “model choice” or “let’s deploy a RAG” problem and experience unhappy results in production. Rather, this is a problem requiring a strategic mindset, with proper business goal setting, content preparation, evaluation, AI tuning and maintenance approaches. Such a holistic approach ensures that the semantic transformation not only survives, but also thrives in production.
The sections that follow describe how we approach this work, what high level techniques we leverage and what challenges we see. We also include a self-assessment to help organizations identify where they currently stand.
AI for Electronics Products
Electronics is a content-heavy industry. A typical component manufacturer or distributor may hold product information in any number of systems and formats ranging from:
- Catalog pages designed for human browsing
- Datasheets that are PDFs with deeply nested tables and figures
- Application notes that combine prose with reference designs
- FAQ pages
- Institutional knowledge locked inside senior sales engineers and product managers
- A PIM with normalized SKU records
For an AI agent to be a truly credible product expert, it has to draw from all available product knowledge. This often benefits first from a transformation in how the company stores and exposes its product knowledge — moving from documents and people as the primary interface to a knowledge surface that an AI agent can search, compare, cite, and answer from on behalf of customers and internal users. But that is the subject of another blog post. Assuming content is well organized, the rest of the transformation has several moving parts.
-
On the content side, source systems have to be inventoried, human-oriented material has to be converted into machine-reasonable representations (structured parameter records for datasheet tables, sectioned PDFs that preserve text/table/figure relationships, externalized institutional knowledge), and there has to be a sustainable process for keeping that content fresh.
-
On the agent side, retrieval, reasoning, citation, and grounded answer composition all have to be designed for the domain. (See our posts on accuracy in technical sales and the agent search problem for more detail on the underlying pipeline.) Around both sides sit the operating concerns: who owns the content, how new products are onboarded, how regulated or confidential information is handled, and how end users — customers, sales engineers, support — actually interact with the agent.
The depth of semantics involved in this domain is worth taking a deeper look. A datasheet for a single MOSFET can list hundreds of electrical, thermal, and mechanical parameters, many of them conditional on operating point. Parts that look interchangeable in a parametric search can differ in subtle but important ways: package thermal resistance, switching behavior under specific load conditions, qualification status for automotive or industrial use. Sales engineers spend years learning to navigate these distinctions, and an AI agent has to learn them as well, from the content and signals the enterprise is able to provide.
† Same part, different answers: these values hold only under the stated conditions. Drop VGS to 6 V or change the gate resistor and the numbers move. Two parts that match on a parametric search can still diverge on thermal path, switching behavior, and qualification status.
Illustrative excerpt only — XN‑100R6 is not a real device and the values are invented. A production datasheet runs to hundreds of such parameters.
Each of these moving parts — content, agent, operations, domain semantics — introduces choices, and each choice needs to be measured against the queries the enterprise actually wants the agent to answer well. That measurement is the role of the evaluation system, which the rest of this post focuses on.
Where are you starting from?
Before we talk about metrics and harnesses, it helps to locate where your organization is starting from. The questions below are the ones we actually walk through on a first call. They are not a test, and there is no passing score. Instead, your readiness becomes an input to how we engage. Wherever you land, our evaluations team picks up the work from there.
Here’s a questionnaire that can help you think through it.
Where are you starting from?
Pick the option that sounds most like you. There are no wrong answers!
A few related questions tend to come up naturally in the same conversation — whether you care more about answer depth or response latency, whether you want broad coverage or excellence on a focused set of high-value queries, and how you want the agent to behave when it is not confident. There are no universally right answers to these; they are simply choices we make explicit before the evaluation begins.
Evaluating with and without ground truth
To evaluate a query and our agent’s response we need a baseline:
- correct compared to what?
- complete in what respects?
- constrained in what way?
These baseline references are our evaluation ground truth. A set of assertions and expectations that collectively establish the response as correct, high quality, complete and acting within bounds.
How much of this query and assertion dataset exists when we start shapes our engagement.
In practice, customers arrive at one of three starting points:
| Starting point | What they have | How we approach it |
|---|---|---|
| Queries and ground truth ready | Organized content in scope, known top queries, validated answers | Score directly against the customer provided ground truth; spend our time on depth and edge cases |
| Partly ready | Rich but messy content, a rough query set, SMEs who can validate | Normalize by de-duping and wrangling content into consolidated set, co-create the query set and ground truth, then evaluate against it |
| Greenfield | Sparse documentation, institutional knowledge, require guidance on how to go about it | Understand customer products, organize and ingest content, generate candidate queries and answers from the corpus, then validate with SMEs |
Most organizations are a blend, and the answers from the assessment above tell us the mix. Whichever the starting point, the evaluation runs in one of two modes.
With a query dataset
The organization has queries that they’d like to qualify agents with already curated.
Queries come with ground truth responses
- We compare the agent’s response directly against provided or validated assertions.
- Scoring is objective and repeatable, so iteration is fast and regressions are easy to catch.
Queries come without ground truth
- We use a combination of the following:
- Evaluating our answers against citation claims and faithfulness
- Comparing our answers to answers from a council of frontier models + web search
- Reviews from our domain aware customer success team.
- We cross reference the three signals above and generate a composite score for evaluating our agent’s responses.
Without a query dataset
- We generate the evaluation set ourselves — synthesize candidate queries from the corpus, draft answers, and route them to SMEs for sign-off.
- The catch: generated queries can inherit gaps or bias from the source content, and an LLM judge has to be calibrated against human judgment before its scores can be trusted.
The building blocks of an evaluation
Whatever the starting point and mode, every Rapidflare evaluation is assembled from the same set of components.
flowchart TB
A["Query Set"]
B["Ground Truth"]
C["Run Agents"]
D["Score:\nRetrieval + Answer Quality"]
E["Judge Calibration\n(LLM + Human)"]
F["Scoreboard"]
H["Quality Baseline"]
I["Launch Ready Scorecard"]
J["Ongoing Regression Tracking"]
A --> C
A -.->|"generate if absent"| B
B --> D
C --> D
D --> E --> F --> H
H -.->|"tuning"| C
H --> I --> J
classDef input fill:#f5f3ff,stroke:#7c3aed,color:#111827,stroke-width:1.4px
classDef step fill:#eef2ff,stroke:#2563eb,color:#111827,stroke-width:1.4px
classDef output fill:#ecfeff,stroke:#22d3ee,color:#111827,stroke-width:1.4px
classDef ongoing fill:#f0fdf4,stroke:#16a34a,color:#111827,stroke-width:1.4px
class A,B input
class C,D,E step
class F,H,I output
class J ongoing
- Query set construction. We assemble the queries the agent must handle well from multiple sources: real queries mined from support, search, CRM and email logs; queries synthesized from the corpus; and queries supplied by your SMEs.
- Ground-truth creation and validation. For queries without a confirmed answer, we draft candidates and run them through an SME sign-off loop until the reference set is trusted.
- Scoring. We measure a number of aspects of answer quality such as correctness, relevance, faithfulness and completeness (answer-quality metrics).
- Judge calibration. Where an LLM does the scoring, we calibrate it against human reviewers so the automated scores track what your experts would say.
- Tuning. No agent achieves 95% or higher quality without domain and customer goal specific tuning. This is the fun part of taking our initial agent from good to great! Working closely with SMEs, we get qualitative notes about human or business preferences and adjust our agentic system to match those requirements.
- Regression tracking. Once the above baseline is achieved, we are in good shape to perform ongoing regression tracking. Your content, our agent and underlying models - all of these change over time. If meeting the quality bar once sounds involved, maintaining it over time is a daunting yet rewarding endeavor. Our original quality dataset gets augmented over time with ongoing agent session based data. Through this real world dataset, we re-run evaluations on a cadence to catch quality regressions before your users do.
How we work with you through the process
We treat this evaluation / qualification as a shared effort instead of making it seem like a black box to our customers. Our customers appreciate this, as it is an opportunity to set mutually agreed goals, identify and work through challenges, and reach a point of launch readiness based on quantitative, trusted experiments. Our rhythm looks like this:
- Kickoff. We start by aligning on where you are today and scoping the evaluation together, so everyone agrees on what we’re measuring and why before any work begins.
- Open by default. Nothing is hidden behind a curtain. You have visibility into the query set, the scoring rubric, the interim scoreboards, and example traces with their supporting citations throughout.
- Regular readouts. We review results together on a set cadence, and when there’s lack of clarity about what counts as a “right” answer, we work through it with your subject matter experts rather than deciding unilaterally. Indeed, we’ve had many cases where SMEs felt that a query in the eval set was too convoluted to be realistic.
- Evidence, not assertions. Every scorecard traces back to the underlying evaluation experiments and datasets, so you can audit how a number was reached instead of taking our word for it.
Problems we tend to see
As we do these evaluations, we see a few interesting patterns come up frequently.
- The source is wrong, not the AI. Conflicting or stale content surfaces as “the AI is wrong” when the underlying document is the problem.
- Generating ground truth is harder than it appears. We are dealing with a complex domain and the pursuit of ground truth is quite difficult. Everything’s easy when queries have black or white answers. But real world queries are much more nuanced and there’s no single correct response. We’ve had cases where a veteran with multiple decades in the industry had trouble evaluating if answer A was more quality than answer B.
- Expert provided answers incorrect. We’ve had cases where the customer provided answers were partially or wholly incorrect. This speaks to the complexity of the domain and unless an expert learns otherwise, sometimes they carry on with incorrect information or assumptions.
- Boiling the ocean. Trying to cover every query or every product area at once means the evaluation takes a long time to converge. We recommend starting simple and then expanding.
- Unspoken trade-offs. Quality and low latency responses are often at odds with each other. We often have an explicit discussion about this at the beginning of the engagement, so we set expectations appropriately. Expectations are sometimes off due to all of us being spoiled by Google Search giving us quick and instant answers that are shallow.
- Our AI can always be better No two customers are the same and we find that the last mile value is often gained by carefully understanding the customer’s domain and needs and tuning our AI.
Wrapping up
We hope this post sheds some light on what readiness for agentic transformation looks like. It is a continuous spectrum and a good evaluation is what turns “where we are” into a plan you can actually run with.
A few things we keep coming back to on every engagement:
- The smoothest deployments start with a mutual assessment of where you are today. Developing that clarity makes the work ahead more predictable and more effective.
- Evaluation isn’t something we bolt on at the end (nor should you). For us, it’s the spine of the whole engagement.
- When we build the measurement system first, the agent has something concrete to be held to, and that quality bar keeps paying off over time.
Wherever you’re starting from, our evaluations team can pick it up from there. Talk to us about your starting point.
This post is the first in a series. We’ll follow up with posts that go deeper on our eval framework, how we generate and validate ground truth, how we set up the system for efficient regression testing, and what more complex evals look like when the job is product selection rather than straight Q&A. Stay tuned!
About the author
Engineering leader and technologist. Prior to Rapidflare, he spent ~11 years at Netflix working on developer platforms and experience at mission-critical scale, including leadership of teams in Platform and Edge (API) Engineering. He got his start in embedded systems at Xilinx — compilers, RTOS, SoC design tools, and TCP/IP stacks — and holds three academic publications and a US patent (US 8,447,957) in coprocessor interface architecture.