Evaluating Generative AI Models Using Gen AI Evaluation Service in Vertex AI

Evaluating the performance of Generative AI models is a critical step in ensuring their effectiveness and reliability. Google Vertex AI provides a comprehensive platform for evaluating these models, offering a range of tools and techniques to assess their capabilities. This article touches the key concepts and processes involved in evaluating Generative AI models on Vertex AI. We will discuss defining clear evaluation metrics, preparing suitable datasets, and leveraging the platform’s features to run evaluations and interpret the results.

Defining Evaluation Metrics

Before evaluating generative AI models, it’s crucial to establish clear evaluation goals and define appropriate metrics. These metrics should align with the specific tasks the AI models are designed for, such as summarizing articles or responding to customer inquiries.

Key Concepts in Metric Definition:

The Gen AI Evaluation Service in Vertex AI lets you evaluate any model with explainable metrics.To evaluate your application’s performance on that specific task, consider the criteria you would like to measure and the metrics that you would use to score them.

Criteria is the dimensions you want to evaluate, like conciseness, relevance, correctness, or appropriate word choice.
Metrics provides a quantifiable score measuring model output against the defined criteria.

Let’s say you’re building an app to summarize articles. How would you know if it’s doing a good job? Here are some things to think about:

Is it concise? You want your summaries short and sweet, not just a rehash of the whole article.
Is it relevant? The summary should hit the main points of the article, not go off on tangents.
Is it correct? You don’t want your summary to say something that’s just plain wrong.

How can you measure these things? Here are a few ideas:

Conciseness: Check the length of the summary compared to the original article. Shorter is usually better.
Relevance: See if the summary includes the most important ideas from the article.
Correctness: Look for any facts in the summary that don’t match the original article.

The Gen AI Evaluation Service Metrics:

The Gen AI Evaluation Service offers two primary evaluation methods: Model-based and Computation-based.

Model-based metrics: Model-based metrics use a proprietary Google model as a judge. You can measure model-based metrics pairwise or pointwise:

Pointwise metrics let the judge model assess the candidate model’s output based on the evaluation criteria. For example, the score could be 0~5, where 0 means the response does not fit the criteria, while 5 means the response fits the criteria well.

Pairwise metrics let the judge model compare the responses of two models and pick the better one. This is often used when comparing a candidate model with the baseline model.

Computation-based metrics: These metrics are computed using mathematical formulas to compare the model’s output against a ground truth or reference.

Model-Based Metrics and Prompt Templates:

For model-based evaluation, we send a prompt to the judge model to generate the metric score based on specified criteria, score rubrics, and other instructions. The prompt template structures the evaluation process for the judge model.

Preparing the Evaluation Dataset

Essential to this model-based evaluation process is the preparation of a quality evaluation dataset. The evaluation dataset typically includes:

Model responses which are the outputs generated by the models being evaluated.
Input data is the data fed into the models to generate the responses, and optionally
Ground truth responses that are correct or desired responses used as a benchmark for comparison.

For model-based metrics, the dataset requires the following information:

prompt: User input to the AI model (this is optional in some cases).
response: LLM inference response to be evaluated.
baseline_model_response (for pairwise metrics only): The baseline LLM response used for comparison in pairwise evaluations.

If you use the Gen AI Evaluation module that comes with the Vertex AI SDK for Python, the Gen AI evaluation service can automatically create the response and baseline_model_response using the model you picked. For other evaluation use cases, you may need to provide more information.

Depending on your use cases, you may also break down the input user prompt into granular pieces, such as instruction and context, and assemble them for inference by providing a prompt template. You can also provide the reference or ground truth information if needed:

Keep these tips in mind when setting up your dataset for evaluation.

Provide examples that represent the types of inputs, which your models process in production.
You need at least one evaluation example in your dataset. But, for the best results, aim for around 100 examples. This will give you more reliable metrics and make your results statistically significant.

Dataset Example:

A question-answering task evaluation dataset might look like this:

				
					prompts = [
    # Example 1
    (
        "Based on the context provided, what is the hardest material? Context:"
        " Some might think that steel is the hardest material, or even"
        " titanium. However, diamond is actually the hardest material."
    ),
    # Example 2
    (
        "Based on the context provided, who directed The Godfather? Context:"
        " Mario Puzo and Francis Ford Coppola co-wrote the screenplay for The"
        " Godfather, and the latter directed it as well."
    ),
]

responses = [
    # Example 1
    "Diamond is the hardest material. It is harder than steel or titanium.",
    # Example 2
    "Francis Ford Coppola directed The Godfather.",
]

baseline_model_responses = [
    # Example 1
    "Steel is the hardest material.",
    # Example 2
    "John Smith.",
]

eval_dataset = pd.DataFrame(
  {
    "prompt":  prompts,
    "response":  responses,
    "baseline_model_response": baseline_model_responses,
  }
)

Running the Evaluation

Vertex AI provides a few different ways to evaluate your generative AI models. You can use Vertex AI SDK for Python or REST APIs, or try out their pre-built evaluation pipelines for common tasks. If you need something more specific, you can even customize the evaluation pipelines to fit your exact requirements.

Viewing and Interpreting Results

After the evaluation pipeline completes, Vertex AI provides access to the results, allowing you to analyze and interpret model performance based on the chosen metrics.

Key Features:

Metric Scores: View the scores for each metric, allowing you to compare the performance of different models or model configurations.
Judge Model Outputs (for model-based metrics): Access the outputs generated by the judge model, providing insights into its evaluation reasoning and criteria fulfillment.
Visualization Tools: Utilize charts, graphs, and other visualizations to gain a deeper understanding of the evaluation results.

By analyzing these results, you can identify areas where models excel or require improvement and iteratively refine their performance.

NGC Mena Dubai ofisimiz açıldı.

Menu

Telephone

E-Mail

Address