AI Content Metrics Evaluation System - 3 - Process Improvements in the Production Phase

June 6, 2025

In the first article of this series, "Process Improvements in the Training Phase", we introduced how to use the metrics evaluation system during the training phase to train on different prompt combinations to obtain the best-performing prompt set. In the second article of this series, "Establishing Evaluation Metrics", we discussed how to establish the evaluation metrics system and how to use LLM Judge to assist in evaluation. This article explains how to use the same metrics evaluation system in a production environment to dynamically obtain the best-performing AI responses and apply them in real business scenarios.

Production Environment Application Process

In practical production environment applications, the core operations can be summarized as:

  1. For the content to be summarized by AI, generate multiple copies (we generated 10 copies)
  2. For the generated multiple copies, call the LLM one by one for summarization
  3. Use the "AI Content Metrics Evaluation System" to analyze each summarized content and provide evaluation results
  4. Select the summary result with the highest score as the final summary result

The specific implementation of each stage above can be broken down into the following process:

Other Matters

In our scenario, since we are currently most concerned about cases of "incorrectly matching the original text," the "Original Text Citation Accuracy" metric has veto power:

  • For the evaluation result of a given copy's AI response, if this metric is not 1, that copy's AI response is discarded
# First check original text citation accuracy
citation_accuracy_data = metrics.get("citation_accuracy", {})
citation_accuracy_score = citation_accuracy_data.get("citation_accuracy", 0.0)

# If citation accuracy is not 1, the overall score is set to 0
if citation_accuracy_score < 1.0:
    return 0.0