AI Content Metrics Evaluation System - 4 - Subsequent Optimization of the Metrics System

@jonaszhou|June 6, 2025

#1. Learning from Others

Eugene Yan's article[1] introduces four characteristics that "a good summary" should possess[2]:

Relevant — The summary retains the key points and important details from the source text.
Concise — The summary is informative, does not repeat the same point multiple times, and is not unnecessarily verbose.
Coherent — The summary is well-structured and easy to understand, not just a jumble of compressed facts.
Faithful — The summary does not hallucinate information not supported by the source text.

Isaac Tham discusses, in the context of AI summarization tasks and his specific scenario, how to use DeepEval to implement custom evaluation metrics[3].

DeepEval's implementation approach is quite interesting—it uses another LLM (LLM Judges[4]) to evaluate the summary content generated by the initial LLM.

Let's first look at these two independent underlying components of DeepEval's SummarizationMetric[5]: Coverage and Alignment. The implementation methods for these two metrics are fascinating[6].

coverage_score The "coverage score" determines whether the summary contains the necessary information from the original text.
alignment_score The "alignment score" determines whether the summary contains information that is consistent with or contradicts the original text.

When calculating the "coverage score," the LLM Judge uses the source text as a benchmark to detect whether the AI-generated summary contains the source text's information or has conflicts (the answer "no" indicates a conflict; "idk" indicates omission):

When calculating the "alignment score," LLM Judge uses AI summaries as a benchmark (opposite to the above), to detect the extent to which AI summary content contradicts or omits information from the source text (the answer "no" indicates a conflict between the AI summary and the original text; "idk" indicates hallucinations in the AI summary):

2. Insights for Our Scenario

In our scenario, we actually encompass two types of tasks:

AI Summarization Task: The LLM generates key information summaries based on the source text.
AI Information Extraction Task: The LLM extracts original text from the source based on the summarized information to support and corroborate the aforementioned summary.

For the AI Summarization Task, we can indeed directly reference or utilize the evaluation methods mentioned above. However, for the AI Information Extraction Task, there are two key aspects that need evaluation:

The consistency between the extracted information and the summary results, or in other words, to what extent the extracted information supports the summary results.

This part can still draw upon the evaluation methods mentioned above.

Whether the extracted information remains faithful to the original text. Since in our task, we explicitly require the LLM to extract content based on the original text, consistency with the original text is another crucial point to consider.

Evaluating this aspect is actually simpler; it only requires comparing the content extracted by the LLM with the original text. You can refer to the implementation in the article AI Content Metrics Evaluation System - 2 - Establishing Metrics.

1. ^ Eugene Yan: Evaluation & Hallucination Detection for Abstractive Summaries

2. ^ Original source: Kryscinski et al. (2019)

3. ^ Isaac Tham: How to Evaluate LLM Summarization

4. ^ Introduction to DeepEval's LLM Judge

5. ^ DeepEval's Introduction to SummarizationMetric

6. ^ Introduction to the Implementation Methods of Coverage and Alignment Metrics