AI Content Metrics Evaluation System - 2 - Establishing Evaluation Metrics

June 6, 2025

As mentioned in the previous article, we established an "AI Content Metrics Evaluation System" during the prompt training phase. The core of this system consists of various evaluation metrics and how each metric is applied. This article will introduce the evaluation metric system we have built and how to use LLM Judge to assist in evaluation.

Task Objectives

In our scenario, the task consists of two parts:

  • AI Summarization Task: The LLM generates key information summaries based on the source text.
  • AI Information Extraction Task: The LLM extracts the original text from the source based on the summarized information, to support and verify the aforementioned summary.

After deploying the LLM, conducting business tests with real data, and tracking for a period, we found:

  • For the AI Summarization Task, the deployed LLM (primarily its prompts) performs the summarization task well—the refined summaries meet our intuitive standards.
  • However, for the AI Information Extraction Task, the LLM still exhibits hallucinations—specifically, it frequently makes errors in extracting the original text from source material containing different visit records for multiple customers (e.g., incorrectly extracting the original text of customer A's X visit record as belonging to customer B's visit record).

Therefore, the current focus of the evaluation metrics is to address the aforementioned issue of "incorrectly matching the original text." (For evaluation metrics of the AI Summarization Task, you can refer to this article: AI Content Metrics Evaluation System - 4 - Further Optimizing the Metrics System).

Evaluation Metrics System & LLM Judge

For our AI summarization and information extraction scenario, we have established the following evaluation metrics system:

  1. Customer Identification Accuracy - Checks if the AI correctly identifies all customers
    • Precision, Recall, F1 Score
  2. Keyword Relevance - Checks if the extracted keywords are relevant to the original text
    • Keyword Frequency, Semantic Relevance Analysis
  3. Original Text Citation Accuracy - Checks the accuracy of original text citations in the details (Core Metric)
    • Citation Content Match Rate, Similarity Analysis
  4. Logical Consistency - Checks the logical consistency between the summary and the details
    • Content Consistency Check, Logical Relationship Verification
  5. LLM Semantic Quality Assessment - Evaluates the semantic quality of the AI summary using an LLM
    • Information Completeness, Semantic Accuracy, Expression Clarity, Professionalism
  6. LLM Sentiment Tendency Consistency - Checks if the AI summary maintains the original text's sentiment tendency
    • Original Sentiment Analysis, Summary Sentiment Analysis, Sentiment Consistency Score, Sentiment Mismatch Analysis

For items 5 & 6 above, we referenced DeepEval's approach of using an LLM Judge[For reference] to assess the semantic quality and sentiment tendency consistency of the AI summary. However, the implementation of the LLM Judge here is relatively simple:

  • Use the LLM Judge to compare the original text and the AI summary content,
  • And have the LLM Judge directly output scores for these two evaluation metrics.
def _load_evaluation_prompts(self):
        """Load evaluation prompt templates"""
        return {
            "semantic_quality": """
            You are a professional AI content quality assessment expert. Please evaluate the semantic quality of the following AI summary.

            Original data:
            {original_data}

            AI summary:
            {ai_summary}

            Please evaluate the following dimensions (1-10 score):
            1. Information completeness: Does the AI summary cover the key information in the original data?
            2. Semantic accuracy: Is the AI summary semantically consistent with the original text?
            3. Expression clarity: Is the AI summary clear and easy to understand?
            4. Professionalism: Are appropriate financial/investment terms used?

            Please return the evaluation result strictly in the following JSON format:
            {{
                "info_completeness": {{
                    "score": score,
                    "analysis": "Analysis rationale"
                }},
                "semantic_accuracy": {{
                    "score": score,
                    "analysis": "Analysis rationale"
                }},
                "expression_clarity": {{
                    "score": score,
                    "analysis": "Analysis rationale"
                }},
                "professionalism": {{
                    "score": score,
                    "analysis": "Analysis rationale"
                }},
                "overall_score": total_score,
                "strengths": ["Strength 1", "Strength 2", ...],
                "weaknesses": ["Weakness 1", "Weakness 2", ...]
            }}
            """,

            "sentiment_consistency": """
            You are a professional sentiment analysis expert. Please analyze the consistency of emotional tendency between the original customer viewpoint and the AI summary.

            Original data:
            {original_data}

            AI summary:
            {ai_summary}

            Please evaluate the following aspects:
            5. Primary sentiment tendency of the original customer viewpoint (positive/negative/neutral)
            6. Sentiment tendency expressed in the AI summary (positive/negative/neutral)
            7. Whether sentiment intensity is accurately preserved
            8. Whether there is sentiment bias or misleading expression

            Please return the evaluation result strictly in the following JSON format:
            {{
                "original_sentiment": {{
                    "primary_sentiment": "positive/negative/neutral",
                    "sentiment_details": ["Detail 1", "Detail 2", ...]
                }},
                "summary_sentiment": {{
                    "primary_sentiment": "positive/negative/neutral",
                    "sentiment_details": ["Detail 1", "Detail 2", ...]
                }},
                "consistency_score": 1-10 score,
                "consistency_analysis": "Consistency analysis",
                "sentiment_mismatches": [
                    {{
                        "original": "Original text snippet",
                        "summary": "AI summary snippet",
                        "mismatch_type": "Sentiment bias type"
                    }},
                    ...
                ]
            }}
            """
        }

For cases of "incorrectly matching the original text," the implementation logic is not overly complex:

We also recognize that LLMs inevitably produce hallucinations, so we set thresholds to tolerate AI paraphrasing of the original text—as long as most of the original text is matched within a reasonable range, it is considered a correct match: