AI Content Metrics Evaluation System - 1 - Process Improvements in the Training Phase

@jonaszhou|June 6, 2025

In our business application scenarios, training large AI models essentially means training the prompt.

Task Background

In our current scenario of AI content summarization and information extraction, after formulating and optimizing the AI prompt, we overly rely on the trained prompt, directly deploy it online, and track its business application.

Current process:

After tracking for a period of time, more and more issues were discovered:

AI summary results are unstable: The summary results obtained during the testing phase seemed fine, but in actual operation over time, the quality of the summaries became unstable, with various unforeseen problems.

Fundamentally, there is a lack of effective metrics and mechanisms to evaluate "AI summary content," leading to unstable results. Therefore, we established the "AI Content Metrics Evaluation System," currently primarily applied during the training phase.

Process Improvement in the Training Phase

During the LLM training phase (using prompt engineering for training), the "AI Content Metrics Evaluation System" is added.

Each time, different prompts are used for training, with the expectation of obtaining prompt combinations that meet the requirements. (Prompt combinations include: system prompts & user prompts.)

Each training process:

Define the prompt combination for the current training session
Use this prompt combination, combined with a fixed training dataset, to batch call the AI interface and obtain response content
Evaluate the batch of AI summary results obtained using the "AI Content Metrics Evaluation System" to get the evaluation metrics for that prompt combination
Based on the evaluation metric results, determine whether the prompt combination still needs adjustment and repeat the training process

The training process can be represented graphically as:

In our scenario, the training prompt combination includes SYSTEM_PROMPT and USER_PROMPT.

USER_PROMPT is fixed as the processed visit records (many days of different visit records), which is the "fixed training dataset." Therefore, the focus of training is:

Passing in different USER_PROMPTs
Tuning the SYSTEM_PROMPT

Additionally, keep other parameters for calling the LLM unchanged:

TEMPERATURE = 0.3 # Set a lower temperature

# Use default values for other parameters

Next Implementation Priorities

Establish a mechanism for batch API calls.
- The current API calls are single calls, unable to handle batch processing. A batch calling mechanism needs to be established so that during the training phase, the training dataset can be used to obtain AI summary results in batches.
Establish an "AI Content Metrics Evaluation System".
- For the AI summary results obtained in batches, use the "AI Content Metrics Evaluation System" for evaluation to obtain evaluation metric results.