Exploring Different Metrics For Natural Language Processing & Generation

It’s challenging to evaluate the quality of a text, especially if you’re working on an NLP project and want to assess how well your model is performing based on hundreds or thousands of data points. So, if you can’t look at every single output of your model, how can you know whether you’re doing a good job?

In this blog post, I will teach you about natural language generation and how to evaluate AI-generated texts with commonly used metrics. You will learn to understand various metrics, how to use and calculate them (with code examples), and how you can choose the best metric for your specific use case.

What is Natural Language Generation?

Natural language generation (NLG) is a subfield of artificial intelligence (AI) that focuses on creating software systems that can generate human-like text. NLG systems convert data into readable text used for summaries, reports, and descriptions, among other things. Natural language processing (NLP), machine learning, and knowledge representation are part of the NLG system.

Those systems typically take a set of data as input (for example, a structured input such as a dataset or table, a simple text, or even an image) and output a coherent and understandable sequence of text. NLG systems can handle various NLP tasks, including answering user questions in a chatbot, translating a sentence or document from one language to another, suggesting story ideas, and generating summaries of long texts.

Why do we need (computer-generated) Metrics for evaluating AI-generated texts?

Every new NLG System or model brings a new set of evaluation challenges. Automatic evaluation measures like the BLEU score are increasingly used to evaluate Natural Language Generation systems. Between 2012 and 2015, up to 60% of NLG research included computer-generated metrics. Automatic evaluation is popular because it is less expensive and takes less time to run than human evaluation. It is needed for the rapid development of new models, algorithm benchmarking, and fine-tuning. As a result, automatic evaluation is often the only viable option.

On the other hand, using such metrics makes sense only if the metric correlates with human preferences. Various studies in NLG and related fields, such as dialogue systems, machine translation (MT), and image captioning, show that this is rarely the case. So what are the different metrics and how do I choose the right one?

3 Types of Natural Language Generation Metrics

Evaluation Metrics for NLG systems can be clustered into three categories:

Human Evaluation

Human evaluation is the gold standard when evaluating AI-generated texts. However, implementation costs are high, and the evaluation results are difficult to duplicate. Participants are asked to rate or compare texts generated by various NLG systems, whether they are novices or experts, and therefore results can sometimes differ quite significantly. Because automated metrics still fail to replicate human decisions, many NLG researchers still include humans when evaluating their new models.

Untrained Evaluation Metrics

Untrained automatic evaluation metrics are the most commonly used metrics to monitor NLG systems — untrained means using just the model’s output and calculating the metric from a pre-defined algorithm. For example, word (or n-gram) overlap metrics are widely used. These metrics measure the degree of “matching” between machine-generated and human-authored (ground-truth) texts.

For most NLG tasks, it is critical to select the right automatic metric that measures the aspects of the generated text that are consistent with the desired goals. Often, using multiple metrics is better because they measure other parts of the system’s quality, e.g., using one metric for grammar and another metric for quality.

Machine-Learned Evaluation Metrics

Untrained approaches are worthless when the generated output is semantically different from the reference, e.g., when creating chatbot answers to human responses. One solution to the problem of untrained evaluation metrics is to use embedding-based metrics, which measure semantic similarity rather than word overlap.

To accomplish this, we use pre-trained NLP Models trained on human judgment data, which can, for example, generate a score indicating how similar two sentences are. In these instances, we can analyze quality measures, including factual correctness, naturalness, fluency, and coherence.

However, using these machine-learned metrics for evaluating AI-generated texts could lead to issues including “metric gaming” or “overfitting.” Also, they are not as easy to use as the untrained approaches.

Examples of Natural Language Generation Metrics

In the following paragraphs, we will discuss several untrained and machine-learned metrics and see how you can use them in your next project.

Untrained Evaluation Metrics

F-Score

The f-score is a measure of accuracy. It balances the generated text’s precision and recall by the harmonic mean of the two measurements. The f1-score is the most common variant of the f-score, and we will now calculate the f1-score for a text-classification model which tries to tell if a given sentence is positive or negative.

# Source: https://huggingface.co/datasets/amazon_polarity
dataset = load_dataset("amazon_polarity")### Example
# {'content': 'My lovely Pat has one of the GREAT voices [...]"',
#  'label': 1,
#  'title': 'Great CD'
# }# Create a pre-trained classification pipeline
model_id = "BaxterAI/finetuning-sentiment-model-3000-samples"
tokenizer = AutoTokenizer.from_pretrained(model_id)
classifier = pipeline(
	task="text-classification",
	model=model_id,
	tokenizer=tokenizer
)# Let`s get predictions and true labels
# from the first 100 elements in the datasettrue_labels = []
predictions = []for x in tqdm.tqdm(range(0, 100)):
    x = dataset['test'][x]
    pred = classifier(x['content'])[0]
    pred['label'] = 1 if pred['label'] == 'LABEL_1' else 0
    predictions.append(pred['label'])
    true_labels.append(x['label'])# Now we can calculate the F1-Score
f1 = f1_score(true_labels, predictions)
print("F1 Score:", f1)
# F1 Score: 0.9221435793731042

BLEU

The Bilingual Evaluation Understudy (BLEU) was one of the first ways to compare how similar two sentences are. To calculate the score, you count the number of words in the generated text that occur in the original text and divide it by the length of the original text.

According to recent research, while BLEU is a good metric for machine translation (for which it was designed), it does not correlate well with human judgments for other tasks (such as image captioning or dialog response generation). According to some studies, generated text with perfect blue scores lacked coherence and had poor informational content. Let’s look at an example of comparing two sentences with the BLEU Score.

# Source: https://huggingface.co/metrics/bleu
bleu_metric = load_metric('bleu')reference = 'Each translation should be tokenized into a list of tokens.'
generated_text = 'Each translation should be tokenized into a list of tokens.'# We will get a perfect bleu score because the two sentences are identical
bleu_score = bleu_metric.compute(
	predictions=[reference.split(' ')],
	references=[[generated_text.split(' ')]]
)print('bleu:', bleu_score['bleu'])
# bleu: 1.0reference = 'Each translation should be tokenized into a list of tokens.'
generated_text = 'Each translation must be tokenized.'# Here, we will get a low score because the two sentences
# differ significantly (althoug they mean the same)
bleu_score = bleu_metric.compute(
	predictions=[reference.split(' ')],
	references=[[generated_text.split(' ')]]
)
print('bleu:', bleu_score['bleu'])
# bleu: 0.0

BLEU always returns a number between 0 and 1. This number indicates how similar the candidate text is to the reference texts, with higher numbers indicating more similar texts. Only a few human translations will receive a score of 1, indicating that the candidate is identical to one of the reference translations. As a result, achieving a score of 1 is not required.

However, as we can see in the last example, the BLEU-Score is 0, although the meaning of the text is the same. To mitigate this problem, let’s look at some machine-learning metrics and their performance on this problem.

Machine-Learned Evaluation Metrics

Evaluation Using BERT

Bert is a transformer-based neural network created by Google that excels at NLP tasks. One of the bert-based evaluation methods is the Bertscore. It matches words in candidate sentences with reference sentences by comparing their cosine similarities using pre-trained contextual embeddings. Let’s see how Bertscore performs in our last example:

# You will need at least 1.5G space because the BERTScorer
# will download the roberta-large model
scorer = BERTScorer(lang="en", rescale_with_baseline=True)reference = 'Each translation should be tokenized into a list of tokens.'
generated_text = 'Each translation should be tokenized into a list of tokens.'# In this case we have a perfect match and the expected result of 1.
precision, recall, F1 = scorer.score([reference], [generated_text])
precision.mean(), recall.mean(), F1.mean()
# (tensor(1.0000), tensor(1.0000), tensor(1.0000))reference = 'Each translation should be tokenized into a list of tokens.'
generated_text = 'Each translation must be tokenized.'precision, recall, F1 = scorer.score([reference], [generated_text])
precision.mean(), recall.mean(), F1.mean()
# (tensor(0.7073), tensor(0.8580), tensor(0.7820))

In the second example, you can see the advantages of the Bertscore over BLEU. Whereas BLEU returned a score of 0, Bertscore returned an F1-Score of 0.7820, which indicates a strong similarity between the two sentences.

GRUEN

The GRUEN score is based on four criteria for text evaluation:

Grammaticality: A high grammaticality score means the system output is readable, fluent, and grammatically correct.
Non-redundancy: Non-redundancy refers to having no unnecessary repetition (e.g., the repeated use of “Uncle Bob” when a pronoun “he” would sufﬁce).
Focus: A focused output should have related semantics between neighboring sentences
Structure and coherence: A well-structured and coherent output should contain well-organized sentences, where the sentence order is natural and easy to follow.

The ﬁnal linguistic quality score is a linear combination of the above four scores: GRUEN-Score = grammatically + non-redundancy + focus + structure and is on a scale of 0 to 1.

To calculate the GRUEN metric, several different algorithms and NLP-Models are used. Let’s see how we can use GRUEN to evaluate AI-generated texts.

# This text was generated with GPT-2:https://transformer.huggingface.co/doc/gpt2-largetext =  'Natural language generation (NLG) is a subfield of artificial intelligence (AI) that deals with the generation of text and other structured and structured data.'gruen_results = gruen.get_gruen([text])
print("GRUEN: ", gruen_results[0])
# GRUEN:  0.8357442046334724# Let's fix the errors and check our score
text =  'Natural language generation (NLG) is a subfield of artificial intelligence (AI) that deals with the generation of text.'gruen_results = gruen.get_gruen([text])
print("GRUEN: ", gruen_results[0])
# GRUEN:  0.8365157705733511 - Fixing the error yielded a better score as it should be.

Now you know more about NLG, what kinds of metrics you can use to evaluate your AI-generated texts and how you can implement them in your next project. You can find a Colab-Notebook with the source code here.

References:

Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, Verena Rieser, Why We Need New Evaluation Metrics for NLG, 2017, https://arxiv.org/abs/1707.06875
Wanzheng Zhu, Suma Bhat, GRUEN for Evaluating Linguistic Quality of Generated Text, 2020, https://arxiv.org/abs/2010.02498
Asli Celikyilmaz, Elizabeth Clark, Jianfeng Gao, Evaluation of Text Generation: A Survey, 2020, https://arxiv.org/abs/2006.14799
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi, BERTScore: Evaluating Text Generation with BERT, 2020, https://arxiv.org/abs/1904.09675