Evaluation
Metrics
Evaluation Metrics for the Shared Task
1. Lemmatization Task
The lemmatization task will be evaluated using the following metrics:
- Accuracy (Exact Match): The percentage of words for which the predicted lemma matches the gold standard lemma exactly. This is the primary metric for evaluating the overall performance of the system.
- Error Analysis Categories: Systems will also be evaluated qualitatively by analyzing errors across categories such as:
- Homographs: Words with the same form but different meanings or lemmas.
- Rare Lemmas: Lemmas that occur infrequently in the dataset.
- Morphological Complexity: Cases with challenging inflectional variations.
2. Token Prediction Task
The evaluation of the text completion task will include a combination of traditional accuracy metrics and perplexity to provide a comprehensive assessment of system performance.
Primary Metrics
- Top-1 Accuracy: The percentage of placeholders correctly filled with the exact word predicted as the top-ranked choice by the system.
- Top-3 Accuracy: this metric evaluates whether the correct word appears within the top k predictions
Secondary Metrics
- Perplexity:
Perplexity evaluates how well a probabilistic model predicts masked tokens by measuring the uncertainty in its predictions. Lower perplexity indicates better performance, with the model assigning higher probabilities to the correct completions.
This metric is critical for assessing probabilistic models’ generalization capabilities across different contexts.
We will normalize and combine these metrics to get a single score for this task.