The Dirty Little Secret of Constituency Parser Evaluation
The task of comparing constituency parsers is not a trivial one. Parsers vary in the types of mistakes they make, types of texts they are good at parsing, speed, and all kinds of interesting features and quirks within each implementation. We set out to understand what stands behind the vague F-measure numbers lurking around 90% and what kind of issues to expect from different parsers, regardless of their overall quality.
In this article, which is the first in our series about exploration of syntactic parsing, we describe approaches to parser evaluation and use them to analyze common parser fail cases. We notice a serious fallacy in all of the standard metrics, which results in an overly optimistic outcome of the evaluation. Thus, we propose several improved metrics suitable for different use cases. Their implementation is published as a little open source [project].
Parser Evaluation Metrics
To properly evaluate the quality of syntactic parsing, one needs to choose an appropriate metric - one that will provide the most relevant and representative information. Such metric, for example, should take into account that we may consider the tree structure to be more important than labelling, and errors closer to the top of the tree to be more destructive than errors closer to the leaves. To find a metric that will satisfy our needs, we tested several classic metrics and have come up with a couple of modified ones to deal with the issues that we have encountered.
Below is a summary of the known approaches.
The Leaf-Ancestor Evaluation is conducted by assigning a score to every word in a sentence. The lineage of each word in the output tree (i.e. the sequence of nodes one has to pass to get to the word, starting from the root) is compared to the lineage of the same word in the gold tree, and the score is calculated as the minimum edit distance. Precision is obtained by dividing the sum of the scores by the number of nodes in the lineages of the output tree. Recall - in the gold tree.
This metric takes bracketed representations of trees, aligns them according to the leaf nodes and counts the brackets that do not coincide, thus showing the nodes that differ in span. The average number of crossed brackets per sentence is also calculated.
Minimum Tree Edit Distance
The minimum edit distance between trees represents the number of transformations needed to turn the generated tree into the gold one. The average number of transformations per tree is also calculated. Transformations include adding, removing and renaming nodes.
This metric was taken from [Berkeley Parser Analyzer], also mentioned in Parser Showdown. Basically, it is also the minimum edit distance between trees, but it does not count wrong leaf tags, while counting node renaming as two operations.
According to Parseval, a node in a tree is considered to be correct if there is the same node in the corresponding gold tree. Similarity is defined by the indices of the substring that the node dominates and the label of the node. To calculate precision, the number of correct nodes generated by the parser is divided by the total number of nodes in the output tree. For recall - in the gold tree. This metric is probably the most balanced one of the classic metrics and the most widely used.
The problem of standard Parseval is that it counts nodes as the same regardless of the underlying structure they dominate. Consider the following trees:
(S NN CC NN CC NN) (S (NP NN) CC (NP NN CC NN)))
The Parseval metric will count the top
S correct and only the added
NPs incorrect, yielding precision of 6/8 and 100% recall. Arguably,
S is also parsed incorrectly, and, to reflect that, we
propose adding an additional condition for node equality: the identity
of spans of the direct children of the node. Such metric causes
precision to be 5/8 and recall - 5/6 for this case.
We decided to call the metric Parseval-split because it depends on how the projection of the node is split into phrases.
More variants of Parseval
Other variations of the Parseval/Parseval-split metrics are also possible. The questions are:
- whether to count the labels of the nodes: if we don’t, we get a measure of purely structural difference
- whether to count leaf nodes
Taking into account leaf nodes is, actually, a common fallacy of all parser evaluation metrics, which results in skewed (and, let’s say, overly optimistic) numbers. Leaf node labels are POS tags that are usually not the result of the parser’s work: the parser receives them from the tagger (and even if it does the job of the tagger, as well, we still shouldn’t care about that). Also, the tagger usually has better quality than the parser on its own (up to 95% F1), and the number of the leaf nodes is close to 50% of the total tree nodes count.
As you’ll see from the evaluation results below, not taking leaf nodes into account results in a 5-20 point decrease of the score!
Weighted variants of Parseval
Further improvements to Parseval may be gained by weighting parser mistakes in relation to their distance to the closest/farthest leaf in the projection of the erroneous node. The hypothesis is that the farther in the tree the mistake is, the more destructive it is.
There may be all kinds of variations as to in what way the weights should be assigned:
- find the longest path from the root to the leaf (
l) and use it for defining weights for each erroneous node, e.g. find the shortest/longest path from the erroneous node to the leaf and normalize it by dividing by
l- this gives us a linear dependency on the distance from the node. The result can then be adjusted by arbitrary coefficients to fall into the predefined interval
- find the shortest/longest path from the root to the leaf through the
erroneous node (
l) and use it to define the weight, e.g. find its position in the path starting from the leaf and divide it by
- instead of a linear function, use an exponent that will asymptotically approach some maximum error weight
We experimented with the last option, used coefficients in the range [0.5, 1.5], and the distance to the farthest leaf. Due to the nature of the exponent function, the penalty for the first few levels near the leaves increases rapidly, but then it remains almost the same for all nodes that are closer to the root - around the upper limit of the interval.
By the way, you can use our project [gr-parseval] to experiment with different versions of Parseval on your own.
Another way of evaluating parsers is proposed in [Berkeley Parser Analyzer]. The described tool generates counts of the occurrences of different classes of parser mistakes in the corpus.
It is a good approach to understand the types of parser errors, not just get some overall quality numbers.
Kummerfield et al. have provided a good basis for such exploration by describing the common parser fail cases in their Parser Showdown. We have slightly updated and expanded the provided example set.
Case 1: PP Attachment
He was named chief executive officer of Applied in 1986.
(TOP (S (NP (PRP He)) (VP (VBD was) (VP (VBN named) (NP (NP (JJ chief) (NN executive) (NN officer)) (PP (IN of) (NP (NP (NNP Applied)) (PP (IN in) (NP (CD 1986)))))))) (. .)))
(TOP (S (NP (PRP He)) (VP (VBD was) (VP (VBN named) (NP (NP (JJ chief) (NN executive) (NN officer)) (PP (IN of) (NP (NNP Applied)))) (PP (IN in) (NP (CD 1986))))) (. .)))
Description: An incorrect PP Attachment is probably the most common parsing mistake. A prepositional phrase can serve as an indirect object relating to the verb, as an adverbial modifier relating to the whole sentence, or as part of a larger noun phrase. It is complicated to differentiate between these use cases, and a lot of research is dedicated to this issue. Case 1 shows a prepositional phrase that plays the role of an adverbial modifier of time and refers to the predicate of the sentence. In the erroneous parse tree, the prepositional phrase happens to be a part of a noun phrase.
Case 2: NP Attachment
Fidelity Investments placed new ads in newspapers yesterday, and wrote another new ad appearing today.
(TOP (S (NP (NNP Fidelity) (NNPS Investments)) (VP (VP (VBD placed) (NP (JJ new) (NNS ads)) (PP (IN in) (NP (NNS newspapers))) (NP (NN yesterday))) (, ,) (CC and) (VP (VBD wrote) (NP (NP (DT another) (JJ new) (NN ad)) (VBG appearing)) (NP (NN today)))) (. .)))
(TOP (S (NP (NNP Fidelity) (NNPS Investments)) (VP (VP (VBD placed) (NP (JJ new) (NNS ads)) (PP (IN in) (NP (NNS newspapers))) (NP (NN yesterday))) (, ,) (CC and) (VP (VBD wrote) (NP (NP (DT another) (JJ new) (NN ad)) (VP (VBG appearing) (NP (NN today)))))) (. .)))
Description: Errors with NP Attachments are a little less common than those with PP attachments, but can be even trickier. The NP Attachment in case 2 acts as an adverbial modifier of time and refers to the action of ‘appearing’ of the ‘ad’ in question, not to the action of ‘wrote’, which obviously happened ‘yesterday’. Although the correct place of the modifier is obvious from the sense, it is not obvious from the syntactic structure, which makes it difficult to parse the sentence properly.
Case 3: Modifier Attachment
You did something so utterly stupid, you had to think about it ahead of time.
(TOP (S (S (NP (PRP You)) (VP (VBD did) (NP (NP (NN something)) (ADJP (RB so) (RB utterly) (JJ stupid))))) (, ,) (S (NP (PRP you)) (VP (VBD had) (S (VP (TO to) (VP (VB think) (PP (IN about) (NP (PRP it)))))) (ADVP (RB ahead) (PP (IN of) (NP (NN time)))))) (. .)))
(TOP (S (S (NP (PRP You)) (VP (VBD did) (NP (NP (NN something)) (ADJP (RB so) (RB utterly) (JJ stupid))))) (, ,) (S (NP (PRP you)) (VP (VBD had) (S (VP (TO to) (VP (VB think) (PP (IN about) (NP (PRP it))) (ADVP (RB ahead) (PP (IN of) (NP (NN time))))))))) (. .)))
Description: Modifier Attachments are frequently misplaced in multiple-clause sentences. The main problem is to decide which clause they refer to as they rarely contain any information that can be helpful. The above shown example demonstrates an adverbial modifier (‘ahead of time’) that should modify the verb ‘think’, not the verb ‘had’.
Case 4: Clause Attachment
The bill intends to restrict the RTC to Treasury borrowings only unless the agency receives specific congressional authorization.
(TOP (S (NP (DT The) (NN bill)) (VP (VBZ intends) (S (VP (TO to) (VP (VB restrict) (NP (DT the) (NNP RTC)) (PP (TO to) (NP (NNP Treasury) (NNS borrowings))) (ADVP (RB only)) (SBAR (IN unless) (S (NP (DT the) (NN agency)) (VP (VBZ receives) (NP (JJ specific) (JJ congressional) (NN authorization))))))))) (. .)))
(TOP (S (NP (DT The) (NN bill)) (VP (VBZ intends) (S (VP (TO to) (VP (VB restrict) (NP (DT the) (NNP RTC)) (PP (TO to) (NP (NNP Treasury) (NNS borrowings))) (ADVP (RB only))))) (SBAR (IN unless) (S (NP (DT the) (NN agency)) (VP (VBZ receives) (NP (JJ specific) (JJ congressional) (NN authorization)))))) (. .)))
Description: Another problem with multi-clause sentences is the incorrect position of subordinate clauses. It frequently occurs when the subordinate clause comes after the main clause. Case 4 shows a conditional clause referring to the fact that ‘the bill intends to …’, not the action of ‘restricting’.
Case 5: VP Attachment
He’s never chuckled, or giggled, or even smiled.
(TOP (S (NP (PRP He)) (VP (VP (VBZ 's) (ADVP (RB never)) (VP (VBD chuckled) (, ,) (CC or) (VBD giggled))) (, ,) (CC or) (VP (ADVP (RB even)) (VBD smiled))) (. .)))
(TOP (S (NP (PRP He)) (VP (VBZ 's) (ADVP (RB never)) (VP (VP (VBD chuckled)) (, ,) (CC or) (VP (VBD giggled)) (, ,) (CC or) (VP (ADVP (RB even)) (VBD smiled)))) (. .)))
Description: Coordinate verbal phrases, like the one in the example above, are often misparsed. This type of error falls into the VP Attachment category.
Case 6: Coordination and PP Attachment
A 16.3% drop for Mannesmann AG and Dresdner Bank AG’s 9.6% decline were especially problematic for their respective boards.
(TOP (S (NP (NP (DT A) (ADJP (CD 16.3) (NN %)) (NN drop)) (PP (IN for) (NP (NP (NNP Mannesmann) (NNP AG)) (CC and) (NP (NP (NNP Dresdner) (NNP Bank) (NNP AG) (POS 's)) (ADJP (CD 9.6) (NN %)) (NN decline))))) (VP (VBD were) (ADJP (RB especially) (JJ problematic)) (PP (IN for) (NP (NP (PRP$ their) (JJ respective) (NNS boards))))) (. .)))
(TOP (S (NP (NP (NP (DT A) (ADJP (CD 16.3) (NN %)) (NN drop)) (PP (IN for) (NP (NNP Mannesmann) (NNP AG)))) (CC and) (NP (NP (NNP Dresdner) (NNP Bank) (NNP AG) (POS 's)) (ADJP (CD 9.6) (NN %)) (NN decline))) (VP (VBD were) (ADJP (RB especially) (JJ problematic)) (PP (IN for) (NP (NP (PRP$ their) (JJ respective) (NNS boards))))) (. .)))
Description: Coordinate structures are dangerous and can appear within almost any syntactic component (S, NP, VP, etc.). Probably, the most complicated cases are those of coordinate noun phrases that include prepositional phrases. Case 6 is an excellent example of such a problem.
Case 7: NP Internal Structure
Secretary of State Baker in a foreign policy speech called for the reunification of Germany.
(TOP (S (NP (NP (NNP Secretary)) (PP (IN of) (NP (NNP State) (NNP Baker)))) (PP (IN in) (NP (DT a) (JJ foreign) (NN policy) (NN speech))) (VP (VBD spoke) (PP (IN about) (NP (NP (DT the) (NN reunification)) (PP (IN of) (NP (NNP Germany)))))) (. .)))
(TOP (S (NP (NAC (NNP Secretary) (PP (IN of) (NP (NNP State)))) (NNP Baker)) (PP (IN in) (NP (DT a) (JJ foreign) (NN policy) (NN speech))) (VP (VBD spoke) (PP (IN about) (NP (NP (DT the) (NN reunification)) (PP (IN of) (NP (NNP Germany)))))) (. .)))
Description: Errors in the NP Internal Structure are considered to be a separate type of parsing errors due to the variety of components an NP can contain, and thus the variety of possible errors. The case above demonstrates the noun ‘Baker’ that has a premodifier ‘Secretary of State’, yet the parser failed to build an appropriate structure.
Case 8: Unary Mistake
Following is a breakdown of major market activity:
(TOP (S (SINV (VP (VBG Following)) (VBZ is) (NP (NP (DT a) (NN breakdown)) (PP (IN of) (NP (JJ major) (NN market) (NN activity))))) (: :)))
(TOP (S (SINV (S (VP (VBG Following))) (VBZ is) (NP (NP (DT a) (NN breakdown)) (PP (IN of) (NP (JJ major) (NN market) (NN activity))))) (: :)))
Unary mistakes include missing nodes and extra nodes in parse trees. These mistakes are usually local and don’t ruin the general structure of a parse tree. Consider the missing ‘S’ in the case above.
Case 9: Different Label
Young man, are you aware of what you eat and drink?
(TOP (S (NP (JJ Young) (NN man)) (, ,) (VP (VBP are) (NP (NP (PRP you)) (ADJP (JJ aware) (PP (IN of) (SBAR (WHNP (WP what)) (S (NP (PRP you)) (VP (VB eat) (CC and) (VB drink)))))))) (. ?)))
(TOP (SQ (NP (JJ Young) (NN man)) (, ,) (VP (VBP are) (NP (NP (PRP you)) (ADJP (JJ aware) (PP (IN of) (SBAR (WHNP (WP what)) (S (NP (PRP you)) (VP (VB eat) (CC and) (VB drink)))))))) (. ?)))
Description: Different labels of the nodes are not destructive, but still may cause some minor errors during tree processing. Case 8 shows an example of an incorrectly labelled question.
Case 10: Single word phrase + Extra node (PP) + Different Label (IN)
Eight hard-nosed and fearless litigators who spent their hours chewing up people.
(TOP (NP (NP (CD Eight) (JJ hard-nosed) (CC and) (JJ fearless) (NNS litigators)) (SBAR (WHNP (WP who)) (S (VP (VBD spent) (NP (PRP$ their) (NNS hours)) (S (VP (VBG chewing) (PP (IN up) (NP (NNS people)))))))) (. .)))
(TOP (NP (NP (CD Eight) (JJ hard-nosed) (CC and) (JJ fearless) (NNS litigators)) (SBAR (WHNP (WP who)) (S (VP (VBD spent) (NP (PRP$ their) (NNS hours)) (S (VP (VBG chewing) (PRT (RP up)) (NP (NNS people))))))) (. .)))
Description: Single word phrase mistakes have to do with improper phrase labels due to incorrect POS tags. In the case above, the word ‘up’ is a particle, not a preposition, and thus makes up the phrase ‘PRT’.
Case 11: Badly parsed noun clusters
Unlike other prices interest rates are to be politically determined.
(TOP (S (PP (IN Unlike) (NP (JJ other) (NNS prices) (NN interest) (NNS rates))) (VP (VBP are) (S (VP (TO to) (VP (VB be) (ADVP (RB politically)) (VP (VBN determined)))))) (. .)))
(TOP (S (PP (IN Unlike) (NP (JJ other) (NNS prices))) (NP (NN interest) (NNS rates)) (VP (VBP are) (S (VP (TO to) (VP (VB be) (ADVP (RB politically)) (VP (VBN determined)))))) (. .)))
Description: Noun clusters in English are deceptive enough to trick parsers. The case above shows an incorrect noun phrase consisting of three nouns. The last two should have comprised a separate noun phrase, a subject for the sentence.
Evaluation of the metrics
Now, let’s compare how well the evaluation metrics that were described above fare for the listed fail cases.
Tables 1 and 2 show the results of evaluating each of the example cases with the help of the described metrics. The numbers for all types of Parseval and for the Leaf-Ancestor evaluation stand for f-measure. Transformations, Minimum Edit Distance and Cross-Bracketing are represented by absolute numbers.
Table 1. Parseval evaluation of error cases
Table 2. Other metrics
|Case no.||Transformations||Minimum Edit Distance||Cross-Bracketing||Leaf-Ancestor|
The tables show the peculiarities of each metric. According to all types of Parseval, Cross-bracketing and both metrics for minimum edit distance, cases 8 and 9 (a unary mistake and an incorrect label) contain the least serious mistakes. Still, the Leaf-Ancestor metric showed a different result. Although we consider a unary mistake to be rather insignificant, the mistake with the label gives the tree the lowest score. If we look at the case, we can see that the mislabelled node is close to the top, which means that it will be in every lineage used to count the accuracy of parsing. Different label is not the kind of parsing error that we consider to be serious, and this makes the Leaf-Ancestor metric irrelevant for defining the best parser.
All Parseval variations (except leafless) show rather similar results, but there are still some differences. Parseval-split gives lower accuracy as it takes into account both the label and the structure of the node. This variation is not very interesting for us because it partially duplicates what more “extreme” Parseval and Unlabelled Parseval-split show. For instance, case 9 (different label) gave 100% score for Unlabelled Parsevals.
Weighted variants of Parseval showed a bit different results. According to these metrics, for example, the incorrect PP Attachment in case 1 is a more serious error than the incorrect VP Attachment (case 5) or the Noun cluster error (case 11), which are the most destructive ones, according to the non-weighted metrics. Most of the cases contained erroneous nodes that were deep enough in the tree to get penalized. Thus, these cases received a lower score, compared to the classic Parseval. Still, cases 5, 7, 10, and 11 turned out to get a higher score, which can be explained by the fact that the errors are close to the leaves, do not ruin the whole tree and get a low weight, as a result.
The Minimum edit distance and the number of transformations provided by Berkeley Analyzer showed almost the same numbers. The difference is only that in case 9 the renaming of a node is counted as two points by Berkeley Analyzer and that its transformations do not count renaming leaf tags (case 10). Both versions of minimum edit distance also directly correlate with Parseval scores.
Cross-bracketing, on the other hand, produced slightly different results. This metric does not count labels at all, and thus the gap between the labelling errors and the errors that ruin the tree structure is bigger. That is why case 9 got 0 points. Cases 8 and 10 (unary mistakes and single word phrase) got only 2 points. The closer to the top the mistake is, the more brackets cross; thus, all the other cases got a higher mistake count.
Finally, we consider Leafless Parseval-split to be the most appropriate metric, yet it will always give a significantly lower score than the popular ones. Case 6 (Coordination error with PP Attachment) contained the third easiest error, according to Parseval-split, but was lowered to the fifth position by Leafless Parseval-split, which we consider to be correct.
To sum up, if what you care about regarding parser accuracy is how well it builds the sentence structure, you should take the commonly reported performance numbers with a huge grain of salt and maybe take a look at other evaluation metrics, like the ones that we propose.
Kummerfeld J. K., Hall D., Curran J. R. and Klein D. Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output // Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). - 2012. - pp. 1048-1059.