The problem of quality assessment of NLP tools
May 16th, 2008Hello, my name is Peter Mancini and I’ve been with Digital Reasoning for over 3 years. Lately my work has been focused on developing quality metrics for our NLP/UDA tools. Below I have a sort of random discussion about the philosophy of quality assessment. It is not a thorough discussion but more a starting point for discussion. I am always interested in hearing comments about my work and in this I am particularily interested since it is new ground for me.
The difficulties with measuring the quality of any natural language system are great. The main source of difficulty comes from the subjective nature of language. Language is also living and a speaker or author can get creative with the use of words that breaks semantic rules but still conveys meaning. I believe that the vast majority (>90%) of language breaks the standard language at least once per written page if not more on average. Further, look at spoken language taken in transcripts; people reconstruct sentences half-way through. Speakers will often not even finish sentences but trade sentence fragments back and forth and both speakers will feel they got the meaning of the dialog correctly. A decent NLP system has to deal with this since most of the input data will typically not be peer reviewed white papers but text collected from many sources. I propose the only useful measure of the quality of NLP tools is utility: how useful is the tool at helping the wielder performing their task. Note the difference between “utility” and “correctness.” Correctness is getting the standard language right, utility is leveraging the underlying meaning of the text for further purposes.
Utility vs. Correctness
Utility is the ability to use the output to perform a task. Correctness is the ability of the tool to register its output with a set of rules. However, as we said, language is living and often the rules are broken. Many times they are broken by accident and other times they are specifically broken to convey a special meaning. Another aspect is looking at how an entity might be used and possibly changing its canon definition with a more useful one.
Take this as an example. Your NLP processor is parsing the following text:
“The hospital is expected to be opened on 12/02/2008.”
In the above utterance we see at the end a string of numbers and backslashes. Most systems would classify 12/02/2008 as one cardinal number. Other systems would classify it as 3 cardinal numbers with separators. Either way is correct. However when it is displayed as the following:
“The hospital is expected to be opened on 12 February 2008.”
…then it becomes Cardinal Number, Proper Noun Singular and Cardinal Number. OK, correct but is it useful to be correct here? How is this presentation substantially different than the prior one other than cosmetically? In this case it would be more interesting from a programming perspective if you treat February as a type of Cardinal Number. This can give the programmer more power to distinguish dates no matter what presentation. You run into problems with the months of April, May, June and August but only because they are also names of people. However there is always a trade off. The point is, you can get more utility out of the system by identifying dates easier with one approach than another. The quality of the system therefore shouldn’t be based upon how correct it is but how the output can be leveraged by software.
Soft vs. Hard Assessment
All quality assessments have some measure of subjectivity to them. When you look at entity extraction, for example, even with well defined concepts such as “location” you end up with ambiguities that are resolved more by opinion than objective determination.
For example consider these examples:
- “We traveled to Boston.”
- “We were there in 1999.”
- “He was forced into exile.”
- “The accusation was all in his mind.”
In example 1 we have an unambiguous location. In example 2 we have a location in time. Time is the 4th dimension but we can’t place the location geographically. In example 3 we have an abstract location. It might be possible to determine the extents of a physical location but one could be in exile in a very abstract way that does not imply a change of location. In example 4 we have a further example of a highly abstract location. You can take this to further abstractions. “They ran in the Boston Marathon” might be a location to some as it implies a well established route. In various discussion of this last example some have told me they would prefer the system to designate the location as Boston Massachusetts. As a long time resident of the area I can assure you less than a mile of the race is actually within the city limits of Boston. Also, the word Boston here is actually a modifier for Marathon, so if you accept Boston Marathon then you also have to accept other places where Boston is a modifier such as in these examples Boston Creme Pie, Boston Whaler, Boston College (actually located in Chestnut Hill), Boston Legal (only exists on TV), Douglas A-20 Boston (a WWII bomber), Boston Market (a chain of resturants) and many others.
The key in situations like this is to set your rules and be consistent when scoring based upon them. However, one has to keep the amount of subjective perspective to the absolute minimum in any analysis. Otherwise you have “open ended criteria” without a pre-determined expected outcome which means that any analysis done with these parameters will always leave you victim to too many false positives being accepted as true. You do that by setting up reasonable constraints on the concept.
Our biggest challenge today is coming up with a model for qualitative analysis of Association Networks (AN). These are terms that are associated with a given term. They can include synonyms, narrower terms, broader terms, attributes and host of other things due to semantic interaction. Here we will first be looking to evaluate quality through utility. It would be hard to measure “correctness” for two reasons. The first is: what is the definition of correctness for an association network? The second is that it takes a lot of data to create a good association network and thus having a reference set that is perfectly understood such that one can predict all of the output from the input is a prohibitively expensive operation. This second issue is more a testament to my laziness than it is to the impossibility to measure correctness but there you have it. Looking at utility we still have an issue with subjectivity. The key will be to minimize that subjectivity so that creative and irreproducible results do not plague the analysis.
My next post will be about measuring the quality of Association Networks - what I’ve tried and what I think works.










