Measurement improves software development
July 10th, 2008There are two possible outcomes:
if the result confirms the hypothesis, then you’ve made a measurement.
If the result is contrary to the hypothesis, then you’ve made a discovery.
Enrico Fermi

A couple of years ago we started a process of programming that was very different than anything I’ve seen in the last 15 years or so that I’ve been at it. We had a challenge given to us to produce a geographical location service built upon our entity extraction technology. It was an interesting exercise which at the time we had no experience doing. The object of the game is to read in text documents, discover location references, disambiguate them, look them up in a gazetteer and mark them up with the coordinates. This can be done either as an additional final section or, the more difficult case, in-line.
So off we went. Now the very first attempts at measuring this were done by me. I had had a lot of statistics in college but never thought I’d really get to use it. I came up with my own measures which were pretty close to recall and precision. Giving both numbers just didn’t fly with the management at the time. It was confusing. They wanted one number. After a little research I discovered both recall, precision and the mysterious F1 (or F-Measure).
In the case of this task we defined tokens as either relevant or irrelevant. If the token represented a PPL (populated place) then it was relevent. Otherwise it was irrelevent. So if a relevent item was marked up with the correct location it was a true positive. If it was not marked up or marked up with the wrong location it was a false positive. If an irrelevent item was marked up it was a false negative. The debates raged on what to do in the case where the system found a location but just did not disambiguate it correctly and over what to do when tokens were improperly co-located (as in what if “Rio de Janeiro” came up as “de Janeiro” instead.) Ultimately we decided to keep it simple. Any error below the level marking something right or wrong was deemed just a detail.
It took a lot of measurements and a lot of debate but we got it to work. This learning process produced a lot of healthy discussion and when we did finally decide on what formulas were best everyone could clearly see how to proceed.
The first day we calculated the f-measure of our geo-coordinate markup service it came up an astoundingly low 37 out of 100. I went over the numbers several times. Management wasn’t happy. What was decided next ended up being a great model for future development. We were put in a conference room with our computers and a white board. We were told not to leave until the f-measure was above 80. The way the development worked we had one person who did work on the trained categories system and another guy who did the application programming. I was doing measurements and creating reference sets. Three of us working towards one task, side by side.
We would discuss potential strategies and would then run them through the test harness. Every strategy would impact recall and precision. Often this would show how these concepts are opposed. As one is increased the other is decreased. What you are looking for is opposition that is not equal such that the f-measure rises. You want the decrease to be smaller than the increase. While it seems obvious most people don’t program that way. They come up with a bunch of ideas, implement them and just accept the measurements they get. In our case each change was tested. Yes it was slow but it separated out the good ideas from the bad ideas. We also, in this way, discovered other weaknesses that were fixed. If we had not been looking at this on a case by case basis we would have missed the subtle clues that helped us iron out the other parts of the system that were contributing to the final result.
I believe that honestly measuring your tools’ accuracy is important not just for sales and customer reassurance but also for the whole development life cycle. Efforts are underway to allow the unsupervised portion of the DRS system to aid in getting the Geo Reasoning system at or above 90 f-measure. Right now 75-80 is state of the art. Every point of f-measure gain beyond 80 is far more difficult to achieve than all the ones prior. However a learning system should be capable of this feat. More on that later.










