Precision and recall are discussed in Section 4.1.2 of “Corpus Linguistics: A Guide to the Methodology” (CLGM) (p. 111–116). Frequently, students seem to have more difficulties than I would have expected in understanding these concepts, so I looked around to see how other people have explained it. Frequently, a fishing metaphor is used, which I quite like, as it has a potential to explain many other aspects of corpus linguistics. So I decided to write my own version of such a metaphorical explanation, which I may or may not include in a future edition of CLGM.
The first part explains the concept of precision and recall itself and would be inserted toward the beginning of the section.
Think of yourself as a captain on a fishing vessel trawling a fishing ground for tuna. In this case, recall would be a measure of what proportion of the tuna in the fishing ground are actually caught, while precision would a measure of how many animals that are caught are actually tuna.
Obviously, it is desirable to maximize recall: the more tuna you catch, the more profit you will make. The easiest way of maximizing recall is to use a large net that will catch anything in its path. This will catch most of the tuna in your fishing grounds, but it will also catch many other animals – including other fish, turtles and dolphins. In other words, while the method has a high recall, it has a very low precision. This means that a lot of additional labor has to be invested in sorting the catch afterwards (and, of course, it means getting in trouble for overfishing and for including protected species like dolphins as by-catch).
You can maximize precision, for example, by using fishing rods and bait that is specific to the diet of tuna and setting them at a depth preferred by tuna. This will ensure that almost all of the animals you catch are actually tuna, so that you will not have to invest any additional labor in sorting through the catch (also, you will not endanger dolphins, who are too smart to go after bait on a fishing line). However, the recall will be much lower than if you were to use a fishing net, as you can only set up so many fishing rods. Thus, you will miss many of the tuna in your fishing grounds.
As corpus linguists, we want to find a reasonable compromise between precision and recall – we want to catch as many instances of the phenomenon under investigation as possible, while keeping low the additional labor we have to invest in manually going through the results to remove the linguistic by-catch (luckily, we don’t have to worry about hurting endangered words and constructions in the process, as our corpus searches, unlike the foray of a tuna-fishing vessel, are non-destructive).
The second part explains how to estimate the precision and recall of a particular corpus query and would occur later in the chapter.
But how do we determine the precision and recall of our query? Precision is easy to determine – you simply go through the results of your query and count the hits that correspond to the phenomenon you are looking for. You then divide this number by the total number of hits, giving you a decimal fraction (on a tuna-fishing vessel, you would count the number of tuna and express it as a fraction of the total number of animals you have caught).
In contrast, recall is impossible to determine. On a tuna vessel, it would be the number of tuna in your catch expressed as a fraction of all tuna in your fishing grounds. But how do you know how many tuna there are in your fishing grounds if you have not caught them? You don’t, obviously, but you can come up with an estimate: you trawl a representative part of the fishing ground with a procedure that maximizes recall, determine the number of tuna in your catch, and then generalize this number to the entire fishing grounds. In commercial fishing, this is actually done: Every year, governmental agencies send out research vessels to a representative parts of fishing grounds, where they use nets to catch, categorize and count the animals. By comparing the area in which the research vessel cast its net to the entire area of the fishing grounds, they then estimate the total number of all animals. Imagine that the area fished by the research vessel is one square mile and the entire fishing grounds is one thousand square miles: if they caught 15 tuna, this means that there are 15 * 1000 = 15 000 tuna in the fishing grounds.
In corpus linguistics, we can do the same thing: We take a representative part of our corpus and search it by a method that maximizes recall – typically, this means reading the files in their entirety and identifying all instances of the phenomenon manually. We then estimate the total number of occurrences of our phenomenon in the same way as the researchers on the fishing vessel: Imagine that the files we went through manually have a size of one-hundred thousand words and the entire corpus has a size of one-hundred million words: If our manually searched sample contains 15 cases of the phenomenon under investigation, this means that there are approximately 15 * 1000 = 15 000 cases in the entire corpus. We can now check how many instances our query actually finds and express it as a proportion of these estimated 15 000 cases.
[Students were also confused by the fact that I illustrated the concept of recall using a parsed section of the ICE-GB (p. 114f) – if the corpus is parsed, they argued, why not just use that information? Thus, a future edition of CLGM will use a manually-searched unparsed corpus instead, but that is a topic for another post.]