python - 如何计算 POS 标注器的标注精度和召回率?

我正在使用一些基于规则和统计的词性标注器来用词性 (POS) 标记一个语料库(大约 5000 个句子)。以下是我的测试语料库的一个片段,其中每个单词都由其各自的 POS 标签以“/”分隔。

No/RB ,/, it/PRP was/VBD n't/RB Black/NNP Monday/NNP ./.
But/CC while/IN the/DT New/NNP York/NNP Stock/NNP Exchange/NNP did/VBD n't/RB fall/VB apart/RB Friday/NNP as/IN the/DT Dow/NNP Jones/NNP Industrial/NNP Average/NNP plunged/VBD 190.58/CD points/NNS --/: most/JJS of/IN it/PRP in/IN the/DT final/JJ hour/NN --/: it/PRP barely/RB managed/VBD *-2/-NONE- to/TO stay/VB this/DT side/NN of/IN chaos/NN ./.
Some/DT ``/`` circuit/NN breakers/NNS ''/'' installed/VBN */-NONE- after/IN the/DT October/NNP 1987/CD crash/NN failed/VBD their/PRP$ first/JJ test/NN ,/, traders/NNS say/VBP 0/-NONE- *T*-1/-NONE- ,/, *-2/-NONE- unable/JJ *-3/-NONE- to/TO cool/VB the/DT selling/NN panic/NN in/IN both/DT stocks/NNS and/CC futures/NNS ./.


No/DT ,/, it/PRP was/VBD n't/RB Black/NNP Monday/NNP ./. 
But/CC while/IN the/DT New/NNP York/NNP Stock/NNP Exchange/NNP did/VBD n't/RB fall/VB apart/RB Friday/VB as/IN the/DT Dow/NNP Jones/NNP Industrial/NNP Average/JJ plunged/VBN 190.58/CD points/NNS --/: most/RBS of/IN it/PRP in/IN the/DT final/JJ hour/NN --/: it/PRP barely/RB managed/VBD *-2/-NONE- to/TO stay/VB this/DT side/NN of/IN chaos/NNS ./.
Some/DT ``/`` circuit/NN breakers/NNS ''/'' installed/VBN */-NONE- after/IN the/DT October/NNP 1987/CD crash/NN failed/VBD their/PRP$ first/JJ test/NN ,/, traders/NNS say/VB 0/-NONE- *T*-1/-NONE- ,/, *-2/-NONE- unable/JJ *-3/-NONE- to/TO cool/VB the/DT selling/VBG panic/NN in/IN both/DT stocks/NNS and/CC futures/NNS ./.

我需要计算标记准确度(Tag wise- Recall & Precision),因此需要在标记每个词标记对时找出错误(如果有的话)。




In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been).

请注意,由于每个单词都只有一个标签,因此总体召回率和准确率分数对于这项任务没有意义(它们都等于准确率指标)。但要求每个标签的召回率和准确率测量确实有意义 - 例如,您可以找到 DT 标签的召回率和准确率。


我还没有测试过这段代码,我的 Python 也有点生疏,但这应该能让你明白。我假设文件是打开的并且 totals 数据结构是字典的字典:

finished = false
while not finished:
trueLine = testFile.readline()
if not trueLine: # end of file
finished = true
trueLine = trueLine.split() # tokenise by whitespace
taggedLine = taggedFile.readline()
if not taggedLine:
print 'Error: files are out of sync.'
taggedLine = taggedLine.split()
if len(trueLine) != len(taggedLine):
print 'Error: files are out of sync.'
for i in range(len(trueLine)):
truePair = trueLine[i].split('/')
taggedPair = taggedLine[i].split('/')
if truePair[0] != taggedPair[0]: # the words should match
print 'Error: files are out of sync.'
trueTag = truePair[1]
guessedTag = taggedPair[1]
if trueTag == guessedTag:
totals[trueTag]['truePositives'] += 1
totals[trueTag]['falseNegatives'] += 1
totals[guessedTag]['falsePositives'] += 1

