标记是文本处理的基本特征,我们将单词标记为语法分类。借助tokenization
和pos_tag
函数来为每个单词创建标签。
import nltk text = nltk.word_tokenize("A Python is a serpent which eats eggs from the nest") tagged_text=nltk.pos_tag(text) print(tagged_text)
执行上面示例代码,得到以下结果 -
[('A', 'DT'), ('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('serpent', 'NN'), ('which', 'WDT'), ('eats', 'VBZ'), ('eggs', 'NNS'), ('from', 'IN'), ('the', 'DT'), ('nest', 'JJS')]
标签说明
可以使用以下显示内置值的程序来描述每个标记的含义。
import nltk nltk.help.upenn_tagset('NN') nltk.help.upenn_tagset('IN') nltk.help.upenn_tagset('DT')
当运行上面的程序时,我们得到以下输出 -
NN: noun, common, singular or mass common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ... IN: preposition or conjunction, subordinating astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if beside ... DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those
标记语料库
还可以标记语料库数据并查看该语料库中每个单词的标记结果。参考以下实现代码 -
import nltk from nltk.tokenize import sent_tokenize from nltk.corpus import gutenberg sample = gutenberg.raw("blake-poems.txt") tokenized = sent_tokenize(sample) for i in tokenized[:2]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged)
执行上面示例代码,得到以下结果 -
[([', 'JJ'), (Poems', 'NNP'), (by', 'IN'), (William', 'NNP'), (Blake', 'NNP'), (1789', 'CD'), (]', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (AND', 'NNP'), (OF', 'NNP'), (EXPERIENCE', 'NNP'), (and', 'CC'), (THE', 'NNP'), (BOOK', 'NNP'), (of', 'IN'), (THEL', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (INTRODUCTION', 'NNP'), (Piping', 'VBG'), (down', 'RP'), (the', 'DT'), (valleys', 'NN'), (wild', 'JJ'), (,', ','), (Piping', 'NNP'), (songs', 'NNS'), (of', 'IN'), (pleasant', 'JJ'), (glee', 'NN'), (,', ','), (On', 'IN'), (a', 'DT'), (cloud', 'NN'), (I', 'PRP'), (saw', 'VBD'), (a', 'DT'), (child', 'NN'), (,', ','), (And', 'CC'), (he', 'PRP'), (laughing', 'VBG'), (said', 'VBD'), (to', 'TO'), (me', 'PRP'), (:', ':'), (``', '``'), (Pipe', 'VB'), (a', 'DT'), (song', 'NN'), (about', 'IN'), (a', 'DT'), (Lamb', 'NN'), (!', '.'), (u"''", "''")]