命名实体识别是识别名称等信息单元的过程(包括人名、地名、组织机构和位置),以及包括非结构化文本中的时间,日期,钱等数值表达式。目标就是开发实用且与域无关的技术,以便自动高精度地检测命名实体。
1. 数据
数据是BIO和POS标签注释的特征设计语料库。
有关实体的基本信息
- geo - 区域实体(Geographical Entity)
- org - 组织(Organization)
- per - 人(Person)
- gpe - 地缘政治实体(Geopolitical Entity)
- tim - 时间指示器(Time indicator)
- art - 人工制品(Artifact)
- eve - 事件(Event)
- nat - 自然现象(Natural Phenomenon)
BIO(beginning-inside-outside)
是用于标记标志的通用标记格式。
- B - 标签前的前缀,表示标签是块的开始。
- I - 标签前的前缀,表示标签位于块内。
- O - 表示标志不属于任何块。
# import package
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
整个数据集比较大,无法全部装入内存中,因此只选择前10000
条记录。
df = pd.read_csv('datasets/entity-annotated-corpus/ner_dataset.csv')
print(df.head())
print(df.isnull().sum())
# 整个数据集不能装入一台计算机的内存中,因此我们选择前10,000个记录,
# 并使用外存学习算法(Out-of-core learning algorithm)来有效地获取和处理数据。
df = df[:10000]
# 结果
Sentence # Word POS Tag
0 Sentence: 1 Thousands NNS O
1 NaN of IN O
2 NaN demonstrators NNS O
3 NaN have VBP O
4 NaN marched VBN O
Sentence # 1000616
Word 0
POS 0
Tag 0
dtype: int64
2. 数据预处理
可以看到"Sentence #"列中有很多NaN值,我们用前面的值去填充NaN。
# 数据预处理
# 我们注意到 "Sentence #"列中有很多NaN的值,我们用前面的值填充NaN。
df = df.fillna(method='ffill')
print(df.head())
# (457 2746 17)我们有457个句子,其中包含2746个独立单词并标记为17个标签
print(df['Sentence #'].nunique(), df.Word.nunique(), df.Tag.nunique())
# 结果
Sentence # Word POS Tag
0 Sentence: 1 Thousands NNS O
1 Sentence: 1 of IN O
2 Sentence: 1 demonstrators NNS O
3 Sentence: 1 have VBP O
4 Sentence: 1 marched VBN O
457 2746 17
我们有457个句子,其中包含2746个独立单词并标记为17个标签。
使用DictVectorizer将文本转换为向量,然后拆分成训练集和测试集。
# 以下代码使用DictVectorizer将文本转换成向量,然后拆分成训练集和测试集
X = df.drop('Tag', axis=1)
v = DictVectorizer(sparse=False)
# print(X[:2].to_dict('records'))
# X = v.fit_transform(X[:2].to_dict('records'))
# print(X)
# # 输出各个维度的特征含义
# print('result array feature name:\n', v.get_feature_names())
X = v.fit_transform(X.to_dict('records'))
y = df.Tag.values
classes = np.unique(y)
classes = classes.tolist()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
print(X_train.shape, y_train.shape)
# 因为标签“O”(outside)是最常见的标签,它会使我们的结果看起来比实际更好。因此,当我们评估分类指标时,我们会删除标记“O”。
new_classes = classes.copy()
new_classes.pop()
print(new_classes)
# 结果
(6700, 3242) (6700,)
['B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per', 'B-tim', 'I-art', 'I-eve', 'I-geo', 'I-gpe', 'I-nat', 'I-org', 'I-per', 'I-tim']
3. 外存算法
我们将尝试一些外存算法,这些算法旨在处理太大而无法存入的单个计算机内存的数据(partial_fit 方法)。
3.1 感知机
"""
感知机
"""
per = Perceptron(verbose=1)
per.partial_fit(X_train, y_train, classes)
y_pred = per.predict(X_test)
# print(X_test[:10])
# print(y_pred[:10])
# print(y_test[:10])
# print(classes)
print(classification_report(y_pred=y_pred, y_true=y_test, labels=new_classes))
# 结果
-- Epoch 1
Norm: 8.77, NNZs: 65, Bias: -3.000000, T: 6700, Avg. loss: 0.005224
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 5.48, NNZs: 27, Bias: -2.000000, T: 6700, Avg. loss: 0.002239
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 22.11, NNZs: 339, Bias: -5.000000, T: 6700, Avg. loss: 0.041343
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 26.00, NNZs: 389, Bias: -4.000000, T: 6700, Avg. loss: 0.040299
Total training time: 0.04 seconds.
-- Epoch 1
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Norm: 5.10, NNZs: 18, Bias: -2.000000, T: 6700, Avg. loss: 0.001343
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 19.57, NNZs: 269, Bias: -3.000000, T: 6700, Avg. loss: 0.028358
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 18.73, NNZs: 241, Bias: -5.000000, T: 6700, Avg. loss: 0.024776
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 17.83, NNZs: 166, Bias: -2.000000, T: 6700, Avg. loss: 0.012985
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 6.63, NNZs: 41, Bias: -2.000000, T: 6700, Avg. loss: 0.002985
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 5.92, NNZs: 32, Bias: -3.000000, T: 6700, Avg. loss: 0.001791
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 8.66, NNZs: 72, Bias: -3.000000, T: 6700, Avg. loss: 0.006567
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 8.06, NNZs: 50, Bias: -3.000000, T: 6700, Avg. loss: 0.003731
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 3.16, NNZs: 10, Bias: -2.000000, T: 6700, Avg. loss: 0.000149
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 17.58, NNZs: 225, Bias: -3.000000, T: 6700, Avg. loss: 0.024328
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 23.11, NNZs: 333, Bias: -4.000000, T: 6700, Avg. loss: 0.032985
Total training time: 0.05 seconds.
-- Epoch 1
Norm: 5.92, NNZs: 35, Bias: -3.000000, T: 6700, Avg. loss: 0.002985
Total training time: 0.04 seconds.
-- Epoch 1
Norm: 23.13, NNZs: 331, Bias: 3.000000, T: 6700, Avg. loss: 0.033134
Total training time: 0.05 seconds.
precision recall f1-score support
B-art 0.24 0.44 0.31 9
B-eve 0.00 0.00 0.00 3
B-geo 0.87 0.19 0.31 69
B-gpe 0.59 0.71 0.64 102
B-nat 0.00 0.00 0.00 0
B-org 0.34 0.75 0.46 63
B-per 0.94 0.41 0.58 41
B-tim 0.59 0.94 0.73 52
I-art 0.33 0.20 0.25 10
I-eve 0.00 0.00 0.00 3
I-geo 0.00 0.00 0.00 11
I-gpe 1.00 0.17 0.29 6
I-nat 0.00 0.00 0.00 1
I-org 0.43 0.21 0.29 47
I-per 0.69 0.27 0.39 66
I-tim 0.00 0.00 0.00 4
micro avg 0.51 0.48 0.49 487
macro avg 0.38 0.27 0.26 487
weighted avg 0.59 0.48 0.46 487
3.2 SGD线性分类器
"""
具有SGD训练的线性分类器
"""
sgd = SGDClassifier(verbose=0)
sgd.partial_fit(X_train, y_train, classes)
print(classification_report(y_pred = sgd.predict(X_test), y_true = y_test, labels = new_classes))
# 结果
precision recall f1-score support
B-art 0.75 0.33 0.46 9
B-eve 0.00 0.00 0.00 3
B-geo 0.38 0.77 0.51 69
B-gpe 0.97 0.31 0.47 102
B-nat 0.00 0.00 0.00 0
B-org 0.83 0.30 0.44 63
B-per 0.27 0.54 0.36 41
B-tim 1.00 0.75 0.86 52
I-art 0.40 0.20 0.27 10
I-eve 0.25 0.33 0.29 3
I-geo 0.00 0.00 0.00 11
I-gpe 1.00 0.17 0.29 6
I-nat 0.00 0.00 0.00 1
I-org 0.70 0.30 0.42 47
I-per 0.49 0.48 0.49 66
I-tim 0.00 0.00 0.00 4
micro avg 0.50 0.45 0.47 487
macro avg 0.44 0.28 0.30 487
weighted avg 0.66 0.45 0.48 487
3.3 朴素贝叶斯分类器
"""
用于多项模型的朴素贝叶斯分类器
"""
nb = MultinomialNB(alpha=0.01)
nb.partial_fit(X_train, y_train, classes)
print(classification_report(y_pred = nb.predict(X_test), y_true = y_test, labels = new_classes))
# 结果
precision recall f1-score support
B-art 0.11 0.33 0.16 9
B-eve 0.00 0.00 0.00 3
B-geo 0.62 0.49 0.55 69
B-gpe 0.64 0.73 0.68 102
B-nat 0.00 0.00 0.00 0
B-org 0.58 0.49 0.53 63
B-per 0.40 0.51 0.45 41
B-tim 0.67 0.85 0.75 52
I-art 0.29 0.20 0.24 10
I-eve 0.20 0.33 0.25 3
I-geo 0.25 0.36 0.30 11
I-gpe 0.25 0.50 0.33 6
I-nat 0.00 0.00 0.00 1
I-org 0.48 0.62 0.54 47
I-per 0.60 0.39 0.48 66
I-tim 0.06 0.25 0.10 4
micro avg 0.51 0.56 0.53 487
macro avg 0.32 0.38 0.33 487
weighted avg 0.55 0.56 0.54 487
3.4 Passive Aggressive分类器
"""
Passive Aggressive分类器
"""
pa = PassiveAggressiveClassifier(verbose=0)
pa.partial_fit(X_train, y_train, classes)
print(classification_report(y_pred = pa.predict(X_test), y_true = y_test, labels = new_classes))
# 结果
precision recall f1-score support
B-art 0.00 0.00 0.00 9
B-eve 0.00 0.00 0.00 3
B-geo 0.25 0.94 0.39 69
B-gpe 0.95 0.57 0.71 102
B-nat 0.00 0.00 0.00 0
B-org 0.71 0.08 0.14 63
B-per 0.51 0.46 0.49 41
B-tim 0.92 0.87 0.89 52
I-art 0.17 0.10 0.12 10
I-eve 0.00 0.00 0.00 3
I-geo 0.00 0.00 0.00 11
I-gpe 0.20 0.17 0.18 6
I-nat 0.00 0.00 0.00 1
I-org 1.00 0.15 0.26 47
I-per 0.82 0.21 0.34 66
I-tim 0.00 0.00 0.00 4
micro avg 0.47 0.44 0.46 487
macro avg 0.35 0.22 0.22 487
weighted avg 0.68 0.44 0.43 487
上述分类器结果显示挺差的。
3.5 条件随机场
CRF通常用于标记或解析序列数据,例如自然语言处理,并且CRF查找POS标记、命名实体识别等应用。
"""
条件随机场(CRF)
CRF通常用于标记或解析序列数据,例如自然语言处理,并且CRF查找POS标记、命名实体识别等应用。
我们将使用sklearn-crfsuite在我们的数据集上训练用于命名实体识别的CRF模型。
"""
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
from collections import Counter
# 以下代码用于检索带有POS和标签的句子
class SentenceGetter(object):
def __init__(self, dataset):
self.n_sent = 1
self.dataset = dataset
self.empty = False
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values,
s['POS'].values,
s['Tag'].values)]
self.grouped = self.dataset.groupby('Sentence #').apply(agg_func)
self.sentences = [s for s in self.grouped]
def get_next(self):
try:
s = self.grouped['Sentence: {}'.format(self.n_sent)]
self.n_sent += 1
return s
except:
return None
getter = SentenceGetter(df)
sentences = getter.sentences
# print(sentences[:1])
# 特征提取
# 我们提取更多特征(单词构成,简化的POS标签,下部/标题/上部标志,附近词的特征)并将它们转换为sklearn-crfsuite格式 – 每个句子应转换为词典列表。
# 以下代码取自sklearn-crfsuites官网。
def word2features(sent, i):
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:],
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(),
'word.isdigit()': word.isdigit(),
'postag': postag,
'postag[:2]': postag[:2],
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.istitle()': word1.istitle(),
'-1:word.isupper()': word1.isupper(),
'-1:postag': postag1,
'-1:postag[:2]': postag1[:2],
})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.istitle()': word1.istitle(),
'+1:word.isupper()': word1.isupper(),
'+1:postag': postag1,
'+1:postag[:2]': postag1[:2],
})
else:
features['EOS'] = True
return features
def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):
return [label for token, postag, label in sent]
def sent2tokens(sent):
return [token for token, postag, label in sent]
# 拆分训练集和测试集
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
print(len(X_train), X_train[0])
print(len(y_train), y_train[0])
# 训练CRF模型
crf = sklearn_crfsuite.CRF(
algorithm ='lbfgs',
c1 = 0.1,
c2 = 0.1,
max_iterations = 100,
all_possible_transitions = True
)
crf.fit(X_train, y_train)
# 评估
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_pred = y_pred, y_true = y_test, labels = new_classes))
# 结果
306 [{'bias': 1.0, 'word.lower()': 'the', 'word[-3:]': 'The', 'word[-2:]': 'he', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'DT', 'postag[:2]': 'DT', 'BOS': True, '+1:word.lower()': 'spokesman', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'NN', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'spokesman', 'word[-3:]': 'man', 'word[-2:]': 'an', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'NN', 'postag[:2]': 'NN', '-1:word.lower()': 'the', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'DT', '-1:postag[:2]': 'DT', '+1:word.lower()': 'says', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VBZ', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'says', 'word[-3:]': 'ays', 'word[-2:]': 'ys', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VBZ', 'postag[:2]': 'VB', '-1:word.lower()': 'spokesman', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'NN', '-1:postag[:2]': 'NN', '+1:word.lower()': 'a', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'DT', '+1:postag[:2]': 'DT'}, {'bias': 1.0, 'word.lower()': 'a', 'word[-3:]': 'a', 'word[-2:]': 'a', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'DT', 'postag[:2]': 'DT', '-1:word.lower()': 'says', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VBZ', '-1:postag[:2]': 'VB', '+1:word.lower()': 'formal', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'JJ', '+1:postag[:2]': 'JJ'}, {'bias': 1.0, 'word.lower()': 'formal', 'word[-3:]': 'mal', 'word[-2:]': 'al', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'JJ', 'postag[:2]': 'JJ', '-1:word.lower()': 'a', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'DT', '-1:postag[:2]': 'DT', '+1:word.lower()': 'agreement', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'NN', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'agreement', 'word[-3:]': 'ent', 'word[-2:]': 'nt', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'NN', 'postag[:2]': 'NN', '-1:word.lower()': 'formal', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'JJ', '-1:postag[:2]': 'JJ', '+1:word.lower()': 'on', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'IN', '+1:postag[:2]': 'IN'}, {'bias': 1.0, 'word.lower()': 'on', 'word[-3:]': 'on', 'word[-2:]': 'on', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'IN', 'postag[:2]': 'IN', '-1:word.lower()': 'agreement', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'NN', '-1:postag[:2]': 'NN', '+1:word.lower()': 'the', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'DT', '+1:postag[:2]': 'DT'}, {'bias': 1.0, 'word.lower()': 'the', 'word[-3:]': 'the', 'word[-2:]': 'he', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'DT', 'postag[:2]': 'DT', '-1:word.lower()': 'on', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'IN', '-1:postag[:2]': 'IN', '+1:word.lower()': 'project', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'NN', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'project', 'word[-3:]': 'ect', 'word[-2:]': 'ct', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'NN', 'postag[:2]': 'NN', '-1:word.lower()': 'the', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'DT', '-1:postag[:2]': 'DT', '+1:word.lower()': 'will', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'MD', '+1:postag[:2]': 'MD'}, {'bias': 1.0, 'word.lower()': 'will', 'word[-3:]': 'ill', 'word[-2:]': 'll', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'MD', 'postag[:2]': 'MD', '-1:word.lower()': 'project', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'NN', '-1:postag[:2]': 'NN', '+1:word.lower()': 'be', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VB', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'be', 'word[-3:]': 'be', 'word[-2:]': 'be', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VB', 'postag[:2]': 'VB', '-1:word.lower()': 'will', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'MD', '-1:postag[:2]': 'MD', '+1:word.lower()': 'signed', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VBN', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'signed', 'word[-3:]': 'ned', 'word[-2:]': 'ed', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VBN', 'postag[:2]': 'VB', '-1:word.lower()': 'be', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VB', '-1:postag[:2]': 'VB', '+1:word.lower()': 'in', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'IN', '+1:postag[:2]': 'IN'}, {'bias': 1.0, 'word.lower()': 'in', 'word[-3:]': 'in', 'word[-2:]': 'in', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'IN', 'postag[:2]': 'IN', '-1:word.lower()': 'signed', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VBN', '-1:postag[:2]': 'VB', '+1:word.lower()': 'june', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'june', 'word[-3:]': 'une', 'word[-2:]': 'ne', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'in', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'IN', '-1:postag[:2]': 'IN', '+1:word.lower()': 'when', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'WRB', '+1:postag[:2]': 'WR'}, {'bias': 1.0, 'word.lower()': 'when', 'word[-3:]': 'hen', 'word[-2:]': 'en', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'WRB', 'postag[:2]': 'WR', '-1:word.lower()': 'june', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'indonesian', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'JJ', '+1:postag[:2]': 'JJ'}, {'bias': 1.0, 'word.lower()': 'indonesian', 'word[-3:]': 'ian', 'word[-2:]': 'an', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'JJ', 'postag[:2]': 'JJ', '-1:word.lower()': 'when', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'WRB', '-1:postag[:2]': 'WR', '+1:word.lower()': 'president', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'president', 'word[-3:]': 'ent', 'word[-2:]': 'nt', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'indonesian', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'JJ', '-1:postag[:2]': 'JJ', '+1:word.lower()': 'susilo', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'susilo', 'word[-3:]': 'ilo', 'word[-2:]': 'lo', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'president', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'bambang', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'bambang', 'word[-3:]': 'ang', 'word[-2:]': 'ng', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'susilo', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'yudhoyono', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'yudhoyono', 'word[-3:]': 'ono', 'word[-2:]': 'no', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'bambang', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'is', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VBZ', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'is', 'word[-3:]': 'is', 'word[-2:]': 'is', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VBZ', 'postag[:2]': 'VB', '-1:word.lower()': 'yudhoyono', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', '+1:word.lower()': 'scheduled', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VBN', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'scheduled', 'word[-3:]': 'led', 'word[-2:]': 'ed', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VBN', 'postag[:2]': 'VB', '-1:word.lower()': 'is', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VBZ', '-1:postag[:2]': 'VB', '+1:word.lower()': 'to', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'TO', '+1:postag[:2]': 'TO'}, {'bias': 1.0, 'word.lower()': 'to', 'word[-3:]': 'to', 'word[-2:]': 'to', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'TO', 'postag[:2]': 'TO', '-1:word.lower()': 'scheduled', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VBN', '-1:postag[:2]': 'VB', '+1:word.lower()': 'visit', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': 'VB', '+1:postag[:2]': 'VB'}, {'bias': 1.0, 'word.lower()': 'visit', 'word[-3:]': 'sit', 'word[-2:]': 'it', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'VB', 'postag[:2]': 'VB', '-1:word.lower()': 'to', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'TO', '-1:postag[:2]': 'TO', '+1:word.lower()': 'moscow', '+1:word.istitle()': True, '+1:word.isupper()': False, '+1:postag': 'NNP', '+1:postag[:2]': 'NN'}, {'bias': 1.0, 'word.lower()': 'moscow', 'word[-3:]': 'cow', 'word[-2:]': 'ow', 'word.isupper()': False, 'word.istitle()': True, 'word.isdigit()': False, 'postag': 'NNP', 'postag[:2]': 'NN', '-1:word.lower()': 'visit', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'VB', '-1:postag[:2]': 'VB', '+1:word.lower()': '.', '+1:word.istitle()': False, '+1:word.isupper()': False, '+1:postag': '.', '+1:postag[:2]': '.'}, {'bias': 1.0, 'word.lower()': '.', 'word[-3:]': '.', 'word[-2:]': '.', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': '.', 'postag[:2]': '.', '-1:word.lower()': 'moscow', '-1:word.istitle()': True, '-1:word.isupper()': False, '-1:postag': 'NNP', '-1:postag[:2]': 'NN', 'EOS': True}]
306 ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-tim', 'O', 'B-gpe', 'B-per', 'I-per', 'I-per', 'I-per', 'O', 'O', 'O', 'O', 'B-geo', 'O']
precision recall f1-score support
B-art 0.50 0.40 0.44 5
B-eve 0.00 0.00 0.00 2
B-geo 0.79 0.68 0.73 77
B-gpe 0.75 0.88 0.81 91
B-nat 0.00 0.00 0.00 2
B-org 0.77 0.68 0.72 53
B-per 0.85 0.92 0.88 61
B-tim 0.95 0.89 0.92 45
I-art 0.00 0.00 0.00 4
I-eve 0.00 0.00 0.00 1
I-geo 0.75 0.38 0.50 16
I-gpe 0.67 0.57 0.62 7
I-nat 0.00 0.00 0.00 2
I-org 0.74 0.70 0.72 50
I-per 0.87 0.97 0.92 75
I-tim 0.33 1.00 0.50 1
micro avg 0.80 0.78 0.79 492
macro avg 0.50 0.50 0.48 492
weighted avg 0.78 0.78 0.78 492