【scikit-learn翻译】TfidfVectorizer

sklearn.feature_extraction.text.TfidfVectorizer

class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)

Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.
Read more in the User Guide.
将原始文档的集合转换为TF-IDF功能的矩阵。
相当于CountVectorizer,后跟TfidfTransformer。
在“ 用户指南”中阅读更多内容。

Parameters:

  • input : string {‘filename’, ‘file’, ‘content’}

If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.

If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.

Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly.
如果是'filename',序列作为参数传递给拟合器,预计为文件名列表,这需要读取原始内容进行分析

如果是'file',序列中元素必须有一个”read“的方法(类似文件的对象),被调用作为获取内存中的字节数

否则,输入预计为序列串,或字节数据项都预计可直接进行分析。

  • encoding : string, ‘utf-8’ by default.

If bytes or files are given to analyze, this encoding is used to decode.
如果给出要解析的字节或文件,此编码将用于解码

  • decode_error : {‘strict’, ‘ignore’, ‘replace’}

Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
如果一个给出的字节序列包含的字符不是给定的编码,指示应该如何去做。默认情况下,它是'strict',这意味着的UnicodeDecodeError将会报错,其他值是'ignore'和'replace'

  • strip_accents : {‘ascii’, ‘unicode’, None}

Remove accents during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
在预处理步骤中去除编码规则(accents),”ASCII码“是一种快速的方法,仅适用于有一个直接的ASCII字符映射,"unicode"是一个稍慢一些的方法,None(默认)什么都不做

  • analyzer : string, {‘word’, ‘char’} or callable

Whether the feature should be made of word or character n-grams.

If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
定义特征为词(word)或n-gram字符,如果传递给它的调用被用于抽取未处理输入源文件的特征序列

  • preprocessor : callable or None (default)

Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
当保留令牌和”n-gram“生成步骤时,覆盖预处理(字符串变换)的阶段

  • tokenizer : callable or None (default)

Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
当保留预处理和n-gram生成步骤时,覆盖字符串令牌步骤

  • ngram_range : tuple (min_n, max_n)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
要提取的n-gram的n-values的下限和上限范围,在min_n <= n <= max_n区间的n的全部值

  • stop_words : string {‘english’}, list, or None (default)

If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value.

If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
如果未english,用于英语内建的停用词列表

如果未list,该列表被假定为包含停用词,列表中的所有词都将从令牌中删除

如果None,不使用停用词。max_df可以被设置为范围[0.7, 1.0)的值,基于内部预料词频来自动检测和过滤停用词

  • lowercase : boolean, default True

Convert all characters to lowercase before tokenizing.
在令牌标记前转换所有的字符为小写

  • token_pattern : string

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
正则表达式显示了”token“的构成,仅当analyzer == ‘word’时才被使用。两个或多个字母数字字符的正则表达式(标点符号完全被忽略,始终被视为一个标记分隔符)。

  • max_df : float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
当构建词汇表时,严格忽略高于给出阈值的文档频率的词条,语料指定的停用词。如果是浮点值,该参数代表文档的比例,如果是整型,代表绝对计数值。如果词汇表不为None,此参数被忽略。

  • min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
当构建词汇表时,严格忽略低于给出阈值的文档频率的词条,语料指定的停用词。如果是浮点值,该参数代表文档的比例,整型绝对计数值,如果词汇表不为None,此参数被忽略。

  • max_features : int or None, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.
如果不为None,构建一个词汇表,仅考虑max_features--按语料词频排序,如果词汇表不为None,这个参数被忽略

  • vocabulary : Mapping or iterable, optional

Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
也是一个映射(Map)(例如,字典),其中键是词条而值是在特征矩阵中索引,或词条中的迭代器。如果没有给出,词汇表被确定来自输入文件。在映射中索引不能有重复,并且不能在0到最大索引值之间有间断。

  • binary : boolean, default=False

If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs.)
如果未True,所有非零计数被设置为1,这对于离散概率模型是有用的,建立二元事件模型,而不是整型计数

  • dtype : type, optional

Type of the matrix returned by fit_transform() or transform().
通过fit_transform()或transform()返回矩阵的类型

  • norm : ‘l1’, ‘l2’ or None, optional

Norm used to normalize term vectors. None for no normalization.
范数用于标准化词条向量。None为不归一化

  • use_idf : boolean, default=True

Enable inverse-document-frequency reweighting.
启动inverse-document-frequency重新计算权重

  • smooth_idf : boolean, default=True

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
通过加1到文档频率平滑idf权重,为防止除零,加入一个额外的文档

  • sublinear_tf : boolean, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
应用线性缩放TF,例如,使用1+log(tf)覆盖tf

Attributes:

  • vocabulary_ : dict

A mapping of terms to feature indices.

  • idf_ : array, shape = [n_features], or None

The learned idf vector (global term weights) when use_idf is set to True, None otherwise.

  • stop_words_ : set

Terms that were ignored because they either:

occurred in too many documents (max_df)
occurred in too few documents (min_df)
were cut off by feature selection (max_features).
This is only available if no vocabulary was given.

参考资料

  1. http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
  2. https://blog.csdn.net/laobai1015/article/details/80451371
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,444评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,421评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,036评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,363评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,460评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,502评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,511评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,280评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,736评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,014评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,190评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,848评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,531评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,159评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,411评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,067评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,078评论 2 352

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,319评论 0 10
  • 笑来老师在得到有个专栏叫做《通往财富自由之路》,截止目前订阅用户178210人,我就是这其中那一个; Who 我今...
    刘冰杰阅读 204评论 0 0
  • F303申某某(Aglarn)第五次作业非暴力沟通训练营第三期 这次课程2个重要的点:区分观察和评论,区分感受和想...
    申某某Aglarn阅读 280评论 1 0
  • 关于爱情,有这么两个观点: 1、一份感情如果对了,一定是顺遂的; 2、真爱,总是来之不易。 这两个看似矛盾的观点,...
    颜卤煮阅读 1,051评论 1 51
  • ——5304-刘罗锅-清单主题营复盘 本来是不想参加清单主题营的,为什么呢? 001每天要通过阅读一本书来书写...
    小晓真儿阅读 245评论 2 3