tensorflow, keras, Tokenizer 获取文本信息, NLP

程序员文章站 2022-06-22 17:33:37

Tokenizer是keras.preprocessing.text包下的一个类，调用路径为：tensorflow.keras.preprocessing.text.Tokenizer.tensorflow和keras就以其数不清的包而著称，也为其诟病。Tokenizer是在数据预处理的时候常用的一个类，其作用是：在处理文本时候向量化整个文本库。接触过机器学习文本处理的都应该了解，计算机是无法记得每个词汇长什么样子，它处理的方法是把每个词汇转换成数字格式，具体操作包括：one-hot,....

Tokenizer是keras.preprocessing.text包下的一个类，调用路径为：

tensorflow.keras.preprocessing.text.Tokenizer.

tensorflow和keras就以其数不清的包而著称，也为其诟病。Tokenizer是在数据预处理的时候常用的一个类，其作用是：

在处理文本时候向量化整个文本库。

接触过机器学习文本处理的都应该了解，计算机是无法记得每个词汇长什么样子，它处理的方法是把每个词汇转换成数字格式，具体操作包括：one-hot, integer-encoding, word-embeddiing等。

Tokenizer处理一整个文本库的方式是将文本库转化为整数的序列，或转化为矢量化。具体描述一下： “我爱学习” 这个四个字，为每个字加上索引，[0,1,2,3]，那么"我爱学习"就可以表示为"0123"；“学习爱我”就是“2310”。

矢量化相对转化为索引序列复杂一点，它把每个汉字（英文的话可以是单词）转化为相对应的矢量，例如"我"转化为[0.2, 0.3, 0.4 ...]矢量的长度和具体内容是和每个字符和整个文档相关的，具体一点可以参考TF-IDF，或者word-embedding。

Tokenizer的定义如下（Tokenizer keras官方API）：

tf.keras.preprocessing.text.Tokenizer(
    num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True,
    split=' ', char_level=False, oov_token=None, document_count=0, **kwargs
)

具体参数代表的含义：

num_words: the maximum number of words to keep, based
    on word frequency. Only the most common `num_words-1` words will
    be kept.
filters: a string where each element is a character that will be
    filtered from the texts. The default is all punctuation, plus
    tabs and line breaks, minus the `'` character.
lower: boolean. Whether to convert the texts to lowercase.
split: str. Separator for word splitting.
char_level: if True, every character will be treated as a token.
oov_token: if given, it will be added to word_index and used to
    replace out-of-vocabulary words during text_to_sequence calls

Tokenizer类主要有以下方法，实现了sequence, text, matrix 之间的转换。sequence是数字索引的列表，就好像我爱学习的"0123"，text是对应索引的文字，matrix是可以代表text的矩阵。

Tokenizer类的函数方法

fit_on_sequences(sequences)：根据sequence的列表（list）来更新内部的词汇。sequences参数就是列表。

fit_on_texts(texts)：根据text的列表（list）来更新内部的词汇，texts参数就是列表。

get_config()：以字典的格式输出Tokenizer的配置属性。返回一个字典。

sequences_to_matrix(sequences, mode)：把sequence的list转化为数字矩阵，mode可以选择转化的模式，分为'binary', 'count', 'tfidf', 'freq'。

sequences_to_texts(sequences): 把sequences list中的每一个sequence转化为text的list。list的每个元素中只含一个text。

sequences_to_texts_generator(sequences): 把sequences list中每一个sequence转化为texts的list。list的每个元素中可以包含多个text。

texts_to_matrix, texts_to_sequences, texts_to_sequences_generator: 与上面三个方法对应。

to_json(*kwargs): 把Tokenizer输出到json中，也可以用keras.preprocessing.text.tokenizer_from_json(json_string)把保存的json读取为Tokenizer。

本文地址：https://blog.csdn.net/github_35807147/article/details/107214951

上一篇：古代最冤枉的人为什么是负心汉陈世美真实的陈世美又是什么样的

下一篇：科普丨近来大火的Wi-Fi 6，“6”在哪里？