长文本的向量表达技术综述

2021-12-28 12:55:15

什么是长文本的向量表达

长文本的表示向量（或者叫做高维空间向量）是将计算机预处理之后的长文本转换成计算机能够理解的向量形式。

长文本的向量表达技巧的必要性

在自然语言处理过程中，经常会涉及到如何度量两个文本之间的相似性。过去文本的相似性主要使用的是词袋法，但是词袋法有两个很明显的缺点：“However, the bag-of-words (BOW) has many disadvantages. The word order is lost, and thus different sentences can have exactly the same representation, as long as the same words are used. Even though bag-of-n-grams considers the word order in short context, it suffers from data sparsity and high dimensionality. Bag-of-words and bag-of-n-grams have very little sense about the semantics of the words or more formally the distances between the words. This means that words ‘powerful’,‘strong’ and ‘Paris’ are equally distant despite the fact that semantically, ‘powerful’ should be closer to ‘strong’ than ‘Paris’.”【1】忽略语序和语义特征，而这两点是很重要的。

同时，因为我们都知道文本是一种高维的语义空间，我们很容易的联想到用高维空间向量表示文本。如果对文本进行抽象分解，就能够站在数学角度去量化其相似性。而有了文本之间相似性的度量方式，我们便可以利用划分法的K-means、基于密度的DBSCAN或者是基于模型的概率方法进行文本之间的聚类分析；另一方面，我们也可以利用文本之间的相似性对大规模语料进行去重预处理，或者找寻某一实体名称的相关名称（模糊匹配）。

这些技巧能够广泛应用在搜索引擎，翻译，垃圾邮件过滤与网络商城搜索分类。

长文本向量表达技巧

目前长文本的表达技巧主要包括FastText，TextCNN，TextRNN，TextRNN+Attention，段落向量预测法等：

1.FastText

这种方法只是简单的将文本中的各个词向量进行平均以获得段落或者文章的空间向量，虽然这种方法十分简单，也不能考虑到词的顺序，以Armand Joulin带领的Facebook研究团队的论文《Bag of Tricks for Efficient Text Classification》中使用的FastText为例，这种线性方法虽然简单，但是格外有效【2】。

模型如下图所示，所有的词进入一个隐藏层，然后平均，线性地表示段落的空间向量（Figure 1 shows a simple model with 1 hidden layer. The first weight matrix can be seen as a look-up table over the words of a sentence. The word representations are averaged into a text representation, which is in turn fed to a linear classi- fier. ）。

值得注意的是，这种模型方法根本没有考虑词语的顺序对语言的影响。作者为了弥补这一缺点，加入了n-gram（n-gram指的是一段连续的多个成分，原文“an n-gram is a contiguous sequence of n items from a given sequence of text or speech”【3】）特征的 trick 来捕获局部序列信息。相对于其他的单层神经网络的FastText来说，提高了定位空间向量的准确度（we use bag of n-gram as additional features to capture some partial information about the local word order. This is very efficient in practice while achieving comparable results to methods that explicitly use the order）【4】。

2.TextCNN

TextCNN是引入卷积神经网络来处理段落，卷积神经网络能够在降维的同时捕捉到局部的信息，因此，使用卷积神经网络能够识别局部词句的序列信息。由于中文以字为单位不像英文每个单词都以空格作为分割那么明显。因此，中文段落语义识别并不适合直接使用FastText。但是CNN却很适合处理中文。

以桂林电子科技大学（Guilin University of Electronic Technology）的Guoyong Cai和 Binbin Xia（Google scholar只显示英文名字，只有拼音没音调不好翻译成中文）教授的论文<Convolutional Neural Networks for Multimedia Sentiment Analysis >中的模型为例。【5】

如下图所示，第一层是图中最左边的7乘5的句子矩阵，每行是词向量，维度=5，这个可以类比为图像中的原始像素点了。然后经过有 filter_size=(2,3,4) 的一维卷积层，每个filter_size 有两个输出 channel。第三层是一个1-max pooling层，这样不同长度句子经过pooling层之后都能变成定长的表示了，最后接一层全连接的 softmax 层，输出每个类别的概率。

3.TextRNN

RNN递归神经网络（recurrent neural network），a class of artificial neural network where connections between units form a directed cycle. This allows it to exhibit dynamic temporal behavior. Unlike feedforward neural networks, RNNs can use their internal memory to process arbitrary sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.，这种算法通过递归，考虑到了段落结构对词语对段落意思的影响（in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture【6】）,相对于之前提到的TextCNN而言，由于考虑到了段落对段落意思的影响，因此准确度相对CNN而言高了很多，但是也有许多不足，在实际操作中一般是使用它的改进型算法TextRNN+Attention。由于TextRNN+Attention包含了TextRNN的内容，因此这里不详细介绍。

4.TextRNN+Attention

TextRNN+Attention算法是在TextRNN的基础上增加了Attention机制，Attention机制指的是类似人的注意力，随着时间的延续而降低，对应在文章处理中也是类似的机制，随着处理词语的数量的增加，期初的词语对下文词语意思的影响会相应降低，因此判断词语的意思时着重依靠靠近该词语的上下文，这也符合自然语言的语言习惯。比如源语言是中文 “我 / 是 / 中国人” 目标语言 “i / am / Chinese”，翻译出“Chinese”时候显然取决于“中国人”，而与“我 / 是”基本无关。下图公式, αij 则是翻译英文第 i 个词时，中文第 j 个词的贡献，也就是注意力。显然在翻译“Chinese”时，“中国人”的注意力值非常大。【7】

以威斯康星麦迪逊大学的Zichao Yang博士的论文<Hierarchical Attention Networks for Document Classification>为例，其中‘Hierarchical’一词指的是分级的，等级制，表示在处理文本时各个单词的等级（权重）是不同的。他们的方法也是基于各个单词对文章意义的关联是平等的（The intuition underlying our model is that not all parts of a document are equally relevant for answering a query and that determining the relevant sections involves modeling the interactions of the words, not just their presence in isolation【8】）。他们的处理方式有两个鲜明的特点：每一部分不平等和attention机制（it has a hierarchical structure that mirrors the hierarchical structure of documents; (ii) it has two levels of attention mechanisms applied at the word and sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation）。

如下图所示，他们使用RNN递归神经网络实现阶级（Hierarchical，这个词不太好翻译），通过Rt和Rz来控制权重。依靠他们来计算Ht和Zt如下公式所示。

至于Attention机制的实现则分成几个部分词的attention用

上述公式计算得到，上述变量的取值范围如下

至于句子则使用公式

计算，同时取值范围如下：

5.段落向量预测法

这种方法和之前的方法不太一样，它是基于BP神经网络的无监督学习以Facebook人工智能实验室的Tomas Mikolov研究员的论文<Distributed Representations of Sentences and Documents>为例。

在这一个方法中，首先设置一个随机的段落空间向量，然后根据上下文预测句子中的单词空间向量，最后通过利用计算得到的预测到的下一个单词的空间向量与实际的单词空间向量的偏差，依靠BP神经网络的反馈调整段落向量。如下图所示

根据上下文与段落向量预测第四个单词是on,并与实际情况进行比较得到反馈。【9】

在实际操作中，使用固定长度在段落上根据滑动窗口不断采样，段落向量会被该段落产生的所有上下文窗口所共同拥有，以此为参数使用下列公式

计算预测值，并根据预测值与实际出现的词的空间向量的偏差通过得出段落空间向量的调整值。其中U和b是需要调整的参数。因此，虽然最初单位向量是随机的，通过训练之后，段落向量就能用来作为段落特征。

引用源：

【1】 Facebook，Tomas Mikolov .Distributed Representations of Sentences and Documents

【2】 ‘At the same time, simple linear models have also shown impressive performance while being very computationally efficient’

Armand Joulin，Edouard Grave， Piotr Bojanowski， Tomas Mikolov.

Facebook AI Research .Bag of Tricks for Efficient Text Classification

【3】 Wikipedia https://en.wikipedia.org/wiki/N-gram

【4】 Armand Joulin，Edouard Grave， Piotr Bojanowski， Tomas Mikolov.

Facebook AI Research .Bag of Tricks for Efficient Text Classification

【5】 Guoyong Cai, Binbin Xia.

Convolutional Neural Networks for Multimedia Sentiment Analysis

【6】 Zichao Yang , Diyi Yang , Chris Dyer , Xiaodong He , Alex Smola , Eduard Hovy. Carnegie Mellon University, Microsoft Research, Redmond Hierarchical Attention Networks for Document Classification

【7】 http://geek.csdn.net/news/detail/189196

【8】 Zichao Yang , Diyi Yang , Chris Dyer , Xiaodong He , Alex Smola , Eduard Hovy. Carnegie Mellon University, Microsoft Research, Redmond Hierarchical Attention Networks for Document Classification

【9】Tomas Mikolov.Distributed Representations of Sentences and Documents.