From 9fac2fca7f6cb32b294e768f5e67d7bec3b2b7d0 Mon Sep 17 00:00:00 2001 From: ZHOUhuichi <96902323+ZHOUhuichi@users.noreply.github.com> Date: Wed, 14 Dec 2022 23:21:50 +0800 Subject: [PATCH 1/6] Add files via upload --- .../NLP\345\237\272\347\241\200.ipynb" | 687 ++++++++++++++++++ 1 file changed, 687 insertions(+) create mode 100644 "docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb" diff --git "a/docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb" "b/docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb" new file mode 100644 index 000000000..ba5bcc1f9 --- /dev/null +++ "b/docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb" @@ -0,0 +1,687 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "119ec186", + "metadata": {}, + "source": [ + "# 词嵌入(概念部分)" + ] + }, + { + "cell_type": "markdown", + "id": "f8e5639e", + "metadata": {}, + "source": [ + "###   在了解什么是词嵌入之前,我们可以思考一下计算机如何识别人类的输入?
\n", + " 计算机通过将输入信息解析为0和1这般的二进制编码,从而将人类语言转化为机器语言,进行理解。
\n", + " 我们先引入一个概念**one-hot编码**,也称为**独热编码**,在给定维度的情况下,一行向量有且仅有一个值为1,例如维度为5的向量[0,0,0,0,1]
\n", + " 例如,我们在幼儿园或小学学习汉语的时候,首先先识字和词,字和词就会保存在我们的大脑中的某处。
\n", + "\n", + "
一个小朋友刚学会了四个字和词-->[我] [特别] [喜欢] [学习]
\n", + " 我们的计算机就可以为小朋友开辟一个词向量维度为4的独热编码
\n", + " 对于中文 我们先进行分词 我 特别 喜欢 学习
\n", + " 那么我们就可以令 我->[1 0 0 0] 特别 ->[0 1 0 0] 喜欢->[0 0 1 0] 学习->[0 0 0 1]
\n", + " 现在给出一句话 我喜欢学习,那么计算机给出的词向量->[1 0 1 1]

\n", + " 我们可以思考几个问题:
\n", + " 1.如果小朋友词汇量越学越多,学到了成千上万个词之后,我们使用上述方法构建的词向量就会有非常大的维度,并且是一个稀疏向量。
\n", + " 2.在中文中 诸如 能 会 可以 这样同义词,我们如果使用独热编码,它们是正交的,缺乏词之间的相似性,很难把他们联系到一起。
\n", + " 因此我们认为独热编码不是一个很好的词嵌入方法。
\n", + "\n", + " 我们再来介绍一下 **稠密表示**
\n", + "\n", + " 稠密表示的格式如one-hot编码一致,但数值却不同,如 [0.45,0.65,0.14,1.15,0.97]" + ] + }, + { + "cell_type": "markdown", + "id": "4db86da3", + "metadata": {}, + "source": [ + "# Bag of Words词袋表示" + ] + }, + { + "cell_type": "markdown", + "id": "44dc9252", + "metadata": {}, + "source": [ + "  词袋表示顾名思义,我们往一个袋子中装入我们的词汇,构成一个词袋,当我们想表达的时候,我们将其取出,构建词袋的方法可以有如下形式。" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "823f8f2d", + "metadata": {}, + "outputs": [], + "source": [ + "corpus = [\"i like reading\", \"i love drinking\", \"i hate playing\", \"i do nlp\"]#我们的语料库\n", + "word_list = ' '.join(corpus).split()\n", + "word_list = list(sorted(set(word_list)))\n", + "word_dict = {w: i for i, w in enumerate(word_list)}\n", + "number_dict = {i: w for i, w in enumerate(word_list)}" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "8eaeb37d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'do': 0,\n", + " 'drinking': 1,\n", + " 'hate': 2,\n", + " 'i': 3,\n", + " 'like': 4,\n", + " 'love': 5,\n", + " 'nlp': 6,\n", + " 'playing': 7,\n", + " 'reading': 8}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "word_dict" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "2bf380c8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{0: 'do',\n", + " 1: 'drinking',\n", + " 2: 'hate',\n", + " 3: 'i',\n", + " 4: 'like',\n", + " 5: 'love',\n", + " 6: 'nlp',\n", + " 7: 'playing',\n", + " 8: 'reading'}" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "number_dict" + ] + }, + { + "cell_type": "markdown", + "id": "90e0ef43", + "metadata": {}, + "source": [ + " 根据如上形式,我们可以构建一个维度为9的one-hot编码,如下(除了可以使用np.eye构建,也可以通过sklearn的库调用)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "9821ed2a", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "voc_size = len(word_dict)\n", + "bow = []\n", + "for i,name in enumerate(word_dict):\n", + " bow.append(np.eye(voc_size)[word_dict[name]])" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "03f1f12f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[array([1., 0., 0., 0., 0., 0., 0., 0., 0.]),\n", + " array([0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n", + " array([0., 0., 1., 0., 0., 0., 0., 0., 0.]),\n", + " array([0., 0., 0., 1., 0., 0., 0., 0., 0.]),\n", + " array([0., 0., 0., 0., 1., 0., 0., 0., 0.]),\n", + " array([0., 0., 0., 0., 0., 1., 0., 0., 0.]),\n", + " array([0., 0., 0., 0., 0., 0., 1., 0., 0.]),\n", + " array([0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n", + " array([0., 0., 0., 0., 0., 0., 0., 0., 1.])]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bow" + ] + }, + { + "cell_type": "markdown", + "id": "086a5fd2", + "metadata": {}, + "source": [ + "# N-gram:基于统计的语言模型\n", + " N-gram 模型是一种自然语言处理模型,它利用了语言中词语之间的相关性来预测下一个出现的词语。N-gram 模型通过对一段文本中连续出现的 n 个词语进行建模,来预测文本中接下来出现的词语。比如,如果一个文本中包含连续出现的词语“the cat sat on”,那么 N-gram 模型可能会预测接下来的词语是“the mat”或“a hat”。\n", + "\n", + " N-gram 模型的精确性取决于用于训练模型的文本的质量和数量。如果用于训练模型的文本包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。此外,如果用于训练模型的文本量较少,那么模型也可能无法充分捕捉到语言中的复杂性。 \n", + "\n", + "**N-gram 模型的优点:**\n", + "\n", + "简单易用,N-gram 模型的概念非常简单,实现起来也很容易。 \n", + "能够捕捉到语言中的相关性,N-gram 模型通过考虑连续出现的 n 个词语来预测下一个词语,因此它能够捕捉到语言中词语之间的相关性。 \n", + "可以使用已有的语料库进行训练,N-gram 模型可以使用已有的大量语料库进行训练,例如 Google 的 N-gram 数据库,这样可以大大提高模型的准确性。 \n", + "\n", + "**N-gram 模型的缺点:**\n", + "\n", + "对于短文本数据集不适用,N-gram 模型需要大量的文本数据进行训练,因此对于短文本数据集可能无法达到较高的准确性。 \n", + "容易受到噪声和语言纠错的影响,N-gram 模型是基于语料库进行训练的,如果语料库中包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。 \n", + "无法捕捉到语言中的非线性关系,N-gram 模型假设语言中的关系是线性的,但事实上语言中可能存在复杂的非线性关系,N-gram 模型无法捕捉到这些关系。" + ] + }, + { + "cell_type": "markdown", + "id": "1f5ad65b", + "metadata": {}, + "source": [ + "# NNLM:前馈神经网络语言模型\n", + " 下面通过前馈神经网络模型来**展示滑动**窗口的使用" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "7bddfa77", + "metadata": {}, + "outputs": [], + "source": [ + "#导入必要的库\n", + "import numpy as np\n", + "import torch\n", + "import torch.nn as nn\n", + "import torch.optim as optim\n", + "from tqdm import tqdm\n", + "from torch.autograd import Variable\n", + "dtype = torch.FloatTensor" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "29f23588", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['i',\n", + " 'like',\n", + " 'reading',\n", + " 'i',\n", + " 'love',\n", + " 'drinking',\n", + " 'i',\n", + " 'hate',\n", + " 'playing',\n", + " 'i',\n", + " 'do',\n", + " 'nlp']" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "corpus = [\"i like reading\", \"i love drinking\", \"i hate playing\", \"i do nlp\"]\n", + "\n", + "word_list = ' '.join(corpus).split()\n", + "word_list" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "12b58886", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 1000 cost = 1.010682\n", + "epoch: 2000 cost = 0.695155\n", + "epoch: 3000 cost = 0.597085\n", + "epoch: 4000 cost = 0.531892\n", + "epoch: 5000 cost = 0.376044\n", + "epoch: 6000 cost = 0.118038\n", + "epoch: 7000 cost = 0.077081\n", + "epoch: 8000 cost = 0.053636\n", + "epoch: 9000 cost = 0.038089\n", + "epoch: 10000 cost = 0.027224\n", + "[['i', 'like'], ['i', 'love'], ['i', 'hate'], ['i', 'do']] -> ['studying', 'datawhale', 'playing', 'nlp']\n" + ] + } + ], + "source": [ + "#构建我们需要的语料库\n", + "corpus = [\"i like studying\", \"i love datawhale\", \"i hate playing\", \"i do nlp\"]\n", + "\n", + "word_list = ' '.join(corpus).split() #将语料库转化为一个个单词 ,如['i', 'like', 'reading', 'i', ...,'nlp']\n", + "word_list = list(sorted(set(word_list))) #用set去重后转化为链表\n", + "# print(word_list)\n", + "\n", + "word_dict = {w: i for i, w in enumerate(word_list)} #将词表转化为字典 这边是词对应到index\n", + "number_dict = {i: w for i, w in enumerate(word_list)}#这边是index对应到词\n", + "# print(word_dict)\n", + "# print(number_dict)\n", + "\n", + "n_class = len(word_dict) #计算出我们词表的大小,用于后面词向量的构建\n", + "\n", + "m = 2 #词嵌入维度\n", + "n_step = 2 #滑动窗口的大小\n", + "n_hidden = 2 #隐藏层的维度为2\n", + "\n", + "\n", + "def make_batch(sentence): #由于语料库较小,我们象征性将训练集按照批次处理 \n", + " input_batch = []\n", + " target_batch = []\n", + "\n", + " for sen in sentence:\n", + " word = sen.split()\n", + " input = [word_dict[n] for n in word[:-1]]\n", + " target = word_dict[word[-1]]\n", + "\n", + " input_batch.append(input)\n", + " target_batch.append(target)\n", + "\n", + " return input_batch, target_batch\n", + "\n", + "\n", + "class NNLM(nn.Module): #搭建一个NNLM语言模型\n", + " def __init__(self):\n", + " super(NNLM, self).__init__()\n", + " self.embed = nn.Embedding(n_class, m)\n", + " self.W = nn.Parameter(torch.randn(n_step * m, n_hidden).type(dtype))\n", + " self.d = nn.Parameter(torch.randn(n_hidden).type(dtype))\n", + "\n", + " self.U = nn.Parameter(torch.randn(n_hidden, n_class).type(dtype))\n", + " self.b = nn.Parameter(torch.randn(n_class).type(dtype))\n", + "\n", + " def forward(self, x):\n", + " x = self.embed(x) # 4 x 2 x 2\n", + " x = x.view(-1, n_step * m)\n", + " tanh = torch.tanh(self.d + torch.mm(x, self.W)) # 4 x 2\n", + " output = self.b + torch.mm(tanh, self.U)\n", + " return output\n", + "\n", + "model = NNLM()\n", + "\n", + "criterion = nn.CrossEntropyLoss() #损失函数的设置\n", + "optimizer = optim.Adam(model.parameters(), lr=0.001) #优化器的设置\n", + "\n", + "input_batch, target_batch = make_batch(corpus) #训练集和标签值\n", + "input_batch = Variable(torch.LongTensor(input_batch))\n", + "target_batch = Variable(torch.LongTensor(target_batch))\n", + "\n", + "for epoch in range(10000): #训练过程\n", + " optimizer.zero_grad()\n", + "\n", + " output = model(input_batch) # input: 4 x 2\n", + "\n", + " loss = criterion(output, target_batch)\n", + "\n", + " if (epoch + 1) % 1000 == 0:\n", + " print('epoch:', '%04d' % (epoch + 1), 'cost = {:.6f}'.format(loss.item()))\n", + "\n", + " loss.backward()\n", + " optimizer.step()\n", + "\n", + "predict = model(input_batch).data.max(1, keepdim=True)[1]#模型预测过程\n", + "\n", + "print([sen.split()[:2] for sen in corpus], '->', [number_dict[n.item()] for n in predict.squeeze()])" + ] + }, + { + "cell_type": "markdown", + "id": "93d8cd2f", + "metadata": {}, + "source": [ + "# Word2Vec模型:主要采用Skip-gram和Cbow两种模式\n", + " 前文提到的distributed representation稠密向量表达可以用Word2Vec模型进行训练得到。\n", + " skip-gram模型(跳字模型)是用中心词去预测周围词\n", + " cbow模型(连续词袋模型)是用周围词预测中心词" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "066f68a0", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 11%|█ | 10615/100000 [00:02<00:24, 3657.80it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 10000 cost = 1.955088\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 21%|██ | 20729/100000 [00:05<00:21, 3758.47it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 20000 cost = 1.673096\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 30%|███ | 30438/100000 [00:08<00:18, 3710.13it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 30000 cost = 2.247422\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 41%|████ | 40638/100000 [00:11<00:15, 3767.87it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 40000 cost = 2.289902\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 50%|█████ | 50486/100000 [00:13<00:13, 3713.98it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 50000 cost = 2.396217\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 61%|██████ | 60572/100000 [00:16<00:11, 3450.47it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 60000 cost = 1.539688\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 71%|███████ | 70638/100000 [00:19<00:07, 3809.11it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 70000 cost = 1.638879\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 80%|████████ | 80403/100000 [00:21<00:05, 3740.33it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 80000 cost = 2.279797\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 90%|█████████ | 90480/100000 [00:24<00:02, 3680.03it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 90000 cost = 1.992100\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 100000/100000 [00:27<00:00, 3677.35it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 100000 cost = 1.307715\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "打印\n" + ] + }, + { + "data": { + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import torch.nn as nn\n", + "import torch.optim as optim\n", + "from torch.autograd import variable\n", + "import numpy as np\n", + "import torch\n", + "import matplotlib.pyplot as plt\n", + "from tqdm import tqdm\n", + "\n", + "dtype = torch.FloatTensor\n", + "#我们使用的语料库 \n", + "sentences = ['i like dog','i like cat','i like animal','dog is animal','cat is animal',\n", + " 'dog like meat','cat like meat','cat like fish','dog like meat','i like apple',\n", + " 'i hate apple','i like movie','i like read','dog like bark','dog like cat']\n", + "\n", + "\n", + "\n", + "word_sequence = ' '.join(sentences).split() #将语料库的每一句话的每一个词转化为列表 \n", + "#print(word_sequence)\n", + "\n", + "word_list = list(set(word_sequence)) #构建我们的词表 \n", + "#print(word_list)\n", + "\n", + "#word_voc = list(set(word_sequence)) \n", + "\n", + "#接下来对此表中的每一个词编号 这就用到了我们之前提到的one-hot编码 \n", + "\n", + "#词典 词对应着编号\n", + "word_dict = {w:i for i,w in enumerate(word_list)}\n", + "#print(word_dict)\n", + "#编号对应着词\n", + "index_dict = {i:w for w,i in enumerate(word_list)}\n", + "#print(index_dict)\n", + "\n", + "\n", + "batch_size = 2\n", + "voc_size = len(word_list)\n", + "\n", + "skip_grams = []\n", + "for i in range(1,len(word_sequence)-1,3):\n", + " target = word_dict[word_sequence[i]] #当前词对应的id\n", + " context = [word_dict[word_sequence[i-1]],word_dict[word_sequence[i+1]]] #两个上下文词对应的id\n", + "\n", + " for w in context:\n", + " skip_grams.append([target,w])\n", + "\n", + "embedding_size = 10 \n", + "\n", + "\n", + "class Word2Vec(nn.Module):\n", + " def __init__(self):\n", + " super(Word2Vec,self).__init__()\n", + " self.W1 = nn.Parameter(torch.rand(len(word_dict),embedding_size)).type(dtype) \n", + " #将词的one-hot编码对应到词向量中\n", + " self.W2 = nn.Parameter(torch.rand(embedding_size,voc_size)).type(dtype)\n", + " #将词向量 转化为 输出 \n", + " def forward(self,x):\n", + " hidden_layer = torch.matmul(x,self.W1)\n", + " output_layer = torch.matmul(hidden_layer,self.W2)\n", + " return output_layer\n", + "\n", + "\n", + "model = Word2Vec()\n", + "criterion = nn.CrossEntropyLoss()\n", + "optimizer = optim.Adam(model.parameters(),lr=1e-5)\n", + "\n", + "#print(len(skip_grams))\n", + "#训练函数\n", + "\n", + "def random_batch(data,size):\n", + " random_inputs = []\n", + " random_labels = []\n", + " random_index = np.random.choice(range(len(data)),size,replace=False)\n", + " \n", + " for i in random_index:\n", + " random_inputs.append(np.eye(voc_size)[data[i][0]]) #从一个单位矩阵生成one-hot表示\n", + " random_labels.append(data[i][1])\n", + " \n", + " return random_inputs,random_labels\n", + "\n", + "for epoch in tqdm(range(100000)):\n", + " input_batch,target_batch = random_batch(skip_grams,batch_size) # X -> y\n", + " input_batch = torch.Tensor(input_batch)\n", + " target_batch = torch.LongTensor(target_batch)\n", + "\n", + " optimizer.zero_grad()\n", + "\n", + " output = model(input_batch)\n", + "\n", + " loss = criterion(output,target_batch)\n", + " if((epoch+1)%10000==0):\n", + " print(\"epoch:\",\"%04d\" %(epoch+1),'cost =' ,'{:.6f}'.format(loss))\n", + "\n", + " loss.backward() \n", + " optimizer.step()\n", + "\n", + "for i , label in enumerate(word_list):\n", + " W1,_ = model.parameters()\n", + " x,y = float(W1[i][0]),float(W1[i][1])\n", + " plt.scatter(x,y)\n", + " plt.annotate(label,xy=(x,y),xytext=(5,2),textcoords='offset points',ha='right',va='bottom')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1edccf25", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "pytorch", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.9 (default, Aug 31 2020, 12:42:55) \n[GCC 7.3.0]" + }, + "vscode": { + "interpreter": { + "hash": "7648c2b9d25760d0d65f53f9b9a34de48caa24d8265d64b0ff81e2f2641d528d" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From a399dea09f46145a5486fedbb41c200e0438ccd5 Mon Sep 17 00:00:00 2001 From: ZHOUhuichi <96902323+ZHOUhuichi@users.noreply.github.com> Date: Wed, 14 Dec 2022 23:33:47 +0800 Subject: [PATCH 2/6] =?UTF-8?q?Create=20=E5=91=A8=E8=BE=89=E6=B1=A0NLP?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../\345\221\250\350\276\211\346\261\240NLP" | 1 + 1 file changed, 1 insertion(+) create mode 100644 "docs/\347\254\254\345\215\201\347\253\240/\345\221\250\350\276\211\346\261\240NLP" diff --git "a/docs/\347\254\254\345\215\201\347\253\240/\345\221\250\350\276\211\346\261\240NLP" "b/docs/\347\254\254\345\215\201\347\253\240/\345\221\250\350\276\211\346\261\240NLP" new file mode 100644 index 000000000..8b1378917 --- /dev/null +++ "b/docs/\347\254\254\345\215\201\347\253\240/\345\221\250\350\276\211\346\261\240NLP" @@ -0,0 +1 @@ + From f43674db2c361ef62a15d192419a40be9735bc50 Mon Sep 17 00:00:00 2001 From: ZHOUhuichi <96902323+ZHOUhuichi@users.noreply.github.com> Date: Wed, 14 Dec 2022 23:35:26 +0800 Subject: [PATCH 3/6] Add files via upload --- .../zhcNLP/NLP\345\237\272\347\241\200.ipynb" | 687 ++++++++++++++++++ .../zhcNLP/NLP\345\237\272\347\241\200.md" | 434 +++++++++++ .../zhcNLP/output_16_20.png" | Bin 0 -> 10308 bytes 3 files changed, 1121 insertions(+) create mode 100644 "docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.ipynb" create mode 100644 "docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.md" create mode 100644 "docs/\347\254\254\345\215\201\347\253\240/zhcNLP/output_16_20.png" diff --git "a/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.ipynb" "b/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.ipynb" new file mode 100644 index 000000000..ba5bcc1f9 --- /dev/null +++ "b/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.ipynb" @@ -0,0 +1,687 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "119ec186", + "metadata": {}, + "source": [ + "# 词嵌入(概念部分)" + ] + }, + { + "cell_type": "markdown", + "id": "f8e5639e", + "metadata": {}, + "source": [ + "###   在了解什么是词嵌入之前,我们可以思考一下计算机如何识别人类的输入?
\n", + " 计算机通过将输入信息解析为0和1这般的二进制编码,从而将人类语言转化为机器语言,进行理解。
\n", + " 我们先引入一个概念**one-hot编码**,也称为**独热编码**,在给定维度的情况下,一行向量有且仅有一个值为1,例如维度为5的向量[0,0,0,0,1]
\n", + " 例如,我们在幼儿园或小学学习汉语的时候,首先先识字和词,字和词就会保存在我们的大脑中的某处。
\n", + "\n", + "
一个小朋友刚学会了四个字和词-->[我] [特别] [喜欢] [学习]
\n", + " 我们的计算机就可以为小朋友开辟一个词向量维度为4的独热编码
\n", + " 对于中文 我们先进行分词 我 特别 喜欢 学习
\n", + " 那么我们就可以令 我->[1 0 0 0] 特别 ->[0 1 0 0] 喜欢->[0 0 1 0] 学习->[0 0 0 1]
\n", + " 现在给出一句话 我喜欢学习,那么计算机给出的词向量->[1 0 1 1]

\n", + " 我们可以思考几个问题:
\n", + " 1.如果小朋友词汇量越学越多,学到了成千上万个词之后,我们使用上述方法构建的词向量就会有非常大的维度,并且是一个稀疏向量。
\n", + " 2.在中文中 诸如 能 会 可以 这样同义词,我们如果使用独热编码,它们是正交的,缺乏词之间的相似性,很难把他们联系到一起。
\n", + " 因此我们认为独热编码不是一个很好的词嵌入方法。
\n", + "\n", + " 我们再来介绍一下 **稠密表示**
\n", + "\n", + " 稠密表示的格式如one-hot编码一致,但数值却不同,如 [0.45,0.65,0.14,1.15,0.97]" + ] + }, + { + "cell_type": "markdown", + "id": "4db86da3", + "metadata": {}, + "source": [ + "# Bag of Words词袋表示" + ] + }, + { + "cell_type": "markdown", + "id": "44dc9252", + "metadata": {}, + "source": [ + "  词袋表示顾名思义,我们往一个袋子中装入我们的词汇,构成一个词袋,当我们想表达的时候,我们将其取出,构建词袋的方法可以有如下形式。" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "823f8f2d", + "metadata": {}, + "outputs": [], + "source": [ + "corpus = [\"i like reading\", \"i love drinking\", \"i hate playing\", \"i do nlp\"]#我们的语料库\n", + "word_list = ' '.join(corpus).split()\n", + "word_list = list(sorted(set(word_list)))\n", + "word_dict = {w: i for i, w in enumerate(word_list)}\n", + "number_dict = {i: w for i, w in enumerate(word_list)}" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "8eaeb37d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'do': 0,\n", + " 'drinking': 1,\n", + " 'hate': 2,\n", + " 'i': 3,\n", + " 'like': 4,\n", + " 'love': 5,\n", + " 'nlp': 6,\n", + " 'playing': 7,\n", + " 'reading': 8}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "word_dict" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "2bf380c8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{0: 'do',\n", + " 1: 'drinking',\n", + " 2: 'hate',\n", + " 3: 'i',\n", + " 4: 'like',\n", + " 5: 'love',\n", + " 6: 'nlp',\n", + " 7: 'playing',\n", + " 8: 'reading'}" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "number_dict" + ] + }, + { + "cell_type": "markdown", + "id": "90e0ef43", + "metadata": {}, + "source": [ + " 根据如上形式,我们可以构建一个维度为9的one-hot编码,如下(除了可以使用np.eye构建,也可以通过sklearn的库调用)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "9821ed2a", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "voc_size = len(word_dict)\n", + "bow = []\n", + "for i,name in enumerate(word_dict):\n", + " bow.append(np.eye(voc_size)[word_dict[name]])" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "03f1f12f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[array([1., 0., 0., 0., 0., 0., 0., 0., 0.]),\n", + " array([0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n", + " array([0., 0., 1., 0., 0., 0., 0., 0., 0.]),\n", + " array([0., 0., 0., 1., 0., 0., 0., 0., 0.]),\n", + " array([0., 0., 0., 0., 1., 0., 0., 0., 0.]),\n", + " array([0., 0., 0., 0., 0., 1., 0., 0., 0.]),\n", + " array([0., 0., 0., 0., 0., 0., 1., 0., 0.]),\n", + " array([0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n", + " array([0., 0., 0., 0., 0., 0., 0., 0., 1.])]" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bow" + ] + }, + { + "cell_type": "markdown", + "id": "086a5fd2", + "metadata": {}, + "source": [ + "# N-gram:基于统计的语言模型\n", + " N-gram 模型是一种自然语言处理模型,它利用了语言中词语之间的相关性来预测下一个出现的词语。N-gram 模型通过对一段文本中连续出现的 n 个词语进行建模,来预测文本中接下来出现的词语。比如,如果一个文本中包含连续出现的词语“the cat sat on”,那么 N-gram 模型可能会预测接下来的词语是“the mat”或“a hat”。\n", + "\n", + " N-gram 模型的精确性取决于用于训练模型的文本的质量和数量。如果用于训练模型的文本包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。此外,如果用于训练模型的文本量较少,那么模型也可能无法充分捕捉到语言中的复杂性。 \n", + "\n", + "**N-gram 模型的优点:**\n", + "\n", + "简单易用,N-gram 模型的概念非常简单,实现起来也很容易。 \n", + "能够捕捉到语言中的相关性,N-gram 模型通过考虑连续出现的 n 个词语来预测下一个词语,因此它能够捕捉到语言中词语之间的相关性。 \n", + "可以使用已有的语料库进行训练,N-gram 模型可以使用已有的大量语料库进行训练,例如 Google 的 N-gram 数据库,这样可以大大提高模型的准确性。 \n", + "\n", + "**N-gram 模型的缺点:**\n", + "\n", + "对于短文本数据集不适用,N-gram 模型需要大量的文本数据进行训练,因此对于短文本数据集可能无法达到较高的准确性。 \n", + "容易受到噪声和语言纠错的影响,N-gram 模型是基于语料库进行训练的,如果语料库中包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。 \n", + "无法捕捉到语言中的非线性关系,N-gram 模型假设语言中的关系是线性的,但事实上语言中可能存在复杂的非线性关系,N-gram 模型无法捕捉到这些关系。" + ] + }, + { + "cell_type": "markdown", + "id": "1f5ad65b", + "metadata": {}, + "source": [ + "# NNLM:前馈神经网络语言模型\n", + " 下面通过前馈神经网络模型来**展示滑动**窗口的使用" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "7bddfa77", + "metadata": {}, + "outputs": [], + "source": [ + "#导入必要的库\n", + "import numpy as np\n", + "import torch\n", + "import torch.nn as nn\n", + "import torch.optim as optim\n", + "from tqdm import tqdm\n", + "from torch.autograd import Variable\n", + "dtype = torch.FloatTensor" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "29f23588", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['i',\n", + " 'like',\n", + " 'reading',\n", + " 'i',\n", + " 'love',\n", + " 'drinking',\n", + " 'i',\n", + " 'hate',\n", + " 'playing',\n", + " 'i',\n", + " 'do',\n", + " 'nlp']" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "corpus = [\"i like reading\", \"i love drinking\", \"i hate playing\", \"i do nlp\"]\n", + "\n", + "word_list = ' '.join(corpus).split()\n", + "word_list" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "12b58886", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 1000 cost = 1.010682\n", + "epoch: 2000 cost = 0.695155\n", + "epoch: 3000 cost = 0.597085\n", + "epoch: 4000 cost = 0.531892\n", + "epoch: 5000 cost = 0.376044\n", + "epoch: 6000 cost = 0.118038\n", + "epoch: 7000 cost = 0.077081\n", + "epoch: 8000 cost = 0.053636\n", + "epoch: 9000 cost = 0.038089\n", + "epoch: 10000 cost = 0.027224\n", + "[['i', 'like'], ['i', 'love'], ['i', 'hate'], ['i', 'do']] -> ['studying', 'datawhale', 'playing', 'nlp']\n" + ] + } + ], + "source": [ + "#构建我们需要的语料库\n", + "corpus = [\"i like studying\", \"i love datawhale\", \"i hate playing\", \"i do nlp\"]\n", + "\n", + "word_list = ' '.join(corpus).split() #将语料库转化为一个个单词 ,如['i', 'like', 'reading', 'i', ...,'nlp']\n", + "word_list = list(sorted(set(word_list))) #用set去重后转化为链表\n", + "# print(word_list)\n", + "\n", + "word_dict = {w: i for i, w in enumerate(word_list)} #将词表转化为字典 这边是词对应到index\n", + "number_dict = {i: w for i, w in enumerate(word_list)}#这边是index对应到词\n", + "# print(word_dict)\n", + "# print(number_dict)\n", + "\n", + "n_class = len(word_dict) #计算出我们词表的大小,用于后面词向量的构建\n", + "\n", + "m = 2 #词嵌入维度\n", + "n_step = 2 #滑动窗口的大小\n", + "n_hidden = 2 #隐藏层的维度为2\n", + "\n", + "\n", + "def make_batch(sentence): #由于语料库较小,我们象征性将训练集按照批次处理 \n", + " input_batch = []\n", + " target_batch = []\n", + "\n", + " for sen in sentence:\n", + " word = sen.split()\n", + " input = [word_dict[n] for n in word[:-1]]\n", + " target = word_dict[word[-1]]\n", + "\n", + " input_batch.append(input)\n", + " target_batch.append(target)\n", + "\n", + " return input_batch, target_batch\n", + "\n", + "\n", + "class NNLM(nn.Module): #搭建一个NNLM语言模型\n", + " def __init__(self):\n", + " super(NNLM, self).__init__()\n", + " self.embed = nn.Embedding(n_class, m)\n", + " self.W = nn.Parameter(torch.randn(n_step * m, n_hidden).type(dtype))\n", + " self.d = nn.Parameter(torch.randn(n_hidden).type(dtype))\n", + "\n", + " self.U = nn.Parameter(torch.randn(n_hidden, n_class).type(dtype))\n", + " self.b = nn.Parameter(torch.randn(n_class).type(dtype))\n", + "\n", + " def forward(self, x):\n", + " x = self.embed(x) # 4 x 2 x 2\n", + " x = x.view(-1, n_step * m)\n", + " tanh = torch.tanh(self.d + torch.mm(x, self.W)) # 4 x 2\n", + " output = self.b + torch.mm(tanh, self.U)\n", + " return output\n", + "\n", + "model = NNLM()\n", + "\n", + "criterion = nn.CrossEntropyLoss() #损失函数的设置\n", + "optimizer = optim.Adam(model.parameters(), lr=0.001) #优化器的设置\n", + "\n", + "input_batch, target_batch = make_batch(corpus) #训练集和标签值\n", + "input_batch = Variable(torch.LongTensor(input_batch))\n", + "target_batch = Variable(torch.LongTensor(target_batch))\n", + "\n", + "for epoch in range(10000): #训练过程\n", + " optimizer.zero_grad()\n", + "\n", + " output = model(input_batch) # input: 4 x 2\n", + "\n", + " loss = criterion(output, target_batch)\n", + "\n", + " if (epoch + 1) % 1000 == 0:\n", + " print('epoch:', '%04d' % (epoch + 1), 'cost = {:.6f}'.format(loss.item()))\n", + "\n", + " loss.backward()\n", + " optimizer.step()\n", + "\n", + "predict = model(input_batch).data.max(1, keepdim=True)[1]#模型预测过程\n", + "\n", + "print([sen.split()[:2] for sen in corpus], '->', [number_dict[n.item()] for n in predict.squeeze()])" + ] + }, + { + "cell_type": "markdown", + "id": "93d8cd2f", + "metadata": {}, + "source": [ + "# Word2Vec模型:主要采用Skip-gram和Cbow两种模式\n", + " 前文提到的distributed representation稠密向量表达可以用Word2Vec模型进行训练得到。\n", + " skip-gram模型(跳字模型)是用中心词去预测周围词\n", + " cbow模型(连续词袋模型)是用周围词预测中心词" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "066f68a0", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 11%|█ | 10615/100000 [00:02<00:24, 3657.80it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 10000 cost = 1.955088\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 21%|██ | 20729/100000 [00:05<00:21, 3758.47it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 20000 cost = 1.673096\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 30%|███ | 30438/100000 [00:08<00:18, 3710.13it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 30000 cost = 2.247422\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 41%|████ | 40638/100000 [00:11<00:15, 3767.87it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 40000 cost = 2.289902\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 50%|█████ | 50486/100000 [00:13<00:13, 3713.98it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 50000 cost = 2.396217\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 61%|██████ | 60572/100000 [00:16<00:11, 3450.47it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 60000 cost = 1.539688\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 71%|███████ | 70638/100000 [00:19<00:07, 3809.11it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 70000 cost = 1.638879\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 80%|████████ | 80403/100000 [00:21<00:05, 3740.33it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 80000 cost = 2.279797\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " 90%|█████████ | 90480/100000 [00:24<00:02, 3680.03it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 90000 cost = 1.992100\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 100000/100000 [00:27<00:00, 3677.35it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "epoch: 100000 cost = 1.307715\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "打印\n" + ] + }, + { + "data": { + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import torch.nn as nn\n", + "import torch.optim as optim\n", + "from torch.autograd import variable\n", + "import numpy as np\n", + "import torch\n", + "import matplotlib.pyplot as plt\n", + "from tqdm import tqdm\n", + "\n", + "dtype = torch.FloatTensor\n", + "#我们使用的语料库 \n", + "sentences = ['i like dog','i like cat','i like animal','dog is animal','cat is animal',\n", + " 'dog like meat','cat like meat','cat like fish','dog like meat','i like apple',\n", + " 'i hate apple','i like movie','i like read','dog like bark','dog like cat']\n", + "\n", + "\n", + "\n", + "word_sequence = ' '.join(sentences).split() #将语料库的每一句话的每一个词转化为列表 \n", + "#print(word_sequence)\n", + "\n", + "word_list = list(set(word_sequence)) #构建我们的词表 \n", + "#print(word_list)\n", + "\n", + "#word_voc = list(set(word_sequence)) \n", + "\n", + "#接下来对此表中的每一个词编号 这就用到了我们之前提到的one-hot编码 \n", + "\n", + "#词典 词对应着编号\n", + "word_dict = {w:i for i,w in enumerate(word_list)}\n", + "#print(word_dict)\n", + "#编号对应着词\n", + "index_dict = {i:w for w,i in enumerate(word_list)}\n", + "#print(index_dict)\n", + "\n", + "\n", + "batch_size = 2\n", + "voc_size = len(word_list)\n", + "\n", + "skip_grams = []\n", + "for i in range(1,len(word_sequence)-1,3):\n", + " target = word_dict[word_sequence[i]] #当前词对应的id\n", + " context = [word_dict[word_sequence[i-1]],word_dict[word_sequence[i+1]]] #两个上下文词对应的id\n", + "\n", + " for w in context:\n", + " skip_grams.append([target,w])\n", + "\n", + "embedding_size = 10 \n", + "\n", + "\n", + "class Word2Vec(nn.Module):\n", + " def __init__(self):\n", + " super(Word2Vec,self).__init__()\n", + " self.W1 = nn.Parameter(torch.rand(len(word_dict),embedding_size)).type(dtype) \n", + " #将词的one-hot编码对应到词向量中\n", + " self.W2 = nn.Parameter(torch.rand(embedding_size,voc_size)).type(dtype)\n", + " #将词向量 转化为 输出 \n", + " def forward(self,x):\n", + " hidden_layer = torch.matmul(x,self.W1)\n", + " output_layer = torch.matmul(hidden_layer,self.W2)\n", + " return output_layer\n", + "\n", + "\n", + "model = Word2Vec()\n", + "criterion = nn.CrossEntropyLoss()\n", + "optimizer = optim.Adam(model.parameters(),lr=1e-5)\n", + "\n", + "#print(len(skip_grams))\n", + "#训练函数\n", + "\n", + "def random_batch(data,size):\n", + " random_inputs = []\n", + " random_labels = []\n", + " random_index = np.random.choice(range(len(data)),size,replace=False)\n", + " \n", + " for i in random_index:\n", + " random_inputs.append(np.eye(voc_size)[data[i][0]]) #从一个单位矩阵生成one-hot表示\n", + " random_labels.append(data[i][1])\n", + " \n", + " return random_inputs,random_labels\n", + "\n", + "for epoch in tqdm(range(100000)):\n", + " input_batch,target_batch = random_batch(skip_grams,batch_size) # X -> y\n", + " input_batch = torch.Tensor(input_batch)\n", + " target_batch = torch.LongTensor(target_batch)\n", + "\n", + " optimizer.zero_grad()\n", + "\n", + " output = model(input_batch)\n", + "\n", + " loss = criterion(output,target_batch)\n", + " if((epoch+1)%10000==0):\n", + " print(\"epoch:\",\"%04d\" %(epoch+1),'cost =' ,'{:.6f}'.format(loss))\n", + "\n", + " loss.backward() \n", + " optimizer.step()\n", + "\n", + "for i , label in enumerate(word_list):\n", + " W1,_ = model.parameters()\n", + " x,y = float(W1[i][0]),float(W1[i][1])\n", + " plt.scatter(x,y)\n", + " plt.annotate(label,xy=(x,y),xytext=(5,2),textcoords='offset points',ha='right',va='bottom')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1edccf25", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "pytorch", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.9 (default, Aug 31 2020, 12:42:55) \n[GCC 7.3.0]" + }, + "vscode": { + "interpreter": { + "hash": "7648c2b9d25760d0d65f53f9b9a34de48caa24d8265d64b0ff81e2f2641d528d" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git "a/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.md" "b/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.md" new file mode 100644 index 000000000..51843e99d --- /dev/null +++ "b/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.md" @@ -0,0 +1,434 @@ +# 词嵌入(概念部分) + +###   在了解什么是词嵌入之前,我们可以思考一下计算机如何识别人类的输入?
+ 计算机通过将输入信息解析为0和1这般的二进制编码,从而将人类语言转化为机器语言,进行理解。
+ 我们先引入一个概念**one-hot编码**,也称为**独热编码**,在给定维度的情况下,一行向量有且仅有一个值为1,例如维度为5的向量[0,0,0,0,1]
+ 例如,我们在幼儿园或小学学习汉语的时候,首先先识字和词,字和词就会保存在我们的大脑中的某处。
+ +
一个小朋友刚学会了四个字和词-->[我] [特别] [喜欢] [学习]
+ 我们的计算机就可以为小朋友开辟一个词向量维度为4的独热编码
+ 对于中文 我们先进行分词 我 特别 喜欢 学习
+ 那么我们就可以令 我->[1 0 0 0] 特别 ->[0 1 0 0] 喜欢->[0 0 1 0] 学习->[0 0 0 1]
+ 现在给出一句话 我喜欢学习,那么计算机给出的词向量->[1 0 1 1]

+ 我们可以思考几个问题:
+ 1.如果小朋友词汇量越学越多,学到了成千上万个词之后,我们使用上述方法构建的词向量就会有非常大的维度,并且是一个稀疏向量。
+ 2.在中文中 诸如 能 会 可以 这样同义词,我们如果使用独热编码,它们是正交的,缺乏词之间的相似性,很难把他们联系到一起。
+ 因此我们认为独热编码不是一个很好的词嵌入方法。
+ + 我们再来介绍一下 **稠密表示**
+ + 稠密表示的格式如one-hot编码一致,但数值却不同,如 [0.45,0.65,0.14,1.15,0.97] + +# Bag of Words词袋表示 + +  词袋表示顾名思义,我们往一个袋子中装入我们的词汇,构成一个词袋,当我们想表达的时候,我们将其取出,构建词袋的方法可以有如下形式。 + + +```python +corpus = ["i like reading", "i love drinking", "i hate playing", "i do nlp"]#我们的语料库 +word_list = ' '.join(corpus).split() +word_list = list(sorted(set(word_list))) +word_dict = {w: i for i, w in enumerate(word_list)} +number_dict = {i: w for i, w in enumerate(word_list)} +``` + + +```python +word_dict +``` + + + + + {'do': 0, + 'drinking': 1, + 'hate': 2, + 'i': 3, + 'like': 4, + 'love': 5, + 'nlp': 6, + 'playing': 7, + 'reading': 8} + + + + +```python +number_dict +``` + + + + + {0: 'do', + 1: 'drinking', + 2: 'hate', + 3: 'i', + 4: 'like', + 5: 'love', + 6: 'nlp', + 7: 'playing', + 8: 'reading'} + + + + 根据如上形式,我们可以构建一个维度为9的one-hot编码,如下(除了可以使用np.eye构建,也可以通过sklearn的库调用) + + +```python +import numpy as np +voc_size = len(word_dict) +bow = [] +for i,name in enumerate(word_dict): + bow.append(np.eye(voc_size)[word_dict[name]]) +``` + + +```python +bow +``` + + + + + [array([1., 0., 0., 0., 0., 0., 0., 0., 0.]), + array([0., 1., 0., 0., 0., 0., 0., 0., 0.]), + array([0., 0., 1., 0., 0., 0., 0., 0., 0.]), + array([0., 0., 0., 1., 0., 0., 0., 0., 0.]), + array([0., 0., 0., 0., 1., 0., 0., 0., 0.]), + array([0., 0., 0., 0., 0., 1., 0., 0., 0.]), + array([0., 0., 0., 0., 0., 0., 1., 0., 0.]), + array([0., 0., 0., 0., 0., 0., 0., 1., 0.]), + array([0., 0., 0., 0., 0., 0., 0., 0., 1.])] + + + +# N-gram:基于统计的语言模型 + N-gram 模型是一种自然语言处理模型,它利用了语言中词语之间的相关性来预测下一个出现的词语。N-gram 模型通过对一段文本中连续出现的 n 个词语进行建模,来预测文本中接下来出现的词语。比如,如果一个文本中包含连续出现的词语“the cat sat on”,那么 N-gram 模型可能会预测接下来的词语是“the mat”或“a hat”。 + + N-gram 模型的精确性取决于用于训练模型的文本的质量和数量。如果用于训练模型的文本包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。此外,如果用于训练模型的文本量较少,那么模型也可能无法充分捕捉到语言中的复杂性。 + +**N-gram 模型的优点:** + +简单易用,N-gram 模型的概念非常简单,实现起来也很容易。 +能够捕捉到语言中的相关性,N-gram 模型通过考虑连续出现的 n 个词语来预测下一个词语,因此它能够捕捉到语言中词语之间的相关性。 +可以使用已有的语料库进行训练,N-gram 模型可以使用已有的大量语料库进行训练,例如 Google 的 N-gram 数据库,这样可以大大提高模型的准确性。 + +**N-gram 模型的缺点:** + +对于短文本数据集不适用,N-gram 模型需要大量的文本数据进行训练,因此对于短文本数据集可能无法达到较高的准确性。 +容易受到噪声和语言纠错的影响,N-gram 模型是基于语料库进行训练的,如果语料库中包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。 +无法捕捉到语言中的非线性关系,N-gram 模型假设语言中的关系是线性的,但事实上语言中可能存在复杂的非线性关系,N-gram 模型无法捕捉到这些关系。 + +# NNLM:前馈神经网络语言模型 + 下面通过前馈神经网络模型来**展示滑动**窗口的使用 + + +```python +#导入必要的库 +import numpy as np +import torch +import torch.nn as nn +import torch.optim as optim +from tqdm import tqdm +from torch.autograd import Variable +dtype = torch.FloatTensor +``` + + +```python +corpus = ["i like reading", "i love drinking", "i hate playing", "i do nlp"] + +word_list = ' '.join(corpus).split() +word_list +``` + + + + + ['i', + 'like', + 'reading', + 'i', + 'love', + 'drinking', + 'i', + 'hate', + 'playing', + 'i', + 'do', + 'nlp'] + + + + +```python +#构建我们需要的语料库 +corpus = ["i like studying", "i love datawhale", "i hate playing", "i do nlp"] + +word_list = ' '.join(corpus).split() #将语料库转化为一个个单词 ,如['i', 'like', 'reading', 'i', ...,'nlp'] +word_list = list(sorted(set(word_list))) #用set去重后转化为链表 +# print(word_list) + +word_dict = {w: i for i, w in enumerate(word_list)} #将词表转化为字典 这边是词对应到index +number_dict = {i: w for i, w in enumerate(word_list)}#这边是index对应到词 +# print(word_dict) +# print(number_dict) + +n_class = len(word_dict) #计算出我们词表的大小,用于后面词向量的构建 + +m = 2 #词嵌入维度 +n_step = 2 #滑动窗口的大小 +n_hidden = 2 #隐藏层的维度为2 + + +def make_batch(sentence): #由于语料库较小,我们象征性将训练集按照批次处理 + input_batch = [] + target_batch = [] + + for sen in sentence: + word = sen.split() + input = [word_dict[n] for n in word[:-1]] + target = word_dict[word[-1]] + + input_batch.append(input) + target_batch.append(target) + + return input_batch, target_batch + + +class NNLM(nn.Module): #搭建一个NNLM语言模型 + def __init__(self): + super(NNLM, self).__init__() + self.embed = nn.Embedding(n_class, m) + self.W = nn.Parameter(torch.randn(n_step * m, n_hidden).type(dtype)) + self.d = nn.Parameter(torch.randn(n_hidden).type(dtype)) + + self.U = nn.Parameter(torch.randn(n_hidden, n_class).type(dtype)) + self.b = nn.Parameter(torch.randn(n_class).type(dtype)) + + def forward(self, x): + x = self.embed(x) # 4 x 2 x 2 + x = x.view(-1, n_step * m) + tanh = torch.tanh(self.d + torch.mm(x, self.W)) # 4 x 2 + output = self.b + torch.mm(tanh, self.U) + return output + +model = NNLM() + +criterion = nn.CrossEntropyLoss() #损失函数的设置 +optimizer = optim.Adam(model.parameters(), lr=0.001) #优化器的设置 + +input_batch, target_batch = make_batch(corpus) #训练集和标签值 +input_batch = Variable(torch.LongTensor(input_batch)) +target_batch = Variable(torch.LongTensor(target_batch)) + +for epoch in range(10000): #训练过程 + optimizer.zero_grad() + + output = model(input_batch) # input: 4 x 2 + + loss = criterion(output, target_batch) + + if (epoch + 1) % 1000 == 0: + print('epoch:', '%04d' % (epoch + 1), 'cost = {:.6f}'.format(loss.item())) + + loss.backward() + optimizer.step() + +predict = model(input_batch).data.max(1, keepdim=True)[1]#模型预测过程 + +print([sen.split()[:2] for sen in corpus], '->', [number_dict[n.item()] for n in predict.squeeze()]) +``` + + epoch: 1000 cost = 1.010682 + epoch: 2000 cost = 0.695155 + epoch: 3000 cost = 0.597085 + epoch: 4000 cost = 0.531892 + epoch: 5000 cost = 0.376044 + epoch: 6000 cost = 0.118038 + epoch: 7000 cost = 0.077081 + epoch: 8000 cost = 0.053636 + epoch: 9000 cost = 0.038089 + epoch: 10000 cost = 0.027224 + [['i', 'like'], ['i', 'love'], ['i', 'hate'], ['i', 'do']] -> ['studying', 'datawhale', 'playing', 'nlp'] + + +# Word2Vec模型:主要采用Skip-gram和Cbow两种模式 + 前文提到的distributed representation稠密向量表达可以用Word2Vec模型进行训练得到。 + skip-gram模型(跳字模型)是用中心词去预测周围词 + cbow模型(连续词袋模型)是用周围词预测中心词 + + +```python +import torch.nn as nn +import torch.optim as optim +from torch.autograd import variable +import numpy as np +import torch +import matplotlib.pyplot as plt +from tqdm import tqdm + +dtype = torch.FloatTensor +#我们使用的语料库 +sentences = ['i like dog','i like cat','i like animal','dog is animal','cat is animal', + 'dog like meat','cat like meat','cat like fish','dog like meat','i like apple', + 'i hate apple','i like movie','i like read','dog like bark','dog like cat'] + + + +word_sequence = ' '.join(sentences).split() #将语料库的每一句话的每一个词转化为列表 +#print(word_sequence) + +word_list = list(set(word_sequence)) #构建我们的词表 +#print(word_list) + +#word_voc = list(set(word_sequence)) + +#接下来对此表中的每一个词编号 这就用到了我们之前提到的one-hot编码 + +#词典 词对应着编号 +word_dict = {w:i for i,w in enumerate(word_list)} +#print(word_dict) +#编号对应着词 +index_dict = {i:w for w,i in enumerate(word_list)} +#print(index_dict) + + +batch_size = 2 +voc_size = len(word_list) + +skip_grams = [] +for i in range(1,len(word_sequence)-1,3): + target = word_dict[word_sequence[i]] #当前词对应的id + context = [word_dict[word_sequence[i-1]],word_dict[word_sequence[i+1]]] #两个上下文词对应的id + + for w in context: + skip_grams.append([target,w]) + +embedding_size = 10 + + +class Word2Vec(nn.Module): + def __init__(self): + super(Word2Vec,self).__init__() + self.W1 = nn.Parameter(torch.rand(len(word_dict),embedding_size)).type(dtype) + #将词的one-hot编码对应到词向量中 + self.W2 = nn.Parameter(torch.rand(embedding_size,voc_size)).type(dtype) + #将词向量 转化为 输出 + def forward(self,x): + hidden_layer = torch.matmul(x,self.W1) + output_layer = torch.matmul(hidden_layer,self.W2) + return output_layer + + +model = Word2Vec() +criterion = nn.CrossEntropyLoss() +optimizer = optim.Adam(model.parameters(),lr=1e-5) + +#print(len(skip_grams)) +#训练函数 + +def random_batch(data,size): + random_inputs = [] + random_labels = [] + random_index = np.random.choice(range(len(data)),size,replace=False) + + for i in random_index: + random_inputs.append(np.eye(voc_size)[data[i][0]]) #从一个单位矩阵生成one-hot表示 + random_labels.append(data[i][1]) + + return random_inputs,random_labels + +for epoch in tqdm(range(100000)): + input_batch,target_batch = random_batch(skip_grams,batch_size) # X -> y + input_batch = torch.Tensor(input_batch) + target_batch = torch.LongTensor(target_batch) + + optimizer.zero_grad() + + output = model(input_batch) + + loss = criterion(output,target_batch) + if((epoch+1)%10000==0): + print("epoch:","%04d" %(epoch+1),'cost =' ,'{:.6f}'.format(loss)) + + loss.backward() + optimizer.step() + +for i , label in enumerate(word_list): + W1,_ = model.parameters() + x,y = float(W1[i][0]),float(W1[i][1]) + plt.scatter(x,y) + plt.annotate(label,xy=(x,y),xytext=(5,2),textcoords='offset points',ha='right',va='bottom') +plt.show() +``` + + 11%|█ | 10615/100000 [00:02<00:24, 3657.80it/s] + + epoch: 10000 cost = 1.955088 + + + 21%|██ | 20729/100000 [00:05<00:21, 3758.47it/s] + + epoch: 20000 cost = 1.673096 + + + 30%|███ | 30438/100000 [00:08<00:18, 3710.13it/s] + + epoch: 30000 cost = 2.247422 + + + 41%|████ | 40638/100000 [00:11<00:15, 3767.87it/s] + + epoch: 40000 cost = 2.289902 + + + 50%|█████ | 50486/100000 [00:13<00:13, 3713.98it/s] + + epoch: 50000 cost = 2.396217 + + + 61%|██████ | 60572/100000 [00:16<00:11, 3450.47it/s] + + epoch: 60000 cost = 1.539688 + + + 71%|███████ | 70638/100000 [00:19<00:07, 3809.11it/s] + + epoch: 70000 cost = 1.638879 + + + 80%|████████ | 80403/100000 [00:21<00:05, 3740.33it/s] + + epoch: 80000 cost = 2.279797 + + + 90%|█████████ | 90480/100000 [00:24<00:02, 3680.03it/s] + + epoch: 90000 cost = 1.992100 + + + 100%|██████████| 100000/100000 [00:27<00:00, 3677.35it/s] + + + epoch: 100000 cost = 1.307715 + + + + +![png](output_16_20.png) + + + + 打印 + + + +
+ + + +```python + +``` diff --git "a/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/output_16_20.png" "b/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/output_16_20.png" new file mode 100644 index 0000000000000000000000000000000000000000..56fa3da4e714085b37e5b0637ef20ecc8ad6edcd GIT binary patch literal 10308 zcmcI~cT`hdx_3eqK@bo@P*4Psj)Y<;0W=_pf>NXxMMAHk_lQarL~5jpfFez#NfCmA zQbIs_6Ob0^9YVi5Gw(NZ@65Mmt^3Dit>9UboO81G^ZeR#BJOG3I(3rsBm@FErJ;`0 zfk2=E;CSx@4fvhydr%AhfqSSLdg!{KJiH&eTSM+X^l)`_@o=PBb?gr1fBfTJ)g9t zU;aLNeP52(c4U8ReHMJroDYp^5yhP53o&&pw9F4-kF-_2uc_O7)2ZSSTdKXH6H_^N z{v)J{^8J{TPp+N~K5<6s1l5TXZC~yevot=X`o{nG@F?Zd-5V5iSo|FnqJ-EQKjcs{ zv><~55yIzCiO#&wl@ zEe*S*qea_{4cf<>_0;a%VVIkn!-(Dda1jQJ=W_mxDK@Kpm6W8`o+OG@3PzMP9LeeC z-TB=V!FsS7rZinKV1D`X<(%@#&&$YkV|@=N6xU*czrr8w@*k%s%%fWo$f3i`>jKE9RFI{VesQa%B7JioB7`S z&H28l#!y<=!>?45l9Hb%BZWB>{iKdbKa>n>y+m0Bw5E^tSC}^2ne$7Z_%T`BxVg|@ zG}oO=5BXK&c^TaP-Bg*iCRXj2>FIaP&ABOQ7_LwGf}fn++-r`5AG0f`e#GtYY&v;y(W&a!_osBMe}*fB<-M25XStIum-zYfbvduamV@05O2Z2E zjn)keMueCAa zDs~oLLAk@K)1zldED2&x#tk}4yzb5?1L$tYGT7ca9FF53o*#&rj)N{y=*6jbybG;l zNzmd+=`)h!tp_WZC0*{INnIDtwEOt>4}2bWsByj6-C_6eFnO14D zWJ1tlY=$Qes0N_{OwsGjpLuz8+z@mD#L;kcp*N1Qoc%+1QW*u@{NM^9bWWm?2!64qBRAq`HQN9O22ACShQg zj7+MWt-%KzM?c4b0)+XlKmHzl?i;zI%I#QpcX#WKkM>m$%gV}j2O5r=U4r*VvMa|A zm%Yy`_`Ec*2uut=aag0?k(X&nXH;Zdk?p-Q201t&DiPzg5f4FEfm4SP(&ft@x3b8% zjG6WZlU9H8XB=<#Xve3gcW8=#d1_?D9>A7w4%%pFKq+uv&C7nmF)12Z&Iy*a83<&G|Di0ICR1mnojZe82YH_rlT<$cIWN|oe z5CE3OV|f&tlEP8ry%I!lV&IaA9vwBMVdBPreMX;e(NNDL#GcRXwI}k-GU4URiSC{r zO0(xru}ofpIP^ls6z(UbY4JPU~G93$Q80=x;3Fd-G+34>Q*<5Pz-NLH0j{JOoL@vM!rl+TslPA1GhZ%Hmn(FJV z-1XQ~Qe{P`%5V7ZiKX=sSW$_xQ>e-zyxQ&~Q@GO2FveU^G#-nCae}&ekkf2z%~P>T zTBfGwBO)RKHacV>VA~|zCXqO_YHJv`UwD1ses_JKKR_umSl=0`bn=~y5Fpe9GSk%w z(bi*CDa0YNkP9KqsQx#4RkR`EX=pd);0YJhgLfaS+IO)sPMi({#r%qj=^F7@-2QZ3 zis4qg{pb-dmyBbU%XmGRDy$JzTIq5k7WHY<(~qAsocl^IUWd!On@YyvFvbf)-ut{P zKJ|m&8aM}kr;W^{$wTop&lBdSLeyrfU+bT7<>)fILh-meM?=+rz1DjLwZFa8o39&2 z<|l~9LjMaq9xs_G@3U$Nk@H@@^CRX$L`+Pc^RJq6yPoI2YQ2+d4_|*UC^5%~sWjOH zTnig#88mbBW$JpXmr2Czgaj&Iw)~7P#@A}34B1*r;h2I}DR~kEnng`2qno4H%iL#f zb4uANA0HjUy%sqt90tT{Y9521YVW`gCRJh(W@cv9SZ#6kGiMUOg@i;!MF9|yqQ$pY z+iULm%+YT$KqNynF=Dt4RR8C~7jRbMXTM&6sHJg>iRIaZtb%@9mPeB_vDrj38pmIwRFtS0WAa= zTG`Q21Tde#wQHqnP8LXlMPf;k9jYS4Ntxv%yCLQ1a@vS#VP_OhmQZ~LR?A^X+XN>P z{B$adlZ?<8%z|HDj*^5Y2Bx*-dDn+)IQSeQ-s z2Yq4Va$6O^tDgbHscUJ??yocuBZZ4Chs*7x9QvcB9r|5YzU*{=e0a0CdX^J$uzFL* zY52^aT|%Sy%uG6pax_!WOle!cK# zFZ6NeFGBv}6eO(;p|w4_blxmj0+-l2T;YIK^p{x}D6v@{tu8bwqX~(-d}|$`U*Zcc z+3sqz%@L3i6cYBmS9NrCjjG-G0sUh{%mQ!%#NYD7vvR~@4}#h6H2?1-N%Lu%TZl1Y zTJ0EoUbM<|VINFicH7ZKimo%PvE zlNM}WF8NE0pVuJ7QYbYg?&J@TA8l7P{rC}|kbr#f;OxXM-%?5Np@ZMnyy`oxbi*%x zTjxgF+uA0ROq|~Ul=WB{OV?I9+>A!V-CfA|=~cO%Rk|kXQ^}emwG}W^&Zsd~gr|?!)M#B@|$Ej!_w^(6E#uWWe12MbP2i&q5;@r?kDEM_%9JP}$$^@~`_L zk$n1#Ol~y<9Y)SNK*>eJmQHXog|F={Y+zrwG(I_glJ+ZPBfw(SB4hhB3lC4{>hEu4Iaf96P9M0p3AF@# zrm)^}x#|1&S8;KTfS2fSpt;WhO{nO<-2}fCf8{QPQN3QASjO4OD|cTv)vnaj-`mU| z2oFE;oQa1K*BN|-cCO!@%|5!U77bDGe{w74{3Ut$+NSNZ;^ugEi-7y(0n;taU>WRs z-`~GKGFRqi-)khtqeLyRp;?)Z-zoobBqAZLAOC|VI1RiGqPM#&CI5{ zy&*O%nZ>FE9h{!Zh63~ek zvSSl~9tHgf$A{y`&{gn}t{jbc*wNF5qrC;+qg`TD5=>I^!>&F_Bfec|fO`^gM6%M# zP|yHHLheMMGi;i|nF}o%5M8+%Y~&jS=XU~16&Qy+Knfy8pK@&L&t`!-Aggcy#(7%l ze&1yT?!HF)`N@@*mU{f@!u0m)HGIjiPu%3|727helz zif3^1)21J`;wxFW>rG1&V)GQVm#d5>&R374)_x`scio%2m6K=XOO-;yWK?cf68$-z zAKGbXXaIe2&3An|yAqVit5>f`6QN9hhROuSgZ4#v0ymjKY4ITt84tfbe>Q4Y_(l!G z^So_8*R;~nsM>!=7}c3Gfox^cVoSSDh#ab~9xtLw9vn0T=ZO@1cdsER5SXyRYP3M` zZo7ryXV*)R;Juzq!#ba9z^YunbP3aNe7J%UF(Ihf%&9+{<`&xTzBN;!*)afUR;+II zhK`O78Ddzk+#%bsSj1j9kainj_cof@ExKyGT_9R%*Ht<*6e&$B(q2$l2<7~I)}hz~ zw=UQ8vF*t_RqM6%2+%YiAKygU)DNYjYcBP>s4UI+lidk1dg_@YYB~j4t(up6Kj_;I&4!D5^=bx>$Nxt#qEp-cUC%$P?+Za{aS&U4%I=M z9GucFP?VJY*pr1@v2;ac9@A+6d4B-B<&bgY04?@#Z*wl1i=hO zEzogjvRWqghSF!d+qaNH*5Jdc4}*;#K19@d&7=M_Y5|L!=)Oogx?Q2UdT%01J-4@j z%PfC;BMysg<#MU}UA7bvMqz;z%I#odVr9O+egnl;wg2hdaTA4{=X}!z@6pD|=BT+}wX#klmDXGHediVZ68x8|!^wXxYA-lvPSVr=s`K3# zeR=8Ts8y`uthxV^G%*c59=Mb4Fj866-rkPB70c}j1U~^z1iLk>bxaU9tLZ((DRtGd zB>xI!w;3F3JkulDA;RJVC0+YSYiNa9ffMVm_P-O9ILGlPZGRYV-&OPyIc;iO<$RY& zNQeFzEWHdmCFNYYy!RDQh@dA{$NX2Wxn&dYWB`a?+ZwPy1F(`r;W>a)508%S74R&; ziWC0QY1EdeGR!3$d-L8XBi@q`cxoa|{!gfzDxZ39(SByTGZKkBBYF#oya&|DLk}#06hN$4+sqA=6+?H@Xt<9pT2P6E)q#q@@1P7Kp{oe z{kLkt#%hTB!FA(wKMXv5Ut!N%4?hv4HGS9(KFH=%lz~b8iM4hPS4hDuQJSvKQP{o9J z9s~){bG0>kyT)>;epYoi!l_j;C7Ow}eRL^eD#Q<{u9bDYiA+LuRcGbWr{Y|gT{oY3 zz?v%K`Rlm1v52d zbngprq1C^ibA8?(b|C8;p!jDFcgBgNwW(2$p1TJig`l$h)AeR?>`N-x#%o@ShHA0g zr*H)&^=+BQ2R7Mn-&U`;3Fn(s6_3N-*OcUz(bzhmiZEYrPCm1XQK?J{E?Qvw{2X!5V0MH=W#1<8ax*smH%TX0R|{)AnjR;zib=V zcrDEj40qBVhDL~*6PY+QOX<}}w6m`DkK0;WsDWZL z25yBE(0-M+L5oXO<9T4unwpz`R=JF49|5CaHG#vBEhO0#>j&oZ>fF5|GHiTi``t{Y zLNf?m4$a$Vk}b4=VIlJZ`9&^LzzsnDe+=CB!ohk@v$NjqV80Lm2kPAw&a z#tU*ypf7y)m&&-a8Ny$F<_-eiu;4+_WGE9N3y3A)i6H!)r(&X`b3huR%of=)q#&7P z76A_k*{iq=3b?jRp}@~9_;4!#+@7i|DmF555`c^i*rJ@gJQgl4O+anxckWPKl#_d% zkZ{Itdr<>KacgU9{9ekTD(X~6Agh@B`Ic7EZ_~2z2{owj{ZVa}xGc9loMTpJH*mNQ zd-9IYn}bCW0j00_c07zH$VbhD@PyiQV|W1{Tc}#pVr=U1%Tua)?Yn(Z)u$W0t31Bb z%#|+VG3c0_96pc^MsBaG=Abx+D{{vv#|+%hidY_#1l-1|51s_>4=NpLgk+rz{MGXS znG;!gd^H06C&9MFb1t{sY`xUNfv(nnm;iT8QCb=99GeobGpdq_t%#US(iA)4VZ_Ul zeR4ZFu&QKzn3A85&@Jdpz&fYO&}E*DOuexYPQ=HH)=@i38kM zFR_z?$0{-pzP{?w+kFdrbx9y%*t6qprw~ad<#2 zj9c!`DfC}?y~oV2G-j&w)u-^TYe#NGILV}C>31fA>z2)2czD7YcMMZTmjNz{e5fl2 zqBI%m`f&9WK+NG!>Hfpvs>Vd>kTPt<9wm==95veWlr7Umc@+T9#T*w|;69OR#J5V} zO4$YFc!;{tHn(G7d0+p>8RodQjb`qnJ{%))RE#)EXDK6*6CK1vfer~paOfhDLapqW zzgzm)s-kdg1(h>}PG`k67?%@j4waOVm^cGq9Z40yN_wDXGAIA`SbX(8dbzh+X7-=q z;?J+NGZEn@NVE1>|BwgGN#c~Z$LkfLDNZOz2PQMc2!nRcVEUfSc~M$A5jZ~y_vsep zWL}oa0A`XyNF}PDbC7VpOA2{Gtm2uCP! zF;yH>m4;3QXolZ3F{iXfq1^lQ4cq|H|IF7FSpGc42wck@(8YEmm7)a(C9?ou7%s>q z{3Wx1D*;T?)YJsxZ?p0uo}<+;a;EcFqyzC<9&N>8DChc%xPZ5M2r_P{UoP+OW&v4c zH-aeD*(7`64M-7s+1Vmmu+ISmC3V1_-oZ*f11e|ZJC_UDbC5N_+BmQv<@SA%U``+j zGV>%6qpvM-ms?v}EL&b)ssjKM%_$uTB0y$V)^C8zO&^WCy%l=}EDoT@PxgI6%K-1d zhgh-RH8`SjXW>`yW>qC%2 zDs1)Yk}on;W&;Z#t+I@Cw6|Xt6@39DJ6pvCrJ&JZ@jHCcf~&cAJQw;!{mwHp-!BKh zR8&-;753@7Z>><9i$n2XJn}Lzu^C7!9}sHY-V_9a!>%U}AeElOA+P%tGtFBBO)@=6 zJDi1p<$4RENh#-14%k#)s`I35!|o|CwIG7LN@T6>-v$7UD_c=dFH_g2B?GJiX(Lb! z)}x=@Kydli_7~VU-@~ne&J+UN>E9**hx(Q3hA`soU^O8hAmcx8Zf+_kUqWm)7Bl%c z|3aU??iZrTY5sMPR0@*8t6&iDl;NL)fMyrO;1*QSPVl_ECqu|Tl1lb9%DZ#n7ZPm?*o|UAYDo#^ z06sreOx99+Xi=U?G3u6ASDQ;{D$eDEu+yCem-x>;{%;_VN%cPh%OghE$e+X8u(Ioa zpB)&?;AG3xC#WZAvM43K;^h+Nw_UO_(PC$PI8$``$YTWefYPvF>5t#wRqb$Zij-3f zf@6*FE0zhM2b~evJ5U`R??yINC#tBt*iFW!oP;~S;KF8#r2aWwKwQ;E z=FHOdND?!v`S_^$WM=eIRQu%>s4<&KJ;R_uO+2=l zq68v)oyfRrY04bOX}vMqp|F961?;NZ+QAg@fj?)C)aI0hpm{furm0vp6CL=~5eGa*!WppqO>tz*vG6zml7 z9U|ZG-34;R;p%s%(!Lc=i;w`Jx1?ONA{}gW`T_@V9Z1NE7c21_tLbR^bhH7;&|C?M zQ%;(mgj_$KablXoY}&%);q6uOTt!+#M~B7b%huU2Mqg9~6A$7JkQAaOegri3nw8r9 zabS1Ah;5-~ZA-(1nOSKLxR6+-qw63)xhy6YGk!4pH#Bn`lmG@(SE`TFb?SxeI!KuI zHnWQ02D5b1Dwi%@YRrA-uzd^p^;@Mrtsy*Ez&9+G)lL6?&3$MK4=nQ_(o&I4ahnpkIh`6C?t{baNQPoO}-O_X$@sya3*~HeoNb~e4xy8 z;3N4?OkY5l1AqAj$IeVgai6`nNN<0aT-sEajGM(-57J-7nHFeKDTbN;F1zN@^#FV# z4IyFDSLQ)bBad_WiWl7?v?=SQ{`dXzul)HxNYnJ(LI?he?|{ocW^e!AKVFnw{rl3K zKBXU~BE!{ZNROC~?JdL5&dKOyDhVzQoe!O-E%P==47T>Dp9l^UK%g!EjX*yILjK!0 zj@&RvoGix0y1aVh&y<~qb0k^}bYB;cu|n(uyu93?J{6S<96Vum;+xwP|%*^>&LgPLCa05#eC`e^KiH^T1{4Wc- zm}fZckPEmS-1`qPV6d1t6`z^{GTTYItx2c+W^VZjs6{kCFE8;{W;tkj){_CuaT0&q zqW{~&`?&z4=<$ok*dec#9hCK^oi$3cV^I?};)t>;kjx7YPs{wU{5zHs;YFyRMr?~E zKY$$c!w0z4skkM|81>H^fJM&#fKj4K`5mZ0$zfa*igs>5JE!lsiLLT?eTpP+6u-Xo zs<;hS9f~gAL=mJ46(Pg$uf=K`v3y-3cBepsgVEHR8`wCMT56 zueC8eM%~AqpBNxiU!K_lh8CI!A+t(bH1uUs(u1LQ0J5KLQmmu(O%L2+S6Iv5!So@6v|CSZ{)O#**WrD ze}L+yS@U0F-$A0+8g~}A2r`^QlIo3(0ESUPH8ERX-bjLn^XMFw1 z#rFk&A8wBAIOur}=rz{7#tzkvjkorp`dhU<@50JLQI+{=TPRd~ohJZ9(ru7H(MO11 zWErF@ht5N|LKt)RHYr~F@{B_YPi7BilRB>@8d_>f^y(Lmvshko1@}U|Njt{$HM2Ai z$BtvFYi9>{kh?d2LWJ&lO~^Q9@D{=udY)#UO2v#dM9GK8f;wRFU1n*$vcgAjT_q@j zib1{5@%gjve{m;1v_#X_=w|Z@P&21(R+fr)q+f-) zYLk~;6q6dFr_`(ayQa?1?;Qm~sw_rDXCP4-1~WmS96B5yzwD(B!Ae7DCU=<>IOhGh zZJ8D+5wzdsh$lm{icjG>K<`Ss!jDE&6jIw}P3hj8Av-Fpe=EvMGy@l0ZY(S$^Z1;T zP48@){{Y;ZP2!ACO-)qxBhSOXIWE1eWZfKI&!TVWj+fUbcyM*|Vv&)3`t*Nj7$|5U zQyq(K7vC?Gb78d8(KaJdn~9OV(vTL27nsoht(^1ab5@m)HY!fuZ(KnVKmZZQ+!D_$aU&|9eZ18Ha<2-}ySgjk z7BR9D5v~yDE8C2KbxDN(wL~p}!Bm!W7p^t6V58S-S+g zJbm}28(~VLA8U?J39kixIgtY@Z}G6&g`3J#y4in~gG8O-yI~*PSvS>u??`1tB|*Wq zwnlP<5vv$LG{s4+UN2s4Xz;z z>$NBQJ%v|=VZykLpkHShR)A2@gzXD_WWGve?nX%TjZ9ZsSoiuuM{4ms+I#QCR=8ki zS%VFX5oqG}?AtdmR$Z`X!b_Fu`VeYrs^lExI`CsnXc}WQY=rUv`rfRt_?Zu(cL*8~ zyHoP=q)Wr=U=RxPsXy&j0XPQ-56t}M=pf>l>0&{Ym+8W~HF%;AqM@pVEWTm+ Date: Wed, 14 Dec 2022 23:37:37 +0800 Subject: [PATCH 4/6] =?UTF-8?q?Delete=20=E5=91=A8=E8=BE=89=E6=B1=A0NLP?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../\345\221\250\350\276\211\346\261\240NLP" | 1 - 1 file changed, 1 deletion(-) delete mode 100644 "docs/\347\254\254\345\215\201\347\253\240/\345\221\250\350\276\211\346\261\240NLP" diff --git "a/docs/\347\254\254\345\215\201\347\253\240/\345\221\250\350\276\211\346\261\240NLP" "b/docs/\347\254\254\345\215\201\347\253\240/\345\221\250\350\276\211\346\261\240NLP" deleted file mode 100644 index 8b1378917..000000000 --- "a/docs/\347\254\254\345\215\201\347\253\240/\345\221\250\350\276\211\346\261\240NLP" +++ /dev/null @@ -1 +0,0 @@ - From d6f7b3f1adca8eff31e83ccc2c7675dbf8141129 Mon Sep 17 00:00:00 2001 From: Zhikang Niu <73390819+NoFish-528@users.noreply.github.com> Date: Wed, 14 Dec 2022 23:41:57 +0800 Subject: [PATCH 5/6] =?UTF-8?q?=E5=88=A0=E9=99=A4=E7=A9=BA=E6=A0=BC?= =?UTF-8?q?=E7=AD=89=E7=BC=96=E8=BE=91=E9=97=AE=E9=A2=98?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../zhcNLP/NLP\345\237\272\347\241\200.md" | 85 +++++-------------- 1 file changed, 20 insertions(+), 65 deletions(-) diff --git "a/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.md" "b/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.md" index 51843e99d..a2ccd861c 100644 --- "a/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.md" +++ "b/docs/\347\254\254\345\215\201\347\253\240/zhcNLP/NLP\345\237\272\347\241\200.md" @@ -1,23 +1,23 @@ # 词嵌入(概念部分) -###   在了解什么是词嵌入之前,我们可以思考一下计算机如何识别人类的输入?
- 计算机通过将输入信息解析为0和1这般的二进制编码,从而将人类语言转化为机器语言,进行理解。
- 我们先引入一个概念**one-hot编码**,也称为**独热编码**,在给定维度的情况下,一行向量有且仅有一个值为1,例如维度为5的向量[0,0,0,0,1]
- 例如,我们在幼儿园或小学学习汉语的时候,首先先识字和词,字和词就会保存在我们的大脑中的某处。
+### 在了解什么是词嵌入之前,我们可以思考一下计算机如何识别人类的输入?
+计算机通过将输入信息解析为0和1这般的二进制编码,从而将人类语言转化为机器语言,进行理解。
+我们先引入一个概念**one-hot编码**,也称为**独热编码**,在给定维度的情况下,一行向量有且仅有一个值为1,例如维度为5的向量[0,0,0,0,1]
+例如,我们在幼儿园或小学学习汉语的时候,首先先识字和词,字和词就会保存在我们的大脑中的某处。
一个小朋友刚学会了四个字和词-->[我] [特别] [喜欢] [学习]
- 我们的计算机就可以为小朋友开辟一个词向量维度为4的独热编码
- 对于中文 我们先进行分词 我 特别 喜欢 学习
- 那么我们就可以令 我->[1 0 0 0] 特别 ->[0 1 0 0] 喜欢->[0 0 1 0] 学习->[0 0 0 1]
- 现在给出一句话 我喜欢学习,那么计算机给出的词向量->[1 0 1 1]

- 我们可以思考几个问题:
- 1.如果小朋友词汇量越学越多,学到了成千上万个词之后,我们使用上述方法构建的词向量就会有非常大的维度,并且是一个稀疏向量。
- 2.在中文中 诸如 能 会 可以 这样同义词,我们如果使用独热编码,它们是正交的,缺乏词之间的相似性,很难把他们联系到一起。
- 因此我们认为独热编码不是一个很好的词嵌入方法。
+我们的计算机就可以为小朋友开辟一个词向量维度为4的独热编码
+对于中文 我们先进行分词 我 特别 喜欢 学习
+那么我们就可以令 我->[1 0 0 0] 特别 ->[0 1 0 0] 喜欢->[0 0 1 0] 学习->[0 0 0 1]
+现在给出一句话 我喜欢学习,那么计算机给出的词向量->[1 0 1 1]

+我们可以思考几个问题:
+1.如果小朋友词汇量越学越多,学到了成千上万个词之后,我们使用上述方法构建的词向量就会有非常大的维度,并且是一个稀疏向量。
+2.在中文中 诸如 能 会 可以 这样同义词,我们如果使用独热编码,它们是正交的,缺乏词之间的相似性,很难把他们联系到一起。
+因此我们认为独热编码不是一个很好的词嵌入方法。
- 我们再来介绍一下 **稠密表示**
+我们再来介绍一下 **稠密表示**
- 稠密表示的格式如one-hot编码一致,但数值却不同,如 [0.45,0.65,0.14,1.15,0.97] +稠密表示的格式如one-hot编码一致,但数值却不同,如 [0.45,0.65,0.14,1.15,0.97] # Bag of Words词袋表示 @@ -72,7 +72,7 @@ number_dict - 根据如上形式,我们可以构建一个维度为9的one-hot编码,如下(除了可以使用np.eye构建,也可以通过sklearn的库调用) +根据如上形式,我们可以构建一个维度为9的one-hot编码,如下(除了可以使用np.eye构建,也可以通过sklearn的库调用) ```python @@ -104,9 +104,9 @@ bow # N-gram:基于统计的语言模型 - N-gram 模型是一种自然语言处理模型,它利用了语言中词语之间的相关性来预测下一个出现的词语。N-gram 模型通过对一段文本中连续出现的 n 个词语进行建模,来预测文本中接下来出现的词语。比如,如果一个文本中包含连续出现的词语“the cat sat on”,那么 N-gram 模型可能会预测接下来的词语是“the mat”或“a hat”。 +N-gram 模型是一种自然语言处理模型,它利用了语言中词语之间的相关性来预测下一个出现的词语。N-gram 模型通过对一段文本中连续出现的 n 个词语进行建模,来预测文本中接下来出现的词语。比如,如果一个文本中包含连续出现的词语“the cat sat on”,那么 N-gram 模型可能会预测接下来的词语是“the mat”或“a hat”。 - N-gram 模型的精确性取决于用于训练模型的文本的质量和数量。如果用于训练模型的文本包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。此外,如果用于训练模型的文本量较少,那么模型也可能无法充分捕捉到语言中的复杂性。 +N-gram 模型的精确性取决于用于训练模型的文本的质量和数量。如果用于训练模型的文本包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。此外,如果用于训练模型的文本量较少,那么模型也可能无法充分捕捉到语言中的复杂性。 **N-gram 模型的优点:** @@ -255,9 +255,9 @@ print([sen.split()[:2] for sen in corpus], '->', [number_dict[n.item()] for n in # Word2Vec模型:主要采用Skip-gram和Cbow两种模式 - 前文提到的distributed representation稠密向量表达可以用Word2Vec模型进行训练得到。 - skip-gram模型(跳字模型)是用中心词去预测周围词 - cbow模型(连续词袋模型)是用周围词预测中心词 +前文提到的distributed representation稠密向量表达可以用Word2Vec模型进行训练得到。 +skip-gram模型(跳字模型)是用中心词去预测周围词 +cbow模型(连续词袋模型)是用周围词预测中心词 ```python @@ -363,72 +363,27 @@ for i , label in enumerate(word_list): plt.annotate(label,xy=(x,y),xytext=(5,2),textcoords='offset points',ha='right',va='bottom') plt.show() ``` - 11%|█ | 10615/100000 [00:02<00:24, 3657.80it/s] - epoch: 10000 cost = 1.955088 - - 21%|██ | 20729/100000 [00:05<00:21, 3758.47it/s] - epoch: 20000 cost = 1.673096 - - 30%|███ | 30438/100000 [00:08<00:18, 3710.13it/s] - epoch: 30000 cost = 2.247422 - - 41%|████ | 40638/100000 [00:11<00:15, 3767.87it/s] - epoch: 40000 cost = 2.289902 - - 50%|█████ | 50486/100000 [00:13<00:13, 3713.98it/s] - epoch: 50000 cost = 2.396217 - - 61%|██████ | 60572/100000 [00:16<00:11, 3450.47it/s] - epoch: 60000 cost = 1.539688 - - 71%|███████ | 70638/100000 [00:19<00:07, 3809.11it/s] - epoch: 70000 cost = 1.638879 - - 80%|████████ | 80403/100000 [00:21<00:05, 3740.33it/s] - epoch: 80000 cost = 2.279797 - - 90%|█████████ | 90480/100000 [00:24<00:02, 3680.03it/s] - epoch: 90000 cost = 1.992100 - - 100%|██████████| 100000/100000 [00:27<00:00, 3677.35it/s] - - epoch: 100000 cost = 1.307715 - - - ![png](output_16_20.png) - - - - 打印 - - -
- - -```python - -``` From 018a3b80b93b2ea90f4b6d6fcd0f37ad6a489a4b Mon Sep 17 00:00:00 2001 From: Zhikang Niu <73390819+NoFish-528@users.noreply.github.com> Date: Wed, 14 Dec 2022 23:47:59 +0800 Subject: [PATCH 6/6] =?UTF-8?q?Delete=20NLP=E5=9F=BA=E7=A1=80.ipynb?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../NLP\345\237\272\347\241\200.ipynb" | 687 ------------------ 1 file changed, 687 deletions(-) delete mode 100644 "docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb" diff --git "a/docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb" "b/docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb" deleted file mode 100644 index ba5bcc1f9..000000000 --- "a/docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb" +++ /dev/null @@ -1,687 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "119ec186", - "metadata": {}, - "source": [ - "# 词嵌入(概念部分)" - ] - }, - { - "cell_type": "markdown", - "id": "f8e5639e", - "metadata": {}, - "source": [ - "###   在了解什么是词嵌入之前,我们可以思考一下计算机如何识别人类的输入?
\n", - " 计算机通过将输入信息解析为0和1这般的二进制编码,从而将人类语言转化为机器语言,进行理解。
\n", - " 我们先引入一个概念**one-hot编码**,也称为**独热编码**,在给定维度的情况下,一行向量有且仅有一个值为1,例如维度为5的向量[0,0,0,0,1]
\n", - " 例如,我们在幼儿园或小学学习汉语的时候,首先先识字和词,字和词就会保存在我们的大脑中的某处。
\n", - "\n", - "
一个小朋友刚学会了四个字和词-->[我] [特别] [喜欢] [学习]
\n", - " 我们的计算机就可以为小朋友开辟一个词向量维度为4的独热编码
\n", - " 对于中文 我们先进行分词 我 特别 喜欢 学习
\n", - " 那么我们就可以令 我->[1 0 0 0] 特别 ->[0 1 0 0] 喜欢->[0 0 1 0] 学习->[0 0 0 1]
\n", - " 现在给出一句话 我喜欢学习,那么计算机给出的词向量->[1 0 1 1]

\n", - " 我们可以思考几个问题:
\n", - " 1.如果小朋友词汇量越学越多,学到了成千上万个词之后,我们使用上述方法构建的词向量就会有非常大的维度,并且是一个稀疏向量。
\n", - " 2.在中文中 诸如 能 会 可以 这样同义词,我们如果使用独热编码,它们是正交的,缺乏词之间的相似性,很难把他们联系到一起。
\n", - " 因此我们认为独热编码不是一个很好的词嵌入方法。
\n", - "\n", - " 我们再来介绍一下 **稠密表示**
\n", - "\n", - " 稠密表示的格式如one-hot编码一致,但数值却不同,如 [0.45,0.65,0.14,1.15,0.97]" - ] - }, - { - "cell_type": "markdown", - "id": "4db86da3", - "metadata": {}, - "source": [ - "# Bag of Words词袋表示" - ] - }, - { - "cell_type": "markdown", - "id": "44dc9252", - "metadata": {}, - "source": [ - "  词袋表示顾名思义,我们往一个袋子中装入我们的词汇,构成一个词袋,当我们想表达的时候,我们将其取出,构建词袋的方法可以有如下形式。" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "823f8f2d", - "metadata": {}, - "outputs": [], - "source": [ - "corpus = [\"i like reading\", \"i love drinking\", \"i hate playing\", \"i do nlp\"]#我们的语料库\n", - "word_list = ' '.join(corpus).split()\n", - "word_list = list(sorted(set(word_list)))\n", - "word_dict = {w: i for i, w in enumerate(word_list)}\n", - "number_dict = {i: w for i, w in enumerate(word_list)}" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "8eaeb37d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'do': 0,\n", - " 'drinking': 1,\n", - " 'hate': 2,\n", - " 'i': 3,\n", - " 'like': 4,\n", - " 'love': 5,\n", - " 'nlp': 6,\n", - " 'playing': 7,\n", - " 'reading': 8}" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "word_dict" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "2bf380c8", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{0: 'do',\n", - " 1: 'drinking',\n", - " 2: 'hate',\n", - " 3: 'i',\n", - " 4: 'like',\n", - " 5: 'love',\n", - " 6: 'nlp',\n", - " 7: 'playing',\n", - " 8: 'reading'}" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "number_dict" - ] - }, - { - "cell_type": "markdown", - "id": "90e0ef43", - "metadata": {}, - "source": [ - " 根据如上形式,我们可以构建一个维度为9的one-hot编码,如下(除了可以使用np.eye构建,也可以通过sklearn的库调用)" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "9821ed2a", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "voc_size = len(word_dict)\n", - "bow = []\n", - "for i,name in enumerate(word_dict):\n", - " bow.append(np.eye(voc_size)[word_dict[name]])" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "03f1f12f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[array([1., 0., 0., 0., 0., 0., 0., 0., 0.]),\n", - " array([0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n", - " array([0., 0., 1., 0., 0., 0., 0., 0., 0.]),\n", - " array([0., 0., 0., 1., 0., 0., 0., 0., 0.]),\n", - " array([0., 0., 0., 0., 1., 0., 0., 0., 0.]),\n", - " array([0., 0., 0., 0., 0., 1., 0., 0., 0.]),\n", - " array([0., 0., 0., 0., 0., 0., 1., 0., 0.]),\n", - " array([0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n", - " array([0., 0., 0., 0., 0., 0., 0., 0., 1.])]" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "bow" - ] - }, - { - "cell_type": "markdown", - "id": "086a5fd2", - "metadata": {}, - "source": [ - "# N-gram:基于统计的语言模型\n", - " N-gram 模型是一种自然语言处理模型,它利用了语言中词语之间的相关性来预测下一个出现的词语。N-gram 模型通过对一段文本中连续出现的 n 个词语进行建模,来预测文本中接下来出现的词语。比如,如果一个文本中包含连续出现的词语“the cat sat on”,那么 N-gram 模型可能会预测接下来的词语是“the mat”或“a hat”。\n", - "\n", - " N-gram 模型的精确性取决于用于训练模型的文本的质量和数量。如果用于训练模型的文本包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。此外,如果用于训练模型的文本量较少,那么模型也可能无法充分捕捉到语言中的复杂性。 \n", - "\n", - "**N-gram 模型的优点:**\n", - "\n", - "简单易用,N-gram 模型的概念非常简单,实现起来也很容易。 \n", - "能够捕捉到语言中的相关性,N-gram 模型通过考虑连续出现的 n 个词语来预测下一个词语,因此它能够捕捉到语言中词语之间的相关性。 \n", - "可以使用已有的语料库进行训练,N-gram 模型可以使用已有的大量语料库进行训练,例如 Google 的 N-gram 数据库,这样可以大大提高模型的准确性。 \n", - "\n", - "**N-gram 模型的缺点:**\n", - "\n", - "对于短文本数据集不适用,N-gram 模型需要大量的文本数据进行训练,因此对于短文本数据集可能无法达到较高的准确性。 \n", - "容易受到噪声和语言纠错的影响,N-gram 模型是基于语料库进行训练的,如果语料库中包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。 \n", - "无法捕捉到语言中的非线性关系,N-gram 模型假设语言中的关系是线性的,但事实上语言中可能存在复杂的非线性关系,N-gram 模型无法捕捉到这些关系。" - ] - }, - { - "cell_type": "markdown", - "id": "1f5ad65b", - "metadata": {}, - "source": [ - "# NNLM:前馈神经网络语言模型\n", - " 下面通过前馈神经网络模型来**展示滑动**窗口的使用" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "7bddfa77", - "metadata": {}, - "outputs": [], - "source": [ - "#导入必要的库\n", - "import numpy as np\n", - "import torch\n", - "import torch.nn as nn\n", - "import torch.optim as optim\n", - "from tqdm import tqdm\n", - "from torch.autograd import Variable\n", - "dtype = torch.FloatTensor" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "29f23588", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['i',\n", - " 'like',\n", - " 'reading',\n", - " 'i',\n", - " 'love',\n", - " 'drinking',\n", - " 'i',\n", - " 'hate',\n", - " 'playing',\n", - " 'i',\n", - " 'do',\n", - " 'nlp']" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "corpus = [\"i like reading\", \"i love drinking\", \"i hate playing\", \"i do nlp\"]\n", - "\n", - "word_list = ' '.join(corpus).split()\n", - "word_list" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "12b58886", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 1000 cost = 1.010682\n", - "epoch: 2000 cost = 0.695155\n", - "epoch: 3000 cost = 0.597085\n", - "epoch: 4000 cost = 0.531892\n", - "epoch: 5000 cost = 0.376044\n", - "epoch: 6000 cost = 0.118038\n", - "epoch: 7000 cost = 0.077081\n", - "epoch: 8000 cost = 0.053636\n", - "epoch: 9000 cost = 0.038089\n", - "epoch: 10000 cost = 0.027224\n", - "[['i', 'like'], ['i', 'love'], ['i', 'hate'], ['i', 'do']] -> ['studying', 'datawhale', 'playing', 'nlp']\n" - ] - } - ], - "source": [ - "#构建我们需要的语料库\n", - "corpus = [\"i like studying\", \"i love datawhale\", \"i hate playing\", \"i do nlp\"]\n", - "\n", - "word_list = ' '.join(corpus).split() #将语料库转化为一个个单词 ,如['i', 'like', 'reading', 'i', ...,'nlp']\n", - "word_list = list(sorted(set(word_list))) #用set去重后转化为链表\n", - "# print(word_list)\n", - "\n", - "word_dict = {w: i for i, w in enumerate(word_list)} #将词表转化为字典 这边是词对应到index\n", - "number_dict = {i: w for i, w in enumerate(word_list)}#这边是index对应到词\n", - "# print(word_dict)\n", - "# print(number_dict)\n", - "\n", - "n_class = len(word_dict) #计算出我们词表的大小,用于后面词向量的构建\n", - "\n", - "m = 2 #词嵌入维度\n", - "n_step = 2 #滑动窗口的大小\n", - "n_hidden = 2 #隐藏层的维度为2\n", - "\n", - "\n", - "def make_batch(sentence): #由于语料库较小,我们象征性将训练集按照批次处理 \n", - " input_batch = []\n", - " target_batch = []\n", - "\n", - " for sen in sentence:\n", - " word = sen.split()\n", - " input = [word_dict[n] for n in word[:-1]]\n", - " target = word_dict[word[-1]]\n", - "\n", - " input_batch.append(input)\n", - " target_batch.append(target)\n", - "\n", - " return input_batch, target_batch\n", - "\n", - "\n", - "class NNLM(nn.Module): #搭建一个NNLM语言模型\n", - " def __init__(self):\n", - " super(NNLM, self).__init__()\n", - " self.embed = nn.Embedding(n_class, m)\n", - " self.W = nn.Parameter(torch.randn(n_step * m, n_hidden).type(dtype))\n", - " self.d = nn.Parameter(torch.randn(n_hidden).type(dtype))\n", - "\n", - " self.U = nn.Parameter(torch.randn(n_hidden, n_class).type(dtype))\n", - " self.b = nn.Parameter(torch.randn(n_class).type(dtype))\n", - "\n", - " def forward(self, x):\n", - " x = self.embed(x) # 4 x 2 x 2\n", - " x = x.view(-1, n_step * m)\n", - " tanh = torch.tanh(self.d + torch.mm(x, self.W)) # 4 x 2\n", - " output = self.b + torch.mm(tanh, self.U)\n", - " return output\n", - "\n", - "model = NNLM()\n", - "\n", - "criterion = nn.CrossEntropyLoss() #损失函数的设置\n", - "optimizer = optim.Adam(model.parameters(), lr=0.001) #优化器的设置\n", - "\n", - "input_batch, target_batch = make_batch(corpus) #训练集和标签值\n", - "input_batch = Variable(torch.LongTensor(input_batch))\n", - "target_batch = Variable(torch.LongTensor(target_batch))\n", - "\n", - "for epoch in range(10000): #训练过程\n", - " optimizer.zero_grad()\n", - "\n", - " output = model(input_batch) # input: 4 x 2\n", - "\n", - " loss = criterion(output, target_batch)\n", - "\n", - " if (epoch + 1) % 1000 == 0:\n", - " print('epoch:', '%04d' % (epoch + 1), 'cost = {:.6f}'.format(loss.item()))\n", - "\n", - " loss.backward()\n", - " optimizer.step()\n", - "\n", - "predict = model(input_batch).data.max(1, keepdim=True)[1]#模型预测过程\n", - "\n", - "print([sen.split()[:2] for sen in corpus], '->', [number_dict[n.item()] for n in predict.squeeze()])" - ] - }, - { - "cell_type": "markdown", - "id": "93d8cd2f", - "metadata": {}, - "source": [ - "# Word2Vec模型:主要采用Skip-gram和Cbow两种模式\n", - " 前文提到的distributed representation稠密向量表达可以用Word2Vec模型进行训练得到。\n", - " skip-gram模型(跳字模型)是用中心词去预测周围词\n", - " cbow模型(连续词袋模型)是用周围词预测中心词" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "066f68a0", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 11%|█ | 10615/100000 [00:02<00:24, 3657.80it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 10000 cost = 1.955088\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 21%|██ | 20729/100000 [00:05<00:21, 3758.47it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 20000 cost = 1.673096\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 30%|███ | 30438/100000 [00:08<00:18, 3710.13it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 30000 cost = 2.247422\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 41%|████ | 40638/100000 [00:11<00:15, 3767.87it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 40000 cost = 2.289902\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 50%|█████ | 50486/100000 [00:13<00:13, 3713.98it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 50000 cost = 2.396217\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 61%|██████ | 60572/100000 [00:16<00:11, 3450.47it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 60000 cost = 1.539688\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 71%|███████ | 70638/100000 [00:19<00:07, 3809.11it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 70000 cost = 1.638879\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 80%|████████ | 80403/100000 [00:21<00:05, 3740.33it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 80000 cost = 2.279797\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 90%|█████████ | 90480/100000 [00:24<00:02, 3680.03it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 90000 cost = 1.992100\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 100000/100000 [00:27<00:00, 3677.35it/s]\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "epoch: 100000 cost = 1.307715\n" - ] - }, - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "打印\n" - ] - }, - { - "data": { - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "import torch.nn as nn\n", - "import torch.optim as optim\n", - "from torch.autograd import variable\n", - "import numpy as np\n", - "import torch\n", - "import matplotlib.pyplot as plt\n", - "from tqdm import tqdm\n", - "\n", - "dtype = torch.FloatTensor\n", - "#我们使用的语料库 \n", - "sentences = ['i like dog','i like cat','i like animal','dog is animal','cat is animal',\n", - " 'dog like meat','cat like meat','cat like fish','dog like meat','i like apple',\n", - " 'i hate apple','i like movie','i like read','dog like bark','dog like cat']\n", - "\n", - "\n", - "\n", - "word_sequence = ' '.join(sentences).split() #将语料库的每一句话的每一个词转化为列表 \n", - "#print(word_sequence)\n", - "\n", - "word_list = list(set(word_sequence)) #构建我们的词表 \n", - "#print(word_list)\n", - "\n", - "#word_voc = list(set(word_sequence)) \n", - "\n", - "#接下来对此表中的每一个词编号 这就用到了我们之前提到的one-hot编码 \n", - "\n", - "#词典 词对应着编号\n", - "word_dict = {w:i for i,w in enumerate(word_list)}\n", - "#print(word_dict)\n", - "#编号对应着词\n", - "index_dict = {i:w for w,i in enumerate(word_list)}\n", - "#print(index_dict)\n", - "\n", - "\n", - "batch_size = 2\n", - "voc_size = len(word_list)\n", - "\n", - "skip_grams = []\n", - "for i in range(1,len(word_sequence)-1,3):\n", - " target = word_dict[word_sequence[i]] #当前词对应的id\n", - " context = [word_dict[word_sequence[i-1]],word_dict[word_sequence[i+1]]] #两个上下文词对应的id\n", - "\n", - " for w in context:\n", - " skip_grams.append([target,w])\n", - "\n", - "embedding_size = 10 \n", - "\n", - "\n", - "class Word2Vec(nn.Module):\n", - " def __init__(self):\n", - " super(Word2Vec,self).__init__()\n", - " self.W1 = nn.Parameter(torch.rand(len(word_dict),embedding_size)).type(dtype) \n", - " #将词的one-hot编码对应到词向量中\n", - " self.W2 = nn.Parameter(torch.rand(embedding_size,voc_size)).type(dtype)\n", - " #将词向量 转化为 输出 \n", - " def forward(self,x):\n", - " hidden_layer = torch.matmul(x,self.W1)\n", - " output_layer = torch.matmul(hidden_layer,self.W2)\n", - " return output_layer\n", - "\n", - "\n", - "model = Word2Vec()\n", - "criterion = nn.CrossEntropyLoss()\n", - "optimizer = optim.Adam(model.parameters(),lr=1e-5)\n", - "\n", - "#print(len(skip_grams))\n", - "#训练函数\n", - "\n", - "def random_batch(data,size):\n", - " random_inputs = []\n", - " random_labels = []\n", - " random_index = np.random.choice(range(len(data)),size,replace=False)\n", - " \n", - " for i in random_index:\n", - " random_inputs.append(np.eye(voc_size)[data[i][0]]) #从一个单位矩阵生成one-hot表示\n", - " random_labels.append(data[i][1])\n", - " \n", - " return random_inputs,random_labels\n", - "\n", - "for epoch in tqdm(range(100000)):\n", - " input_batch,target_batch = random_batch(skip_grams,batch_size) # X -> y\n", - " input_batch = torch.Tensor(input_batch)\n", - " target_batch = torch.LongTensor(target_batch)\n", - "\n", - " optimizer.zero_grad()\n", - "\n", - " output = model(input_batch)\n", - "\n", - " loss = criterion(output,target_batch)\n", - " if((epoch+1)%10000==0):\n", - " print(\"epoch:\",\"%04d\" %(epoch+1),'cost =' ,'{:.6f}'.format(loss))\n", - "\n", - " loss.backward() \n", - " optimizer.step()\n", - "\n", - "for i , label in enumerate(word_list):\n", - " W1,_ = model.parameters()\n", - " x,y = float(W1[i][0]),float(W1[i][1])\n", - " plt.scatter(x,y)\n", - " plt.annotate(label,xy=(x,y),xytext=(5,2),textcoords='offset points',ha='right',va='bottom')\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1edccf25", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "pytorch", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.9 (default, Aug 31 2020, 12:42:55) \n[GCC 7.3.0]" - }, - "vscode": { - "interpreter": { - "hash": "7648c2b9d25760d0d65f53f9b9a34de48caa24d8265d64b0ff81e2f2641d528d" - } - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}