到目前为止,我们一直专注于定义由序列输入、单个隐藏 RNN 层和输出层组成的网络。尽管在任何时间步长的输入和相应的输出之间只有一个隐藏层,但从某种意义上说这些网络很深。第一个时间步的输入会影响最后一个时间步的输出T(通常是 100 或 1000 步之后)。这些输入通过T在达到最终输出之前循环层的应用。但是,我们通常还希望保留表达给定时间步长的输入与同一时间步长的输出之间复杂关系的能力。因此,我们经常构建不仅在时间方向上而且在输入到输出方向上都很深的 RNN。这正是我们在 MLP 和深度 CNN 的开发中已经遇到的深度概念。
构建这种深度 RNN 的标准方法非常简单:我们将 RNN 堆叠在一起。给定一个长度序列T,第一个 RNN 产生一个输出序列,也是长度T. 这些依次构成下一个 RNN 层的输入。在这个简短的部分中,我们将说明这种设计模式,并提供一个简单示例来说明如何编写此类堆叠 RNN。下面,在 图 10.3.1中,我们用L隐藏层。每个隐藏状态对顺序输入进行操作并产生顺序输出。此外,每个时间步的任何 RNN 单元(图 10.3.1中的白框 )都取决于同一层在前一时间步的值和前一层在同一时间步的值。
图 10.3.1深度 RNN 的架构。
正式地,假设我们有一个小批量输入 Xt∈Rn×d(示例数量: n,每个示例中的输入数量:d) 在时间步 t. 同时step,让hidden state的 lth隐藏层(l=1,…,L) 是 Ht(l)∈Rn×h(隐藏单元的数量:h) 和输出层变量是 Ot∈Rn×q(输出数量: q). 环境Ht(0)=Xt, 的隐藏状态lth使用激活函数的隐藏层ϕl计算如下:
(10.3.1)Ht(l)=ϕl(Ht(l−1)Wxh(l)+Ht−1(l)Whh(l)+bh(l)),
权重在哪里 Wxh(l)∈Rh×h和 Whh(l)∈Rh×h, 连同偏差bh(l)∈R1×h, 是模型参数lth隐藏层。
最终输出层的计算只是根据最终的隐藏状态Lth隐藏层:
(10.3.2)Ot=Ht(L)Whq+bq,
重量在哪里Whq∈Rh×q和偏见bq∈R1×q是输出层的模型参数。
与 MLP 一样,隐藏层的数量L和隐藏单元的数量h是我们可以调整的超参数。常见的 RNN 层宽度 (h) 在范围内(64,2056), 和共同深度 (L) 在范围内(1,8). 此外,我们可以通过将(10.3.1)中的隐藏状态计算替换为来自 LSTM 或 GRU 的计算来轻松获得深度门控 RNN。
import torch from torch import nn from d2l import torch as d2l
from mxnet import np, npx from mxnet.gluon import rnn from d2l import mxnet as d2l npx.set_np()
import jax from flax import linen as nn from jax import numpy as jnp from d2l import jax as d2l
import tensorflow as tf from d2l import tensorflow as d2l
10.3.1。从零开始实施
要从头开始实现多层 RNN,我们可以将每一层视为RNNScratch具有自己可学习参数的实例。
class StackedRNNScratch(d2l.Module): def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01): super().__init__() self.save_hyperparameters() self.rnns = nn.Sequential(*[d2l.RNNScratch( num_inputs if i==0 else num_hiddens, num_hiddens, sigma) for i in range(num_layers)])
class StackedRNNScratch(d2l.Module): def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01): super().__init__() self.save_hyperparameters() self.rnns = [d2l.RNNScratch(num_inputs if i==0 else num_hiddens, num_hiddens, sigma) for i in range(num_layers)]
class StackedRNNScratch(d2l.Module): num_inputs: int num_hiddens: int num_layers: int sigma: float = 0.01 def setup(self): self.rnns = [d2l.RNNScratch(self.num_inputs if i==0 else self.num_hiddens, self.num_hiddens, self.sigma) for i in range(self.num_layers)]
class StackedRNNScratch(d2l.Module): def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01): super().__init__() self.save_hyperparameters() self.rnns = [d2l.RNNScratch(num_inputs if i==0 else num_hiddens, num_hiddens, sigma) for i in range(num_layers)]
多层正向计算只是逐层进行正向计算。
@d2l.add_to_class(StackedRNNScratch) def forward(self, inputs, Hs=None): outputs = inputs if Hs is None: Hs = [None] * self.num_layers for i in range(self.num_layers): outputs, Hs[i] = self.rnns[i](outputs, Hs[i]) outputs = torch.stack(outputs, 0) return outputs, Hs
@d2l.add_to_class(StackedRNNScratch) def forward(self, inputs, Hs=None): outputs = inputs if Hs is None: Hs = [None] * self.num_layers for i in range(self.num_layers): outputs, Hs[i] = self.rnns[i](outputs, Hs[i]) outputs = np.stack(outputs, 0) return outputs, Hs
@d2l.add_to_class(StackedRNNScratch) def forward(self, inputs, Hs=None): outputs = inputs if Hs is None: Hs = [None] * self.num_layers for i in range(self.num_layers): outputs, Hs[i] = self.rnns[i](outputs, Hs[i]) outputs = jnp.stack(outputs, 0) return outputs, Hs
@d2l.add_to_class(StackedRNNScratch) def forward(self, inputs, Hs=None): outputs = inputs if Hs is None: Hs = [None] * self.num_layers for i in range(self.num_layers): outputs, Hs[i] = self.rnns[i](outputs, Hs[i]) outputs = tf.stack(outputs, 0) return outputs, Hs
例如,我们在时间机器数据集上训练了一个深度 GRU 模型(与第 9.5 节相同)。为了简单起见,我们将层数设置为 2。
data = d2l.TimeMachine(batch_size=1024, num_steps=32) rnn_block = StackedRNNScratch(num_inputs=len(data.vocab), num_hiddens=32, num_layers=2) model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2) trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1) trainer.fit(model, data)
data = d2l.TimeMachine(batch_size=1024, num_steps=32) rnn_block = StackedRNNScratch(num_inputs=len(data.vocab), num_hiddens=32, num_layers=2) model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2) trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1) trainer.fit(model, data)
data = d2l.TimeMachine(batch_size=1024, num_steps=32) rnn_block = StackedRNNScratch(num_inputs=len(data.vocab), num_hiddens=32, num_layers=2) model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2) trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1) trainer.fit(model, data)
data = d2l.TimeMachine(batch_size=1024, num_steps=32) with d2l.try_gpu(): rnn_block = StackedRNNScratch(num_inputs=len(data.vocab), num_hiddens=32, num_layers=2) model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2) trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1) trainer.fit(model, data)
10.3.2。简洁的实现
幸运的是,实现多层 RNN 所需的许多逻辑细节都可以在高级 API 中轻松获得。我们的简洁实现将使用此类内置功能。该代码概括了我们之前在第 10.2 节中使用的代码,允许明确指定层数而不是选择单层的默认值。
class GRU(d2l.RNN): #@save """The multi-layer GRU model.""" def __init__(self, num_inputs, num_hiddens, num_layers, dropout=0): d2l.Module.__init__(self) self.save_hyperparameters() self.rnn = nn.GRU(num_inputs, num_hiddens, num_layers, dropout=dropout)
Fortunately many of the logistical details required to implement multiple layers of an RNN are readily available in high-level APIs. Our concise implementation will use such built-in functionalities. The code generalizes the one we used previously in Section 10.2, allowing specification of the number of layers explicitly rather than picking the default of a single layer.
class GRU(d2l.RNN): #@save """The multi-layer GRU model.""" def __init__(self, num_hiddens, num_layers, dropout=0): d2l.Module.__init__(self) self.save_hyperparameters() self.rnn = rnn.GRU(num_hiddens, num_layers, dropout=dropout)
Flax takes a minimalistic approach while implementing RNNs. Defining the number of layers in an RNN or combining it with dropout is not available out of the box. Our concise implementation will use all built-in functionalities and add num_layers and dropout features on top. The code generalizes the one we used previously in Section 10.2, allowing specification of the number of layers explicitly rather than picking the default of a single layer.
class GRU(d2l.RNN): #@save """The multi-layer GRU model.""" num_hiddens: int num_layers: int dropout: float = 0 @nn.compact def __call__(self, X, state=None, training=False): outputs = X new_state = [] if state is None: batch_size = X.shape[1] state = [nn.GRUCell.initialize_carry(jax.random.PRNGKey(0), (batch_size,), self.num_hiddens)] * self.num_layers GRU = nn.scan(nn.GRUCell, variable_broadcast="params", in_axes=0, out_axes=0, split_rngs={"params": False}) # Introduce a dropout layer after every GRU layer except last for i in range(self.num_layers - 1): layer_i_state, X = GRU()(state[i], outputs) new_state.append(layer_i_state) X = nn.Dropout(self.dropout, deterministic=not training)(X) # Final GRU layer without dropout out_state, X = GRU()(state[-1], X) new_state.append(out_state) return X, jnp.array(new_state)
Fortunately many of the logistical details required to implement multiple layers of an RNN are readily available in high-level APIs. Our concise implementation will use such built-in functionalities. The code generalizes the one we used previously in Section 10.2, allowing specification of the number of layers explicitly rather than picking the default of a single layer.
class GRU(d2l.RNN): #@save """The multi-layer GRU model.""" def __init__(self, num_hiddens, num_layers, dropout=0): d2l.Module.__init__(self) self.save_hyperparameters() gru_cells = [tf.keras.layers.GRUCell(num_hiddens, dropout=dropout) for _ in range(num_layers)] self.rnn = tf.keras.layers.RNN(gru_cells, return_sequences=True, return_state=True, time_major=True) def forward(self, X, state=None): outputs, *state = self.rnn(X, state) return outputs, state
选择超参数等架构决策与10.2 节中的决策非常相似。我们选择相同数量的输入和输出,因为我们有不同的标记,即vocab_size。隐藏单元的数量仍然是 32。唯一的区别是我们现在通过指定 的值来选择不平凡的隐藏层数量 num_layers。
gru = GRU(num_inputs=len(data.vocab), num_hiddens=32, num_layers=2) model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2) trainer.fit(model, data)
model.predict('it has', 20, data.vocab, d2l.try_gpu())
'it has a small the time tr'
gru = GRU(num_hiddens=32, num_layers=2) model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2) # Running takes > 1h (pending fix from MXNet) # trainer.fit(model, data) # model.predict('it has', 20, data.vocab, d2l.try_gpu())
gru = GRU(num_hiddens=32, num_layers=2) model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2) trainer.fit(model, data)
model.predict('it has', 20, data.vocab, trainer.state.params)
'it has wo mean the time tr'
gru = GRU(num_hiddens=32, num_layers=2) with d2l.try_gpu(): model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2) trainer.fit(model, data)
model.predict('it has', 20, data.vocab)
'it has and the time travel'
10.3.3。概括
在深度 RNN 中,隐藏状态信息被传递到当前层的下一个时间步和下一层的当前时间步。存在许多不同风格的深度 RNN,例如 LSTM、GRU 或普通 RNN。方便的是,这些模型都可以作为深度学习框架的高级 API 的一部分使用。模型的初始化需要小心。总的来说,深度 RNN 需要大量的工作(例如学习率和裁剪)来确保适当的收敛。
10.3.4。练习
用 LSTM 替换 GRU 并比较准确性和训练速度。
增加训练数据以包含多本书。你的困惑度可以降到多低?
在建模文本时,您想结合不同作者的来源吗?为什么这是个好主意?会出什么问题?
-
神经网络
+关注
关注
42文章
4771浏览量
100752 -
pytorch
+关注
关注
2文章
808浏览量
13221
发布评论请先 登录
相关推荐
评论