RNN from scratch | Building RNN model in Python

Contents

Introduction

Humans do not reset their understanding of language every time we hear a sentence. Given a post, we grasp the context based on our prior understanding of those words. One of the defining characteristics we have is our memory (or holding power).

Can an algorithm replicate this? The first technique that comes to mind is a neural network (NN). But, sadly, traditional NNs cannot do this. Take an example of wanting to predict what comes next in a video. A traditional neural network will have a hard time generating accurate results.

That's where the concept of recurrent neural networks comes into play. (RNN). RNNs have become extremely popular in the deep learning space, which makes learning them even more imperative. Some real-world applications of RNN include:

  • Voice accreditation
  • Translator machine
  • Musical composition
  • Handwriting Accreditation
  • Grammar learning

neural network fundamentalsIn this post, we will first quickly review the core components of a typical RNN model. Later we will configure the declaration of the problem that in conclusion we will solve by implementing an RNN model from scratch in Python.

We can always take advantage of high-level Python libraries to encode an RNN. Then, Why code it from scratch? I firmly believe that the best way to learn and truly root a concept is to learn it from scratch.. And that's what I will show in this tutorial.

This post assumes a basic understanding of recurrent neural networks. In case you need a quick refresher or are looking to learn the basics of RNN, I recommend you read the posts below first:

Table of Contents

  • Flashback: a summary of recurrent neural network concepts
  • Sequence prediction using RNN
  • Building an RNN model using Python

Flashback: a summary of recurrent neural network concepts

Let's quickly recap the basics behind recurring neural networks.

We will do this using an example of sequence data, let's say the shares of a particular company. A simple machine learning model, or an artificial neural network, you can learn to predict the price of stocks based on a number of characteristics, as the volume of the shares, the aperture value, etc. Apart from these, the price also depends on how the stock performed in the previous weeks and fays. For a merchant, this historical data is actually a major deciding factor in making predictions.

In conventional feedforward neural networks, all test cases are considered independent. Can you see that doesn't fit well when predicting stock prices?? The NN model would not consider the above stock price values, Not a great idea!

There is another concept we can rely on when dealing with time-sensitive data: recurrent neural networks (RNN).

A typical RNN looks like this:

This may seem intimidating at first. But once we develop it, things start to look a lot simpler:

Now it is easier for us to visualize how these networks are considering the trend of stock prices. This helps us predict the prices of the day. Here, each prediction at time t (h_t) depends on all previous predictions and the information obtained from them. Pretty straightforward, truth?

RNNs can solve our sequence handling purpose to a great extent, but not quite.

Text is another good example of sequence data. Being able to predict which word or phrase comes after a certain text could be a very useful advantage.. We want our models write Shakespearean sonnets!

Now, registered nurses are excellent when it comes to a context that is short or small in nature. But to be able to build a story and remember it, our models must be able to understand the context behind the sequences, like a human brain.

Sequence prediction using RNN

In this post, we will work on a sequence prediction obstacle using RNN. One of the simplest tasks for this is sine wave forecasting. The sequence contains a visible trend and is easy to fix using heuristics. This is what a sine wave looks like:

We will first design a recurring neural network from scratch to solve this problem.. Our RNN model should also be well generalizable so that we can apply it to other sequence problems.

We will formulate our problem in this way: given a sequence of 50 numbers that belong to a sine wave, predicts the number 51 of the series. Time to power up your Jupyter laptop! (or your IDE of choice)!

RNN encoding using Python

Paso 0: data preparation

Ah, the inevitable first step in any data science project: prepare data before doing anything else.

How does our network model expect the data to be? I would accept a single length sequence 50 as input. Then, the form of the input data will be:

(number_of_records x length_of_sequence x types_of_sequences)

Here, types_of_sequences es 1, because we only have one type of sequence: la then sinusoidal.

Besides, the output would only have one value for each record. Of course, this will be the value 51 in the input sequence. Then its shape would be:

(number_of_records x types_of_sequences) #where types_of_sequences is 1

Let's dive into the code. First, import must-have libraries:

%pylab inline

import math

To create a sine wave as data, we will use Python's sinusoidal function Math Library:

sin_wave = e.g.array([math.without(x) for x in e.g.arange(200)])

Visualizing the sine wave we just generated:

plt.plot(sin_wave[:50])

We will create the data now in the following code block:

X = []
Y = []

seq_len = 50
num_records = len(sin_wave) - seq_len

for i in range(num_records - 50):
    X.append(sin_wave[i:i+seq_len])
    Y.append(sin_wave[i+seq_len])
    
X = e.g.array(X)
X = e.g.expand_dims(X, axis=2)

Y = e.g.array(Y)
Y = e.g.expand_dims(Y, axis=1)

Print the data form:

Note that we did a loop for (num_records – 50) because we want to reserve 50 records such as our validation data. We can create this validation data now:

X_val = []
Y_val = []

for i in range(num_records - 50, num_records):
    X_val.append(sin_wave[i:i+seq_len])
    Y_val.append(sin_wave[i+seq_len])
    
X_val = e.g.array(X_val)
X_val = e.g.expand_dims(X_val, axis=2)

Y_val = e.g.array(Y_val)
Y_val = e.g.expand_dims(Y_val, axis=1)

Paso 1: Create the architecture for our RNN model

Our next task is to establish all the variables and indispensable functions that we will use in the RNN model.. Our model will take the input sequence, will process it by means of a hidden layer of 100 units and will produce a single value output:

learning_rate = 0.0001    
nepoch = 25               
T = 50                   # length of sequence
hidden_dim = 100         
output_dim = 1

bptt_truncate = 5
min_clip_value = -10
max_clip_value = 10

Later we will define the weights of the network:

U = e.g.random.uniform(0, 1, (hidden_dim, T))
W = e.g.random.uniform(0, 1, (hidden_dim, hidden_dim))
V = e.g.random.uniform(0, 1, (output_dim, hidden_dim))

Here,

  • U is the weights matrix for the weights between input and hidden layers
  • V is the weight matrix for weights between hidden and output layers
  • W is the weight matrix for shared weights in the RNN layer (hidden layer)

In summary, we will define the activation function, sigmoidea, to be used in the hidden layer:

def sigmoid(x):
    return 1 / (1 + e.g.exp(-x))

Paso 2: train the model

Now that we have defined our model, in conclusion we can continue with the training on our sequence data. We can subdivide the training procedure into smaller steps, namely:

Paso 2.1: Check for loss of training data
Paso 2.1.1: Pass forward
Paso 2.1.2: Calculate the error
Paso 2.2: Check for validation data loss
Paso 2.2.1: Pass forward
Paso 2.2.2: Calculate the error
Paso 2.3: Start the actual training
Paso 2.3.1: Pass forward
Paso 2.3.2: Backpropagation error
Paso 2.3.3: Update the weights

We need to repeat these steps until convergence. If the model begins to overfit, Stop! Or just preset the number of epochs.

Paso 2.1: Check for loss of training data

We will make a forward pass through our RNN model and calculate the squared error of the predictions for all records in order to obtain the loss value.

for epoch in range(nepoch):
    # check loss on train
    loss = 0.0
    
    # do a forward pass to get prediction
    for i in range(Y.shape[0]):
        x, Y = X[i], Y[i]                    # get input, output values of each record
        prev_s = e.g.zeros((hidden_dim, 1))   # here, prev-s is the value of the previous activation of hidden layer; which is initialized as all zeroes
        for t in range(T):
            new_input = e.g.zeros(x.shape)    # we then do a forward pass for every timestep in the sequence
            new_input[t] = x[t]              # for this, we establece a single input for that timestep
            mulu = e.g.dot(U, new_input)
            mulw = e.g.dot(W, prev_s)
            add = mulw + mulu
            s = sigmoid(add)
            mulv = e.g.dot(V, s)
            prev_s = s

    # calculate error 
        loss_per_record = (Y - mulv)**2 / 2
        loss += loss_per_record
    loss = loss / float(Y.shape[0])

Paso 2.2: Check for validation data loss

We will do the same to calculate the loss in the validation data (in the same cycle):

    # check loss on val
    val_loss = 0.0
    for i in range(Y_val.shape[0]):
        x, Y = X_val[i], Y_val[i]
        prev_s = e.g.zeros((hidden_dim, 1))
        for t in range(T):
            new_input = e.g.zeros(x.shape)
            new_input[t] = x[t]
            mulu = e.g.dot(U, new_input)
            mulw = e.g.dot(W, prev_s)
            add = mulw + mulu
            s = sigmoid(add)
            mulv = e.g.dot(V, s)
            prev_s = s

        loss_per_record = (Y - mulv)**2 / 2
        val_loss += loss_per_record
    val_loss = val_loss / float(Y.shape[0])

    print('Epoch: ', epoch + 1, ', Loss: ', loss, ', Val Loss: ', val_loss)

You should get the following result:

Epoch:  1 , Loss:  [[101185.61756671]] , Val Loss:  [[50591.0340148]]
...
...

Paso 2.3: Start the actual training

Now we will start with the actual training of the network. In this, first we will do a forward pass to calculate the errors and a back pass to calculate the gradients and update them. Let me show you these step by step so you can visualize how it works in your mind.

Paso 2.3.1: Pass forward

In the advance pass:

  • First we multiply the input with the weights between the input and the hidden layers.
  • Add this with multiplying weights in the RNN layer. This is because we want to capture the knowledge of the previous time step.
  • Pass it through a sigmoid activation function.
  • Multiply this with the weights between the hidden and output layers.
  • On the output layer, we have a linear activation of the values, so we don't explicitly pass the value through a trigger layer.
  • Save the state in the current layer and also the state in the previous time step in a dictionary

Here is the code to perform a forward pass (note that it is a continuation of the previous cycle):

    # train model
    for i in range(Y.shape[0]):
        x, Y = X[i], Y[i]
    
        layers = []
        prev_s = e.g.zeros((hidden_dim, 1))
        of = e.g.zeros(U.shape)
        dV = e.g.zeros(V.shape)
        dW = e.g.zeros(W.shape)
        
        dU_t = e.g.zeros(U.shape)
        dV_t = e.g.zeros(V.shape)
        dW_t = e.g.zeros(W.shape)
        
        dU_i = e.g.zeros(U.shape)
        dW_i = e.g.zeros(W.shape)
        
        # forward pass
        for t in range(T):
            new_input = e.g.zeros(x.shape)
            new_input[t] = x[t]
            mulu = e.g.dot(U, new_input)
            mulw = e.g.dot(W, prev_s)
            add = mulw + mulu
            s = sigmoid(add)
            mulv = e.g.dot(V, s)
            layers.append({'s':s, 'prev_s':prev_s})
            prev_s = s

Paso 2.3.2: Backpropagation error

After the forward propagation step, we calculate the gradients in each layer and propagate back the errors. We will use the truncated back propagation through time (TBPTT), instead of back-propagating vanilla. It may seem complex, but it's actually quite simple.

The central difference in BPTT versus backprop is that the backpropagation step is performed for all time steps in the RNN layer. Then, if the length of our sequence is 50, we will backpropagate all time steps prior to the current time step.

If you guessed correctly, BPTT seems very computationally expensive. Then, instead of propagating backwards through all the previous time steps, we propagate back up to x time steps to save computational power. Consider this ideologically similar to stochastic gradient descent, where we include a batch of data points instead of all data points.

Here is the code to propagate the errors backwards:

        # derivative of pred
        dmulv = (mulv - Y)
        
        # backward pass
        for t in range(T):
            dV_t = e.g.dot(dmulv, e.g.transpose(layers[t]['s']))
            dsv = e.g.dot(e.g.transpose(V), dmulv)
            
            ds = dsv
            dadd = add * (1 - add) * ds
            
            dmulw = dadd * e.g.ones_like(mulw)

            dprev_s = e.g.dot(e.g.transpose(W), dmulw)


            for i in range(t-1, max(-1, t-bptt_truncate-1), -1):
                ds = dsv + dprev_s
                dadd = add * (1 - add) * ds

                dmulw = dadd * e.g.ones_like(mulw)
                dmulu = dadd * e.g.ones_like(mulu)

                dW_i = e.g.dot(W, layers[t]['prev_s'])
                dprev_s = e.g.dot(e.g.transpose(W), dmulw)

                new_input = e.g.zeros(x.shape)
                new_input[t] = x[t]
                dU_i = e.g.dot(U, new_input)
                dx = e.g.dot(e.g.transpose(U), dmulu)

                dU_t += dU_i
                dW_t += dW_i
                
            dV += dV_t
            of += dU_t
            dW += dW_t

Paso 2.3.3: Update the weights

Finally, we update the weights with the gradients of calculated weights. One thing we need to pay attention to is that gradients tend to explode if you don't keep them under control.. This is a fundamental topic in neural network training, called the explosive gradient problem. Therefore we have to hold them in a range so that they do not explode. We can do it like this

            if of.max() > max_clip_value:
                of[of > max_clip_value] = max_clip_value
            if dV.max() > max_clip_value:
                dV[dV > max_clip_value] = max_clip_value
            if dW.max() > max_clip_value:
                dW[dW > max_clip_value] = max_clip_value
                
            
            if of.min() < min_clip_value:
                of[of < min_clip_value] = min_clip_value
            if dV.min() < min_clip_value:
                dV[dV < min_clip_value] = min_clip_value
            if dW.min() < min_clip_value:
                dW[dW < min_clip_value] = min_clip_value
        
        # update
        U -= learning_rate * of
        V -= learning_rate * dV
        W -= learning_rate * dW

When training the previous model, we get this result:

Epoch:  1 , Loss:  [[101185.61756671]] , Val Loss:  [[50591.0340148]]
Epoch:  2 , Loss:  [[61205.46869629]] , Val Loss:  [[30601.34535365]]
Epoch:  3 , Loss:  [[31225.3198258]] , Val Loss:  [[15611.65669247]]
Epoch:  4 , Loss:  [[11245.17049551]] , Val Loss:  [[5621.96780111]]
Epoch:  5 , Loss:  [[1264.5157739]] , Val Loss:  [[632.02563908]]
Epoch:  6 , Loss:  [[20.15654115]] , Val Loss:  [[10.05477285]]
Epoch:  7 , Loss:  [[17.13622839]] , Val Loss:  [[8.55190426]]
Epoch:  8 , Loss:  [[17.38870495]] , Val Loss:  [[8.68196484]]
Epoch:  9 , Loss:  [[17.181681]] , Val Loss:  [[8.57837827]]
Epoch:  10 , Loss:  [[17.31275313]] , Val Loss:  [[8.64199652]]
Epoch:  11 , Loss:  [[17.12960034]] , Val Loss:  [[8.54768294]]
Epoch:  12 , Loss:  [[17.09020065]] , Val Loss:  [[8.52993502]]
Epoch:  13 , Loss:  [[17.17370113]] , Val Loss:  [[8.57517454]]
Epoch:  14 , Loss:  [[17.04906914]] , Val Loss:  [[8.50658127]]
Epoch:  15 , Loss:  [[16.96420184]] , Val Loss:  [[8.46794248]]
Epoch:  16 , Loss:  [[17.017519]] , Val Loss:  [[8.49241316]]
Epoch:  17 , Loss:  [[16.94199493]] , Val Loss:  [[8.45748739]]
Epoch:  18 , Loss:  [[16.99796892]] , Val Loss:  [[8.48242177]]
Epoch:  19 , Loss:  [[17.24817035]] , Val Loss:  [[8.6126231]]
Epoch:  20 , Loss:  [[17.00844599]] , Val Loss:  [[8.48682234]]
Epoch:  21 , Loss:  [[17.03943262]] , Val Loss:  [[8.50437328]]
Epoch:  22 , Loss:  [[17.01417255]] , Val Loss:  [[8.49409597]]
Epoch:  23 , Loss:  [[17.20918888]] , Val Loss:  [[8.5854792]]
Epoch:  24 , Loss:  [[16.92068017]] , Val Loss:  [[8.44794633]]
Epoch:  25 , Loss:  [[16.76856238]] , Val Loss:  [[8.37295808]]

Looking good! Time to get the predictions and plot them to get a visual idea of ​​what we have designed.

Paso 3: get predictions

We will make a forward pass through the trained weights to obtain our predictions:

preds = []
for i in range(Y.shape[0]):
    x, Y = X[i], Y[i]
    prev_s = e.g.zeros((hidden_dim, 1))
    # Forward pass
    for t in range(T):
        mulu = e.g.dot(U, x)
        mulw = e.g.dot(W, prev_s)
        add = mulw + mulu
        s = sigmoid(add)
        mulv = e.g.dot(V, s)
        prev_s = s

    preds.append(mulv)
    
preds = e.g.array(preds)

Plotting these predictions along with the actual values:

plt.plot(preds[:, 0, 0], 'g')
plt.plot(Y[:, 0], 'r')
plt.show()

This was in the training data. How do we know if our model was not too tight? This is where the validation set comes in., that we previously created:

preds = []
for i in range(Y_val.shape[0]):
    x, Y = X_val[i], Y_val[i]
    prev_s = e.g.zeros((hidden_dim, 1))
    # For each time step...
    for t in range(T):
        mulu = e.g.dot(U, x)
        mulw = e.g.dot(W, prev_s)
        add = mulw + mulu
        s = sigmoid(add)
        mulv = e.g.dot(V, s)
        prev_s = s

    preds.append(mulv)
    
preds = e.g.array(preds)

plt.plot(preds[:, 0, 0], 'g')
plt.plot(Y_val[:, 0], 'r')
plt.show()

Nothing bad. The predictions seem impressive. The RMSE score in the validation data is also respectable:

from sklearn.metrics import mean_squared_error

math.sqrt(mean_squared_error(Y_val[:, 0] * max_val, preds[:, 0, 0] * max_val))
0.127191931509431

Final notes

I can't stress enough how useful RNNs are when working with sequence data. I implore everyone to take this learning and apply it to a dataset. Take an NLP hurdle and see if you can find a solution. You can always reach out to me in the comment section below if you have any questions.

In this post, we learned how to create a recurring neural network model from scratch using just the numpy library. Of course, you can use a high-level library like Keras or Caffe, but it is essential to know the concept you are implementing.

Share your thoughts, questions and comments about this post below. Happy learning!

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.