Introduction
Humans do not reset their understanding of language every time we hear a sentence. Given a post, we grasp the context based on our prior understanding of those words. One of the defining characteristics we have is our memory (or holding power).
Can an algorithm replicate this? The first technique that comes to mind is a red neuronalNeural networks are computational models inspired by the functioning of the human brain. They use structures known as artificial neurons to process and learn from data. These networks are fundamental in the field of artificial intelligence, enabling significant advancements in tasks such as image recognition, Natural Language Processing and Time Series Prediction, among others. Their ability to learn complex patterns makes them powerful tools.. (NN). But, sadly, traditional NNs cannot do this. Take an example of wanting to predict what comes next in a video. A traditional neural network will have a hard time generating accurate results.
That's where the concept of recurrent neural networks comes into play. (RNN). RNNs have become extremely popular in the space of deep learningDeep learning, A subdiscipline of artificial intelligence, relies on artificial neural networks to analyze and process large volumes of data. This technique allows machines to learn patterns and perform complex tasks, such as speech recognition and computer vision. Its ability to continuously improve as more data is provided to it makes it a key tool in various industries, from health..., which makes learning them even more imperative. Some real-world applications of RNN include:
- Voice accreditation
- Translator machine
- Musical composition
- Handwriting Accreditation
- Grammar learning
In this post, we will first quickly review the core components of a typical RNN model. Later we will configure the declaration of the problem that in conclusion we will solve by implementing an RNN model from scratch in Python.
We can always take advantage of high-level Python libraries to encode an RNN. Then, Why code it from scratch? I firmly believe that the best way to learn and truly root a concept is to learn it from scratch.. And that's what I will show in this tutorial.
This post assumes a basic understanding of recurrent neural networks. In case you need a quick refresher or are looking to learn the basics of RNN, I recommend you read the posts below first:
Table of Contents
- Flashback: a summary of recurrent neural network concepts
- Sequence prediction using RNN
- Building an RNN model using Python
Flashback: a summary of recurrent neural network concepts
Let's quickly recap the basics behind recurring neural networks.
We will do this using an example of sequence data, let's say the shares of a particular company. A simple machine learning model, or an artificial neural network, you can learn to predict the price of stocks based on a number of characteristics, as the volume of the shares, the aperture value, etc. Apart from these, the price also depends on how the stock performed in the previous weeks and fays. For a merchant, this historical data is actually a major deciding factor in making predictions.
In conventional feedforward neural networks, all test cases are considered independent. Can you see that doesn't fit well when predicting stock prices?? The NN model would not consider the above stock price values, Not a great idea!
There is another concept we can rely on when dealing with time-sensitive data: recurrent neural networks (RNN).
A typical RNN looks like this:
This may seem intimidating at first. But once we develop it, things start to look a lot simpler:
Now it is easier for us to visualize how these networks are considering the trend of stock prices. This helps us predict the prices of the day. Here, each prediction at time t (h_t) depends on all previous predictions and the information obtained from them. Pretty straightforward, truth?
RNNs can solve our purpose of handling sequences in large measureThe "measure" it is a fundamental concept in various disciplines, which refers to the process of quantifying characteristics or magnitudes of objects, phenomena or situations. In mathematics, Used to determine lengths, Areas and volumes, while in social sciences it can refer to the evaluation of qualitative and quantitative variables. Measurement accuracy is crucial to obtain reliable and valid results in any research or practical application...., but not quite.
Text is another good example of sequence data. Being able to predict which word or phrase comes after a certain text could be a very useful advantage.. We want our models write Shakespearean sonnets!
Now, registered nurses are excellent when it comes to a context that is short or small in nature. But to be able to build a story and remember it, our models must be able to understand the context behind the sequences, like a human brain.
Sequence prediction using RNN
In this post, we will work on a sequence prediction obstacle using RNN. One of the simplest tasks for this is sine wave forecasting. The sequence contains a visible trend and is easy to fix using heuristics. This is what a sine wave looks like:
First we will design a red neuronal recurrenteRecurrent neural networks (RNN) are a type of neural network architecture designed to process data streams. Unlike traditional neural networks, RNNs use internal connections that allow information from previous entries to be remembered. This makes them especially useful in tasks such as natural language processing, Machine translation and time series analysis, where context and sequence are central to the... from scratch to solve this problem. Our RNN model should also be well generalizable so that we can apply it to other sequence problems.
We will formulate our problem in this way: given a sequence of 50 numbers that belong to a sine wave, predicts the number 51 of the series. Time to power up your Jupyter laptop! (or your IDE of choice)!
RNN encoding using Python
Paso 0: data preparation
Ah, the inevitable first step in any data science project: prepare data before doing anything else.
How does our network model expect the data to be? I would accept a single length sequence 50 as input. Then, the form of the input data will be:
(number_of_records x length_of_sequence x types_of_sequences)
Here, types_of_sequences es 1, because we only have one type of sequence: la then sinusoidal.
Besides, the output would only have one value for each record. Of course, this will be the value 51 in the input sequence. Then its shape would be:
(number_of_records x types_of_sequences) #where types_of_sequences is 1
Let's dive into the code. First, import must-have libraries:
%pylab inline import math
To create a sine wave as data, we will use Python's sinusoidal function Math Library:
sin_wave = e.g.array([math.without(x) for x in e.g.arange(200)])
Visualizing the sine wave we just generated:
plt.plot(sin_wave[:50])
We will create the data now in the following code block:
X = [] Y = [] seq_len = 50 num_records = len(sin_wave) - seq_len for i in range(num_records - 50): X.append(sin_wave[i:i+seq_len]) Y.append(sin_wave[i+seq_len]) X = e.g.array(X) X = e.g.expand_dims(X, axis=2) Y = e.g.array(Y) Y = e.g.expand_dims(Y, axis=1)
Print the data form:
Note that we did a loop for (num_records – 50) because we want to reserve 50 records such as our validation data. We can create this validation data now:
X_val = [] Y_val = [] for i in range(num_records - 50, num_records): X_val.append(sin_wave[i:i+seq_len]) Y_val.append(sin_wave[i+seq_len]) X_val = e.g.array(X_val) X_val = e.g.expand_dims(X_val, axis=2) Y_val = e.g.array(Y_val) Y_val = e.g.expand_dims(Y_val, axis=1)
Paso 1: Create the architecture for our RNN model
Our next task is to establish all the variables and indispensable functions that we will use in the RNN model.. Our model will take the input sequence, will process it by means of a hidden layer of 100 units and will produce a single value output:
learning_rate = 0.0001 nepoch = 25 T = 50 # length of sequence hidden_dim = 100 output_dim = 1 bptt_truncate = 5 min_clip_value = -10 max_clip_value = 10
Later we will define the weights of the network:
U = e.g.random.uniform(0, 1, (hidden_dim, T)) W = e.g.random.uniform(0, 1, (hidden_dim, hidden_dim)) V = e.g.random.uniform(0, 1, (output_dim, hidden_dim))
Here,
- U is the weights matrix for the weights between input and hidden layers
- V is the weight matrix for weights between hidden and output layers
- W is the weight matrix for shared weights in the RNN layer (hidden layer)
In summary, we will define the wake functionThe activation function is a key component in neural networks, since it determines the output of a neuron based on its input. Its main purpose is to introduce nonlinearities into the model, allowing you to learn complex patterns in data. There are various activation functions, like the sigmoid, ReLU and tanh, each with particular characteristics that affect the performance of the model in different applications...., sigmoidea, to be used in the hidden layer:
def sigmoid(x): return 1 / (1 + e.g.exp(-x))
Paso 2: train the model
Now that we have defined our model, In conclusion, we can continue with the trainingTraining is a systematic process designed to improve skills, physical knowledge or abilities. It is applied in various areas, like sport, Education and professional development. An effective training program includes goal planning, regular practice and evaluation of progress. Adaptation to individual needs and motivation are key factors in achieving successful and sustainable results in any discipline.... in our sequence data. We can subdivide the training procedure into smaller steps, namely:
Paso 2.1: Check for loss of training data
Paso 2.1.1: Pass forward
Paso 2.1.2: Calculate the error
Paso 2.2: Check for validation data loss
Paso 2.2.1: Pass forward
Paso 2.2.2: Calculate the error
Paso 2.3: Start the actual training
Paso 2.3.1: Pass forward
Paso 2.3.2: Backpropagation error
Paso 2.3.3: Update the weights
We need to repeat these steps until convergence. If the model begins to overfit, Stop! Or just preset the number of epochs.
Paso 2.1: Check for loss of training data
We will make a forward pass through our RNN model and calculate the squared error of the predictions for all records in order to obtain the loss value.
for epoch in range(nepoch): # check loss on train loss = 0.0 # do a forward pass to get prediction for i in range(Y.shape[0]): x, Y = X[i], Y[i] # get input, output values of each record prev_s = e.g.zeros((hidden_dim, 1)) # here, prev-s is the value of the previous activation of hidden layer; which is initialized as all zeroes for t in range(T): new_input = e.g.zeros(x.shape) # we then do a forward pass for every timestep in the sequence new_input[t] = x[t] # for this, we establece a single input for that timestep mulu = e.g.dot(U, new_input) mulw = e.g.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = e.g.dot(V, s) prev_s = s # calculate error loss_per_record = (Y - mulv)**2 / 2 loss += loss_per_record loss = loss / float(Y.shape[0])
Paso 2.2: Check for validation data loss
We will do the same to calculate the loss in the validation data (in the same cycle):
# check loss on val val_loss = 0.0 for i in range(Y_val.shape[0]): x, Y = X_val[i], Y_val[i] prev_s = e.g.zeros((hidden_dim, 1)) for t in range(T): new_input = e.g.zeros(x.shape) new_input[t] = x[t] mulu = e.g.dot(U, new_input) mulw = e.g.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = e.g.dot(V, s) prev_s = s loss_per_record = (Y - mulv)**2 / 2 val_loss += loss_per_record val_loss = val_loss / float(Y.shape[0]) print('Epoch: ', epoch + 1, ', Loss: ', loss, ', Val Loss: ', val_loss)
You should get the following result:
Epoch: 1 , Loss: [[101185.61756671]] , Val Loss: [[50591.0340148]] ... ...
Paso 2.3: Start the actual training
Now we will start with the actual training of the network. In this, first we will do a forward pass to calculate the errors and a back pass to calculate the gradients and update them. Let me show you these step by step so you can visualize how it works in your mind.
Paso 2.3.1: Pass forward
In the advance pass:
- First we multiply the input with the weights between the input and the hidden layers.
- Add this with multiplying weights in the RNN layer. This is because we want to capture the knowledge of the previous time step.
- Pass it through a sigmoid activation function.
- Multiply this with the weights between the hidden and output layers.
- In the Output layerThe "Output layer" is a concept used in the field of information technology and systems design. It refers to the last layer of a software model or architecture that is responsible for presenting the results to the end user. This layer is crucial for the user experience, since it allows direct interaction with the system and the visualization of processed data...., we have a linear activation of the values, so we don't explicitly pass the value through a trigger layer.
- Save the state in the current layer and also the state in the previous time step in a dictionary
Here is the code to perform a forward pass (note that it is a continuation of the previous cycle):
# train model for i in range(Y.shape[0]): x, Y = X[i], Y[i] layers = [] prev_s = e.g.zeros((hidden_dim, 1)) of = e.g.zeros(U.shape) dV = e.g.zeros(V.shape) dW = e.g.zeros(W.shape) dU_t = e.g.zeros(U.shape) dV_t = e.g.zeros(V.shape) dW_t = e.g.zeros(W.shape) dU_i = e.g.zeros(U.shape) dW_i = e.g.zeros(W.shape) # forward pass for t in range(T): new_input = e.g.zeros(x.shape) new_input[t] = x[t] mulu = e.g.dot(U, new_input) mulw = e.g.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = e.g.dot(V, s) layers.append({'s':s, 'prev_s':prev_s}) prev_s = s
Paso 2.3.2: Backpropagation error
After the forward propagation step, we calculate the gradients in each layer and propagate back the errors. We will use the truncated back propagation through time (TBPTT), instead of back-propagating vanilla. It may seem complex, but it's actually quite simple.
The central difference in BPTT versus backprop is that the backpropagation step is performed for all time steps in the RNN layer. Then, if the length of our sequence is 50, we will backpropagate all time steps prior to the current time step.
If you guessed correctly, BPTT seems very computationally expensive. Then, instead of propagating backwards through all the previous time steps, we propagate back up to x time steps to save computational power. Consider this ideologically similar to the decline of gradientGradient is a term used in various fields, such as mathematics and computer science, to describe a continuous variation of values. In mathematics, refers to the rate of change of a function, while in graphic design, Applies to color transition. This concept is essential to understand phenomena such as optimization in algorithms and visual representation of data, allowing a better interpretation and analysis in... stochastic, where we include a batch of data points instead of all data points.
Here is the code to propagate the errors backwards:
# derivative of pred dmulv = (mulv - Y) # backward pass for t in range(T): dV_t = e.g.dot(dmulv, e.g.transpose(layers[t]['s'])) dsv = e.g.dot(e.g.transpose(V), dmulv) ds = dsv dadd = add * (1 - add) * ds dmulw = dadd * e.g.ones_like(mulw) dprev_s = e.g.dot(e.g.transpose(W), dmulw) for i in range(t-1, max(-1, t-bptt_truncate-1), -1): ds = dsv + dprev_s dadd = add * (1 - add) * ds dmulw = dadd * e.g.ones_like(mulw) dmulu = dadd * e.g.ones_like(mulu) dW_i = e.g.dot(W, layers[t]['prev_s']) dprev_s = e.g.dot(e.g.transpose(W), dmulw) new_input = e.g.zeros(x.shape) new_input[t] = x[t] dU_i = e.g.dot(U, new_input) dx = e.g.dot(e.g.transpose(U), dmulu) dU_t += dU_i dW_t += dW_i dV += dV_t of += dU_t dW += dW_t
Paso 2.3.3: Update the weights
Finally, we update the weights with the gradients of calculated weights. One thing we need to pay attention to is that gradients tend to explode if you don't keep them under control.. This is a fundamental topic in neural network training, called the explosive gradient problem. Therefore we have to hold them in a range so that they do not explode. We can do it like this
if of.max() > max_clip_value: of[of > max_clip_value] = max_clip_value if dV.max() > max_clip_value: dV[dV > max_clip_value] = max_clip_value if dW.max() > max_clip_value: dW[dW > max_clip_value] = max_clip_value if of.min() < min_clip_value: of[of < min_clip_value] = min_clip_value if dV.min() < min_clip_value: dV[dV < min_clip_value] = min_clip_value if dW.min() < min_clip_value: dW[dW < min_clip_value] = min_clip_value # update U -= learning_rate * of V -= learning_rate * dV W -= learning_rate * dW
When training the previous model, we get this result:
Epoch: 1 , Loss: [[101185.61756671]] , Val Loss: [[50591.0340148]] Epoch: 2 , Loss: [[61205.46869629]] , Val Loss: [[30601.34535365]] Epoch: 3 , Loss: [[31225.3198258]] , Val Loss: [[15611.65669247]] Epoch: 4 , Loss: [[11245.17049551]] , Val Loss: [[5621.96780111]] Epoch: 5 , Loss: [[1264.5157739]] , Val Loss: [[632.02563908]] Epoch: 6 , Loss: [[20.15654115]] , Val Loss: [[10.05477285]] Epoch: 7 , Loss: [[17.13622839]] , Val Loss: [[8.55190426]] Epoch: 8 , Loss: [[17.38870495]] , Val Loss: [[8.68196484]] Epoch: 9 , Loss: [[17.181681]] , Val Loss: [[8.57837827]] Epoch: 10 , Loss: [[17.31275313]] , Val Loss: [[8.64199652]] Epoch: 11 , Loss: [[17.12960034]] , Val Loss: [[8.54768294]] Epoch: 12 , Loss: [[17.09020065]] , Val Loss: [[8.52993502]] Epoch: 13 , Loss: [[17.17370113]] , Val Loss: [[8.57517454]] Epoch: 14 , Loss: [[17.04906914]] , Val Loss: [[8.50658127]] Epoch: 15 , Loss: [[16.96420184]] , Val Loss: [[8.46794248]] Epoch: 16 , Loss: [[17.017519]] , Val Loss: [[8.49241316]] Epoch: 17 , Loss: [[16.94199493]] , Val Loss: [[8.45748739]] Epoch: 18 , Loss: [[16.99796892]] , Val Loss: [[8.48242177]] Epoch: 19 , Loss: [[17.24817035]] , Val Loss: [[8.6126231]] Epoch: 20 , Loss: [[17.00844599]] , Val Loss: [[8.48682234]] Epoch: 21 , Loss: [[17.03943262]] , Val Loss: [[8.50437328]] Epoch: 22 , Loss: [[17.01417255]] , Val Loss: [[8.49409597]] Epoch: 23 , Loss: [[17.20918888]] , Val Loss: [[8.5854792]] Epoch: 24 , Loss: [[16.92068017]] , Val Loss: [[8.44794633]] Epoch: 25 , Loss: [[16.76856238]] , Val Loss: [[8.37295808]]
Looking good! Time to get the predictions and plot them to get a visual idea of what we have designed.
Paso 3: get predictions
We will make a forward pass through the trained weights to obtain our predictions:
preds = [] for i in range(Y.shape[0]): x, Y = X[i], Y[i] prev_s = e.g.zeros((hidden_dim, 1)) # Forward pass for t in range(T): mulu = e.g.dot(U, x) mulw = e.g.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = e.g.dot(V, s) prev_s = s preds.append(mulv) preds = e.g.array(preds)
Plotting these predictions along with the actual values:
plt.plot(preds[:, 0, 0], 'g') plt.plot(Y[:, 0], 'r') plt.show()
This was in the training data. How do we know if our model was not too tight? This is where the validation set comes in., that we previously created:
preds = [] for i in range(Y_val.shape[0]): x, Y = X_val[i], Y_val[i] prev_s = e.g.zeros((hidden_dim, 1)) # For each time step... for t in range(T): mulu = e.g.dot(U, x) mulw = e.g.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = e.g.dot(V, s) prev_s = s preds.append(mulv) preds = e.g.array(preds) plt.plot(preds[:, 0, 0], 'g') plt.plot(Y_val[:, 0], 'r') plt.show()
Nothing bad. The predictions seem impressive. The RMSE score in the validation data is also respectable:
from sklearn.metrics import mean_squared_error math.sqrt(mean_squared_error(Y_val[:, 0] * max_val, preds[:, 0, 0] * max_val))
0.127191931509431
Final notes
I can't stress enough how useful RNNs are when working with sequence data. I implore everyone to take this learning and apply it to a dataset. Take an NLP hurdle and see if you can find a solution. You can always reach out to me in the comment section below if you have any questions.
In this post, we learned how to create a recurring neural network model from scratch using just the numpy library. Of course, you can use a high-level library like Keras or Caffe, but it is essential to know the concept you are implementing.
Share your thoughts, questions and comments about this post below. Happy learning!