What are convolutional networks: a short explanation

Contents

This article was published as part of the Data Science Blogathon.

Introduction

Hello! Today I will do my best to explain intuitively how recurrent convolutional neural networks work (CRNN). When I first tried to learn about how CRNN works, I discovered that the information was divided between several sites and that different levels of “depth”, so I will try to explain them in a way that by the end of this article I will know exactly how they work and why they perform better in some categories than others.

In this article, I will assume you already know a little about how a simple neural network works. In case you need a little review of how it works or even if you don't know how they work at all, I recommend that you watch the well-made videos that explain how they work that I linked at the end of the article. I will provide all the information you consider necessary to intuitively understand how CRNN works.
In this article we will cover the following topics, so feel free to skip the ones you already know:

  1. What are convolutional neural networks, how they work and why we need them?
  2. What are recurrent neural networks, how they work and why we need them?
  3. · What are they and why do we need convolutional recurrent neural networks? + handwritten text recognition example
  4. · More reading and links

What are convolutional neural networks, how they work and why we need them?

The easiest to answer is the last question, Why do we need them? For that, let's take an example. Let's say we want to find out if we have a cat or a dog in the image. To simplify the explanation, Let's first think of an image of 3 × 3. In this picture, we have an important feature in the blue rectangle (like a dog's face, a letter or whatever the important feature is).

90139imaginea_crnn-1736171

Let's see how a simple neural network would recognize the importance and link between pixels.

55139simple_nn_for_feature_extraction-8884157

As we can see, we will need “aplanar” the image to feed it to a dense neural network. In doing so, we lose the spatial context in the image of the feature complete with the background and also the pieces of the feature to each other. Imagine how difficult it will be for the neural network to learn that they are related. What's more, we will have a lot of weights to train, so we will need more data and, Thus, more time to train them.

Then, we can see multiple problems with this approach:

  • The spatial context is lost
  • Much more weight for larger images
  • More weights result in more time and more data needed

Only if there was another way ... Wait!! There are! This is where convolutional neural networks step in to save the day.. Its main function is to extract relevant characteristics from the input (an image, for instance) by using filters. These filters are chosen at random first and then trained as the weights do.. They are modified by the Neural Network to extract and find the most relevant characteristics.

Agree, we have so far established that convolutional neural networks, what will i use as CNN, use filters to extract features. But, What exactly are filters and how do they work?

Filters are arrays containing different values ​​that slide over the image (for instance) to analyze the characteristics. If the matrix is, for instance, 3x3x3, the extracted feature will be 3x3x3 in size. If the matrix is ​​of size 5 × 5, the feature it will detect will have a maximum size of 5 × 5 in the image, and so on. When analyzing a pixel window, we understand the multiplication by elements between the filter and the covered window.

Then, for instance, if we have an image with a size of 6 × 6 and a filter 3 × 3, we can imagine the filter sliding over the image, and every time it lands in a new window, the analysis, what we can see represented in the image below, only for the first two rows of the image:

88831example_of_how_conv_works_2_rows-1035829

Depending on what we need to extract, we can change the filter step (both vertically and horizontally, in the example above, the filter takes a step in both directions).

After doing the multiplication (by elements), the result becomes the new pixel of the image. Then, after “analyze” the first window, we get the first pixel of our image, and so on. We see that in the case presented above, the final image will have a size of 5 × 5. To have the final image with the same size, we can apply the filters after imaginatively filling the image (adding an imaginary row and column at the beginning and end), but the details are for another time to discuss.

To see even better how convolution works, we can see examples of filters and the effect they cause on the output image:

87123convolution_filters-8726341

We can see how different filters detect and “they extract” different characteristics. The function of training a convolution neural network is to find the best filters to extract the most relevant characteristic for our task..

Then, to conclude the part about convolution neural networks, we can summarize the information in 3 simple ideas:

  • What are they: Convolutional neural networks are a type of neural networks that use the convolution operation (sliding a filter across an image) to extract relevant characteristics.
  • Why do we need them: work better on data (instead of using normal dense neural networks) in which there is a strong correlation between, for instance, pixels because spatial context is not lost.
  • How do they work: use filters to extract features. Filters are matrices that "slide" over the image. They are modified in the training period to extract the most relevant characteristics.

What are recurrent neural networks, how they work and why we need them?

While convolutional neural networks help us to extract relevant features in the image, Recurrent neural networks help the neural network to take into account information from the past to make predictions or analyze.

Therefore, if we have, for instance, the following matrix: {2, 4, 6}, and we want to predict what will come next, we can use a recurring neural network, why, in every step, will take into consideration what was before that.

We can visualize a simple recurring cell, as shown in the following picture:

97244recurrent_neural_network_unfold-3631587

First, let's just focus on the right side of the image. Here, xt are the inputs received in time step t. To follow the same example, these could be the numbers from the matrix mentioned above, x0 = 2, x1 = 4, x2 = 6. To take into consideration what was before the passage of time, the property that makes them part of a Recurrent Neural Network, we have to receive information from the previous time step, that in this image we have represented as v Each cell has a call “condition”, which intuitively contains the information which is then sent to the next cell.

Then, to recap, xt is the cell entry. Later, the cell decides what is the important information, taking into account the information of the previous time steps, received through the "v", and send it to the next cell. What's more, we have the option if we want to return this important information that the cell considered, through the “O” in the image, cell output.

To represent the aforementioned process in a more compact way, we can “fold up” the cells, represented on the left side of the image.

We will not go into detail about the exact type of recurring cells, since there are many options, and explaining in detail how they work would take too long. If you are interested, I left some links that I found very useful at the end of the article.

What are they and why do we need convolutional recurrent neural networks?
+ handwritten text recognition example

Now we have all the important information to understand how a convolutional recurring network works.

Most of the time, convolutional neural network analyzes the image and sends it to the recurring part of the detected important features. The appellant analyzes these characteristics in order, taking into account the previous information to figure out what are some important links between these characteristics that influence the output.

To understand a little more about how a CRNN works in some tasks, Let's take handwritten text recognition as an example.

Let's imagine that we have images that contain words and we want to train the NNet to give us which word is initially in the image..

First, we would like our neural network to be able to extract important characteristics for different letters, as loops of “g” O “l”, or even circles of “a” u “O”. For it, we can use a convolutional neural network. As explained above, CNN uses filters to extract the important features (we saw how different filters have different effects on the initial image). Of course, these filters will in practice detect more abstract features that we can't really understand, but intuitively we can think of simpler features, like those mentioned above.

Then, we would like to analyze these characteristics. Let's take a look at why we can't decide which letter is based solely on its own characteristics.. In the image below, we see that the letter is "a" (from “to”) u "o" (de for).

76322for_aux_litera_context-3047654

The difference lies in the way the letter is linked to the other letters. Then we would need to know information from previous places in the image to be able to determine the letter. Sounds familiar? This is where the RNN part comes in. Recursively analyze the information extracted by CNN, where the input for each cell could be the features detected in a specific segment of the image, as depicted below, with solo 10 segments (less than we would use in real models):

15645de_ce_rnn-7430963

We do not feed the RNN with the image itself, as shown in the picture above, but with the characteristics extracted from that “segment”.

We could also see that processing the image forward is just as important as processing the image backward., so we can add a layer of cells that process the features in the other way, taking both into account when calculating the output. Or even vertically, depending on the task to be carried out.

Hurray! Finally we have the image analyzed: the characteristics extracted and analyzed in relation to each other. All we have to do now is add a layer that calculates the loss and an algorithm that decodes the output, for this, we may want to use a CTC (Connectionist temporal classification) for handwritten text recognition, but that's an interesting topic on its own. and I think it deserves another article.

Conclusions

In this article, we briefly discuss how convolutional recurrent neural networks work, how they analyze and extract features and an example of how they could be used.

The convolutional neural network extracts the characteristics by applying relevant filters and the recurrent neural network analyzes these characteristics, taking into account the information received from the previous time steps.

More reading and links:

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.