Big Data

Advanced architectures | Deep learning architectures

Introduction

It's getting very difficult to keep up with recent advances in deep learning. Hardly a day goes by without a new innovation or a new deep learning app.. But nevertheless, Most of these advances are hidden within a large number of research articles that are published in media such as ArXiv / Springer.

To keep us updated, we have created a small reading group to share our learnings internally in DataPeaker. One of those learnings that I would like to share with the community is a study of advanced architectures that have been developed by the research community.

This article contains some of the recent advances in deep learning along with codes for implementation in keras library. I have also provided links to the original articles, in case you are interested in reading them or want to refer to them.

To keep the article concise, I have only considered architectures that have been successful in the domain of Computer Vision.

If you're interested, keep reading!

PD: This article assumes knowledge of neural networks and familiarity with keras. If you need to catch up on these topics, I highly recommend that you read the following articles first:

What do we understand by advanced architecture?
Types of machine vision tasks
List of deep learning architectures

What do we understand by advanced architecture?

Deep learning algorithms consist of such a diverse set of models compared to a single traditional machine learning algorithm. This is due to the flexibility that the neural network provides when creating a complete end-to-end model..

The neural network can sometimes be compared to lego blocks, where you can build almost any simple or complex structure that your imagination helps you build.

We can define an advanced architecture as one that has a proven track record of being a successful model.. This is mostly seen in challenges like ImageNet, where your task is to solve a problem, let's say image recognition, using the data provided. Those who don't know what ImageNet is, is the data set that is provided in the ILSVR challenge (ImageNet Large Scale Visual Recognition).

Also as described in below mentioned architectures, each of them has a nuance that differentiates them from the usual models; giving them an advantage when used to solve a problem. These architectures also fall into the category of models “deep”, so they are likely to perform better than their shallow counterparts.

Types of machine vision tasks

This article is mainly focused on computer vision, so it is natural to describe the horizon of computer vision tasks. Computer vision; as the name suggests, is simply creating artificial models that can replicate the visual tasks performed by a human being. This essentially means that what we can see and what we perceive is a process that can be understood and implemented in an artificial system..

The main types of tasks into which computer vision can be classified are as follows:

Recognition / object classification – In object recognition, you are given a raw image and your task is to identify which class the image belongs to.
Classification + Location – If there is only one object in the image and your task is to find the location of that object, a more specific term for this problem is location problem.
Object detection – On object detection, your task is to identify where the objects are in the image. These objects can be of the same class or of a completely different class.
Image Segmentation – Image segmentation is a bit of a sophisticated task, where the goal is to map each pixel to its legitimate class.

List of deep learning architectures

Now that we have understood what advanced architecture is and have explored the tasks of computer vision, let's list the most important architectures and their descriptions:

1. AlexNet

AlexNet is the first deep architecture that was introduced by one of the pioneers in deep learning: Geoffrey Hinton and his colleagues. It is a simple but powerful network architecture, which helped pave the way for groundbreaking deep learning research as it is now. Here is a representation of the architecture proposed by the authors.

When it breaks down, AlexNet looks like a simple architecture with convolutional layers and grouped one on top of the other, followed by fully connected layers on top. This is a very simple architecture, which was conceptualized in the decade of 1980. What sets this model apart is the scale at which it performs the task and the use of the GPU for training. In the decade of 1980, the CPU was used to train a neural network. While AlexNet speeds up training 10 times only with the use of GPU.

Although a bit outdated at the moment, AlexNet is still used as a starting point to apply deep neural networks for all tasks, either computer vision or voice recognition.

2. VGG Net

The VGG network was presented by researchers from the Visual Graphics Group in Oxford (hence the name VGG). This network is especially characterized by its pyramidal shape, where the lower layers closest to the image are wide, while the upper layers are deep.

As the image shows, VGG contains posterior convolutional layers followed by grouped layers. The grouping layers are responsible for making the layers narrower. In your article, proposed multiple types of networks of this type, with changes in the depth of the architecture.

The advantages of VGG are:

It is a very good architecture for benchmarking on a particular task.
What's more, pre-VGG-enabled networks are freely available on the Internet, so it is commonly used for various applications.

Secondly, its main disadvantage is that it is very slow to train if you train from scratch. Included with a decent GPU, it would take more than a week for it to work.

3. GoogleNet

GoogleNet (o Inception Network) is an architecture class designed by Google researchers. GoogleNet was the winner of ImageNet 2014, where he proved to be a powerful model.

In this architecture, in addition to deepening (contains 22 layers compared to VGG which had 19 covers), the researchers also made a novel approach called the Inception module.

As seen above, it's a drastic change from the sequential architectures we saw earlier. In a single layer, various types of “feature extractors”. This indirectly helps the network work better, since the network in training has many options to choose from when solving the task. You can choose to convolve the input or group it directly.

The final architecture contains several of these initial modules stacked one on top of the other.. Even the training is slightly different on GoogleNet, since most top layers have their own output layer. This nuance helps the model converge faster, since there is joint training and parallel training for the layers themselves.

The advantages of GoogleNet are:

GoogleNet trains faster than VGG.
The size of a previously trained Google network is comparatively smaller than that of VGG. A VGG model can have> 500 MB, while GoogleNet has a size of only 96 MB

GoogleNet does not have an immediate downside per se, but additional changes to the architecture are proposed, that make the model work better. One of those changes is called Red Xception, in which the divergence limit of the starting modulus is increased (4 on GoogleNet as we saw in the image above). Now theoretically it can be infinite (Hence it is called extreme start!)

4. ResNet

ResNet is one of the monster architectures that really defines how deep a deep learning architecture can be.. Residual networks (ResNet for short) consists of several subsequent residual modules, which are the basic building block of the ResNet architecture. A representation of the residual modulus is as follows

In simple words, a residual module has two options, you can perform a set of functions on the input or you can skip this step altogether.

Now similar to GoogleNet, these residual modules are stacked on top of each other to form a complete end-to-end network.

Some newer techniques that ResNet introduced include:

Use of standard SGD instead of fancy adaptive learning technique. This is done in conjunction with a reasonable initialization function that keeps the training intact..
Input preprocessing changes, where the input is first patched and then fed to the network.

The main advantage of ResNet is that hundreds, even thousands of these residual layers can be used to create a network and then train. This is a bit different from the usual sequential networks, where you see there are reduced performance improvements as the number of layers increases.

5. ResNeXt

ResNeXt is said to be the most advanced current technique for object recognition. It is based on the concepts of startup and resnet to generate a new and improved architecture. The following image is a summary of what a residual module of the ResNeXt module looks like.

6. RCNN (CNN regional)

The region-based CNN architecture is said to be the most influential of all deep learning architectures that have been applied to the object detection problem.. To solve the detection problem, what RCNN does is try to draw a bounding box over all the objects present in the image and then recognize which object is in the image. It works in the following way:

The structure of RCNN is as follows:

7. YOLO (you only look once)

YOLO is the current state-of-the-art real-time deep learning-based system for solving image detection problems. As seen in the image given below, it first splits the image into defined bounding boxes and then runs a recognition algorithm in parallel for all these boxes to identify which object class they belong to. After identifying these classes, goes on to cleverly merge these boxes to form an optimal bounding box around objects.

All this is done in parallel, so it can run in real time; processing up 40 images in a second.

Although it offers a reduced performance than its RCNN counterpart, it still has the advantage of being in real time to be feasible for use in day-to-day problems. Here is a rendering of YOLO architecture.

8. SqueezeNet

The squeezeNet architecture is a more powerful architecture that is extremely useful in low-bandwidth scenarios such as mobile platforms.. This architecture occupies only 4,9 MB of space, Secondly, The start occupies ~ 100 MB! This drastic change is brought about by a specialized structure called a fire module.. The image below is a representation of the fire module.

The final architecture of squeezeNet is as follows:

9. SegNet

SegNet is a deep learning architecture applied to solve image segmentation problems. It consists of a sequence of processing layers (encoders) followed by a corresponding set of decoders for a pixel classification. The following image summarizes the operation of SegNet.

A key feature of SegNet is that it preserves the high frequency details in the segmented image, as the clustering indices of the encoder network are connected to the clustering indices of the decoder networks. In summary, information transfer is direct rather than convoluted. SegNet is one of the best models to use when it comes to image segmentation problems.

10. GAN (Generative Adversarial Network)

GAN is a completely different class of neural network architectures, in which a neural network is used to generate a completely new image that is not present in the training data set, but it is realistic enough to be in the dataset. For instance, the image below is a breakdown of the GAN. I have covered how GANs work in this article.. Check it out if you are curious.

Final notes

In this article, I have covered an overview of the main deep learning architectures that you should be familiar with. If you have any questions about deep learning architectures, feel free to share it with me through the comments.

Advanced architectures | Deep learning architectures

Contents

Introduction

Table of Contents

What do we understand by advanced architecture?

Types of machine vision tasks