Crowd counting | Building a crowd counting model using Python

Contents

Introduction

Artificial intelligence and machine learning will be our biggest helper in the next decade!!

Today in the morning, I was reading an article that reported that an artificial intelligence system won against 20 lawyers and lawyers were really happy that artificial intelligence can take care of a repetitive part of their roles and help them work on complex issues. These lawyers were pleased that artificial intelligence enables them to perform more satisfying roles.

Today, I will share a similar example: how to count the number of people in a crowd using Deep learning and computer vision? But, before doing that, Let's develop a sense of how easy life is for a crowd counting scientist.

Act like a crowd counting scientist

Let us begin!

Can you help me count / estimate the number of people in this image who will attend this event?

crowd-at-a-stadium-in-johannesburg-south-africa-for-rugby-2741194

Ok, what about this one?

img_2-2634912

Source: ShanghaiTech dataset

You get the hang of it. At the end of this tutorial, we will create an algorithm for crowd counting with astonishing precision (compared to humans like you and me). Will you use such an assistant?

PD This article assumes that you have a basic understanding of how convolutional neural networks work. (CNN). You can refer to the post below for more information on this topic before proceeding.:

Table of Contents

  1. What is crowd counting?
  2. Why is crowd counting required?
  3. Understanding of different computer vision techniques for crowd counting
  4. CSRNet architecture and training methods
  5. Building Your Own Crowd Counting Model in Python

This article is very inspired by the article: CSRNet: Dilated convolutional neural networks to understand highly congested scenes.

What is crowd counting?

Crowd counting is a technique for counting or estimating the number of people in an image. Take a moment to analyze the following image:

img_2-2634912

Source: ShanghaiTech dataset

Can you give me an approximate number of how many people are in the box? Yes, including those present in the background. The most direct method is to manually count each person, but does that make practical sense? It's almost impossible when the crowd is so large!

The Crowd Scientists (Yes, That's a real job title!) They count the number of people in certain parts of an image and then extrapolate to arrive at an estimate. More commonly, we've had to rely on raw metrics to estimate this number for decades.

Surely there must be a better and more accurate approach.

Yes there are!

Although we still do not have algorithms that can give us the EXACT number, the majority computer vision Techniques can produce impressively accurate estimates. Let's first understand why crowd counting is important before diving into the algorithm behind it..

Why is crowd counting useful?

Let's understand the usefulness of crowd counting with an example. Imagine this: your company just finished hosting a big data science conference. During the event many different sessions were held.

You are asked to analyze and estimate the number of people who attended each session. This will help your team understand what kinds of sessions drew the largest crowds. (and which ones failed in that sense). This will shape next year's conference, So it is an important task!

linux_conference_2013_opening-1090863

There were hundreds of people at the event, Counting them manually will take days! That's where your skills as a data scientist come in.. He managed to get photos of the crowd at each shoot and create a computer vision model to do the rest!!

There are many other scenarios where crowd counting algorithms are changing the way industries work.:

  • Counting the number of people who attend a sporting event
  • Estimate how many people attended an inauguration or march (political demonstrations, perhaps)
  • Monitoring of high traffic areas
  • Assist with staffing and resource allocation.

Can you think of other use cases? Let me know in the comment section below!! We can connect and try to figure out how we can use crowd counting techniques on your stage..

Understanding of different computer vision techniques for crowd counting

In general terms, there are currently four methods we can use to count the number of people in a crowd:

1. Detection-based methods

Here, we use a moving window-like detector to identify people in an image and count how many there are. The methods used for detection require well-trained classifiers that can extract low-level characteristics. Although these methods work well for detecting faces, do not work well in crowded images, since most of the target objects are not clearly visible.

2. Regression-based methods

We were unable to extract low-level features with the above approach. Regression-based methods triumph here. First we cut out the patches from the image and then, for each patch, we extract the low-level characteristics.

3. Methods based on density estimation

First we create a density map for the objects. Later, the algorithm learns a linear mapping between the extracted features and their object density maps. We can also use random forest regression to learn nonlinear mapping.

4. CNN-based methods

Ah, good and reliable convolutional neural networks (CNN). Instead of looking at the patches of an image, we create an end-to-end regression method using CNN. This takes the whole image as input and directly outputs the people count. CNNs work great with regression or classification tasks, and they have also proven their worth in the generation of density maps.

CSRNet, a technique that we will implement in this article, implements deeper CNN to capture high-level features and generate high-quality density maps without expanding network complexity. Let's understand what CSRNet is before moving on to the coding section.

Understanding the CSRNet architecture and training method

CSRNet uses VGG-16 as an interface due to its high transfer learning capabilities. VGG output size is one fifth of original input size. CSRNet also uses dilated convolutional layers on the back.

But, What the hell are dilated convolutions? It's a fair question. Consider the following image:

screenshot-from-2019-02-01-16-49-21-2729482

The basic concept of using dilated convolutions is to enlarge the kernel without increasing the parameters. Then, if the rate of dilation is 1, we take the kernel and convert it into the whole image. While, if we increase the rate of dilation to 2, the core extends as shown in the picture above (follow the labels under each image). Can be an alternative to layer grouping.

Underlying mathematics (recommended, but optional)

I'm going to take a moment to explain how math works. Note that this is not mandatory to implement the algorithm in Python, but I recommend that you learn the underlying idea. This will come in handy when you need to adjust or modify your model..

Suppose we have an input x (m, n), a filter w (i, j) and the rate of dilation r. The output and (m, n) it will be:

screenshot-from-2019-02-01-16-56-14-3239944

We can generalize this equation using a kernel (k * k) with a rate of dilation r. The nucleus enlarges to:

([k + (k-1)*(r-1)] * [k + (k-1)*(r-1)])

So the basic truth has been generated for each image. Each person's head in a given image is blurred using a Gaussian kernel. All images are cropped at 9 patches and the size of each patch is a quarter of the original image size. With me until now?

The first 4 patches are divided into 4 rooms and the others 5 patches are cut out randomly. Finally, the mirror of each patch is taken to duplicate the training set.

That, in a nutshell, are the details of the architecture behind CSRNet. Then, we will see your training details, including evaluation metric used.

Stochastic gradient descent is used to train CSRNet as an end-to-end structure. During the training, the fixed learning rate is set to 1e-6. The loss function is taken as the Euclidean distance to measure the difference between the truth of the terrain and the estimated density map. This is represented as:

screenshot-from-2019-02-01-18-02-14-1319367

where N is the size of the training batch. The evaluation metric used in CSRNet is MAE and MSE, namely, mean absolute error and mean square error. These are given by:

screenshot-from-2019-02-01-18-03-40-8146105

Here, Ci is the estimated count:

screenshot-from-2019-02-01-18-05-22-3405494

L and W are the width of the predicted density map.

Our model will first predict the density map for a given image. The pixel value will be 0 if there is no person present. A certain preset value will be assigned if that pixel corresponds to a person. Then, calculating the total pixel values ​​corresponding to a person will give us the count of people in that image. Impressive, truth?

And now, ladies and gentlemen, It's time to finally build our own crowd counting model!!

Building your own crowd counting model

Ready with your laptop on?

We will implement CSRNet in the ShanghaiTech dataset. This contains 1198 annotated images from a combined total of 330,165 people. You can download the dataset from here.

Use the following code block to clone the CSRNet-pytorch repository. Contains all the code to create the dataset, train the model and validate the results:

git clone https://github.com/leeyeehoo/CSRNet-pytorch.git

Please install MIRACLES Y PyTorch before continuing. These are the backbone behind the code that we will use next.

Now, move the dataset to the repository you cloned earlier and unzip it. Later, we will need to create the basic truth values. the make_dataset.ipynb file is our savior. We only need to make minor changes to that notebook:

#setting the root to the Shanghai dataset you have downloaded
# change the root path as per your location of dataset
root="/home/pulkit/CSRNet-pytorch/"

Now, let's generate the basic real values ​​for the images in part_A and part_B:

Generating the density map for each image is a time step. So make a cup of coffee while the code runs.

Up to now, we have generated the basic truth values ​​for the images in part_A. We will do the same with the images part_B. But before that, Let's look at a sample image and plot your actual heat map of the ground.:

plt.imshow(Image.open(img_paths[0]))

screenshot-from-2019-02-01-19-00-11-6759532

Things are getting interesting!

gt_file = h5py.File(img_paths[0].replace('.jpg','.h5').replace('images','ground-truth'),'r')
groundtruth = np.asarray(gt_file['density'])
plt.imshow(groundtruth,cmap=CM.jet)

screenshot-from-2019-02-01-19-00-37-9563568

Let's count how many people are present in this image:

np.sum(groundtruth)

270.32568

In the same way, we will generate values ​​for part_B:

Now, we have the images, as well as their corresponding fundamental truth values. Time to train our model!

We will use the .json files available in the cloned directory. We just have to change the location of the images in the json files. To do this, open the .json file and replace current location with the location where your images are.

Please note that all this code is written in Python 2. Make the following changes if you are using any other version of Python:

  1. In model.py, change xrange on the line 18 a range
  2. Change the line 19 in model.py with: list (self.frontend.state_dict (). Items ())[i][1].data[:] = list (mod.state_dict (). items ())[i][1].data[:]
  3. In image.py, reemplace ground_truth con ground-truth

Did you make the changes? Now, open a new terminal window and type the following commands:

cd CSRNet-pytorch
python train.py part_A_train.json part_A_val.json 0 0

Again, sit down cause this will take some time. You can reduce the number of epochs in the train.py file to speed up the process. A good alternative option is to download the pre-trained weights. from here if you don't feel like waiting.

Finally, let's verify the performance of our model on invisible data. We will use the val.ipynb file to validate the results. Remember to change the path to the previously trained weights and images.

#defining the image path
img_paths = []
for path in path_sets:
    for img_path in glob.glob(os.path.join(path, '*.jpg')):
       img_paths.append(img_path)

model = CSRNet()

#defining the model
model = model.cuda()

#loading the trained weights
checkpoint = torch.load('part_A/0model_best.pth.tar')
model.load_state_dict(checkpoint['state_dict'])

Check the MAE (mean absolute error) in the test images to evaluate our model:

We have an MAE value of 75,69, which is pretty good. Now let's review the predictions in a single image:

screenshot-from-2019-02-05-19-01-15-1985138Wow, the original count was 382 and our model estimated that there was 384 people in the picture. That is a very impressive performance!!

Congratulations on building your own crowd counting model!!

Final notes

I encourage you to try this approach on different images and share your results in the comment section below.. Crowd counting has many diverse applications and is already being adopted by organizations and government agencies..

It's a useful skill to add to your portfolio. A large number of industries will be looking for data scientists who can work with crowd counting algorithms. Learn, experiment with it and give yourself the gift of deep learning!

Did you find this article useful? Feel free to leave me your suggestions and comments below, and I will be happy to communicate with you.

You should also check out the resources below to learn and explore the wonderful world of computer vision.:

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.