Face detection and recognition capable of beating humans

Contents

This article was published as part of the Data Science Blogathon

Creating facial recognition is considered a very easy task in the field of computer vision., but it is extremely difficult to have a pipeline that can predict faces with complex backgrounds when you have multiple faces, different lighting conditions and different image scales. This blog will describe how we create a model that can outperform humans in some cases. Our dataset consists of 3 lessons (I cannot share the data due to confidentiality issues, but i'll show you how it looks). Class 1 it Jesse Eisenberg (actor), class 2 is Mila Kunis (pop star) and the class 0, Anyone. This is what our train looked like (80 images) and test data (more of 1800 images).

33114capture-7668773

These are our test data and the faces extracted from those images, this data is extremely complex due to multiple faces, complex backgrounds and lots of pixelated images. Secondly, our train data is extremely clean as shown in the picture below. We have many differences in the distribution of test and train data. We need a technique that can generalize well regardless of how many samples you need and how different the train and test data are.

84116capture-9941361

The technique that we are going to use for this task is, first, generate the facial key from a deep learning model and then apply a simple classifier.

Using FACENET

To really push the limits of face detection, we will see some cutting edge methods. Modern face extraction techniques have made use of Deep Convolution Networks. As we all know, the features created by modern deep learning frameworks are actually better than most hand-built features. We check 4 deep learning models, namely, FaceNet (Google), DeepFace (Facebook), VGGFace (Oxford) and OpenFace (CMU). Of these 4 Models FaceNet it was giving us the best result. In general, FaceNet offers better results than the others 3 Models.

FaceNet is considered a next-generation model developed by Google. It is based on the initial layer, explaining the full architecture of FaceNet is beyond the scope of this blog. The architecture of FaceNet is shown below. FaceNet uses block start modules to reduce the number of trainable parameters. This model takes RGB images of 160 × 160 and generates an embed of size 128 for a picture. For this implementation, we will need a couple of additional functions. But before sending the face image to FaceNet, we need to extract the faces from the images.

31307capture-2100659
39330capture-4390412
detector = dlib.cnn_face_detection_model_v1("../input/pretrained-models-faces/mmod_human_face_detector.dat")

def rect_to_bb(rect):
    # take a bounding predicted by dlib and convert it
    # to the format (x, Y, w, h) as we would normally do
    # with OpenCV
    x = rect.rect.left()
    y = rect.rect.top()
    w = rect.rect.right() - x
    h = rect.rect.bottom() - Y

    # return a tuple of (x, Y, w, h)
    return (x, Y, w, h)

def dlib_corrected(data, data_type="train"):
    #We set the size of the image
    dim = (160, 160)
    data_images=[]
    #If we are processing training data we need to keep track of the labels
    if data_type=='train':
        data_labels=[]
    #Loop over all images
    for cnt in range(0,len(data)):
        image = data['img'][cnt]
        #The large images are resized
        if image.shape[0] > 1000 and image.shape[1] > 1000:
            image = cv2.resize(image, (1000,1000), interpolation = cv2.INTER_AREA)
        #The image is converted to grey-scales
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        #Detect the faces
        rects = detector(gray, 1)
        sub_images_data = []
        #Loop over all faces in the image
        for (i, rect) in enumerate(rects):
            #Convert the bounding box to edges
            (x, Y, w, h) = rect_to_bb(rect)
            #Here we copy and crop the face out of the image
            clone = image.copy()
            if(x>=0 and y>=0 and w>=0 and h>=0):
                crop_img = clone[Y:y+h, x:x+w]
            else:
                crop_img = clone.copy()
            #We resize the face to the correct size
            rgbImg = cv2.resize(crop_img, dim, interpolation = cv2.INTER_AREA)
            #In the test set we keep track of all faces in an image
            if data_type == 'train':
                sub_images_data = rgbImg.copy()
            else:
                sub_images_data.append(rgbImg)
        #If no face is detected in the image we will add a NaN
        if(len(rects)==0):
            if data_type == 'train':
                sub_images_data = np.empty(dim + (3,))
                sub_images_data[:] = np.nan
            if data_type=='test':
                nan_images_data = np.empty(dim + (3,))
                nan_images_data[:] = np.nan
                sub_images_data.append(nan_images_data)
        #Here we add the the image(s) to the list we will return
        data_images.append(sub_images_data)
        #And add the label to the list
        if data_type=='train':
            data_labels.append(data['class'][cnt])
    #Lastly we need to return the correct number of arrays
    if data_type=='train':
        return np.array(data_images), np.array(data_labels)
    else:
        return np.array(data_images)

USANDO DLIB

DLIB is a widely used model for detecting faces. In our experiments, we found that dlib produces better results than HAAR, although we note that some improvements can still be made:

  • If the boundaries of the rectangle face are moved out of the image, we take the whole image instead of the face cutout. It is implemented as follows:
    • and (x> = 0 and and> = 0 y w> = 0 y h> = 0):
      • crop_img = clon[Y:y+h, x:x+w]
    • the rest:
  • For test images, instead of saving one face per image, we save all faces for prediction.
  • Instead of a HOG-based detector, we can use a detector based on CNN. How these enhancements are designed to optimize your use with FaceNet, we will define a new corrected face detection.

The above code block extracts the faces from the image, for many images we have several faces, so we must put all those faces in a list. To extract the faces we are using dlib.cnn_face_detection_model_v1, note that you should not feed very large dimension images to this, otherwise you will get a dlib memory error. If an image does not have a face, stores NaN in those places. Let's FaceNet these data images now. The above pre-processing is only required for test data, the train data is already clean, what can be seen in the pictures above. Once we are done getting the face inlays from the train data, get face inlays for test data, but first you have to use the preprocessing provided in the code block above to extract faces from the test data.

def get_embedding(model, face_pixels):
    # scale pixel values
    face_pixels = face_pixels.astype('float32')
    # standardize pixel values across channels (global)
    mean, std = face_pixels.mean(), face_pixels.std()
    face_pixels = (face_pixels - mean) / std
    # transform face into one sample
    samples = expand_dims(face_pixels, axis=0)
    # make prediction to get embedding
    yhat = model.predict(samples)
    return yhat[0]

model = load_model('../input/pretrained-models-faces/facenet_keras.h5')

svmtrainX = []
for index, face_pixels in enumerate(newTrainX):
    embedding = get_embedding(model, face_pixels)
    svmtrainX.append(embedding)

After generating the inlays for training and testing, we will use SVM for classification. Why SVM, You can ask? With much experience, I have come to know that SVM-based functions + DL can outperform any other method, even to deep learning methods, when the amount of data is small.

from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer

linear_model = make_pipeline(StandardScaler(), SVC(kernel="rbf", C=1.0, gamma=0.01, probability =True))
linear_model.fit(svmtrainX, svmtrainY)

Once the SVM is trained, time to do some tests, but our test data has multiple faces in a list. Then, as long as we have Jesse or Mila in a picture, we will ignore the class 0 and when both Jesse and Mila are present in a picture, then we will choose the one that gives us the greatest precision.

predicitons=[]
for i in corrected_test_X:    
    flag=0
    if(len(i)==1):
        embedding = get_embedding(model, i[0])
        tmp_output = linear_model.predict([embedding])
        predicitons.append(tmp_output[0])
    else:
        tmp_sub_pred = []
        tmp_sub_prob = []
        for j in i:
            j= j.astype(int)
            embedding = get_embedding(model, j)
            tmp_output = linear_model.predict([embedding])
            tmp_sub_pred.append(tmp_output[0])
            tmp_output_prob = linear_model.predict_log_proba([embedding])
            tmp_sub_prob.append(np.max(tmp_output_prob[0]))
            
        if 1 in tmp_sub_pred and 2 in tmp_sub_pred:
            index_1 = np.where(np.array(tmp_sub_pred)==1)[0][0]
            index_2 = np.where(np.array(tmp_sub_pred)==2)[0][0]
            if(tmp_sub_prob[index_1] > tmp_sub_prob[index_2] ):
                predicitons.append(1)
            else:
                predicitons.append(2)
        elif 1 not in tmp_sub_pred and 2 not in tmp_sub_pred:
            predicitons.append(0)
        elif 1 in tmp_sub_pred and 2 not in tmp_sub_pred:
            predicitons.append(1)
        elif 1 not in tmp_sub_pred and 2 in tmp_sub_pred:
            predicitons.append(2)

DISCUSSION

Final remarks, this is a very small data set, so the results can change enormously even when adding or removing some images. In our test we found out that he cheated on us many times, there was around 20 images in the test that were incorrectly predicted by us but correctly predicted by our model. We confirm the expected result by searching those images on Google.

Deep neural networks can extract more meaningful features than machine learning models. But nevertheless, the downfall of these large networks is the need for a large amount of data. We managed to deal with this problem using a previously trained model, a model that has been trained on a much larger data set to retain knowledge of how to encode facial images, which we then use for our purposes in this challenge. What's more, SVM's fine tuning really helped us go beyond the precision of the 95%.

The media shown in this article is not the property of DataPeaker and is used at the author's discretion.

Subscribe to our Newsletter

We will not send you SPAM mail. We hate it as much as you.