FaceNet15 May 2017
Today, I’ll be reviewing “FaceNet: A Unified Embedding for Face Recognition and Clustering”1, a paper by Schroff et al. from Google, published in CVPR 2015.
This paper builds off of the previously reviewed one by demonstrating FaceNet, a model which improves the accuracy on the LFW dataset by 99.63% (over DeepFace’s 97.35%) by dispensing with the 3D modeling/alignment step and simply applying end-to-end training for an embedding space. Vectors in this space represents faces and the Euclidean distance between vectors (ideally) represent how different the faces are. They key here is robustness of this embedding space: as they say,
Once this embedding has been produced, then … the tasks become straight-forward: face verification simply involves thresholding the distnace between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-the-shelf techniques such as k-means or agglomerative clustering.
Deep neural networks learn representations which, especially at the top levels, correspond well with higher-level representations that form useful features by themselves, even when they are not explicitly trained to generate an embedding. This is precisely how previous approaches to facial recognition worked: they were trained to classify the faces of $K$ different people, and generalized beyond these $K$ people by picking an intermediary layer as an embedding.
However, the authors thought this was indirect; furthermore, such an intermediary layer typically has thousands of dimensions, which is inefficient. They proposed to optimize the embedding space itself, picking a small representation dimension (128).
Definition of triplet loss
This is done by defining the triplet loss. The intuition behind the triplet loss is simple: we want to minimize the distance between two faces belonging to the same person, and maximize the distance between two which belong to different identities. So we then just directly maximize the absolute difference between the two.
Formally, let $N$ be the number of face images we have. At iteration $i$, pick an anchor face $x_i^a$; then pick a positive example $x_i^p$, which belongs to the same person as the anchor face, and a negative example $x_i^n$, belonging to a different person. Then we want to minimize the triplet loss
for some hyperparameter $\alpha$ representing the margin between the two.
Picking the triplets
The choice of triplets chosen directly impacts the model’s speed and convergence. Ideally, we pick triplets difficult for the model, picking
which they call the hard positive and hard negative, respectively.
But this is infeasible to do so for the whole training set, and in fact “it might lead to poor training, since mislabelled and poorly imaged faces would dominate the hard positives and negatives”.
They generate triplets online: for each minibatch of training, they choose the triplets by selecting the positive and negative exemplars from within a mini-batch. (They also tried to do so offline, recomputing the argmin and argmax every some number of steps, but the experiments were inconclusive.)
They also used all of the positive pairs in a minibatch, instead of just the hard positives, while only selecting the hard negatives. This was empirically shown to lead to more stable training and slightly faster convergence. Sometimes they also selected semi-hard negatives, i.e. negatives within the margin $\alpha$ which separated them, especially at the beginning of training, in order to avoid a collapsed model (i.e. $f(x) = 0$ for all $x$).
For each minibatch, they selected 40 “exemplars” (faces) per identity, for a total of 1800 exemplars.
They use two types of architectures: one based off of Zeiler&Fergus2 (Z&F) and the other based off of GoogLeNet/Inception3. Z&F results in a larger model, with 22 layers, 140 million parameters, requiring about 1.6 billion FLOPS per image; the Inception models are smaller, with up to 26M parameters and 220M FLOPS per image.
They evaluate their model with the Labelled Faces in the Wild (LFW) dataset and the YouTube Faces Dataset (YTD), both standard for facial recognition. They evaluate on a hold-out test set (1M images). They also evaluate on a dataset manually verified to have very clean labels, of three personal photo collections (12K images total). On LFW, they achieve record-breaking $99.63% \pm$ standard error using an extra face alignment step; similarly, a record-breaking $95.12\pm 0.39$ on YTD over 93.2% state-of-the-art.
There was a direct correlation between number of FLOPS and accuracy, but the smaller Inception architectures did almost as well as the larger Z&F ones while reducing the number of parameters and FLOPS significantly. Embedding dimensionality was cross-validated and picked to be 128, which was statistically insignificant. They also demonstrate robustness (invariance) to image quality, including different image sizes, occlusion, lighting, pose, and age.
Schroff, Florian, Dmitry Kalenichenko, and James Philbin. “Facenet: A unified embedding for face recognition and clustering.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. ↩
Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” European conference on computer vision. Springer International Publishing, 2014. ↩
Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. ↩