Apple ML Research: Private On-Device Machine Learning to Recognize People in Photos

This research summary articles is based on the article from Apple ML team 'Recognizing People in Photos Through Private On-Device Machine Learning'

Please don't forget to join our ML Subreddit

People use Photos (on iOS, iPad OS, and macOS) to browse, search, and relive life’s memories with their friends and family. Photos curate and organize photographs, live images, and videos using several machine learning algorithms that operate secretly on the device. A vital component of this goal is an algorithm that recognizes people based on their appearance.

In a variety of ways, photos rely on personal information. A user can scroll upon an image, tap on the circle representing the person who has been recognized in that image, and later pivot to browse their library for photographs including that person, as shown in Figure 1A. A user can go straight to the People Album, as shown in Figure 1B, to look through photographs and make sure the right person is tagged in them. As illustrated in Figure 1C, a user may then manually add names to persons in their images and search for someone by putting the person’s name into the search field. Photos may also employ identification data to create a private, on-device knowledge graph that finds unique patterns in a user’s library, such as essential groups of people, frequent locales, previous visits, events, the last time a user shot an image of a specific individual, and more. Memories use popular themes based on significant persons in a user’s life, such as “Together,” as illustrated in Figure 1D.


Finding Friends and Family in Pictures

The process of recognizing persons in libraries is divided into two steps. As the library grows, one phase entails gradually assembling a gallery of well-known people. The second phase entails attributing a new person’s observation to a known person in the gallery or reporting the observation as unknown. These phases’ algorithms work on feature vectors, often known as embeddings, which reflect a human observation.

People’s faces and upper bodies in a particular image are first identified. When the subject is looking away from the camera, their faces are usually obscured or just not visible. The upper bodies of people in the image are considered to solve these cases since they typically show constant characteristics—like clothing—within a specific context. These consistent features can serve as significant indicators for identifying a person across photographs taken just a few minutes apart.

A deep neural network is trained that accepts a whole image as input and outputs bounding boxes for the faces and upper bodies that are detected. Then, using a matching technique that considers bounding box area and position, as well as the intersection of face and upper body regions, face bounding boxes are associated with their respective upper bodies. The image’s face and upper body crops are fed into two different deep neural networks, each responsible for extracting the feature vectors, or embeddings, that represent them.

Making a Collection of Pictures

A gallery in a user’s Photos library is a group of frequently appearing persons. Photos use clustering algorithms to construct groups, or clusters, of the face and upper body feature vectors that match the persons found in the library to build the gallery unsupervised. A clustering technique uses a combination of facial and upper body embeddings for each observation to create these groups. This step is conservative because it permanently associates two instances when it unites them. The algorithm is fine-tuned so that each first-pass cluster only puts together very near matches, resulting in a high degree of precision but a large number of smaller clusters. As examples are added, each cluster is represented by the running average of its embeddings.

Assigning Personality

Matching a new observation to this gallery is the second phase of the person recognition issue. An encoder is chosen that is more powerful than the naive one-hot encoding of closest neighbor classification, inspired by Learning Feature Representations using K-means.

Filtering Unclear Faces

During nighttime clustering, the processing pipeline that is outlined thus far would allocate every computed face and upper body embedding to a cluster. Faces and upper body detections that are either false positives or out-of-distribution would begin to show up in the gallery over time, lowering recognition accuracy. However, not every observation matches real faces, and upper bodies, and not every face and upper body can be accurately represented by a neural network on a mobile device. Filtering away observations that aren’t well represented, such as face and upper body embedding, is an essential part of the processing pipeline to address this.

Augmentation and Data

Data augmentation can also result in significant improvements in model accuracy. A random combination of several modifications is used to boost model generalization to enrich the input image during training. Pixel-level modifications like color jitter or grayscale conversion, structural changes like left-right flipping or distortion, Gaussian blur, random compression artifacts, and cutout regularization are all examples of these transformations.

The Networking Design

The fundamental difficulty in developing the architecture is capturing the highest level of accuracy while running effectively on-device with low latency and a small memory footprint. Every stage of the network has trade-offs that necessitate experimentation to balance accuracy and computing expense.

The model’s training

On the unit-hypersphere, training aims to create an embedding that increases intra-class compactness and inter-class discrepancy. The embeddings are randomly dispersed around the hypersphere before training the network, as shown in the lower section of Figure 6. As training advances, embeddings representing the same person’s face become further apart from those representing another person’s face. This technology is comparable to that described in ArcFace, a cutting-edge facial recognition system.

Performance on-Device

The end-to-end process operates entirely locally on the user’s device, keeping the recognition processing private. On-device performance is highly crucial.


Annu is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kanpur. She is a coding enthusiast and has a keen interest in the scope of application of mathematics in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.

🚀 LLMWare Launches SLIMs: Small Specialized Function-Calling Models for Multi-Step Automation [Check out all the models]