A Team of researchers from NEC Laboratories, Palo Alto Research Center, Amazon, PARC and Stanford University are working together to solve the problem of realistically altering scene text in videos. Their main application behind this research is to create personalized content for marketing and promotional purposes. For example, replace a word on a store sign with a personalized name or message, as shown in the picture below.
Technically, several attempts have been made to automate text replacement in still images based on principles of deep style transfer. The research group is including this progress and their research to tackle the problem of text replacement in videos. Videotext replacement is not an easy task. It must meet the challenges faced in still images while also accounting for time and effects such as lighting changes, blur caused by camera motion or object movement.
One approach to solve video-test replacement could be to train an image-based text style transfer module on individual frames while incorporating temporal consistency constraints in the network loss. But with this approach, the network performing text style transfer will be additionally burdened with handling geometric and motion-induced effects encountered in the video.
Therefore, the research group took a very different approach. First, they extract text regions of interest (ROI) and train a Spatio-temporal transformer network (STTN) to Frontalize the ROIs so that they will be temporally consistent. Next, they scan the video and select a reference frame with high text quality which was measured in terms of text sharpness, size, and geometry.
The research team performed still image text replacement on the given frame using SRNet, a state-of-art method trained on video frames. Next, the new text is transferred onto other frames with a novel module called TPM (text propagation module) that considers changes in lighting and blur effects. As input, TPM takes the reference and current frame from the original video. It concludes with an image transformation between the pair with application towards the altered reference frame generated by SRNet. The important part is that the TPM takes the temporal consistency of an image into account when learning pairwise transformations.
The researchers named their framework associated with the above research approach as STRIVE (Scene Text Replacement In VidEos), and it is shown in the picture below.
Using the proposed approach, the researchers were able to show results on synthetic and challenging real videos with realistic text transfer, competitive quantitative and qualitative performance, and superior inference speed relative to alternatives. They also introduce new synthetic and real-world datasets with paired text objects. According to the research group, this may be the first attempt at deep video text replacement.