paint-brush
TextStyleBrush Translates Text in Images While Emulating the Fontby@whatsai
2,106 reads
2,106 reads

TextStyleBrush Translates Text in Images While Emulating the Font

by Louis BouchardJune 21st, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Facebook AI model can translate or edit any text in an image in your language of choice. The model can ensure that that the translated text follows the same font and style of the original image. It can copy the style of a text from any picture using a single word as an example. The tool also uses similar technology as deep-fake fake images to change the words in a picture following the same style as the original words.Learn more about it in the video below. Read the full article:

People Mentioned

Mention Thumbnail
Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - TextStyleBrush Translates Text in Images While Emulating the Font
Louis Bouchard HackerNoon profile picture

This new Facebook AI model can translate or edit any text in an image in your language of choice. Not only that, the model can ensure that that the translated text follows the same font and style of the original image.

Learn more about it in the video.

Watch the video

References

►Read the full article: https://www.louisbouchard.ai/textstylebrush/

►Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev, and Tal
Hassner, Facebook AI, (2021), "TextStyleBrush: Transfer of text
aesthetics from a single example",
https://scontent.fymq3-1.fna.fbcdn.net/v/t39.8562-6/10000000_944085403038430_3779849959048683283_n.pdf?_nc_cat=108&ccb=1-3&_nc_sid=ae5e01&_nc_ohc=Jcq0m5jBvK8AX9p0hND&_nc_ht=scontent.fymq3-1.fna&oh=ab1cc3f244468ca196c76b81a299ffa1&oe=60EF2B81

►Dataset Facebook AI made:
https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset?fbclid=IwAR0pRAxhf8Vg-5H3fA0BEaRrMeD21HfoCJ-so8V0qmWK7Ub21dvy_jqgiVo

Video Transcript

00:00

Imagine you are on vacation in another country where you do not speak the language.

00:05

You want to try out a local restaurant, but their menu is in the language you don't speak.

00:09

I think this won't be too hard to imagine as most of us already faced this situation.

00:14

Whether you see menu items or directions and you can't understand what's written.

00:19

Well, in 2020, you would take out your phone and google translate what you see.

00:24

In 2021 you don't even need to open google translate anymore and try to write what you

00:29

see one by one to translate it.

00:31

Instead, you can simply use this new model by Facebook AI to translate every text in

00:36

the image in your own language!

00:38

Of course, as you can see here, this is not the first application of this technology,

00:43

but even this is cool.

00:45

What is even cooler is that their translation tool actually uses similar technology as deep

00:50

fakes to change the words in an image following the same style as the original words!

00:55

It can copy the style of a text from any picture using a single word as an example!

01:01

Just like this...

01:02

This is amazing for photo-realistic language translation in augmented reality.

01:06

This is only the first paper trained on a new dataset they released for this task, and

01:12

it is already quite impressive!

01:14

This could be amazing for video games or movies as you will be able to translate the text

01:19

appearing on buildings, posters, signs, etc. super easily,

01:23

making the immersion even more personalized and convincing for everyone based on the chosen

01:29

language without having to manually photoshop each frame or completely remake scenes.

01:34

As you can see, it also works with handwriting using a single word as well.

01:39

Its ability to generalize from a single word example and copy its style is what makes this

01:43

new artificial intelligence model so impressive.

01:46

Indeed, it understands not only the typography and calligraphy of the text, but also the

01:50

scene in which it appears.

01:53

Whether it's on a curvy poster or different backgrounds.

01:56

Typical text-transfer models are trained in a supervised manner with one specific style

02:01

and use images with text segmentation.

02:03

Meaning that you need to know what is every pixel in the picture, whether it is the text

02:08

or not, which is very costly and complicated to have.

02:11

Instead, they use a self-supervised training process where the style and the segmentation

02:15

of the texts aren't given to the model during training.

02:18

Only the actual word content is given.

02:21

I said that they released a dataset for this model and that it was able to do that with

02:26

only one word.

02:27

This is because the model first learns a generalized way of accomplishing the task on this new

02:32

dataset with many examples during training.

02:35

This dataset contains approximately 9 000 images of text on different surfaces with

02:40

only the word annotations.

02:42

Then, it uses the new word from the input image to learn its style in what we call a

02:47

"one-shot-transfer" manner.

02:49

This means that from only one image example containing the word to be changed, it will

02:54

automatically adjust the model to fit this exact style for any other words.

02:59

As you know, the goal here is to disentangle the content of a text appearing on an image

03:04

and then to use this text's style on new text and put it back on the image.

03:09

This process of disentangling the text from the actual image is learned in a self-supervised

03:14

manner, as we will see in a minute.

03:16

In short, we take an image as input and create a new image with only the text translated.

03:22

Doesn't it feel similar to the task of taking a picture of your face and only change specific

03:26

features of it to match another style, like the video I just covered did on hairstyles?

03:31

If you remember, I said that it is very similar to how deepfakes work.

03:36

Which means that what would be better to do this than StyleGan2, the best model for generating

03:42

images from another image?

03:44

Now, let's get into how it can achieve this, which means the training process.

03:49

They train this model to measure its performance on these unlabeled images using a pre-trained

03:55

typeface classification network and a pre-trained text recognition network.

04:00

This is why it is learning in a self-supervised manner because it doesn't have access to labels

04:05

or ground truth about the input images directly.

04:08

This, coupled with a realism measure calculated on the generated image with the new text compared

04:14

to the input image, allows the model to be trained without supervision

04:18

where we tell it exactly what is in the image, aiming for photo-realistic and accurate text

04:21

results.

04:22

Both these networks will tell how close the generated text is from what it is supposed

04:27

to be by first detecting the text in the image, which will be our ground truth, and then comparing

04:33

the new text with what we wanted to write and its font with the original image's text

04:39

font.

04:40

Using these two already-trained networks allows the StyleGan-based image generator to be trained

04:45

on images without any prior labels.

04:48

Then, the model can be used at inference time, or in other words, in the real world, on any

04:53

image without the two other networks we discussed, only sending the image through the trained

04:59

StyleGAN-based network which generates the new image with the modified text.

05:05

It will achieve its translation by understanding the style of the text and the content separately.

05:10

Where the style is from the actual image, and the content is the identified string and

05:15

the string to be generated.

05:16

Here, the "understanding" process I just referred to is an encoder for each, here shown in green,

05:23

compressing the information into general information that should accurately represent what we really

05:28

want from this input.

05:30

Then, both these encoded, general information are sent in the image StyleGAN-based generator,

05:35

shown in blue, at different steps according to the details needed.

05:40

Meaning that the content and style are sent at first because it needs to be translated.

05:44

Then, we will force the style in the generated image by iteratively feeding it into the network

05:50

at multiple steps with optimal proportions learned during training.

05:55

This allows the generator to control low to high-resolution details of the text appearance,

06:01

instead of being limited to low-resolution details if we only sent this style information

06:06

as inputs like as it is typically done.

06:08

Of course, there are more technical details in order to adapt everything and make it work,

06:13

but I will let you read their great paper linked in the description below if you would

06:17

like to learn more about how they achieved this in the more technical side of things.

06:22

I also wanted to mention that they openly shared some issues with complex scenes where

06:27

illumination or color changes caused problems, hurting realism just like other GAN-based

06:33

applications I previously covered transferring your face into cartoons or changing the background

06:38

of an image.

06:39

It's crucial and super interesting to see the limitations as they will help to accelerate

06:44

research.

06:45

To end on a more positive note, this is only the first paper attacking this complex task

06:50

with this level of generalization, and it is already extremely impressive.

06:54

I cannot wait to see the next versions!

06:57

As always, thank you for watching, and many thanks to Rebekah Hoogenboom for the support

07:02

on Patreon!