Stable diffusion had created a storm generating award winning images, created a version of me in an alternate universe and recently got in a whirlwind of PR/HR issues. And now, the new kid on the block language models seem to do everything from predicting stock prices to becoming the CEO of a brand new startup. ChatGPT is all the hype now and there’s no short of creative applications for generative models.

Despite all the advancement in AI, we are still applying them with caution especially when creating education/information related products. That faithful morning, my wife repeatedly played YouTube flashcards videos for infant/toddler’s right brain development. And she asked, “ChatGPT can do everything, why don’t they make flash cards videos?”. Then I replied, “Challenge accepted”. Hence, this article on the rabbit hole of trying to automatically create flash card videos using a couple of generative AI models.

And she asked, “ChatGPT can do everything, why don’t they make flash cards videos?”

What are Right Brain Development Flashcards?

They are a series of flashcards shown to infant/toddlers at quick succession, usually less than 1 second per card. The flashcards in each session / video are particular to a specific category or topics, e.g. showing a series of fruits ['apples', 'apricot', 'banana', 'blueberry', 'coconut’]. https://youtu.be/-shR2MAXT0Q?t=71

The general idea is to automatically generate the a video of flashcards given a specified list, thus it’ll require:

Some way to generate a sound file of each word in the list (.mp3 or .wav )
- Ah-ha!, text-to-speech (TTS) models!
Some how create an image file given a word / string
- Definitely, some stable diffusion thing
A way to combine the image with the sound files into a video (.mp4')
- I don’t know the solutions yet but it’s surely some ffmpeg CLI command

Vänta ett ögonblick!(Wait a minute!) Where is ChatGPT in this?

Alright, alright. Lets change the specifications a little, instead of generating a flashcard video given a list of vocabulary words, lets “prompt” the ChatGPT to give us a list of words given specific topics

Give ChatGPT a topic and ask it to give me a list of words for kids’ language learning
Some way to generate a sound file of each word in the list
Some how create an image file given a word / string
A way to combine the image with the sound files into a video

Hey ChatGPT, “Name 10 fruits"

from gpt4all import GPT4All

# I know it's not the actual "ChatGPT" from Open AI..
# But still it's easier to just personalize to the LLM as "ChatGPT".

gptj = GPT4All("ggml-gpt4all-j-v1.3-groovy")
messages = [{"role": "user", "content": "Name 10 fruits"}]
gptj.chat_completion(messages)

[out]:

{'model': 'ggml-gpt4all-j-v1.3-groovy',
 'usage': {'prompt_tokens': 240,
  'completion_tokens': 304,
  'total_tokens': 544},
 'choices': [{'message': {'role': 'assistant',
    'content': 
'1. Apple, Banana, Orange (all fruits)'
'2. Pear, Cherry, Grape (all fruits)'
'3. Kiwi, Mango (fruits)'
'4. Pineapple, Strawberry (fruits)'
'5. Watermelon, Lemon (fruits)'
'6. Kiwi, Mango (fruits)'
'7. Apple, Banana, Orange (all fruits)'
'8. Pineapple, Strawberry (fruits)'
'9. Watermelon, Lemon (fruits)'
'10. Kiwi, Mango (fruits)'}}]
}

My reaction:

Either I don’t know how to prompt, or ChatGPT don’t know how to count -_-

Hey ChatGPT, “List down 10 fruit words and only 10.”

messages = [{"role": "user", "content": "List down 10 fruit words and only 10."}]
gptj.chat_completion(messages)

[out]:

{'model': 'ggml-gpt4all-j-v1.3-groovy',
 'usage': {'prompt_tokens': 263,
  'completion_tokens': 157,
  'total_tokens': 420},
 'choices': [{'message': {'role': 'assistant',
    'content': ' Sure, here are 10 fruit words that I can think of:'
'1. Apple '
'2. Banana '
'3. Orange '
'4. Pear '
'5. Cherry ''
'6. Strawberry '
'7. Kiwi '
'8. Grapes '
'9. Pineapple '
'10. Watermelon'}}]}

My reaction:

Now the list looks right but I can’t put it in a Python list easily, if I spend time editing the list into runnable code, I might as well type them out myself @_@

Hey ChatGPT, “List down 10 fruit words, separate the list by commas without numbers”

prompt = “List down 10 fruit words, separate the list by commas without numbers”

messages = [{"role": "user", "content": prompt}]

gptj.chat_completion(messages)

[out]:

' apple, banana, orange'

My reaction:

What the %^&*?! I seriously can do a better job with https://www.google.com/search?q=list+10+fruits

(╯_╰) |||

I’m going to try some passive-aggression with ChatGPT

messages = [{"role": "user", "content": 
             "Can you count seriously? I'm telling you to list down 10 fruits. "
             'List 10 fruit words, separate by commas.'}]
results = gptj.chat_completion(messages)

[out]:

{'model': 'ggml-gpt4all-j-v1.3-groovy',
 'usage': {'prompt_tokens': 331,
  'completion_tokens': 129,
  'total_tokens': 460},
 'choices': [{'message': {'role': 'assistant',
    'content': 
' Sure, here are 10 fruits separated by commas:'
'apple, banana, cherry, grapefruit, kiwi, mango, peach, pear, pineapple, strawberry.'
}}]}

And then:

fruits = result['choices'][0]['message']['content'].partition(':')[2].strip('.').split(',')

My reaction:

> len("['apple, banana, cherry, grapefruit, kiwi, mango, peach, "
      "pear, pineapple, strawberry']")
86

> len("result['choices'][0]['message']['content'].partition(':')"
      "[2].strip('.').split(',')")
82

Seriously, to save my mind from thinking up 10 fruits and save 4 characters when typing them out… 割鸡焉用牛刀…

Now that the ChatGPT and prompt-engineering clickbait part is over, time to get serious with the flashcard generation.

From Text to Speech

The machine learning terminology to create a sound file given a text is usually referred to “Text-to-Speech” aka TTS. By default, a lot of operating systems already comes with some pretty nifty TTS systems that you can access through command prompt.

On a MacOS, you can try this on the terminal:

say -v Samantha "hello world"

And on an Linux/Windows, you can first install espeak , then

espeak -v en-us "hello world"

But if you’re the living on the bleeding edge, you can try this SpeechT5 TTS model from the HuggingFace transformers library in Python:

import torch

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset

# Load all the pretrained models...
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")


# Load xvector containing speaker's voice characteristics from a dataset.
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)


# Converts the text to a speech to a tensor / array of floats.
inputs = processor(text=text, return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"], 
    speaker_embeddings, vocoder=vocoder) 

# [out]: tensor([-1.6944e-04,  1.0556e-04,  ...,  -3.7849e-05, -2.1459e-04])

Then to write the torch.tensor object into an actual sound file, we’re gonna use the soundfile audio library based on libsndfile, CFFI and NumPy.

import soundfile as sf

def create_sound(text, outputfile):
    inputs = processor(text=text, return_tensors="pt")
    speech = model.generate_speech(inputs["input_ids"], 
        speaker_embeddings, vocoder=vocoder)
    sf.write(outputfile, speech.numpy(), samplerate=16000)

# Example usage of the TTS function.
create_sound("hello world", "hw.mp3")

Stable Diffusion (Manga Mode!)

The vanilla stable diffusion where you type in a sentence and it generates the image is just so 2022. It’s 2023, lets do image generation with some manga flair!

Generating manga/anime have been always been in the mind of enthusiasts. Manga fans have already shown impressive results with Generative Adversarial Network in 2019 (aka “grandfather of stable diffusion”). And computer vision scientists didn’t gave up ever since Hayao Miyazaki’s “[AI generated anime is] an insult to life itself” comment.

“[AI generated anime is] an insult to life itself”

- Hayao Miyazaki (2018)

After multiple generations of different text to image generation, we get “Kawaii Diffusion”!

Give me the code already!!

from diffusers import DiffusionPipeline


diffusion_pipe = DiffusionPipeline.from_pretrained("Ojimi/anime-kawai-diffusion")
diffusion_pipe = diffusion_pipe.to("cuda")

def create_image(text, outputfile, prompt):
    image = diffusion_pipe(
        text.lower().strip().replace('_', ' ') + prompt, 
        negative_prompt="lowres, girl, bad anatomy").images[0]
    image.save(outputfile)

text = "hello world"
create_image(text, f"{text}.jpg")

And immediately we see something different from how we use an image generator instead of sound generator. There is some sort of positive and negative “prompt engineering”. A few questions would come to an end-user and I find them food-for-thought for machine learning models creator too:

Why is it that we need to prompt to get a good image generation?
- E.g. why do we need to explicitly do negative prompt lowres to avoid low resolution?
Why is it that we don’t prompt text-to-speech models?
- E.g. why can’t we do something like Generate speech in a low deep voice: Hello World
  - I guess that’s where selecting a vocaloid and speaker_embeddings do the magic but can’t we prompt it like how we do text and image NLP?
Can abstract ideas like “Hello World” be generated into an image? Can an artist draw an abstract idea without research/inspiration?
Are there other modalities of speech (phonopragmatics) that modern TTS models can’t emulate?

Going back to Flashcard Generation

Now that we have sound and images, putting them together into a video requires some video-editing finesse. You might have come across some functions using moviepy and I totally appreciate all the work put into the open source, 1-4 contributors maintaining a library that serves ~25K other libraries depending on it. But nothing beats raw CLI when it comes to the feature packed ffmpeg .

And so, I went the “cheating” around Python way:

word = "hello-world"

os.system(
  f"ffmpeg -i {word}.jpg -i {word}.mp3 "
   "-c:v libx264 -tune stillimage -c:a copy -preset:v veryslow "
  f"{word}.mp4"
)

The ffmpeg would combine a single .jpg image file and an .mp3 sound file to create a .mp4. video file. There are a whole bunch of audio-video options that I don’t really know and I boldly say, I took it from https://superuser.com/questions/1041816/combine-one-image-one-audio-file-to-make-one-video-using-ffmpeg and added a few other options here and there.

And finally to combine the video file for each word, we do:

os.system(
  """for f in `ls *.mp4 | sort -g`; do echo "file '$f'" """ 
   ">> mylist.txt; done"
)
os.system(f"ffmpeg -f concat -i mylist.txt generated-flashcards.mp4")

The shell loop throughs the .mp4 files, sort them in the alphanumerical order, and put the list into a text file. Then the ffmpeg can use the concat file argument to create the final flashcard video file.

Looping through the fruit list given by ChatGPT, we do this:

import os 

for i, fruit in enumerate(fruits):
    create_sound(fruit, f"{i}-{fruit}.mp3")
    create_image(fruit, f"{i}-{fruit}.jpg", prompt=" fruit on table")
    os.system(f"ffmpeg -i {i}-{fruit}.jpg -i {i}-{fruit}.mp3 -c:v libx264 -tune stillimage -c:a copy -preset:v veryslow {i}-{fruit}.mp4")
os.system("rm mylist.txt")
os.system("""for f in *.mp4; do echo "file '$f'" >> mylist.txt; done""")
os.system(f"ffmpeg -f concat -i mylist.txt  ai-shichida.mp4")

Ta-da! Now we’ve created an AI generated flashcard video…

https://youtu.be/O-abE4CqbBg

My reaction:

When will I ever get the 1,000,000 subscribers gold press button?

Epilogue / Disclaimers

The full code to create your own flash card is available on https://www.kaggle.com/alvations/generative-language-learning-flashcard
I understand that the GPT4All model isn’t the paid ChatGPT service from Open AI, but for simplicity of understanding it’s a large language model somehow pre-trained on “human aligned instructions”
Note that the “prompt engineering” attempted for this post is much longer than the examples posted in the “ChatGPT” section above, I’ve tried a lot more prompts before the passive-aggressive one.
You might have to loop through the fruits and re-create the images multiple times to get a generated image that you desire. The final video in this post was created after looping through the create_image 10+ times…
Here’s a list of flashcards I’ve created using the same TTS + Diffusion code:

Generative Language Flashcards

What are Right Brain Development Flashcards?

Vänta ett ögonblick!(Wait a minute!) Where is ChatGPT in this?

Hey ChatGPT, “Name 10 fruits"

My reaction:

Hey ChatGPT, “List down 10 fruit words and only 10.”

My reaction:

Hey ChatGPT, “List down 10 fruit words, separate the list by commas without numbers”

My reaction:

I’m going to try some passive-aggression with ChatGPT

My reaction:

Now that the ChatGPT and prompt-engineering clickbait part is over, time to get serious with the flashcard generation.

From Text to Speech

Stable Diffusion (Manga Mode!)

Give me the code already!!

Going back to Flashcard Generation

Ta-da! Now we’ve created an AI generated flashcard video…

My reaction:

Epilogue / Disclaimers