Stable diffusion had created a storm generating award winning images, created a version of me in an alternate universe and recently got in a whirlwind of PR/HR issues. And now, the new kid on the block language models seem to do everything from predicting stock prices to becoming the CEO of a brand new startup. ChatGPT is all the hype now and thereās no short of creative applications for generative models.
Despite all the advancement in AI, we are still applying them with caution especially when creating education/information related products. That faithful morning, my wife repeatedly played YouTube flashcards videos for infant/toddlerās right brain development. And she asked, āChatGPT can do everything, why donāt they make flash cards videos?ā. Then I replied, āChallenge acceptedā. Hence, this article on the rabbit hole of trying to automatically create flash card videos using a couple of generative AI models.
And she asked, āChatGPT can do everything, why donāt they make flash cards videos?ā
What are Right Brain Development Flashcards?
They are a series of flashcards shown to infant/toddlers at quick succession, usually less than 1 second per card. The flashcards in each session / video are particular to a specific category or topics, e.g. showing a series of fruits ['apples', 'apricot', 'banana', 'blueberry', 'coconutā]
. https://youtu.be/-shR2MAXT0Q?t=71
The general idea is to automatically generate the a video of flashcards given a specified list, thus itāll require:
- Some way to generate a sound file of each word in the list (
.mp3
or.wav
)- Ah-ha!, text-to-speech (TTS) models!
- Some how create an image file given a word / string
- Definitely, some stable diffusion thing
- A way to combine the image with the sound files into a video (
.mp4'
)- I donāt know the solutions yet but itās surely some
ffmpeg
CLI command
- I donāt know the solutions yet but itās surely some
VƤnta ett ƶgonblick!(Wait a minute!) Where is ChatGPT in this?
Alright, alright. Lets change the specifications a little, instead of generating a flashcard video given a list of vocabulary words, lets āpromptā the ChatGPT to give us a list of words given specific topics
- Give ChatGPT a topic and ask it to give me a list of words for kidsā language learning
- Some way to generate a sound file of each word in the list
- Some how create an image file given a word / string
- A way to combine the image with the sound files into a video
Hey ChatGPT, āName 10 fruits"
from gpt4all import GPT4All
# I know it's not the actual "ChatGPT" from Open AI..
# But still it's easier to just personalize to the LLM as "ChatGPT".
gptj = GPT4All("ggml-gpt4all-j-v1.3-groovy")
messages = [{"role": "user", "content": "Name 10 fruits"}]
gptj.chat_completion(messages)
[out]:
{'model': 'ggml-gpt4all-j-v1.3-groovy',
'usage': {'prompt_tokens': 240,
'completion_tokens': 304,
'total_tokens': 544},
'choices': [{'message': {'role': 'assistant',
'content':
'1. Apple, Banana, Orange (all fruits)'
'2. Pear, Cherry, Grape (all fruits)'
'3. Kiwi, Mango (fruits)'
'4. Pineapple, Strawberry (fruits)'
'5. Watermelon, Lemon (fruits)'
'6. Kiwi, Mango (fruits)'
'7. Apple, Banana, Orange (all fruits)'
'8. Pineapple, Strawberry (fruits)'
'9. Watermelon, Lemon (fruits)'
'10. Kiwi, Mango (fruits)'}}]
}
My reaction:
Either I donāt know how to prompt, or ChatGPT donāt know how to count -_-
Hey ChatGPT, āList down 10 fruit words and only 10.ā
messages = [{"role": "user", "content": "List down 10 fruit words and only 10."}]
gptj.chat_completion(messages)
[out]:
{'model': 'ggml-gpt4all-j-v1.3-groovy',
'usage': {'prompt_tokens': 263,
'completion_tokens': 157,
'total_tokens': 420},
'choices': [{'message': {'role': 'assistant',
'content': ' Sure, here are 10 fruit words that I can think of:'
'1. Apple '
'2. Banana '
'3. Orange '
'4. Pear '
'5. Cherry ''
'6. Strawberry '
'7. Kiwi '
'8. Grapes '
'9. Pineapple '
'10. Watermelon'}}]}
My reaction:
Now the list looks right but I canāt put it in a Python list easily, if I spend time editing the list into runnable code, I might as well type them out myself @_@
Hey ChatGPT, āList down 10 fruit words, separate the list by commas without numbersā
prompt = āList down 10 fruit words, separate the list by commas without numbersā
messages = [{"role": "user", "content": prompt}]
gptj.chat_completion(messages)
[out]:
' apple, banana, orange'
My reaction:
What the %^&*?! I seriously can do a better job with https://www.google.com/search?q=list+10+fruits
(āÆ_ā°) |||
Iām going to try some passive-aggression with ChatGPT
messages = [{"role": "user", "content":
"Can you count seriously? I'm telling you to list down 10 fruits. "
'List 10 fruit words, separate by commas.'}]
results = gptj.chat_completion(messages)
[out]:
{'model': 'ggml-gpt4all-j-v1.3-groovy',
'usage': {'prompt_tokens': 331,
'completion_tokens': 129,
'total_tokens': 460},
'choices': [{'message': {'role': 'assistant',
'content':
' Sure, here are 10 fruits separated by commas:'
'apple, banana, cherry, grapefruit, kiwi, mango, peach, pear, pineapple, strawberry.'
}}]}
And then:
fruits = result['choices'][0]['message']['content'].partition(':')[2].strip('.').split(',')
My reaction:
> len("['apple, banana, cherry, grapefruit, kiwi, mango, peach, "
"pear, pineapple, strawberry']")
86
> len("result['choices'][0]['message']['content'].partition(':')"
"[2].strip('.').split(',')")
82
Seriously, to save my mind from thinking up 10 fruits and save 4 characters when typing them outā¦ å²éø”ēēØēåā¦
Now that the ChatGPT and prompt-engineering clickbait part is over, time to get serious with the flashcard generation.
From Text to Speech
The machine learning terminology to create a sound file given a text is usually referred to āText-to-Speechā aka TTS. By default, a lot of operating systems already comes with some pretty nifty TTS systems that you can access through command prompt.
On a MacOS, you can try this on the terminal:
say -v Samantha "hello world"
And on an Linux/Windows, you can first install espeak
, then
espeak -v en-us "hello world"
But if youāre the living on the bleeding edge, you can try this SpeechT5 TTS model from the HuggingFace transformers
library in Python:
import torch
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
# Load all the pretrained models...
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
# Load xvector containing speaker's voice characteristics from a dataset.
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
# Converts the text to a speech to a tensor / array of floats.
inputs = processor(text=text, return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"],
speaker_embeddings, vocoder=vocoder)
# [out]: tensor([-1.6944e-04, 1.0556e-04, ..., -3.7849e-05, -2.1459e-04])
Then to write the torch.tensor
object into an actual sound file, weāre gonna use the soundfile
audio library based on libsndfile, CFFI and NumPy.
import soundfile as sf
def create_sound(text, outputfile):
inputs = processor(text=text, return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"],
speaker_embeddings, vocoder=vocoder)
sf.write(outputfile, speech.numpy(), samplerate=16000)
# Example usage of the TTS function.
create_sound("hello world", "hw.mp3")
Stable Diffusion (Manga Mode!)
The vanilla stable diffusion where you type in a sentence and it generates the image is just so 2022. Itās 2023, lets do image generation with some manga flair!
Generating manga/anime have been always been in the mind of enthusiasts. Manga fans have already shown impressive results with Generative Adversarial Network in 2019 (aka āgrandfather of stable diffusionā). And computer vision scientists didnāt gave up ever since Hayao Miyazakiās ā[AI generated anime is] an insult to life itselfā comment.
ā[AI generated anime is] an insult to life itselfā
- Hayao Miyazaki (2018)
After multiple generations of different text to image generation, we get āKawaii Diffusionā!
Give me the code already!!
from diffusers import DiffusionPipeline
diffusion_pipe = DiffusionPipeline.from_pretrained("Ojimi/anime-kawai-diffusion")
diffusion_pipe = diffusion_pipe.to("cuda")
def create_image(text, outputfile, prompt):
image = diffusion_pipe(
text.lower().strip().replace('_', ' ') + prompt,
negative_prompt="lowres, girl, bad anatomy").images[0]
image.save(outputfile)
text = "hello world"
create_image(text, f"{text}.jpg")
And immediately we see something different from how we use an image generator instead of sound generator. There is some sort of positive and negative āprompt engineeringā. A few questions would come to an end-user and I find them food-for-thought for machine learning models creator too:
- Why is it that we need to prompt to get a good image generation?
- E.g. why do we need to explicitly do negative prompt
lowres
to avoid low resolution?
- E.g. why do we need to explicitly do negative prompt
- Why is it that we donāt prompt text-to-speech models?
- E.g. why canāt we do something like
Generate speech in a low deep voice: Hello World
- I guess thatās where selecting a
vocaloid
andspeaker_embeddings
do the magic but canāt we prompt it like how we do text and image NLP?
- I guess thatās where selecting a
- E.g. why canāt we do something like
- Can abstract ideas like āHello Worldā be generated into an image? Can an artist draw an abstract idea without research/inspiration?
- Are there other modalities of speech (phonopragmatics) that modern TTS models canāt emulate?
Going back to Flashcard Generation
Now that we have sound and images, putting them together into a video requires some video-editing finesse. You might have come across some functions using moviepy
and I totally appreciate all the work put into the open source, 1-4 contributors maintaining a library that serves ~25K other libraries depending on it. But nothing beats raw CLI when it comes to the feature packed ffmpeg
.
And so, I went the ācheatingā around Python way:
word = "hello-world"
os.system(
f"ffmpeg -i {word}.jpg -i {word}.mp3 "
"-c:v libx264 -tune stillimage -c:a copy -preset:v veryslow "
f"{word}.mp4"
)
The ffmpeg
would combine a single .jpg
image file and an .mp3
sound file to create a .mp4
. video file. There are a whole bunch of audio-video options that I donāt really know and I boldly say, I took it from https://superuser.com/questions/1041816/combine-one-image-one-audio-file-to-make-one-video-using-ffmpeg and added a few other options here and there.
And finally to combine the video file for each word, we do:
os.system(
"""for f in `ls *.mp4 | sort -g`; do echo "file '$f'" """
">> mylist.txt; done"
)
os.system(f"ffmpeg -f concat -i mylist.txt generated-flashcards.mp4")
The shell loop throughs the .mp4
files, sort them in the alphanumerical order, and put the list into a text file. Then the ffmpeg
can use the concat file argument to create the final flashcard video file.
Looping through the fruit list given by ChatGPT, we do this:
import os
for i, fruit in enumerate(fruits):
create_sound(fruit, f"{i}-{fruit}.mp3")
create_image(fruit, f"{i}-{fruit}.jpg", prompt=" fruit on table")
os.system(f"ffmpeg -i {i}-{fruit}.jpg -i {i}-{fruit}.mp3 -c:v libx264 -tune stillimage -c:a copy -preset:v veryslow {i}-{fruit}.mp4")
os.system("rm mylist.txt")
os.system("""for f in *.mp4; do echo "file '$f'" >> mylist.txt; done""")
os.system(f"ffmpeg -f concat -i mylist.txt ai-shichida.mp4")
Ta-da! Now weāve created an AI generated flashcard videoā¦
My reaction:
When will I ever get the 1,000,000 subscribers gold press button?
Epilogue / Disclaimers
- The full code to create your own flash card is available on https://www.kaggle.com/alvations/generative-language-learning-flashcard
- I understand that the
GPT4All
model isnāt the paid ChatGPT service from Open AI, but for simplicity of understanding itās a large language model somehow pre-trained on āhuman aligned instructionsā - Note that the āprompt engineeringā attempted for this post is much longer than the examples posted in the āChatGPTā section above, Iāve tried a lot more prompts before the passive-aggressive one.
- You might have to loop through the fruits and re-create the images multiple times to get a generated image that you desire. The final video in this post was created after looping through the
create_image
10+ timesā¦ - Hereās a list of flashcards Iāve created using the same TTS + Diffusion code: