Some time ago, I encountered an issue when 10 million messages with emoji were written in the MySQL table with utf8 encoding.
For those who don’t know: you should use the ut8mb4 encoding in MySQL to support emoji
If it were a small table, I would alter it with the conversion of columns to utf8mb4 encoding. But the table contains hundreds of millions of rows in many shards, so it’s very hard to alter it in the production runtime without the system degradation.
So the fast solution was to prevent the insertion of messages in a database on the back-end level. The service back-end is written in Golang. In this article, I’ll explain a few solutions to work with emojis in Go: the good way and the bad way.
Some libraries and forums suggest using regexp to find emojis in a text:
package main
import (
"fmt"
"regexp"
)
func main() {
var emojiRx = regexp.MustCompile(`[\x{1F600}-\x{1F6FF}|[\x{2600}-\x{26FF}]`)
fmt.Println(emojiRx.MatchString("Message with emoji 😆😛")) // true
}
It looks like it works, but some disadvantages make us don’t use the way:
fmt.Println(emojiRx.MatchString("Message with emoji 👏")) // false
[\x{1F600}-\x{1F6FF}|[\x{2600}-\x{26FF}]
The best way is to have a storage with all emojis and use it when you need to detect an emoji in a text. That’s how the GoMoji library works.
Firstly let’s add the package to our project:
go get -u github.com/forPelevin/gomoji
Now it’s pretty simple to check whether a string contains emoji:
package main
import (
"fmt"
"github.com/forPelevin/gomoji"
)
func main() {
fmt.Println(gomoji.ContainsEmoji("Message with emoji 👏")) // true
}
Let’s deep dive into the library internals and figure out how it works.
Firstly, look into the ContainsEmoji
function:
// ContainsEmoji checks whether given string contains emoji or not. It uses local emoji list as provider.
func ContainsEmoji(s string) bool {
for _, r := range s {
if _, ok := emojiMap[r]; ok {
return true
}
}
return false
}
We see the lib iterates through string’s runes and checks whether it’s an emoji or not via emojiMap
. So the complexity of the function is O(N), where N is runes count.
But what is emojiMap
? It’s a map of Emoji models by their hex code:
// Emoji is an entity that represents comprehensive emoji info.
type Emoji struct {
Slug string `json:"slug"`
Character string `json:"character"`
UnicodeName string `json:"unicode_name"`
CodePoint string `json:"code_point"`
Group string `json:"group"`
SubGroup string `json:"sub_group"`
}
// Code generated by generator.go ; DO NOT EDIT.
package gomoji
var (
emojiMap = map[int32]Emoji{
42: {
Slug: "keycap",
Character: "*️⃣",
UnicodeName: "keycap: *",
CodePoint: "002A FE0F 20E3",
Group: "symbols",
SubGroup: "keycap",
},
...
}
)
So the pre-generated map gives us some advantages:
it contains all existed emojis
it returns an emoji by a hex code in 0(1) which is an outstanding performance.
There are some benchmarks:
BenchmarkContainsEmojiParallel-8 94079461 13.1 ns/op 0 B/op 0 allocs/op
BenchmarkContainsEmoji-8 23728635 49.8 ns/op 0 B/op 0 allocs/op
BenchmarkFindAllParallel-8 10220854 115 ns/op 288 B/op 2 allocs/op
BenchmarkFindAll-8 4023626 294 ns/op 288 B/op 2 allocs/op
The reasonable question is where it takes all emojis data?
An answer to the question is in the generator.go file. It contains the CLI app that fetches all emojis from OpenAPI Emoji and saves them in the data.go file in emojiMap map[int32]Emoji
format. It allows the lib to keep emojis up to date.
As software engineers, we encounter problems every day. The best solution isn’t always the first that comes to mind or is founded in stackoverflow.
So if you are stuck with emojis, consider using a simple and useful GoMoji lib. It can help you not only validate texts but make great features in your chat app.