TL;DR, if youāre just here for the source code, you can download it from my GitHub repository. Or go ahead andĀ play the game online.
Remember the Microsoft Kinect? That bulky sensor bar which was onceĀ the worldās fastest-selling consumer electronics deviceĀ and later famously had to be unbundled from the Xbox One package?
It didnāt get much love from game developers, but the Kinect was a pretty decent piece of hardware. Head tracking and body pose estimation worked very well, and it could even detect simple hand gestures like āpinchā and āfistā. At an affordable retail price of around 99 USD, it quickly became a favorite inĀ hackerĀ andĀ makerĀ communities.
Sadly we havenāt seen a real successor to the 2014 āKinect 2ā model which was discontinued in 2018. The 2019 āKinect Azureā does not fill this gap for multiple reasons:
- It isĀ muchĀ more expensive. You can find a used Kinect 2 for around 50 USD while the Azure development kits currently sell for prices north of 1,000 USD. The Kinect 2 is a self-contained device that works out of the box.
- The Azure requires an additional Nvidia GPU (RTX 2060 or better) to work. Good luck finding one these days.
Luckily, with all the recent advances in machine learning and computer vision, there are now some great alternatives for adding Kinect-like features to your project.
Introducing TensorFlow.js
Developed by theĀ Google Brain Team, TensorFlow is a popular machine learning (ML) library for the Python programming language. TensorFlow.js (TFJS) is its companion library for JavaScript. Quoting theĀ official website:
TensorFlow.js is a library for machine learning in JavaScript. Develop ML models in JavaScript, and use ML directly in the browser or in Node.js.
TensorFlow.js is not just an amazing piece of software, but it also gives you access to aĀ growing library of machine learning models, ready to use with your project.
In this tutorial, I will show you how to use a TFJS based machine learning model to create a āRock, Paper, Scissorsā game with gesture controls. The final game will be running in your web browser, using just HTML and JavaScript.
Letās make a game!
The focus of this tutorial is on the hand gesture recognition part, not so much on game development. So to speed things up,Ā I have already prepared a simple game UI for you. šš
Still, to get a better idea of the game weāre building, letās create a simple game design document.
The game design document
When developing a game, usually the best way to start is to describe the gameplay by creating a game design document. There are many ways to do this, for example by drawing a storyboard.
For a simple game like āRock, paper, scissorsā, we will just verbally describe how the game should work:
- Game will load and show a āplease waitā message during initialization
- Player and opponent are shown next to each other
- The first round of the game is started by the press of a key or button
- A new round starts with a countdown, asking the player to get ready
- Player should show a hand gesture (rock, paper, or scissors)
- Once a valid gesture is registered, it is compared to the computer gesture (which is still hidden at this point)
- The result is evaluated according to the rules of āRock, paper, scissorsā
- Winner is awarded one point, score display gets updated
- Start next round (go to item #4)
With the UI out of the way, letās get right to the good stuff.
Building the hand gesture recognizer
When building a Rock, Paper, Scissors game, the key challenge is to recognize the three hand gestures āš¤ā inside a camera picture.
Before we look into the actual implementation of things, letās first take a step back and think about how a high-level process to detect hand gestures would look like:
- We are trying to determine hand gestures from a camera picture. Therefore, the first step is to detect whether a hand is actually inside the camera picture.
- If a hand is detected, we estimate the position of all finger joints to trace the hand skeleton.
The hand skeleton detector returns 21 key points (also called ālandmarksā): Four joints for each finger plus the wrist. This is our raw data which we will process further.
Working with the key points
The key points represent 2D coordinates, telling us the position of each skeleton point in the picture. This isnāt very useful to describe a hand gesture, as it is hard to compare two hand gestures based on the position of the joints. A hand can appear anywhere in the picture, it could be rotated, and people could be left- or right-handed.
Letās try to find a better representation by describing a hand gesture using natural language:
Take the āThumbs Upā gesture š as an example: It can be described as āAll four fingers fully curled and pointing to either the left or right. Thumb must not be curled and point upwardsā.
Curl and pointing directions are a much more concise way of describing a hand gesture. They are independent of the size and position of the hand in the camera picture, also both can easily be deduced from the raw 2D coordinates.
This brings us to the next steps in our detection process:
- Using the key points, we describe the curl and pointing direction for each detected finger.
- Finally, we compare this description to a list of known hand gestures and check which one is the best match.
Great, we figured out how to detect hand gestures ā At least in theory. Now letās see how TensorFlow.js can help us to implement it:
Picking a suitable machine learning model
As I mentioned in the introduction, TensorFlow.js gives you access to a library ofĀ many useful machine learning modelsĀ which you can immediately use within your application.
One of the models is called "HandPose"Ā and offers āHand pose detectionā. The description reads:
A palm detector and a hand-skeleton finger tracking model. It predicts 21 3D hand key points per detected hand.
Sounds like this model can already cover steps (1) and (2) of our detection process and extract the raw data we need. Awesome! Letās install it:
Install HandPose
First, we need to install the model itself:
npm i --save @tensorflow-models/handpose
Next, we install its TensorFlow.js dependencies:
npm i --save @tensorflow/tfjs-core
npm i --save @tensorflow/tfjs-converter
npm i --save @tensorflow/tfjs-backend-webgl
TensorFlow.js can use your computerās GPU for additional performance. Almost any GPU (Nvidia, AMD, Intel) works as long as it supports WebGL.Ā Yours most likely does, so make sure to install the WebGL backend to get a massive speed boost for free.
Processing the raw data
As I mentioned before, the raw data isnāt very useful for gesture detection. To work with the data, we need to transform it into ācurlā and āpointing directionā. Luckily there is the "Fingerpose" library which will do just that.
Install the Fingerpose library with the following command:
npm i --save fingerpose
Define the gestures
Fingerpose expects you to define a hand gesture by describingĀ the directionĀ andĀ curlĀ for each finger. Our game uses three distinct hand gestures, so we need to create oneĀ
GestureDescription
Ā for each gesture.Describe the rock gesture ā:
The rock gesture is just you making a fist:
- You bend your fingers into your palm, curling them under until the tip of each finger touches its corresponding base.
- Then you bend the thumb down so that it falls across the top halves of the index and middle fingers.
This code describes a ārockā gesture as:
- All fingers fully curled
- Thumb must be half curled or not curled at all
In case you wonder about the second condition: It is physically impossible for you to fully curl your thumb (unless you are Houdini). Also, some people, when making a fist will place their thumb next to their index finger, effectively stretching it out. So we tell Fingerpose thatĀ bothĀ are acceptable and equally valid.
Next, letās look at the āpaperā gestureĀ š:
No surprises here. To make a āpaperā gesture, you have to stretch out all of your fingers and your thumb.
Lastly, letās have a look at āscissorsāĀ āļø:
The āscissorsā gesture closely resembles a āvictoryā sign. Index and middle fingers are stretched out. Ring and pinky should be half or fully curled. We donāt care about the thumb, so we just omit it.
In case you are wondering about the pointing direction of each finger: Unlike a āThumbs upā gesture which has a completely different meaning when turned upside down, the gestures of our game do not change their meaning when mirrored or rotated. Therefore, the direction can be omitted to keep the descriptions simple.
Putting it all together
The implementation of the hand gesture recognizer consists of two parts:
- One-time Initialization of the model
- Process every video frame
Letās see some code for the initialization process:
The code above will first create a FingerposeĀ
GestureEstimator
Ā instance. Simply pass the list of known gestures to its constructor, and it is ready to be used.Afterward, the HandPose model will load and initialize. This may take some time as it will also download some files (the model weights) from the tfhub.dev website.
The last step is optional but will greatly improve the user experience. After you load the model, I recommend you āwarm upā the model by making one single prediction on a sample image. This is because the first prediction can take quite a bit of time, while subsequent predictions will usually be much faster. If you make the first (slow) prediction part of the initialization process, it will not slow down your gameplay later.
Processing a video frame:
Again, letās see some code first:
The code explained:
- We predict the 21 key points (landmarks) by calling theĀ
Ā function of the HandPose model, passing the HTML video element as the first argument. The second parameter indicates whether the source video is horizontally flipped.estimateHands
- If key points were predicted, we pass them to Fingerpose together with a minimum required match score (0 to 10, 10 = perfect match). The match score tells Fingerpose to ignore possible matches with a lower score.
- Depending on your minimum match score, Fingerpose could return an array with multiple possible matches. This code will pick the gesture with the highest match score (confidence).
The return value of this method will be the name of the gesture with the highest match score, or an empty string in case no gesture was detected.
Increase the detection stability
When you run the code above on a source video, you will notice that the predictions are occasionally unstable. In some frames, HandPose will detect āphantom handsā (false positive) or no hands at all (false negative). This can have an impact on gameplay.
One simple solution is to create a low-pass filter by combining detections from several consecutive frames to a single result. For example, we could wait for three consecutive frames to be classified as āRockā gesture before we emit the āRock detectedā event:
Prevent blocking the game UI
Running a machine learning model can be quite taxing on your CPU and GPU. While TensorFlow.js is incredibly fast (especially when running with the WebGL backend), it can still cause your game UI to become unresponsive. Especially when you run the model on each frame of a video stream.
Again, there is a simple workaround to prevent the UI from locking up. You can wrap theĀ
predictGesture
Ā function inside aĀ setTimeoutĀ
call with a timeout of zero seconds. Check outĀ this thread on StackOverflowĀ to learn more about this workaround.Below is some example code how to create a non-blocking detection loop:
Wrapping it up
With the code above, we have implemented a fast and stable gesture detector. Check out the full source code to learn how to integrate it in the final game.
This is the end of my tutorial. Feel free to use my source code as a base for your own game or application. If you have any comments, questions or suggestions please start a conversation in the comments.
Thank you very much for reading ā¤ļø