IVI is a video streaming service emphasizing recommendations and deep personalization. We offer thousands of movies and TV series to our users and, among other things, personalize titles’ posters.
For such personalization, one has to create multiple diverse posters for each title. While the best posters are hand-crafted by designers, the process is too time-consuming. That’s why we created a tool called Parker, which generates high-quality title previews and posters automatically.
Though this approach saves time, it also contains a major flaw – Parker can produce posters with awkward-looking facial expressions, which makes the resulting images unusable. Not seeing an effective solution and racing against the clock, we turned to
Our software generates facial images from the platform’s media catalog that can at times look quite goofy and awkward when caught mid-frame. What’s even worse, such facial shots often have no recognizable emotions.
When our moderators were sifting through the images by hand, that proved to be too labor-intensive and time-consuming. This is why we decided to use Machine Learning (ML) to train an algorithm to do this work instead and hopefully win in the long haul.
This isn’t, in and of itself, a super challenging task; in fact, a simple binary classification will do just fine – 0 for a normal face and 1 for an awkward one. Easy enough. All that’s needed is a training model and the actual labeled data. Finding a suitable model is again not an issue since dozens are available on the market – from Xception to Efficient. All of them can be used on Keras, which is a common tool for constructing neural networks.
The most common ML models that can be utilized on Keras.
But the data is a different story altogether. Going by several datasets used to classify human emotions, we worked out that we would need approximately 72,000 labeled images to reliably train our algorithm. As a result, we extracted high-quality images of faces from 18 films, with each actor having no fewer than 10 (to understand the person’s expressions) and no more than 100 images (to not let the algorithm get used to one face). The raw data was there. What about the labeling?
We estimated that our in-house team would need around 1,5 months to complete the project, and it would cost us around $3000. This is when Toloka stepped in, a company that uses minimal efforts of crowd performers from around the globe along with its vast IT infrastructure to label the data.
The labeling assignment would contain 5 key stages:
Decomposition
In a binary classification task, performers are expected to answer a simple yes-no question, in this case: “Is this a normal face or not?” The task included 2 phases:
Each one, in turn, contained an additional 2 subphases:
Instructions
Two sets of instructions with examples were given to Tolokers (Toloka’s crowd performers) – one for each phase of the task. In the first phase, an image was considered acceptable if it contained only one face with clearly visible facial features that covered more than half the image. The viewer also had to be able to tell whether the eyes and mouth were open or closed.
Tolokers were asked to judge whether the images were “acceptable” or not. “Acceptable” images needed to have only one face, visible facial features, and appropriate size in relation to the rest of the image.
Unacceptable images were the ones that did not meet these criteria and, in most cases, contained no face or more than one face.
Images that contained no face or multiple faces were labeled as “unacceptable” and excluded from the 2nd phase.
In the second phase that included only acceptable images, a face was deemed awkward-looking if it contained a half-open mouth or eyes that did not form a recognizable facial expression.
“Acceptable” images that had faces with half-open mouths or eyes and no easily recognizable expressions were labeled as “awkward-looking” in the 2nd phase.
On the other hand, a face was deemed normal if the mouth and eyes weren’t half-open, and if they were, a distinct expression could still be read.
“Acceptable” images that had faces without half-open mouths or eyes were labeled as “normal” in the 2nd phase.
“Acceptable” images that contained faces with half-open mouths or eyes but still had easily recognizable expressions were also labeled as “normal” in the 2nd phase.
Task Interface
We decided to use Toloka’s ready image classification template for all phases and subphases of the task. Each page contained 8-10 images and 3 radio buttons, with an option to utilize the keyboard in addition to the mouse.
An example of the task’s interface on Toloka with a selection of 8 images that Tolokers had to label as “acceptable/unacceptable” in the 1st phase and “normal/awkward-looking” in the 2nd phase of the assignment.
Quality control
To minimize errors and exclude bots or unscrupulous performers, 4 quality control mechanisms were put in place:
The interface provided hints and explained incorrect answers during training. Only Tolokers who passed the training session with 80% accuracy were allowed to complete the assignment.
During the assignment, 10% of the questions/answers were known to us – we incorporated them on purpose. This way, any Toloker with a fail rate that exceeded 25% was dropped.
As is the case with most crowdsourcing tasks, each question was given to more than one performer, ending with majority voting. Every question was answered by 3 performers in the 1st phase (face/no face) and 5 performers in the 2nd phase (normal/awkward-looking face).
Assurance tools
In addition, 4 quality assurance methods were built into the assignment:
Language restrictions (only those with proven Russian skills could participate).
Duration (overly fast responses were flagged).
Daily task limits (each Toloker could complete only a certain number of tasks in a row).
Captchas (Turing challenge-response tests were spread throughout).
Aggregation
Using a common technique, we aggregated the labeled data to get the final results. All in all, the assignment took 4 days – 11 times faster than the estimated in-house expenses. The accuracy varied between 75 and 90%, approaching the latter figure towards the project’s completion.
Having tried crowdsourcing for the first time, we were very pleased; in fact, so much so that we’re now preparing a new project: a new improvement to Parker teaching it to recognize and prioritize good-looking posters and previews over average ones.