Life with ML Moderation Pt. 1 - Realities of ML Accuracies

Wed Jan 01 2020

Removing unsafe and anti-social content is a primary goal at CR. Doing this while still being respectful of people's possible desire of being anonymous on the site poses significant technical and social challenges. In particular, since we do not require user registration, it is necessary to actively moderate the site to limit anti-social content. Originally, this was done purely through the auspices of human moderators. Obviously, management and growth of a human moderation force poses significant challenges in both scalability and cost. Thus, we have been augmenting this effort with machine learning (ML) content moderation for almost 2 years now, and we have learned a lot about the realities of these services in that time.

Service Accuracies

The naive use case for ML moderation is to take a snapshot of a video stream and see whether or not that image shows some form of anti-social behaviour (usually some form of nudity). While simple, this is not a bad place to start, as it allows for a simple and intuitive analysis of the underpinning ML moderation model results.

Before going into that, we need to go over how ML-based moderation typically works. For this post, we'll focus on cloud-based ML moderation services. Most of these essentially are API endpoints that translate graphical input (images or video streams) into a collection of label-confidence associations reflecting possible unsafe content in the submitted media. In this post, all confidences are rescaled to a zero to one range, with one reflecting the highest confidence in the associated label.

With that, we analysed the simplest possible model: set some confidence threshold and if any adult label coming from the ML classifier has a confidence level at or above the threshold, we say that image is unsafe. For our initial analysis, we did a parameter scan of label confidence against 10,000 images randomly pulled from production. We then had humans classify the same set of samples as being safe or unsafe, i.e. did the images show some explicit nudity. Using the human classification as the baseline, we then calculated true-negative and false-positive rates for the ML moderation results as a function of confidence threshold:

While there are many ways to assess the quality of a ML model, the above does a good job of setting the stage for what our expectations should be of such a system. In particular, if we want less than 1% false-positives (a reflection of the underlying business goal), our true-negative rate (for Azure) is unacceptably high. This would translate into a huge number of people being incorrectly banned. With Rekognition we can only get down to about a 5% false-positive rate with an even worse true-negative rate.

So, even with our simplistic analysis, we can already see the ML moderation from AWS or MS will not achieve our business goals out-of-the-box.

Fixes and Correlations

So, the next question is: can we "fix" the above situation with a more sophisticated model than a simple threshold analysis? A canonical next-step is to combine the moderation results of multiple samples for a single user to determine if they have showed some form of adult content. The easiest way to do this is to translate confidences to probabilities (via the data underpinning Figure [ml_comp]) and then treating each moderation result as an independent observation of the user's behaviour. The end result is that we can calculate the probability of the user showing adult content at some point, over samples, as

where is the probability of the i-th image analysed being safe, as determined by the ML moderation. To be a bit more precise, the above probability reflects the chances of at least one image in the collection of as being unsafe, i.e. an unsafe user.

Practically, as increases, the chances of someone being identified as unsafe becomes more certain to the point of inevitability. This is a by-product of no absolutely certain classifications coming from ML systems. Thus, our initial tests of this model were arbitrarily limited to ten of the last samples in a user's history.

Even with this extension, we only saw marginal improvements over the results presented in the above figure. The obvious problem being that the modelling ML classification of different samples as independent probabilistic events is incorrect: if a user is mis-identified in one sample, there is a good chance they are mis-identified in many samples. Indeed, it is common that samples from one user will contain subsets of images that are very similar. Thus, moderation results across such subsets will yield correlated results.

Similarly, we tried to combine results from Azure and Rekognition to see if there was any improvement, thinking that two ML moderation classifiers should generate results fairly independent of one another, in cases where one mis-identifies an image. Unfortunately, we found they frequently mis-identified the same images in our sample set, which is not entirely surprising, as both ML models are probably trained using similar sets of images.

Without detailed calculations on result correlations, these models can only be a bandaid in improving the underlying moderation results, and building complex models around result correlations tend to be fragile under changing social and environmental circumstances of the userbase.

Sometimes the Truth Hurts

Our figure indicates an information theoretic deficiency in the ML moderation results: they literally can not give us the information we need to make accurate safe/unsafe content decisions at scale. Today ML moderation is supported by intelligent sampling and image pre-processing systems that have yielded significant boosts in accuracies discussed above. These systems allow us to better shape the prior context of ML moderation to partially compensate for the lack of information content in the ML results themselves. Ultimately though, this is only a stop-gap as we are now starting to develop our own ML models to augment our human moderation capacity.