Collaborative filtering doesn't work for us
It’s our job to improve the user experience of Chatroulette. The site was created to facilitate meaningful connections and conversations between people. We don’t want to define what ‘meaningful’ means – that’s down to our users – but we can broadly assume that the longer two users speak, the better the time they’re having and the more successful the match.
So, since 2018, we’ve been experimenting with ways to match individuals who are likely to have longer conversations.
One technique we explored was collaborative filtering. This method is widely used in generating recommendations for users across a broad spectrum of markets – suggesting songs they might love, products they might want, or people they might know, for example.
Looking to the Chatroulette context, the rough idea is that if, say, Alice spoke to Bob for a long time and then Alice also spoke to Carol for a long time, then Bob and Carol are more likely than not to speak for a long time too.
We structured feasibility studies around simplified associative models and hypotheses to see if the approach warranted deeper investigation when compared to other techniques.
These studies were carried out by analysing the duration statistics of over 15 million Chatroulette conversations. These conversations took place between over 350 thousand unique users and represented roughly a week’s worth of activity on our site.
Let’s dive into the studies.
First Study: Binary Classifier
The vast majority of conversations on Chatroulette are short-lived. This reflects a common use case, in which someone rapidly flips through potential partners, hitting ‘Next’ until they find someone that sparks their interest. Then they’ll stop and try to strike up a conversation.
The real site dynamics are much more complex than this, but you can see how this common behaviour leads to a majority of short-lived conversations.
Our initial goal was to increase the occurrence of conversations lasting 30 seconds or more, which we defined to be non-trivial. So we were only interested in models that could help us predict when such non-trivial conversations would occur.
Our first study was constructed to see whether or not collaborative filtering could be used as a predictor for non-trivial conversations. We used a very basic associative model:
Simple Associative Model
If there exists a user
, such that both user and user have had separate, non-trivial conversations with user , then it's predicted that and will also have a non-trivial conversation. Otherwise, it's predicted that and will have a trivial conversation.
From here on in, for brevity’s sake we'll call a pair of chained conversations across three unique individuals a 2-chain. Our model says that any 2-chain containing two non-trivial conversations implies the conversation linking the ends of the 2-chain should also be non-trivial.
To test this, we ran through our conversational data in chronological order as a sort of hindsight simulation. So, if we had a 2-chain where
Unfortunately, the results showed a true-negative rate of 78%. i.e. most of the time the model failed to predict when a meaningful conversation was about to occur.
This means that the data had a high occurrence of the following type of chronological sequence:
had a trivial conversation with , then had a trivial conversation with , then had an non-trivial conversation with
The model is significantly worse than a coin-flip. Obviously, this is not good; and given that the majority of conversations on the site are trivial, using our model as an anti-predictor would obviously just lead to an unacceptably high false-positive rate.
Second Study: Information in Conversational Chains
The results of the first study cast doubt on whether or not 2-chains could inform the prediction of a non-trivial conversation. Of course, we wouldn’t discard the entire concept based on such a simple analysis.
What the first study did show us, however, is that we needed to take a deeper look at whether or not 2-chains generally contained enough information to support the prediction of non-trivial conversations.
To this end, we performed another analysis in which we gathered all pairs (denoted here by
where
with each element of
For these pairs, we analysed the distributions of the 2-chain values separately for individuals who did and did not have a trivial direct conversation. These two distributions are depicted in the figure below.
If we want to classify non-trivial conversations by thresholding the 2-chain value, we really don’t want these distributions overlapping in the graph. Unfortunately, we see a very strong overlap between both distributions, which means the 2-chain value is giving very similar information about individuals, regardless of whether or not they’ve had a non-trivial conversation.
Of course, this qualitative interpretation has a formal underpinning; but again, the point here is to get across the general intuition of the results.
Third Study: Different Thresholds and 2-chain Constructions
In a final effort to salvage the collaborative filtering idea, we relaxed the definition of a non-trivial conversation and investigated whether or not some construction of a 2-chain duration could be used to classify conversations falling above or below some arbitrary threshold.
For this analysis we went beyond constructing the 2-chain value as the maximum average of 2-chains joining users and considered various combinations of standard and geometric averages of 2-chain conversation durations, with the collection of geometric averages being denoted as:
We ended up analysing the following 2-chain mappings:
The results corresponding to each of these constructions are depicted in the figure below.
Contours have been overlaid on all the density plots. The qualitative characteristic we’d love to see in any of these plots is two well-defined islands that could be cleanly separated into opposite quadrants for some selection of horizontal and vertical lines.
In such a scenario, the horizontal line would represent a 2-chain threshold that would divide the population into pairs that had conversations below/above the threshold indicated by the vertical line. Of course, if those islands were joined by bridges or the like, that would still be workable.
Unfortunately, none of the graphs show anything resembling this scenario. There’s a bit of an island for some 2-chain constructions out around the 200 second duration (note that we’re using log scales on the charts) for some 2-chain constructions, but there’s no horizontal line that cleanly cleaves this from the main island near the origin.
If nothing else, these graphs clearly show that no matter what 2-chain value we use, it’s not going to provide us with a decent binary classifier for predicting conversation duration.
Again, this qualitative assessment was underpinned with a quantitative analysis, but you get the picture.
The Data Hath Spoken
The data have clearly shown us that, despite proving a successful technique for many other products across a range of contexts, collaborative filtering simply wasn’t a match for Chatroulette. It wasn’t going to get us anything that resembled intelligent matchmaking.
In another post I’ll speak about a further, more fundamental obstacle to the collaborative filtering approach; and in future posts, we’ll share approaches that have emerged as successes.