Collaborative filtering doesn't work for us

Wed Jan 01 2020

It’s our job to improve the user experience of Chatroulette. The site was created to facilitate meaningful connections and conversations between people. We don’t want to define what ‘meaningful’ means – that’s down to our users – but we can broadly assume that the longer two users speak, the better the time they’re having and the more successful the match.

So, since 2018, we’ve been experimenting with ways to match individuals who are likely to have longer conversations.

One technique we explored was collaborative filtering. This method is widely used in generating recommendations for users across a broad spectrum of markets – suggesting songs they might love, products they might want, or people they might know, for example.

Looking to the Chatroulette context, the rough idea is that if, say, Alice spoke to Bob for a long time and then Alice also spoke to Carol for a long time, then Bob and Carol are more likely than not to speak for a long time too.

We structured feasibility studies around simplified associative models and hypotheses to see if the approach warranted deeper investigation when compared to other techniques.

These studies were carried out by analysing the duration statistics of over 15 million Chatroulette conversations. These conversations took place between over 350 thousand unique users and represented roughly a week’s worth of activity on our site.

Let’s dive into the studies.

First Study: Binary Classifier

The vast majority of conversations on Chatroulette are short-lived. This reflects a common use case, in which someone rapidly flips through potential partners, hitting ‘Next’ until they find someone that sparks their interest. Then they’ll stop and try to strike up a conversation.

The real site dynamics are much more complex than this, but you can see how this common behaviour leads to a majority of short-lived conversations.

Our initial goal was to increase the occurrence of conversations lasting 30 seconds or more, which we defined to be non-trivial. So we were only interested in models that could help us predict when such non-trivial conversations would occur.

Our first study was constructed to see whether or not collaborative filtering could be used as a predictor for non-trivial conversations. We used a very basic associative model:

Simple Associative Model

If there exists a user , such that both user and user have had separate, non-trivial conversations with user , then it's predicted that and will also have a non-trivial conversation. Otherwise, it's predicted that and will have a trivial conversation.

From here on in, for brevity’s sake we'll call a pair of chained conversations across three unique individuals a 2-chain. Our model says that any 2-chain containing two non-trivial conversations implies the conversation linking the ends of the 2-chain should also be non-trivial.

To test this, we ran through our conversational data in chronological order as a sort of hindsight simulation. So, if we had a 2-chain where talked to and then talked to , we ran the model to predict the outcome of talking with , if that data was present in our records. (This was just a naive first-order analysis, but it was also a decent way to see if we were on the right track.)

Unfortunately, the results showed a true-negative rate of 78%. i.e. most of the time the model failed to predict when a meaningful conversation was about to occur.

This means that the data had a high occurrence of the following type of chronological sequence:

had a trivial conversation with , then
had a trivial conversation with , then
had an non-trivial conversation with

The model is significantly worse than a coin-flip. Obviously, this is not good; and given that the majority of conversations on the site are trivial, using our model as an anti-predictor would obviously just lead to an unacceptably high false-positive rate.

Second Study: Information in Conversational Chains

The results of the first study cast doubt on whether or not 2-chains could inform the prediction of a non-trivial conversation. Of course, we wouldn’t discard the entire concept based on such a simple analysis.

What the first study did show us, however, is that we needed to take a deeper look at whether or not 2-chains generally contained enough information to support the prediction of non-trivial conversations.

To this end, we performed another analysis in which we gathered all pairs (denoted here by ) of individuals connected by a direct conversation and one or more 2-chains. To each of these pairs, we associated two values: the duration of their direct conversation, , and the maximum average duration of all 2-chains joining them in our data:

where

with each element of being represented as a 2-component vector. Obviously, I’m being loose with the notation here. The point isn’t to lay out pages of mathematical formalism, though I’m always down for that.

For these pairs, we analysed the distributions of the 2-chain values separately for individuals who did and did not have a trivial direct conversation. These two distributions are depicted in the figure below.

If we want to classify non-trivial conversations by thresholding the 2-chain value, we really don’t want these distributions overlapping in the graph. Unfortunately, we see a very strong overlap between both distributions, which means the 2-chain value is giving very similar information about individuals, regardless of whether or not they’ve had a non-trivial conversation.

Of course, this qualitative interpretation has a formal underpinning; but again, the point here is to get across the general intuition of the results.

Third Study: Different Thresholds and 2-chain Constructions

In a final effort to salvage the collaborative filtering idea, we relaxed the definition of a non-trivial conversation and investigated whether or not some construction of a 2-chain duration could be used to classify conversations falling above or below some arbitrary threshold.

For this analysis we went beyond constructing the 2-chain value as the maximum average of 2-chains joining users and considered various combinations of standard and geometric averages of 2-chain conversation durations, with the collection of geometric averages being denoted as:

We ended up analysing the following 2-chain mappings:

The results corresponding to each of these constructions are depicted in the figure below.

Contours have been overlaid on all the density plots. The qualitative characteristic we’d love to see in any of these plots is two well-defined islands that could be cleanly separated into opposite quadrants for some selection of horizontal and vertical lines.

In such a scenario, the horizontal line would represent a 2-chain threshold that would divide the population into pairs that had conversations below/above the threshold indicated by the vertical line. Of course, if those islands were joined by bridges or the like, that would still be workable.

Unfortunately, none of the graphs show anything resembling this scenario. There’s a bit of an island for some 2-chain constructions out around the 200 second duration (note that we’re using log scales on the charts) for some 2-chain constructions, but there’s no horizontal line that cleanly cleaves this from the main island near the origin.

If nothing else, these graphs clearly show that no matter what 2-chain value we use, it’s not going to provide us with a decent binary classifier for predicting conversation duration.

Again, this qualitative assessment was underpinned with a quantitative analysis, but you get the picture.

The Data Hath Spoken

The data have clearly shown us that, despite proving a successful technique for many other products across a range of contexts, collaborative filtering simply wasn’t a match for Chatroulette. It wasn’t going to get us anything that resembled intelligent matchmaking.

In another post I’ll speak about a further, more fundamental obstacle to the collaborative filtering approach; and in future posts, we’ll share approaches that have emerged as successes.