How to define a User? Pt. 1

Wed Jan 01 2020

We don’t require users to log-in or register to use Chatroulette. That’s part of what makes the platform unique and exciting. Thanks to the feeling of anonymity, users are able to express themselves more freely and in a wider range of ways.

At the same time, the lack of log-in and registration also makes it difficult for us to create a strongly defined notion of a user identity. This is notable because our ability to track a user’s behaviour on the site beyond the current session is integral to our ability to limit anti-social behaviour within the community.

Indeed, some users take advantage of the fact we don’t require log-in or ID to perform antisocial behaviours. When investigating the severity of this problem, we found that it was non-trivial.

All this means we need to find a way, or ways, to define a user’s identity within the system and weave a history of their behaviour together, while retaining the anonymity and freedom that lends Chatroulette its magic.

So, how do you define a Chatroulette user?

Ants, Artillery and ID Degeneracy

Today, Chatroulette uses a combination of transient identifiers to reconstruct an internal representation of a user. Any given identifier associated with an individual is ephemeral and not sufficient to uniquely identify that individual on its own.

This creates an information theoretic barrier that helps intrinsically protect people's identity on the site. That means we need to reconstruct the logical user from these transient IDs at scale, in real-time, across a wide-variety of client platforms.

Assuming we can deal with the technical issues, the first question we have to ask is if we’re “killing ants with field artillery” in our ID reconstruction. Afterall, it’s possible one of our 'transient' identifiers isn't so transient and may be more than sufficient for our needs.

The figure below shows this isn’t the case:

I’ll come back to how we made this graph, but first let’s discuss what it’s really saying. Here we've plotted results for three ID types that we internally track: a Session ID which is stored as a cookie on the client; the IP address of the connecting session; and an ephemeral hash of a Face ID that only has meaning for 24 hours after its last use.

For every session, we run through our ID inference to associate any number of these IDs to that session, and Figure 1 shows us the Cumulative Distribution Function (CDF) for the degeneracies of each ID type across sessions (here we used a month's worth of data).

So, we see the number of sessions associated with only one facial hash is about 38%; to one session ID about 77%; and to one IP around 92%. Following the curves, we can then see percentages of the sessions having one or 𝑥 more identities associated with them – 𝑥 corresponding to the abscissa of the graph.

The upshot is that we see an absolute best case scenario that about 8% of our population would not be well tracked by using only one of these IDs. This number is way too high to achieve our goals of minimising anti-social behaviour on the site.

Graphs, Entropy and the Human Condition

We've jumped the gun by showing the above figure. It demonstrates that ID degeneracy is something we really need to deal with; but to get there, we had to develop a way to find out which IDs could be associated to a session. So, let's get into that.

First, we had to come to grips with some ID types possibly resolving to more than one individual. Really the biggest question here was around using IP addresses, as we know many of our users are university students who will be issued IPs by large, campus DHCP servers.

Fortunately, since session IDs are guaranteed (effectively) to be unique we could readily quantify the entropy of the distribution of sessions over IPs. This analysis told us we really didn't have to worry about IPs of different users on the site colliding, provided we didn't use an IP association for too long in our inference. Entropy also vindicated our other transient IDs in a similar fashion.

With that bit of analysis, it was clear that we needed to be primarily focused on dealing with users shedding their transient IDs between site engagements. Of course, people may jump IPs or flush the session ID cookie, etc. for completely benign reasons. These are not the people we’re concerned with building a user identity for.

We do want to build an identity for people shedding their transient IDs for the purposes of covering up their anti-social behaviour. Unfortunately, these users are typically adept at shedding IDs quickly, and rapidly re-engaging the site (e.g. spammers). So, our ID inference engine really had to be tailored to this use case first and foremost.

Aaaand... we had to be sure that our ID inference engine could work in real-time, across millions of sessions per day, across a wide variety of platforms.

To this end, we developed our ID reconstruction around a graph store that tracks all transient IDs that have been observed within the system. Very broadly, our ID inference can be conceptualised as an associative graph that evolves in time.

If one thinks of the graph nodes as a particular transient IDs (ignoring type), a user is identified with a collection of nodes that are (or were) connected at some point in time. Nodes in this graph continuously blink out of existence, as IDs are set to disappear after a period of non-use. A user only disappears from the system if all ID nodes connected to them have died off.

While the ID inference can be thought of as an associative graph spanning all ID types, we only implemented a graph store to perform initial model validation. In production, the graph store and operations are replaced with a stateful stream processor. With the conversion to a stream, we gained the concept of temporal ordering of ID associations.

In turn, this allowed us to exploit the idea that user identities can only merge (i.e. they never split), which is a reasonable assumption given that we’ve selected our IDs to be uniquely identifying. Engineering our algorithms and analytics around this broken symmetry (i.e. embracing a bit of entropy in our calculations) afforded a huge simplification and speedup in the computations supporting the ID inference.

This is really what allows us to not only perform real-time ID inference but also generate real-time analytics on top of those inferred user IDs. On top of that, we can also add or remove ID types from the system in situ without disrupting the existing ID inference calculations and results.

At the end of the day, this is the engine we used to generate our figure, which validates the necessity for ID inference in our application.

Hopefully, we soon won’t need the ID inferencing engine to support our efforts to moderate anti-social behaviour. Even in the case of a site utopia, the ID engine will still be a key part of our real-time analytics pipeline, providing the backbone for hundreds of telemetric signals and KPIs.

We Have a Solution

To sum up, today we have a sound way to infer a logical user within the system, without compromising the user’s anonymity external to the system.

This is a great first step. Now we can set about improving how it works, as well as putting it to a wider range of uses.