Unethical uses for public Twitter data

Adrian Short · 30 October 2014

The outcry about Samaritans Radar has highlighted a common, false and extremely dangerous argument: Your tweets (or other social media posts) are public and so you’ve only got yourself to blame if someone uses them in ways you don’t like. Consequently, if you don’t want to suffer harm you should watch what you say.

This is essentially victim blaming and a call for self-censorship. It’s disturbing that the Samaritans themselves use this argument in their supposed defence rather than taking a more critical view of their own responsibility to act ethically.

So what’s wrong with this argument? Obviously there are extreme cases where it’s true. We all have to take responsibility for ourselves not to do manifestly harmful things where the consequences are self-evident. Someone who tweets “Off on holiday for 2 weeks. Keys under the mat. 55 Acacia Avenue, Stockton” has to take a fair share of the responsibility when they come home to a crime scene, although we should also note that people’s recklessness or stupidity isn’t a licence to harm them.

But the bigger problem with things like public tweets is that no-one knows what information can be derived from them, either now or in the future. I write as a data analyst who’s done a fair bit of work with this kind of material. What follows are a few techniques that aren’t at all obvious to the average Twitter user. They go far beyond reading the surface text (or metadata) of an individual tweet. And these are just some of the techniques currently used to mine this data, ethically or unethically, legally or illegally. There is absolutely no defence against data analysis methods yet to be discovered. By definition we cannot reasonably post tweets and consent to those methods being applied to extant content.

Bear in mind that all these techniques can be used to target specific users or to trawl the whole network for interesting data. They can be used in combination to strengthen each other. And perhaps most vitally, the results from these methods can be monitored to see how they change over time.

Sentiment analysis looks at text and tries to calculate its emotional conent. Is this tweet happy or sad, angry or calm, friendly or hostile? Marketers typically use this to monitor the public feeling about their brands. If they’ve just launched a new product, is the responsive generally positive or negative? While this involves analysing individuals’ data, knowledge is being sought about the brand name or hashtag, not those individuals themselves. But of course you can apply it to individuals too. Is a user generally happy or sad? Can we find abusive people? Can we take this list of job applicants and weed out the people with what we consider to be a consistently bad attitude? Of course we could. Would it be ethical? That would depend on a number of facts, not least the data subject’s consent. But what’s very likely is that a person applying for a job today probably wouldn’t have realised that a series of tweets they wrote two years ago could mean the different between being shortlisted for interview and being automatically rejected without a human being ever looking at their application. This is a long way from saying “Don’t post stupid stuff online because a future employer might see it.”

Stylometry is set of text analysis methods that tries to identify the style of a piece of writing and tag it with a unique “fingerprint”. One application of this is to try to identify the author of a piece of text. For example, there is ongoing investigation into the identity of the author of Shakespeare’s plays. We can create fingerprints for Shakespeare’s writing and do likewise for other contemporary writers and see whether any of them match. This could be used to settle academic arguments or legal ones: Who’s the legitimate author of a work? It can also be used for police investigations. If like me you’ve got a public blog with tens of thousands of words of your writing identified with your real name, it’s probably not a good idea to justify your terrorist attacks with a rambling manifesto. This is similar to how the Unabomber, Ted Kaczynski, was caught: after his writing was published anonymously in two newspapers his brother recognised the style and ideas and alerted police to his potential as a suspect. You can apply this technique to public social media text and be likely to unmask the identities of large numbers of anonymous and pseudonymous users. Would it be ethical? In most cases, no. Someone posting something using a pseudonym clearly doesn’t consent to having their legal identity revealed and to do so publicly would only be justified in cases of serious wrongdoing. Doing it automatically for large numbers of social media users would cause massive social damage – turning previously safe spaces hostile – as well as the obvious personal damage to those identified.

Most of us are familiar with face recognition – automatically tagging photos with the identies of the people pictured. But consent is crucial here to ethical use. A system that tried to identify and tag every person in a public photo on social media and tie all the photos for each person (not account) together would leak huge amounts of unwanted and probably unexpected information about individuals, particularly if it tagged with people’s legal identities rather than online pseudonyms. This isn’t just about the drunken party you attended at college. It’s about surfacing pictures of people on political demonstrations and in all kinds of legal yet sensitive contexts where the subject doesn’t (or didn’t) consent to or realise that their presence would be noted and published forever.

Similarly, there’s location analysis. Many tweets are published with the user’s location attached (though not all users realise they’re doing this.) But it’s increasingly easy to automatically identify the location of a photograph or video by comparing it with other media of visually similar known locations. We can reasonably assume that this technology will be robust and widespread within a few years. The effect will be that most outdoor photographs will be able to be automatically located, which will of course reveal the locations and activities of individuals both at the time and in the past whether they want that or not. This could reveal someone’s home or workplace, where (or whether) they’re on holiday. For pseudonymous and anonymous users this could be used to find the legal identity of a user. Apply this method to a large number of accounts and you can do social network analysis by the backdoor, tying together the people who are or were in the same place at the same time, regardless of whether they have any other obvious relationship online.

Social network analysis is a set of techniques for finding and analysing the relationships between people. These relationships could be explicit (people who choose to follow others on Twitter), weakly explicit (people who participate in Twitter conversations with each other) or implicit (people who post on similar topics but don’t mention each other). You can find the status of people within a group – are they a central connector between large numbers of other members or do they hover on the fringes of the group? Andy Baio’s analysis of #Gamergate provides a straightforward example. Social network analysis lets you find unusual group memberships – people who are in seemingly unrelated groups. Combined with time series analysis it lets you track individual group membership and group composition over time. Who’s joining, who’s leaving, who’s moving to the centre, who’s drifting to the edge. This works at a far higher level than individual tweets or a casual reading of someone’s profile and can be hugely revealing about someone’s relationships, interests and status.

Every tweet is posted with a timestamp (time and date) and these can be hugely and unexpectedly revealing when analysed in bulk. It’s easy to find trivial and hopefully obvious cases of people leaking info through timestamps: people off sick from work posting photos from the beach. But many applications are much less obvious. Timestamps can be used to analyse an individual or group’s activity over time. How often do they post? What times of the day do they post? Which days of the week do they post? Are there unusual gaps in their posting history? We can use these to infer people’s identities, their relationships and group memberships, their locations (tweet timestamps are strongly correlated with timezones). Large changes in tweet volumes could be used to infer things about people’s health, mood and of course death. Who works nine to five and who works night shifts? Who’s moved to another country?

This is the tip of the iceberg. Even if you’re a professional data analyst, you’ve got no way to know how any one of these techniques could be used, either in good faith, recklessly or maliciously, to invade the privacy and damage the lives of people who have done nothing more than post to Twitter. I hope it’s clear that your tweets can reveal your legal identity, relationships, group memberships, interests, location, attitudes and health even where you haven’t explicitly or obviously volunteered that information. This can, and of course is, being used to change people’s lives, very often for the worst. It can affect people’s job prospects, relationships, health, finances, it could cost people their liberty or even their lives. There is no meaningful way to consent to this, no way that any one person could comprehend the genuine risk from their social media exposure, either in the light of current known techniques or of data analysis methods yet to be devised. Increasingly, opting out isn’t an option either. At best you lose the benefits of being part of social networks online. At worst, your absence flags you as an outsider or someone with something to hide.

The answer isn’t to assume that it’s ethical to use any data analysis techniques on all social media data. People haven’t consented to things that far outside their knowledge or control. Nor is it to encourage self-censorship. While this happens, it has far greater impact on already marginalised individuals and groups. We need to recognise that just because something can be done doesn’t mean that it should be done. The appropriate use of public social media data needs to be effectively regulated by law and by the policies of platforms such as Twitter. We need vigorous debate about where what Google’s Eric Schmidt calls the “creepy line” is drawn. And most of all we need to recognise that the enormous power of data mining comes with a great responsibility to use it ethically and to be responsive to the concerns of those affected. Blaming the victims, as the Samaritans have done, just adds insult to injury.

Related: Samaritans Radar must close