Analysing social media language: a legal and ethical conundrum

We live in an age where people’s daily social media posts and internet search queries are not wholly their own. Though most of us search without much thought for the invisible eyes of others, the value of such data has not gone untapped by advertisers or unnoticed by researchers, who have used it for projects from studying influenza outbreaks to predicting stock market behavior.

With all of this internal dialogue set free in the public sphere, some have also argued that more subjective correlations between social media language, human behavior, and our general well being have gone understudied. Dr Lyle Ungar is part of a research team from University of Pennsylvania (UPenn) that was inspired to leverage the plethora of social media expression and study open correlations between our language and behavior.

Using machine learning to analyze data and find potential patterns, the team sought to uncover potential links between what we daily ‘tweet and post’ and our daily behaviors, even our physical and mental health status. Their initial study, Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach, was published in 2013 and immediately caught the attention of the medical field. The team’s research, part of University of Pennsylvania’s World Well-Being Project, is an ongoing work in progress.

Yet by its very nature, the study approach raises questions about the legal and ethical ramifications of accessing and analyzing social media language. Is it legal to analyze social media posts without people’s knowledge, and if they are aware, how do they know the knowledge is being kept ‘private’, or at least out of the hands of companies, governments, or other institutions who might use the information for their own narrow objectives?

Initially, Ungar and his team used a combination of unobtrusive measures (those that include any data about human behavior that can be collected without the subjects’ knowledge) and obtrusive methods (surveys and interviews are examples) to collect data. The team analyzed 700-million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests. The researchers were even able to get 2 000 people to grant access to their health records, allowing the team to perform a micro study on the potential relationships between language use on Facebook profiles and current or past medical conditions.

On the surface, the fact that people volunteered their personal information seems acceptable. Beyond Facebook, which has more stringent requirements for data access, the team collected thousands of tweets from Twitter, an easier to access platform (tweets are open to everyone and the data is easy to manipulate) and one that allowed the researchers to freely map where tweets came from and the geographical location of tweeters.

All this collecting was done in the name of research and furthering social scientists’ understanding of the behavior and well being of individuals and the greater populace; however, the sheer availability and relative ease-of-access to information means that with the right resources, any person or institution (we know most governments have the ability to keep a close watch) could do the same.

Maybe platforms like Twitter are fair game; people are aware that the information they post is available to everyone in the Tweetosphere, and we might assume tweeters should be cognizant and responsible for anything that they put out on publicly visible platforms. But how many people are actively considering the far-reaching impacts of their comments on a whim or in the heat of the moment, especially on a platform that has seemingly become another casual place to have a conversation?

Surprisingly, there hasn’t been all that much published about the legal and ethical ramifications of using social media data for research purposes; however, the potential considerations are wide and deep. A U.K.-based publication covered this topic briefly in a 2014 paper titled Use of Social Media for Research and Analysis. Both the organisations that “own” the data (like Facebook and Twitter) and researchers are largely ‘building the plane while flying it’ when it comes to handling grey areas.

For example, many social media organisations provide specific technical interfaces for the accessing of data, called “Application Programming Interfaces” [API], which allow for monitoring and setting ‘quotas’ (Twitter researchers are only able to access about one percent of material published on Twitter for any day).

Smaller social sites, like Pinterest and Instagram, are not as likely to provide API access. In this case, those who wish to extract data from web pages often use a method called “scraping”, in which the user ‘instructs’ a computer to extract and download information. Scraping seems to fall in the grey area category; while the content is free to access, there are usually copyright and intellectual property laws in place to protect this data (but whether anyone reads or abides by these codes is up for debate.

For many researchers, the idea of information consent for social media, a traditionally iron component of any valid research study, seems in many cases unreasonable, considering the volume of people that could potentially be involved in large studies spanning large regions. Though the 75,000 voluntary participants in the UPenn study seem like a big set, Dr. Ungar noted that the amount of usable extracted information, after human- and machine learning-processing, turned out to be a relatively small data set.

At present, it seems many contemporary researchers may ground their approach to social media research ethics by taking efforts to ensure that people whose data are utilized are not subjected to any directly-related negative effects, similar to steps taken when mitigating legal concerns involving data protection. But there doesn’t seem to be any black or white answers regarding the more abstract legal and ethical standing of social media data in research. There are undoubtedly great potential benefits and likewise harms to conducting such studies, but the issues are a reminder that maintaining a voice on the web is not done in a social bubble or vacuum. Whether our posts and tweets will be used predominantly for or against our own well-being is yet to be checked and balanced.

More

News

Sign up to our newsletter to get the latest in digital insights. sign up

Welcome to Memeburn

Sign up to our newsletter to get the latest in digital insights.