Questions and Answers @ POPSTAR | Public Opinion and Sentiment Tracking, Analysis, and Research

Questions and answers.

Index

What is POPSTAR?
Who is the POPSTAR team?
Where does the public opinion data come from?
How do we estimate trends in public opinion?
What does Buzz mean?
Where does the news, blogs and Twitter data come from?
What does Sentiment mean?
How do you measure the polarity of text?
How do we measure sentiment regarding political leaders in Twitter?

What is POPSTAR?

POPSTAR (Public Opinion and Sentiment Tracking, Analysis, and Research) is a project that develops methods for the collection, measurementand aggregation of political and economic opinions voiced in micro-blogs (Twitter), in the blogosphere and in online news. Furthermore, the project also aims to compare these data with the more conventional indicators of public opinion, such as those gathered through questionnaire surveys (polls). POPSTAR brings together researchers from the Instituto de Ciências Sociais da Universidade de Lisboa (ICS-ULisboa), from the Instituto de Engenharia de Sistemas e Computadores-ID Lisboa (INESC-ID), from the Faculdade de Engenharia da Universidade do Porto (FEUP), and from the Núcleo de Investigação em Políticas Económicas of the Universidade do Minho (NIPE-UM). POPSTAR is a project funded by Fundação para a Ciência e Tecnologia (PTDC/CPJ-CPO/116888/2010). The project’s website can be found here.

This site presents data resulting from the first prototype of the tools developed for the detection and analysis of trends of:

Mentions of Portuguese parties and political leaders on Twitter, the blogosphere and online news;
Sentiments conveyed through “tweets” regarding party and political leaders;
Vote intentions for the main political parties, as measured by polls;
Evaluation of the performance of said party leaders, as measured by polls.

The next prototypes will broaden sentiment analysis to other sources beyond Twitter, and will include economic phenomena.

Who is the POPSTAR team?

The POPSTAR team is formed by Pedro Magalhães, Carlos Soares, Luis Aguiar-Conraria, Mário J. Silva, Nina Wiesehomeier, Paula Carvalho, Silvio Amir, Pedro Saleiro, Miguel Maria Pereira, João Filgueiras and by designer Manuel Távora. Eduarda Mendes Rodrigues was also a member of the team in its early stages.

Where does the public opinion data come from?

Public opinion data used are the results of surveys of the Portuguese voting population carried out since the last legislative election, in June 2011, published in the media and later deposited in the Regulatory Authority for the Media (ERC). More specifically, we use vote intentions (in %) in the five main political parties for national elections and the results of questions on the evaluation of the performance of the main political parties and party leaders. In the case of vote intentions, when polls released reported “undecided” without redistributing them by the remaining valid options, we treat them as abstentions, which is equivalent to distributing them proportionally by the remaining options (PSD, PS, CDS-PP, CDU, BE and Other, Blank or Void votes). In the case of the performance evaluations of the main political leaders (party leaders and the president) the response options vary from polling company to polling company, resulting in distinct metrics. However, in the development and analysis of our estimates, we take these differences into account, and present the results on a scale ranging from 0 (minimum approval) to 20 (maximum approval).

How do we estimate trends in public opinion?

The public opinion data collected by us are then treated to filter out excessive noise and to better discern trends in public opinion. There are several econometric techniques available to do this “smoothing” of data. We opted for the Kalman Filter. The principle underlying the Kalman Filter is quite simple. It considers that each variation observed in public opinion may be the result of two factors: (1) changes in public opinion itself, or (2) variations in the measurement of such opinion. Suppose, for instance, that a new poll shows a significant increase in vote intentions for a specific party. This increase may be due simply to an increase in the public support for that party, but also from the sampling error inherent in any poll. If one considers that the data comes from a high quality poll and that public opinion is volatile, it makes sense to give a large weight to that poll’s results. On the other hand, if one considers that public opinion is very stable, it is reasonable to give the same weight to every single poll. Under certain assumptions, it can be shown that the Kalman Filter considers these two sources of uncertainty optimally, giving us, at every moment in time and given all information available, an optimal estimate of the state of public opinion.

In the measurement of public opinion, we use this filter in two distinct situations: to estimate vote intentions and to estimate politician’s popularity ratings. In the case of vote intentions, this exercise is easier, because each polling company aims at measuring exactly the same thing and it is perfectly clear what is being measured. In the case of popularity ratings, different polling companies make use of different questions and weigh answers differently. Asking someone to assign a score from 0 to 20 to a leader (as some polling companies do) is different from simply asking whether you evaluate the performance of a leader positively or negatively, as other companies do. Thus, for leaders’ approval, we assume that there is a latent variable, unobserved - call it “popularity” – which influences the various indexes used by polling companies. Applying the Kalman Filter to the data from the different companies, it is possible to provide optimal estimates for the value of that unobserved popularity. This popularity index is converted to a 0-20 scale.

Public opinion data are updated in POPSTAR whenever new polls are published.

What does Buzz mean?

Buzz is the daily frequency with which political leaders are mentioned by Twitter users, bloggers and online media news. In the Buzz section we present two types of indicators. The first type is the relative frequency with which the five party leaders are mentioned by each medium (Twitter, Blogs and News), on each day. This indicator is expressed, for each leader of each party, as a percentage relative to the total number of mentions to all party leaders. The second indicator is the absolute frequency of mentions, a simple count of citations for each political leader, in this case also including the president. In the case of the Leftist Bloc (Bloco de Esquerda), starting November the 11th, 2012, both indicators, consider the two coordinators of the party (João Semedo and Catarina Martins) as a single entity.

To estimate trends in Buzz, we again use the Kalman Filter. However, in this case with an upgrade that enables users’ experience to be more interactive. We allow users to choose the smoothing degree for each estimated trend. You can choose between three alternatives: a fairly reactive one, where trend is highly volatile, allowing close monitoring of day-by-day variations; a very smooth one, ideal to capture long term trends; and an intermediate option, displayed by default.

Buzz data are updated daily in POPSTAR.

Where does the news, blogs and Twitter data come from?

Data from the social media and online news are collected by the platform POPmine, developed by Faculdade de Engenharia da Universidade do Porto (FEUP) and the Labs Sapo UP. POPmine filters data from various social media and news sources mentioning specific entities, applies content classifiers (e.g. topic or sentiment), aggregates data (daily buzz, for instance) and makes it available through an API. Within the POPSTAR project, specific modules of sentiment classification (Opinionizer), aggregation and smoothing were integrated in the platform. POPmine collects data from three sources:

News: Data from online news are provided by the service Verbetes e Notícias from Labs Sapo. This service handles online news from over 60 Portuguese news sources and is able to recognize entities mentioned in the news.

Blogs: Blog posts are provided by the blogs’ monitoring system from Labs Sapo, which includes all blogs with domain sapo.pt, blogspot.pt (Blogger) and Wordpress (blogs written in Portuguese).

Twitosphere: “Tweets” are collected using the platform TwitterEcho, responsible for the compilation of messages from 100.000 Portuguese users of Twitter. “Tweets” are collected in real time and submitted to a language classification. Platform POPmine only uses “tweets” written in Portuguese.

What does Sentiment mean?

In the context of this project, sentiment is any subjective expression (i.e. opinion) conveyed in textual documents about a particular topic, for instance a political leader or the state of the economy. In POPSTAR, and in this first prototype, the main objectives at this level are:

To determine the polarity of texts written by Twitter users regarding political and party leaders, identifying their positive, negative or neutral character.
To build, validate and analyze global sentiment indicators by Twitter users with respect to each party leader.

At a later stage, this analysis will be extended to other media (news, blogs). Part of the validation and analysis effort will be to compare these results with the work of human coders, and to compare the sentiment indicators with those arising the conventional methods of measuring public opinion, such as results from public opinion polls.

To estimate trends in sentiment, we again use the Kalman Filter. However, in this case with an upgrade that enables users’ experience to be more interactive. We allow users to choose the smoothing degree for each estimated trend. You can choose between three alternatives: a fairly reactive one, where trend is highly volatile, allowing close monitoring of day-by-day variations; a very smooth one, ideal to capture long term trends; and an intermediate option, displayed by default.

In POPSTAR these data are updated daily.

How do you measure the polarity of text?

The analysis of the polarity of a text, in this case of “tweets”, is made by Opinionizer, a sentiment analysis tool for Twitter messages, developed by the DMIR group (INESC-ID Lisboa). For each Twitter message that mentions at least one of the targets (parties and political leaders) being studied, this tool decides, whether it consists of a positive, negative or neutral message. For this purpose, the message is converted into a mathematical representation that combines a number of features which characterize particular vocabulary used, the presence of words from a sentiment lexicon and the presence of syntactic patterns typically used to express emotions. This classification algorithm is based upon two steps:

To automatically “learn” the relations between aspects that characterize the message and the sentiment it expresses via the analysis of a set of examples previously classified by human coders;
To use this information to infer about the polarity of each message, considering all aspects.

These aspects considered are the following:

Vocabulary – words that make up the “tweet”, weighted according to a probability distribution of each word being used in positive, negative or neutral messages. Words with high probability of appearing in all classes are ignored. The probability distribution is estimated also taking into account the hand-coded classification examples.
Sentiment words – the sum of words with positive or negative polarity in the message, using the SentiLex-PT, a lexicon of words that express feeling.
Syntactic patterns – the presence of words used to express emotions or feeling, such as punctuation, emoticons and informal language used in web-based social networks (lol, hehehe, hahaha, etc…).

How do we measure sentiment regarding political leaders in Twitter?

After identifying the polarity in each of the “tweets” that constitute the political communication in the Portuguese Twittosphere, there are several ways to quantify the overall sentiment regarding political leaders. We can, for instance, look at each target independently or in relative terms, compare positive with negative references or simply look at one side of the polarity, or look at daily, weekly or monthly data records.

In this first prototype we opted to present two separate indicators and their evolution across time, using in both cases the day as reference period. The fist indicator is the logarithm of the ratio of positive and negative “tweets” by political leader (party leaders and the president). In other words, a positive sign means that the political leader under consideration received more positive than negative “tweets” that day, while a negative result means that he received more negative than positive “tweets”. In mathematical notation:

The second approach is to simply look at the negative “tweets” (the vast majority of “tweets” in our base classifier) and calculate their relative frequency for each leader. In this way it is possible to follow each day which party leaders were, in relative terms, more or less subject to “tweets” with negative polarity. In mathematical notation: