Nov
07
Filed Under (2012 Election, Big Data, unstructured information) by L.Scagliarini on 07-11-2012

Like most U.S. voters, I was planning to spend last night waiting for the results of the presidential election. From Europe, the wait is even longer, as the first results start coming in around dawn. So while I was watching the countdown to the closing of the first poll, I started to wonder whether waiting four or five more hours was a good investment of my time (vs. actually sleeping).

At that time, I had just received a message from a friend of mine forwarding a quote from Obama stating that he thought he had enough votes to win, while at the same moment, David Plouff was speaking openly about his confidence for Pennsylvania, Ohio and Virginia remaining blue states. This open confidence was obviously part of the last minutes of this never ending campaign, but it also was a very important signal that the statistical models available to them were showing very little uncertainty compared to what “we” were feeling.

While still debating whether it was time to go to bed, I went to the NYT on my iPad and I read this post from the now famous FiveThirtyEight blog. As typical of his very well written blog, Nate Silver describes the data available, helps to explain and interpret the data and, basically, tells us that in the end, the election is not as undecided as we all thought. At that point I made up my mind and went upstairs.

I think that in addition to President Obama, the 2012 election had another big winner: “Big Data”. Polls have been extremely precise throughout the campaign to identify mood swings, trends, etc. and, at the end of the day, they were next to perfect in predicting the outcome. And this is only the beginning.

As I wrote in a previous post, I am very confident in how the integration of unstructured information will improve the quality of human behavior prediction models, and I am very excited by the effect that unstructured information will have in terms of the costs of feeding data to these systems. This will mean less doubt about who the next president will be come this time four years from now, and greater ability to predict, with significant precision, the success of a product in the market with very limited investment, for example. This is to me is almost as exciting as waking up to the news of who I already knew was going to be the president for the next four years!

With the U.S. presidential election looming, it’s hard to avoid the talk of who’s ahead—everywhere you turn, there’s an article with the latest results of a new poll. Over the last 24 hours, I read two articles about predicting human behavior. David Brooks, poking fun at his ‘poll addiction’, supports the thesis that, while you can reach a certain level of predictability, essentially, human behavior is impossible to predict.

On the other end of the spectrum, I clicked over to an article that makes the case that what has been missing in building predictive models is the data. Now, the data is available in the form of social media content and will progressively more available in the future. Problem solved!

When we talk about models for predicting human behavior, I think we have to avoid the radical approach. As the political system demonstrates, we have made huge progress in predicting behaviors and reactions of the electorate, where elections are often won by a small margin, or even hanging chads in some cases.

But the objective cannot be perfection. We do not expect this from most of other models—we accept a margin of error. I believe that when we start including new data based on unstructured information, the margin of error in human behavior predictive models will not be eliminated completely, but it will shrink.

We will still have to wait for election night to know who the next president will be, but we will probably send out the party invitation the night before.

Although it’s been hard to resist reading the news about the previous night’s debates each morning, I have been relying on our analysis of the presidential debates for a first impression. Like with email communication, experiencing the debates minus any visual or verbal context (and not even in sentence form initially) can leave much up to interpretation, and we’re left with word choices, and the meanings of those words (which can imply feelings and context) to figure it out.

While much of the analysis here is straightforward, one interesting aspect of using semantic analysis is that it is able to distinguish the most important sentences and words in text, determined not by frequency, but by a complex algorithm that looks at the logical role, co-occurrences with related terms, etc. Using this, we can identify the most important words and concepts that are being conveyed, those that are central and critical to the overall text (see “Most Important Nouns” in the graphic below).

Some of the most interesting discoveries were the use of “Romney,” which was cited as the most important term used by President Obama in last night’s debate (Could this mean that he took a more aggressive and forceful approach to his opponent this time?), and Romney’s use of “I” over “we” (Obama was fairly equal in use of both).

Take a look at our newest infographic to see some of the other ‘curiosities’ that our analysis uncovered. Until the next debate……

Highlights from our semantic analysis of the language and word choices used by President Obama and Governor Romney in the first debate:

This week’s presidential debate is being analyzed across the web on a number of fronts, from a factual analysis of what was said, to the number of tweets it prompted. Instead, we used our Cogito semantic engine to analyze the transcript of the debate through a semantic and linguistic lens.
Cogito extracted the responses by question, breaking sentences down to their granular detail. This analysis allows us to look at the individual language elements to better understand what was said, as well as how the combined effect of word choice, sentence structure and sentence length might be interpreted by the audience.

Here is a sample of what we found:

  • Overall, President Obama spoke less (in number or words) but used longer sentences and a more complex sentence construction than Governor Romney, who used a simple sentence construction. Looking at the use of modal verbs, Romney made a greater use of “can” and “will” while President Obama often emphasized the word “would.” While both used the verb “be” most often, the second verb in frequency denotes a sense of action in the case of Obama (“do”) and more passive action in the case of Romney (“have”).
  • While the main lemmas and words did not vary much between them (small businesses, America, costs, the private sector, economic growth), President Obama spoke more about the topics of health care, student loans and purchasing power. Instead, Romney spoke more about taxes in various forms (tax plan, income tax, tax rate, property tax, economy tax).
  • Using semantic analysis,  the graphics below show the context in which each spoke about the concept of “tax” (the closer the words are to the center of the graphic, the more connected they are to the main idea, in this case, taxes). For Romney, “tax” is immediately connected to “raise” and “pay”; for Obama, “tax” is closely connected to “cut” and “family.”

 

President Obama and “taxes”

Governor Romney and “taxes”

 

  • Using sentiment analysis on the top terms used by both candidates, we can glimpse the feelings and emotions potentially transmitted by each of the candidates. This shows Romney with terms associated with more positive emotions, and Obama with a less passionate, more neutral language, which sends a more negative sentiment.

Positive terms used by President Obama

 

Positive terms used by Governor Romney

Positive terms used by Governor Romney

 

Stay tuned for our analysis of the second debate on October 16.