The other day, I was talking to one of our clients – for whom we are designing a semantic search engine - and he made a comment which deserves some consideration.
According to him, it is essential that a search engine be very rapid, like Google, which gives results in two to three tenths of a second on average. In fact, he believes that his search engine should be even faster because Google filters who knows how many billions of pages, while his Intranet contains less than a million J. I tried to explain to him that speed is only one of many important aspects.
Like in many other fields, Google has been successful in transforming a technical (and very internal) aspect into a feature which has become important for users. Without a doubt, speed has become essential for Google as well as for other search engines. In fact, many common searches are no longer carried out, but are “preprogrammed” by the system because this means cutting down on servers (thousands and thousands), electrical energy, bandwidth and more.
Establishing how important speed actually is for users is a complicated task: obviously, the less time it takes the better, but I ask myself if it wouldn’t be better to wait even 10 times longer (meaning only 2 or 3 seconds) in order to have better results.
In fact, based on the latest market research, 40% of Internet searches do not receive results, half of the searches have to be reworded in order to get better results and 46% of the search sessions are longer than 20 minutes. Given this situation, personally, I would be happy to wait 2 seconds longer if it means that I will find what I’m looking for more often and/or it reduces the search time of 20 minutes even by just 5 minutes.
Perhaps the problem lies within the fact that with the current technology, Google and other search engines do not know how to improve search results, even if they took 10 times longer. Therefore, they use a simple tactic as a backstop (a kind of unspoken agreement with the user, which is probably tolerable): I’ll give you answers quickly (and for free), but don’t expect too much quality-wise!
Last Wednesday, I visited a client to discuss semantic searches. He motioned for me to sit in the chair in front of his desk. Then, right off the bat, he asked, “Can you explain why, when I search for something in Google or Yahoo, sometimes the information I’m looking for is at the top of the list and other times it’s not there at all?”
His question sparked a very interesting and lively discussion about the Semantic Web, which made me think about how much ground has been covered, but also about how much confusion still exists in regards to this subject.
The first time people began to talk about the Semantic Web was in 2001. It was a new kind of Web, in which web pages, various files, images and the like, would contain precise information about the data they contained. In this way, the Web would become Semantic: no longer a source for manually-searched documents, but rather an instrument capable of immediate and automatic data interpretation. I remember thinking, “Fantastic”. Many people still think it is just that…something that is more closely related to fantasy than reality.
The Web contains an enormous amount of information which is not always accessible. The pages that make up the Web are not “semantically” linked. The lack of explanation about content meanings and links, along with the exponential growth of the data it contains, is the main cause of fluctuation in the degree of precision of search results.
In order to give meaning to web pages, each informational resource should be able to provide information about itself (this is called “metadata”, meaning data about data). Of course, all of this information needs to be expressed in a language which is suitable for computers. To do this, the most feasible hypothesis is to use a shared vocabulary, along with some XML-based formalisms (I won’t go into the details here, further research on this subject can be done in Wikipedia). Let’s just say, that in this way, we can obtain complete, objective, accurate data and therefore, generate forms of analysis which are also exact: but who has the time to link each bit of information to the metadata? Usually, speed and simplicity are preferred (which compromise precision and efficiency).
There have been many approaches in an attempt to free the Semantic Web from labels such as, “interesting but (almost) impossible” and transform it into something “interesting but also useful and usable”. Some pioneers began to walk down the semantic road even before the theories about the Semantic Web were affirmed. For example, Semantic Intelligence aims to improve precision and recall in the search process, making computers able to automatically, “understand what we’re talking about”. If SI makes it possible to automatically understand what a text is talking about, then it is reasonable to think that metadata can be created for the Semantic Web. Today, we are way beyond the beta version frontier: Semantic Intelligence is a mature technology and is widespread in the business world.
We may not be that far away from that “fantastic” Web which is able to understand whether a jaguar is an animal or a car. A Web in which you can search for information on pop music from the Sixties and receive pages containing the keywords music, pop, and Sixties (for example), but also those about the Beatles and the Beach Boys and maybe even some useful tidbits about the next Rolling Stones tour.
The Office of Public Liaison in the new Obama is promising to listen to citizens as it considers policy direction, legislation and otherwise brings the people to Washington rather than bringing Washington to the people. The most concrete of these proposals is to allow a 5 day comment period by citizens via the internet before the President signs any legislation. Even now anyone can offer an opinion directly to the President here. You can contribute up to 500 characters. That is roughly 40 words.
The windows are open in the White House and a new breeze of open, inclusiveness is blowing right in. This is certainly a change over the previous 8 years when the White House was shut tight, the air inside growing staler by the day. But I wonder if the administration is prepared for the hurricane force winds that could result?
If you ask for comments on pending legislation how many comments will the White House get? There are some hints from around the blogosphere. Go to Technorati and ask for a count of the word “bailout” over the last 6 months. The chart below is what you get.
The peak of over 14,000 blog posts was around the passage of the first muti-billion bank bailout in the early Fall. An estimate of the average around this spike looks to be roughly 6,000 posts per day. As the debate and finalization of the ARRA (American Recovery and Reinvestment Act) and second half of the bank bailout money is finalized you can be sure the number will spike again. But let’s be conservative and assume 1/3 of the average would like to comment directly to the White House on the ARRA over the 5 day period promised. That would be 10,000 comments President Obama says he will consider before signing the legislation. The current estimate of US bloggers is 22.6 million so 10,000 comments may only be a drop in the bucket.
Short of a small army of readers how will Valerie Jarrett and her staff understand this “wisdom of the crowd” input? We do know that President Obama has hired some tech vets to lead this kind of effort.
Chief among these is a former Google product manager Katie Jacobs Stanton who will be the new President’s “director of citizen participation” come March. It is not just a coincidence that Ms. Stanton was in charge of Google Moderator.
A quick look at this tool reveals the ability for anyone to post a question (or I suppose a comment) and then have others vote for its importance relative to all the other questions posted. Looking through the questions posted around the Presidential debates is another estimate I can find that might look like what the White House will experience. The breakdown of topics, questions asked, votes recorded and citizens participating look like the table below.
|
|
Votes |
Questions |
People |
|
Education |
6,926 |
96 |
1,183 |
|
Health Care |
3,483 |
81 |
412 |
|
Iraq War |
3,513 |
64 |
488 |
|
Economy |
7,534 |
209 |
580 |
|
Environment |
3,078 |
73 |
317 |
|
Foreign Policy |
3,699 |
101 |
339 |
|
TOTAL |
28,233 |
624 |
3,319 |
Ok here is the rub. No matter how you count what can be expected from citizens participating in the new administration technology beyond posting and voting is going to be needed. It’s not clear on Google Monitor if the categories were decided before the questions came in or after everyone posted. In any case I took the top vote getting comments from each of these categories and analyzed them again using our semantic technology to see what categories come out. I could find 90 categories in total across all those who commented. The top categories (more than 1% of the total) were the following;
That’s easily more than twice what Google Moderator can bucket things into. The point is that true participation means more than a simple tally. It should mean listening, really listening to the context, the nuances, and the breadth of what citizen’s experience in their daily lives and what they expect from their government. Volume is only the first problem for citizen participation. The bigger issue is, as the intelligence community who is familiar with these problems puts it, finding dots, connecting dots and understanding dots.
I believe semantics to be a core technology that can not only process the volume of what the White House is about to experience but can also trick out the full picture of true citizen participation. It will not do President Obama any good promise to listen to his most important constituency and latter be accused of lending a dull ear to the process. There is great promise in having the breadth and range of American opinion directly influence the highest office in the land. Everyone can see technology is the key to extending our democratic reach to every living room and kitchen table in the land. The peril is in not applying enough or the right technology resulting in enough citizens feeling as though they were not sufficiently heard. That would do democracy harm indeed.
The impact of the Internet and then of the online social network phenomenon on the consumer buying behavior is a fact. These days, I cannot even imagine organizing a vacation or buying a piece of electronics (not to mention books, cars, real estate etc.) without first spending a significant amount of time reviewing online opinions from my peers, consumers or bloggers with recognized authority on the topic of interest. You can therefore imagine that monitoring and, when possible, trying to influence the opinion expressed on these sources should be a main priority for any company (at least in many sectors.) So it is not surprising that the first comment I receive from the majority of marketing and product managers I speak to is, “Yes of course we know it is important and we are doing it.” However, if you try to understand what most of these companies are doing in reality, you will find out that the situation is quite different.
In any case, the point I want to make is not that traditional market research is useless. I think it has a right place in the mix of competitive intelligence initiatives any company has to undertake. But more so, that it needs to be integrated to take advantage of the wealth of information the explosion of the Internet has made available. Compared to traditional market research, online sentiment monitoring has the following advantages:
This established behavior is very resistant to change. When I introduce our product, Cogito Monitor, to decision makers inside enterprises and mid-size companies, I often get the same objections. They immediately focus all their attention on finding errors and noise in the sentiment level automatically identified by the system. Even if the product has proved in many implementations to provide very high precision and that noise has no impact whatsoever on the reliability of the summary data provided (false positive instances are equally distributed among the different sentiment levels.)
I could argue that traditional business intelligence and market research projects offer probably similar results in terms of reliability and I am not saying that our product is perfect, but what I really want to question is the rationality of their objection. To what are they comparing the results obtained by Cogito Monitor? If the mistakes, as they are, are statistically irrelevant why are they resistant to use also this information, in conjunction with any other information they already have to support their decision-making process? To what are they comparing the precision of the online monitoring tool? Instead of comparing it to what they actually have today, it seems like they compare it to an ideal system or process providing 100% precision and recall. And when they resist to adopt these tools, they actually choose to sit like they are George W. Bush on 10 September 2001, and prefer to rely on data they are comfortable with but that is incomplete in describing what is actually happening in the market place, when they should instead be investing in resources able to interpret the signals, often still weak and confused, of brewing storms that are available on social media which can dramatically impact their business.