Many technology companies have managed to flood the market with their “magic” automatic solutions. They promise precision tools which are simple to use, have enticingly colorful dashboards and provide full coverage of “consumers sentiment” with minimal effort. Many of their web sites state that once top executives are equipped with their magic tool, the task of reading and interpreting the extracted data can be easily delegated to their subordinates.

As I am directly involved with this aspect with my company, I can’t seem to figure out if this claim is just an aggressive marketing campaign or if it is just a way to tell customers what they want to hear. Either way, it has created a situation where the only dialogue with the potential customers is based on looking at the data contained these reports and trying to find the errors in the extraction, thus casting doubts on whether the technology is actually ready for the market.

This vicious cycle needs to be stopped. It could be useful to remember the following guidelines to create a more productive climate and provide value both to businesses and to technology companies:

1) Online sentiment analysis is just another element of competitive intelligence and should be handled accordingly
2) Even the more established technologies (ERP, BI, CRM) are not perfect and do have a margin of acceptable error
3) A group of analysts should be employed to examine the data, extract the knowledge and scan the sources
4) Set priorities and be conservative: avoid incidents first
5) Learn how messages propagate

Feb
08

As I have written many times before, semantic technology is unique in that it is able to go beyond the limits of other types of technology and approach the automatic understanding of a text. It is not perfect, however, and it certainly has yet to reach its maximum potential.

I realize that it’s not that easy for those who don’t work in the sector to understand (especially due to the fact that there are so many false promises out there, which tend to create unreasonable expectations, muddled ideas and market chaos). Therefore, it might be useful to use a common experience as an example, such as: our learning process.

Let’s start from the beginning: from the moment we (human beings) begin to talk, understand, learn, go to school, etc… We require at least 12-15 years to be able to read a newspaper and understand the most general articles and this is thanks to the experience we developed while learning the meanings of words and experimenting with a great deal of different phrase constructions. Consequently, the learning process is  lengthier when we decide to tackle more technical terms or specific topics.

Learning takes time, and the same goes for a computer. It’s true that a computer can process in nanoseconds while we think in milliseconds, but it is also true that our method of learning uses a device (the brain) that no one has been able to fully understand and that is able to do things that not even the most powerful computer can imitate.

In summary, it doesn’t make sense to expect that a computer be able to perfectly analyze and understand a biology text, for example, without first having learned all it can about that subject. There are no shortcuts nor magic formulas: learning a language is difficult and even automatic processes require time and labor.

When I present a company with our software solutions (which are based on a semantic technology that uses a rich and vast semantic network), I find myself in front of an audience who clearly understands the advantages of this approach.  Yet, the series of concerns and doubts they raise often clouds the decision-making process and causes an incorrect evaluation of the actual return on investment.

Whether they are raised by IT managers, KM workers or software developers, the concerns fall into two categories: the first, the costs related to the setup and maintenance of the semantic network and the second, the costs related to the infrastructure required to maintain a performance level able to satisfy operations.

There are many reasons behind these concerns, but two factors seem to stand out. On one hand, there are the excellent (and often incorrect) communication activities carried out by the makers of systems based on keyword technology.  They have almost succeeded in convincing the market that a complex problem such as information management can be solved with automatic shortcuts and that any other alternative would be unaffordable. On the other hand, the majority of researchers in this sector are  still skeptical about systems which are entirely semantic. This is mainly caused by their inability (at least up to now) to develop software which can combine the advantages of increased text comprehension with performance in order to meet the demands of the real world (thus further strengthening the position of the competition.)

In the past ten years, many successful projects have been developed using our semantic technology. Therefore, I think it would be useful to use real data from our everyday experiences to help clear up the misconceptions which often cause people to make irrational decisions.

Costs of development

To add a new language to Cogito, two man-years of software development and 8-10 man-years of linguistic development are needed in order to refine the semantic network. You can quickly estimate the cost of such resources  (if you are in the Silicon Valley, divide your estimated total by 2!) and immediately understand that the initial investment is considerable, yet affordable considering the cost will be spread over all the implementations that will be done over time.

Cogito’s standard semantic network permits a horizontal management of content so that a significantly higher rate of precision e recall (compared to that obtained from a static system) is obtained with no need for further elaboration. For vertical implementations, start-up costs will be necessary so that a standard semantic network can be enriched with knowledge from a specific dominion (the number of added concepts usually does not exceed 5,000); usually 20-30 working days are needed for a linguist to complete this task.

For those who believe that “languages constantly change and adding new terms can be costly,” may I  remind you that even the most dynamic languages, such as English,  increase by no more than 100-200 new terms (of common use) and less than 1000 non-idiomatic expressions  per year  (in the worst case scenario, this could mean about 10 working days per year.)

Those who criticize the complexity of managing a semantic network often refer to the complexity of managing lists of entities such as: people, places, companies, organizations, etc.  Traditional systems are able to recognize an entity only if it is present in a list; this aspect is often  erroneously confused with semantic network management.  A good semantic engine is able to recognize an entity based on the semantic role it plays within a text, therefore it does not require the creation nor the maintenance of lists. At the same time, it is also able to correctly recognize  less frequent entities (which, for obvious reasons, have not been inserted in the list.)

Costs of infrastructure

Cogito can analyze more than 120KB of text (circa 40 pages of text) per second with a common single-processor server. This kind of speed, combined with its linear scalability and low cost, makes Cogito a  practical solution even in situations in which large quantities (tens of millions) of documents must be analyzed.

The development and maintenance costs of a semantic network are considerably lower than what is commonly assumed; the improvements in terms of the ability to manage information (even when very complex) are obvious even to those who are not experts in this sector. I am convinced that when these aspects can be objectively analyzed (when myths and obsolete information are ignored), the number of companies which adopt real semantic solutions will increase.

May
18

Google keeps releasing new (little) functions and refining those already in existence. This is all fine and good, but perhaps it is time that it cleared up few matters regarding what has been in place for quite some time. I find it interesting to highlight the problems commonly encountered during web searches (plus it’s fun to put King Google through the wringer :)

Theoretically, when we search in Google, we insert a couple of words, without quotation marks, without paying attention to the order. The system should apply a sort of AND between the two words (which then shifts to OR depending on the mysterious formulas applied). It seems, however, that this doesn’t actually occur and that the number of results are simply an approximation.

In fact, if I search for Angelina Jolie, Google tells me that there are approximately 47,100,000 results, while if I search for Jolie Angelina, for no apparent reason, the results are cut down to just 7,660,000. Perhaps this is due to the fact that most people use the first search method (first name followed by the last name). But this still doesn’t explain why: if I search for Jolie Pitt, I get approximately 1,820,000 results, while if I search for Pitt Jolie, I get approximately 8,620,000… And not only: because if I search for Angelina AND Jolie, the results decrease to 40,400,000, which is neither logical, nor intuitive (although it is possible to imagine what Google is doing behind the curtains…) and if I search “Angelina Jolie”, the results are similar to the very first search.

Similarly, the vagueness of the query suggestion assistant is baffling: I inserted the name “Brad Pitt” with the query suggestion assistant turned on, and I saw approximately 28,100,000 results. But, as I completed the query, the results became 23,400,000.

A suggestion for a Google: seeing as though almost no one goes beyond the first two pages of results, why not simplify and just write “more than 1,000 results” or “more than 10,000 results” in these cases? Or, as an alternative, they could copy Yahoo! and actually make the numbers match the different search variants.

The impact of the Internet and then of the online social network phenomenon on the consumer buying behavior is a fact. These days, I cannot even imagine organizing a vacation or buying a piece of electronics (not to mention books, cars, real estate etc.) without first spending a significant amount of time reviewing online opinions from my peers, consumers or bloggers with recognized authority on the topic of interest. You can therefore imagine that monitoring and, when possible, trying to influence the opinion expressed on these sources should be a main priority for any company (at least in many sectors.) So it is not surprising that the first comment I receive from the majority of marketing and product managers I speak to is, “Yes of course we know it is important and we are doing it.” However, if you try to understand what most of these companies are doing in reality, you will find out that the situation is quite different.

First of all, the budget allocated to the monitoring of online sources is a small fraction of the budget allocated to traditional business and marketing intelligence projects. Companies are spending significantly more money on tools to analyze internal structured data (sales, accounting, inventory etc.) and even more troubling, sentiment and online monitoring account for a very small fraction of the budget invested in traditional market research (that is, a set of pre-defined questions where the consumer has to choose one answer and, sometimes, gets to add a few words in free text.)
To clarify, companies worldwide prefer to use information that is
  • expensive to gather (projects are very often in the range of hundred of thousands of dollars);
  • biased (there is a lot of research proving that users do not really answer freely to these questions);
  • static, or in other words, that describes the situation at a specific moment in time and that is actually compiled and reported sometime later when the situation could be different.

In any case, the point I want to make is not that traditional market research is useless. I think it has a right place in the mix of competitive intelligence initiatives any company has to undertake. But more so, that it needs to be integrated to take advantage of the wealth of information the explosion of the Internet has made available. Compared to traditional market research, online sentiment monitoring has the following advantages: 

  • it’s relatively inexpensive (if done with technologies);
  • it’s less biased (and this bias tends to decrease as more of the masses go online);
  • it provides a dynamic, real time view on the market.

This established behavior is very resistant to change. When I introduce our product, Cogito Monitor, to decision makers inside enterprises and mid-size companies, I often get the same objections. They immediately focus all their attention on finding errors and noise in the sentiment level automatically identified by the system. Even if the product has proved in many implementations to provide very high precision and that noise has no impact whatsoever on the reliability of the summary data provided (false positive instances are equally distributed among the different sentiment levels.)

I could argue that traditional business intelligence and market research projects offer probably similar results in terms of reliability and I am not saying that our product is perfect, but what I really want to question is the rationality of their objection. To what are they comparing the results obtained by Cogito Monitor? If the mistakes, as they are, are statistically irrelevant why are they resistant to use also this information, in conjunction with any other information they already have to support their decision-making process? To what are they comparing the precision of the online monitoring tool? Instead of comparing it to what they actually have today, it seems like they compare it to an ideal system or process providing 100% precision and recall. And when they resist to adopt these tools, they actually choose to sit like they are George W. Bush on 10 September 2001, and prefer to rely on data they are comfortable with but that is incomplete in describing what is actually happening in the market place, when they should instead be investing in resources able to interpret the signals, often still weak and confused, of brewing storms that are available on social media which can dramatically impact their business.

Feb
02
Filed Under (Myths and realities) by M.Varone on 02-02-2009

While surfing the web recently, I came across a chart published by TechCrunch.

You may remember this dreamlike search engine promising a revolution in Internet search, by indexing a previously unimaginable number of web pages.
Cuil’s fiasco proves that, without innovative technologies and/or approaches, even the most advertised and financed companies (even those founded by Google’s anointeds, in this specific case), are destined to fail.
It seems everything turned out to be a marketing operation, until now without any real effect on the daily activities of Internet users (but this was obvious from the start). Fortunately, money and names are not enough to obtain success: we need innovative ideas and the ability to turn them into real and useful products in relatively short timeframes.
Sep
04

The secret to a successful automatic categorization project is not dependent on choosing a powerful enough technology, but rather in the methodology used to implement the project: if the methodology is the right one, then a powerful technology will be actually indispensable to obtain success, but if the methodology is wrong, no technology will be able to help.

The most important element is the initial analysis phase, during which it is necessary to describe the core of the issue in a clear, objective and replicable way.

 It is fundamental that the customer, typically an organization that needs to manage a considerable amount of knowledge (in general, various types of documents produced or acquired during work), explains to the supplier its real needs.

The supplier, of course, must commit to address such needs in the best possible way.

Described in this way, this situation is not so different from any other software development project. However, the difference in this case is that we must understand how to manage a complex knowledge, which is never easy, and cannot be improvised.

The first step is the most important one and it requires a special effort from the customer who needs to answer rationally the following questions:

  • Why do I need to categorize my documents?

  • Who are the people who best know the knowledge base I want to categorize?

  • If currently the categorization is performed manually, what is the detailed process flow?

  • Which are the really important and relevant categories able to make the content more useful and valuable?

  • If the category tree is already available, are all the categories really necessary?

  • Which are the most objective criteria that make a specific document belong to one category and not to another?

Even if the above questions are all simple ones, it is not so easy to quickly find the answers and this is where the experience of the supplier comes in, making the analysis phase a cooperation between customer and supplier.

First of all, the supplier needs to discuss the problem with the customer before offering a solution. Moreover, the expertise of the supplier must exceed the technical aspects strictly related to technology: in fact, the customer is generally not a knowledge expert and therefore, it is not easy for him to immediately individuate the basic categories (or knowledge domains) for the success of the project.

If the analysis phase is performed correctly, the most important step for the success of the project is already done: this is in fact, the only narrow path that leads to an effective system able to guarantee effectiveness and advantages in terms of costs and value.

Aug
21
Filed Under (Myths and realities) by M.Varone on 21-08-2008

… or programs that “learn” how to categorize  and programs that just categorize

From the Seventies onward, many researchers have been investing time and resources to develop algorithms able to analyze texts already categorized by hand, in order to extract, automatically (or better… magically), the knowledge required to categorize  other texts of the same kind.

Basically, the idea was (or rather is, because no solution has been found yet) the following:

           Take a list of the desired categories (or tree, often hierarchical) directly from the people who need a system for automatic categorization.

           Receive from the same people a set of documents (tagged automatically) for each category, selected from the larger set of available texts.

           Use the categorization tree and the set of documents to teach the program how to recognize the stylistic features of each category. This is pure magic ;) and it is normally referred to as training.

This approach has produced one of the oldest and most persistent myths about Knowledge Management.

Although the solution soon proved to be inadequate, the will to accomplish this magic has been so persistent that even today the market insists on the possibility to obtain a program, suitable for any field that, starting from a few examples can perform automatically a task that often is not even within the capacity of people.

The idea of such a system is understandable and desirable (maybe it’s the dream of everyone in the field of information management), but has created exaggerated expectations, absolutely unrealistic and even detrimental, because they interfere with the advance of the state of the art.

Systems of this kind DO NOT exist and, what’s more as I often underline, there are no easy shortcuts for the solution of complex problems related to the management of information.

Still, in the case of specific categorization the myth can come true and reality is often better than expected.

In fact, although the categorization of contents for personal use is still quite far from being economically realizable (it remains pricey as the subjects are countless and tied to subjectivity),  we can nevertheless observe that, for few years, at the enterprise level it is possible to implement systems for the automatic categorization that are economical and effective, provided that all the parts (firm and supplier of technology, client and vendor, etc.) share clear goals and work together to avoid traps.

We will see how in the next post on this subject.

Jul
30
Filed Under (Myths and realities) by M.Varone on 30-07-2008
Categorization, a subject I have covered before, is a central activity in the effective management of the knowledge contained in texts (or, in technical terms, the so-called “unstructured information”) but it is shrouded in the most stubborn myths of the field of document processing.

But what is categorization?

The question is not trivial, because there are different ways to indicate this activity, which seems to have inherited the confused eclecticism typical of Knowledge Management, and includes the large variety of labels such as “classification” and “clustering” and even going as far as some who use such linguistic monstrosities  as “taxonomization”.

Personally I prefer “categorization”, because I believe it’s the term that best reflects the process behind the different names: distinguishing available information according to different categories to make searching easy and immediate.

Categorization is in most cases performed manually, and therefore tied to subjectivity, to individual choices depending on the way of thinking, on necessity etc., and also on the type of content (documents, emails, web sites, etc.)

There is no need to emphasize that, being a manual activity, categorizing presents two main problems: it requires a great amount of time to be performed, and normally produces subjective definitions of categories that different users may find incoherent.  In order to solve these problems, in the development of technologies for information management, automatic applications were introduced.

The first categorization systems were born immediately after the first attempts to implement research applications, but only with the recent explosion of information, has the potential usefulness of automatic categorization become a major interest. We just need to consider the quantity of data available today on the web in comparison to a few years ago,  our direct experience in the management of documents on our pc, or the phenomenon of email: less than 10 years have passed, and average users are no longer managing a few emails per week, but about 30 emails per day… 

Typically,  in the field of technologies for information processing (at least from the point of view of an insider), nearly all the researchers have approached the problem with the fixed idea of finding an algorithm that, with no or little manual work, can categorize any content automatically, and with a very high quality. 

This is how a pragmatic approach to the problem was replaced by the silver bullet race of automatic categorization: an imprudence that has caused excessive expectations and unsatisfactory results. In the next posts we will see how, when and why.

Jun
18
Filed Under (Myths and realities) by M.Varone on 18-06-2008

Working on categorization projects, we often face the fact that a perfect automatic categorization cannot exist:  a certain degree of subjectivity (which can also vary in time) is always involved when we assign a category or a subject to a text.

The most common situation involves taxonomies including heterogeneous categories: for example, when categorizing newspaper articles customers tend to include in the taxonomy subjects such as sport and politics together with domains such as people or events.

But while categories like sport or politics are fairly objective and strictly related to the content of the text, people and events are cross-category elements, therefore it is very difficult to manage them with an automatic system.  In fact there are no common topics, no recurring or typical concepts, no specific domains, while the only shared feature is that of being focused on someone or something (a person or event).

 

However, it is  comparatively easy for the reader to agree that articles about Leonardo da Vinci, Gorbachev, Robin Hood or Joe Dimaggio should belong to a “people category”.

In general we should always keep in mind that some choices are quite easy for us, but can be extremely complicated for a program.

For example, we may need to categorize the review of a Second World War movie. For most readers, without even having to read the whole article, the first category will be “cinema”, as the subject is a movie. The program, instead, may think* about history or war or military instead, and would not consider “cinema” as relevant topic.

Luckily, most categorization issues can actually be solved by an automatic system which, once configured properly, will be far more objective and reliable (because it will never get tired nor influenced by external factors) than a person, who remains nevertheless the only one of the two who is really intelligent.

* think… it’s only a manner of speaking :-)