… or programs that “learn” how to categorize and programs that just categorize
From the Seventies onward, many researchers have been investing time and resources to develop algorithms able to analyze texts already categorized by hand, in order to extract, automatically (or better… magically), the knowledge required to categorize other texts of the same kind.
Basically, the idea was (or rather is, because no solution has been found yet) the following:
• Take a list of the desired categories (or tree, often hierarchical) directly from the people who need a system for automatic categorization.
• Receive from the same people a set of documents (tagged automatically) for each category, selected from the larger set of available texts.
• Use the categorization tree and the set of documents to teach the program how to recognize the stylistic features of each category. This is pure magic
and it is normally referred to as training.
This approach has produced one of the oldest and most persistent myths about Knowledge Management.
Although the solution soon proved to be inadequate, the will to accomplish this magic has been so persistent that even today the market insists on the possibility to obtain a program, suitable for any field that, starting from a few examples can perform automatically a task that often is not even within the capacity of people.
The idea of such a system is understandable and desirable (maybe it’s the dream of everyone in the field of information management), but has created exaggerated expectations, absolutely unrealistic and even detrimental, because they interfere with the advance of the state of the art.
Systems of this kind DO NOT exist and, what’s more as I often underline, there are no easy shortcuts for the solution of complex problems related to the management of information.
Still, in the case of specific categorization the myth can come true and reality is often better than expected.
In fact, although the categorization of contents for personal use is still quite far from being economically realizable (it remains pricey as the subjects are countless and tied to subjectivity), we can nevertheless observe that, for few years, at the enterprise level it is possible to implement systems for the automatic categorization that are economical and effective, provided that all the parts (firm and supplier of technology, client and vendor, etc.) share clear goals and work together to avoid traps.
We will see how in the next post on this subject.
Last week I was invited to speak at Google TechTalks.
Google TechTalks are designed to disseminate a wide spectrum of views on topics ranging from Current Affairs, Science, Engineering, Humanities, Business, Law, Entertainment, Medicine, and the Arts. My presentation focused on how our semantic platform can help advertisers to publish targeted advertisements based on the actual meaning and sentiment of a page instead of keyword or general topics covered by the site.
More than 80 “Googlers” attended from different offices. It was a very interesting exchange considering how philosophically different is the approach of Google compared to Expert System’s.
After my presentation I had the opportunity to speak with several people. To my big surprise, for the first time I heard several people from Google saying that “at the end of the day what matters is the bottom line. If, to improve our bottom line in some situations, we need to move away sometimes from our “all automatic” approach then…. so be it”. They were not executives but I think it was a sign.
Probably one of these situations is the area of contextual advertisement and Google, even with the most popular contextual advertisement platform currently available (adsense), strives to do better.
We think our semantic platform, even if still in beta, can significantly improve the quality of contextual advertisements and we will be collecting data in the next couple of months with partners that are already working with us. The advantage we have is that our platform is mature, performing and already used in the real world. I know several other start ups are tackling the issue. This will make for a very interesting time.
Below my presentation at Google TechTalks:
http://www.youtube.com/watch?v=WGygU_D-qqY
Recently I happened to re-read A Study in Scarlet, the first Sherlock Holmes novel by Conan Doyle, and I found an interesting passage:
I consider that a man’s brain originally is like a little empty attic, and you have to stock it with such furniture as you choose. […] It is a mistake to think that that little room has elastic walls and can distend to any extent. Depend upon it there comes a time when for every addition of knowledge you forget something that you knew before. It is of the highest importance, therefore, not to have useless facts elbowing out the useful ones.
If we consider the explosion of information made available by new technologies, we realize that this excerpt is particularly relevant today, as we are literally besieged by useless information, and filtering our searches implies an increasingly unacceptable waste of time.
I hope we won’t have to wait another 119 years (A Study in Scarlet was published in 1887) to find an effective solution to the problem of information overload.