How many useful messages do we receive via email? The analysts say about 30 per day, with a cost of 1 or 2 hours of work to manage them.
In addition to the work necessary to clear off the incoming mail, we need to consider the work necessary to retrieve the data later in time. In fact, we tend to keep everything (even only to keep a history) and in doing so it’s…. 30 messages today, 30 tomorrow…
Controlling the situation is not easy. The data keep on growing, until reading everything becomes impossible, even with a good organization system, with different folders and sub-folders, because of transversal contents, and contacts and useful details scattered everywhere - where did I save that email from so and so with whom I’m working on that project for so and so? Maybe in the folder with his name, in that of the project, or of the customer…
The “search” function based on keywords (usually the only one available) can help us only if we know which is the right folder and if we can remember at least the author of the message and a specific word (and not too common) in the text. But in most cases we just remember a general idea, and as a consequence we can only proceed by trial and error, often without results, or having to search on the Internet for what we already have!
Despite the complexity of the problem (information is often subjective, instead of standard and objective), semantics can greatly improve these kinds of searches. One example is the possibility to automatically double-check more data, also extending the search to all the concepts and related sub-concepts.
For example:
”I’m looking for the sales of the competitor X”
with semantics I can retrieve not only messages containing “competitor X + sales”, but also those with “competitor X + billing”, “competitor X + turnover”
and also with:
”Product1 + Competitor + sales”, “Product2 + Competitor + revenues”, etc.
Everything as easy for the user as a keyword search.
Working on categorization projects, we often face the fact that a perfect automatic categorization cannot exist: a certain degree of subjectivity (which can also vary in time) is always involved when we assign a category or a subject to a text.
The most common situation involves taxonomies including heterogeneous categories: for example, when categorizing newspaper articles customers tend to include in the taxonomy subjects such as sport and politics together with domains such as people or events.
But while categories like sport or politics are fairly objective and strictly related to the content of the text, people and events are cross-category elements, therefore it is very difficult to manage them with an automatic system. In fact there are no common topics, no recurring or typical concepts, no specific domains, while the only shared feature is that of being focused on someone or something (a person or event).
However, it is comparatively easy for the reader to agree that articles about Leonardo da Vinci, Gorbachev, Robin Hood or Joe Dimaggio should belong to a “people category”.
In general we should always keep in mind that some choices are quite easy for us, but can be extremely complicated for a program.
For example, we may need to categorize the review of a Second World War movie. For most readers, without even having to read the whole article, the first category will be “cinema”, as the subject is a movie. The program, instead, may think* about history or war or military instead, and would not consider “cinema” as relevant topic.
Luckily, most categorization issues can actually be solved by an automatic system which, once configured properly, will be far more objective and reliable (because it will never get tired nor influenced by external factors) than a person, who remains nevertheless the only one of the two who is really intelligent.
* think… it’s only a manner of speaking
A long series of false notions on the Internet has created a macro-myth: you can find everything online, you just need to “know how to search”.
Instead, there’s nothing special to know. That is, it’s not a matter of tricks if we cannot find for example, library books on the web, it is simply that library books are not on the web. In fact, only a very small part of the knowledge that surrounds us is also online, and it’s not by magic, but instead because someone has decided to make it available on the Web (and available does not mean “for free”, because it is not true that all the information on the web is free… this is another MYTH;))
We also need to consider the impact of dynamic pages (and also if all search engines have developed a special crawler to index as much content as possible, subtracting it to the hidden part of the web), and that search engines are able to classify only a minimum part of all the accessible data (no one can indicate an exact percentage, but I would be surprised if this would be more than 4 or 5%). Therefore the content can be actually online, but the problem remains, and that is because there is no special technique to retrieve what is not indexed.
But also without considering the hidden Web: in the case that the interesting content is actually indexed, can we really find what we need in a very short time (and without effort… another mith)?
Without the right keywords, the answer is no and we could even search for an entire week and still not find anything anyway.
The reality is that we still cannot take advantage of all that we have available.
There are some that say that it would be nice to have on the Internet any original document (like the library books we were talking about at the beginning of the post) but it would also be nice to be able to utilize the multitude of secondary information that can provide a very useful support, in particular, because it is produced for the majority of people according to different competences, points of views, sensibilities, etc.
The Semantic Technology Conference in San Jose is probably the most important in this sector.
I attended it for the second year in a row and this year the event had more than 1,100 people attending. It is a very important moment to understand the maturity level of the so called semantic technologies and in general, to evaluate if these technologies have started their run to become mainstream. Below you will find some random thoughts from a non-technical guy on trends and issues facilitating or preventing Semantic technologies from becoming mainstream.
The language used by vendors and experts is still too technical to engage and to excite business people. However, I noticed that more presentations included practical demo sessions showing how users interact with the applications or the solutions presented. This is a first step but what should happen next is to have presentations with clear ROI analysis, which was still missing from most presentations at this year’s event. I believe that, as usual, this is a turning point for any technology to show its strategic relevance for enterprises.
This year we were finally able to see the first real working semantic web applications. It was impressive to see the expectations that platforms like Twine, Freebase or Powerset have generated in the community. I am a Twine user so I am not surprised to see this interest but it is still nice to see this phenomenon. It is still early to say if these applications will be successful and drive a lot of traffic. Initial users seem to have split opinions. I have a conflict of interest because we are suppliers of Twine and the developer of www.askwiki.com which directly competes with Powerset so I cannot express my opinion. However, we will all follow the efforts of these companies carefully because if they can deliver on the hype they have generated it will help to make Semantic Technologies pervasive.
The defense sector seems to be ahead of the enterprise and other government sectors in the adoption or at least interest in Semantic Technologies. Many of the most important defense-related system integrators, vendors or agencies attended the event. It’s difficult to say if this interest depends on the fact that the major wave of investment attributed to the defense sector allows it to have a much broader scope in monitoring new technologies or is it as I believe, due to the issues facing the defense sector (especially in monitoring open sources) that makes semantic technologies a perfect fit. In any case, this interest is of a great help to the industry.
Analysts of the major firms (like IDC and Gartner) seem not to have really caught up with the semantic wave. While most of these firms have started to cover semantic technologies in some shape or form, they don’t yet seem to be very engaged and comfortable with the topic. It came as no surprise that there were no analysts from these firms among the attendees of the event. I think it will be important for semantic technology companies to engage these firms in the future to present clearly their case if they want to find some advocates for a breakthrough in the business world.
There was a lot of talk about standards for the semantic web (OWL, RDF, etc.) as if simply having the standard makes a semantic web. People seem to forget that you need something to create applications to process the information and create output to the standards. In order to become mainstream and be really usable in real world applications, it is mandatory to have the tools to do the heavy lifting. This fact has always driven the development of our technology here at Expert System and this is why we have developed such a solid set of tools.
We believe that only when application development and customization tools are readily available, can the semantic web become a reality.
Democracy is the worst form of government, except for all those other forms that have been tried from time to time
I’m fond of this quote by Winston Churchill, and I often use it in my meetings and presentations in this modified version
Semantic technology is the worst technology for processing unstructured information, except for all other technologies
Actually, a perfect technology for knowledge management does not exist because a definition of knowledge itself does not exist, at least not one so rich and complete to be shared by everyone.
We all have our idea of knowledge and we prefer to approach information our own way: if I could remember word by word all I know and had the time to read directly all the sources I have available, I would be 100% sure of being well-informed.
And I wouldn’t need any technology, either perfect or imperfect
But the many reports available on the explosion of knowledge, from here to 2010 for example, are clear: we will have then 988 billions Gigabyte of digital information, at least according to the projections based on 2006 that closed with 161 exabyte (billion gigabyte). Now, it is definitely true that the majority of this information consists in multimedia material, but most of the rest is plain text that requires the best possible technology to become, at least in part, usable and useful.
In most people’s mind, Secret Services belong to a mysterious world, where technology is beyond belief and all instruments and tools are far from ordinary.
Yet drawing parallel between the activities we carry out in our business companies (but also in our leisure time) and those of a spy can be quite surprising.
In the field of information management, we have more in common with Intelligence than we can imagine:
• finding the most correct data and clues;
• sharing and spreading knowledge in the most effective way.
You may already know about “A-Space”, a project of social networking promoted by the Intelligence Community with the goal of improving the quality of intelligence and promoting information sharing.
“A” stands for “Analysts” (that is to say secret agents) and “Space” echoes a famous Web 2.0 site: MySpace.
A-Space will take part in Intellipedia (a group of 3 top-secret Wikis, with a name based on the famous Wikipedia), whose members come from 16 different secret agencies.
Only one year and a half has passed since its creation and Intellipedia already contains more than 29,000 articles, with an average of 114 new articles everyday and more than 4,800 changes to already existing material.
It’s interesting and promising to see how specific tools of Web 2.0 can contribute to the solution of problems connected to information sharing (also at a global level) in the field of intelligence. But the new thing here is the association of Intelligence with web sites such as Facebook and MySpace, very user friendly and, typically, used by young people.
“Logged In and Sharing Gossip, er, Intelligence”, New York Times.
“Spies and teenagers normally have little in common but that is about to change as America’s intelligence agencies prepare to launch “A-Space”, an internal communications tool modelled on the popular social networking sites, Facebook and MySpace”, Financial Times.
The problems of search and integrated management of information, caused mainly by technological limits, are the same for everyone: agents and secret services, companies and common people. But of course the needs and the objectives are different, and certainly also the risks and complexity of the situations to be faced.
Therefore, the approach to solve such problems is different, in terms of concreteness and speed.
I find interesting (and funny) to observe how the market of unstructured information management is evolving so fast that we don’t even share a name or expression to define it.
If we consider structured information, we all agree that we’re talking about databases, data warehouses, data mining and, recently, business intelligence; but we don’t have anything similar for unstructured data. In fact, depending on circumstances, applications and points of view, all the following names can be and are actually used:
· search engine
· information retrieval
· information extraction
· clustering
· text mining
· etl
· content management
· enterprise search technology
· content access tools
· semantic intelligence (we ourselves invented this one)
· information access technology
· categorization
· text analytics
It’s also interesting to note that even IDC and Gartner don’t agree with each other on the name of the field: Gartner refers to “information access technology” while for IDC the right term is “content access tools”: luckily, they have “access” in common
Of course, I realize there are bigger problems in the world… but, I would like to see a clearer and more shared approach to names and expressions. Considering how crucial the label “business intelligence” has been to establish a group of technologies and solutions that were already available on the market, I think that it is crucial to converge as soon as possible towards a common terminology.
“If Hewlett Packard knew what Hewlett Packard knows, we would be three times more profitable.”
It’s fun (and also a bit alarming) to realize how this statement made some years ago by Lew Platt (CEO of HP in the Nineties) is still current and more effective than most recent definitions of what Knowledge Management is supposed to be.
In the business world, the KM concept has undergone so many transformations, and it’s been associated to with so many killer applications (content management, data-information management, e-learning, portals, contents access via Intranet… ) that we can compare it to a phoenix: it seems to be dying, but then it re-emerges from its own ashes, mutates, and becomes powerful again.
Beyond names and definitions, the only certainty is the issue for which KM was created. We are flooded with an enormous quantities of potentially interesting data that we are not able to use, and therefore controlling what we have and what we know is becoming more and more difficult.
The management of information (Knowledge Management, broadly speaking) has its own myths and legends and I think it will be interesting to discuss them in a new category of posts dedicated to what is just a myth and what is actually feasible.
I hope I will be able to write in a way that will be both clear and involving, but I can’t guarantee a regular publishing schedule (every now and then I also have to work), so I appreciate the patience of my readers.
The comprehension of a text performed by a linguistic analysis engine is only a simulation of some aspects of human comprehension, therefore the automatic process and the human process cannot be considered equivalent. No one knows yet how human comprehension really works (and it seems we are still far from knowing it), therefore we are not able to reproduce it with a software, even in absence of other obstacles.
When I started to think about COGITO® and automatic comprehension of texts, I selected some aspects of the wider process of human comprehension. Then, with the agreement of engineers, programmers, mathematicians and linguists, I focused on the following: