Have you tried out the machine translation tools I was talking about last Wednesday? Were you surprised by the results? I believe you were.
In case you’ve translated short and specific texts (I chose articles of about ten lines – in Russian, a language I don’t know, not even a word), you may have obtained reasonably good results: in my case, the translation performed by Promt (a system based on shallow linguistic technology) was quite fair in general, as were the ones performed by the Google tools (statistical technology).
In the case that you’ve tried to translate full articles, you surely have realised that the results are nearly always disappointing (sometimes hilarious): no matter which system is used, the imperfections are plentiful, sentences have no logical structure, and in most cases the text makes no sense.
So, considering the results, why should we use these systems?
Because in some cases we may just need to get a general idea of the content – a general idea is better than nothing, in a world where only a very few people know more than two or three foreign languages (and often just one). But in most cases we actually need to know what we are reading, therefore we can’t go far with this approach.
If we want machine translation to become truly useful, we need to consider a new perspective and a different approach: semantics as a base technology, a technology in which to work, to find the missing piece of the puzzle, and to truly understand the general meaning of the text.
Is it possible to develop a system for machine translation able to overcome the many limitations of the existing systems, limits that in fact are preventing any practical application?
I believe so, but we need to abandon the idea of an easy and quick solution obtained by some statistical magic formula. Instead we need semantic comprehension of the content, a large quantity of conceptual information, and a great deal of work for each language to be managed.
To realize how disappointing the state of the art is, we just need to test the two main systems representing the current approaches:
Have you ever tried these systems? I will talk about them very soon, so you may want to test them a little, so that you can compare your impressions.
I would like to bring to your attention an interesting article, which confirms some of the things I’ve been writing about in the past few months (such as the post When Reality is Not What We Expected), about the differences between Internet and corporate searches. The article contains some remarkable and very realistic considerations by Google.
There is plenty of confusion these days about what web 3.0 is, sometimes called the Semantic Web.
I thought it would help to set the record straight.
Let’s start by understanding what web 1.0 and web 2.0 are thereby setting the stage for web 3.0.
Here goes;
Web 1.0 is one producer and mass consumption. The original web gave the power of authorship to few people who created content for the rest of us. Mass consumption of that content became possible with the advent of web directories like Google and Yahoo. Those directories work by indexing every word on every page. The more words per page that match your search query the higher on the result list. But even if only one word matches the page is listed as part of the search results. This is why you get 30, 40 or 60 thousand results per search. Many of those pages are useless to you.
Also, this is keyword technology. It treats words like tokens. A series of letters in a certain order from a search query is matched to a series of letters in the same order on any page. The underlying meaning of the word, the context in which the word is used, the words relationship to other words around it is not considered. Keyword technology treats words like a picture and not part of a language.
Here is the definitive test. Take any webpage. Take every word on that page and mix them up until they make no sense. Feed the page back to Google or Yahoo and have them index it. They will serve the page up just like they did the original. Same words, same tokens – if they match serve the page. Keyword technology does not care that the page makes no sense because the technology does not use sense as part of the index.
Web 2.0 is mass production and mass consumption. The advent of blogs, chat rooms, and other instant and ubiquitous authoring tools and sites ready to accept the content has been a great democratization of the web. The power to express opinion, to add knowledge to humankind is a great advance forward. We all get to hear from each other, to learn from each other. This is a good thing.
Except … that keyword technology has not helped us to locate opinion or knowledge as intended by those authors.
Consider the following sentence. “I believe the government has done a good thing in bailing out the economy that is in such bad shape”. Keyword technology would match this sentence (web page) to any of the following queries;
good government
bad government
good economy
bad economy
Four different queries and four different needs – yet they all get the same page and are left to do the work of reviewing whether the page really applies to their needs or not. Multiple that effort by the 30, 40 or 60 thousand search results and you have an untenable situation. So how are we going to get out of this mess?
Web 3.0 is mass production with pinpoint consumption. Semantics is the science of machine comprehension of text. It means the computer reads, understands and tags words, sentences, paragraphs and whole documents. With semantics, when we search we can tell the computer to fetch only concepts about “good government” or “bad economy”. In the above sentence, semantics would understand the adjective good is connected to the noun government and that the adjective bad is connected to the noun economy. In other words a semantic search would ignore a sentence such as “ I believe the government has done a bad thing in bailing out the economy that is in fundamentally good shape”.
So here is the kicker. If web 1.0 was single production and mass consumption then web 3.0, the semantic web, is mass production and pinpoint consumption. Web 3.0 turns web 1.0 on its head. It allows me, the individual, to find, assemble and consume only those portions of the vast internet that help me with my current task. Web 3.0 works for me rather than me having to work the web to get anything useful from it.
I had the chance to meet the inventor of the web Sir Tim Berners-Lee just the other day at MIT. It is no accident he has reinvented the web as a semantic web. As the amount of available information grows ever larger web 1.0 becomes less useful. One could argue it will eventually collapse under its own weight and it is keyword technology that is killing it. Semantics is the driver of web 3.0 and will restore the productivity promise of a world of connected information, knowledge and intelligence once more.
In a recent web seminar that we participated organized by Project 10X some 260 registered attendees submitted questions prior to the event. I semantically processed these questions (sometimes called “eating your own dog food” – imagine that!) looking for common themes and concerns.
In reviewing the outcome here is what I found;
1. Case Studies and ROI. People learn best with storytelling and proof points embodied by Return on Investment. So it should be no surprise that this tops the list of questions and concerns. These stories help convince funders, provide guidance for technical planning, and show feasibility. Yet this also shows a level of understanding of the technology by the participants. In other words they are convinced of the basic value parameters of semantic technologies and have come to believe they can be deployed with good outcomes within their organizations but need help to find the right place to start, the expected timelines, and how to sell the capabilities and outcomes to upper management. At Expert System we have over 100 implementations in the last 3 years alone and can confirm this concern meets with our experience.
2. Technical Integration Points. Here attendees concerns are about how to make semantics live with or interact with existing applications, data sets, and search products. Here I sense the need to make existing products pay a bit longer for their sunk cost and not to tear things out wholesale and start over. The good news is that semantic technology is intended to play this exact role by providing new insight into information where ever they currently live. 9 out of 10 customers ask us for a SAAS implementation with a front end user interface that already exists.
3. Semantic Networks. This is a real surprise to us but pleasantly so. While our technology relies heavily on a semantic network, sometimes called ontology, it is not always the case that other providers use this method to unlock the meaning of text. Some use statistical approaches, others heuristics and still others something called latent semantic processing. These other approaches tend to sound quite scientific but in reality are short cuts that prove to be less than sufficient for industry strength precision and recall. Semantic Networks are hard to produce and they take time. But the investment pays off. They become a knowledge representation of a domain of knowledge. When done thoroughly and properly can increase the precision and recall of the processing greatly. Many networks are specific to a branch of science or hold deep technical knowledge representations. Our semantic network, on the other hand, is of the common language, covering all topics, all words, all concepts and the connections between them. This means it can be applied to any domain.
4. W3C standards are confusing. When we read the comments its clear there are too many acronyms and to many standards. More concerning, the standards themselves seem to be the solution to semantics. It is as if many seem to think the standards provide the inference, the storage, the modeling, the interpretation and more that are core to semantics. The reality is that standards are only a proposed common language for describing and exchanging the outcomes of semantic processing.
To sum up – the semantic web has come a long way in terms of showing value and laying down a base of understanding. But as with any new technology, there is more to do. All of us to do better in terms of explaining, simplifying and educating up and down the organizational decision chain. Only when that is done will we be able to say “it’s baked”.
Where the categories mean the following;
Integration: How to embed or use semantics behind the scenes of existing applications.
Mobility: Get semantics to support mobile workers.
ROI Case Studies: Examples of successful, killer applications and their payback.
Semantic Nets: Semantic networks or ontologies, what they are, when to use them, how to maintain them.
Standards: W3C’s soup of acronyms and what they mean.
Timing: How fast will the technology and/or market progress.
Performance: Can semantics run with everything else and keep up.
Databases: How and when to use databases with semantics.
Automatic: Do semantic systems or tools learn on their own. What about maintenance and support.
Selling: How to make the case for funding to upper management.
NLP: how does semantics support natural language processing or computing.
During the internet bubble, among the many startups based on bizarre ideas, there was one in the US working on a sound project: developing solutions able to make explicit and available the large mass of tacit knowledge hidden in email messages exchanged within organizations.
In fact, if we think about it, the email traffic we handle at work on a daily basis is definitely a goldmine, because it contains, in a processable format, the tacit knowledge which is vital to businesses. However, when we need such knowledge, we often cannot retrieve it because, being tacit, it is unstructured or unorganized, and therefore remains hidden inside the email messages.
In order to understand the full potential of tacit knowledge, we can consider the difficulties when a key person leaves a company and takes important knowledge assets with him (or her.) Or there are also the numerous times that we know that we already have a solution to a problem inside an email message, but we can’t remember where to find it.
These examples prove how much can be saved, in terms of time and costs, by an application able to read all the email messages exchanged by a group, organize the contents, and make them accessible and usable in the future.
Developing generic solutions of this kind is extremely complex (as a matter of fact, the start-up mentioned earlier is now working on other developments).
But semantics can still have a key role, even if under present conditions it requires considerable customization and tuning.
This means that only big companies can invest in such solutions, and this is pity, because small and medium businesses could also benefit from them, as tacit knowledge hidden in email messages can really imply relevant costs, often implicit.
Actually, it’s a paradox: for the first time in history we are able to keep track of the business communications that used to be only vocal, but at the same time we cannot make them accessible and usable.
I doubt the problem will ever be solved completely but I’m confident that, at least in part, it will be possible to realize solutions that can find the gems available in this goldmine of hidden and unused knowledge and in the next few years, this will be the biggest challenge for the developers of semantic technologies.
To sum up, this is good demonstration that, when we talk about information, often things are not what they appear and it’s always worth trying to understand what’s behind them.
What’s natural language?
I am often asked this question by customers.
Natural language is the everyday language (English, Italian, German, etc.) used to communicate at all levels, which in computational linguistics is opposed to “formal language”, created expressly for a specific purpose. While natural language changes and evolves continuously thanks to neologisms, idioms, loanwords, slang, etc., formal language is closed and without exceptions: no semantic ambiguity, no omography, no homonimy, limited expressiveness.
If the analysis phase is performed correctly, the most important step for the success of the project is already done: this is in fact, the only narrow path that leads to an effective system able to guarantee effectiveness and advantages in terms of costs and value.
… or programs that “learn” how to categorize and programs that just categorize
From the Seventies onward, many researchers have been investing time and resources to develop algorithms able to analyze texts already categorized by hand, in order to extract, automatically (or better… magically), the knowledge required to categorize other texts of the same kind.
Basically, the idea was (or rather is, because no solution has been found yet) the following:
• Take a list of the desired categories (or tree, often hierarchical) directly from the people who need a system for automatic categorization.
• Receive from the same people a set of documents (tagged automatically) for each category, selected from the larger set of available texts.
• Use the categorization tree and the set of documents to teach the program how to recognize the stylistic features of each category. This is pure magic
and it is normally referred to as training.
This approach has produced one of the oldest and most persistent myths about Knowledge Management.
Although the solution soon proved to be inadequate, the will to accomplish this magic has been so persistent that even today the market insists on the possibility to obtain a program, suitable for any field that, starting from a few examples can perform automatically a task that often is not even within the capacity of people.
The idea of such a system is understandable and desirable (maybe it’s the dream of everyone in the field of information management), but has created exaggerated expectations, absolutely unrealistic and even detrimental, because they interfere with the advance of the state of the art.
Systems of this kind DO NOT exist and, what’s more as I often underline, there are no easy shortcuts for the solution of complex problems related to the management of information.
Still, in the case of specific categorization the myth can come true and reality is often better than expected.
In fact, although the categorization of contents for personal use is still quite far from being economically realizable (it remains pricey as the subjects are countless and tied to subjectivity), we can nevertheless observe that, for few years, at the enterprise level it is possible to implement systems for the automatic categorization that are economical and effective, provided that all the parts (firm and supplier of technology, client and vendor, etc.) share clear goals and work together to avoid traps.
We will see how in the next post on this subject.