As I have written many times before, semantic technology is unique in that it is able to go beyond the limits of other types of technology and approach the automatic understanding of a text. It is not perfect, however, and it certainly has yet to reach its maximum potential.
I realize that it’s not that easy for those who don’t work in the sector to understand (especially due to the fact that there are so many false promises out there, which tend to create unreasonable expectations, muddled ideas and market chaos). Therefore, it might be useful to use a common experience as an example, such as: our learning process.
Let’s start from the beginning: from the moment we (human beings) begin to talk, understand, learn, go to school, etc… We require at least 12-15 years to be able to read a newspaper and understand the most general articles and this is thanks to the experience we developed while learning the meanings of words and experimenting with a great deal of different phrase constructions. Consequently, the learning process is lengthier when we decide to tackle more technical terms or specific topics.
Learning takes time, and the same goes for a computer. It’s true that a computer can process in nanoseconds while we think in milliseconds, but it is also true that our method of learning uses a device (the brain) that no one has been able to fully understand and that is able to do things that not even the most powerful computer can imitate.
In summary, it doesn’t make sense to expect that a computer be able to perfectly analyze and understand a biology text, for example, without first having learned all it can about that subject. There are no shortcuts nor magic formulas: learning a language is difficult and even automatic processes require time and labor.
This week we announced the appointment of Julie Hartigan, Ph.D. as CTO of Federal Programs, and Rita Joseph as Vice President of Federal Programs. The expansion of our executive team here in North American is directly in line with our overall goals and vision for growth in the U.S.
Julie and Rita have the extensive experience to help us drive our federal program initiatives. And we’re all satisfied that in an era where government seeks to “connect the dots,” both of these seasoned veterans will bring expertise, guidance and our advanced, high speed, multilingual semantic processing to federal government agencies.
Wet morning in Santa Clara. People seem to be looking at the sky as if it was falling. We are not used to so much rain here.
There are not many people at the conference. The audience is an interesting mix of semantic geeks, marketing and product managers, business people. Definitely a very heterogeneous crowd.
The most interesting presentations are by Scott Prevost of Microsoft Bing and Mark Greaves from Vulcan.
Scott Prevost comes from the Powerset acquisition by Microsoft and is now part of the Bing project.
“The Semantic Web? It is already here” he says. What he really means is that in the Bing project they use quite extensively semantic technology like the ones we offer at Expert System. His opinion is that semantics, that is already applied under the hoods in all major search engines, is here to stay and will gradually evolve and make the user search experience better – most of the time without the final user even realizing that he is using semantic technologies!
Bing applies semantics in a lot of different ways:
They interpret semantically the requests of the user. Example: “who mocked Sarah Palin” returns not only results with “Sarah Palin” and “mocked”, but also “parodied”, “impersonated”, etc. We at Expert System provide a similar functionality for the Enterprise market with Cogito Answers.
They classify the search results so that they can be filtered and navigated in a better way by the user – similar to what we can do with the Cogito Categorizer.
They try to leverage RDF information added by publishers to their pages – similar to the rich snippets in Google. This information can be added to a search result to make it more interesting to the user and improve his search experience. A classical example is the search results for a restaurant returning the Yelp web page with the average score and the number of reviews. We can help publishers to produce automatically these snippets using our Cogito Discover technology.
They apply semantics to their advertising platform so that the advertisement campaigns can be based on concepts instead of keywords as they are today. We offer a similar solution with our Cogito Advertiser product.
Another interesting speaker is Mark Greaves from Vulcan Technologies. One of the most interesting points that he talks about is the fact that a lot of data that used to live in databases around the world is now moving into the “Semantic Web”. The advantages are huge:
Linking the data: Think about relational databases and on how you can link one piece of data from one database to another one (maybe belonging to a different organization). It may not be impossible, but it is at least very difficult. One basic advantage of the Semantic Web is that data can be linked in all sorts of ways. The OWL standard in particular provides the means to connect data in different “clouds” very easily.
“Organic growth” of the data: The Semantic Web also allows for “organic growth” of data. As opposed to relational databases where you need to define an outline before you even start entering any data, the Semantic Web is designed to provide the flexibility to add and modify data in different formats in different points in the web. With open data usually there is also a community that maintains it and makes sure it is accurate.
There are also some recurring themes at the conference that seem to be common in many of the talks:
- Mobile Internet: Internet on Mobile devices presents some specific challenges. The environment is different (e.g. no big keyword or mouse and much smaller browser). The market is huge, the opportunities also. Search Engines, social networks, content providers discuss how to use semantics to develop this new space.
- “Internet of Data”: the huge amount of Linked Open Data that is available for free today on the Internet represents a new and ever growing opportunity that can be leveraged by computer programs to help us humans in our daily tasks.
- Social Networking Interaction: this is a concept that seems to mean different things to different people. Some people talk about how social networks can be represented in a “semantic” way with RDF so that it can be used by semantic web applications. Other people talk about the way people in social networks contribute in publishing and maintaining data in the Linked Open Data Cloud in a similar way that the Wikipedia community has developed the huge Wikipedia knowledge base in the last few years.
Bottom line is that the Semantic Web is already here and the ideas discussed at Web 3.0 are mostly about opportunities on how to leverage in order to make our life better…
by Walter Pezzini, VP of Pre-Sales and Professional Services at Expert System
When I present a company with our software solutions (which are based on a semantic technology that uses a rich and vast semantic network), I find myself in front of an audience who clearly understands the advantages of this approach. Yet, the series of concerns and doubts they raise often clouds the decision-making process and causes an incorrect evaluation of the actual return on investment.
Whether they are raised by IT managers, KM workers or software developers, the concerns fall into two categories: the first, the costs related to the setup and maintenance of the semantic network and the second, the costs related to the infrastructure required to maintain a performance level able to satisfy operations.
There are many reasons behind these concerns, but two factors seem to stand out. On one hand, there are the excellent (and often incorrect) communication activities carried out by the makers of systems based on keyword technology. They have almost succeeded in convincing the market that a complex problem such as information management can be solved with automatic shortcuts and that any other alternative would be unaffordable. On the other hand, the majority of researchers in this sector are still skeptical about systems which are entirely semantic. This is mainly caused by their inability (at least up to now) to develop software which can combine the advantages of increased text comprehension with performance in order to meet the demands of the real world (thus further strengthening the position of the competition.)
In the past ten years, many successful projects have been developed using our semantic technology. Therefore, I think it would be useful to use real data from our everyday experiences to help clear up the misconceptions which often cause people to make irrational decisions.
Costs of development
To add a new language to Cogito, two man-years of software development and 8-10 man-years of linguistic development are needed in order to refine the semantic network. You can quickly estimate the cost of such resources (if you are in the Silicon Valley, divide your estimated total by 2!) and immediately understand that the initial investment is considerable, yet affordable considering the cost will be spread over all the implementations that will be done over time.
Cogito’s standard semantic network permits a horizontal management of content so that a significantly higher rate of precision e recall (compared to that obtained from a static system) is obtained with no need for further elaboration. For vertical implementations, start-up costs will be necessary so that a standard semantic network can be enriched with knowledge from a specific dominion (the number of added concepts usually does not exceed 5,000); usually 20-30 working days are needed for a linguist to complete this task.
For those who believe that “languages constantly change and adding new terms can be costly,” may I remind you that even the most dynamic languages, such as English, increase by no more than 100-200 new terms (of common use) and less than 1000 non-idiomatic expressions per year (in the worst case scenario, this could mean about 10 working days per year.)
Those who criticize the complexity of managing a semantic network often refer to the complexity of managing lists of entities such as: people, places, companies, organizations, etc. Traditional systems are able to recognize an entity only if it is present in a list; this aspect is often erroneously confused with semantic network management. A good semantic engine is able to recognize an entity based on the semantic role it plays within a text, therefore it does not require the creation nor the maintenance of lists. At the same time, it is also able to correctly recognize less frequent entities (which, for obvious reasons, have not been inserted in the list.)
Costs of infrastructure
Cogito can analyze more than 120KB of text (circa 40 pages of text) per second with a common single-processor server. This kind of speed, combined with its linear scalability and low cost, makes Cogito a practical solution even in situations in which large quantities (tens of millions) of documents must be analyzed.
The development and maintenance costs of a semantic network are considerably lower than what is commonly assumed; the improvements in terms of the ability to manage information (even when very complex) are obvious even to those who are not experts in this sector. I am convinced that when these aspects can be objectively analyzed (when myths and obsolete information are ignored), the number of companies which adopt real semantic solutions will increase.
I usually don’t talk much about the technical aspects of linguistics or semantics, but I would like to draw your attention to www.phrasedetectives.org . This website uses a game format to gather useful material for refining algorithms to resolve anaphoras and co-references.
Seeing as though this material could also be useful for us, and those who dedicate some time can also win prizes, I thought it would be nice to point out ![]()
Internet search engines have made some serious progress the past few years, from the first successes of Altavista and Lycos to the unmatched power (given its superior results) of Google. However, in the past two or three years, even Google has reached a kind of plateau; significant innovations are less and less and the competition (Bing in particular) is closing in faster than ever before.
Keyword technology (integrated with a series of statistical elements such as PageRank) has the enormous advantage of being simple, easily applicable to many languages and very fast. It has all of the characteristics which were crucial during the Web’s beginnings (when investments and processing power were much lower), but which are not so important today. When applied to the Web, keyword technology took advantage of the free and voluntary labor hours of hundreds of millions of people. People, who by searching and clicking on one or more results, provide makers with and enormous quantity of information each day. This kind of information is priceless and helps to re-organize search results in the best possible way (it could be looked at as the price users pay to use free services: with labor instead of money).
Nevertheless, the time has come to integrate this technology with something new. There is absolutely no need to throw away what was done in the past (in many cases, search results are already quite good). We just need to add on new technology to improve the currently problematic search results and make searching as simple as possible (especially for those who are unable to conduct an efficient search, but could easily formulate their question to a person).
We can’t be afraid to get our hands dirty. We need to get to the heart of the language and culture of every nation; up until now the approach has been very “sterile” and has stayed at a symbolic level, without really scratching the surface. In order to understand meanings, we need to go in-depth and understand that a text is comprised of phrases, concepts, attributes and relations which need to be analyzed as a whole (even on a cultural level). Only then can we succeed in capturing the content’s most important aspects and be able to respond to users’ searches in a timely fashion.
Significant investments will be necessary (each language is complex and differs from others and is often indivisible from a nation’s culture), as well as more manual labor paired with today’s greater processing power. These features will greatly accelerate the ongoing process, which will bring about an Internet search engine market led by two players: Google and Yahoo!
Smaller entities, whether already in existence or just starting-up, can still make their contribution, but only for innovative technological aspects or in vertical market contexts. It will be near impossible for them to compete against these two giants on any other level. Semantic technology is still young and has much room to grow; we should not expect any miracles or major revolutions in the near future. The path to follow is long and tortuous, but in the end, the potential reward could be quite astounding.
The industry was in an uproar when Eric Schmidt stated that it will be necessary to switch from words to meanings, in order to better understand what users are asking and what is contained in indexed documents. It would be a considerable change in direction for the Mountain View giant, which has always sustained that keyword technology is more than sufficient to obtain the best results.In a way, it’s really nothing new. For some time now, in the world of Semantic Web, a sort of integration of semantic technology (which is able to understand meanings) has been going on within one of the most popular Internet search engines. When Bing was launched, Microsoft itself claimed to use semantic elements, but without actually specifying the types of elements and the ways they would benefit searches. However, just the fact that the industry leader is talking about ‘understanding meanings’, makes it legitimate and creates a time line of before and after: the era of widespread web-applied semantics has officially begun.
When you search for something on the Internet, you always know which search engine you are using (Yahoo!, Bing and Google are the most popular), but when you search for something at work, sometimes you have no idea where the information comes from. You really don’t know which system you’re using, you just limit yourself to typing your request in the search box provided and hope to get the answer you were looking for.
The strange thing about this is that searching for information is a key activity in every company. Still, this market has not yet been tapped by the multinational software giants because it seems like they just can’t get their act together. Autonomy, the leading producer of company search solutions, is practically unknown to non-specialized personnel. Oracle and IBM play a small role and Microsoft actually had to purchase the Norwegian company, Fast, in order to try to grow in this sector. Google has a good share of the market thanks to its brand name, but its product does not provide results to top the competition. Not only, but users are also giving negative feedback (this goes for all of the key players) on result quality and search times. Thus, we have a complete picture of a situation which must be addressed if we want to try to beef up companies’ efficiency.
The most promising technology is semantic technology. Although it hasn’t yet reached its maximum potential, it is already able to better “understand” content and identify the most important concepts and relations. We must also take note of the fact that it is impossible to have totally automatic solutions which magically know how to program themselves (an idealistic goal). The search engines must be developed around the knowledge and terminology used within the companies; if done in the right way, the task won’t be too complex, but its value will be priceless.
Change must occur in the technology used to analyze the content in the various types and forms of company documents. All of the above-listed search engines still use the old keyword technology, which has been strengthened throughout the years by statistical elements. This technology has the advantage of being stable and easily adaptable to different languages, but it also very limited because it cannot, in any way, understand the language nor the actual context of a text.
Different companies, such as mine, already offer search solutions based on models which I’ve just described (as in technology and methodology) and the results are quite interesting. It is probably just a matter of time before the big names decide to move in the same direction.
There’s a new Question Answering system (it’s something different from a search engine, as I wrote in the past) available online for testing: True Knowledge (it just needs a quick registration.)
From my point of view, the best way to test a system of this kind is by posing real questions to it, questions that, in the past, we have already searched the internet for an answer using standard search engines or other sources (especially Wikipedia.) In fact, testing questions invented on the spot is not very useful (if not useless), because we may tend to follow the examples provided (too easy) or ask weird questions which no one would actually ask in a normal situation.
For this purpose, I’ve been collecting a list of about fifty questions (which I update from time to time.) I know it’s not a very long list, but it’s carefully created, balanced and representative of both easy and difficult tasks to be carried out in this sector, from the point of view of an insider.
Even though I’m well aware of the huge complexity that must be faced in order to implement effective systems of this kind, I have to say that here the results are disappointing: only 3 out of the fifty questions obtain a correct answer (while a fourth question is not fully satisfying), with a percentage of 7%. As most questions are very simple (for example, “Who won the Nobel prize for Chemistry in 1999”), I was actually expecting something better (for the above mentioned question Google already provides the correct answer in the first link of the results.) It’ s true that this is just a beta version, however even small variations of the suggested questions seem to invalidate the process, and this makes me doubtful on the soundness of the approach, and on the applicability in real situations.
When the system replies, it seems like magic indeed, but this happens so rarely that the magic disappears and what remains is the distinct sensation of a nice experiments but actually a useless tool. The effort is remarkable, anyway and I will surely test the system again in a few months, but at this stage of development I have to say that unfortunately, the tool is not able to save us time, yet (and as for the future, we will wait and see.)
When I previously wrote about Wolfram Alpha, I had suggested that this kind of system could become more than just a competitor of Google (or other types of general search engines); it could be an excellent complement to meet specific and punctual needs. Therefore, I was not surprised when I heard the news that Microsoft had sealed a deal with Wolfram to integrate Bing and Wolfram Alpha for searches/questions belonging to a subcategory of topics for which Alpha has good coverage of.
The terms of the accord are still unknown, but I really believe it will be beneficial for both parties. After its initial launch, general interest in Wolfram Alpha greatly diminished (due also to the excessively high expectations which were created but could not be met…). An accord of this type could be just the thing to create a new buzz and new interest. Surely, it won’t help Bing make any improvements in quality, but for Microsoft, this agreement could help keep the hype about Bing and its ecosystem alive.