Jun
30
Filed Under (Semantic Intelligence) by M.Varone on 30-06-2009

I’ve already written many times about automatic categorization, but it’s such a complex topic with plenty of different aspects, and although it may seem simple to the general public, I think that it’s worth discussing once again (and again in the future.)

This time I would like to focus on the categorization of contents dealing with generic and horizontal subjects, i.e. journalistic categories such as news, sports, economics, politics and so on. For those, like us, who have developed categorization software for years, the abundance of non-institutional content on the Web (mainly blogs and similar) offers more opportunities than in the past to apply our applications successfully. In theory, it’s a winning solution not only for those who develop the technology, but also for those who provide the content because categorization rules need only minor customization and as a result, improving content with quality information becomes quick and effective.

Nevertheless, two relevant aspects must be considered with close attention, in order to avoid problems during implementation.

The first aspect is somehow implicit in the personal and subjective nature of such information sources. In fact, when writing blogs, authors quite often (if not always) mix posts on their favorite subject or field of expertise (cinema, sports, technology…) with other more intimate and personal posts, which do not necessarily have a specific subject. When we try to categorize this kind of content using standard systems developed for the well-focused articles of periodicals and newspapers, we tend to obtain background noise. In order to minimize such noise, we need to be aware of the problem, and use semantic technology in an expert way: this way, the level of the final result is usually quite good, and can provide an added value to users.

The second aspect is the average length of these contents. In fact, quite often the post is short and does not exceed 500-600 characters, making it quite difficult to obtain enough reliable information to select the right category. The readers of a blog already know the subject because they have read previous posts and therefore do not need further context to find the main subject. Yet, for a program the task is definitely more complicated because very often a program does not analyse the posts of the blog one after the other, or from the same information source, but receives them in a random order, or one by one. In order to manage this aspect correctly, we need to accept some compromises and modify the system progressively, in order to reach a good balance.

For these kinds of projects, technology is very important but equally important is the expertise of those who have worked for years in this field: like they say in Naples, no one is born learned.

Jun
22
Filed Under (Semantic Intelligence) by L.Scagliarini on 22-06-2009

After a presentation from the New York Times, the Semantic Technology Conference of 2009 came to a close. Evan Sandhaus, the newspaper’s semantic expert, talked about how, from the beginning of the twentieth century, the Times have made historical efforts to use and value the immense quantity of available content and about how now is a period of crisis for the publishing industry, therefore strategies must be more decisive and significant. During the presentation, Evan officially announced the NYT’s participation in the Linked Data project. This will make the newspaper’s massive index easily accessible through Internet and linkable to other content or applications. This is an important step because it opens the doors to a new generation of Internet applications which could revolutionize our lives.

Another important highlight was the active participation of the three main search engines: Google, Bing and Yahoo. Google received applause for the announcement that it will be adopting new standards which will conduct the company on a slow journey towards more intelligent research methods. This quite a change from last year when Marissa Meyer, vice president of Search Product and User Experience, stated that Google wasn’t interested in semantic research.

Another significant element emerged on the final day:  the growing efforts of the American government to increase, over the coming years, the availability of its own data (Operation Transparency) and make it accessible even through intelligent technology.

The Semantic Technology Conference was also a great success for Expert System. In our third consecutive year of participation, we were involved in five workshops and panels in which the explosive potential of two areas of great interest were examined: semantic applications for mobile networks and online advertising.

To sum up these few days, I can say that the fifth edition of the Semantic Technology Conference was a success. The participants totaled 1,170, a 16% increase from last year. Specifically, there was an increase in participation in the business area, which was almost 20% of total participation. Concrete solutions and activities were presented and given the economic crisis, much attention was paid to the success stories of cost- saving projects for strategic activities, such as customer care and competitive intelligence.

And that’s all, folks…I’m off to dine in a restaurant on the Californian coast with a great Oceanside view (for a couple days a year it’s not completely shrouded in fog).

Jun
19
Filed Under (Semantic Intelligence) by L.Scagliarini on 19-06-2009

In an event where new buzzwords like, “cloud”, “linked data” and “shared ontologies” prevail, the moment has finally come for a rising star (which some think is already on the brink of falling): semantic research.

Yesterday, Carla Thompson (Guidewire Group) conducted a “Semantic Search” panel in which representatives from search giants took part: Scott Prevost from Bing’s Powerset division, Andrew Tomkins from Yahoo, and the renowned Peter Norvig from Google. Also participating were Riza Berkan, CEO of Hakia, William Tunstall-Pedoe from True Knowledge and Tomasz Imielinski of the revived (for the third or fourth time now) Ask.com.

Here is a nutshell version of some of the highlights from the Question & Answer session:

Carla Thompson: What makes your engine different from the others?

Ask.com: the ability to give answers (extracted from structured and non-structured sources) to questions formulated in natural language. This statement was found to be rather amusing to the audience because just a couple years ago, Ask.com had repositioned itself as a traditional keyword engine, in an attempt to make people forget about the shoddy performance of its previous Q&A engine released in the late nineties.
Powerset/Bing: the ability to understand the user’s intent and to organize the pages based on this intent.
Google: the ability to provide a set of information which is complete, accurate and fast.
Hakia: the ability to define a ranking based on credibility and not on the popularity of the sources.
True Knowledge: the ability to provide high-quality answers to direct questions by extracting such answers from structured (databases) and non-structured (texts) sources.
Yahoo: the URL’s ability to make semantic sense, see SearchMonkey, which allows results to be seen in a different way.

Carla: Why focus on semantic research when the public doesn’t seem to demand it (nor understand what it is)?

Ask.com: normally, a winning product doesn’t originate from public demand, but from the ability of a company to anticipate needs that the average user can’t sense yet. Increasing the precision of answers to specific questions, such as we are doing, has a value for the user. The user will recognize this value once he/she has experimented with research of this kind.
Google: why innovate if something works? Because people don’t realize what can be achieved, so they are happy with what they have. However, when they see new features which can improve their experience, they don’t hesitate to welcome them with enthusiasm.
Powerset/Bing: the ability to answer questions and present complex information in a clear and precise way is valuable to users. Our outlook hasn’t changed after we were bought by Microsoft. We still continue to focus on in-depth content analysis and provide precise answers to users’ questions thanks to disambiguation, which objectively, we are the only ones to offer.

I do not believe that this feature is currently present in Bing.

Hakia: we aren’t the only ones offering this, but we do everything from scratch, including an ontology which is non-dependant on language and noticeably improves performance in terms of precision, something that every user needs.
Yahoo: what Ask said was right. We are already evolving to anticipate the competition. Today, thanks to semantics, we display information in a different way and the users’ response is decisively positive.

Carla: Is it possible to objectively compare two search engines?

Ask.com: yes, and there are simple ways to do this. For example, the characteristics of a search engine should allow the user to search “the top 10 songs of all-time” or the same phrase, but put “ten” instead of “10”, and have the results come out the same (but they don’t always do). Or, inserting “Tom Cruise” and “Tom Cruise actor” should give the same results, but instead, the results in general are worse with the addition of information. We at Ask invest in assuring that a search engine can respect these standards.
Hakia: it is possible to perform objective tests, even if, in the end it is the user who chooses one search engine over another.
Powerset/Bing: there are ways to conduct test. In terms of level of semanticity, my fellow colleague from Ask is correct, but it is important to add other criteria as well, like for example, the capacity to extract relations.
Yahoo: it is very difficult. Our research shows that users are very conservative and tend to reject small changes. We think that changing the presentation of the results based on the search objective is very important and must be a criteria for comparison.

Carla: Which is better: traditional or question & answer search engines?

True Knowledge: in terms of input, there shouldn’t be any difference in the results if the user inserts a question in natural language or in keywords. The excellence of a search engine is its ability to provide precise answers.
Google: cut and dry answers are not always best. Everything depends on the question. For example, if someone is interested in knowing how meditation improves their lives, receiving a series of informative content is probably better than a cut and dry answer.
Hakia: offering the opportunity to ask questions is important. A search engine must be able to understand the question and continue to improve in situations in which the initial results are insufficient.
Yahoo: the notion of supporting the user is important, but continuously trying to provide an answer isn’t enough to improve the search. What is needed are new types of interfaces.
Powerset/Bing: yes, question & answer engines are better, but only for mobile applications or for certain types of requests.

Here is the last question that Carla Thompson wasn’t able to ask due to time restrictions: Is semantic research more important for the business market or for the consumer market?

In the market of business search engines, semantic research identifies concepts and relations (who does what), therefore, it is more important. Businesses and so-called knowledge workers already know this. Although semantic research has become a bit less popular in the consumer market, it still takes the prize as an important aspect in the business search engine industry.

Jun
19
Filed Under (Books & News Related) by M.Varone on 19-06-2009

The buzz surrounding the world of search engines certainly hasn’t died down in the past few weeks. In fact, the people of WolframAlpha have released a package of updates, which according to them is significant, although there really doesn’t seem to be anything beneficial for users at the moment. With this new entry and Microsoft’s recently released Bing, Google may have been feeling left out. Not wanting to share the spotlight, they have announced a series of developments which are not particularly significant, but nonetheless, have served to make journalists write about Google, too.

Google released Google Squared (coincidently, during the launch of Bing ;-)). For the time being,  it’s just an application available from Google Labs, but based on the emphasis placed on its presentation, it seems that this is a service in which they are investing important sums of money. What Google Squared is, is probably unknown even to those at Google and I certainly won’t be the one to unravel the mystery! Of course, I did try it out and I shared my impressions and opinions with my colleagues.

The general idea they seem to want to convey is that it is possible to transform unstructured data (meaning information not yet organized in a database) into easily accessible, structured knowledge (basically, the Holy Grail of the Semantic Web). Unfortunately, the outcome is underdeveloped and often, quite useless (it seems that they have accelerated its release for marketing reasons). It is true that in some cases, in which 4 or 5 elements similar to the system are inserted, it is able to provide tables with specific attributes. But, it is also true that when testing things less similar to the examples, the results are unpredictable. It also seems that Google Squared retrieves the reasonable answers mostly from Wikipedia – a data source which is already partially structured, therefore casting doubt on the possibility to apply this approach to content which is truly general.

From a technical point of view, interesting and important developments can be noted, but it is impossible to tell whether they will be useful in the coming months or years. At the moment, it seems to be more of a research project than something that could actually turn out to be a real product.

Jun
18
Filed Under (Semantic Intelligence) by L.Scagliarini on 18-06-2009

Tom Togue’s presentation on Open Calais kicked off the day on June 16th at the Semantic Technology Conference 2009 and inspired me to try to outline what seem to be the emerging trends (including risks and opportunities) in the semantic technology market, it’s current state and some possible developments.

First off, in order to bring about the widespread use of semantic technology, the instruments for development and integration need to be available. The adoption of semantic technology by companies won’t come about just because it’s cool or because its discussed on Twitter. It will happen only when the technology or the applications developed from it demonstrate a clear cut way to save on, improve on and speed up business’ strategic activities. This means that those who propose a change toward semantic instruments can’t just market it’s practicality, instead they have to make its potential value for businesses stand out.
For social networking applications, a strategy has to be devised in order to create solutions which are able to generate income. Whether or not this will happen in the near future remains a mystery. The ability of semantic technologies to prove themselves and be decisive enough to take this leap will be yet another element which will influence its success, even if it seems unlikely that it will happen quickly, it would be nice to see the results from the first so-called semantic social networks…

Advertising remains a word which is taboo for every technical expert in the industry, but it is also the key to helping content providers generate greater revenue and profit. Semantic technology’s validity in the improvement of efficiency and productivity of advertising should really be taken into consideration. Personally, I believe that it is the most immediate factor which could contribute to an explosive growth in the industry.

The development of semantic search engines which can directly compete with Google seems unlikely. If we focus more on the vertical sectors however, the opportunities are colossal and at this point, it is clear that resources and the will to realize semantic engines are abundant.

Content providers and news aggregators, can use semantics to extract more value from available information (greater possibilities for access and utility). The opportunity to be able to aggregate, classify and cross-reference content using semiautomatic mechanisms gives aggregators, such as the Huffington Post or Techmeme, remarkable potential advantages.

Jun
17
Filed Under (Semantic Intelligence) by L.Scagliarini on 17-06-2009

The Semantic Technology Conference, currently in its fifth edition, has become one of the most important annual events in Silicon Valley. It started as an appointment mainly for techies and researchers, but in just a few years, it has transformed itself into a concrete opportunity for the entrepreneurs of the IT industry to show the market that semantic technologies can generate value.

We are very happy to be a part of SemTech again this year, and are honored to have been chosen as speakers in various work sessions. Upon my arrival at the Fairmount Hotel, I met Mills Davis, most likely the industry’s most expert analyst and author of the annual Semantic Wave Report: a true opera omnia with over 700 pages of analysis of trends and markets related to semantic technology. Mills is enthusiastic about SemTech’s success and is very optimistic about the industry. He believes that it is quite evident that the World Wide Web is becoming semantic, namely an advanced version of the current Web, where the information contained in a page is much more than: a URL address, a title, an author, a date or some keywords used by traditional search engines in order to find a page. Things are abuzz in the business world as well. In fact, just as last year, the first day of the Semantic Technology Conference (http://www.semantic-conference.com/) was dedicated to single-topic technical sessions. The sessions are aimed, on one hand, at educating so-called beginners who are interested in understanding how semantic technology can be useful to their organization, and on the other hand, at presenting a highly-qualified audience with the latest developments from the top university and company research labs.

I believe that the worldwide distribution of semantic technologies can be made possible by illustrating its concrete value for businesses and individuals, therefore I’ve been mostly concentrated on the presentations related to business. The most interesting presentation led me to endure nearly two hours of arctic cold in the conference halls of the Fairmont Hotel (evidently, the economic recession didn’t hit hard enough to make American hotels ease up on the air-conditioning; they are all set on the temperature of the North Sea in February) and it was about how semantic technology is transforming traditional business activities and competitive intelligence.

The outlook presented by a various speakers, in particular, by Daniela Barbosa of Dow Jones, is that the incredible amount of information available today must be seen as a unique opportunity to get better acquainted with the market, thereby reducing the risks that businesses are exposed to. So that this opportunity can be taken advantage of, and not become a problem, it must be handled with the right resources and technologies.

Well-developed semantic technology which processes text allows today’s user to efficiently extract and order data available outside of the organization (the well-known open informational sources) and to create very advanced intelligence models to manage different business activities. For example, a scenario was presented about an executive search company which was able to develop a model with the help of semantic technology. The model began with the analysis of articles and notices of corporate events and public announcements. It was then able to forecast which top managers were most likely to step down from their posts. The accurate performance of this forecasting model allows the businesses to be one step ahead of its competitors, by enabling them to contact and make offers to the listed managers before the news becomes public dominion. Another, more common scenario, was based on the businesses within the financial sector, where they were able to create prognostic models. By conducting an automatic analysis of news streams, they attempted to anticipate (apparently with good precision) the future buyouts and mergers and subsequently, used that information as a guide for investment strategies. These types of analysis, clearly essential in both sectors, have always been conducted. What is different today is that the quantity of available information and the possibilities offered by semantic technology which comprehend text allow us to reach levels of sophistication and precision which were unthinkable when everything was done manually by employees.

It should also be emphasized that it is too simple to think that semantic technology alone will solve all of our problems. The biggest obstacle for these models is, in fact, is the assumption that because the past has always evolved in a certain manner, the future will continue to evolve in this same manner. Therefore, to make a difference, it is important to have the best technology for automatic text comprehension (to have access to the greatest possible quantity of data), but it is also necessary take into consideration analysts and business experts’ opinions about unexpected occurrences.

We have now entered into the heart of the conference and the growing interest in semantic technology is quite obvious: for the first time, all of the giants from Google to Yahoo to Oracle, etc. will take part in presentations and panels.

Jun
05
Filed Under (Semantic Intelligence) by L.Scagliarini on 05-06-2009

I have been observing with interest a new trend that is finally emerging among vertical search engines. In order to improve their users’ experience and differentiate themselves from the field crowded with players offering similar services and features, search sites like Mobissimo, LastMinute.com, Dot Homes or service sites like Presdo.com have launched a new feature: a simple one box search to allow consumers to enter in one box all their requirements about a vacation, a home, an auto, etc.

I believe that this is an extremely powerful feature for different reasons.

  • The most obvious is that it can streamline and simplify a search process that unfortunately is still pretty cumbersome. For example, making a reservation for a trip still requires several steps including navigating and entering information in several boxes and sorting and filtering the results in “pre defined” different, but limited and possibly irrelevant for the user, ways to find the best option. Referring to travel websites, when I go to London I generally like to make reservations in hotels in the Earl’s Court area (I like the restaurant selection in the area) with breakfast not included but with free internet access. This, in a traditional site like expedia means 4-5 boxes plus 2 filters and often the information whether internet is free is not available as a filter criteria.
  • In addition, an effective one box search could significantly extend the usability of the site to mobile platforms. Even considering that websites are definitely easier to access in new generation smartphones (like iPhone) the experience is far from being ideal and it is not really addressing the requirements of people that are accessing the site while on the go, like lack of time, not ideal situations to access a site, etc. An effective one box search would allow users to express their needs as easily as sending an SMS or writing a twitter update returning only those entries that really address the user’s needs in order to help him/her jump immediately to the purchase.
  • Lastly, the technology that enables the effectiveness of one box search could also enable the expansion of the search and filtering criteria for the user by mining directly from the free text description of a hotel, the extra elements that could be relevant for the user (see my free internet example above) but that are not part of the standard set of criteria made available on the web site. For example, the extraction of elements, like the availability of a playground or the size or the quality of the illumination of a parking lot, or any other element that could make the difference for many potential customers, could be made available together with the standard search and filtering criteria.

The success of the one box search features is strictly linked to how well the system performs in terms of understanding the request without requiring the user to follow a rigid syntax. From this point of view the sites I have tested are still falling short from what, in my opinion, is the minimum requirement to drive up significantly adoption. For example, Mobissimo seems to have a hard time recognizing New York as a starting location for an air trip (gave me a warning message saying there were too many locations starting with New) but works perfectly if I use LGA or JFK, while LastMinute.com has implemented a mechanism to lead the user in using a syntax making the search more cumbersome than what it should be, and dot homes was not able to disambiguate between Belmont as a road and Belmont as a city.

This is another practical example of where solid semantic technologies can make a difference. Publishers should not ignore that this is not a trivial issue and that cutting corners by implementing internally developed tools or leveraging standard non semantic technologies are workarounds that don’t pay off because they can’t address the complex issues and get the real value that one box search can provide.