Many technology companies have managed to flood the market with their “magic” automatic solutions. They promise precision tools which are simple to use, have enticingly colorful dashboards and provide full coverage of “consumers sentiment” with minimal effort. Many of their web sites state that once top executives are equipped with their magic tool, the task of reading and interpreting the extracted data can be easily delegated to their subordinates.

As I am directly involved with this aspect with my company, I can’t seem to figure out if this claim is just an aggressive marketing campaign or if it is just a way to tell customers what they want to hear. Either way, it has created a situation where the only dialogue with the potential customers is based on looking at the data contained these reports and trying to find the errors in the extraction, thus casting doubts on whether the technology is actually ready for the market.

This vicious cycle needs to be stopped. It could be useful to remember the following guidelines to create a more productive climate and provide value both to businesses and to technology companies:

1) Online sentiment analysis is just another element of competitive intelligence and should be handled accordingly
2) Even the more established technologies (ERP, BI, CRM) are not perfect and do have a margin of acceptable error
3) A group of analysts should be employed to examine the data, extract the knowledge and scan the sources
4) Set priorities and be conservative: avoid incidents first
5) Learn how messages propagate

Mar
25
Filed Under (Semantic Intelligence) by M.Varone on 25-03-2010

The current economic crisis is creating new possibilities for semantic technology. Businesses need to cut costs (which is usually done by downsizing personnel) without making any cutbacks in the quality of their services. No one can afford to let quality go downhill, because once it does, the competition is more than ready to pounce on the opportunity.

The fact that there is growing interest in semantic solutions (like automatic categorization and text mining) is obviously a positive factor (for us, at least :-), but the crisis has also complicated the situation. Companies express interest in implementing innovative solutions, however their present-day budgets are quite limited, especially when it comes to purchasing software licenses. This creates a kind of catch-22 situation because investing in innovative technology is often the key to improving efficiency within a company (when correctly implemented.)

It is very important to thoroughly weigh the pros and cons of this situation. Cost may seem to be a problem at first (because it may exceed the allotted budget funds), but in reality it is actually quickly absorbed thanks to the level quality which is gained (quality which was not compromised by the need to reduce costs). This type of approach often makes it possible to find a solution which benefits both sides (customer and supplier) and guarantees an outcome which lives up to expectations. Of course, flexibility is required on both ends and supplier experience is essential to minimize implementation risks.

If a budget is truly minimal, it is usually best not to play around with money. Money should never be wasted, especially if it’s scarce. Although sometimes it is actually possible to reduce the final objectives and focus only on the fundamental aspects and still be able to obtain positive results.

Sooner or later this crisis is bound to end! Those who invested wisely during these hard times will have an advantage over those who chose not to take any risks.

Feb
24
Filed Under (Books & News Related, Semantic Intelligence) by B.Aker on 24-02-2010

In 1969 Arthur C. Clarke introduced us to his computer named HAL.  He had us believing all we needed to do was talk to HAL.  HAL would listen, understand and do what we wanted.  Until HAL, that is, developed an evil soul and did nasty things to humans.  The evil soul is pure fiction but HAL is not.

2010 is the year we get to meet the real HAL.  He may still be a child but he is growing up fast thanks to four trends in computing that have coalesced and are now ready to explode.  These trends are The Cloud, The Pipe, The UI and The API. A depiction is below.

The Cloud is elastic computing power.  It is more than renting a server from a service provider.  It means automatic, on-demand scalability onto as many servers as are needed to accomplish a task or take care of a sudden flood of customer needs.  The Cloud gives any size organization the appearance and performance of Google-sized computing.

The Pipe is everywhere, all the time, high speed internet connections. Typical wired internet speeds today are over 6MB per second and wireless connections are quickly catching up with that – 3G and soon 4G deployments are common.  The biggest trend in mobile devices is smart phones.  These are devices that do more than route phone calls, but also manage email, calendars, music, applications and the entire internet.  But of course the processing power to do these things is not all on the device.  Instead it’s up in the cloud.

The UI or User Interface is smart. Speech to text and semantic technologies combine to allow for the appearance of intelligence.  Computers or mobile phones spoken to in natural language understand and then locate, calculate, connect, tally, and display the answer to queries rather than simply list resources for you.  Try Nuance, Vlingo or Google Mobile for speech to text accuracy.  Try us at Expert System for semantic processing accuracy.

The API or Application Programming Interface means really useful applications. API’s package the first three trends so that creative types can make applications for specific tasks, domains or verticals quickly and make lots of them.  Look how many IPhone / ITouch applications have been built in the last 2 years alone.  Many have been built by individuals and not large corporations.

These four trends create a virtuous cycle.  They combine to bring a sudden higher platform of computing.  One that engages the imagination, has enormous productivity, improves processes and creates new value out of existing information resources.

No you can’t really see or touch HAL.  But be assured he is there, working in the background, growing, learning and getting smarter every day.  He is ready to serve you.  Just ask.

In science we have tackled great problems.  It was only a short number of years ago that we had mapped the human genome.  Imagine unlocking the code of what makes us human.  More recently, scientists are studying how proteins operate.  Or more precisely how they fold.  It is in the folding that we learn what a protein is intended for and what job it is supposed to do.  Once we unlock this we will know how diseases form, replicate and, most importantly, how to beat them… all of them.

So what does the information science of semantics have to do with proteins?  Semantics fold too.  That’s what.

Scientists studying proteins that fold are discovering it’s most important and elemental attributes.
The same is true with semantics.  Boil a sentence down to its most elemental parts and you get what is called a triple – that is a subject, a predicate and an object.   So consider the sentence below;

“John works in the White House”.

Subject:  Who or what does the sentence describe?  Obviously, that would be” John”.
Predicate:  What is the property that describes or connects the subject to the rest of the sentence?  That would be the verb “works”.
Object:  What is the value of the property?  That would be “White House”.

So that example is pretty easy.  What about a longer sentence.  Something like this;

“John, a favorite of the President Obama from his days in Chicago,
now works as public liaisonin the White House”.

Now the job is tougher.   It is clear John is still the subject of the sentence.  It might be tempting to assign “favorite” as the predicate since it connects John to President Obama.  But the commas indicate to us that this is really a clausal description of John and not the central action of the sentence.  So we are left with “works” as the predicate.  But what does “works” connect to?  Is it “public liaison” or “White House”?  The stronger connection is “public liaison” since this describes the kind of work John does.  The White House is just the location of that work so it is nothing more than a qualifier.

When we learned to read as a child we were taught to reason through these example sentences pretty much like I just described.  Of course you don’t think about it very deeply – the understanding of the sentence, the essence of it comes naturally:  John – works – public liaison.  The rest just colors these most important facts.

Semantics is the information science of establishing meaning over text without human intervention – and this includes establishing the triple of any sentence.  This is also what is called the Semantic Web or Web 3.0.  From a diagram perspective this basic notion is sometimes represented notionally like this;

You will note this diagram looks much like cells or proteins linked together.  There is a reason for that.  Like the proteins that fold and match up along the edges that are common in order to do their work so do semantic triples.  Switching to a protein example now let’s consider these two sentences;
1.    Protein X adds two molecules of zinc to the cell for each molecule of oxygen.
2.    Protein Y adds one molecule of copper to the cell for each molecule of iron.

Our diagram now looks like the following;

So what happened?  Each sentence has its own triple.  But they have a common predicate of “adds”.  So we can diagram two subjects and two objects but with a common predicate.

Just like proteins that fold and combine to make something new we have done the same here in the science of semantics.  Because we boiled the sentences down to triples, stored them in a place that can be queried we can ask for all predicates that match to “add(s)”.

Why is this important?  It gives scientists, researchers, business professionals, citizens a chance to tap into and glean true meaning from their documents, email or the web. This is far different from a Google like keyword match. The word “add(s)” certainly matched but it was the words role that also matched.
But what if the author of sentence (1) did not use the word “adds” but instead used the word “increased”.  A keyword match would fail here.  But semantics can also understand that “add” and “increase” are related and so the query would result in the same scientific discovery of Proteins that add/increase molecules.

Now let’s change sentence (2) from Protein Y to Protein X.  A more restrictive query on a store of triples where you would ask for both subject and predicate matches would result in a diagram like below.

Again why is this important?  Because now a scientist can rely on the smarts built into such a search index to deliver all the Protein X’s that add/increase [some kind of] molecule to a cell.  The interesting thing for the scientist will be to group and sort the kind of molecules that will be added to the cell.

This is real discovery in science.  It is semantics that get language out of the way.  It is semantics that build in smarts to a system so the scientist can find, analyze and create new cures for diseases that have yet to be worked on effectively.  So… semantics and folding proteins do have a lot in common – more than you thought.

Feb
08

As I have written many times before, semantic technology is unique in that it is able to go beyond the limits of other types of technology and approach the automatic understanding of a text. It is not perfect, however, and it certainly has yet to reach its maximum potential.

I realize that it’s not that easy for those who don’t work in the sector to understand (especially due to the fact that there are so many false promises out there, which tend to create unreasonable expectations, muddled ideas and market chaos). Therefore, it might be useful to use a common experience as an example, such as: our learning process.

Let’s start from the beginning: from the moment we (human beings) begin to talk, understand, learn, go to school, etc… We require at least 12-15 years to be able to read a newspaper and understand the most general articles and this is thanks to the experience we developed while learning the meanings of words and experimenting with a great deal of different phrase constructions. Consequently, the learning process is  lengthier when we decide to tackle more technical terms or specific topics.

Learning takes time, and the same goes for a computer. It’s true that a computer can process in nanoseconds while we think in milliseconds, but it is also true that our method of learning uses a device (the brain) that no one has been able to fully understand and that is able to do things that not even the most powerful computer can imitate.

In summary, it doesn’t make sense to expect that a computer be able to perfectly analyze and understand a biology text, for example, without first having learned all it can about that subject. There are no shortcuts nor magic formulas: learning a language is difficult and even automatic processes require time and labor.

When I present a company with our software solutions (which are based on a semantic technology that uses a rich and vast semantic network), I find myself in front of an audience who clearly understands the advantages of this approach.  Yet, the series of concerns and doubts they raise often clouds the decision-making process and causes an incorrect evaluation of the actual return on investment.

Whether they are raised by IT managers, KM workers or software developers, the concerns fall into two categories: the first, the costs related to the setup and maintenance of the semantic network and the second, the costs related to the infrastructure required to maintain a performance level able to satisfy operations.

There are many reasons behind these concerns, but two factors seem to stand out. On one hand, there are the excellent (and often incorrect) communication activities carried out by the makers of systems based on keyword technology.  They have almost succeeded in convincing the market that a complex problem such as information management can be solved with automatic shortcuts and that any other alternative would be unaffordable. On the other hand, the majority of researchers in this sector are  still skeptical about systems which are entirely semantic. This is mainly caused by their inability (at least up to now) to develop software which can combine the advantages of increased text comprehension with performance in order to meet the demands of the real world (thus further strengthening the position of the competition.)

In the past ten years, many successful projects have been developed using our semantic technology. Therefore, I think it would be useful to use real data from our everyday experiences to help clear up the misconceptions which often cause people to make irrational decisions.

Costs of development

To add a new language to Cogito, two man-years of software development and 8-10 man-years of linguistic development are needed in order to refine the semantic network. You can quickly estimate the cost of such resources  (if you are in the Silicon Valley, divide your estimated total by 2!) and immediately understand that the initial investment is considerable, yet affordable considering the cost will be spread over all the implementations that will be done over time.

Cogito’s standard semantic network permits a horizontal management of content so that a significantly higher rate of precision e recall (compared to that obtained from a static system) is obtained with no need for further elaboration. For vertical implementations, start-up costs will be necessary so that a standard semantic network can be enriched with knowledge from a specific dominion (the number of added concepts usually does not exceed 5,000); usually 20-30 working days are needed for a linguist to complete this task.

For those who believe that “languages constantly change and adding new terms can be costly,” may I  remind you that even the most dynamic languages, such as English,  increase by no more than 100-200 new terms (of common use) and less than 1000 non-idiomatic expressions  per year  (in the worst case scenario, this could mean about 10 working days per year.)

Those who criticize the complexity of managing a semantic network often refer to the complexity of managing lists of entities such as: people, places, companies, organizations, etc.  Traditional systems are able to recognize an entity only if it is present in a list; this aspect is often  erroneously confused with semantic network management.  A good semantic engine is able to recognize an entity based on the semantic role it plays within a text, therefore it does not require the creation nor the maintenance of lists. At the same time, it is also able to correctly recognize  less frequent entities (which, for obvious reasons, have not been inserted in the list.)

Costs of infrastructure

Cogito can analyze more than 120KB of text (circa 40 pages of text) per second with a common single-processor server. This kind of speed, combined with its linear scalability and low cost, makes Cogito a  practical solution even in situations in which large quantities (tens of millions) of documents must be analyzed.

The development and maintenance costs of a semantic network are considerably lower than what is commonly assumed; the improvements in terms of the ability to manage information (even when very complex) are obvious even to those who are not experts in this sector. I am convinced that when these aspects can be objectively analyzed (when myths and obsolete information are ignored), the number of companies which adopt real semantic solutions will increase.

I’ve written an article which explores what is happening in the Web, and Alt Search Engines published it early this week.

Jun
30
Filed Under (Semantic Intelligence) by M.Varone on 30-06-2009

I’ve already written many times about automatic categorization, but it’s such a complex topic with plenty of different aspects, and although it may seem simple to the general public, I think that it’s worth discussing once again (and again in the future.)

This time I would like to focus on the categorization of contents dealing with generic and horizontal subjects, i.e. journalistic categories such as news, sports, economics, politics and so on. For those, like us, who have developed categorization software for years, the abundance of non-institutional content on the Web (mainly blogs and similar) offers more opportunities than in the past to apply our applications successfully. In theory, it’s a winning solution not only for those who develop the technology, but also for those who provide the content because categorization rules need only minor customization and as a result, improving content with quality information becomes quick and effective.

Nevertheless, two relevant aspects must be considered with close attention, in order to avoid problems during implementation.

The first aspect is somehow implicit in the personal and subjective nature of such information sources. In fact, when writing blogs, authors quite often (if not always) mix posts on their favorite subject or field of expertise (cinema, sports, technology…) with other more intimate and personal posts, which do not necessarily have a specific subject. When we try to categorize this kind of content using standard systems developed for the well-focused articles of periodicals and newspapers, we tend to obtain background noise. In order to minimize such noise, we need to be aware of the problem, and use semantic technology in an expert way: this way, the level of the final result is usually quite good, and can provide an added value to users.

The second aspect is the average length of these contents. In fact, quite often the post is short and does not exceed 500-600 characters, making it quite difficult to obtain enough reliable information to select the right category. The readers of a blog already know the subject because they have read previous posts and therefore do not need further context to find the main subject. Yet, for a program the task is definitely more complicated because very often a program does not analyse the posts of the blog one after the other, or from the same information source, but receives them in a random order, or one by one. In order to manage this aspect correctly, we need to accept some compromises and modify the system progressively, in order to reach a good balance.

For these kinds of projects, technology is very important but equally important is the expertise of those who have worked for years in this field: like they say in Naples, no one is born learned.

Jun
22
Filed Under (Semantic Intelligence) by L.Scagliarini on 22-06-2009

After a presentation from the New York Times, the Semantic Technology Conference of 2009 came to a close. Evan Sandhaus, the newspaper’s semantic expert, talked about how, from the beginning of the twentieth century, the Times have made historical efforts to use and value the immense quantity of available content and about how now is a period of crisis for the publishing industry, therefore strategies must be more decisive and significant. During the presentation, Evan officially announced the NYT’s participation in the Linked Data project. This will make the newspaper’s massive index easily accessible through Internet and linkable to other content or applications. This is an important step because it opens the doors to a new generation of Internet applications which could revolutionize our lives.

Another important highlight was the active participation of the three main search engines: Google, Bing and Yahoo. Google received applause for the announcement that it will be adopting new standards which will conduct the company on a slow journey towards more intelligent research methods. This quite a change from last year when Marissa Meyer, vice president of Search Product and User Experience, stated that Google wasn’t interested in semantic research.

Another significant element emerged on the final day:  the growing efforts of the American government to increase, over the coming years, the availability of its own data (Operation Transparency) and make it accessible even through intelligent technology.

The Semantic Technology Conference was also a great success for Expert System. In our third consecutive year of participation, we were involved in five workshops and panels in which the explosive potential of two areas of great interest were examined: semantic applications for mobile networks and online advertising.

To sum up these few days, I can say that the fifth edition of the Semantic Technology Conference was a success. The participants totaled 1,170, a 16% increase from last year. Specifically, there was an increase in participation in the business area, which was almost 20% of total participation. Concrete solutions and activities were presented and given the economic crisis, much attention was paid to the success stories of cost- saving projects for strategic activities, such as customer care and competitive intelligence.

And that’s all, folks…I’m off to dine in a restaurant on the Californian coast with a great Oceanside view (for a couple days a year it’s not completely shrouded in fog).

Jun
19
Filed Under (Semantic Intelligence) by L.Scagliarini on 19-06-2009

In an event where new buzzwords like, “cloud”, “linked data” and “shared ontologies” prevail, the moment has finally come for a rising star (which some think is already on the brink of falling): semantic research.

Yesterday, Carla Thompson (Guidewire Group) conducted a “Semantic Search” panel in which representatives from search giants took part: Scott Prevost from Bing’s Powerset division, Andrew Tomkins from Yahoo, and the renowned Peter Norvig from Google. Also participating were Riza Berkan, CEO of Hakia, William Tunstall-Pedoe from True Knowledge and Tomasz Imielinski of the revived (for the third or fourth time now) Ask.com.

Here is a nutshell version of some of the highlights from the Question & Answer session:

Carla Thompson: What makes your engine different from the others?

Ask.com: the ability to give answers (extracted from structured and non-structured sources) to questions formulated in natural language. This statement was found to be rather amusing to the audience because just a couple years ago, Ask.com had repositioned itself as a traditional keyword engine, in an attempt to make people forget about the shoddy performance of its previous Q&A engine released in the late nineties.
Powerset/Bing: the ability to understand the user’s intent and to organize the pages based on this intent.
Google: the ability to provide a set of information which is complete, accurate and fast.
Hakia: the ability to define a ranking based on credibility and not on the popularity of the sources.
True Knowledge: the ability to provide high-quality answers to direct questions by extracting such answers from structured (databases) and non-structured (texts) sources.
Yahoo: the URL’s ability to make semantic sense, see SearchMonkey, which allows results to be seen in a different way.

Carla: Why focus on semantic research when the public doesn’t seem to demand it (nor understand what it is)?

Ask.com: normally, a winning product doesn’t originate from public demand, but from the ability of a company to anticipate needs that the average user can’t sense yet. Increasing the precision of answers to specific questions, such as we are doing, has a value for the user. The user will recognize this value once he/she has experimented with research of this kind.
Google: why innovate if something works? Because people don’t realize what can be achieved, so they are happy with what they have. However, when they see new features which can improve their experience, they don’t hesitate to welcome them with enthusiasm.
Powerset/Bing: the ability to answer questions and present complex information in a clear and precise way is valuable to users. Our outlook hasn’t changed after we were bought by Microsoft. We still continue to focus on in-depth content analysis and provide precise answers to users’ questions thanks to disambiguation, which objectively, we are the only ones to offer.

I do not believe that this feature is currently present in Bing.

Hakia: we aren’t the only ones offering this, but we do everything from scratch, including an ontology which is non-dependant on language and noticeably improves performance in terms of precision, something that every user needs.
Yahoo: what Ask said was right. We are already evolving to anticipate the competition. Today, thanks to semantics, we display information in a different way and the users’ response is decisively positive.

Carla: Is it possible to objectively compare two search engines?

Ask.com: yes, and there are simple ways to do this. For example, the characteristics of a search engine should allow the user to search “the top 10 songs of all-time” or the same phrase, but put “ten” instead of “10”, and have the results come out the same (but they don’t always do). Or, inserting “Tom Cruise” and “Tom Cruise actor” should give the same results, but instead, the results in general are worse with the addition of information. We at Ask invest in assuring that a search engine can respect these standards.
Hakia: it is possible to perform objective tests, even if, in the end it is the user who chooses one search engine over another.
Powerset/Bing: there are ways to conduct test. In terms of level of semanticity, my fellow colleague from Ask is correct, but it is important to add other criteria as well, like for example, the capacity to extract relations.
Yahoo: it is very difficult. Our research shows that users are very conservative and tend to reject small changes. We think that changing the presentation of the results based on the search objective is very important and must be a criteria for comparison.

Carla: Which is better: traditional or question & answer search engines?

True Knowledge: in terms of input, there shouldn’t be any difference in the results if the user inserts a question in natural language or in keywords. The excellence of a search engine is its ability to provide precise answers.
Google: cut and dry answers are not always best. Everything depends on the question. For example, if someone is interested in knowing how meditation improves their lives, receiving a series of informative content is probably better than a cut and dry answer.
Hakia: offering the opportunity to ask questions is important. A search engine must be able to understand the question and continue to improve in situations in which the initial results are insufficient.
Yahoo: the notion of supporting the user is important, but continuously trying to provide an answer isn’t enough to improve the search. What is needed are new types of interfaces.
Powerset/Bing: yes, question & answer engines are better, but only for mobile applications or for certain types of requests.

Here is the last question that Carla Thompson wasn’t able to ask due to time restrictions: Is semantic research more important for the business market or for the consumer market?

In the market of business search engines, semantic research identifies concepts and relations (who does what), therefore, it is more important. Businesses and so-called knowledge workers already know this. Although semantic research has become a bit less popular in the consumer market, it still takes the prize as an important aspect in the business search engine industry.