Phonetic Algorithm: Bryan, Brian, Briane, Bryne, or … what was his name again?

Let the spelling loose …

What do Callie and Kelly have in common (except for the double ‘l’ in the middle)? What about “no” and “know”, or “Ceasar’s” and “scissors” and what about “message” and “massage”? You definitely got it – Callie and Kelly, “no” and “know”, “Ceasar’s” and “scissors” sound alike, but are spelled quite differently. “message” and “massage” on the other hand differ by only one vowel (“a” vs “e”) but their pronunciation is not at all the same.

It’s a well known fact for many languages that ortography does not determine the pronunciation of words. English is a classic example. George Bernard Shaw was the attributed author of “ghoti” as an alternative spelling of “fish”. And while phonology often reflects the current state of the development of the language, orthography may often lag centuries behind. And while English is notorious for that phenomenon it is not the only one. Swedish, French, Portuguese, among others, all have their ortography/pronunciation discrepancies.

Phonetic Algorithms

So how do we represent things that sound similar but are spelled different? It’s not trivial but for most cases it is not impossible either. Soundex is probably the first algorithm to tackle this problem. It is an example of the so called phonetic algorithms which attempt to solve the problem of giving the same encoding to strings which are pronounced in a similar fashion. Soundex was designed for English only but has its limits. DoubleMetaphone (DM) is one of the possible replacements and relatively successful. Designed by Lawrence Philips in the beginning of 1990s it not only deals with native English names but also takes proper care of foreign names so omnipresent in the language. And what is more – it can output two possible encodings for a given name, hence the “Double” in the naming of the algorithm, – an anglicised and a native (be that Slavic, Germanic, Greek, Spanish, etc.) version.

By relying on DM one can encode all the four names in the title of this post as “PRN”. The name George will get two encodings – JRJ and KRK, the second version reflecting a possible German pronunciation of the name. And a name with Polish origin, like Adamowicz, would also get two encodings – ATMTS and ATMFX, depending on whether you pronounce the “cz” as the English “ch” in “church” or “ts” in “hats”.

The original implementation by Lawrence Philips allowed a string to be encoded only with 4 characters. However, in most subsequent
implementations of the algorithm this option is parameterized or just omitted.

Apache Commons Codec has an implementation of the DM among others (Soundex, Metaphone, RefinedSoundex, ColognePhonetic, Coverphone, to
name just a few.) and here is a tiny example with it:

import org.apache.commons.codec.language.DoubleMetaphone;

public class DM {

public static void main(String[] args) {

String s = "Adamowicz";

DoubleMetaphone dm = new DoubleMetaphone();

// Default encoding length is 4!

// Let's make it 10

dm.setMaxCodeLen(10);

System.out.println("Alternative 1: " + dm.doubleMetaphone(s) +

// Remember, DM can output 2 possible encodings:

"nAlternative 2: " + dm.doubleMetaphone(s, true));

}
}

The above code will print out:

Alternative 1: ATMTS

Alternative 2: ATMFX

It is also relatively straightforward to do phonetic search with Solr. You just need to ensure that you add the phonetic analysis to a field which contains names in your schema.xml:

Enhancements

While DM does perform quite well, at first sight, it has its limitations. We should know that it still originated from the English language and although it aims to tackle a variety of non-native borrowings most of the rules are English-centric. Suppose you work on any of the Scandinavian languages (Swedish, Danish, Norwegian, Icelandic) and one of the names you want to encode is ”Örjan”. However, “Orjan” and “Örjan” get different encodings – ARJN vs RJN. Why is that? One look under the hood (the implementation in DoubleMetaphone.java) will give you the answer:

private static final String VOWELS = "AEIOUY";

So the Scandinavian vowels “ö”, “ä”, “å”, “ø” and “æ” are not present. If we just add these then compile and use the new version of the DM implementation we get the desired output – ARJN for both “Örjan” and “Orjan”.

Finally, if you don’t want to use DM or maybe it is really not suitable for your task, you still may use the same principles and create your own encoder by relying on regular expressions for example. Suppose you have a list of bogus product names which are just (mis)spelling variations of some well known names and you want to search for the original name but get back all ludicrous variants. Here is one albeit very naïve way to do it. Given the following names:

CupHoulder

CappHolder

KeepHolder

MacKleena

MackCliiner

MacqQleanAR

Ma’cKcle’an’ar

and with a bunch of regular expressions you can easily encode them as ”cphldR” and “mclnR”.

String[] ar = new String[]{"CupHoulder", "CappHolder", "KeepHolder",
"MacKleena", "MackCliiner", "MacqQleanAR", "Ma'cKcle'an'ar"};

for (String a : ar) {
a = a.toLowerCase();
a = a.replaceAll("[ae]r?$", "R");
a = a.replaceAll("[aeoiuy']", "");
a = a.replaceAll("pp+", "p");
a = a.replaceAll("q|k", "c");
a = a.replaceAll("cc+", "c");
System.out.println(a);
}

You can now easily find all the ludicrous spellings of “CupHolder” och ”MacCleaner”.

I hope this blogpost gave you some ideas of how you can use phonetic algorithms and their principles in order to better discover names and entities that sound alike but are spelled unlike. At Findwise we have done a number of enhancements to DM in order to make it work better with Swedish, Danish and Norwegian.

References

You can learn more about Double Metaphone from the following article by the creator of the algorithm:
http://drdobbs.com/cpp/184401251?pgno=2

A German phonetic algorithm is the Kölner Phonetik:
http://de.wikipedia.org/wiki/Kölner_Phonetik

And SfinxBis is a phonetic algorithm based on Soundex and is Swedish specific:
http://www.swami.se/projekt/sfinxbis.68.html

Search and Content Quality – Ways of Improving Your Intranet

If you have 6 minutes to spare I would recommend you to watch this interview with Gabriel Olsson from Tetra Pak. During the last years Tetra Pak has been working strategically with turning their intranet into something true end user-centric. Tetra Pak has also put effort into search and content quality.

By actually asking the employees what they expect to find and what sort of information that would make their everyday work (tasks) more efficient, Tetra Pak has managed to create a navigation structure based on facts reflecting these needs. The method used is Gerry McGovern’s Task based Customer Carewords… and the result? The ones that scream the loudest are not the most important – the need of the employees is.

Gabriel is also talking about the importance of following up on search by key matches and synonyms. This, together with content quality initiatives, helps create a solid foundation for search, the simple reasons being:

Use metadata to filter search results (note, not a Tetra Pak picture)

  • If the quality of the information is good (clear headings, good metadata, frequent keywords), the information found through search will be good as well. If you have a lot of old content and duplicates this will be just as visible, making it hard for the users to determinate what is qualitative and trustworthy.Good quality will also make it possible to group and categorize information.
  • Synonyms makes it easy to adjust the corporate language to the one used by the employees. Let people search for “report” when they want to find a “bulletin”. A simple synonym list, based on search statistics will make users find what they want, without thinking about how to phrase the query.The synonyms can used in the background (without the users knowledge) or as ‘did you mean-suggestions’:

    Synonyms used for ‘Did you mean” functionality (note, not a Tetra Pak picture)

  • Key matches (also referred to as sponsored links, best bets or editor’s pick) are used to manually force the first hit in the search result list to refer to a specific page or document. By following up on search statistics and knowing what sort of information that is frequently most asked for, it is easy to adjust the search result list. However, this take  time and effort to follow up.

Tetra Pak is not alone when it comes to adjusting their intranets to true end-user needs. During the spring there will be a number of conferences where our customers will be sharing experiences from their initiatives. Among others Ability Partner, and the recently completed IntraTeam.

Apart from this, our own breakfast seminaries is a, as always, announced on our homepage and on twitter. Looking forward to seeing you!

To Crawl or Not to Crawl in Enterprise Search

Having an Enterprise Search Engine, there are basically two ways of getting content into the index; using a web crawler or a connector. Both methods have their advantages and disadvantages. In this post I’ll try to poinpoint the differences with the two methods.

Web crawler

Most systems of today have a web-interface. Let it be your time reporting system, intranet, document management, you’ll probably access those with your web browser. Because of this, it’s very easy to use a web crawler to index this content as well.

The web crawler index the pages by starting at one page. From there, it follows all outbound links and index those. From those pages, it follows all links, and so on. This process continues until all links at a web site has been followed and the pages been indexed. The crawler thus uses the same technique as a human, visit a page and clicking the links.

Most Enterprise Search Engines are bundled with a web crawler. Thus, it’s usually very easy to get started. Just enter a start page and within minutes you’ll have searchable content in your index. No extra installation or license fee are required. For some sources, this may also be the only option, i.e if you’re indexing external sources that your company has no control of.

The main disadvantage though, is that web pages are designed for humans, not crawlers. This means that there are a lot of extra information for presentation purposes, such as navigation menus, sticky information messages, headers and footers and so on. All of this makes it a more pleasant experience for the user, and also making it easier to navigate on the page. The crawler on the other hand has no use of this information when retrieving pages. It’s actually reducing information quality in the index. For example, a navigation menu will be displayed on every page, thus the crawler will index the navigation content for all pages. So if you have a navigation item called “Customers” and a user searches for customers, he/she will get a hit in ALL pages in the index.

There are ways to get around this, but it requires either altering of the produced HTML or adjustments in the search engine. Also, if the design of the site change, you have to do these adjustments again.

Connector

Even though the majority of systems has a web-interface, the content is stored in a data source of some format. It might be a database, structured file system, etc. By using a connector, you connect either to the underlying data source or to the system directly by its programming API.

Using a connector, the search engine does not get any presentation information but only the pure content, making the information quality in the index better. The connector can also retrieve all metadata associated with the information which further increases the quality. Often, you’ll also have more fine-grained control over what will be indexed with a connector than a web crawler.

Though, using a connector requires more configuration. It might also cost some extra money to buy one for your system, and require additional hardware. Though, once set up, it’s most likely to produce more relevant results compared to a web crawler.

Bottom line is it’s a consideration between quality and cost, as most decisions in life :)

Do You Know Something I Don’t? The Art of Benchmarking Enterprise Search

During the autumn we have been trying to keep our customers and others up to date with the search world by hosting breakfast seminars. By benchmarking enterprise search and sharing experiences and discussing with others the participants have taken giant leaps in understanding what search can deliver in true value. The same goes for sharing experiences between companies, where you often find yourself struggling with the same problems, regardless of business or company size.

We have been discussing how enterprise search can help intranets, extranets, web sites and support centers to capitalize on their knowledge. Some of the things that have been discussed in regards to benchmarking enterprise search.

Business Cases

  • How can search help companies save 100 million SEK/year?
  • How do you count return on investment (ROI) for search?

Search Functionality

How and why should you work with:

  • Key Matches to promote certain content (similar to Google’s sponsored links on the web)
  • Synonyms (to make sure that the end-users language corresponds to the corporate without having to change the information)
  • Query completion and suggestion to give the user an overview of what other people have been searching for when they start to type (similar to Apples web site search).

End User Experience

  • How can different interfaces serve different information needs and user-groups?
  • How does your user interface serve your end-users?

Information Quality

  • Do taxonomies and folksonomies help us find information faster?
  • Can search be used to improve the quality of your content?

During the spring we will continue to hold seminars, keeping you up-to date. If you’re not on our mailing list, please look at our Findability Events and register for our events.

During Wednesday and Thursday this week we will be attending a Ability conference to discuss search. Hope to see you there!

Enterprise Search 2.0?

While visiting Enterprise Search Summit in San Jose I realized that enabling Enterprise 2.0 within enterprise search is the hottest trend at the moment. Is it Enterprise Search 2.0?

Andrew McAfee who coined the term Enterprise 2.0 and has released a book on the subject, spoke about how to use altruism to develop the enterprise. People are wired to help and if we stop obsessing about the risks and lower the bars for how people can help each other it is possible to make this work within a corporate environment.

He also spoke about how process control and how much workflow control. How much do we really need? Make it easy to correct mistake instead of making it hard to make them. With regards to innovation he pointed out that we need to question credentialism and build communities that people want to join. To leverage the intelligence aspects within the enterprise we should explore and experiment with collective intelligence such as prediction markets and open peer review processes. All in all make it easy for people to interconnect.

Very high improvement in access to knowledge, internal experts, satisfaction, increased innovation and customer satisfaction.

I also recommend to read Price Waterhouse Coopers Technology Forecast Summer 2008 to get a good overview of the available tools and technologies.

So how does this impact enterprise search? Search can be made to be the facilitator for Enterprise 2.0. Of course it is possible to index and make all blogs, wikipedias, tweets (yammer), online communities and social networks searchable, but that is only one way to make it this new environment more findable. If someone tweets or blogs about information we should use that information to impact on the search results and ranking. We could also track user behavior on a site to make certain information more visible with regards to implicitly expressed interests.

Findwise releases Open Pipeline Plugins

Findwise is proud to announce that we now have released our first publicly available plugins to the Open Pipeline crawling and document processing framework. A list of all available plugins can be found on the Open Pipeline Plugins page and the ones Findwise have created can be downloaded on our Findwise Open Pipeline Plugins page.

OpenPipeline is an open source software for crawling, parsing, analyzing and routing documents. It ties together otherwise incomplete solutions for enterprise search and document processing. OpenPipeline provides a common architecture for connectors to data sources, file filters, text analyzers and modules to distribute documents across a network. It includes a job scheduler and a full UI with a point-and-click interface.

Findwise have been using this framework in a number of customer projects with great success. It ties particularly good together with Apache Solr, not only because it is open source but most importantly because it fills a hole in functionality that Solr lacks – an easy to use framework for developing document processors and connectors. However we are not using this for Solr only, a number of plugins for the Google Search Appliance have also been made and we have started investigating how Open Pipeline can be integrated with the IBM Omnifind search engine as well.

The best thing with this framework is that it is very flexible and customizable but still easy to use AND, maybe most importantly for me as a developer, easy to work with and develop against. It has a simple yet powerful enough API to handle all that you need. And because it is an open source framework any shortcomings and limitations that we find along the way can be investigated in detail and a better solution can be proposed to the Open Pipeline team for inclusion in future releases.

We have in fact already contributed to the development of the project in a great deal by using it, testing it and by reporting bugs and suggested improvements on their forums. And the response from the team has been very good – some of our suggested improvements have already been included and some are on the way in the new 0.8 version. We are also in the process of further deepening the collaboration by signing a contributors agreement so that we eventually can be able to contribute with code as well.

So how do our customers benefit from this?

First it makes us develop and deliver search and index solutions more quickly and of better quality to our customers. This is because more developers can work with the same framework as a base and the overall code base will be used more, tested more and is thus of better quality. We have also the possibility to reuse good and well tested components so that several customers together can share the costs of development and thus get a better service/product for less money which is always a good thing of course!

Six Simple Steps to Superior Search

Do you have your search application up and running but it still doesn’t quite seem to do the trick? Here are six simple steps to boost the search experience.

Avoid the Garbage in-Garbage out Syndrome

Fact 1: A search application is only as good as the content it makes findable

If you have a news search service that only provides yesterday’s news, the search bit does not add any value to your offering.

If your Intranet search service provides access to a catalog of employee competencies, but this catalog does not cover all co-workers or contain updated contact details, then search is not the means it should be to help users get in touch with the right people.

If your search service gives access to a lot of different versions of the same document and there is no metadata available as to single out which copy is the official one, then users might end up spending unnecessary time reviewing irrelevant search results. And still you cannot rule out the risk that they end up using old or even flawed versions of documents.

The key learning here is that there is no plug and play when it comes to accurate and well thought out information access. Sure, you can make everything findable by default. But you will annoy your users while doing so unless you take a moment and review your data.

Focus on Frequent Queries

Fact 2: Users tend to search for the same things over and over again.

It is not unusual that 20 % of the full query volume is made up of less than 1 % of all query strings. In other words, people tend to use search for a rather fixed set of simple information access tasks over and over again. Typical tasks include finding the front page of a site or application on the Intranet, finding the lunch menu at the company canteen or finding the telephone number to the company helpdesk.

In other words, you will be much advised to make sure your search application works for these highly frequent (often naïve) information access tasks. An efficient way of doing so is to keep an analytic eye on the log file of your search application and take appropriate action on frequent queries that do not return any results whatsoever or return weird or unexpected results.

The key learning here is that you should focus on providing relevant results for frequent queries. This is the least expensive way to get boosted benefit from your search application.

Make the Information People Often Need Searchable

Fact 3: Users do not know what information is available through search.

Users often believe that a search application gives them access to information that really isn’t available through search. Say your users are frequently searching for ”lunch menu”, ”canteen” and ”today’s lunch”, what do you do if you do not have the menu available at all on your Intranet or Web site?

In the best of worlds, you will make frequently requested information available through search. In other words, you would add the lunch menu to your site and make it searchable. If that is not an option, you might consider informing your users that the lunch menu—or some other popular information people tend to request—is not available in the search application and provide them with a hard-coded link to the canteen contractor or some other related service as a so called “best bet” (or sponsored link as in Google web search).

The key learning here is to monitor what users frequently search for and make sure the search application can tackle user expectations properly.

Adapt to the User’s Language

Fact 4: Users do not know your company jargon.

People describe things using different words. Users are regularly searching for terms which are synonymous to—but not the same as—the terms used in the content being searched. Say your users are frequently looking for a ”travel expense form” on your Intranet search service, but the term used in your official company jargon  is ”travel expenses template”. In cases like this you can build a glossary of synonyms mapping those common language terms people tend to search for frequently to official company terms in order to satisfy your users’ frequent information needs better without having to deviate from company terminology. Another way of handling the problem is to provide hand-crafted best bets (or sponsored links as in Google web search) that are triggered by certain common search terms.

Furthermore, research suggests that Intranet searches often contain company-specific abbreviations. A study of the query log of a search installation at one of Findwise’s customers showed that abbreviations—query strings consisting of two, three or four letters—stood for as much as 18 % of all queries. In other words, it might be worthwhile for the search application to add the spelled-out form to a query for a frequently used abbreviation. Users searching for “cp” on the Intranet would for example in effect see the results of the query “cp OR collaboration portal”

The lesson to learn here is that you should use your query log to learn the terminology the users are using and adapt the search application accordingly, not the other way around!

Help Users With Spelling

Fact 5: Users do not know how to spell.

Users make spelling mistakes—lots of them. Research suggests that 10—25 % of all queries sent to a search engine contain spelling mistakes. So turn on spellchecking in your search platform if you haven’t already! And while you are at it, make sure your search platform can handle queries containing inflected forms (e.g. “menu”, “menus”, “menu’s”, “menus’”). There’s your quick wins to boost the search experience.

Keep Your Search Solution Up-To-Date

Fact 6: Your search application requires maintenance.

Information sources change, so should your search application. There is a fairly widespread misconception that a search application will maintain itself once you’ve got it up and running. The truth is you need to monitor and maintain your search solution as any other business-critical IT application.

A real-life example is a fairly large enterprise that decided to perform a total makeover of its internal communication process, shifting focus from the old Intranet, which was built on a web content management system, in favor of a more “Enterprise 2.0 approach” using a collaboration platform for active projects and daily communication and a document management system for closed projects and archived information.

The shift had many advantages, but it was a disaster for the Enterprise Search application that was only monitoring the old Intranet being phased out. Employees looking for information using the search tool would in other words only find outdated information.

The lesson to learn here is that the fairly large investment in efficient Findability requires maintenance in order for the search application to meet the requirements posed on it now and in the future.

References

100 Most Often Mispelled Misspelled Words in English – http://www.yourdictionary.com/library/misspelled.html

Definition of “sponsored link” – http://encyclopedia2.thefreedictionary.com/Sponsored+link