Speaking about Search as a Service @ PROMISE Technology Transfer day, want to meet up?

Tomorrow morning I leave Gothenburg to attend the PROMISE Technology Transfer day @ CeBIT 2013 in Hanover, Germany.

The event is a workshop introducing its participants to methodologies for the systematic evaluation and monitoring of search engines, and for discussing future trends and requirements for the next generation of information access systems. In other words, it is right up our alley at Findwise.

As Director of Research at Findwise I will speak about Search as a Service. If you are at the event or just nearby I would be happy to meet up and have a chat.  I will be around from Tuesday March 5 until Thursday March 7. Feel free to email me, henrik.strindberg@findwise.com or give me a call at +46709443905.

Hope to see you there!

SLTC 2012 in retrospect – two cutting-edge components

The 4th Swedish Language Technology Conference (SLTC) was held in Lund on 24-26 October 2012.
It is a biennial event organized by prominent research centres in Sweden.
The conference is, therefore, an excellent venue to exchange ideas with Swedish researchers in the field of Natural Language Processing (NLP), as well as present own research and be updated of the state-of-the-art in most of the areas of Text Analytics (TA).

This year Findwise participated in two tracks – in a workshop and in the main conference.
As the area of Search Analytics (SA) is very important to us, we decided to be proactive and sent an application to organize a workshop on the topic of “Exploratory Query Log Analysis” in connection with the main conference. The application was granted and the workshop was very successful. It gathered researchers who work in the area of SA from very different perspective – from utilizing deep Machine Learning to discover users’ intent,  to looking at query logs as a totally new genre. I will do a follow-up on that in another post. All the contributions to the workshop will also be uploaded on our research page.

As for the main conference, we had two papers accepted for presentation. The first one dealt with the topic of document summarization – both single and multidocument summarization
(http://www.slideshare.net/findwise/extractive-document-summarization-an-unsupervised-approach).
The second paper was about detecting Named Enities in Swedish
(http://www.slideshare.net/findwise/identification-of-entities-in-swedish).

These two papers presented de facto state-of-the-art results for Swedish both when it comes to document summarization and Named Entity Recognition (NER). As for the former task, there is neither a standard corpus for evaluation of summarization systems, nor many previous results and just few other systems which made it unfeasible to compare our own system with. Thus, we have contributed two things to the research in document summarization – a Swedish corpus based on featured Wikipedia articles to be used for evaluation and a system based on unsupervised Machine Learning, which by relying on domain boosting achieves state-of-the-art results for English and Swedish. Our system can be further improved by relying on our enhanced NER and Coreference resolution modules.

As for the NER paper, our Entity recognition system for Swedish achieves 74.0% F-score, which is 4% higher than another study presented simultaneously at SLTC (http://www.ling.su.se/english/nlp/tools/stagger). Both systems were evaluated on the same corpus, which is considered a de facto standard for evaluation of different NLP resources for Swedish. The unlabelled score (i.e. no fine-grained division of classes but just entity vs non-entity) of our system achieved 91.3% F-score (93.1% Precision and 89.6% Recall). When identifying people, the Findwise NER system achieves 78.1% Precision and 90.5% Recall (83.9% F-score).

So, what did we take home from the conference? We were really happy to see that the tools we develop for our customers are not something mediocre but rather something that is of very high quality and is the state-of-the-art in Swedish NLP. We actively share our results and our corpora for research perposes. Findwise showed keen interest in cooperating with other researchers in developing better tools and systems in the area of NLP and Text Analytics. And this I think is a huge bonus to all our current and prospective customers – we actively follow the current trends in the research community and cooperate with researchers, and our products do incorporate the latest findings in the field, which make us leverage both high quality and cutting-edge technology.

As we continuously improve our products, we have also released a Polish NER and some work has been initiated on Danish and Norwegian ones. More NLP components will be soon available for demo and testing on our research page.

Presentation: Enterprise Search and Findability in 2013

This was presented 8 November at J. Boye 2012 Conference in Aarhus, Denmark, by Kristian Norling.

Presentation Summary

There is a lot of talk about social, big data, cloud, digital workplace and semantic web. But what about search, is there anything interesting happening within enterprise search and findability? Or is enterprise search dead?

In the spring of 2012,  we conducted a global survey on Enterprise Search and Findability. The resulting report based on the answers from survey tells us what the leading practitioners are doing and gives guidance for what you can do to make your organisation’s enterprise search and findability better in 2013.

This presentation will give you a sneak peak into the near future and trends of enterprise search, based on data form the survey and what the leaders that are satisfied with their search solutions do.

Topics on Enterprise Search

  •  Help me! Content overload!
  • The importance of context
  • Digging for gold with search analytics
  • What has trust to do with enterprise search?
  • Social search? Are you serious?
  • Oh, and that mobile thing

The Enterprise Search and Findability Report 2012 is ready

No strategy, no budget, no resources. This is the common scenario for enterprise search and findability in many organisations today. Still Enterprise Search is considered a critical success factor in 75% of organisations that responded to the global survey that ran from March to May this year.

The Enterprise Search and Findability Report 2012 is now ready for download.

The Enterprise Search and Findability report 2012 shows that 60% of the respondents expressed that it is very/moderately hard to find the right information. Only 11% stated that it is fairly easy to search for information and as few as 3% consider it very easy to find the desirable information. This shows that there still is a large untapped potential for any organisation to get great value from investing in enterprise search. For a relatively small investment, preferably in personnel it is possible to make search a lot better. The survey also reveals that  organisations who are very satisfied with their search, have a (larger) budget, more resources and systematically work with analysing search.

What is your primary goal for utilising search technology in your organisation?Figure. What is your primary goal for utilising search technology in your organisation?

The primary goal for using search is to accelerate retrieval of known information sources, 91%, and to improve the re-use of content (information/knowledge), 72%. This indicates that often search within organisations is used as a discovery tool for what already is known. If looking over the next three years, as many as 77% think that the amount of information in the organisation will increase. This means that every year it will be even more important be able to find the right information and that means Enterprise search is still very much needed, as stated in the following great presentations (on video):  Why Business Success Depends on Enterprise Search (by Martin White of Intranet Focus) and The Enterprise Search Market – What should be on your radar? (by Alan Pelz-Sharpe of 451 Research)

Download the full report.

Video and results from the Enterprise Search and Findability Survey

More than 200 people, primarily from Europe and North America, have responded to the Enterprise Search and Findability survey, providing a unique insight into how search is currently being managed, or rather is not being managed, in the best interest of the organisation. However, to get a deeper understanding in how search is used and managed at a regular basis in an enterprise context, search vendors and integrators have been excluded in this report, resulting in 170 unique responses from 28 countries globally.

A few findings

The survey has shown that the majority of the respondents find it difficult to find relevant information within the organization. To be more precise, 59.5 % of the respondents expressed that it is very/moderately hard to find the right information. Only 11.2% stated that it is fairly easy to search for information and as few as 2.8 % consider it very easy to find the desirable information. The ease of finding the right information clearly has a connection with the size of the organization. When looking at organizations with less than 1000 employees, one can see that 30.9% of the respondents feel that it is moderately/very hard to find the right information, while the corresponding percentage for organizations with 1001 or more employees is 77.3%.

The larger the organisation, the more information, and the more cumbersome it is to search and use the right information at the right time.

Video

Kristian Norling presents the findings from the global survey on Enterprise Search implementations. Presented at Enterprise Search Summit #ESS12 in New York, Enterprise Search Europe #ESEU in London, #IKS Workshop in Salzburg and Findability Day 2012 #findday12 in Stockholm. The slides are available here.

Presentation: Results from the Enterprise Search and Findability Survey

This is a mashup of the presentations made at Enterprise Search Summit in New York, US on the 15th of May 2012 and at Enterprise Search Europe in London on the 30th of May 2012.

Global results are presented with numbers only, results from Europe and North America are clearly stated as such.

If you are interested in participating in the survey next year, please sign-up. All sign-ups will receive this years report.

Sign up for the Enterprise Search and Findability Survey 2013!
* = required field

powered by MailChimp!
View more presentations from Findwise

Architecture of Search Systems and Measuring the Search Effectiveness

Lecture made at the 19th of April 2012, at the Warsaw University of Technology. This is the 9th lecture in the regular course for master grade studies, “Introduction to text mining”.

View more presentations from Findwise

A look at European Conference on Information Retrieval (ECIR) 2012

European Conference on Information Retrieval

The 34th European Conference on Information Retrieval was held  1-5 April 2011, in the lovely but crowded city of Barcelona, Spain. The core conference attracted over 100 attendees, with a total of 35 accepted full papers, 28 posters, and 7 demos being presented. As opposed to the previous year, which had 2 parallel sessions, this year’s conference included a single running session. The accepted papers covered a diverse range of topics, and were divided into query representation, blog and online-community search, semi-structured retrieval, applications, evaluation, retrieval models, classification, categorisation and clustering, image and video retrieval, and systems efficiency.

The best paper award went to Guido Zuccon, Leif Azzopardi, Dell Zhang and Jun Wang for their work entitled “Top-k Retrieval using Facility Location Analysis” and presented by Leif Azzopardi during the retrieval models session. The authors propose using facility location analysis taken from the discipline of operations research to address the top-k retrieval problem of finding “the optimal set of k documents from a number of relevant documents given the user’s query”.

Meanwhile, “Predicting IMDB Movie Ratings using Social Media” by Andrei Oghina, Mathias Breuss, Manos Tsagkias and Maarten de Rijke won the best poster award. With a different goal from the best paper, the authors of the poster experiment with a prediction model for rating movies using a set of qualitative and quantitative features extracted from the stream of two social media channels, YouTube and Twitter. Their findings show that the highest predictive performance is obtained by combining features from both channels, and propose as future work to include other social media channels.

Workshop Days

The conference was preceded by a full day of workshops and tutorials running in parallel. I attended two workshops: Information Retrieval Over Query Sessions (SIR) during the morning and Task-Based and Aggregated Search (TBAS) in the afternoon. The second workshop ended with an interactive discussion. A third, full-day workshop was Searching 4 Fun!.

Industry Day

The last day was the Industry Day. Only 2 papers here, plus 5 oral contributions, and around 50 attendees. A strong focus of the talks given at the industry day was on opinion-mining: four of the six participating companies/institutions presented work on sentiment analysis and opinion mining from social media streams. Jussi Karlgren, from Gavagai, argued that sentiment analysis from social media can be used by companies for example in finding reviews or comments made about their product or service, analyse their market position, and predict price movements. Rianne Kaptein, from Oxyme, backed this up by adding that businesses are interested by what the consumers say about their brand, products or campaigns on social media streams. Furthermore, Hugo Zaragoza from Websays identified two basic needs inside a company: a need for help in reading so that someone can act, and a need for help in explaining so that it can convince. Very interesting topic indeed, and research in this direction will advance as companies become more aware of the business gains from opinion mining of social media.

Overall, ECIR 2012 was a very inspiring conference. It also seemed a very friendly conference, offering many opportunities to network with the fellow attendees. Despite that, several participants said that the number of attendees at this year’s conference has decreased in comparison with previous years. The workshops and the core conference gave me the impression that it has a strong focus on young researchers, as many of the accepted contributions had a student as a first author and presenter at the conference. The fact that there was only one session running at a time was a good decision in my opinion, as the attendees were not forced to miss presentations. Nevertheless, the workshops and tutorials were running in parallel, and although the proceedings of the workshops will be made freely available, I still feel that I missed something that day. The industry day was very exciting, offering the opportunity to share ideas between academia and industry. However, there were not so many presentations, and the topics were not as diverse. I propose that next year Findwise will be among the speakers at the Industry track!

ECIR 2013 will be held in Moscow, Russia, between 24-28 March. See you there!

Semantic Search Engine – What is the Meaning?

The shortest dictionary definition of semantics is: the study of meaning. The more complex explanation of this term would lead to a relationship that maps words, terms and written expressions into common sense and understanding of objects and phenomena in the real world. It is worthy to mention that objects, phenomena and relationships between them are language independent. It means that the same semantic network of concepts can map to multiple languages which is useful in automatic translations or cross-lingual searches.

The approach

In the proposed approach semantics will be modeled as a defined ontology making it possible for the web to “understand” and satisfy the requests and intents of people and machines to use the web content. The ontology is a model that encapsulates knowledge from specific domain and consists of hierarchical structure of classes (taxonomy) that represents concepts of things, phenomena, activities etc. Each concept has a set of attributes that represent the mapping of that particular concept to words and phrases that represents that concepts in written language (as shown at the top of the figure below). Moreover, the proposed ontology model will have horizontal relationships between concepts, e.g. the linguistic relationships (synonymy, homonymy etc.) or domain specific relationships (medicine, law, military, biological, chemical etc.). Such a defined ontology model will be called a Semantic Map and will be used in the proposed search engine. An exemplar part of an enriched ontology of beverages is shown in the figure below. The ontology is enriched, so that the concepts can be easily identified in text using attributes such as the representation of the concept in the written text.

Semantic Map

The Semantic Map is an ontology that is used for bidirectional mapping of textual representation of concepts into a space of their meaning and associations. In this manner, it becomes possible to transform user queries into concepts, ideas and intent that can be matched with indexed set of similar concepts (and their relationships) derived from documents that are returned in a form of result set. Moreover, users will be able to precise and describe their intents using visualized facets of concept taxonomy, concept attributes and horizontal (domain) relationships. The search module will also be able to discover users’ intents based on the history of queries and other relevant factors, e.g. ontological axioms and restrictions. A potentially interesting approach will retrieve additional information regarding the specific user profile from publicly available information available in social portals like Facebook, blog sites etc., as well as in user’s own bookmarks and similar private resources, enabling deeper intent discovery.

Semantic Search Map

Semantic Search Engine

The search engine will be composed of the following components:

  • Connector – This module will be responsible for acquisition of data from external repositories and pass it to the search engine. The purpose of the connector is also to extract text and relevant metadata from files and external systems and pass it to further processing components.
  • Parser – This module will be responsible for text processing including activities like: tokenization (breaking text into lexems – words or phrases), lemmatization (normalization of grammar forms), exclusion of stop-words, paragraph and sentence boundary detector. The result of parsing stage is structured text with additional annotations that is passed to semantic Tagger.
  • Tagger – This module is responsible for adding semantic information for each lexem extracted from the processed text. Technically it refers to addition of identifiers to relevant concepts stored in the Semantic Map for each lexem. Moreover phrases consisting of several words are identified and disambiguation is performed basing on derived contexts. Consider the example illustrated in the figure.
  • Indexer – This module is responsible for taking all the processed information, transformation and storage into the search index. This module will be enriched with methods of semantic indexing using ontology (semantic map) and language tools.
  • Search index – The central storage of processed documents (document repository) structured properly to manage full text of the documents, their metadata and all relevant semantic information (document index). The structure is optimized for search performance and accuracy.
  • Search – This module is responsible for running queries against the search index and retrieval of relevant results. The search algorithms will be enriched to use user intents (complying data privacy) and the prepared Semantic Map to match semantic information stored in the search index.

What do you think? Please let us know by writing a comment.

Searching for Zebras: Doing More with Less

There is a very controversial and highly cited 2006 British Medical Journal (BMJ) article called “Googling for a diagnosis – use of Google as a diagnostic aid: internet based study” which concludes that, for difficult medical diagnostic cases, it is often useful to use Google Search as a tool for finding a diagnosis. Difficult medical cases are often represented by rare diseases, which are diseases with a very low prevalence.

The authors use 26 diagnostic cases published in the New England Journal of Medicine (NEJM) in order to compile a short list of symptoms describing each patient case, and use those keywords as queries for Google. The authors, blinded to the correct disease (a rare diseases in 85% of the cases), select the most ‘prominent’ diagnosis that fits each case. In 58% of the cases they succeed in finding the correct diagnosis.

Several other articles also point to Google as a tool often used by clinicians when searching for medical diagnoses.

But is that so convenient, is that enough, or can this process be easily improved? Indeed, two major advantages for Google are the clinicians’ familiarity with it, and its fresh and extensive index. But how would a vertical search engine with focused and curated content compare to Google when given the task of finding the correct diagnosis for a difficult case?

Well, take an open-source search engine such as Indri, index around 30,000 freely available medical articles describing rare or genetic diseases, use an off-the-shelf retrieval model, and there you have Zebra. In medicine, the term “zebra” is a slang for a surprising diagnosis. In comparison with a search on Google, which often returns results that point to unverified content from blogs or content aggregators, the documents from this vertical search engine are crawled from 10 web resources containing only rare and genetic disease articles, and which are mostly maintained by medical professionals or patient organizations.

Evaluating on a set of 56 queries extracted in a similar manner to the one described above, Zebra easily beats Google. Zebra finds the correct diagnosis in top 20 results in 68% of the cases, while Google succeeds in 32% of them. And this is only the performance of the Zebra with the baseline relevance model — imagine how much more could be done (for example, displaying results as a network of diseases, clustering or even ranking by diseases, or automatic extraction and translation of electronic health record data).