SLTC 2012 in retrospect – two cutting-edge components

The 4th Swedish Language Technology Conference (SLTC) was held in Lund on 24-26 October 2012.
It is a biennial event organized by prominent research centres in Sweden.
The conference is, therefore, an excellent venue to exchange ideas with Swedish researchers in the field of Natural Language Processing (NLP), as well as present own research and be updated of the state-of-the-art in most of the areas of Text Analytics (TA).

This year Findwise participated in two tracks – in a workshop and in the main conference.
As the area of Search Analytics (SA) is very important to us, we decided to be proactive and sent an application to organize a workshop on the topic of “Exploratory Query Log Analysis” in connection with the main conference. The application was granted and the workshop was very successful. It gathered researchers who work in the area of SA from very different perspective – from utilizing deep Machine Learning to discover users’ intent,  to looking at query logs as a totally new genre. I will do a follow-up on that in another post. All the contributions to the workshop will also be uploaded on our research page.

As for the main conference, we had two papers accepted for presentation. The first one dealt with the topic of document summarization – both single and multidocument summarization
(http://www.slideshare.net/findwise/extractive-document-summarization-an-unsupervised-approach).
The second paper was about detecting Named Enities in Swedish
(http://www.slideshare.net/findwise/identification-of-entities-in-swedish).

These two papers presented de facto state-of-the-art results for Swedish both when it comes to document summarization and Named Entity Recognition (NER). As for the former task, there is neither a standard corpus for evaluation of summarization systems, nor many previous results and just few other systems which made it unfeasible to compare our own system with. Thus, we have contributed two things to the research in document summarization – a Swedish corpus based on featured Wikipedia articles to be used for evaluation and a system based on unsupervised Machine Learning, which by relying on domain boosting achieves state-of-the-art results for English and Swedish. Our system can be further improved by relying on our enhanced NER and Coreference resolution modules.

As for the NER paper, our Entity recognition system for Swedish achieves 74.0% F-score, which is 4% higher than another study presented simultaneously at SLTC (http://www.ling.su.se/english/nlp/tools/stagger). Both systems were evaluated on the same corpus, which is considered a de facto standard for evaluation of different NLP resources for Swedish. The unlabelled score (i.e. no fine-grained division of classes but just entity vs non-entity) of our system achieved 91.3% F-score (93.1% Precision and 89.6% Recall). When identifying people, the Findwise NER system achieves 78.1% Precision and 90.5% Recall (83.9% F-score).

So, what did we take home from the conference? We were really happy to see that the tools we develop for our customers are not something mediocre but rather something that is of very high quality and is the state-of-the-art in Swedish NLP. We actively share our results and our corpora for research perposes. Findwise showed keen interest in cooperating with other researchers in developing better tools and systems in the area of NLP and Text Analytics. And this I think is a huge bonus to all our current and prospective customers – we actively follow the current trends in the research community and cooperate with researchers, and our products do incorporate the latest findings in the field, which make us leverage both high quality and cutting-edge technology.

As we continuously improve our products, we have also released a Polish NER and some work has been initiated on Danish and Norwegian ones. More NLP components will be soon available for demo and testing on our research page.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>