The Findability blog

the enterprise search and findability blog by Findwise

Main menu

Skip to primary content
Skip to secondary content
  • Home
  • About
  • Findwise.com

Tag Archives: Uppsala University

Apache Nutch Making Use of Open Pipeline

Posted on November 11, 2010 by Anders Rask
1

During the last couple of months I’ve been working on a project for Uppsala University. The project’s goal is to improve the findability on the university web site. The solution that we are working on is based on Apache Nutch 1.1 in conjunction with Apache Solr 1.4. Nutch provides us with a robust web crawler that scales very well and also gives us a page rank for each page that we can use for relevance tuning. Besides the web information crawled by Nutch, the search application will also be used to search people and organizational information that we index from another source. I thought that I would share some details on how we are using Nutch.

We have made two extensions to Nutch, one is a parser plug-in that can run Open Pipeline embedded in it. This was an important extension in order to get better control of the information that we index to Solr and also to be able to reuse our different Open Pipeline components. The main stages of the pipeline are the following:

  1. Extract the encoding of a web page
  2. Extract all links from a web page
  3. Extract all headings (hx) from a web page
  4. Remove all tags that don’t contain complete sentences on a web page
  5. Extract text and metadata from different types of documents with Tika
  6. Do some metadata mapping and cleaning
  7. Populate facets according to metadata and/or URL
  8. Do static URL ranking
  9. Replace certain common titles with the largest heading of the web page

The other extension we made to Nutch is an indexing filter that makes sure all our metadata fields are indexed to Solr.

So far so good. The fetching, parsing and indexing works well now and currently our largest challenge is tuning all the different relevance parameters we have, as well as harmonizing the relevance of web information to that of people and organizational information. I will have to get back to you on how that went!

Posted in Data Processing, Findability, Lucene, Open source, Solr | Tagged Apache HTTP Server, Apache Software Foundation, Apache Solr, Cross-platform software, Doug Cutting, findability, internet search engines, Knowledge representation, Lucene, Metadata, Nutch, Open Pipeline, search application, university web site, Uppsala University, web crawler, web information | 1 Reply

Recent Posts

  • Update on Findability Day 2013
  • Why search and Findability is critical for the customer experience and NPS on websites
  • Event related data – the buzz word at ECIR 2013
  • Big Data is a Big Challenge
  • Welcome to Findability Day 2013

Recent Comments

  • Xiaodong Shen on How to Index and Search XML Content in Solr
  • yagyesh on How to Index and Search XML Content in Solr
  • Kalyan on Query Rules in SharePoint 2013
  • FindZebra: Rare Disease Search Engine | SIRENSONG on Searching for Zebras: Doing More with Less
  • Paula Petcu on Query Rules in SharePoint 2013

Tags

Apache Software Foundation Apache Solr business intelligence content management systems Document Management System Enterprise Search Facebook findability Findwise Google Human-computer interaction IBM Index Information Information retrieval Information science internet search engines Intranet Knowledge representation Kristian Norling M&A Metadata Microsoft Microsoft SharePoint search analytics search application search engine search engines search experience Searching search platform search result search results search solution search solutions search technology Social information processing Technical communication Twitter usability Web 2.0 web design Web search engine World Wide Web Yahoo

Sign up for the Enterprise Search and Findability Survey 2013!

* = required field

powered by MailChimp!
Find us on Google+

Categories

Archives

Proudly powered by WordPress