The Findability blog

the enterprise search and findability blog by Findwise

Main menu

Skip to primary content
Skip to secondary content
  • Home
  • About
  • Findwise.com

Tag Archives: Metadata

Post navigation

← Older posts

Enterprise search case study: Vårdaktörsportalen makes reliable information easy to find for health professionals

Posted on December 11, 2012 by Henrik Jacobsson
Reply

Vårdaktörsportalen (VAP) is a portal for health care providers made by Västra Götalandsregionen (VGR). This portal makes information from a number of reliable and authorised sources findable and accessible for the people who need it in their daily work. This stretches from doctors and nurses to medical secretaries primarily located in the region of Västra Götaland, Sweden. The site and most of its features and information is also accesible (in Swedish) to anyone through http://vap.vgregion.se. In November 2012 the first version of this site went live and Findwise had a big role in the creation of this search centric site. The main source of information for VAP is the regional guidelines found in a document repository within VGR but some other external sources are also included. These include trustworthy authorities like

  • Socialstyrelsen
  • Läkemedelsverket
  • SBU
  • TLV
  • 1177 - the public health care information site for citizens

This search solution is built around the open source search engine Apache Solr and our common tools for processing and indexing. For this site we have also implemented a rather unique metadata enhancement service that automatically extracts keywords from the document to index and attaches it as metadata. The keyword extraction is based on information from the medical term database SweMeSH. More information (in Swedish) can be found on Google Code. We also include synonyms to keywords to increase recall, making it easier to find documents regardless of what synonym used.

The metadata enhancement service was included because the quality of metadata on the external sites was not very good. VGR will work with the above mentioned authorities to try and make them understand that they would benefit if they improved their data. The source 1177 stands out with very good meta data and overall good quality texts.

We have conducted a user study to see how this first version is able to satisfy user demands. The result of that study shows that some local sources are missing, but a general positive feedback on the idea and the graphical design was collected.

Findwise is looking forward to continue working on VAP with VGR in the future to make it an even better tool.

Related links:

  • http://vap.vgregion.se
  • http://webbfunktion.com/grafisk-form-pa-vardaktorsportalen/
  • http://vardaktorsportalen.se/
Posted in Content refinement, Data Processing, Enterprise Search, Findability, Information quality, Knowledge management, Open source, Solr | Tagged Apache Solr, Data management, Information retrieval, Metadata, open-source search engine, search centric site, search engine, search solution, Sweden, Västra Götaland, Web portal | Leave a reply

People, Topics and Information Flow Key for Findability

Posted on August 20, 2012 by Kristian Norling
2

Understanding and utilizing the context of both people and topic (subject) is the future of enterprise search and findability. As we have seen the last few years, the amount of information that is created within organisations and elsewhere is growing exponentially. This makes it harder, day-by-day, to find the information that is relevant at any given moment. By organizing information based on topic, by using text analytics, better metadata, adding user tagging, sentiment analysis etc. it is possible to make findability better. A few examples are mentioned in this blog post series on information flow from 2010. The whole point of findability boils down to improving the information flow and access at any given time. Example on Information Flow from the Intranet of Region Västra Götaland.

In order to make sense of any arbitrary information we as humans usually need the help of someone familiar with the topic to help us makes sense of it and understand it. By both addressing the challenge of finding people with the right knowledge and finding the right information, we can contextually make the information more relevant and easier to find.

For example by doing search analytics and looking at usage patterns in general or by looking at how people with the same usage (search) patterns are going about finding information, we can give better suggestions. Also, recommendations of information produced or liked by people who are like you have a better chance of being relevant to you. By using Social Network Analysis, we should be able to find patterns in what information is in demand and how the informations flows. The analysis can of course also be used to find the supernodes, meaning the people through which information and connections flow. For example, email is a under-utilized source of information flow, knowledge, context and social network analysis.

On the 28th of August, at the World Café in Oslo, Kristian Norling will talk about findability and collaboration, with a focus on people and topic centric solutions. Examples from Region Västra Götaland and other projects will be made.

Posted in Business, Findability, Findwise, Future development, Presentation, Search, Uncategorized | Tagged Enterprise Search, findability, Information, Knowledge, Knowledge representation, Kristian Norling, Metadata, Oslo, topic centric solutions, World Café | 2 Replies

Data and Search Going Big?

Posted on April 25, 2012 by Martin Johansson
1

A few enterprise search specialists from Findwise recently attended the Scandinavian Developer Conference 2012. One of the tracks was Big Data, which is very much related to search. It had some interesting talks about how to handle large amounts of data in an efficient way. Special thanks to Theo Hultberg, Jim Webber and Tim Berglund!

The theme was that you should choose a storage system which is well suited for the task. This may seem like an obvious point, but for a long time this was simply ignored; I’m talking about the era of relational databases. Don’t get me wrong, sometimes a relational database is the very best for the job, but in many cases it isn’t.

Data is jagged by nature, i.e. not all objects have the same properties. This is why we shouldn’t force them to fit into a square table, instead everything should be denormalized! The application accessing the data will be aware of the information structure and will handle it accordingly. This will also avoid expensive assembly operations (such as joins) to get the data in the format we want when retrieving it. Why should you split up your data if you are going to assemble it over and over again? Also remember that disk space is cheap, pre-compute as much as possible. The design of a Big Data system should be governed by how the data will be retrieved.

Another step away from the relational databases is the relaxation of some of the ACID properties: Atomicity, Consistency, Isolation and Durability. Again, this is along the lines of choosing the components best suited for the system. Decide which properties are a must have and which are not so important.

Relaxing the ACID properties, such as consistency, can give great performance gains. The NoSQL database Cassandra is eventually consistent and its write performance scales linearly up to 288 nodes (and probably even higher) which gives a write performance of over 1 million writes per second!

However, relaxation of these properties is not a new concept in the world of search engines. When indexing a document, it will usually take a number of seconds before it is searchable. This is called eventual consistency, i.e. the state of the search engine will be brought from one valid state to another, within a sufficiently long period of time. Do we really need documents that were just submitted to the search engine to be
searchable instantly? Most likely, no. Isolation is another property that is not crucial to a search engine. Since a document in an index doesn’t have any explicit relations to any other documents in the same index, there isn’t a great need for isolation. If two writes for the same document are submitted at the same time, there is probably something wrong in another part of the system.

So what does all this mean for search? There is an interesting challenge in storing jagged data in large amounts and then making good use out of it. To search in vast amounts jagged data, you need a lot of querytime field mappings (to make relevant data searchable) … or do you? There is also the issue of retaining a good relevancy model, which is absolutely vital to a search engine. How do you measure the relevance of arbitrary metadata and then weigh it all together? Maybe we need to think in new ways about relevance all together?

Whomever can solve these problems in a good way with a minimum amount of manual labor, is a name we’ll be hearing from a lot in the future.

Posted in Big Data, Conference, Search, Search Watch, Technology | Tagged ACID, Atomicity, big data, conference, data, Data management, Database, Database management systems, Database theory, Databases, Durability, Enterprise Search, enterprise search specialists, Isolation, Jim Webber, Linearizability, Metadata, relational database, scandev, search, search engine, search engines, Theo Hultberg, Tim Berglund, Transaction processing | 1 Reply

Automated Testing of Enterprise Search Solutions

Posted on March 8, 2012 by Mickel Gronroos
Reply

Quality assuring an enterprise search solution is challenging, yet important. The challenge is to be able to do continuous follow-up of the quality of the solution during implementation but also after release, when the solution is in production and operated by an operations team. Testing is important, but it is also costly – unless it can be automated.

So what kind of testing is specific for a search application? And what of that can be automated?

The whole idea of Enterprise Search is to provide the right information to the right people at the right time. The information made findable is normally stored in many different information systems and the information in these systems is constantly changing. In the end, every enterprise search solution operates in a context where the requirements of the end-users and the available content changes on a daily basis. In other words, assuring the quality of enterprise search is about assuring the quality of the information and the way that information is accessed by and delivered to the end-users.

During our engagements over the years, we have set routines and developed tools for automated testing of enterprise search. What we specifically want to track in an automated fashion is:

  • Completeness
  • Freshness
  • Access restrictions
  • Metadata quality
  • Performance
  • Relevance

Allow me to take a few moments and describe what this means.

Completeness testing

Completeness tests aim to make sure that the search index is complete – that all information objects (such as web pages and documents) that are supposed to be searchable are really searchable. In addition, completeness testing provides proof that the correct parts of the information objects are indexed for retrieval, e.g. all pages in a multi-page document, as well as titles and other searchable metadata. It is also important to monitor that information that should not be searchable is indeed not indexed, e.g. headers and footers of web pages.

Freshness testing

Freshness tests aim to make sure that the search index is up to date, i.e. new content that has been added to a source (such as a document management system) becomes searchable, deleted content is removed automatically from the search index and updated content is updated in the search index – all in due time.

Testing access restrictions

If an enterprise search solution provides access to access-controlled information, it is of uttermost importance to be able to prove that security is never compromised. Testing access restrictions aim to do precisely that. What one needs to monitor is that existing document-level security works, i.e. that people who should have access to an information object really has access and that people who shouldn’t have access, don’t have access. The tricky part is to monitor that a change in access privileges in for instance Active Directory or in the access restrictions (the ACL) for a particular document is handled in the search index as well in due time.

Testing metadata quality

Each information object in the search index contains a set of fields containing metadata and text, e.g. a title, the text body, an author, a timestamp containing last modification date, information on file format, a keywords field and many more.

In an enterprise search setting, many different information models implemented in the source systems need to be harmonized into one common domain model (schema/index profile/information model) in the search index. This means information regarding a creator of an information object in one system and a publisher of an information object in another system can be stored in a common author metadata field in the search index in a common, defined format such as Firstname Lastname regardless of formatting in the source system. Unless you have a common model in the index, you can’t provide features like cross-system filtering with facets.

So how do you track that the metadata in the search index stays in good shape? This is the aim of metadata testing. The test cases provided for metadata testing need to check that the metadata in the search index conforms to the defined domain model and formatting even when the underlying content changes in the source systems.

Performance testing

Performance testing is probably the easiest type of tests you can create and run. In the end you will have a threshold or pain limit in milliseconds under which a query in the enterprise search solution will be required to provide an answer even under peak times with high query loads. Normally you will also be monitoring issues like RAM and processor capacity usage of the software components of your solution to be able to generate automatic alerts to the maintenance team if the hardware is under too much pressure.

Relevance testing

Quality assuring the relevance model of an enterprise search solution is tricky. Largely because relevance in a result set is to some extent subjective. However, when implementing search, one does need to set a relevance model that presupposes a set of business rules for what type of content is to be deemed more important than other. For example, when making documents in a document management system searchable, a typical business rule would be that documents tagged with Status=Approved must always be deemed more important than documents with any other status (such as Preliminary or Deprecated). Another typical rule is that a document for which a query term can be found in the title or in the keywords metadata field is most likely more important than documents where the query term is found elsewhere in the text body.

What it all boils down to is the definition of the business rules for relevance. Once you have defined the rules that govern how the results are to be ranked, you can also create test cases, i.e. associate query terms with information objects that must be returned as top results given these terms.

Automating it all

Once you have defined you test cases for all the above mentioned types of tests in a test plan, you are ready to automate, i.e. enter the test plan into a test automation framework. The beauty of it all is that you can automate regression testing during the implementation phase of an enterprise search solution, i.e. continuously test that new development does not break such parts of the solution that worked as intended before. This is in particular important if you add new information sources to your enterprise search solution, when there is a high risk that the relevance model that worked fine yesterday all of the sudden gets out of order. In addition, after the release of the enterprise search solution, the test automation framework will assist the operations team in monitoring that the solution behaves as expected even after the implementation team has left the building. All in all this leads to continuously good quality of the solution while lowering the costs for monitoring.

Posted in Development, Enterprise Search, Governance, Search, Testing | Tagged author, common author, Concept Search, content management systems, Document Management System, Enterprise content management, Enterprise Search, enterprise search setting, enterprise search solution, Index, Information retrieval, information systems, Information technology management, Metadata, RAM, Relevance, search application, search index, search index stays, Searching, software components | Leave a reply

Content Choreography?

Posted on October 27, 2011 by Christopher Wallstrom
Reply

Is getting the right content to the right users and customers a priority for you and your organisation? Do you drown in too much information? With some insight into how to manage content your answer is probably “Yes!”.

Today we have loads of channels to choose from, e-mails, internet/intranets, Yammer feeds, blogs and different collaboration platforms and social media services. Some content is more beneficial in one channel and other content in another channel. But how do you make sure the right information reaches the right users, in the right channels?

Content Choreography aims to handle all that; Content, strategy, format and delivery.

We need to tailor the user/customer experience in order to achieve good Findability. How? Taxonomy, Metadata and Search!
Taxonomy to ensure that we speak the same language, metadata to classify the content to fulfill a certain task or objective and search to deliver it to the right channel.

Need more information about Content Choreography?
Join us in our joint seminar with KnowIT, Nov 22nd: Future Choreography of Content Management, where Seth Earley – CEO at Early Associates will speak about Content Choreography – The Art of Dynamic Web Content. Seth Earley have more than 20 years experience in the field and is a very eloquent and interesting speaker. He will share his thoughts and ideas gathered from a number of large customers worldwide.

More information and registration can be found here.

Posted in Information Architecture, Information management, Strategy, Uncategorized | Tagged CEO, Content, content management, content management systems, Data management, Early Associates, eloquent and interesting speaker, findability, Information science, internet/intranets, Intranet, Knowledge representation, Metadata, Seth Earley, social media services, Web Content, web design, Yammer | Leave a reply

Collaborative, Social and Adaptive Relevance in Enterprise Search

Posted on June 30, 2011 by Mickel Gronroos
Reply

Providing spot-on results with good relevance in Enterprise Search solutions is one of the hardest tasks when working with findability. Sure, it is doable to work out a generic model for ranking results based on the organization’s most common requirements on findability in conjunction with available metadata of the information made findable. But is it enough?

The burning question is: How can you ensure that the generic relevance model does not get outdated once the Findability solution has been in use for a month, half a year, a year and the implementation crew is long gone?

Findwise recently released a large Enterprise Findability solution at a customer in the electrical power industry in Sweden. In the project we identified personalized and adaptive relevance as two key requirements for the findability solution to provide real, future-proof value-in-use to a large set of people with fundamentally different roles within the company. This blog post will focus on the latter requirement, adaptiveness: How can we make sure that an Enterprise Findability solution returns search results that become better and better as the solution is used?

Let user behavior improve the behavior of the search tool

The Enterprise Findability solution rolled out at the power company contains two features that, put together, build the foundation of a continuously improving relevance model:

  1. A feature that promotes popular content given a query term – “social relevance”
  2. A feature that continuously changes the relevance model by boosting the relevance of popular documents – “adaptive relevance

Social relevance

Inspired by e-commerce actors on the web, the delivered Enterprise Findability solution uses the logged behavior of its users to promote popular content. When an end-user searches for, e.g. “terawatt hours”, the solution by default offers search results ranked and sorted according to the generic relevance model. This is what any search tool would do. But this solution also uses search logs to promote popular content just as e-commerce sites have been doing for years – “Other people searching for ‘terawatt hours’ viewed ‘Current power production’ (intranet page), ‘Definition of terms in the electrical power industry’ (PDF document)” etc.

By combining the intel of the search logs (where the end-user behavior of an Enterprise Findability solution is constantly collected) and the best bets (editorially provided “sponsored links”) with the regular search result, end-users are presented with a rich set of information answering their original question from different angles. And the best part of it is that the social relevance feature constantly improves as the tool is used. People get better results as time goes by.

Adaptive relevance

In addition to the social relevance feature, the vast amount of real search behavior compiled in the search logs is used for improving the generic relevance model as well. The solution tracks changes in popularity of content and adapts the document-level scores of documents and web pages in the search index accordingly. If a document is accessed often through the search tool, the document will be deemed “more important” and start climbing towards top positions in the search result. And if a previously popular document becomes less popular as time goes by, the document’s impact on the relevance model is decreased. In the end, content that has great importance for a limited amount of time (such as news items and weekly lunch menus) will first peek and then dip in the search index. The search index and the generic relevance model attached to it will stay fresh.

From generic to personalized search experience

This blog post has pinpointed a couple of solutions for a continuously-improving, generic relevance model in an Enterprise Findability solution. Obviously, generic models are generic, i.e. good enough for the many, not perfect for the few. There are great ways to address personalization solving many of the role-based challenges of Enterprise Findability, but let’s leave that to another, future blog post. Stay tuned!

Posted in Business, Findability | Tagged e-commerce actors, e-commerce sites, Enterprise Search, findability, findability solution, Index, Information retrieval, Metadata, PDF, Personalization, personalized search experience, real search behavior, regular search result, Relevance, search index, search logs, search result, search tool, Sweden, Web search engine | Leave a reply

European Conference on Information Retrieval (ECIR) 2011 in retrospect

Posted on April 27, 2011 by Svetoslav Marinov
1

The European Conference on Information Retrieval (ECIR) 2011 took place in Dublin last week, 18-21 April. In this blogpost I would try to highlight some of the papers and talks from the conference which caught my attention and back it up with what other attendees said about it.

First, I was intrigued by the session on evaluation for IR and especially the topic of Croudsourcing. In my opition, the paper A Methodology for Evaluating Aggregated Search Results, which also got the prize for best student paper, was among the most pedagogically presented ones. It deals with the task of incorporating search results from a number of different sources, called verticals, into Web search results. By using a small number of human judgements for a given query the authors present the way to evaluate any possible permutation of verticals in the result presentation. I think that this methodology should be adopted in the world of Enterprise search, since it is exactly there where we crawl, index and present information from a number of different sources – Web, databases, fileshares, etc. The prerequisites are really minimal and low cost but the return value, the user experience, seems quite high.

Amazon Mechanical Turk, or the Artificial Artificial Intelligence, which is the marketplace for Croudsourcing, provides a way for a ridiculously small sum of money to perform evaluation, relevance assessment or any task for which you would need humans to give you some judgements. Leaving aside ethical issues, two papers in the conference presented ways of how you can utilize this service for some IR tasks.

Evgeniy Gabrilovich from Yahoo! Research, who won the Karen Sparck Jones award for 2010, gave a very interesting keynote talk on Computational Advertising. Up to now, it has never struck me how hard advertising in Information Retrieval systems is actually. I liked one of his points on the future of Ads – by using product feeds, one can automatically create product description via Text Summarization and Natural Language Generation and index this, thus avoiding bid words.

Another interesting and very pedagogically presented paper was about the gensim package by Radim Řehůřek. I definitely think we can use it in some of our projects. In general, text categorization and IR for social network were the dominant tracks. In one of the social networks tracks, Oscar Täckström presented a neat way of discovering fine-grained sentiment where some coarse-grained supervision is available. It really hooked me on trying it for any of our customers where sentiment analysis is required.

Thorsten Joachims, the last of the keynote speakers, gave a very inspiring talk on The Value of User Feedback. He put forward the idea of designing retrieval systems for feedback. In stead of just looking at the clicklogs post factum one can think of a system which uses the clicks feedback to learn, thus creating a better ranker for a given query and a given user need. In a single session, we can use click feedback to disambiguate the query and deliver results on the run which are of immediate benefit to the users.

Unfortunately, I guess I could have missed other interesting presentations but with two parallel sessions and several workshops there was a limit to what I could devour. What surprised me though, was that there were very few papers by the industry. We do try to solve exactly the same problems and tackle the same issues as academia. We, at Findwise, have constantly flagged the huge benefit of good, relevant Metadata for the task of achieving better search performace, which was also touched upon in the paper “Topic Classification in Social Media using Metadata from Hyperlinked Objects”.

It was really great to visit Dublin and attent ECIR 2011. It was an inspiring conference and I do believe that at next ECIR we, from Findwise, can be on the podium, sharing our knowledge and hands-on experience on Enterprise search and IR.

Sláinte!

Posted in Enterprise Search, Findwise, Internet search, Relevancy, Research, Search, Technology, User Experience | Tagged Amazon, Artificial Artificial Intelligence, Document classification, Dublin, European Conference on Information Retrieval, Evgeniy Gabrilovich, hard advertising, Information retrieval, Information science, Metadata, Oscar Täckström, retrieval systems, Science, search performace, search results, social media, social network, Storage, Thorsten Joachims, Web search results, Yahoo | 1 Reply

Open Source Tools for Text Analytics

Posted on March 21, 2011 by Daniel Ling
Reply

Recently, both clients of Findwise as well as the Enterprise Search community in general are increasingly showing interest in text analytics in order to get a higher business value out of their (often large) volumes of unstructured information.

Text Analytics merges techniques from linguistics, computer science, machine learning, statistics and many of the central algorithms in this field are publically available as open source tools and packages with easily accessible APIs. While many customers of commercial Enterprise Search solutions, such as Automomy, IBM Omnifind, Microsoft FAST ESP, etc., have long benefitted from some sort of Text Analytics (e.g. Entity Extraction, Keyword Extraction and document summarization), the open source components have now come a long way in providing alternative, free of charge solutions with similar performance and feature set.

As every modern enterprise search architecture today has some kind of document processing that is extensible by additional stages or APIs (for example the Open Pipeline with Solr or the pipeline that comes with Microsoft FAST) – the opportunity for plugging new text analytics stages to existing search implementations is open and ready for new innovation.

Among the most popular applications of text analytics that have emerged lately are customized entity extraction, sentiment analysis and document classification – each with a set of open source alternatives (such as Balie, OpenNLP and GATE) readily available for customization and implementation to your document processing.

Regardless of your industry domain, these techniques open up for a wide variety of new ways to interpret the content and discover new trends from your unstructured textual data – be it through sentiment analysis to support the decision making process, trend analysis or relevance model of search, or entity extraction in order to navigate your content by entities (such as company name or person), the enhancement of your texts by meta-data tagging or finding similar and related content.

How are you taking advantage of modern text analytics?

Posted in Data Processing, Open Pipeline, Open source, Search | Tagged Analytics, Apache Solr, Artificial intelligence, central algorithms, charge solutions, Computational linguistics, Data analysis, Data management, document processing, enterprise search architecture, Findwise, IBM, machine learning, Metadata, Microsoft, Named entity recognition, Natural Language Processing, Open Pipeline, Open source tools, Science, search implementations, Text analytics, Text mining | Leave a reply

If a Piece of Content is Never Read, Does it Really Exist?

Posted on December 10, 2010 by Mattias Brunnert
3

Since ancient times, information technology has developed from carvings in rock and wood to cell phones and Facebook. Still, the basic purpose remains the same; to facilitate communication between people separated by space and time. Therefore one can measure the successfulness of any information tool by two axes: how easy it is to create information and how easy it is to consume it. Being a Findability expert, I spend a large part of my life focusing on the latter. Therefore it troubles me that so many organizations wait so long when they are introducing new content management systems before looking at search. If I had a nickel for every time I heard “we are currently busy with building our new intranet/web page/collaboration tool and will look at search when the project is finished” I would definitely have had a few quarters by now.

I like to say that I am in the information marketing business. What I mean by that is that Findability is all about marketing information so that the consumers, your employees, can find the piece of information they need. And just as an industrialist would not construct a factory before doing a marketing plan, you should not build a new information repository without thinking about how the content created in that repository will reach its target audience. When marketing information, search is one of your most important channels.

While a enterprise search solution can definitely smooth out imperfections in information structure and quality using intelligent algorithms, spending a little time thinking about how you can make it easier for a search engine to deliver relevant results presented in a user friendly way can really make it shine. Some questions you can ask yourself are:

  • How can we make tagging so convenient that we have good metadata for presenting and filtering results using facets? Many search solutions have automated tagging functionality to take load off users.
  • How can we use search as an integration platform to pull in content from other sources instead of making costly one-time integrations?
  • How will the new information repository fit into an existing search solution, for example are we changing the metadata model and how should the documents be ranked compared to other sources?
  • Should we migrate content from an old system to the new one or just freeze information creation in the old one and have a search box that let’s the user find information from both?
  • Can we use search to avoid creating duplicate information by encouraging users to make searches before typing new content or even doing implicit searches while the user is typing?

So does a piece of content that no one ever reads exist? Well in terms of bits on a disk in a data center, yes, but in terms of business value definitely no. Designing your information repository for Findability will have great returns in improved efficiency and user satisfaction.

Posted in Findability, Governance, Search, Strategy | Tagged cell phones, content management, content management systems, Facebook, findability, Information retrieval, Information science, Information technology, intranet/web page/collaboration tool, Knowledge representation, Metadata, quality using intelligent algorithms, search box, search engine, search solution, search solutions, Tag, web design, Web search engine | 3 Replies

Apache Nutch Making Use of Open Pipeline

Posted on November 11, 2010 by Anders Rask
1

During the last couple of months I’ve been working on a project for Uppsala University. The project’s goal is to improve the findability on the university web site. The solution that we are working on is based on Apache Nutch 1.1 in conjunction with Apache Solr 1.4. Nutch provides us with a robust web crawler that scales very well and also gives us a page rank for each page that we can use for relevance tuning. Besides the web information crawled by Nutch, the search application will also be used to search people and organizational information that we index from another source. I thought that I would share some details on how we are using Nutch.

We have made two extensions to Nutch, one is a parser plug-in that can run Open Pipeline embedded in it. This was an important extension in order to get better control of the information that we index to Solr and also to be able to reuse our different Open Pipeline components. The main stages of the pipeline are the following:

  1. Extract the encoding of a web page
  2. Extract all links from a web page
  3. Extract all headings (hx) from a web page
  4. Remove all tags that don’t contain complete sentences on a web page
  5. Extract text and metadata from different types of documents with Tika
  6. Do some metadata mapping and cleaning
  7. Populate facets according to metadata and/or URL
  8. Do static URL ranking
  9. Replace certain common titles with the largest heading of the web page

The other extension we made to Nutch is an indexing filter that makes sure all our metadata fields are indexed to Solr.

So far so good. The fetching, parsing and indexing works well now and currently our largest challenge is tuning all the different relevance parameters we have, as well as harmonizing the relevance of web information to that of people and organizational information. I will have to get back to you on how that went!

Posted in Data Processing, Findability, Lucene, Open source, Solr | Tagged Apache HTTP Server, Apache Software Foundation, Apache Solr, Cross-platform software, Doug Cutting, findability, internet search engines, Knowledge representation, Lucene, Metadata, Nutch, Open Pipeline, search application, university web site, Uppsala University, web crawler, web information | 1 Reply

Post navigation

← Older posts

Recent Posts

  • Big data and cloud solutions at Atea Bootcamp
  • Update on Findability Day 2013
  • Why search and Findability is critical for the customer experience and NPS on websites
  • Event related data – the buzz word at ECIR 2013
  • Big Data is a Big Challenge

Recent Comments

  • FindZebra Search Tool Expected to Advance Medical Field | GemFind | Web Solutions for Jewelry Professionals on Searching for Zebras: Doing More with Less
  • Guest blog: FindZebra – a rare disease search engine on Searching for Zebras: Doing More with Less
  • Xiaodong Shen on How to Index and Search XML Content in Solr
  • yagyesh on How to Index and Search XML Content in Solr
  • Kalyan on Query Rules in SharePoint 2013

Tags

Apache Software Foundation Apache Solr business intelligence content management systems Document Management System Enterprise Search Facebook findability Findwise Google Human-computer interaction IBM Index Information Information retrieval Information science internet search engines Intranet Knowledge representation Kristian Norling M&A Metadata Microsoft Microsoft SharePoint search analytics search application search engine search engines search experience Searching search platform search result search results search solution search solutions search technology Social information processing Technical communication Twitter usability Web 2.0 web design Web search engine World Wide Web Yahoo

Sign up for the Enterprise Search and Findability Survey 2013!

* = required field

powered by MailChimp!
Find us on Google+

Categories

Archives

Proudly powered by WordPress