Using log4j in Tomcat and Solr and How to Make a Customized File Appender

This article shows how to use log4j for both tomcat and solr, besides that, I will also show you the steps to make your own customized log4j appender and use it in tomcat and solr.

Default Tomcat log mechanism

Tomcat by default uses a customized version of java logging api. The configuration is located at ${tomcat_home}/conf/logging.properties. It follows the standard java logging configuration syntax plus some special tweaks(prefix property with a number) for identifying logs of different web apps.

An example is below:

handlers = 1catalina.org.apache.juli.FileHandler, 2localhost.org.apache.juli.FileHandler, 3manager.org.apache.juli.FileHandler, 4host-manager.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler

.handlers = 1catalina.org.apache.juli.FileHandler, java.util.logging.ConsoleHandler

1catalina.org.apache.juli.FileHandler.level = FINE

1catalina.org.apache.juli.FileHandler.directory = ${catalina.base}/logs

1catalina.org.apache.juli.FileHandler.prefix = catalina.

2localhost.org.apache.juli.FileHandler.level = FINE

2localhost.org.apache.juli.FileHandler.directory = ${catalina.base}/logs

2localhost.org.apache.juli.FileHandler.prefix = localhost.

Default Solr log mechanism

Solr uses slf4j logging, which is kind of wrapper for other logging mechanisms. By default, solr uses log4j syntax and wraps java logging api (which means that it looks like you are using log4j in the code, but it is actually using java logging underneath). It uses tomcat logging.properties as configuration file. If you want to define your own, it can be done by placing a logging.properties under ${tomcat_home}/webapps/solr/WEB-INF/classes/logging.properties

Switching to Log4j

Log4j is a very popular logging framework, which I believe is mostly due to its simplicity in both configuration and usage. It has richer logging features than java logging and it is not difficult to make an extension.

Log4j for tomcat

  1. Rename/remove ${tomcat_home}/conf/logging.properties
  2. Add log4j.properties in ${tomcat_home}/lib
  3. Add log4j-xxx.jar in ${tomcat_home}/lib
  4. Download tomcat-juli-adapters.jar from extras and put it into ${tomcat_home}/lib
  5. Download tomcat-juli.jar from extras and replace the original version in ${tomcat_home}/bin

(extras are the extra jar files for special tomcat installation, it can be found in the bin folder of a tomcat download location, fx. http://archive.apache.org/dist/tomcat/tomcat-6/v6.0.33/bin/extras/)

Log4j for solr

  1. Add log4j.properties in ${tomcat_home}/webapps/solr/WEB-INF/classes/ (create classes folder if not present)
  2. Replace slf4j-jdkxx-xxx.jar with slf4j-log4jxx-xxx.jar in ${tomcat_home}/webapps/solr/WEB-INF/lib (which means switching underneath implementation from java logging to log4j logging)
  3. Add log4jxxx.jar to ${tomcat_home}/webapps/solr/WEB-INF/lib

Make our own log4j file appender

Log4j has 2 types of common fileappender,

DailyRollingFileAppender – rollover at certain time interval

RollingFileAppender – rollover at certain size limit

And I found a nice customized file appender -  CustodianDailyRollingFileAppender online.

I happen to need a file appender which should  rollover at certain time interverl(each day) and backup earlier logs in backup folder and get zipped. Plus removing logs older than certain days. CustodianDailyRollingFileAppender already has the rollover feature, so I decide to start with making a copy of this class,

Parameters

Besides the default parameters in DailyRollingFileAppender, I need 2 more parameters,

Outdir – backup directory

maxDaysToKeep – the number of days to keep the log file

You only need to define these 2 parameters in the new class, and add get/set methods for them (no constructor involved). The rest will be handled by log4j framework.

Logging entry point

When there comes a log event, the subAppend(…) function will be called, inside which a super.subAppend(event); will just do the log writing work. So before that function call, we can add the mechanism for back up and clean up.

Clean up old log

Use a file filter to find all log files start with the filename, delete those older than maxDaysToKeep.

Backup log

Make a separate Thread for zipping the log file and delete original log file afterwards(I found CyclicBarrier very easy to use for this type of wait thread to complete task, and a thread is preferable for avoiding file lock/access ect. problems). Call the thread at the point where current log file needs to be rolled over to backup.

Deploy the customized file appender

Let’s say we make a new jar called log4jxxappender.jar, we can deploy the appender by copying the jar file to ${tomcat_home}/lib and in ${tomcat_home}/webapps/solr/WEB-INF/lib

Example configuration for solr,

log4j.rootLogger=INFO, solrlog

log4j.appender.solrlog=com.findwise.xx.log4j.fileappender.YyRollingFileAppender

log4j.appender.solrlog.File=${catalina.home}/logs/solr.log

log4j.appender.solrlog.Append=true

log4j.appender.solrlog.Encoding=UTF-8

log4j.appender.solrlog.DatePattern='.'yyyy-MM-dd

log4j.appender.solrlog.MaxDaysToKeep=10

log4j.appender.solrlog.Outdir=${catalina.base}/logs/backup

log4j.appender.solrlog.layout=org.apache.log4j.PatternLayout

log4j.appender.solrlog.layout.ConversionPattern = %d [%t] %-5p %c - %m%n

Solr.war

Last thing to remember about solr is to zip the deployment folder ${tomcat_home}/webapps/solr and rename the zip file solr.zip to solr.war. Now you should have a log4j enabled solr.war file with your customized fileappender.

How to Index and Search XML Content in Solr

Indexing XML Content

In solr, there is an xml update request handler which can be used to update xml formatted data.

For example,

<add>
<doc>
<field name="employeeId">05991</field>
<field name="office">Bridgewater</field>
<field name="skills">Perl</field>
<field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>

However when a field itself should contain xml formatted data, the xml update handler will fail to import. Because, xml update handler parse the import data with xml parser, it will try to get direct child text under ‘field’ node, which is empty if a field’s direct child is xml tag.

What we can do is to use json update handler. For example:

[
  {
    "id" : "MyTestDocument",
    "title" : "<root p="cc">test \ node</root>"
  }
]

There are two things to notice,

  1. Both ‘‘ and ‘‘ characters should be escaped
  2. The xml content should be kept as a single line

Json import data can be loaded into Solr by the curl command,

curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'

Or, by using solrj:

CommonsHttpSolrServer server = new CommonsHttpSolrServer(serverpath);
server.setMaxRetries(1);
ContentStreamUpdateRequest csureq = new ContentStreamUpdateRequest("/update/json");
csureq.addFile(file);
NamedList<Object> result = server.request(csureq);
NamedList<Object> responseHeader = (NamedList<Object>) result.get("responseHeader");

Integer status = (Integer) responseHeader.get("status");

Stripping out xml tags in Schema definition

When querying xml content, we most likely will not be interested in xml tags. So we need to strip out xml tags before indexing the xml text. We can do that by applying HTMLStripCharFilter to the xml content.
            <analyzer type="index">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>
            <analyzer type="query">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>

Search XML Content

Xml content search does not differ much from text content search. However, if people want to search for xml attributes, there requires some special tweak.

HTMLStripCharFilter we mentioned earlier will filter out all xml tags including attributes, in order to index attributes, we need to find a way to make HTMLStripCharFilter keep the attribute text.

For example if we have original xml content as following,

<sample attr=”key_o2_4”>find it </sample>
After applying HTMLStripCharFilter, we want to have,

key_o2_4    find it
One way we can do is to add assistance xml instruction tags in original xml content such as,

<sample attr=”key_o2_4”><?solr key_o2_4?>find it</sample>

And apply Solr.PatternReplaceCharFilterFactory to it as shown in following schema fieldtype definition.

<analyzer type="index">
...
<charFilter pattern="&lt;?solr ([A-Z0-9_-]*)?&gt; " replacement="       $1  " maxBlockChars="10000000"/>
<charFilter/>
...
</analyzer>

Which will make replace <?solr key_o2_4?> with 7 leading empty spaces + key_o2_4 + 2 ending empty spaces in order to keep the original offset,

With this technique, we can do a search on attr attribute and get a hit.

ExternalFileField in Solr

Sometimes we want to update document values in an indexed field more often than other fields. A good solution to this is to use the field type ExternFileField. The ExternalFileField gets values from an external file instead of the index. Such file can easily be changed and update the field after a commit. Hence no documents need to be re-indexed. A field that has ExternalFileField as type is not searchable. The field may currently only be used as a ValueSource in a FunctionQuery.

The external file contains keys and values:

key1=value1
key2=value2

The keys don’t need to be unique.

The name of the external file must be external_<fieldname> or external_<fieldname>.* and must be placed in the index directory.

A new file type of the type ExternalFileField and field must be added to schema.xml.

<fieldType name="file"

           keyField="keyField" defVal="1" indexed="false"

           stored="false" valType="float" />

<field name="<fieldname>" type="file" />

keyField is the field that contains the keys and <fieldname> contains the values from the external file.

valType defines the value type of the field.

At Findwise we have used this method for a customer where we wanted to show the most visited pages higher up in the search result. These statistics are changing daily for a lot of pages and we don’t want to re-index all these pages every day.

Development Techniques for Solr: Structure First or Structure Last?

I’d like to share two different development techniques for Solr I commonly use when setting up a Apache Solr project. To explain it I’ll start by introducing the way I used to work. (The wrong way ;) )

Development Techniques for Solr: The Structure First

Since I work as a enterprise search consultant I come across a lot of different data sources.  All of these data sources have at least some structure, some more than others.

My objective as a backend developer was then to first of all figure out how the data source was structured and then design a Solr schema that fit the requirements, both technical and business.

The problem with this was of course that the requirements were quite fuzzy until I actually figured out how the data was structured and even more importantly what the data quality was.

In many cases I would spend a lot of time on extracting a date from the source, converting that to an ISO 8601 date format (Supported by Solr), updating the schema with that field and then finally reindexing. Only to learn that the date was either not required or had too poor data quality to be used.

My point being that I spent a lot of time designing a schema (and connector) for a source which I, and most others, knew almost nothing about.

Development Techniques for Solr: The Structure Last

Ok so what’s the supposed “right way” of doing this?

In Solr there is a concept called dynamic fields. It allows you to map fields that fulfil a certain name criteria to a specific type. In the example Solr schema you can find the following section:

<!– uncomment the following to ignore any fields that don’t already match an existing

field name or dynamic field, rather than reporting them as an error.

alternately, change the type=”ignored” to some other type e.g. “text” if you want

unknown fields indexed and/or stored by default –>

<!–dynamicField type=”ignored” multiValued=”true” /–>

The section above will drop any fields that are not explicitly declared in the schema. But what I usually do to start with is to do the complete opposite. I map all fields to a string type.

<dynamicField multiValued=”true” indexed=”true” stored=”true”/>

I start with a minimalist schema that only has an id field and the above stated dynamic field.

With this schema it doesn’t matter what I do, everything is mapped to a string field, exactly as it is entered.

This allows me to focus on getting the data into Solr without caring about what to name the fields, what properties they should have and most importantly to even having to declare them at all.

Instead I can focus on getting the data out of the source system and then into Solr. When that’s done I can use Solr´s schema browser to see what fields are high quality, contain a lot of text or are suited to be used as facets and use this information to help out in the requirements process.

The Structure Last Technique lets you be more pragmatic about your requirements.

Solr Processing Pipeline

Hi again Internet,

For once I have had time to do some thinking. Why is there no powerful data processing layer between the Lucene Connector Framework and Solr? I´ve been looking into the Apache Commons Processing Pipeline. It seems like a likely candidate to do some cool stuff.  Look at the diagram below.

A schematic drawing of a Solr Pipeline concept. (Click to enlarge)

What I´m thinking of is to make a transparent Solr processing pipeline that speaks the Solr REST protocol on each end. This means that you would be able to use SolrJ or any other API to communicate with the Pipeline.

Has anyone attempted this before?  If you’re interested in chatting about the pipeline drop me a mail or just grab me at Eurocon in Prague this year.

Solr – the Sunny Side of Search

When I started working for Findwise two years ago, Apache Solr was one of those no-name search platforms. We could barely get our customers to consider Solr even after proving that the platform would be a perfect match for their business needs. As time passed and the financial crisis hit the world, a few of our customers started considering Solr, but then usually for the reason that it was “free” – not for the functionality of the platform.

Things have changed. More and more companies now offer support and training for Solr. It seems that the platform is gaining momentum on the enterprise market. In fact, I was just in Oslo, Norway to become a certified Lucid Imagination training partner, as the need for training is growing rapidly, even up here in the snow-covered Nordics.

Today we even have customers approaching us asking questions about how, and not if, they should use Solr. I wouldn’t have imagined that two years ago …

Could this be the year that Solr goes head to head with the large enterprise search platforms? And where will we be in another two years? I wish I knew.

Faceted Search by LinkedIn

My RSS feeds have been buzzing about the LinkedIn faceted search since it was first released from beta in December. So why is the new search at LinkedIn so interesting that people are almost constantly discussing it? I think it’s partly because LinkedIn is a site that is used by most professionals and searching for people is core functionality on LinkedIn. But the search interface on LinkedIn is also a very good example of faceted search.

I decided to have a closer look into their search. The first thing I realized was just how many different kinds of searches there are on LinkedIn. Not only the obvious people search but also, job, news, forum, group, company, address book, answers and reference search. LinkedIn has managed to integrate search so that it’s the natural way of finding information on the site. People search is the most prominent search functionality but not the only one.

I’ve seen several different people search implementations and they often have a tendency to work more or less like phone books. If you know the name you type it and get the number. And if you’re lucky you can also get the name if you only have the number. There is seldom anyway to search for people with a certain competence or from a geographic area. LinkedIn sets a good example of how searching for people could and should work.

LinkedIn has taken careful consideration of their users; What information they are looking for, how they want it presented and how they need to filter searches in order to find the right people. The details that I personally like are the possibility to search within filters for matching options (I worked on a similar solution last year) and how different filters are displayed (or at least in different order) depending on what query the user types. If you want to know more about how the faceted search at LinkedIn was designed, check out the blog post by Sara Alpern.

But LinkedIn is not only interesting because of the good search experience. It’s also interesting from a technical perspective. The LinkedIn search is built on open source so they have developed everything themselves. For those of you interested in the technology behind the new LinkedIn search I recommend “LinkedIn search a look beneath the hood”, by Daniel Tunkelang where he links to a presentation by John Wang search architect at LinkedIn.