How to Index and Search XML Content in Solr

Indexing XML Content

In solr, there is an xml update request handler which can be used to update xml formatted data.

For example,

<add>
<doc>
<field name="employeeId">05991</field>
<field name="office">Bridgewater</field>
<field name="skills">Perl</field>
<field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>

However when a field itself should contain xml formatted data, the xml update handler will fail to import. Because, xml update handler parse the import data with xml parser, it will try to get direct child text under ‘field’ node, which is empty if a field’s direct child is xml tag.

What we can do is to use json update handler. For example:

[
  {
    "id" : "MyTestDocument",
    "title" : "<root p="cc">test \ node</root>"
  }
]

There are two things to notice,

  1. Both ‘‘ and ‘‘ characters should be escaped
  2. The xml content should be kept as a single line

Json import data can be loaded into Solr by the curl command,

curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'

Or, by using solrj:

CommonsHttpSolrServer server = new CommonsHttpSolrServer(serverpath);
server.setMaxRetries(1);
ContentStreamUpdateRequest csureq = new ContentStreamUpdateRequest("/update/json");
csureq.addFile(file);
NamedList<Object> result = server.request(csureq);
NamedList<Object> responseHeader = (NamedList<Object>) result.get("responseHeader");

Integer status = (Integer) responseHeader.get("status");

Stripping out xml tags in Schema definition

When querying xml content, we most likely will not be interested in xml tags. So we need to strip out xml tags before indexing the xml text. We can do that by applying HTMLStripCharFilter to the xml content.
            <analyzer type="index">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>
            <analyzer type="query">
                ...
                <charFilterSpellE">solr.HTMLStripCharFilterFactory"/>
                <tokenizerSpellE">solr.StandardTokenizerFactory"/>
                <filterSpellE">solr.LowerCaseFilterFactory"/>
                ...
            </analyzer>

Search XML Content

Xml content search does not differ much from text content search. However, if people want to search for xml attributes, there requires some special tweak.

HTMLStripCharFilter we mentioned earlier will filter out all xml tags including attributes, in order to index attributes, we need to find a way to make HTMLStripCharFilter keep the attribute text.

For example if we have original xml content as following,

<sample attr=”key_o2_4”>find it </sample>
After applying HTMLStripCharFilter, we want to have,

key_o2_4    find it
One way we can do is to add assistance xml instruction tags in original xml content such as,

<sample attr=”key_o2_4”><?solr key_o2_4?>find it</sample>

And apply Solr.PatternReplaceCharFilterFactory to it as shown in following schema fieldtype definition.

<analyzer type="index">
...
<charFilter pattern="&lt;?solr ([A-Z0-9_-]*)?&gt; " replacement="       $1  " maxBlockChars="10000000"/>
<charFilter/>
...
</analyzer>

Which will make replace <?solr key_o2_4?> with 7 leading empty spaces + key_o2_4 + 2 ending empty spaces in order to keep the original offset,

With this technique, we can do a search on attr attribute and get a hit.

2 thoughts on “How to Index and Search XML Content in Solr

    • Here is what I just tried,
      1. Download solr 4.2 binary
      2. Go down to solr/example and run “java –jar start.jar” to start solr
      3. Open a browser and check if you can access http://localhost:8983/solr
      4. Open Cygwin(if you use ms dos, you will need another curl format, maybe with double quoates), and run the curl command you copy from wiki.apache.org/solr/UpdateJSON/, the command is curl ‘http://localhost:8983/solr/update/json?commit=true’ –data-binary @books.json -H ‘Content-type:application/json’
      5. You will get response back % Total % Received % Xferd Average Speed Time Time Time Current
      Dload Upload Total Spent Left Speed
      100 1243 0 44 100 1199 60 1635 –:–:– –:–:– –:–:– 1830{“responseHeader”:{“status”:0,”QTime”:575}}
      6. enter http://localhost:8983/solr/#/collection1/query in browser, you should be able to see 4 results.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>