Extract the content from the documents and make it searchable with solr 4

solr can be easily extended to handle binary files and extract the information from them. Apache Tika library is used for the file analysis.

This can be set up in solr by using Extracting Request Handler that is already set up in solrconfig.xml. All we need to do is to add extra libraries. Once you have your solr set up, copy:

contrib/extraction/lib/* (from the downloaded solr package) into /var/lib/tomcat6/webapps/solr/WEB-INF/lib
solr/apache-solr-4.0.0-BETA/dist/apache-solr-cell-4.0.0-BETA.jar into /var/lib/tomcat6/webapps/solr/WEB-INF/lib

Restart tomcat and index sample document (I’m using test.pdf that I have in the same, current directory). Handler is available at update/extract:

curl "http://localhost:8080/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@test.pdf"

I have provided an unique id for the document by passing literal.id=doc1 option, to index second document I’d use:

curl "http://localhost:8080/solr/update/extract?literal.id=doc2&commit=true" -F "myfile=@test.pdf"

That’s all – here is the result of executing “*” query:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">4</int><lst name="params"><str name="q">*</str><str name="wt">xml</str></lst></lst><result name="response" numFound="2" start="0"><doc><str name="id">doc1</str>
<str name="author">Tomasz Muras</str>
<str name="author_s">Tomasz Muras</str><arr name="content_type">
<str>application/pdf</str></arr>
<arr name="content">
<str>Test Test Test</str></arr>
<long name="_version_">1413927282391646208</long></doc>
<doc><str name="id">doc2</str>
<str name="author">Tomasz Muras</str>
<str name="author_s">Tomasz Muras</str>
<arr name="content_type"><str>application/pdf</str></arr>
<arr name="content"><str>Test Test Test</str></arr>
<long name="_version_">1413934423130243072</long></doc>
</result>
</response>