solr can be easily extended to handle binary files and extract the information from them. Apache Tika library is used for the file analysis.
This can be set up in solr by using Extracting Request Handler that is already set up in solrconfig.xml. All we need to do is to add extra libraries. Once you have your solr set up, copy:
contrib/extraction/lib/* (from the downloaded solr package) into /var/lib/tomcat6/webapps/solr/WEB-INF/lib
solr/apache-solr-4.0.0-BETA/dist/apache-solr-cell-4.0.0-BETA.jar into /var/lib/tomcat6/webapps/solr/WEB-INF/lib
Restart tomcat and index sample document (I’m using test.pdf that I have in the same, current directory). Handler is available at update/extract:
I have provided an unique id for the document by passing literal.id=doc1 option, to index second document I’d use:
That’s all – here is the result of executing “*” query: