Apache Tika

From Knowitall
Jump to: navigation, search

Apache Tika will convert documents into text files. It wraps a number of other libraries that have been developed to convert specific documents (such as Apache POI). You can download Apache Tika from:

http://tika.apache.org/download.html

However, Apache Tika version 0.8 has an error when parsing PDFs. There are no line breaks except for at the end of a page. This has been fixed recently and a 0.81 may come out to fix this error. Until then, I recommend downloading the latest version of Apache Tika from the SVN (or Apache Tika 0.7, but the older version has different issues).

http://tika.apache.org/source-repository.html

The tika-app jar file may be executed to convert a document to text. For example:

java -jar tika-app.jar --text FILE_TO_CONVERT

This jar file can be linked against to programatically convert documents. The tika-app itself is an excellent example (although a bit complicated). The file to look at is "TikaCLI.java".