.----.
____ __\\\\\\__
\___'--" .-.
/___>------rtika '0'
/____,--. \----B
"".____ ___-"
// / /
]/
Apache Tika is similar to the Babel fish in Douglas Adam’s book, “The Hitchhikers’ Guide to the Galaxy” (C. Mattmann and Zitting 2011, 3). The Babel fish translates any natural language to any other. While Apache Tika does not yet translate natural languages, it starts to tame the tower of babel of digital document formats. As the Babel fish allowed a person to understand Vogon poetry, Tika allows a computer to extract text and objects from Microsoft Word.
The world of digital file formats is like a place where each community has their own language. Academic, business, government, and online communities use anywhere from a few file types to thousands. Unfortunately, attempts to unify groups around a single format are often fruitless (C. A. Mattmann 2013).
This plethora of document formats has become a common concern. Tika is a common library to address this issue. Starting in Apache Nutch in 2005, Tika became its own project in 2007 and then a component of other Apache projects including Lucene, Jackrabbit, Mahout, and Solr (C. Mattmann and Zitting 2011, 17).
With the increased volume of data in digital archives, and terabyte
sized data becoming common, Tika’s design goals include keeping
complexity at bay, low memory consumption, and fast processing (C. Mattmann and Zitting 2011, 18). The
rtika
package is an interface to Apache Tika that leverages
Tika’s batch processor module to parse many documents fairly
efficiently. Therefore, I recommend using batches whenever possible.
Video, sound and images are important, and yet much meaningful data remains numeric or textual. Tika can parse many formats and extract alpha-numeric characters, along with a few characters to control the arrangement of text, like line breaks.
I recommend an analyst start with a directory on the computer and get
a vector of paths to each file using base::list.files()
.
The commented code below has a recipe. Here, I use test files that are
included with the package.
library('rtika')
library('magrittr')
# Code to get ALL the files in my_path:
# my_path <- "~"
# batch <- file.path(my_path,
# list.files(path = my_path,
# recursive = TRUE))
# pipe the batch into tika_text()
# to get plain text
# test files
batch <- c(
system.file("extdata", "jsonlite.pdf", package = "rtika"),
system.file("extdata", "curl.pdf", package = "rtika"),
system.file("extdata", "table.docx", package = "rtika"),
system.file("extdata", "xml2.pdf", package = "rtika"),
system.file("extdata", "R-FAQ.html", package = "rtika"),
system.file("extdata", "calculator.jpg", package = "rtika"),
system.file("extdata", "tika.apache.org.zip", package = "rtika")
)
text <-
batch %>%
tika_text()
# normal syntax also works:
# text <- tika_text(batch)
The output is a R character vector of the same length and order as the input files.
In the example above, there are several seconds of overhead to start up the Tika batch processor and then process the output. The most costly file was the first one. Large batches are parsed more quickly. For example, when parsing thousands of 1-5 page Word documents, I’ve measured 1/100th of a second per document on average.
Occasionally, files are not parsable and the returned value for the
file will be NA
. The reasons include corrupt files, disk
input/output issues, empty files, password protection, a unhandled
format, the document structure is broken, or the document has an
unexpected variation.
These issues should be rare. Tika works well on most documents, but if an archive is very large there may be a small percentage of unparsable files, and you might want to handle those.
Plain text is easy to search using base::grep()
.
With plain text, a variety of interesting analyses are possible,
ranging from word counting to constructing matrices for deep learning.
Much of this text processing is handled easily with the well documented
tidytext
package (Silge and Robinson
2017). Among other things, it handles tokenization and creating
term-document matrices.
A general suggestion is to use tika_fetch()
when
downloading files from the Internet, to preserve the server Content-Type
information in a file extension.
Tika’s Content-Type detection is improved with file extensions (Tika
also relies on other features such as Magic bytes, which are unique
control bytes in the file header). The tika_fetch()
function tries to preserves Content-Type information from the download
server by finding the matching extension in Tika’s database.
download_directory <- tempfile('rtika_')
dir.create(download_directory)
urls <- c('https://tika.apache.org/',
'https://cran.rstudio.com/web/packages/keras/keras.pdf')
downloaded <-
urls %>%
tika_fetch(download_directory)
# it will add the appropriate file extension to the downloads
downloaded
This tika_fetch()
function is used internally by the
tika()
functions when processing URLs. By using
tika_fetch()
explicitly with a specified directory, you can
also save the files and return to them later.
Large jobs are possible with rtika
. However, with
hundreds of thousands of documents, the R object returned by the
tika()
functions can be too big for RAM. In such cases, it
is good to use the computer’s disk more, since running out of RAM slows
the computer.
I suggest changing two parameters in any of the tika()
parsers. First, set return = FALSE
to prevent returning a
big R character vector of text. Second, specify an existing directory on
the file system using output_dir
, pointing to where the
processed files will be saved. The files can be dealt with in smaller
batches later on.
Another option is to increase the number of threads, setting
threads
to something like the number of processors minus
one.
# create a directory not already in use.
my_directory <-
tempfile('rtika_')
dir.create(my_directory)
# pipe the batch to tika_text()
batch %>%
tika_text(threads = 4,
return = FALSE,
output_dir = my_directory)
# list all the file locations
processed_files <- file.path(
normalizePath(my_directory),
list.files(path = my_directory,
recursive = TRUE)
)
The location of each file in output_dir
follows a
convention from the Apache Tika batch processor: the full path to each
file mirrors the original file’s path, only within the
output_dir
.
Note that tika_text()
produces .txt
files,
tika_xml()
produces .xml
files,
tika_html()
produces .html
files, and both
tika_json()
and tika_json_text()
produce
.json
files.
Plain text falls short for some purposes. For example, pagination
might be important for selecting a particular page in a PDF. The Tika
authors chose HTML as a universal format because it offers semantic
elements that are common or familiar. For example, the hyperlink is
represented in HTML as the anchor element <a>
with
the attribute href
. The HTML in Tika preserves this
metadata:
library('xml2')
# get XHTML text
html <-
batch %>%
tika_html() %>%
lapply(xml2::read_html)
# parse links from documents
links <-
html %>%
lapply(xml2::xml_find_all, '//a') %>%
lapply(xml2::xml_attr, 'href')
sample(links[[1]],10)
Each type of file has different information preserved by Tika’s internal parsers. The particular aspects vary. Some notes:
<div class="page">
.<a>
with the attribute href
.<table>
element. The rvest
package has a
function to get tables of data with
rvest::html_table()
.Note that tika_html()
and tika_xml()
both
produce the same strict form of HTML called XHTML, and either works
essentially the same for all the documents I’ve tried.
The tika_html()
and tika_xml()
functions
are focused on extracting strict, structured HTML as XHTML. In addition,
metadata can be accessed in the meta
tags of the XHTML.
Common metadata fields include Content-Type
,
Content-Length
, Creation-Date
, and
Content-Encoding
.
Metadata can also accessed with tika_json()
and
tika_json_text()
. Consider all that can be found from a
single image:
library('jsonlite')
# batch <- system.file("extdata", "calculator.jpg", package = "rtika")
# a list of data.frames
metadata <-
batch %>%
tika_json() %>%
lapply(jsonlite::fromJSON)
# look at metadata for an image
str(metadata[[6]])
In addition, each specific format can have its own specialized metadata fields. For example, photos sometimes store latitude and longitude:
Some types of documents can have multiple objects within them. For
example, a .gzip
file may contain many other files. The
tika_json()
and tika_json_text()
functions
have a special ability that others do not. They will recurse into a
container and examine each file within. The Tika authors call the format
jsonRecursive
for this reason.
In the following example, I created a compressed archive of the
Apache Tika homepage, using the command line programs wget
and zip
. The small archive includes the HTML page, its
images, and required files.
# wget gets a webpage and other files.
# sys::exec_wait('wget', c('--page-requisites', 'https://tika.apache.org/'))
# Put it all into a .zip file
# sys::exec_wait('zip', c('-r', 'tika.apache.org.zip' ,'tika.apache.org'))
batch <- system.file("extdata", "tika.apache.org.zip", package = "rtika")
# a list of data.frames
metadata <-
batch %>%
tika_json() %>%
lapply(jsonlite::fromJSON)
# The structure is very long. See it on your own with: str(metadata)
Here are some of the main metadata fields of the recursive
json
output:
# the 'X-TIKA:embedded_resource_path' field
embedded_resource_path <-
metadata %>%
lapply(function(x){ x$'X-TIKA:embedded_resource_path' })
embedded_resource_path
The X-TIKA:embedded_resource_path
field tells you where
in the document hierarchy each object resides. The first item in the
character vector is the root, which is the container itself. The other
items are embedded one layer down, as indicated by the forward slash
/
. In the context of the
X-TIKA:embedded_resource_path
field, paths are not
literally directory paths like in a file system. In reality, the image
icon_info_sml.gif
is within a folder called
images
. Rather, the number of forward slashes indicates the
level of recursion within the document. One slash /
reveals
a first set of embedded documents. Additional slashes /
indicate that the parser has recursed into an embedded document within
an embedded document.
The Content-Type
metadata reveals the first item is the
container and has the type application/zip
. The items after
that are deeper and include web formats such as
application/xhtml+xml
, image/png
, and
text/css
.
The X-TIKA:content
field includes the XHTML rendition of
an object. It is possible to extract plain text in the
X-TIKA:content
field by calling
tika_json_text()
instead. That is the only difference
between tika_json()
and tika_json_text()
.
It may be surprising to learn that Word documents are containers (at
least the modern .docx
variety are). By parsing them with
tika_json()
or tika_json_text()
, the various
images and embedded objects can be analyzed. However, there is an added
complexity, because each document may produce a long vector of
Content-Types
for each embedded file, instead of a single
Content-Type
for the container like tika_xml()
and tika_html()
.
Out of the box, rtika
uses all the available Tika
Detectors and Parsers and runs with sensible defaults. For most, this
will work well.
In future versions, Tika uses a configuration file to customize
parsing. This config file option is on hold in rtika
,
because Tika’s batch module is still new and the config file format will
likely change and be backward incompatible. Please stay tuned.
There is also room for improvement with the document formats common
in the R community, especially Latex and Markdown. Tika currently reads
and writes these formats just fine, captures metadata and recognizes the
MIME type when downloading with tika_fetch()
. However, Tika
does not have parsers to fully understand the Latex or Markdown document
structure, render it to XHTML, and extract the plain text while ignoring
markup. For these cases, Pandoc will be more useful (See: https://pandoc.org/demos.html ).
You may find these resources useful: