A Digital Babel Fish

                    .----.      
           ____    __\\\\\\__                 
           \___'--"          .-.          
           /___>------rtika  '0'          
          /____,--.        \----B      
                  "".____  ___-"    
                  //    / /                   
                        ]/

Apache Tika is similar to the Babel fish in Douglas Adam’s book, “The Hitchhikers’ Guide to the Galaxy” (C. Mattmann and Zitting 2011, 3). The Babel fish translates any natural language to any other. While Apache Tika does not yet translate natural languages, it starts to tame the tower of babel of digital document formats. As the Babel fish allowed a person to understand Vogon poetry, Tika allows a computer to extract text and objects from Microsoft Word.

The world of digital file formats is like a place where each community has their own language. Academic, business, government, and online communities use anywhere from a few file types to thousands. Unfortunately, attempts to unify groups around a single format are often fruitless (C. A. Mattmann 2013).

This plethora of document formats has become a common concern. Tika is a common library to address this issue. Starting in Apache Nutch in 2005, Tika became its own project in 2007 and then a component of other Apache projects including Lucene, Jackrabbit, Mahout, and Solr (C. Mattmann and Zitting 2011, 17).

With the increased volume of data in digital archives, and terabyte sized data becoming common, Tika’s design goals include keeping complexity at bay, low memory consumption, and fast processing (C. Mattmann and Zitting 2011, 18). The rtika package is an interface to Apache Tika that leverages Tika’s batch processor module to parse many documents fairly efficiently. Therefore, I recommend using batches whenever possible.

Extract Plain Text

Video, sound and images are important, and yet much meaningful data remains numeric or textual. Tika can parse many formats and extract alpha-numeric characters, along with a few characters to control the arrangement of text, like line breaks.

I recommend an analyst start with a directory on the computer and get a vector of paths to each file using base::list.files(). The commented code below has a recipe. Here, I use test files that are included with the package.


library('rtika')
library('magrittr')

# Code to get ALL the files in my_path:

# my_path <- "~"
# batch <- file.path(my_path,
#                 list.files(path = my_path,
#                 recursive = TRUE))

# pipe the batch into tika_text() 
# to get plain text

# test files
batch <- c(
  system.file("extdata", "jsonlite.pdf", package = "rtika"),
  system.file("extdata", "curl.pdf", package = "rtika"),
  system.file("extdata", "table.docx", package = "rtika"),
  system.file("extdata", "xml2.pdf", package = "rtika"),
  system.file("extdata", "R-FAQ.html", package = "rtika"),
  system.file("extdata", "calculator.jpg", package = "rtika"),
  system.file("extdata", "tika.apache.org.zip", package = "rtika")
)

text <-  
    batch %>%
    tika_text() 

# normal syntax also works:
# text <- tika_text(batch)

The output is a R character vector of the same length and order as the input files.

In the example above, there are several seconds of overhead to start up the Tika batch processor and then process the output. The most costly file was the first one. Large batches are parsed more quickly. For example, when parsing thousands of 1-5 page Word documents, I’ve measured 1/100th of a second per document on average.

Occasionally, files are not parsable and the returned value for the file will be NA. The reasons include corrupt files, disk input/output issues, empty files, password protection, a unhandled format, the document structure is broken, or the document has an unexpected variation.

These issues should be rare. Tika works well on most documents, but if an archive is very large there may be a small percentage of unparsable files, and you might want to handle those.

# Find which files had an issue
# Handle them if needed
batch[which(is.na(text))]

Plain text is easy to search using base::grep().

length(text)

search <-
    text[grep(pattern = ' is ', x = text)]

length(search)

With plain text, a variety of interesting analyses are possible, ranging from word counting to constructing matrices for deep learning. Much of this text processing is handled easily with the well documented tidytext package (Silge and Robinson 2017). Among other things, it handles tokenization and creating term-document matrices.

Preserve Content-Type when Downloading

A general suggestion is to use tika_fetch() when downloading files from the Internet, to preserve the server Content-Type information in a file extension.

Tika’s Content-Type detection is improved with file extensions (Tika also relies on other features such as Magic bytes, which are unique control bytes in the file header). The tika_fetch() function tries to preserves Content-Type information from the download server by finding the matching extension in Tika’s database.

download_directory <- tempfile('rtika_')

dir.create(download_directory)

urls <- c('https://tika.apache.org/',
          'https://cran.rstudio.com/web/packages/keras/keras.pdf')

downloaded <- 
    urls %>% 
    tika_fetch(download_directory)

# it will add the appropriate file extension to the downloads
downloaded

This tika_fetch() function is used internally by the tika() functions when processing URLs. By using tika_fetch() explicitly with a specified directory, you can also save the files and return to them later.

Settings for Big Datasets

Large jobs are possible with rtika. However, with hundreds of thousands of documents, the R object returned by the tika() functions can be too big for RAM. In such cases, it is good to use the computer’s disk more, since running out of RAM slows the computer.

I suggest changing two parameters in any of the tika() parsers. First, set return = FALSE to prevent returning a big R character vector of text. Second, specify an existing directory on the file system using output_dir, pointing to where the processed files will be saved. The files can be dealt with in smaller batches later on.

Another option is to increase the number of threads, setting threads to something like the number of processors minus one.

# create a directory not already in use.
my_directory <-
   tempfile('rtika_')
                  
dir.create(my_directory)

# pipe the batch to tika_text()
batch %>%
tika_text(threads = 4,
          return = FALSE,
          output_dir = my_directory) 

# list all the file locations 
processed_files <- file.path(
                normalizePath(my_directory),
                list.files(path = my_directory,
                recursive = TRUE)
                )

The location of each file in output_dir follows a convention from the Apache Tika batch processor: the full path to each file mirrors the original file’s path, only within the output_dir.

processed_files

Note that tika_text() produces .txt files, tika_xml() produces .xml files, tika_html() produces .html files, and both tika_json() and tika_json_text() produce .json files.

Get a Structured XHTML Rendition

Plain text falls short for some purposes. For example, pagination might be important for selecting a particular page in a PDF. The Tika authors chose HTML as a universal format because it offers semantic elements that are common or familiar. For example, the hyperlink is represented in HTML as the anchor element <a> with the attribute href. The HTML in Tika preserves this metadata:

library('xml2')

# get XHTML text
html <- 
    batch %>%
    tika_html() %>%
    lapply(xml2::read_html)

# parse links from documents
links <-
    html %>%
    lapply(xml2::xml_find_all, '//a') %>%
    lapply(xml2::xml_attr, 'href')

sample(links[[1]],10)

Each type of file has different information preserved by Tika’s internal parsers. The particular aspects vary. Some notes:

PDF files retain pagination, with each page starting with the XHTML element <div class="page">.
PDFs retain hyperlinks in the anchor element <a> with the attribute href.
Word and Excel documents retain tabular data as a <table> element. The rvest package has a function to get tables of data with rvest::html_table().
Multiple Excel sheets are preserved as multiple XHTML tables. Ragged tables, where rows have differing numbers of cells, are not supported.

Note that tika_html() and tika_xml() both produce the same strict form of HTML called XHTML, and either works essentially the same for all the documents I’ve tried.

Access Metadata in the XHTML

The tika_html() and tika_xml() functions are focused on extracting strict, structured HTML as XHTML. In addition, metadata can be accessed in the meta tags of the XHTML. Common metadata fields include Content-Type, Content-Length, Creation-Date, and Content-Encoding.

# Content-Type
html %>%
lapply(xml2::xml_find_first, '//meta[@name="Content-Type"]') %>%
lapply(xml2::xml_attr, 'content') %>%
unlist()

# Creation-Date
html %>%
lapply(xml2::xml_find_first, '//meta[@name="Creation-Date"]') %>%
lapply(xml2::xml_attr, 'content') %>%
unlist()

Get Metadata in JSON Format

Metadata can also accessed with tika_json() and tika_json_text(). Consider all that can be found from a single image:

library('jsonlite')
# batch <- system.file("extdata", "calculator.jpg", package = "rtika")

# a list of data.frames
metadata <-
    batch %>% 
    tika_json() %>%
    lapply(jsonlite::fromJSON)

# look at metadata for an image
str(metadata[[6]])

In addition, each specific format can have its own specialized metadata fields. For example, photos sometimes store latitude and longitude:

metadata[[6]]$'geo:lat'
metadata[[6]]$'geo:long'

Get Metadata from “Container” Documents

Some types of documents can have multiple objects within them. For example, a .gzip file may contain many other files. The tika_json() and tika_json_text() functions have a special ability that others do not. They will recurse into a container and examine each file within. The Tika authors call the format jsonRecursive for this reason.

In the following example, I created a compressed archive of the Apache Tika homepage, using the command line programs wget and zip. The small archive includes the HTML page, its images, and required files.

# wget gets a webpage and other files. 
# sys::exec_wait('wget', c('--page-requisites', 'https://tika.apache.org/'))
# Put it all into a .zip file 
# sys::exec_wait('zip', c('-r', 'tika.apache.org.zip' ,'tika.apache.org'))
batch <- system.file("extdata", "tika.apache.org.zip", package = "rtika")

# a list of data.frames
metadata <-
    batch %>% 
    tika_json() %>%
    lapply(jsonlite::fromJSON)

# The structure is very long. See it on your own with: str(metadata)

Here are some of the main metadata fields of the recursive json output:

# the 'X-TIKA:embedded_resource_path' field
embedded_resource_path <- 
    metadata %>%
    lapply(function(x){ x$'X-TIKA:embedded_resource_path' }) 

embedded_resource_path

The X-TIKA:embedded_resource_path field tells you where in the document hierarchy each object resides. The first item in the character vector is the root, which is the container itself. The other items are embedded one layer down, as indicated by the forward slash /. In the context of the X-TIKA:embedded_resource_path field, paths are not literally directory paths like in a file system. In reality, the image icon_info_sml.gif is within a folder called images. Rather, the number of forward slashes indicates the level of recursion within the document. One slash / reveals a first set of embedded documents. Additional slashes / indicate that the parser has recursed into an embedded document within an embedded document.

content_type <-
    metadata %>%
    lapply(function(x){ x$'Content-Type' }) 

content_type

The Content-Type metadata reveals the first item is the container and has the type application/zip. The items after that are deeper and include web formats such as application/xhtml+xml, image/png, and text/css.

content <- 
     metadata %>%
    lapply(function(x){ x$'X-TIKA:content' })

str(content)

The X-TIKA:content field includes the XHTML rendition of an object. It is possible to extract plain text in the X-TIKA:content field by calling tika_json_text() instead. That is the only difference between tika_json() and tika_json_text().

It may be surprising to learn that Word documents are containers (at least the modern .docx variety are). By parsing them with tika_json() or tika_json_text(), the various images and embedded objects can be analyzed. However, there is an added complexity, because each document may produce a long vector of Content-Types for each embedded file, instead of a single Content-Type for the container like tika_xml() and tika_html().

Extending rtika

Out of the box, rtika uses all the available Tika Detectors and Parsers and runs with sensible defaults. For most, this will work well.

In future versions, Tika uses a configuration file to customize parsing. This config file option is on hold in rtika, because Tika’s batch module is still new and the config file format will likely change and be backward incompatible. Please stay tuned.

There is also room for improvement with the document formats common in the R community, especially Latex and Markdown. Tika currently reads and writes these formats just fine, captures metadata and recognizes the MIME type when downloading with tika_fetch(). However, Tika does not have parsers to fully understand the Latex or Markdown document structure, render it to XHTML, and extract the plain text while ignoring markup. For these cases, Pandoc will be more useful (See: https://pandoc.org/demos.html ).

You may find these resources useful:

Current Tika issues and progress can be seen here: https://issues.apache.org/jira/projects/TIKA
The Tika Wiki is here: https://cwiki.apache.org/confluence/display/tika/
Tika sourcecode: https://github.com/apache/tika

References

Mattmann, Chris A. 2013. “Computing: A Vision for Data Science.” Nature 493 (7433): 473.

Mattmann, Chris, and Jukka Zitting. 2011. Tika in Action. Manning Publications Co. https://www.manning.com/books/tika-in-action.

Silge, Julia, and David Robinson. 2017. Text Mining with r: A Tidy Approach. O’Reilly Media, Inc. https://www.tidytextmining.com/.

Introduction to rtika