Blog

Tika and Tesseract in ContraxSuite

So far in this series on open source, we’ve defined contract analytics, discussed the evolution of the underlying open source philosophy, and mentioned some of the most important open source platforms, as well as some ongoing problems in open source.

News broke on Monday that Microsoft has acquired GitHub. This announcement is an unmistakable sign that the utility of open source tools continues to grow. The timing couldn’t be better for us to delve deeply once more into open source at a high level.

We designed ContraxSuite to process and analyze high volumes of legal documents. The platform utilizes optical character recognition (OCR) to extract metadata and information, but many of the documents analyzed are still in legacy formats (read: paper). OCR handles these paper documents, as well as scanned PDF and TIFF images. How do we do this?

Tika

Tika is the front-line open source OCR tool for ContraxSuite. Tika identifies and extracts content, extracts metadata, and identifies the various languages found in documents. It’s one of the best OCR tools, used by organizations as far-ranging as Goldman Sachs, the FICO credit scoring service, and researchers at NASA and other academic institutions. It’s also the favored tool for journalists analyzing leaked documents in foreign languages.

For ContraxSuite, we need a fast, reliable way to extract text from a variety of document formats, including scanned PDFs, and image files (such as TIFF, JPG etc.). The more documents in a set, the more strategies ContraxSuite can deploy process them. Once a document set reaches a certain volume, text extraction has to run in a cluster. The workflow for such clusters is as follows: we run Apache Tika servers on cluster nodes, and then load-balance the various text extraction API requests among those clusters. A quick diagram may help explain:

tika server cluster load balance text extraction API

Tika server operation

DockerHub has several different Apache Tika images, but the best ones all use the Tesseract OCR engine (now in version 4.0).

Tesseract

The OCR engine Tesseract can recognize over one hundred languages, and can output in many of the most common text-based file types. Developed in partnership with Google, Tesseract is one of the best open source OCR tools out there.

For ContraxSuite, we created a custom Docker image of Apache Tika coupled with the latest version of Tesseract, which uses LSTM neural networks. Currently, our custom Docker image has support for English, Italian, French, Spanish, German, and Russian.

Running LexPredict’s Custom Docker Image

To pull the image from docker, enter the following command:

docker pull lexpredict/tika-server

To run the Tika server with a default configuration and publish Tika port on the host machine:

docker run -p 9998:9998 -it lexpredict/tika-server

Output should look like this:

tika server docker image run pull

You can open http://localhost:9998/ in a browser window to see the available Tika APIs. Opening http://localhost:9998/parsers/details will show you the details of the configured parsers.

Disabling OCR

Tika’s OCR support via Tesseract is important for running ContraxSuite, but one drawback of OCR is that extraction moves slowly and consumes a lot of memory. For some use cases, OCR may not be needed. In these situations, we can disable Tesseract OCR.

If you need to disable Tesseract for any reason, or need to re-configure some other features of Tika, you can override Tika’s configuration file with your own custom file. This config file needs to be mounted into a Tika container at /tika-config.xml. An example of how to write for /tika-config.xml with OCR disabled is below:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
      <parser class="org.apache.tika.parser.DefaultParser">
          <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
      </parser>
  </parsers>
</properties>

Run Tika container with this config file with the following command:

docker run -it -p 9998:9998 -v /home/user/tika-config.xml:/tika-config.xml lexpredict/tika-server

Running this command should produce output similar to below:

tika server ocr docker disable command line

Disabling OCR in Tika

 

Running Tika Cluster in Docker Swarm

Let’s say you want to run the Tika cluster in Docker Swarm. You should start with a configured Docker Swarm cluster, and have worker machines connected. Next, we deploy Tika with the docker-compose.yml file. The Tika configuration file (tika-config.xml) should be in this same directory with docker-compose.yml.


version: "3.3"
services:
  tika:
    image: lexpredict/tika-server:latest
    ports:
      - 9998:9998
    configs:
      - source: tika_config_3
        target: /tika-config.xml
    networks:
      - net
    deploy:
      replicas: 3

networks:
  net:

configs:
  tika_config_3:
    file: ./tika-config.xml

To deploy Tika to Docker Swarm, input the following:

docker stack deploy --compose-file docker-compose.yml tika-cluster

tika cluster docker swarm deploy

Deploying Tika in Docker Swarm

Docker Swarm will distribute three replicas of the Tika server among the available cluster nodes, and start listening at port 9998 (https://docs.docker.com/engine/swarm/ingress/). Connections to port 9998 will be automatically load-balanced among the running Tika containers on cluster nodes.

At present, Docker Swarm doesn't support normal updating configurations specified in docker-compose files. As a workaround, if you ever change tika-config.xml, you need to also change the configuration name in the docker-compose file (tika_config_3 → tika_config_4 and so on) and re-deploy the stack or update the service.

Comments are closed, but trackbacks and pingbacks are open.