So far in this series on open source, we’ve defined contract analytics, discussed the evolution of the underlying open source philosophy, and mentioned some of the most important open source platforms, as well as some ongoing problems in open source.
News broke on Monday that Microsoft has acquired GitHub. This announcement is an unmistakable sign that the utility of open source tools continues to grow. The timing couldn’t be better for us to delve deeply once more into open source at a high level.
We designed ContraxSuite to process and analyze high volumes of legal documents. The platform utilizes optical character recognition (OCR) to extract metadata and information, but many of the documents analyzed are still in legacy formats (read: paper). OCR handles these paper documents, as well as scanned PDF and TIFF images. How do we do this?
Tika is the front-line open source OCR tool for ContraxSuite. Tika identifies and extracts content, extracts metadata, and identifies the various languages found in documents. It’s one of the best OCR tools, used by organizations as far-ranging as Goldman Sachs, the FICO credit scoring service, and researchers at NASA and other academic institutions. It’s also the favored tool for journalists analyzing leaked documents in foreign languages.
For ContraxSuite, we need a fast, reliable way to extract text from a variety of document formats, including scanned PDFs, and image files (such as TIFF, JPG etc.). The more documents in a set, the more strategies ContraxSuite can deploy process them. Once a document set reaches a certain volume, text extraction has to run in a cluster. The workflow for such clusters is as follows: we run Apache Tika servers on cluster nodes, and then load-balance the various text extraction API requests among those clusters. A quick diagram may help explain:
DockerHub has several different Apache Tika images, but the best ones all use the Tesseract OCR engine (now in version 4.0).
The OCR engine Tesseract can recognize over one hundred languages, and can output in many of the most common text-based file types. Developed in partnership with Google, Tesseract is one of the best open source OCR tools out there.
For ContraxSuite, we created a custom Docker image of Apache Tika coupled with the latest version of Tesseract, which uses LSTM neural networks. Currently, our custom Docker image has support for English, Italian, French, Spanish, German, and Russian.
Running LexPredict’s Custom Docker Image
To pull the image from docker, enter the following command:
docker pull lexpredict/tika-server
To run the Tika server with a default configuration and publish Tika port on the host machine:
docker run -p 9998:9998 -it lexpredict/tika-server
Output should look like this:
You can open
http://localhost:9998/ in a browser window to see the available Tika APIs. Opening
http://localhost:9998/parsers/details will show you the details of the configured parsers.
Tika’s OCR support via Tesseract is important for running ContraxSuite, but one drawback of OCR is that extraction moves slowly and consumes a lot of memory. For some use cases, OCR may not be needed. In these situations, we can disable Tesseract OCR.
If you need to disable Tesseract for any reason, or need to re-configure some other features of Tika, you can override Tika’s configuration file with your own custom file. This
config file needs to be mounted into a Tika container at
/tika-config.xml. An example of how to write for
/tika-config.xml with OCR disabled is below:
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/> </parser> </parsers> </properties>
Run Tika container with this config file with the following command:
docker run -it -p 9998:9998 -v /home/user/tika-config.xml:/tika-config.xml lexpredict/tika-server
Running this command should produce output similar to below:
Running Tika Cluster in Docker Swarm
Let’s say you want to run the Tika cluster in Docker Swarm. You should start with a configured Docker Swarm cluster, and have worker machines connected. Next, we deploy Tika with the docker-compose.yml file. The Tika configuration file (
tika-config.xml) should be in this same directory with
version: "3.3" services: tika: image: lexpredict/tika-server:latest ports: - 9998:9998 configs: - source: tika_config_3 target: /tika-config.xml networks: - net deploy: replicas: 3 networks: net: configs: tika_config_3: file: ./tika-config.xml
To deploy Tika to Docker Swarm, input the following:
docker stack deploy --compose-file docker-compose.yml tika-cluster
Docker Swarm will distribute three replicas of the Tika server among the available cluster nodes, and start listening at port 9998 (https://docs.docker.com/engine/swarm/ingress/). Connections to port 9998 will be automatically load-balanced among the running Tika containers on cluster nodes.
At present, Docker Swarm doesn't support normal updating configurations specified in
docker-compose files. As a workaround, if you ever change
tika-config.xml, you need to also change the configuration name in the
docker-compose file (tika_config_3 → tika_config_4 and so on) and re-deploy the stack or update the service.