At LexPredict we’ve been hard at work on improving ContraxSuite. This past Saturday, June 2nd, marked the release of ContraxSuite 1.1.0.
When we first decided to make ContraxSuite open source, we hoped our project and its components would help the community collectively innovate and improve. We continue to be impressed by the response of the open source community. We look forward to gaining new insights from the developer community, as we continue to build and improve ContraxSuite for our clients.
The release on June 2nd marks our eleventh open source release for ContraxSuite. It includes updates on the Tika OCR engine, improvements and updates on document detection and the logging system, better error handling for Celery tasks, and improvements to the UI.
New Features and Changelog
The 1.1.0 release focused on various improvements to ContraxSuite. A detailed changelog is below:
- Improved text extraction, and detection of document types
- Document type is detected by its contents
- Apache Tika is now the default text extractor
- Custom Apache Tika Docker image has been created and published. Click here to view
- This Docker image contains the latest version of Tika, 1.18. It also contains the Tesseract Version 4.0 OCR engine
- These allow external Tika configuration, making it usable in Docker Swarm clusters
- ContraxSuite logging has been switched to FileBeat
- Django, Celery, and database files are first written in JSON
- A separate FileBeat Docker container reads them in asynchronous mode and pushes these records to Elasticsearch
- The logging system is now unwired from Python modules; it will not hang or slow down the application in case of Elasticsearch problems
- Internal Nginx logs are now sent to Elasticsearch. Standard FileBeat Kibana dashboards now display Nginx access and error data
- Logs are written to Elasticsearch indexes containing dates in their names. Old log indexes are deleted by Curator
- ContraxSuite’s logging routines for asynchronous Celery tasks have been refactored
- Task logs are no longer stored in the database. Instead, Elasticsearch is now the primary source of log data. Task logs in the UI
- Users can now search task logs in Kibana using a document name/ID
- MetricBeat now tracks metrics for Docker containers in ContraxSuite clusters
- Standard MetricBeat dashboards are now available in Kibana, allowing easy tracking of CPU, memory usage, availability, and status of different ContraxSuite components
- Metrics are written to Elasticsearch indexes containing dates in their names, and Curator deletes old log indexes
- Improved project cleanup method to delete all related objects, and added “total cleanup” method and UI
- Fixed “purge_task” Celery task to handle GroupResults
- Added the non-admin role of “Project Creator” for users with full access to everything except the admin interface and admin tasks
- Included “set_site” management command into deployment script
- Improved Celery task progress calculation
- Added logging for Celery subtasks
- Updated the task list view for sorting/filtering using calculated fields
- Improved handling of exceptions to documents in a clustering project
- Improved error handling for memory errors in the training document field model
- Added user name into response cookies and JSON response of login rest API
- Fixed a broken API for password resetting
- After a password change, UI now redirects to a user’s detail page
- Several bug fixes related to the annotator API
For more information about the release, as well as a complete issue-level change log for each release of ContraxSuite, you can visit our documentation release notes and change log.
Keep The Conversation Going
Do you still have questions about how ContraxSuite works, or how you can get started using it? Are you a developer or organization interested in contributing? Either way, we’re all ears. Please reach out to us at firstname.lastname@example.org so we can talk more.