In June of 2018, Mike, Dan, and Eric published a research paper on a new software tool for SEC research: OpenEDGAR. The paper is available for download at arXiv. Below, we summarize much of the paper’s contents.
The SEC’s EDGAR database contains terabytes of documents and data, including press releases, annual corporate filings, executive employment agreements, and investment company holdings. While EDGAR has existed for over twenty years, it’s been difficult for scholars to conduct or reproduce research based on EDGAR data. Often, researchers spend a lot of time and money developing and redeveloping code to retrieve and parse EDGAR data, with no common bottom-up framework.
OpenEDGAR changes the way people interact with the EDGAR system. OpenEDGAR is an open source Python framework that allows researchers and developers working with SEC data to share the costs and benefits of a core functionality. In the same way that open source has contributed to the development of natural language processing (NLP) and machine learning (ML) resources, OpenEDGAR empowers researchers to find and develop answers to their questions.
As we’ve discussed in the past, there are many positive aspects to open source software, chief among them increased access and lowered costs. OpenEDGAR uses standard open source licenses, such as those from MIT, Apache, and the GPL family. It uses mature code bases with a high level of documentation, and with large and active developer communities. It also is capable of simple scalability and infrastructure setup, and supports multiple languages other than English.
OpenEDGAR provides an open source Python framework that can work with EDGAR data at scale. To accomplish this, we use high-quality open source packages that enable researchers to tackle problems of all sizes.
- Object Storage: OpenEDGAR was designed to use Amazon Simple Storage Service (S3) and other compatible storage engines like OpenStack Swift that can manage large volumes of data.
- Relational Database: OpenEDGAR uses traditional relational database technology to manage index and metadata records. Users can interact with this data either through Django ORM or through SQL directly. OpenEDGAR uses Postgres by default, but Django supports databases such as MySQL, Oracle, or SQLite.
- Distributed Task and Message Queues: OpenEDGAR is designed to be able to distribute work across one or more servers utilizing Celery, integrated with Django. OpenEDGAR uses RabbitMQ as its message broker, but Celery supports other brokers as well.
- Content Extraction: To normalize content across file types and document types, OpenEDGAR employs Apache Tika and Tesseract to extract metadata and text from EDGAR documents.
- Interactive Data Science Platform: OpenEDGAR supports the Jupyter interactive computing platform. This allows researchers to develop code in Python or R, execute in ecosystems like Apache Spark, examine figures and results, and publish source code either publicly or privately.
OpenEDGAR is built on the Django application framework. Many of the template files created by Django are not unique to OpenEDGAR, but we do distribute them in the repository. Contributions to Django provided by LexPredict include the following:
- Data Model: OpenEDGAR structures all the key metadata provided by the SEC EDGAR database. These metadata values include company name (Central Index Key value), CompanyInfo, FilingIndex, Filing and FilingDocument values, and SearchQuery objects.
- Clients: OpenEDGAR provides two client APIs. The Boto library is designed for use with Amazon S3 and other compatible engines like OpenStack Swift. The Requests library, meanwhile, can access EDGAR itself to retrieve file and directory contents, indexes by type or year, and company metadata.
- Parsers: The EDGAR system was designed during the bygone heyday of the SGML data format. OpenEDGAR’s Index Parser, Filing Parser, and Filing Document Parser are designed with the flexibility to parse even these older SGML tags that are often found in some SEC filings.
The data model, clients, and parsers provide the building blocks for constructing research databases from EDGAR. OpenEDGAR also provides some standard processes for a lot of common tasks that many research projects will find useful:
- Populate database objects for company metadata, both initially and incrementally
- Download Filing index files, both initially and incrementally
- Download Filing files, both initially and incrementally
- Populate database objects from FilingIndex and Filings
- Incrementally update existing database objects, FilingIndex and Filings
- Extract text content from Filing documents, both initially and incrementally
- Search Filing documents for term references
Now let’s take a look at a quick example of how these features can be brought to bear in a relatively common task.
OpenEDGAR Example Usage
Researchers developing natural language and machine learning models often find themselves in need of a large corpus of documents for training purposes. These researchers may want to develop word frequency or word embedding models. These types of models encode the text of one or multiple documents into vectors that can be shown as a feature space of related words and/or concepts. Not only can word embedding models trained with sufficient data find groups of word stems or lemmas, but they can also capture synonyms and related concepts without the need for human intervention in the initial training. The code for such a retrieval mechanism is below.
With this code, researchers can quickly retrieve a sample of 1,000 press releases to build a word2vec word embedding model. Once a model like this is properly trained, it can be used to produce vector representations of new text, and it can also be queried to produce synonyms or related concepts.
For researchers who need pre-trained word embedding models for legal or regulatory text, our LexNLP package includes a number of pre-trained word2vec and doc2vec models.
The SEC’s EDGAR database contains ideal content for training and using natural language processing and machine learning tools. It represents a vast treasure trove of raw data that researchers are looking to more effectively harness every day. The OpenEDGAR software framework has the tools to empower researchers and developers who work with SEC data. Read the paper on arXiv.