In June of 2018, we published a paper describing our latest software tool for natural language processing in legal. Read our description below to learn more about LexNLP. The paper is available to read at this link.
Natural Language Processing (NLP) in a Nutshell
Researchers and developers have been working on natural language processing (NLP) and machine learning packages for over twenty years. NLP tools are used in all sorts of products, like search engines, chatbots, customer service, and basic document analysis. Odds are, you’ll interact with an NLP technology at some point today. One way to think of how NLP works is to think about the machine learning concepts at their core. Well-known examples of NLP technologies like Apple’s Siri and Amazon Alexa are programmed to understand simple queries. As they learn how to correctly answer simple queries, this data is used to train them on more complex queries using common elements of speech.
A simple Google search is another common example of NLP usage. When you type a question into Google’s search bar, that question gets parsed by NLP and transformed from a sentence a human understands into a machine-readable query that the search engine can interpret and respond to.
NLP’s iterative acquisition of English proficiency is not too dissimilar from the way people acquire the language from an early age. Common, easy-to-use words and phrases are learned, and then more vocabulary, syntax, and grammar develop through trial and error over the course of a person’s first decades or so. NLP tools use machine learning the same way, with developers feeding new data into the system and determining its progress through trial and error.
The NLTK Toolkit for NLP
You may not have heard of the Natural Language Toolkit (NLTK), but you’ve almost certainly used an app or software package that uses this NLP library. NLTK is a Python-based NLP tool that works great with general knowledge. It’s the sort of NLP toolkit used by the average everyday app. First built at the turn of the 21st Century, NLTK is used to find and extract text and parse that text according to rules prescribed by developers. As of this writing, NLTK is already being taught in several universities in the US and around the world.
NLP software like NLTK works wonders for general text analysis, but leaves much to be desired when it comes to more specific areas of expertise. Several NLP tools have already been tailor made for the medical field, for example. These toolkits take NLP libraries like NLTK and train their algorithms yet further, in order to extract and analyze language specific to their field. These NLP tools might add terms describing diseases, or the names of various microbes, or obscure pharmaceutical words that most people use seldom or never, but that medical professionals use as often as other common words.
LexNLP and Law
Researchers and developers have produced NLP tools for medicine and other fields. But there hasn’t been a useful NLP tool for the law. The legal field is integral to society, employs millions of people worldwide, and – just like medicine – it has its own distinct, intricate lexicon. Yet law has not seen much development of a comparable NLP tool.
Mike, Dan, and Eric are filling this gap with LexNLP. LexNLP is designed to provide legal professionals with tools and data to work with real legal and regulatory text, including statutes, regulations, court opinions, briefs, contracts, and other legal work product.
LexNLP is a Python-based toolkit just like NLTK, and actually operates on an NLTK foundation. In terms of the ability to find, extract, and analyze text, think of NLTK as an undergraduate with a bachelor’s degree in English, while LexNLP is a law school graduate with several years of experience at a firm. Just like an undergraduate who goes to law school, LexNLP takes NLTK’s natural language libraries and uses machine learning to add terms, grammatical structures, and other elements specific to legal and regulatory documents.
LexNLP is an open source Python package, and is one of the tools that powers LexPredict’s ContraxSuite platform. LexNLP was trained using several different databases of legal material, including the SEC’s EDGAR database.
For an example of how LexNLP specializes in the particulars of legal texts, one can look to the way legal documents handle numbers. Oftentimes, numbers in legal documents are written out as words rather than digits. LexNLP can handle these elements of a document, making sure that “onethousand fifty seven”, “one thousand fifty seven”, “one thousand and fifty seven”, and “one thousand fiftyseven” all show up as “1057”.
What Else Can LexNLP Do?
The example above is just a taste of the different methods LexNLP uses to process legal and regulatory text into a machine-readable format.
- Stopwords: Stopwords are short words like “the” or “of” that appear in literally every document produced in the English language. Legal and regulatory texts, however, often use more specialized stopwords, some of which are from other languages (most notably Latin). LexNLP can detect and parse these rarer, niche-type stopwords.
- Collocations: Like stopwords, we use collocations every day. Any word that frequently pairs with another word or words is called a collocation. There’s one in the previous sentence (“pair(s) with”). Much like with stopwords, legal and regulatory texts often use collocations that are unique to the legal field, and so not often learned by more generalized NLP toolkits. LexNLP, however, is trained to spot these unique examples.
- Segmentation: LexNLP is trained to find and distinguish titles, headings, sub-headings, sections, and paragraphs from context. These models can be retrained and customized by users.
- Tokens, Stems, and Lemmas: Segmentation at the level of individual words and characters. LexNLP uses the Treebank tokenizer, Snowball stemmer, and WordNet lemmatizer. LexNLP can also be customized to function with the Stanford NLP toolkit.
- Parts of Speech: LexNLP can find verbs, adjectives, adverbs, and nouns.
LexNLP can also find and extract key text and structural information elements. This includes addresses, amounts, citations, conditional statements (e.g. “at least”, “within”), dates, definitions (e.g. “shall mean”), distances, durations, money/currency, percentages, ratios, regulations, trademarks, and URLs. The package can also extract named entities, such as companies, countries, NGOs, and other geopolitical entities. And all of these text features can be transformed into data for model training, to build both supervised and unsupervised training models.