key: cord-0190266-f9oovo0i
authors: Spangher, Alexander; May, Jonathan
title: textit{StateCensusLaws.org}: A Web Application for Consuming and Annotating Legal Discourse Learning
date: 2021-04-20
journal: nan
DOI: nan
sha: d210c07f4e80e7bb89ee116f3bd6e2cee503b4c3
doc_id: 190266
cord_uid: f9oovo0i

In this work, we create a web application to highlight the output of NLP models trained to parse and label discourse segments in law text. Our system is built primarily with journalists and legal interpreters in mind, and we focus on state-level law that uses U.S. Census population numbers to allocate resources and organize government. Our system exposes a corpus we collect of 6,000 state-level laws that pertain to the U.S. census, using 25 scrapers we built to crawl state law websites, which we release. We also build a novel, flexible annotation framework that can handle span-tagging and relation tagging on an arbitrary input text document and be embedded simply into any webpage. This framework allows journalists and researchers to add to our annotation database by correcting and tagging new data.

Since at least 1958, AI practitioners have explored how to analyze legal documents -i.e. laws, court opinions and regulations -to yield greater insight into legal decision-making (Mehl, 1958) . A number of systems seek to help citizens understand law (Dale, 2019) , by answering questions, 1 generating documents, 2 or helping users file motions (Gibbs, 2016) .

However, researchers trying to compare and contrast large bodies of law, such as journalists or academics, have virtually no open-source resources: an informal survey of legal journalists we conducted before starting this project 3 yielded several insights: (1) there are no open accessible, free, online sources to perform full-text searches on all 1 https://www.chatbotsecommerce.com/nrf-launche s-parker-first-australian-privacy-law-chatbot/ 2 https://legal.thomsonreuters.com.au/products/c ontract-express/, https://turbotax.intuit.com/ ...in counties having a metropolitan form of government and in counties having a population of not less than three hundred thirty-five thousand (335,000) nor more than three hundred thirtysix thousand (336,000), according to the 1990 federal census or any subsequent federal census, the magistrate or magistrates shall be selected and appointed by and serve at the pleasure of the trial court judge... Figure 1 : Paragraph from a sample law, Tennessee § 36-5-402, referencing a bureaucratic process impacted by population counts determined by the upcoming federal census. The colored blocks represent the following concepts, which we will define in the text: probe, test, subject, consequence, object. Our web application aggregates these span tags across state-level laws, helping civil citizens and journalists more easily discern the impacts of a U.S. Census undercount.

state law (2) it is hard to track entities across laws (3) it is hard to know when a law applies. Our web application aims to address all three of these short-comings.

In this work, we present a web application, 4 an annotation framework, and a set of web-scrapers designed to meet these challenges. This system is the fruit of theoretical work, publication-upcoming, in which we propose a discourse schema for ana-lyZing law (illustrated in Figure 1 ), an annotated dataset, and a set of models. At the core, our framework seeks to answer the following key questions:

(1) When does this law apply? (2) Who gains what powers? (3) Who gains what restrictions? This paper makes three key contributions:

1. Searching and consuming model output:

We present a web-app to expose and help users navigate through a discourse framework for law that we develop, annotate and train models on. We release a set of 25 robust Dockerized web-scrapers to collect U.S. state law wholesale or by keyword. These scrapers are designed to overcome uncivil attempts to block law-scraping that hinders research. We collect and release a database of more than 6,000 state-level laws using these scrapers. Our particular subject area in this work -statelevel laws pertaining to the 2020 U.S. Censusis relevant for several reasons: (1) The 2020 U.S. Census has faced massive challenges and certain populations might be undercounted (Naylor, 2020; Mervis, 2019; Berry-James et al., 2020) . (2) Very little research exists exploring the effects of an undercount on state-level processes. 5 (3) Journalists, our primary users, can provide useful feedback for ongoing work on discourse-schema development.

We outline our discourse schema and modeling in Section 2. We next discuss our dataset collection process, including the web-scrapers we release for gathering public-domain U.S. state law text (Section 3). In Section 4 we describe our lightweight and modular span and relation annotation interface which we used to collect data. Next, in Section 5, we describe our web-app, where we surface our model's output to journalists and engage volunteers to improve our annotations. Finally, we discuss an ongoing use-case to illustrate how one might use our app in Section 6.

We briefly introduce the key components of our discourse schema, the dataset we collect, and the BERT-CRF-based modeling we perform.

The five principal discourse elements that we identify in our schema are: SUBJECT, CON-SEQUENCE, OBJECT, PROBE and TEST. PROBES, SUBJECTS and OBJECTS are entities while TESTS and CONSEQUENCES are verb phrases. We describe each in turn.

The first three elements in our schema, SUB-JECT, CONSEQUENCE, and OBJECT, uncover how law dictates first-degree interactions between entities. 6 As such, a SUBJECT is an entity directly gaining a power or a restriction under a law. The CONSEQUENCE is the specific power or restriction placed on the SUBJECT. The OBJECT is the entity being affected by the SUBJECT's gain in powers or restrictions. 7 OBJECTS are not always present, 8 and SUBJECTS might be expressed passively. 9

The last two elements in our schema, TEST, and PROBE, indicate when laws apply. A TEST is an explicit condition applied to an entity to determine when a SUBJECT-CONSEQUENCE-OBJECT relation holds. A PROBE is an entity that is tested but not part of a legally-mediated relationship. 10 A TEST need not only apply to a PROBE; it can apply to a SUBJECT, OBJECT, or even a CONSE-QUENCE.

Our dataset collection is described in Section 3. From our full dataset, we sample a set of lawparagraphs to annotate.We build an annotation framework, described in Section 4, and enlist two expert annotators. As of this writing, we have 573 law paragraphs annotated according to the spanlevel schema described in Section 2.1.

We test a class of models aimed at performing span-tagging, and use our top-performing model, BERT-CRF (based on Li et al. (2019); Spangher et al. (2021) ) in the current production pipeline. Our model accuracy improves with more data, and a primary purpose of our web-app is to collect more annotations from users (Section 4). We provide our predicted outputs in the database as well.

Our collected dataset comprises the more than 100,000 active state-level laws in the United States, including roughly 6,000 laws that reference Census population counts from a public-domain law website, Justia, 11 and directly from state websites. 12 In total, we build parsers for 24 official state websites and one for Justia.

State law is always public domain; 13 yet in practice, it is often inaccessible for bulk downloads and web scraping. For instance, many websites license LexisNexis, a for-profit company, as the official provider for their state codes. 14 Although these websites are publicly and freely accessible (as apart from other LexisNexis-hosted resources, which require memberships to access), they employ a range of mechanisms (e.g. timeouts, dynamicallygenerated URLs, cookie-based access) that make them difficult to scrape. 15 To circumvent these blocking mechanisms, our scrapers are robust and, in many cases, mimic human web-browsing behavior in order to provide full access to public-domain law. We develop a generalized scraper for LexisNexis Public Access websites that uses scrapy 16 and selenium-webdriver 17 In order to scrape Justia, we launch three Google Compute Engine (GCE) instances for a total of 60 compute hours. We will open-source our scraping routines as well as the Docker images we created to perform these scrapes. Our scrapers are flexible and can download either full codes or all codes related to a query. Our Docker containers contain the necessary python libraries (e.g. scrapy) and binary libraries selenium-webdriver and gsutil 11 https://www.justia.com/ 12 Some of the laws provided by Justia, such as those for Colorado, contain data in unhelpful forms, such as PDF files (see https://law.justia.com/codes/colorado/2019/), so we extract directly in these cases. 

We wanted a lightweight, modular, Javascriptbased annotation framework that could handle span annotation and relation assignment. We wanted it to be easily adaptable for multiple use-cases: (1) to serve as a standalone web-app to be distributed to undergraduate helpers and volunteers, (2) to be integrated into a larger web-app to help site visitors annotate new laws and correct problematic annotations they see, and (3) to be compiled to Amazon Mechanical Turk (AMT) HTMLQuestions. 18 Although many web-based, NLP-focused annotation tools exist (78, by Neves and Ševa (2021)'s count), we surveyed options, including brat (Stenetorp et al., 2012) , GATE (Bontcheva et al., 2013) , YEDDA (Yang et al., 2017) and WebAnnon (Yimam et al., 2013) , and found that none of them met our requirements. Although all these options were web-based, none were flexible enough to be integrated easily into larger websites. Further, none were able to automatically generate AMT tasks.

Not finding an appropriate solution, we designed a simple and modularized annotation framework in 600 lines of JQuery, Javascript and HTML, with a Datastore backend. 19 Our annotation framework supports span annotation and relation tagging.

The annotation interface itself, shown in Figure 2 , is powered by a stateful page object, called PageHandler, that is instantiated with several parameters (page _ height, buttons, relations) and handles all of the page interactions. The PageHandler is placed directly in the HTML page containing the text to be annotated, so any service that can render text can automatically become an annotation service. In our case, we built Jinja templates to render our HTML, since our server is coded in Python-Flask. We additionally provide a helper function that, with input data, can compile our Jinja templates as static, fully-functional AMT HTMLQuestions.

We use a Datastore backend to track progress 18 An HTMLQuestion is an HTML page that is submitted to AMT, https://docs.aws.amazon.com/AWSMechTurk/la test/AWSMturkAPI/ApiReference _ HTMLQuestionArticle.

html. AMT handles the traffic split and data collection. 19 Google Datastore is a NoSQL, scalable JSON store, which is suitable for our usecase. https://cloud.google .com/datastore Figure 2 : Our lightweight annotation interface allows users to identify spans by highlighting text, assigning labels to spans using the dropdown, and assigning relations by right-clicking to draw lines. We track user-identity using Google sign-in. (While our interface can collect relational labels, we do not study them in this work.) towards annotation tasks, as shown in Figure 4 . We code data entries (the equivalent of MySQL tables) to track helper-statistics, helper _ summ, how many tasks are left to assign, incomp _ tasks, and how many annotations have been completed, comp _ annot. We track these statistics to ensure that we can obtain multiple annotations for each task, and that no helper sees the same task more than once. We perform one GET request at the beginning of each user session to collect user-stats and then use client-side cookies throughout the session to minimize the number of requests we send to the back-end. We use a NoSQL database because they are low-latency and designed for streaming, and Datastore because our web-app is hosted on Google App Engine. We include our Datastore management back-end as part of the annotation package. To use our tool with other NoSQL providers, 20 a port is necessary.

The annotation interface is designed for modularity first-and-foremost. We release the annotation code as part of this framework. In the future, the annotation component of this project will be abstracted and distributed in its own stand-alone Javascript package for the benefit of researchers needing a similar application. 20 e.g. Amazon DynamoDBhttps://aws.amazon.com/d ynamodb/

We design a web interface that serves three principle use cases: (1) enabling full-text search on our database, (2) exposing users to our discourse schema by extracting spans across laws and (3) allowing users to both correct/update and provide new annotations. Our website has two principlal flows: Flow 1: searching laws by keyword, and Flow 2: grouping laws by span, shown in Figure 3 .

In Flow 1, users can use a query box to perform full-text and faceted search on laws and then click on and return results to read the full text of the law. ElasticSearch powers both of these endpoints, as shown in Figure 4 . This flow is useful for when journalists want to explore a specific term or concept irrespective of its discourse role, or simply familiarize themselves with the corpus.

In Flow 2, users can view aggregate counts of different discourse elements, by type, across the corpora. This helps to summarize the corpora from a functional standpoint, as described in Section 2. Users navigate this flow by clicking on one of five buttons to see the counts of each of the five principle discourse spans, then clicking on any of the returned span results to view all laws with this span. MySQL serves both of these endpoints (and provides additional metrics, such as a map in the about.html page, not shown here.).

In both flows, visitors can access our annotation framework, described in Section 4. From Flow 1, they can click search results to tag a specific paragraph, and from Flow 2 they can click to correct an annotated paragraph. Additionally, they can annotate a randomly selected paragraph by clicking "Help Us Tag."

We describe two example articles that are currently being explored by users of our system.

In the first example, journalists hypothesized that the allocation of new liquor licenses might be population-based. To explore this, they used Flow 1; they searched for the term "alcohol OR liquor OR beverage" in the search interface and discovered that interface returned 270 laws. They reached out to us and, offline, we analyzed the breakdown in states. 21 We found that the states most likely to base liquor licenses off population counts were Tennessee, New York and Illinois. They then asked us to extract all TESTS from these laws. We found that mid-size cities would be the most likely to be impacted by a 5% or 10% undercount in population. The journalists identified key cities and sought sources in these areas. This work is ongoing.

In another example, journalists explored Flow 2. They noticed that some TESTS are based on explicit population thresholds (ex. Figure 1) and that some of these thresholds were very narrow. They reached out to us. We compiled several keyword filters and regular expressions extract specific population thresholds. We found that, in Tennessee in particular, over 40% of all Census-related laws Figure 4 : Our database setup. ElasticSearch contains the full text of laws for Flow 1. MySQL contains tables for summary statistics, rendering annotation HTML, and collecting annotation-data/model-output (BERT-CRF feeds into MySQL span _ summ.). Datastore handles task assignment for volunteers.

imposed narrow population tests of fewer than 500 people and 10% imposed tests of fewer than 100 people. This raised questions: what is the purpose of these narrowly targeted laws? Were they trying to target specific counties without mentioning them by name? The journalists are now investigating further by tracking down the authors of these laws.

Although the field of AI-driven legal aids is multifaceted and growing (Kauffman and Soares, 2020) , free and open-source frameworks remain few (Morris, 2019; Dale, 2019; Vergottini, 2011) . Our discourse-driven web application, designed for legal exploratory analysis is one of the few AIpowered, free applications that exist, and the first to open source tools for legal document collection.

For-profit legal inquiry systems, as mentioned above, are numerous. Bloomberg Law, 22 Westlaw, 23 LexisNexis 24 and Wolters Kluwar 25 are the four main services for legal research (Dale, 2019) , which provide subscription-based, Google-style searches. CaseText 26 and Ravel 27 were two upstart case-text search engines (although both have now been aquired); CaseText offered crowdsourced annotations and Ravel linked cases together to create visual maps of important cases (Lee et al., 2015) . We similarly provide a way of collecting user-annotations, and a novel way linking together cases, although ours is linguistically grounded in discourse theory.

Various discourse schemas have been developed to understand law texts, including deontological logic-based schemas (Wyner and Peters, 2011; Zeni et al., 2015) , and subject matter-specific schemas (Espejo-Garcia et al., 2019) . Ours is the first discourse-based approach to take steps towards a big-data approach by setting up a framework for the ingestion of crowdsourced annotations.

Finally, outside of the legal domain, other areas have experienced a growth in academicallyoriented systems for human-in-the-loop inquiry. The COVID-19 pandemic has produced a burst in NLP-driven corpora-collection (Wang et al., 2020 ), demonstrations (Sohrab et al., 2020 Hope et al., 2020; Spangher et al., 2020) and workshops (Verspoor et al., 2020b,a) .

Such concerted effort in the NLP domain to expose open resources and build free tools for subject matter experts is an inspiring guide for how researchers can contribute to wider inquiries. We hope such acknowledgement of our ability as NLP researchers to empower others becomes commonplace for other applications as well, forming a common alliance between academics, civil-minded journalists and other researchers and end-users.

In this work, we have presented three opensource components. (1) A web-app highlighting a novel discourse schema and its application to U.S. Census-related state law. (2) A flexible and modular annotation framework that can be seamlessly embedded into web-apps to allow visitors to contribute and update annotations. (3) A set of 25 web-scrapers to help researchers gather publicdomain legal text. Our concrete goal is to facilitate journalistic exploration into laws possibly affected by the 2020 U.S. Census, of which we have multiple ongoing projects already. Our longer-term goal is to collect feedback and data, and improve our database and machine learning systems. We hope that such efforts can continue to push Legaltech into a more open and accessible domain, and make it easier to understand the laws governing our society.

There were several possible ethical considerations we encountered during this research which we wish to address.

Dataset Creation: The creation of our dataset involved scraping numerous websites, including state websites, state-licensed LexisNexis pages and https://www.justia.com. In the third case, Justia, we did not violate any terms of service. In fact, Justia's robots.txt file 28 is the most permissive possible, giving unlimited license to any crawler. It is generally accepted that robots.txt files are implied licenses of access, 29 and we did not disregard Justia's file before scraping.

Content derived from the first two categories, state law websites and official, state-licensed websites like LexisNexis are, by law, public domain (Wolfe, 2019; MacWright, 2013) . Web-scraping the public domain is neither illegal nor unethical (Mehta, 2021) . As we did in the body of the paper, we again emphatically criticize attempts by providers to make web-scraping difficult, and we went to lengths to overcome this.

Dataset Annotation: All parties involved in annotating our dataset received valid compensation. We relied entirely on expert researchers to collect our annotations. This included the authors of this paper. All the researchers who provided annotations for us were affiliated with our institution and compensated appropriately by our institution (we leave the determination of "appropriate" for our institution to define).

Although we describe accommodating AMT tasks in the body of the paper, thus far, we have not used any annotations made by Turkers on AMT or or by journalists/researchers using our site. If we do, we will ensure there are no ethical issues by securing university IRB approval or exemption, as deemed fit by the IRB. For the Turkers, we will calculate a payment that equals, on average, $15 an hour. For the journalists/researchers, we will have exchanged something of value (the use of our web-app) for the annotation.

Website Usage: Our website has significant accessibility limitations for the seeing-impaired and 28 Found here https://www.justia.com/robots.txt. Such files govern the site-owners' standards for scraping and crawling. 29 https://stackoverflow.com/questions/999056/et hics-of-robots-txt for non-English speakers. We have not addressed them in this current version, but are mindful and actively searching for options to expand accessibility.

There are two ways in which seeing-impaired users might suffer. First, blind users will not be able to read any of the site without external tools, as we have not recorded or built in any native audioscripts, keyboard shortcuts or voice-activated commands. Secondly, part of our website, Flow 2, introduces users to our discourse schema by introducing them to . The web-app does not attempt to natively include accessibility options for seeingimpaired users.

Our website focuses on U.S.-based laws and contains only English-language text. We do not attempt, in this version, to perform translations. Our plan in the present iteration of this work was to work with U.S.-based journalists studying U.S.based law. We have not yet undertaken a study to compare how well our discourse schema would apply to non-U.S. law, be it common or civil (Dainow, 1966) . However, if this approach proves useful for journalists and researchers, we will certainly seek to undertake this.

Civil rights, social equity, and census 2020

Gate teamware: a web-based, collaborative text annotation framework

The civil law and the common law: some points of comparison

Law and word order: NLP in legal tech

End-to-end sequence labeling via deep learning for automatic extraction of agricultural regulations

Artificial intelligence approach to legal reasoning

Chatbot lawyer overturns 160,000 parking tickets in london and new york. The Guardian

SciSight: Combining faceted navigation and research group detection for COVID-19 exploratory scientific search

AI in legal services: new trends in AI-enabled legal services

A new era: Integrating today's next gen research tools ravel and casetext in the law school classroom

Discourse tagging for scientific evidence extraction

State law is public domain. what's public domain? macwright

Automation in the legal world. National Physical Laboratory

Us court says scraping a site without permission isn't illegal

Census citizenship question is dropped, but challenges linger

Making mischief with opensource legal tech

Counting an invisible class of citizens: The lgbt population and the us census

An extensive review of tools for manual annotation of documents

Counting for dollars 2020: the role of the decennial census in the geographic distribution of federal funds

BENNERD: A neural named entity linking system for COVID-19

Sz-rung Shiang, and Lingjia Deng. 2021. Multitask learning for class-imbalanced discourse classification

Enabling low-resource transfer learning across COVID-19 corpora by combining event-extraction and co-training

brat: a web-based tool for NLP-assisted text annotation

To go open source or not?

2020a. Proceedings of the 1st Workshop on NLP for COVID-19

2020b. Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. Association for Computational Linguistics

CORD-19: The COVID-19 open research dataset

U.s. high court to rule on scope of copyright for legal codes

On rule extraction from regulations

Yedda: A lightweight collaborative text span annotation tool

Webanno: A flexible, web-based and visually supported system for distributed annotations

Gaiust: supporting the extraction of rights and obligations for regulatory compliance