key: cord-0075786-sez7pt1h
authors: Wei, Bin; Kuang, Kun; Sun, Changlong; Feng, Jun; Zhang, Yating; Zhu, Xinli; Zhou, Jianghong; Zhai, Yinsheng; Wu, Fei
title: A full-process intelligent trial system for smart court
date: 2022-03-18
journal: Front Inform Technol Electron Eng
DOI: 10.1631/fitee.2100041
sha: d9ce0c7af0744b3f769a12ff212adbcf59695f24
doc_id: 75786
cord_uid: sez7pt1h

In constructing a smart court, to provide intelligent assistance for achieving more efficient, fair, and explainable trial proceedings, we propose a full-process intelligent trial system (FITS). In the proposed FITS, we introduce essential tasks for constructing a smart court, including information extraction, evidence classification, question generation, dialogue summarization, judgment prediction, and judgment document generation. Specifically, the preliminary work involves extracting elements from legal texts to assist the judge in identifying the gist of the case efficiently. With the extracted attributes, we can justify each piece of evidence’s validity by establishing its consistency across all evidence. During the trial process, we design an automatic questioning robot to assist the judge in presiding over the trial. It consists of a finite state machine representing procedural questioning and a deep learning model for generating factual questions by encoding the context of utterance in a court debate. Furthermore, FITS summarizes the controversy focuses that arise from a court debate in real time, constructed under a multi-task learning framework, and generates a summarized trial transcript in the dialogue inspectional summarization (DIS) module. To support the judge in making a decision, we adopt first-order logic to express legal knowledge and embed it in deep neural networks (DNNs) to predict judgments. Finally, we propose an attentional and counterfactual natural language generation (AC-NLG) to generate the court’s judgment.

During the pandemic of COVID-19, online trials based on the intelligent trial system have become ubiquitous. The smart court relies on Internet courts to turn offline litigation activities into online activities. Online trials reduce the flow of personnel and keep trials in working order. The smart court has successfully implemented full-service online processing and built a comprehensive, multi-functional, and intensive online litigation platform, which has alleviated judicial urgency issues. The Supreme People's Court promptly issued the "Notice on Strengthening and Standardizing Online Litigation during the COVID-19 Prevention and Control Period," which created a comprehensive deployment of online litigation for the courts to conduct proceedings with smart court. The smart court has formulated clear regulations for judicial tasks, such as online court hearings, electronic service, identity authentication, and material submission, and provided full judicial services and guarantees for online litigation promotion and regulation. According to the statistical data during the COVID-19 period (from February 3 to November 4, 2020), the people's courts at four levels filed 6.501 million online cases, 778 000 online court sessions, 3.23 million online mediations, and 18.15 million electronic services.

To make the smart court operate efficiently and improve trial efficiency in simple cases, Zhejiang Higher People's Court, Zhejiang University, and the Alibaba Group have jointly developed a full-process intelligent trial system (FITS) , which provides strong technical support for constructing a smart court for the Zhejiang Provincial People's Court. FITS has played an essential role in financial lending and private lending cases, which moves the trial procedures of the court to the network platform, supports judicial trials in a highly informative manner, and assists judges in making judicial decisions. As shown in Fig. 1 , the intelligent trial system implements the following judicial tasks: (1) extracting essential information from the legal text (indictment, lending contract, court debate transcript, etc.) to help the judge promptly grasp the key case information; (2) summarizing the controversy focuses from the court debate transcript recorded during the trial; (3) verifying the authenticity, legality, and relevance of the evidence; (4) recommending candidate questions to the judges to assist in the necessary trial procedures and discover facts related to the case; (5) retrieving the most similar cases from the historical data, and leveraging the knowledge of legal experts to predict case facts and help judges make judicial decisions; (6) generating a judgment document with complete structure, complete elements, and rigorous logic after confirming the facts of the case and applying laws and regulations.

Zhejiang University and the Alibaba Group have conducted much research on the above judicial tasks. Zhao et al. (2018) proposed a named entity recognition model based on the BiLSTM-CRF architecture, with two novel techniques of multi-task data selection and constrained decoding. Liu XJ et al. (2018) introduced a graph convolution based model to combine textual and visual information presented in visually rich documents (VRDs). Zhou et al. (2019) studied a novel research task of legal dispute judgment (LDJ) prediction for e-commerce transactions, which connects two isolated domains, e-commerce data mining and legal intelligence. Duan et al. (2019) introduced a delicately designed multirole and multi-focus utterance representation technique and provided an end-to-end solution specializing in controversy focus based debate summarization (CFDS) via joint learning. Wang et al. (2020) Full-process intelligent trial system Fig. 1 Overview of the full-process intelligent trial system (FITS) (ASR: automatic speech recognition; OCR: optical character recognition; NLP: natural language processing) investigated dialogue context representation learning with various types of unsupervised pretraining tasks, where the training objectives were given naturally according to the nature of the utterance and the structure of multi-role conversation. Wu et al. (2020) proposed a novel attentional and counterfactual natural language generation (AC-NLG) method, in which counterfactual decoders were employed to eliminate the confounding bias in data and generate judgment-discriminative court views by incorporating a synergistic judgment predictive model. Ji et al. (2020) proposed a novel network architecture, cross copy networks (CCNs), for content generation by simultaneously exploring the logical structure of the current dialogue context and similar dialogue instances.

FITS is designed by following the trial process and by emulating the way by which the judge makes judicial decisions. We adopt a combination of the knowledge-guided method and the big data driven method. The knowledge-guided method is to simulate judges based on knowledge and use logical reasoning to make judgments. The big data driven approach is to simulate judges to make judgments based on the principle of "treating like cases alike." Most of the technologies in these papers directly serve the FITS. Many new technologies were born in developing this system, and their original purpose was to perform the judicial tasks in trial practice. FITS applies these technologies to reengineer the existing case trial process and promote the intelligence of all nodes of the judicial process. In practice, FITS also provides judges and parties with intelligent assisting services at each node of the case trial procedure. Based on these works, we will show the operation process of the intelligent trial system. To summarize, we make several noteworthy contributions as follows:

1. We are the first to propose an FITS that serves primary phases of the trial procedure in the smart court.

2. We convert central judicial tasks of the trial procedure into corresponding natural language processing (NLP) problems, and adopt a combination of knowledge-based models and data-driven models.

3. Based on our FITS, we have developed an artificial intelligence (AI) judge assistant robot called Xiaozhi (micro intelligence) and achieved satisfactory results that have already assisted several courts in Zhejiang Province in financial lending cases and private lending cases.

The rest of this paper is organized as follows: In Section 2, we introduce a BiLSTM-CRF neural architecture and use it for legal text (indictments, judgment documents, etc.) information extraction. In Section 3, we justify the validity of evidence based on historical data and logical knowledge graphs. In Section 4, we propose an automatic questioning system to help judges ask procedural and factual questions. In Section 5, we summarize the focuses of the dispute during a trial by employing a multitask learning framework called CFDS and propose a framework of dialogue inspectional summarization (DIS). In Section 6, we combine first-order logic and deep neural networks to discover the facts of the case. In Section 7, we propose the AC-NLG method to generate the court's judgment-discriminative view. In Section 8, we introduce the results achieved by FITS in the application to smart court. Section 9 discusses related research work and the last section concludes this paper.

Information extraction (IE) aims to extract structured information from unstructured documents. It has been explored extensively due to its significant role in NLP. Legal information extraction includes the extraction of legal ontology, legal relations, and legal named entities. Earlier research studied the extraction of legal case information (Jackson et al., 2003) , and combined information retrieval and machine learning to extract the correlation between current cases and precedent texts using support vector machine (SVM) and other algorithms. The transfer learning approach (Elnaggar et al., 2018) using a neural network has been trained for linking of named entities to legal documents. Recently, the popular neural structure for IE, BiLSTM-CRF (Lample et al., 2016) , has shown excellent performance on numerous sequence-labeling tasks with high robustness and low computational complexity. We have collected more than 70 million judgment documents to build the corpus, including more than 360 000 court records and more than 100 000 evidence samples.

The model of long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997 ) is a type of recurrent neural network (RNN) architecture, in conjunction with an appropriate gradient-based learning algorithm, which addresses the vanishing/exploding gradient problem of learning long-term dependencies by introducing a memory cell with self-connections that store the temporal state of the network. Although numerous LSTM variants have been described, we employ the version proposed by Google (Sak et al., 2014) . LSTM takes input as a sequence of vectors x=(x 1 , x 2 , ..., x n ) and returns another sequence y=(y 1 , y 2 , ..., y n ); then the network can be calculated using the following equations iteratively:

where W is the weight matrix, b is the bias vector, σ is the logistic sigmoid function, and i, f , o, and c are, respectively, the input gate, forget gate, output gate, and cell activation vectors, all of which are of the same size as the cell output activation vector m, and is the element-wise product of the vectors.

The LSTM model takes past information into account, but ignores future information, because conventional RNNs are only able to make use of the previous context. Bidirectional LSTM (BiL-STM) can better exploit context in forward and backward directions. BiLSTM (Graves and Schmidhuber, 2005) combines bidirectional RNNs (BRNNs) with LSTM. BRNNs present each training sequence forward and backward to two separate recurrent nets by processing the data in both directions with two separate hidden layers that are fed forward to the same output layer. The hidden state of BiLSTM at time t generates the forward hidden sequence − → h t and the backward hidden sequence ← − h t . A popular probabilistic method for structured prediction, conditional random fields (CRFs), is widely applied in segment and label sequence data. The advantage of CRFs is to avoid a fundamental limitation of maximum entropy Markov models (MEMMs) based on directed graphical models (Lafferty et al., 2001) . We describe the definition of a general CRF (Sutton and McCallum, 2007) based on a general factor graph. Let G be a factor graph over X and Y . Then (X, Y ) is a conditional random field if for any value x of X, the distribution p(y|x) factorizes according to G. If F = {Ψ a } is the set of factors in G, then the conditional distribution for a CRF has the form

where A is the number of factors in the collection, both feature functions f ak and weights θ ak are indexed by factor index a to emphasize that each factor has its own set of weights, and Z(x) is a normalization factor over all state sequences for sequence x.

BiLSTM-CRF is a widely adopted neural architecture for sequence labeling problems, including entity recognition. It is a hierarchical model, and the architecture is illustrated in Fig. 2 . The network can effectively obtain two-way input features through the BiLSTM layer and sentence-level tags through the CRF layer. Note that the CRF layer has a state transition matrix as a parameter, and we can effectively use past and future tags to predict the current tag. The first layer of the model maps words to their embeddings. X=(x 1 , x 2 , ..., x n ) is a sentence composed of n words in a sequence, regarded as input to a BiLSTM layer. In the second layer, word embeddings are encoded and the output is h = (h 1 , h 2 , ..., h n ). We record the features extracted from the linear layer as matrix P =(p 1 , p 2 , ..., p n ), in which the element p ij corresponds to the score of the j th tag of the i th word in a sentence. We introduce a tagging transition matrix T , where T ij represents the score of transition from tag i to tag j in successive words. The score of the sentence X along with a sequence of predictions Y =(y 1 , y 2 , ..., y n ) is then given by the sum of transition scores and network scores:

A softmax for all tag sequences obtains the normalized probability:

where Y X represents all possible tag sequences for a sentence X. The model is trained by maximizing the log-probability with a log-likelihood function (Lample et al., 2016) . From this, BiLSTM-CRF obtains the sequence of output tags. In decoding the prediction, we seek the optimal path to obtain the maximum score driven by y * = arg max y ∈YX score(X, y ).

Domain adaptation maps the source domain with the label and the target domain with different data distributions to the same feature space (embedding manifold). BiLSTM-CRF is combined with domain adaptation to explore external datasets (Zhao et al., 2018) , as illustrated in Fig. 3 , in which the full-connection layer maps the distributed feature representation to the sample label space. The CRF features can be computed separately, i.e., φ T (x) = G T · h, φ S (x) = G S · h for the target and source datasets, respectively. The loss functions p y | x; θ T and p y | x; θ S are optimized in alternating order.

BiLSTM-CRF has been widely used in neural entity recognition (Lample et al., 2016; Liu XJ et al., 2018) and information extraction (Yang ZL et al., 2017; Zhao et al., 2018) in the legal domain. FITS applies it to the financial lending case and the private lending case. Taking the financial lending case (Zhao et al., 2018) as an example, the coverage of the extraction includes 45 types of documents (loan contract, loan extension contract, guarantee contract, mortgage contract, credit contract, pledge contract, pledge registration certificate, joint repayment commitment documents, loan vouchers, guarantor industrial and commercial registration materials, etc.), involving about 550 kinds of elements (plaintiff, defendant, defendant's ID card, litigation claims, facts and reasons, loan amount, loan contract number, signing date, the content of the indictment, etc.). On average, there are at least seven elements (fields) for each document to be extracted.

The BiLSTM-CRF model first matches each input character to a word vector that is pre-trained on a large corpus (usually based on word2vec, Glove, BERT, and other language models). Then the model uses BiLSTM to perform encoding on the word vector sequence, and obtains BiLSTM word encoding after concatenation. BiLSTM word encoding is used as the top CRF layer input to obtain the final result of the beginning, inside, and outside (BIO) information identification, thereby obtaining the result of information extraction. For the example in Fig. 2 , the information "joint and several liabilities" in the input will be marked and extracted. Meanwhile, many original materials are obtained through optical character recognition (OCR) or automatic speech recognition (ASR). Missing information and noise exist in the recognition process, so we use regularization rules to extract some particular information fields as a supplement.

In practice, we divide all information into two categories: general fields and specific fields. General fields refer to fields that are included in every case, such as party information. Specific fields are fields unique to each case, such as the date of contract signing for financial loan cases. For any case, general fields will be extracted by a common model shared by all cases, and the corresponding proprietary model will extract the specific fields for this type of case. In other words, a legal case text will be extracted by two models to extract corresponding fields.

To avoid supervised learning that requires a large amount of data annotation, we also adopt the transfer learning method. We use the annotation data of one case reason to improve the information extraction ability of another case reason from transfer learning. The diagram of the migration learning model for a "financial lending case" and a "private lending case" is shown in Fig. 3 . The model adds a fully connected layer (FCL) under different domains between the BiLSTM layer and the CRF sequence output layer, thereby enhancing the model's transfer learning ability.

In the trial, evidence analysis plays an essential role in determining the facts of the case. The primary task is to classify the evidence, which aims to divide each piece of evidence into different categories, and its purpose is to study the characteristics of different types of evidence and its application rules. The evidence materials discussed here are texts or images (for example, evidence in private lending cases includes loan agreements, guarantee conditions, payment delivery, repayment conditions, etc.). The second task of evidence analysis is to justify each piece of evidence's authenticity, legality, and relevance. These three aspects determine whether the evidence is probative.

We classify different types of evidence through multi-modal analysis. The preliminary work of evidence classification is to extract text evidence from the original evidence materials through OCR technology. We then use the NLP engine to understand the text content and extract the semantic features at the text level. For the part of the evidence materials from which OCR cannot identify or accurately extract useful information, we introduce the method of visual feature recognition to improve the effect of evidence recognition. The text features and visual features are merged to classify the evidence finally. For simplicity, we here introduce mainly the classifi-cation of the evidence after it is extracted as text.

We propose a classifier by representing the evidence in a vector. Specifically, we employ the BiL-STM model introduced in the previous section to build a classifier to perform evidence classification. We apply a hierarchical attention network (Yang ZC et al., 2016) for evidence classification. The model constructs a hierarchical structure of "word-sentenceevidence text" and has two attention-level mechanisms applied at the word-and sentence-level. We learn from the idea that the model uses the attention mechanism twice under the hierarchical structure. We embed evidence in a vector representation by first using word vectors to represent sentence vectors and then using sentence vectors to represent evidence vectors.

We first encode words by embedding the words in vectors through a matrix W , and then use the BiLSTM model to obtain annotations of words by summarizing information from both directions. Afterward we obtain an annotation for a given word w by concatenating the hidden state h= − → h ; ← − h , which summarizes the information of the whole sentence centered around w. Then we introduce the attention mechanism to extract words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector. We have u w =tanh(W w h + b w ) as a hidden representation of h and obtain a normalized importance weight α through a softmax function. We have the sentence vector as a weighted sum of the word annotations based on the weights.

After we have the vector of the sentence, we further similarly obtain a vector of evidence. We also use BiLSTM to encode the sentences, again use attention mechanism and introduce a sentence-level context vector u s =tanh(W s h + b s ), and then have v= i α i h i , which indicates the evidence vector that summarizes all the information of sentences in the evidence text. The evidence vector v is a high-level representation of the evidence and can be used as features for evidence classification:

An overview of evidence classification is shown in Fig. 4 . Evidence analysis also contributes to the formation of the evidence chain, which can visually show the case fact structure. This helps the judge sort out the details of the case and grasp the trial's progress. Evidence confirmation ensures that every piece of evidence in the evidence chain is legal and credible. Evidence classification automatically identifies different types of evidence and provides structured input for the components of the evidence chain. (2) hs (1) hw (1) hw ( 

The justification of evidence is the prerequisite of legal reasoning and fact-finding. The attributes of evidence are reflected in three aspects: (1) authenticity of the evidence, including authenticity of the form and authenticity of the content; (2) legality of the evidence, including legality of the source and legality of the state; (3) relevance of the evidence, that is, whether the evidence is related to the facts to be proved. Two novel methods are proposed to characterize these three attributes.

First, we evaluate the authenticity and the legality of evidence based on the analysis of historical data. In practice, it is not appropriate to determine the attributes of evidence from the legal text itself. The judge determines the authenticity and legality of evidence depending on the state of the evidence and the procedure of obtaining the evidence. The technical proposal is to mine massive evidence materials from real cases and then to calculate the prior probabilities of certain types of materials. On this basis, we build a knowledge base composed of different kinds of evidence with prior probability. According to the relevant evidence in the historical data, we evaluate the attributes by adopting the Bayesian theory to assess the probability that the evidence is real or legally obtained.

Second, we evaluate the relevance of evidence by analyzing the relationship between evidence and relevant knowledge. We adopt a logical knowledge graph based reasoning method to automatically determine the relevance of evidence. For example, in response to the "financial borrowing case," we sort out the correlations between various types of evidence based on the judge's experience and form a logical map of correlation review. For all relevant evidence materials, if there is a direct or indirect relevance between the elements of any two sets of evidence, we believe that the evidence's relevance is valid. We apply a logical graph G=< E, R > to represent the relevance of evidence, where E is a set of nodes representing the type of evidence, and R is a set of links representing the relationship between two pieces of evidence.

During the trial process, we design an automatic questioning robot to assist the judge in presiding over the trial. The trial is a particular multi-agent dialogue situation. The participants include the judge, the plaintiff, and the defendant. The judge is the trial organizer, while the plaintiff and the defendant ask questions to understand the facts. They also need to maintain order in the court trial and promote the trial process. The automatic questioning system for the judge contains multiple modules: First, the judge's original speech is converted into text with ASR, and then the text is transformed into the context and state of the questioning system with semantic understanding. Second, a module for question management (QM) is constructed and the candidate questions are generated within this module. Finally, automatic questioning is realized with a text-to-speech (TTS) technique that transforms the text into speech. According to the question's content, we divide the judge's questions into two categories: procedural questioning and factual questioning.

Procedural questioning refers mainly to some relatively fixed questions used by judges to organize and promote court trials, such as "identity information of the plaintiff and the defendant" and "the plaintiff and the defendant read the indictment and the defense." Procedural questioning is closely related to the procedures of the trial procedure, which has strong regularity. The system of procedural questioning focuses on solving the problem of questioning automatically in the trial procedure. The following is a sequence diagram (Fig. 5 ) of an automatic questioning system, where fact stands for the node of factual questioning, while procedure identifies the node of a procedural questioning node. Factual questioning is inserted in the process of procedural questioning, and multiple fact nodes can be inserted. It can be seen that an essential function of process questioning is state management.

Output Input Fact state The process of procedural questioning can be defined as a natural language generation problem, and the solution includes rule-based methods and abstract generation methods. The rule-based approach has the advantages of accuracy and practicability, but it requires a large number of custom rules. The abstract generation method currently has technical bottlenecks; the generated text usually has incomplete speech, repetition, and faulty speech. The automatic questioning system innovatively proposes a scheme combining a finite state machine (FSM) and an affair map. The finite state machine is responsible for state management, and the affair map is responsible for selecting subsequent actions, which can also flexibly configure templates for downstream text generation.

The judge's factual questioning is aimed mainly at the factual elements of the plaintiff's and defendant's petitions and defenses and also refers to the factual questions that the judge has asked before. Factual questioning is considered to be a textgenerated problem. We obtain factual questions raised by the judge in the trial's historical dialogues, using joint learning of classification and retrieval. Therefore, we first need to define dialogue in the trial and then give an encoder to delicately represent the hierarchical information in the dialogue context.

Let D = {U 1 , U 2 , ..., U n } denote a dialogue containing n utterances, where each utterance U i is composed of a sequence of words (namely sentence) S i , which means the text content of U i . We employ BiLSTM to encode the semantics of the utterance. BiLSTM has been widely recognized for encoding the utterance's semantics while maintaining its syntax (Wang et al., 2020) . We use BiLSTM to learn a feature representation of dialogue by masking and recovering its unit elements, such as evidence and laws in the legal domain for trial dialogue.

In the utterance layer, the input source is a set of dialogue information obtained from the speech-transformation of the judge's factual questions, denoted as a sequence of {utterance 1 , utterance 2 , ..., utterance n }, and each utterance is composed of the questions asked by the judge. It contains L utterances where each utterance U i is composed of a sequence of l words (namely sentence) S i = {w i1 , w i2 , ..., w il } and the associated role (the judge) r i . We employ BiLSTM to encode the semantics of the utterance. Note that the judge's role information should be embedded in the utterance. We connect the judge's role information with each word in the sentence so that the same word can be projected into different dimensional spaces. The representation of BiLSTM is obtained by concatenating its left and right context representations.

To strengthen the relevance between words in an utterance, the attention mechanism is employed to obtain U i , which can be interpreted as a local representation of an utterance:

where Q u are learnable parameters.

In the dialogue layer, to represent the global context in the dialogue, we use BiLSTM again to encode the dependencies between utterances and obtain a global representation of each utterance, which is expressed as U i .

where dim h refers to the dimensionality of the hidden state h. We next perform word segmentation on the judge's utterance in the dialogue and word vector representation for each word segment to obtain X={x 1 , x 2 , ..., x n }, and then employ BiLSTM and other neural network units to encode X and conduct automatic feature selection. Because the judge's question in the dialogue contains many utterances, it therefore generates a new vector sequence V J ={v 1 , v 2 , ..., v n }. We further use the attention mechanism to perform a secondary representation of V J . These neural network units can enhance information interaction between different levels of dialogue. After the hierarchical representation, we obtain a mapping from V J to V J h ={h 1 , h 2 , ..., h n }, where v and h have a one-to-one correspondence.

Because the judge's factual questions are related to the case's facts in the dialogue between the plaintiff and the defendant, it is also necessary to segment the plaintiff's litigation request and the text of the defendant's defense. We first represent the word vector for each word segment to obtain Y ={y 1 , y 2 , ..., y n }. We then employ the attention mechanism to encode Y to form an encoding vector V W for each combination. The function of V W is to encode the information of the plaintiff's request and the defendant's defense in the encoded text. We combine the element y in V W and the element h in V J h one by one according to the serial number, and the combination result is recorded as V J h ={h t 1 , h t 2 , ..., h t n }. The new statement contains the prosecution and defense information of the plaintiff and the defendant and contains information about the judge's questions in the dialogue.

We employ a classification task to recommend the most likely problem categories. We first predefine a number of problem categories. Under each question category, there are several standard question templates. For example, "recovery of debt" and "the spouses' joint debt" belong to different question categories. When recommending questions to the judge, the system obtains the indictment and pleading, as well as the historical questions raised by the judge as input, and returns the top K most likely question categories according to the steps as mentioned earlier. Finally, in the top K question categories, it returns the standard question template with the highest probability. An example of factual questioning is shown in Fig. 6 . 

Trial summarization consists of two tasks. The first task is to summarize the court debate transcript during the trial stage. The other task is to summarize the controversial focuses of the dialogue in the trial. Summarization-based algorithms have enabled a broad spectrum of applications, such as auto-abbreviated news and retrieval outcomes (Gerani et al., 2014) to assist users in consuming lengthy documents effectively. Thanks to the development of ASR techniques, dialogue summarization (Goo and Chen, 2018; Liu CY et al., 2019) has also attracted much attention in recent years, with exemplar applications like the judicial trial, customer service, and meeting summarization. Different from the plain document, multi-role dialogue is more complicated due to the interactions among various parties. Enhanced representation of the atomic components (e.g., utterance and role) of the dialogue prequalifies summary generation optimization.

During the trial process, the judge needs to discover the common focus of the dispute between the plaintiff and the defendant in the debate and identify how the two sides defend and refute the other party's arguments. The summary of the dialogue during the trial is vital in helping the judge grasp the critical information in the dialogue between the two parties. They include both useful information that appears during the dialogue (for example, private lending cases include the names of the parties, loan amounts, repayment records, etc.) and the focal point of the case (for example, the fact that both parties have repeatedly defended and questioned). The judge finally completes the case trial by analyzing the focus of the dialogue between the two parties and combining the judgment logic.

We have realized the automatic generation model of court trial abstracts in the intelligent trial system, mainly the automatic abstracts of dispute focuses. This task includes (1) extracting dialogue fragments related to the dispute focus in the dialogue and (2) classifying the dispute focus corresponding to each dialogue. Through the generation and processing of the court trial summary, the judge can obtain important dispute fragments in the court trial dialogue, to understand and deal with the court trial more efficiently.

The Alibaba Group proposed a multi-task learning framework called CFDS (Duan et al., 2019; Wang et al., 2020) to summarize the focus of court disputes, which includes mainly the following parts: (1) Using a sequence encoder, we model the text of the trial, semantic information of dispute focus, the role related to utterances, and the node sequence in the corresponding legal knowledge graph, and obtain the vector representation of context information through an attention mechanism. (2) According to the different dispute focuses, the focus classifier takes the category of the dispute focus involved in each utterance as the target, and obtains the label of the dispute focus.

(3) For the court record summary extraction task, the objective of the summary extraction classifier is whether each utterance is extracted.

We adopt a multi-task learning strategy including the following parts: (1) the prediction of the controversy focus, (2) the highlighted sentence, and (3) the recognition of sentence elements. To distinguish between different roles in the dialogue, such as judge, plaintiff, and defendant, we use different embeddings to represent different roles. We apply word embedding to express an utterance in the dialogue through a convolutional neural network and pooling mechanism, and then use a CNN with an attention mechanism to express the entire dialogue. The process of trial summarization is shown in Fig. 7 .

St St ht ht ht Ut Ut Ut Fig. 7 The process of trial summarization 5.1.1 Controversy focus assignment

The first task is to assign a controversy focus to each utterance. Different debate dialogues may have various controversy focuses, and the judge concludes each controversy focus according to the content of debate dialogue D. Because the number of controversy focuses varies in different debate dialogues and each controversy focus differs in semantics and syntax, we can hardly cope with this task using text classification. We calculate the relevance between utterance u i and each controversy focus f m in F with respect to debate dialogue D.

To do so, we need to compute the embedding of each controversy focus. As both controversy focuses and sentences in the debate are natural language, we use the BiLSTM encoder to obtain the controversy focus embedding f m . In addition, not every utterance u i is assigned a controversy focus. Some utterances do not belong to any controversy focus and they can be regarded as irrelevant content, namely noise. Thus, a category Noise is created for every debate dialogue and a dense vector is used to represent it. Then we calculate the attention score α f ij of utterance u i with f j :

Controversy focus with the highest normalized score α f ij is the controversy focus assigned to u i .

The second task aims to extract the crucial utterances from the debate dialogue about the different controversy focuses and to form multiple summarizations. The utterance extractor considers two aspects: utterance content and controversy focuses. To enhance utterance representation learning, we employ the normalized controversy focus distribution as the input to this task:

Then F i and u i are concatenated and fed into the fully connected layers as follows:

where W fc 1 and W fc 2 are two weight matrices and o i ∈ [0, 1] is the output of the utterance extractor, which indicates the probability of extracting utterance u i .

In the court debate scenario, the judge summarizes the case narrative based on facts recognized from the court debate during the trial and relies on the evidence or materials submitted by the litigants. We particularly propose a framework of DIS, which includes four parts: (1) For the text of the trial transcript, the multi-role dialogue encoder can hierarchically and serially model the semantics of the court trial transcript, and obtain the vectorized representations of the word level, speech level, and dialogue level, respectively. (2) The decoder uses the attention mechanism and the replication mechanism to generate the sequence results identified by the court.

(3) The target fact element regularizer classifies the relevance of fact elements, and the element level in the generated text should be consistent with the content of the court trial. (4) The missing fact entity discriminator uses the classification of missing fact entities to predict the inconsistency between the decoder state representation and the dialogue encoding representation in fact entity classification.

We design a hierarchical dialogue encoder involving role information to accommodate extended context and multiple turns among the multiple roles. Rather than directly aligning the input dialogue and its summary, within the generation framework, we propose two additional tasks in the manner of joint learning: expectant factual aspect regularization (EFAR) can estimate the factual aspects to be contained in the summary to make the model emphasize the factual coverage of logical reasoning, and missing factual entity discrimination (MFED) predicts the missing aspects, which discover/alarm the factual gap between the input and the output. Specifically, the DIS framework is shown in Fig. 8 

We propose an inspectional decoder for generating summaries. The inspectional decoder generates the summary via a pointing mechanism, while the expectant factual aspect regularizer ensures factual consistency from the aspect level.

From the perspective of bionics, humans tend to write a draft before focusing on factual aspects. We treat the inspectional decoder as a drafter, whose states need to be further regularized by the aspectaware module.

With the pointing mechanism integrated, the decoder can directly copy tokens from dialogue, making the generated summary more accurate and relevant in factual details.

When writing formal documents like the legal verdict, people always carefully review their drafts to ensure that there are no inconsistencies in the expected aspects. Inspired by this process, we propose an expectant factual aspect regularizer to verify the aspect level's consistency.

For each aspect e i , we use the aspect encoder to obtain its semantic embedding a i . The encoder Enc A is single-layer bidirectional LSTM to represent the aspect description text:

We then produce a weighted sum of the decoder hidden states, known as the aspect-aware decoder state s a :

where K is the number of factual aspects and the score function uses additive attention:

score(a i , s t ) = v T tanh(linear(a i , s t )).

Finally, we feed s a into a three-layer classifier to predict the expectant aspects:

where F a is the notation of linear layers and y a ∈ R K indicates the related probability of K aspects.

There are always factual inconsistencies between the dialogue and reference summary. In the Seq2Seq framework, inconsistencies mislead the decoder to generate incorrect factual details. The missing factual entity discriminator tries to detect the inconsistencies, thus mitigating the problem. Motivated by this observation, we design the discriminator to classify whether the factual entity is missing in the conversation. In real applications, human summarizers can refer to the predictions to complete generated text based on additional information. Intuitively, we view inconsistency as the factual divergence between source and target content, using the bilinear layer as the classifier.

Legal judgment prediction (LJP) is one of the most attractive research topics in the field of legal AI (Xiao et al., 2018; Chao et al., 2019; Zhong et al., 2020a Zhong et al., , 2020b . LJP aims to predict legal judgment based on a legal text including the description of the case facts. Most previous works treated LJP as a text classification task and generally adopted DNNbased methods to solve it. Zhong et al. (2018) and Yang WM et al. (2019) used multi-task learning to capture the dependencies among subtasks by considering their topological order. Zhong et al. (2020b) applied a question-answering task to improve the interpretability of LJP through reinforcement learning. Luo et al. (2017) formulated legal documents as a knowledge basis and used attention mechanisms to aggregate representations of relevant legal texts to support judgment prediction.

We combine DNNs with a symbolic legal knowledge module, in which legal knowledge is expressed as a set of first-order logic (FOL) rules. The application of FOL to represent domain knowledge has already demonstrated its effectiveness on many other tasks, including visual relation prediction (Xie et al., 2019) , natural language inference (Li et al., 2019) , and semantic role labeling (Li et al., 2020) . The advantages of representing legal knowledge as FOL rules can make judgment prediction more interpretable and provide models with inductive bias, which reduces neural network dependency.

The proposed model unifies the gradient-based deep learning module with the non-differentiable symbolic knowledge module via probabilistic logic. Specifically, we build a deep learning module based on a co-attention mechanism, which benefits the information interaction between fact descriptions and claims. Afterward, the deep learning module outputs, predicted probability distribution for judgments, will be fed into the symbolic module.

Before presenting how to integrate legal knowledge into DNNs, we briefly introduce FOL to express legal knowledge. To preserve the advantages of gradient-based end-to-end training schema, we convert the Boolean operations of FOL into probabilistic logic, denoted in the continuous real-valued space.

Specifically, we associate the variable X in preconditions with corresponding neural outputs x. Then, Lukasiewicz T-norm and T-conorm (Klement et al., 2000) are used to relax the logic rules to a softened version based on the associated outputs of the deep learning module. A set of functions is denoted to map the discrete outputs of FOL into continuous real values as follows: 1. Γ (X i ) = x i with X i denoting a variable in FOL and x i as the associated output of neural networks.

2. Γ (

In designing qualified mapping functions, when the precondition holds, the mapping function should generate a predefined maximum positive score to lift the original score produced by neural networks. The mapping functions should also reveal the semantics of propositional connectives. For example, the conjunctive precondition's mapping score becomes zero if even only one of the conjuncts is false. For a disjunctive precondition, the mapping score becomes zero when all the disjuncts are false. Moreover, the mapping score will increase as the number of disjuncts increases.

In addition to the functions listed above, two mapping functions are used for negated predicates. One of them is for negated predicates in preconditions, e.g., ¬X i . The soften output of ¬X i is denoted as 1 − x i . The other is for negated consequent ¬Y , designated as −y i to reduce neural networks' original outputs.

We investigate compiling three specific types of legal knowledge into FOL rules, which are frequently referred to by legal experts in private loan cases.

The first legal logic rule comes from article 28 of the Supreme People's Court's Provisions on Several Issues Concerning the Application of Law in the Trial of Private Loan Cases (http://www.court.gov.cn/fabuxiangqing-15146.html). In short, it is stated that the law shall not support the interest rate agreed by the lender and the borrower exceeding four times the quoted interest rate on the one-year loan market when the contract was established. We formulate this legal knowledge as the following FOL rule K 1 :

where X TIR is a variable that indicates if the current claim is for interest. X RIO indicates if the claimed interest rate exceeds four times the quoted interest rate on the one-year loan market. This rule reflects the decrease in the illegitimate interest rate. The second legal logic rule comes from article 29 of the same law. In short, it is stated that if neither the interest rate during the loan period nor the overdue interest rate has been agreed upon, the people's court shall support the unpaid interest from the date of overdue repayment. We formulate this legal knowledge as the following FOL rule K 2 :

where X RIA indicates if the borrower and the lender have made an agreement on the interest rate, and X DIL indicates if the date of overdue repayment is legitimate.

In private loan law cases, the plaintiff often proposes multiple claims and the judgments on these claims are not independent. For example, when a plaintiff proposes two claims, one is for the principal and the other is for the interest. If the judge does not support the principal claim, then the interest claim should not be supported either. Such prior knowledge should be injected into the deep learning module as well. Another example showing the dependency among multiple claims is that the losing party shall bear the litigation costs. The third FOL rule, K 3 , is formulated as

where X TIC indicates if the current claim is for litigation fees or not. This rule will affect those claims for litigation costs.

We first build a co-attention network as our base model, which can enrich the representations by exchanging information between fact descriptions and claims. Formally, we provide an abstract denotation of the co-attention network as follows:

Here, the encoder and layers are deep neural networks. σ and W are the activation function and model parameters, respectively. Note that the softmax outputs of co-attention networks will be input into the logic module and adjusted accordingly. As shown in Fig. 9 , the proposed model consists of a deep learning module based on co-attention networks and a symbolic legal knowledge module. We first input fact descriptions and multiple claims in the co-attention network to obtain contextual representations for both fact descriptions and claims. The predicted probability distribution of the deep learning module is then re-weighted by first-order logic rules in the symbolic module. The logic rules represent professional legal knowledge, which is essential for making correct judgments. The co-attention model can fuse the claim representations and fact descriptions to create implicit reasoning. However, the related legal knowledge used by legal experts (e.g., lawyers or judges) can hardly be learned by the co-attention network. For example, the rule that a private loan interest rate that exceeds 2% per month is not protected by law may not always be followed by the neural networks. Thus, it is crucial to explicitly inject such declarative legal knowledge into neural networks, so they can make interpretable judgment predictions.

Before introducing substantial legal knowledge related to our private loan scenario, we first show how to inject symbolic FOL rules into the deep learning module using the above mapping functions Γ (·). In short, the core idea of this legal knowledge injection is to re-weight the output y of co-attention networks as introduced in the previous subsection so that when the facts in the text satisfy conditions in the legal knowledge, the associated value of y increases. Otherwise, the value of y decreases.

Specifically, given the softmax outputs y of Eq. (15) and an FOL rule X → Y , the FOL rule and DNNs are combined by regulating the outputs of the deep learning module as follows:

where ρ is a hyper-parameter which denotes the importance of each rule. Through Eq. (16), we can directly regulate the deep learning module's outputs.

Given a set of samples,

}, the model is trained by maximizing the following objective function:

Judgment document generation is based mainly on the judge's view, which is often regarded as a "court view" in the judgment document (Ye et al., 2018) , and its content includes mostly the determination of the case facts and the matching of laws and regulations. Therefore, the core task of judgment document generation is the generation of the court's view. Details about the proposed algorithms and experimental results on court's view generation can be found from our previous conference paper published in EMNLP 2020 (Wu et al., 2020) .

Due to the popularity of machine learning, especially NLP techniques, many legal assistant systems have been proposed to improve the effectiveness and efficiency of the legal system from different aspects. The court's view can be regarded as the interpretation of the sentence in a case. As an important portion of the verdict, the court's opinion is difficult to generate due to the logical reasoning required in the content. Therefore, the generation of the court's view is regarded as one of the most critical functions in a legal assistant system. The court's view consists of two main parts, the judgment and the rationales, where the judgment responds to the plaintiff's claims in civil cases or charges in criminal cases, and the rationales are summarized from the fact description to derive and explain the judgment.

In this work, we focus on the problem of automatically generating the court's view in civil cases by injecting the plaintiff's claim and fact description (Fig. 10 ). In such a context, generating the court's view can be formulated as a text-to-text NLG problem, where the input is the plaintiff's claim and the fact description. The output is the corresponding court view, which contains the judgment and the rationales. Because the claims are various, for simplification, the judgment of a civil case is defined as supported if all its requests are accepted and nonsupported otherwise.

The plaintiff A claimed that the defendant B should return the loan of $29 500 Pri nci ple claim and the corresponding interest Interest claim .

After the hearing, the court held the facts as follows: defendant B borrowed $29 500 from plaintiff A, and agreed to return after one month. After the loan expired, the defendant failed to return Fact .

The court concluded that the loan relationship between plaintiff A and defendant B is valid. The defendant failed to return the money on time Rationale . Therefore, the plaintiff's claim on principle was supported Acceptance according to law. The court did not support the plaintiff's claim on interest Rejection because the evidence was insufficient Rationale . Fig. 10 An example of the court's view from a legal document (Wu et al., 2020) Although classical NLG models have been applied to many text-generation tasks, when generating the court's view, such techniques cannot be applied for the following reasons: (1) The "no claim, no trial" principle exists in civil legal systems; the judgment is the response to the claims declared by the plaintiff, and its rationales summarize the corresponding facts. (2) The distribution of judgment results in civil cases is very imbalanced. Such an imbalance of judgment would blind the model's training by focusing on the supported cases while ignoring the non-supported cases, leading to incorrect judgment generation of the court's view.

To address these challenges, we propose the AC-NLG method by jointly optimizing a claim-aware en-coder, a pair of counterfactual decoders to generate judgment-discriminative court views (both supportive and non-supportive), and a synergistic judgment predictive model. Comprehensive experiments show the effectiveness of our method under both quantitative and qualitative evaluation metrics.

Causal inference (Pearl, 2009; Kuang et al., 2020) is a powerful statistical modeling tool for explanatory analysis that removes the confounding bias in data. That bias might create a spurious correlation or confounding effect among variables. Recently, many methods have been proposed to remove the confounding bias in the literature of causal inference, including do-operation based on a structure causal model (Pearl, 2009 ) and counterfactual outcome prediction based on a potential outcome framework (Imbens and Rubin, 2015) . With do-operation, a backdoor adjustment (Pearl et al., 2016) has been proposed for data debiasing. In this study, we sketch the causal structure model of our problem, as shown in Fig. 11 , and adopt the backdoor for confounding bias reduction. Confounding bias from the data generation mechanism (Wu et al., 2020) In this subsection we introduce the effect of mechanism confounding bias on the generation of the court's view and propose a backdoor-inspired method to eliminate that bias. Then, we describe our AC-NLG model in detail. Fig. 12 shows the overall framework.

As shown in Fig. 11 , u refers to the unobserved mechanism (i.e., plaintiffs sue when they have a high probability of being supported) that causes the judgment in dataset D(J) to be imbalanced. D(J) → I denotes that the imbalanced data D(J) has a causal effect on the representation of input I (i.e., plaintiff's claim and fact description), and D(J) → V denotes that D(J) has a causal effect on the representation of court's view V . Such imbalance in D(J) leads to the confounding bias that the representations of I and V tend to be supportive and blind the conventional training on P (V |I). The confounding bias from the data generation mechanism would blind the conventional training on P (V |I), and current sequence-tosequence models struggle to solve this problem. For a particular case, given the input I = (c, f ), and using the Bayes rule, we would train the model to generate the court's view V as follows:

The backdoor adjustment creates a dooperation on I, which promotes the posterior probability from passive observation to active intervention. The backdoor adjustment addresses the confounding bias by computing the interventional posterior P (V |do(I)) and controlling the confounder as

Because the backdoor adjustment helps cut the dependence between D(J) and I, we can eliminate the confounding bias from the data generation mechanism and learn an interventional model for debiased court's view generation.

As shown in Fig. 12 , to optimize Eq. (19), we use a pair of counterfactual decoders to learn the likelihood P (V |I, j) for each j. At inference, we propose to use a predictor to approximate P (j). Note that our implementation on backdoor-adjustment can be easily applied for multi-valued confounding with multiple counterfactual decoders.

Our model is conducted in a multi-task learning manner that consists of a shared encoder, a predictor, and a pair of counterfactual decoders. Note that the predictor and the decoders take the output of the encoder as input.

1. Claim-aware encoder Intuitively, the plaintiff's claim c and the fact description f are sequences of words. The encoder first transforms the words into embeddings. Then the embedding sequences are fed to BiLSTM, producing two sequences of hidden states h c and h f corresponding to the plaintiff's claim and the fact description, respectively.

After that, we use a claim-aware attention mechanism to fuse h c and h f . For each hidden state h f i in h f , e i k is its attention weight on h c k , and the attention distribution q i is calculated as follows:

where v, W c , W f , b attn are learnable parameters. The attention distribution can be regarded as the importance of each word in the plaintiff's claim. Next, the new representation of the fact description is produced as follows:

After feeding to another BiLSTM layer, we obtain the claim-aware representation of fact h. Fig. 12 Architecture of the attentional and counterfactual natural language generation (AC-NLG) method (Wu et al., 2020) 2. Judgment predictor Given the claim-aware representation of fact h, the judgment predictor produces the probability of support P sup through a fully connected layer and a sigmoid operation. The prediction result j is obtained as follows:

where 1 means support and 0 means non-support.

To eliminate the effect of data bias, here we use a pair of counterfactual decoders, which contains two decoders, one for supported cases and the other for non-supported cases. The two decoders have the same structure but aim to generate the court's view with different judgments. We name them counterfactual decoders because every time only one of the two generated court views is correct. Still, we apply the attention mechanism. At each step t, given the encoder's output h and the decode state s t , the attention distribution a t is calculated in the same way as q i in Eq. (21), but with different parameters. The context vector h * t is then a weighted sum of h:

The context vector h * t , which can be regarded as a representation of the input for this step, is concatenated with the decode state s t and fed to linear layers to produce the vocabulary distribution p vocab :

where V , V , b, b are all learnable parameters. Then we add a generation probability to solve the out of vocabulary (OOV) problem. Given the context h * t , the decode state s t , and the decoder's input (the word embedding of the previous word) x t , the generation probability P gen can be calculated:

where w h * , w s , w x , and b ptr are learnable, and σ is the sigmoid function. The final probability for a word w in time step is obtained:

We introduce how to alienate the two decoders in the training part.

For the predictor, we use cross-entropy as the loss function:

whereĵ is the real judgment.

For the decoders, the previous word in training is the word in the real court's view, and the loss for time step t is the negative log-likelihood of the target word w * t :

and the overall generation loss is

where T is the length of the real court's view.

Because we aim to make the two decoders generate two different court views, we use a mask operation when calculating the loss of each decoder. The exact loss for the support decoder is

The loss for the non-support decoder L nsup is obtained in the opposite way. Thus, the total loss is

where we set λ to 0.1 in our model.

To investigate the effectiveness of FITS, we conducted experiments on a real private loan dataset. We developed an AI-judge assistant, named Xiaozhi, based on FITS. We also applied FITS in real courts and achieved satisfactory results.

Due to the page limitation, here we show only the comparison results of judgment prediction, which is the most important task of a smart trial. We compare our method with other deep learning baselines on the collected private loan dataset and discuss the role that legal knowledge plays in its performance.

We collected a total of 61 611 private loan law cases. Each instance in the dataset consists of a factual description and the plaintiff's multiple claims. We will release all the experiment data to motivate other scholars to investigate this problem further. Macro F1 and Micro F1 (Mac.F1 and Mic.F1 for short) were adopted as the primary metrics for algorithm evaluation. We denoted the co-attentionbased method as CoATT+LK, which means we injected legal knowledge into neural networks.

We evaluated our model and the baselines on the private loan dataset. In addition to Mac.F1 and Mic.F1, we used macro-precision (Mac.P) and macro-recall (Mac.R) to evaluate the methods. The performance on the test set is summarized in Table 1 . We can draw the following conclusions from the results: First, the performance of the deep learning based methods, e.g., TextCNN, BiLSTM+ATT, and HARNN, significantly exceeded the traditional machine learning method TF-IDF+SVM, which shows the success of applying neural networks for LJP. Second, LSTM-based methods gave better results than the CNN-based approach, demonstrating the advantages of extracting contextual features using LSTM. Third, BERT outperformed all the deep learning based methods, which shows the pre-trained language model's strong representation abilities, even for the legal domain.

Finally, the co-attention model gave a 4.8% absolute increase in performance (the average of Mac.F1 and Mic.F1) compared with BERT, which leads to two conclusions. First, directly applying pre-trained models to specific domains still has room for improvement. Second, it verifies our assumption that the bi-directional attention flows of information between facts and claims help locate crucial facts. Most importantly, injecting legal knowledge into co-attention networks gave another 1% absolute increase compared with the co-attention model and achieved the best results among all methods.

The full-process smart trial system has played an important role in the construction of the smart court in Zhejiang Province. We developed a substantive AI-judge assistant robot, called Xiaozhi based on FITS, which has already assisted seven Zhejiang Provincal courts in financial lending cases and private lending cases. Xiaozhi moved the full procedural trial mode from the experimental stage to application practice. As a judge's assistant, Xiaozhi demonstrates the advantages of AI in the judicial field. FITS can understand legal documents, extract case information, justify evidence, and record the parties' speeches. It assists the judge in automatically questioning, promoting the trial process independently, summarizing the focus of disputes, predicting the outcome of the judgment, and generating judgment documents. If the judge's judgment deviates from a similar case, the system will also remind the judge of risks.

Compared with the traditional court, FITS has allowed realization of a new "human-machine integration" mode of intelligent trial in real applications. The litigation procedures in China consist of four phases: (1) In the trial preparation phase, Xiaozhi can push the pre-trial report to the judge and analyze the report's elements. (2) In the investigation stage, Xiaozhi synchronously conducts semantic recognition and text conversion, automatically helps The best results are in bold the judge with questioning, and justifies the validity of evidence.

(3) In the debate stage, Xiaozhi can convert the dialogue between the parties into text in real time, and summarize the dispute's focus from the dialogue and extract its elements. (4) In the judgment stage, Xiaozhi helps predict the outcome of the case and generate judgment documents in real time, which enables the judge to pronounce judgment in court after review and confirmation.

FITS breaks through the geographical limitations and avoids the inefficiency of traditional courts. It has launched "networking," "digitization," and "intelligence" in the smart court. The application of FITS has achieved satisfactory results: (1) In the automatic questioning task, the accuracy rate of procedural questioning can reach 96%. The hit rate for factual questioning can reach 70%. (2) In the high-frequency private lending and financial borrowing cases, the summaries of court trial records can reach 90%, and the accuracy rate of generating dispute focuses can reach 70%. The factor prediction accuracy rate can reach 80%. (3) The accuracy of financial loan evidence determination is 92%, and the accuracy of private lending is 95%. The accuracy of evidence classification can reach 90%. (4) FITS predicts the trial's outcome by combining the legal knowledge graph and big data analysis, with an accuracy rate of 96%. (5) With the help of our system, the rate of sentence pronouncement in court can be improved from 40% (traditional judge system) to 90%, and the proposed system can also shorten the trial time from 2-3 h (traditional judge system) to 20-30 min. Moreover, the average number of trial days for initial financial loan cases has been shortened from 98 in 2017 to 66 in 2020, and no case has been revised or remanded for retrial.

This paper attempts to cover the primary process of adjudication; the essential steps/stages for a trial pipeline include making judgments and writing judgment documents. The technologies of text classification and legal prediction are often used to assist in these tasks. In the history of AI and law, there have been many research works. Basically, the legal text classifier is the fundamental technology of our work. Dahbur and Muscarello (2003) gave a classification system for serial criminal patterns. Ashley and Brüninghaus (2009) proposed a model of SMILE+IBP to automatically classify textual facts in terms of a set of classification concepts that capture stereotypical fact patterns. Passage-based text summarization was used to investigate how to categorize text excerpts from Italian normative texts (Kanapala et al., 2019) . Liu CL and Chen (2019) applied machine learning methods, including gradient boosting, multilayer perceptrons, and deep learning methods with LSTM units, to extract the gist of Chinese judgments of the supreme court.

Concerning the works of legal prediction, remarkable results have been achieved (Arditi et al., 1998) . In the early stages, machine learning, such as argument based machine learning (Možina et al., 2005) , was applied to the legal domain. Machine learning has also been applied to predict decisions of the European Court of Human Rights (Aletras et al., 2016; Medvedeva et al., 2020) . A time-evolving random forest classifier was designed to predict the behavior of the Supreme Court of the United States (Katz et al., 2017) . Recently, Chao et al. (2019) improved the interpretability of charge prediction systems and improved automatic legal document generation from the fact description. They further proposed an interpretable model for charge prediction for criminal cases using a dynamic rationale attention mechanism (Ye et al., 2018) . Hu et al. (2020) studied the problem of identifying the principals and accessories from the fact description with multiple defendants in a criminal case.

This paper presents a full-process intelligent trial system. The technical route adopts mainly a combination of knowledge-based models and datacentric models. The method of knowledge expression and reasoning formalizes mainly the judge's legal knowledge and implements logical reasoning according to the judge's logical rules. Big data driven technology realizes the tasks of classification, summarization, and prediction through big data analysis of massive legal texts. Several deep learning models are proposed for legal information extraction, evidence justification, trial summarization, outcome prediction, and judgment document generation.

Note that the application of FITS has not been extended to criminal cases. The application to criminal cases should be very cautious because the standard of judicial proof in criminal cases is "beyond a reasonable doubt," but the prediction results of the intelligent system cannot be guaranteed to be 100% correct. The predictive model contains machine learning algorithms that are uninterpretable or have "black box" problems, which means that the process from data input to result from output is nontransparent. Therefore, the use of FITS in criminal case trials will be very cautious.

The system explores the in-depth application of big data, modern logic, and AI in the full trial process. The AI trial system also has shortcomings. Even if the existing technologies are good at handling simple cases (such as financial lending and private lending cases), for complex cases, the determination of the facts of the case and the application of laws are inseparable from the experience of the judge, especially for ethics and morality. It is difficult for AI to accurately predict the outcome of complex cases while taking into account these empirical factors. Therefore, we need to formulate the AI trial system in a human-machine interaction mode, and enable judges to provide real-time feedback on algorithm results.

Bin WEI, Kun KUANG, Changlong SUN, and Jun FENG discussed the organization of this paper from different aspects, including the views of both law and computer science. Bin WEI drafted mainly Sections 1, 3, 4, and 10.

Kun KUANG drafted mainly Sections 6 and 7. Changlong SUN drafted mainly Sections 2 and 9 and provided judicial big data and technical models for experiments in Section 8.

Jun FENG drafted mainly Section 5 and conducted the experiments in Section 8. Fei WU, Xinli ZHU, and Jianghong ZHOU guided the research. All authors revised and finalized the paper.

Predicting judicial decisions of the European court of human rights: a natural language processing perspective

Predicting the outcome of construction litigation using neural networks

Automatically classifying case texts and predicting outcomes

Interpretable charge prediction for criminal cases with dynamic rationale attention

Classification system for serial criminal patterns

Legal summarization for multi-role debate dialogue via controversy focus mining and multi-task learning

Deep learning for named-entity linking with transfer learning for legal documents

Abstractive summarization of product reviews using discourse structure

Abstractive dialogue summarization with sentence-gated modeling optimized by dialogue acts. IEEE Spoken Language Technology Workshop

Framewise phoneme classification with bidirectional LSTM and other neural network architectures

Long short-term memory

Identifying principals and accessories in a complex case based on the comprehension of fact description

Causal Inference for Statistics, Social, and Biomedical Sciences: an Introduction

Information extraction from case law and retrieval of prior cases

Cross copy network for dialogue generation

Passage-based text summarization for legal information retrieval

A general approach for predicting the behavior of the supreme court of the United States

Triangular Norms

Conditional random fields: probabilistic models for segmenting and labeling sequence data

Neural architectures for named entity recognition

A logic-driven framework for consistency of neural models

Structured tuning for semantic role labeling

Extracting the gist of Chinese judgments of the supreme court

Automatic dialogue summary generation for customer service

Graph convolution for multimodal information extraction from visually rich documents

Learning to predict charges for criminal cases with legal basis

Using machine learning to predict decisions of the European court of human rights

Argument based machine learning applied to law

Causality: Models, Reasoning, and Inference (2 nd Ed

Causal Inference in Statistics: a Primer

Long short-term memory recurrent neural network architectures for large scale acoustic modeling

An introduction to conditional random fields for relational learning

Masking orchestration: multi-task pretraining for multi-role dialogue representation learning

De-biased court's view generation with causality

CAIL2018: a large-scale legal dataset for judgment prediction

Embedding symbolic knowledge into deep networks

Legal judgment prediction via multi-perspective bi-feedback network

Hierarchical attention networks for document classification

Transfer learning for sequence tagging with hierarchical recurrent networks

Interpretable charge predictions for criminal cases: learning to generate court views from fact descriptions

Improve neural entity recognition via multi-task data selection and constrained decoding

Legal judgment prediction via topological learning

How does NLP benefit legal system: a summary of legal artificial intelligence

Iteratively questioning and answering for interpretable legal judgment prediction

Legal intelligence for e-commerce: multi-task learning by leveraging multiview dispute representation

We thank all members of the FITS project team, especially the natural language processing team. In particular, we would like to thank Xiaozhong LIU, Lin YUAN,