key: cord-0483471-885cjrqa
authors: Lee, Eunji; Kim, Sundong; Kim, Sihyun; Park, Sungwon; Cha, Meeyoung; Jung, Soyeon; Yang, Suyoung; Choi, Yeonsoo; Ji, Sungdae; Song, Minsoo; Kim, Heeja
title: Classification of Goods Using Text Descriptions With Sentences Retrieval
date: 2021-11-02
journal: nan
DOI: nan
sha: 184d24a344a68856703a44306a12983e352a52e9
doc_id: 483471
cord_uid: 885cjrqa

The task of assigning and validating internationally accepted commodity code (HS code) to traded goods is one of the critical functions at the customs office. This decision is crucial to importers and exporters, as it determines the tariff rate. However, similar to court decisions made by judges, the task can be non-trivial even for experienced customs officers. The current paper proposes a deep learning model to assist this seemingly challenging HS code classification. Together with Korea Customs Service, we built a decision model based on KoELECTRA that suggests the most likely heading and subheadings (i.e., the first four and six digits) of the HS code. Evaluation on 129,084 past cases shows that the top-3 suggestions made by our model have an accuracy of 95.5% in classifying 265 subheadings. This promising result implies algorithms may reduce the time and effort taken by customs officers substantially by assisting the HS code classification task.

According to the World Customs Organization (WCO), the number of import and export declarations worldwide reached 500 million as of 2020. Events like COVID-19 have led to a surge in cross-national imports of ecommerce goods, where for instance, Korea marked 63.5 million in 2020, 48% increase compared to the previous year [18] . As global transactions increase and the traded products become diversified, managing standards to categorize numerous products-i.e., Harmonized Commodity Description and Coding System (HS)-becomes crucial. HS is an international standard for classifying goods; from live animals to electronic devices, each product is classified as one of 5,387 subheadings (6-digit HS codes) that meet international conventions [27] . This code determines critical trade decisions like tariff rate, import and export requirement, etc.

HS code classification is non-trivial and requires a high degree of expertise since it determines the tariff rates. Securing tariffs is vital for fiscal income in many countries. The share of tax revenue secured through the customs office is nearly 20% worldwide, and it exceeds 40% in West African countries. 1 In addition, tariff rates directly link to the price of goods, affecting their global competitiveness. Therefore, importers and exporters pay special attention to the product declaration. Customs authorities will scrutinize the HS code of the declared goods and correct them if needed. When the HS code is wrong, the customs authorities make them correct the code. Simple errors can be corrected by amending the declaration or sending a request for correction. If customs administrations recognize the evidence of smuggling or intention of false declaration for tax evasion, importers will be punished by customs act.

Classifying a product is complex because human experts' interpretation may not always be consistent. It can lead to international disputes when there is a difference of opinion between customs authorities or between companies and customs authorities. For example, when smartwatches were first released, tariffs varied by the importing country due to the absence of a classification standard. For example, tariff rates for wireless communication devices are 0%, but 4-10% for watches. This led to a dispute, which was finally resolved at the WCO HS Committee in 2014. The committee classified smartwatch as a wireless communication device, and the manufacturer was able to save about $13 million a year [21] . Accordingly, the customs administration operates a pre-examination system, allowing import and export companies to request the customs for reviewing their items before formal declarations. There are about 6,000 applications for pre-examination every year in Korea. As the complexity of the goods increases, the processing time has increased from 20.4 days to 25.8 days since 2018. The main reason is the detailed review process since the HS code and the corresponding tax rate can differ even for similar-looking items. For example, tariff rates for television (HS 8528.59) are 8%, but 0% for PC monitor (HS 8528.52).

HS codes had been determined by reviewing the description submitted by applicants and relevant cases in the past. Experts adjust to Harmonized Commodity Description and Coding System Explanatory Notes (HS manual) [5] for standard code descriptions and General Rules for Interpretation of Nomenclature (GRI) [25] for decisionmaking criteria. In this paper, we present a novel HS classification model by reflecting on how experts work. First, the model suggests 4-digit HS codes (headings) based on product descriptions with pre-trained language models. Then, it retrieves key sentences from the HS manual that are most related to the product. Next, the model suggests 6digit HS codes (subheadings) using product descriptions and retrieved sentences. The workflow is similar to the order of applying GRI when experts perform HS classification tasks. Retrieved sentences act as supporting facts of the decision to convince importer/exporters.

We classified headings and subheadings of recently examined electrical equipments (i.e., Chapter 85), which is known to be difficult due to product complexity. Our model outperformed the winning solution used in the product classification challenge in the e-commerce sector [16] . Moreover, we demonstrated our model at the Korea Customs Service with eleven undisclosed decision cases (Top-3 accuracy: 0.82). Last, we discuss how to advance our model by discussing interpretability and contextual logic understanding.

All the items that go through customs are assigned the Harmonized System (HS) code, an internationally standardized system of names and numbers to classify traded products to determine tariffs. Being an internationally recognized standard, the first six digits of the HS code (HS6) are the same for all the countries. For further classification, countries have added more digits to their respective HS code systems. HS6 includes three components:

1. Chapter, the first 2-digit of HS code, contains 96 categories from 01 to 99. Example chapter 85 indicates electrical machinery and equipment and parts thereof.

Heading, the first 4-digits, groups similar characteristics of goods within a Chapter. For instance, heading 8528 represents monitors and projectors but excludes television reception apparatus.

3. Subheading, the first 6-digits, groups goods within a Heading. For example, the subheading 8528.71 includes items not designed to incorporate a video display or screen in 8528.

Recent studies utilize machine learning approaches to predict HS code using text descriptions of the declared goods. These approaches include k-nearest neighbor, SVM, Adaboost [6] , and neural networks [4] . To catch semantic information from the text, state-of-the-art studies use neural machine translator [1] and other transformerbased algorithms [12] . Other studies utilized hierarchical relationships between HS codes and co-occurrence of the words by background nets [11] . There are similar studies that understand short texts and classify them in a large hierarchy using class taxonomy [19] , metadata [30] , and hyperbolic embedding [3] , that can be applied to HS prediction. However, most approaches focus on classification itself and lack of providing any explanation.

Sentence retrieval is commonly used for the question and answering (QA) tasks [20, 22] . Retrieved sentences become supporting facts and provide a detailed explanation of the answer. State-of-the-art approaches in identifying supporting factors use self-attention [23] , bi-attention [17] over paragraphs, and this is possible since most of the QA datasets with paragraphs have annotated evidence sentences themselves [9, 29] . In the case of the unsupervised setting, approaches such as TF-IDF [15] , alignment-based methods [10] are used to find supporting sentences. Unsupervised sentence retrieval can increase the interpretability and performance in finding answers [7] .

The study is conducted on goods belongs to electrical equipment (Chapter 85). Classifying these goods is getting trickier since electrical products are multi-functional and complex, so they do not easily fit into the existing harmonized system [14] . Because of its difficulties, goods belongs to Chapter 85 received the most classification requests (17.1%, as of 2020). According to Customs Valuation and Classification Institute, it takes 37.2 days to resolve the classification requests, which is much longer than the average time required for other categories (25.9 days). Chapter 85 contains 46 headings and 265 subheadings in total.

We utilized the datasets from the customs law information portal (CLIP) from Korea Customs Service. CLIP Table 1 . An example decision case including item description and corresponding HS code.

Item description: Photovoltaic cell panel silicon (Si) embedded in plastic (EVA) and assembled a layer of glass and fiberglass and upper layer of "Tedlar EVA", with an aluminum frame, which converts sunlight into electricity. Type cells are polycrystalline, with a maximum power of 135W. Each panel has 36 cells connected in series and the open circuit voltage is 22.1V. Incorporates type diodes "bypass" of protection in the junction box and cables. It has no other devices that allow power directly usable. Dimensions 1008 x 992 x 35mm and a weight of 13.5kg.

HS code: 8541.40-9000 

Experts decide HS code according to Harmonized Commodity Description and Coding System Explanatory Notes (HS manual), a common set of laws worldwide. The HS manual explains each code in section, chapter, and heading level in detail. We utilized the heading-level manual to provide supporting facts of the model decision. 

The proposed framework takes item description as an input, and the final goal is to predict the subheading (HS6) of a given item by referring to the HS manual. It also provides intermediate outputs for candidate headings and subheadings, prior cases, and key sentences from the HS manual. Figure. 1 illustrates our model, which is divided into three stages.

A. Heading Prediction D = {D 1 , · · · , D N } is a collection of decision cases where each case D i ∈ D is a pair of the item description x i and its one-hot encoded heading label y i . After translating all item descriptions into Korean, we used KoELEC-TRA [13] as a description encoder e θ to map a sequence of words x i into embedding space R d . The item embedding e θ (x i ) goes through the classification head and the prediction is done by minimizing the loss L between the true probability y i and the predicted probabilityŷ i = e θ (x i ) ·W where W ∈ R d×dim(y i ) is a trainable weight matrix of the classification head. Since the problem is multi-class classification, categorical cross-entropy loss H is used.

After predicting the heading, the model extracts key sentences from the HS manual to justify this decision. Key sentences are iteratively retrieved by calculating the similarity score between the item description x i and the heading-level HS manual M. This method is called Alignment Information Retrieval (AIR), and the process terminates when retrieved sentences cover all keywords from input descriptions from M or no new keywords are discovered from M [28] .

where d m and m k are the m th and k th sentences of the x i and M. The cosine similarity (sim) is derived by GloVe embedding of the two inputs and idf is the inverse document frequency value. We chose maximum seven sentences from the heading-level HS manual and created a sentence set S i = {m k } with the highest alignment score.

Using key sentences and item description, the model predicts the subheading candidates and retrieve prior cases Using the predicted heading, Stage 2 retrieves key sentences from the heading-level HS manual for supporting decisions. In stage 3, item descriptions and key sentences are utilized to predict the most relevant subheadings of the product. The final outputs include heading and subheading candidates with key sentences and similar cases for reference. belong to each subheading. After concatenating item description x i and key sentences S i , we use KoELECTRA encoder e φ to generate the embedding e φ ([x i , S i ]). As we did in heading prediction, the embedding goes through the classification head and the prediction is done by minimizing the categorical cross-entropy loss between the true probability y s i and the predicted probabilityŷ s i = e φ (x i ) · W s , where W s ∈ R d×dim(y s i ) is a trainable weight matrix of the classification head.

After training, the model can determine top-k subheading candidates from the classification head. If the subheading of a given item i is predicted as y p , similar cases are derived in D p ⊂ D, which is a set of cases whose subheading label is determined as y p . Similar cases are iteratively chosen by measuring cosine similarity between item embeddings:

This section tests feasibility of the proposed HS classification model in terms of classification performance and retrieved sentence quality. Classification performance is measured by checking the subheading (HS6) accuracy, our final output, and heading (HS4) accuracy, the intermediate output. The quality of the retrieved sentences is examined by comparing our results with documents written by experts.

For experiments, we retained 126,000 cases that the product HS code is maintained among 129,084 cases. Given that decisions may change over time, we used the last three months of data (1,466 international and 186 Korean cases) for evaluation. Three months' worth of data before the test period was used as the validation set (1,733 international cases and 102 Korean cases) for hyperparameter tuning.

We evaluated the heading and subheading classification performance by measuring top-k accuracy with k = 1, 3, 5. In retrieved sentences case study analysis, we measured recall and precision to evaluate the quality of the supporting factors.

KoELECTRA models and classification heads in heading and subheading prediction are trained for 50 epochs and evaluated when validation accuracy is the highest. The embedding size of KoELEC-TRAs is set as 768. Key sentences of each item are required to train a subheading prediction model, so we prepared them from the answer heading's HS manual beforehand. In evaluation stage, key sentences are retrieved from the predicted heading's HS manual. Training KoELECTRA model takes 40 hours and data preparation for the sentence retrieval model takes 50 hours using NVIDIA TITAN Xp. Inference and retrieval take less than 30 seconds.

We compared the proposed model with two baselines-a word-matching model and an LSTM-based model. The word matching model computes the word matching rate between the item description and heading-level HS manual and chooses the heading with the highest matching rate. LSTM-based model [16] is a winning model of the competition [2] , which has a similar setting with our problem: predict the detailed category of the e-commerced products using their descriptions. The model utilizes LSTM networks to get embedding from tokenized input texts. Table 3 shows the top-k accuracy of the baselines and our model. Two variants of our model are tested, the one used retrieved sentences from the second stage and the other did not use them. 2) Retrieved Key Sentences. Given the item description, experts present supporting reasons for their final decision. They quote some sentences from the HS manual and provides evidence of their decision with detailed explanations. We compared the quoted sentences written by experts, with key sentences retrieved by our model. As shown in Table 4 , the key sentences (supporting facts) retrieved by our model broadly match the quotations from experts. Table 4 . Comparison between the actual reasons for decision and supporting facts found by our algorithm. Figure 2 shows the model's final outputs. First, the model provides three heading candidates with their prediction scores. To calibrate the score, we applied temperature scaling to the model's softmax outputs [8] . For each candidate, one to seven key sentences are retrieved from the HS manual. Retrieved sentences explain the model prediction and reduce the reviewing scope by customs officers. Next, the model provides three subheading candidates. With each subheading, the model provides similar prior cases for reference.

One of our goal is to add interpretability to the HS classification model. To do that, we provide a confidence score for each HS code candidate. The score provides additional information to judge whether a candidate is valid or not. Although the score is tuned by temperature scaling, the range of the top-k confidence score was quite different for each input item. Careful calibration is required to make customs officers use this value as a reference in the decision-making process. Another way to increase the interpretability is to visualize the part of the item description relates to each subheading candidate. If then, customs officers can concentrate on the selected part and decide whether to consider second and third candidates to review. In addition, key sentences should relate to the subheading characteristics so that the final form of the model output should resemble reports written by experts. Making an organized document that explains the relation between prediction, description, and HS manual will reduce the effort required for HS code classification.

Customs experts decide the HS code by following the General Rules for Interpretation of Nomenclature (GRI), similar to judges deciding based on the law [26] . On the other hand, deep learning models solve classification problems by finding common patterns from previous cases. As a result, past examples are the primary determinant of the AI model's decision, different from human experts who make their decision by rules and manuals. Since HS code and its manual undergo revision every five years, previous cases cannot always be good references for recent ones. Therefore, it is essential to employ GRIs and HS manual to make a credible model, which utilizes contextual information in model training based on deep linguistic understanding [24] .

This study introduces a framework to predict harmonized system codes in customs. Using product description and HS manual, it predicts heading as subheadings and provides supporting facts and related cases to facilitate decision process by human experts. We expect that our work will contribute significantly in various aspects. This framework by declarants will improve the initial declaration quality, thereby reducing customs officials' workloads. Internally, the framework can be used to assist customs officials in carrying out their duties to increase their work efficiency and train their employees. The situation with competing HS codes is particularly problematic for declarants and customs officials. Our model presents the competing HS codes of the target product with its rationale, so it has great significance as an auxiliary means for product classification. The platforms require a systematic classification system to effectively expose and recommend products to users, but there are often different standards for each product-providing company. The platforms build and utilize hierarchical classification algorithms to maintain the consistent categorization of hundreds of millions of products. Our work will be used to advance those and facilitate their management.

and video camera recorders. 2. TELEVISION CAMERAS, DIGITAL

This group covers cameras that capture images

digital cameras and video camera recorders

Supporting facts found by our model 1. PARTS 2. TELEVISION CAMERAS

Exploring machine learning models to predict harmonized system code

Product categorization competition in Daum Shopping

Hyperbolic interaction model for hierarchical multilabel classification

Neural machine translation for harmonized system codes prediction

Harmonized commodity description and coding system explanatory notes

Autocategorization of hs code using background net approach

A simple yet strong pipeline for hotpotqa

On calibration of modern neural networks

QASC: A dataset for question answering via sentence composition

Bridging the gap: Incorporating a semantic similarity measure for effectively mapping pubmed queries to documents

Customs classification for crossborder e-commerce based on text-image adaptive convolutional neural network

Classifying short text for the harmonized system with convolutional neural networks

KoELECTRA: Pretrained ELEC-TRA model for Korean

A study on the customs classification fallacy of certain ITA goods

Using TF-IDF to determine word relevance in document queries

Lime robot. 1st place solution of the product categorization competition in Daum Shopping

Bidirectional attention flow for machine comprehension

TaxoClass: Hierarchical multi-label text classification using only class names

Identifying supporting facts for multi-hop question answering with document graph networks

Smartwatch is a communication device

Evidence sentence extraction for machine reading comprehension

Gated self-matching networks for reading comprehension and question answering

Teach me to explain: A review of datasets for explainable NLP

General rules for the interpretation of the harmonized system

The challenges of artificial judicial decision making for liberal democracy

World Customs Organization. HS compendium -The harmonized system, a universal language for international trade

Unsupervised alignment-based iterative evidence retrieval for multi-hop question answering

HotpotQA: A dataset for diverse, explainable multi-hop question answering

MATCH: Metadata-aware text classification in a large hierarchy

ACKNOWLEDGMENT This work was supported by the Institute for Basic Science (IBS-R029-C2, IBS-R029-Y4). We thank numerous officers from Korea Customs Service for their insightful discussions.