work_23qh3ethwvhxtazlivq7izfvwa ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 48 Research on IPv4, IPv6 and IPV9 Address Representation YURY Halavachou Department of the International Relations Belarusian State University of Transport Republic of Belarus 34, Kirova street, Gomel, 246653 Republic of Belarus e-mail: oms@bsut.by Wang Yubian Department of Railway Transportation Control Belarusian State University of Transport 34, Kirova street, Gomel, 246653 Republic of Belarus e-mail: alika_wang@mail.ru Abstract—The new generation network architecture (IPV9) is designed to solve a series of problems such as the shortage of address space and the danger of information security. IPv4 addresses have a length of 32 bits and a theoretically expressible address space of 232, while IPv6 addresses extend to 128 bits and theoretically an address space of 2128. In this paper, by studying IPv4, IPv6 address structure focuses on the new generation of network IPV9 address representation method. This method adopts the address coding method of the variable-length and variable-position, ranging from 16 bits to 2048 bits. Moreover, it adopts the mechanism of verification before communication, and relies on the method of assigning addresses to the computers on the Internet with full character codes. It redefines the address structure of the future Internet and provides new solutions for the Internet of things and the Internet of everything. Keywords-IPv4; IPv6; IPV9; Address Structure I. NETWORK ADDRESS An interconnected network is made up of LAN with interconnected nodes, also known as hosts or routers. Each device has a physical address connected to a network with a MAC layer address and a logical Internet address. Because a network address can be logically assigned to any network device, it is also called a logical address. Internet addresses are assigned by the Internet Corporation for Assigned Names and Numbers. The association appoints three local organizations - INTERNIC, RIPENCC and APNIC - to carry out location assignments in North America, Europe and the Asia Pacific region. The purpose of this uniform allocation is to ensure that network addresses are globally unique. II. ADDRESS SPACE FOR IPV4 The entire Internet is a single and abstract network. An IP address is a worldwide unique 32-bit identifier assigned to each interface of every host (or router) on the Internet. The structure of IP addresses makes it easy to address them on the Internet. A. The base header of IPv4 The base first format design of IPv4 is shown in Figure 1. DOI: 10.21307/ijanmc-2019-047 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 49 Figure 1. IP datagram format Figure 1 shows. The first line of the above section indicates the bits occupied by each field in the header format. The whole header format is divided into fixed part (20 bytes in total) and variable part. The variable part is to increase the function of IP datagram, but the variable header length of IP datagram also increases the overhead of each router to process datagram. The following explains the role of the fields in the base IPv4 header. 1) Version. IP Version. 2) Header Length (HL). It can represent a maximum decimal value of 15 and the most commonly used header length of 20 bytes (header length of 0101). 3) Differentiated services. It is used to get better service. 4) Total Length. It refers to the length of the sum of the radical and the data. 5) Identification. It holds the value of the counter that accumulates the number of datagram. 6) Flag. It is a total of 3 bits, the lowest bit (More Fragment) means if there is still fragmentation, the middle bit (Don't Fragment) means if there is still fragmentation. 7) Fragment Offset. It represents the relative position of a slice in the original grouping after the longer grouping is fragmented. 8) Time To Live. It represents the lifetime of the datagram in the network. 9) Protocol. It indicates which protocol is used for the data carried by the datagram. 10) Head Checksum. As the datagram passes through each router, the router calculates the sum again. B. Classified IP addresses Classification of IP address is the most base addressing method, the core of which is to divide the IP address into several fixed classes, each of which is composed of two fixed-length fields: network-id and host-id. The first field indicates the network to which the host or router is connected, and the network number must be unique. The second field identifies the host or router, and a host number must be unique within the range indicated by the network number it is on. Thus, the uniqueness of an IP address is ensured. This two-level IP address can be recorded as: IP address: : ={< network number >, < host number >}, where ": : =" means "defined as". Figure 2 below shows the network number field and host number field of various IP addresses: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 50 Figure 2. Network number field and host number field in IP address Figure 2 shows: The network number field of address of class A, B and C (the field is pink in the figure) is 1, 2 and 3 word length respectively, while the category bit of 1-3 bits in the front of the network number field is specified as 0, 10 and 110 respectively. The host number fields of class A, B, and C addresses are 3, 2, and 1 word long, respectively. Class D addresses (the first four bits are 1110) are used for multicast (one-to-many communication). Class E addresses (the first four bits are 1111) are reserved for later use. A dotted decimal notation is presented to improve the readability of IP addresses when it is 32-bit binary code. In IP addresses, every eight bits is represented in decimal, with a dot inserted between the digits. Figure 3 illustrates the convenience of this method. Figure 3. Illustrates the decimal system C. Improvement of base addressing method Because the classified IP address has defects, the IP address addressing method also goes through the following two historical stages. 1) Subnet partitioning Subnet division mainly includes two contents, one is to make the IP address from two to three levels, improve the utilization of IP address space, improve International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 51 network performance and enhance the flexibility of IP address; The second is the use of subnet mask, subnet mask AND IP address bitwise "AND" operation (AND) to get the network address, so as to facilitate the datagram sent to the destination network. a) Subnet idea  The subnet is still represented as a network.  Borrow some bits from the host number of the network as the subnet number, and the two-level IP address becomes the three-level IP address within a certain range, which can be expressed as: IP address: : ={< network number >,< subnet number >,< host number >}  The IP datagram can be sent to the router according to the destination network number, and then the router can find the destination subnet according to the network number and subnet number, and deliver the IP datagram to the destination host. b) Subnet mask A subnet mask, also known as a network mask or address mask, is a 32-bit address that consists of a string of one’s followed by a string of zeros. It is used to indicate which bits are the subnet and host that identify an IP address. The following example illustrates the role of subnet masks: [Example] the known IP address is 132.32.63.23, and the subnet mask is 255.255.224.0.Try to find the network address. [Answer]The subnet mask is 11111111 11111111 11100000 00000000, because the first two bytes of the mask are all 1, so the first two bytes of the network address can be written as 132.32.The fourth byte of the subnet mask is all 0, so the fourth byte of the network address is 0.It can be seen that this question only needs to calculate the third byte in the address, and we can easily obtain the network address by using binary representation of the third byte of IP address and subnet mask, as shown in figure 4 below: Figure 4. Solving process of network address 2) Classless Inter-Domain Routing (constitute super-netting) The main characteristics of Classless Inter-Domain Routing (CIDR) are as follows: a) CIDR eliminates the traditional concept of classified address and subnet division. CIDR divides the IP address into a network-prefix and a host number, denoted by: IP address: : ={< network prefix >, < host number >} CIDR also uses slash notation. It is to add "/" after the IP address, and write the number of network prefix after the slash, for example: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 52 128.85.36.17/19 = 10000000, 01010101, 00100100, 00010001 b) CIDR address block CIDR combines the same network prefix with consecutive IP addresses to form a "CIDR address block", which can be specified by the smallest address in the address block and the number of digits in the network prefix. For example: 128.85.36.17/19 in the address block: The minimum address is 128.85.32.0/19=10000000 01010101 00100000 00000000 The maximum address is 128.85.63.255/19=10000000 01010101 00111111 So the above address can be recorded as 128.85.32.0/20, referred to as "/20 address block" for short. The routing table can use a CIDR address block containing multiple addresses to query the destination network. This aggregation of addresses is known as routing aggregation and is also known as composition supernettingting. III. IPV6 ADDRESS SPACE IPv6 is the sixth version of the Internet protocol. IPv6 USES 128-bit addresses (2128 bits), which is about 3.4 x 1038 addresses, but IPv6 addresses up to 128 bits in length does not say how many addresses there are per square meter of the earth. Rather, IPv6 addresses were designed to be large in size, with the aim of further subdividing them into layered routing domains that reflect the topology of the modern Internet. Using a 128-bit address space provides multiple levels of hierarchy and flexibility in designing hierarchical addressing and routing, which is lacking in the current ipv4-based Internet. IPv6 addresses consist of global routing prefixes, subnet ids, and interface ids. Where the global routing prefix is used to specify a site, the subnet ID is used to specify a link within the site, and the interface ID is used to specify an interface on the link. A. Base IPv6 headers IPv6 datagram with multiple optional extension headers is shown in figure 5, and IPv6 base headers are shown in figure 6. Figure 5. IPv6 datagram with multiple optional extension headers Figure 6. Basic IPv6 header with a length of 40 bytes International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 53 As shown in figure 6, the first line of the figure indicates the bit occupied by each field in the header format. Compared to IPv4, IPv6 fixed the base header with 40 bytes, eliminated many unnecessary fields, and reduced the number of private segments in the header to 8 (although the header length was doubled). The following explains the function of each field in the IPv6 basic header: 1) Version. It specifies the version of the protocol. 2) Traffic Class. It distinguishes between different IPv6 datagram categories or priorities. 3) Flow Label. It is a new mechanism for IPv6 to support pre-allocation of resources. 4) Payload Length. It specifies the number of bytes in an IPv6 datagram other than the base header. 5) Next Head. It is equivalent to the IPv4 protocol field or optional field. 6) Hop Limit. It is used to prevent datagram from being in the network indefinitely. B. IPv6 address representation method 1) Colon hexadecimal form Preferred form “n:n:n:n:n:n:n:n”.Each n represents a 16-bit value and is hexadecimal, separated by a colon. For example: “3FFE:FFFF:7654:FEDA:1245:BA98:3210:4562”. 2) Compressed form Writing a long string of zeros can be simplified using a compressed form, where a single contiguous sequence of zeros is represented by a double colon, “: : ”.This symbol can only appear once in an address. For example, the local link unicast address FE80:0:0:0:0:0:0:10 is shortened as“FE80::/10”, and the multicast address FFDE:0:0:0:0:0:0:101 is shortened as “FFED::101”.Loop address 0:0:0:0:0:0:0:1compression form “::1”. An unspecified address 0:0:0:0:0:0:0:0 is shortened as “::”. 3) Mixed form This form combines IPv4 and IPv6 addresses. In this case, the address format is “n:n:n:n:n:n:d.d.d.d”. Where each “n” represents the 16-bit value of the IPv6 address and is represented in hexadecimal, and each “d” represents the 8-bit value of the IPv4 address and is represented in decimal. C. Transition from IPv4 to IPv6 The transition from IPv4 to IPv6 can only be done incrementally, because the number of routers using IPv4 across the Internet is so large that it is impractical to set a cut-off point to upgrade the system. There is also backward compatibility when installing a new IPv6 system.IPv6 system must be able to complete the IPv4 system to receive, forward IP datagram and routing selection. Here are three strategies for transitioning to IPV6: 1) Dual stack Prior to the full transition to IPv6, there were stacks of IPv4 and IPv6 on some hosts or routers. Dual stack hosts or routers can communicate with both IPv4 and IPv6 systems. 2) Tunneling The point of this technique is that the IPv6 datagram is disguised as an IPv4 datagram, and the entire IPv6 datagram becomes the data portion of the IPv4 datagram. This allows unimpeded access to the IPv4 network and, upon leaving the IPv4 network, transfers the data portion of the IPv4 datagram to the host’s IPv6 protocol stack.IPv6 datagram is submitted to the IPv6 protocol stack to complete the communication. 3) Network address conversion/protocol conversion technology Network Address Translation/Protocol Translation technology NAT-PT (Network Address Translation - Protocol Translation) is combined with SIIT Protocol Translation, dynamic Address Translation (NAT) under traditional IPv4 and appropriate application layer gateway (ALG).It enables communication between International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 54 hosts with only IPv6 installed and most applications with only IPv4 machines installed. IV. THE RESEARCH STATUS OF IPV4 AND IPV6 A. Current status of IPv4 Due to the allocation of IPv4 addresses adopts the principle of "first come, first served, distributed according to needs", the uneven distribution makes the address allocation has a huge loophole, which makes many countries and regions have insufficient IP address resources. With the development of Internet, especially the explosive growth of scale, some inherent defects of IPv4 are gradually exposed, mainly focusing on address exhaustion, rapid expansion of routing table to the bottleneck, security and service quality is difficult to guarantee, and serious waste of IPv4 address structure. The design of IPv4 protocol does not consider the real-time transmission of audio stream and video stream.IPv4 does not provide encryption and authentication mechanisms, so the secure transmission of confidential data resources cannot be guaranteed. B. Current status of IPv6 Due to the limitations of the technology era, there are many defects in the design idea of the address structure of IPv6.The richness of the IPv6128 bit address length makes it more than just a matter of extending the address. Instead of following the principle of transparency between different protocol layers, IP addresses, which should belong to the protocol of the network layer, are mixed with physical layer addresses and application layer, resulting in a series of fatal consequences. V. IPV9 ADDRESS CODE IPV9 not only expands the length of IP address, but also makes the network support more address levels. In addition, the method of address coding method of the variable-length and variable-position is adopted, and the parsing link is cancelled. The formal text representation method used by human is directly converted into machine language, which actually reduces the overhead of network. A. IPV9 header format IPV9 header format design is shown in figure 7. Figure 7. Header format of IPV9 Figure 7 design explanation is as follows: 1) Version. It has a length of four bits. Internet protocol version number, for IPV9, this field must be 9. 2) Traffic Class. It has a length of 8 bits. The three bits high are used to specify the length of the address, and the value is 0 to 7, which is the power of 2.The address length is 1Byte ~ 128Byte.The default value is 256 bits, where 0 is 16 bits, 1 is 32 bits, 2 is 64 bits, 3 is 128 bits, 4 is 256 bits, 5 is 512 bits, 6 is 1024 bits, International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 55 and 7 is 2048 bits. The last five bits specify the communication class and authentication for the source and destination addresses.0 through 15 are the priority values, where 0 through 5 are used to specify the priority class for the traffic.6 through 7 are used to specify a communication method for authentication before communication, which is used by the packet sender for traffic control and whether authentication of source and destination addresses is required.8 through 14 are used to specify absolute traffic that will not fall back when congestion is encountered.15 for virtual circuits.16 and 17 respectively assign audio and video, called absolute value, to ensure the uninterrupted transmission of audio and video. The other values are reserved for future use. 3) Flow Label. It is 20 bits long and is used to identify packages that belong to the same business flow. 4) Payload Length. It has a length of 16 bits, including the net payload of bytes, which is the number of bytes contained in the packet behind the IPV9 header. 5) Next Header. Its length is 8 bits, and this field indicates the protocol type in the field following the IPV9 header. 6) Hop Limit. Its length is 8 bits, and this field is subtracted by one each time a node forwards a packet. 7) Source address. Its length is 8 bit ~ 2048 bit; specify IPV9 packet sender address, using variable length and location method. 8) Destination address. Its length is 8 bit ~ 2048 bit, and the destination address of IPV9 packet is specified. 9) Time. It is used to control the lifetime of the address in the header. 10) Identification code. It identifies the authenticity of the address in the header. B. Text representation of IPV9 addresses This paper has developed a unified method to represent IPV9 address, including "bracket decimal", "curly braces decimal" and "parentheses bracket". 1) Bracket decimal The bracket decimal can be expressed in the following two ways: Method 1: use "[]" when the length is 2048 bits. Where, the parentheses are expressed in decimal notation, and the length can be written in indefinite length. Method 2: length 256 able address in the form of representation is "y[y] [y] [y] [y] [y] [y]", where each y represents the address as a 32 bit part, and used the decimal representation. Because 232 = 4294967296. Each "y" represents a 32 bits portion of the address and is represented in decimal. The difference in decimal number of each of the range is 0 to 9, such as the first digit from left the range is 0 ~ 4, so you don't have the phenomenon of overflow. 2) Curly braces decimal This method divides the 256-bit address into four 64-bit decimal Numbers represented by curly braces separating them. The representation is "Z}Z}Z}Z", where each Z represents a 64-bit portion of the address and is represented in decimal. It's exactly the same as Y, but it's also compatible with Y, so you can mix the two. This approach makes it very convenient for IPv4 addresses to be compatible in IPV9.Such as: z}z}z}z; z}z}y]y]y]y; z}z}y]y]y]d.d.d.d; z}z}z}y]d.d.d.d; z}z}z}y]J.J.J.J; 3) Bracketed notation International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 56 Since IPV9 has an address length of 256 bits, whether you use four or eight segments, there are still many bits in each segment. For example: ...]00000000000000000000000000110100]... ...]01011111111111111111111111111111]... The above situation input cumbersome, prone to errors. For convenience, the parenthesis notation -- (K/L) is introduced. Where "K" means 0 or 1 and "L" means the number of 0 or 1.In this way, the above two examples can be abbreviated as: ...](0/26) of 110100]... ...]010 (1/29)]... 4) Text representation of address prefixes The IPV9 address scheme is similar to the CIDR (unclassified addressing) scheme of IPv4 in that the address prefix is used to represent the network hierarchy. On the representation of IPV9 address prefix, a representation similar to CIDR is adopted, whose form is as follows: IPV9 address/address prefixes length For example, Address prefix 1706[0[0[0[210[22[0 of 200 bits can be expressed as: 1706[0[0[0[210[22[0[0/200 5) IPV9 address type c) Pure IPV9 address The form is Y[Y[Y[Y[Y[Y[Y[Y Each “Y” represents a decimal integer from 0 to 232 =4294967296. d) IPV9 addresses compatible with IPv4 The form is: Y[Y[Y[Y[Y[Y[Y[D.D.D.D Each “Y” represents a decimal integer from 0 to 232 =4294967296. “D” represents a decimal integer between 0 and 255 from the original IPv4. e) IPV9 addresses compatible with Ipv6 The form is: Y[Y[Y[Y[X:X:X:X:X:X:X:X Each “Y” represents a decimal integer from 0 to 232 =4294967296.The “X” represents a hexadecimal number that originally Ipv6 ranged from 0000 to FFFF. f) Special compatibility address In order to guarantee the research results of IPv4 and Ipv6, IPV9 has designed some compatible addresses. The new compatibility address design idea is in this part of the address with the appropriate prefix form. In order to make their representation more convenient and ensure accuracy, the following abbreviations were introduced: y[y[y[y[x:x:x:x:x:x:d.d.d.d Where, each y represents the address as 32 bits, represented by decimal; Each “x” represents the original Ipv6 address of 16 bits, in hexadecimal; Each “d” represents the original IPv4 address of 8 bits, in decimal notation. g) [] full decimal address In order to facilitate the application of logistics code and full decimal address. Category number 5 is recommended. In the power of 10 to the power of 512, fixed length positioning method is adopted according to application needs. h) IPV9 address for transitional period IPV9 is compatible with IPv4 and IPv6 technical protocols for the Internet, but IPv4 and IPv6 technical protocols are not compatible with IPV9 in reverse. C. IPv4 and IPv6 transition to IPV9 In order to solve the IPv4 flat to IPV9 transition, special design IPV9 transition address. Transitioning IPv4 to a 232 address in the IPV9 address allows a small change to the current system to complete the transition. In IPV9, there is a section of J.J.J.J address, where each “J” represents a decimal number from 0 to 28 (0~255), where the preceding [7] can be omitted in the middle of the local address, that is, local users (or designated users) can use J.J.J.J directly, which is different from the original IPv4 D.D.D.D. At the same time, this part of the user in order to smooth the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 57 transition to full decimal can be allocated at the same time decimal. In order to improve the software and hardware in the future, there is no need to re-address, such as [7]741852963 can be written into [7]44.55.199.35 can be directly used in a local IP network to write 44.55.199.35, so that the original terminal can be used. Interim IPV9 address system can be modified to the original IPv4 system. The IPv4 header is also used, but the version number is 9 to distinguish the original IPv4.However, users may use the original terminal equipment within the territory. IPV9 Header TCP/UDP Header data IPv4 Header IPV9 Header TCP/UDP Header data IPV9 Header TCP/UDP Header data Raw IPV9 datagrams The IPV9 header encapsulates the IPv4 header in the tunnel IPV9 datagrams restored through the tunnel IPV9 Host-1 IPv4/IPV9 Tunnel ROuter-R1 IPv4 Router-1 IPv4 Router-2 IPv4/IPV9 Tunnel ROuter-R2 IPV9 Host-2 Internet[IPv4] Figure 8. IPV9 is IPv4 compatible Figure 8 above means that it is possible to build the IPV9 backbone, provide application services and gradually upgrade the backbone network to IPV9 without affecting or modifying the existing terminal IPv4 applications.IPV9 inherited and transplanted most of the application functions on the existing IPv4 Internet, and successfully solved the development problem of IPV9 online application functions. Most of the existing Internet application functions can be copied to the IPV9 network, and began to enter the practical stage. At the same time, the application of IPV9 will continue to innovate and develop. D. Support IPV9 device working mode In the decimal network working group of scientific research, the current IPV9 support devices are ipv9-100m WIFI router, ipv9-1000m WI-FI router, ipv9-10000m router, ipv9-100g router, ipv9-linux client and ipv9-windows client. The IPV9 router network interface types include ordinary Ethernet interface, 4to9 interface (convert IPv4 packets into IPV9 packets according to custom mapping rules), 9to4 interface (convert IPV9 packets into IPv4 packets according to custom mapping rules) and sit interface (realize IPV9 data packets to be transmitted in the current IPv4 network. Implement 9over4, where IPV9 data over is the data portion of the IPv4 packet. The following takes IPV9 100/1000m WIFI router as an example to explain its working mode VPN. Under the VPN mode, most configuration of the router is completed by the background server, which is divided into IPv4 mode and IPV9 mode. In IPv4 mode, the router runs the NAT module, and the client (IPv4) accesses the Internet network in the same way as other IPv4 routers. When the client accesses the server in IPV9 backbone network, the VPN server will communicate with it. Although the pure IPV9 client is not supported in this mode, the communication between the client of WIFI router A and the client of WIF router B is supported. The data flow diagram is shown in figure 9 below: International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 58 (a) one-way access between IPv4 of VPN tunnel and Beijing backbone node (b) IPv4 reciprocal visits within the VPN tunnel Figure 9. (a) one-way access between IPv4 of VPN tunnel and Beijing backbone node; (b) IPv4 reciprocal visits within the VPN tunnel International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 59 In IPV9 mode, clients (IPv4) access the Internet network in the same way as other IPv4 routers. This mode supports the pure IPV9 client, but does not support the communication between the client of WIFI router A and the client of WIF router B, as shown in data flow figure 10. (a) IPV9 exchange visits between VPN tunnel and Beijing backbone node (b) IPV9 mutual visits within the VPN tunnel Figure 10. (a) IPV9 exchange visits between VPN tunnel and Beijing backbone node; (b) IPV9 mutual visits within the VPN tunnel International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 60 To sum up, IPV9 inherits most functions on the existing Internet. In the protection of IPv4 and IPv6 research results, address expansion, security verification and other operations. This makes IPV9 more competitive in the development of the Internet, and its functions will continue to develop with the development of technology. VI. SUMMARIZES Although the use of NAT (" network address translation "), CIDR (" classless inter-domain routing ") and other technologies can alleviate the IPv4 crisis to some extent. However, this does not fundamentally solve the problem, and at the same time, it will bring about new problems in cost, service quality, safety and other aspects, but create greater challenges. But the new generation network layer protocol IPv6 itself also has the corresponding question, causes it not to have the Omni-directional. In this situation, a new network will come into being, which not only represents the progress of people's technology, but also symbolizes people's dedication to new technology. This paper mainly designs and researches the new network address coding, compares the IPv4 and IPv6 address coding, and proposes a new address coding. This method solves the problem of address exhaustion thoroughly, and puts forward the theory of verification before communication, which solves the problems of current network address exhaustion and information security. It also describes the ipv9-compatible IPv4 working mode, which guarantees the existing research results, provides some new design ideas for new network addresses, and promotes the development of new network addresses. REFERENCES [1] RFC - Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791, 1981.09. [2] S. Deering, R. Hinden, Network Working Group.Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [3] M. Crawford. Network Working Group.Transmission of IPv6 Packets over Ethernet Networks.RFC-2464, 1998.12. [4] J. Onions, Network Working Group.A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [5] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. [6] Information technology-FutureNetwork-Problem statement and requirement-Part 5: Security, ISO/IEC DTR 29181-5,2014,12. [7] Wang Wenfeng, Xie Jianping,etc.Product and servicedigital identification format for information procession.SJ/T11603-2016, 2016.06. [8] S. Deering, R. Hinden, Internet Protocol, Version 6 (IPv6)-Specification, Network Working Group. RFC-1883, 1995.12. work_27d5tr326feydgcx32t4nhl424 ---- Transactions of the Association for Computational Linguistics, 1 (2013) 89–98. Action Editor: Noah Smith. Submitted 12/2012; Published 5/2013. c©2013 Association for Computational Linguistics. A Novel Feature-based Bayesian Model for Query Focused Multi-document Summarization Jiwei Li School of Computer Science Carnegie Mellon University bdlijiwei@gmail.com Sujian Li Laboratory of Computational Linguistics Peking University lisujian@pku.edu.cn Abstract Supervised learning methods and LDA based topic model have been successfully applied in the field of multi-document summarization. In this paper, we propose a novel supervised approach that can in- corporate rich sentence features into Bayesian topic models in a principled way, thus taking advantages of both topic model and feature based supervised learn- ing methods. Experimental results on DUC2007, TAC2008 and TAC2009 demonstrate the effective- ness of our approach. 1 Introduction Query-focused multi-document summarization (Nenkova et al., 2006; Wan et al., 2007; Ouyang et al., 2010) can facilitate users to grasp the main idea of documents. In query-focused summarization, a specific topic description, such as a query, which expresses the most important topic information is proposed before the document collection, and a summary would be generated according to the given topic. Supervised models have been widely used in sum- marization (Li, et al., 2009, Shen et al., 2007, Ouyang et al., 2010). Supervised models usually re- gard summarization as a classification or regression problem and use various sentence features to build a classifier based on labeled negative or positive sam- ples. However, existing supervised approaches sel- dom exploit the intrinsic structure among sentences. This disadvantage usually gives rise to serious prob- lems such as unbalance and low recall in summaries. Recently, LDA-based (Blei et al., 2003) Bayesian topic models have widely been applied in multi- document summarization in that Bayesian ap- proaches can offer clear and rigorous probabilis- tic interpretations for summaries(Daume and Marcu, 2006; Haghighi and Vanderwende, 2009; Jin et al., 2010; Mason and Charniak, 2011; Delort and Alfon- seca, 2012). Exiting Bayesian approaches label sen- tences or words with topics and sentences which are closely related with query or can highly generalize documents are selected into summaries. However, LDA topic model suffers from the intrinsic disad- vantages that it only uses word frequency for topic modeling and can not use useful text features such as position, word order etc (Zhu and Xing, 2010). For example, the first sentence in a document may be more important for summary since it is more likely to give a global generalization about the document. It is hard for LDA model to consider such informa- tion, making useful information lost. It naturally comes to our minds that we can im- prove summarization performance by making full use of both useful text features and the latent seman- tic structures from by LDA topic model. One related work is from Celikyilmaz and Hakkani-Tur (2010). They built a hierarchical topic model called Hybh- sum based on LDA for topic discovery and assumed this model can produce appropriate scores for sen- tence evaluation. Then the scores are used for tun- ing the weights of various features that helpful for summary generation. Their work made a good step of combining topic model with feature based super- vised learning. However, what their approach con- fuses us is that whether a topic model only based on word frequency is good enough to generate an appropriate sentence score for regression. Actually, how to incorporate features into LDA topic model has been a open problem. Supervised topic models such as sLDA(Blei and MacAuliffe 2007) give us some inspiration. In sLDA, each document is asso- ciated with a labeled feature and sLDA can integrate such feature into LDA for topic modeling in a prin- 89 cipled way. With reference to the work of supervised LDA models, in this paper, we propose a novel sentence feature based Bayesian model S-sLDA for multi- document summarization. Our approach can natu- rally combine feature based supervised methods and topic models. The most important and challeng- ing problem in our model is the tuning of feature weights. To solve this problem, we transform the problem of finding optimum feature weights into an optimization algorithm and learn these weights in a supervised way. A set of experiments are con- ducted based on the benchmark data of DUC2007, TAC2008 and TAC2009, and experimental results show the effectiveness of our model. The rest of the paper is organized as follows. Sec- tion 2 describes some background and related works. Section 3 describes our details of S-sLDA model. Section 4 demonstrates details of our approaches, including learning, inference and summary gener- ation. Section 5 provides experiments results and Section 6 concludes the paper. 2 Related Work A variety of approaches have been proposed for query-focused multi-document summarizations such as unsupervised (semi-supervised) approaches, supervised approaches, and Bayesian approaches. Unsupervised (semi-supervised) approaches such as Lexrank (Erkan and Radex, 2004), manifold (Wan et al., 2007) treat summarization as a graph- based ranking problem. The relatedness between the query and each sentence is achieved by impos- ing querys influence on each sentence along with the propagation of graph. Most supervised ap- proaches regard summarization task as a sentence level two class classification problem. Supervised machine learning methods such as Support Vector Machine(SVM) (Li, et al., 2009), Maximum En- tropy (Osborne, 2002) , Conditional Random Field (Shen et al., 2007) and regression models (Ouyang et al., 2010) have been adopted to leverage the rich sentence features for summarization. Recently, Bayesian topic models have shown their power in summarization for its clear probabilistic interpretation. Daume and Marcu (2006) proposed Bayesum model for sentence extraction based on query expansion concept in information retrieval. Haghighi and Vanderwende (2009) proposed topic- sum and hiersum which use a LDA-like topic model and assign each sentence a distribution over back- ground topic, doc-specific topic and content topics. Celikyilmaz and Hakkani-Tur (2010) made a good step in combining topic model with supervised fea- ture based regression for sentence scoring in sum- marization. In their model, the score of training sentences are firstly got through a novel hierarchi- cal topic model. Then a featured based support vec- tor regression (SVR) is used for sentence score pre- diction. The problem of Celikyilmaz and Hakkani- Turs model is that topic model and feature based re- gression are two separate processes and the score of training sentences may be biased because their topic model only consider word frequency and fail to con- sider other important features. Supervised feature based topic models have been proposed in recent years to incorporate different kinds of features into LDA model. Blei (2007) proposed sLDA for doc- ument response pairs and Daniel et al. (2009) pro- posed Labeled LDA by defining a one to one corre- spondence between latent topic and user tags. Zhu and Xing (2010) proposed conditional topic random field (CTRF) which addresses feature and indepen- dent limitation in LDA. 3 Model description 3.1 LDA and sLDA The hierarchical Bayesian LDA (Blei et al., 2003) models the probability of a corpus on hidden topics as shown in Figure 1(a). Let K be the number of topics , M be the number of documents in the cor- pus and V be vocabulary size. The topic distribution of each document θm is drawn from a prior Dirichlet distribution Dir(α), and each document word wmn is sampled from a topic-word distribution φz spec- ified by a drawn from the topic-document distribu- tion θm. β is a K×M dimensional matrix and each βk is a distribution over the V terms. The generat- ing procedure of LDA is illustrated in Figure 2. θm is a mixture proportion over topics of document m and zmn is a K dimensional variable that presents the topic assignment distribution of different words. Supervised LDA (sLDA) (Blei and McAuliffe 2007) is a document feature based model and intro- 90 Figure 1: Graphical models for (a) LDA model and (b) sLDA model. 1. Draw a document proportion vector θm|α ∼ Dir(α) 2. For each word in m (a)draw topic assignment zmn|θ ∼ Multi(θzmn ) (b)draw word wmn|zmn,β ∼ Multi(βzmn ) Figure 2: Generation process for LDA duces a response variable to each document for topic discovering, as shown in Figure 1(b). In the gener- ative procedure of sLDA, the document pairwise la- bel is draw from y|−→zm,η,δ2 ∼ p(y|−→zm,η,δ2), where−→zm = 1N ∑N n=1 zm,n. 3.2 Problem Formulation Here we firstly give a standard formulation of the task. Let K be the number of topics, V be the vo- cabulary size and M be the number of documents. Each document Dm is represented with a collection of sentence Dm = {Ss}s=Nms=1 where Nm denotes the number of sentences in mth document. Each sentence is represented with a collection of words {wmsn}n=Nmsn=1 where Nms denotes the number of words in current sentence. −−→ Yms denotes the feature vector of current sentence and we assume that these features are independent. 3.3 S-sLDA zms is the hidden variable indicating the topic of current sentence. In S-sLDA, we make an assump- tion that words in the same sentence are generated from the same topic which was proposed by Gruber (2007). zmsn denotes the topic assignment of cur- rent word. According to our assumption, zmsn = Figure 3: Graph model for S-sLDA model 1. Draw a document proportion vector θm|α ∼ Dir(α) 2. For each sentence in m (a)draw topic assignment zms|θ ∼ Multi(θzmn ) (b)draw feature vector −−→ Yms|zms,η ∼ p( −−→ Yms|zms,η) (c)for each word wmsn in current sentence draw wmsn|zms,β ∼ Multi(βzms ) Figure 4: generation process for S-sLDA zms for any n ∈ [1,Nms]. The generative approach of S-sLDA is shown in Figure 3 and Figure 4. We can see that the generative process involves not only the words within current sentence, but also a series of sentence features. The mixture weights over fea- tures in S-sLDA are defined with a generalized lin- ear model (GLM). p( −−→ Yms|zms,η) = exp(zTmsη) −−→ Yms ∑ zms exp(zTmsη) −−→ Yms (1) Here we assume that each sentence has T features and −−→ Yms is a T × 1 dimensional vector. η is a K × T weight matrix of each feature upon topics, which largely controls the feature generation proce- dure. Unlike s-LDA where η is a latent variable esti- mated from the maximum likelihood estimation al- gorithm, in S-sLDA the value of η is trained through a supervised algorithm which will be illustrated in detail in Section 3. 3.4 Posterior Inference and Estimation Given a document and labels for each sentence, the posterior distribution of the latent variables is: p(θ,z1:N|w1:N,Y,α,β1:K,η) = ∏ m p(θm|α) ∏ s[p(zms|θm)p( −−→ Yms|zms,η) ∏ n p(wmsn|zmsn,βzmsn ]∫ dθp(θm|α) ∑ z ∏ s[p(zms|θm)p( −−→ Yms|zms,η) ∏ n p(wmsn|βzmsn )] (2) Eqn. (2) cannot be efficiently computed. By applying the Jensens inequality, we obtain a lower bound of the log likelihood of document p(θ,z1:N|w1:N, −−→ Yms,α,β1:K,η) ≥ L, where L = ∑ ms E[logP(zms|θ)] + ∑ ms E[logP( −−→ Yms|zms,η)]+ ∑ m E[logP (θ|α)] + ∑ msn E[logP(wmsn|zms,β)] + H(q) (3) 91 where H(q) = −E[logq] and it is the entropy of variational distribution q is defined as q(θ,z|γ,φ) = ∏ mk q(θm|γ) ∏ sn q(zmsn|φms) (4) here γ a K-dimensional Dirichlet parameter vector and multinomial parameters. The first, third and forth terms of Eqn. (3) are identical to the corre- sponding terms for unsupervised LDA (Blei et al., 2003). The second term is the expectation of log probability of features given the latent topic assign- ments. E[logP( −−→ Yms|zms,η)] = E(zms) Tη −−→ Yms − log ∑ zms exp(zTmsη −−→ Yms) (5) where E(zms)T is a 1 × K dimensional vector [φmsk] k=K k=1 . The Bayes estimation for S-sLDA model can be got via a variational EM algorithm. In EM procedure, the lower bound is firstly minimized with respect to γ and φ, and then minimized with α and β by fixing γ and φ. E-step: The updating of Dirichlet parameter γ is identical to that of unsupervised LDA, and does not involve feature vector −−→ Yms. γnewm ← α + ∑ s∈m φs (6) φnewsk ∝ exp{E[logθm|γ] + Nms∑ n=1 E[log(wmsn|β1:K)]+ T∑ t=1 ηktYst} = exp[Ψ(γmk) − Ψ( K∑ k=1 γmk) + T∑ t=1 ηktYst] (7) where Ψ(·) denotes the log Γ function. ms denotes the document that current sentence comes from and Yst denotes the tth feature of sentence s. M-step: The M-step for updating β is the same as the pro- cedure in unsupervised LDA, where the probability of a word generated from a topic is proportional to the number of times this word assigned to the topic. βnewkw = M∑ m=1 Nm∑ s=1 Nms∑ n=1 1(wmsn = w)φ k ms (8) 4 Our Approach 4.1 Learning In this subsection, we describe how we learn the fea- ture weight η in a supervised way. The learning pro- cess of η is a supervised algorithm combined with variational inference of S-sLDA. Given a topic de- scription Q1 and a collection of training sentences S from related documents, human assessors assign a score v(v = −2,−1, 0, 1, 1) to each sentence in S. The score is an integer between −2 (the least desired summary sentences) and +2 (the most desired sum- mary sentences), and score 0 denotes neutral atti- tude. Ov = {ov1,ov2, ...,vvk}(v = −2,−1, 0, 1, 2) is the set containing sentences with score v. Let φQk denote the probability that query is generated from topic k. Since query does not belong to any docu- ment, we use the following strategy to leverage φQk φQk = ∏ w∈Q βkw· 1 M M∑ m=1 exp[Ψ(γmk)−Ψ( K∑ k=1 γmk)] (9) In Equ.(9), ∏ w∈Q βkw denotes the probability that all terms in query are generated from topic k and 1 M ∑M m=1 exp[Ψ(γmk)−Ψ( ∑K k=1 γmk)] can be seen as the average probability that all documents in the corpus are talking about topic k. Eqn. (9) is based on the assumption that query topic is relevant to the main topic discussed by the document corpus. This is a reasonable assumption and most previous LDA summarization models are based on similar as- sumptions. Next, we define φOv,k for sentence set Ov, which can be interpreted as the probability that all sen- tences in collection Ov are generated from topic k. φOv,k = 1 |Ov| ∑ s∈Ov φsk,k ∈ [1,K],v ∈ [−2, 2] (10) |Ov| denotes the number of sentences in set Ov. In- spired by the idea that desired summary sentences would be more semantically related with the query, we transform problem of finding optimum η to the following optimization problem: minηL(η) = v=2∑ v=−2 v ·KL(Ov||Q); T∑ t=1 ηkt = 1 (11) 1We select multiple queries and their related sentences for training 92 where KL(Ov||Q) is the Kullback-Leibler diver- gence between the topic and sentence set Ov as shown in Eqn.(12). KL(Ov||Q) = K∑ k=1 φOvklog φOvk φQk (12) In Eqn. (11), we can see that O2, which contain de- sirable sentences, would be given the largest penalty for its KL divergence from Query. The case is just opposite for undesired set. Our idea is to incorporate the minimization pro- cess of Eqn.(11) into variational inference process of S-sLDA model. Here we perform gradient based optimization method to minimize Eqn.(11). Firstly, we derive the gradient of L(η) with respect to η. ∂L(η) ηxy = v=2∑ v=−2 v · ∂KL(Qv||Q) ∂ηxy (13) ∂KL(Qv||Q) ∂ηxy = K∑ k=1 1 |Qv| (1 + log ∑ s∈Qv |Qv| ) ∑ s∈Qv ∂φsk ∂ηxy − K∑ k=1 1 |Qv| ∑ s∈Qv ∂Qsk ηxy − K∑ k=1 1 Qv ∑ s∈Qvφsk φQk ∂φsk ∂ηxy (14) For simplification, we regard β and γ as constant during updating process of η, so ∂φQk ∂ηxy = 0.2 We can further get first derivative for each labeled sentence. ∂φsk ηxy ∝    Ysyexp[Ψ(γmsi) − Ψ( K∑ k=1 γmsk) + T∑ t=1 ηktYsy] × ∏ w∈s βkw if k = x 0 if k 6= x (15) 4.2 Feature Space Lots of features have been proven to be useful for summarization (Louis et al., 2010). Here we dis- cuss several types of features which are adopted in S-sLDA model. The feature values are either binary or normalized to the interval [0,1]. The following features are used in S-sLDA: Cosine Similarity with query: Cosine similarity is based on the tf-idf value of terms. 2This is reasonable because the influence of γ and β have been embodied in φ during each iteration. Local Inner-document Degree Order: Local Inner document Degree Order is a binary feature which indicates whether Inner-document Degree (IDD) of sentence s is the largest among its neighbors. IDD means the edge number between s and other sen- tences in the same document. Document Specific Word: 1 if a sentence contains document specific word, 0 otherwise. Average Unigram Probability (Nenkova and Van- derwende, 2005; Celikyilmaz and Hakkani-Tur 2010): As for sentence s, p(s) = ∑ w∈s 1 |s|pD(w), where pD(w) is the observed unigram probability in document collection. In addition, we also use the commonly used fea- tures including sentence position, paragraph po- sition, sentence length and sentence bigram fre- quency. E-step initialize φ0sk := 1/K for all i and s. initialize γmi := αmi + N)m/K for all i. initialize ηkt = 0 for all k and t. while not convergence for m = 1 : M update γt+1m according to Eqn.(6) for s = 1 : Nm for k = 1 : K update φt+1sk according to Eqn.(7) normalize the sum of φt+1sk to 1. Minimize L(η) according to Eqn.(11)-(15). M-step: update β according to Eqn.(8) Figure 5: Learning process of η in S-sLDA 4.3 Sentence Selection Strategy Next we explain our sentence selection strategy. Ac- cording to our intuition that the desired summary should have a small KL divergence with query, we propose a function to score a set of sentences Sum. We use a decreasing logistic function ζ(x) = 1/(1+ ex) to refine the score to the range of (0,1). Score(Sum) = ζ(KL(sum||Q)) (16) Let Sum? denote the optimum update summary. We can get Sum? by maximizing the scoring function. Sum? = arg max Sum∈S&&words(Sum)≤L Score(Sum) (17) 93 1. Learning: Given labeled set Ov, learn the feature weight vector η using algorithm in Figure 5. 2. Given new data set and η, use algorithm in section 3.3 for inference. (The only difference between this step and step (1) is that in this step we do not need minimize L(η). 3. Select sentences for summarization from algo- rithm in Figure 6. Figure 6: Summarization Generation by S-sLDA. A greedy algorithm is applied by adding sentence one by one to obtain Sum?. We use G to denote the sentence set containing selected sentences. The algorithm first initializes G to Φ and X to SU. Dur- ing each iteration, we select one sentence from X which maximize Score(sm ∪G). To avoid topic re- dundancy in the summary, we also revise the MMR strategy (Goldstein et al., 1999; Ouyang et al., 2007) in the process of sentence selection. For each sm, we compute the semantic similarity between sm and each sentence st in set Y in Eqn.(18). cos−sem(sm,st) = ∑ k φsmkφstk√∑ k φ 2 smk √∑ k φ 2 stk (18) We need to assure that the value of semantic similar- ity between two sentences is less than Thsem. The whole procedure for summarization using S-sLDA model is illustrated in Figure 6. Thsem is set to 0.5 in the experiments. 5 Experiments 5.1 Experiments Set-up The query-focused multi-document summarization task defined in DUC3(Document Understanding Conference) and TAC4(Text Analysis Conference) evaluations requires generating a concise and well organized summary for a collection of related news documents according to a given query which de- scribes the users information need. The query usually consists of a title and one or more narra- tive/question sentences. The system-generated sum- maries for DUC and TAC are respectively limited to 3http://duc.nist.gov/. 4http://www.nist.gov/tac/. 250 words and 100 words. Our experiment data is composed of DUC 2007, TAC5 2008 and TAC 2009 data which have 45, 48 and 44 collections respec- tively. In our experiments, DUC 2007 data is used as training data and TAC (2008-2009) data is used as the test data. Stop-words in both documents and queries are removed using a stop-word list of 598 words, and the remaining words are stemmed by Porter Stem- mer6. As for the automatic evaluation of summa- rization, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures, including ROUGE- 1, ROUGE-2, and ROUGE-SU47 and their corre- sponding 95% confidence intervals, are used to eval- uate the performance of the summaries. In order to obtain a more comprehensive measure of summary quality, we also conduct manual evaluation on TAC data with reference to (Haghighi and Vanderwende, 2009; Celikyilmaz and Hakkani-Tur, 2011; Delort and Alfonseca, 2011). 5.2 Comparison with other Bayesian models In this subsection, we compare our model with the following Bayesian baselines: KL-sum: It is developed by Haghighi and Vanderwende (Lin et al., 2006) by using a KL- divergence based sentence selection strategy. KL(Ps||Qd) = ∑ w P(w)log P(w) Q(w) (19) where Ps is the unigram distribution of candidate summary and Qd denotes the unigram distribution of document collection. Sentences with higher ranking score is selected into the summary. HierSum: A LDA based approach proposed by Haghighi and Vanderwende (2009), where unigram distribution is calculated from LDA topic model in Equ.(14). Hybhsum: A supervised approach developed by Celikyilmaz and Hakkani-Tur (2010). For fair comparison, baselines use the same pro- precessing methods with our model and all sum- 5Here, we only use the docset-A data in TAC, since TAC data is composed of docset-A and docset-B data, and the docset- B data is mainly for the update summarization task. 6http://tartarus.org/ martin/PorterStemmer/. 7Jackknife scoring for ROUGE is used in order to compare with the human summaries. 94 maries are truncated to the same length of 100 words. From Table 1 and Table 2, we can Methods ROUGE-1 ROUGE-2 ROUGE-SU4 Our 0.3724 0.1030 0.1342 approach (0.3660-0.3788) (0.0999-0.1061) (0.1290-0.1394) Hybhsum 0.3703 0.1007 0.1314 (0.3600-0.3806) (0.0952-0.1059) (0.1241-0.1387) HierSum 0.3613 0.0948 0.1278 (0.3374-0.3752) (0.0899-0.0998) (0.1197-0.1359) KLsum 0.3504 0.0917 0.1234 (0.3411-0.3597) (0.0842-0.0992) (0.1155-0.1315) StandLDA 0.3368 0.0797 0.1156 (0.3252-0.3386) (0.0758-0.0836) (0.1072-0.1240) Table 1: Comparison of Bayesian models on TAC2008 Methods ROUGE-1 ROUGE-2 ROUGE-SU4 Our 0.3903 0.1223 0.1488 approach (0.3819-0.3987) (0.1167-0.1279) (0.1446-0.1530) Hybhsum 0.3824 0.1173 0.1436 (0.3686-0.3952) (0.1132-0.1214) (0.1358-0.1514) HierSum 0.3706 0.1088 0.1386 (0.3624-0.3788) (0.0950-0.1144) (0.1312-0.1464) KLsum 0.3619 0.0972 0.1299 (0.3510-0.3728) (0.0917-0.1047) (0.1213-0.1385) StandLDA 0.3552 0.0847 0.1214 (0.3447-0.3657) (0.0813-0.0881) (0.1141-0.1286) Table 2: Comparison of Bayesian models on TAC2009 see that among all the Bayesian baselines, Hybh- sum achieves the best result. This further illus- trates the advantages of combining topic model with supervised method. In Table 1, we can see that our S-sLDA model performs better than Hybhsum and the improvements are 3.4% and 3.7% with re- spect to ROUGE-2 and ROUGE-SU4 on TAC2008 data. The comparison can be extended to TAC2009 data as shown in Table 2: the performance of S- sLDA is above Hybhsum by 4.3% in ROUGE-2 and 5.1% in ROUGE-SU4. It is worth explaining that these achievements are significant, because in the TAC2008 evaluation, the performance of the top ranking systems are very close, i.e. the best system is only 4.2% above the 4th best system on ROUGE- 2 and 1.2% on ROUGE-SU4. 5.3 Comparison with other baselines. In this subsection, we compare our model with some widely used models in summarization. Manifold: It is the one-layer graph based semi- supervised summarization approach developed by Wan et al.(2008). The graph is constructed only con- sidering sentence relations using tf-idf and neglects topic information. LexRank: Graph based summarization approach (Erkan and Radev, 2004), which is a revised version of famous web ranking algorithm PageRank. It is an unsupervised ranking algorithms compared with Manifold. SVM: A supervised method - Support Vector Ma- chine (SVM) (Vapnik 1995) which uses the same features as our approach. MEAD: A centroid based summary algorithm by Radev et al. (2004). Cluster centroids in MEAD consists of words which are central not only to one article in a cluster, but to all the articles. Similarity is measure using tf-idf. At the same time, we also present the top three participating systems with regard to ROUGE-2 on TAC2008 and TAC2009 for comparison, denoted as (denoted as SysRank 1st, 2nd and 3rd)(Gillick et al., 2008; Zhang et al., 2008; Gillick et al., 2009; Varma et al., 2009). The ROUGE scores of the top TAC system are directly provided by the TAC evaluation. From Table 3 and Table 4, we can see that our approach outperforms the baselines in terms of ROUGE metrics consistently. When compared with the standard supervised method SVM, the relative improvements over the ROUGE-1, ROUGE-2 and ROUGE-SU4 scores are 4.3%, 13.1%, 8.3% respec- tively on TAC2008 and 7.2%, 14.9%, 14.3% on TAC2009. Our model is not as good as top par- ticipating systems on TAC2008 and TAC2009. But considering the fact that our model neither uses sen- tence compression algorithm nor leverage domain knowledge bases like Wikipedia or training data, such small difference in ROUGE scores is reason- able. 5.4 Manual Evaluations In order to obtain a more accurate measure of sum- mary quality for our S-sLDA model and Hybhsum, we performed a simple user study concerning the following aspects: (1) Overall quality: Which sum- mary is better overall? (2) Focus: Which summary contains less irrelevant content? (3)Responsiveness: Which summary is more responsive to the query. (4) Non-Redundancy: Which summary is less re- dundant? 8 judges who specialize in NLP partic- ipated in the blind evaluation task. Evaluators are presented with two summaries generated by S-sLDA 95 Methods ROUGE-1 ROUGE-2 ROUGE-SU4 Our 0.3724 0.1030 0.1342 approach (0.3660-0.3788) (0.0999-0.1061) (0.1290-0.1394) SysRank 1st 0.3742 0.1039 0.1364 (0.3639-0.3845) (0.0974-0.1104) (0.1285-0.1443) SysRank 2nd 0.3717 0.0990 0.1326 (0.3610-0.3824 (0.0944-0.1038) (0.1269-0.1385) SysRank 3rd 0.3710 0.0977 0.1329 (0.3550-0.3849) (0.0920-0.1034) (0.1267-0.1391) PageRank 0.3597 0.0879 0.1221 (0.3499-0.3695) (0.0809-0.0950) (0.1173-0.1269) Manifold 0.3621 0.0931 0.1243 (0.3506-0.3736) (0.0868-0.0994) (0.1206-0.1280) SVM 0.3588 0.0921 0.1258 (0.3489-0.3687) (0.0882-0.0960) (0.1204-0.1302) MEAD 0.3558 0.0917 0.1226 (0.3489-0.3627) (0.0882-0.0952) (0.1174-0.1278) Table 3: Comparison with baselines on TAC2008 Methods ROUGE-1 ROUGE-2 ROUGE-SU4 Our 0.3903 0.1223 0.1488 approach (0.3819-0.3987) (0.1167-0.1279) (0.1446-0.1530) SysRank 1st 0.3917 0.1218 0.1505 (0.3778-0.4057) (0.1122-0.1314) (0.1414-0.1596) SysRank 2nd 0.3914 0.1212 0.1513 (0.3808-0.4020) (0.1147-0.1277) (0.1455-0.1571) SysRank 3rd 0.3851 0.1084 0.1447 (0.3762-0.3932) (0.1025-0.1144) (0.1398-0.1496) PageRank 0.3616 0.0849 0.1249 (0.3532-0.3700) (0.0802-0.0896) (0.1221-0.1277) Manifold 0.3713 0.1014 0.1342 (0.3586-0.3841) (0.0950-0.1178) (0.1299-0.1385) SVM 0.3649 0.1028 0.1319 (0.3536-0.3762) (0.0957-0.1099) (0.1258-0.1380) MEAD 0.3601 0.1001 0.1287 (0.3536-0.3666) (0.0953-0.1049) (0.1228-0.1346) Table 4: Comparison with baselines on TAC2009 and Hybhsum, as well as the four questions above. Then they need to answer which summary is better (tie). We randomly select 20 document collections from TAC 2008 data and randomly assign two sum- maries for each collection to three different evalua- tors to judge which model is better in each aspect. As we can see from Table 5, the two models al- most tie with respect to Non-redundancy, mainly because both models have used appropriate MMR strategies. But as for Overall quality, Focus and Our(win) Hybhsum(win) Tie Overall 37 14 9 Focus 32 18 10 Responsiveness 33 13 14 Non-redundancy 13 11 36 Table 5: Comparison with baselines on TAC2009 Responsiveness, S-sLDA model outputs Hybhsum based on t-test on 95% confidence level. Ta- ble 6 shows the example summaries generated re- spectively by two models for document collection D0803A-A in TAC2008, whose query is “Describe the coal mine accidents in China and actions taken“. From table 6, we can see that each sentence in these two summaries is somewhat related to topics of coal mines in China. We also observe that the summary in Table 6(a) is better than that in Table 6(b), tend- ing to select shorter sentences and provide more in- formation. This is because, in S-sLDA model, topic modeling is determined simultaneously by various features including terms and other ones such as sen- tence length, sentence position and so on, which can contribute to summary quality. As we can see, in Table 6(b), sentences (3) and (5) provide some unimportant information such as “somebody said“, though they contain some words which are related to topics about coal mines. (1)China to close at least 4,000 coal mines this year: official (2)By Oct. 10 this year there had been 43 coal mine accidents that killed 10 or more people, (3)Offi- cials had stakes in coal mines. (4)All the coal mines will be closed down this year. (5) In the first eight months, the death toll of coal mine accidents rose 8.5 percent last year. (6) The government has issued a series of regulations and measures to improve the coun.try’s coal mine safety situation. (7)The mining safety technology and equipments have been sold to countries. (8)More than 6,000 miners died in accidents in China (1) In the first eight months, the death toll of coal mine accidents across China rose 8.5 percent from the same period last year. (2)China will close down a number of ill-operated coal mines at the end of this month, said a work safety official here Monday. (3) Li Yizhong, director of the National Bureau of Production Safety Supervision and Administration, has said the collusion between mine owners and officials is to be condemned. (4)from January to September this year, 4,228 people were killed in 2,337 coal mine accidents. (5) Chen said officials who refused to register their stakes in coal mines within the required time Table 6: Example summary text generated by systems (a)S-sLDA and (b) Hybhsum. (D0803A-A, TAC2008) 96 6 Conclusion In this paper, we propose a novel supervised ap- proach based on revised supervised topic model for query-focused multi document summarization. Our approach naturally combines Bayesian topic model with supervised method and enjoy the advantages of both models. Experiments on benchmark demon- strate good performance of our model. Acknowledgments This research work has been supported by NSFC grants (No.90920011 and No.61273278), National Key Technology R&D Program (No:2011BAH1B0403), and National High Tech- nology R&D Program (No.2012AA011101). We also thank the three anonymous reviewers for their helpful comments. Corresponding author: Sujian Li. References David Blei and Jon McAuliffe. Supervised topic models. 2007. In Neural Information Processing Systems David Blei, Andrew Ng and Micheal Jordan. Latent dirichlet allocation. In The Journal of Machine Learn- ing Research, page: 993-1022. Charles Broyden. 1965. A class of methods for solv- ing nonlinear simultaneous equations. In Math. Comp. volume 19, page 577-593. Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering doc- uments and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. Asli Celikyilmaz and Dilek Hakkani-Tur. 2010. A Hy- brid hierarchical model for multi-document summa- rization. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. page: 815-825 Jade Goldstein, Mark Kantrowitz, Vibhu Mittal and Jaime Carbonell. 1999. Summarizing Text Docu- ments: Sentence Selection and Evaluation Metrics. In Proceedings of the 22nd annual international ACM SI- GIR conference on Research and development in infor- mation retrieval, page: 121-128. Amit Grubber, Micheal Rosen-zvi and Yair Weiss. 2007. Hidden Topic Markov Model. In Artificial Intelligence and Statistics. Hal Daume and Daniel Marcu H. 2006. Bayesian Query- Focused Summarization. In Proceedings of the 21st International Conference on Computational Linguis- tics and the 44th annual meeting of the Association for Computational Linguistics, page 305-312. Gune Erkan and Dragomir Radev. 2004. Lexrank: graph- based lexical centrality as salience in text summariza- tion. In J. Artif. Intell. Res. (JAIR), page 457-479. Dan Gillick, Benoit Favre, Dilek Hakkani-Tur, The ICSI Summarization System at TAC, TAC 2008. Dan Gillick, Benoit Favre, and Dilek Hakkani-Tur, Berndt Bohnet, Yang Liu, Shasha Xie. The ICSI/UTD Summarization System at TAC 2009. TAC 2009 Aria Haghighi and Lucy Vanderwende. 2009. Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chap- ter of the Association for Computational Linguistics, pages 362370. Feng Jin, Minlie Huang, and Xiaoyan Zhu. 2010. The summarization systems at tac 2010. In Proceedings of the third Text Analysis Conference, TAC-2010. Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha and Yong Yu. 2009. Enhancing diversity, coverage and bal- ance for summarization through structure learning. In Proceedings of the 18th international conference on World wide web, page 71-80. Chin-Yew Lin, Guihong Gao, Jianfeng Gao and Jian-Yun Nie. 2006. An information-theoretic approach to au- tomatic evaluation of summaries. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the As- sociation of Computational Linguistics, page:462-470. Annie Louis, Aravind Joshi, Ani Nenkova. 2010. Dis- course indicators for content selection in summariza- tion. In Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, page:147-156. Tengfei Ma, Xiaojun Wan. 2010. Multi-document sum- marization using minimum distortion, in Proceedings of International Conference of Data Mining. page 354363. Rebecca Mason and Eugene Charniak. 2011. Extractive multi-document summaries should explicitly not con- tain document-specific content. In proceedings of ACL HLT, page:49-54. Ani Nenkova and Lucy Vanderwende. The impact of fre- quency on summarization. In Tech. Report MSR-TR- 2005-101, Microsoft Research, Redwood, Washing- ton, 2005. Ani Nenkova, Lucy Vanderwende and Kathleen McKe- own. 2006. A compositional context sensitive multi- document summarizer: exploring the factors that inu- ence summarization. In Proceedings of the 29th an- nual International ACM SIGIR Conference on Re- 97 search and Development in Information Retrieval, page 573-580. Miles Osborne. 2002. Using maximum entropy for sen- tence extraction. In Proceedings of the ACL-02 Work- shop on Automatic Summarization, Volume 4 page:1- 8. Jahna Otterbacher, Gunes Erkan and Dragomir Radev. 2005. Using random walks for question-focused sen- tence retrieval. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, page 915-922 You Ouyang, Wenjie Li, Sujian Li and Qin Lua. 2011. Applying regression models to query-focused multi- document summarization. In Information Processing and Management, page 227-237. You Ouyang, Sujian. Li, and Wenjie. Li. 2007, Develop- ing learning strategies for topic-based summarization. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge manage- ment, page: 7986. Daniel Ramage, David Hall, Ramesh Nallapati and Christopher Manning. 2009. Labeled LDA: A super- vised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Vol 1, page 248-256. Dou She, Jian-Tao Sun, Hua Li, Qiang Yang and Zheng Chen. 2007. Document summarization using conditional random elds. In Proceedings of Inter- national Joint Conference on Artificial Intelligence, page: 28622867. V. Varma, V. Bharat, S. Kovelamudi, P. Bysani, S. GSK, K. Kumar N, K. Reddy, N. Maganti , IIIT Hyderabad at TAC 2009. TAC2009 Xiaojun Wan and Jianwu Yang. 2008. Multi-document Summarization using cluster-based link analysis. In Proceedings of the 31st annual international ACM SI- GIR conference on Research and development in in- formation retrieval, page: 299-306. Xiaojun Wan, Jianwu Yang and Jianguo Xiao. 2007. Manifold-ranking based topic-focused multi- document summarization. In Proceedings of In- ternational Joint Conference on Artificial Intelligence, page 2903-2908. Furu Wei, Wenjie Li, Qin Lu and Yanxiang He. 2008. Ex- ploiting Query-Sensitive Similarity for Graph-Based Query-Oriented Summarization. In Proceedings of the 31st annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 283-290. Jin Zhang, Xueqi Cheng, Hongbo Xu, Xiaolei Wang, Yil- ing Zeng. ICTCAS’s ICTGrasper at TAC 2008: Sum- marizing Dynamic Information with Signature Terms Based Content Filtering, TAC 2008. Dengzhong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet and Bernhard Schlkopf. 2003. Ranking on Data Manifolds. In Proceedings of the Conference on Advances in Neural Information Processing Systems, page 169-176. Jun Zhu and Eric Xing. 2010. Conditional Topic Random Fields. In Proceedings of the 27th International Con- ference on Machine Learning. Xiaojin Zhu, Zoubin Ghahramani and John Laf- ferty. 2003. Semi-supervised Learning using Gaussian Fields and Harmonic Functions. In Proceedings of In- ternational Conference of Machine Learning, page: 912-919. 98 work_235n7dcftbgpplqreihoa3x6uy ---- Shift-Reduce Constituent Parsing with Neural Lookahead Features Jiangming Liu and Yue Zhang Singapore University of Technology and Design, 8 Somapah Road, Singapore, 487372 {jiangming liu, yue zhang}@sutd.edu.sg Abstract Transition-based models can be fast and accu- rate for constituent parsing. Compared with chart-based models, they leverage richer fea- tures by extracting history information from a parser stack, which consists of a sequence of non-local constituents. On the other hand, during incremental parsing, constituent infor- mation on the right hand side of the current word is not utilized, which is a relative weak- ness of shift-reduce parsing. To address this limitation, we leverage a fast neural model to extract lookahead features. In particular, we build a bidirectional LSTM model, which leverages full sentence information to predict the hierarchy of constituents that each word starts and ends. The results are then passed to a strong transition-based constituent parser as lookahead features. The resulting parser gives 1.3% absolute improvement in WSJ and 2.3% in CTB compared to the baseline, giv- ing the highest reported accuracies for fully- supervised parsing. 1 Introduction Transition-based constituent parsers are fast and ac- curate, performing incremental parsing using a se- quence of state transitions in linear time. Pioneer- ing models rely on a classifier to make local de- cisions, searching greedily for local transitions to build a parse tree (Sagae and Lavie, 2005). Zhu et al. (2013) use a beam search framework, which preserves linear time complexity of greedy search, while alleviating the disadvantage of error propaga- tion. The model gives state-of-the-art accuracies at a speed of 89 sentences per second on the standard WSJ benchmark (Marcus et al., 1993). Zhu et al. (2013) exploit rich features by extract- ing history information from a parser stack, which consists of a sequence of non-local constituents. However, due to the incremental nature of shift- reduce parsing, the right-hand side constituents of the current word cannot be used to guide the action at each step. Such lookahead features (Tsuruoka et al., 2011) correspond to the outside scores in chart parsing (Goodman, 1998), which has been effective for obtaining improved accuracies. To leverage such information for improving shift- reduce parsing, we propose a novel neural model to predict the constituent hierarchy related to each word before parsing. Our idea is inspired by the work of Roark and Hollingshead (2009) and Zhang et al. (2010b), which shows that shallow syntactic information gathered over the word sequence can be utilized for pruning chart parsers, improving chart parsing speed without sacrificing accuracies. For ex- ample, Roark and Hollingshead (2009) predict con- stituent boundary information on words as a pre- processing step, and use such information to prune the chart. Since such information is much lighter- weight compared to full parsing, it can be predicted relatively accurately using sequence labellers. Different from Roark and Hollingshead (2009), we collect lookahead constituent information for shift-reduce parsing, rather than pruning informa- tion for chart parsing. Our main concern is improv- ing the accuracy rather than improving the speed. Accordingly, our model should predict the con- stituent hierarchy for each word rather than simple boundary information. For example, in Figure 1(a), the constituent hierarchy that the word “The” starts is “S → NP”, and the constituent hierarchy that the word “table” ends is “S → VP → NP → PP → NP”. 45 Transactions of the Association for Computational Linguistics, vol. 5, pp. 45–58, 2017. Action Editor: Brian Roark. Submission batch: 5/2016; Revision batch: 9/2016; Published 1/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. NP VB DT NN NP VP S (a) (b) DT NNS The students like this book ADJP JJ past CC and JJ present NP PP NP DT NN the table IN on Word s-type e-type The [s: S NP] [e: Ø ] past [s: ADJP] [e: Ø ] and [s: Ø] [e: Ø ] present [s: Ø] [e: ADJP ] students [s: Ø] [e: NP ] like [s: VP] [e: Ø ] this [s: NP NP] [e: Ø ] book [s: Ø] [e: NP ] on [s: PP] [e: Ø ] the [s: NP] [e: Ø ] table [s: Ø [e: S VP NP PP NP ] Figure 1: Example constituent hierarchies for the sentence “The past and present students like this book on the table”. (a) parse tree; (b) constituent hierarchies on words. For each word, we predict both the constituent hier- archy it starts and the constituent hierarchy it ends, using them as lookahead features. The task is challenging. First, it is significantly more difficult compared to simple sequence la- belling, since two sequences of constituent hierar- chies must be predicted for each word in the input sequence. Second, for high accuracies, global fea- tures from the full sentence are necessary since con- stituent hierarchies contain rich structural informa- tion. Third, to retain high speed for shift-reduce parsing, lookahead feature prediction must be exe- cuted efficiently. It is highly difficult to build such a model using manual discrete features and structured search. Fortunately, sequential recurrent neural networks (RNNs) are remarkably effective models to encode the full input sentence. We leverage RNNs for build- ing our constituent hierarchy predictor. In particular, an LSTM (Hochreiter and Schmidhuber, 1997) is used to learn global features automatically from the input words. For each word, a second LSTM is then used to generate the constituent hierarchies greed- ily using features from the hidden layer of the first LSTM, in the same way a neural language model de- coder generates output sentences for machine trans- lation (Bahdanau et al., 2015). The resulting model solves all three challenges raised above. For fully- supervised learning, we learn word embeddings as part of the model parameters. In the standard WSJ (Marcus et al., 1993) and CTB 5.1 tests (Xue et al., 2005), our parser gives 1.3 F1 and 2.3 F1 improvement, respectively, over the Initial State [φ, 0,false, 0] Final State [S,n,true,m : 2n <= m <= 4n] Induction Rules: SHIFT [S,i,false,k] [S|w,i+1,false,k+1] REDUCE-L/R-X [S|s1s0,i,false,k] [S|X,i,false,k+1] UNARY-X [S|s0,i,false,k] [S|X,i,false,k+1] FINISH [S,n,false,k] [S,n,true,k+1] IDLE [S,n,true,k] [S,n,true,k+1] Figure 2: Deduction system for the baseline shift- reduce parsing process. baseline of Zhu et al. (2013), resulting in a accuracy of 91.7 F1 for English and 85.5 F1 for Chinese, which are the best for fully-supervised models in the literature. We release our code, based on ZPar (Zhang and Clark, 2011; Zhu et al., 2013), at https://github.com/SUTDNLP/LookAheadConparser. 2 Baseline System We adopt the parser of Zhu et al. (2013) for a base- line, which is based on the shift-reduce process of Sagae and Lavie (2005) and the beam search strat- egy of Zhang and Clark (2011) with global percep- tron training. 46 2.1 The Shift-Reduce System Shift-reduce parsers process an input sentence in- crementally from left to right. A stack is used to maintain partial phrase-structures, while the incom- ing words are ordered in a buffer. At each step, a transition action is applied to consume an input word or construct a new phrase-structure. The set of tran- sition actions are • SHIFT: pop the front word off the buffer, and push it onto the stack. • REDUCE-L/R-X: pop the top two constituents off the stack (L/R means that the head is the left constituent or the right constituent, respec- tively), combine them into a new constituent with label X, and push the new constituent onto the stack. • UNARY-X: pop the top constituent off the stack, raise it to a new constituent X, and push the new constituent onto the stack. • FINISH: pop the root node off the stack and end parsing. • IDLE: no-effect action on a completed state without changing items on the stack or buffer, used to ensure that the same number of actions are in each item in beam search (Zhu et al., 2013). The deduction system for the process is shown in Figure 2, where a state is represented as [stack, buffer front index, completion mark, action index], and n is the number of words in the input. For ex- ample, given the sentence “They like apples”, the action sequence “SHIFT, SHIFT, SHIFT, REDUCE- L-VP, REDUCE-R-S” gives its syntax “(S They (VP like apples) )”. 2.2 Search and Training Beam-search is used for decoding with the k best state items at each step being kept in the agenda. During initialization, the agenda contains only the initial state [φ, 0,false, 0]. At each step, each state in the agenda is popped and expanded by apply- ing all valid transition actions, and the top k re- sulting states are put back onto the agenda (Zhu et al., 2013). The process repeats until the agenda is Description Templates UNIGRAM s0tc,s0wc,s1tc,s1wc,s2tc s2wc,s3tc,s3wc,q0wt,q1wt q2wt,q3wt,s0lwc,s0rwc s0uwc,s1lwc,s1rwc,s1uwc BIGRAM s0ws1w,s0ws1c,s0cs1w,s0cs1c s0wq0w,s0wq0t,s0cq0w,s0cq0t q0wq1w,q0wq1t,q0tq1w,q0tq1t s1wq0w,s1wq0t,s1cq0w,s1cq0t TRIGRAM s0cs1cs2c,s0ws1cs2c,s0cs1wq0t s0cs1cs2w,s0cs1cq0t,s0ws1cq0t s0cs1wq0t,s0cs1cq0w Extended s0llwc,s0lrwc,s0luwc s0rlwc,s0rrwc,s0ruwc s0ulwc,s0urwc,s0uuwc s1llwc,s1lrwc,s1luwc s1rlwc,s1rrwc,s1ruwc Table 1: Baseline feature templates, where si rep- resents the ith item on the top of the stack and qi denotes the ith item in the front of the buffer. The symbol w denotes the lexical head of an item; the symbol c denotes the constituent label of an item; the symbol t is the POS of a lexical head; u denotes unary child; sill denotes the left child of si’s left child. empty, and the best completed state is taken as out- put. The score of a state is the total score of the transi- tion actions that have been applied to build it: C(α) = N∑ i=1 Φ(αi) ·~θ (1) Here Φ(αi) represents the feature vector for the ith action αi in the state item α. N is the total number of actions in α. The model parameter vector ~θ is trained online using the averaged perceptron algorithm with the early-update strategy (Collins and Roark, 2004). 2.3 Baseline Features Our baseline features are taken from Zhu et al. (2013). As shown in Table 1, they include the UN- IGRAM, BIGRAM, TRIGRAM features of Zhang and Clark (2009) and the extended features of Zhu et al. (2013). 47 Templates s0gs,s0ge,s1gs,s1ge q0gs,q0ge,q1gs,q1ge Table 2: Lookahead feature templates, where si rep- resents the ith item on the top of the stack and qi de- notes the ith item in the front end of the buffer. The symbol gs and ge denote the next level constituent in the s-type hierarchy and e-type hierarchy, respec- tively. 3 Global Lookahead Features The baseline features suffer two limitations, as men- tioned in the introduction. First, they are relatively local to the state, considering only the neighbouring nodes of s0 (top of stack) and q0 (front of buffer). Second, they do not consider lookahead information beyond s3, or the syntactic structure of the buffer and sequence. We use an LSTM to capture full sen- tential information in linear time, representing such global information that is fed into the baseline parser as a constituent hierarchy for each word. Lookahead features are extracted from the constituent hierarchy to provide top-down guidance for bottom-up pars- ing. 3.1 Constituent Hierarchy In a constituency tree, each word can start or end a constituent hierarchy. As shown in Figure 1, the word “The” starts a constituent hierarchy “S → NP”. In particular, it starts a constituent S in the top level, dominating a constituent NP. The word “table” ends a constituent hierarchy “S → VP → NP → PP → NP”. In particular, it ends a constituent hierarchy, with a constituent S on the top level, dominating a VP (starting from the word “like”), and then an NP (starting from the noun phrase “this book”), and then a PP (starting from the word “in”), and finally an NP (starting from the word “the”). The extraction of constituent hierarchies for each word is based on un- binarized grammars, reflecting the unbinarized trees that the word starts or ends. The constituent hier- archy is empty (denoted as φ) if the corresponding word does not start or end a constituent. The con- stituent hierarchies are added into the shift-reduce parser as soft features (section 3.2). Formally, a constituent hierarchy is defined as [type : c1 → c2 → ... → cz], where c is a constituent label (e.g. NP), “→” repre- sents the top-down hierarchy, and type can be s or e, denoting that the current word starts or ends the con- stituent hierarchy, respectively, as shown in Figure 1. Compared with full parsing, the constituent hier- archies associated with each word have no forced structural dependencies between each other, and therefore can be modelled more easily, for each word individually. Being soft lookahead features rather than hard constraints, inter-dependencies are not crucial for the main parser. 3.2 Lookahead Features The lookahead feature templates are defined in Table 2. In order to ensure parsing efficiency, only simple feature templates are taken into consideration. The lookahead features of a state are instantiated for the top two items on the stack (i.e., s0 and s1) and buffer (i.e., q0 and q1). The new function Φ′ is defined to output the lookahead features vector. The scoring of a state in our model is based on Formula (1) but with a new term Φ′(αi) · ~θ′: C′(α) = N∑ i=1 Φ(αi) ·~θ + Φ′(αi) · ~θ′ For each word, the lookahead feature represents the next level constituent in the top-down hierarchy, which can guide bottom-up parsing. For example, Figure 3 shows two intermediate states during parsing. In Figure 3(a), the s-type and e-type lookahead features of s1 (i.e., the word “The” are extracted from the constituent hierarchy in the bottom level, namely NP and NULL, respec- tively. On the other hand, in Figure 3(b), the s-type lookahead feature of s1 is extracted from the s-type constituent hierarchy of same word “The”, but it is S based on current hierarchical level. The e-type lookahead feature, on the other hand, is extracted from the e-type constituent hierarchy of end word “students” of the VP constituent, which is NULL in the next level. Lookahead features for items on the buffer are extracted in the same way. The lookahead features are useful for guiding shift-reduce decisions given the current state. For 48 stack buffer DT The JJ past CC and JJ present s0s1 S NP Ø ADJP Ø s0gs s0ge=nulls1gs s1ge=null Ø Ø Ø ADJP q0 q1 q0gs=null q0ge=null q1gs=null q1ge NP VB DT NN DT NNS s0s1 q0 q1 The students like this book S NP VP VP Ø NP NPØ Ø s1ge=null s0gs s0ge=null q0gs q0ge=null q1gs=null stack buffer q1ges1gs ADJP past and present (a) (b) incorrect Constituent hierarchy Look-ahead features Configuration Figure 3: Two intermediate states for parsing on the sentence “The past and present students like this book on the table”. Each item on the stack or buffer has two constituent hierarchies: s-type (left) and e-type (right), respectively, in the corresponding box. Note that the e-type constituent hierarchy of the word “students” is incorrectly predicted, yet used as soft constraints (i.e., features) in our model. example, given the intermediate state in Figure 3(a), s0 has a s-type lookahead feature ADJP, and q1 in the buffer has e-type lookahead feature ADJP. This indicates that the two items are likely reduced into the same constituent. Further, s0 cannot end a con- stituent because of the empty e-type constituent hi- erarchy. As a result, the final shift-reduce parser will assign a higher score to the SHIFT decision. 4 Constituent Hierarchy Prediction We propose a novel neural model for constituent hi- erarchy prediction. Inspired by the encoder-decoder framework for neural machine translation (Bah- danau et al., 2015; Cho et al., 2014), we use an LSTM to capture full sentence features, and another LSTM to generate the constituent hierarchies for each word. Compared with a CRF-based sequence labelling model (Roark and Hollingshead, 2009), the proposed model has three advantages. First, the global features can be automatically represented. Second, it can avoid the exponentially large num- ber of labels if constituent hierarchies are treated as unique labels. Third, the model size is relatively small, and does not have a large effect on the final parser model. As shown in Figure 4, the neural network con- sists of three main layers, namely the input layer, the encoder layer and the decoder layer. The input layer represents each word using its characters and token information; the encoder hidden layer uses a bidirectional recurrent neural network structure to learn global features from the sentence; and the de- coder layer predicts constituent hierarchies accord- ing to the encoder layer features, by using the atten- tion mechanism (Bahdanau et al., 2015) to compute the contribution of each hidden unit of the encoder. 4.1 Input Layer The input layer generates a dense vector representa- tion of each input word. We use character embed- dings to alleviate OOV problems in word embed- dings (Ballesteros et al., 2015; Santos and Zadrozny, 2014; Kim et al., 2016), concatenating character- embeddings of a word with its word embedding. Formally, the input representation xi of the word wi is computed by: xi = [xwi ; ci att] ci att = ∑ j αijc ′ ij, where xwi is a word embedding vector of the word wi according to a embedding lookup table, ci att is a character embedding form of the word wi, cij is the embedding of the jth character in wi, c′ij is the character window representation centered at cij, and αij is the contribution of the c′ij to ci att, which is computed by: αij = e f(xwi,c ′ ij) ∑ k e f(xwi,c ′ ik ) f is a non-linear transformation function. 49 … h1 h2 hn h1 h2 hn … … SoftMax … c2,1 c2,2 c2,m … … xw2 attention pooling Decoder layer Input layer Encoder layer y1j x2 xnx1 s1js1j-1 c2_att … x’2 x’nx’1 h1 h2 hn … cn,1 … … xwn attention pooling cn_att … cn,2 cn,m’ windows windows windows … c’2,1 c’2,2 c’2,m c’n,1 c’n,2 c’n,m’ Figure 4: Structure of the constituent hierarchy pre- diction model. −→ hi denotes the left-to-right encoder hidden units; ←− hi denotes the right-to-left encoder hidden units; s denotes the decoder hidden state vec- tor; and yij is the jth label of the word wi. 4.2 Encoder Layer The encoder first uses a window strategy to repre- sent input nodes with their corresponding local con- text nodes. Formally, a word window representation takes the form x′i = [xi−win; ...; xi; ...; xi+win]. Second, the encoder scans the input sentence and generates hidden units for each input word using a recurrent neural network (RNN), which represents features of the word from the global sequence. For- mally, given the windowed input nodes x′1, x ′ 2, ..., x′n for the sentence w1, w2, ..., wn, the RNN layer calculates a hidden node sequence h1, h2, ..., hn. Long Short-Term Memory (LSTM) mitigates the vanishing gradient problem in RNN training, by in- troducing gates (i.e., input i, forget f and output o) and a cell memory vector c. We use the variation of Graves and Schmidhuber (2008). Formally, the values in the LSTM hidden layers are computed as follows: ii = σ(W1x ′ i + W2hi−1 + W3 � ci−1 + b1) fi = 1 − ii c̃i = tanh(W4x ′ i + W5hi−1 + b2) ci = fi � ci−1 + ii � c̃i oi = σ(W6x ′ i + W7hi−1 + W8 � ci + b3) hi = oi � tanh(ci), where � is pair-wise multiplication. Further, in or- der to collect features for xi from both x′1, .., x ′ i−1 and x′i+1, ... x ′ n, we use a bidirectional variation (Schuster and Paliwal, 1997; Graves et al., 2013). As shown in Figure 4, the hidden units are generated by concatenating the corresponding hidden layers of a left-to-right LSTM −→ hi and a right-to-left LSTM ←− hi , where ←→ hi = [ −→ hi ; ←− hi ] for each word wi. 4.3 Decoder Layer The decoder hidden layer uses two different LSTMs to generate the s-type and e-type sequences of con- stituent labels from each encoder hidden output, re- spectively, as shown in Figure 4. Each constituent hierarchy is generated bottom-up recurrently. In particular, a sequence of state vectors is generated recurrently, with each state yielding a output con- stituent label. The process starts with a ~0 state vec- tor and ends when a NULL constituent is generated. The recurrent state transition process is achieved us- ing an LSTM model with the hidden vectors of the encoder layer being used for context features. Formally, for word wi, the value of the jth state unit sij of the LSTM is computed by: sij = f(sij−1,aij, ←→ hi ) 1, where the context aij is computed by: aij = ∑ k βijk ←→ hk βijk = ef(sij−1, ←→ hk ) ∑ k′ e f(sij−1, ←→ hk′) 1Here, different from typical MT models (Bahdanau et al., 2015), the chain is predicted sequentially in a feed-forward way with no feedback of the prediction made. We found that this fast alternative gives similar results. 50 Here ←→ hk refers to the encoder hidden vector for wk. The weights of contribution βijk are computed using the attention mechanism (Bahdanau et al., 2015). The constituent labels are generated from each state unit sij, where each constituent label yij is the output of a SOFTMAX function, p(yij = l) = e s>ijWl ∑ k e s>ijWk yij = l denotes that the jth label of the ith word is l(l ∈ L). As shown in Figure 4, the SOFTMAX functions are applied to the state units of the decoder, gener- ating hierarchical labels bottom-up, until the default label NULL is predicted. 4.4 Training We use two separate models to assign the s-type and e-type labels, respectively. For training each con- stituent hierarchy predictor, we minimize the follow- ing training objective: L(θ) = − T∑ i Zi∑ j log pijo + λ 2 ||θ||2, where T is the length of the sentence, Zi is the depth of the constituent hierarchy of the word wi, and pijo stands for p(yij = o), which is given by the SOFT- MAX function, and o is the gold label. We apply back-propagation, using momentum stochastic gradient descent (Sutskever et al., 2013) with a learning rate of η = 0.01 for optimization and regularization parameter λ = 10−6. 5 Experiments 5.1 Experiment Settings Our English data are taken from the Wall Street Jour- nal (WSJ) sections of the Penn Treebank (Marcus et al., 1993). We use sections 2-21 for training, section 24 for system development, and section 23 for final performance evaluation. Our Chinese data are taken from the version 5.1 of the Penn Chinese Treebank (CTB) (Xue et al., 2005). We use articles 001- 270 and 440-1151 for training, articles 301-325 for sys- tem development, and articles 271-300 for final per- formance evaluation. For both English and Chinese hyper-parameters value Word embedding size 50 Word window size 2 Character embedding size 30 Character window size 2 LSTM hidden layer size 100 Character hidden layer size 60 Table 3: Hyper-parameter settings s-type e-type parser 1-layer 93.39 81.50 90.43 2-layer 93.76 83.37 90.72 3-layer 93.84 83.42 90.80 Table 4: Performance of the constituent hierarchy predictor and the corresponding parser on the WSJ dev dataset. n-layer denotes an LSTM model with n hidden layers. data, we adopt ZPar2 for POS tagging, and use ten- fold jackknifing to assign POS tags automatically to the training data. In addition, we use ten-fold jack- knifing to assign constituent hierarchies automati- cally to the training data for training the parser using the constituent hierarchy predictor. We use F1 score to evaluate constituent hierarchy prediction. For example, if the prediction is “S → S → VP → NP” and the gold is “S → NP → NP”, the evaluation process matches the two hierarchies bottom-up. The precision is 2/4 = 0.5, the recall is 2/3 = 0.66 and the F1 score is 0.57. A label is counted as correct if and only if it occurs at the cor- rect position. We use EVALB to evaluate parsing performance, including labelled precision (LP ), labelled recall (LR), and bracketing F1.3 5.2 Model Settings For training the constituent hierarchy prediction model, gold constituent labels are derived from la- belled constituency trees in the training data. The hyper-parameters are chosen according to develop- ment tests, and the values are shown in Table 3. For the shift-reduce constituency parser, we set the beam size to 16 for both training and decoding, which achieves a good tradeoff between efficiency 2https://github.com/SUTDNLP/ZPar 3http://nlp.cs.nyu.edu/evalb 51 s-type e-type parser all 93.76 83.37 90.72 all w/o wins 93.62 83.34 90.58 all w/o chars 93.51 83.21 90.33 all w/o chars & wins 93.12 82.36 89.18 Table 5: Performance of the constituent hierarchy predictor and the corresponding parser on the WSJ dev dataset. all denotes the proposed model with- out ablation. wins denotes input windows. chars denotes character-based attention. and accuracy (Zhu et al., 2013). The optimal train- ing iteration number is determined on the develop- ment sets. 5.3 Results of Constituent Hierarchy Prediction Table 4 shows the results of constituent hierarchy prediction, where word and character embeddings are randomly initialized, and fine-tuned during train- ing. The third column shows the development pars- ing accuracies when the labels are used for looka- head features. As Table 4 shows, when the number of hidden layers increases, both s-type and e-type constituent hierarchy prediction improve. The accu- racy of e-type prediction is relatively lower due to right-branching in the treebank, which makes e-type hierarchies longer than s-type hierarchies. In addi- tion, a 3-layer LSTM does not give significant im- provements compared to a 2-layer LSTM. For better tradeoff between efficiency and accuracy, we choose the 2-layer LSTM as our constituent hierarchy pre- dictor. Table 5 shows ablation results for constituent hi- erarchy prediction given by different reduced ar- chitectures, which include an architecture without character embeddings and an architecture with nei- ther character embeddings nor input windows. We find that the original architecture achieves the high- est performance on constituent hierarchy prediction, compared to the two baselines. The baseline only without character embeddings has relatively small influence on constituent hierarchy prediction. On the other hand, the baseline only without input word windows has relatively smaller influence on con- stituent hierarchy prediction. Nevertheless, both of these two ablation architectures lead to lower pars- Parser LR LP F1 Fully-supervised Ratnaparkhi (1997) 86.3 87.5 86.9 Charniak (2000) 89.5 89.9 89.5 Collins (2003) 88.1 88.3 88.2 Sagae and Lavie (2005)† 86.1 86.0 86.0 Sagae and Lavie (2006)† 87.8 88.1 87.9 Petrov and Klein (2007) 90.1 90.2 90.1 Carreras et al. (2008) 90.7 91.4 91.1 Shindo et al. (2012) N/A N/A 91.1 Zhu et al. (2013)† 90.2 90.7 90.4 Socher et al. (2013)* N/A N/A 90.4 Vinyals et al. (2015)* N/A N/A 88.3 Cross and Huang (2016)*† N/A N/A 91.3 Dyer et al. (2016)*† N/A N/A 91.2 This work 91.3 92.1 91.7 Ensemble Shindo et al. (2012) N/A N/A 92.4 Vinyals et al. (2015)* N/A N/A 90.5 Rerank Charniak and Johnson (2005) 91.2 91.8 91.5 Huang (2008) 92.2 91.2 91.7 Dyer et al. (2016)*† N/A N/A 93.3 Semi-supervised McClosky et al. (2006) 92.1 92.5 92.3 Huang and Harper (2009) 91.1 91.6 91.3 Huang et al. (2010) 91.4 91.8 91.6 Zhu et al. (2013)† 91.1 91.5 91.3 Durrett and Klein (2015)* N/A N/A 91.1 Table 6: Comparison of related work on the WSJ test set. * denotes neural parsing; † denotes methods using a shift-reduce framework. ing accuracies. The baseline removing both the character embeddings and the input word windows has a relatively low F-score. 5.4 Final Results For English, we compare the final results with previous related work on the WSJ test sets. As shown in Table 64, our model achieves 1.3% F1 improvement compared to the baseline parser with fully-supervised learning (Zhu et al., 2013). Our model outperforms the state-of-the-art fully- supervised system (Carreras et al., 2008; Shindo et al., 2012) by 0.6% F1. In addition, our fully- supervised model also catches up with many state- of-the-art semi-supervised models (Zhu et al., 2013; 4We treat the methods as semi-supervised if they use pre- trained word embeddings, word clusters (e.g., Brown clusters) or extra resources. 52 Parser LR LP F1 Fully-supervised Charniak (2000) 79.6 82.1 80.8 Bikel (2004) 79.3 82.0 80.6 Petrov and Klein (2007) 81.9 84.8 83.3 Zhu et al. (2013)† 82.1 84.3 83.2 Wang et al. (2015)‡ N/A N/A 83.2 Dyer et al. (2016)*† N/A N/A 84.6 This work 85.2 85.9 85.5 Rerank Charniak and Johnson (2005) 80.8 83.8 82.3 Dyer et al. (2016)*† N/A N/A 86.9 Semi-supervised Zhu et al. (2013)† 84.4 86.8 85.6 Wang and Xue (2014)‡ N/A N/A 86.3 Wang et al. (2015)‡ N/A N/A 86.6 Table 7: Comparison of related work on the CTB5.1 test set. * denotes neural parsing; † denotes methods using a shift-reduce framework; ‡ denotes joint POS tagging and parsing. Huang and Harper, 2009; Huang et al., 2010; Dur- rett and Klein, 2015) by achieving 91.7% F1 on WSJ test set. The size of our model is much smaller than the semi-supervised model of Zhu et al. (2013), which contains rich features from a large automat- ically parsed corpus. In contrast, our model is about the same in size compared to the baseline parser. We carry out Chinese experiments with the same models, and compare the final results with previous related work on the CTB test set. As shown in Table 7, our model achieves 2.3% F1 improvement com- pared to the state-of-the-art baseline system with fully-supervised learning (Zhu et al., 2013), which is by far the best result in the literature. In addition, our fully-supervised model is also comparable to many state-of-the-art semi-supervised models (Zhu et al., 2013; Wang and Xue, 2014; Wang et al., 2015; Dyer et al., 2016) by achieving 85.5% F1 on the CTB test set. Wang and Xue (2014) and Wang et al. (2015) do joint POS tagging and parsing. 5.5 Comparison of Speed Table 8 shows the running times of various parsers on test sets on a Intel 2.2 GHz processor with 16G memory. Our parsers are much faster than the re- lated parser with the same shift-reduce framework (Sagae and Lavie, 2005; Sagae and Lavie, 2006). Compared to the baseline parser, our parser gives Parser #Sent/Second Ratnaparkhi (1997) Unk Collins (2003) 3.5 Charniak (2000) 5.7 Sagae and Lavie (2005) 3.7 Sagae and Lavie (2006) 2.2 Petrov and Klein (2007) 6.2 Carreras et al. (2008) Unk Zhu et al. (2013) 89.5 This work 79.2 Table 8: Comparison of running times on the test set, where the time for loading models is excluded. The running times of related parsers are taken from Zhu et al. (2013). significant improvement on accuracies (90.4% to 91.7% F1) at the speed of 79.2 sentences per sec- ond5, in contrast to 89.5 sentences per second on the standard WSJ benchmark. 6 Error Analysis We conduct error analysis by measuring parsing ac- curacies against: different phrase types, constituents of different span lengths, and different sentence lengths. 6.1 Phrase Type Table 9 shows the accuracies of the baseline and the final parsers with lookahead features on 9 common phrase types. As the results show, while the parser with lookahead features achieves improvements on all of the frequent phrase types, there are relatively higher improvements on VP, S, SBAR and WHNP. The constituent hierarchy predictor has relatively better performance on s-type labels for the con- stituents VP, WHNP and PP, which are prone to errors by the baseline system. The constituent hi- erarchy can give guidance to the constituent parser for tackling the issue. Compared to the s-type con- stituent hierarchy, the e-type constituent hierarchy 5The constituent hierarchy prediction is excluded, which processes an average of 150 sentences per second on a single CPU. The cost of this step is far less than the cost of parsing, and can be essentially eliminated by pipelining the constituent hierarchy prediction and the shift-reduce decoder, by launching the constituent hierarchy predictor first, and then starting pars- ing in parallel as soon as the lookahead output is available for the first sentence, since the lookahead will outpace the parsing from that point forward. 53 2 4 6 8 10 12 14 85 90 95 span length F 1 S co re (% ) baseline lookahead Figure 5: Comparison with the baseline on spans of different lengths. is relatively more difficult to predict, particularly for the constituents with long spans such as VP, S and SBAR. Despite this, the e-type constituent hi- erarchies with relatively low accuracies also benefit prediction of constituents with long spans. 6.2 Span Length Figure 5 shows the F1-scores of the two parsers on constituents with different span lengths. As the re- sults show, lookahead features are helpful on both large spans and small spans, and the performance gap between the two parsers is larger as the size of span increases. This reflects the usefulness of long- range information captured by the constituent hier- archy predictor and lookahead features. 6.3 Sentence Length Figure 6 shows the F1-scores of the two parsers on sentences of different lengths. As the results show, the parser with lookahead features outperforms the baseline system on both short sentences and long sentences. Also, the performance gap between the two parsers is larger as the length of sentence in- creases. The constituent hierarchy predictors generate hi- erarchical constituents for each input word using global information. For longer sentences, the pre- dictors yield deeper constituent hierarchies, offer- ing corresponding lookahead features. As a result, compared to the baseline parser, the performance of the parser with lookahead features decreases more slowly as the length of the sentences increases. 10 20 30 40 50 50+ 85 90 95 F 1 sc or e (% ) baseline lookahead Figure 6: Comparison with the baseline on sen- tences of different lengths. Sentences with length [0, 10) fall in the bin 10. 7 Related Work Our lookahead features are similar in spirit to the pruners of Roark and Hollingshead (2009) and Zhang et al. (2010b), which infer the maximum length of constituents that a particular word can start or end. However, our method is different in three main ways. First, rather than using a CRF with sparse local word window features, a neural network is used for dense global features on the sentence. Second, not only the size of constituents but also the constituent hierarchy is identified for each word. Third, the results are added into a transition-based parser as soft features, rather then being used as hard constraints to a chart parser. Our concept of constituent hierarchies is simi- lar to supertags in the sense that both are shallow parses. For lexicalized grammars such as Combi- natory Categorial Grammar (CCG), Tree-Adjoining Grammar (TAG) and Head-Driven Phrase Structure Grammar (HPSG), each word in the input sentence is assigned one or more supertags, which are used to identify the syntactic role of the word to constrain parsing (Clark, 2002; Clark and Curran, 2004; Car- reras et al., 2008; Ninomiya et al., 2006; Dridan et al., 2008; Faleńska et al., 2015). For a lexical- ized grammar, supertagging can benefit the parsing in both accuracy and efficiency by offering almost- parsing information. In particular, Carreras et al. (2008) used the concept of spine for TAG (Schabes, 1992; Vijay-Shanker and Joshi, 1988), which is sim- ilar to our constituent hierarchy. However, there are three differences. First, the spine is defined to de- scribe the main syntactic tree structure with a series 54 NP VP S PP SBAR ADVP ADJP WHNP QP baseline 92.06 90.63 90.28 87.93 86.93 84.83 74.12 95.03 89.32 with lookahead feature 93.10 92.45 91.78 88.84 88.59 85.64 74.50 96.18 89.63 improvement +1.04 +1.82 +1.50 +0.91 +1.66 +0.81 +0.38 +1.15 +0.31 constituent hierarchy s-type 95.18 97.51 93.37 98.01 92.14 88.94 79.88 96.18 91.70 e-type 91.98 76.82 80.72 84.80 66.82 85.01 71.16 95.13 91.02 Table 9: Comparison between the parsers with lookahead features on different phrases types, with the corresponding constituent hierarchy predictor performances. of unary projections, while constituent hierarchy is defined to describe how words can start or end hi- erarchical constituents (it can be empty if the word cannot start or end constituents). Second, spines are extracted from gold trees and used to prune the search space of parsing as hard constraints. In con- trast, we use constituent hierarchies as soft features. Third, Carreras et al. (2008) use spines to prune chart parsing, while we use constituent hierarchies to improve a linear shift-reduce parser. For lexicalized grammars, supertags can benefit parsing significantly since they contain rich syntac- tic information as almost parsing (Bangalore and Joshi, 1999). Recently, there has been a line of work on better supertagging. Zhang et al. (2010a) proposed efficient methods to obtain supertags for HPSG parsing using dependency information. Xu et al. (2015) and Vaswani et al. (2016) leverage recur- sive neural networks for supertagging for CCG pars- ing. In contrast, our models predict the constituent hierarchy instead of a single supertag for each word in the input sentence. Our constituent hierarchy predictor is also related to sequence-to-sequence learning (Sutskever et al., 2014), which has been successfully used in neural machine translation (Bahdanau et al., 2015). The neural model encodes the source-side sentence into dense vectors, and then uses them to generate target- side word by word. There has also been work that di- rectly applies sequence-to-sequence models for con- stituent parsing, which generates constituent trees given raw sentences (Vinyals et al., 2015; Luong et al., 2015). Compared to Vinyals et al. (2015), who predict a full parse tree from input, our predictors tackle a much simpler task, by predicting the con- stituent hierarchies of each word separately. In ad- dition, the outputs of the predictors are used for soft lookahead features in bottom-up parsing, rather than being taken as output structures directly. By integrating a neural constituent hierarchy pre- dictor, our parser is related to neural network mod- els for parsing, which has given competitive accura- cies for both constituency parsing (Dyer et al., 2016; Cross and Huang, 2016; Watanabe and Sumita, 2015) and dependency parsing (Chen and Manning, 2014; Zhou et al., 2015; Dyer et al., 2015). In par- ticular, our parser is more closely related to neu- ral models that integrate discrete manual features (Socher et al., 2013; Durrett and Klein, 2015). Socher et al. (2013) use neural features to rerank a sparse baseline parser; Durrett and Klein directly in- tegrate sparse features into neural layers in a chart parser. In contrast, we integrate neural information into sparse features in the form of lookahead fea- tures. There has also been work on lookahead features for parsing. Tsuruoka et al. (2011) run a baseline parser for a few future steps, and use the output ac- tions to guide the current action. In contrast to their model, our model leverages full sentential informa- tion, yet is significantly faster. Previous work investigated more efficient parsing without loss of accuracy, which is required by real time applications, such as web parsing. Zhang et al. (2010b) introduced a chart pruner to accelerate a CCG parser. Kummerfeld et al. (2010) proposed a self-training method focusing on increasing the speed of a CCG parser rather than its accuracy. 8 Conclusion We proposed a novel constituent hierarchy predic- tor based on recurrent neural networks, aiming to capture global sentential information. The resulting constituent hierarchies are fed to a baseline shift- reduce parser as lookahead features, addressing lim- itations of shift-reduce parsers in not leveraging 55 right-hand side syntax for local decisions, yet main- taining the same model size and speed. The resulting fully-supervised parser outperforms the state-of-the- art baseline parser by achieving 91.7% F1 on stan- dard WSJ evaluation and 85.5% F1 on standard CTB evaluation. Acknowledgments We thank the anonymous reviewers for their detailed and constructive comments, and the co-Editor-in- Chief Lillian Lee for her extremely detailed copy editing. This work is supported by T2MOE 201301 of Singapore Ministry of Education. Yue Zhang is the corresponding author. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. ICLR. Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In EMNLP, pages 349–359. Srinivas Bangalore and Aravind K. Joshi. 1999. Su- pertagging: An approach to almost parsing. Compu- tational Linguistics, 25(2):237–265, June. Daniel M. Bikel. 2004. On the parameter space of gener- ative lexicalized statistical parsing models. PhD The- sis, University of Pennsylvania. Xavier Carreras, Michael Collins, and Terry Koo. 2008. TAG, dynamic programming, and the perceptron for efficient, feature-rich parsing. In CoNLL, pages 9–16, Morristown, NJ, USA. Association for Computational Linguistics. Eugene Charniak and Mark Johnson. 2005. Coarse-to- fine n-best parsing and MaxEnt discriminative rerank- ing. In ACL. Eugene Charniak. 2000. A maximum-entropy-inspired parser. In ANLP, pages 132–139. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP, pages 740–750, Stroudsburg, PA, USA. As- sociation for Computational Linguistics. Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase represen- tations using RNN encoder-decoder for statistical ma- chine translation. In EMNLP, pages 1724–1734. Stephen Clark and James R. Curran. 2004. The impor- tance of supertagging for wide-coverage CCG pars- ing. In COLING, pages 282–288, Morristown, NJ, USA, August. University of Edinburgh, Association for Computational Linguistics. Stephen Clark. 2002. Supertagging for combinatory cat- egorial grammar. In Proceedings of the Sixth Inter- national Workshop on Tree Adjoining Grammar and Related Frameworks, pages 101–106, Universita di Venezia. Michael Collins and Brian Roark. 2004. Incremental parsing with the perceptron algorithm. In ACL, Mor- ristown, NJ, USA. Association for Computational Lin- guistics. Michael Collins. 2003. Head-driven statistical models for natural language parsing. Computational Linguis- tics, 29(4):589–637. James Cross and Liang Huang. 2016. Span-based con- stituency parsing with a structure-label system and provably optimal dynamic oracles. In EMNLP. Rebecca Dridan, Valia Kordoni, and Jeremy Nicholson. 2008. Enhancing performance of lexicalised gram- mars. In ACL. Greg Durrett and Dan Klein. 2015. Neural CRF parsing. In ACL, pages 302–312. Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition- based dependency parsing with stack long short-term memory. In ACL-IJCNLP, pages 334–343. Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In NAACL, pages 199–209. Agnieszka Faleńska, Anders Björkelund, Özlem Çetinoğlu, and Wolfgang Seeker. 2015. Stacking or supertagging for dependency parsing – what’s the difference? In Proceedings of the 14th International Conference on Parsing Technologies. Joshua Goodman. 1998. Parsing inside-out. PhD thesis, Harvard University. Alex Graves and Jürgen Schmidhuber. 2008. Offline handwriting recognition with multidimensional recur- rent neural networks. In NIPS, pages 545–552. Alex Graves, Navdeep Jaitly, and Abdel-rahman Mo- hamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pages 273–278. IEEE. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780, November. Zhongqiang Huang and Mary P. Harper. 2009. Self- training PCFG grammars with latent annotations across languages. In EMNLP, pages 832–841. 56 Zhongqiang Huang, Mary P. Harper, and Slav Petrov. 2010. Self-training with products of latent variable grammars. In EMNLP, pages 12–22. Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In ACL, pages 586– 594. Yoon Kim, Yacine Jernite, David Sontag, and Alexan- der M. Rush. 2016. Character-aware neural language models. In AAAI. Jonathan K. Kummerfeld, Jessika Roesner, Tim Daw- born, James Haggerty, James R. Curran, and Stephen Clark. 2010. Faster parsing by supertagger adap- tation. In ACL, pages 345–355. University of Cam- bridge, Association for Computational Linguistics, July. Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task se- quence to sequence learning. ICLR. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated cor- pus of English: The Penn treebank. Computational Linguistics, 19(2):313–330. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In HLT- NAACL, pages 152–159, Morristown, NJ, USA. As- sociation for Computational Linguistics. Takashi Ninomiya, Takuya Matsuzaki, Yoshimasa Tsu- ruoka, Yusuke Miyao, and Jun’ichi Tsujii. 2006. Ex- tremely lexicalized models for accurate and fast HPSG parsing. In EMNLP, pages 155–163. University of Manchester, Association for Computational Linguis- tics, July. Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In HLT-NAACL, pages 404– 411. Adwait Ratnaparkhi. 1997. A linear observed time sta- tistical parser based on maximum entropy models. In EMNLP. Brian Roark and Kristy Hollingshead. 2009. Linear complexity context-free parsing pipelines via chart constraints. In HLT-NAACL, pages 647–655. Kenji Sagae and Alon Lavie. 2005. A classifier-based parser with linear runtime complexity. In IWPT, pages 125–132, Morristown, NJ, USA. Association for Com- putational Linguistics. Kenji Sagae and Alon Lavie. 2006. Parser combination by reparsing. In HLT-NAACL, pages 129–132, Mor- ristown, NJ, USA. Association for Computational Lin- guistics. Cicero D. Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tag- ging. In ICML, pages 1818–1826. Yves Schabes. 1992. Stochastic tree-adjoining gram- mars. In Proceedings of the workshop on Speech and Natural Language, pages 140–145. Association for Computational Linguistics. Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- tional recurrent neural networks. Signal Processing, IEEE transaction, 45(11):2673–2681. Hiroyuki Shindo, Yusuke Miyao, Akinori Fujino, and Masaaki Nagata. 2012. Bayesian symbol-refined tree substitution grammars for syntactic parsing. In ACL, pages 440–448. Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In ACL, pages 455–465. Ilya Sutskever, James Martens, George E. Dahl, and Ge- offrey E. Hinton. 2013. On the importance of ini- tialization and momentum in deep learning. In ICML, pages 1139–1147. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112. Yoshimasa Tsuruoka, Yusuke Miyao, and Jun’ichi Kazama. 2011. Learning with lookahead: Can history-based models rival globally optimized models? In CoNLL, pages 238–246. Ashish Vaswani, Yonatan Bisk, and Kenji Sagae. 2016. Supertagging with LSTMs. In NAACL. K. Vijay-Shanker and Aravind K. Joshi. 1988. A study of tree adjoining grammars. Citeseer. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2015. Gram- mar as a foreign language. In NIPS, pages 2773–2781. Zhiguo Wang and Nianwen Xue. 2014. Joint POS tag- ging and transition-based constituent parsing in Chi- nese with non-local features. In ACL, pages 733–742, Stroudsburg, PA, USA. Association for Computational Linguistics. Zhiguo Wang, Haitao Mi, and Nianwen Xue. 2015. Feature optimization for constituent parsing via neu- ral networks. In ACL-IJCNLP, pages 1138–1147, Stroudsburg, PA, USA. Association for Computational Linguistics. Taro Watanabe and Eiichiro Sumita. 2015. Transition- based neural constituent parsing. In ACL, pages 1169– 1179. Wenduan Xu, Michael Auli, and Stephen Clark. 2015. CCG supertagging with a recurrent neural network. In ACL-IJCNLP, pages 250–255, Stroudsburg, PA, USA. Association for Computational Linguistics. Naiwen Xue, Fei Xia, Fu-dong Chiou, and Martha Palmer. 2005. The Penn Chinese treebank: Phrase structure annotation of a large corpus. Natural Lan- guage Engineering, 11(2):207–238. 57 Yue Zhang and Stephen Clark. 2009. Transition-based parsing of the Chinese treebank using a global discrim- inative model. In ICPT, pages 162–171, Morristown, NJ, USA. Association for Computational Linguistics. Yue Zhang and Stephen Clark. 2011. Syntactic process- ing using the generalized perceptron and beam search. Computational Linguistics, 37(1):105–151. Yaozhong Zhang, Takuya Matsuzaki, and Jun’ichi Tsu- jii. 2010a. A simple approach for HPSG supertag- ging using dependency information. In NAACL-HLT, pages 645–648. University of Manchester, Association for Computational Linguistics, June. Yue Zhang, Byung-Gyu Ahn, Stephen Clark, Curt Van Wyk, James R. Curran, and Laura Rimell. 2010b. Chart pruning for fast lexicalised-grammar parsing. In COLING, pages 1471–1479. Hao Zhou, Yue Zhang, Shujian Huang, and Jiajun Chen. 2015. A neural probabilistic structured-prediction model for transition-based dependency parsing. In ACL, pages 1213–1222. Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and accurate shift-reduce con- stituent parsing. In ACL, pages 434–443. 58 work_22tktt3z4fczfipon3agmpyoeq ---- Semantic micro-contributions with decentralized nanopublication services Semantic micro-contributions with decentralized nanopublication services Tobias Kuhn1, Ruben Taelman2, Vincent Emonet3, Haris Antonatos4, Stian Soiland-Reyes5,6 and Michel Dumontier3 1 Department of Computer Science, VU Amsterdam, Amsterdam, Netherlands 2 IDLab, Ghent University, Ghent, Belgium 3 Institute of Data Science, Maastricht University, Maastricht, Netherlands 4 SciFY, Athens, Greece 5 Informatics Institute, University of Amsterdam, Amsterdam, Netherlands 6 Department of Computer Science, The University of Manchester, Manchester, UK ABSTRACT While the publication of Linked Data has become increasingly common, the process tends to be a relatively complicated and heavy-weight one. Linked Data is typically published by centralized entities in the form of larger dataset releases, which has the downside that there is a central bottleneck in the form of the organization or individual responsible for the releases. Moreover, certain kinds of data entries, in particular those with subjective or original content, currently do not fit into any existing dataset and are therefore more difficult to publish. To address these problems, we present here an approach to use nanopublications and a decentralized network of services to allow users to directly publish small Linked Data statements through a simple and user-friendly interface, called Nanobench, powered by semantic templates that are themselves published as nanopublications. The published nanopublications are cryptographically verifiable and can be queried through a redundant and decentralized network of services, based on the grlc API generator and a new quad extension of Triple Pattern Fragments. We show here that these two kinds of services are complementary and together allow us to query nanopublications in a reliable and efficient manner. We also show that Nanobench makes it indeed very easy for users to publish Linked Data statements, even for those who have no prior experience in Linked Data publishing. Subjects Human-Computer Interaction, Digital Libraries, World Wide Web and Web Science Keywords Nanopublications, Semantic Web, Linked data, Semantic publishing INTRODUCTION Linked Data has achieved remarkable adoption (Bizer, Heath & Berners-Lee, 2011; Schmachtenberg, Bizer & Paulheim, 2014), but its publication has remained a complicated issue. The most popular methods for publishing Linked Data include subject pages (Berners-Lee, 2009), SPARQL endpoints (Feigenbaum et al., 2013), and data dumps. The latter are essentially just RDF files on the web. Such files are not regularly indexed on a global scale by any of the existing search engines and therefore often lack discoverability, but they are the only option that does not require the setup of a web server for users wanting to publish Linked Data on their own. While one of the fundamental ideas behind the web is that anyone should be able to express themselves, Linked Data publishing is therefore mostly done by large centralized entities such as DBpedia (Auer et al., 2007) and How to cite this article Kuhn T, Taelman R, Emonet V, Antonatos H, Soiland-Reyes S, Dumontier M. 2021. Semantic micro-contributions with decentralized nanopublication services. PeerJ Comput. Sci. 7:e387 DOI 10.7717/peerj-cs.387 Submitted 26 August 2020 Accepted 19 January 2021 Published 8 March 2021 Corresponding author Tobias Kuhn, t.kuhn@vu.nl Academic editor Chiara Ghidini Additional Information and Declarations can be found on page 19 DOI 10.7717/peerj-cs.387 Copyright 2021 Kuhn et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.387 mailto:t.�kuhn@�vu.�nl https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.387 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ Wikidata (Vrandečić & Krötzsch, 2014). Even such community-driven datasets have clear guidelines on what kind of data may be added and typically do not allow for subjective or original content, such as personal opinions or new scientific findings that have otherwise not yet been published. It is therefore difficult for web users to publish their own personal pieces of Linked Data in a manner that the published data can be easily discovered, queried, and aggregated. To solve these shortcomings, we propose here a complementary approach to allow for what we call semantic micro-contributions. In contrast to the existing Linked Data publishing paradigms, semantic micro-contributions allow individual web users to easily and independently publish small snippets of Linked Data. We show below how such semantic micro-contributions can be achieved with nanopublications and semantic templates, and how we can make such a system redundant and reliable with a decentralized network of services. We will explain below how this approach differs from other decentralization approaches that have been proposed in the context of Linked Data publishing (including Solid and Blockchain-based approaches). Concretely, we investigate here the research question of how we can build upon the existing nanopublication publishing ecosystem to provide query services and intuitive user interfaces that allow for quick and easy publishing of small Linked Data contributions in a decentralized fashion. Our concrete contributions are: 1. a concrete scheme of how nanopublications can be digitally signed and thereby reliably linked to user identities, 2. two complementary sets of nanopublication query services building upon extensions of existing Linked Data technologies, one based on the grlc API generator and the other one in the form of an extension of Triple Pattern Fragments called Quad Pattern Fragments (QPF), 3. a user interface connecting to these services that allows for simple nanopublication publishing based on the new concept of nanopublication templates, and 4. positive evaluation results on the above-mentioned query services and user interface. Below, we outline the relevant background, introduce the details of our approach, present and discuss the design and results of two evaluations, and outline future work. BACKGROUND Before we introduce our approach, we give here the relevant background in terms of our own previous work, and other related research on the topics of the use of semantic technologies for scientific publishing, Linked Data APIs, and decentralization. Under the label of semantic publishing (Shotton, 2009), a number of approaches have been presented to align research and its outcomes with Linked Data in order to better organize, aggregate, and interpret scientific findings and science as a whole. We have previously argued that these Linked Data representations should ideally come directly from the authors (i.e., the researchers), should cover not just metadata properties but the content of the scientific findings themselves, and should become the main publication object instead of papers with narrative text, in what we called genuine semantic publishing Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 2/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ (Kuhn & Dumontier, 2017). Nanopublications (Mons et al., 2011) are one of the most prominent proposals to implement this. They are small independent pieces of Linked Data that encapsulate atomic statements in the form of a few RDF triples (this part is called the assertion graph) together with formal provenance information (the provenance graph, e.g., pointing to the study that the assertion was derived from) and metadata (the publication info graph, e.g., by whom and when was the nanopublication created). While the original nanopublication proposal focused on assertions with domain statements (such as expressing a link between a gene and a disease), we subsequently suggested to broaden their scope and to use them also to express bibliographic and other meta-level information, statements about other nanopublications, vocabulary definitions, and generally any kind of small and coherent snippet of Linked Data (Kuhn et al., 2013). In order to make nanopublications verifiable and to enforce their immutability, we then showed how cryptographic hash values can be calculated on their content and included in their identifiers in the form of trusty URIs (Kuhn & Dumontier, 2015). Based on this, we established a decentralized and open server network, through which anybody can reliably publish and retrieve nanopublications (Kuhn et al., 2016), and we introduced index nanopublications, which allow for assigning nanopublications to versions of larger datasets (Kuhn et al., 2017). The work to be presented below is a continuation of this research line, adding query services and an intuitive publishing interface as components to this ecosystem. Our general approach is partly related to semantic wikis, for example, Ghidini et al. (2009), Baumeister, Reutelshoefer & Puppe (2011) and Kuhn (2008). They combine the ideas of the Semantic Web with the wiki concept, and therefore allow for quick and easy editing of semantic data. They focus on the collaborative process of consensus finding and its result in the form of a single coherent formal knowledge base, and as such, they focus less on individual contributions as the unit of reference. In terms of Linked Data APIs, SPARQL endpoints (Feigenbaum et al., 2013) are probably the most well-known example and they are often used for providing queryable access to RDF datasets. In practice, such endpoints often suffer from availability problems (Buil-Aranda et al., 2013), due to their public nature and the uncontrolled complexity of SPARQL queries. The Linked Data Fragments (LDF) framework (Verborgh et al., 2016) was initiated as an attempt to investigate alternative RDF query interfaces, where the total query effort can be distributed between server and client. Triple Pattern Fragments (TPF) (Verborgh et al., 2016), for example, heavily reduce the expressivity of queries that can be evaluated by a server, so clients that want answers to more complex SPARQL queries need to take up part of the execution load themselves. Through client-side query engines, such as Comunica (Taelman et al., 2018), complex SPARQL queries can be split into multiple triple pattern queries that can be executed separately by a TPF server and then joined to create the full result on the client-side. Another approach to address the problems of full SPARQL endpoints is grlc (Meroño-Peñuela & Hoekstra, 2016), a tool that automatically generates APIs from SPARQL templates. By providing a small number of possible API operations instead of SPARQL’s virtually unlimited query possibilities, grlc makes Linked Data access easier and better manageable on both, the client and server Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 3/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ side. A further noteworthy technology is the Linked Data Platform (LDP) (Speicher, Arwe & Malhotra, 2015) to manage and provide Linked Data. In order to establish connections between producers and consumers of Linked Data, subscription and notification protocols such as WebSub (https://www.w3.org/TR/websub/) and provenance pingbacks (https://www.w3.org/TR/prov-aq/#provenance-pingback) have been proposed. The approaches above mostly assume requests are targeted towards a central server. This centralization comes with the downsides that such a server forms a single point of failure, that we need to trust in the authority that runs it, and that it is difficult to scale. To address these problems, a number of more decentralized approaches have been proposed. LDF interfaces such as TPF, as introduced above, can in fact also be used in a more distributed fashion, as fragments can be published across different servers (Delva et al., 2019). Distributed approaches to semantically annotate web pages like https://schema.org/ (Guha, Brickley & Macbeth, 2016) have moreover shown strong adoption. Another example is Solid (Mansour et al., 2016), where users have their own personal Linked Data pod, in which they can store their own data and thereby are in full control of who can access it. Solid thereby targets personal and potentially confidential data, with a focus on access control and minimizing data duplication. The Solid ecosystem has been applied in a number of use cases, such as collaboration within decentralized construction projects (Werbrouck et al., 2020), and decentralization of citizen data within governments (Buyle et al., 2019). Such approaches where data is distributed but not replicated, however, often lead to major difficulties when queries need to be executed over such a federation of data sources (Taelman, Steyskal & Kirrane, 2020). This stands in contrast to decentralized approaches where data is not only distributed but also replicated, which typically target open and public data and have an emphasis on scalability and reliability. Blockchain-based solutions fall into the latter category, for which a whole range exists of possible approaches to integrate Linked Data (Third & Domingue, 2017). A core trade-off of all blockchain-based approaches is to either sacrifice some degree of decentralization with permissioned blockchains or to pay the price of the expensive mining process. For applications that do not crucially depend on a fixed and agreed-upon order of events, as cryptocurrencies do for their transaction ledger, the costs of Blockchain-based solutions in fact often do not seem to offset their benefits. Our approach to be presented below also falls into this second category of decentralization approaches with replicated data sources, but does not entail the costs of Blockchain-based approaches. APPROACH The approach to be presented here, as shown in Fig. 1, is based on our work on nanopublications and the ecosystem for publishing them, as introduced above. The core of this new approach is to allow end-users to directly publish Linked Data snippets in the form of nanopublications with our existing decentralized nanopublication publishing network through an interface powered by semantic templates, which are themselves published as nanopublications. Below we explain how users can establish their identity by announcing their public key, and how they can then sign, publish, and update their Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 4/23 https://www.w3.org/TR/websub/ https://www.w3.org/TR/prov-aq/#provenance-pingback https://schema.org/ http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ nanopublications. Then we describe our extension of Triple Pattern Fragments to support quads and thereby nanopublications. Next, we show how we defined two complementary sets of services on top of the existing nanopublication network to query and access the published data in a redundant and reliable way. Finally, we explain how these components together with semantic templates allowed us to build a flexible and intuitive end-user application called Nanobench. Identities and updates Nanopublications typically specify their creator in the publication info graph, but because anybody can publish anything they want through the existing open nanopublication network, there is no guarantee that this creator link is accurate. For that reason, we propose here a method to add a digital signature to the publication graph. With our approach, users have to first introduce their identifier and public key before they can publish their own nanopublications. This introduction is itself published as a signed nanopublication declaring the link between the personal identifier (such as an ORCID identifier) and the public key in its assertion graph, as shown by this example: sub:assertion { sub:keyDeclaration npx:declaredBy orcid:0000-0001-2345-6789 ; npx:hasAlgorithm "RSA"; npx:hasPublicKey "MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQK…" . } Below, we will come back to the question of how we can ensure that this user is indeed in control of the stated ORCID identifier. Once an identity is established in this way, the respective user can publish nanopublications such as the one shown in Fig. 2, where the personal identifier and the public key are mentioned in the publication info graph (yellow) together with the digital signature that is calculated with the respective private key on the entire nanopublication, excluding only the npx:hasSignature triple and the hash code of the trusty URI. The trusty URI (here represented with the prefix this: ) is Figure 1 The architecture of our overall approach. Full-size DOI: 10.7717/peerj-cs.387/fig-1 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 5/23 http://dx.doi.org/10.7717/peerj-cs.387/fig-1 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ calculated as a last step, which therefore also covers the signature. This makes the nanopublication including its signature verifiable and immutable. Immutability is a desirable property to ensure stable and reliable linking, but for practical purposes it has to come with a mechanism to declare updates and mark obsolete entries. With our approach, new versions of a nanopublication can be declared with the npx:supersedes property in the publication info graph of the nanopublication containing the update, for example: sub:pubinfo { this: npx:supersedes . … } In order to declare a nanopublication obsolete without an update, the npx:retracts property can be used in the assertion graph of a separate retraction nanopublication, for example: sub:assertion { orcid:0000-0001-2345-6789 npx:retracts . } Figure 2 Example nanopublication in TriG notation that was published with Nanobench. Full-size DOI: 10.7717/peerj-cs.387/fig-2 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 6/23 http://dx.doi.org/10.7717/peerj-cs.387/fig-2 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Of course, updated versions and retractions should only be considered valid if authorized by the author of the original nanopublication. For the scope of this work, we only consider them valid if the retraction or update is signed with the same key pair, but more flexible solutions are possible in the future. The elements introduced so far allow us to cryptographically verify that given nanopublications were published by the same user who introduced herself in her introduction nanopublication, but they still allow anybody to claim any ORCID identifier (or other kind of identifier). To add this missing link, users can add the link of their introduction nanopublication to their ORCID profile under “Websites & Social Links”, which proves that they have control of that account. This link is represented with foaf:page when the user identifier is resolved with a HTTP GET request asking for an RDF representation via content negotiation. This is thereby a general method that can work on any URL scheme and identification mechanism providing dereferenceable user identifiers, but for simplicity we will restrict our discussion here to ORCID identifiers. Quad pattern fragments Nanopublications, as can be seen in Fig. 2, are represented as four named RDF graphs. Triple Pattern Fragments, however, as their names indicates, only support triples and not quads (which include the graph information), and TPF is therefore insufficient for querying nanopublications. For this reason, we introduce an extension of TPF to support quads, called Quad Pattern Fragments (QPF) (https://linkeddatafragments.org/ specification/quad-pattern-fragments/). In order to allow querying over QPF, its HTTP responses include metadata that declaratively describe the controls via which querying is possible. These controls are defined in a similar way as for TPF using the Hydra Core vocabulary (Lanthaler & Gütl, 2013), and allows intelligent query engines to detect and use them. Below, an example of these controls is shown: @prefix rdf: . @prefix hydra: . @prefix void: . @prefix sd: . a void:Dataset, hydra:Collection; void:subset ; sd:defaultGraph ; hydra:search _:pattern. _:pattern hydra:template "https://example.org/{?s,p,o,g}"; hydra:variableRepresentation hydra:ExplicitRepresentation; hydra:mapping _:subject, _:predicate, _:object, _:graph. _:subject hydra:variable "s"; hydra:property rdf:subject. _:predicate hydra:variable "p"; hydra:property rdf:predicate. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 7/23 https://linkeddatafragments.org/specification/quad-pattern-fragments/ https://linkeddatafragments.org/specification/quad-pattern-fragments/ http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ _:object hydra:variable "o"; hydra:property rdf:object. _:graph hydra:variable "g"; hydra:property sd:graph. The control above indicates that the QPF API accepts four URL parameters, corresponding to the four elements of a quad. For example, a query to this API for the pattern ?s npx:retracts ?o sub:assertion would result in an HTTP request for the URL https://example.org/?p=npx:retracts&g=sub:assertion1. Just like with TPF, intelligent clients can be built that can handle more complex queries (such as SPARQL queries) over QPF APIs. This requires these clients to split up a SPARQL query into multiple quad patterns, which can be resolved by the API, after which they can be joined by the client to form a complete query result. QPF has been designed to be backwards-compatible with TPF. This means that clients that implement support for TPF APIs, but do not understand the notion of QPF, will be able to recognize the API as TPF, and execute triple pattern queries against it. Due to the declaratively described QPF and TPF controls, clients such as the Comunica engine can recognize and make use of both variants next to each other. A live version of a QPF API can be found at https://ldf.nanopubs.knows.idlab.ugent.be/np, which is one of six instances of this service in our network2. Nanopublication services Nanopublications can be reliably and redundantly published by uploading them to the existing nanopublication server network (Kuhn et al., 2016), which at the time of writing consists of eleven severs in five countries and storing more than 10 million nanopublications (http://purl.org/nanopub/monitor). This network implements a basic publishing layer where nanopublications can be looked up by their trusty URI, but no querying features are provided. In order to allow for querying of the nanopublications’ content, we present here our implementation of a new service layer built on top of the existing publication layer. While we are using a triple store with SPARQL under the hood, we do not provide a full-blown SPARQL endpoint to users in order to address the above-mentioned problems of availability and scalability. For our nanopublication service layer, we employ a mix of two kinds of services that are more restricted than SPARQL but also more scalable. The first kind of service is based on LDF via our QPF API, as introduced above, and allows only for simple queries at the level of individual RDF statements but does not impose further restrictions. The second one is based on the grlc API generator (Meroño-Peñuela & Hoekstra, 2016), which optionally comes with the Tapas HTML interface (Lisena et al., 2019) and which can be used to execute complex queries but is restricted to a small number of predefined patterns. The LDF-based services reduce the complexity and load on the server by only allowing for very simple queries to be asked to the server, and delegate the responsibility of orchestrating them to answer more complex questions to the client. The grlc-based 1 For simplicity, URLs for p and g are prefixed, whereas they will be expanded in practise. 2 A live example of a QPF client that can query over this API can be found at http://query.linkeddatafragments.org/ #datasources=https%3A%2F%2Fldf. nanopubs.knows.idlab.ugent.be%2Fnp. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 8/23 https://example.org/?p=npx:retracts&g=sub:assertion https://ldf.nanopubs.knows.idlab.ugent.be/np http://purl.org/nanopub/monitor http://query.linkeddatafragments.org/#datasources=https%3A%2F%2Fldf.nanopubs.knows.idlab.ugent.be%2Fnp http://query.linkeddatafragments.org/#datasources=https%3A%2F%2Fldf.nanopubs.knows.idlab.ugent.be%2Fnp http://query.linkeddatafragments.org/#datasources=https%3A%2F%2Fldf.nanopubs.knows.idlab.ugent.be%2Fnp http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ services reduce the complexity and load by allowing only for queries that are based on a small number of SPARQL templates that are hand-crafted for optimized performance. These two kinds of services are thereby designed to be complementary, with grlc being restricted but faster and LDF being more powerful but slower. The grlc-based services provide general API operations that are based on 14 SPARQL templates: � find_nanopubs returns all nanopublication identifiers in undefined order (paginated in groups of 1,000) possibly restricted by the year, month, or day of creation; � find_nanopubs_with_pattern additionally allows for specifying the subject, predicate, and/or object of a triple in the nanopublication as a filter, and to restrict the occurrence of that triple to the assertion, provenance, or publication info graph; � find_nanopubs_with_uri similarly allows for filtering by a URI irrespective of its triple position; � find_nanopubs_with_text supports full-text search on the literals in the nanopublication (using non-standard SPARQL features available in Virtuoso and GraphDB); � for each of the four find_nanopubs_* templates mentioned above, there is also a find_signed_nanopubs_* version that only returns nanopublications that have a valid signature and that allows for filtering by public key; � get_all_indexes returns all nanopublication indexes (i.e., sets of nanopublications); � get_all_users returns all users who announced a public key via an introduction nanopublication; � get_backlinks returns all identifiers of nanopublications that directly point to a given nanopublication; � get_deep_backlinks does the same thing but includes deep links through chains of nanopublications linking to the given one; � get_latest_version returns the latest version of a given nanopublication signed by the same public key by following npx:supersedes backlinks; � get_nanopub_count returns the number of nanopublications, possibly restricted by year, month, or day of creation. The full SPARQL templates can be found in the Supplemental Material (see below). These API calls provide a general set of queries based on which applications with more complex behavior can be built. We will introduce Nanobench as an example of such an application below. In order to answer some of the above queries, auxiliary data structures have to be created while loading new nanopublications. Most importantly, digital signatures cannot be checked in SPARQL directly, as this involves translating the triples of a nanopublication into a normalized serialization and then calculating a cryptographic hash function on it, which goes beyond SPARQL’s capabilities. Other aspects like deep backlinks are complicated because it is not sufficient to check whether a link is present, but we also need Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 9/23 http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ to check that the respective triple is located in the linking nanopublication (as a triple linking two nanopublications could itself be located in a third nanopublication). In order to solve these problems, additional triples in two administrative graphs are generated when new nanopublications are loaded. Concretely, the following triples are added for each nanopublication (placeholders in capitals): npa:graph { npa:hasHeadGraph ; dct:created "DATETIME"^^xsd:dateTime ; npa:creationDay ; npa:creationMonth ; npa:creationYear ; npa:hasValidSignatureForPublicKey "PUBLICKEY" . } npa:networkGraph { REFERENCED-NPURIS… . npa:refersToNanopub REFERENCED-NPURIS… . } The first triple of the npa:graph links the nanopublication identifier to its head graph, where the links to its assertion, provenance, and publication info graphs can be found. The second one contains the creation date in a normalized form. Number three to five allow for efficient filtering by day, month, and year, respectively (we use URIs instead of literals because this happens to be much faster for filtering under Virtuoso). The final triple in the npa:graph links the nanopublication to its public key if the signature was found to be valid. In the npa:networkGraph, all instances of linking to another nanopublication with the linking nanopublication URI in subject position are added (e.g., with npx: supersedes). In the cases where another nanopublication is linked but not with the pattern of the linking nanopublication in subject position (e.g., as with npx:retracts), npa:refersToNanopub is used as predicate to link the two nanopublications. We set up a network of six servers in five different countries each providing both of the introduced services (LDF-based and grlc-based). They are notified about new nanopublications by the servers of the existing publishing network, which are otherwise running independently. The services connect to a local instance of a Virtuoso triple store (https://virtuoso.openlinksw.com/), into which all nanopublications are loaded via a connector module. This connector module also creates the additional triples in the administrative graphs as explained above. While the restriction to predefined templates with grlc significantly improves the scalability of the system as compared to unrestricted SPARQL, further measures will be needed in the future if the number of nanopublications keeps growing to new orders of magnitude. The services presented here are designed in such a way that such measures are possible with minimal changes to the API. The 14 query templates of the grlc services can be distributed to different servers, for example, such that a single server would only be Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 10/23 https://virtuoso.openlinksw.com/ http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ responsible for one of the 14 kinds of queries. This server could then use an optimized data structure for exactly that kind of query and would only need to hold a fraction of the data. The find_ queries could moreover be further compartmentalized based on publication date, for example each server instance just covering a single year. The LDF-based services could be distributed in a similar fashion, for example based on the predicate namespace. Nanobench client and templates To demonstrate and evaluate our approach, we next implemented a client application that runs on the user’s local computer, can be accessed through their web browser, and connects to the above decentralized network of services. The code can be found online (https://github.com/peta-pico/nanobench) and Fig. 3 shows a screenshot. In the “search” part of the interface, users are provided with a simple search interface that connects to the grlc API operations find_nanopubs_with_uri (if a URI is entered in the search field) or find_nanopubs_with_text (otherwise). In the “others” part, other users’ latest nanopublications can be seen in a feed-like manner, similar to Twitter feeds. In order for users to publish their own nanopublications and thereby create their own feed, they have to first set up their profile. Nanobench provides close guidance through this process, which involves the declaration of the user’s ORCID identifier, the creation of an RSA key pair, and the publication of an introduction nanopublication that links the public key to the ORCID identifier. The last step of linking the new introduction nanopublication from the user’s ORCID profile is not strictly necessary for the user to start publishing nanopublications and is therefore marked as optional. Once the user profile is completed, a list of templates is shown in the “publish” part of the interface. Templates are published as nanopublications as well, and so this list can be populated via a call to the find_signed_nanopubs_with_pattern operation of the grlc-based services. Currently, the list includes templates for free-text commenting on a URL, expressing a foaf:knows relation to another person, declaring that the user has read a given paper, expressing a gene–disease association, retracting a nanopublication, describing a datasets with a SPARQL endpoint, and publishing an arbitrary RDF triple. Figure 3 A screenshot of the Nanobench application with a publication form. Full-size DOI: 10.7717/peerj-cs.387/fig-3 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 11/23 https://github.com/peta-pico/nanobench http://dx.doi.org/10.7717/peerj-cs.387/fig-3 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ After selecting a template, a form is automatically generated that allows the user to fill in information according to that template, as shown in Fig. 3. Templates describe the kind of statements users can publish and also provide additional information on how the input form should be presented to the user. This is an example of a template (the same one that is shown in Fig. 3), defined in the assertion graph of a nanopublication: sub:assertion { sub:assertion a nt:AssertionTemplate ; rdfs:label "Expressing that you know somebody" ; nt:hasStatement sub:st1 . sub:st1 a rdf:Statement ; rdf:subject nt:CREATOR ; rdf:predicate foaf:knows ; rdf:object sub:person . foaf:knows rdfs:label "know" . sub:person a nt:UriPlaceholder ; rdfs:label "ORCID identifier of the person you know" ; nt:hasPrefix "https://orcid.org/" ; nt:hasRegex "[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{3}[0-9X]" . } In a template nanopublication, the assertion graph is classified as an AssertionTemplate (in the namespace https://w3id.org/np/o/ntemplate/) and given a human readable label with rdfs:label. Moreover, it is linked to the statement templates (i.e., triples in the nanopublications to be published) via hasStatement. The above example has just one such statement template, but more complex templates involve several of them. These templates then use regular RDF reification to point to their subjects, predicates, and objects. In the case of multiple statements, their order in the form can be defined with statementOrder and some of them can be marked as optional by classifying them as OptionalStatement. rdfs:label can be used on all the elements to define how they should be labeled in the form interface, and the special URI CREATOR is mapped to the identifier of the user applying the template. Importantly, the URIs in subject, predicate, or object position of the template statements can be declared placeholders with the class UriPlaceholder, and similarly for literals with LiteralPlaceholder. Such placeholders are represented as input elements, such as text fields or drop-down menus, in the form that is generated from the template. Currently supported more specific placeholder types include TrustyUriPlaceholder, which requires a trusty URI (such as a nanopublication URI), and RestrictedChoicePlaceholder, which leads to a drop-down menu with the possible options defined by the property possibleValue. For URI placeholders, prefixes can be defined with hasPrefix and regex restrictions with hasRegex, as can be seen in the example above. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 12/23 https://w3id.org/np/o/ntemplate/ http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Once the user filled in a form that was generated from a template and clicks on “Publish”, Nanobench creates the assertion graph of a new nanopublication by following the template and replacing all the placeholders with the user’s input. For the provenance graph, only a simple prov:wasAttributedTo link to the user’s identifier is currently added (we are working on extending the coverage of templates to the provenance and publication info graphs). In the publication info graph, Nanobench adds a timestamp, specifies the user as the creator of the nanopublication, and adds a wasCreatedFromTemplate link that points to the underlying template nanopublication. Then, Nanobench adds a digital signature element to the publication info graph with a signature made from the user’s local private key, transforms the whole nanopublication into its final state with a trusty URI, and finally publishes it to the server network with a simple HTTP POST request. Within a few minutes or less, it then appears in the user’s feed. Nanobench currently makes use of the redundancy of the nanopublication services in a very simple way: For each query, it randomly selects two grlc service instances and sends the same query to both. It then processes the result as soon as it gets the first answer and discards the second, thereby increasing the chance of success and lowering the average waiting time. More sophisticated versions of this protocol are of course easily imaginable and will be investigated in future work. PERFORMANCE EVALUATION In order to evaluate our approach, we introduce here a performance evaluation that we ran on the network of nanopublication services. In the next section we will then look into whether these services are useful to potential end users with a usability evaluation on Nanobench. Performance evaluation design For this performance evaluation we wanted to find out how well the two types of services— LDF-based and grlc-based—perform in our network of services, how they compare, and to what extent they are really complementary. For this purpose, we defined a set of concrete queries that we can then submit to both services. We started with the 14 query templates of the grlc-based service, and instantiated each of them with a simple set of parameters to make 14 concrete executable queries. As parameter values, we chose generic yet realistically useful examples that return non-trivial answer sets for the kind of nanopublications that the current templates describe: (1) find_nanopubs restricted to the month 2020-02; (2) find_nanopubs_with_pattern with the predicate value set to foaf:knows; (3) find_nanopubs_with_text on the free-text keyword “john”; (4) find_nanopubs_with_uri to search for nanopublications mentioning a given ORCID identifier; (5–8) of the form find_signed_nanopubs_* are given the same parameters as (1–4); (9) get_all_indexes and (10) get_all_users do not need parameters; (11) get_backlinks and (12) get_deep_backlinks are given the URI of a specific nanopublication, which has a substantial number of backlinks; (13) get_latest_version is given the URI of the first version of a template Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 13/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ nanopublication that has afterwards been updated four times; and (14) get_nanopub_count is, like (1), restricted to the month 2020-02. We can run these queries via the grlc-powered API but we can also use an LDF engine like Comunica to run them against our LDF-based services. The latter comes with some caveats, as the free text queries of find_nanopubs_with_text and find_signed_nanopubs_with_text depend on implementation-dependent non-standard extensions of SPARQL that do not work with LDF-style query execution. Moreover, Comunica currently lacks support for complex property paths, which are needed for get_deep_backlinks and get_latest_version. Queries (3), (7), (12), and (13) can therefore only be run on the grlc-based services but not on the LDF-based ones. However, the power of the LDF-based services is of course that they can (potentially) run arbitrary SPARQL queries (with some restrictions, as mentioned above). To demonstrate and test this ability, we created another query (15) that in a simple way combines the outputs of two of the currently available templates. Specifically, it checks for a given user (below abbreviated as me:) who he has declared to know via the foaf:knows template, and then searches for papers these people declared to have read via a different template. Thereby, query (15) returns a list of all papers that friends of the user me: have read: select ?person ?paper where { me: foaf:knows ?person . ?person pc:hasRead ?paper . } This query can be considered a quick-and-dirty solution for exploration purposes, as it misses a number of checks. It does not check that both triples are in the assertion graphs of signed nanopublications, that the first is signed with the public key corresponding to the user in subject position, and that neither of the nanopublications is superseded or retracted. We therefore define query (16) that includes all these checks. This query is more complicated, and we show here for illustration just the SPARQL fragment of the part necessary to check that the second nanopublication ?np2 with public key ?pubkey2 was not retracted: filter not exists { graph npa:graph { ?retraction npa:hasHeadGraph ?rh . ?retraction npa:hasValidSignatureForPublicKey ?pubkey2 . } graph ?rh { ?retraction np:hasAssertion ?ra . } graph ?ra { ?somebody npx:retracts ?np2 . } } The inconvenience of writing such rather complicated queries can be addressed by future versions of the services, which could include predefined options to restrict the query to the assertion graphs and to up-to-date content. The full set of used Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 14/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ queries and further details can be found in the Supplemental Material online (DOI 10.5281/zenodo.3994068). To evaluate the performance of the nanopublication services, we accessed them in a clearly defined setting from a number of different locations from personal computers via home networks, by running the 16 queries specified above on all service instances of both kinds. For that, we created a Docker image that accesses the grlc-based services with simple HTTP requests via curl and the LDF-based ones with the Comunica (https://github.com/comunica/comunica) engine 1.12.1. The results as well as the execution time of all the calls are recorded, which is then used to evaluate the performance. For both kinds of services, the timeout is set to 60 s. Performance evaluation results We ran the Dockerized evaluation process described above at five places in four different countries. Each of them ran all of the compatible queries on each of the six existing service instance for both of the two kinds. For each query we therefore have 30 outcomes for grlc and another 30 outcomes for LDF. These outcomes fall into the general categories of timeout, error, and full result. In the case of the LDF-based services, timeout and error outcomes can come with partial results. Figure 4 shows a summary of these overall outcomes. With grlc, 96% of the calls succeeded and only 4% resulted in an error (mostly due to downtime of one particular service). With LDF, 73% fully succeeded, 21% reached the timeout, and 6% gave an error. The latter two sometimes gave partial results: overall 6% reached a timeout while still giving partial results, and overall 3% gave an error with a partial result. For LDF, these types of outcomes are not evenly distributed. Two queries— find_nanopubs_with_uri (4) and get_all_indexes (9)—never fully succeeded, but the former sometimes gave partial results. For the remaining queries, however, these LDF calls returned at least a partial result in 97% of the cases. Except for query (1) in addition to the above mentioned (4) and (9), the full result was always received from at least one of the servers in LDF mode. For grlc, this was the case for all queries. 0 0.2 0.4 0.6 0.8 1 grlc timeout without result timeout with partial result error without result error with partial result full result 16_papers_x 15_papers 14_get_nanopub_count 13_get_latest_version 12_get_deep_backlinks 11_get_backlinks 10_get_all_users 09_get_all_indexes 08_find_signed_nanopubs_with_uri 07_find_signed_nanopubs_with_text 06_find_signed_nanopubs_with_pattern 05_find_signed_nanopubs 04_find_nanopubs_with_uri 03_find_nanopubs_with_text 02_find_nanopubs_with_pattern 01_find_nanopubs 0 0.2 0.4 0.6 0.8 1 LDF ratio of query executions Figure 4 Overall outcomes per query and kind of service, executed from five locations. Full-size DOI: 10.7717/peerj-cs.387/fig-4 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 15/23 http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information https://doi.org/10.5281/zenodo.3994068 https://github.com/comunica/comunica http://dx.doi.org/10.7717/peerj-cs.387/fig-4 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ A client checking multiple servers would therefore have eventually received the full result. For query (1) in LDF mode, this was true for 4 cases out of 5. Next, we can look at the time performance. Table 1 shows the average execution times per query and service type, including only the calls that returned a full result. The successful queries to the grlc services took on average from 0.21 to 6.46 s. For the LDF services, these numbers range from 1.53 to 35.26 s (but they can be a bit misleading as they ignore the fact that the LDF services repeatedly hit the time limit of 60 s). For the queries that could successfully be run on both kinds of services, LDF is on average 7.18 to 86.50 times slower than grlc. Importantly, the queries that do not follow a predefined pattern (15) and (16) gave the full result with LDF in 97% of the cases and ran reasonably fast. The quick-and-dirty version (15) required on average 2.30 s, whereas the thorough one (16) completed on average after 10.07 s. USABILITY EVALUATION Now that we know that the services perform reasonably well, we wanted to find out whether this general approach and our specific Nanobench tool indeed makes it easy for users who might not be experts in Linked Data to publish their own small data entries. Usability evaluation design We wanted to test the usability of Nanobench in a real setting, where users actually publish nanopublications. For that we wrote detailed instructions on how to install and use Nanobench and its publication feature, which includes downloading the latest Nanobench release, running it locally, accessing Nanobench through their web browser, completing the Nanobench profile, accessing the list of templates, and finally filling in and submitting Table 1 Average execution times of the successful query executions in seconds. Query grlc LDF L/g 1 find_nanopubs 1.02 35.26 34.48 2 find_nanopubs_with_pattern 0.55 6.69 12.20 3 find_nanopubs_with_text 6.46 4 find_nanopubs_with_uri 0.78 5 find_signed_nanopubs 0.49 20.77 42.05 6 find_signed_nanopubs_with_pattern 0.73 9.57 13.04 7 find_signed_nanopubs_with_text 1.54 8 find_signed_nanopubs_with_uri 0.34 29.53 86.50 9 get_all_indexes 3.52 10 get_all_users 0.65 31.09 47.71 11 get_backlinks 0.21 1.53 7.18 12 get_deep_backlinks 0.68 13 get_latest_version 0.71 14 get_nanopub_count 0.23 6.54 28.29 15 papers 2.30 16 papers_x 10.07 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 16/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ the publication form generated from a chosen template. Through mailing lists, social media, and personal contacts, we tried to convince as many people as possible to try out Nanobench and to publish some nanopublications on their own. Next, we created an anonymous usability questionnaire, consisting of the ten standard questions of the widely used System Usability Scale (SUS) (Brooke, 1996). We added to that the questions “Have you published RDF/Linked Data before?” and “Have you digitally signed RDF/Linked Data before?”, and as a follow up to each of them—if the answer was “yes”—whether Nanobench was harder or easier to use for publishing and signing Linked Data, respectively, compared to how they previously did it. The responses were on a 5-point Likert scale from 1 (Nanobench was harder) to 5 (Nanobench was easier). We sent this questionnaire to all the Nanobench users who published at least one nanopublication (not counting their introduction nanopublication), excluding the co-authors of this paper and their close relatives. Further details, including instructions and questionnaire, can be found in the supplemental material online (DOI 10.5281/zenodo.3994066). Usability evaluation results Overall, 42 users registered in the decentralized system by publishing an introduction nanopublication. A total of 29 of them (69%) also linked this introduction nanopublication from their ORCID accounts, which was a step that was marked as optional. Collectively, they published 81 nanopublications, not counting their introduction nanopublications, via the use of seven distinct templates. After applying the exclusion criteria defined above, we arrived at a set of 29 users to whom we sent the anonymous usability questionnaire (this set of users is overlapping but different from the set of 29 users mentioned just above). After sending up to two reminders, we received responses from all of them. On the question of whether they had published Linked Data before, 21 respondents (72%) said they did. 20 of them (95%) reported that Nanobench was easier to use compared to how they previously published Linked Data, with the remaining one being indifferent (score of 3). The average was 4.5 on the 5-point Likert scale. Of the 21 respondents, only three (14%) stated that they had previously digitally signed Linked Data. All three of them found Nanobench easier, giving two times a 5 and once a 4 as responses (average 4.7). Table 2 shows the results of the SUS questions. Overall, our system achieved a SUS score of 77.76, which is clearly above the average score reported in the literature (70.14) and is roughly in the middle between “good” and “excellent” on an adjective scale (Bangor, Kortum & Miller, 2008). Interestingly, if we only consider the eight respondents who stated they had never published Linked Data before, this value is even better at 85.94, clearly in the “excellent” range. The participants were moreover given the possibility to provide further feedback in a free-text field. We received a variety of comments for further improvement, but except for the point that the required local installation was somewhat inconvenient, no point was mentioned more than once. The other comments concerned the search page being confusing (this part of the interface was indeed not the focus of the study), the lack of Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 17/23 http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information https://doi.org/10.5281/zenodo.3994066 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ support for batch publishing of multiple similar nanopublications, the lack of integrated ORCID lookup, the relatively small number of general-purpose templates, the lack of RDF prefix recognition, the fact that not all lengthy URIs are masked with readable labels in the user interface, and the fact that the confirmation checkbox did not mention the possibility of retraction. A further comment was that a command-line interface would have been preferred in the particular context of the given participant. Such a command-line interface actually exists (as part of the nanopub-java library; Kuhn, 2016) but was not the focus of this study. DISCUSSION AND CONCLUSION The results of the performance study described above confirm that the tested kinds of queries can be efficiently answered by at least one of the two types of services, and that these two service types are indeed complementary. The grlc services run reliably and fast on the types of queries they are designed for. The LDF services can run most of these kinds of queries too, albeit in a much slower fashion, and they are reasonably fast for simple kinds of unrestricted queries. The results of the usability study indicate that our Nanobench client application connecting to these services is indeed easily and efficiently usable, even for users with no prior experience in Linked Data publishing. In future work, we are planning to improve a number of aspects of the involved tools and methods. For example, our approach does not yet exploit the full potential of replication in our decentralized setting. Existing work has shown that a client-side algorithm can enable effective load-balancing over TPF servers (Minier et al., 2018), and we plan to extend this work to QPF. As another example, our otherwise decentralized approach currently uses centralized ORCID identifiers. We are therefore investigating decentralized forms of authentication, such as WebID-OIDC (https://github.com/solid/ webid-oidc-spec) or an approach similar to the web of trust (Caronni, 2000), where public Table 2 SUS usability evaluation results. better → SUS questions: odd questions: 1 2 3 4 5 even questions: 5 4 3 2 1 Score 1: I think that I would like to use this system frequently 0 3 9 13 4 65.52 2: I found the system unnecessarily complex. 0 0 3 11 15 85.34 3: I thought the system was easy to use 0 1 1 13 14 84.48 4: I think that I would need the support of a technical person to be able to use this system 1 2 5 7 14 76.72 5: I found the various functions in this system were well integrated 0 1 7 14 7 73.28 6: I thought there was too much inconsistency in this system 0 1 2 15 11 81.03 7: I would imagine that most people would learn to use this system very quickly 0 3 6 14 6 69.83 8: I found the system very cumbersome to use. 0 0 1 17 11 83.62 9: I felt very confident using the system 0 1 6 15 7 74.14 10: I needed to learn a lot of things before I could get going with this system 0 1 4 8 16 83.62 Total: 1 13 44 127 105 77.76 Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 18/23 https://github.com/solid/webid-oidc-spec https://github.com/solid/webid-oidc-spec http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ keys are found based on personal trust relationships that could themselves be published as nanopublications. Such a web of trust could then also allow users in the future to find trustworthy services. This could include meta services whose task is to monitor and test other kinds of services, so clients could make an informed decision on which service instances to rely on. This is currently difficult, as there is no guarantee that all services are well-behaved and return complete and correct results. Clients could already now deal with this by taking random samples of nanopublications from the publishing servers and check whether the query services correctly return them, but this is quite resource intensive. Another issue that needs to be taken care of in future work is identity management when private keys are compromised, lost, or simply replaced as a measure of precaution. For that, we envisage that introduction nanopublications are extended so users can also list old public keys. On top of that, we are going to need a method for users to re-claim old nanopublications they signed with an old key that has since been compromised by a third party (possibly by linking to them with an index nanopublication signed with a new key). This will also require modifications in how we deal with retracted and superseded nanopublications, as they might then be signed with a different key. This is not trivial but can be dealt with within our framework, as opposed to Blockchain-based solutions where identity is inseparably linked to private key access. Currently, users need to install Nanobench locally to ensure secure private key access and proper decentralization, but a more flexible and more powerful handling of private keys as explained above will also allow us to provide login-based public Nanobench instances with their own sets of private keys, which in turn can significantly increase the ease of use of our approach. More work is also needed on the templating features to also cover the provenance and publication info graphs. We also plan to more closely align our templating vocabulary with existing RDF shape standards. Moreover, we are working on making our templating approach more general and more powerful, by adding repeatable statement patterns among other features, such that we can express, for example, templates of templates and thereby allow users to create and publish their own templates directly via Nanobench. The tools and applications we described above in a sense just scratch the surface of what can become possible with our general approach in the nearer future, from Linked Data publications of the latest scientific findings, to formally organized argumentation and automated real-time aggregations. We believe that our approach of semantic micro-contributions could in fact be the starting point of bringing Linked Data publishing to the masses. ADDITIONAL INFORMATION AND DECLARATIONS Funding Ruben Taelman is a postdoctoral fellow of the Research Foundation — Flanders (FWO) (1274521N). Support for Vincent Emonet and Michel Dumontier was provided by the Biomedical Data Translator project funded by National Institutes of Health (No. OT2TR003434-01). Stian Soiland-Reyes was funded by BioExcel-2 (European Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 19/23 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Commission H2020-INFRAEDI-02-2018-823830). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Research Foundation—Flanders (FWO): 1274521N. National Institutes of Health: OT2TR003434-01. BioExcel-2 (European Commission): H2020-INFRAEDI-02-2018-823830. Competing Interests Haris Antonatos is employed by SciFY. The authors declare that they have no competing interests. Author Contributions � Tobias Kuhn conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Ruben Taelman performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Vincent Emonet performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Haris Antonatos performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Stian Soiland-Reyes performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. � Michel Dumontier performed the experiments, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Supplemental data for the performance evaluation is available at Zenodo: Tobias Kuhn, & Vincent Emonet. (2020, August 21). peta-pico/nanopub-services-eval 1.0 (Version 1.0). Zenodo. DOI 10.5281/zenodo.3994068. Supplemental data for the usability evaluation is available at Zenodo: Tobias Kuhn. (2020, August 21). peta-pico/nanobench-usability-eval 1.0 (Version 1.0). Zenodo. DOI 10.5281/zenodo.3994066. The code for Nanobench (release nanobench-1.7) is available at Zenodo: Tobias Kuhn, & Vincent Emonet. (2020, November 26). peta-pico/nanobench: nanobench-1.7 (Version nanobench-1.7). Zenodo. DOI 10.5281/zenodo.4292171. The code for the nanopublication services (release nanopub-services-1.0) is also available at Zenodo: Tobias Kuhn. (2020, November 26). peta-pico/nanopub-services: nanopub-services-1.0 (Version nanopub-services-1.0). Zenodo. DOI 10.5281/zenodo.4291594. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 20/23 http://doi.org/10.5281/zenodo.3994068 http://doi.org/10.5281/zenodo.3994066 http://doi.org/10.5281/zenodo.4292171 http://doi.org/10.5281/zenodo.4291594 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.387#supplemental-information. REFERENCES Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. 2007. DBpedia: A nucleus for a web of open data. In: The Semantic Web. Springer DOI 10.1007/978-3-540-76298-0_52. Bangor A, Kortum PT, Miller JT. 2008. An empirical evaluation of the system usability scale. International Journal of Human-Computer Interaction 24(6):574–594 DOI 10.1080/10447310802205776. Baumeister J, Reutelshoefer J, Puppe F. 2011. Knowwe: a semantic wiki for knowledge engineering. Applied Intelligence 35(3):323–344 DOI 10.1007/s10489-010-0224-5. Berners-Lee T. 2009. Linked data. Available at https://www.w3.org/DesignIssues/LinkedData.html. Bizer C, Heath T, Berners-Lee T. 2011. Linked data: the story so far. In: Semantic Services, Interoperability and Web Applications: Emerging Concepts. IGI Global DOI 10.4018/978-1-60960-593-3.ch008. Brooke J. 1996. SUS-a quick and dirty usability scale. In: Jordan PW, Thomas B, Weerdmeester BA, McClelland IL, eds. Usability Evaluation in Industry. Milton Park: Taylor & Francis, 189–194. Buil-Aranda C, Hogan A, Umbrich J, Vandenbussche P-Y. 2013. SPARQL web-querying infrastructure: ready for action? In: The Semantic Web–ISWC 2013. Springer DOI 10.1007/978-3-642-41338-4_18. Buyle R, Taelman R, Mostaert K, Joris G, Mannens E, Verborgh R, Berners-Lee T. 2019. Streamlining governmental processes by putting citizens in control of their personal data. In: International Conference on Electronic Governance and Open Society: Challenges in Eurasia. Springer, 346–359 DOI 10.1007/978-3-030-39296-3_26. Caronni G. 2000. Walking the web of trust. In: WET ICE 2000. Piscataway: IEEE DOI 10.1109/ENABL.2000.883720. Delva H, Rojas Melendez JA, Colpaert P, Verborgh R. 2019. Decentralized publication and consumption of transfer footpaths. First International Workshop on Semantics for Transport 2447:1–7. Feigenbaum L, Todd Williams G, Grant Clark K, Torres E. 2013. SPARQL 1.1 protocol. Rec., W3C. Available at https://www.w3.org/TR/2013/RECsparql11-protocol-20130321/. Ghidini C, Kump B, Lindstaedt S, Mahbub N, Pammer V, Rospocher M, Serafini L. 2009. Moki: the enterprise modelling wiki. In: European Semantic Web Conference. Springer, 831–835 DOI 10.1007/978-3-642-02121-3_65. Guha RV, Brickley D, Macbeth S. 2016. Schema. org: evolution of structured data on the web. Communications of the ACM 59(2):44–51 DOI 10.1145/2844544. Kuhn T. 2008. AceWiki: A Natural and Expressive Semantic Wiki. In: Proceedings of Semantic Web User Interaction at CHI 2008: Exploring HCI Challenges. CEUR Workshop Proceedings. Kuhn T. 2016. Nanopub-java: a Java library for nanopublications. In: Linked Science: Proceedings of the 5th Workshop on Linked Science 2015-Best Practices and the Road Ahead (LISC 2015). Vol. 1572. CEUR Workshop Proceedings, 19–25. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 21/23 http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information http://dx.doi.org/10.7717/peerj-cs.387#supplemental-information http://dx.doi.org/10.1007/978-3-540-76298-0_52 http://dx.doi.org/10.1080/10447310802205776 http://dx.doi.org/10.1007/s10489-010-0224-5 https://www.w3.org/DesignIssues/LinkedData.html http://dx.doi.org/10.4018/978-1-60960-593-3.ch008 http://dx.doi.org/10.1007/978-3-642-41338-4_18 http://dx.doi.org/10.1007/978-3-030-39296-3_26 http://dx.doi.org/10.1109/ENABL.2000.883720 https://www.w3.org/TR/2013/RECsparql11-protocol-20130321/ http://dx.doi.org/10.1007/978-3-642-02121-3_65 http://dx.doi.org/10.1145/2844544 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Kuhn T, Barbano PE, Nagy ML, Krauthammer M. 2013. Broadening the scope of nanopublications. In: Extended Semantic Web Conference. Springer DOI 10.1007/978-3-642-38288-8_33. Kuhn T, Chichester C, Krauthammer M, Queralt-Rosinach N, Verborgh R, Giannakopoulos G, Ngomo A-CN, Viglianti R, Dumontier M. 2016. Decentralized provenance-aware publishing with nanopublications. PeerJ Computer Science 2(1):e78 DOI 10.7717/peerj-cs.78. Kuhn T, Dumontier M. 2015. Making digital artifacts on the web verifiable and reliable. IEEE Transactions on Knowledge and Data Engineering 27(9):2390–2400 DOI 10.1109/TKDE.2015.2419657. Kuhn T, Dumontier M. 2017. Genuine semantic publishing. Data Science 1(1–2):139–154 DOI 10.3233/DS-170010. Kuhn T, Willighagen E, Evelo C, Queralt-Rosinach N, Centeno E, Furlong LI. 2017. Reliable granular references to changing linked data. In: International Semantic Web Conference. Springer DOI 10.1007/978-3-319-68288-4_26. Lanthaler M, Gütl C. 2013. Hydra: A vocabulary for hypermedia-driven web APIs. In: LDOW2013. Rio de Janeiro, Brazil, 996. Lisena P, Meroño-Peñuela A, Kuhn T, Troncy R. 2019. Easy web API development with SPARQL transformer. In: International Semantic Web Conference. Cham: Springer DOI 10.1007/978-3-030-30796-7_28. Mansour E, Sambra AV, Hawke S, Zereba M, Capadisli S, Ghanem A, Aboulnaga A, Berners- Lee T. 2016. A demonstration of the solid platform for social web applications. In: 25th International Conference Companion on World Wide Web. Montréal, Québec, Canada, 223–226 DOI 10.1145/2872518.2890529. Meroño-Peñuela A, Hoekstra R. 2016. grlc makes GitHub taste like Linked Data APIs. In: European Semantic Web Conference. Springer DOI 10.1007/978-3-319-47602-5_48. Minier T, Skaf-Molli H, Molli P, Vidal M-E. 2018. Intelligent clients for replicated triple pattern fragments. In: European Semantic Web Conference. Cham: Springer DOI 10.1007/978-3-319-93417-4_26. Mons B, Van Haagen H, Chichester C, Den Dunnen JT, Van Ommen G, Van Mulligen E, Singh B, Hooft R, Roos M, Hammond J, Kiesel B, Giardine B, Velterop J, Groth P, Schultes E. 2011. The value of data. Nature genetics 43(4):281–283 DOI 10.1038/ng0411-281. Schmachtenberg M, Bizer C, Paulheim H. 2014. Adoption of the linked data best practices in different topical domains. In: International Semantic Web Conference. Cham: Springer, 245–260 DOI 10.1007/978-3-319-11964-9_16. Shotton D. 2009. Semantic publishing: the coming revolution in scientific journal publishing. Learned Publishing 22(2):85–94 DOI 10.1087/2009202. Speicher S, Arwe J, Malhotra A. 2015. Linked Data Platform 1.0. W3C Recommendation. Available at https://www.w3.org/TR/ldp/ (accessed 26 February 2015). Taelman R, Steyskal S, Kirrane S. 2020. Towards querying in decentralized environments with privacy-preserving aggregation. In: Proceedings of the 4th Workshop on Storing, Querying, and Benchmarking the Web of Data. Taelman R, Van Herwegen J, Vander Sande M, Verborgh R. 2018. Comunica: a modular SPARQL query engine for the web. In: 17th International Semantic Web Conference DOI 10.1007/978-3-030-00668-6_15. Third A, Domingue J. 2017. Linkchains: exploring the space of decentralised trustworthy linked data. In: Workshop on Decentralizing the Semantic Web 2017. CEUR-WS. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 22/23 http://dx.doi.org/10.1007/978-3-642-38288-8_33 http://dx.doi.org/10.7717/peerj-cs.78 http://dx.doi.org/10.1109/TKDE.2015.2419657 http://dx.doi.org/10.3233/DS-170010 http://dx.doi.org/10.1007/978-3-319-68288-4_26 http://dx.doi.org/10.1007/978-3-030-30796-7_28 http://dx.doi.org/10.1145/2872518.2890529 http://dx.doi.org/10.1007/978-3-319-47602-5_48 http://dx.doi.org/10.1007/978-3-319-93417-4_26 http://dx.doi.org/10.1038/ng0411-281 http://dx.doi.org/10.1007/978-3-319-11964-9_16 http://dx.doi.org/10.1087/2009202 https://www.w3.org/TR/ldp/ http://dx.doi.org/10.1007/978-3-030-00668-6_15 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Verborgh R, Vander Sande M, Hartig O, Van Herwegen J, De Vocht L, De Meester B, Haesendonck G, Colpaert P. 2016. Triple Pattern Fragments: a low-cost knowledge graph interface for the Web. Journal of Web Semantics 37-38:184–206 DOI 10.1016/j.websem.2016.03.003. Vrandečić D, Krötzsch M. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM 57(10):78–85 DOI 10.1145/2629489. Werbrouck J, Taelman R, Verborgh R, Pauwels P, Beetz J, Mannens E. 2020. Pattern-based access control in a decentralised collaboration environment. In: Proceedings of the 8th Linked Data in Architecture and Construction Workshop. CEUR–WS. Kuhn et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.387 23/23 http://dx.doi.org/10.1016/j.websem.2016.03.003 http://dx.doi.org/10.1145/2629489 http://dx.doi.org/10.7717/peerj-cs.387 https://peerj.com/computer-science/ Semantic micro-contributions with decentralized nanopublication services Introduction Background Approach Performance evaluation Usability evaluation Discussion and conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_2dimm6xerfcsnc7w5ji6qhdb7y ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 126 Design and Implementation of Music Recommendation System Based on Hadoop Zhao Yufeng School of Computer Science and Engineering Xi'an University of Technology Shaanxi, Xi’an, China e-mail: zyfzy99@163.com Li Xinwei School of Computer Science and Engineering Xi'an University of Technology Shaanxi, Xi’an, China e-mail: 604013466@qq.com Abstract—In order to solve the problem of information overload of music system under large data background, this paper studies the design scheme of distributed music recommendation system based on Hadoop. The proposed algorithm is based on the MapReduce distributed computing framework, which has high scalability and performance, and can be applied to the calculation and analysis of off-line data efficiently. The music recommendation system designed in this paper also includes client, server interface, database and ETL operation, which can calculate a set of complete recommendation system from user operation end to server and data calculation. In order to improve the accuracy of the recommendation algorithm, this paper introduces k-means clustering algorithm to improve the recommendation algorithm based on user-based collaborative filtering.The experimental results show that the accuracy of the proposed algorithm has been significantly improved after the introduction of k-means. Keywords-Music Recommendation; K-means Clustering; Collaborative Filtering; Recommendation Algorithm; Hadoop I. INTRODUCTION With the development of the mobile Internet, the amount of data generated by mobile APP has increased rapidly in recent years. On July 27, 2017, Trustdata, a well-known mobile big data monitoring platform in China released the "Analysis Report of China Mobile Internet Development in the First Half of 2017" [1]. Among them, mobile music as a high-frequency application, the number of the user be showed a steady growth in the first half of 2017, peak DAU (daily active users) nearly 150 million. Taking Netease cloud music as an example, since its client APP was launched in April 2013,the number of users has reached 300 million, song orders 400 million. Such a huge amount of data makes the traditional single-server data storage and processing becomes more and more clumsy. Moreover, it is more and more difficult for users to find their favorite songs in huge amounts of data. After all, favorite songs are only a few, and finding them one by one can be very difficult. The past practice is that when users want to listen to some songs, they search for them by engines, but only find songs they have known. A lot of songs that users do not know, but will probably like very much, will never be heard. If there is a system specifically for users to push songs, users will spend less time searching for songs, and the stickiness from users to the system will increase. Based on this, this paper uses Hadoop, a big data computing and storage framework, to store and calculate data, so as to solve the problem in finding songs, and also improve user's activity and stickiness. The Hadoop-based recommendation system used in this paper has the following important implications in today's Internet context: 1) To effectively solve the problem of "information overload", to provide users with interesting content by exploring the relationship between users and songs; 2) Hadoop cluster parallel computing and distributed storage technology has good scalability, can effectively handle massive data storage and computing problems; 3) For the enterprise, the system with the recommended function can enhance the user experience and increase the user's activity and stickiness. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 127 II. ALGORITHM BASIS A. Traditional user-based collaborative filtering recommendation algorithm The traditional user-based collaborative filtering recommendation algorithm is based on several users that are most similar to this user's interest, and then recommends songs that the user has not heard from these similar users. The specific method is to find each user by user similarity algorithm some of the most similar users, and then summarize the similar user listening records, from which the target user to find out the songs have not been heard, and then similar users to sort the similarity to obtain The preference of each song, the order of these songs, so as to target users recommend [2]. Specific algorithm steps are as follows: 1) Construct users - song data representation matrix. The row vector represents the user, and the column vector represents each song. The matrix value indicates whether the user has heard of the song, 0 means that it has not been heard, and 1 means it has been heard. It is a 0-1 matrix. 2) Generate the nearest neighbor set. According to the user-song matrix, the similarity algorithm is used to calculate the user's similarity so as to find the user set closest to the target user. Formula (1)[2] calculates the similarity of two users. |)v(N||)u(N| |))i(N|1log( 1 w )v(N)u(Ni uv     () 3) Produce recommendations. The recommendation value of a song that a similar user has heard while the target user has not heard is determined by the similarity of the user with the target user. A user set must have many users that have heard the same song, and this song's recommended value is the sum of the similarities of these users. In this way, songs that have not been heard by the target user can be sorted according to the recommended values, and songs with high recommendation values will be preferentially pushed to the target user. Equation(2) [2] calculates the user's preference for song i.    )i(N)K,u(Sv viuv rw)i,u(p   B. Introducing the k-means algorithm to optimize the traditional recommendation algorithm Clustering is an unsupervised learning method which aggregates data with similar attributes on the basis of nomanual labeling. It combines data for similarities and the data in the same group have similarities. The data in the different group is different from each other. The improved collaborative filtering recommendation algorithm in this paper is based on the user base with high similarity, so that the algorithm has a better recommendation effect. Similar calculation directly restricts the effectiveness of the clustering effect and thus affects the final recommendation result. The principle of the algorithm is as follows: Suppose there is a group of users User, the total number of users is m, remember to be U (U1, U2, U3, ..., Um), each user Ux has n attributes, recorded as Cx (Cx1, Cx2, Cx3,..., Cxn), the principle of clustering is to compare each attribute on the basis of the set U and divide into the groups of similar users [3]. The core idea of the k-means algorithm is to divide a given group of users into k groups. Within each k group, a cluster center is set to calculate the distance of each data from the center. The minimum distance is attributed to this group. C. k-means clustering algorithm improvement 1) The removal of free points [3]. Of all the data points, the free points are those that are far away from all other points, and their existence will cause the deviation of the center point in the belonging class and thus the classification effect. The process of removing free points in this paper is as follows: Let the total number of users be m, the total number of paths of all users to other users be calculated according to the formula (3): 2 )1(   mm L  Then the sum of all users is:     m i ij ji CCgapD 1 ),( 2 1  )()()(),( 2211 jninjijiji CCCCCCCCgap   2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 128 In formula (5), Ci1, Cj1, ...Cin, Cjn are n attributes of users Ci and Cj. Find the distance average from L and D, formula (6) D L EMV   Formula (6) is the average of all user distances. For each user U, compute the distance U from all other users Lu = (Ui, Uj). If all Lu> = EMV, the user is classified as a free point, a separate category. If the number of free points is small, they can’t be classified into a category for collaborative filtering, so you can classify it to a class that is closest to a type of center point. 2) The selection of random points will also have an impact on the classification results, and fall into the local optimal solution. In order to solve this problem, the method adopted in this paper is to adopt the improved algorithm of clustering: dichotomous clustering [4]. The idea of dichotomous clustering is to first classify all the points as a cluster and then divide it into two (k = 2 k-means clustering), and then select the class that can minimize the clustering cost function again and divide it into two formula until the number of clusters equals k. The cluster cost function is defined as the sum of squared error of the cluster, as shown in formula (7). The largest square error sum of a class, this kind of point in the distance from the center of the maximum distance, you need to be divided once again.    iCp i cpEi 2  III. RECOMMENDED SYSTEM DESIGN This section describes the implementation and testing of the entire system and what frameworks are included in each of the system's features. First of all, introduce the top-level design; then analyze the overall framework of the system, what technologies are needed, and the overall process of the system; finally, we evaluate the recommended results from the accuracy and recall rate of recommendations. A. Recommended process The K-means clustering-based collaborative filtering recommendation algorithm proposed in the previous section is mainly divided into two steps, one is to use k-means to implement user clustering, and the other is to perform a recommendation algorithm based on user collaborative filtering thus generating the recommended result. User song recording data Songs commonly used labels User clustering Collaborative filtering Generate recommended results User similarity Figure 1. Distributed recommended algorithm flow Algorithm parallelization of the flow chart is shown in figure 1. Clustering algorithm is divided into three steps. The first step is to create a user tag model: user log table and song list of commonly used tags through a step MapReduce process to generate user-tag model, the tag file as a cache file for each user's song recording tag statistics in the tag vector Increase the number of each position to generate a user label matrix. The second step is to use k-means algorithm to calculate the cluster center point of user label matrix. After several iterations, the relatively stable center point of each cluster is determined. The third step reads the central point file as a cache, in order to classify users, by calculating the user from which one of the nearest center, put this user into which cluster. The user-based collaborative filtering algorithm will recommend collaborative filtering for users in each cluster.This step is divided into a number of steps, including counting the number of a song being listened to, counting the number of a user listening to music, calculating user similarity, generating recommended results. B. Recommended system architecture design The recommended system is divided into the recommended algorithm layer, server-side and client. The recommended algorithm layer uses Hadoop cluster for distributed recommendation, the server side uses the Java Servlet and MySQL database for development, and the client side uses Android for display[5]. Log collection is 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 129 collected using the ETL tool Sqoop. System overall framework is shown in Figure 2. It can be seen from the figure that the recommended system is divided into 4 major parts: Top-level client, server, database and Hadoop layer. The client uses the Android system for display. Server development uses Java EE development. Database uses MySQL. MySQL is the intermediate data link between the client and the recommendation system. Data is stored in the database so that front-end servers and Hadoop are decoupled. The transfer of data between the database and Hadoop takes the Sqoop. Sqoop is a distributed log collection system based on Hadoop, which requires Hadoop to run, which can also speed up data transfer. The collected data is stored in HDFS. The Hadoop layer is divided into two parts, one is HDFS data storage, and the other is MapReduce distributed computing. MapReduce reads the data from HDFS and calculates it, and then saves the results back to HDFS. Sqoop reads data from the HDFS and transfers to MySQL, the client requests the server at irregular intervals, the server reads data from the database and returns it to the client, and the user's operation of songs is fed back to the server and saved to the database[6]. Hadoop Android WebService MySQL HDFS MapReduce Sqoop Figure 2. Recommended system architecture C. Recommended system function module analysis When users use the recommendation system, the system recommends different songs to different users, which requires users to log in the system with different user names. In order to log in, the system also has to provide the user registration function. To listen to different songs, users must also be able to search for songs and add tags to the songs. Because we have no songs to play, we also need a collection or a favorite feature for users[7]. The server side designs the corresponding interfaces based on these requirements, and the database also designs different tables to store the content. Accordingly, we have designed the function diagram of the system, as shown in Figure 3. As can be seen from the figure, the system function is decomposed into four modules. Clients have user registration, login, search songs, tag songs and play features. Play actually refers to favorites or likes. Because the song involves copyright issues, it can only be replaced by favorites or likes. These functions correspond to the server-side interface to provide data. Data is read from the database, so the database is designed with three tables: the user table, used to register and log in; the table for recording users' listening to songs, with the largest amount of data; song table storage song's basic information, including tag information, search songs and add tags[8]. MySQL Client register login search Add tag play Service Registration interface Login interface Search interface Add interface user tablerecord list of listening to songs Song list Hadoop Collaborative filtering K-means clustering Sqoop Figure 3. Recommended system function block diagram IV. DESIGN OF HADOOPBASED RECOMMENDATION SYSTEM Algorithm parallelization is based on the k-means clustering algorithm introduced in the previous chapter and user-based collaborative filtering recommendation algorithm is implemented in the distributed system, the distributed 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 130 recommendation algorithm is divided into two modules, one is distributed k-means Clustering algorithm, one is the distributed collaborative filtering recommendation algorithm, and finally the two algorithms connected to achieve the work of the entire distributed recommendation algorithm[9]. A. K-means clustering of parallel design The data input of the K-means algorithm is a matrix, because here the user is clustered, so the user-label matrix is first formed by data preprocessing. User listening records are large files and need to be designed in parallel. Listener recorded data into the HDFS behind the addition of each song's label, the formation of (user id, song id, song time, label) this format records. Then use MapReduce to parallelize the user-label matrices in one step. Then the user-label matrix is clustered. The clustering and parallelization algorithm is designed as follows: The first step scans all the original points and randomly selects k points as the center point of the first step cluster. The second step is to calculate the cluster of all the points to each center point, and to point each point to the closest center point cluster. The third step to repeat the second step to meet the termination conditions[10]. The fourth step is to calculate all the user points, the user assigned to their respective clusters. Algorithm architecture is shown in Figure 4. MapReduce数据源 HDFS userlog.txt songtags.txt User record mapping Song label mapping UTMatrixMapper UTMatrixReduce User Label Matrix KMeansMapper KMeansReducer KMeansDriver Cluster Center User clustering KMeansClusterMapper KMeansCombiner Figure 4. Distributed k-means clustering algorithm architecture B. Concurrent Design of User-based Collaborative Filtering Recommendation Algorithm According to the formula (1) and (2) in the first section, combined with the rules of Hadoop distributed design, the following steps are designed for the recommendation algorithm: The first step is to count how many songs each user listens to; the second step is to count the number of times each song is heard; the third step is to calculate the similarity of every two users, but only the similarity of the two users who have heard the same song need to be computed; the fourth step, is to calculate each user's recommending value for each song to form a recommend list; The final step is to sum up the recommending values for a user listening to the same song. The above steps are implemented based on MapReduce, and the entire work flow requires a number of Map and Reduce to complete. Figure 5 shows the MapReduce architecture based on the user collaborative filtering algorithm. UserListenCountMapper UserListenCountReduce SongListenedCountMapper SongListenedCountReduce UserSimilarityMapper UserSimilarityReduce UserCommendSimMapper UserCommendLogMapper UserCommendReducer MapReduce userlog.txt Mapping user listening records User songs recorded and the number The number of songs to be heard User similarity file 源数据 User song interest value HDFS Generate recommended results UserSongValueMapper UserSongValueReducer Figure 5. Distributed architecture based on user collaborative filtering algorithm V. EXPERIMENTS AND EVALUATION In the same situation as the data-set, many times of repeated experiments were done on the traditional user-based collaborative filtering algorithm, the recommendation algorithm after introducing K-means clustering and the recommendation algorithm after the improved clustering. And the stable results were selected to analyze [11]. A. Experimental environment and data set 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 131 Environment construction and configuration include Hadoop cluster, Sqoop, development environment and Web server. The cluster environment is built using VMware virtual machines. The cluster is built using three nodes, one Master and two Slaves.  Host configuration Hardware environment: CPU Intel i5-4590, quad-core, 3.30 GHz, RAM 8 GB;  Software Environment: OS Centos7, java Environment jdk1.8, server tomcat8.0, hadoop 2.7.4;  Development environment: Eclipse, windows10, hadoop.plugin;  Music data used in this experiment are from the network, including more than 50,000 users, more than 1,700,000 user operations, and 730 tags. B. Results Show Figure 6 shows a screenshot of the recommended results for the Android phone. Figure 6. Android mobile music recommended results C. Evaluation Indices The Precision and Recall [12] are used as evaluation indices.Each user's song date is sorted in descending order, with the top 80% as the training set, and the remaining 20% as the test set for experimental evaluation. %100 results drecommende ofnumber The results drecommende ofnumber correct The Precision   %100 like users ofnumber The results drecommende ofnumber correct The Recall  () Formula (8) and formula (9) are calculated formulas for the Precision and Recall respectively. D. The line contrast chart of three algorithms After repeated experiments that the precision of three algorithms changes with the K-value is shown in line chart as Figure 7: As shown in Figure 7, the precision after introducing the K-means clustering algorithm is better than the traditional collaborative filtering algorithm, and when the K value is 4, the precision is increased by about 0.65%, which is the best classification. After the clustering algorithm is improved, the precision is nearly 0.15% higher than unimproved when the K value is 5. Figure 8 is the Recall change chart for three algorithms. The Recall of K-means clustering algorithm is better than that of the traditional collaborative filtering algorithm, and when the K value is 4, the Recall can increase about 0.65%. When the K value is 5, the Recall increases nearly 0.15% after the clustering algorithm is improved; but as the K value increases, the improved clustering is lower than the unimproved clustering algorithm. Because the user number is fixed, after removing the free point, each classification will be affected. Some classification number is very few, then the recommended result is inaccurate, thus the whole effect of the recommendation algorithm is affected. Figure 7. The change of precision with K value 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 132 Figure 8. The change chart of recall with K value VI. CONCLUDING REMARKS This paper puts forward the design and implementation of the Hadoop-based music recommendation system, and improves the traditional user-based collaborative filtering recommendation algorithm, and improves the precision of the recommendation algorithm by using the song tag for users clustering. Hadoop, a scalable and high-performance distributed computing platform, provides a reference for the design of a music recommendation system in the background of large data. The recommendation algorithm of K-means clustering and collaborative filtering is designed based on MapReduce distributed framework, which has some reference significance for the distributed design of the recommendation algorithm. REFERENCES [1] Trustdata. China Mobile Internet Development Analysis Report for the First Half of 2017 [EB/OL]. (2017-07-30). http://itrustdata.com/#service [2] Xiang Liang Recommended system practice[M].BeiJing:People Post Press,2012.6:1-3Zhang Xin-sheng Zhang Hai-ying Mao Qian.Hadoop[EB/OL].(2016-12-19).https://baike.baidu.com/item/Ha doop/3526507 [3] Wu Hongchen, WangXinjun,ChengYong,PengZhaohui. Advanced Recommendation Base on Collaborative Filtering and Partition Clustering[J].Computer Research and Development,2011,48(S3):205-212. [4] ZhengJie.Machine Learning Algorithm Principles and Programming Practice[M].BeiJing.Electronic Industry Press,2015.10:141 [5] Weston J, Bengio S, Hamel P, et al. Large-Scale Music Annotation and Retrieval: Learning to Rank in Joint Semantic Spaces.[J]. arXiv: Learning, 2011. [6] Den Oord A V, Dieleman S, Schrauwen B, et al. Deep content-based music recommendation[C]. neural information processing systems, 2013: 2643-2651. [7] Su J, Chang W, Tseng V S, et al. Personalized Music Recommendation by Mining Social Media Tags[J]. Procedia Computer Science, 2013: 303-312. [8] Avidson J, Liebald B, Liu J, et al. The YouTube video recommendation system[C].conference on recommender systems, 2010: 293-296. [9] FENG Ya-li,JIANGJie,TianFeng.Research on the combined recommendation algorithm based on item and user[J].Information Technology,2017,(10):69-73. [10] CHANG Xiao yu,YU Zheng sheng.Point-of-interest Recommendation Algorithm Introducing Time Attenuation Item[J].Journal of Hangzhou Dianzi University(Natural Sciences),2016,36(03):42-46. [2017-09-19]. DOI : 10.13954/j.cnki.hdu.2016.03.009 [11] Zhao Z, Shang M. User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop[C]. knowledge discovery and data mining, 2010: 478-481. [12] Chen Yaxi.Music recommendation system and related technologies [J]. Computer Engineering and Applications,2012,48(18):9-16+47. work_2dw3vuoprfdbdnrfirx5jj7jme ---- Edinburgh Research Explorer Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction Citation for published version: Modi, A, Titov, I, Demberg, V, Sayeed, A & Pinkal, M 2017, 'Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction', Transactions of the Association for Computational Linguistics, vol. 5, pp. 31-44. Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://www.transacl.org/ojs/index.php/tacl/article/view/968 https://www.research.ed.ac.uk/portal/en/publications/modeling-semantic-expectation-using-script-knowledge-for-referent-prediction(01c319a9-4ee1-4b73-af5a-9ac0aad76603).html Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction Ashutosh Modi1,3 Ivan Titov2,4 Vera Demberg1,3 Asad Sayeed1,3 Manfred Pinkal1,3 1 {ashutosh,vera,asayeed,pinkal}@coli.uni-saarland.de 2 titov@uva.nl 3 Universität des Saarlandes, Germany 4 ILLC, University of Amsterdam, the Netherlands Abstract Recent research in psycholinguistics has pro- vided increasing evidence that humans predict upcoming content. Prediction also affects per- ception and might be a key to robustness in human language processing. In this paper, we investigate the factors that affect human prediction by building a computational model that can predict upcoming discourse referents based on linguistic knowledge alone vs. lin- guistic knowledge jointly with common-sense knowledge in the form of scripts. We find that script knowledge significantly improves model estimates of human predictions. In a second study, we test the highly controversial hypothesis that predictability influences refer- ring expression type but do not find evidence for such an effect. 1 Introduction Being able to anticipate upcoming content is a core property of human language processing (Kutas et al., 2011; Kuperberg and Jaeger, 2016) that has re- ceived a lot of attention in the psycholinguistic liter- ature in recent years. Expectations about upcoming words help humans comprehend language in noisy settings and deal with ungrammatical input. In this paper, we use a computational model to address the question of how different layers of knowledge (lin- guistic knowledge as well as common-sense knowl- edge) influence human anticipation. Here we focus our attention on semantic pre- dictions of discourse referents for upcoming noun phrases. This task is particularly interesting because it allows us to separate the semantic task of antic- ipating an intended referent and the processing of the actual surface form. For example, in the con- text of I ordered a medium sirloin steak with fries. Later, the waiter brought . . . , there is a strong ex- pectation of a specific discourse referent, i.e., the referent introduced by the object NP of the preced- ing sentence, while the possible referring expression could be either the steak I had ordered, the steak, our food, or it. Existing models of human predic- tion are usually formulated using the information- theoretic concept of surprisal. In recent work, how- ever, surprisal is usually not computed for DRs, which represent the relevant semantic unit, but for the surface form of the referring expressions, even though there is an increasing amount of literature suggesting that human expectations at different lev- els of representation have separable effects on pre- diction and, as a consequence, that the modelling of only one level (the linguistic surface form) is in- sufficient (Kuperberg and Jaeger, 2016; Kuperberg, 2016; Zarcone et al., 2016). The present model ad- dresses this shortcoming by explicitly modelling and representing common-sense knowledge and concep- tually separating the semantic (discourse referent) and the surface level (referring expression) expec- tations. Our discourse referent prediction task is related to the NLP task of coreference resolution, but it substantially differs from that task in the following ways: 1) we use only the incrementally available left context, while coreference resolution uses the full text; 2) coreference resolution tries to identify the DR for a given target NP in context, while we look at the expectations of DRs based only on the context 31 Transactions of the Association for Computational Linguistics, vol. 5, pp. 31–44, 2017. Action Editor: Hwee Tou Ng. Submission batch: 8/2016 Revision batch: 10/2016; Published 1/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. before the target NP is seen. The distinction between referent prediction and prediction of referring expressions also allows us to study a closely related question in natural language generation: the choice of a type of referring expres- sion based on the predictability of the DR that is intended by the speaker. This part of our work is inspired by a referent guessing experiment by Tily and Piantadosi (2009), who showed that highly pre- dictable referents were more likely to be realized with a pronoun than unpredictable referents, which were more likely to be realized using a full NP. The effect they observe is consistent with a Gricean point of view, or the principle of uniform information den- sity (see Section 5.1). However, Tily and Piantadosi do not provide a computational model for estimat- ing referent predictability. Also, they do not include selectional preference or common-sense knowledge effects in their analysis. We believe that script knowledge, i.e., common- sense knowledge about everyday event sequences, represents a good starting point for modelling con- versational anticipation. This type of common-sense knowledge includes temporal structure which is par- ticularly relevant for anticipation in continuous lan- guage processing. Furthermore, our approach can build on progress that has been made in recent years in methods for acquiring large-scale script knowl- edge; see Section 1.1. Our hypothesis is that script knowledge may be a significant factor in human an- ticipation of discourse referents. Explicitly mod- elling this knowledge will thus allow us to produce more human-like predictions. Script knowledge enables our model to generate anticipations about discourse referents that have al- ready been mentioned in the text, as well as anticipa- tions about textually new discourse referents which have been activated due to script knowledge. By modelling event sequences and event participants, our model captures many more long-range depen- dencies than normal language models are able to. As an example, consider the following two alternative text passages: We got seated, and had to wait for 20 minutes. Then, the waiter brought the ... We ordered, and had to wait for 20 minutes. Then, the waiter brought the ... Preferred candidate referents for the object posi- tion of the waiter brought the ... are instances of the food, menu, or bill participant types. In the con- text of the alternative preceding sentences, there is a strong expectation of instances of a menu and a food participant, respectively. This paper represents foundational research in- vestigating human language processing. However, it also has the potential for application in assistant technology and embodied agents. The goal is to achieve human-level language comprehension in re- alistic settings, and in particular to achieve robust- ness in the face of errors or noise. Explicitly mod- elling expectations that are driven by common-sense knowledge is an important step in this direction. In order to be able to investigate the influence of script knowledge on discourse referent expecta- tions, we use a corpus that contains frequent refer- ence to script knowledge, and provides annotations for coreference information, script events and par- ticipants (Section 2). In Section 3, we present a large-scale experiment for empirically assessing hu- man expectations on upcoming referents, which al- lows us to quantify at what points in a text humans have very clear anticipations vs. when they do not. Our goal is to model human expectations, even if they turn out to be incorrect in a specific instance. The experiment was conducted via Mechanical Turk and follows the methodology of Tily and Pianta- dosi (2009). In section 4, we describe our computa- tional model that represents script knowledge. The model is trained on the gold standard annotations of the corpus, because we assume that human compre- henders usually will have an analysis of the preced- ing discourse which closely corresponds to the gold standard. We compare the prediction accuracy of this model to human predictions, as well as to two baseline models in Section 4.3. One of them uses only structural linguistic features for predicting ref- erents; the other uses general script-independent se- lectional preference features. In Section 5, we test whether surprisal (as estimated from human guesses vs. computational models) can predict the type of referring expression used in the original texts in the corpus (pronoun vs. full referring expression). This experiment also has wider implications with respect to the on-going discussion of whether the referring expression choice is dependent on predictability, as predicted by the uniform information density hy- 32 (I)(1)P bather [decided]E wash to take a (bath)(2)P bath yesterday afternoon after working out . Once (I)(1)P bather got back home , (I)(1)P bather [walked]E enter bathroom to (my)(1)P bather (bathroom)(3)P bathroom and first quickly scrubbed the (bathroom tub)(4)P bathtub by [turning on]E turn water on the (water)(5)P water and rinsing (it)(4)P bathtub clean with a rag . After (I)(1)P bather finished , (I)(1)P bather [plugged]E close drain the (tub)(4)P bathtub and began [filling]E fill water (it)(4)P bathtub with warm (water)(5)P water set at about 98 (degrees)(6)P temperature . Figure 1: An excerpt from a story in the InScript corpus. The referring expressions are in parentheses, and the corresponding discourse referent label is given by the superscript. Referring expressions of the same discourse referent have the same color and superscript number. Script-relevant events are in square brackets and colored in orange. Event type is indicated by the corresponding subscript. pothesis. The contributions of this paper consist of: • a large dataset of human expectations, in a va- riety of texts related to every-day activities. • an implementation of the conceptual distinction between the semantic level of referent predic- tion and the type of a referring expression. • a computational model which significantly im- proves modelling of human anticipations. • showing that script knowledge is a significant factor in human expectations. • testing the hypothesis of Tily and Piantadosi that the choice of the type of referring expres- sion (pronoun or full NP) depends on the pre- dictability of the referent. 1.1 Scripts Scripts represent knowledge about typical event sequences (Schank and Abelson, 1977), for exam- ple the sequence of events happening when eating at a restaurant. Script knowledge thereby includes events like order, bring and eat as well as partici- pants of those events, e.g., menu, waiter, food, guest. Existing methods for acquiring script knowledge are based on extracting narrative chains from text (Chambers and Jurafsky, 2008; Chambers and Juraf- sky, 2009; Jans et al., 2012; Pichotta and Mooney, 2014; Rudinger et al., 2015; Modi, 2016; Ahrendt and Demberg, 2016) or by eliciting script knowledge via Crowdsourcing on Mechanical Turk (Regneri et al., 2010; Frermann et al., 2014; Modi and Titov, 2014). Modelling anticipated events and participants is motivated by evidence showing that event repre- sentations in humans contain information not only about the current event, but also about previous and future states, that is, humans generate anticipa- tions about event sequences during normal language comprehension (Schütz-Bosbach and Prinz, 2007). Script knowledge representations have been shown to be useful in NLP applications for ambiguity reso- lution during reference resolution (Rahman and Ng, 2012). 2 Data: The InScript Corpus Ordinary texts, including narratives, encode script structure in a way that is too complex and too im- plicit at the same time to enable a systematic study of script-based expectation. They contain interleaved references to many different scripts, and they usually refer to single scripts in a point-wise fashion only, relying on the ability of the reader to infer the full event chain using their background knowledge. We use the InScript corpus (Modi et al., 2016) to study the predictive effect of script knowledge. In- Script is a crowdsourced corpus of simple narrative texts. Participants were asked to write about a spe- cific activity (e.g., a restaurant visit, a bus ride, or a grocery shopping event) which they personally ex- perienced, and they were instructed to tell the story as if explaining the activity to a child. This resulted in stories that are centered around a specific scenario and that explicitly mention mundane details. Thus, they generally realize longer event chains associated with a single script, which makes them particularly appropriate to our purpose. The InScript corpus is labelled with event-type, participant-type, and coreference information. Full verbs are labeled with event type information, heads of all noun phrases with participant types, using scenario-specific lists of event types (such as enter bathroom, close drain and fill water for the “taking a bath” scenario) and participant types (such as bather, water and bathtub). On average, each template of- fers a choice of 20 event types and 18 participant 33 (I)(1) decided to take a (bath)(2) yesterday afternoon after working out . Once (I)(1) got back home , (I)(1) walked to (my)(1) (bathroom)(3) and first quickly scrubbed the (bathroom tub)(4) by turning on the (water)(5) and rinsing (it)(4) clean with a rag . Af- ter (I)(1) finished , (I)(1) plugged XXXXXX Figure 2: An illustration of the Mechanical Turk exper- iment for the referent cloze task. Workers are supposed to guess the upcoming referent (indicated by XXXXXX above). They can either choose from the previously acti- vated referents, or they can write something new. 0 5 1 0 1 5 2 0 14 5 1 DR_4 (P_bathtub) the drain (new DR) DR_1 (P_bather) N u m b e r o f W o rk e rs Figure 3: Response of workers corresponding to the story in Fig. 2. Workers guessed two already activated dis- course referents (DR) DR 4 and DR 1. Some of the workers also chose the “new” option and wrote different lexical variants of “bathtub drain”, a new DR correspond- ing to the participant type “the drain”. types. The InScript corpus consists of 910 stories ad- dressing 10 scenarios (about 90 stories per scenario). The corpus has 200,000 words, 12,000 verb in- stances with event labels, and 44,000 head nouns with participant instances. Modi et al. (2016) report an inter-annotator agreement of 0.64 for event types and 0.77 for participant types (Fleiss’ kappa). We use gold-standard event- and participant-type annotation to study the influence of script knowl- edge on the expectation of discourse referents. In addition, InScript provides coreference annotation, which makes it possible to keep track of the men- tioned discourse referents at each point in the story. We use this information in the computational model of DR prediction and in the DR guessing experiment described in the next section. An example of an an- notated InScript story is shown in Figure 1. 3 Referent Cloze Task We use the InScript corpus to develop computa- tional models for the prediction of discourse refer- ents (DRs) and to evaluate their prediction accuracy. This can be done by testing how often our models manage to reproduce the original discourse referent (cf. also the “narrative cloze” task by (Chambers and Jurafsky, 2008) which tests whether a verb together with a role can be correctly guessed by a model). However, we do not only want to predict the “cor- rect” DRs in a text but also to model human expec- tation of DRs in context. To empirically assess hu- man expectation, we created an additional database of crowdsourced human predictions of discourse ref- erents in context using Amazon Mechanical Turk. The design of our experiment closely resembles the guessing game of (Tily and Piantadosi, 2009) but ex- tends it in a substantial way. Workers had to read stories of the InScript corpus 1 and guess upcoming participants: for each target NP, workers were shown the story up to this NP ex- cluding the NP itself, and they were asked to guess the next person or object most likely to be referred to. In case they decided in favour of a discourse ref- erent already mentioned, they had to choose among the available discourse referents by clicking an NP in the preceding text, i.e., some noun with a specific, coreference-indicating color; see Figure 2. Other- wise, they would click the “New” button, and would in turn be asked to give a short description of the new person or object they expected to be mentioned. The percentage of guesses that agree with the actually re- ferred entity was taken as a basis for estimating the surprisal. The experiment was done for all stories of the test set: 182 stories (20%) of the InScript corpus, evenly taken from all scenarios. Since our focus is on the effect of script knowledge, we only consid- ered those NPs as targets that are direct dependents of script-related events. Guessing started from the third sentence only in order to ensure that a mini- mum of context information was available. To keep the complexity of the context manageable, we re- stricted guessing to a maximum of 30 targets and skipped the rest of the story (this applied to 12% of the stories). We collected 20 guesses per NP for 3346 noun phrase instances, which amounts to a to- tal of around 67K guesses. Workers selected a con- 1The corpus is available at : http://www.sfb1102. uni-saarland.de/?page_id=2582 34 http://www.sfb1102.uni-saarland.de/?page_id=2582 http://www.sfb1102.uni-saarland.de/?page_id=2582 text NP in 68% of cases and “New” in 32% of cases. Our leading hypothesis is that script knowledge substantially influences human expectation of dis- course referents. The guessing experiment provides a basis to estimate human expectation of already mentioned DRs (the number of clicks on the respec- tive NPs in text). However, we expect that script knowledge has a particularly strong influence in the case of first mentions. Once a script is evoked in a text, we assume that the full script structure, includ- ing all participants, is activated and available to the reader. Tily and Piantadosi (2009) are interested in sec- ond mentions only and therefore do not make use of the worker-generated noun phrases classified as “New”. To study the effect of activated but not explicitly mentioned participants, we carried out a subsequent annotation step on the worker-generated noun phrases classified as “New”. We presented an- notators with these noun phrases in their contexts (with co-referring NPs marked by color, as in the M- Turk experiment) and, in addition, displayed all par- ticipant types of the relevant script (i.e., the script as- sociated with the text in the InScript corpus). Anno- tators did not see the “correct” target NP. We asked annotators to either (1) select the participant type in- stantiated by the NP (if any), (2) label the NP as un- related to the script, or (3), link the NP to an overt antecedent in the text, in the case that the NP is ac- tually a second mention that had been erroneously labeled as new by the worker. Option (1) provides a basis for a fine-grained estimation of first-mention DRs. Option (3), which we added when we noticed the considerable number of overlooked antecedents, serves as correction of the results of the M-Turk ex- periment. Out of the 22K annotated “New” cases, 39% were identified as second mentions, 55% were linked to a participant type, and 6% were classified as really novel. 4 Referent Prediction Model In this section, we describe the model we use to predict upcoming discourse referents (DRs). 4.1 Model Our model should not only assign probabilities to DRs already explicitly introduced in the preced- ing text fragment (e.g., “bath” or “bathroom” for the cloze task in Figure 2) but also reserve some prob- ability mass for ‘new’ DRs, i.e., DRs activated via the script context or completely novel ones not be- longing to the script. In principle, different variants of the activation mechanism must be distinguished. For many participant types, a single participant be- longing to a specific semantic class is expected (re- ferred to with the bathtub or the soap). In contrast, the “towel’ participant type may activate a set of ob- jects, elements of which then can be referred to with a towel or another towel. The “bath means” partici- pant type may even activate a group of DRs belong- ing to different semantic classes (e.g., bubble bath and salts). Since it is not feasible to enumerate all potential participants, for ‘new’ DRs we only pre- dict their participant type (“bath means” in our ex- ample). In other words, the number of categories in our model is equal to the number of previously introduced DRs plus the number of participant types of the script plus 1, reserved for a new DR not corre- sponding to any script participant (e.g., cellphone). In what follows, we slightly abuse the terminology and refer to all these categories as discourse refer- ents. Unlike standard co-reference models, which pre- dict co-reference chains relying on the entire docu- ment, our model is incremental, that is, when pre- dicting a discourse referent d(t) at a given position t, it can look only in the history h(t) (i.e., the pre- ceding part of the document), excluding the refer- ring expression (RE) for the predicted DR. We also assume that past REs are correctly resolved and as- signed to correct participant types (PTs). Typical NLP applications use automatic coreference reso- lution systems, but since we want to model human behavior, this might be inappropriate, since an au- tomated system would underestimate human perfor- mance. This may be a strong assumption, but for reasons explained above, we use gold standard past REs. We use the following log-linear model (“softmax regression”): p(d(t) = d|h(t)) = exp(w T f(d,h(t)))∑ d′ exp(w T f(d′,h(t))) , where f is the feature function we will discuss in the following subsection, w are model parameters, and the summation in the denominator is over the 35 Feature Type Recency Shallow Linguistic Frequency Shallow Linguistic Grammatical function Shallow Linguistic Previous subject Shallow Linguistic Previous object Shallow Linguistic Previous RE type Shallow Linguistic Selectional preferences Linguistic Participant type fit Script Predicate schemas Script Table 1: Summary of feature types set of categories described above. Some of the features included in f are a func- tion of the predicate syntactically governing the unobservable target RE (corresponding to the DR being predicted). However, in our incremental setting, the predicate is not available in the his- tory h(t) for subject NPs. In this case, we use an additional probabilistic model, which esti- mates the probability of the predicate v given the context h(t), and marginalize out its predictions: p(d(t) =d|h(t))= ∑ v p(v|h(t)) exp(w T f(d,h(t),v))∑ d′ exp(w T f(d′,h(t),v)) The predicate probabilities p(v|h(t)) are computed based on the sequence of preceding predicates (i.e., ignoring any other words) using the recurrent neural network language model estimated on our training set.2 The expression f(d,h(t),v) denotes the feature function computed for the referent d, given the history composed of h(t) and the predicate v. 4.2 Features Our features encode properties of a DR as well as characterize its compatibility with the context. We face two challenges when designing our fea- tures. First, although the sizes of our datasets are respectable from the script annotation perspective, they are too small to learn a richly parameterized model. For many of our features, we address this challenge by using external word embeddings3 and associate parameters with some simple similarity measures computed using these embeddings. Con- 2We used RNNLM toolkit (Mikolov et al., 2011; Mikolov et al., 2010) with default settings. 3We use 300-dimensional word embeddings estimated on Wikipedia with the skip-gram model of Mikolov et al. (2013): https://code.google.com/p/word2vec/ sequently, there are only a few dozen parameters which need to be estimated from scenario-specific data. Second, in order to test our hypothesis that script information is beneficial for the DR prediction task, we need to disentangle the influence of script information from general linguistic knowledge. We address this by carefully splitting the features apart, even if it prevents us from modeling some interplay between the sources of information. We will de- scribe both classes of features below; also see a sum- mary in Table 1. 4.2.1 Shallow Linguistic Features These features are based on Tily and Pianta- dosi (2009). In addition, we consider a selectional preference feature. Recency feature. This feature captures the distance lt(d) between the position t and the last occurrence of the candidate DR d. As a distance measure, we use the number of sentences from the last mention and exponentiate this number to make the depen- dence more extreme; only very recent DRs will re- ceive a noticeable weight: exp(−lt(d)). This feature is set to 0 for new DRs. Frequency. The frequency feature indicates the number of times the candidate discourse referent d has been mentioned so far. We do not perform any bucketing. Grammatical function. This feature encodes the dependency relation assigned to the head word of the last mention of the DR or a special none label if the DR is new. Previous subject indicator. This binary feature in- dicates whether the candidate DR d is coreferential with the subject of the previous verbal predicate. Previous object indicator. The same but for the ob- ject position. Previous RE type. This three-valued feature indi- cates whether the previous mention of the candidate DR d is a pronoun, a non-pronominal noun phrase, or has never been observed before. 4.2.2 Selectional Preferences Feature The selectional preference feature captures how well the candidate DR d fits a given syntactic po- sition r of a given verbal predicate v. It is com- puted as the cosine similarity simcos(xTd ,xv,r) of a vector-space representation of the DR xd and a structured vector-space representation of the pred- 36 https://code.google.com/p/word2vec/ icate xv,r. The similarities are calculated using a Distributional Memory approach similar to that of Baroni and Lenci (2010). Their structured vector space representation has been shown to work well on tasks that evaluate correlation with human the- matic fit estimates (Baroni and Lenci, 2010; Baroni et al., 2014; Sayeed et al., 2016) and is thus suited to our task. The representation xd is computed as an aver- age of head word representations of all the previ- ous mentions of DR d, where the word vectors are obtained from the TypeDM model of Baroni and Lenci (2010). This is a count-based, third-order co- occurrence tensor whose indices are a word w0, a second word w1, and a complex syntactic relation r, which is used as a stand-in for a semantic link. The values for each (w0,r,w1) cell of the tensor are the local mutual information (LMI) estimates obtained from a dependency-parsed combination of large cor- pora (ukWaC, BNC, and Wikipedia). Our procedure has some differences with that of Baroni and Lenci. For example, for estimating the fit of an alternative new DR (in other words, xd based on no previous mentions), we use an aver- age over head words of all REs in the training set, a “null referent.” xv,r is calculated as the average of the top 20 (by LMI) r-fillers for v in TypeDM; in other words, the prototypical instrument of rub may be represented by summing vectors like towel, soap, eraser, coin. . . If the predicate has not yet been en- countered (as for subject positions), scores for all scenario-relevant verbs are emitted for marginaliza- tion. 4.2.3 Script Features In this section, we describe features which rely on script information. Our goal will be to show that such common-sense information is beneficial in per- forming DR prediction. We consider only two script features. Participant type fit This feature characterizes how well the participant type (PT) of the candidate DR d fits a specific syn- tactic role r of the governing predicate v; it can be regarded as a generalization of the selectional prefer- ence feature to participant types and also its special- isation to the considered scenario. Given the candi- date DR d, its participant type p, and the syntactic (I)(1) decided to take a (bath)(2) yesterday afternoon after working out . (I)(1) was getting ready to go out and needed to get cleaned before (I)(1) went so (I)(1) decided to take a (bath)(2). (I)(1) filled the (bath- tub)(3) with warm (water)(4) and added some (bub- ble bath)(5). (I)(1) got undressed and stepped into the (water)(4). (I)(1) grabbed the (soap)(5) and rubbed it on (my)(1) (body)(7) and rinsed XXXXXX Figure 4: An example of the referent cloze task. Similar to the Mechanical Turk experiment (Figure 2), our refer- ent prediction model is asked to guess the upcoming DR. relation r, we collect all the predicates in the train- ing set which have the participant type p in the posi- tion r. The embedding of the DR xp,r is given by the average embedding of these predicates. The feature is computed as the dot product of xp,r and the word embedding of the predicate v. Predicate schemas The following feature captures a specific aspect of knowledge about prototypical sequences of events. This knowledge is called predicate schemas in the recent co-reference modeling work of Peng et al. (2015). In predicate schemas, the goal is to model pairs of events such that if a DR d participated in the first event (in a specific role), it is likely to partici- pate in the second event (again, in a specific role). For example, in the restaurant scenario, if one ob- serves a phrase John ordered, one is likely to see John waited somewhere later in the document. Spe- cific arguments are not that important (where it is John or some other DR), what is important is that the argument is reused across the predicates. This would correspond to the rule X-subject-of-order → X-subject-of-eat.4 Unlike the previous work, our dataset is small, so we cannot induce these rules di- rectly as there will be very few rules, and the model would not generalize to new data well enough. In- stead, we again encode this intuition using similari- ties in the real-valued embedding space. Recall that our goal is to compute a feature ϕ(d,h(t)) indicating how likely a potential DR d is to follow, given the history h(t). For example, imag- 4In this work, we limit ourselves to rules where the syntactic function is the same on both sides of the rule. In other words, we can, in principle, encode the pattern X pushed Y → X apologized but not the pattern X pushed Y → Y cried. 37 Model Name Feature Types Features Base Shallow Linguistic Features Recency, Frequency, Grammatical function, Previous subject, Previous object Linguistic Shallow Linguistic Features + Linguistic Feature Recency, Frequency, Grammatical function, Previous subject, Previous object + Selectional Preferences Script Shallow Linguistic Features + Linguistic Feature + Script Features Recency, Frequency, Grammatical function, Previous subject, Previous object + Selectional Preferences + Participant type fit, Predicate schemas Table 2: Summary of model features ine that the model is asked to predict the DR marked by XXXXXX in Figure 4. Predicate-schema rules can only yield previously introduced DRs, so the score ϕ(d,h(t)) = 0 for any new DR d. Let us use “soap” as an example of a previously introduced DR and see how the feature is computed. In order to choose which inference rules can be applied to yield “soap”, we can inspect Figure 4. There are only two preceding predicates which have DR “soap” as their object (rubbed and grabbed), resulting in two poten- tial rules X-object-of-grabbed → X-object-of-rinsed and X-object-of-rubbed → X-object-of-rinsed. We define the score ϕ(d,h(t)) as the average of the rule scores. More formally, we can write ϕ(d,h(t))= 1 |N(d,h(t))| ∑ (u,v,r)∈N(d,h(t)) ψ(u,v,r), (1) where ψ(u,v,r) is the score for a rule X-r-of-u → X-r-of-v, N(d,h(t)) is the set of applicable rules, and |N(d,h(t))| denotes its cardinality.5 We define ϕ(d,h(t)) as 0, when the set of applicable rules is empty (i.e. |N(d,h(t))| = 0). The scoring function ψ(u,v,r) as a linear func- 5In all our experiments, rather than considering all potential predicates in the history to instantiate rules, we take into ac- count only 2 preceding verbs. In other words, u and v can be interleaved by at most one verb and |N(d, h(t))| is in {0, 1, 2}. tion of a joint embedding xu,v of verbs u and v: ψ(u,v,r) = αTr xu,v. The two remaining questions are (1) how to define the joint embeddings xu,v, and (2) how to estimate the parameter vector αr. The joint embedding of two predicates, xu,v, can, in principle, be any composi- tion function of embeddings of u and v, for example their sum or component-wise product. Inspired by Bordes et al. (2013), we use the difference between the word embeddings: ψ(u,v,r) = αTr (xu −xv), where xu and xv are external embeddings of the corresponding verbs. Encoding the succession re- lation as translation in the embedding space has one desirable property: the scoring function will be largely agnostic to the morphological form of the predicates. For example, the difference between the embeddings of rinsed and rubbed is very sim- ilar to that of rinse and rub (Botha and Blunsom, 2014), so the corresponding rules will receive simi- lar scores. Now, we can rewrite the equation (1) as ϕ(d,h(t))= αT r(h(t)) ∑ (u,v,r)∈N(d,h(t)) (xu −xv) |N(d,h(t))| (2) where r(h(t)) denotes the syntactic function corre- sponding to the DR being predicted (object in our example). As for the parameter vector αr, there are again a number of potential ways how it can be estimated. For example, one can train a discriminative classifier to estimate the parameters. However, we opted for a simpler approach—we set it equal to the empirical estimate of the expected feature vector xu,v on the training set:6 αr = 1 Dr ∑ l,t δr(r(h (l,t))) ∑ (u,v,r′)∈N(d(l,t),h(l,t)) (xu −xv), (3) where l refers to a document in the training set, t is (as before) a position in the document, h(l,t) and 6This essentially corresponds to using the Naive Bayes model with the simplistic assumption that the score differences are normally distributed with spherical covariance matrices. 38 Scenario Human Model Script Model Linguistic Model Tily Model Accuracy Perplexity Accuracy Perplexity Accuracy Perplexity Accuracy Perplexity Grocery Shopping 74.80 2.13 68.17 3.16 53.85 6.54 32.89 24.48 Repairing a flat bicycle tyre 78.34 2.72 62.09 3.89 51.26 6.38 29.24 19.08 Riding a public bus 72.19 2.28 64.57 3.67 52.65 6.34 32.78 23.39 Getting a haircut 71.06 2.45 58.82 3.79 42.82 7.11 28.70 15.40 Planting a tree 71.86 2.46 59.32 4.25 47.80 7.31 28.14 24.28 Borrowing book from library 77.49 1.93 64.07 3.55 43.29 8.40 33.33 20.26 Taking Bath 81.29 1.84 67.42 3.14 61.29 4.33 43.23 16.33 Going on a train 70.79 2.39 58.73 4.20 47.62 7.68 30.16 35.11 Baking a cake 76.43 2.16 61.79 5.11 46.40 9.16 24.07 23.67 Flying in an airplane 62.04 3.08 61.31 4.01 48.18 7.27 30.90 30.18 Average 73.63 2.34 62.63 3.88 49.52 7.05 31.34 23.22 Table 3: Accuracies (in %) and perplexities for different models and scenarios. The script model substantially out- performs linguistic and base models (with p < 0.001, significance tested with McNemar’s test (Everitt, 1992)). As expected, the human prediction model outperforms the script model (with p < 0.001, significance tested by McNe- mar’s test). Model Accuracy Perplexity Linguistic Model 49.52 7.05 Linguistic Model + Predicate Schemas 55.44 5.88 Linguistic Model + Participant type fit 58.88 4.29 Full Script Model (both features) 62.63 3.88 Table 4: Accuracies from ablation experiments. d(l,t) are the history and the correct DR for this posi- tion, respectively. The term δr(r′) is the Kronecker delta which equals 1 if r = r′ and 0, otherwise. Dr is the total number of rules for the syntactic function r in the training set: Dr = ∑ l,t δr(r(h (l,t)))×|N(d(l,t),h(l,t))|. Let us illustrate the computation with an example. Imagine that our training set consists of the docu- ment in Figure 1, and the trained model is used to predict the upcoming DR in our referent cloze exam- ple (Figure 4). The training document includes the pair X-object-of-scrubbed → X-object-of-rinsing, so the corresponding term (xscrubbed - xrinsing) partici- pates in the summation (3) for αobj. As we rely on external embeddings, which encode semantic simi- larities between lexical items, the dot product of this term and (xrubbed - xrinsed) will be high.7 Conse- quently, ϕ(d,h(t)) is expected to be positive for d = “soap”, thus, predicting “soap” as the likely forth- coming DR. Unfortunately, there are other terms (xu − xv) both in expression (3) for αobj and in expression (2) for ϕ(d,h(t)). These terms may be 7The score would have been even higher, should the pred- icate be in the morphological form rinsing rather than rinsed. However, embeddings of rinsing and rinsed would still be suf- ficiently close to each other for our argument to hold. irrelevant to the current prediction, as X-object-of- plugged → X-object-of-filling from Figure 1, and may not even encode any valid regularities, as X- object-of-got → X-object-of-scrubbed (again from Figure 1). This may suggest that our feature will be too contaminated with noise to be informative for making predictions. However, recall that inde- pendent random vectors in high dimensions are al- most orthogonal, and, assuming they are bounded, their dot products are close to zero. Consequently, the products of the relevant (“non-random”) terms, in our example (xscrubbed - xrinsing) and (xrubbed - xrinsed), are likely to overcome the (“random”) noise. As we will see in the ablation studies, the predicate- schema feature is indeed predictive of a DR and con- tributes to the performance of the full model. 4.3 Experiments We would like to test whether our model can pro- duce accurate predictions and whether the model’s guesses correlate well with human predictions for the referent cloze task. In order to be able to evaluate the effect of script knowledge on referent predictability, we compare three models: our full Script model uses all of the features introduced in section 4.2; the Linguistic model relies only on the ‘linguistic features’ but not the script-specific ones; and the Base model includes all the shallow linguistic features. The Base model differs from the linguistic model in that it does not model selectional preferences. Table 2 summarizes features used in different models. The data set was randomly divided into training (70%), development (10%, 91 stories from 10 sce- 39 narios), and test (20%, 182 stories from 10 scenar- ios) sets. The feature weights were learned using L-BFGS (Byrd et al., 1995) to optimize the log- likelihood. Evaluation against original referents. We calcu- lated the percentage of correct DR predictions. See Table 3 for the averages across 10 scenarios. We can see that the task appears hard for humans: their average performance reaches only 73% accuracy. As expected, the Base model is the weakest system (the accuracy of 31%). Modeling selectional pref- erences yields an extra 18% in accuracy (Linguis- tic model). The key finding is that incorporation of script knowledge increases the accuracy by further 13%, although still far behind human performance (62% vs. 73%). Besides accuracy, we use perplex- ity, which we computed not only for all our models but also for human predictions. This was possible as each task was solved by multiple humans. We used unsmoothed normalized guess frequencies as the probabilities. As we can see from Table 3, the perplexity scores are consistent with the accuracies: the script model again outperforms other methods, and, as expected, all the models are weaker than hu- mans. As we used two sets of script features, capturing different aspects of script knowledge, we performed extra ablation studies (Table 4). The experiments confirm that both feature sets were beneficial. Evaluation against human expectations. In the previous subsection, we demonstrated that the in- corporation of selectional preferences and, perhaps more interestingly, the integration of automatically acquired script knowledge lead to improved accu- racy in predicting discourse referents. Now we turn to another question raised in the introduction: does incorporation of this knowledge make our predic- tions more human-like? In other words, are we able to accurately estimate human expectations? This in- cludes not only being sufficiently accurate but also making the same kind of incorrect predictions. In this evaluation, we therefore use human guesses collected during the referent cloze task as our target. We then calculate the relative accuracy of each computational model. As can be seen in Figure 5, the Script model, at approx. 53% accuracy, is a lot more accurate in predicting human guesses than the Linguistic model and the Base model. We can also Script Linguistic Base 0 1 0 2 0 3 0 4 0 5 0 6 0 52.9 38.4 34.52 R e l. A cc u ra cy ( in % ) Figure 5: Average relative accuracies of different models w.r.t human predictions. Script Linguistic Base0 .0 0 .2 0 .4 0 .6 0 .8 0.5 0.57 0.66 JS D iv e rg e n ce Figure 6: Average Jensen-Shannon divergence between human predictions and models. observe that the margin between the Script model and the Linguistic model is a lot larger in this evalu- ation than between the Base model and the Linguis- tic model. This indicates that the model which has access to script knowledge is much more similar to human prediction behavior in terms of top guesses than the script-agnostic models. Now we would like to assess if our predictions are similar as distributions rather than only yield- ing similar top predictions. In order to compare the distributions, we use the Jensen-Shannon divergence (JSD), a symmetrized version of the Kullback- Leibler divergence. Intuitively, JSD measures the distance between two probability distributions. A smaller JSD value is indicative of more similar distributions. Fig- ure 6 shows that the probability distributions result- ing from the Script model are more similar to human predictions than those of the Linguistic and Base models. In these experiments, we have shown that script knowledge improves predictions of upcoming ref- erents and that the script model is the best among our models in approximating human referent predic- tions. 5 Referring Expression Type Prediction Model (RE Model) Using the referent prediction models, we next at- tempt to replicate Tily and Piantadosi’s findings that 40 the choice of the type of referring expression (pro- noun or full NP) depends in part on the predictability of the referent. 5.1 Uniform Information Density hypothesis The uniform information density (UID) hypothe- sis suggests that speakers tend to convey information at a uniform rate (Jaeger, 2010). Applied to choice of referring expression type, it would predict that a highly predictable referent should be encoded us- ing a short code (here: a pronoun), while an unpre- dictable referent should be encoded using a longer form (here: a full NP). Information density is mea- sured using the information-theoretic measure of the surprisal S of a message mi: S(mi) = − log P(mi | context) UID has been very successful in explaining a vari- ety of linguistic phenomena; see Jaeger et al. (2016). There is, however, controversy about whether UID affects pronominalization. Tily and Piantadosi (2009) report evidence that writers are more likely to refer using a pronoun or proper name when the ref- erent is easy to guess and use a full NP when readers have less certainty about the upcoming referent; see also Arnold (2001). But other experiments (using highly controlled stimuli) have failed to find an ef- fect of predictability on pronominalization (Steven- son et al., 1994; Fukumura and van Gompel, 2010; Rohde and Kehler, 2014). The present study hence contributes to the debate on whether UID affects re- ferring expression choice. 5.2 A model of Referring Expression Choice Our goal is to determine whether referent pre- dictability (quantified in terms of surprisal) is cor- related with the type of referring expression used in the text. Here we focus on the distinction be- tween pronouns and full noun phrases. Our data also contains a small percentage (ca. 1%) of proper names (like “John”). Due to this small class size and earlier findings that proper nouns behave much like pronouns (Tily and Piantadosi, 2009), we com- bined pronouns and proper names into a single class of short encodings. For the referring expression type prediction task, we estimate the surprisal of the referent from each of our computational models from Section 4 as well as the human cloze task. The surprisal of an upcoming discourse referent d(t) based on the previous context h(t) is thereby estimated as: S(d(t)) = − log p(d(t) | h(t)) In order to determine whether referent predictability has an effect on referring expression type over and above other factors that are known to affect the choice of referring expression, we train a logistic regression model with referring expression type as a response variable and discourse referent predictabil- ity as well as a large set of other linguistic factors (based on Tily and Piantadosi, 2009) as explanatory variables. The model is defined as follows: p(n(t) = n|d(t),h(t)) = exp(v T g(n,dt,h(t)))∑ n′ exp(v T g(n′,dt,h(t))) , where d(t) and h(t) are defined as before, g is the feature function, and v is the vector of model pa- rameters. The summation in the denominator is over NP types (full NP vs. pronoun/proper noun). 5.3 RE Model Experiments We ran four different logistic regression models. These models all contained exactly the same set of linguistic predictors but differed in the estimates used for referent type surprisal and residual entropy. One logistic regression model used surprisal esti- mates based on the human referent cloze task, while the three other models used estimates based on the three computational models (Base, Linguistic and Script). For our experiment, we are interested in the choice of referring expression type for those occur- rences of references, where a “real choice” is possi- ble. We therefore exclude for our analysis reported below all first mentions as well as all first and second person pronouns (because there is no optionality in how to refer to first or second person). This subset contains 1345 data points. 5.4 Results The results of all four logistic regression models are shown in Table 5. We first take a look at the results for the linguistic features. While there is a bit of variability in terms of the exact coefficient es- timates between the models (this is simply due to small correlations between these predictors and the predictors for surprisal), the effect of all of these features is largely consistent across models. For in- stance, the positive coefficients for the recency fea- ture means that when a previous mention happened 41 Estimate Std. Error Pr(>| z |) Human Script Linguistic Base Human Script Linguistic Base Human Script Linguistic Base (Intercept) -3.4 -3.418 -3.245 -3.061 0.244 0.279 0.321 0.791 <2e-16 *** <2e-16 *** <2e-16 *** 0.00011 *** recency 1.322 1.322 1.324 1.322 0.095 0.095 0.096 0.097 <2e-16 *** <2e-16 *** <2e-16 *** <2e-16 *** frequency 0.097 0.103 0.112 0.114 0.098 0.097 0.098 0.102 0.317 0.289 0.251 0.262 pastObj 0.407 0.396 0.423 0.395 0.293 0.294 0.295 0.3 0.165 0.178 0.151 0.189 pastSubj -0.967 -0.973 -0.909 -0.926 0.559 0.564 0.562 0.565 0.0838 . 0.0846 . 0.106 0.101 pastExpPronoun 1.603 1.619 1.616 1.602 0.21 0.207 0.208 0.245 2.19e-14 *** 5.48e-15 *** 7.59e-15 *** 6.11e-11 *** depTypeSubj 2.939 2.942 2.656 2.417 0.299 0.347 0.429 1.113 <2e-16 *** <2e-16 *** 5.68e-10 *** 0.02994 * depTypeObj 1.199 1.227 0.977 0.705 0.248 0.306 0.389 1.109 1.35e-06 *** 6.05e-05 *** 0.0119 * 0.525 surprisal -0.04 -0.006 0.002 -0.131 0.099 0.097 0.117 0.387 0.684 0.951 0.988 0.735 residualEntropy -0.009 0.023 -0.141 -0.128 0.088 0.128 0.168 0.258 0.916 0.859 0.401 0.619 Table 5: Coefficients obtained from regression analysis for different models. Two NP types considered: full NP and Pronoun/ProperNoun, with base class full NP. Significance: ‘***’ < 0.001, ‘**’ < 0.01, ‘*’ < 0.05, and ‘.’ < 0.1. very recently, the referring expression is more likely to be a pronoun (and not a full NP). The coefficients for the surprisal estimates of the different models are, however, not significantly dif- ferent from zero. Model comparison shows that they do not improve model fit. We also used the esti- mated models to predict referring expression type on new data and again found that surprisal estimates from the models did not improve prediction accu- racy. This effect even holds for our human cloze data. Hence, it cannot be interpreted as a problem with the models—even human predictability esti- mates are, for this dataset, not predictive of referring expression type. We also calculated regression models for the full dataset including first and second person pronouns as well as first mentions (3346 data points). The re- sults for the full dataset are fully consistent with the findings shown in Table 5: there was no significant effect of surprisal on referring expression type. This result contrasts with the findings by Tily and Piantadosi (2009), who reported a significant effect of surprisal on RE type for their data. In order to replicate their settings as closely as possible, we also included residualEntropy as a predictor in our model (see last predictor in Table 5); however, this did not change the results. 6 Discussion and Future Work Our study on incrementally predicting discourse referents showed that script knowledge is a highly important factor in determining human discourse ex- pectations. Crucially, the computational modelling approach allowed us to tease apart the different fac- tors that affect human prediction as we cannot ma- nipulate this in humans directly (by asking them to “switch off” their common-sense knowledge). By modelling common-sense knowledge in terms of event sequences and event participants, our model captures many more long-range dependencies than normal language models. The script knowledge is automatically induced by our model from crowd- sourced scenario-specific text collections. In a second study, we set out to test the hypoth- esis that uniform information density affects refer- ring expression type. This question is highly con- troversial in the literature: while Tily and Piantadosi (2009) find a significant effect of surprisal on refer- ring expression type in a corpus study very similar to ours, other studies that use a more tightly con- trolled experimental approach have not found an ef- fect of predictability on RE type (Stevenson et al., 1994; Fukumura and van Gompel, 2010; Rohde and Kehler, 2014). The present study, while replicating exactly the setting of T&P in terms of features and analysis, did not find support for a UID effect on RE type. The difference in results between T&P 2009 and our results could be due to the different corpora and text sorts that were used; specifically, we would expect that larger predictability effects might be ob- servable at script boundaries, rather than within a script, as is the case in our stories. A next step in moving our participant predic- tion model towards NLP applications would be to replicate our modelling results on automatic text- to-script mapping instead of gold-standard data as done here (in order to approximate human level of processing). Furthermore, we aim to move to more complex text types that include reference to several scripts. We plan to consider the recently published ROC Stories corpus (Mostafazadeh et al., 2016), a large crowdsourced collection of topically unre- stricted short and simple narratives, as a basis for these next steps in our research. 42 Acknowledgments We thank the editors and the anonymous review- ers for their insightful suggestions. We would like to thank Florian Pusse for helping with the Ama- zon Mechanical Turk experiment. We would also like to thank Simon Ostermann and Tatjana Anikina for helping with the InScript corpus. This research was partially supported by the German Research Foundation (DFG) as part of SFB 1102 ‘Informa- tion Density and Linguistic Encoding’, European Research Council (ERC) as part of ERC Starting Grant BroadSem (#678254), the Dutch National Sci- ence Foundation as part of NWO VIDI 639.022.518, and the DFG once again as part of the MMCI Cluster of Excellence (EXC 284). References Simon Ahrendt and Vera Demberg. 2016. Improving event prediction by representing script participants. In Proceedings of NAACL-HLT. Jennifer E. Arnold. 2001. The effect of thematic roles on pronoun use and frequency of reference continuation. Discourse Processes, 31(2):137–162. Marco Baroni and Alessandro Lenci. 2010. Distribu- tional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4):673– 721. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! A systematic compari- son of context-counting vs. context-predicting seman- tic vectors. In Proceedings of ACL. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Trans- lating embeddings for modeling multi-relational data. In Proceedings of NIPS. Jan A. Botha and Phil Blunsom. 2014. Compositional morphology for word representations and language modelling. In Proceedings of ICML. Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208. Nathanael Chambers and Daniel Jurafsky. 2008. Unsu- pervised learning of narrative event chains. In Pro- ceedings of ACL. Nathanael Chambers and Dan Jurafsky. 2009. Unsuper- vised learning of narrative schemas and their partici- pants. In Proceedings of ACL. Brian S. Everitt. 1992. The analysis of contingency ta- bles. CRC Press. Lea Frermann, Ivan Titov, and Manfred Pinkal. 2014. A hierarchical Bayesian model for unsupervised induc- tion of script knowledge. In Proceedings of EACL. Kumiko Fukumura and Roger P. G. van Gompel. 2010. Choosing anaphoric expressions: Do people take into account likelihood of reference? Journal of Memory and Language, 62(1):52–66. T. Florian Jaeger, Esteban Buz, Eva M. Fernandez, and Helen S. Cairns. 2016. Signal reduction and linguis- tic encoding. Handbook of psycholinguistics. Wiley- Blackwell. T. Florian Jaeger. 2010. Redundancy and reduction: Speakers manage syntactic information density. Cog- nitive psychology, 61(1):23–62. Bram Jans, Steven Bethard, Ivan Vulić, and Marie Francine Moens. 2012. Skip n-grams and ranking functions for predicting script events. In Proceedings of EACL. Gina R. Kuperberg and T. Florian Jaeger. 2016. What do we mean by prediction in language comprehension? Language, cognition and neuroscience, 31(1):32–59. Gina R. Kuperberg. 2016. Separate streams or proba- bilistic inference? What the N400 can tell us about the comprehension of events. Language, Cognition and Neuroscience, 31(5):602–616. Marta Kutas, Katherine A. DeLong, and Nathaniel J. Smith. 2011. A look around at what lies ahead: Pre- diction and predictability in language processing. Pre- dictions in the brain: Using our past to generate a fu- ture. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer- nockỳ, and Sanjeev Khudanpur. 2010. Recurrent neu- ral network based language model. In Proceedings of Interspeech. Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Cernocky. 2011. RNNLM-recurrent neural network language modeling toolkit. In Pro- ceedings of the 2011 ASRU Workshop. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor- rado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS. Ashutosh Modi and Ivan Titov. 2014. Inducing neural models of script knowledge. Proceedings of CoNLL. Ashutosh Modi, Tatjana Anikina, Simon Ostermann, and Manfred Pinkal. 2016. Inscript: Narrative texts anno- tated with script information. Proceedings of LREC. Ashutosh Modi. 2016. Event embeddings for semantic script modeling. Proceedings of CoNLL. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of com- monsense stories. Proceedings of NAACL. 43 Haoruo Peng, Daniel Khashabi, and Dan Roth. 2015. Solving hard coreference problems. In Proceedings of NAACL. Karl Pichotta and Raymond J Mooney. 2014. Statistical script learning with multi-argument events. Proceed- ings of EACL. Altaf Rahman and Vincent Ng. 2012. Resolving com- plex cases of definite pronouns: the Winograd schema challenge. In Proceedings of EMNLP. Michaela Regneri, Alexander Koller, and Manfred Pinkal. 2010. Learning script knowledge with web experiments. In Proceedings of ACL. Hannah Rohde and Andrew Kehler. 2014. Grammati- cal and information-structural influences on pronoun production. Language, Cognition and Neuroscience, 29(8):912–927. Rachel Rudinger, Vera Demberg, Ashutosh Modi, Ben- jamin Van Durme, and Manfred Pinkal. 2015. Learn- ing to predict script events from domain-specific text. Proceedings of the International Conference on Lexi- cal and Computational Semantics (*SEM 2015). Asad Sayeed, Clayton Greenberg, and Vera Demberg. 2016. Thematic fit evaluation: an aspect of selectional preferences. In Proceedings of the Workshop on Eval- uating Vector Space Representations for NLP (RepE- val2016). Roger C. Schank and Robert P. Abelson. 1977. Scripts, Plans, Goals, and Understanding. Lawrence Erlbaum Associates, Potomac, Maryland. Simone Schütz-Bosbach and Wolfgang Prinz. 2007. Prospective coding in event representation. Cognitive processing, 8(2):93–102. Rosemary J. Stevenson, Rosalind A. Crawley, and David Kleinman. 1994. Thematic roles, focus and the rep- resentation of events. Language and Cognitive Pro- cesses, 9(4):519–548. Harry Tily and Steven Piantadosi. 2009. Refer effi- ciently: Use less informative expressions for more pre- dictable meanings. In Proceedings of the workshop on the production of referring expressions: Bridging the gap between computational and empirical approaches to reference. Alessandra Zarcone, Marten van Schijndel, Jorrig Vo- gels, and Vera Demberg. 2016. Salience and atten- tion in surprisal-based accounts of language process- ing. Frontiers in Psychology, 7:844. 44 work_2edq4sbs6readbovfgzcxl5ojy ---- International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 15 Radix-8 Design Alternatives of Fast Two Operands Interleaved Multiplication with Enhanced Architecture With FPGA implementation & synthesize of 64-bit Wallace Tree CSA based Radix-8 Booth Multiplier Mohammad M. Asad King Faisal University, Department of Electrical Engineering, Ahsa 31982, Saudi Arabia e-mail: asadmosab@gmail.com Ibrahim Marouf King Faisal University, Department of Electrical Engineering, Ahsa 31982, Saudi Arabia e-mail: i.marouf@outlook.com Qasem Abu Al-Haija Department of Computer Information and Systems Engineering Tennessee State University, Nashville, USA e-mail: qabualha@my.tnstate.edu Abstract—In this paper, we proposed different comparable reconfigurable hardware implementations for the radix-8 fast two operands multiplier coprocessor using Karatsuba method and Booth recording method by employing carry save (CSA) and kogge stone adders (KSA) on Wallace tree organization. The proposed designs utilized family with target chip device along with simulation package. Also, the proposed designs were synthesized and benchmarked in terms of the maximum operational frequency, the total path delay, the total design area and the total thermal power dissipation. The experimental results revealed that the best multiplication architecture was belonging to Wallace Tree CSA based Radix- 8 Booth multiplier ( ) which recorded: critical path delay of , maximum operational frequency of , hardware design area (number of logic elements) of , and total thermal power dissipation estimated as . Consequently, method can be efficiently employed to enhance the speed of computation for many multiplication based applications such embedded system designs for public key cryptography. Keywords-Cryptography; Computer Arithmetic; FPGA Design; Hardware Synthesis; Kogge-Stone Adder (KSA); Radix- 8 Booth Recording; Karatsuba Multiplier; Wallace Tree I. INTRODUCTION Recently, the vast promotion in the field of information and communication technology (ICT) such as grid and fog computing has increased the inclination of having secret data sharing over the existing non- secure communication networks. This encouraged the researches to propose different solutions to ensure the safe access and store of private and sensitive data by employing different cryptographic algorithms especially the public key algorithms [1] which proved robust security resistance against most of the attacks and security halls. Public key cryptography is significantly based on the use of number theory and digital arithmetic algorithms. Indeed, wide range of public key cryptographic systems were developed and embedded using hardware modules due to its better performance and security. This increased the demand on the embedded and System-on Chip ( ) [2] technologies employing several computers aided ( ) tools along with the configurable hardware processing units such as field programmable gate array ( ) and application specific integrated circuits ( ). Therefore, considerable number of embedded coprocessors design were used to replace software based (i.e. programming based) solutions of different applications such as image processors, cryptographic processors, digital filters, low power application such as [3] and others. The major part of designing such processors significantly encompasses the use computer arithmetic techniques in the underlying layers of processing. Computer arithmetic [4] or digital arithmetic is the science that combines mathematics with computer engineering and deals with representing integers and real values in digital systems and efficient algorithms DOI: 10.21307/ijanmc-2019-043 mailto:asadmosab@gmail.com mailto:i.marouf@outlook.com mailto:qabualha@my.tnstate.edu International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 16 for manipulating such numbers by means hardware circuitry and software routines. Arithmetic operations on pairs of numbers x and y include addition ( y), subtraction ( – ), multiplication ( ), and division ( ). Subtraction and division can be viewed as operations that undo the effects of addition and multiplication, respectively. Multiplication operation is considered as a core operation that affect the performance of any embedded system. Therefore, the use of fast multiplier units will result in enhancements in the overall performance of the system. Recently, several solutions were proposed for multiplication algorithms while few of them were efficient [5]. A multiplication algorithm [6] is method to find the product of two numbers, i.e. . Multiplication is an essential building block for several digital processors as it requires a considerable amount of processing time and hardware resources. Depending on the size of the numbers, different algorithms are in use. Elementary-school grade algorithm was multiplying each number digit by digit producing partial sum with complexity of ). [5]For larger numbers, more efficient algorithms are needed. For example, let integers to be multiplied with equal to 1k bits, thus multiplications. However, more efficient and practical multiplication algorithms will be discussed in the following subsections. In this paper, we report on several fast alternative designs for Radix-8 based multiplier unit including: Radix-8 CSA Based Booth Multiplier, CSA Based Radix-8 Booth, Wallace Tree Karatsuba Multiplier, CSA Based Radix-8 Booth, KSA Based Karatsuba Multiplier, CSA Based Radix-8 Booth, With Comparator Karatsuba Multiplier, Sequential 64-Bit CSA Based Radix-8 Booth Multiplier, 64-bit Wallace Tree CSA based Radix-8 Booth multiplier (WCBM). The remaining of this paper is organized as follows: Section 2m discusses the core components of efficient multiplier design. Section 3,provides the proposed design alternatives of Radix-8 based multiplier, Section 4, presents the synthesizing results and analysis, and, finally, Section 5 concludes the paper. II. CORE DESIGN COMPONENTS-REVIEW Two operands-multiplication is a substantial arithmetic operation since it plays a major role in the design of many embedded and digital signal processors [7]. Therefore, the efficient design and implementation of a fast multiplier unit is on demand. In this paper, we propose a competitive reconfigurable multiplier design using scalable and efficient modules. Thus, the following subsections reviews the core design components for the proposed multiplier implementation unit. Figure 1. Carry save Adder: (a) Top View Design (b) Internal Architecture A. Carry save Adder (CSA) CSA [4] is a fast-redundant adder with constant carry path delay regardless of the number of operands’ bits. It produces the result as two-dimensional vectors: sum vector (or the partial sum) and carry vector (or partial carry). The advantage of CSA is that the speed is constant regardless the number of bits. However, its area increases linearly with the number of bits. The top view of the CSA unit along with its internal logic design architecture are provided in Fig.1 below. In this work, we have implemented the CSA adder using VHDL code for different bit sizes ranges from 8- bits through 64-bits [8]. The synthesize results of total delay in ( ) and area in Logic Elements (LEs) were analyzed and reported in [8] and they are illustrated in Fig.2. These results were generated using software [9], simulated for model [10] and they highly conform theoretical evaluation of CSA operation since the delay time is almost equal for all bits. However, the area is almost double for each number of bits. Also, the timing estimation of CSA was generated via Time Analyzer tool provided in the package. Accordingly, the critical path delay is which is data arrival time while the data delay is only 2.866 ns which provide a frequency of .Finally, to verify the performance of CSA, we have compared it with the well-known Carry LockAhead Adder (CLA) in terms of area and delay. CLA is a carry propagation International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 17 adder (CPA) with logarithmic relation between the carry propagation delay and the number of bits in operands. Figure 2. Delay-Area analysis of CSA vs CLA implementations (8–64 bit) The simulation results of both CSA and CLA is provided in Fig.2 shows that CSA is superior in both Area and speed. It has almost a constant time delay and relatively less area than CLA. Whereas CLA time delay increases as the number of bit increases but not much as the area size. B. Kogge-Stone Adder (KSA) KSA is a fast two operands parallel prefix adder (PPAs) [11] that executes addition on parallelized manner. PPAs are just like CLA but with an enhancement in the carry propagation stage (called the middle stage). There are five different variations of PPAs namely: Ladner-Fischer Adder (LFA), Brent- Kung Adder (BKA), Kogge-Stone Adder (KSA), Hans-Carlson Adder (HCA), and Sklansky Adder (SkA). These adders differ by the tree structure design to optimize certain aspects such as, performance, power, area size, and fan in/out. To verify the performance of all PPAs, we have implemented them on FPGA and the experimental results [6] showed that KSA utilizes larger area size to achieve higher performance comparing among all other five PPAs. Thus, we decided to consider KSA as our basic carry propagation adder (CPA) to finalize the redundant results and to build up many other units that are in-need for conventional adder. In short, the simulation results of [6] showed that KSA leading the other adders as it has the smallest time delay with only 4.504. This result is very useful and conforms the theatrical modeling of KSA which has the least number of logic levels. Like all PPAs, KSA functionality consists of three computational stages as illustrated in Fig.3, as follows:  Pre-processing stage: The computation of generate and propagate of each bit from A and B are done in this step. These signals are given by the logic equations: and  Carry generation network: PPA differentiates from each other by the connections of the network. It computes carries of each bit by using generate and propagate signals from previous block. In this block two blocks are defined group generation and propagation (GGP), in addition to group generation only (GGO), as shown in Fig.3. Logic blocks used for the calculation of generate and propagate bits can be describe by the following logic equations: and ), Where the generation group have only logic equation for carry generation: .  Post processing (Calculating the Sum):This is the last step and is common to all adders of this family (carry look ahead). It involves computation of sum bits. Sum bits are computed by the logic given in: . The top view and the internal logic circuit is provided in the Fig.3. C. Fast Multi-Operands Addition Addition operation is not commonly used to add two operands only, instead, it is more involved with multiplication and inner product computations [12]. The use of regular two operands adders leads to intermediate results before getting the last answer which affect the performance or the time delay of a system. Therefore, Multi-operand adders are manly studied to reduce this problem. Wallace and Dadda trees [13] are considered as two variations of high- performance multi-operands addition. Fig.4. shows the dot notation to represent the digit positions or International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 18 alignments (instead of using the value which is quite useful) for the use of Multi-operand addition in multiplication and inner-product computation. Figure 3. Kogge Stone Adder: (a) Top View Design of KSA (c) KSA Stages (c) Group generation and propagation In this work, we have adopted a CSA based Wallace tree since it confirmed better operands organization to improve the total addition delay [8]. We have implemented two CSA Wallace Trees: 10- operands addition and 22-operands addition. The structure logic diagram of 10 operands is given in Fig.5. It’s clearly seen that the Wallace tree unit is designed behaviorally (FSM is generated). Figure 4. Dot notation of Multi-operand addition for multiplication and inner-product computation D. Karatsuba Multiplier To enhance the performance of multiplication for large operands (i.e. 1024-bit size), a re-organization process can be adopted for the multiplication operands to utilize the maximum possible parallelism to enhance the multiplication time. Karatsuba algorithm [14] is pipelined multiplication process used mainly to construct the high precision multipliers form multiple small precision multiplier blocks by exploiting the maximum available parallelism between the multiplication blocks. The basic idea of Karatsuba algorithm is illustrated in fig.6 and Karatsuba algorithm can be defined as follows: Let be integers and is the base (Radix_2) and where n: the number of digits, then: 1) Re-write as follows: and 2) Calculate Product as follows: , where: , , , A more efficient implementation of Karatsuba multiplication can be accomplished as: Figure 5. Multi-operand addition for 10 operands. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 19 Figure 6. Aligning Partial Products. E. Magnitude Comparator The magnitude (or digital) comparator is a hardware electronic device that takes two numbers as input in binary form and determines whether one number is greater than, less than or equal to the other number. Like that in binary addition, the efficient comparator can be implemented using G (generate) and P (propagate) signal for comparison. Basically, the comparator involves two 2-bits: & can be realized by: 1 1 1 1 0 0 ( ).( ) Big B A B A B A B   (1) 1 1 0 0 EQ ( ).( )A B A B   (2) For AB, “BBig, EQ” is “0,0”. Where BBig is defined as output A less than B (A_LT_B).Comparing Eq. (1) and (2) with carry signal (3): ( ). . out in in C AB A B C G P C     (3) Where A & B are binary inputs Cin is carry input, Cout is carry output, and G & P are generate & propagate signals, respectively. Now, after comparing equations (1) & (3), we got: 1 1 1 G A B , 1 1 1 ( )EQ A B  , 0 0in C A B (4) Cin can be considered as G0. For this, encoding equation is given as: [ ] [ ] [ ]i i i G A B (5) [ ] [ ] [ ] ( ) i i i EQ A B  (6) Substituting the two values from equations (5) & (6) in (1) & (2) results in: [2 j 1:2 j] [2 j 1] [2 j 1] [2 ] . Big j B G EQ G      (7) [2 j 1:2 ] [2 j 1] [2 j] . j EQ EQ EQ    (8) & signals can be further combined to form group & signals. For instance, for 64-bit comparator, & can be computed as: 6362 [63:0] 63 0 1 . Big k m k m k B G G EQ              (9) 63 [63:0] 0 m m EQ EQ   (10) Fig 7. Shows the complete design of an 8-bit comparator as an example of this techniques where: i= 0…7, j = 0…3. III. PROPOSED MULTIPLIER DESIGN ALTERNATIVES Fundamentally, multiplication operation (along with fast addition) is a significant unit in almost all cryptographic coprocessors. For instance, in the design of SSC Crypto-processor[15], the multiplication primarily used to compute the square parameter the public key ( and the modulus ( . Also, in the design of RSA Crypto-processor, the multiplier is used to compute the modulus ( and the Euler function [16]. One more example, is the need for fast multipliers at several computation stages of ECC cryptosystem [17]. Indeed, wide range of methods have been proposed to address the efficient design of fast two operands arithmetic International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 20 multiplier. In this paper, we have spent an extensive time to design an efficient multiplier by trying several variations of different multiplier design specifications. The first design was the implementation of Radix-8 Booth Encoding Multiplier. Then, we tried many variations to employ this multiplier with different design methods. In the next subsections, we provide six design alternatives of the proposed multiplier to come up with the most cost-effective multiplier design. We finally report on the final implemented design. Figure 7. The complete design of8- Bit Comparatorincluding Pre- Encoding circuit and Comp circuit A. Radix-8 CSA Based Booth Multiplier Unlike Binary radix booth encoder, Radix-8 booth encodes each group of three bits as shown in table 1. The encoding technique uses shift operation to produce 2A and 4A while 3A is equal to 2A+A. The logic diagram of implementing CSA based Radix-8 booth multiplier is shown in Fig. 8. The use of CSA provides very powerful performance with limited area cost. The partial products for radix-2 is n (where n is the number of operand bits). However, for radix 8 the number of partial products is only n/3. TABLE I. RADIX-8 BOOTH ENCODING. Inputs (bits of M-bit multiplier) Partial Product PPRi 0 0 0 0 0 0 0 0 1 A 0 0 1 0 A 0 0 1 1 2A 0 1 0 0 2A 0 1 0 1 3A 0 1 1 0 3A 0 1 1 1 4A 1 0 0 0 -4A 1 0 0 1 -3A 1 0 1 0 -3A 1 0 1 1 -2A 1 1 0 0 -2A 1 1 0 1 -A 1 1 1 0 -A 1 1 1 1 0 As can be seen from fig.8, the multiplier accepts two 32-bit operands ( and ) and stores the operand ( ) in a shift register to select the group bits used in encoding whereas the operand ( ) processed with the International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 21 booth encoder. The output of encoding stage is added via the sequential CSA adder and the result is provided in a redundant representation (vector sum and vector carry). Radix-8 Booth Encoder X Shift Register A CSA_64bit Partial SumPartial Carry 6464 64 32 3 32 Figure 8. Design of Radix-8 Booth 32-bit multiplier Reset Mul_Gen CSA Store Data Output Start Enable i<=11 i>11 Figure 9. State machine diagram for 32-bit Booth multiplier. Also, Fig.9 illustrates the FSM diagram of 32-bit booth multiplier. It starts with Reset_State, where all signals and variables are cleared (i.e. reset). Next state is Mul_Gen, where encoding is occurred. After that, the generated vector is added to the previous results of CSA state. Fourth, results are stored in Store_State and moves back to Mul_Gen state in loop until all the bits are selected and encoded. Finally, the output results are provided in Output state.Note that in Radix-8 encoding the number of generated partial product vectors are computed by dividing the number of bits over 3, since each three bits are selected and used for encoding. B. CSA Based Radix-8 Booth, Wallace Tree Karatsuba Multiplier In this method, we combine the benefits of the bit reduction of radix 8 booth along with the parallelism of CSA based Wallace tree as well as the pipelining process of Karatsuba multiplication. Thus, this design achieved minimum path delay and minimized area (i.e. the best performance). However, redundancy in this design produced one critical problem regarding the middle carry at the edges of blocks that affects the results. Fig.10 illustrates the flow diagram for this design. Here, we first designed a 64-bit Karatsuba Multiplier using a 32-bit CSA based radix-8 Booth for partial products calculation (as for our target design and since we are implementing 64-bit multiplier; m was chosen to be 32 bits (half size)). First, the entered two operands are divided into halves . Next, they are fed into the Booth multiplier to compute the partial products as given in Karatsuba formula. Since the results are redundant and we have 5 partial products according to Karatsuba: . Thus, 10 partial products are generated. In the final stage, a CSA based Wallace tree was implemented to be used for adding the resulted partial products. Final result is represented redundantly as vector sum and vector carry. This design achieves minimum path delay with limited area. However, redundancy in this design produces one critical problem that affects the results. As a rule-of- thumb, if we multiply two numbers (i.e. p and q), the multiplication result will be increased to . However, this is not the case when using redundant systems since the result is stored as two vectors and adding the two vectors to we tend to obtain the conventional product might result in . This additional bit brings up a new problem in the preliminary design. Now, this problem can be solved by discarding the last carry when converting back to conventional representation. However, in Karatsuba algorithm the numbers are split into 32-bit (original size is 64). The result must be 128- bit, but in Karatsuba case will be 10 partial product vectors of 64-bit shifted in such a way that adding those vectors will result in 128-bit. Thus, discarding all the generated carry when converting back to conventional system leads to error since only the carry generated of adding the two vectors corresponding to International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 22 the same variable (or the same partial product in this case) needs to be discarded. Other generated carries must be considered. Fig.11. demonstrate this problem graphically. Generate Karatsuba operands 32 bit Rad.8 Booth Multiplication Four Levels CSA Tree Input Two 64 Bit operands Output 128 Bit Sum and Carry Vectors 6 operands 32 Bit 8 Vectors 64 Bit Figure 10. Design of 64-bit CSA Based Radix-8 Booth, Wallace Tree Karatsuba multiplier. 32-bit Booth Radix-8 32-bit Booth Radix-8 32-bit Booth Radix-8 X=x1B+x2, 64-bit (32-bit each) Y=y1B+y2, 64-bit (32-bit each) x1 y1 x0 y0 (x1-x0) (y1-y0) ps1 pc1 ps2 pc2 ps3 pc3 B 2 (ps1, pc1) B(ps1, pc1) B(ps2, pc2) 1(ps2, pc2) B(ps3, pc3) Figure 11. Graphical approaches to demonstrate the carry error (the mid-carry problem), here we have two cases:Case I- ps1+ pc1 = might result in carry, result = 65-bit (wrong). Carry must be discarded and Case II- ps1+ ps2 = might result in carry, result = 65-bit (correct). Carry must be considered. Eventually, the mid-carry problem was solved by either using 64-bit CSA Based Radix-8 Booth, KSA Based Karatsuba multiplier or using 64-bit CSA Based Radix-8 Booth, with comparator Karatsuba multiplier. However, both solutions have added more overhead to design cost; therefore, this solution has been excluded. Both solutions are discussed in the following subsections. 1) CSA Based Radix-8 Booth, KSA Based Karatsuba Multiplier. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 23 Since the carry to be eliminated is the generated one from booth multiplier, a first thought is to exchange the CSA adder with KSA adder to convert back the two vectors into one 64-bit number and discard any generated carry. All the 8 vectors are reduced into five 64-bit vectors in parallel. This stage helps to eliminate the false carry without the need to do any further examination. KSA is a fast adder, thus this design maintains its high performance utilizing more logic elements. The logic diagram of the design is shown in Fig.12. 2) CSA Based Radix-8 Booth, With Comparator Karatsuba Multiplier. Another noticeable design option can solve the mid- carry problem is to use a 64-bit comparator to test if the two vectors will generate a carry if yes, then do the correction step before input the 10 vectors to CSA Tree. After Booth multiplication stage, connect the vector sum and vector carry that may produce carry error to the inputs of 64-Bit comparator unit, then perform correction if needed. Finally, all vectors added using CSA tree. The complete solution is depicted in fig.13. Generate Karatsuba operands 32 bit Rad.8 Booth Multiplication 64 Bit KSA Adder Input Two 64 Bit operands Output 128 Bit Sum and Carry Vectors Three Levels CSA Adder 6 operands 32 Bit 8 Vectors 64 Bit 5 Operands 64 Bit Figure 12. Design of 64-bit: 64-bit CSA Based Radix-8 Booth, KSA Based Karatsuba multiplier. Generate Karatsuba operands 32 bit Rad.8 Booth Multiplication 64 Bit Carry Generate and Kill Input Two 64 Bit operands 64 Bit Comparator Correction circuit Five Levels CSA Tree Output 128 Bit Sum and Carry Vectors 6 operands 32 Bit10 Vectors 64 Bit 10 Vectors 64 Bit 10 Vectors 64 Bit Figure 13. Karatsuba multiplication based on CSA and comparator. Note that the 64-bit comparator can be built with 8 stages in total recording a total delay of 13 level gate delay and area of 317 gates (like the design of 8-bit comparator discussed in section.2.5). To predict whether the carry will be generated or not, then we need to generate 64-Bit G (generate) and K (kill) vectors. Thus, we have three cases which might happen as follows:  Case I: when 0. The carry is propagated. Here we need to define the first carry state before . If the state is , then the vector does not need any correction. But, if the state is a state, then we need to subtract one from the highest bit (MSB) of any vector to prevent the carry to .  Case II:when . Here we have a state, so that no need to correction.  Case III:when .Here is a state and a correction is needed. If this happed at highest bit (MSB), then it needs to subtract 2 ones. But if it after some , then this is Case I. To define the first case, we have used a comparator to compare the two vectors as the comparator results:  : Generate state happened first or it is the first state after propagation  : kill state happened first or it is the first state after propagation International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 24  All states are propagating states, no need for correction because we do not have input carry 3) Comparisons between Design II & Design III We investigated both proposed design alternatives of Karatsuba based multiplication theoretically in terms of critical path delay (using gate delay unit) and the area of the multiplier (how many gates used in the implementation). The results are shown in table 2 below. TABLE II. COMPARISON BETWEEN DESIGN II & DESIGN III. Design Solutions # Delay (gate delay) % Optimization Area (# of gates) % Optimization Solution I: using KSA Adder. 23 +15% 6130 Solution II: using Comparator unit. 27 3712 +50% C. Sequential 64-Bit CSA Based Radix-8 Booth Multiplier This design is accomplished by expanding the 32- bit booth to 64-bit. The two modules (i.e. 64-bit and 32-bit Booth) differ only in the number of generated partial products. Since radix-8 is used, 22 partial products are generated in the new module instead of 11 while other logic components remained the same. Fig.14 shows the logic diagram of new 64-bit implementation. This design was implemented and was simulated on Altera FPGA Kit recording a path delay of 10.701 ns for one loop and since the program runs 22 times(i.e. 22 partial products), thus the total path delay is 235.422 nS. Also, this multiplier requires 3330 logic elements (LEs). Radix-8 Booth Encoder X Shift Register A CSA_64bit Partial SumPartial Curry 128128 128 64 3 64 Figure 14. Design of CSA based Radix-8 Booth 64-bit multiplier. IV. SYNTHESIZE RESULTS AND ANALYSIS To speed up the performance of sequential 64-Bit CSA Based Radix-8 Booth Multiplier, we parallelized the addition of partial products produced in the same level by using Wallace CSA tree instead of sequential CSA to exploit the maximum possible parallelism between the partial products to gain in speed and enhance the design performance. That’s it, we end up with implementing a 64-bit Wallace Tree CSA based Radix-8 Booth multiplier (WCBM). The block diagram for the proposed design is shown in Fig.15. (a). The comparison with the other design alternatives showed that Wallace Tree CSA Based Radix-8 Booth Multiplier (WCBM) has decreased the total delay and increased the operational speed for the multiplication operation. Also, the design is modified to increase the frequency be dividing the program to three main states. The top view of our implemented WCBM unit is given in Fig.15. (b). It’s clearly seen that WCBM unit is triggered by CLK signal along with enable line. The generated number can be obtained from the output portliness “sum” which is 128 bits. Besides the unit encompasses three control input signals (enable, reset, clk) and two control output signals (Ack and Ready). Moreover, the finite state machine (FSM) diagram for the implemented WCBM is shown in Fig.15. (c). FSM consists of three main phases: Partial product generation (Initially, 22 partial products are generated by using radix-8 Booth encoding), Wallace tree phase (these partial products are added by using 7 levels Wallace Tree CSA based) and KSA phase (because the result is redundant, KSA is used in the last phase to convert it to conventional result). Finally, Fig.16. Illustrates a sample numerical example of the proposed WCBM that is generated from Quartus II simulation tool. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 25 Generate Booth Partial Products CSA Tree (7 Levels) Input Two 64 Bit operands Output 128 Bit Sum and Carry Vectors 22 PP 128 Bit 2 Vectors 128 Bit Figure 15. (a) Design Architecture of WCBM (a) Top Level DiagramWCBM (C) FSM Diagram for WCBM. Figure 16. Sample run example of WCBM process of two 64-bit numbers International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 26 The proposed multiplier implementation has been synthesized using Altera Cyclone EP4CGX-22CF19C7 FPGA kit to analyze several design factors such as design area, the total delay of the multiplication unit and the thermal power consumption of FPGA implementation. We have evaluated the performance of the 64-bit Wallace Tree CSA based Radix-8 Booth multiplier WCBM module for different data path sizes. Timing analysis of the critical clock cycle for the implemented WCBM is illustrated in Fig.17. It can be seen from the graph that the critical path delay is 14.103 ns in which 3.094 ns for the clock delay and 11.009 ns for the data delay. This give a maximum frequency for the circuit of 90.83 MHz.In addition, the area of the design has recorded a constant number of logic elements (i.e. 14249 LEs) with the total thermal power dissipation estimated by using PowerPlay Power analyzer tool of Quartus II software of 217.56 mW. Figure 17. Waveform sample of the proposed WCBM data delay V. CONCLUSIONS AND REMARKS Multiplication operation is a core operation that domineer the performance of several public cryptographic algorithms such as RSA and SSC. In this paper, we have thoroughly discussed several design alternatives of radix-8 based multiplier unit by employing the Karatsuba method and Booth recording method with carry save and Kogge stone adders on Wallace tree organization. The proposed designs were evaluated in terms of many aspects including: maximum frequency and critical path delay, design area, and the total FPGA power consumption. The proposed hardware cryptosystem design is conducted using Altera Cyclone FPGA design technology along with the help of CAD package of Altera such as Quartus II and Modelsim 10.1. To sum up, we have successfully implemented and synthesized the Wallace Tree CSA Based Radix-8 Booth Multiplier (WCBM) module via the target FPGA technology for 64-bits. The synthesizer results showed an attractive results in terms of several design factors that can improve the computation performance for many multiplication based applications. REFERENCES [1] A.J. Menezes, P.C. Van Oorschot and S.A. Vanstone. (1996). Handbook of Applied Cryptography”, CRC Press, Boca Raton, Florida. [2] K. Javeed, X. Wang and M. Wang, 'Serial and Parallel Interleaved Modular Multiplierson FPGA Platform', IEEE 25th International Conference on Field Programmable Logic and Applications (FPL), 2015 https://doi.org/10.1109/FPL.2015.7293986 [3] D. J Greaves, System on Chip Design and Modelling, University of Cambridge, Computer Laboratory, Lecture Notes, 2011. http://www.cl.cam.ac.uk/teaching/1011/SysOnChip/socdam- notes1011.pdf. [4] M. D. Ercegovac and T. Lang, (2004) 'Digital Arithmetic', Morgan Kaufmann Publishers, Elsevier, vol.1, p.p.51-136. http://www.sciencedirect.com/science/book/9781558607989 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 27 [5] Qasem Abu Al-Haija, Sharifah M. S. Ahmad, "Fast Radix-2 Sequential Multiplier Using Kintex-7 FPGA Chip Family", The Open Cybernetics & Systemics Journal, Bentham Open, Vol. 12, 2018. [6] Mohammed Mosab Asad, Ibrahim Marouf, Qasem Abu Al-Haija, "Review Of Fast Multiplication Algorithms For Embedded Systems Design", International Journal Of Scientific & Technology Research Volume 6, Issue 08, 2017. [7] Heath, Steve (2003). Embedded systems design. EDN series for design engineers (2 ed.). Newnes. p. 2. ISBN 978-0-7506-5546-0. An embedded system is a microprocessor based system that is built to control a function or a range of functions. [8] I. Marouf, M. M. Asad, A. Bakhuraibah and Q. A. Al-Haija, "Cost analysis study of variable parallel prefix adders using altera cyclone IV FPGA kit," 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA), Ras Al Khaimah, 2017, pp. 1-4.doi: 10.1109/ICECTA.2017.8252011 [9] Altera Co., “Introduction to Quartus II Software: Ver 10.0”, Intel Quartus II MNL-01055-1.0, 2012. [10] Altera Corporation, “Cyclone IV Device Handbook”, Vol. 1, CYIV- 5V1-2.2, https://www.altera.com/, 2012. [11] S. Butchibabu, S. Kishore Bab (2014). Design and Implementation of Efficient Parallel Prefix Adders on FPGA, International Journal of Engineering Research & Technology, Vol. 3 Issue No.7. [12] B. Parhami, (1999), “Computer Arithmetic: Algorithms and Hardware Designs”, Oxford University Press, Oxford. [13] D. Purohit, H. Joshi, (2014), ‘Comparative Study and Analysis of Fast Multipliers’, International Journal of Engineering and Technical Research (IJETR), Vol. 2, No.7, 2014. [14] A. Karatsuba and Y. Ofman, (1963) ‘Multiplication of Multidigit Numbers on Automata’, Soviet Physics, Doklady, p.p.595-596. https://www.researchgate.net/publication/234346907_Multiplication_ of_Multidigit_Numbers_on_Automata [15] Qasem Abu Al-Haija, Mohamad M.Asad, Ibrahim Marouf,"A Systematic Expository Review of Schmidt-Samoa Cryptosystem", International Journal of Mathematical Sciences and Computing(IJMSC), Vol.4, No.2, pp.12-21, 2018.DOI: 10.5815/ijmsc.2018.02.02 [16] Qasem Abu Al-Haija, Mahmoud Smadi, Monther Al-Ja’fari, Abdullah Al-Shua’ibi, "Efficient FPGA implementation of RSA coprocessor using scalable modules", Procedia Computer Science, Elsevier, Vol 34, 2014. [17] Qasem Abu Al-Haija, Mohammad Alkhatib, Azmi B Jaafar, "Choices on Designing GF(P) Elliptic Curve Coprocessor Benefiting from Mapping Homogeneous Curves in Parallel Multiplications", International Journal on Computer Science and Engineering, Engg Journals Publications, Vol. 3, No. 2, 2011. work_2ha3mubt3zcffhnqhokzofd3ai ---- Enhancing discovery in spatial data infrastructures using a search engine Enhancing discovery in spatial data infrastructures using a search engine Paolo Corti1, Athanasios Tom Kralidis2 and Benjamin Lewis1 1 Center for Geographic Analysis, Harvard University, Cambridge, MA, USA 2 Open Source Geospatial Foundation, Beaverton, OR, USA ABSTRACT A spatial data infrastructure (SDI) is a framework of geospatial data, metadata, users and tools intended to provide an efficient and flexible way to use spatial information. One of the key software components of an SDI is the catalogue service which is needed to discover, query and manage the metadata. Catalogue services in an SDI are typically based on the Open Geospatial Consortium (OGC) Catalogue Service for the Web (CSW) standard which defines common interfaces for accessing the metadata information. A search engine is a software system capable of supporting fast and reliable search, which may use ‘any means necessary’ to get users to the resources they need quickly and efficiently. These techniques may include full text search, natural language processing, weighted results, fuzzy tolerance results, faceting, hit highlighting, recommendations and many others. In this paper we present an example of a search engine being added to an SDI to improve search against large collections of geospatial datasets. The Centre for Geographic Analysis (CGA) at Harvard University re-engineered the search component of its public domain SDI (Harvard WorldMap) which is based on the GeoNode platform. A search engine was added to the SDI stack to enhance the CSW catalogue discovery abilities. It is now possible to discover spatial datasets from metadata by using the standard search operations of the catalogue and to take advantage of the new abilities of the search engine, to return relevant and reliable content to SDI users. Subjects Human-Computer Interaction, Spatial and Geographic Information Systems Keywords Data discovery, Catalogue Service for the Web, Metadata, WorldMap, Geoportal, Search engine, Spatial Data Infrastructure, pycsw, Solr, GeoNode INTRODUCTION A spatial data infrastructure (SDI) typically stores a large collection of metadata. While the Open Geospatial Consortium (OGC) recommends the use of the catalogue service for the web (CSW) standard to query these metadata, several important benefits can be obtained by pairing the CSW with a search engine platform within the SDI software stack. SDI, interoperability, and standards An SDI is a framework of geospatial data, metadata, users and tools which provides a mechanism for publishing and updating geospatial information. An SDI provides the architectural underpinnings for the discovery, evaluation and use of geospatial information (Nebert, 2004; Goodchild, Fu & Rich, 2007; Masó, Pons & Zabala, 2012). SDIs are typically distributed in nature, and connected by disparate computing platforms and client/server design patterns. How to cite this article Corti et al. (2018), Enhancing discovery in spatial data infrastructures using a search engine. PeerJ Comput. Sci. 4:e152; DOI 10.7717/peerj-cs.152 Submitted 23 January 2018 Accepted 3 April 2018 Published 21 May 2018 Corresponding author Paolo Corti, pcorti@gmail.com Academic editor Alessandro Frigeri Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.152 Copyright 2018 Corti et al. Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.152 mailto:pcorti@�gmail.�com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.152 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ A critical principle of an SDI is interoperability which can be defined as the ability of a system or components in a system to provide information sharing and inter-application cooperative process control through a mutual understanding of request and response mechanisms embodied in standards. Standards (formal, de facto, community) provide three primary benefits for geospatial information: (a) portability: use and reuse of information and applications, (b) interoperability: multiple system information exchange and (c) maintainability: long term updating and effective use of a resource (Groot & McLaughlin, 2000). The OGC standards baseline has traditionally provided core standards definitions to major SDI activities. Along with other standards bodies (IETF, ISO, OASIS) and de facto/community efforts (Open Source Geospatial Foundation (OSGeo), etc.), OGC standards provide broadly accepted, mature specifications, profiles and best practices (Kralidis, 2009). Metadata search in an SDI and CSW An SDI can contain a large number of geospatial datasets which may grow in number over time. The difficulty of finding a needle in such a haystack means a more effective metadata search mechanism is called for. Metadata is data about data, describe the content, quality, condition and other characteristics of data in order to ease the search and understanding of data (Nogueras-Iso, Zarazaga-Soria & Muro-Medrano, 2005). Metadata standards define a way to provide homogeneous information about the identification, the extent, the spatial and temporal aspects, the content, the spatial reference, the portrayal, distribution and other properties of digital geographic data and services (ISO 19115-1: 2014, 2014). Ease of data discovery is a critical measure of the effectiveness of an SDI. The OGC CSW standard specify the interfaces and bindings, as well as a framework for defining the application profiles required to publish and access digital catalogues of metadata for geospatial data and services (Open Geospatial Consortium, 2016; Nebert, Whiteside & Vretanos, 2005; Rajabifard, Kalantari & Binns, 2009). Based on the Dublin Core metadata information model, CSW supports broad interoperability around discovering geospatial data and services spatially, non-spatially, temporally, and via keywords or free text. CSW supports application profiles which allow for information communities to constrain and/or extend the CSW specification to satisfy specific discovery requirements and to realize tighter coupling and integration of geospatial data and services. The CSW ISO Application Profile is an example of a standard for geospatial data search which follows ISO geospatial metadata standards. CSW catalogue within the SDI architecture In a typical SDI architecture the following components can be identified: � GIS clients: desktop GIS tools or web based viewers. � Spatial data server: returns geospatial data to map clients in a range of formats. � Cache data server: returns cached tiles to map clients to improve performance. � Processing server: responsible for the processing of the geospatial datasets. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 2/15 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ � Spatial repository: a combination of a spatial database and file system, where the geospatial data is stored. � Catalogue server: used by map clients to query the metadata of the spatial datasets to support discovery. Desktop GIS clients generally access the SDI data directly from the spatial repository or the file system. When the user has appropriate permissions from these clients it is possible to perform advanced operations, which are generally faster than when performed over OGC web standards. Web based viewers access the SDI data served by the spatial data server using a number of OGC web standards over HTTP, typically WMS/WMTS/WMS-C when it is just needed to render the data, or WFS/WCS when it is needed to access to the native information respectively for vector or cover datasets. WFS-T can be used for editing vector datasets. Web viewers can run GIS SDI processes by using the WPS standard exposed by the processing server. All of these OGC standards can be used by desktop GIS clients as well. The spatial repository is generally a combination of a RDBMS with a spatial extension and the file system where are stored data which are not in the database. The catalogue, based on the CSW standard, lets users to discover data and services in an SDI. CSW is a standard for exposing a catalogue of geospatial entities over the HTTP request/response cycle. In a SDI or portal CSW endpoints are provided by a CSW catalogue. Popular open source implementations of CSW catalogue include (but are not limited to) pycsw (http://pycsw.org/), GeoNetwork (https://geonetwork-opensource.org/), degree (https://www.deegree.org/) and Esri Geoportal Server (https://www.esri.com/en-us/ arcgis/products/geoportal-server/overview). A CSW catalogue implements a number of operations which are accessible via HTTP. Some of these operations are optional: � GetCapabilities retrieves service metadata from the server. � DescribeRecord allows a client to discover the data model of a specific catalogue information model. � GetRecords searches for records using a series of criteria, which can be spatial, aspatial, logical, comparative. � GetRecordById retrieves metadata for one record (layer) of the catalogue by its id. � GetDomain (optional) retrieves runtime information about the range of values of a metadata record element or request parameter. � Harvest (optional) creates or updates metadata with a request to the server to ‘pull’ metadata from some endpoint. � Transaction (optional) creates or edits metadata with a request to the server. Need for a search engine within an SDI Search workflow and user experience are a vital part of modern web-based applications. Numerous types of web application, such as Content Management Systems (CMS), wikis, data delivery frameworks, all can benefit from improved data discovery. Same applies Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 3/15 http://pycsw.org/ https://geonetwork-opensource.org/ https://www.deegree.org/ https://www.esri.com/en-us/arcgis/products/geoportal-server/overview https://www.esri.com/en-us/arcgis/products/geoportal-server/overview http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ to SDI. Furthermore, in the Big Data era, more powerful mechanisms are needed to return relevant content to the users from very large collections of data (Tsinaraki & Schade, 2016). In the last few years, content-driven platforms have delegated the task of search optimization to specific frameworks known as search engines. Rather than implementing a custom search logic, these platforms now often add a search engine in the stack to improve search. Apache Solr (http://lucene.apache.org/solr/) and Elasticsearch (https:// www.elastic.co/), two popular open source search engine web platforms, both based on Apache Lucene (https://lucene.apache.org/), are commonly used in typical web application stacks to support complex search criteria, faceting, results highlighting, query spell-check, relevance tuning and more (Smiley et al., 2015). As for CMS’s, SDI search can dramatically benefit from such platforms as well. How a search engine works Typically the way a search engine works can be split into two distinct phases: indexing and searching. During the indexing phase, all of the documents (metadata, in the SDI context) that must be searched are scanned, and a list of search terms (an index) is built. For each search term, the index keeps track of the identifiers of the documents that contain the search term. During the searching phase only the index is looked at, and a list of the documents containing the given search term is quickly returned to the client. This indexed approach makes a search engine extremely fast in outputting results. On top of this, a search engine provides many other useful search related features, improving dramatically the experience of users. Improvements in an SDI with a search engine There are numerous opportunities to enhance the functionality of the CSW specification and subsequent server implementations by specifying standard search engine functionality as enhancements to the standard. A search engine is extremely fast and scalable: by building and maintaining its indexed structure of the content, it can return results much faster and scale much better than a traditional CSW based on a relational database. While a CSW can search metadata with a full text approach, with a search engine it is possible to extend the full text search with features such as language stemming, thesaurus and synonyms, hit highlighting, wild-card matches and other ‘fuzzy’ matching techniques. Another key advantage is that search engines can provide relevancy scores for likely matches, allowing for much finer tuning of search results. CSW does not easily emit facets or facet counts as part of search results. Search engine facets however, can be based on numerous classification schemes, such as named geography, date and time extent, keywords, etc. and can be used to enable interactive feedback mechanisms which help users define and refine their searches effectively. BACKGROUND Harvard WorldMap (http://worldmap.harvard.edu/) is an open source SDI and Geospatial Content Management System (GeoCMS) platform developed by the Centre for Geographic Analysis (CGA) to lower the barrier for scholars who wish to explore, visualize, edit and publish geospatial information (Guan et al., 2012). Registered users are Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 4/15 http://lucene.apache.org/solr/ https://www.elastic.co/ https://www.elastic.co/ https://lucene.apache.org/ http://worldmap.harvard.edu/ http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ able to upload geospatial content, in the form of vector or raster datasets (layers), and combine them with existing layers to create maps. Existing layers can be layers uploaded by other users and layers provided by external map services. WorldMap is a web application built on top of the GeoNode open source mapping platform (http://geonode.org/), and since 2010 has been used by more than 20,000 registered users to upload about 30,000 layers and to create some 5,000 web maps. Users can also access about 90,000 layers from remote map services based on OGC standards and Esri REST protocols. WorldMap is based on the following components, all open source and designed around OGC standards (Fig. 1): � A JavaScript web GIS client, GeoExplorer (http://suite.boundlessgeo.com/docs/latest/), based on OpenLayers (https://openlayers.org/) and ExtJS (https://www.sencha.com/ products/extjs/). Figure 1 The WorldMap SDI architecture. Full-size DOI: 10.7717/peerj-cs.152/fig-1 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 5/15 http://geonode.org/ http://suite.boundlessgeo.com/docs/latest/ https://openlayers.org/ https://www.sencha.com/products/extjs/ https://www.sencha.com/products/extjs/ http://dx.doi.org/10.7717/peerj-cs.152/fig-1 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ � A spatial data server based on GeoServer (http://geoserver.org/). � A cache data server based on GeoWebCache (http://geowebcache.org/). � A spatial database implemented with PostgreSQL (https://www.postgresql.org/) and PostGIS (https://postgis.net/). � A catalogue based on pycsw or GeoNetwork. � Aweb application, developed with Django (https://www.djangoproject.com/), a Python web framework, which orchestrates all of the previous components. WorldMap allows users to build maps using its internal catalogue of layers (local layers) combined with layers from external map services (remote layers), for a total of about 120,000 layers. WorldMap users can have trouble finding useful and reliable layers given the large number of them; a system was needed to enable fast, scalable search capable of returning the most reliable and useful layers within a large and heterogeneous collection. RESULTS AND DISCUSSION In 2015 CGA started the design and development of Hypermap Registry (Hypermap) (https://github.com/cga-harvard/Hypermap-Registry) to improve search for WorldMap users. Hypermap is an application that manages OGC web services (such as WMS, WMTS, CSW Capabilities service metadata) as well as Esri RESTendpoints. In addition it supports map service discovery (Chen et al., 2011), crawling (Bone et al., 2016; Li, Yang & Yang, 2010), harvesting and uptime statistics gathering for services and layers. One of the main purposes of Hypermap is to bring enhanced search engine capabilities into an SDI architecture. As it can be seen from the following Fig. 2, search engine documents, based on a provided schema, must be kept in synchrony with layer metadata stored in the GeoNode RDBMS. Hypermap is responsible for ensuring that the WorldMap search engine, based on Apache Solr, and the WorldMap catalogue RDBMS, based on PostgreSQL, are kept in sync. For example, when a WorldMap user updates the metadata information for one layer from the WorldMap metadata editing interface, that information is updated in the WorldMap pycsw backend, which is based on the RDBMS. As soon as this happens, a synchronization task is sent from Hypermap to the task queue. The task will be processed by the task queue, and all of the metadata information for the layer will be synced to the corresponding search engine document. Thanks to this synchronization mechanism, WorldMap users can search the existing layers metadata using a search engine rather than the OGC catalogue, enabling more flexible searches which filter on keywords, source, layer type, map extent and date range (Corti & Lewis, 2017). The results are returned by the search engine which returns a JSON response, and tabular in addition to spatial views (based on spatial facets) are returned to the browser (Fig. 2). WorldMap improvements with the search engine By pairing the CSW catalogue with a search engine, the metadata search in the WorldMap SDI yields several major benefits. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 6/15 http://geoserver.org/ http://geowebcache.org/ https://www.postgresql.org/ https://postgis.net/ https://www.djangoproject.com/ https://github.com/cga-harvard/Hypermap-Registry http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Fast results By having the metadata content indexed in a search engine, metadata are returned very rapidly to the client. Because of its indexed documents nature, a search engine is much faster to return results when compared it with a relational database. Therefore, using a search engine in WorldMap search client makes things much faster than using a CSW catalogue based on a RDBMS. Scalability From a software engineering perspective, search engines are highly scalable and replicable, thanks to their shardable architecture. Such systems are capable of providing interactive query access to collections of spatio-temporal objects containing billions of features (Kakkar & Lewis, 2017; Kakkar et al., 2017). Clean API Query to the search engine API tends to be much simpler than XML queries to the CSW catalogue, specially when crafting advanced search requests (spatial, non-spatial, temporal, etc.). Same for output: JSON output from search engine API provides a more compact representation of search results enabling better performance and making the output more readable (Figs. 3 and 4). Figure 2 Metadata RDBMS to search engine synchronization in Harvard WorldMap. Full-size DOI: 10.7717/peerj-cs.152/fig-2 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 7/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-2 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Synonyms, text stemming Crucially, search engines are good at handling the ambiguities of natural languages, thanks to stop words (words filtered out during the processing of text), stemming (ability to detect words derived from a common root), synonyms detection and controlled vocabularies such as thesauri and taxonomies. It is possible to do phrase searches and proximity searches (search for a phrase containing two different words separated by a specified number of words). Because of features like these, keyword queries using the Hypermap search engine endpoint typically returns more results than an equivalent Figure 3 CSW request and response. Full-size DOI: 10.7717/peerj-cs.152/fig-3 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 8/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-3 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Figure 4 Search engine request and response. Full-size DOI: 10.7717/peerj-cs.152/fig-4 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 9/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-4 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ query using the Hypermap CSW. For example doing a full text search for the keyword ‘library’ returns more results from the search engine because it includes variations and synonyms of the original term like ‘libraries,’ ‘bibliotheca,’ ‘repository,’ ‘repositories’ in the returned results. Figure 5 Facets generate counts for metadata categories and geographic regions in a GeoCMS. Full-size DOI: 10.7717/peerj-cs.152/fig-5 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 10/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-5 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Relevancy Results can be ranked, providing a way to return results to users with the more relevant ones closer to the top. This is very useful to detect the most significative metadata for a given query. Weights can be assigned by specifying boosts (weighted factors) for each field. Facets Another important search engine feature useful for searching the WorldMap metadata catalogue is faceted search. Faceting is the arrangement of search results in categories based on indexed terms. This capability makes it possible, for example to provide an immediate indication of the number of times that common keywords are contained in different metadata documents. A typical use case is with metadata categories, keywords and regions. Thanks to facets, the user interface of an SDI catalogue can display counts for documents by category, keyword or region (Fig. 5). Search engines can also support temporal and spatial faceting, two features that are extremely useful for browsing large collections of geospatial metadata. Temporal faceting can display the number of metadata documents by date range as a kind of histogram. Spatial faceting can provide a spatial surface representing the distribution of layers or features across an area of interest. In Fig. 6, a heatmap is generated by spatial faceting which shows the distribution of layers in the WorldMap SDI for a given geographic region (Fig. 6). Figure 6 Spatial faceting enables heatmaps showing the distribution of the SDI layers in the space. Full-size DOI: 10.7717/peerj-cs.152/fig-6 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 11/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-6 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Other features In addition, it is possible to use regular expressions, wildcard search and fuzzy search to provide results for a given term and its common variations. It is also possible to support boolean queries: a user is able to search results using terms and boolean operators such as AND, OR, NOT and hit highlighting can provide immediate search term suggestions to the user searching a text string in metadata. CONCLUSION While the CSW 3.0.0 standard provides improvements to address mass market search/ discovery, the benefits of search engine implementations combined with broad interoperability of the CSW standard presents a great opportunity to enhance the CSW Figure 7 pycsw interaction with the search engine using an application profile and using a basic profile (when pycsw will provide direct support for the search engine). Full-size DOI: 10.7717/peerj-cs.152/fig-7 Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 12/15 http://dx.doi.org/10.7717/peerj-cs.152/fig-7 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ standard. The authors hope that such an approach eventually becomes formalized as a CSW Application Profile or Best Practice in order to achieve maximum benefit and adoption in SDI activities. This will allow CSW implementations to make better use of search engine methodologies for improving the user search experience in SDI workflows. In addition, pycsw is planning for dedicated Elasticsearch/Solr support as part of a future release to enable the use of search engines as backend stores to the CSW standard. This is a different approach from using an Application Profile or Best Practice, as it directly interacts with data in the search engine rather than in the RDBMS (Fig. 7). The authors would like to share this work with the OGC CSW community in support of the evolution of the CSW specification. Given recent developments on the OGC WFS 3.0 standard (RESTful design patterns, JSON, etc.), there is an opportunity for CSW to evolve in alignment with WFS 3.0 in support of the principles of the W3C Spatial Data on the Web Best Practices (Group, 2017) in a manner similar to the work presented in this paper. ACKNOWLEDGEMENTS The authors thank all the contributors to the Hypermap and GeoNode platform source code, particularly: Wayner Barrios, Matt Bertrand, Simone Dalmasso, Alessio Fabiani, Jorge Martı́nez Gómez, Wendy Guan, Jeffrey Johnson, Devika Kakkar, Jude Mwenda, Ariel Núñez, Luis Pallares, David Smiley, Charles Thao, Angelos Tzotsos, Mingda Zhang. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is partially funded by the U.S. National Endowment for the Humanities, Digital Humanities Implementation Grant #HK5009113 and the U.S. National Science Foundation Industry-University Cooperative Research Centers Program (IUCRC) grant for the Spatiotemporal Thinking, Computing, and Applications Center (STC) #1338914, and by Harvard University. Grant administration was supported by Harvard’s Institute for Quantitative Social Science. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: U.S. National Endowment for the Humanities, Digital Humanities Implementation: #HK5009113. U.S. National Science Foundation Industry-University Cooperative Research Centers Program (IUCRC). Spatiotemporal Thinking, Computing, and Applications Center (STC): #1338914. Harvard University. Harvard’s Institute for Quantitative Social Science. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 13/15 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Competing Interests The authors declare that they have no competing interests. Author Contributions � Paolo Corti conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Athanasios Tom Kralidis performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. � Benjamin Lewis performed the experiments, analyzed the data, contributed reagents/ materials/analysis tools, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: Hypermap Registry: https://github.com/cga-harvard/Hypermap-Registry REFERENCES Bone C, Ager A, Bunzel K, Tierney L. 2016. A geospatial search engine for discovering multi- format geospatial data across the web. International Journal of Digital Earth 9(1):47–62 DOI 10.1080/17538947.2014.966164. Chen N, Chen Z, Hu C, Di L. 2011. A capability matching and ontology reasoning method for high precision OGC web service discovery. International Journal of Digital Earth 4(6):449–470 DOI 10.1080/17538947.2011.553688. Corti P, Lewis B. 2017. Making temporal search more central in spatial data infrastructures. In: ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences. Germany: Copernicus Publications, 93–95. Goodchild MF, Fu P, Rich P. 2007. Sharing geographic information: an assessment of the geospatial one-stop. Annals of the Association of American Geographers 97(2):250–266 DOI 10.1111/j.1467-8306.2007.00534.x. Groot R, McLaughlin JD. 2000. Geospatial Data Infrastructure: Concepts, Cases, and Good Practice. Oxford: Oxford University Press. Group OWW. 2017. Spatial data on the web best practices. Available at https://www.w3.org/TR/ sdw-bp/ (accessed 12 March 2018). Guan WW, Bol PK, Lewis BG, Bertrand M, Berman ML, Blossom JC. 2012. Worldmap—a geospatial framework for collaborative research. Annals of GIS 18(2):121–134 DOI 10.1080/19475683.2012.668559. ISO 19115-1: 2014. 2014. Geographic Information—Metadata—Part 1: Fundamentals. Geneva: International Standards Organisation. Kakkar D, Lewis B. 2017. Building a billion spatio-temporal object search and visualization platform. In: ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences. Germany: Copernicus Publications, 97–100. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 14/15 https://github.com/cga-harvard/Hypermap-Registry http://dx.doi.org/10.1080/17538947.2014.966164 http://dx.doi.org/10.1080/17538947.2011.553688 http://dx.doi.org/10.1111/j.1467-8306.2007.00534.x https://www.w3.org/TR/sdw-bp/ https://www.w3.org/TR/sdw-bp/ http://dx.doi.org/10.1080/19475683.2012.668559 http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Kakkar D, Lewis B, Smiley D, Nunez A. 2017. The billion object platform (bop): a system to lower barriers to support big, streaming, spatio-temporal data sources. Free and Open Source Software for Geospatial (FOSS4G) Conference Proceedings 17:15 DOI 10.7275/R5ST7N0G. Kralidis AT. 2009. Geospatial web services: the evolution of geospatial data infrastructure. In: The Geospatial Web. London: Springer, 223–228. Li W, Yang C, Yang C. 2010. An active crawler for discovering geospatial web services and their distribution pattern—a case study of OGC web map service. International Journal of Geographical Information Science 24(8):1127–1147 DOI 10.1080/13658810903514172. Masó J, Pons X, Zabala A. 2012. Tuning the second-generation SDI: theoretical aspects and real use cases. International Journal of Geographical Information Science 26(6):983–1014 DOI 10.1080/13658816.2011.620570. Nebert DD. 2004. Developing Spatial Data Infrastructures: the SDI cookbook. Global Spatial Data Infrastructure (GSDI) Association. Available at http://gsdiassociation.org/images/publications/ cookbooks/SDI_Cookbook_GSDI_2004_ver2.pdf. Nebert D, Whiteside A, Vretanos P. 2005. OGC catalogue services specification. Open Geospatial Consortium Inc. Available at https://portal.opengeospatial.org/files/?artifact_id=20555. Nogueras-Iso J, Zarazaga-Soria FJ, Muro-Medrano PR. 2005. Geographic Information Metadata for Spatial Data Infrastructures: Resources, Interoperability and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg. Open Geospatial Consortium. 2016. Catalogue service. Available at http://www.opengeospatial. org/standards/cat/ (accessed 12 March 2018). Rajabifard A, Kalantari M, Binns A. 2009. SDI and metadata entry and updating tools. In: SDI Convergence. Available at https://minerva-access.unimelb.edu.au/bitstream/handle/ 11343/26084/115448_SDIandMetadataEntryandUpdatingtool.pdf. Smiley D, Pugh E, Parisa K, Mitchell M. 2015. Apache Solr Enterprise Search Server. Birmingham: Packt Publishing Ltd. Tsinaraki C, Schade S. 2016. Big data—a step change for SDI? International Journal 11:9–19. Corti et al. (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.152 15/15 http://dx.doi.org/10.7275/R5ST7N0G http://dx.doi.org/10.1080/13658810903514172 http://dx.doi.org/10.1080/13658816.2011.620570 http://gsdiassociation.org/images/publications/cookbooks/SDI_Cookbook_GSDI_2004_ver2.pdf http://gsdiassociation.org/images/publications/cookbooks/SDI_Cookbook_GSDI_2004_ver2.pdf https://portal.opengeospatial.org/files/?artifact_id=20555 http://www.opengeospatial.org/standards/cat/ http://www.opengeospatial.org/standards/cat/ https://minerva-access.unimelb.edu.au/bitstream/handle/11343/26084/115448_SDIandMetadataEntryandUpdatingtool.pdf https://minerva-access.unimelb.edu.au/bitstream/handle/11343/26084/115448_SDIandMetadataEntryandUpdatingtool.pdf http://dx.doi.org/10.7717/peerj-cs.152 https://peerj.com/computer-science/ Enhancing discovery in spatial data infrastructures using a search engine Introduction Background Results and Discussion Conclusion flink5 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_2iy7olcp6bgspd4hfeinespmbm ---- None work_2lytvedz4zhc3ne5onhq7issre ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 DOI: 10.21307/ijanmc-2019-074 74 On the RFID Information Query Technology Based on IPV9 Li Guiping The School of Computer Science and Engineering Xi’an Technological University, Xi’an, China E-mail: 15693685@qq.com Xue Lei Shandong University of Science and Technology 223 Daizong Street, Tai’an City Shandong Province, 271000 E-mail: Shirleyxue06@163.com Abstract—Since coding format of RF label is inconsistent with the protocol format of the information server's network, a design scheme of network architecture is proposed to achieve the connectivity between the decimal network based on IPV9 and the Internet network based on IPV4/IPV6. And also, two ways of information’s Query and connectivity based on D-ONS and direct routing between the decimal network is devised by using the expert modules. The experiment results shows that both the two ways can provide efficient service of RFID Information Query . Keywords-IPV9 ; RFID; Domain Transformation; Information Query I. INTRODUCTION With the development of radio frequency technology, product-related information can be quickly obtained by the readers if RF tags are assigned to each product. Recently, people have began to adopt IPV6 and even IPV9 protocol formats for RF tag coding since IPV4 address space has been used up. If the server storing product-related information is stored in an IPV4 or IPV6 network, how to query RF information across the network should be solved. This paper mainly studies the RF information query process of interconnection between decimal network and Internet network and the domain name conversion rules obtained by decimal network query server. II. IPV9 IPV9, short for decimal network and the new generation of Internet, is the result of China's independent innovation of core technologies which has IPV9 protocol, IPV9 addressing, digital domain name system and other core technologies are adopted with original and independent intellectual property rights. Full digital text is used to represent the IP address. The address space is larger than that of IPV4 or IPV6. The 1st-41th level of the address space is expressed as binary 256 bits, while using the decimal 256 bits to express the 42th. III. RFID Radio Frequency Identification(RFID) is a non- contact automatic recognition technology. It communicates with other object using reflected power. It can automatically identify target objects and obtain relevant data through radio frequency signals, which can quickly track items and exchange data. And also, the identify ability need no one to participate in. A typical RFID system consists of electronic tag, reader (including antenna) and application system. Further, electronic tags are the data carrier of RFID system which consists of a label antenna and a label chip. It can receive the reader's electromagnetic field modulation signal and return the response signal to achieve the label identification code and memory data read or write operations. The reader is used to receive host commands and send the data stored in the sensor back to the host by the wired or wireless way. It contains a controller and an antenna. If the reading distance is long, the antenna will stand alone. The terminal computer of the application system interacting with the RFID system transmits the work instructions issued by the MIS application system. It makes electronic tags and readers be coordination through middleware, processes all data collected by RFID system, and carries out calculation, storage and data transmission. The process can be described as Fig.1and Fig.2. Figure 1. The requery process of information based on RFID Timin g Data Energy RFID antenna RFID electronic label Computer control network RFID reader and writer International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 75 Figure 2. Electronic tags The operating principle of RFID system is to receive the radio frequency signal emitted by the reader when an item with an electronic tag enters the radiation range of the reader antenna. The passive tag sends data stored in the tag chip using the energy generated by the induced current, while active electronic tags can send data stored in the tag chip, actively. Generally, readers are equipped with middleware of certain functions. Hence, it can read data, decode and directly carry out simple data processing, then send to the application system. The application system judges the legitimacy of electronic tags according to the logic operation, and carries out corresponding processing and control for different Settings, thus realizing the basic functions of RFID system. IV. NETWORK ARCHITECTURE BASED ON IPV9 RFID INFORMATION QUERY TECHNOLOGY RFID information query technology can provide people with the function of inquiring information related to products through the RFID tag. The information related to RFID tags is stored on the information server and is generally maintained by the manufacturer of the product. In view of the actual use of the Internet, it is necessary to design a network architecture for interconnection between the decimal network and the Internet, which meets certain conditions. On this basis, RFID tag information query service is implemented. The overall network architecture design scheme is as follows. A. Overall design of network architecture The architecture of RFID tag information query service on Internet is based on IPV4 and IPV6 protocols. Routing adopts IPV4 and IPV6 protocols, and resource positioning is completed by SNS and PSNS. The architecture of electronic tag information location query service based on decimal network is a network architecture based on IPV9 protocol, including the following two ways. (1) Using the routing to locate the information server directly. The route uses the IPV9 protocol without the DNS resolver. (2) Adopting the parsing service of the application layer with U-code Resolution Server. DONS uses host domain name resolution to provide IPV4, IPV6 and IPV9 addresses, while using IPV4, IPV6 and IPV9 protocols as routing protocols .Resource positioning is done by D-ONS. The network architecture of RFID tag information query service mainly includes the decimal network information query service technology system and the Internet information query service technology system. Specifically, the decimal network architecture includes middleware, information server, D-ONS server, DDNS and IPV9 direct router. Internet architecture includes SNS server, PSNS server, information server, .cn root DNS server, DDNS server and.cn root. DNS server through the domain name resolution forwarding digital domain name and English domain name connectivity. B. The key module of network architecture -- expert module The expert module is the middleware used between the decimal network and the Internet to realize the interconnection between the two, and the data exchange format between the two is XML. It includes the following interfaces.  RF information query of Decimal network and architecture of discovery technology input product and service digital identifiers to query the Internet RF information, through the expert module, discovery technology architecture request to return the product and service specific information storage address or URI.  RF information query of Internet and architecture of discovery technology input product and service digital identifiers to query the Decimal network RF information, through the expert module, discovery technology architecture request to return the product and service specific information storage address or URI.  RF information query of Decimal network and architecture of discovery technology input product and service digital identifiers to query the Internet RF information, through the expert module, discovery technology architecture request to return the product and service specific information. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 76  RF information query of Internet and architecture of discovery technology input product and service digital identifiers to query the Decimal network RF information, through the expert module, discovery technology architecture request to return the product and service specific information. V. RESEARCH ON INFORMATION QUERY PROCESS Based on the above network architecture, the implementation of RFID tag information query service based on IPV9 can be designed in two ways: D-ONS based decimal network and Internet network information query and direct routing mode of exchange visits between decimal network information query service system and Internet information query service system. A. Exchange query process Between D-ons-based decimal network and Internet network information query service system When the related product information is stored in the Internet information server and label coding format is using able format, the decimal network based on D- ONS accessing the information query process of Internet mainly involves the following several key modules: decimal network query server, decimal network, expert module, information server middleware, the SNS and PSNS server. The access relationship between these modules is shown in figure 3. Decimal network label readers Require server Expert modul middleware SNS server PSNS server Information server a) b) c) d) i) e) j) f) g) h) identifiers of product and service Domain of product and server standard identifiers and identifiers of product and service Domain of product and server NAPTR resource record Address of information server conneccti on Internet Identifiers of product and service Figure 3. The process of accessing the Internet on a D-ONS- based decimal network. 1) The access process can be described as follows a) Using RFID readers to read IPV9 identifiers and product and service identifiers from electronic tags. b) Submitting the read identifiers and product and service identifiers to the query server in the decimal network. c) Calling decimal network query server and Internet interface expert modules to access the Internet. d) The Internet interface of the expert module accesses the middleware of the Internet architecture through standard identifiers and identifiers of product and service . e) The service middleware converts the standard identifier to the domain name format and sends it to the SNS server to request the resolution service. f) The SNS server returns the conversion rules of PSID domain name by the form of regular expressions to the service middleware. g) Service middleware issues query request to PSNS server based on PSID domain name. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 77 h) PSNS returns the NAPTR record containing the product and service information service or PSDS address. i) The service middleware returns the results to the expert module, whose decimal network interface returns service of the product and information or NAPTR records of PSDS address to the query server. j) Querying the server request and finally getting the product information returned by the information server. When product-related information is stored in the information server of the decimal network, and the label encoding format is IPV4 or IPV6 format, the D- ONS based decimal network needs to be accessed through the Internet network. The access process involves the key modules: Internet query server, expert module, D-ons server, information server. The access process is shown in figure 4. decimal network D-ONS resolve server D-ONS server Information server Domain resolve server Expert modul label reader Requery server a) b) c) f) d) e) g) tandard id code and domain name onversion Network distributed query control product and service domain name Address of information server identifiers of product and service. identifiers of product and service. Internet Figure 4. The process of Internet accessing to a decimal network based on the d-ons protocol. 2) The access process can be described as follows a) Using Rf readers to read product and service identifiers from electronic tags encoded in IPV4 and IPV6 formats. b) Submitting the identifier and identifiers of product and service to the query server. c) Calling by the query server the expert module's decimal network interface between the Internet network and the decimal network. d) The decimal network interface of the expert module sends a request to the D-ONS server of the decimal network architecture to inquire the information of the server domain name where the product information is stored through the identifier and product and service. e) The D-ONS server receives the request and returns information of the product and service domain name. f) The information server of decimal network queries the Internet server for product information. g) The query server returns product information. B. Directly routing the decimal network information query service and the Internet information query service system exchange query process The process of mutual visits between the system of decimal network information query service with direct routing and the Internet information query service system are shown in figure 5 and figure 6. It involve the query server, expert module, middleware, information server, SNS server and PSNS server. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 78 The interconnection of IPV4, IPV6 and IPV9 can realize the mutual visits between IPV4, IPV6 and IPV9 through protocol conversion. Specifically, a protocol conversion server needs to be set up, and all data packets are converted into specified protocols to satisfy the data communication between different protocols. Decimal network label reader Requery server Internet middleware SNS server PSNS server Information server a) b) c) j) d) i) e) k) f) g) h) identifier of product and service identifier of product and service Domain of product and server NAPTR resource record Address of information server connection Expert modul Protocl transerformation IPV9 IPV4/ IPV6 Figure 5. The accessing process of Direct routing of decimal network to the Internet 1) The process of searching the product information stored in the Internet and encoding format used IPV9 can be described as follows. RF reader reads IPV9 identifier and the identifier of product and service from electronic tag. Submitting identifiers of product and service to the query server on a decimal network. a) The query server calls the expert module's Internet interface. b) The Internet interface of the expert module accesses the middleware of the Internet architecture through identifiers of product and service. c) Internet middleware delivers product and service domain names to the SNS server. d) SNS server returns the middle results to the middleware . e) The middleware returns the results to the expert module. f) The expert module requests product information from the information server according to the address information inquired. g) The expert module returns the product information to the query server of decimal network to complete the process of product information query. In the direct routing mode, if the product's RF tags are encoded in IPV4 and IPV6 protocols, and the product-related information is stored in the server of the decimal network, the query process needs to involve IPV9 router, information server, expert module and query server. International Journal of Advanced Network, Monitoring and Controls Volume 04, No.04, 2019 79 Decimal network IPV9 direct routing server Information server Internet label reader Requery server a) b) c) f) e)identifier of product and service Transformation ruler of domain of product and server identifier of product and service Non-decimal product identifier Expert modul Protocol transformation h) IPV9 IPV4/ IPV6 Figure 6. The access process from Internet to decimal network under the direct routing 2) The access process can be described as follows. a) RF readers read IPV4 or IPV6 identifiers and identifiers of product and service from electronic tags. b) Submitting identifiers of product and service to the query server over the Internet network. c) The query server calls the expert module's decimal network interface. d) The expert module of decimal network interface accesses the IPV9 router for the decimal network architecture through identifiers of product and service . e) The PV9 router converts the product and service digital identifiers to the IP address of the product information server. f) The IPV9 router accesses the information server and requests product information. g) The IPV9 router returns product information to the Internet user via the expert module. In the above process, the expert module between the two network systems realizes the translation and conversion of the data formats of the two systems. The protocol conversion module can translate the IP address, message and header. VI. CONCLUSION This paper proposes two kinds of information query exchange services between decimal network and Internet to solve the problem that the encoding format of radio frequency tag is not uniform with the network protocol format of product information server. Experimental results show that both methods can provide efficient rf information query service. REFERENCES [1] SJ/T CCCCC-CCCC Digital identity format specification for information processing products and services. [2] SJ/T DDDDD-DDDD Rf-based domain name specification for products and services [3] IETF RFC 1034 Domain names-concepts and facilities [4] IETF RFC 1035 Domain names-implementation and specification [5] IETF RFC 1347 TCP and UDP with Bigger Addresses [6] IETF RFC 1561 ISO Use of ISO CLNP in TUBA Environments [7] IETF RFC 1606 A Historical Perspective On The Usage Of IP Version 9 [8] IETF RFC 1607 A VIEW FROM THE 21ST CENTURY [9] IETF RFC 1700 - Assigned Numbers) work_2mdy3t4xane2hdgqagdmp6jlju ---- More ties than we thought Submitted 13 February 2015 Accepted 6 May 2015 Published 27 May 2015 Corresponding author Mikael Vejdemo-Johansson, mvj@kth.se Academic editor Anne Bergeron Additional Information and Declarations can be found on page 13 DOI 10.7717/peerj-cs.2 Copyright 2015 Hirsch et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS More ties than we thought Dan Hirsch1, Ingemar Markström2, Meredith L. Patterson1, Anders Sandberg3 and Mikael Vejdemo-Johansson2 ,4 ,5 1 Upstanding Hackers Inc. 2 KTH Royal Institute of Technology, Stockholm, Sweden 3 Oxford University, UK 4 Jožef Štefan Institute, Ljubljana, Slovenia 5 Institute for Mathematics and its Applications, Minneapolis, USA ABSTRACT We extend the existing enumeration of neck tie-knots to include tie-knots with a textured front, tied with the narrow end of a tie. These tie-knots have gained popularity in recent years, based on reconstructions of a costume detail from The Matrix Reloaded, and are explicitly ruled out in the enumeration by Fink & Mao (2000). We show that the relaxed tie-knot description language that comprehensively describes these extended tie-knot classes is context free. It has a regular sub-language that covers all the knots that originally inspired the work. From the full language, we enumerate 266,682 distinct tie-knots that seem tie-able with a normal neck-tie. Out of these 266,682, we also enumerate 24,882 tie-knots that belong to the regular sub-language. Subjects Algorithms and Analysis of Algorithms, Computational Linguistics, Theory and Formal Methods Keywords Necktie knots, Formal language, Automata, Chomsky hierarchy, Generating functions INTRODUCTION There are several different ways to tie a necktie (Fig. 1). Classically, knots such as the four-in-hand, the half windsor and the full windsor have commonly been taught to new tie-wearers. In a sequence of papers and a book, Fink & Mao (2001), Fink & Mao (2000) and Fink & Mao (1999) defined a formal language for describing tie-knots, encoding the topology and geometry of the knot tying process into the formal language, and then used this language to enumerate all tie-knots that could reasonably be tied with a normal-sized necktie. The enumeration of Fink and Mao crucially depends on dictating a particular finishing sequence for tie-knots: a finishing sequence that forces the front of the knot—the façade—to be a flat stretch of fabric. With this assumption in place, Fink and Mao produce a list of 85 distinct tie-knots, and determine several novel knots that extend the previously commonly known list of tie-knots. In recent years, however, interest has been growing for a new approach to tie-knots. In The matrix reloaded (Wachowski et al., 2003), the character of “The Merovingian” has a sequence of particularly fancy tie-knots. Attempts by fans of the movie to recreate the tie-knots from the Merovingian have led to a collection of new tie-knot inventions, all of which rely on tying the tie with the thin end of the tie—the thin blade. Doing this allows for How to cite this article Hirsch et al. (2015), More ties than we thought. PeerJ Comput. Sci. 1:e2; DOI 10.7717/peerj-cs.2 mailto:mvj@kth.se https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.2 http://dx.doi.org/10.7717/peerj-cs.2 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 Figure 1 Some specific tie-knot examples. Top row from left: the Trinity (L-110.4), the Eldredge (L-373.2) and the Balthus (C-63.0, the largest knot listed by Fink and Mao). Bottom row randomly drawn knots. From left: L-81.0, L-625.0, R-353.0. a knot with textures or stylings of the front of the knot, producing symmetric and pleasing patterns. Knorr (2010) gives the history of the development of novel tie-knots. It starts out in 2003 when the edeity knot is published as a PDF tutorial. Over the subsequent 7 years, more and more enthusiasts involve themselves, publish new approximations of the Merovingian tie-knot as PDF files or YouTube videos. By 2009, the new tie-knots are featured on the website Lifehacker and go viral. In this paper, we present a radical simplification of the formal language proposed by Fink and Mao, together with an analysis of the asymptotic complexity class of the tie-knots language. We produce a novel enumeration of necktie-knots tied with the thin blade, and compare it to the results of Fink and Mao. Formal languages The work in this paper relies heavily on the language of formal languages, as used in theoretical computer science and in mathematical linguistics. For a comprehensive reference, we recommend the textbook by Sipser (2006). Recall that given a finite set L called an alphabet, the set of all sequences of any length of items drawn (with replacement) from L is denoted by L∗. A formal language on the alphabet L is some subset A of L∗. The complexity of the automaton required to determine whether a sequence is an element of A places A in one of several complexity classes. Languages that are described by finite state automata are regular; languages that require a pushdown automaton are context free; languages that require a linear bounded automaton are context sensitive and languages that require a full Turing machine to determine are called recursively enumerable. This sequence builds an increasing hierarchy of expressibility and computational complexity for syntactic rules for strings of some arbitrary sort of tokens. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 2/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 L R C T W Thin blade Broad blade BA Figure 2 Left/Center/Right. The parts of a necktie, and the division of the wearer’s torso with the regions (Left, Center Right) and the winding directions (Turnwise, Widdershins) marked out for reference. One way to describe a language is to give a grammar—a set of production rules that decompose some form of abstract tokens into sequences of abstract or concrete tokens, ending with a sequence of elements in some alphabet. The standard notation for such grammars is the Backus–Naur form, which uses ::= to denote the production rules and ⟨some name⟩ to denote the abstract tokens. Further common symbols are ∗, the Kleene star, that denotes an arbitrary number of repetitions of the previous token (or group in brackets), and |, denoting a choice of one of the adjoining options. THE ANATOMY OF A NECKTIE In the following, we will often refer to various parts and constructions with a necktie. We call the ends of a necktie blades, and distinguish between the broad blade and the thin blade1—see Fig. 2 for these names. The tie-knot can be divided up into a body, consisting of 1 There are neckties without a width difference between the ends. We ignore this distinction for this paper. all the twists and turns that are not directly visible in the final knot, and a façade, consisting of the parts of the tie actually visible in the end. In Fig. 3 we demonstrate this distinction. The body builds up the overall shape of the tie-knot, while the façade gives texture to the front of the knot. The enumeration of Fink and Mao only considers knots with trivial façades, while these later inventions all consider more interesting façades. As a knot is in place around a wearer, the Y-shape of the tie divides the torso into 3 regions: Left, Center and Right—as shown to the right in Fig. 2. A tie-knot has to be tied by winding and tucking one of the two blades around the other: if both blades are active, then the tie can no longer be adjusted in place for a comfortable fit. We shall refer to the blade used in tying the knot as the leading blade or the active blade. Each time the active blade is moved across the tie-knot—in front or in back—we call the part of the tie laid on top of the knot a bow. A LANGUAGE FOR TIE-KNOTS Fink & Mao (2000) observe that once the first crossing has been made, the wrapping sequence of a classical tie-knot is completely decided by the sequence of regions into which the broad blade is moved. Adorning the region specifications with a direction—is the tie moving away from the wearer or towards the wearer—they establish a formal alphabet for describing tie-knots with 7 symbols. We reproduce their construction here, using U for the Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 3/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 Figure 3 Different examples of tie knots. Left, a 4-in-hand; middle, a double windsor; right a trinity. The 4-in-hand and double windsor share the flat façade but have different bodies producing different shapes. The trinity has a completely different façade, produced by a different wind and tuck pattern. move to tuck the blade U nder the tie itself.2 The notation proposed by Fink & Mao (2000) 2 Fink and Mao used T for Tuck. interprets repetitions U k of U as tucking the blade k bows under the top. It turns out that the complexity analysis is far simpler if we instead write U k for tucking the blade under the bow that was produced 2k windings ago. This produces a language on the alphabet: {L⊗,L⊙,C⊗,C⊙,R⊗,R⊙,U} They then introduce relations and restrictions on these symbols: T ie1 No region (L,C,R) shall repeat: after an L only C or R are valid next regions. U moves do not influence this. T ie2 No direction (⊙—out of the paper, ⊗—in towards the paper) shall repeat: after an outwards move, the next one must go inwards. U moves do not influence this. T ie3 Tucks (U ) are valid after an outward move. T ie4 A tie-knot can end only on one of C⊗,C⊙ or U . In fact, almost all classical knots end on U .3 3 The exemption here being the Onassis style knot, favored by the eponymous shipping magnate, where after a classical knot the broad blade is brought up with a C⊙ move to fall in front of the knot, hiding the knot completely. T ie5 A k-fold tuck U k is only valid after at least 2k preceding moves. Fink & Mao (2000) do not pay much attention to the conditions on k-fold tucks, since these show up in their enumeration as stylistic variations, exclusively at the end of a knot. This collection of rules allow us to drastically shrink the tie language, both in alphabet and axioms. Fink & Mao are careful to annotate whether tie-knot moves go outwards or inwards at any given point. We note that the inwards/outwards distinction follows as a direct consequence of axioms T ie2, T ie3 and T ie4. Since non-tuck moves must alternate between inwards and outwards, and the last non-tuck move must be outwards, the orientation of any sequence of moves follows by backtracking from the end of the string. Hence, when faced with a non-annotated string like RCLCRCLCRCLRURCLU Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 4/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 we can immediately trace from the tail of the knot string: the last move before the final tuck must be outwards, so that L must be a L⊙. So it must be preceded by R⊙C⊗. Tracing backwards, we can specify the entire string above to R⊗C⊙L⊗C⊙R⊗C⊙L⊗C⊙R⊗C⊙L⊗R⊙UC⊗R⊙C⊗L⊙U Next, the axiom T ie1 means that a sequence will not contain either of LU∗L,CU∗C,RU∗R as subsequences.4 Hence, the listing of regions is less important than the direction of 4 Recall that the Kleene star F∗ is used to denote sequences of 0 or more repetitions of the string F. transition: any valid transition is going to go either clockwise or counterclockwise.5 5 Say, as seen on the mirror image. Changing this convention does not change the count, as long as the change is consequently done. Writing T for clockwise6 and W for counterclockwise,7 we can give a strongly reduced 6 T for Turnwise. 7 W for Widdershins. tie language on the alphabet T, W, U. To completely determine a tie-knot, the sequence needs a starting state: an annotation on whether the first crossing of a tie-knot goes across to the right or to the left. In such a sequence, a U instruction must be followed by either T or W dictating which direction the winding continues after the tuck, unless it is the last move of the tie: in this case, the blade is assumed to continue straight ahead—down in front for most broad-blade tie-knots, tucked in under the collar for most thin-blade knots. The position of the leading blade after a sequence of W/T windings is a direct result of #W − #T(mod 3). This observation allows us to gain control over several conditions determining whether a distribution of U symbols over a sequence of W/T produces a physically viable tie-knot. Theorem 1. A position in a winding sequence is valid for a k-fold tuck if the sub-sequence of the last 2k W or T symbols is such that either 1. starts with W and satisfies #W − #T = 2 (mod 3) 2. starts with T and satisfies #T − #W = 2 (mod 3). Proof. The initial symbol produces the bow under which the tuck will go. If the initial symbol goes, say, from R to L, then the tuck move needs to come from C in order to go under the bow. In general, a tuck needs to come from the one region not involved in the covering bow. Every other bow goes in front of the knot, and the others go behind the knot. Hence, there are 2k − 1 additional winding symbols until the active blade returns to the right side of the knot. During these 2k − 1 symbols, we need to transition one more step around the sequence of regions. The transitions W and T are generator and inverse for the cyclic group of order 3, concluding the proof. � It is worth noticing here that a particular point along a winding can be simultaneously valid for both a k-fold and an m-fold tuck for k ≠ m. One example would be in the winding string TWTT: ending with TT, it is a valid site for a 1-fold tuck producing TWTTU, and since TWTT starts with T and has 2 more T than W, it is also a valid site for a 2-fold tuck producing TWTTUU. We will revisit this example below, in ‘Recursive tucks.’ We may notice that with the usual physical constraints on a tie—where we have experimentally established that broad blade ties tend to be bounded by 9 moves, and thin blade ties by 15 moves, we can expect that no meaningful tuck deeper than 7 will ever Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 5/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 be relevant; 4 for the broad blade ties. The bound of 4 is achieved in the enumeration by Fink & Mao (1999). In our enumeration, we will for the sake of comfort focus on ties up to 13 moves. LANGUAGE COMPLEXITY In this section, we examine the complexity features of the tie-knot language. Due to the constraints we have already observed on the cardinality of W and T, we will define a grammar for this language. We will write this grammar with a Backus–Naur form. Although in practice it is only possible to realise finite strings in the tie-knot language due to the physical properties of fabric, we assume an arbitrarily long (but finite), infinitely thin tie. Single-depth tucks The classical Fink and Mao system has a regular grammar, given by ⟨tie⟩ ::= L⟨L⟩ ⟨lastR⟩ ::= L⟨lastL⟩ | C⟨lastC⟩ | LCU ⟨lastL⟩ ::= R⟨lastR⟩ | C⟨lastC⟩ | RCU ⟨lastC⟩ ::= L⟨lastL⟩ | R⟨lastR⟩ We use the symbol ⟨lastR⟩ to denote the rule that describes what can happen when the last move seen was an R. Hence, at any step in the grammar, some tie knot symbol is emitted, and the grammar continues from the state that symbol was the last symbol emitted. The above grammar works well if the only tucks to appear are at the end. For intermediate tucks, and to avoid tucks to be placed at the back of the knot (obeying T ie3), we would need to keep track of winding parity: tucks are only valid an even number of winding steps from the end. We can describe this with a regular grammar. For the full tie-knot language, the grammar will end up context-free, as we will see in ‘Recursive tucks.’ ⟨tie⟩ ::= ⟨prefix⟩(⟨pair⟩ | ⟨tuck⟩) ∗ ⟨tuck⟩ ⟨prefix⟩ ::= T | W | ϵ ⟨pair⟩ ::= (T|W)(T|W) ⟨tuck⟩ ::= TTU | WWU The distribution of T and W varies by type of knot: for classical knots, #W − #T = 2 (mod 3); for modern knots that tuck to the right, #W − #T = 1 (mod 3); and for modern knots that tuck to the left, #W − #T = 0 (mod 3). This grammar does not discriminate between these three sub-classes. In order to track these sub-classes, the RLC-notation is easier. In order to rebuild this grammar to one based on the RLC-notation, note that from L a T takes us to C and a W takes us to R. So from a ⟨lastT⟩ residing at L, we have the options: W to Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 6/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 R, T to C, or TU to C. In particular, there is a ⟨lastT⟩ at L if we arrived from R. Hence, the TU option can be seen as being a TTU option executing from the preceding R state. There is thus, at any given position in the tie sequence, the options of proceeding with a T or a W, or to proceed with one of TTU or WWU. In the latter two cases, we can also accept the string. Starting at L, these options take us—in order—to C, to R, to CRU and to RCU respectively. This observation extends by symmetry to all stages, giving the grammar below. ⟨lastR⟩ ::= LR⟨lastR⟩ | CR⟨lastR⟩ | LC⟨lastC⟩ | CL⟨lastL⟩ | LCU[⟨lastC⟩] | CLU[⟨lastL⟩] ⟨lastL⟩ ::= RL⟨lastL⟩ | CL⟨lastL⟩ | RC⟨lastC⟩ | CR⟨lastR⟩ | RCU[⟨lastC⟩] | CRU[⟨lastR⟩] ⟨lastC⟩ ::= LC⟨lastC⟩ | RC⟨lastC⟩ | LR⟨lastR⟩ | RL⟨lastL⟩ | LRU[⟨lastR⟩] | RLU[⟨lastL⟩] ⟨tie⟩ ::= L(⟨lastL⟩ | R⟨lastR⟩ | C⟨lastC⟩) By excluding some the exit rules, this allows us to enumerate novel tie-knots with a specific ending direction, which will be of interest later on. Recursive tucks We can write a context-free grammar for the arbitrary depth tuck tie-knots. ⟨tie⟩ ::= ⟨prefix⟩(⟨pair⟩ | ⟨tuck⟩) ∗ ⟨tuck⟩ ⟨prefix⟩ ::= T | W | ϵ ⟨pair⟩ ::= (T|W)(T|W) ⟨tuck⟩ ::= ⟨ttuck2⟩ | ⟨wtuck2⟩ ⟨ttuck2⟩ ::= TT⟨w0⟩U | TW⟨w1⟩U ⟨wtuck2⟩ ::= WW⟨w0⟩U | WT⟨w2⟩U ⟨w0⟩ ::= WW⟨w1⟩U | WT⟨w0⟩U | TW⟨w0⟩U|TT⟨w2⟩U | ⟨ttuck2⟩’⟨w2⟩U | ⟨wtuck2⟩’⟨w1⟩U | ϵ ⟨w1⟩ ::= WW⟨w2⟩U | WT⟨w1⟩U | TW⟨w1⟩U|TT⟨w0⟩U | ⟨ttuck2⟩’⟨w0⟩U | ⟨wtuck2⟩’⟨w2⟩U ⟨w2⟩ ::= WW⟨w0⟩U | WT⟨w2⟩U | TW⟨w2⟩U|TT⟨w1⟩U | ⟨ttuck2⟩’⟨w1⟩U | ⟨wtuck2⟩’⟨w0⟩U Note that the validity of a tuck depends only on the count of T and W in the entire sequence comprising the tuck, and not the validity of any tucks recursively embedded into it. For instance, TWTT is a valid depth-2-tuckable sequence, as is its embedded depth-1-tuckable sequence TT. However, TTWT is also a valid depth-2-tuckable sequence, even though WT is not a valid depth-1-tuckable sequence. We introduce the symbol ’ to delineate different tucks that may come in immediate sequence, such as happens in the tie knot TWTTU’UU. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 7/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 Classification of the tie-knot language If we limit our attention to only the single-depth tie-knots described in ‘Single-depth tucks,’ then the grammar is regular, proving that this tie language is a regular language and can be described by a finite automaton. In particular this implies that the tie-knot language proposed by Fink & Mao (1999) is regular. In fact, an automaton accepting these tie-knots is given by: After the prefix, execution originates at the middle node, but has to go outside and return before the machine will accept input. This maintains the even length conditions required by T ie3. As for the deeper tucked language in ‘Recursive tucks’, the grammar we gave shows it to be at most context-free. Whether it is exactly context-free requires us to exclude the existence of a regular grammar. Theorem 2. The deeper tucked language is context-free. Proof. Our grammar in ‘Recursive tucks’ already shows that the language for deeper tucked tie-knots is either regular or context-free: it produces tie-knot strings with only single non-terminal symbols to the left of each production rule. It remains to show that this language cannot be regular. To do this, we use the pumping lemma for regular languages. Recall that the pumping lemma states that for every regular language there is a constant p such that for any word w of length at least p, there is a decomposition w = xyz such that |xy| ≤ p, |y| ≥ 1 and xyiz is a valid string for all i > 0. Since the reverse of any regular language is also regular, the pumping lemma has an alternative statement that requires |yz| ≤ p instead. We shall be using this next. Suppose there is such a p. Consider the tie-knot TTW6q−2U3q for some q > p/3. Any decomposition such that |yz| ≤ p will be such that y and z consist of only U symbols. In particular y consists of only U symbols. Hence, for sufficiently large values of i, there are too few preceding T/W-symbols to admit that tuck depth. Hence the language is not regular. � Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 8/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 ENUMERATION We can cut down the enumeration work by using some apparent symmetries. Without loss of generality, we can assume that a tie-knot starts by putting the active blade in region R: any knot starting in the region L is the mirror image of a knot that starts in R and swaps all W to T and vice versa. Generating functions Generating functions have proven a powerful method for enumerative combinatorics. One very good overview of the field is provided by the textbooks by Stanley (1997) and Stanley (1999). Their relevance to formal languages is based on a paper by Chomsky & Schützenberger (1959) that studied context-free grammars using formal power series. More details will appear in the (yet unpublished) Handbook AutoMathA (Gruber, Lee & Shallit, 2012). A generating function for a series an of numbers is a formal power series A(z) = ∞ j=0 ajz j such that the coefficient of the degree k term is precisely ak. Where ak and bk are counts of “things of size k” of type a or b respectively, the sum of the corresponding generating functions is the count of “things of size k” across both categories. If gluing some thing of type a with size j to some thing of type b with size k produces a thing of size j + k, then the product of the generating functions measures the counts of things you get by gluing things together between the two types. For our necktie-knot grammars, the sizes are the winding lengths of the ties, and it is clearly the case that adding a new symbol extends the size (thus is a multiplication action), and taking either one or another rule extends the items considered (thus is an additive action). The Maple8 package combstruct has built-in functions for computing a generating 8 Maple is a trademark of Waterloo Maple Inc. The computations of generating functions in this paper were performed by using Maple. function from a grammar specification. Using this, and the grammars we state in ‘Single-depth tucks,’ we are able to compute generating functions for both the winding counts and the necktie counts for both Fink and Mao’s setting and the single-depth tuck setting. • The generating function for Fink and Mao necktie-knots is z3 (1 + z)(1 − 2z) = z3 + z4 + 3z5 + 5z6 + 11z7 + 21z8 + 43z9 + O(z10). • The generating function for single tuck necktie-knots is 2z3(2z + 1) 1 − 6z2 = 2z3 + 4z4 + 12z5 + 24z6 + 72z7 + 144z8 + 432z9 + 864z10 + 2,592z11 + 5,184z12 + 15,552z13 + O(z14). • By removing final states from the BNF grammar, we can compute corresponding generating functions for each of the final tuck destinations. For an R-final tuck, we remove all final states except for CRU and LRU, making the non-terminal symbol mandatory for all other tuck sequences. For L, we remove all Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 9/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 but CLU and RLU. For C, we remove all but RCU and LCU. This results in the following generating functions for R-final, L-final and C-final sequences, respectively. z3(2z3 − 2z2 + z + 1) 1 − 6z2 = z3 + z4 + 4z5 + 8z6 + 24z7 + 48z8 + 144z9 + 288z10 + 864z11 + 1,728z12 + 5,184z13 + O(z14) 2z4(2z2 − 2z − 1) 1 − 6z2 = 2z4 + 4z5 + 8z6 + 24z7 + 48z8 + 144z9 + 288z10 + 864z11 + 1,728z12 + 5,184z13 + O(z14) z3(2z3 − 2z2 + z + 1) 1 − 6z2 = z3 + z4 + 4z5 + 8z6 + 24z7 + 48z8 + 144z9 + 288z10 + 864z11 + 1,728z12 + 5,184z13 + O(z14). • Removing the references to the tuck move, we recover generating functions for the number of windings available for each tie length. We give these for R-final, L-final and C-final respectively. Summed up, these simply enumerate all possible T/W-strings of the corresponding lengths, and so run through powers of 2. z3 1 − z − 2z2 = z3 + z4 + 3z5 + 5z6 + 11z7 + 21z8 + 43z9 + 85z10 + 171z11 + 341z12 + 683z13 + O(z14) 2z4 (1 − 2z)(1 + z) = 2z4 + 2z5 + 6z6 + 10z7 + 22z8 + 42z9 + 86z10 + 170z11 + 342z12 + 682z13 + O(z14) z3 1 − z − 2z2 = z3 + z4 + 3z5 + 5z6 + 11z7 + 21z8 + 43z9 + 85z10 + 171z11 + 341z12 + 683z13 + O(z14). • For the full grammar of arbitrary depth knots, we set w to be a root of (8z6 − 4z4)ζ 3 + (−8z6 + 18z4 − 7z2)ζ 2 + (−16z6 + 14z4 − 6z2 + 2)ζ − 12z4 + 9z2 − 2 = 0 solved for ζ . Then the generating function for this grammar is: − 1 8z4 − 11z2 + 3  64w2z7 − 128wz7 + 32w2z6 − 64wz6 − 48z5w2 + 216z5w − 24w2z4 − 96z5 + 108wz4 + 8w2z3 − 48z4 − 110wz3 + 4w2z2 + 82z3 − 55z2w + 41z2 + 16zw − 16z + 8w − 8  = 2z2 + 4z3 + 20z4 + 40z5 + 192z6 + 384z7 + 1,896z8 + 3,792z9 + 19,320z10 + 38,640z11 + 202,392z12 + 404,784z13 + 2,169,784z14 + O(z15). Tables of counts For ease of reading, we extract the results from the generating functions above to more easy-to-reference tables here. Winding length throughout refers to the number of R/L/C symbols occurring, and thus is 1 larger than the W/T count. The cases enumerated by Fink & Mao (2000) are Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 10/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 Winding length 3 4 5 6 7 8 9 Total # tie-knots 1 1 3 5 11 21 43 85 A knot with the thick blade active will cover up the entire knot with each new bow. As such, all thick blade active tie-knots will fall within the classification by Fink & Mao (2000). The modern case, thus, deals with thin blade active knots. As evidenced by the Trinity and the Eldredge knots, thin blade knots have a wider range of interesting façades and of interesting tuck patterns. For thick blade knots, it was enough to assume that the tuck happens last, and from the C region, the thin blade knots have a far wider variety. The case remains that unless the last move is a tuck—or possibly finishes in the C region—the knot will unravel from gravity. We can thus expect this to be a valid require- ment for the enumeration. There are often more valid tuck sites than the final position in a knot, and the tuck need no longer come from the C region: R and L are at least as valid. The computations in ‘Generating functions’ establish Winding length 3 4 5 6 7 8 9 10 11 12 13 Total # left windings 0 2 2 6 10 22 42 86 170 342 682 1,364 # right windings 1 1 3 5 11 21 43 85 171 341 683 1,365 # center windings 1 1 3 5 11 21 43 85 171 341 683 1,365 # left knots 0 2 4 8 24 48 144 288 864 1,728 5,184 8,294 # right knots 1 1 4 8 24 48 144 288 864 1,728 5,184 8,294 # center knots 1 1 4 8 24 48 144 288 864 1,728 5,184 8,294 # single tuck knots 2 4 12 24 72 144 432 864 2,592 4,146 15,552 24,882 total # knots 2 4 20 40 192 384 1,896 3,792 19,320 38,640 202,392 266,682 The first point where the singly tucked knots and the full range of knots deviate is at the knots with winding length 4; there are 12 singly tucked knots, and 8 knots that allow for a double tuck, namely: TTTTU TTWWU TWTTU TWWWU WTTTU WTWWU WWTTU WWWWU TTUTTU TTUWWU WWUTTU WWUWWU TTTWUU TTWTUU TWTTUU TWTTU’UU WTWWUU WTWWU’UU WWTWUU WWWTUU The reason for the similarity between the right and the center counts is that the winding sequences can be mirrored. Left-directed knots are different since the direction corresponds to the starting direction. Hence, a winding sequence for a center tuck can be mirrored to a winding sequence for a right tuck. In the preprint version of this paper, we claimed the total count of knots using only single-depth tucks to be 177,147. During the revision of the paper, we have discovered two errors in this claim: Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 11/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 1. There is an off-by-one error in this count. 2. This count was done for tie-knots that allow tucks that are hidden behind the knot. Adding this extra space to the generating grammar produces the generating function 2z3 + 6z4 + 18z5 + 54z6 + 162z7 + 486z8 + 1,458z9 + 4,374z10 + 13,122z11 + 39,366z12 + 118,098z13 + O(z14) with a total of 177,146 tie-knots with up to 13 moves. AESTHETICS Fink & Mao (2000) propose several measures to quantify the aesthetic qualities of a necktie-knot; notably symmetry and balance, corresponding to the quantities #R − #L and the number of transitions from a streak of W to a streak of T or vice versa. By considering the popular thin-blade neck tie-knots: the Eldredge and the Trinity, as described in Krasny (2012a) and Krasny (2012b), we can immediately note that balance no longer seems to be as important for the look of a tie-knot as is the shape of its façade. Symmetry still plays an important role in knots, and is easy to calculate using the CLR notation for tie-knots. Knot TW-string CLR-string Balance Symmetry Eldredge TTTWWTTUTTWWU LCRLRCRLUCRCLU 3 0 Trinity TWWWTTTUTTU LCLRCRLCURLU 2 1 We do not in this paper attempt to optimize any numeric measures of aesthetics, as this would require us to have a formal and quantifiable measure of the knot façades. This seems difficult with our currently available tools. CONCLUSION In this paper, we have extended the enumeration methods originally used by Fink & Mao (2000) to provide a larger enumeration of necktie-knots, including those knots tied with the thin blade of a necktie to produce ornate patterns in the knot façade. We have found 4,094 winding patterns that take up to 13 moves to tie and are anchored by a final single depth tuck, and thus are reasonable candidates for use with a normal necktie. We chose the number of moves by examining popular thin-blade tie-knots—the Eldredge tie-knot uses 12 moves—and by experimentation with our own neckties. Most of these winding patterns allow several possible tuck patterns, and thus the 4,094 winding patterns give rise to 24,882 singly tucked tie-knots. We have further shown that in the limit, the language describing neck tie-knots is context free, with a regular sub-language describing these 24,882 knots. These counts, as well as the stated generating functions, are dependent on the correctness of the combstruct package in Maple, and the correctness of our encoding of these grammars as Maple code. We have checked the small counts and generated strings Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 12/15 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.2 for each of the grammars against experiments with a necktie and with the results by Fink and Mao and our own catalogue. Questions that remain open include: • Find a way to algorithmically divide a knot description string into a body/façade distinction. • Using such a distinction, classify all possible knot façades with reasonably short necktie lengths. We have created a web-site that samples tie-knots from knots with at most 12 moves and displays tying instructions: http://tieknots.johanssons.org. The entire website has also been deposited with Figshare (Vejdemo-Johansson, 2015). All the code we have used, as well as a table with assigned names for the 2,046 winding patterns for up to 12 moves are provided as Supplemental Information to this paper. Winding pattern names start with R, L or C depending on the direction of the final tuck, and then an index number within this direction. We suggest augmenting this with the bit-pattern describing which internal tucks have been added—so that e.g., the Eldredge would be L-373.4 (including only the 3rd potential tuck from the start) and the Trinity would be L-110.2 (including only the 2nd potential tuck). Thus, any single-depth tuck can be concisely addressed. ACKNOWLEDGEMENTS We would like to thank the reviewers, whose comments have gone a long way to make this a far better paper, and who have caught several errors that marred not only the presentation but also the content of this paper. Reviewer 1 suggested a significant simplification of the full grammar in ‘Recursive tucks,’ which made the last generating function at all computable in reasonable time and memory. Reviewer 2 suggested we look into generating functions as a method for enumerations. As can be seen in ‘Generating functions,’ this suggestion has vastly improved both the power and ease of most of the results and calculations we provide in the paper. For these suggestions in particular and all other suggestions in general we are thankful to both reviewers. ADDITIONAL INFORMATION AND DECLARATIONS Funding MVJ was partially supported for this work by the 7th Framework Programme through the project Toposys (FP7-ICT-318493-STREP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: 7th Framework Programme: FP7-ICT-318493-STREP. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 13/15 https://peerj.com/computer-science/ http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://tieknots.johanssons.org http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2 Competing Interests DH and MLP are employees of Upstanding Hackers Inc. Author Contributions • Dan Hirsch and Anders Sandberg analyzed the data, wrote the paper, reviewed drafts of the paper. • Ingemar Markström analyzed the data, performed the computation work, reviewed drafts of the paper. • Meredith L. Patterson analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. • Mikael Vejdemo-Johansson analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Data Deposition The following information was supplied regarding the deposition of related data: Figshare: http://dx.doi.org/10.6084/m9.figshare.130013. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/ 10.7717/peerj-cs.2#supplemental-information. REFERENCES Chomsky N, Schützenberger M P. 1959. The algebraic theory of context-free languages. Studies in Logic and the Foundations of Mathematics 26:118–161. Fink T, Mao Y. 1999. Designing tie knots by random walks. Nature 398(6722):31–32 DOI 10.1038/17938. Fink T, Mao Y. 2000. Tie knots, random walks and topology. Physica A: Statistical Mechanics and its Applications 276(1):109–121 DOI 10.1016/S0378-4371(99)00226-5. Fink T, Mao Y. 2001. The 85 ways to tie a tie. London: Fourth Estate. Gruber H, Lee J, Shallit J. 2012. Enumerating regular expressions and their languages. ArXiv preprint. arXiv:1204.4982. Knorr A. 2010. Eldredge reloaded. http://xirdalium.net. [Blog Post] Available at http://xirdalium. net/2010/06/20/eldredge-reloaded/ (accessed 26 December 2012). Krasny A. 2012a. Eldredge tie knot—how to tie a eldredge necktie knot. http://agreeordie.com. [Blog Post] Available at http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge- knot (accessed 26 December 2012). Krasny A. 2012b. Trinity tie knot—how to tie a trinity necktie knot. http://agreeordie.com. [Blog Post] Available at http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott (accessed 26 December 2012). Sipser M. 2006. Introduction to the theory of computation. Vol. 2. Boston: Thomson Course Technology. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 14/15 https://peerj.com/computer-science/ http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.6084/m9.figshare.130013 http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.7717/peerj-cs.2#supplemental-information http://dx.doi.org/10.1038/17938 http://dx.doi.org/10.1016/S0378-4371(99)00226-5 http://arxiv.org/abs/1204.4982 http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://xirdalium.net/2010/06/20/eldredge-reloaded/ http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com/blog/musings/545-how-to-tie-a-necktie-eldredge-knot http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://agreeordie.com/blog/musings/553-how-to-tie-a-necktie-trinity-knott http://dx.doi.org/10.7717/peerj-cs.2 Stanley RP. 1997. Enumerative combinatorics, Cambridge studies in advanced mathematics 49, vol. 1. Cambridge: Cambridge University Press. Stanley RP. 1999. Enumerative combinatorics, Cambridge studies in advanced mathematics 62, vol. 2. Cambridge: Cambridge University Press. Vejdemo-Johansson M. 2015. Random tie knots webpage. Available at http://dx.doi.org/10.6084/m9. figshare.1300138 (accessed February 2015). Wachowski A, Wachowski L, Silver J, Reeves K, Fishburne L, Moss C, Weaving H, Smith J, Foster G, Perrineau H et al. 2003. The Matrix Reloaded [Film]. USA: Warner Bros. Pictures. Hirsch et al. (2015), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.2 15/15 https://peerj.com/computer-science/ http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.6084/m9.figshare.1300138 http://dx.doi.org/10.7717/peerj-cs.2 More ties than we thought Introduction Formal languages The Anatomy of a Necktie A Language for Tie-knots Language Complexity Single-depth tucks Recursive tucks Classification of the tie-knot language Enumeration Generating functions Tables of counts Aesthetics Conclusion Acknowledgements References work_2mp36ujpvjcgdc3qzaxexknpvi ---- International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 DOI: 10.21307/ijanmc-2020-026 36 A Method to Access a Decimal Network (IPV9) Resource Guangzhou Liu, Fuya Yu Xi'an Decimal Network Technology Co. LTD Xi'an V9 Network Research Institute Co. LTD Email: 5571200@qq.com Abstract—Network security is highly valued by world leaders. The current Internet technology core is IPv4, IPv6, completely controlled by the United States. On December 14, 2017, the US Federal Communications Commission (FCC) formally abolished the net neutrality law. At that time, the Internet took on an obvious political color and posed a serious threat to Internet applications in various countries. China's economy is already highly dependent on the Internet, and if the network is disrupted, the whole country will suffer heavy losses. The decimal Network Standard working Group of The Ministry of Industry and Information Technology of China and The Decimal Network Information Technology Co., LTD of Shanghai have been researching on the future network for more than 20 years. Developed a complete set of decimal network framework system, completed the future network series research and development with China's independent intellectual property rights, and built the second Internet network system besides the United States. The technology has been fully tested in many places and achieved good results, truly achieving the goal of "autonomy, safety, high speed and compatibility". This paper will introduce the method of accessing decimal network resources in the current network environment. Keywords-Decimal Network; CHN; Domain Name; Network Resources Decimal network is a complete independent intellectual property rights based overall decimal digital code, the establishment of 2 256 times of cyberspace sovereignty. It includes 13 root domain name servers from the parent root, the primary root, and the zero-trust security mechanism for communication after verification. Compatible with current Internet systems, it has a future Internet architecture that overlaps geographical location and IP address space. Most Internet applications today are based on IPv4 environments. In the context of the existing Internet network, the IPV9 .chn domain name network can be accessed by setting up the existing computer or terminal. Most current computer browsers and mobile browsers support access. For example, Firefox, Google Chrome, Microsoft Edge, 360 speed browser and so on are common on computers. Safari and Baidu browser commonly used on mobile phones need to set the network DNS and point to the IPV9 DNS server before using the browser to open the website. The addresses are: 202.170.218.93 and 61.244.5.162. Once set up, you can access the resources of the decimal network in the current Internet environment. Before visiting, a few typical IPV9 sites are recommended, as shown in Table 1. Here are the steps to accessing the .C web site on your PC and mobile phone. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 37 TABLE I. TYPICAL CHN DOMAIN NAME WEBSITES Website domain name Web resources Resource management Resources to address http://www.v9.chn .chn portal website Decimal Network Standard Working Group Shanghai http://em777.chn Decimal technology introduction website Shanghai Decimal Network Information Technology Co. LTD Shanghai http://www.xav9.chn Xi 'an Decimal System portal Xi 'an Decimal Network Technology Co. LTD Xi 'an http://www.xa.chn V9 Research Institute portal Xi 'an Weijiu Research Institute Co. LTD Xi 'an http://www.hqq.chn/ The red Flag Canal craftsman Xi 'an Decimal Network Technology Co. LTD Xi 'an http://www.zjsjz.chn Zhejiang Decimal System portal website Zhejiang Decimal Network Co. LTD Hangzhou http://www.zjbdth.chn Beidou day draw Beidou Tianhua Information Technology Co. LTD Hangzhou I. COMPUTER ACCESS. CHN WEBSITE SETTINGS Introduce with Windows10 system settings (PC). 1) First click the "Network" icon on the desktop and select the "Properties" option. The interface appears as shown in Figure 1. Figure 1. Network and share Center setup interface International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 38 2) Click the "Connection: Ethernet" option in the network and Sharing Center setting interface. The interface appears as shown in Figure 2. Figure 2. Ethernet status interface 3) In the Ethernet status interface, click the "Properties" button. The dialog box appears as shown in Figure 3. Figure 3. Ethernet property interface 4) In the Ethernet property interface, double-click the option "Internet Protocol Version 4 (TCP/IPv4)". The dialog box appears as shown in Figure 4. Setting the preferred DNS and alternate DNS and finished setup. Figure 4. Internet Protocol version 4 (TCP/IPv4) properties 5) Open a browser. Firefox or Google Chrome is recommended. Enter http://www.hqq.chn in the browser address bar to access the IPV9 site, as shown in Figure 5. II. MOBILE ACCESS .CHN WEBSITE At present, there are many types of mobile phones, but the setting method is similar. Android mobile phone can download the plug-in (download address: https://www.dtgty.com/HomeSearch) by flow direct access. But in most cases, access to .chn resources will be more convenient over local Wi-Fi. It can also be accessed through mobile hotspots, with the same Settings as Wi-Fi and mobile hotspots. Take Huawei (Android system) mobile phone and iPhone (iOS system) mobile phone as an example to introduce the setting method of mobile DNS. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 39 A. Huawei Mobile Phone setting The phone type is HUAWEI Mate 20, Android 10 and EMUI 10.1.0. 1) Click "Settings" on the desktop of the mobile phone to display the setting interface, as shown in Figure 6. Figure 5. Access the IPV9 site Figure 6. Mobile phone Setting Interface Figure 7. Wireless connection setting interface International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 40 2) Click "Wireless LAN" in the interface, and the interface appears as shown in Figure 7. 3) Press on the connected network name for a while, and additional menu options appear, as shown in Figure 8. Click "Modify Network" menu, the interface of network parameter setting appears, and select "Display Advanced Options", as shown in Figure 9. Select the "Static" option, as shown in Figure 10. Figure 8. Modification of network Interface Figure 9. Parameter setting interface 4) Modify DNS according to the parameters in the figure. After modification, click "Save" button to complete the setting. Figure 10. Modification of network Interface Figure 11. Parameter setting interface 5) Return to the main interface of the mobile phone and enter http://www.xand.chn in the browser International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 41 (Firefox or Google Chrome) to browse the overseas study service website for testing, as shown in Figure 11. The rest are Xiaomi phones, Vivo phones and so on. You can access IPV9 network resources by simply setting the DNS Settings for the connection network. B. iPhone parameter setting Mobile phone model: iPhone XR, system: IOS13.5. 1) Click "Settings" on the desktop of the mobile phone to appear the setting interface. Click "Wireless LAN" in the interface. The interface appears as shown in Figure 12. 2) Click the icon on the right of the connected WLAN, and the network setting interface appears, as shown in Figure 13. Figure 12. Interface of wireless LAN Figure Figure 13. Interface of wireless connection parameters Figure 14. DNS Setting Interface 3) In the setting interface, select "Configure DNS" and the DNS setting interface appears, as shown in Figure 14. Select the Add Server option and enter the DNS address shown in the figure. Click the "Save" command in the upper right corner of the interface to complete the setup. 4) Open the browser. Enter http://www.xav9.chn in the address bar to open the main interface of Xi 'an Future Network, as shown in Figure 15. III. METHOD OF ACCESSING IPV9 WEBSITE WITH CHINESE DOMAIN NAME In addition to accessing network resources through character domain names, the decimal network system can also use Chinese domain names to access, in the format: http:// Chinese.*****, but before access to the following Settings. Take the Firefox browser, for example. 1) Open the Firefox browser and click the menu button in the upper right corner to open the browser Settings menu, as shown in Figure 16. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 42 Figure 15. Xi 'an Future Network main interface Figure 16. Firefox menu Settings screen International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 43 2) Click the "Options" command, drag the right scroll bar to the bottom of the page, and network Settings appear, as shown in Figure 17. Figure 17. Firefox menu options screen 3) Click the "Settings" button in network Settings, and the "Connection Settings" dialog box appears, as shown in Figure 18. In the Configure Proxy Server to Access the Internet option, select Do not Use proxy Server (Y), and then select Enable HTTPS DNS at the bottom of the screen. Finally enter https://doh.zsw9.cn/dns.query in the "custom" edit box. 4) After setting, click "OK" button to complete setting. Enter the Chinese domain name "China Micro Nine Research Institute" into the Firefox browser to access Chinese website resources. This is shown in Figure 19. To facilitate test access, several typical IPV9 sites are recommended, as shown in Table 2. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 44 Figure 18. Firefox connection Settings screen Figure 19. Website of Xi 'an V9 Research Institute International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 45 TABLE II. TYPICAL CHINESE DOMAIN NAME WEBSITES Character of the domain name Web resources Chinese domain name Resource management http://www.ijanmc.chn New online international journals http:// in China. New network and detection control Xi’an Technological University http://www.iccnea.chn ICCNEA International Conference Website http:// in China. The international conference on Xi’an Technological University http://www.xa.chn .chn portal website http:// in China. Micro Nine Research Institute Xi 'an Decimal Network Company http://www.xav9.chn Xi 'an Decimal System portal http:// in China. Xi 'an Future Network Portal Xi 'an Decimal Network Company http://www.xand.chn Xi 'an NORTON Study Abroad website http:// in China. Xi 'an NORTON Study Abroad Xi 'an Decimal Network Company http://www.hqq.chn The red Flag Canal craftsman http:// in China. The red Flag Canal craftsman Xi 'an Decimal Network Company http://www.xazn.chn The website of Zhengnuo Conference Company The website of Zhengnuo Conference Company Xi 'an Decimal Network Company In addition to accessing network resources through character domain names and Chinese characters, the decimal address can also be used to access resources. A website corresponds to a decimal address. At the same time, we can also realize a decimal address corresponding to multiple network resources in the way of subdirectory structure. Since decimal address access is bound to the computer in the background, setup is cumbersome, and only a presentation interface is provided here, as shown in Figure 20. Figure 20. Red Flag Canal Craftsman website International Journal of Advanced Network, Monitoring and Controls Volume 05, No.03, 2020 46 At present the decimal network is in the experimental application stage, although the network resources are less, but the original resources running on the Internet can be completely translated to the decimal network system. With the introduction of national policy, the decimal network of resources will be more and more. The decimal network application of China's independent intellectual property rights is bound to enter thousands of households. IV. CONCLUSION This paper introduces the method of using browser to access decimal network resources through personal computer terminal or personal mobile phone under the current Internet environment. A simple DNS setup is required to point to the decimal server to complete resource access. The setup is very simple, which lays the foundation for a wide range of network applications. REFERENCE [1] Xie Jianping. A method for assigning addresses to networked computers using full decimal algorithm, Chinese patent No. : ZL00135182.6, 2004.2.6. [2] Xie Jianping. A method for assigning addresses to networked computers using a full decimal algorithm, U.S. Patent No. :US: 8082365, [4] RFC - Internet Standard. Internet Protocol, DARPA INTERNET PROGRAM PROTOCOL SPECIFICATION, RFC 791, 1981.09. [3] S. Deering, R. Hinden, Network Working Group. Internet Protocol, Version 6 (IPv6)-Specification, RFC-1883, 1995.12. [4] M. Crawford. Network Working Group. Transmission of IPv6 Packets over Ethernet Networks. RFC-2464, 1998.12. [5] J. Onions, Network Working Group. A Historical Perspective on the usage of IP version 9. RFC1606. 1994.04. [6] V. Cerf, Network Working Group. A VIEW FROM THE 21ST CENTURY, RFC1607. 1994.04. work_2p32le2xefgk7mdiaxsaftmnmu ---- Aalborg Universitet Sustainable computational science the ReScience initiative Rougier, Nicolas; Hinsen, Konrad; Alexandre, Frédéric; Arildsen, Thomas; Barba, Lorena; Benureau, Fabien; Brown, C. Titus; de Buyl, Pierre; Caglayan, Ozan; Davison, Andrew; Delsuc, Marc André; Detorakis, Georgios; Diem, Alexandra; Drix, Damien; Enel, Pierre; Girard, Benoît; Guest, Olivia; Hall, Matt; Henriques, Rafael; Hinaut, Xavier; Jaron, Kamil; Khamassi, Mehdi; Klein, Almar; Manninen, Tiina; Marchesi, Pietro; McGlinn, Dan; Metzner, Christoph; Petchey, Owen; Ekkehard Plesser, Hans; Poisot, Timothée; Ram, Karthik; Ram, Yoav; Roesch, Etienne; Rossant, Cyrille; Rostami, Vahid; Shifman, Aaron; Stachelek, Joseph; Stimberg, Marcel; Stollmeyer, Frank; Vaggi, Federico; Viejo, Guillaume; Vitay, Julien; Vostinar, Anya; Yurchak, Roman; Zito, Tiziano Published in: PeerJ DOI (link to publication from Publisher): 10.7717/peerj-cs.142 Creative Commons License CC BY 4.0 Publication date: 2017 Document Version Publisher's PDF, also known as Version of record Link to publication from Aalborg University Citation for published version (APA): Rougier, N., Hinsen, K., Alexandre, F., Arildsen, T., Barba, L., Benureau, F., Brown, C. T., de Buyl, P., Caglayan, O., Davison, A., Delsuc, M. A., Detorakis, G., Diem, A., Drix, D., Enel, P., Girard, B., Guest, O., Hall, M., Henriques, R., ... Zito, T. (2017). Sustainable computational science: the ReScience initiative. PeerJ, 3(142e). https://doi.org/10.7717/peerj-cs.142 https://doi.org/10.7717/peerj-cs.142 https://vbn.aau.dk/en/publications/df049490-73ef-44b2-835f-2fc6da4c5ceb https://doi.org/10.7717/peerj-cs.142 Submitted 5 October 2017 Accepted 15 November 2017 Published 18 December 2017 Corresponding author Nicolas P. Rougier, Nicolas.Rougier@inria.fr Academic editor Feng Xia Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.142 Copyright 2017 Rougier et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Sustainable computational science: the ReScience initiative Nicolas P. Rougier1, Konrad Hinsen2, Frédéric Alexandre1, Thomas Arildsen3, Lorena A. Barba4, Fabien C.Y. Benureau1, C. Titus Brown5, Pierre de Buyl6, Ozan Caglayan7, Andrew P. Davison8, Marc-André Delsuc9, Georgios Detorakis10, Alexandra K. Diem11, Damien Drix12, Pierre Enel13, Benoît Girard14, Olivia Guest15, Matt G. Hall16, Rafael N. Henriques17, Xavier Hinaut1, Kamil S. Jaron18, Mehdi Khamassi14, Almar Klein19, Tiina Manninen20, Pietro Marchesi21, Daniel McGlinn22, Christoph Metzner23, Owen Petchey24, Hans Ekkehard Plesser25, Timothée Poisot26, Karthik Ram27, Yoav Ram28, Etienne Roesch29, Cyrille Rossant30, Vahid Rostami31, Aaron Shifman32, Joseph Stachelek33, Marcel Stimberg34, Frank Stollmeier35, Federico Vaggi36, Guillaume Viejo14, Julien Vitay37, Anya E. Vostinar38, Roman Yurchak39 and Tiziano Zito40 1 INRIA Bordeaux Sud-Ouest, Talence, France 2 Centre de Biophysique Moléculaire UPR4301, CNRS, Orléans, France 3 Department of Electronic Systems, Technical Faculty of IT and Design, Aalborg University, Aalborg, Denmark 4 Department of Mechanical and Aerospace Engineering, The George Washington University, Washington, D.C., USA 5 Department of Population Health and Reproduction, University of California Davis, Davis, CA, USA 6 Instituut voor Theoretische Fysica, KU Leuven, Leuven, Belgium 7 Laboratoire d’Informatique (LIUM), Le Mans University, Le Mans, France 8 UNIC FRE 3693, CNRS, Gif-sur-Yvette, France 9 Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France 10 Department of Cognitive Sciences, University of California Irvine, Irvine, CA, USA 11 Computational Engineering and Design, University of Southampton, Southampton, United Kingdom 12 Humboldt Universität zu Berlin, Berlin, Germany 13 Department of Neuroscience, Mount Sinai School of Medicine, New York, NY, USA 14 Institute of Intelligent Systems and Robotics, Sorbonne Universités - UPMC Univ Paris 06 - CNRS, Paris, France 15 Experimental Psychology, University College London, London, Greater London, United Kingdom 16 UCL Great Ormond St Institute of Child Health, London, United Kingdom 17 Champalimaud Centre for the Unknown, Champalimaud Neuroscience Program, Lisbon, Portugal 18 Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland 19 Independent scholar, Enschede, The Netherlands 20 BioMediTech Institute and Faculty of Biomedical Sciences and Engineering, Tampere University of Technology, Tampere, Finland 21 Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands 22 Department of Biology, College of Charleston, Charleston, SC, USA 23 Centre for Computer Science and Informatics Research, University of Hertfordshire, Hatfield, United Kingdom 24 Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland 25 Faculty of Science and Technology, Norwegian University of Life Sciences, Aas, Norway 26 Département de Sciences Biologiques, Université de Montréal, Montréal, QC, Canada 27 Berkeley Institute for Data Science, University of California, Berkeley, CA, USA 28 Department of Biology, Stanford University, Stanford, CA, USA 29 Centre for Integrative Neuroscience, University of Reading, Reading, United Kingdom 30 Institute of Neurology, University College London, London, United Kingdom How to cite this article Rougier et al. (2017), Sustainable computational science: the ReScience initiative. PeerJ Comput. Sci. 3:e142; DOI 10.7717/peerj-cs.142 https://peerj.com mailto:Nicolas.Rougier@inria.fr https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.142 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.142 31 Institute of Neuroscience & Medicine, Juelich Forschungszentrum, Jülich, Germany 32 Department of Biology, University of Ottawa, Ottawa, Ontario, Canada 33 Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI, USA 34 Sorbonne Universités/UPMC Univ Paris 06/INSERM/CNRS/Institut de la Vision, Paris, France 35 Max Planck Institute for Dynamics and Self-Organization, Göttingen, Lower Saxony, Germany 36 Amazon, Seattle, WA, USA 37 Department of Computer Science, Chemnitz University of Technology, Chemnitz, Saxony, Germany 38 Department of Computer Science, Grinnell College, Grinnell, IA, USA 39 Symerio, Palaiseau, France 40 Neural Information Processing Group, Eberhard Karls Universität Tübingen, Tübingen, Germany ABSTRACT Computer science offers a large set of tools for prototyping, writing, running, testing, validating, sharing and reproducing results; however, computational science lags behind. In the best case, authors may provide their source code as a compressed archive and they may feel confident their research is reproducible. But this is not exactly true. James Buckheit and David Donoho proposed more than two decades ago that an article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code, and data that produced the result. This implies new workflows, in particular in peer-reviews. Existing journals have been slow to adapt: source codes are rarely requested and are hardly ever actually executed to check that they produce the results advertised in the article. ReScience is a peer-reviewed journal that targets computational research and encourages the explicit replication of already published research, promoting new and open-source implementations in order to ensure that the original research can be replicated from its description. To achieve this goal, the whole publishing chain is radically different from other traditional scientific journals. ReScience resides on GitHub where each new implementation of a computational study is made available together with comments, explanations, and software tests. Subjects Data Science, Digital Libraries, Scientific Computing and Simulation, Social Computing Keywords Computational science, Open science, Publication, Reproducible, Replicable, Sustainable, GitHub, Open peer-review INTRODUCTION There is a replication crisis in Science (Baker, 2016; Munafò et al., 2017). This crisis has been highlighted in fields as diverse as medicine (Ioannidis, 2005), psychology (Open Science Collaboration, 2015), the political sciences (Janz, 2015), and recently in the biomedical sciences (Iqbal et al., 2016). The reasons behind such non-replicability are as diverse as the domains in which it occurs. In medicine, factors such as study power and bias, the number of other studies on the same question, and importantly, the ratio of true to no relationships among the all relationships probed have been highlighted as important causes (Ioannidis, 2005). In psychology, non-replicability has been blamed on spurious p-values (p-hacking), while in the biomedical sciences (Iqbal et al., 2016), a lack of access to full datasets and Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 2/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 detailed protocols for both clinical and non-clinical biomedical investigation is seen as a critical factor. The same remarks were recently issued for chemistry (Coudert, 2017). Surprisingly, the computational sciences (in the broad sense) and computer sciences (in the strict sense) are no exception (Donoho et al., 2009; Manninen, Havela & Linne, 2017) despite the fact they rely on code and data rather than on experimental observations, which should make them immune to the aforementioned problems. When Colberg and colleagues (2016) decided to measure the extent of the problem precisely, they investigated the availability of code and data as well as the extent to which this code would actually build with reasonable effort. The results were dramatic: of the 515 (out of 613) potentially reproducible papers targeted by the study, the authors managed to ultimately run only 102 (less than 20%). These low numbers only reflect the authors’ success at running the code. They did not check for correctness of the code (i.e., does the code actually implement what is advertised in the paper), nor the reproducibility of the results (does each run lead to the same results as in the paper). One example of this problem can be found in Topalidou et al. (2015), in which the authors tried to replicate results obtained from a computational neuroscience model. Source code was not available, neither as supplementary material to the paper nor in a public repository. When the replicators obtained the source code after contacting the corresponding author, they found that it could not be compiled and would be difficult to reuse for other purposes. Confronted with this problem, a small but growing number of journals and publishers have reacted by adopting explicit policies for data and software. Examples can be seen in the PLOS instructions on Materials and Software Sharing and on Data Availability, and in the recent announcement by eLife on forking (creating a linked copy of) software used in eLife papers to GitHub. Such policies help to ensure access to code and data in a well-defined format (Perkel, 2016) but this will not guarantee reproducibility nor correctness. At the educational and methodological levels, things have started to change with a growing literature on best practices for making computations reproducible (Sandve et al., 2013; Crook, Davison & Plesser, 2013; Wilson et al., 2014; Halchenko & Hanke, 2015; Janz, 2015; Hinsen, 2015). Related initiatives such as Software and Data Carpentry (Wilson, 2016) are of note since their goal is to make scientists more productive, and their work more reliable, by teaching them basic computing skills. Such best practices could be applied to already published research codebases as well, provided the original authors are willing to take on the challenge of re-implementing their software for the sake of better science. Unfortunately, this is unlikely since the incentives for doing such time-consuming work are low or nonexistent. Furthermore, if the original authors made mistakes in their original implementation, it seems likely that they will reproduce their mistakes in any re-implementation. REPLICATION AND REPRODUCTION While recognition of the replication crisis as a problem for scientific research has increased over time, unfortunately no common terminology has emerged so far. One reason for the diverse use of terms is that each field of research has its own specific technical and Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 3/17 https://peerj.com http://journals.plos.org/plosone/s/materials-and-software-sharing http://journals.plos.org/plosone/s/data-availability https://elifesciences.org/elife-news/inside-elife-forking-software-used-elife-papers-github http://dx.doi.org/10.7717/peerj-cs.142 social obstacles on the road to publishing results and findings that can be verified by other scientists. Here we briefly summarize the obstacles that arise from the use of computers and software in scientific research, and introduce the terminology we will use in the rest of this article. We note, however, that there is some disagreement about this particular choice of terminology even among the authors of this article. Reproducing the result of a computation means running the same software on the same input data and obtaining the same results. The goal of a reproduction attempt is to verify that the computational protocol leading to the results has been recorded correctly. Performing computations reproducibly can be seen as a form of provenance tracking, the software being a detailed record of all data processing steps. In theory, computation is a deterministic process and exact reproduction should therefore be trivial. In reality, it is very difficult to achieve because of the complexity of today’s software stacks and the tediousness of recording all interactions between a scientist and a computer (although a number of recent tools have attempted to automate such recording, e.g., Guo & Engler, 2011; Davison, 2012; Murta et al., 2015). Mesnard and Barba explain (Mesnard & Barba, 2017) how difficult it can be to reproduce a two-year-old computation even though all possible precautions were taken at the time to ensure reproducibility. The most frequent obstacles are the loss of parts of the software or input data, lack of a computing environment that is sufficiently similar to the one used initially, and insufficient instructions for making the software work. An obstacle specific to numerical computations is the use of floating-point arithmetic, whose rules are subject to slightly different interpretations by different compilers and runtime support systems. A large variety of research practices and support tools have been developed recently to facilitate reproducible computations. For a collection of recipes that have proven useful, see Kitzes, Turek & Deniz (2017). Publishing a reproducible computational result implies publishing all the software and all the input data, or references to previously published software and data, along with the traditional article describing the work. An obvious added value is the availability of the software and data, which helps readers to gain a better understanding of the work, and can be re-used in other research projects. In addition, reproducibly published results are more trustworthy, because many common mistakes in working with computers can be excluded: mistyping parameter values or input file names, updating the software but forgetting to mention the changes in the description of the method, planning to use one version of some software but actually using a different one, etc. Strictly speaking, reproducibility is defined in the context of identical computational environments. However, useful scientific software is expected to be robust with respect to certain changes in this environment. A computer program that produces different results when compiled using different compilers, or run on two different computers, would be considered suspect by most practitioners, even if it were demonstrably correct in one specific environment. Ultimately it is not the software that is of interest for science, but the models and methods that it implements. The software is merely a vehicle to perform computations based on these models and methods. If results depend on hard-to-control Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 4/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 implementation details of the software, their relation to the underlying models and methods becomes unclear and unreliable. Replicating a published result means writing and then running new software based on the description of a computational model or method provided in the original publication, and obtaining results that are similar enough to be considered equivalent. What exactly ‘‘similar enough’’ means strongly depends on the kind of computation being performed, and can only be judged by an expert in the field. The main obstacle to replicability is an incomplete or imprecise description of the models and methods. Replicability is a much stronger quality indicator than reproducibility. In fact, reproducibility merely guarantees that all the ingredients of a computation are well documented. It does not imply that any of them are correct and/or appropriate for implementing the models and methods that were meant to be applied, nor that the descriptions of these models and methods are correct and clear. A successful replication shows that two teams have produced independent implementations that generate equivalent results, which makes serious mistakes in either implementation unlikely. Moreover, it shows that the second team was able to understand the description provided by the first team. Replication can be attempted for both reproducible and non-reproducible results. However, when an attempt to replicate non-reproducible work fails, yielding results too different to be considered equivalent, it can be very difficult to identify the cause of the disagreement. Reproducibility guarantees the existence of a precise and complete description of the models and methods being applied in the original work, in the form of software source code, which can be analyzed during the investigation of any discrepancies. The holy grail of computational science is therefore a reproducible replication of reproducible original work. THE RESCIENCE INITIATIVE Performing a replication is a daunting task that is traditionally not well rewarded. Nevertheless, some people are willing to replicate computational research. The motivations for doing so are very diverse (see Box 1). Students may want to familiarize themselves with a specific scientific domain, and acquire relevant practical experience by replicating important published work. Senior researchers may critically need a specific piece of code for a research project and therefore re-implement a published computational method. If these people write a brand new open source implementation of already published research, it is likely that this new implementation will be of interest for other people as well, including the original authors. The question is where to publish such a replication. To the best of our knowledge, no major journal accepts replications in computational science for publication. This was the main motivation for the creation of the ReScience journal (https://rescience.github.io) by Konrad Hinsen and Nicolas P. Rougier in September 2015. Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 5/17 https://peerj.com https://rescience.github.io http://dx.doi.org/10.7717/peerj-cs.142 Box 1. Authors having published in Rescience explain their motivation. (Stachelek, 2016) I was motivated to replicate the results of the original paper because I feel that working through code supplements to blog posts has really helped me learn the process of scientific analysis. I could have published my replication as a blog post but I wanted the exposure and permanency that goes along with journal articles. This was my first experience with formal replication. I think the review was useful because it forced me to consider how the replication would be used by people other than my- self. I have not yet experienced any new interactions following publication. However, I did notify the author of the original implementation about the replication’s publi- cation. I think this may lead to future correspondence. The original author suggested that he would consider submitting his own replications to ReScience in the future. (Topalidou & Rougier, 2015) Our initial motivation and the main reason for replicating the model is that we needed it in order to collaborate with our neurobiologist colleagues. When we arrived in our new lab, the model had just been published (2013) but the original author had left the lab a few months before our arrival. There was no public repository nor version control, and the paper describing the model was incomplete and partly inaccurate. We managed to get our hands on the original sources (6,000 lines of Delphi) only to realize we could not compile them. It took us three months to replicate it using 250 lines of Python. But at this time, there was no place to publish this kind of replication to share the new code with colleagues. Since then, we have refined the model and made new predictions that have been confirmed. Our initial replication effort really gave the model a second life. (Viejo, Girard & Khamassi, 2016) Replicating previous work is a relatively routine task every time we want to build a new model: either because we want to build on this previous work, or because we want to compare our new model to it. We also give replication tasks to M.Sc. students every year, as projects. In all these cases, we are confronted with incomplete or inaccurate model descriptions, as well as with the impossibility to obtain the original results. Contacting the original authors sometimes solves the problem, but not so often (because of the dog ate my hard drive syndrome). We thus accumulate knowledge, internal to the lab, about which model works and which doesn’t, and how a given model has to be parameterized to really work. Without any place to publish it, this knowledge is wasted. Publishing it in ReScience, opening the discussion publicly, will be a progress for all of us. ReScience is an openly-peer-reviewed journal that targets computational research and encourages the explicit replication of already published research. In order to provide the largest possible benefit to the scientific community, replications are required to be reproducible and open-source. In two years of existence, 17 articles have been published and 4 are currently under review (#20, #39, #41, #43). The editorial board covers a wide range of computational sciences (see http://rescience.github.io/board/) and more than 70 volunteers have registered to be reviewers. The scientific domains of published work Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 6/17 https://peerj.com https://github.com/ReScience/ReScience-submission/pull/20 https://github.com/ReScience/ReScience-submission/pull/27 https://github.com/ReScience/ReScience-submission/pull/30 https://github.com/ReScience/ReScience-submission/pull/30 http://rescience.github.io/board/ http://dx.doi.org/10.7717/peerj-cs.142 are computational neuroscience, neuroimaging, computational ecology and computer graphics, with a majority in computational neuroscience. The most popular programming languages are Python and R. The review process takes about 100 days on average and involves about 50 comments. There is a strong bias towards successful replication (100%); experience has taught us that researchers are reluctant to publish failed replications, even when they can prove that the original work is wrong. For young researchers, there is a social/professional risk in publishing articles that show results from a senior researcher to be wrong. Until we implement a certified anonymized submission process, this strong bias will most likely remain. One of the specificities of the ReScience journal is a publishing chain that is radically different from any other traditional scientific journal, since ReScience lives on GitHub, a platform originally designed for collaborative software development. A ReScience submission is treated very similarly to a contribution to an Open Source software project. One of the consequences is that the whole process, from submission via reviewing to publication, is open for anyone to see and even comment on. Each submission is considered by a member of the editorial board, who may decide to reject the submission if it does not respect the formal publication criteria of ReScience. A submission must contain • a precise reference to the work being replicated, • an explanation of why the authors think they have replicated the paper (same figures, same graphics, same behavior, etc.) or why they have failed, • a description of any difficulties encountered during the replication, • open-source code that produces the replication results, • an explanation of this code for human readers. A complete submission therefore consists of both computer code and an accompanying article, which are sent to ReScience in the form of a pull request (the process used on GitHub to submit a proposed modification to a software project). Partial replications that cover only some of the results in the original work are acceptable, but must be justified. If the submission respects these criteria, the editor assigns it to two reviewers for further evaluation and tests. The reviewers evaluate the code and the accompanying material in continuous interaction with the authors through the discussion section until both reviewers consider the work acceptable for publication. The goal of the review is thus to help the authors meet the ReScience quality standards through discussion. Since ReScience targets replication of already published work, the criteria of importance or novelty applied by most traditional journals are irrelevant. For a successful submission (i.e., partial or full replication) to be accepted, both reviewers must consider it reproducible and a valid replication of the original work. As we explained earlier, this means that the reviewers • are able to run the proposed implementation on their computers, • obtain the same results as indicated in the accompanying paper, Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 7/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 • consider these results sufficiently close to the ones reported in the original paper being replicated. For a failure to replicate submission to be accepted, we require extra steps to be taken. In addition to scrutiny of the submission by reviewers and editors, we will try to contact the authors of the original research, and issue a challenge to the community to spot and report errors in the new implementation. If no errors are found, the submission will be accepted and the original research will be declared non-replicable. Since independent implementation is a major feature of replication work, ReScience does not allow authors to submit replications of their own research, nor the research of close collaborators. Moreover, replication work should be based exclusively on the originally published paper, although exceptions are admitted if properly documented in the replication article. Mistakes in the implementation of computational models and methods are often due to biases that authors invariably have, consciously or not. Such biases will inevitably carry over to a replication. Perhaps even more importantly, cross-fertilization is generally useful in research, and trying to replicate the work of one’s peers might pave the way for a future collaboration, or may give rise to new ideas as a result of the replication effort. LESSONS LEARNED Although ReScience is still a young project, the submissions handled so far already provide valuable experience concerning the reproducibility and replicability of computational work in scientific research. Short-term and long-term reproducibility While some of the reasons for non-reproducibility are specific to each scientific domain, our experience has shown that there are also some common issues that can be identified. Missing code and/or data, undocumented dependencies, and inaccurate or imprecise description appear to be characteristic of much non-reproducible work. Moreover, these problems are not always easy to detect even for attentive reviewers, as we discovered when some articles published in ReScience turned out to be difficult to reproduce for someone else for exactly the reasons listed above. ReScience reviewers are scientists working in the same domain as the submitting authors, because familiarity with the field is a condition for judging if a replication is successful. But this also means that our reviewers share a significant common background with the authors, and that background often includes the software packages and programming languages adopted by their community. In particular, if both authors and reviewers have essential libraries of their community installed on their computers, they may not notice that these libraries are actually dependencies of the submitted code. While solutions to this problem evidently exist (ReScience could, for example, request that authors make their software work on a standard computational environment supplied in the form of a virtual machine), they represent an additional effort to authors and therefore discourage them from submitting replication work to ReScience. Moreover, the evaluation of de-facto reproducibility (‘‘works on my machine’’) Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 8/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 by reviewers is useful as well, because it tests the robustness of the code under small variations in the computational environments that are inevitable in real life. Our goal is to develop a set of recommendations for authors that represent a workable compromise between reproducibility, robustness, and implementation effort. These recommendations will evolve over time, and we hope that with improving technology we will ultimately reach full reproducibility over a few decades. Another issue with reproducibility is that with today’s computing technology, long- term reproducibility can only be achieved by imposing drastic constraints on languages and libraries that are not compatible with the requirements of research computing. This problem is nicely illustrated by Mesnard & Barba (2017) whose authors report trying to reproduce their own work performed two years earlier. Even though Barba’s group is committed to reproducible research practices, they did not escape the many problems one can face when trying to re-run a piece of code. As a consequence, code that is written for ReScience today will likely cease to be functional at some point in the future. The long-term value of a ReScience publication lies not just in the actual code but also in the accompanying article. The combination of the original article and the replication article provide a complete and consistent description of the original work, as evidenced by the fact that replication was possible. Even 5, 10, or 20 years later, a competent scientist should be able to replicate the work again thanks to these two articles. Of course, the new code can also help, but the true long-term value of a replication is the accompanying article. Open reviewing The well-known weaknesses of the traditional anonymous peer-reviewing system used by most scientific journals have motivated many experiments with alternative reviewing processes. The variant adopted by ReScience is similar to the ones used by F1000Research or PeerJ, but is even more radically open: anyone can look at ReScience submissions and at the complete reviewing process, starting from the assignment of an editor and the invitation of reviewers. Moreover, anyone with a GitHub account can intervene by commenting. Such interventions could even be anonymous because a GitHub account is not required to advertise a real name or any other identifying element. ReScience does currently require all authors, editors, and reviewers to provide real names (which however are not verified in any way), but there are valid reasons to allow anonymity for authors and reviewers, in particular to allow junior scientists to criticize the work of senior colleagues without fear of retribution, and we envisage exploring such options in the future. Our experience with this open reviewing system is very positive so far. The exchanges between reviewers and authors are constructive and courteous, without exception. They are more similar in style to a coffee-table discussion than to the judgement/defence style that dominates traditional anonymous reviewing. Once reviewers have been invited and have accepted the task, the editors’ main role is to ensure that the review moves forward, by gently reminding everyone to reply within reasonable delays. In addition, the editors occasionally answer questions by authors and reviewers about the ReScience publishing process. The possibility to involve participants beyond the traditional group of authors, editors, and reviewers is particularly interesting in the case of ReScience, because it can be helpful Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 9/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 to solicit input from the authors of the original study that is being replicated. For example, in one recent case (#28), a reviewer suggested asking the author of the original work for permission to re-use an image. The author intervened in the review and granted permission. Publishing on the GitHub platform GitHub is a commercial platform for collaborative software development based on the popular version control system git. It offers unlimited free use to public projects, defined as projects whose contents are accessible to everyone. All ReScience activities are organized around a few such Open Source projects hosted by GitHub. This is an unusual choice for a scientific journal, the only other journal hosted on GitHub being The Journal of Open Source Software (Smith et al., 2017). In this section, we discuss the advantages and problems resulting from this choice, considering both technical and social issues. There are clear differences between platforms for software development, such as GitHub, and platforms for scientific publishing, such as HighWire. The latter tend to be expensive commercial products developed for the needs of large commercial publishers, although the market is beginning to diversify with products such as Episciences. More importantly, to the best of our knowledge, no existing scientific publishing platform supports the submission and review of code, which is an essential part of every ReScience article. For this reason, the only option for ReScience was to adopt a software development platform and develop a set of procedures that make it usable for scientific publishing. Our experience shows that the GitHub platform provides excellent support for the reviewing process, which is not surprising given that the review of a scientific article containing code is not fundamentally different from the review of code with accompanying documentation. One potential issue for other journals envisaging adoption of this platform is the necessity that submitting authors have a basic knowledge of the version control system Git and of the techniques of collaborative software development. Given the code-centric nature of ReScience, this has not been a major problem for us, and the minor issues have been resolved by our editors providing technical assistance to authors. It is of course possible that potential authors are completely discouraged from submitting to ReScience by their lack of the required technical competence, but so far nobody has provided feedback suggesting that this is a problem. The main inconvenience of the GitHub platform is its almost complete lack of support for the publishing steps, once a submission has successfully passed the reviewing process. At this point, the submission consists of an article text in Markdown format plus a set of code and data files in a git repository. The desired archival form is an article in PDF format plus a permanent archive of the submitted code and data, with a Digital Object Identifier (DOI) providing a permanent reference. The Zenodo platform allows straightforward archiving of snapshots of a repository hosted on GitHub, and issues a DOI for the archive. This leaves the task of producing a PDF version of the article, which is currently handled by the managing editor of the submission, in order to ease the technical burden on our authors. A minor inconvenience of the GitHub platform is its implementation of code reviews. It is designed for reviewing contributions to a collaborative project. The contributor submits new code and modifications to existing code in the form of a ‘‘pull request’’, which other Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 10/17 https://peerj.com https://github.com/ReScience/ReScience-submission/pull/28 http://github.com/ https://git-scm.com/ http://home.highwire.org/ https://www.episciences.org/ https://zenodo.org/ http://dx.doi.org/10.7717/peerj-cs.142 project members can then comment on. In the course of the exchanges, the contributor can update the code and request further comments. Once everybody is satisfied, the contribution is ‘‘merged’’ into the main project. In the case of ReScience, the collaborative project is the whole journal, and each article submission is a contribution proposed as a pull request. This is, however, not a very intuitive representation of how a journal works. It would be more natural to have a separate repository for each article, an arrangement that would also facilitate the final publishing steps. However, GitHub does not allow code review on a new repository, only on contributions to an already existing one. Relying on a free-use offer on a commercial platform poses some additional problems for scientific publishing. GitHub can change its conditions at any time, and could in principle delete or modify ReScience contents at any time without prior notice. Moreover, in the case of technical problems rendering ReScience contents temporarily or permanently inaccessible, the ReScience community has no legal claims for compensation because there is no contract that would imply any obligations for GitHub. It would clearly be imprudent to count on GitHub for long-term preservation of ReScience content, which is why we deposit accepted articles on Zenodo, a platform designed for archiving scientific information and funded by research organizations as an element of public research infrastructure. The use of free services provided by GitHub and Zenodo was clearly important to get ReScience started. The incentives for the publication of replication work being low, and its importance being recognized only slowly in the scientific community, funding ReScience through either author page charges or grants would have created further obstacles to its success. A less obvious advantage of not having to organize funding is that ReScience can exist without being backed by any legal entity that would manage its budget. This makes it possible to maintain a community spirit focused on shared scientific objectives, with nobody in a position to influence ReScience by explicit or implicit threats of reducing future funding. OUTLOOK Based on our experience with the ReScience initiative, we can engage in informed speculation about possible future evolutions in scientific publishing, in particular concerning replication work. We will not discuss minor technical advances such as a better toolchain for producing PDF articles, but concentrate on long-term improvements in the technology of electronic publishing and, most of all, in the attitude of the scientific community towards the publication, preservation, and verification of computer-aided research. A fundamental technical issue is the difficulty of archiving or accurately describing the software environments in which computational scientists perform their work. A publication should be accompanied by both a human-readable description of this environment and an executable binary form. The human-readable description allows an inspection of the versions of all software packages that were used, for example to check for the impact of bugs that become known only after a study was published. The executable version enables other scientists to re-run the analyses and inspect intermediate results. Ideally, the Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 11/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 human-readable description would permit rebuilding the executable version, in the same way that software source code permits rebuilding executable binaries. This approach is pursued for example by the package manager Guix (Courtés & Wurmus, 2015). A more limited but still useful implementation of the same idea exists in the form of the conda package manager (Anaconda Inc., 2017), which uses a so-called environment file to describe and reconstruct environments. The main limitation compared to Guix is that the packages that make up a conda environment are themselves not reproducible. For example, a conda environment file does not state which compiler versions were used to build a package. Containerization, as implemented e.g., by Docker (Docker Inc., 2017) is currently much discussed, but provides only the executable version without a human-readable description. Moreover, the long-term stability of the container file format remains to be evaluated. History has shown that long-term stability in computing technology is achieved only by technology for which it is a design priority, as in the case of the Java Virtual Machine (Lindholm & Yellin, 1999). Docker, on the contrary, is promoted as a deployment technology with no visible ambition towards archiving of computational environments. Today’s electronic publishing platforms for scientific research still show their origins in paper-based publishing. Except for the replacement of printed paper by a printable PDF file, not much has changed. Although it is increasingly realized that software and data should be integral parts of most scientific publications today, they are at best relegated to the status of ‘‘supplementary material’’, and systematically excluded from the peer review process. In fact, to the best of our knowledge, ReScience is the only scientific journal that aims to verify the correctness of scientific software. As our experience has shown, it is far easier to graft publication onto a software development platform than to integrate software reviewing into a publishing platform. Furthermore, tools that will allow for the automated validation of computational models and the automated verification of correctness are being actively developed in the community (see, for example, SciUnit or OSB-model-validation). An integration of such frameworks, which would greatly enhance the verification and validation process, seems feasible for the existing software development platforms. A logical next step is to fully embrace the technology designed for software development, which far better takes into account the specificity of electronic information processing than today’s scientific publishing systems. In addition to the proper handling of code, such an approach offers further advantages. Perhaps the most important one is a shift of focus from the paper as a mostly isolated and finished piece of work to scientific progress as a collection of incremental and highly interdependent steps. The Software Heritage project, whose aim is to create a permanent public archive of all publicly available software source code, adopts exactly this point of view for the preservation of software. As our experience with ReScience has shown, integrating the narrative of a scientific article into a framework designed for software development is not difficult at all. Publishing and archiving scientific research in Software Heritage would offer several advantages. The intrinsic identifiers that provide access to the contents of the archive permit unambiguous and permanent references to ongoing projects as well as to snapshots at a specific time, and to whole projects as well as to the individual files that are part of them. Such references hold the promise for better Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 12/17 https://peerj.com https://github.com/scidash/sciunit https://github.com/OpenSourceBrain/osb-model-validation https://www.softwareheritage.org/ http://dx.doi.org/10.7717/peerj-cs.142 ✓ Original work Article Replication (success) Authors A Authors A Authors B Replication (sucess) Certified Article Authors B Authors A+B ✓ ✓ A. ReScience B. CoScience Replication (failure) Authors B ✗ Original work Replication (failure) No publication Authors A Authors B ✗ Feedback to author & editor Feedback to author & editor Figure 1 (A) The ReScience publication chain starts from an original research article by authors A, pub- lished in a journal, in conference proceedings, or as a preprint. This article constitutes the base material for authors B, who attempt to replicate the work based on its description. Success or failure to replicate is not a criterion for acceptance or rejection, even though failure to replicate (continued on next page...) Full-size DOI: 10.7717/peerjcs.142/fig-1 Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 13/17 https://peerj.com https://doi.org/10.7717/peerjcs.142/fig-1 http://dx.doi.org/10.7717/peerj-cs.142 Figure 1 (...continued) requires more precaution to ensure this is not a misunderstanding or a bug in the new code. After review, the replication is published, and feedback is given to original authors (and editors) to inform them the work has been replicated (or not). (B) The CoScience proposal would require the replication to happen before the actual publication. In case of failure, nothing will be published. In case of success, the publica- tion will be endorsed by authors A and authors B with identified roles and will be certified as reproducible because it has been replicated by an independent group. reuse of scientific information, for better reproducibility of computations, and for fairer attribution of credit to scientists who contribute to research infrastructure. One immediate and legitimate question is to wonder to what extent a replication could be performed rior to the publication of the original article. This would strongly reinforce a claim because a successful and independent replication would be available right from the start. As illustrated in Fig. 1, this would require group A to contact group B and send them a draft of their original work (the one that would be normally submitted to a journal) such that group B could perform a replication and confirm or refute the results. In case of confirmation, a certified article could be later published with both groups as authors (each group being identified according to their respective roles). However, if the replication fails and the original work cannot be fixed, this would prevent publication. This model would improve the quality of computational research and also considerably slow down the rapid pace of publication we are observing today. Unfortunately, such a scenario seems highly improbable today. The pressure to publish is so strong and the incentive for doing replication so low that it would most probably prevent such collaborative work. However, we hope that the current replication crisis will lead to a change in attitude, with an emphasis on the quality rather than the quantity of scientific ouput, with CoScience becoming the gold-standard approach to quality assurance. ADDITIONAL INFORMATION AND DECLARATIONS Funding The authors received no funding for this work. Competing Interests Federico Vaggi is an employee of Amazon, Inc., Roman Yurchak is an employee of Symerio, and C. Titus Brown and Nicolas P. Rougier are Academic Editors for PeerJ. Author Contributions • Nicolas P. Rougier wrote the paper, prepared figures and/or tables, reviewed drafts of the paper, co-founder, editor, author. • Konrad Hinsen wrote the paper, reviewed drafts of the paper, co-founder, editor. • Frédéric Alexandre, Alexandra K. Diem, Rafael N. Henriques, Owen Petchey, Frank Stollmeier and Guillaume Viejo reviewed drafts of the paper, author. • Thomas Arildsen, Pierre de Buyl and Olivia Guest wrote the paper, reviewed drafts of the paper, editor. Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 14/17 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.142 • Lorena A. Barba, C. Titus Brown, Timothée Poisot, Karthik Ram and Tiziano Zito reviewed drafts of the paper, editor. • Fabien C.Y. Benureau, Ozan Caglayan, Andrew P. Davison, Marc-André Delsuc and Etienne Roesch wrote the paper, reviewed drafts of the paper, reviewer. • Georgios Detorakis, Mehdi Khamassi, Aaron Shifman and Julien Vitay reviewed drafts of the paper, reviewer, author. • Damien Drix, Pierre Enel, Matt G. Hall, Xavier Hinaut, Kamil S. Jaron, Almar Klein, Tiina Manninen, Pietro Marchesi, Daniel McGlinn, Hans Ekkehard Plesser, Yoav Ram, Cyrille Rossant, Marcel Stimberg, Federico Vaggi, Anya E. Vostinar and Roman Yurchak reviewed drafts of the paper, reviewer. • Benoît Girard wrote the paper, reviewed drafts of the paper, editor, reviewer, author. • Christoph Metzner wrote the paper, reviewed drafts of the paper, reviewer, author. • Vahid Rostami and Joseph Stachelek wrote the paper, reviewed drafts of the paper, author. Data Availability The following information was supplied regarding data availability: ReScience journal: https://zenodo.org/communities/rescience/. REFERENCES Anaconda Inc. 2017. Conda. Available at https://conda.io/ . Baker M. 2016. 1, 500 scientists lift the lid on reproducibility. Nature 533(7604):452–454 DOI 10.1038/533452a. Colberg C, Proebsting TA. 2016. Repeatability in computer systems research. Communi- cations of the ACM 59(3):62–69 DOI 10.1145/2812803. Coudert F-X. 2017. Reproducible research in computational chemistry of materials. Chemistry of Materials 29(7):2615–2617 DOI 10.1021/acs.chemmater.7b00799. Courtès L, Wurmus R. 2015. Reproducible and user-controlled software environments in HPC with Guix. In: Hunold S, Costan A, Giménez D, Iosup A, Ricci L, Requena MEG, Scarano V, Varbanescu AL, Scott SL, Lankes S, Weidendorfer J, Alexander M, eds. Euro-Par 2015: parallel processing workshops. Lecture notes in computer science, vol. 9523. Cham: Springer. Crook SM, Davison AP, Plesser HE. 2013. 20 years of computational neuroscience. In: Bower MJ, ed. Chap. Learning from the past: approaches for reproducibility in computational neuroscience. New York: Springer New York, 73–102. Davison AP. 2012. Automated capture of experiment context for easier reproducibility in computational research. Computing in Science and Engineering 14:48–56 DOI 10.1109/MCSE.2012.41. Docker Inc. 2017. Docker. Available at https://www.docker.com/ . Donoho DL, Maleki A, Rahman IU, Shahram M, Stodden V. 2009. Reproducible research in computational harmonic analysis. Computing in Science Engineering 11(1):8–18 DOI 10.1109/MCSE.2009.15. Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 15/17 https://peerj.com https://zenodo.org/communities/rescience/ https://conda.io/ http://dx.doi.org/10.1038/533452a http://dx.doi.org/10.1145/2812803 http://dx.doi.org/10.1021/acs.chemmater.7b00799 http://dx.doi.org/10.1109/MCSE.2012.41 https://www.docker.com/ http://dx.doi.org/10.1109/MCSE.2009.15 http://dx.doi.org/10.7717/peerj-cs.142 Guo PJ, Engler D. 2011. CDE: using system call interposition to automatically create portable software packages. In: Proceedings of the 2011 USENIX annual technical conference, USENIX’11. Portland: USENIX Association. Available at http://dl.acm. org/citation.cfm?id=2002181.2002202. Halchenko YO, Hanke M. 2015. Four aspects to make science open ‘‘by design’’ and not as an after-thought. GigaScience 4(1) DOI 10.1186/s13742-015-0072-7. Hinsen K. 2015. Writing software specifications. Computing in Science & Engineering 17(3):54–61 DOI 10.1109/mcse.2015.64. Ioannidis JPA. 2005. Why most published research findings are false. PLOS Medicine 2(8):e124 DOI 10.1371/journal.pmed.0020124. Iqbal SA, Wallach JD, Khoury MJ, Schully SD, Ioannidis JPA. 2016. Reproducible research practices and transparency across the biomedical literature. PLOS Biology 14(1):e1002333 DOI 10.1371/journal.pbio.1002333. Janz N. 2015. Bringing the gold standard into the class room: replication in university teaching. International Studies Perspectives Epub ahead of print Mar 9 2015 DOI 10.1111/insp.12104. Kitzes J, Turek D, Deniz F (eds.) 2017. The practice of reproducible research: case studies and lessons from the data-intensive sciences. Oakland: University of California Press. Lindholm T, Yellin F. 1999. Java virtual machine specification. Second Edition. Boston: Addison-Wesley Longman Publishing Co., Inc. Manninen T, Havela R, Linne M-L. 2017. Reproducibility and comparability of com- putational models for astrocyte calcium excitability. Frontiers in Neuroinformatics 11:11 DOI 10.3389/fninf.2017.00011. Mesnard O, Barba LA. 2017. Reproducible and replicable CFD: it’s harder than you think. IEEE/AIP Computing in Science and Engineering 19(4):44–55 DOI 10.1109/mcse.2017.3151254. Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, Du Sert NP, Simon- sohn U, Wagenmakers E-J, Ware JJ, Ioannidis JPA. 2017. A manifesto for repro- ducible science. Nature Human Behaviour 1(1):0021 DOI 10.1038/s41562-016-0021. Murta L, Braganholo V, Chirigati F, Koop D, Freire J. 2015. noWorkflow: capturing and analyzing provenance of scripts. In: Provenance and annotation of data and processes. Lecture notes in computer science, vol. 8628. Berlin: Springer International Publishing, 71–83. Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349(6251):aac4716–aac4716 DOI 10.1126/science.aac4716. Perkel J. 2016. Democratic databases: science on GitHub. Nature 538(7623):127–128 DOI 10.1038/538127a. Sandve GK, Nekrutenko A, Taylor J, Hovig E. 2013. Ten simple rules for repro- ducible computational research. PLOS Compututational Biology 9(10):e1003285 DOI 10.1371/journal.pcbi.1003285. Smith AM, Niemeyer KE, Katz DS, Barba LA, Githinji G, Gymrek M, Huff KD, Madan CR, Cabunoc Mayes A, Moerman KM, Prins P, Ram K, Rokem A, Teal TK, Valls Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 16/17 https://peerj.com http://dl.acm.org/citation.cfm?id=2002181.2002202 http://dl.acm.org/citation.cfm?id=2002181.2002202 http://dx.doi.org/10.1186/s13742-015-0072-7 http://dx.doi.org/10.1109/mcse.2015.64 http://dx.doi.org/10.1371/journal.pmed.0020124 http://dx.doi.org/10.1371/journal.pbio.1002333 http://dx.doi.org/10.1111/insp.12104 http://dx.doi.org/10.3389/fninf.2017.00011 http://dx.doi.org/10.1109/mcse.2017.3151254 http://dx.doi.org/10.1038/s41562-016-0021 http://dx.doi.org/10.1126/science.aac4716 http://dx.doi.org/10.1038/538127a http://dx.doi.org/10.1371/journal.pcbi.1003285 http://dx.doi.org/10.7717/peerj-cs.142 Guimera R, Vanderplas JT. 2017. Journal of Open Source Software (JOSS): design and first-year review. ArXiv preprint. arXiv:1707.02264. Stachelek J. 2016. [Re] least-cost modelling on irregular landscape graphs. ReScience 2(1) DOI 10.5281/zenodo.45852. Topalidou M, Leblois A, Boraud T, Rougier NP. 2015. A long journey into repro- ducible computational neuroscience. Frontiers in Computational Neuroscience 9:30 DOI 10.3389/fncom.2015.00030. Topalidou M, Rougier NP. 2015. [Re] interaction between cognitive and motor cortico- basal ganglia loops during decision making: a computational study. ReScience 1(1) DOI 10.5281/zenodo.47146. Viejo G, Girard B, Khamassi M. 2016. [Re] speed/accuracy trade-off between the habitual and the goal-directed process. ReScience 2(1) DOI 10.5281/zenodo.27944. Wilson G. 2016. Software carpentry: lessons learned. F1000Research 3:62 DOI 10.12688/f1000research.3-62.v2. Wilson G, Aruliah DA, Brown CT, Hong NPC, Davis M, Guy RT, Haddock SHD, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P. 2014. Best practices for scientific computing. PLOS Biology 12(1):e1001745 DOI 10.1371/journal.pbio.1001745. Rougier et al. (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.142 17/17 https://peerj.com http://arXiv.org/abs/1707.02264 http://dx.doi.org/10.5281/zenodo.45852 http://dx.doi.org/10.3389/fncom.2015.00030 http://dx.doi.org/10.5281/zenodo.47146 http://dx.doi.org/10.5281/zenodo.27944 http://dx.doi.org/10.12688/f1000research.3-62.v2 http://dx.doi.org/10.1371/journal.pbio.1001745 http://dx.doi.org/10.7717/peerj-cs.142 work_2pr3au7mozfx3bfmyg4bc4yo7u ---- MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-scale gene regulatory networks MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-scale gene regulatory networks Nisar Wani1 and Khalid Raza2 1 Govt. Degree College Baramulla, Jammu & Kashmir, India 2 Department of Computer Science, Jamia Millia Islamia, New Delhi, India ABSTRACT High throughput multi-omics data generation coupled with heterogeneous genomic data fusion are defining new ways to build computational inference models. These models are scalable and can support very large genome sizes with the added advantage of exploiting additional biological knowledge from the integration framework. However, the limitation with such an arrangement is the huge computational cost involved when learning from very large datasets in a sequential execution environment. To overcome this issue, we present a multiple kernel learning (MKL) based gene regulatory network (GRN) inference approach wherein multiple heterogeneous datasets are fused using MKL paradigm. We formulate the GRN learning problem as a supervised classification problem, whereby genes regulated by a specific transcription factor are separated from other non-regulated genes. A parallel execution architecture is devised to learn a large scale GRN by decomposing the initial classification problem into a number of subproblems that run as multiple processes on a multi-processor machine. We evaluate the approach in terms of increased speedup and inference potential using genomic data from Escherichia coli, Saccharomyces cerevisiae and Homo sapiens. The results thus obtained demonstrate that the proposed method exhibits better classification accuracy and enhanced speedup compared to other state-of-the-art methods while learning large scale GRNs from multiple and heterogeneous datasets. Subjects Bioinformatics, Computational Biology, Data Mining and Machine Learning Keywords Gene regulatory networks, GRN inference, large-scale GRN, Systems biology, Network biology INTRODUCTION The problem of understanding gene interactions and their influence through network inference and analysis is of great significance in systems biology (Albert, 2007). The aim of this inference process is to establish relationships between genes and construct a network topology based on the evidence provided by different data types. Among various network inference studies, gene regulatory network inference (GRNI) has remained of particular interest to researchers with extensive scientific literature generated in this domain. Gene regulatory networks (GRNs) are biological networks where genes serve as nodes and the edges connecting them serve as regulatory relations (Lee et al., 2002; Raza & Alam, 2016). Standard methods for GRN inference such as RELNET How to cite this article Wani N, Raza K. 2021. MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large- scale gene regulatory networks. PeerJ Comput. Sci. 7:e363 DOI 10.7717/peerj-cs.363 Submitted 19 October 2020 Accepted 29 December 2020 Published 28 January 2021 Corresponding author Khalid Raza, kraza@jmi.ac.in Academic editor Othman Soufan Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.363 Copyright 2021 Wani and Raza Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.363 mailto:kraza@�jmi.�ac.�in https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.363 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ (Butte & Kohane, 1999), ARACNE (Margolin et al., 2006), CLR (Faith et al., 2007), SIRENE (Mordelet & Vert, 2008) and GENIE3 (Huynh-Thu et al., 2010) mostly use transcriptomic data for GRN inference. Among these methods, our approach is modeled along the same principle as SIRENE. SIRENE is a general method to infer unknown regulatory relationships between known transcription factors (TFs) and all the genes of an organism. It uses a vector of gene expression data and a list of known regulatory relationships between known TFs and their target genes. However, integration of this data with other genomic data types such as protein–protein interaction (PPI), methylation expression, sequence similarity and phylogenetic profiles has drastically improved GRN inference (Hecker et al., 2009). A comprehensive list of state-of-the-art data integration techniques for GRN inference has been reviewed in (Wani & Raza, 2019a). In this article, we aim to integrate gene expression, methyl expression and TF-DNA interaction data using advanced multiple kernel learning (MKL) library provided by shogun machine learning toolbox (Sonnenburg et al., 2010) and design an algorithm to infer gene regulatory networks (GRNs). Besides, we also integrate PPI data and other data such as gene ontology information as source of prior knowledge to enhance the accuracy of network inference. The problem of network inference is modeled as a binary classification problem whereby a gene being regulated by a given TF is treated as a positive label and negative otherwise. To infer a large-scale network, the MKL model needs to be trained for each TF with a set of known regulations for the whole genome. Given N TFs, we need to train N different classification models individually and then combine the results from these models for a complete network inference task. As the number of TFs increase, the number of classification models also increase, creating resource deficiency and long execution times for the inference algorithm. The proposed approach attempts to provide a solution to this problem by distributing these classification models to different processors on a multi-processor hardware platform using parallel processing library from Python. The results from these models are stored in a shared queue object which are later on used for network inference. A detailed description of the model is contained in the methods section. RELATED LITERATURE An early attempt to learn and classify gene function from integrated datasets using kernel methods was carried out in Pavlidis et al. (2002). They trained a support vector machine (SVM) for gene function classification with a heterogeneous kernel derived from a combination of two different types of data (e.g., gene expression and phylogenetic profiles). Since SVM does not learn from multiple kernel matrices simultaneously, they proposed three different ways to fuse two datasets and referred to these fusion methods as (i) early integration, (ii) intermediate integration and (iii) late integration approaches. In early integration, feature vectors from heterogeneous data types are concatenated to build a single length vector for a given set of genes. This extended dataset is then transformed into a kernel matrix using appropriate kernel function and serves as an input to the SVM model from where we can draw biological inferences. In the case of intermediate integration, the two datasets are first transformed into their respective kernel Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 2/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ matrices; subsequently these kernel matrices are added together to yield an integrated kernel for SVM training. For late integration, the authors trained the SVM models individually using the heterogeneous datasets. The probability scores which act as discriminant values obtained from separate SVM models are then added together for gene function prediction. In fact, kernel-based methods as effective integration techniques were first proposed in Lanckriet et al. (2004), wherein a 1-norm soft margin SVM is trained for a classification problem, separating membrane proteins from ribosomal proteins. They combined heterogeneous biological datasets such as PPI, amino acid sequences and gene expression data characterizing different proteins by transforming them into multiple positive semidefinite kernel matrices using different kernel functions. Their findings reveal an improved classifier performance when all datasets are integrated as a unit compared to testing the classifier on individual datasets. In an earlier study (Lanckriet et al., 2003) on function prediction for baker’s yeast proteins, they trained an SVM classifier with multiple datasets of different types and achieved an improved performance over a classifier trained using single data type. In yet another study for network inference using kernel data integration (Yamanishi, Vert & Kanehisa, 2004), the authors fused four different datasets, namely gene expression data, protein interaction data, protein localization data and data from phylogenetic profiles. These datasets are transformed into different kernel matrices. Datasets comprising of gene expression, protein localization and data from phylogenetic profiles were kernelized using Gaussian, polynomial and linear kernel functions. Graph datasets were kernelized using diffusion kernel (Kondor & Lafferty, 2002). This study compared both unsupervised and supervised inference methods on single and integrated datasets. To assess the accuracy of the methods, the inferred networks are compared with a gold standard protein network. Contrary to the unsupervised approaches, the supervised approach seems to make interesting predictions and capture most of the information from the gold standard. They observed that data from transcriptomic and phylogenetic profiles seem to contribute with an equal quantum of information followed by noisy PPI and localization data. Applying a supervised approach to integrated datasets seems to produce overall best results, therefore highlighting the importance of guided network inference from integrated prior biological knowledge. In another study, Ben-Hur & Noble (2005) applied kernel methods to PPI studies and proposed a pair-wise kernel between two pairs of proteins in order to construct a similarity matrix. This pairwise kernel is based on three sequence kernels, a spectrum kernel, a motif, and a Pfam kernel. They further extended this experiment to explore the effect of adding kernels from non-sequence data, such as gene ontology annotations, homology scores and Mutual clustering coefficient (MCC) derived from protein interactions computed in each cross-validation fold. Integrating these non-sequence features with the pairwise kernel resulted in improved performance than any method by itself. Another integration and supervised learning method that uses MKL is the Feature Selection Multiple Kernel Learning (FSMKL) proposed by Seoane et al. (2013). The feature Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 3/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ selection is performed on variable number of features per kernel, separating feature sets from each data type with greater relevance to the given problem. The selection criteria uses statistical scoring by ranking features that are statistically aligned with the class labels and biological insights, where genes that are present in a specific pathway are chosen. They integrate gene expression, copy number variation and other genomic data from KEGG pathway. These data are transformed into their base kernels and integrated using MKL framework into a combined kernel. The prior biological knowledge in the form of pathway information serves as central criterion for FSMKL to cluster samples. The authors claim that FSMKL performance is comparable to the other state-of-the-art breast cancer prognosis methods from DREAM challenge. Speicher & Pfeifer (2015) adopted an unsupervised approach to discover cancer subtypes from an integrated kernel using MKL. The proposed method called Regularized MKL Locality Preserving Projections (rMKL-LPP) integrates multi-omics data such as gene expression, DNA methylation and miRNA expression profiles of multiple cancer types from TCGA (Tomczak, Czerwińska & Wiznerowicz, 2015). This regularized version extends the dimensionality reduction variant of the MKL technique (MKL-DR) proposed by Yan et al. (2007). The regularization term allows to use different types of kernels during optimization process and also avoids overfitting. They cluster the samples by applying k-means on the distance summation of each sample’s k-Nearest Neighbors by applying Locality Preserving Projections (LPP). Also many approaches have been proposed for parameter estimation of such large-scale and integrated models. Besides, cross validation, grid search and randomised parameter optimization methods (Remli et al., 2019) have proposed a cooperative enhanced scatter search for parameter for high dimensional biological models. Their proposed method is executed in a parallel environment and can be faster than other methods in providing accurate estimate of model parameters. Multiple kernel Learning approach has also been applied to the domain of drug- target interaction network inference and drug bioactivity prediction. For drug-target interaction prediction, Nascimento, Prudêncio & Costa (2016) proposed a new MKL based algorithm that selects and combines kernels automatically on a bipartite drug-protein prediction problem. Their proposed method extends the Kronecker regularized least squares approach (KronRLS) (Van Laarhoven, Nabuurs & Marchiori, 2011) to fit in a MKL setting. The method uses L2 regularization to produce a non-sparse combination of base kernels. The proposed method can cope with large drug vs. target interaction matrices; does not require sub-sampling of the drug-target network; and is also able to combine and select relevant kernels. They performed the comparative analysis of their proposed method with top performers from single and integrative kernel approaches and demonstrated the competitiveness of KronRLS-MKL to all the evaluated scenarios. Similarly for drug bioactivity prediction (Cichonska et al., 2018) proposed pairwise MKL method in order to address the scalability issues in handling massive pairwise kernel matrices in terms of both computational complexity and memory demands of such prediction problems. The proposed method has been successfully implemented to the drug bioactivity inference problems and provides a general approach other pairwise MKL spaces. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 4/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Since MKL is applied to solve large scale learning problems, various efforts have been undertaken to devise a scheme whereby MKL algorithm can be run in a multiprocessor and distributed computational environment. The authors in Chen & Fan (2014) have proposed a parallel multiple kernel learning (PMKL) using hybrid alternating direction method multipliers (H-ADMM). The proposed method makes the local processors to co-ordinate with each other to achieve the global solution. The results of their experiments demonstrated that PMKL displays fast execution times and higher classification accuracies. Another important study to address the scalability and computational requirements in the domain of large scale learning has been carried out by Alioscha-Perez, Oveneke & Sahli (2019). They proposed SVRG-MKL an MKL solution with inherent scalability properties that can combine multiple descriptors involving millions of samples. They conducted extensive experimental validation of their proposed method on several benchmarking datasets confirming a higher accuracy and significant speedup for SVRG-MKL. In one of our recent works, we proposed a data fusion and inference model, called iMTF-GRN, based on Non-negative Matrix Tri-factorization that integrates the diverse types of biological data (Wani & Raza, 2019b). The advantage of our proposed parallel MKL-GRNI approach is that it is simple to implement and does not need complex coding to distribute multiple classification problems in a multiprocessor environment. Our method employs shared queue objects for distributing inputs and collecting outputs from multiple processors compared to PMKL (Chen & Fan, 2014) where multiple processors are explicitly made to co-ordinate using the hybrid alternating direction method of multipliers (H-ADMM) introducing complexity and an added computational overhead. Also, we chose basic addition operation to fuse multiple kernels compared to Kron-RLS MKL (Cichonska et al., 2018) method, where the fusion of multiple kernels is achieved by performing Kronecker product operation which requires calculating the inverse of individual kernels, hence a computational overhead compared to a basic arithmetic operation. Also for MKL implementation, we used the Shogun toolbox, which is a highly optimized, stable and efficient tool developed in C++ by Sonnenburg et al. (2010) making it a suitable candidate for computing-intensive and large-scale learning problems. MATERIALS AND METHODS The proposed method adopts a supervised approach to learn new interactions between a TF and the whole genome of an organism. The algorithm operates on multiple datasets that characterize the genes of an organism. Since we are adopting an integrated approach, datasets such as gene expression, known TF-gene regulations, PPI, and DNA-methylation data can be combined using MKL approach. All these datasets are carefully chosen owing to their role in gene regulation. The TF-gene interaction data serves a dual purpose. It supplies the algorithm with prior knowledge about the regulatory relationships, and for each TF, the known target gene list also form the labels for the MKL classifier. For each TF, a set of known gene targets serve as positive examples. For negative examples, we divide our input into two subsets; the MKL classifier is trained using positive examples for which no prediction is needed, and the other subset contains Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 5/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ negative examples. We perform 10-fold cross-validation using the same scheme and obtain discriminant values for all the genes with no prior regulation knowledge for this TF. This whole procedure is repeated for all the TFs. The idea here is to identify the set of genes whose expression profiles match those of positive examples even though the classifier is supplied with some false negative examples in the training set. A graphical overview of this architecture is depicted in Fig. 1. The problem of GRN inference from integrated datasets through supervised learning using MKL is not a trivial task. The nature of the complexity raises manifold while considering GRN inference of organisms with large genomes sizes. In this scenario, the model training and testing becomes TF specific. Therefore, the inference problem is decomposed into a set of classification subproblems corresponding to the total number of TFs present in the input Gene-TF interaction matrix. A sequential approach to such a problem scenario would require to run each subproblem one after the other in a loop. However, as we increase the number of TFs, the execution time of the algorithm also increases. To overcome such problems, we devise a strategy of parallel execution for the algorithm wherein multiple subproblems run simultaneously across different processors of a multi-processor hardware platform as explained in Algorithm 1. Outputs generated by each model in the form of confidence scores (probability that a given TF regulates a gene) are stored in a shared queue object. Once all the subproblems finish their execution, the shared object is iterated to collect the results generated by all the models in order to build a single output matrix. In case the number of TFs is more than the number of available processors, they are split into multiple groups and dispatched to each processor with the condition that the number of TFs are divided in such a manner so that all the processors receive equal number of classification models. Figure 1 Application architecture of MKL-GRNI (A) Combined kernel (B) Decomposed regulation matrices (C) Parallel distribution and model building (D) Model execution (E) Writing results to shared object. Full-size DOI: 10.7717/peerj-cs.363/fig-1 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 6/20 http://dx.doi.org/10.7717/peerj-cs.363/fig-1 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Kernel methods for genomic data fusion Kernel methods represent a mathematical framework which embeds data points (genes, proteins, drugs, etc) from input spaceIto feature space F by employing a kernel function. Genomic datasets viz., mRNA expression levels from RNA-seq, DNA methylation profiles and TF-gene regulation matrix obtained from different databases comprise heterogeneous datasets that can be fused using kernel methods and serve as the building blocks for inference of gene regulatory networks. A modular and generic approach to pattern analysis, kernel methods can operate on very high dimensional data in feature space by performing an inner product on the input data using a kernel function (Shawe-Taylor & Cristianini, 2004). An algorithm is devised that can work with Algorithm 1 MKL-GRNI Parallel approach for supervised inference of large-scale gene regulatory networks. Input: k datasets D1, D2, . . . ., Dk Input: Regulation binary matrix R for Classification labels Output: A matrix of decision scores DS for TF-Gene interaction begin Transform D1, D2, . . . ., Dk int k1, k2, . . . ., kn kernels using appropriate kernel function Fuse n Kernels as K = k1 + k2+…+kn define mkl parameters params (C, norm, epsilon) /* Distribute Source TF’s among multiple CPU’s */ foreach cpu in the cpu list do do in parallel foreach TF in source TF list do /* Set MKL parameters and Data */ set mkl.kernel ) K set mkl.labels )R set mkl.parameters ) params /* Obtain decision scores for MKL algorithm between each TF and all genes in the genomes */ DSTF ) ApplyMKL() end put DSTFk in queue Q end end foreach q in Q do DSTFk ) q.val end end Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 7/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ such data and learn patterns. Such an algorithm is more generic as they operate on any data type that can be kernelized. These kernels are data specific, such as Gaussian, polynomial and sigmoid kernels for vectorial data, diffusion kernels for graph data, and string kernels for different types of sequence data. The kernel part is data specific, creating a flexible and modular approach to combine multiple modules to obtain complex learning systems. A graphical depiction of this fusion technique is shown in Fig. 2. The choice of different kernel functions for transforming datasets into their respective kernel matrices is made after a thorough analysis of literature in the field of kernel methods and MKL methods. MKL model Multiple kernel learning is based on integrating many features of objects such as genes, proteins, drugs, etc., via their kernel matrices and represents a paradigm shift from Figure 2 Genomic data fusion by combining kernel matrices from multiple kernels into a single combined kernel. Full-size DOI: 10.7717/peerj-cs.363/fig-2 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 8/20 http://dx.doi.org/10.7717/peerj-cs.363/fig-2 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ machine learning models that use single object features (Sonnenburg et al., 2006). This combined information from multiple kernel matrices is provided as an input to MKL algorithm to perform classification/regression tasks on unseen data. Information represented by the kernel matrices can be combined by applying the basic algebraic operations, such as addition, multiplication, and exponentiation such that the positive semi-definiteness of the candidate kernels is preserved in the final kernel matrix. The resultant kernel can be defined by following equations using k1 and k2 as candidate kernel matrices and ϕ1(x) and ϕ2(x), their corresponding embedding in the feature space. K ¼ k1 þ k2 (1) with the new induced embedding �x ¼ �1ðxÞ; �2ðxÞ (2) Given a kernel set K = {k1, k2, : : : , km}, an affine combination of m parametrized kernels can be formed as given by: - K ¼ Xm i¼1 liki (3) subject to the constraint that μi (weights) are positive that is, μi ≥ 0, i = 1……..m. With these kernel matrices as input, a statistical classifier such as SVM separates the two classes using a linear discriminant by inducing a margin in the feature space. To find this discriminant, an optimization problem, known as a quadratic program (QP) needs to be solved. QP belongs to a class of convex optimization problems, which are easily solvable. Shogun toolbox solves this MKL optimization problem using semidefinite programing (SDP) first implemented for MKL learning by Lanckriet et al. (2004). Based on this margin, we classify SVM algorithms into hard, 1-norm soft and 2-norm soft margin SVM. Here we use the 1-norm soft margin SVM and SDP for MKL optimization and classification from heterogeneous datasets explained in our earlier work on MKL for biomedical image analysis (Wani & Raza, 2018). A detailed literature on SVM algorithms is covered in (Scholkopf & Smola, 2001). Datasets To test the parallel MKL algorithm on multiple datasets, we downloaded gene expression data of Escherichia coli and Saccharomyces cerevisiae from DREAM5 Network inference challenge (Marbach et al., 2012) along with their gold standard network and human breast cancer transcriptomic data from TCGA. Some prominent features of these data are shown in Table 1. Because the MKL paradigm provides the platform to fuse heterogeneous datasets, we download PPI data for both E. coli and S. cerevisiae from STRING database (Szklarczyk et al., 2011). The PPI data is supplied as prior biological knowledge to the algorithm in order to improve its inference accuracy as MKL can learn from multiple datasets. To supplement the human transcriptome with additional biological knowledge, Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 9/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ we download DNA methylation expression data for all the genes in the transcriptome from the TCGA broad institute data portal (https://gdac.broadinstitute.org/). The regulation data (i.e., known interaction between genes and TFs) for E.coli and S. cerevisiae were extracted from the gold standard network provided in the DREAM dataset, however, for GRN inference in humans, the regulation data has been collected from a number of databases that store TF-gene interaction data derived from ChIP-seq and ChIP-ChIP experiments. We collected a list of 66 TFs from the ENCODE data portal (https://www. encodeproject.org/) for which ChIP-seq experiments were carried out on MCF7 breast cancer cell lines across different experimental labs. The targets of these TFs were extracted from ENCODE (ENCODE Project Consortium, 2004), TRED (Jiang et al., 2007) and TRRUST (Han et al., 2015) databases. Hardware and software requirements The hardware platform used in this study is an IBM System X3650 M4 server model that includes an Intel Xeon processor having 24 cores and a primary memory of 32 GB with extendable option of 64 GB. The system supports a 64-bit memory addressing scheme having powerful 3.2 GHz/1066 MHz Intel Xeon processors with 1066 MHz front- side bus (FSB) and 4 MB L2 cache (each processor is dual core and comes with 2 × 2 MB (4 MB) L2 cache). The system also supports hyper threading features for more efficient program execution. In order to exploit this multi-core and multithreading features present in the hardware system we used multiprocessing Python package to dispatch different sub-problems across multiple cores of the computing system. The process of distribution of different learning sub-problems among different cores of a multi-core machine has been demonstrated in Fig. 1. For fusion of multiple datasets we use MKL approach whereby different datasets are first converted into similarity matrices (Kernels) and then joined to generate a final integrated matrix for learning TF-gene targets. We use MKL Python library provided by Shogun Machine Learning toolbox for implementing the proposed algorithm. RESULTS All the genomic datasets are transformed into their respective kernel matrices by using an appropriate kernel function. For example, datasets such as gene expression and DNA methylation expression data are transformed using a Gaussian radial basis function. The PPI data is converted into a diffusion kernel, K = eβH, where H is the negative Laplacian derived from adjacency and Degree matrix H = A − D of PPI graph. The TF-Target gene regulation data is organized as a binary matrix of labels (i.e., 1 and −1) Table 1 Dataset description of different organisms for supervised GRN inference. Organism Genes Samples Transcription factors Known regulations Known targets E. coli 4,297 805 140 1,979 953 S. cerevisiae 5,657 536 120 4,000 2,721 Homo sapiens 19,201 1,212 66 73,052 12,028 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 10/20 https://gdac.broadinstitute.org/ https://www.encodeproject.org/ https://www.encodeproject.org/ http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ with genes in rows and TFs in columns. The number of rows correspond to the genome size of the organism and the number of columns correspond to the total number of TFs being used for GRN inference. The elements of each column with value 1 signify that a gene gi is regulated by TFj and −1 otherwise. Such an organization of the regulation data allows us to use each column of the matrix as a label for individual classification problems in a supervised learning environment. We perform two sets of experiments with our proposed approach in order to evaluate the scalability and the inference potential of the supervised learning from heterogeneous datasets using MKL paradigm. Our first experiment records execution times required to learn from varying genome and sample sizes on single and multi-processor architectures, given a set of TFs. Our second experiment focuses on the evaluation of inference potential of this approach on different genome and sample sizes. Since our problem of GRN inference is complex, the experiment aims to evaluate the parallel nature of the MKL algorithm by decomposing supervised inference of GRNs for multiple TFs into a number of subproblems and distribute them to multiple processors for parallel execution. Varying the genome and sample sizes in these experiments is to evaluate how efficiently MKL based models scale to large genomes where most of the GRN models developed till date do not perform optimally as reported in Marbach et al. (2012). The proposed method is implemented in Python and the code along with data is available at (https://github.com/waninisar/MKL-GRNI). To assess the performance of the parallel MKL-GRNI on different genomes characterized by datasets in Table 1. We execute the algorithm and embed the required code for the evaluation metrics. Once the algorithm completes its execution run, all the essential metrics are recorded for further analysis. The metrics are computed to evaluate the capacity of our approach in terms of reduced computational cost and enhanced inference accuracy when dealing with complex and large-scale inference tasks. Initially the algorithm is run in sequential mode for all the organisms for a set of 32 TFs, and later on in parallel mode on 8 and 16 CPUs. Performance metrics for all the datasets are plotted in Fig. 3. A brief description of these important performance metrics is given below: SPEEDUP We calculate speedup as a measure of relative performance of executing our algorithm in sequential and parallel processing environments. The speed up is calculated as under:- SðjÞ ¼ Tð1Þ=TðjÞ (4) Where S(j) is the speedup on j processors, T(1) is the time it takes on a single processor and T(j) is the time program takes on j processors. EFFICIENCY Efficiency is defined as the ratio of speedup to the number of processing elements (j CPUs in our case). It measures the utilization of the computation resources for a fraction of time. Ideally in parallel system, speedup is equal to j and efficiency is equal to 1. However, in Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 11/20 https://github.com/waninisar/MKL-GRNI http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ practice, speedup is less than j and efficiency is between zero and one, depending on the effectiveness with which the processing elements are utilized. We calculate efficiency E(j) on j processors as given below: EðjÞ ¼ SðjÞ=j (5) REDUNDANCY Redundancy is computed as the ratio between number of operations executed in parallel and sequential modes. It measures the required increase in the number of computations when the algorithm is run on multiple processors. RðjÞ ¼ OðjÞ=Oð1Þ (6) QUALITY Quality measures the relevance of using parallel computation and is defined as the ratio between the product of speedup and efficiency to that of redundancy. QðjÞ ¼ SðjÞxEðjÞ=RðjÞ (7) Figure 3 Performance metrics for parallel MKL-GRNI algorithm: (A) Speedup, (B) Efficiency, (C) Redundancy, (D) Quality. Full-size DOI: 10.7717/peerj-cs.363/fig-3 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 12/20 http://dx.doi.org/10.7717/peerj-cs.363/fig-3 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ It is evident from the Fig. 1 that there is marked increase in the speedup as we move from a sequential (single CPU) to parallel execution (i.e., 8 and 16 CPUs). For an E. coli genome with a sample size of 500 and 32 TFs used for inference, the algorithm shows a sharp speedup as we move from sequential execution to parallel execution on 8 processors, however when the number of processors is increased to 16, there is marginal increase in speedup for E. coli. On the other hand, there is considerable increase in speedup recorded for 8 and 16 processors on higher genomes, such as S. cerevisiae and Homo sapiens, suggesting an increase in the capacity of the parallel algorithm to reduce the execution times. To assess the resource utilization using our parallel approach, the efficiency metric shows considerable drop in utilization of compute resources for all the three datasets, because only a section of algorithm runs in parallel. This can be inferred from the computed redundancy for sequential and parallel executions. The redundancy plot shows slight increase in terms of the computational cost incurred when running our computational problem in parallel, thereby suggesting less computational overhead as we switch from sequential to parallel mode of execution. To evaluate the relevance of parallel execution to our problem, we calculate quality metric for all the three datasets. From the barplots we can observe that parallel algorithms are less relevant when applied to smaller genomes as is evident in case of E. coli. But there is steady improvement in quality metric as move from S. cerevisiae to Homo sapiens with relevance indicator high when yeast dataset is run on 8 processors and human dataset on 16 processors. These improvements in speedup and quality metrics when running the algorithm in parallel provides us with a framework to work with more complex datasets and organisms with large genome sizes to infer very large scale GRNs using a supervised approach. To assess the inference potential of this supervised method we compare the proposed approach with other methods that infer gene interactions from single and integrated datasets. Initially we apply MKL-GRNI to DREAM5 E.coli data, we performed a 10-fold cross-validation to make sure that model is trained on all the known regulations. At each cross-validation step, important performance metrics such as precision, recall and F1 score are recorded and then averaged for the whole cross-validation procedure. We then compared our network inference method with inference methods that predict TF-target gene regulations, such as CLR (Faith et al., 2007) and SIRENE (Mordelet & Vert, 2008). The results are recorded in Table 2. After running all the inference procedures, it is observed that the average precision , recall and F1 metrics generated by running MKL-GRNI is quite higher than those generated by other comparable methods. The improvement with MKL-GRNI can be attributed to the additional biological knowledge in the form of protein-protein interactions between E.coli genes to aid in the inference process. To test the proposed method on integrated data, We perform a 10 fold cross-validation procedure on the input data. In this experiment, the known target genes of each organism as depicted in Table 1 are split into training and test sets. The model is trained on the features from the training set, and the network inference is performed between the genes in the test set, important evaluation metrics, such as Precision, Recall and F1 scores are recorded for each iteration and averaged across cross-validation runs. Table 3 summarizes Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 13/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ these metric for varying genome and sample size for human breast cancer dataset and Table 4 contains results for all the three genomes. It is evident from these results that the MKL-GRNI algorithm scales well for higher genomes sizes. These metrics highlight the learning and inference potential of MKL. Looking at Table 3 we observe an average recall of 80% and an average precision of 58% with an average F1 measure of 65% for a genome size of 5,000 and sample size of 100, with an increase in these metrics as we increase the sample size to 500 and 1,000 respectively. However, as we start increasing the size of the genome, these metrics start a gradual decline for smaller sample size and again show a marginal increase as we increase the sample size for a fixed genome size. Although there is no direct rule of determining the number of samples corresponding to the size of the genome in omics studies, the improvements in precision, recall and F1 measures suggests an improvement in learning and inference potential of MKL algorithm with an increase in the number of samples. Also the tabulated metrics for all the three genomes in Table 4 show a considerable decline Table 2 Average precision, recall and F1 measures for various inference methods. Method Average precision Average recall Average F1 score CLR 0.275 0.55 0.36 SIRENE 0.445 0.73 0.55 MKL-GRNI 0.46 0.97 0.62 Table 3 Precision, recall and F1 measure recorded for different combination of genome and sample sizes for Breast cancer data. No. of genes No. of samples Average recall Average Precision Average F1 measure 5,000 100 0.8005 0.5817 0.6582 5,000 500 0.8005 0.6169 0.6848 5,000 1,000 0.8354 0.6347 0.6968 10,000 100 0.7350 0.4406 0.5509 10,000 500 0.7660 0.4537 0.5699 10,000 1,000 0.7860 0.4937 0.6065 19,201 100 0.7499 0.3746 0.4996 19,201 500 0.7444 0.3893 0.5112 19,201 1,000 0.7499 0.4246 0.5422 Table 4 Precision, recall and F1 measures averaged across cross-validation runs for complete genomes. Organism No. of genes No. of samples Avg. precision Avg. recall Avg. F1 measure E. coli 4,297 802 0.46 0.97 0.62 S. cerevisiae 5,657 536 0.42 0.84 0.56 Homo sapiens 19,201 1,012 0.37 0.73 0.49 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 14/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ in the evaluation metrics as we move from smaller to larger genomes, suggesting a decrease in inference potential of the algorithm for larger datasets. The possible decline in the performance metrics can be attributed to increase in the genome size as we move from simple prokaryotic to more complex eukaryotic genomes. This increase in the genome sizes versus the sample size leads to curse of dimensionality and therefore making difficult to learn properly from skewed datasets. We also compare our MKL-GRNI with a recently developed Integrative random forest for gene regulatory network inference (iRafNet) (Petralia et al., 2015). We select DREAM5 datasets of E. coli and S. cerevisiae and integrate PPI and gene expression data from both datasets. For MKL we build Gaussian and diffusion kernels from expression and PPI data. For iRafNet , the expression data serves as the main data and the PPI data is used as support data. Sampling weights are then derived from PPI data by building a diffusion kernel as K = eH where H is a graph laplacian for PPI data. Sampling weights from K are derived as WPPIi, j = K(i, j) that is, the element K (i,j). The sampling weights thus obtained are then integrated with main data set (i.e., gene expression data). Putative regulatory links are then predicted using importance scores generated using the iRafNet R package. The AUC and AUPR scores obtained using iRafNet and MKL-GRNI are listed in Table 5. The AUC and AUPR scores of MKL-GRNI thus obtained are comparable to iRafNet for both datasets. However, iRafNet reports a lower AUC and higher AUPR scores compared to MKL-GRNI when run on E. coli data. But once we move towards a higher genome size, these scores start dropping marginally for both iRafNet and MKL-GRNI approaches. The slight higher AUC scores in case of MKL-GRNI can be attributed to some extent to the skewed class label distribution where in negative labels far outnumber the positive ones because of limited known regulations. This class imbalance leads to higher predictive accuracy (AUC) but lower precision-recall scores (AUPR). On the other hand regression based GRN inference techniques have been reported to perform well for smaller genomes with GENIE3 (Huynh-Thu et al., 2010) being a start performer in DREAM5 network inference challenges. The higher AUPR generated by iRafNet in case of E. coli can be attributed to the way potential regulators are sampled using prior information from sampling weights (PPI), therefore decreasing false positives and increasing precision and recall. But for higher genomes (i.e, yeast in our case) the performance of both approaches begins to fall as reported by (Mordelet & Vert, 2008). Present implementation of iRafNet does not provide the ability to run the random forest algorithm in parallel. Therefore, using iRafNet for GRNI of higher genomes can incur huge computational cost by running thousands of decision trees in sequential mode. Table 5 AUC and AUPR scores for E. coli and S. cerevisiae using iRafNet and MKL-GRNI. Datasets iRafNet MKL-GRNI AUC AUPR AUC AUPR E. coli 0.901 0.552 0.925 0.44 S. cerevisiae 0.833 0.39 0.89 0.42 Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 15/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Since our main motive in this study is to parallelize the inference algorithm for large-scale GRNI, the higher speedup and higher quality provided by running MLK-GRNI in parallel can be used as a trade-off for slightly lower AUPR compared to iRafNet run in sequential mode with marginally higher AUPR scores. DISCUSSION AND CONCLUSION Here we present a scalable and parallel approach to GRN inference using MKL as integration and supervised learning framework. The algorithm has been implemented in Python using Python interface to MKL provided by shogun machine learning toolbox (Sonnenburg et al., 2010). The ability of kernel methods in pattern discovery and learning from genomic data fusion of multi-omics data using MKL has already been demonstrated in a number of inference studies. Our focus here is to explore the scalability option for large-scale GRN inference in a supervised machine learning setting, besides assessing the inference potential across different genomes. The approach undertaken can be considered as a parallel extension to SIRENE (Mordelet & Vert, 2008). Although SIRENE performs better than other unsupervised and information theoretic based inference methods as reported by (Mordelet & Vert, 2008). However, it lacks the ability to learn from heterogeneous genomic datasets that can provide essential and complementary information for GRN inference. Another limitation is the sequential execution of the TF-specific classification problems that incur the huge cost in terms of execution times as we move from E. coli genomes to more complex and large genomes of mice and humans. Therefore to facilitate very large scale GRN inference using supervised learning approach, we use the concept of decomposing the initial problems of learning GRN into many subproblems, where each subproblem is aimed to infer a GRN for a specific TF. Our algorithm distributes all such learning problems to different processors on a multi-processor hardware platform and dispatches them for simultaneous execution, thereby reducing the execution time of the inference process substantially. The results from each execution are written to a shared queue object, once all the child processes complete their execution, the queue object is iterated to build a single output matrix for genome-scale GRN inference. We also assess the inference potential of our MKL based parallel GRN inference approach by computing essential evaluation metrics for machine learning based approaches. A quick survey of scientific literature on GRN inference methods will ensure that the results obtained by our approach are comparable to other state-of-the-art methods in this domain and some cases better than inference methods that employ only gene expression data (e.g., CLR, ARACNE, SIRENE, etc. ). A drawback of our approach is that only TFs with known targets can be used to train the inference model. Also, the performance of the algorithm tends to decrease if the model training is carried out using TFs with few known targets, leading to a bias in favor of TFs with many known neighbors (i.e., hubs) and is less likely to predict new associations for TFs with very few neighbors. Besides, we are not able to identify new TFs among the newly learned interaction, nor the model can predict whether a given gene is upregulated or downregulated by a particular TF. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 16/20 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Therefore additional work is needed to improve the efficiency of the parallel algorithm and the inference potential of the MKL-GRNI. In our current implementation, we integrate only two datasets for GRNI, therefore leaving the scope to use more omics sources that can be integrated for improved performance of the inference model. Also, the MKL framework provides a mechanism to weigh the contribution of individual datasets that can be used to select informative datasets for integration. Further, we do not identify TFs from the predicted target genes and can be considered in future extension to this work. Besides, novel techniques to choose negative examples for training our parallel MKL-GRNI model can be incorporated to decrease the number of false positives and improve the overall precision/recall scores for genomes of higher organisms. ADDITIONAL INFORMATION AND DECLARATIONS Funding Nisar Wani is supported by Teacher Fellowship of University Grants Commission, Ministry of Human Resources Development, Govt. of India vide letter No. F.B No. 27-(TF-45)/2015 under Faculty Development Programme. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: University Grants Commission, Ministry of Human Resources Development, Govt. of India: F.B No. 27-(TF-45)/2015. Competing Interests The authors declare that they have no competing interests. Author Contributions � Nisar Wani conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. � Khalid Raza conceived and designed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: The code is available at GitHub: https://github.com/waninisar/MKL-GRNI. REFERENCES Albert R. 2007. Network inference, analysis, and modeling in systems biology. Plant Cell 19(11):3327–3338 DOI 10.1105/tpc.107.054700. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 17/20 https://github.com/waninisar/MKL-GRNI http://dx.doi.org/10.1105/tpc.107.054700 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Alioscha-Perez M, Oveneke MC, Sahli H. 2019. Svrg-mkl: a fast and scalable multiple kernel learning solution for features combination in multi-class classification problems. IEEE Transactions on Neural Networks and Learning Systems 31(5):1710–1723 DOI 10.1109/TNNLS.2019.2922123. Ben-Hur A, Noble WS. 2005. Kernel methods for predicting protein-protein interactions. Bioinformatics 21(Suppl. 1):i38–i46 DOI 10.1093/bioinformatics/bti1016. Butte AJ, Kohane IS. 1999. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In: Biocomputing 2000. Singapore: World Scientific, 418–429. Chen Z-Y, Fan Z-P. 2014. Parallel multiple kernel learning: a hybrid alternating direction method of multipliers. Knowledge and Information Systems 40(3):673–696 DOI 10.1007/s10115-013-0655-5. Cichonska A, Pahikkala T, Szedmak S, Julkunen H, Airola A, Heinonen M, Aittokallio T, Rousu J. 2018. Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics 34(13):i509–i518 DOI 10.1093/bioinformatics/bty277. ENCODE Project Consortium. 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306(5696):636–640 DOI 10.1126/science.1105136. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. 2007. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLOS Biology 5(1):e8 DOI 10.1371/journal.pbio.0050008. Han H, Shim H, Shin D, Shim JE, Ko Y, Shin J, Kim H, Cho A, Kim E, Lee T, Kim H, Kim K, Yang S, Bae D, Yun A, Kim S, Kim CY, Cho HJ, Kang B, Shin S, Lee I. 2015. TRRUST: a reference database of human transcriptional regulatory interactions. Scientific Reports 5(1):11432 DOI 10.1038/srep11432. Hecker M, Lambeck S, Toepfer S, Van Someren E, Guthke R. 2009. Gene regulatory network inference: data integration in dynamic models: a review. Biosystems 96(1):86–103 DOI 10.1016/j.biosystems.2008.12.004. Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. 2010. Inferring regulatory networks from expression data using tree-based methods. PLOS ONE 5(9):e12776 DOI 10.1371/journal.pone.0012776. Jiang C, Xuan Z, Zhao F, Zhang MQ. 2007. Tred: a transcriptional regulatory element database, new entries and other development. Nucleic Acids Research 35(Suppl. 1):D137–D140 DOI 10.1093/nar/gkl1041. Kondor RI, Lafferty J. 2002. Diffusion kernels on graphs and other discrete structures. Proceedings of the 19th International Conference on Machine Learning 2002:315–322. Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. 2003. Kernel-based data fusion and its application to protein function prediction in yeast. In: Biocomputing 2004. Singapore: World Scientific, 300–311. Lanckriet GR, De Bie T, Cristianini N, Jordan MI, Noble WS. 2004. A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635 DOI 10.1093/bioinformatics/bth294. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne J-B, Volkert TL, Fraenkel E, Gifford DK, Young RA. 2002. Transcriptional regulatory networks in saccharomyces cerevisiae. Science 298(5594):799–804 DOI 10.1126/science.1075090. Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Consortium D, Kellis M, Collins JJ, Stolovitzky G. 2012. Wisdom of crowds for robust gene network inference. Nature Methods 9(8):796. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 18/20 http://dx.doi.org/10.1109/TNNLS.2019.2922123 http://dx.doi.org/10.1093/bioinformatics/bti1016 http://dx.doi.org/10.1007/s10115-013-0655-5 http://dx.doi.org/10.1093/bioinformatics/bty277 http://dx.doi.org/10.1126/science.1105136 http://dx.doi.org/10.1371/journal.pbio.0050008 http://dx.doi.org/10.1038/srep11432 http://dx.doi.org/10.1016/j.biosystems.2008.12.004 http://dx.doi.org/10.1371/journal.pone.0012776 http://dx.doi.org/10.1093/nar/gkl1041 http://dx.doi.org/10.1093/bioinformatics/bth294 http://dx.doi.org/10.1126/science.1075090 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. 2006. Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7(1):S7 DOI 10.1186/1471-2105-7-S1-S7. Mordelet F, Vert J-P. 2008. SIRENE: supervised inference of regulatory networks. Bioinformatics 24(16):i76–i82 DOI 10.1093/bioinformatics/btn273. Nascimento AC, Prudêncio RB, Costa IG. 2016. A multiple kernel learning algorithm for drug- target interaction prediction. BMC Bioinformatics 17(1):46 DOI 10.1186/s12859-016-0890-3. Pavlidis P, Weston J, Cai J, Noble WS. 2002. Learning gene functional classifications from multiple data types. Journal of Computational Biology 9(2):401–411 DOI 10.1089/10665270252935539. Petralia F, Wang P, Yang J, Tu Z. 2015. Integrative random forest for gene regulatory network inference. Bioinformatics 31(12):i197–i205 DOI 10.1093/bioinformatics/btv268. Raza K, Alam M. 2016. Recurrent neural network based hybrid model for reconstructing gene regulatory network. Computational Biology and Chemistry 64:322–334 DOI 10.1016/j.compbiolchem.2016.08.002. Remli MA, Mohamad MS, Deris S, Samah AA, Omatu S, Corchado JM. 2019. Cooperative enhanced scatter search with opposition-based learning schemes for parameter estimation in high dimensional kinetic models of biological systems. Expert Systems with Applications 116:131–146 DOI 10.1016/j.eswa.2018.09.020. Scholkopf B, Smola AJ. 2001. Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press. Seoane JA, Day IN, Gaunt TR, Campbell C. 2013. A pathway-based data integration framework for prediction of disease progression. Bioinformatics 30(6):838–845 DOI 10.1093/bioinformatics/btt610. Shawe-Taylor J, Cristianini N. 2004. Kernel methods for pattern analysis. Cambridge: Cambridge University Press. Sonnenburg S, Henschel S, Widmer C, Behr J, Zien A, Bona Fd, Binder A, Gehl C, Franc V. 2010. The shogun machine learning toolbox. Journal of Machine Learning Research 11:1799–1802. Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B. 2006. Large scale multiple kernel learning. Journal of Machine Learning Research 7:1531–1565. Speicher NK, Pfeifer N. 2015. Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery. Bioinformatics 31(12):i268–i275 DOI 10.1093/bioinformatics/btv244. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, Mering C. 2011. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Research 39(Suppl. 1):D561–D568 DOI 10.1093/nar/gkq973. Tomczak K, Czerwińska P, Wiznerowicz M. 2015. The cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncology 19(1A):A68. Van Laarhoven T, Nabuurs SB, Marchiori E. 2011. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics 27(21):3036–3043 DOI 10.1093/bioinformatics/btr500. Wani N, Raza K. 2018. Multiple kernel-learning approach for medical image analysis. In: Soft Computing Based Medical Image Analysis. Amsterdam: Elsevier, 31–47. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 19/20 http://dx.doi.org/10.1186/1471-2105-7-S1-S7 http://dx.doi.org/10.1093/bioinformatics/btn273 http://dx.doi.org/10.1186/s12859-016-0890-3 http://dx.doi.org/10.1089/10665270252935539 http://dx.doi.org/10.1093/bioinformatics/btv268 http://dx.doi.org/10.1016/j.compbiolchem.2016.08.002 http://dx.doi.org/10.1016/j.eswa.2018.09.020 http://dx.doi.org/10.1093/bioinformatics/btt610 http://dx.doi.org/10.1093/bioinformatics/btv244 http://dx.doi.org/10.1093/nar/gkq973 http://dx.doi.org/10.1093/bioinformatics/btr500 http://dx.doi.org/10.7717/peerj-cs.363 https://peerj.com/computer-science/ Wani N, Raza K. 2019a. Integrative approaches to reconstruct regulatory networks from multi-omics data: a review of state-of-the-art methods. Computational Biology and Chemistry 83:107120 DOI 10.1016/j.compbiolchem.2019.107120. Wani N, Raza K. 2019b. iMTF-GRN: integrative matrix tri-factorization for inference of gene regulatory networks. IEEE Access 7:126154–126163 DOI 10.1109/ACCESS.2019.2936794. Yamanishi Y, Vert J-P, Kanehisa M. 2004. Protein network inference from multiple genomic data: a supervised approach. Bioinformatics 20(Suppl. 1):i363–i370 DOI 10.1093/bioinformatics/bth910. Yan S, Xu D, Zhang B, Zhang H-J, Yang Q, Lin S. 2007. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(1):40–51 DOI 10.1109/TPAMI.2007.250598. Wani and Raza (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.363 20/20 http://dx.doi.org/10.1016/j.compbiolchem.2019.107120 http://dx.doi.org/10.1109/ACCESS.2019.2936794 http://dx.doi.org/10.1093/bioinformatics/bth910 http://dx.doi.org/10.1109/TPAMI.2007.250598 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.363 MKL-GRNI: A parallel multiple kernel learning approach for supervised inference of large-scale gene regulatory networks Introduction Related literature Materials and Methods Results Speedup Efficiency Redundancy Quality Discussion and Conclusion References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_2qq6wxmizbfornkrysxkp6vpqu ---- From Characters to Time Intervals: New Paradigms for Evaluation and Neural Parsing of Time Normalizations Egoitz Laparra∗ Dongfang Xu∗ Steven Bethard School of Information University of Arizona Tucson, AZ {laparra,dongfangxu9,bethard}@email.arizona.edu Abstract This paper presents the first model for time normalization trained on the SCATE corpus. In the SCATE schema, time expressions are annotated as a semantic composition of time entities. This novel schema favors machine learning approaches, as it can be viewed as a semantic parsing task. In this work, we propose a character level multi-output neural network that outperforms previous state-of-the-art built on the TimeML schema. To compare predic- tions of systems that follow both SCATE and TimeML, we present a new scoring metric for time intervals. We also apply this new metric to carry out a comparative analysis of the anno- tations of both schemes in the same corpus. 1 Introduction Time normalization is the task of translating natural language expressions of time to computer-readable forms. For example, the expression three days ago could be normalized to the formal representation 2017-08-28 in the ISO-8601 standard. As time nor- malization allows entities and events to be placed along a timeline, it is a crucial step for many informa- tion extraction tasks. Since the first shared tasks on time normalization (Verhagen et al., 2007), interest in the problem and the variety of applications have been growing. For example, Lin et al. (2015) use normal- ized timestamps from electronic medical records to contribute to patient monitoring and detect potential causes of disease. Vossen et al. (2016) identify multi- lingual occurrences of the same events in the news ∗These two authors contributed equally. by, among other steps, normalizing time-expressions in different languages with the same ISO standard. Fischer and Strötgen (2015) extract and normalize time-expressions from a large corpus of German fic- tion as the starting point of a deep study on trends and patterns of the use of dates in literature. A key consideration for time normalization sys- tems is what formal representation the time expres- sions should be normalized to. The most popular scheme for annotating normalized time expressions is ISO-TimeML (Pustejovsky et al., 2003a; Puste- jovsky et al., 2010), but it is unable to represent several important types of time expressions (e.g., a bounded set of intervals, like Saturdays since March 6) and it is not easily amenable to machine learning (the rule-based HeidelTime (Strötgen et al., 2013) still yields state-of-the-art performance). Bethard and Parker (2016) proposed an alternate scheme, Se- mantically Compositional Annotation of Time Ex- pressions (SCATE), in which times are annotated as compositional time entities (Figure 1), and suggested that this should be more amenable to machine learn- ing. However, while they constructed an annotated corpus, they did not train any automatic models on it. We present the first machine-learning models trained on the SCATE corpus of time normalizations. We make several contributions in the process: • We introduce a new evaluation metric for time normalization that can compare normalized times from different annotation schemes by mea- suring overlap of intervals on the timeline. • We use the new metric to compare SCATE and TimeML annotations on the same corpus, and confirm that SCATE covers a wider variety of 343 Transactions of the Association for Computational Linguistics, vol. 6, pp. 343–356, 2018. Action Editor: Mona Diab. Submission batch: 10/2017; Revision batch: 1/2018; Published 5/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. THIS INTERVAL REPEATING-INTERVALS DAY-OF-WEEK TYPE=SATURDAY Saturdays BETWEEN START-INTERVAL END-INTERVAL=DOC-TIME since LAST INTERVAL=DOC-TIME REPEATING-INTERVAL MONTH-OF-YEAR TYPE=MARCH SUB-INTERVAL March DAY-OF-MONTH VALUE=6 6 Figure 1: Annotation of the expression Saturdays since March 6 following the SCATE schema. time expressions. • We develop a recurrent neural network for learn- ing SCATE-style time normalization, and show that our model outperforms the state-of-the-art HeidelTime (Strötgen et al., 2013). • We show that our character-based multi-output neural network architecture outperforms both word-based and single-output models. 2 Background ISO-TimeML (Pustejovsky et al., 2003a; Pustejovsky et al., 2010) is the most popular scheme for annotat- ing time expressions. It annotates time expressions as phrases, and assigns an ISO 8601 normalization (e.g., 1990-08-15T13:37 or PT24H) as the VALUE at- tribute of the normalized form. ISO-TimeML is used in several corpora, including the TimeBank (Puste- jovsky et al., 2003b), WikiWars (Mazur and Dale, 2010), TimeN (Llorens et al., 2012), and the Temp- Eval shared tasks (Verhagen et al., 2007; Verhagen et al., 2010; UzZaman et al., 2013). However, the ISO-TimeML schema has a few drawbacks. First, times that align to more than a single calendar unit (day, week, month, etc.), such as Saturdays since March 6 (where multiple Satur- days are involved), cannot be described in the ISO 8601 format since they do not correspond to any pre- fix of YYYY-MM-DDTHH:MM:SS. Second, each time expression receives a single VALUE, regardless of the word span, the compositional semantics of the expression are not represented. For example, in the expressions since last week and since March 6, the semantics of since are identical – find the inter- val between the anchor time (last week or March 6) and now. But ISO-TimeML would have to annotate these two phrases independently, with no way to in- dicate the shared portion of their semantics. These drawbacks of ISO-TimeML, especially the lack of compositionality, make the development of machine learning models difficult. Thus, most prior work has taken a rule-based approach, looking up each token of a time expression in a normalization lexicon and then mapping this sequence of lexical entries to the normalized form (Strötgen and Gertz, 2013; Bethard, 2013; Lee et al., 2014; Strötgen and Gertz, 2015). As an alternative to TimeML, and inspired by pre- vious works, Schilder (2004) and Han and Lavie (2004), Bethard and Parker (2016) proposed Seman- tically Compositional Annotation of Time Expres- sions (SCATE). In the SCATE schema, each time expression is annotated in terms of compositional time entity over intervals on the timeline. An ex- ample is shown in Figure 1, with every annotation corresponding to a formally defined time entity. For instance, the annotation on top of since corresponds to a BETWEEN operator that identifies an interval starting at the most recent March 6 and ending at the document creation time (DCT). The BETWEEN operator is formally defined as: BETWEEN([t1, t2): INTERVAL, [t3, t4): INTERVAL): INTERVAL = [t2, t3). The SCATE schema can represent a wide variety of time expressions, and provides a formal definition of the semantics of each annotation. Unlike TimeML, SCATE uses a graph structure to capture composi- tional semantics and can represent time expressions that are not expressed with contiguous phrases. The schema also has the advantage that it can be viewed as a semantic parsing task and, consequently, is more 344 suitable for machine-learning approaches. However, Bethard and Parker (2016) present only a corpus; they do not present any models for semantic parsing. 3 An interval-based evaluation metric for normalized times Before attempting to construct machine-learned mod- els from the SCATE corpus, we were interested in evaluating Bethard and Parker (2016)’s claim that the SCATE schema is able to represent a wider vari- ety of time expressions than TimeML. To do so, we propose a new evaluation metric to compare time nor- malizations annotated in both the ISO 8601 format of TimeML and the time entity format of SCATE. This new evaluation interprets normalized annotations as intervals along the timeline and measures the overlap of the intervals. TimeML TIMEX3 (time expression) annotations are converted to intervals following ISO 8601 se- mantics of their VALUE attribute. So, for example, 1989-03-05 is converted to the interval [1989-03- 05T00:00:00, 1989-03-06T00:00:00), that is, the 24- hour period starting at the first second of the day on 1989-03-05 and ending just before the first second of the day on 1989-03-06. SCATE annotations are con- verted to intervals following the formal semantics of each entity, using the library provided by Bethard and Parker (2016). So, for example, Next(Year(1985), SimplePeriod(YEARS, 3)), the 3 years following 1985, is converted to [1986-01-01T00:00, 1989-01- 01T00:00). Note that there may be more than one interval associated with a single annotation, as in the Saturdays since March 6 example. Once all anno- tations have been converted into intervals along the timeline, we can measure how much the intervals of different annotations overlap. Given two sets of intervals, we define the interval precision, Pint, as the total length of the intervals in common between the two sets, divided by the total length of the intervals in the first set. Interval recall, Rint is defined as the total length of the intervals in common between the two sets, divided by the total length of the intervals in the second set. Formally: IS ⋂ IH = {i∩ j : i ∈ IS ∧ j ∈ IH} Pint(IS, IH) = ∑ i∈COMPACT(IS ⋂ IH) |i| ∑ i∈IS |i| Rint(IS, IH) = ∑ i∈COMPACT(IS ⋂ IH) |i| ∑ i∈∪IH |i| where IS and IH are sets of intervals, i∩j is possibly the empty interval in common between the intervals i and j, |i| is the length of the interval i, and COMPACT takes a set of intervals and merges any overlapping intervals. Given two sets of annotations (e.g., one each from two time normalization systems), we define the over- all precision, P , as the average of interval precisions where each annotation from the first set is paired with all annotations that textually overlap it in the second set. Overall recall is defined as the average of interval recalls where each annotation from the second set is paired with all annotations that textually overlap it in the first set. Formally: OIa(B) = ⋃ b∈B:OVERLAPS(a,b) INTERVALS(b) P(S, H) = 1 |S| ∑ s∈S Pint(INTERVALS(s), OIs(H)) R(S, H) = 1 |H| ∑ h∈H Rint(INTERVALS(h), OIh(S)) where S and H are sets of annotations, INTERVALS(x) gives the time intervals associ- ated with the annotation x, and OVERLAPS(a, b) decide whether the annotations a and b share at least one character of text in common. It is important to note that these metrics can be applied only to time expressions that yield bounded intervals. Time expressions that refer to intervals with undefined boundaries are out of the scope, like in “it takes just a minute” or “I work every Saturday”. 4 Data analysis 4.1 TimeML vs. SCATE Both TimeML and SCATE annotations are available on a subset of the TempEval 2013 corpus (UzZaman et al., 2013) that contains a collection of news articles from different sources, such as Wall Street Journal, 345 AQUAINT TimeBank Test Documents 10 68 20 Sentences 251 1429 339 TimeML timex3 61 499 158 SCATE entities 333 1810 461 SCATE time exp. 114 715 209 SCATE bounded 67 403 93 Table 1: Number of documents, TimeML TIMEX3 an- notations and SCATE annotations for the subset of the TempEval 2013 corpus annotated with both schemas. AQUAINT TimeBank P R F1 P R F1 Body text 92.2 92.2 92.2 82.4 83.0 82.7 All text 92.2 67.1 77.7 82.4 71.2 76.4 Table 2: Comparison of TimeML and SCATE annotations. New York Times, Cable News Network, and Voices of America. Table 1 shows the statistics of the data. Documents from the AQUAINT and TimeBank form the training and development dataset. The SCATE corpus contains 2604 time entities (individual com- ponents of a time expression, such as every, month, last, Monday, etc.) annotated in the train+dev set (i.e. AQUAINT+TimeBank). These entities compose a total of 1038 time expressions (every month, last Monday, etc.) of which 580 yield bounded intervals, i.e. intervals with a specified start and ending (last Monday is bounded, while every month is not). We apply the interval-based evaluation metric in- troduced in Section 3 to the AQUAINT and Time- Bank datasets, treating the TimeML annotations as the system (S) annotator and the SCATE annotations as the human (H) annotator. Table 2 shows that the SCATE annotations cover different time intervals than the TimeML annotations. In the first row, we see that TimeML has a recall of only 92% of the time in- tervals identified by SCATE in the AQUAINT corpus and of only 83% in the TimeBank corpus. We manu- ally analyzed all places where TimeML and SCATE annotations differed and found that the SCATE inter- pretation was always the correct one. For example, a common case where TimeML and SCATE annotations overlap, but are not identical, is time expressions preceded by a preposition like “since”. The TimeML annotation for “Since 1985” (with a DCT of 1998-03-01T14:11) only covers the year, “1985”, resulting in the time interval [1985- 01-01T00:00,1986-01-01T00:00). The SCATE an- notation represents the full expression and, conse- quently, produces the correct time interval [1986-01- 01T00:00,1998-03-01T14:11). Another common case of disagreement is where TimeML failed to compose all pieces of a complex expression. The TimeML annotation for “10:35 a.m. (0735 GMT) Friday” annotates two separate inter- vals, the time and the day (and ignores “0735 GMT” entirely). The SCATE annotation recognizes this as a description of a single time interval, [1998-08- 07T10:35, 1998-08-07T10:36). TimeML and SCATE annotations also differ in how references to particular past periods are inter- preted. For example, TimeML assumes that “last year” and “a year ago” have identical semantics, re- ferring to the most recent calendar year, e.g., if the DCT is 1998-03-04, then they both refer to the inter- val [1997-01-01T00:00,1998-01-01T00:00). SCATE has the same semantics for “last year”, but recog- nizes that “a year ago” has different semantics: a period centered at one year prior to the DCT. Under SCATE, “a year ago” refers to the interval [1996-09- 03T00:00,1997-09-03T00:00). Beyond these differences in interpretation, we also observed that, while the SCATE corpus annotates time expressions anywhere in the document (includ- ing in metadata), the TimeBank TIMEX3 annotations are restricted to the main text of the documents. The second row of Table 2 shows the evaluation when comparing overall text in the document, not just the body text. Unsurprisingly, TimeML has a lower re- call of the time intervals from the SCATE annotations under this evaluation. 4.2 Types of SCATE annotations Studying the training and development portion of the dataset, we noticed that the SCATE annotations can be usefully divided into three categories: non- operators, explicit operators, and implicit operators. We define non-operators as NUMBERs, PERIODs (e.g., three months), explicit intervals (e.g., YEARs like 1989), and repeating intervals (DAY-OF-WEEKs like Friday, MONTH-OF-YEARs like January, etc.). Non-operators are basically atomic; they can be in- 346 Non-Op Exp-Op Imp-Op Total 1497 305 219 2021 74% 15% 11% 100% Table 3: Distribution of time entity annotations in AQUAINT+TimeBank. terpreted without having to refer to other annotations. Operators are not atomic; they can only be interpreted with respect to other annotations they link to. For example, the THIS operator in Figure 1 can only be interpreted by first interpreting the DAY-OF-WEEK non-operator and the BETWEEN operator that it links to. We split operators into two types: explicit and implicit. We define an operator as explicit if it does not overlap with any other annotation. This occurs, for example, when the time connective since evokes the BETWEEN operator in Figure 1. An operator is considered to be implicit if it overlaps with an- other annotation. This occurs, for example, with the LAST operator in Figure 1, where March implies last March, but there is no explicit signal in the text, and it must be inferred from context. We study how these annotation groups distribute in the AQUAINT and TimeBank documents. Table 3 shows that non-operators are much more frequent than operators (both explicit and implicit). 5 Models We decompose the normalization of time expressions into two subtasks: a) time entity identification which detects the spans of characters that belong to each time expression and labels them with their corre- sponding time entity; and b) time entity composition that links relevant entities together while respecting the entity type constraints imposed by the SCATE schema. These two tasks are run sequentially using the output of the former as input to the latter. Once identification and composition steps are completed we can use the final product, i.e. semantic composi- tional of time entities, to feed the SCATE interpreter1 and encode time intervals. 5.1 Time entity identification Time entity identification is a type of sequence tag- ging task where each piece of a time expression is 1https://github.com/clulab/timenorm assigned a label that identifies the time entity that it evokes. We express such labels using the BIO tagging system, where B stands for the beginning of an annotation, I for the inside, and O for outside any annotation. Differing somewhat from standard sequence tagging tasks, the SCATE schema allows multiple annotations over the same span of text (e.g., “Saturdays” in Figure 1 is both a DAY-OF-WEEK and a THIS), so entity identification models must be able to handle such multi-label classification. 5.1.1 Neural architectures Recurrent neural networks (RNN) are the state- of-the-art on sequence tagging tasks (Lample et al., 2016a; Graves et al., 2013; Plank et al., 2016) thanks to their ability to maintain a memory of the sequence as they read it and make predictions conditioned on long distance features, so we also adopt them here. We introduce three RNN architectures that share a similar internal structure, but differ in how they repre- sent the output. They convert the input into features that feed an embedding layer. The embedded feature vectors are then fed into two stacked bidirectional Gated Recurrent Units (GRUs), and the second GRU followed by an activation function, outputs one BIO tag for each input. We select GRU for our models as they can outperform another popular recurrent unit LSTM (Long Short Term Memory), in terms of pa- rameter updates and convergence in CPU time with the same number of parameters (Chung et al., 2014). Our 1-Sigmoid model (Figure 2) approaches the task as a multi-label classification problem, with a set of sigmoids for each output that allow zero or more BIO labels to be predicted simultaneously. This is the standard way of encoding multi-label classification problems for neural networks, but in our experiments, we found that these models perform poorly since they can overproduce labels for each input, e.g., 03 could be labeled with both DAY-OF-MONTH and MONTH- OF-YEAR at the same time. Our 2-Softmax model (Figure 3) splits the out- put space of labels into two sets: non-operators and operators (as defined in Section 4.2). It is very un- likely that any piece of text will be annotated with more than one non-operator or with more than one operator,2 though it is common for text to be anno- 2In the training data, only 4 of 1217 non-operators overlap with another non-operator, and only 6 of 406 operators overlap 347 Input Feature Embed Bi-GRU Bi-GRU Sigmoid Output Non-Operators and Operators M M Lu NNP B-LAST B-MONTH a a Ll NNP I-LAST I-MONTH y y Ll NNP I-LAST I-MONTH Zs ∅ ∅ 2 2 Nd CD B-DAY 5 5 Nd CD I-DAY . . . . . . . . . . . . . . . . . . Figure 2: Architecture of the 1-Sigmoid model. The input is May 25. In SCATE-style annotation, May is a MONTH- OF-YEAR (a non-operator), with an implicit LAST (an operator) over the same span, and 25 is a DAY-OF-MONTH. At the feature layer, M is an uppercase letter (Lu), a and y are lowercase letters (Ll), space is a separator (Zs), and May is a proper noun (NNP). Input Feature Embed Bi-GRU Bi-GRU Softmax Output Non-Operators Operators M M Lu NNP B-MONTH B-LAST a a Ll NNP I-MONTH I-LAST y y Ll NNP I-MONTH I-LAST Zs ∅ O O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 3: Architecture of the 2-Softmax model. The input is May. The SCATE annotations and features are the same as in Figure 2. tated with one non-operator and one operator (see Figure 1). As a result, we can use two softmaxes, one for non-operators and one for operators, and the 2- Softmax model thus can produce 0, 1, or 2 labels per input. We share input and embedding layers, but as- sociate a separate set of stacked Bi-GRUs with each output category, as shown in Figure 3.3 Our 3-Softmax further splits operators into explicit operators and implicit operators (again, as defined with another operator. For example, a NYT said in an editorial on Saturday, April 25, Saturday is labeled as [DAY-OF-WEEK, LAST, INTERSECTION] where the last two labels are operators. 3In preliminary experiments, we tried sharing GRU layers as well, but this generally resulted in worse performance. in Section 4.2). We expect this to help the model since the learning task is very different for these two cases: with explicit operators, the model just has to memorize which phrases evoke which operators, while with implicit operators, the model has to learn to infer an operator from context (verb tense, etc.). We use three softmaxes, one each for non-operators, explicit operators, and implicit operators, and, as with 2-Softmax, we share input and embedding layers, but associate a separate set of stacked Bi-GRUs with each output category. The model looks similar to Figure 3, but with three output groups instead of two. We feed three features as input to the RNNs: Text: The input word itself for the word-by-word 348 model, or a the single input character for the character-by-character model. Unicode character categories: The category of each character as defined by the Unicode standard.4 This encodes information like the presence of upper- case (Lu) or lowercase (Ll) letters, punctuation (Po), digits (Nd), etc. For the word-by-word model, we concatenate the character categories of all characters in the word (e.g., May becomes LULLLL). Part-of-speech: The part-of-speech as determined by the Stanford POS tagger (Toutanova et al., 2003). We expect this to be useful for, e.g., finding verb tense to help distinguish between implicit LAST and NEXT operators. For the character-by-character model, we repeat the word-level part-of-speech tag for each char- acter in the word, and characters with no part-of- speech (e.g., spaces) get no tag. 5.1.2 Input: words vs. characters Identifying SCATE-style time entity is a sequence tagging task, similar to named entity recognition (NER), so we take inspiration from recent work in neural architectures for NER. The first neural NER models followed the prior (non-neural) work in ap- proaching NER as a word classification problem, ap- plying architectures such as sliding-window feedfor- ward neural networks (Qi et al., 2009), convolutional neural networks (CNNs) with conditional random field (CRF) layers (Collobert et al., 2011), and LSTM with CRF layers and hand-crafted features (Huang et al., 2015). More recently, character-level neural net- works have also been proposed for NER, including several which combine a CNN or LSTM for learn- ing character-based representations of words with an LSTM or LSTM-CRF for word-by-word labeling (Chiu and Nichols, 2016; Lample et al., 2016b; Ma and Hovy, 2016), as well as character-by-character sequence-to-sequence networks (Gillick et al., 2016; Kuru et al., 2016). Based on these works, we consider two forms of input processing for our RNNs: word-by-word vs. character-by-character. Several aspects of the time normalization problem make the character-based ap- proach especially appealing. First, many time phrases involve numbers that must be interpreted semanti- cally (e.g., a good model should learn that months 4See http://unicode.org/notes/tn36/ cannot be a number higher than 12), and digit-by- digit processing of numbers allows such interpreta- tions, while treating each number as a word would result in a sparse, intractable learning problem. Sec- ond, word-based models assume that we know how to tokenize the text into words, but at times present challenging formats such as overnight, where over evokes a LAST operator and night is a PART-OF-DAY. Finally, character-based models can ameliorate out- of-vocabulary (OOV) words, which are a common problem when training sparse datasets. (Hybrid word- character models, such as the LSTM-CNNs-CRF (Ma and Hovy, 2016) can address this last problem, but not the previous two.) For our word-based model, we apply the NLTK Tokenizer (Bird et al., 2009) to each sentence. We further tokenize with the regular expres- sion "\d+|[ˆ\d\W]+|\S" to break apart alpha- numeric expressions like 1620EDT. However, the tokenizer is unable to break-apart expressions such as 19980206 and overnight. For our character-based model, no tokenization is applied and every character (including whitespace characters) is fed as input. 5.2 Time entity composition Once the entities of the time-expressions are identi- fied, they must be composed in order to obtain their semantic interpretation. This step of the analysis con- sists of two parts: linking the entities that make up a time-expression together and completing the entities’ properties with the proper values. For both cases, we set a simple set of rules that follow the constraints imposed by the SCATE schema5. 5.2.1 Time entity linking Algorithm 1 shows the process followed to obtain the links between the time entities. First, we define an empty stack that will store the entities belong- ing to the same time expression. Then, we iterate over the list of entities of a document sorted by their starting character offsets (SORTBYSTART). For each of these entities (entity1) and for each entity in the stack (entity2), we check if the guidelines specify a possible link (LINKISVALID) between the types of entity1 and entity2. If such a link is possible, and 5https://github.com/bethard/ anafora-annotations/blob/master/.schema/ timenorm-schema.xml 349 Algorithm 1 Linking time entities stack = ∅ for entity1 in SORTBYSTART(entities) do if START(entity1) - END(stack) > 10 then stack = ∅ end if for entity2 in stack do if LINKISVALID(entity1, entity2) then CREATELINK(entity1, entity2) end if end for PUSH(stack, entity1) end for it has not already been filled by another annotation, we greedily make the link (CREATELINK). When the distance in the number of characters between the entity and the end of the stack is bigger than 10, we assume that the entities do not belong to the time expression. Thus, we empty the stack.6 For example, our time entity identification model gets the YEAR, MONTH-OF-YEAR and DAY-OF- MONTH for the time-expression 1992-12-23. Our time entity composition algorithm then iterates over these entities. At the beginning the stack is empty, it just pushes the entity 1992 (YEAR) into the stack. For the entity 12 (MONTH-OF-YEAR) it checks if the guidelines define a possible link between this en- tity type and the one currently in the stack (YEAR). In this case, the guidelines establish that a YEAR can have a SUB-INTERVAL link to a SEASON-OF- YEAR, a MONTH-OF-YEAR or WEEK-OF-YEAR. Thus, the algorithm creates a SUB-INTERVAL link between 1992 and 12. The entity 12 is then pushed into the stack. This process is repeated for the en- tity 23 (DAY-OF-MONTH) checking if there was a possible link to the entities in the stack (1992, 12). The guidelines define a possible SUB-INTERVAL link between MONTH-OF-YEAR and DAY-OF-MONTH, so a link is created here as well. Now, suppose that the following time entity in the list is several words ahead of 23 so the character distance between both entities is larger than 10. If that is the case the stack is empty and the process starts again to compose a new time expression. 6The distance threshold was selected based on the perfor- mance on the development dataset. 5.2.2 Property completion The last step is to associate each time entity of a time-expression with a set of properties that include information needed for its interpretation. Our system decides the value of these properties as follows: TYPE: The SCATE schema defines that some enti- ties can only have specific values. For example, a SEASON-OF-YEAR can only be SPRING, SUMMER, FALL or WINTER, a MONTH-OF-YEAR can only be JANUARY, FEBRUARY, MARCH, etc. To complete this property we take the text span of the time en- tity and normalize it to the values accepted in the schema. For example, if the span of a MONTH-OF- YEAR entity was the numeric value 01 we would normalize it to JANUARY, if its span was Sep. we would normalize it to SEPTEMBER, and so on. VALUE: This property contains the value of a nu- merical entity, like DAY-OF-MONTH or HOUR-OF- DAY.To complete it, we just take the text span of the entity and convert it to an integer. If it is written in words instead of digits (e.g., nineteen instead of 19), we apply a simple grammar7 to convert to an integer. SEMANTICS: In news-style texts, it is common that expressions like last Friday, when the DCT is a Fri- day, refer to the day as the DCT instead of the previ- ous occurrence (as it would in more standard usage of last). SCATE indicates this with the SEMANTICS property, where the value INTERVAL-INCLUDED in- dicates that the current interval is included when calculating the last or next occurrence. For the rest of the cases the value INTERVAL-NOT-INCLUDED is used. In our system, when a LAST operator is found, if it is linked to a DAY-OF-WEEK (e.g. Friday) that matches the DCT, we set the value of this property to INTERVAL-INCLUDED. INTERVAL-TYPE: Operators like NEXT or LAST need an interval as reference in order to be interpreted. Normally, this reference is the DCT. For example, next week refers to the week following the DCT, and in such a case the value of the property INTERVAL- TYPE for the operator NEXT would be DOCTIME. However, sometimes the operator is linked to an in- terval that serves as reference by itself, for example, “by the year 2000”. In this cases the value of the INTERVAL-TYPE is LINK. Our system sets the value 7https://github.com/ghewgill/text2num 350 of this property to LINK if the operator is linked to a YEAR and DOCTIME otherwise. This is a very coarse heuristic; finding the proper anchor for a time expression is a challenging open problem for which future research is needed. 5.3 Automatically generated training data Every document in the dataset starts with a document creation time. These time expressions are quite partic- ular; they occur in isolation and not within the context of a sentence and they always yield a bounded inter- val. Thus their identification is a critical factor in an interval based evaluation metric. However, document times appear in many different formats: ”Monday, July-24, 2017”, ”07/24/17 09:52 AM”, ”08-15-17 1337 PM”, etc. Many of these formats are not cov- ered in the training data, which is drawn from a small number of news sources, each of which uses only a single format. We therefore designed a time gen- erator to randomly generate an extra 800 isolated training examples for a wide variety of such expres- sion formats. The generator covers 33 different for- mats8 which include variants covering abbreviation, with/without delimiters, mixture of digits and strings, and different sequences of time units. 6 Experiments We train and evaluate our models on the SCATE cor- pus described in Section 4. As a development dataset, 14 documents are taken as a random stratified sample from the TempEval 2013 (TimeBank + AQUAINT) portion shown in Table 1, including broadcast news documents (1 ABC, 1 CNN, 1 PRI, 1 VOA), and newswire documents (5 AP, 1 NYT, 4 WSJ). We use the interval-based evaluation metric described in Sec- tion 3, but also report more traditional information extraction metrics (precision, recall, and F1) for the time entity identification and composition steps. Let S be the set of items predicted by the system and H is the set of items produced by the humans, precision (P ), recall (R), and F1 are defined as: P(S, H) = |S ∩H| |S| R(S, H) = |S ∩H| |H| 8We use the common formats available in office suites, specif- ically, LibreOffice. F1(S, H) = 2 ·P(S, H) ·R(S, H) P(S, H) + R(S, H) . For these calculations, each item is an annotation, and one annotation is considered equal to another if it has the same character span (offsets), type, and properties (with the definition applying recursively for properties that point to other annotations). To make the experiments with different neural ar- chitectures comparable, we tuned the parameters of all models to achieve the best performance on the development data. Due to space constraints, we only list here the hyper-parameters for our best Char 3- Softmax: the embedding size of the character-level text, word-level text, POS tag, and unicode character category features are 128, 300, 32 and 64, respec- tively. To avoid overfitting, we used dropout with probabilities 0.25, 0.15 and 0.15 for the 3 features, respectively; the sizes of the first and second layer GRU units are set as 256 and 150. We trained the model with RMSProp optimization on mini-batches of size 120, and followed standard recommendations to leave the optimizer hyperparameter settings at their default values. Each model is trained for at most 800 epochs, the longest training time for Char 3-Softmax model is around 22 hours using 2x NVIDIA Kepler K20X GPU. 6.1 Model selection We compare the different time entity identification models described in Section 5.1, training them on the training data, and evaluating them on the develop- ment data. Among the epochs of each model, we se- lect the epoch based on the output(s) which the model is good at predicting because based on its weakness, the model would yield unstable results in our pre- liminary experiments. For example, for 3-Softmax models, our selections rely on the performances of non-operators and implicit-operators. Table 4 shows the results of the development phase. First, we find that the character-based models out- perform the word-based models.9 For example, the best character-based model achieves the F1 of 81.7 (Char 3-Softmax),which is significantly better than the best word-based model achieving the F1 of only 9We briefly explored using pre-trained word embeddings to try to improve the performance of the Word 1-Sigmoid model, but it yielded a performance that was still worse than the character-based model, so we didn’t explore it further. 351 Model P R F1 Word 1-Sigmoid 60.2 52.0 55.8 Char 1-Sigmoid 54.0 59.0 56.4 Word 2-Softmax 58.7 63.9 61.2 Char 2-Softmax 74.8 72.4 73.6 Word 3-Softmax 68.3 64.9 66.6 Char 3-Softmax 88.2 76.1 81.7 Char 3-Softmax extra 80.6 73.4 76.8 Table 4: Precision (P ), recall (R), and F1 for the different neural network architectures on Time entity identification on the development data. 66.6 (p=0).10 Second, we find that Softmax mod- els outperform Sigmoid models. For example, the Char 3-Softmax model achives the F1 of 81.7, sig- nificantly better than 56.4 F1 of the Char 1-Sigmoid model (p=0). Third, for both character- and word- based models, we find that 3-Softmax significantly outperforms 2-Softmax: the Char 3-Softmax F1 of 81.7 is better than the Char 2-Softmax F1 of 73.6 (p=0) and the Word 3-Softmax F1 of 66.6 is bet- ter than the Word 2-Softmax F1 of 61.2 (p=0.0254). Additionally, we find that all models are better at identifying non-operators than operators and that the explicit operators are the hardest to solve. For ex- ample, the Char 3-Softmax model gets 92.4 F1 for non-operators, 36.1 F1 for explicit operators and 79.1 F1 for implicit operators. Finally, we also train the best model, Char 3-Softmax, using the generated an- notations described in Section 5.3 and achieve 76.8 F1 (Char 3-Softmax extra), i.e., the model performs better without the extra data (p=0). This is probably a result of overfitting due to the small variety of time formats in the training and development data. From this analysis on the development set, we se- lect two variants of the Char 3-softmax architecture for evaluation on the test set: Char 3-Softmax and Char 3-Softmax extra. These models were then cou- pled with the rule-based linking system described in Section 5.2 to produce a complete SCATE-style parsing system. 6.2 Model evaluation We evaluate both Char 3-Softmax and Char 3- Softmax extra on the test set for identification and 10We used a paired bootstrap resampling significance test. Char 3-Softmax Char 3-Soft. extra P R F1 P R F1 Non-Op 79.2 63.2 70.3 87.4 63.2 73.4 Exp-Op 52.6 36.6 43.2 39.8 38.7 39.3 Imp-Op 53.3 47.1 50.0 65.4 50.0 56.7 Ident 70.0 54.5 61.3 69.4 55.3 61.5 Comp 59.7 46.5 52.3 57.7 46.0 51.2 Table 5: Results on the test set for Time entity identifi- cation (Ident) and Time entity composition (Comp) steps. For the former we report the performances for each entity set: non-operators (Non-Op), explicit operators (Exp-Op) and implicit operators (Imp-Op). Model P R F1 HeidelTime 70.9 76.8 73.7 Char 3-Softmax 73.8 62.4 67.6 Char 3-Softmax extra 82.7 71.0 76.4 Table 6: Precision (P ), recall (R), and F1 of our models on the test data producing bounded time intervals. For comparison, we include the results obtained by Heidel- Time. composition tasks. Table 5 shows the results. On the identification task, Char 3-Softmax extra is no worse than using the original dataset with the over- all F1 61.5 vs. 61.3 (p=0.5899), and using extra generated data the model is better at predicting non- operators and implicit operators with higher preci- sions (p=0.0096), which is the key to produce correct bounded time intervals. To compare our approach with the state-of-the-art, we run HeidelTime on the test documents and make use of the metric described in Section 3. This way, we can compare the intervals produced by both sys- tems no matter the annotation schema. Table 6 shows that our model with additional randomly generated training data outperforms HeidelTime in terms of Precision, with a significant difference of 12.6 per- centage points (p=0.011), while HeidelTime obtains a non-significant better performance in terms of Re- call (p=0.1826). Overall, our model gets 3.3 more percentage points than HeidelTime in terms of F1 (p=0.2485). Notice that, although the model trained without extra annotations is better in time entity com- position (see Table 5), it performs much worse at producing final intervals. This is caused by the fact 352 Model P R F1 HeidelTime 70.7 80.2 75.1 Char 3-Softmax 74.3 64.2 68.9 Char 3-Softmax extra 83.3 74.1 78.4 Table 7: Precision (P ), recall (R), and F1 on bounded intervals on the TimeML/SCATE perfect overlapping test data. that this model fails to identify the non-operators that compound dates in unseen formats (see Section 5.3). However, evaluating HeidelTime in the SCATE annotations may not be totally fair. HeidelTime was developed following the TimeML schema and, as we show in Section 4, SCATE covers a wider set of time expressions. For this reason, we perform an additional evaluation. First, we compare the annota- tions in the test set using our interval-based metric, similar to the comparison reported in Table 2, and select those cases where TimeML and SCATE match perfectly. Then, we remove the rest of the cases from the test set. Consequently, we also remove the pre- dictions given by the systems, both ours and Heidel- Time, for those instances. Finally, we run the interval scorer using the new configuration. As can be seen in Table 7 all the models improve their performances. However, our model still performs better when it is trained with the extra annotations. The SCATE interpreter that encodes the time in- tervals needs the compositional graph of a time- expression to have all its elements correct. Thus, failing in the identification of any entity of a time- expression results in totally uninterpretable graphs. For example, in the expression next year, if our model identifies year as a PERIOD instead of an INTER- VAL it cannot be linked to next because it violates the SCATE schema. The model can also fail in the recognition of some time-entities, like summer in the expression last summer. This identification errors are caused mainly by the sparse training data. As graphs containing these errors produce unsolvable logical formulae, the interpreter cannot produce intervals and hence the recall decreases. Within those inter- vals that are ultimately generated, the most common mistake is to confuse the LAST and NEXT operators, and as a result an incorrectly placed interval even with correctly identified non-operators. For example, if an October with an implicit NEXT operator is in- stead given a LAST operator, instead of referring to [2013-10-01T00:00, 2013-11-01T00:00), it will refer to [2012-10-01T00:00, 2012-11-01T00:00). Missing implicit operators is also the main source of errors for HeidelTime, which fails with complex compositional graphs. For example, that January day in 2011 is annotated by HeidelTime as two different intervals, corresponding respectively to January and 2011. As a consequence, HeidelTime predicts not one but two incorrect intervals, affecting its precision. 7 Discussion As for the time entity identification task, the per- formance differences between development and test dataset could be attributed to the annotation distri- butions of the datasets. For example, there are 10 Season-Of-Year annotations in the test set while there are no such annotations in the development dataset; the relative frequencies of the annotations Minute- Of-Hour, Hour-Of-Day, Two-Digit-Year and Time- Zone in the test set are much lower, and our models are good at predicting such annotations. Explicit operators are very lexically-dependent, e.g. LAST corresponds to one word from the set {last, latest, previously, recently, past, over, recent, earlier, the past, before}, and the majority of them appear once or twice in the training and development sets. Our experiments verify the advantages of character-based-models in predicting SCATE anno- tations, which are in agreement with our explana- tions in Section 5.1.2: word-based-models tend to fail to distinguish numbers from digit-based time ex- pressions. It’s difficult for word-based-models to catch some patterns of time expressions, such as 24th and 25th, August and Aug., etc., while character- based models are robust to such variance. We ran an experiment to see whether these benefits were unique to compositional annotations like those of SCATE, or more generally to simply recognizing time expressions. We used the TimeML annotations from AQUAINT and TimeBank (see Table 1) to train two multi-class classifiers to identify TIMEX3 an- notations. The models were similar to our Char 3- Softmax and Word 3-Softmax models, using the same parameter settings, but with a single softmax output layer to predict the four types of TIMEX3: DATE, TIME, DURATION, and SET. As shown in Table 8, 353 TIMEX3 TIMEX3-Digits P R F1 P R F1 Char 70.2 62.7 66.2 73.8 71.4 72.6 Word 81.3 69.0 74.7 86.2 79.4 82.6 Table 8: Precision (P ), recall (R), and F1 for character- based and word-based models in predicting TimeML TIMEX3 annotations on the TempEval 2013 test set. TIMEX3-Digits is the subset of annotations that contain digits. on the test set the word-based model significantly out- performs the character-based model in terms of both time expressions (p=0.0428) and the subset of time expressions that contain digits (p=0.0007). These results suggest that the reason character-based mod- els are more successful on the SCATE annotations is that SCATE breaks time expressions down into meaningful sub-components. For example, TimeML would simply call Monday, 1992-05-04 a DATE, and call 15:00:00 GMT Saturday a TIME. SCATE would identify four and five, respectively, different types of semantic entities in these expression; and each SCATE entity would be either all letters or all digits. In TimeML, the model is faced with difficult learning tasks, e.g., that sometimes a weekday name is part of a DATE and sometimes it is part of a TIME, while in SCATE, a weekday name is always a DAY-OF- WEEK. On the other hand, running the entity composition step with gold entity identification achieves 72.6 in terms of F1. One of the main causes of errors in this step is the heuristic to complete the INTERVAL-TYPE property. As we explain in Section 5.2, we implement a too coarse set of rules for this case. Another source of errors is the distance of the 10 characters we use to decide if the time entities belong to the same time expression. This condition prevents the creation of some links, for example, the expression “Later” at the beginning of a sentence typically refers to another time interval in a previous sentence, so the distance between them is much longer. 8 Conclusion We have presented the first model for time normaliza- tion trained on SCATE-style annotations. The model outperforms the rule-based state-of-the-art, proving that describing time expressions in terms of compo- sitional time entities is suitable for machine learn- ing approaches. This broadens the research in time normalization beyond the more restricted TimeML schema. We have shown that a character-based neural network architecture has advantages for the task over a word-based system, and that a multi-output net- work performs better than producing a single output. Furthermore, we have defined a new interval-based evaluation metric that allows us to perform a com- parison between annotations based on both SCATE and TimeML schema, and found that SCATE pro- vides a wider variety of time expressions. Finally, we have seen that the sparse training set available induces model overfitting and that the largest number of errors are committed in those cases that appear less frequently in the annotations. This is more significant in the case of explicit operators because they are very dependent on the lexicon. Improving performance on these cases is our main goal for future work. Accord- ing to the results presented in this work, it seems that a solution would be to obtain a wider training set, so a promising research line is to extend our approach to automatically generate new annotations. 9 Software The code for the SCATE-style time normalization models introduced in this paper is available at https://github.com/clulab/timenorm. 10 Acknowledgements We thank the anonymous reviewers as well as the action editor, Mona Diab, for helpful comments on an earlier draft of this paper. The work was funded by the THYME project (R01LM010090) from the National Library Of Medicine, and used computing resources supported by the National Science Founda- tion under Grant No. 1228509. The content is solely the responsibility of the authors and does not nec- essarily represent the official views of the National Library Of Medicine, National Institutes of Health, or National Science Foundation. References [Bethard and Parker2016] Steven Bethard and Jonathan Parker. 2016. A semantically compositional anno- tation scheme for time normalization. In Proceedings 354 of the Tenth International Conference on Language Re- sources and Evaluation (LREC 2016), Paris, France, 5. European Language Resources Association (ELRA). [Bethard2013] Steven Bethard. 2013. A synchronous con- text free grammar for time normalization. In Proceed- ings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 821–826, Seattle, Washington, USA, 10. Association for Computational Linguistics. [Bird et al.2009] Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc. [Chiu and Nichols2016] Jason P. C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistic, 4:357–370. [Chung et al.2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empiri- cal evaluation of gated recurrent neural networks on se- quence modeling. arXiv preprint arXiv:1412.3555v1. [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (al- most) from scratch. The Journal of Machine Learning Research, 12:2493–2537, November. [Fischer and Strötgen2015] Frank Fischer and Jannik Strötgen. 2015. When Does (German) Literature Take Place? On the Analysis of Temporal Expressions in Large Corpora. In Proceedings of DH 2015: Annual Conference of the Alliance of Digital Humanities Orga- nizations, volume 6, Sydney, Australia. [Gillick et al.2016] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilin- gual language processing from bytes. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, NAACL HLT 2016, The 2016 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1296–1306. The Association for Computational Linguistics. [Graves et al.2013] Alex Graves, Abdel-rahman Mo- hamed, and Geoffrey Hinton. 2013. Speech recog- nition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649. IEEE. [Han and Lavie2004] Benjamin Han and Alon Lavie. 2004. A framework for resolution of time in natural language. 3(1):11–32, March. [Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991. [Kuru et al.2016] Onur Kuru, Ozan Arkan Can, and Deniz Yuret. 2016. Charner: Character-level named entity recognition. In COLING 2016, 26th International Con- ference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pages 911–921. [Lample et al.2016a] Guillaume Lample, Miguel Balles- teros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016a. Neural architectures for named entity recognition. In Proceedings of the 2016 Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 260–270. Association for Compu- tational Linguistics. [Lample et al.2016b] Guillaume Lample, Miguel Balles- teros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016b. Neural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Lan- guage Technologies, San Diego California, USA, June 12-17, 2016, pages 260–270. [Lee et al.2014] Kenton Lee, Yoav Artzi, Jesse Dodge, and Luke Zettlemoyer. 2014. Context-dependent semantic parsing for time expressions. In Proceedings of the 52nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1437–1447, Baltimore, Maryland, 6. Association for Computational Linguistics. [Lin et al.2015] Chen Lin, Elizabeth W. Karlson, Dmitriy Dligach, Monica P. Ramirez, Timothy A. Miller, Huan Mo, Natalie S. Braggs, Andrew Cagan, Vivian S. Gainer, Joshua C. Denny, and Guergana K. Savova. 2015. Automatic identification of methotrexate- induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record. Jour- nal of the American Medical Informatics Association, 22(e1):e151–e161. [Llorens et al.2012] Hector Llorens, Leon Derczynski, Robert J. Gaizauskas, and Estela Saquete. 2012. TIMEN: An Open Temporal Expression Normalisa- tion Resource. In Language Resources and Evaluation Conference, pages 3044–3051. European Language Re- sources Association (ELRA). [Ma and Hovy2016] Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics (ACL 2016), volume 1. Association for Computa- tional Linguistics. [Mazur and Dale2010] Pawet Mazur and Robert Dale. 2010. Wikiwars: A new corpus for research on tempo- ral expressions. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 913–922, Stroudsburg, PA, USA. Association for Computational Linguistics. 355 [Plank et al.2016] Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tag- ging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 412–418, Berlin, Germany, August. Association for Computational Lin- guistics. [Pustejovsky et al.2003a] James Pustejovsky, José Castaño, Robert Ingria, Roser Saurı́, Robert Gaizauskas, Andrea Setzer, and Graham Katz. 2003a. TimeML: Robust Specification of Event and Temporal Expressions in Text. In IWCS-5, Fifth International Workshop on Computational Semantics. [Pustejovsky et al.2003b] James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo. 2003b. The TimeBank corpus. In Proceedings of Corpus Linguistics 2003, Lancaster. [Pustejovsky et al.2010] James Pustejovsky, Kiyong Lee, Harry Bunt, and Laurent Romary. 2010. ISO-TimeML: An International Standard for Semantic Annotation. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10), Val- letta, Malta. European Language Resources Associa- tion (ELRA). [Qi et al.2009] Yanjun Qi, Koray Kavukcuoglu, Ronan Collobert, Jason Weston, and Pavel P. Kuksa. 2009. Combining labeled and unlabeled data with word-class distribution learning. In Proceedings of the 18th ACM conference on Information and knowledge management, ACM, pages 1737–1740. [Schilder2004] Frank Schilder. 2004. Extracting meaning from temporal nouns and temporal prepositions. ACM Transactions on Asian Language Information Process- ing (TALIP) - Special Issue on Temporal Information Processing, 3(1):33–50, March. [Strötgen and Gertz2013] Jannik Strötgen and Michael Gertz. 2013. Multilingual and cross-domain tem- poral tagging. Language Resources and Evaluation, 47(2):269–298. [Strötgen and Gertz2015] Jannik Strötgen and Michael Gertz. 2015. A baseline temporal tagger for all lan- guages. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 541–547, Lisbon, Portugal, September. Associa- tion for Computational Linguistics. [Strötgen et al.2013] Jannik Strötgen, Julian Zell, and Michael Gertz. 2013. Heideltime: Tuning English and developing Spanish resources for TempEval-3. In Proceedings of the Seventh International Workshop on Semantic Evaluation, SemEval ’13, pages 15–19. Asso- ciation for Computational Linguistics. [Toutanova et al.2003] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic de- pendency network. In Proceedings of the 2003 Confer- ence of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 173–180, Stroudsburg, PA, USA. Association for Computational Linguistics. [UzZaman et al.2013] Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, At- lanta, Georgia, USA, 6. Association for Computational Linguistics. [Verhagen et al.2007] Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007. SemEval-2007 Task 15: TempEval Temporal Relation Identification. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, pages 75–80, Prague, Czech Republic. [Verhagen et al.2010] Marc Verhagen, Roser Sauri, Tom- maso Caselli, and James Pustejovsky. 2010. SemEval- 2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 57–62, Uppsala, Sweden, 7. Association for Computa- tional Linguistics. [Vossen et al.2016] Piek Vossen, Rodrigo Agerri, Itziar Aldabe, Agata Cybulska, Marieke van Erp, Antske Fokkens, Egoitz Laparra, Anne-Lyse Minard, Alessio Palmero Aprosio, German Rigau, Marco Rospocher, and Roxane Segers. 2016. NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news. Special Issue Knowledge-Based Systems, Elsevier. 356 work_2sjrdf2ktjborjpbns3i4rort4 ---- 基于CamShift的视频跟踪算法改进及实现 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 97 Improvement and Realization of CamShift Algorithm Based MotionImage Tracking Wang Yubian* Department of Railway Transportation Control Belarusian State University of Transport, 34, Kirova street, Gomel,246653, Republic of Belarus *is the communication author. e-mail: alika_wang@mail.ru Yuri Shebzukhov Department of the International Relations Belarusian State University of Transport, Republic of Belarus 34, Kirova street, Gomel, 246653, Republic of Belarus e-mail: oms@bsut.by Abstract—The detection and tracking technology of moving object image is one of the key technologies of computer vision, and is widely used in many fields such as industry, transportation, and military. The detection and tracing of moving object in the motion scenes based on the UAV platform is a technical difficulty in this field. In practical applications, the characteristics of complicated environment, small target object and moving platform require higher real-time performance and reliability of the algorithm. If it is possible to add some other features of the target to the tracking process, it will be possible to improve the shortcomings of Camshift itself. Based on the Camshift tracking algorithm, this paper integrates SURF feature detection into the algorithm, which greatly improves the tracking accuracy of the target object, and also has better real-time performance which can achieve better tracking performance of the object. Keywords-Motion Image; CamShift Algorithm; SURF; Feature Detection I. INTRODUCTION Video tracking is the key technology for motion target object detection in dynamic sceneson UAV platform. It can be realized by two methods: one is based on target recognition technology, of which the core concept is frame-by-frame recognition algorithm for motion video to identify the target object and determine the target matching. The other is based on the detection technology of the moving object, of which the core concept is the active detection of the moving object, and the position of the moving object is determined in accordance with the detection result to realize the tracking.This method can achieve the tracking of any moving object without the need of complicated priori information for detection, such ascharacteristics of object shape, object sizes. However, the tracking effect of various tracking algorithms also depends on the background migration of the object, the unpredictable tracking path, the unpredictable target motion path and mode, the scene switch, the target object movement does not have an analyzable pattern, the change of camera model, the camera shift, and the change of illumination condition. And the causes of the changes in the color and shape of the moving object are very different [1]. The current mainstream motion object tracking methods with good tracking performance mainly include feature-based tracking methods, region-based tracking methods, model-based tracking methods, motion estimation-based tracking methods, and contour-based tracking methods. The detection and tracking algorithm of the conventional motion object is only suitable for the scene with static or almost static background, but it is not suitable for the detection and tracking of moving objects in the UAV video. Therefore, the digital sequence information of the video images acquired by the drone should meet the real-time, Accurate, robust and other requirements [2]. The current research shows that the target object tracking method based on Camshift algorithm can meet the requirements of drone video target tracking. The CamShift algorithm performs target object recognition and tracking by analyzing the Hue component DOI: 10.21307/ijanmc-2019-028 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 98 information of the target region in the HSV color space. The target has low deformation sensitivity. The algorithm has good real-time performance, little computationand low complexity. Therefore it has been extensively studied. In the video target tracking by drone, the Camshift algorithm [3],comparing to other tracking methods, has many advantages and disadvantages. Under the condition such as the similar color setting to the target object in the background or in the complicated scene, the Camshift algorithm may have tracking error or fail to track because of the tracking characteristicsthat the algorithm is based on the color information of the moving object. If other auxiliary features of the object can be obtained during the search process and input conditionsof algorithm are specified, it is possible to make up for the problems caused by Camshift in such scenes. Because the SURF algorithm has the advantage of good object recognition, the implementation of the SpeedUp Robust Features (SURF) in the Camshift algorithm will greatly improve the tracking accuracy and the tracking reliability of the object. The improved method ensures the good real-time performance of the tracking, greatly improves the tracking accuracy of the object by UAV, and eventually achieves a better tracking effect of the moving object. II. THE CHARACTERISTICS OF SURF The tracking of the characteristics of the target object isbased on the tracking of point of interest, which is often used in engineering applications and it works well. The difficulty of this method lies in the selection and extraction of features [4]. The selection of features should be able to fully cover the key features of the target object in different scene settings, and such features shall also be convenientfor extraction. In general, if the number of sampling points is insufficient in the process of extracting features, it is easy to lose tracking of the object and the tracking effectwould deteriorate. On the contrary, the calculation amount and complexity will be greatly increased, and unableto satisfy the actual application. Although the Harris corner detection method is a traditional point of interest detection method, due to its fixed scale characteristics, it is becoming difficult to determine the change ofposition of the target object between image frames when the target object is deformed or there is change in dimensions. Prof. David G. Lowe of the University of British Columbia in Canada first proposed the Scale Invariant Feature Transform (SIFT) [5]. The SIFT algorithm for querying feature key points is to use feature detection within the constructed scale. The orientational selection of point of interest is calculated according to the gradient of its neighborhood, thus achieving the goal that the feature in the scale does not change in accordance with the orientation. However, the algorithm requires high computational complexity, and has high requirements for hardware devices, and a shortcoming of poor real-time computation. On the basis of this algorithm, Yan Ke et al. introduced the Principal Components Analysis(PCA) in the SIFT system and proposed the PCA-SIFT algorithm, which profoundly improved the matching inefficiency of the common SIFT algorithm. However, this method will lead to the failure of feature extraction in the later stage as well as the deterioration of the characteristic distinctiveness. On this basis, some people have proposed a point of interest algorithm based on SURF [6]. The extracted features are used as the key local features of the images. The speed of the fast calculation is improved by the integral image method. Then the point of interest obtained by Haar wavelet transform are used to obtain the main orientation and its feature vector [7]. Finally, the Euclidean metric is calculated to verify the matching effect between the images. The SURF feature is invariant in the presence of changes in brightness, translation, rotation, and scale. In addition, the method does not show any negative effect to its robustness under noise interference conditions even if the viewing angle changes. This method not only realizes theaccurate feature recognition, but also reduces the computational complexity, which greatly improves the inefficiency of the method in use, and has broad application. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 99 III. THE SURF ALGORITHM A. Constructing Hessian matrix SURF relies on approximation of the determinant images of Hessian matrix. The Hessian matrix is the core of the SURF algorithm [8]. First, the Hessian matrix of each pixel is calculated according to the equation, and then the Hessian matrix discriminant is used to determine whether the point is an extremum. Gaussian filtering is used to construct the Gaussian pyramid [9]. Compared with the Gaussian pyramid construction process of SIFT algorithm, the speed of SURF algorithm is improved. In the SURF algorithm, the image space occupancy of each set is different. The previous set of images is downsampled by 1/4 to obtain the next scale. At the same time, the same set of images has the same size, and the difference is that different scales σ are used. In addition, in the process of blurring, the Gaussian template size is always constant, and only the scale changes σ. For the SURF algorithm, the image size remains the same, only the Gaussian blur template size and scale σ need to be changed. B. Preliminary determination of point of interestusing non-maximum suppression Each pixel processed by the hessian matrix is compared with its 26 points in the three-dimensional domain. If it is the extremum of these points, it shall beselected as a preliminary point of interest for the next process [10]. The detection process uses a filter corresponding to the size of the scale image to detect. In this paper, a 3×3 filter is used as an example for detection analysis. The point of interestfor detection which is one of the 9 pixel points in the scale and the remaining 8 points in the scale is compared with the 9 pixel pointsin each of theadjacenttwo scale layers above and below it to complete the calculation of 26 points in the three-dimensional domain. C. Precisely locate extremum The three-dimensional linear interpolation method removes the points whose valuesare less than a certain threshold, but the sub-pixel point of interest can be obtained, and the extremumbe increased, the number of detected point of interest is therefore reduced, and finally several highly-featured points are detected and the amount of workis reduced [11]. D. Select the main orientation of the point of interest In order to ensure the rotational invariance, the gradient histogram is not countedin the SURF, but the Haar wavelet response around the point of interestregion is computed [12]. That is, centering on the point of interest, within the circular neighborhood of a radius of 6s (s is the scale of the point of interest), the Haar wavelet side length is 4s, and the Haar wavelet response of all points in the 60 degree fan in both x- and y-directions is computed. Sum, and assign a Gaussian weight coefficient to the Haar wavelet responses, to improve the contribution of the response close to the point of interest, suppress the contribution of the response away from the point of interest, and then add the response within the range of 60 degrees, eventually yielding a new vector, traversing the whole circular region. The main orientation of the point of interestwas defined by the orientation of the longest vector [13]. E. Construct descriptor of the SURF point of interest[14] Extract a square region around the point of interest. The size of the window is 20s. The orientation of the region is the main orientation detected in step 2.4. The region is then divided into 16 sub-regions, each of which the Haar wavelet responses of the x- and y-direction of 25 pixels are summed, where the x- and y-directions are relative to the main orientation. The Haar wavelet response is the sum of the horizontal direction, the sum of the absolute values in the horizontal direction, the sum of the vertical directions, and the sum of the absolute values in the vertical direction. IV. EXTRACT AND MATCH THE POINT OF INTEREST (1)Select a frame from the drone tracking video and extract the points of interest using SURF detection method, as shown in Figure 1. International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 100 Figure 1. Extract points of interest (2)Matching the target region. After selecting the target region, extract the target regions in the adjacent two frames and match the points of interest of the two frames. In Figure 2, it can be seen that 6 points of interest are successfully matched. Figure 2. Matching the points of interest V. VERIFICATION After manually selecting the target window, the feedback mechanism is used to calculate the color similarity between the Camshift tracking window and the initial window, and the feature similarity between the SURF tracking window and the initial window. Suppress the major displacement, and assign the weight of displacement dynamically to the two tracking algorithms in accordance with Bhattacharyya distance [15]. The Camshift tracking algorithm is preferred when the motion tracking is stable, and otherwise, the SURF tracking method [16] is preferred. The experiments show that the tracking method is ideal for tracking, and it can solve the problem of tracking interference caused by background changes and object similarity. Takes a picture of tracking object every 15 seconds, and an example is shown in Figures 3, 4, 5, and 6. Figure 3. Tracking image 1 Figure 4. Tracking image 2 Figure 5. Tracking image 3 International Journal of Advanced Network, Monitoring and Controls Volume 03, No.04, 2018 101 Figure 6. Tracking image 4 As shown in the above figures 3, 4, 5, and 6, the drone has achieved good results in tracking the moving object. VI. CONCLUSION Based on the classic tracking algorithm Camshift and focused on the deficiencies of the algorithm in the motion object tracking of drone video, this paper proposes the combination of Camshift algorithm and SURF feature detection to realize the tracking of motion object in UAV video. The experimental results show that the proposed method can effectively track and locate the target object in the more complex aerial photography background. The experiment has achieved good results, basically realizes the tracking of the moving object, and the tracking speed is fast, thus the real-time performance is satisfactory and the timeconsumption is little. However, in practical applications, the environment of the object tracking is more complicated and diverse. The further study of this paper can be carried out in the following respects: This paper only studied the tracking of a single object, which does not involve the tracking of multiple similar objects. The multi-object tracking has practical significance in video surveillance, intelligent traffic detection, air formation, and geographic monitoring. Therefore, it is very necessary to further study the tracking of multiple objects; the object tracking system in this paper needs to be improved, including optimization of the logic system and add the feature of parallel processing to improve the tracking efficiency of the system. The Camshift algorithm in this paper is still based on color histogram, which is sensitive to changes in illuminance and the change in color of objects. When the resolution of the camera is not high and the ambient light is insufficient, the tracking effect is not good,therefore the study should focus on the tracking of physical features that are insensitive to illumination. REFERENCES [1] Liu yanli, tang xianqi, Chen yuedong. Application research of moving target tracking algorithm based on improved Camshift [J]. Journal of Anhui Polytechnic University,2012, 27(2):74-77. [2] Xiong TAN, Xuchu YU, Jingzheng LIU, Weijie HUANG. Object Fast Tracking Based on Unmanned Aerial Vehicle Video[C]//Proceedings of PACIIA, IEEE Presss,2010:244-248. [3] C. Harris, M. J. Stephens. A Combined Corner and Edge Detector[C]. Prco of the 4th Alvey Vision Conf, 2013: 147-152. [4] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision.2014, 60(2): 91-110. [5] C. Harris, M. J. Stephens. A Combined Corner and Edge Detector[C]. Prco of the 4th Alvey Vision Conf, 2015: 147-152. [6] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision.2014, 60(2): 91-110. [7] Liu yawei. Review of target detection and tracking methods in uav aerial photography video [J]. Airborne missile,2016.(9):53-56. [8] Yan K,Sukthankar:a more distinctive representation for local image descriptors[C]//ProceedingsofCVPR,Los Alamitos,IEEE press,2014:268-235. [9] Bay H,Ess A,Tuytelaars T.Speeded up robust features(SURF).[J]Computer Vision and Image Understanding,2007,110(3):346-359. [10] Leutenegger S,Chli M,Siegwart R.BRISK:binary robust invariant scalable keypoints.[C]//Proceedings of ICCV,IEEE Press,2013:326-329. [11] Cui zhe. Image feature point extraction and matching based on SIFT algorithm [D]. Xi 'an: XiDian: university.2016:38-46. [12] Yu huai, Yang wen. A fast feature extraction and matching algorithm for uav aerial images [J]. Journal of electronics and information technology, 2016, 38(3):509-516. [13] Wang jianxiong. Research on key technologies of low altitude photogrammetry of unmanned airship and practice of large scale map formation [D]. Xi 'an: chang 'an university, 2011:36-48. [14] Li yifei, research on PID control in four-rotor aircraft,[J] technology and market, 2016.07:90-91. [15] Li xiang, wang yongjun, li zhi, misalignment error and correction of attitude system vector sensor,[J] journal of sensor technology, 2017.02:266-271. [16] Wang donghua, yue dawei. Design and implementation of large remote sensing image correction effect detection system,[J] computer programming skills and maintenance,2015.12:118-120. http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=SJES&filename=SJES13011501083888&v=MTMyOTA0UFFIL2lyUmRHZXJxUVRNbndaZVp1SGlublU3N0pKbG9jYnhNPU5pZk9mYks3SHRETnFvOUVaT01NQkhReG9CTVQ2VA==&uid=WEEvREcwSlJHSldRa1Fhb09jSnZpZ08yWkh6OUZSbjYwY0pIaGNMdmlKTT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4ggI8Fm4gTkoUKaID8j8gFw!! http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=SJES&filename=SJES13011501083888&v=MTMyOTA0UFFIL2lyUmRHZXJxUVRNbndaZVp1SGlublU3N0pKbG9jYnhNPU5pZk9mYks3SHRETnFvOUVaT01NQkhReG9CTVQ2VA==&uid=WEEvREcwSlJHSldRa1Fhb09jSnZpZ08yWkh6OUZSbjYwY0pIaGNMdmlKTT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4ggI8Fm4gTkoUKaID8j8gFw!! http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=SJES&filename=SJES13011501083888&v=MTMyOTA0UFFIL2lyUmRHZXJxUVRNbndaZVp1SGlublU3N0pKbG9jYnhNPU5pZk9mYks3SHRETnFvOUVaT01NQkhReG9CTVQ2VA==&uid=WEEvREcwSlJHSldRa1Fhb09jSnZpZ08yWkh6OUZSbjYwY0pIaGNMdmlKTT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4ggI8Fm4gTkoUKaID8j8gFw!! http://scholar.cnki.net/result.aspx?q=Brisk:Binary%20robust%20invatiant%20scalable%20keypoints http://scholar.cnki.net/result.aspx?q=Brisk:Binary%20robust%20invatiant%20scalable%20keypoints http://scholar.cnki.net/result.aspx?q=Brisk:Binary%20robust%20invatiant%20scalable%20keypoints work_2sn25vxvsrbzrjcp6ps7nehcte ---- Combining Minimally-supervised Methods for Arabic Named Entity Recognition Maha Althobaiti, Udo Kruschwitz, and Massimo Poesio School of Computer Science and Electronic Engineering University of Essex Colchester, UK {mjaltha, udo, poesio}@essex.ac.uk Abstract Supervised methods can achieve high perfor- mance on NLP tasks, such as Named En- tity Recognition (NER), but new annotations are required for every new domain and/or genre change. This has motivated research in minimally supervised methods such as semi- supervised learning and distant learning, but neither technique has yet achieved perfor- mance levels comparable to those of super- vised methods. Semi-supervised methods tend to have very high precision but compar- atively low recall, whereas distant learning tends to achieve higher recall but lower pre- cision. This complementarity suggests that better results may be obtained by combining the two types of minimally supervised meth- ods. In this paper we present a novel ap- proach to Arabic NER using a combination of semi-supervised and distant learning tech- niques. We trained a semi-supervised NER classifier and another one using distant learn- ing techniques, and then combined them using a variety of classifier combination schemes, including the Bayesian Classifier Combina- tion (BCC) procedure recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in per- formance of 8 percentage points over the best base classifiers. 1 Introduction Supervised learning techniques are very effective and widely used to solve many NLP problems, in- cluding NER (Sekine and others, 1998; Benajiba et al., 2007a; Darwish, 2013). The main disadvantage of supervised techniques, however, is the need for a large annotated corpus. Although a considerable amount of annotated data is available for many lan- guages, including Arabic (Zaghouani, 2014), chang- ing the domain or expanding the set of classes al- ways requires domain-specific experts and new an- notated data, both of which demand time and effort. Therefore, much of the current research on NER focuses on approaches that require minimal human intervention to export the named entity (NE) clas- sifiers to new domains and to expand NE classes (Nadeau, 2007; Nothman et al., 2013). Semi-supervised (Abney, 2010) and distant learn- ing approaches (Mintz et al., 2009; Nothman et al., 2013) are alternatives to supervised methods that do not require manually annotated data. These approaches have proved to be effective and easily adaptable to new NE types. However, the perfor- mance of such methods tends to be lower than that achieved with supervised methods (Althobaiti et al., 2013; Nadeau, 2007; Nothman et al., 2013). We propose combining these two minimally su- pervised methods in order to exploit their respective strengths and thereby obtain better results. Semi- supervised learning tends to be more precise than distant learning, which in turn leads to higher re- call than semi-supervised learning. In this work, we use various classifier combination schemes to combine the minimal supervision methods. Most previous studies have examined classifier combi- nation schemes to combine multiple supervised- learning systems (Florian et al., 2003; Saha and Ekbal, 2013), but this research is the first to com- bine minimal supervision approaches. In addition, 243 Transactions of the Association for Computational Linguistics, vol. 3, pp. 243–255, 2015. Action Editor: Ryan McDonald. Submission batch: 1/2015; Revision batch 4/2015; Published 5/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. we report our results from testing the recently pro- posed Independent Bayesian Classifier Combination (IBCC) scheme (Kim and Ghahramani, 2012; Lev- enberg et al., 2014) and comparing it with traditional voting methods for ensemble combination. 2 Background 2.1 Arabic NER A lot of research has been devoted to Arabic NER over the past ten years. Much of the initial work em- ployed hand-written rule-based techniques (Mesfar, 2007; Shaalan and Raza, 2009; Elsebai et al., 2009). More recent approaches to Arabic NER are based on supervised learning techniques. The most common supervised learning techniques investigated for Ara- bic NER are Maximum Entropy (ME) (Benajiba et al., 2007b), Support Vector Machines (SVMs) (Be- najiba et al., 2008), and Conditional Random Fields (CRFs) (Benajiba and Rosso, 2008; Abdul-Hamid and Darwish, 2010). Darwish (2013) presented cross-lingual features for NER that make use of the linguistic properties and knowledge bases of another language. In his study, English capitalisation features and an English knowledge base (DBpedia) were exploited as dis- criminative features for Arabic NER. A large Ma- chine Translation (MT) phrase table and Wikipedia cross-lingual links were used for translation between Arabic and English. The results showed an overall F-score of 84.3% with an improvement of 5.5% over a strong baseline system on a standard dataset (the ANERcorp set collected by Benajiba et al. (2007a)). Abdallah et al. (2012) proposed a hybrid NER system for Arabic that integrates a rule-based sys- tem with a decision tree classifier. Their inte- grated approach increased the F-score by between 8% and 14% when compared to the original rule based system and the pure machine learning tech- nique. Oudah and Shaalan (2012) also developed hybrid Arabic NER systems that integrate a rule- based approach with three different supervised tech- niques: decision trees, SVMs, and logistic regres- sion. Their best hybrid system outperforms state-of- the-art Arabic NER systems (Benajiba and Rosso, 2008; Abdallah et al., 2012) on standard test sets. 2.2 Minimal Supervision and NER Much current research seeks adequate alternatives to expensive corpus annotation that address the limita- tions of supervised learning methods: the need for substantial human intervention and the limited num- ber of NE classes that can be handled by the system. Semi-supervised techniques and distant learning are examples of methods that require minimal supervi- sion. Semi-supervised learning (SSL) (Abney, 2010) has been used for various NLP tasks, including NER (Nadeau, 2007). ‘Bootstrapping’ is the most com- mon semi-supervised technique. Bootstrapping in- volves a small degree of supervision, such as a set of seeds, to initiate the learning process (Nadeau and Sekine, 2007). An early study that introduced mutual bootstrapping and proved highly influential is (Riloff and Jones, 1999). They presented an al- gorithm that begins with a set of seed examples of a particular entity type. Then, all contexts found around these seeds in a large corpus are compiled, ranked, and used to find new examples. Pasca et al. (2006) used the same bootstrapping technique as Riloff and Jones (1999), but applied the technique to very large corpora and managed to generate one million facts with a precision rate of about 88%. Ab- delRahman et al. (2010) proposed to integrate boot- strapping semi-supervised pattern recognition and a Conditional Random Fields (CRFs) classifier. They used semi-supervised pattern recognition in order to generate patterns that were then used as features in the CRFs classifier. Distant learning (DL) is another popular paradigm that avoids the high cost of supervision. It depends on the use of external knowledge (e.g., encyclopedias such as Wikipedia, unlabelled large corpora, or external semantic repositories) to increase the performance of the classifier, or to automatically create new resources for use in the learning process (Mintz et al., 2009; Nguyen and Moschitti, 2011). Nothman et al. (2013) automatically created massive, multilingual training annotations for NER by exploiting the text and in- ternal structure of Wikipedia. They first categorised Wikipedia articles into a specific set of named entity types across nine languages: Dutch, English, French, German, Italian, Polish, Portuguese, Rus- 244 sian, and Spanish. Then, Wikipedia’s links were transformed into named entity annotations based on the NE types of the target articles. Following this approach, millions of words were annotated in the aforementioned nine languages. Their method for automatically deriving corpora from Wikipedia outperformed the methods proposed by Richman and Schone (2008) and Mika et al. (2008) when testing the Wikipedia-trained models on CONLL shared task data and other gold-standard corpora. Alotaibi and Lee (2013) presented a methodology to automatically build two NE-annotated sets from Arabic Wikipedia. The corpora were built by transforming links into NE annotations according to the NE type of the target articles. POS-tagging, morphological analysis, and linked NE phrases were used to detect other mentions of NEs that appear without links in text. Their Wikipedia-trained model performed well when tested on various newswire test sets, but it did not surpass the performance of the supervised classifier that is trained and tested on data sets drawn from the same domain. 2.3 Classifier Combination and NER We are not aware of any previous work combin- ing minimally supervised methods for NER task in Arabic or any other natural language, but there are many studies that have examined classifier com- bination schemes to combine various supervised- learning systems. Florian et al. (2003) presented the best system at the NER CoNLL 2003 task, with an F-score value equal to 88.76%. They used a combination of four diverse NE classifiers: the transformation-based learning classifier, a Hidden Markov Model classifier (HMM), a robust risk min- imization classifier based on a regularized winnow method (Zhang et al., 2002), and a ME classifier. The features they used included tokens, POS and chunk tags, affixes, gazetteers, and the output of two other NE classifiers trained on richer datasets. Their methods for combining the results of the four NE classifiers improved the overall performance by 17- 21% when compared with the best performing clas- sifier. Saha and Ekbal (2013) studied classifier combi- nation techniques for various NER models under single and multi-objective optimisation frameworks. They used seven diverse classifiers - naive Bayes, decision tree, memory based learner, HMM, ME, CRFs, and SVMs - to build a number of voting mod- els based on identified text features that are selected mostly without domain knowledge. The combina- tion methods used were binary and real vote-based ensembles. They reported that the proposed multi- objective optimisation classifier ensemble with real voting outperforms the individual classifiers, the three baseline ensembles, and the corresponding sin- gle objective classifier ensemble. 3 Two Minimally Supervised NER Classifiers Two main minimally supervised approaches have been used for NER: semi-supervised learning (Al- thobaiti et al., 2013) and distant supervision (Noth- man et al., 2013). We developed state-of-the-art classifiers of both types that will be used as base classifiers in this paper. Our implementations of these classifiers are explained in Section 3.1 and Section 3.2. 3.1 Semi-supervised Learning As previously mentioned, the most common SSL technique is bootstrapping, which only requires a set of seeds to initiate the learning process. We used an algorithm adapted from Althobaiti et al. (2013) and contains three components, as shown in Figure 1. Pattern Induction Instance Extraction Instance Ranking/Selection Seed Instances Figure 1: The Three Components of SSL System. The algorithm begins with a list of a few exam- ples of a given NE type (e.g., ‘London’ and ‘Paris’ can be used as seed examples for location entities) and learns patterns (P) that are used to find more ex- amples (candidate NEs). These examples are even- tually sorted and used again as seed examples for the next iteration. Our algorithm does not use plain frequencies since absolute frequency does not always produce good examples. This is because bad examples will be extracted by one pattern, however unwantedly, as many times as the bad examples appear in the text in relatively similar contexts. Meanwhile, good exam- 245 ples are best extracted using more than one pattern, since they occur in a wider variety of contexts in the text. Instead, our algorithm ranks candidate NEs ac- cording to the number of different patterns that are used to extract them, since pattern variety is a better cue to semantics than absolute frequency (Baroni et al., 2010). After sorting the examples according to the num- ber of distinct patterns, all examples but the top m are discarded, where m is set to the number of ex- amples from the previous iteration, plus one. These m examples will be used in the next iteration, and so on. For example, if we start the algorithm with 20 seed instances, the following iteration will start with 21, and the next one will start with 22, and so on. This procedure is necessary in order to carefully include examples from one iteration to another and to ensure that bad instances are not passed on to the next iteration. The same procedure was applied by (Althobaiti et al., 2013). 3.2 Distant Learning For distant learning we follow the state of the art ap- proach to exploit Wikipedia for Arabic NER, as in (Althobaiti et al., 2014). Our distant learning sys- tem exploits many of Wikipedia’s features, such as anchor texts, redirects, and inter-language links, in order to automatically develop an Arabic NE anno- tated corpus, which is used later to train a state-of- the-art supervised classifier. The three steps of this approach are: 1. Classify Wikipedia articles into a set of NE types. 2. Annotate the Wikipedia text as follows: • Identify and label matching text in the title and the first sentence of each article. • Label linked phrases in the text according to the NE type of the target article. • Compile a list of alternative titles for articles and filter out ambiguous ones. • Identify and label matching phrases in the list and the Wikipedia text. 3. Filter sentences to prevent noisy sentences from being included in the corpus. We briefly explain these steps in the following sec- tions. 3.2.1 Classifying Wikipedia Articles The Wikipedia articles in the dataset need to be classified into the set of named entity types in the classification scheme. We conduct an experiment that uses simple bag-of-words features extracted from different portions of the Wikipedia document and metadata such as categories, the infobox ta- ble, and tokens from the article title and first sen- tence of the document. To improve the accuracy of document classification, tokens are distinguished based on their location in the document. There- fore, categories and infobox features are marked with suffixes to differentiate them from tokens ex- tracted from the article’s body text (Tardif et al., 2009). The feature set is represented by Term Frequency-Inverse Document Frequency (TF-IDF). In order to develop a Wikipedia document classifier to categorise Wikipedia documents into CoNLL NE types, namely person, location, organisation, mis- cellaneous, or other, we use a set of 4,000 manually classified Wikipedia articles that are available free online (Alotaibi and Lee, 2012). 80% of the 4,000 hand-classified Wikipedia articles are used for train- ing, and 20% for evaluation. The Wikipedia docu- ment classifier that we train performs well, achiev- ing an F-score of 90%. The classifier is then used to classify all Wikipedia articles. At the end of this stage, we obtain a list of pairs containing each Wikipedia article and its NE Type in preparation for the next stage: developing the NE-tagged training corpus. 3.2.2 The Annotation Process To begin the Annotation Process we identify matching terms in the article title and the first sen- tence and then tag the matching phrases with the NE-type of the article. The system adopts partial matching where all corresponding words in the ti- tle and the first sentence should first be identified. Then, the system annotates them and all words in between (Althobaiti et al., 2014). The next step is to transform the links between Wikipedia articles into NE annotations according to the NE-type of the link target. Wikipedia also contains a fair amount of NEs without links. We follow the technique proposed by Nothman et al. (2013), which suggests inferring additional links using the aliases for each article. 246 Thus, we compile a list of alternative titles, in- cluding anchor texts and NE redirects (i.e., the linked phrases and redirected pages that refer to NE articles). It is necessary to filter the list, however, to remove noisy alternative titles, which usually appear due to (a) one-word meaningful named entities that are ambiguous when taken out of context and (b) multi-word alternative titles that contain apposition words (e.g., ‘President’, ‘Vice Minister’). To this end we use the filtering algorithm proposed by Althobaiti et al. (2014) (see Algorithm 1). In this algorithm a capitalisation probability measure for Arabic is introduced. This involves finding the English gloss for each one-word alternative name and then computing its probability of being capitalised in the English Wikipedia. In order to find the English gloss for Arabic words, Wikipedia Arabic-to-English cross-lingual links are exploited. In case the English gloss for the Arabic word could not be found using inter-language links, an online translator is used. Before translating the Arabic word, a light stemmer is used to remove prefixes and conjunctions in order to acquire the translation of the word itself without its associated affixes. The capitalisation probability is computed as follows Pr[EN] = f(EN)isCapitalised f(EN)isCapitalised+f(EN)notCapitalised where EN is the English gloss of the alterna- tive name; f(EN)isCapitalised is the number of times the English gloss EN is capitalised in the English Wikipedia; and f(EN)notCapitalised is the number of times the English gloss EN is not capitalised in the English Wikipedia. By specifying a capitalisation threshold constraint, ambiguous one-word titles are prevented from being included in the list of alternative titles. The capitalisation threshold is set to 0.75 as suggested in (Althobaiti et al., 2014). The multi-word alternative name is also omitted if any of its words belong to the list of apposition words. 3.2.3 Building The Corpus The last stage is to incorporate sentences into the final corpus. We refer to this dataset as the Wikipedia-derived corpus (WDC). It contains 165,119 sentences of around 6 million tokens. Our model was then trained on the WDC corpus. In this Algorithm 1: Filtering Alternative Names Input: A set L = {l1, l2, . . . , ln} of all alternative names of Wikipedia articles Output: A set RL = {rl1, rl2, . . . , rln} of reliable alternative names 1 for i ← 1 to n do 2 T ← split li into tokens 3 if (T.size() >= 2) then /* All tokens of T do not belong to apposition list */ 4 if (! containAppositiveWord(T)) then 5 add li to the set RL 6 else 7 lightstem ← findLightStem(li) 8 englishgloss ← translate(lightstem) /* Compute Capitalisation Probability for English gloss */ 9 capprob ← compCapProb(englishgloss) 10 if (capprob > 0.75) then 11 add li to the set RL paper we refer to this model as the DL classifier. The WDC dataset is available online1. We also plan to make the models available to the research community. 4 Classifier Combination 4.1 The Case for Classifier Combination In what follows we use SSL to refer to our semi- supervised classifier (see Section 3.1) and DL to re- fer to our distant learning classifier (see Section 3.2). Table 1 shows the results of both classifiers when tested on the ANERcorp test set (see Section 5 for details about the dataset). NEs Classifiers Precision Recall Fβ=1 PER SSL 85.91 51.10 64.08 DL 80.01 45.11 57.69 LOC SSL 87.91 62.48 73.04 DL 75.21 67.14 70.95 ORG SSL 84.27 40.30 54.52 DL 74.10 57.02 64.45 Overall SSL 86.03 51.29 64.27 DL 76.44 56.42 64.92 Table 1: The results of SSL and DL classifiers on the ANERcorp test set. As is apparent in Table 1, the SSL classifier tends to be more precise at the expense of recall. The dis- 1 https://sites.google.com/site/mahajalthobaiti/resources 247 https://sites.google.com/site/mahajalthobaiti/resources tant learning technique is lower in precision than the semi-supervised learning technique, but higher in re- call. Generally, preference is given to the distant su- pervision classifier in terms of F-score. The classifiers have different strengths. Our semi- supervised algorithm iterates between pattern ex- traction and candidate NEs extraction and selection. Only the candidate NEs that the classifier is most confident of are added at each iteration, which re- sults in the high precision. The SSL classifier per- forms better than distant learning in detecting NEs that appear in reliable/regular patterns. These pat- terns are usually learned easily during the training phase, either because they contain important NE in- dicators2 or because they are supported by many re- liable candidate NEs. For example, the SSL classi- fier has a high probability to successfully detect AÓ AK. ð @ “Obama” and È A g à A ̄ �� ñË “Louis van Gaal” as per- son names in the following sentences: • . . . AJ K A¢� QK. P ð QK ø YË @ AÓ AK. ð @ �� KQË @ h �Qå� “President Obama said on a visit to Britain ...” • . . . à @ YJ ��K A KñK Q��� �� � AÓ H. P YÓ È A g à A ̄ �� ñË È A�̄ “Louis van Gaal the manager of Manchester United said that ...” The patterns extracted from such sentences in the newswire domain are learned easily during the train- ing phase, as they contain good NE indicators like �� KQË @ “president” and H. P YÓ “manager”. Our distant learning method relies on Wikipedia structure and links to automatically create NE anno- tated data. It also depends on Wikipedia features, such as inter-language links and redirects, to handle the rich morphology of Arabic without the need to perform excessive pre-processing steps (e.g., POS- tagging, deep morphological analysis), which has a slight negative effect on the precision of the DL clas- sifier. The recall, however, of the DL classifier is high, covering as many NEs as possible in all pos- sible domains. Therefore, the DL classifier is better than the SSL classifier in detecting NEs that appear in ambiguous contexts (they can be used for differ- ent NE types) and with no obvious clues (NE indi- cators). For example, detecting ø P @Q� ̄ “Ferrari” and AJ »ñ K“Nokia” as organization names in the following sentences: 2 Also known as trigger words which help in identifying NEs within text • . . . ø P @Q� ̄ Ð Qk ø YË @ ,ñ JK P �� K A� úΫ ñ� �ñË @ Ð Y �®�K “Alonso got ahead of the Renault driver who prevented Ferrari from ... ” • �é �® ®�Ë @ Ð AÖ �ß @ à C« @ áÓ Ð ñK YªK. AJ »ñ K H. A¢ k Z Ag. “Nokia’s speech came a day after the comple- tion of the deal” The strengths and weaknesses of the SSL and DL classifiers indicates that a classifier ensemble could perform better than its individual components. 4.2 Classifier Combination Methods Classifier combination methods are suitable when we need to make the best use of the predictions of multiple classifiers to enable higher accuracy classi- fications. Dietterich (2000a) reviews many methods for constructing ensembles and explains why clas- sifier combination techniques can often gain better performance than any base classifier. Tulyakov et al. (2008) introduce various categories of classifier combinations according to different criteria includ- ing the type of the classifier’s output and the level at which the combinations operate. Several empir- ical and theoretical studies have been conducted to compare ensemble methods such as boosting, ran- domisation, and bagging techniques (Maclin and Opitz, 1997; Dietterich, 2000b; Bauer and Kohavi, 1999). Ghahramani and Kim (2003) explore a gen- eral framework for a Bayesian model combination that explicitly models the relationship between each classifier’s output and the unknown true label. As such, multiclass Bayesian Classifier Combination (BCC) models are developed to combine predictions of multiple classifiers. Their proposed method for BCC in the machine learning context is derived di- rectly from the method proposed in (Haitovsky et al., 2002) for modelling disagreement between human assessors, which in turn is an extension of (Dawid and Skene, 1979). Similar studies for modelling data annotation using a variety of methods are presented in (Carpenter, 2008; Cohn and Specia, 2013). Simp- son et al. (2013) present a variant of BCC in which they consider the use of a principled approximate Bayesian method, variational Bayes (VB), as an in- ference technique instead of using Gibbs Sampling. They also alter the model so as to use point values for hyper-parameters, instead of placing exponential hyper-priors over them. 248 The following sections detail the combination methods used in this paper to combine the minimally supervised classifiers for Arabic NER. 4.2.1 Voting Voting is the most common method in classifier combination because of its simplicity and accept- able results (Van Halteren et al., 2001; Van Erp et al., 2002). Each classifier is allowed to vote for the class of its choice. It is common to take the ma- jority vote, where each base classifier is given one vote and the class with the highest number of votes is chosen. In the case of a tie, when two or more classes receive the same number of votes, a random selection is taken from among the winning classes. It is useful, however, if base classifiers are distin- guished by their quality. For this purpose, weights are used to encode the importance of each base clas- sifier (Van Erp et al., 2002). Equal voting assumes that all classifiers have the same quality (Van Halteren et al., 2001). Weighted voting, on the other hand, gives more weight to classifiers of better quality. So, each classifier is weighted according to its overall precision, or its precision and recall on the class it suggests. Formally, given K classifiers, a widely used com- bination scheme is through the linear interpolation of the classifiers’ class probability distribution as follows P(C |SK1 (w)) = K∑ k=1 Pk (C |Sk (w)) ·λk (w) where Pk(C|Sk(w)) is an estimation of the proba- bility that the correct classification is C given Sk(w), the class for the word w as suggested by classifier k. λk(w) is the weight that specifies the importance given to each classifier k in the combination. Pk(C|Sk(w)) is computed as follows Pk(C|Sk(w)) = { 1, if Sk(w) = C 0, otherwise For equal voting, each classifier should have the same weight (e.g., λk(w) = 1/K). In case of weighted voting, the weight associated with each classifier can be computed from its precision and/or recall as illustrated above. 4.2.2 Independent Bayesian Classifier Combination (IBCC) Using a Bayesian approach to classifier combi- nation (BCC) provides a mathematical combina- tion framework in which many classifiers, with var- ious distributions and training features, can be com- bined to provide more accurate information. This framework explicitly models the relationship be- tween each classifier’s output and the unknown true label (Levenberg et al., 2014). This section de- scribes the Bayesian approach to the classifier com- bination we adopted in this paper which, like the work of Levenberg et al. (2014), is based on Simp- son et al. (2013) simplification of Ghahramani and Kim (2003) model. For ith data point, true label ti is assumed to be generated by a multinomial distribution with the pa- rameter δ: p(ti = j|δ) = δj, which models the class proportions. True labels may take values ti = 1...J, where J is the number of true classes. It is also as- sumed that there are K base classifiers. The output of the classifiers are assumed to be discrete with values l = 1...L, where L is the number of possible out- puts. The output c(k)i of the classifier k is assumed to be generated by a multinomial distribution with parameters π(k)j : p(c (k) i = l|ti = j,π (k) j ) = π (k) j,l where π(k) is the confusion matrix for the classifier k, which quantifies the decision-making abilities of each base classifier. As in Simpson et al. (2013) study, we as- sume that parameters π(k)j and δ have Dirichlet prior distributions with hyper-parameters α(k)0,j = [α (k) 0,j1,α (k) 0,j2, ...,α (k) 0,jL] and ν = [ν0,1,ν0,2, ...,ν0,J ] respectively. Given the observed class labels and based on the above prior, the joint distribution over all variables for the IBCC model is p(δ, Π, t,c|A0,ν) = I∏ i=1 {δti K∏ k=1 π (k) ti,c (k) i }p(δ|ν)p(Π|A), where Π = {π(k)j |j = 1...J,k = 1...K} and A0 = {α(k)0,j |j = 1...J,k = 1...K}. The conditional probability of a test data point ti being assigned class j is given by p(ti = j) = ρij∑J y=1 ρiy , 249 where ρij = δj K∏ k=1 π j,c (k) i . In our implementation we used point values for A0 as in (Simpson et al., 2013). The values of hyper- parameters A0 offered a natural method to include any prior knowledge. Thus, they can be regarded as pseudo-counts of prior observations and they can be chosen to represent any prior level of uncertainty in the confusion matrices, Π. Our inference tech- nique for the unknown variables (δ,π, and t) was Gibbs sampling as in (Ghahramani and Kim, 2003; Simpson et al., 2013). Figure 2 shows the directed graphical model for IBCC. The c(k)i represents ob- served values, circular nodes are variables with dis- tributions and square nodes are variables instantiated with point values. 𝝂𝟎 𝜶𝟎 (𝒌) Classifiers: k=1, 2, …, K Data points: i=1, 2, …, I 𝜹 𝝅(𝒌) 𝒄𝒊 (𝒌) 𝒕𝒊 Figure 2: The directed graph of IBCC. 5 Data In this section, we describe the two datasets we used: • Validation set3(NEWS + BBCNEWS): 90% of this dataset is used to estimate the weight of each base classifier and 10% is used to perform error analysis. • Test set (ANERcorp test set): This dataset is used to evaluate different classifier combina- tion methods. The validation set is composed of two datasets: NEWS and BBCNEWS. The NEWS set contains around 15k tokens collected by Darwish (2013) 3 Also known as development set. from the RSS feed of the Arabic (Egypt) version of news.google.com from October 2012. We created the BBCNEWS corpus by collecting a representa- tive sample of news from BBC in May 2014. It con- tains around 3k tokens and covers different types of news such as politics, economics, and entertainment. The ANERcorp test set makes up 20% of the whole ANERcorp set. The ANERcorp set is a news- wire corpus built and manually tagged especially for the Arabic NER task by Benajiba et al. (2007a) and contains around 150k tokens. This test set is com- monly used in the Arabic NER literature to evaluate supervised classifiers (Benajiba and Rosso, 2008; Abdul-Hamid and Darwish, 2010; Abdallah et al., 2012; Oudah and Shaalan, 2012) and minimally- supervised classifiers (Alotaibi and Lee, 2013; Al- thobaiti et al., 2013; Althobaiti et al., 2014), which allows us to review the performance of the combined classifiers and compare it to the performance of each base classifier. 6 Experimental Analysis 6.1 Experimental Setup In the IBCC model, the validation data was used as known ti to ground the estimates of model param- eters. The hyper-parameters were set as α(k)j = 1 and νj = 1 (Kim and Ghahramani, 2012; Leven- berg et al., 2014). The initial values for random variables were set as follows: (a) the class propor- tion δ was initialised to the result of counting ti and (b) the confusion matrix π was initialised to the re- sult of counting ti and the output of each classifier c(k). Gibbs sampling was run well past stability (i.e., 1000 iterations). Stability was actually reached in approximately 100 iterations. All parameters required in voting methods were specified using the validation set. We examined two different voting methods: equal voting and weighted voting. In the case of equal voting, each classifier was given an equal weight, (1/K) where K was the number of classifiers to be combined. In weighted voting, total precision was used in order to give pref- erence to classifiers with good quality. 250 6.2 Results and Discussion 6.2.1 A Simple Baseline Combined Classifier A proposed combined classifier simply and straightforwardly makes decisions based on the agreed decisions of the base classifiers, namely the SSL classifier and DL classifier. That is, if the base classifiers agree on the NE type of a certain word, then it is annotated by an agreed NE type. In the case of disagreement, the word is considered not named entity. Table 2 shows the results of this combined classifier, which is considered a baseline in this pa- per. Precision Recall Fβ=1 Person 97.31 24.69 39.39 Location 98.35 40.01 56.88 Organisation 97.38 33.2 49.52 Overall 97.68 32.63 48.92 Table 2: The results of the baseline The results of the combined classifier shows very high precision, which indicates that both base clas- sifiers are mostly accurate. The base classifiers also commit different errors that are evident in the low recall. The accuracy and diversity of the single clas- sifiers are the main conditions for a combined clas- sifier to have better accuracy than any of its com- ponents (Dietterich, 2000a). Therefore, in the next section we take into consideration various classifier combination methods in order to aggregate the best decisions of SSL and DL classifiers, and to improve overall performance. 6.2.2 Combined Classifiers: Classifier Combination Methods The SSL and DL classifiers are trained with two different algorithms using different training data. The SSL classifier is trained on ANERcorp training data, while the DL classifier is trained on a corpus automatically derived from Arabic Wikipedia, as ex- plained in Section 3.1 and 3.2. We combine the SSL and DL classifiers using the three classifier combination methods, namely equal voting, weighted voting, and IBCC. Table 3 shows the results of these classifier combination methods. The IBCC scheme outperforms all voting techniques and base classifiers in terms of F-score. Regard- ing precision, voting techniques show the highest scores. However, the high precision is accompanied by a reduction in recall for both voting methods. The IBCC combination method also has relatively high precision compared to the precision of base classi- fiers. Much better recall is registered for IBCC, but it is still low. NEs Combination Methods Precision Recall Fβ=1 PER Equal Voting 79.99 41.88 54.97 Weighted Voting 80.15 44.24 57.01 IBCC 77.87 63.86 70.17 LOC Equal Voting 86.87 30.66 45.32 Weighted Voting 87.48 30.23 44.93 IBCC 81.52 59.86 69.03 ORG Equal Voting 97.01 29.97 45.79 Weighted Voting 98.11 30.98 47.09 IBCC 95.44 34.31 50.47 Overall Equal Voting 87.96 34.17 49.22 Weighted Voting 88.58 35.15 50.33 IBCC 84.94 52.68 65.03 NEs Base Classifiers Precision Recall Fβ=1 Overall SSL 86.03 51.29 64.27 DL 76.44 56.42 64.92 Table 3: The performances of various combination meth- ods. 6.2.3 Combined Classifiers: Restriction of the Combination Process An error analysis of the validation set shows that 10.01% of the NEs were correctly detected by the semi-supervised classifier, but considered not NEs by the distant learning classifier. At the same time, the distant learning classifier managed to correctly detect 25.44% of the NEs that were considered not NEs by the semi-supervised classifier. We also no- ticed that false positive rates, i.e. the possibility of considering a word NE when it is actually not NE, are very low (0.66% and 2.45% for the semi- supervised and distant learning classifiers respec- tively). These low false positive rates and the high percentage of the NEs that are detected and missed by the two classifiers in a mutually exclusive way can be exploited to obtain better results, more specif- ically, to increase recall without negatively affect- ing precision. Therefore, we restricted the combi- 251 nation process to only include situations where the base classifiers agree or disagree on the NE type of a certain word. The combination process is ignored in cases where the base classifiers only disagree on detecting NEs. For example, if the base classifiers disagree on whether a certain word is an NE or not, the word is automatically considered an NE. Figure 3 provides some examples that illustrate the restric- tions we applied to the combination process. The annotations in the examples are based on the CoNLL 2003 annotation guidelines (Chinchor et al., 1999). Predictions of SSL Classifier Predictions of DL Classifier B-PER B-LOC O B-LOC B-ORG O B-PER B-PER Apply Combination Method B-LOC B-ORG Apply Combination Method Figure 3: Examples of restricting the combination pro- cess. Restricting the combination process in this way increases recall without negatively affecting the pre- cision, as seen in Table 4. The increase in recall makes the overall F-score for all combination meth- ods higher than those of base classifiers. This way of using the IBCC model results in a performance level that is superior to all of the individual clas- sifiers and other voting-based combined classifiers. Therefore, the IBCC model leads to a 12% increase in the performance of the best base classifier, while voting methods increase the performance by around 7% - 10%. These results highlight the role of re- stricting the combination, which affects the perfor- mance of combination methods and gives more con- trol over how and when the predictions of base clas- sifiers should be combined. 6.2.4 Comparing Combined Classifiers: Statistical Significance of Results We tested whether the difference in performance between the three classifier combination methods - equal voting, weighted voting, and IBCC - is sig- nificant using two different statistical tests over the results of these combination methods on an ANER- corp test set. The alpha level of 0.01 was used as a significance criterion for all statistical tests. First, We ran a non-parametric sign test. The small p- value (p � 0.01) for each pair of the three combina- NEs Combination Methods Precision Recall Fβ=1 PER Equal Voting 74.46 61.88 67.59 Weighted Voting 77.77 63.50 69.91 IBCC 77.88 64.56 70.60 LOC Equal Voting 74.04 71.36 72.68 Weighted Voting 74.05 73.70 73.86 IBCC 76.20 75.91 76.05 ORG Equal Voting 76.01 63.97 69.47 Weighted Voting 76.30 66.60 71.12 IBCC 78.91 66.65 72.26 Overall Equal Voting 74.84 65.74 69.99 Weighted Voting 76.04 67.93 71.76 IBCC 77.66 69.04 73.10 NEs Base Classifiers Precision Recall Fβ=1 Overall SSL 86.03 51.29 64.27 DL 76.44 56.42 64.92 Table 4: The performances of various combination meth- ods when restricting the combination process. tion methods, as seen in Table 5, suggests that these methods are significantly different. The only com- parison where no significance was found is equal voting vs. weighted voting, when we used them to combine the data without any restrictions (p = 0.3394). Combination Methods (Without Restriction) Equal Voting Weighted Voting IBCC Equal Voting Weighted Voting 0.3394 IBCC <2.2E-16 <2.2E-16 Combination Methods (With Restriction) Equal Voting Weighted Voting IBCC Equal Voting Weighted Voting 1.78E-07 IBCC <2.2E-16 1.97E-06 Table 5: The sign test results (exact p values) for the pair- wise comparisons of the combination methods. Second, we used a bootstrap sampling (Efron and Tibshirani, 1994), which is becoming the de facto standards in NLP (Søgaard et al., 2014). Table 6 compares each pair of the three combination meth- ods using a bootstrap sampling over documents with 10,000 replicates. It shows the p-values and confi- dence intervals of the difference between means. 252 Combination With Restriction Combination Methods Comparison p-value [95% CI] Weighted Voting, Equal Voting 0.000 [0.270, 0.600] IBCC, Equal Voting 0.000 [0.539, 0.896] IBCC, Weighted Voting 0.000 [0.157, 0.426] Combination Without Restriction Combination Methods Comparison p-value [95% CI] Weighted Voting, Equal Voting 0.508 [-0.365, 0.349] IBCC, Equal Voting 0.000 [4.800, 6.122] IBCC, Weighted Voting 0.000 [4.783, 6.130] Table 6: The bootstrap test results (p-values and CI) for the pairwise comparisons of the combination methods. The differences in performance between almost all the three methods of combination are highly sig- nificant. The one exception is the comparison be- tween equal voting and weighted voting, when they are used as a combination method without restric- tion, which shows a non-significant difference (p- value = 0.508, CI = -0.365 to 0.349). Generally, the IBCC scheme performs signifi- cantly better than voting-based combination meth- ods whether we impose restrictions on the combina- tion process or not, as can be seen in Table 3 and Table 4. 7 Conclusion Major advances over the past decade have occurred in Arabic NER with regard to utilising various su- pervised systems, exploring different features, and producing manually annotated corpora that mostly cover the standard set of NE types. More effort and time for additional manual annotations are re- quired when expanding the set of NE types, or ex- porting NE classifiers to new domains. This has mo- tivated research in minimally supervised methods, such as semi-supervised learning and distant learn- ing, but the performance of such methods is lower than that achieved by supervised methods. How- ever, semi-supervised methods and distant learning tend to have different strengths, which suggests that better results may be obtained by combining these methods. Therefore, we trained two classifiers based on distant learning and semi-supervision techniques, and then combined them using a variety of classifier combination schemes. Our main contributions in- clude the following: • We presented a novel approach to Arabic NER using a combination of semi-supervised learning and distant supervision. • We used the Independent Bayesian Classifier Combination (IBCC) scheme for NER, and com- pared it to traditional voting methods. • We introduced the classifier combination restric- tion as a means of controlling how and when the predictions of base classifiers should be com- bined. This research demonstrated that combining the two minimal supervision approaches using various clas- sifier combination methods leads to better results for NER. The use of IBCC improves the performance by 8 percentage points over the best base classi- fier, whereas the improvement in the performance when using voting methods is only 4 to 6 percent- age points. Although all combination methods re- sult in an accurate classification, the IBCC model achieves better recall than other traditional combi- nation methods. Our experiments also showed how restricting the combination process can increase the recall ability of all the combination methods without negatively affecting the precision. The approach we proposed in this paper can be easily adapted to new NE types and different do- mains without the need for human intervention. In addition, there are many ways to restrict the combi- nation process according to the applications’ prefer- ences, either producing high accuracy or recall. For example, we may obtain a highly accurate combined classifier if we do not combine the predictions of all base classifiers for a certain word and automatically consider it not NE when one of the base classifier considers this word not NE. References Sherief Abdallah, Khaled Shaalan, and Muhammad Shoaib. 2012. Integrating rule-based system with classification for arabic named entity recognition. In Computational Linguistics and Intelligent Text Pro- cessing, pages 311–322. Springer. Samir AbdelRahman, Mohamed Elarnaoty, Marwa Magdy, and Aly Fahmy. 2010. Integrated machine 253 learning techniques for arabic named entity recogni- tion. IJCSI, 7:27–36. Ahmed Abdul-Hamid and Kareem Darwish. 2010. Sim- plified feature set for Arabic named entity recognition. In Proceedings of the 2010 Named Entities Workshop, pages 110–115. Association for Computational Lin- guistics. Steven Abney. 2010. Semisupervised learning for com- putational linguistics. CRC Press. Fahd Alotaibi and Mark Lee. 2012. Mapping Ara- bic Wikipedia into the named entities taxonomy. In Proceedings of COLING 2012: Posters, pages 43–52, Mumbai, India, December. The COLING 2012 Orga- nizing Committee. Fahd Alotaibi and Mark Lee. 2013. Automatically De- veloping a Fine-grained Arabic Named Entity Corpus and Gazetteer by utilizing Wikipedia. In IJCNLP. Maha Althobaiti, Udo Kruschwitz, and Massimo Poesio. 2013. A semi-supervised learning approach to arabic named entity recognition. In Proceedings of the Inter- national Conference Recent Advances in Natural Lan- guage Processing RANLP 2013, pages 32–40, Hissar, Bulgaria, September. INCOMA Ltd. Shoumen, BUL- GARIA. Maha Althobaiti, Udo Kruschwitz, and Massimo Poesio. 2014. Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia. In Proceedings of the Student Research Workshop at the 14th Confer- ence of the European Chapter of the Association for Computational Linguistics (EACL), pages 106–115, Gothenburg. Marco Baroni, Brian Murphy, Eduard Barbu, and Mas- simo Poesio. 2010. Strudel: A Corpus-Based Seman- tic Model Based on Properties and Types. Cognitive Science, 34(2):222–254. Eric Bauer and Ron Kohavi. 1999. An empirical comparison of voting classification algorithms: Bag- ging, boosting, and variants. Machine learning, 36(1- 2):105–139. Yassine Benajiba and Paolo Rosso. 2008. Arabic named entity recognition using conditional random fields. In Proc. of Workshop on HLT & NLP within the Arabic World, LREC, volume 8, pages 143–153. Yassine Benajiba, Paolo Rosso, and José Miguel Benedı́ruiz. 2007a. Anersys: An Arabic Named En- tity Recognition System based on Maximum Entropy. In Computational Linguistics and Intelligent Text Pro- cessing, pages 143–153. Springer. Yassine Benajiba, Paolo Rosso, and José Miguel Benedı́ruiz. 2007b. Anersys: An arabic named en- tity recognition system based on maximum entropy. In Computational Linguistics and Intelligent Text Pro- cessing, pages 143–153. Springer. Yassine Benajiba, Mona Diab, Paolo Rosso, et al. 2008. Arabic named entity recognition: An svm-based ap- proach. In Proceedings of 2008 Arab International Conference on Information Technology (ACIT), pages 16–18. Bob Carpenter. 2008. Multilevel bayesian models of categorical data annotation. Unpublished manuscript. Available online at http://lingpipe-blog.com/lingpipe- white-papers/, last accessed 15-March-2015. Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson. 1999. 1999 Named Entity Recognition Task Definition. MITRE and SAIC. Trevor Cohn and Lucia Specia. 2013. Modelling anno- tator bias with multi-task gaussian processes: An ap- plication to machine translation quality estimation. In ACL, pages 32–42. Kareem Darwish. 2013. Named Entity Recognition us- ing Cross-lingual Resources: Arabic as an Example. In ACL, pages 1558–1567. Alexander Philip Dawid and Allan M Skene. 1979. Max- imum likelihood estimation of observer error-rates us- ing the em algorithm. Applied statistics, pages 20–28. Thomas G. Dietterich. 2000a. Ensemble methods in ma- chine learning. In Multiple Classifier Systems, volume 1857 of Lecture Notes in Computer Science, pages 1– 15. Springer Berlin Heidelberg. Thomas G Dietterich. 2000b. An experimental compar- ison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning, 40(2):139–157. Bradley Efron and Robert J Tibshirani. 1994. An intro- duction to the bootstrap. CRC press. Ali Elsebai, Farid Meziane, and Fatma Zohra Belkredim. 2009. A rule based persons names arabic extraction system. Communications of the IBIMA, 11(6):53–59. Radu Florian, Abe Ittycheriah, Hongyan Jing, and Tong Zhang. 2003. Named entity recognition through clas- sifier combination. In Proceedings of the seventh con- ference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 168–171. Association for Com- putational Linguistics. Zoubin Ghahramani and Hyun-Chul Kim. 2003. Bayesian classifier combination. Technical report, University College London. Y Haitovsky, A Smith, and Y Liu. 2002. Modelling dis- agreements among and within raters assessments from the bayesian point of view. In Draft. Presented at the Valencia meeting 2002. Hyun-Chul Kim and Zoubin Ghahramani. 2012. Bayesian classifier combination. In International con- ference on artificial intelligence and statistics, pages 619–627. 254 Abby Levenberg, Stephen Pulman, Karo Moilanen, Ed- win Simpson, and Stephen Roberts. 2014. Predict- ing economic indicators from web text using sentiment composition. International Journal of Computer and Communication Engineering, 3(2):109–115. Richard Maclin and David Opitz. 1997. An empiri- cal evaluation of bagging and boosting. AAAI/IAAI, 1997:546–551. Slim Mesfar. 2007. Named entity recognition for ara- bic using syntactic grammars. In Natural Language Processing and Information Systems, pages 305–316. Springer. Peter Mika, Massimiliano Ciaramita, Hugo Zaragoza, and Jordi Atserias. 2008. Learning to Tag and Tag- ging to Learn: A Case Study on Wikipedia. vol- ume 23, pages 26–33. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction with- out labeled data. In Proceedings of the Joint Confer- ence of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan- guage Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, pages 1003–1011, Stroudsburg, PA, USA. Association for Computational Linguistics. David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Lingvisti- cae Investigationes, 30(1):3–26. David Nadeau. 2007. Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision. Truc-Vien T. Nguyen and Alessandro Moschitti. 2011. End-to-end relation extraction using distant supervi- sion from external semantic repositories. In Pro- ceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT ’11, pages 277–282, Stroudsburg, PA, USA. Association for Computational Linguistics. Joel Nothman, Nicky Ringland, Will Radford, Tara Mur- phy, and James R Curran. 2013. Learning multilin- gual Named Entity Recognition from Wikipedia. Ar- tificial Intelligence, 194:151–175. Mai Oudah and Khaled F Shaalan. 2012. A pipeline ara- bic named entity recognition using a hybrid approach. In COLING, pages 2159–2176. Marius Pasca, Dekang Lin, Jeffrey Bigham, Andrei Lif- chits, and Alpa Jain. 2006. Organizing and searching the world wide web of facts-step one: the one-million fact extraction challenge. In AAAI, volume 6, pages 1400–1405. Alexander E Richman and Patrick Schone. 2008. Mining Wiki Resources for Multilingual Named Entity Recog- nition. In ACL, pages 1–9. Ellen Riloff and Rosie Jones. 1999. Learning dictionar- ies for information extraction by multi-level bootstrap- ping. In AAAI, pages 474–479. Sriparna Saha and Asif Ekbal. 2013. Combining mul- tiple classifiers using vote based classifier ensemble technique for named entity recognition. Data & Knowledge Engineering, 85:15–39. Satoshi Sekine et al. 1998. NYU: Description of the Japanese NE system used for MET-2. In Proceed- ings of the Seventh Message Understanding Confer- ence (MUC-7), volume 17. Khaled Shaalan and Hafsa Raza. 2009. Nera: Named entity recognition for arabic. Journal of the Ameri- can Society for Information Science and Technology, 60(8):1652–1663. Edwin Simpson, Stephen Roberts, Ioannis Psorakis, and Arfon Smith. 2013. Dynamic bayesian combination of multiple imperfect classifiers. In Decision Making and Imperfection, pages 1–35. Springer. Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy, and Hector Martinez. 2014. Whats in a p-value in nlp? In Proceedings of the eighteenth conference on computational natural language learning (CONLL14), pages 1–10. Sam Tardif, James R. Curran, and Tara Murphy. 2009. Improved Text Categorisation for Wikipedia Named Entities. In Proceedings of the Australasian Language Technology Association Workshop, pages 104–108. Sergey Tulyakov, Stefan Jaeger, Venu Govindaraju, and David Doermann. 2008. Review of classifier combi- nation methods. In Machine Learning in Document Analysis and Recognition, pages 361–386. Springer. Merijn Van Erp, Louis Vuurpijl, and Lambert Schomaker. 2002. An overview and comparison of voting methods for pattern recognition. In Eighth International Work- shop on Frontiers in Handwriting Recognition, pages 195–200. IEEE. Hans Van Halteren, Walter Daelemans, and Jakub Za- vrel. 2001. Improving accuracy in word class tagging through the combination of machine learning systems. Computational linguistics, 27(2):199–229. Wajdi Zaghouani. 2014. Critical Survey of the Freely Available Arabic Corpora. In Workshop on Free/Open-Source Arabic Corpora and Corpora Pro- cessing Tools, pages 1–8, Reykjavik, Iceland. Tong Zhang, Fred Damerau, and David Johnson. 2002. Text chunking based on a generalization of winnow. The Journal of Machine Learning Research, 2:615– 637. 255 256 work_2tflktf2hvbkja6rltueqe3w4q ---- 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 6 Research on Vehicle Detection Method Based on Background Modeling Zhichao Lian School of Computer Science and Engineering Xi’an Technological University Xi’an 710032, China e-mail: 965941167@qq.com Zhongsheng Wang School of Computer Science and Engineering Xi’an Technological University Xi’an 710032, China e-mail: wangzhongsheng@xatu.edu.cn Abstract—This paper mainly studies the background difference method in the field of intelligent traffic, proposes a background modeling method base on frame difference, and compares it with the statistical average background model and Gaussian distribution background modeling method. vehicle contour obtained by the morphological method. Finally, experiments were carried out on 4 normal road traffic surveillance videos, the effective detection rate used in this paper reaches 93.75%, which has a certain degree of application. The algorithm model need to be further tested in more complex weather and different road conditions. Keywords-Vehicle Detection; Background Modeling; Inter-frame Difference; Morphological Method I. INTRODUCTION In practical applications, a vehicle detection method with fast response, high accuracy, and good adaptability becomes a key part of intelligent traffic detection management. Dealing with moving object in video is affected by many complicated conditions. Commonly used methods for detecting three kinds of moving vehicles include: difference method, optical flow method, and background difference method. Currently, the difference method includes the inter-frame difference method and the time difference method. The inter-frame difference detection method detectionspeed faster and the operation algorithm is simple, and can be detected under a scene with high real-time requirements. The time difference method is suitable for dynamically changing scenes and does not apply to completely segmenting moving objects. The optical flow method is poor in the context of real-time and practical, and it is difficult to meet the requirements for teal-time detection of moving vehicles. The background difference method has a good result on the speed and detection effect when the performance of the camera is relatively stable. The background difference method focuses on how to set up the background and dynamically update the system in real time. This article uses the background difference method. II. COMMONLY USED BACKGROUND MODELING UPDATE MODELS In the monitoring applications, the background difference method needs to establish a background reference frame. Establishing an accurate and robust background model is the key to the system, the accuracy of the frame directly affects the output. The commonly used background models are the statistical average method and the Gaussian distribution background model. A. Statistical average method background model The statistical average method, also called the mean method, is essentially a statistical filter idea. In a period of time, the collected images are added together, and the average value is taken as the reference background model. It is to obtain the gray average value ofN frames of images in the sequence image as the estimated value of the background image to weaken the interference of the moving object to the background. The specific calculation formula is shown in k 1 2 1 1 ( ... ) k k k k N Avg f f f f N           Avgk is the background model established for the acquisition of the Frame K image system; N is the average number of frames; fk,fk-1, ...,fk-N+1 are the continuous sequences frame. The statistical average method is simple and fast, but if easily causes noise accumulation and mixing. This method is more suitable for a small number of continuous motion objects in the scene, and the background is visible in the most of the time. While there are a large number of moving objects, especially when they are moving slowly, there will be a large deviation in the background of this situation. B. Gaussian Distribute Background Model The Gaussian distribution background model was first proposed by N. Friedman et.al and is divided into background models with single Gaussian distribution and mixed Gaussian distribution. The single Gaussian model method regard the change in the gray value of each pixel in the background image as a Gaussian random process, and establishes a Gaussian model for each pixel in the image, which is achieved by continuously updating the Gaussian model background image. The mixed Gaussian model uses K (basically 3~5) Gaussian models to characterize the features of each pixel in the image. After the new image is acquired, the mixed Gaussian model is updated. Each pixel in current image is mixed with a Gaussian mixture model. Matching to determine the pixel belongs to a background point or a front sight. The section focuses on the hybrid Gaussian background modeling method. This modeling method uses the statistical 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 7 information such as the probability density of a large number of sample values for a long time to represent the background. Use the statistical difference (such as 3σ principle) to judge the target pixel. This method can model complex dynamic backgrounds with a large amount of computation. Suppose that any pixel (x,y) in the background obeys a Gaussian model composed of K Gaussian distributions, as shown in follow. , , , , , , 1 ( ) ( , , ) k x y j x y x y j x y j j P I I        η(I(x, y), μ(x, y), σ(x,y,j)) means that the j-th Gaussian probability density, means value isμ(x,y,j), the variance isσ(x,y,j). And the ωjis the weight of the j-th Gaussian distribution. The pixel values observed at the current moment are compared with the current K Gaussian distribution functions in descending order of weights to obtain the best match. If there is no match, the pixel is a foreground spot.Otherwise it is a background point. The Gaussian distribution background model has a large amount of calculation, many storage parameters, and takes a long time, which is not conducive to practical application. III. IMPROVED BACKGROUND MODELING METHOD This paper proposes an adaptive background update model based on the inter-frame difference method. This method uses the background of the current frame and the background of the previous frame in the video sequence to perform weighted averaging to update the background. The specific methods are as shown in formulas (3) and (4) as shown. ( , , ) | ( , , ) ( , , ) |Diff x y t I x y t B x y t   1 ( , , ) ( , , ) 0 h Diff x y t T BOM x y t otherwise      I(x, y, t) and B(x, y, t) are the current frame containing the moving object at t time and the updated current background image; Th is the threshold and the maximum peak right fade in the difference image histogram is used,and the gray level corresponding to 1/10 of the maximum peak. Using equation (5) to obtain the motion template Stencil(x, y, t) at time t from two adjacent spaced images, it is used as a target factor to determine which pixels in the current frame are used to update the current background. Use the formula (6) to obtain the instantaneous background and update the background using the weighted average of the instantaneous background and the current background, as shown in Equation (7). ( , , ) ( , , ) & ( , , 1)Stencil x y t BOM x y t BOM x y t   ( , , 1) ( , , ) 0 ( , , ) ( , , ) ( , , ) 1 temp B x y t Stencil x y t B x y t I x y t Stencil x y t       ( , , ) α ( , , ) (1 α) ( , , 1) temp B x y t B x y t B x y t     Among them, in order to update the coefficient, theαvalue has a positive correlation with the update speed. The larger theαvalue is, the faster the update speed is, and the change of the external light can be captured in time so that the current background is closer to the external condition of the current frame. The smaller the value is, the slower the update rate will be, and the current background acquired will have some deviation. After the background image is extracted, the current motion area is segmented using the background difference method. Using the threshold parameter in the expression (8), the image is binaries and segmented to obtain the foreground binary image. 1 ( , , ) ( , , ) 0 Diff x y t threshold D x y t other      Threshold in the formula should be properly selected, through the experimental results to filter the residual background. IV. EXPERIMENTAL TESTING According to the flow of the vehicle detection algorithm, the original video frame→background update and extraction algorithm→motion region segmentation→frame image conversion to grayscale image→binary processing→morphological processing→obtain the detection result, and complete detection test. A. Comparison of experimental results By using the statistical average method, the background of the mixture Gaussian model and the method proposed in this paper, the background model is obtained by modeling the background image of a video of a traffic video surveillance video. The video of this video screen is 1080 * 960, the frequency is 25fps, and the time length is 10 seconds.The experimental results are shown are shown in Fig.1 below. 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 8 (a)Original image frame The resulting background frame image (b-1)Statistical average method (c-1)Mixed Gaussian method (d-1)This article's algorithm Background and frame image difference processing (b-2)Statistical average method (c-2)Mixed Gaussian method (d-2)This article's algorithm Binary operation (b-3)Statistical average method (c-3)Mixed Gaussian method (d-3)This article's algorithm Open operation (b-4)Statistical average method (c-4)Mixed Gaussian method (d-4)This article's algorithm Close operation (b-5)Statistical average method (c-5)Mixed Gaussian method (d-5)This article's algorithm Figure 1. Comparison of experimental results with three methods 2018 International Conference on Sensor Network and Computer Engineering (ICSNCE 2018) 9 It can be seen that in the background frame extraction process, the average statistical method extracts the background [Fig. 1 (b-1)] and the blurred background is affected by the camera shake of the video camera, and the effect is worst. The parameters of the mixed Gaussian model are set as follows: the number of pixel models is 5, the initial variance is 30, and the learning rate of model weights is α = 0.005, T = 0.7; The resulting background [Fig. 1 (c-1)] is clearer, but the background extracted at the lower left corner of the screen is still somewhat blurred; the corresponding parameter setting of this algorithm is: Th = 15, α = 0.2, threshold = 12. The background [Fig. 1 (d-1)] obtained by the method proposed in this paper is of good quality and closer to the real background. Finally, after the difference, binary processing, and morphological processing, the image is finally obtained. The statistical mean method [Fig. 1 (b-5)] is noisy, and the extraction result is not clear. The result graph obtained by using the Gaussian method [Fig. 1(c-5)] works well, but there is still noise. The result graph obtained by using this algorithm [Fig. 1(c-5)] has the best effect, the noise is very small, the extracted vehicle is clearer and the connectivity is better. B. Comparison of Algorithm Performance The final conclusion of the comparison experiment is not only drawn from the experimental results, but also from the performance side. Table 1 compares the performance ofthe three methods, including the time used for the entire inspection process, the memory footprint, and the CPU usage. It can be seen that the algorithm proposed in this paper compared with the other two methods, it consumes less time, consumes less memory, and has a small CPU occupancy rate. TABLEI Comparison of performance of three modeling methods Background model Time-consuming (seconds) Average memory size (MB) Average CPU usage Statistical average 149 169 65% Gaussian distribution 40 99 35% Method of this article 25 60 32% V. CONCLUSION This dissertation focuses on the vehicle detection method based on background difference in the intelligent traffic field, and proposes a background modeling method based on the adaptive difference method between frames. Design experiments were compared with commonly used averaging methods and Gaussian distribution model methodsto compare the background images obtained by the three modeling methods. At the same time, the running performance of the algorithm is analyzed from the aspects of time-consuming, memory occupancy and CPU occupancy, which verifies that the vehicle detection algorithm proposed in this paper can extract the background more accurately. The morphological method is used to process the differential binary image to eliminate noise, fill in voids, etc. to complete the detection step. Experiments prove the effectiveness and real-time performance of the algorithm for video-based motion vehicle detection. However, the effect of this algorithm in complex weather conditions and complex road conditions needs to be further improved. REFERENCES [1] Ma Junqiang. Research on the detection and tracking technology of sports vehicles based on video [D]. Beijing University of Technology, 2009. [2] Yu jie, Li Xiaojing. Moving object detection based on mean background and three framedifference[J]. Journal of Shaanxi University of Science and Technology, 2018(1). [3] Zhou Mingjiang, Wang Jiwu. Research on the Detection and Tracking Method of Moving Vehicles in Video Sequences[J]. Science and Technology Economic Market, 2017(4):76-78. [4] Fan Wenjie, Zhang Li. An Improved Method of Moving Vehicle Detection Based on Background Difference Algorithm [J]. Journal of Chengdu University of Information Technology, 2010, 25(4):355-360. [5] Huang Lei, Yu Manman. Research on Moving Object Detection Based on Background Difference [J]. Software Herald, 2009(6):187-188. [6] Lu Xixin. Detection of moving objects based on Gaussian mixture model and three-frame difference method[D]. Xi'an University of Technology, 2011. [7] Federal Highway Administration, Department of Transportation, Highway Performance Monitoring System Reassessment, Final Report, FHWA-PL-99-001, 1999: 32-34 [8] Jean Serra.Introduction to mathematical morphology.Computer Vision, Graphicsand Image Processing, 2005,35(7):283-305 [9] YananakaK.Effects of misdate Smith Predictor on stability in System with Time-delay.Automatically, 2011,23(7):787-791. [10] Cheung S, Kamath C. Robust background subtraction with foreground validation for urban traffic video[J]. EURASIP Journal on Applied Signal Processing, 2005,14: 2330- 2340. work_2timpskoaza73ntj374tkxmaty ---- Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective This is a repository copy of Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/144020/ Version: Published Version Article: Tsarapatsanis, Dimitrios orcid.org/0000-0002-6419-4321, Aletras, Nikolaos, Preotiuc- Pietro, Daniel et al. (1 more author) (2016) Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective. PeerJ Computer Science. ISSN 2376-5992 https://doi.org/10.7717/peerj-cs.93 eprints@whiterose.ac.uk https://eprints.whiterose.ac.uk/ Reuse This article is distributed under the terms of the Creative Commons Attribution (CC BY) licence. This licence allows you to distribute, remix, tweak, and build upon the work, even commercially, as long as you credit the authors for the original work. More information and the full terms of the licence here: https://creativecommons.org/licenses/ Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request. Submitted 11 May 2016 Accepted 23 September 2016 Published 24 October 2016 Corresponding author Nikolaos Aletras, nikos.aletras@gmail.com Academic editor Lexing Xie Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.93 Copyright 2016 Aletras etal Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective Nikolaos Aletras1,2, Dimitrios Tsarapatsanis3, Daniel Preoţiuc-Pietro4,5 and Vasileios Lampos2 1 Amazon.com, Cambridge, United Kingdom 2 Department of Computer Science, University College London, University of London, London, United Kingdom 3 School of Law, University of Sheffield, Sheffield, United Kingdom 4 Positive Psychology Center, University of Pennsylvania, Philadelphia, United States 5 Computer & Information Science, University of Pennsylvania, Philadelphia, United States ABSTRACT Recent advances in Natural Language Processing and Machine Learning provide us with the tools to build predictive models that can be used to unveil patterns driving judicial decisions. This can be useful, for both lawyers and judges, as an assisting tool to rapidly identify cases and extract patterns which lead to certain decisions. This paper presents the first systematic study on predicting the outcome of cases tried by the European Court of Human Rights based solely on textual content. We formulate a binary classification task where the input of our classifiers is the textual content extracted from a case and the target output is the actual judgment as to whether there has been a violation of an article of the convention of human rights. Textual information is represented using contiguous word sequences, i.e., N-grams, and topics. Our models can predict the court’s decisions with a strong accuracy (79% on average). Our empirical analysis indicates that the formal facts of a case are the most important predictive factor. This is consistent with the theory of legal realism suggesting that judicial decision-making is significantly affected by the stimulus of the facts. We also observe that the topical content of a case is another important feature in this classification task and explore this relationship further by conducting a qualitative analysis. Subjects Artificial Intelligence, Computational Linguistics, Data Mining and Machine Learning, Data Science, Natural Language and Speech Keywords Natural Language Processing, Text Mining, Legal Science, Machine Learning, Artificial Intelligence, Judicial decisions INTRODUCTION In his prescient work on investigating the potential use of information technology in the legal domain, Lawlor surmised that computers would one day become able to analyse and predict the outcomes of judicial decisions (Lawlor, 1963). According to Lawlor, reliable prediction of the activity of judges would depend on a scientific understanding of the ways that the law and the facts impact on the relevant decision-makers, i.e., the judges. More than fifty years later, the advances in Natural Language Processing (NLP) and Machine Learning (ML) provide us with the tools to automatically analyse legal materials, so as to build successful predictive models of judicial outcomes. How to cite this article Aletras etal (2016), Predicting judicial decisions of the European Court of Human Rights: a Natural Language Processing perspective. PeerJ Comput. Sci. 2:e93; DOI 10.7717/peerj-cs.93 1An amicus curiae (friend of the court) is a person or organisation that offers testimony before the Court in the context of a particular case without being a formal party to the proceedings. In this paper, our particular focus is on the automatic analysis of cases of the European Court of Human Rights (ECtHR or Court). The ECtHR is an international court that rules on individual or, much more rarely, State applications alleging violations by some State Party of the civil and political rights set out in the European Convention on Human Rights (ECHR or Convention). Our task is to predict whether a particular Article of the Convention has been violated, given textual evidence extracted from a case, which comprises of specific parts pertaining to the facts, the relevant applicable law and the arguments presented by the parties involved. Our main hypotheses are that (1) the textual content, and (2) the different parts of a case are important factors that influence the outcome reached by the Court. These hypotheses are corroborated by the results. Our work lends some initial plausibility to a text-based approach with regard to ex ante prediction of ECtHR outcomes on the assumption that the text extracted from published judgments of the Court bears a sufficient number of similarities with, and can therefore stand as a (crude) proxy for, applications lodged with the Court as well as for briefs submitted by parties in pending cases. We submit, though, that full acceptance of that reasonable assumption necessitates more empirical corroboration. Be that as it may, our more general aim is to work under this assumption, thus placing our work within the larger context of ongoing empirical research in the theory of adjudication about the determinants of judicial decision-making. Accordingly, in the discussion we highlight ways in which automatically predicting the outcomes of ECtHR cases could potentially provide insights on whether judges follow a so-called legal model (Grey, 1983) of decision making or their behavior conforms to the legal realists’ theorization (Leiter, 2007), according to which judges primarily decide cases by responding to the stimulus of the facts of the case. We define the problem of the ECtHR case prediction as a binary classification task. We utilise textual features, i.e., N-grams and topics, to train Support Vector Machine (SVM) classifiers (Vapnik, 1998). We apply a linear kernel function that facilitates the interpretation of models in a straightforward manner. Our models can reliably predict ECtHR decisions with high accuracy, i.e., 79% on average. Results indicate that the ‘facts’ section of a case best predicts the actual court’s decision, which is more consistent with legal realists’ insights about judicial decision-making. We also observe that the topical content of a case is an important indicator whether there is a violation of a given Article of the Convention or not. Previous work on predicting judicial decisions, representing disciplinary backgrounds in political science and economics, has largely focused on the analysis and prediction of judges’ votes given non textual information, such as the nature and the gravity of the crime or the preferred policy position of each judge (Kort, 1957; Nagel, 1963; Keown, 1980; Segal, 1984; Popple, 1996; Lauderdale & Clark, 2012). More recent research shows that information from texts authored by amici curiae1 improves models for predicting the votes of the US Supreme Court judges (Sim, Routledge & Smith, 2015). Also, a text mining approach utilises sources of metadata about judge’s votes to estimate the degree to which those votes are about common issues (Lauderdale & Clark, 2014). Accordingly, this paper presents the first systematic study on predicting the decision outcome of cases tried at a major international court by mining the available textual information. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 2/19 2ECHtR provisional annual report for the year 2015: http://www.echr.coe.int/ Documents/Annual_report_2015_ENG. pdf. 3HUDOC ECHR Database: http://hudoc. echr.coe.int/. 4Nonetheless, not all cases that pass this first admissibility stage are decided in the same way. While the individual judge’s decision on admissibility is final and does not comprise the obligation to provide reasons, a Committee deciding a case may, by unanimous vote, declare the application admissible and render a judgment on its merits, if the legal issue raised by the application is covered by well-established case-law by the Court. Overall, we believe that building a text-based predictive system of judicial decisions can offer lawyers and judges a useful assisting tool. The system may be used to rapidly identify cases and extract patterns that correlate with certain outcomes. It can also be used to develop prior indicators for diagnosing potential violations of specific Articles in lodged applications and eventually prioritise the decision process on cases where violation seems very likely. This may improve the significant delay imposed by the Court and encourage more applications by individuals who may have been discouraged by the expected time delays. MATERIALS AND METHODS European Court of Human Rights The ECtHR is an international court set up in 1959 by the ECHR. The court has jurisdiction to rule on the applications of individuals or sovereign states alleging violations of the civil and political rights set out in the Convention. The ECHR is an international treaty for the protection of civil and political liberties in European democracies committed to the rule of law. The treaty was initially drafted in 1950 by the ten states which had created the Council of Europe in the previous year. Membership in the Council entails becoming party to the Convention and all new members are expected to ratify the ECHR at the earliest opportunity. The Convention itself entered into force in 1953. Since 1949, the Council of Europe and thus the Convention have expanded significantly to embrace forty-seven states in total, with a combined population of nearly 800 million. Since 1998, the Court has sat as a full-time court and individuals can apply to it directly, if they can argue that they have voiced their human rights grievance by exhausting all effective remedies available to them in their domestic legal systems before national courts. Case processing by the court The vast majority of applications lodged with the Court are made by individuals. Applications are first assessed at a prejudicial stage on the basis of a list of admissibility criteria. The criteria pertain to a number of procedural rules, chief amongst which is the one on the exhaustion of effective domestic remedies. If the case passes this first stage, it can either be allocated to a single judge, who may declare the application inadmissible and strike it out of the Court’s list of cases, or be allocated to a Committee or a Chamber. A large number of the applications, according to the court’s statistics fail this first admissibility stage. Thus, to take a representative example, according to the Court’s provisional annual report for the year 2015,2 900 applications were declared inadmissible or struck out of the list by Chambers, approximately 4,100 by Committees and some 78,700 by single judges. To these correspond, for the same year, 891 judgments on the merits. Moreover, cases held inadmissible or struck out are not reported, which entails that a text-based predictive analysis of them is impossible. It is important to keep this point in mind, since our analysis was solely performed on cases retrievable through the electronic database of the court, HUDOC.3 The cases analysed are thus the ones that have already passed the first admissibility stage,4 with the consequence that the Court decided on these cases’ merits under one of its formations. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 3/19 5Rules of ECtHR, http://www.echr.coe.int/ Documents/Rules_Court_ENG.pdf. Main premise Our main premise is that published judgments can be used to test the possibility of a text-based analysis for ex ante predictions of outcomes on the assumption that there is enough similarity between (at least) certain chunks of the text of published judgments and applications lodged with the Court and/or briefs submitted by parties with respect to pending cases. Predictive tasks were based on the text of published judgments rather than lodged applications or briefs simply because we did not have access to the relevant data set. We thus used published judgments as proxies for the material to which we do not have access. This point should be borne in mind when approaching our results. At the very least, our work can be read in the following hypothetical way: if there is enough similarity between the chunks of text of published judgments that we analyzed and that of lodged applications and briefs, then our approach can be fruitfully used to predict outcomes with these other kinds of texts. Case structure The judgments of the Court have a distinctive structure, which makes them particularly suitable for a text-based analysis. According to Rule 74 of the Rules of the Court,5 a judgment contains (among other things) an account of the procedure followed on the national level, the facts of the case, a summary of the submissions of the parties, which comprise their main legal arguments, the reasons in point of law articulated by the Court and the operative provisions. Judgments are clearly divided into different sections covering these contents, which allows straightforward standardisation of the text and consequently renders possible text-based analysis. More specifically, the sections analysed in this paper are the following: • Procedure: This section contains the procedure followed before the Court, from the lodging of the individual application until the judgment was handed down. • The facts: This section comprises all material which is not considered as belonging to points of law, i.e., legal arguments. It is important to stress that the facts in the above sense do not just refer to actions and events that happened in the past as these have been formulated by the Court, giving rise to an alleged violation of a Convention article. The ‘Facts’ section is divided in the following subsections: – The circumstances of the case: This subsection has to do with the factual background of the case and the procedure (typically) followed before domestic courts before the application was lodged by the Court. This is the part that contains materials relevant to the individual applicant’s story in its dealings with the respondent state’s authorities. It comprises a recounting of all actions and events that have allegedly given rise to a violation of the ECHR. With respect to this subsection, a number of crucial clarifications and caveats should be stressed. To begin with, the text of the ‘Circumstances’ subsection has been formulated by the Court itself. As a result, it should not always be understood as a neutral mirroring of the factual background of the case. The choices made by the Court when it comes to formulations of the facts incorporate implicit or explicit judgments to the effect that some facts are more Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 4/19 relevant than others. This leaves open the possibility that the formulations used by the Court may be tailor-made to fit a specific preferred outcome. We openly acknowledge this possibility, but we believe that there are several ways in which it is mitigated. First, the ECtHR has limited fact-finding powers and, in the vast majority of cases, it defers, when summarizing the factual background of a case, to the judgments of domestic courts that have already heard and dismissed the applicants’ ECHR-related complaint (Leach, Paraskeva & Uelac, 2010; Leach, 2013). While domestic courts do not necessarily hear complaints on the same legal issues as the ECtHR does, by virtue of the incorporation of the Convention by all States Parties (Helfer, 2008), they typically have powers to issue judgments on ECHR-related issues. Domestic judgments may also reflect assumptions about the relevance of various events, but they also provide formulations of the facts that have been validated by more than one decision-maker. Second, the Court cannot openly acknowledge any kind of bias on its part. This means that, on their face, summaries of facts found in the ‘Circumstances’ section have to be at least framed in as neutral and impartial a way as possible. As a result, for example, clear displays of impartiality, such as failing to mention certain crucial events, seem rather improbable. Third, a cursory examination of many ECtHR cases indicates that, in the vast majority of cases, parties do not seem to dispute the facts themselves, as contained in the ‘Circumstances’ subsection, but only their legal significance (i.e., whether a violation took place or not, given those facts). As a result, the ‘Circumstances’ subsection contains formulations on which, in the vast majority of cases, disputing parties agree. Last, we hasten to add that the above three kinds of considerations do not logically entail that other forms of non-outright or indirect bias in the formulation of facts are impossible. However, they suggest that, in the absence of access to other kinds of textual data, such as lodged applications and briefs, the ‘Circumstances’ subsection can reasonably perform the function of a (sometimes crude) proxy for a textual representation of the factual background of a case. – Relevant law: This subsection of the judgment contains all legal provisions other than the articles of the Convention that can be relevant to deciding the case. These are mostly provisions of domestic law, but the Court also frequently invokes other pertinent international or European treaties and materials. • The law: The law section considers the merits of the case, through the use of legal argument. Depending on the number of issues raised by each application, the section is further divided into subsections that examine individually each alleged violation of some Convention article (see below). However, the Court in most cases refrains from examining all such alleged violations in detail. Insofar as the same claims can be made by invoking more than one article of the Convention, the Court frequently decides only those that are central to the arguments made. Moreover, the Court frequently refrains from deciding on an alleged violation of an article, if it overlaps sufficiently with some other violation it has already decided on. – Alleged violation of article x: Each subsection of the judgment examining alleged violations in depth is divided into two sub-sections. The first one contains the Parties’ Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 5/19 6The data set is publicly available for download from https://figshare.com/s/ 6f7d9e7c375ff0822564. Figure 1 Procedure. This section contains the procedure followed before the Court, from the lodging of the individual application until the judgment was handed down. Submissions. The second one comprises the arguments made by the Court itself on the Merits. ∗ Parties’ submissions: The Parties’ Submissions typically summarise the main arguments made by the applicant and the respondent state. Since in the vast majority of cases the material facts are taken for granted, having been authoritatively established by domestic courts, this part has almost exclusively to do with the legal arguments used by the parties. ∗ Merits: This subsection provides the legal reasons that purport to justify the specific outcome reached by the Court. Typically, the Court places its reasoning within a wider set of rules, principles and doctrines that have already been established in its past case-law and attempts to ground the decision by reference to these. It is to be expected, then, that this subsection refers almost exclusively to legal arguments, sometimes mingled with bits of factual information repeated from previous parts. • Operative provisions: This is the section where the Court announces the outcome of the case, which is a decision to the effect that a violation of some Convention article either did or did not take place. Sometimes it is coupled with a decision on the division of legal costs and, much more rarely, with an indication of interim measures, under article 39 of the ECHR. Figures 1–4, show extracts of different sections from the Case of ‘‘Velcheva v. Bulgaria’’ (http://hudoc.echr.coe.int/sites/eng/pages/search.aspx?i=001-155099) following the structure described above. Data We create a data set6 consisting of cases related to Articles 3, 6, and 8 of the Convention. We focus on these three articles for two main reasons. First, these articles provided the most data we could automatically scrape. Second, it is of crucial importance that there should be a sufficient number of cases available, in order to test the models. Cases from the selected articles fulfilled both criteria. Table 1 shows the Convention right that each article protects and the number of cases in our data set. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 6/19 Figure 2 The facts. This section comprises all material which is not considered as belonging to points of law, i.e., legal arguments. Figure 3 The law. The law section is focused on considering the merits of the case, through the use of le- gal argument. Figure 4 Operative provisions. This is the section where the Court announces the outcome of the case, which is a decision to the effect that a violation of some Convention article either did or did not take place. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 7/19 Table 1 Articles of the Convention and number of cases in the data set. Article numbers, Convention right that each article protects and the number of cases in our data set. Article Human Right Cases 3 Prohibits torture and inhuman and degrading treatment 250 6 Protects the right to a fair trial 80 8 Provides a right to respect for one’s ‘‘private and family life, his home and his correspondence’’ 254 For each article, we first retrieve all the cases available in HUDOC. Then, we keep only those that are in English and parse them following the case structure presented above. We then select an equal number of violation and non-violation cases for each particular article of the Convention. To achieve a balanced number of violation/non-violation cases, we first count the number of cases available in each class. Then, we choose all the cases in the smaller class and randomly select an equal number of cases from the larger class. This results to a total of 250, 80 and 254 cases for Articles 3, 6 and 8, respectively. Finally, we extract the text under each part of the case by using regular expressions, making sure that any sections on operative provisions of the Court are excluded. In this way, we ensure that the models do not use information pertaining to the outcome of the case. We also preprocess the text by lower-casing and removing stop words (i.e., frequent words that do not carry significant semantic information) using the list provided by NLTK (https://raw.githubusercontent.com/nltk/nltk_data/ghpages/packages/corpora/ stopwords.zip). Description of textual features We derive textual features from the text extracted from each section (or subsection) of each case. These are either N-gram features, i.e., contiguous word sequences, or word clusters, i.e., abstract semantic topics. • N-gram features: The Bag-of-Words (BOW) model (Salton, Wong & Yang, 1975; Salton & McGill, 1986) is a popular semantic representation of text used in NLP and Information Retrieval. In a BOW model, a document (or any text) is represented as the bag (multiset) of its words (unigrams) or N-grams without taking into account grammar, syntax and word order. That results to a vector space representation where documents are represented as m-dimensional variables over a set of m N-grams. N-gram features have been shown to be effective in various supervised learning tasks (Bamman, Eisenstein & Schnoebelen, 2014; Lampos & Cristianini, 2012). For each set of cases in our data set, we compute the top-2000 most frequent N-grams where N ∈ {1,2,3,4}. Each feature represents the normalized frequency of a particular N-gram in a case or a section of a case. This can be considered as a feature matrix, C ∈ Rc×m, where c is the number of the cases and m = 2,000. We extract N-gram features for the Procedure (Procedure), Circumstances (Circumstances), Facts (Facts), Relevant Law (Relevant Law), Law (Law) and the Full case (Full) respectively. Note that the representations of the Facts is obtained by taking the mean vector of Circumstances and Relevant Law. In a similar Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 8/19 way, the representation of the Full case is computed by taking the mean vector of all of its sub-parts. • Topics: We create topics for each article by clustering together N-grams that are semantically similar by leveraging the distributional hypothesis suggesting that similar words appear in similar contexts. We thus use the C feature matrix (see above), which is a distributional representation (Turney & Pantel, 2010) of the N-grams given the case as the context; each column vector of the matrix represents an N-gram. Using this vector representation of words, we compute N-gram similarity using the cosine metric and create an N-gram by N-gram similarity matrix. We finally apply spectral clustering (von Luxburg, 2007)—which performs graph partitioning on the similarity matrix—to obtain 30 clusters of N-grams. For Articles 6 and 8, we use the Article 3 data for selecting the number of clusters T , where T = {10,20,...,100}, while for Article 3 we use Article 8. Given that the obtained topics are hard clusters, an N-gram can only be part of a single topic. A representation of a cluster is derived by looking at the most frequent N-grams it contains. The main advantages of using topics (sets of N-grams) instead of single N-grams is that it reduces the dimensionality of the feature space, which is essential for feature selection, it limits overfitting to training data (Lampos et al., 2014; Preoţiuc-Pietro, Lampos & Aletras, 2015; Preoţiuc-Pietro et al., 2015) and also provides a more concise semantic representation. Classification model The problem of predicting the decisions of the ECtHR is defined as a binary classification task. Our goal is to predict if, in the context of a particular case, there is a violation or non-violation in relation to a specific Article of the Convention. For that purpose, we use each set of textual features, i.e., N-grams and topics, to train Support Vector Machine (SVM) classifiers (Vapnik, 1998). An SVM is a machine learning algorithm that has shown particularly good results in text classification, especially using small data sets (Joachims, 2002; Wang & Manning, 2012). We employ a linear kernel since that allows us to identify important features that are indicative of each class by looking at the weight learned for each feature (Chang & Lin, 2008). We label all the violation cases as +1, while no violation is denoted by −1. Therefore, features assigned with positive weights are more indicative of violation, while features with negative weights are more indicative of no violation. The models are trained and tested by applying a stratified 10-fold cross validation, which uses a held-out 10% of the data at each stage to measure predictive performance. The linear SVM has a regularisation parameter of the error term C, which is tuned using grid-search. For Articles 6 and 8, we use the Article 3 data for parameter tuning, while for Article 3 we use Article 8. RESULTS AND DISCUSSION Predictive accuracy We compute the predictive performance of both sets of features on the classification of the ECtHR cases. Performance is computed as the mean accuracy obtained by 10-fold Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 9/19 Table 2 Accuracy of the different feature types across articles. Accuracy of predicting violation/non- violation of cases across articles on 10-fold cross-validation using an SVM with linear kernel. Parentheses contain the standard deviation from the mean. Accuracy of random guess is .50. Bold font denotes best accuracy in a particular Article or on Average across Articles. Feature Type Article 3 Article 6 Article 8 Average N-grams Full .70 (.10) .82 (.11) .72 (.05) .75 Procedure .67 (.09) .81 (.13) .71 (.06) .73 Circumstances .68 (.07) .82 (.14) .77 (.08) .76 Relevant law .68 (.13) .78 (.08) .72 (.11) .73 Facts .70 (.09) .80 (.14) .68 (.10) .73 Law .56 (.09) .68 (.15) .62 (.05) .62 Topics .78 (.09) .81 (.12) .76 (.09) .78 Topics and circumstances .75 (.10) .84 (0.11) .78 (0.06) .79 cross-validation. Accuracy is computed as follows: Accuracy = TV +TNV V +NV (1) where TV and TNV are the number of cases correctly classified that there is a violation an article of the Convention or not respectively. V and NV represent the total number of cases where there is a violation or not respectively. Table 2 shows the accuracy of each set of features across articles using a linear SVM. The rightmost column also shows the mean accuracy across the three articles. In general, both N-gram and topic features achieve good predictive performance. Our main observation is that both language use and topicality are important factors that appear to stand as reliable proxies of judicial decisions. Therefore, we take a further look into the models by attempting to interpret the differences in accuracy. We observe that ‘Circumstances’ is the best subsection to predict the decisions for cases in Articles 6 and 8, with a performance of .82 and .77 respectively. In Article 3, we obtain better predictive accuracy (.70) using the text extracted from the full case (‘Full’) while the performance of ‘Circumstances’ is almost comparable (.68). We should again note here that the ‘Circumstances’ subsection contains information regarding the factual background of the case, as this has been formulated by the Court. The subsection therefore refers to the actions and events which triggered the case and gave rise to a claim made by an individual to the effect that the ECHR was violated by some state. On the other hand, ‘Full’, which is a mixture of information contained in all of the sections of a case, surprisingly fails to improve over using only the ‘Circumstances’ subsection. This entails that the factual background contained in the ‘Circumstances’ is the most important textual part of the case when it comes to predicting the Court’s decision. The other sections and subsections that refer to the facts of a case, namely ‘Procedure,’ ‘Relevant Law’ and ‘Facts’ achieve somewhat lower performance (.73 cf. .76), although they remain consistently above chance. Recall, at this point, that the ‘Procedure’ subsection consists only of general details about the applicant, such as the applicant’s name or country of origin and the procedure followed before domestic courts. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 10/19 On the other hand, the ‘Law’ subsection, which refers either to the legal arguments used by the parties or to the legal reasons provided by the Court itself on the merits of a case consistently obtains the lowest performance (.62). One important reason for this poor performance is that a large number of cases does not include a ‘Law’ subsection, i.e., 162, 52 and 146 for Articles 3, 6 and 8 respectively. That happens in cases that the Court deems inadmissible, concluding to a judgment of non-violation. We also observe that the predictive accuracy is high for all the Articles when using the ‘Topics’ as features, i.e., .78, .81 and .76 for Articles 3, 6 and 8 respectively. ‘Topics’ obtain the best performance in Article 3 and performance comparable to ‘Circumstances’ in Articles 6 and 8. ‘Topics’ form a more abstract way of representing the information contained in each case and capture a more general gist of the cases. Combining the two best performing sets of features (‘Circumstances’ and ‘Topics’) we achieve the best average classification performance (.79). The combination also yields slightly better performance for Articles 6 and 8 while performance marginally drops for Article 3. That is .75, .84 and .78 for Articles 3, 6 and 8 respectively. Discussion The consistently more robust predictive accuracy of the ‘Circumstances’ subsection suggests a strong correlation between the facts of a case, as these are formulated by the Court in this subsection, and the decisions made by judges. The relatively lower predictive accuracy of the ‘Law’ subsection could also be an indicator of the fact that legal reasons and arguments of a case have a weaker correlation with decisions made by the Court. However, this last remark should be seriously mitigated since, as we have already observed, many inadmissibility cases do not contain a separate ‘Law’ subsection. Legal formalism and realism These results could be understood as providing some evidence for judicial decision-making approaches according to which judges are primarily responsive to non-legal, rather than to legal, reasons when they decide appellate cases. Without going into details with respect to a particularly complicated debate that is out of the scope of this paper, we may here simplify by observing that since the beginning of the 20th century, there has been a major contention between two opposing ways of making sense of judicial decision-making: legal formalism and legal realism (Posner, 1986; Tamanaha, 2009; Leiter, 2010). Very roughly, legal formalists have provided a legal model of judicial decision-making, claiming that the law is rationally determinate: judges either decide cases deductively, by subsuming facts under formal legal rules or use more complex legal reasoning than deduction whenever legal rules are insufficient to warrant a particular outcome (Pound, 1908; Kennedy, 1973; Grey, 1983; Pildes, 1999). On the other hand, legal realists have criticized formalist models, insisting that judges primarily decide appellate cases by responding to the stimulus of the facts of the case, rather than on the basis of legal rules or doctrine, which are in many occasions rationally indeterminate (Llewellyn, 1996; Schauer, 1998; Baum, 2009; Leiter, 2007; Miles & Sunstein, 2008). Extensive empirical research on the decision-making processes of various supreme and international courts, and especially the US Supreme Court, has indicated rather consistently Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 11/19 that pure legal models, especially deductive ones, are false as an empirical matter when it comes to cases decided by courts further up the hierarchy. As a result, it is suggested that the best way to explain past decisions of such courts and to predict future ones is by placing emphasis on other kinds of empirical variables that affect judges (Baum, 2009; Schauer, 1998). For example, early legal realists had attempted to classify cases in terms of regularities that can help predict outcomes, in a way that did not reflect standard legal doctrine (Llewellyn, 1996). Likewise, the attitudinal model for the US Supreme Court claims that the best predictors of its decisions are the policy preferences of the Justices and not legal doctrinal arguments (Segal & Spaeth, 2002). In general, and notwithstanding the simplified snapshot of a very complex debate that we just presented, our results could be understood as lending some support to the basic legal realist intuition according to which judges are primarily responsive to non-legal, rather than to legal, reasons when they decide hard cases. In particular, if we accept that the ‘Circumstances’ subsection, with all the caveats we have already voiced, is a (crude) proxy for non-legal facts and the ‘Law’ subsection is a (crude) proxy for legal reasons and arguments, the predictive superiority of the ‘Circumstances’ subsection seems to cohere with extant legal realist treatments of judicial decision-making. However, not more should be read into this than our results allow. First, as we have already stressed at several occasions, the ‘Circumstances’ subsection is not a neutral statement of the facts of the case and we have only assumed the similarity of that subsection with analogous sections found in lodged applications and briefs. Second, it is important to underline that the results should also take into account the so-called selection effect (Priest & Klein, 1984) that pertains to cases judged by the ECtHR as an international court. Given that the largest percentage of applications never reaches the Chamber or, still less, the Grand Chamber, and that cases have already been tried at the national level, it could very well be the case that the set of ECtHR decisions on the merits primarily refers to cases in which the class of legal reasons, defined in a formal sense, is already considered as indeterminate by competent interpreters. This could help explain why judges primarily react to the facts of the case, rather than to legal arguments. Thus, further text-based analysis is needed in order to determine whether the results could generalise to other courts, especially to domestic courts deciding ECHR claims that are placed lower within the domestic judicial hierarchy. Third, our discussion of the realism/formalism debate is overtly simplified and does not imply that the results could not be interpreted in a sophisticated formalist way. Still, our work coheres well with a bulk of other empirical approaches in the legal realist vein. Topic analysis The topics further exemplify this line of interpretation and provide proof of the usefulness of the NLP approach. The linear kernel of the SVM model can be used to examine which topics are most important for inferring whether an article of the Convention has been violated or not by looking at their weights w. Tables 3– 5 present the six topics for the most positive and negative SVM weights for the articles 3, 6 and 8, respectively. Topics identify in a sufficiently robust manner patterns of fact scenarios that correspond to well-established trends in the Court’s case law. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 12/19 7Note that all the cases used as examples in this section are taken from the data set we used to perform the experiments. Table 3 The most predictive topics for Article 3 decisions. Most predictive topics for Article 3, represented by the 20 most frequent words, listed in order of their SVM weight. Topic labels are manually added. Positive weights (w) denote more predictive topics for violation and negative weights for no violation. Topic Label Words w Top-5 Violation 4 Positive State Obligations injury, protection, ordered, damage, civil, caused, failed, claim, course, connection, region, effective, quashed, claimed, suffered, suspended, carry, compensation, pecuniary, ukraine 13.50 10 Detention conditions prison, detainee, visit, well, regard, cpt, access, food, situation, problem, remained, living, support, visited, establishment, standard, admissibility merit, overcrowding, contact, good 11.70 3 Treatment by state officials police, officer, treatment, police officer, July, ill, force, evidence, ill treatment, arrest, allegation, police station, subjected, arrested, brought, subsequently, allegedly, ten, treated, beaten 10.20 Top-5 No Violation 8 Prior Violation of Article 2 june, statement, three, dated, car, area, jurisdiction, gendarmerie, perpetrator, scene, June applicant, killing, prepared, bullet, wall, weapon, kidnapping, dated June, report dated, stopped −12.40 19 Issues of Proof witness, asked, told, incident, brother, heard, submission, arrived, identity, hand, killed, called, involved, started, entered, find, policeman, returned, father, explained −15.20 13 Sentencing sentence, year, life, circumstance, imprisonment, release, set, president, administration, sentenced, term, constitutional, federal, appealed, twenty, convicted, continued, regime, subject, responsible −17.40 First, topic 13 in Table 3 has to do with whether long prison sentences and other detention measures can amount to inhuman and degrading treatment under Article 3. That is correctly identified as typically not giving rise to a violation (European Court of Human Rights, 2015). For example, cases7 such as Kafkaris v. Cyprus ([GC] no. 21906/04, ECHR 2008-I), Hutchinson v. UK (no. 57592/08 of 3 February 2015) and Enea v. Italy ([GC], no. 74912/01, ECHR 2009-IV) were identified as exemplifications of this trend. Likewise, topic 28 in Table 5 has to do with whether certain choices with regard to the social policy of states can amount to a violation of Article 8. That was correctly identified as typically not giving rise to a violation, in line with the Court’s tendency to acknowledge a large margin of appreciation to states in this area (Greer, 2000). In this vein, cases such as Aune v. Norway (no. 52502/07 of 28 October 2010) and Ball v. Andorra (Application no. 40628/10 of 11 December 2012) are examples of cases where topic 28 is dominant. Similar observations apply, among other things, to topics 23, 24 and 27. That includes issues with the enforcement of domestic judgments giving rise to a violation of Article 6 (Kiestra, 2014). Some representative cases are Velskaya v. Russia, of 5 October 2006 and Aleksandrova v. Russia of 6 December 2007. Topic 7 in Table 4 is related to lower standard of review when property rights are at play (Tsarapatsanis, 2015). A representative Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 13/19 Table 4 The most predictive topics for Article 6 decisions. Most predictive topics for Article 6, represented by the 20 most frequent words, listed in order of their SVM weight. Topic labels are manually added. Positive weights (w) denote more predictive topics for violation and negative weights for no violation. Topic Label Words w Top-5 Violation 27 Enforcement of domestic judgments and reasonable time appeal, enforcement, damage, instance, dismissed, established, brought, enforcement proceeding, execution, limit, court appeal, instance court, caused, time limit, individual, responsible, receipt, court decision, copy, employee 11.70 23 Enforcement of domestic judgments and reasonable time court, applicant, article, judgment, case, law, proceeding, application, government, convention, time, article convention, January, human, lodged, domestic, February, September, relevant, represented 9.15 24 Enforcement of domestic judgments and reasonable time party, final, respect, set, interest, alleged, general, violation, entitled, complained, obligation, read, fair, final judgment, violation article, served, applicant complained, summons, convention article, fine 6.78 Top-5 No violation 10 Criminal limb defendant, detention, witness, cell, counsel, condition, defence, court upheld, charged, serious, regional court upheld, pre, remand, inmate, pre trial, extended, detained, temporary, defence counsel, metre −5.71 3 Criminal limb procedure, judge, fact, federal, justice, reason, charge, point, criminal procedure, code criminal, code criminal procedure, result, pursuant, article code, lay, procedural, point law, indictment, lay judge, argued, appeal point law −7.01 7 Property rights and claims by companies compensation, company, property, examined, cassation, rejected, declared, owner, deputy, tula, returned, duly, enterprise, moscow, foreign, appears, control, violated, absence, transferred −9.08 case here is Oao Plodovaya Kompaniya v. Russia of 7 June 2007. Consequently, the topics identify independently well-established trends in the case law without recourse to expert legal/doctrinal analysis. The above observations require to be understood in a more mitigated way with respect to a (small) number of topics. For instance, most representative cases for topic 8 in Table 3 were not particularly informative. This is because these were cases involving a person’s death, in which claims of violations of Article 3 (inhuman and degrading treatment) were only subsidiary: this means that the claims were mainly about Article 2, which protects the right to life. In these cases, the absence of a violation, even if correctly identified, is more of a technical issue on the part of the Court, which concentrates its attention on Article 2 and rarely, if ever, moves on to consider independently a violation of Article 3. This is exemplified by cases such as Buldan v. Turkey of 20 April 2004 and Nuray Şen v. Turkey of 30 March 2004, which were, again, correctly identified. On the other hand, cases have been misclassified mainly because their textual information is similar to cases in the opposite class. We observed a number of cases where there is a Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 14/19 Table 5 The most predictive topics for Article 8 decisions. Most predictive topics for Article 8, represented by the 20 most frequent words, listed in order of their SVM weight. Topic labels are manually added. Positive weights (w) denote more predictive topics for violation and negative weights for no violation. Topic Label Words w Top-5 Violation 30 Death and military action son, body, result, russian, department, prosecutor office, death, group, relative, head, described, military, criminal investigation, burial, district prosecutor, men, deceased, town, attack, died 15.70 1 Unlawful limitation clauses health moral, law democratic, law democratic society, disorder crime, prevention disorder, prevention disorder crime, economic well, protection health, interest national, interest national security, public authority exercise, interference public authority exercise, national security public, exercise law democratic, public authority exercise law, authority exercise law democratic, exercise law, authority exercise law, exercise law democratic society, crime protection 12.20 26 Judicial procedure second, instance, second applicant, victim, municipal, violence, authorised, address, municipal court, relevant provision, behaviour, register, appear, maintenance, instance court, defence, procedural, decide, court decided, quashed 9.51 Top-5 No violation 25 Discretion of state authorities service, obligation, data, duty, review, high, system, test, concern, building, agreed, professional, positive, threat, carry, van, accepted, step, clear, panel −7.89 28 Social policy contact, social, care, expert, opinion, living, welfare, county, physical, psychological, agreement, divorce, restriction, support, live, dismissed applicant, prior, remained, court considered, expressed −12.30 4 Migration cases national, year, country, residence, minister, permit, requirement, netherlands, alien, board, claimed, stay, contrary, objection, spouse, residence permit, close, deputy, deportation, brother −13.50 violation having a very similar feature vector to cases that there is no violation and vice versa. CONCLUSIONS We presented the first systematic study on predicting judicial decisions of the European Court of Human Rights using only the textual information extracted from relevant sections of ECtHR judgments. We framed this task as a binary classification problem, where the training data consists of textual features extracted from given cases and the output is the actual decision made by the judges. Apart from the strong predictive performance that our statistical NLP framework achieved, we have reported on a number of qualitative patterns that could potentially drive judicial decisions. More specifically, we observed that the information regarding the Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 15/19 factual background of the case as this is formulated by the Court in the relevant subsection of its judgments is the most important part obtaining on average the strongest predictive performance of the Court’s decision outcome. We suggested that, even if understood only as a crude proxy and with all the caveats that we have highlighted, the rather robust correlation between the outcomes of cases and the text corresponding to fact patterns contained in the relevant subsections coheres well with other empirical work on judicial decision-making in hard cases and backs basic legal realist intuitions. Finally, we believe that our study opens up avenues for future work, using different kinds of data (e.g., texts of individual applications, briefs submitted by parties or domestic judgments) coming from various sources (e.g., the European Court of Human Rights, national authorities, law firms). However, data access issues pose a significant barrier for scientists to work on such kinds of legal data. Large repositories like HUDOC, which are easily and freely accessible, are only case law databases. Access to other kinds of data, especially lodged applications and briefs, would enable further research in the intersection of legal science and artificial intelligence. ADDITIONAL INFORMATION AND DECLARATIONS Funding DPP received funding from Templeton Religion Trust (https://www.templeton.org) grant number: TRT-0048. VL received funding from Engineering and Physical Sciences Research Council (http://www.epsrc.ac.uk) grant number: EP/K031953/1. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Templeton Religion Trust: TRT-0048. Engineering and Physical Sciences Research Council: EP/K031953/1. Competing Interests Nikolaos Aletras is an employee of Amazon.com, Cambridge, UK, but work was completed while at University College London. Author Contributions • Nikolaos Aletras and Vasileios Lampos conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • Dimitrios Tsarapatsanis conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 16/19 • Daniel Preoţiuc-Pietro conceived and designed the experiments, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: ECHR dataset: https://figshare.com/s/6f7d9e7c375ff0822564. REFERENCES Bamman D, Eisenstein J, Schnoebelen T. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18(2):135–160 DOI 10.1111/josl.12080. Baum L. 2009. The puzzle of judicial behavior. University of Michigan Press. Chang Y-W, Lin C-J. 2008. Feature ranking using linear SVM. In: WCCI causation and prediction challenge, 53–64. European Court of Human Rights. 2015. Factsheet on life imprisonment. Strasbourg: European Court of Human Rights. Available at http://www.echr.coe.int/Documents/ FS_Life_sentences_ENG.pdf . Greer SC. 2000. The margin of appreciation: interpretation and discretion under the European Convention on Human Rights, vol. 17. Council of Europe. Grey TC. 1983. Langdell’s orthodoxy. University of Pittsburgh Law Review 45:1–949. Helfer LR. 2008. Redesigning the European Court of Human Rights: embeddedness as a deep structural principle of the European human rights regime. European Journal of International Law 19(1):125–159 DOI 10.1093/ejil/chn004. Joachims T. 2002. Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers. Kennedy D. 1973. Legal formality. The Journal of Legal Studies 2(2):351–398 DOI 10.1086/467502. Keown R. 1980. Mathematical models for legal prediction. Computer/LJ 2:829. Kiestra LR. 2014. The impact of the European Convention on Human Rights on private international law. Springer. Kort F. 1957. Predicting Supreme Court decisions mathematically: a quantitative analysis of the ‘‘right to counsel’’ cases. American Political Science Review 51(01):1–12 DOI 10.2307/1951767. Lampos V, Aletras N, Preoţiuc-Pietro D, Cohn T. 2014. Predicting and characterising user impact on Twitter. In: Proceedings of the 14th conference of the European Chapter of the Association for Computational Linguistics, 405–413. Lampos V, Cristianini N. 2012. Nowcasting events from the social web with statistical learning. ACM Transactions on Intelligent Systems and Technology 3(4):72:1–72:22. Lauderdale BE, Clark TS. 2012. The Supreme Court’s many median justices. American Political Science Review 106(04):847–866 DOI 10.1017/S0003055412000469. Lauderdale BE, Clark TS. 2014. Scaling politically meaningful dimensions using texts and votes. American Journal of Political Science 58(3):754–771 DOI 10.1111/ajps.12085. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 17/19 Lawlor RC. 1963. What computers can do: analysis and prediction of judicial decisions. American Bar Association Journal 49:337–344. Leach P. 2013. Taking a case to the European Court of Human Rights. Oxford: Oxford University Press. Leach P, Paraskeva C, Uelac G. 2010. Human rights fact-finding. The European Court of Human Rights at a crossroads. Netherlands Quarterly of Human Rights 28(1):41–77. Leiter B. 2007. Naturalizing Jurisprudence: essays on American legal realism and naturalism in legal philosophy. Oxford: Oxford University Press. Leiter B. 2010. Legal formalism and legal realism: what is the issue? Legal Theory 16(2):111–133 DOI 10.1017/S1352325210000121. Llewellyn KN. 1996. The common law tradition: deciding appeals. William S. Hein & Co., Inc.. Miles TJ, Sunstein CR. 2008. The new legal realism. The University of Chicago Law Review 75(2):831–851. Nagel SS. 1963. Applying correlation analysis to case prediction. Texas Law Review 42:1006. Pildes RH. 1999. Forms of formalism. The University of Chicago Law Review 66(3):607–621 DOI 10.2307/1600419. Popple J. 1996. A pragmatic legal expert system. Applied Legal Philosophy Series, Dart- mouth (Ashgate), Aldershot. Posner RA. 1986. Legal formalism, legal realism, and the interpretation of statutes and the constitution. Case Western Reserve Law Review 37:179–217. Pound R. 1908. Mechanical jurisprudence. Columbia Law Review 8(8):605–623. Preoţiuc-Pietro D, Lampos V, Aletras N. 2015. An analysis of the user occupational class through Twitter content. In: Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing (Volume 1: Long Papers). 1754–1764. Preoţiuc-Pietro D, Volkova S, Lampos V, Bachrach Y, Aletras N. 2015. Studying user income through language, behaviour and affect in social media. PLoS ONE 10(9):1–17 DOI 10.1371/journal.pone.0138717. Priest GL, Klein B. 1984. The selection of disputes for litigation. The Journal of Legal Studies 13(1):1–55 DOI 10.1086/467732. Salton G, McGill MJ. 1986. Introduction to modern information retrieval. New York: McGraw-Hill, Inc. Salton G, Wong A, Yang C-S. 1975. A vector space model for automatic indexing. Communications of the ACM 18(11):613–620 DOI 10.1145/361219.361220. Schauer F. 1998. Prediction and particularity. Boston University Law Review 78:773. Segal JA. 1984. Predicting Supreme Court cases probabilistically: the search and seizure cases, 1962–1981. American Political Science Review 78(04):891–900 DOI 10.2307/1955796. Segal JA, Spaeth HJ. 2002. The Supreme Court and the attitudinal model revisited. Cambridge: Cambridge University Press. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 18/19 Sim Y, Routledge BR, Smith NA. 2015. The utility of text: the case of Amicus briefs and the Supreme Court. In: Twenty-Ninth AAAI conference on artificial intelligence. Tamanaha BZ. 2009. Beyond the formalist-realist divide: the role of politics in judging. Princeton: Princeton University Press. Tsarapatsanis D. 2015. The margin of appreciation doctrine: a low-level institutional view. Legal Studies 35(4):675–697 DOI 10.1111/lest.12089. Turney PD, Pantel P. 2010. From frequency to meaning: vector space models of semantics. Journal of Artificial Intelligence Research 37:141–188. Vapnik VN. 1998. Statistical learning theory. New York: Wiley. von Luxburg U. 2007. A tutorial on spectral clustering. Statistics and Computing 17(4):395–416. Wang S, Manning CD. 2012. Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: short papers-Volume 2. 90–94. Aletras etal (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.93 19/19 work_2tzdxx7pkvbbbb62pwqubu6nfe ---- Universal Word Segmentation: Implementation and Interpretation Yan Shao, Christian Hardmeier, Joakim Nivre Department of Linguistics and Philology, Uppsala University {yan.shao, christian.hardmeier, joakim.nivre}@lingfil.uu.se Abstract Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of lan- guages with different writing systems and ty- pological characteristics. Additionally, we in- vestigate the correlations between various ty- pological factors and word segmentation ac- curacy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Uni- versal Dependencies datasets. Our model ob- tains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and He- brew, when compared to previous work. 1 Introduction Word segmentation is the initial step for most higher level natural language processing tasks, such as part-of-speech tagging (POS), parsing and machine translation. It can be regarded as the problem of correctly identifying word forms from a character string. Word segmentation can be very challenging, es- pecially for languages without explicit word bound- ary delimiters, such as Chinese, Japanese and Viet- namese. Even for space-delimited languages like English or Russian, relying on white space alone generally does not result in adequate segmentation as at least punctuation should usually be separated from the attached words. For some languages, the space-delimited units in the surface form are too coarse-grained and therefore often further analysed, as in the cases of Arabic and Hebrew. Even though language-specific word segmentation systems are near-perfect for some languages, it is still useful to have a single system that performs reasonably with no or minimum language-specific adaptations. Word segmentation standards vary substantially with different definitions of the concept of a word. In this paper, we will follow the teminologies of Universal Dependencies (UD), where words are de- fined as basic syntactic units that do not always coincide with phonological or orthographic words. Some orthographic tokens, known in UD as mul- tiword tokens, therefore need to be broken into smaller units that cannot always be obtained by split- ting the input character sequence.1 To perform word segmentation in the UD frame- work, neither rule-based tokenisers that rely on white space nor the naive character-level sequence tagging model proposed previously (Xue, 2003) are ideal. In this paper, we present an enriched sequence labelling model for universal word segmentation. It is capable of segmenting languages in very diverse written forms. Furthermore, it simultaneously iden- tifies the multiword tokens defined by the UD frame- work that cannot be resolved simply by splitting 1Note that this notion of multiword token has nothing to do with the notion of multiword expression (MWE) as discussed, for example, in Sag et al. (2002). 421 Transactions of the Association for Computational Linguistics, vol. 6, pp. 421–435, 2018. Action Editor: Sebastian Padó . Submission batch: 3/2018; Revision batch: 6/2018; Published 7/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. the input character sequence. We adapt a regular sequence tagging model, namely the bidirectional recurrent neural networks with conditional random fields (CRF) (Lafferty et al., 2001) interface as the fundamental framework (BiRNN-CRF) (Huang et al., 2015) for word segmentation. The main contributions of this work include: 1. We propose a sequence tagging model for word segmentation, both for general purposes (mere splitting) and full UD processing (splitting plus occasional transduction). 2. We investigate the correlation between segmen- tation accuracy and properties of languages and writing systems, which is helpful in interpret- ing the gaps between segmentation accuracies across different languages as well as selecting language-specific settings for the model. 3. Our segmentation system achieves state-of-the- art accuracy on the UD datasets and improves on previous work (Straka and Straková, 2017) especially for the most challenging languages. 4. We provide an open source implementation.2 2 Word Segmentation in UD The UD scheme for cross-linguistically consistent morphosyntactic annotation defines words as syn- tactic units that have a unique part-of-speech tag and enter into syntactic relations with other words (Nivre et al., 2016). For languages that use whitespace as boundary markers, there is often a mismatch be- tween orthographic words, called tokens in the UD terminology, and syntactic words. Typical examples are clitics, like Spanish dámelo = da me lo (1 token, 3 words), and contractions, like French du = de le (1 token, 2 words). Tokens that need to split into multiple words are called multiword tokens and can be further subdivided into those that can be handled by simple segmentation, like English cannot = can not, and those that require a more complex transduc- tion, like French du = de le. We call the latter non- segmental multiword tokens. In addition to multi- word tokens, the UD scheme also allows multitoken words, that is, words consisting of multiple tokens, such as numerical expressions like 20 000. 2 https://github.com/yanshao9798/segmenter 3 Word Segmentation and Typological Factors We begin with the analysis of the difficulty of word segmentation. Word segmentation is fundamen- tally more difficult for languages like Chinese and Japanese because there are no explicit word bound- ary markers in the surface form (Xue, 2003). For Vietnamese, the space-segmented units are syllables that roughly correspond to Chinese characters rather than words. To characterise the challenges of word segmentation posed by different languages, we will examine several factors that vary depending on lan- guage and writing system. We will refer to these as typological factors although most of them are only indirectly related to the traditional notion of linguis- tic typology and depend more on writing system. • Character Set Size (CS) is the number of unique characters, which is related to how in- formative the characters are to word segmen- tation. Each character contains relatively more information if the character set size is larger. • Lexicon Size (LS) is the number of unique word forms in a dataset, which indicates how many unique word forms have to be identified by the segmentation system. Lexicon size in- creases as the dataset grows in size. • Average Word Length (AL) is calculated by dividing the total character count by the word count. It is negatively correlated with the den- sity of word boundaries. If the average word length is smaller, there are more word bound- aries to be predicted. • Segmentation Frequency (SF) denotes how likely it is that space-delimited units are fur- ther segmented. It is calculated by dividing the word count by the space-segment count. Lan- guages like Chinese and Japanese have much higher segmentation frequencies than space- delimited languages. • Multiword Token Portion (MP) is the per- centage of multiword tokens that are non- segmental. • Multiword Token Set Size (MS) is the number of unique non-segmental multiword tokens. The last two factors are specific to the UD scheme but can have a significant impact on word segmenta- tion accuracy. 422 Figure 1: K-Means clustering (K = 6) of the UD languages. PCA is applied for dimensionality reduction. CS LS AL SF MP MS 0.058 0.938 0.101 -0.043 -0.060 -0.028 Table 1: Pearson product-moment correlation coeffi- cients between dataset size and the statistical factors. All the languages in the UD dataset are charac- terised and grouped by the typological factors in Fig- ure 1. We standardise the statistics x of the proposed factors on the UD datasets with the arithmetic mean µ and the standard deviation σ as x−µ σ . We use them as features and apply K-Means clustering (K = 6) to group the languages. Principal component anal- ysis (PCA) (Abdi and Williams, 2010) is used for dimensionality reduction and visualisation. The majority of the languages in UD are space- delimited with few or no multiword tokens and they are grouped at the bottom left of Figure 1. They are statistically similar from the perspective of word segmentation. The Semitic languages Arabic and Hebrew with rich non-segmental multiword tokens are positioned at the top. In addition, languages with large character sets and high segmentation fre- quencies, such as Chinese, Japanese and Vietnamese are clustered together. Korean is distanced from the other space-delimited languages as it contains white-space delimiters but has a comparatively large character set. Overall, the x-axis of Figure 1 is pri- marily related to character set size and segmentation Language CS LS AL SF MP MS Czech 140 125,342 4.83 1.26 0.0018 9 Czech-CAC 93 66,256 5.06 1.20 0.0022 12 Czech-CLIT 96 2,774 5.30 1.14 0.0005 1 English 108 19,672 4.06 1.24 0.0 0 English-LinES 82 7,436 4.01 1.22 0.0 0 English-ParTUT 94 5,532 4.50 1.22 0.0002 6 Finnish 244 49,210 6.49 1.28 0.0 0 Finnish-FTB 95 39,717 5.94 1.14 0.0 0 French 298 42,250 4.33 1.27 0.0281 9 French-ParTUT 96 3,364 4.53 1.27 0.0344 4 French-Sequota 108 8,452 4.48 1.29 0.0277 7 Latin 57 6,927 5.05 1.28 0.0 0 Latin-ITTB 42 12,526 5.06 1.24 0.0 0 Portuguese 114 26,653 4.15 1.32 0.0746 710 Portuguese-BR 186 29,906 4.11 1.29 0.0683 35 Russian 189 25,708 5.21 1.26 0.0 0 Russian-SynTagRus 157 107,890 5.12 1.30 0.0 0 Slovenian 99 29,390 4.63 1.23 0.0 0 Slovenian-SST 40 4,534 4.29 1.12 0.0 0 Swedish 86 12,911 4.98 1.20 0.0 0 Swedish-LinES 86 9,659 4.50 1.19 0.0 0 Table 2: Different UD datasets in same languages and the statistical factors. frequency, while the y-axis is mostly associated with multiword tokens. Dataset sizes for different languages in UD vary substantially. Table 1 shows the correlation coef- ficients between the dataset size in sentence num- ber and the six typological factors. Apart from the lexicon size, all the other factors, including multi- word token set size, have no strong correlations with dataset size. From Table 2, we can see that the 423 Char. On considère qu’environ 50 000 Allemands du Wartheland ont péri pendant la période. Tags BEXBIIIIIIIEXBIEBIIIIIEXBIIIIEXBIIIIIIIEXBEXBIIIIIIIIEXBIEXBIIEXBIIIIIEXBEXBIIIIIES Figure 2: Tags employed for word segmentation. 50 000 is a multitoken word, while qu’environ and du are multiword tokens that should be processed differently. factors, except for lexicon size, are relatively sta- ble across different UD treebanks for the same lan- guage, which indicates that they do capture proper- ties of these languages, although some variation in- evitably occurs due to corpus properties like genre. In this paper, we thoroughly investigate the corre- lations between the proposed statistical factors and segmentation accuracy. Moreover, we aim to find specific settings that can be applied to improve seg- mentation accuracy for each language group. 4 Sequence Tagging Model Word segmentation can be modelled as a character- level sequence labelling task (Xue, 2003; Chen et al., 2015). Characters as basic input units are passed into a sequence labelling model and a sequence of tags that are associated with word boundaries are predicted. In this section, we introduce the boundary tags adopted in this paper. Theoretically, binary classification is sufficient to indicate whether a character is the end of a word for segmentation. In practice, more fine-grained tagsets result in higher segmentation accuracy (Zhao et al., 2006). Following the work of Shao et al. (2017), we employ a baseline tagset consisting of four tags: B, I, E, and S, to indicate a character positioned at the beginning (B), inside (I), or at the end (E) of a word, or occurring as a single-character word (S). The baseline tagset can be applied to word seg- mentation of Chinese and Japanese without further modification. For languages with space-delimiters, we add an extra tag X to mark the characters, mostly spaces, that do not belong to any words/tokens. As illustrated in Figure 2, the regular spaces are marked with X while the space in a multitoken word like 50 000 is disambiguated with I. To enable the model to simultaneously identify non-segmental multiword tokens for languages like Spanish and Arabic in the UD framework, we ex- tend the tagset by adding four tags B, I, E, S that correspond to B, I, E, S to mark corresponding Tags Applied Languages Baseline Tags B, I, E, S Chinese, Japanese, ... Boundary X Russian, Hindi, ... Transduction B, I, E, S Spanish, Arabic, ... Joint Sent. Seg. T, U All languages Table 3: Tag set for universal word segmentation. positions in non-segmental multiword tokens and to indicate their occurrences. As shown in Figure 2, the multiword token qu’environ is split into qu’ and environ and therefore the corresponding tags are BIEBIIIIIE. This contrasts with du, which should be transduced into de and le. Moreover, the extra tags disambiguate whether the multiword to- kens should be split or transduced according to the context. For instance, AÜØð (wamimma) in Arabic is occasionally split into ð (wa) and AÜØ (mimma) but more frequently transduced into ð (wa), áÓ (min) and AÓ (ma) . The corresponding tags are SBIE and BIIE, respectively. The transduction of the identi- fied multiword tokens will be described in detail in the following section. The complete tagset is summarised in Table 3. The proposed sequence model can easily be ex- tended to perform joint sentence segmentation by adding two more tags to mark the last character of a sentence (de Lhoneux et al., 2017). T is used if the character is a single-character word and U otherwise. T and U can be used together with B, I, E, S, X for general segmentation, or with B, I, E, S additionally for full UD processing. Joint sentence segmentation is not addressed any further in this paper. 5 Neural Networks for Segmentation 5.1 Main network The main network for regular segmentation as well as non-segmental multiword token identification is an adaptation of the BiRNN-CRF model (Huang et al., 2015) (see Figure 3). The input characters can be represented as con- 424 夏 天 太 热 (too) (hot)(summer) character representations GRU GRU GRU GRU forward RNN GRU GRU GRU GRU backward RNN B E S S CRF layer 太 热夏天output Figure 3: The BiRNN-CRF model for segmentation. The dashed arrows indicate that dropout is applied. ventional character embeddings. Alternatively, we employ the concatenated 3-gram model introduced by Shao et al. (2017). In this representation (Fig- ure 4), the pivot character in a given context is rep- resented as the concatenation of the character vec- tor representation along with the local bigram and trigram vectors. The concatenated n-grams encode rich local information as the same character has dif- ferent yet closely related vector representations in different contexts. For each n-gram order, we use a single vector to represent the terms that appear only once in the training set while training. These vectors are later used as the representations for unknown characters and n-grams in the development and test sets. All the embedding vectors are initialised ran- domly. The character vectors are passed to the forward and backward recurrent layers. Gated recurrent units (GRU) (Cho et al., 2014) are employed as the basic recurrent cell to capture long term dependencies and sentence-level information. Dropout (Srivastava et al., 2014) is applied to both the inputs and the out- puts of the bidirectional recurrent layers. A first- order chain CRF layer is added on top of the recur- 夏 天 太 热 (too) (hot)(summer) Vi,i Vi−1,i Vi−1,i+1 n-gram character representation V3 Figure 4: Concatenated 3-gram model. The third character is the pivot character in the given context. rent layers to incorporate transition information be- tween consecutive tags, which ensures that the op- timal sequence of tags over the entire sentence is obtained. The optimal sequence can be computed efficiently via the Viterbi algorithm. 5.2 Transduction The non-segmental multiword tokens identified by the main network are transduced into correspond- ing components in an additional step. Based on the statistics of the multiword tokens to be trans- duced on the entire UD training sets, 98.3% only have one possible transduction, which indicates that the main ambiguity of non-segmental multiword to- kens comes with identification, not transduction. We therefore transduce the identified non-segmental multiword tokens in a context-free fashion. For mul- tiword tokens with two or more valid transductions, we only adopt the most frequent one. In most languages that have multiword tokens, the number of unique non-segmental multiword to- kens is rather limited, such as in Spanish, French and Italian. For these languages, we build dictionar- ies from the training data to look up the multiword tokens. However, in some languages like Arabic and Hebrew, multiword tokens are very productive and therefore cannot be well covered by dictionar- ies generated from training data. Some of the avail- able external dictionary resources with larger cover- age, for instance the MILA lexicon (Itai and Wint- ner, 2008), do not follow the UD standards. In this paper, we propose a generalising ap- proach to processing non-segmental multiword to- kens. If there are more than 200 unique multi- word tokens in the training set for a language, we 425 Character embedding size 50 GRU/LSTM state size 200 Optimiser Adagrad Initial learning rate (main) 0.1 Decay rate 0.05 Gradient Clipping 5.0 Initial learning rate (encoder-decoder) 0.3 Dropout rate 0.5 Batch size 10 Table 4: Hyper-parameters for segmentation. train an attention-based encoder-decoder (Bahdanau et al., 2015) equipped with shared long-short term memory cells (LSTM) (Hochreiter and Schmidhu- ber, 1997). At test time, identified non-segmental multiword tokens are first queried in the dictionary. If not found, the segmented components are gen- erated with the encoder-decoder as character-level transduction. Overall, we utilise rich context to identify non-segmental multiword tokens, and then apply a combination of dictionary and sequence-to- sequence encoder-decoder to transduce them. 5.3 Implementation Our universal word segmenter is implemented us- ing the TensorFlow library (Abadi et al., 2016). Sentences with similar lengths are grouped into the same bucket and padded to the same length. We construct sub-computational graphs for each bucket so that sentences of different lengths are processed more efficiently. Table 4 shows the hyper-parameters adopted for the neural networks. We use one set of parame- ters for all the experiments as we aim for a sim- ple universal model, although fine-tuning the hyper- parameters on individual languages might result in additional improvements. The encoder-decoder is trained prior to the main network. The weights of the neural networks, including the embeddings, are initialised using the scheme introduced in Glo- rot and Bengio (2010). The network is trained us- ing back-propagation. All the random embeddings are fine-tuned during training by back-propagating gradients. Adagrad (Duchi et al., 2011) with mini- batches is employed for optimization. The initial learning rate η0 is updated with a decay rate ρ. The encoder-decoder is trained with the unique non-segmental multiword tokens extracted from the training set. 5% of the total instances are subtracted for validation. The model is trained for 50 epochs and the score of how many outputs exactly match the references is used for selecting the weights. For the main network, word-level F1-score is used to measure the performance of the model after each epoch on the development set. The network is trained for 30 epochs and the weight of the best epoch is selected. To increase efficiency and reduce memory de- mand both for training and decoding, we truncate sentences longer than 300 characters. At decoding time, the truncated sentences are reassembled at the recorded cut-off points in a post-processing step. 6 Experiments 6.1 Datasets and Evaluation Datasets from Universal Dependencies 2.0 (Nivre et al., 2016) are used for all the word segmentation ex- periments.3 In total, there are 81 datasets in 49 lan- guages that vary substantially in size. The training sets are available in 45 languages. We follow the standard splits of the datasets. If no development set is available, 10% of the training set is subtracted. We adopt word-level precision, recall and F1- score as the evaluation metrics. The candidate and the reference word sequences in our experiments may not share the same underlying characters due to the transduction of non-segmental multiword to- kens. The alignment between the candidate words and the references becomes unclear and therefore it is difficult to compute the associated scores. To re- solve this issue, we use the longest common subse- quence algorithm to align the candidate and the ref- erence words. The matched words are compared and the evaluation scores are computed accordingly: R = |c∩r| |r| (1) P = |c∩r| |c| (2) F = 2 · R ·P R + P (3) where c and r denote the sequences of candidate words and reference words, and |c|, |r| are their 3We employ the version that was used in the CoNLL 2017 shared task on UD parsing. 426 Basic Unit F1-score Training Time (s) Latin Character 82.79 572 Space-delimited Unit 87.62 218 Table 5: Different segmentation units employed for word segmentation on Vietnamese. Concatenated 3- gram is not used. lengths. |c∩r| is the number of candidate words that are aligned to reference words by the longest com- mon subsequence algorithm. The word-level evalu- ation metrics adopted in this paper are different from the boundary-based alternatives (Palmer and Burger, 1997). We adapt the evaluation script from the CoNLL 2017 shared task (Zeman et al., 2017) to calculate the scores. In the following experiments, we only report the F1-score. In the following sections, we thoroughly investi- gate correlations between several language-specific characteristics and segmentation accuracy. All the experimental results in Section 6.2 are obtained on the development sets. The test sets are reserved for final evaluation, reported in Section 6.3. 6.2 Language-Specific Characteristics 6.2.1 Word-Internal Spaces For Vietnamese and other languages with sim- ilar historical backgrounds, such as Zhuang and Hmongic languages (Zhou, 1991), the space- delimited syllables containing no punctuation are never segmented but joined into words with word- internal spaces instead. The space-delimited units can therefore be applied as the basic elements for tag prediction if we pre-split punctuation. Word seg- mentation for these languages thus becomes practi- cally the same as for Chinese and Japanese. Table 5 shows that a substantial improvement can be achieved if we use space-delimited syllables as the basic elements for word segmentation for Viet- namese. It also drastically increases both training and decoding speed as the sequence of tags to be predicted becomes much shorter. 6.2.2 Character Representation We apply regular character embeddings and con- catenated 3-gram vectors introduced in Section 5.1 to the input characters and test their performances 1 2 3 4 5 6 7 8 9 10 0.8 0.9 1 N/300 F 1- S co re Arabic Catalan Chinese English Japanese Spanish Figure 5: Segmentation results with unigram char- acter embeddings (dashed) and concatenated 3-gram vectors for character representations with different numbers of training instances N. respectively. First, the experiments are extensively conducted on all the languages with the full train- ing sets. The results show that the concatenated 3-gram model is substantially better than the regu- lar character embeddings on Chinese, Japanese and Vietnamese, but notably worse on Spanish and Cata- lan. For all the other languages, the differences are marginal. To gain more insights, we select six languages, namely Arabic, Catalan, Chinese, Japanese, English and Spanish for more detailed analysis via learn- ing curve experiments. The training sets are grad- ually extended by 300 sentences at a time. The results are shown in Figure 5. Regardless of the amounts of training data and the other typological factors, concatenated 3-grams are better on Chinese and Japanese and worse on Spanish and Catalan. We expect the concatenated 3-gram representation to outperform simple character embeddings on all languages with a large character set but no space de- limiters. Since adopting the concatenated 3-gram model drastically enlarges the embedding space, in the following experiments, including the final testing phase, concatenated 3-grams are only applied to Chinese, Japanese and Vietnamese. 427 1 2 3 4 5 6 7 8 9 10 0.6 0.7 0.8 0.9 1 N/300 F 1- S co re Arabic Chinese English Korean Russian Spanish Figure 6: Segmentation results with (dashed) and without space delimiters with different numbers of training instances N. 6.2.3 Space Delimiters Chinese and Japanese are not delimited by spaces. Additionally, continuous writing without spaces (scriptio continua) is evidenced in most Classical Greek and Latin manuscripts. We perform two sets of learning curve experiments to investigate the im- pact of white space on word segmentation. In the first set, we keep the datasets in their original forms. In the second set, we omit all white space. The ex- perimental results are presented in Figure 6. In general, there are huge discrepancies between the accuracies with and without spaces, showing that white space acts crucially as a word boundary in- dicator. Retaining the original forms of the space- delimited languages, very high accuracies can be achieved even with small amounts of training data as the model quickly learns that space is a reliable word boundary indicator. Moreover, we obtain rel- atively lower scores on space-delimited languages when space is ignored than Chinese using compara- ble amounts of training data, which shows that Chi- nese characters are more informative to word bound- ary prediction, due to the large character set size. 6.2.4 Non-Segmental Multiword Tokens The concept of multiword tokens is specific to UD. To explore how the non-segmental multiword tokens, as opposed to pure segmentation, influence 1 2 3 4 5 6 7 8 9 10 0.8 0.9 1 N/300 F 1- S co re Arabic French Hebrew Italian Portuguese Spanish Figure 7: Segmentation results with and without (dashed) processing non-segmental multiword to- kens with different training instances N. Language Data size Evaluation Scores Training Validation ACC MFS Arabic 3,500 184 77.84 82.64 Hebrew 2,995 157 84.81 92.35 Table 6: Accuracy of the seq2seq transducer on Ara- bic and Hebrew. segmentation accuracy, we conduct relevant experi- ments on selected languages. Similarly to the previ- ous section, two sets of learning curve experiments are performed. In the second set, all the multiword tokens that require transduction are regarded as sin- gle words without being processed. The results are presented in Figure 7. Word segmentation with full UD processing is no- tably more challenging for Arabic and Hebrew. Ta- ble 6 shows the evaluation of the encoder-decoder as the transducer for non-segmental multiword tokens on Arabic and Hebrew. The evaluation metrics ACC and MF-score (MFS) are adapted from the metrics used for machine transliteration evaluation (Li et al., 2009). ACC is exact match and MFS is based on edit distance. The transducer yields relatively higher scores on Hebrew while it is more challenging to process Arabic. In addition, different approaches to transducing the non-segmental multiword tokens are evaluated in Table 7. In the condition None, the identified non- 428 None Dictionary Transducer Mix Arabic 94.11 96.74 96.54 97.27 Hebrew 87.17 91.33 88.46 91.85 Table 7: Segmentation accuracies on Arabic and Hebrew with different ways of transducing non- segmental multiword tokens. segmental multiword tokens remain unprocessed. In Dictionary, they are mapped via the dictionary de- rived from training data if found in the dictionary. In Transducer, they are all transduced by the attention- based encoder-decoder. In Mix, in addition to utilis- ing the mapping dictionary, the non-segmental terms not found in the dictionary are transduced with the encoder-decoder. The results show that when the encoder-decoder is applied alone, it is worse than only using the dictionaries, but additional improve- ments can be obtained by combining both of them. The accuracy differences associated with non- segmental multiword tokens are nonetheless marginal on the other languages as shown in Figure 7. Regardless of their frequent occurrences, mul- tiword tokens are easy to process in general when the set of unique non-segmental multiword tokens is small. 6.2.5 Correlations with Accuracy We investigate the correlations between the pro- posed typological factors in Section 3 and segmen- tation accuracy using linear regression with Huber loss (Huber, 1964). The factors are used in addition to training set size as the features to predict the seg- mentation accuracies in F1-score. To collect more data samples, apart from experimenting with the full training data for each set, we also use smaller sets of 500, 1,000 and 2,000 training instances to train the models respectively if the training set is large enough. The features are standardised with the arith- metic mean and the standard deviation before fitting the linear regression model. The correlation coefficients of the linear regres- sion model are presented in Figure 8. We can see that segmentation frequency and multiword token set size are negatively correlated with segmentation accuracy. Overall, the UD datasets are strongly bi- ased towards space-delimited languages. Training set size is therefore not a strong factor as high accu- TS CS LS AL SF MP MS −1 0 1 ·10−2 Figure 8: Correlation coefficients between segmen- tation accuracy and the typological factors in the lin- ear regression model. The factors are training set size (TS), character set size (CS), lexicon size (LS), average word length (AL), segmentation frequency (SF), multitoken word portion (MP) and multitoken word size (MS). racies can be obtained with small amounts of train- ing data, which is consistent with the results of all the learning curve experiments. The other typolog- ical factors such as average word length and lexi- con size are less relevant to segmentation accuracy. Referring back to Figure 1, segmentation frequency and multiword token set size as the most influen- tial factors, are also the primary principal compo- nents that categorise the UD languages into different groups. 6.2.6 Language-Specific Settings Our model obtains competitive results with only a minimal number of straightforward language- specific settings. Based on the previous analysis of segmentation accuracy and typological factors, re- ferring back to Figure 1, we apply the following settings, targeting on specific language groups, to the segmentation system on the final test sets. The language-specific settings can be applied to new lan- guages beyond the UD datasets based on an analysis of the typological factors. 1. For languages with word-internal spaces like Vietnamese, we first separate punctuation and then use space-delimited syllables for bound- 429 Space NLTK UDPipe This Paper 80.86 95.64 99.47 99.45 Table 8: Average evaluation scores on UD lan- guages, excluding Chinese, Japanese, Vietnamese, Arabic and Hebrew. ary prediction. 2. For languages with large character sets and no space delimiters, like Chinese and Japanese, we use concatenated 3-gram representations. 3. For languages with more than 200 unique non- segmental multiword tokens, like Arabic and Hebrew, we use the encoder-decoder model for transduction. 4. For other languages, the universal model is suf- ficient without any specific adaptation. 6.3 Final Results We compare our segmentation model to UDPipe (Straka and Straková, 2017) on the test sets. UDPipe contains word segmentation, POS tagging, morpho- logical analysis and dependency parsing models in a pipeline. The word segmentation model in UD- Pipe is also based on RNN with GRU. For efficiency, UDPipe has a smaller character embedding size and no CRF interface. It also relies heavily on white- space and uses specific configurations for languages in which word-internal spaces are allowed. Auto- matically generated suffix rules are applied jointly with a dictionary query to handle multiword tokens. Moreover, UDPipe uses language-specific hyper- parameters for Chinese and Japanese. We employ UDPipe 1.2 with the publicly avail- able UD 2.0 models.4 The presegmented option is enabled as we assume the input text to be preseg- mented into sentences so that only word segmen- tation is evaluated. In addition, the CoNLL shared task involved some test sets for which no specific training data were available. This included a number of parallel test sets of known languages, for which we apply the models trained on the standard tree- banks, as well as four surprise languages, namely Buryat, Kurmanji, North Sami and Upper Sorbian, for which we use the small annotated data samples provided in addition to the test sets by the shared 4http://hdl.handle.net/11234/1-2364 task to build models and evaluation on those lan- guages. The main evaluation results are shown in Table 9. We also report the Macro Average F1-scores. The scores of the surprise languages are excluded and presented separately as no corresponding UDPipe models are available. Our system obtains higher segmentation accuracy overall. It achieves substantially better accuracies on languages that are challenging to segment, namely Chinese, Japanese, Vietnamese, Arabic and Hebrew. The two systems yield very similar scores, when these languages are excluded as shown in Table 8, in which the two systems are also compared with two rule-based baselines, a simple space-based to- keniser and the tokenisation model for English in NLTK (Loper and Bird, 2002). The NLTK model obtains relatively high accuracy while the space- based baseline substantially underperforms, which indicates that relying on white space alone is insuffi- cient for word segmentation in general. On the ma- jority of the space-delimited languages without pro- ductive non-segmental multiword tokens, both UD- Pipe and our segmentation system yield near-perfect scores in Table 9. In general, referring back to Fig- ure 1, languages that are clustered at the bottom-left corner are relatively trivial to segment. The evaluation scores are notably lower on Semitic languages as well as languages without word delimiters. Nonetheless, our system obtains substantially higher scores on the languages that are more challenging to process. For Chinese, Japanese and Vietnamese, our sys- tem benefits substantially from the concatenated 3-gram character representation, which has been demonstrated in Section 6.2.2. Besides, we em- ploy a more fine-grained tagset with CRF loss in- stead of the binary tags adopted in UDPipe. As presented in Zhao et al. (2006), more fine-grained tagging schemes outperform binary tags, which is supported by the experimental results on morpheme segmentation reported in Ruokolainen et al. (2013). We further investigate the merits of the fine- grained tags over the binary tags as well as the ef- fectiveness of the CRF interface by the experiments presented in Table 10 with the variances of our seg- mentation system. The fine-grained tags denote the boundary tags introduced in Table 3. The binary 430 Dataset UDPipe This Paper Dataset UDPipe This Paper Dataset UDPipe This Paper Ancient Greek 99.98 99.96 Ancient Greek-PROIEL 99.99 100.0 Arabic 93.77 97.16 Arabic-PUD 90.92 95.93 Basque 99.97 100.0 Bulgarian 99.96 99.93 Catalan 99.98 99.80 Chinese 90.47 93.82 Croatian 99.88 99.95 Czech 99.94 99.97 Czech-CAC 99.96 99.93 Czech-CLTT 99.58 99.64 Czech-PUD 99.34 99.62 Danish 99.83 100.0 Dutch 99.84 99.92 Dutch-LassySmall 99.91 99.96 English 99.05 99.13 English-LinES 99.90 99.95 English-PUD 99.69 99.71 English-ParTUT 99.60 99.51 Estonian 99.90 99.88 Finnish 99.57 99.74 Finnish-FTB 99.95 99.99 Finnish-PUD 99.64 99.39 French 98.81 99.39 French-PUD 98.84 97.23 French-ParTUT 98.97 99.32 French-Sequoia 99.11 99.48 Galician 99.94 99.97 Galician-TreeGal 98.66 98.07 German 99.58 99.64 German-PUD 97.94 97.74 Gothic 100.0 100.0 Greek 99.94 99.86 Hebrew 85.16 91.01 Hindi 100.0 100.0 Hindi-PUD 98.26 98.82 Hungarian 99.79 99.93 Indonesian 100.0 100.0 Irish 99.38 99.85 Italian 99.83 99.54 Italian-PUD 99.21 98.78 Japanese 92.03 93.77 Japanese-PUD 93.67 94.17 Kazakh 94.17 94.21 Korean 99.73 99.95 Latin 99.99 100.0 Latin-ITTB 99.94 100.0 Latin-PROIEL 99.90 100.0 Latvian 99.16 99.56 Norwegian-Bokmaal 99.83 99.89 Norwegian-Nynorsk 99.91 99.97 Old Church Slavonic 100.0 100.0 Persian 99.65 99.62 Polish 99.90 99.93 Portuguese 99.59 99.10 Portuguese-BR 99.85 99.52 Portuguese-PUD 99.40 98.98 Romanian 99.68 99.74 Russian 99.66 99.96 Russian-PUD 97.09 97.28 Russian-SynTagRus 99.64 99.65 Slovak 100.0 99.98 Slovenian 99.93 100.0 Slovenian-SST 99.91 100.0 Spanish 99.75 99.85 Spanish-AnCora 99.94 99.93 Spanish-PUD 99.44 99.39 Swedish 99.79 99.97 Swedish-LinES 99.93 99.98 Swedish-PUD 98.36 99.26 Turkish 98.09 97.85 Turkish-PUD 96.99 96.68 Ukrainian 99.81 99.76 Urdu 100.0 100.0 Uyghur 99.85 97.86 Vietnamese 85.53 87.79 Average 98.63 98.90 Table 9: Evaluation results on the UD test sets in F1-scores. The datasets are represented in the correspond- ing treebank codes. PUD suffix indicates the parallel test data. Two shades of green/red are used for better visualisation, with brighter colours for larger differences. Green represents that our system is better than UDPipe and red is used otherwise. BT BT+CRF FT FT+CRF Chinese 90.54 90.66 90.73 91.28 Japanese 91.54 91.64 91.88 91.94 Vietnamese 87.63 87.95 87.61 87.75 Arabic 94.47 96.74 94.73 97.16 Hebrew 85.34 90.74 85.53 91.98 Table 10: Comparison between the binary tags (BT) and the fine-grained tags (FT) as well as the effec- tiveness of the CRF interface on the development sets. tags include two basic tags B, I plus the correspond- ing tags B, I for non-segmental multiword tokens. White space is marked as I instead of X. The con- catenated 3-grams are not applied. In general, the experimental results confirm that the fine-grained tags are more beneficial except for Vietnamese. The fine-grained tagset contains more structured posi- tional information that can be exploited by the word segmentation model. Additionally, the CRF in- terface leads to notable improvements, especially Arabic French German Hebrew UDPipe 79.34 98.91 94.21 71.87 Our model 91.35 97.50 94.21 86.17 Table 11: Percentages of the correctly processed multiword tokens on the development sets. for Arabic and Hebrew. The combination of the fine-grained tags with the CRF interface achieves substantial improvements over the basic binary tag model that is analogous to UDPipe. For Arabic and Hebrew, apart from greatly bene- fiting from the fine-grained tagset and the CRF inter- face, our model is better at handling non-segmental multiword tokens (Table 11). The attention-based encoder-decoder as the transducer is much more powerful in processing the non-segmental multi- word tokens that are not covered by the dictionary than the suffix rules for analysing multiword tokens in UDPipe. UDPipe obtains higher scores on a few datasets. Our model overfits the small training data of Uyghur 431 Segmentation UDPipe parser Dozat et al. (2017) Accuracy UAS LAS UAS LAS UDPipe This Paper UDPipe This Paper UDPipe This Paper UDPipe This Paper UDPipe This Paper Arabic 93.77 97.16 72.34 78.22 66.41 71.79 77.52 83.55 72.89 78.42 Chinese 90.47 93.82 63.20 67.91 59.07 63.31 71.24 76.33 68.20 73.04 Hebrew 85.16 91.01 62.14 71.18 57.82 66.59 67.61 76.39 64.02 72.37 Japanese 92.03 93.77 78.08 81.77 76.73 80.83 80.21 83.79 79.44 82.99 Vietnamese 85.53 87.79 47.72 50.87 43.10 46.03 50.28 53.78 45.54 48.86 Table 12: Extrinsic evaluations with dependency parsing on the test sets. The parsing accuracies are reported in unlabelled attachment score (UAS) and labelled attachment score (LAS). Space NLTK Sample Transfer Buryat 71.99 97.99 88.07 97.99 (Russian) Kurmanji 78.97 97.37 93.37 96.71 (Spanish) North Sami 79.07 99.20 92.82 99.81 (German) Upper Sorbian 72.35 94.60 93.34 93.66 (Spanish) Table 13: Evaluation on the surprise languages. as it yields 100.0 F1-score on the development set. For a few parallel test sets, there are punctuation marks not found in the training data that cannot be correctly analysed by our system as it is fully data- driven without any heuristic rules for unknown char- acters. The evaluation results on the surprise languages are presented in Table 13. In addition to the seg- mentation models proposed in this paper, we present the evaluation scores of a space-based tokeniser as well as the NLTK model for English. As shown by the previous learning curve experiments in Sec- tion 6.2, very high accuracies can be obtained on the space-delimited languages with only small amounts of training data. However, in case of extreme data sparseness (less than 20 training sentences), such as for the four surprise languages in Table 13 and Kazakh in Table 9, the segmentation results are dras- tically lower even though the surprise languages are all space-delimited. For the surprise languages, we find that applying segmentation models trained on a different language with more training data yields better results than re- lying on the small annotated samples of the target language. Considering that the segmentation model is fully character-based, we simply select the model of the language that shares the most characters with the surprise language as its segmentation model. No annotated data of the surprise language are used for model selection. As shown in Table 13, the transfer approach achieves comparable segmentation accu- racies to NLTK. For space-delimited languages with insufficient training data, it may be beneficial to em- ploy a well-designed rule-based word segmenter as NLTK occasionally outperforms the data-driven ap- proach. As a form of extrinsic evaluation, we test the seg- menter in a dependency parsing setup on the datasets where we obtained substantial improvements over UDPipe. We present results for the transition-based parsing model in UDPipe 1.2 and for the graph- based parser by Dozat et al. (2017). The experimen- tal results are shown in Table 12. We can see that word segmentation accuracy has a great impact on parsing accuracy as the segmentation errors propa- gate. Having a more accurate word segmentation model is very beneficial for achieving higher pars- ing accuracy. 7 Related Work The BiRNN-CRF model is proposed by Huang et al. (2015) and has been applied to a number of se- quence labelling tasks, such as part-of-speech tag- ging, chunking and named entity recognition. Our universal word segmenter is a major exten- sion of the joint word segmentation and POS tagging system described by Shao et al. (2017). The origi- nal model is specifically developed for Chinese and only applicable to Chinese and Japanese. Apart from being language-independent, the proposed model in this paper employs an extended tagset and a comple- mentary sequence transduction component to fully process non-segmental multiword tokens that are present in a substantial amount of languages, such as Arabic and Hebrew in particular. It is a gener- alised segmentation and transduction framework. Our universal model is compared with the 432 This Paper Shao Che Björkelund Chinese 93.82 95.21 91.19 92.81 Japanese 93.77 94.79 92.95 91.68 Arabic 97.16 – 93.71 95.53 Hebrew 91.01 – 85.16 91.37 Table 14: Comparison between the universal model and the language-specific models. language-specific model of Shao et al. (2017) in Ta- ble 14. With pretrained character embeddings, en- semble decoding and joint POS tags prediction as introduced in Shao et al. (2017), considerable im- provements over the universal model presented in this paper can be obtained. However, the joint POS tagging system is difficult to generalise as single characters in space-delimited languages are usually not informative for POS tagging. Additionally, com- pared to Chinese, sentences in space-delimited lan- guages have a much greater number of characters on average. Combining the POS tags with segmenta- tion tags drastically enlarges the search space and therefore the model becomes extremely inefficient both for training and tagging. The joint POS tag- ging model is nonetheless applicable to Japanese and Vietnamese. Monroe et al. (2014) present a data-driven word segmentation system for Arabic based on a sequence labelling framework. An extended tagset is designed for Arabic-specific orthographic rules and applied together with hand-crafted features in a CRF frame- work. It obtains 98.23 F1-score on newswire Ara- bic Treebank,5 97.61 on Broadcast News Treebank,6 and 92.10 on the Egyptian Arabic dataset.7 For He- brew, Goldberg and Elhadad (2013) perform word segmentation jointly with syntactic disambiguation using lattice parsing. Each lattice arc corresponds to a word and its corresponding POS tag, and a path through the lattice corresponds to a specific word segmentation and POS tagging of the sentence. The proposed model is evaluated on the Hebrew Tree- bank (Guthmann et al., 2009). The joint word seg- mentation and parsing F1-score (76.95) is reported and compared against the parsing score (85.70) with gold word segmentation. The evaluation scores re- 5LDC2010T13, LDC2011T09, LDC2010T08 6LDC2012T07 7LDC2012E93,98,89,99,107,125, LDC2013E12,21 ported in both Monroe et al. (2014) and Goldberg and Elhadad (2013) are not directly comparable to the evaluation scores on Arabic and Hebrew in this paper, as they are obtained on different datasets. For universal word segmentation, apart from UD- Pipe described in Section 6.3, there are several systems that are developed for specific language groups. Che et al. (2017) build a similar Bi-LSTM word segmentation model targeting languages with- out space delimiters like Chinese and Japanese. The proposed model incorporates rich statistics-based features gathered from large-scale unlabelled data, such as character unigram embeddings, character bigram embeddings and the point-wise mutual in- formation of adjacent characters. Björkelund et al. (2017) use a CRF-based tagger for multiword token rich languages like Arabic and Hebrew. A predicted Levenshtein edit script is employed to transform the multiword tokens into their components. The evalu- ation scores on a selected set of languages reported in Che et al. (2017) and Björkelund et al. (2017) are included in Table 14 as well. More et al. (2018) adapt existing morphologi- cal analysers for Arabic, Hebrew and Turkish and present ambiguous word segmentation possibilities for these languages in a lattice format (CoNLL- UL) that is compatible with UD. The CoNLL-UL datasets can be applied as external resources for pro- cessing non-segmental multiword tokens.8 8 Conclusion We propose a sequence tagging model and apply it to universal word segmentation. BiRNN-CRF is adopted as the fundamental segmentation frame- work that is complemented by an attention-based sequence-to-sequence transducer for non-segmental multiword tokens. We propose six typological fac- tors to characterise the difficulty of word segmen- tation cross different languages. The experimental results show that segmentation accuracy is primarily correlated with segmentation frequency as well as the set of non-segmental multiword tokens. Using whitespace as delimiters is crucial to word segmen- tation, even if the correlation between orthographic tokens and words is not perfect. For space-delimited 8CoNLL-UL is not evaluated in our experiments as it is very recent work. 433 languages, very high accuracy can be obtained even with relatively small training sets, while more train- ing data is required for high segmentation accuracy for languages without spaces. Based on the analy- sis, we apply a minimal number of language-specific settings to substantially improve the segmentation accuracy for languages that are fundamentally more difficult to process. The segmenter is extensively evaluated on the UD datasets in various languages and compared with UDPipe. Apart from obtaining nearly perfect segmentation on most of the space-delimited lan- guages, our system achieves high accuracies on lan- guages without space delimiters such as Chinese and Japanese as well as Semitic languages with abundant multiword tokens like Arabic and Hebrew. Acknowledgments We acknowledge the computational resources pro- vided by CSC in Helsinki and Sigma2 in Oslo through NeIC-NLPL (www.nlpl.eu). This work is supported by the Chinese Scholarship Council (CSC) (No. 201407930015). We would like to thank the TACL editors and reviewers for their valuable feedback. References Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI), pages 265–283. Hervé Abdi and Lynne J Williams. 2010. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433–459. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In International Con- ference on Learning Representations. Anders Björkelund, Agnieszka Falenska, Xiang Yu, and Jonas Kuhn. 2017. IMS at the CoNLL 2017 UD shared task: CRFs and perceptrons meet neural net- works. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 40–51. Wanxiang Che, Jiang Guo, Yuxuan Wang, Bo Zheng, Huaipeng Zhao, Yang Liu, Dechuan Teng, and Ting Liu. 2017. The HIT-SCIR system for end-to-end pars- ing of universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 52–62. Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long short-term mem- ory neural networks for Chinese word segmentation. In Conference on Empirical Methods in Natural Lan- guage Processing, pages 1197–1206. Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder ap- proaches. arXiv preprint arXiv:1409.1259. Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and Joakim Nivre. 2017. From raw text to Universal De- pendencies – look, no tags! In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies., pages 207–217. Timothy Dozat, Peng Qi, and Christopher D. Man- ning. 2017. Stanford’s graph-based neural depen- dency parser at the CoNLL 2017 shared task. In Pro- ceedings of the CoNLL 2017 Shared Task: Multilin- gual Parsing from Raw Text to Universal Dependen- cies, pages 20–30, Vancouver, Canada, August. Asso- ciation for Computational Linguistics. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural net- works. In International Conference on Artificial Intel- ligence and Statistics, pages 249–256. Yoav Goldberg and Michael Elhadad. 2013. Word seg- mentation, unknown-word resolution, and morpholog- ical agreement in a Hebrew parsing system. Computa- tional Linguistics, 39(1):121–160, March. Noemie Guthmann, Yuval Krymolowski, Adi Milea, and Yoad Winter. 2009. Automatic annotation of mor- phosyntactic dependencies in a modern Hebrew. In Proceedings of the 1st Workshop on Treebanks and Linguistic Theories, pages 1–12. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735– 1780. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi- rectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. Peter J Huber. 1964. Robust estimation of a location pa- rameter. The annals of mathematical statistics, pages 73–101. 434 Alon Itai and Shuly Wintner. 2008. Language resources for Hebrew. Language Resources and Evaluation, 42(1):75–98. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilis- tic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Con- ference on Machine Learning, ICML ’01, pages 282– 289. Haizhou Li, A Kumaran, Vladimir Pervouchine, and Min Zhang. 2009. Report of NEWS 2009 machine translit- eration shared task. In Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, pages 1–18. Edward Loper and Steven Bird. 2002. NLTK: The nat- ural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computa- tional linguistics-Volume 1, pages 63–70. Association for Computational Linguistics. Will Monroe, Spence Green, and Christopher D Man- ning. 2014. Word segmentation of informal arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 206–211. Amir More, Özlem Çetinoğlu, Çağrı Çöltekin, Nizar Habash, Benoı̂t Sagot, Djamé Seddah, Dima Taji, and Reut Tsarfaty. 2018. CoNLL-UL: Universal morpho- logical lattices for Universal Dependency parsing. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Na- talia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation, pages 1659–1666. David Palmer and John Burger. 1997. Chinese word seg- mentation and information retrieval. In AAAI Spring Symposium on Cross-Language Text and Speech Re- trieval, pages 175–178. Teemu Ruokolainen, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. 2013. Supervised morphological seg- mentation in a low-resource learning setting using con- ditional random fields. In Proceedings of the Sev- enteenth Conference on Computational Natural Lan- guage Learning, pages 29–37, Sofia, Bulgaria. Asso- ciation for Computational Linguistics. Ivan A Sag, Timothy Baldwin, Francis Bond, Ann Copes- take, and Dan Flickinger. 2002. Multiword expres- sions: A pain in the neck for NLP. In International Conference on Intelligent Text Processing and Com- putational Linguistics, pages 1–15. Springer. Yan Shao, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2017. Character-based joint segmenta- tion and POS tagging for Chinese using bidirectional RNN-CRF. In Proceedings the 8th International Joint Conference on Natural Language Processing, pages 173–183. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Re- search, 15(1):1929–1958. Milan Straka and Jana Straková. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UD- Pipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal De- pendencies, pages 88–99. Nianwen Xue. 2003. Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, pages 29–48. Daniel Zeman, Martin Popel, Milan Straka, Jan Ha- jic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Fran- cis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkova, Jan Hajic jr., Jaroslava Hlavacova, Václava Kettnerová, Zdenka Uresova, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria dePaiva, Kira Droganova, Héctor Martı́nez Alonso, Çağr Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Al- calde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonca, Tatiana Lando, Rattima Nitisaroj, and Josie Li. 2017. CoNLL 2017 shared task: Multi- lingual parsing from raw text to Universal Dependen- cies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal De- pendencies, pages 1–19. Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2006. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pages 87– 94. Youguang Zhou. 1991. The family of Chinese character- type scripts. Sino-Platonic Papers, 28. 435 436 work_2vd5wfkfpfgqtm4e6elvfcktem ---- 47© 2020 Authors. This work is licensed under the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/). CONNECTIONS Issue 1 | Vol. 40Article | DOI: 10.21307/connections-2019.012 The contingent effect of work roles on brokerage in professional organizations Anssi Smedlund1,* and Emily W. Choi2 1Finnish Institute of Occupational Health, Helsinki, Uusimaa. 2Naveen Jindal School of Management, The University of Texas at Dallas, Texas, TX. *E-mail: anssi.smedlund@ttl.fi Abstract In this paper, we consider whether brokerage in an intra- organizational communication network and type of work role interact to predict individual performance in a professional organization. The independent–interdependent nature of work roles is considered a key factor in structural contingency theory, but is yet to be studied in relation to brokerage. We propose that a brokerage position has a joint effect on performance along with work role in a study of organization-wide communication network in an architectural firm with 65 employees. Our analysis suggests an association between brokerage and role-prescribed performance for individuals in both interdependent and independent types of work roles. Our findings also suggest that interdependent roles requiring broad, organization- wide collaboration, and communication with others, brokerage is positively associated with the performance prescribed by the role, but for independent roles, wherein collaboration and communication are somewhat limited by the formal role, brokerage has far less of an effect. Our findings contribute to brokerage theory by comparing how brokerage affects performance in two distinct work roles by illustrating how the benefits of brokerage seem more restricted to those in interdependent work roles. The contribution of this paper is to suggest the independent–interdependent nature of work role as a boundary condition for brokerage. Keywords Network theory, Brokerage, Formal organization, Informal organization, Work role, Interdependency Brokerage theory is probably one of the most influential lines of thought in network theory. Since Burt’s (1992) seminal book, brokerage has been studied and associated with numerous organizational advantages for an individual, such as higher salary and faster promotion, based on the benefits of having better access to information and greater control over other actors than their more socially constrained colleagues (Burt, 2004). While brokerage generally provides benefits, some studies have found that under context-specific conditions, there may not always be positive effects (e.g. Barnes et al., 2016; Burt, 1998; Fleming et al., 2007). Therefore, as the empirical results are mixed, the contingencies and boundary conditions for brokerage merit further research. An important, but generally overlooked boundary condition in structural network analysis is the independ- ent–interdependent nature of work roles, even though interdependency has been widely used as a modera- tor in general management studies and is at the core of structural contingency theory (Thompson, 1967). Inter- dependence regulates how much ind ividuals can com- municate and collaborate with others to perform their work effectively (Cummings, 1978; Wang et al., 2019), and is one of the most important factors influencing team performance in organizations (Langfred, 2005; 48 The contingent effect of work roles on brokerage in professional organizations Saavedra et al., 1993). In structural contingency the- ory, numerous dimensions of interdependence, such as pooled, sequential, or reciprocal types, have been distinguished based on exchange of information or re- sources (Thompson, 1967). This empirical study considered how the benefits of brokerage are contingent on the independent– interdependent nature of the work role. That is, how the informal organization, operationalized as the intra- organizational communication network structure, corresponds with the formal work role for performance. The novelty and value in this approach is that when the formal and informal organizations have been linked to performance outcomes as such, their effects on each other have not been linked so often in the literature (McEvily et al., 2014). Following the structural contingency theory logic to test the formal work role’s effect as a boundary condition for brokerage, our paper explores whether work role moderates brokerage effect on performance in professional organizations. Drawing from the nature of work at our case architectural firm, our hypothesis is developed on the starting point that work in a professional organization is generally anchored on either independent or interde- pendent tasks and work roles are formed accordingly. Both independent and interdependent roles prompt role-prescribed performance expectations due to the division of labor: professionals typically work in inte- llectually demanding projects, requiring them to focus their energy on the highly demanding ope rative work, simultaneously creating demand for interdependent roles to manage the projects and the supporting organizations (cf. Etzioni, 1964; Weber, 1982). Thus, the formal work role not only limits interdependence for the professionals, but also assumes managers to adopt the interdependent role to communicate and collaborate broadly across the organization. In this paper, we hypothesize that when brokerage and role-prescribed performance are aligned, individuals perform best. Our data are derived from a communication network study of an architectural firm of 93 employees, of which 65 were classified as working mainly in either independent (n = 31) or interdependent (n = 34) roles. We chose to study this specific architectural firm as a typical professional organization because the firm had clearly distinguished work roles for the independent professional architects and interdependent managers1. In our analysis, we examined the moderating effect of work role in the association between brokerage and role-specific individual performance. Role-specific performance involved an objective measure of either billable hours for the professional architects (i.e. independent work role) or a peer evaluation score of managers as promoter of ideas (i.e. interdependent work role). Our study contributes to the underlying assump- tions about the effects of brokerage. Despite a few negative reports from the literature (e.g. Barnes et al., 2016; Fleming et al., 2007), the general understanding about brokerage is that it benefits the individual in most circumstances. In this regard, our study contributes to brokerage theory by pointing out an important contingency of work role. In practical terms, our findings imply that independent professionals, such as architects, do not seem to benefit so much from networking and bridge building, since these are less related to their role-prescribed performance. In the context of management theory, our study contributes to a better understanding of the interplay of formal and informal organizations. These two topics have historically remained separate and unconnected (McEvily et al., 2014) but have spurred number of integrative studies (Biancani et al., 2014; Kleinbaum et al., 2013; Soda and Zaheer, 2012; Srivastava, 2015). By treating work role as a moderator in molding the association between brokerage and performance, we address the gap in the literature on the topic by extending structural network analysis with contingencies (Adler and Kwon, 2002; Carnabuci and Oszegi, 2015; Cross and Cummings, 2004; Hansen, 1999; Mehra et al., 2001). By doing this, we increase the explanatory power of structural analysis (Lincoln, 1990; cf. Lincoln and Miller, 1979), resulting as an increased knowledge of how informal network position is associated with role- prescribed performance. Contingent effect of work role on the relationship between brokerage and performance The basic tenet of our study is that there are, on the whole, fundamental differences in the communication and cooperation requirements between independent and interdependent work roles. According to classi- cal management theory, work roles outline a kind of bureaucratic boundary for social relationships that individuals can and should adhere to and engage in within their organization – when an individual is assigned a certain role, then the communication network becomes somewhat inherited and defined 1The organization’s work role structures and innovation ac- tivities were studied extensively at two-year research project with 13 theme-based interviews and several workshops. The results are reported at a PhD thesis (Tuominen, 2013). 49 CONNECTIONS by the role (Hansen and Haas, 2001; Lincoln and Miller, 1979; McEvily et al., 2014; Merton, 1939; Weber, 1982). Over time, individuals develop informal networks largely corresponding the role-prescribed relationships (Lincoln and Miller, 1979; Padgett and Ansell, 1993), but the networks reach beyond the formal bureaucratic boundaries as individuals communicate freely across the organization (Krackhardt, 1994). In addition to communication, formal division of labor and corresponding roles also affect expected performance. Previous research has noted that a work role defines what types of activities an individual performs, prompts normative expectations in an organization, and sets the standard for how performance is evaluated (Biddle, 1986; Katz and Kahn, 1978; Welbourne et al., 1998). In the most extreme cases, performance that is not prescribed by the work role is prohibited and only the type of performance established for the role is rewarded (Pfeffer and Salancik, 1975). Conceivably because performance expectations are strongly determined by the work role, a notable body of research studies specifically considers work role performance, and the conditions to manage and maximize it (Griffin et al., 2007; Leroy et al., 2015). Work roles having a contingent effect on the relationship between brokerage and performance can be analyzed using structural contingency theory combined with the conception of organization as a socio-technical system. As a socio-technical system, professional organization is a combination of social, interpersonal communication networks, and technical roles specified by the formal division of labor, wherein the formal aspects interact with the social aspects of performance (Cummings, 1978). From this perspective, work role is derived from technology and corresponds with Thompson’s (1967) pooled task interdependence for independent work and reciprocal task interdependence for interdependent work. In the former category, rules and standard procedures provide enough coordination for the individuals and teams to work independently toward a common goal, and in the latter category, the coordination mechanism involves a mutual adjustment, as the work is performed together to produce the output. Specifically, the independent– interdependent nature of work has been a key focus of research related to team performance (Cummings, 1978; Langfred, 2005; Wang et al., 2019). In these studies, interdependence is built-in to the work the team performs, and then treated as a moderator of aspects such as group autonomy, collective efficacy, group potency, organizational citizenship or diversity for several different types of outcomes (Bachrach et al., 2006; Langfred, 2000, 2005; Stajkovic et al., 2009; Wang et al., 2019). Notable in the results of these studies is the support for the mechanisms derived from Thompson’s (1967) theory that demonstrate that the need for communication and cooperation increases along with an increase in the task interdependency, complexity of goals and feedback (Saavedra et al., 1993). In professional organizations, these dimensions become increasingly complex amid higher positions in formal hierarchy simply because managers tend to have increasingly broader job descriptions than their subordinates and participate in a larger number of overlapping projects of various kinds. Typically, managers are experienced professionals in their field, and they perform some of the client project work in addition to fulfilling the expectations toward sales, organizational development and coordinating activities in their departments or other work units (e.g. Etzioni, 1964). A manager’s goals are in this respect defined from both above and below their hierarchical position, and they receive feedback for their work from several others aside from their immediate colleagues. In contrast, professionals are technical specialists, and performing their job well generally requires spending more time at their desks working on specific projects, thus having inherently higher independence incorporated in their work roles, even if their project may require combining several individual’s work. Table 1 summarizes how professionals and managers differ in terms of interdependency based on the dimensions identified by Saavedra et al. (1993). A brokerage position in a communication network of interdependent work roles can provide a major boost to effective communication and cooperation. Studies show that brokerage provides the best position to coordinate work across other areas of a work communication network (Burt, 1992; Granovetter, 1973) and increases the ability to convey ideas across the organization to the distant individuals in the network (Reagans and McEvily, 2003). Brokerage also increases the chances that an individual’s activities are to be considered and engaged by others, and concomitantly, to be judged valuable (Burt, 2004). In general, brokerage means less structural constraint, more diversity, and weaker ties (Aral and Van Alstyne, 2011), and allows individuals to benefit from non-redundant sources of knowledge (Hansen, 1999). The more interdependent the work role is, the greater the need for brokerage in a professional organization. Our hypothesis evaluates how brokerage in the communication network and 50 The contingent effect of work roles on brokerage in professional organizations independent–interdependent work roles interact with each other: H1. Work role moderates the relationship between brokerage and role-prescribed performance such that the contribution of brokerage is stronger when the work role is interdependent compared to independent. Methods Data We tested our hypotheses using data collected in an architectural firm during a two-year study. We collected questionnaire and timesheet data from employees who participated in client projects residing in the same open office and who were employed during the first and second years of the study (n = 65). To control for common method variance and develop a causal argument on the network positions and performance, the data on dependent variables were collected in the second year of the study from time sheets and from an additional online survey. In total, there were 93 employees at the start of the study and the remaining 28 employees worked at other physical locations, left the company or belonged to administrative staff (e.g. information system administration and payroll). There were five formal roles: professionals, project managers, senior project managers, and managing partners. The professional architects were coded as independent roles (n = 31) and all manager roles were coded as interdependent roles (n = 34). The professionals performed different aspects of design and drawings, and managers attended to sales, project management, and development. Based on 13 interviews about work roles and innovative activities at the case company reported by Tuominen (2013), the professionals were clearly a distinct group from the managers and were allowed to focus mainly on their solitary architectural design work. Conversely, managers were given broad responsibilities in managing work units and engaged in development and innovation. The case firm invested heavily in innovation and development and just before the beginning of our data collection, promoted several individuals previously working as professional architects to project managers. Both work roles required talent and extensive training in architectural design, but they differed in communication patterns, the managers have to communicate across the firm to participate in and supervise several development projects. A total of 33% of the sample were women, and 83% had a master’s degree in architecture, which Table 1. Differences between independent and interdependent work roles in a professional organization. Independent roles in a professional organization Interdependent roles in a professional organization Typical formal role Professional Manager Task interdependency Client projects of several sequential and parallel tasks to be worked on alone and coordinated within the project team Supervision over work units, selling, negotiating, and participating in several client projects. Member of business development and strategy development teams Goal interdependency Client project provides clear goals for each individual and for compiled output of project Several goals coming from projects, firm, and clients Feedback interdependency Individuals receive feedback from colleagues working on the same project. Collective feedback provided by superior and client during and after project Feedback from the subordinates, from clients and from top management. Feedback from several projects Requirements for collaboration and coordination Lower Higher 51 CONNECTIONS is minimal required training for certified architects. The remaining 17% had a bachelor’s degree or vocational degree in related design field. The average tenure was 9.25 years (SD = 6.83) for managers and 5.17 years (SD = 4.89) for professionals. Measures Dependent variables We used billable hours from clients as a dependent variable of the role-prescribed performance for the independent work roles. This was constructed based on time sheets, where the employees had allocated their working time in a variety of categories. We chose billable hours from the client category as a performance measure of independent work roles, because the firm aimed at maximizing it, and it was directly linked to annual profit. We calculated a monthly mean of the number of billable hours to generate a uniform variable to describe individuals’ average performance through the year. Monthly mean billable hours for interdependent roles were 94.76 (SD = 38.04) and for independent roles 114.16 (SD = 22.27). The variable was normalized with second power transformation to adjust its skew. For the variable describing role-prescribed performance for interdependent work roles, we chose promoting of new ideas. Following the survey examples from Wasserman and Faust (1994), the variable was constructed from a questionnaire in which the respondent was asked to name five individuals in the firm who promote new ideas. Each nomination received one point, and points were summed resulting in a count variable of interdependent work roles’ performance. This procedure was chosen, because it provides a single component measure of a person’s perceived competence and ability to put forth actions in the organization that will eventually lead to innovation (March, 1991). This measure also corresponds with the current understanding of creativity that highlights the generation of both novel and useful ideas (Amabile, 1996) and provides a measure to identify those individuals who are both coming up with ideas and promoting them. The variable was normalized with square root transformation to adjust its skew. Independent variables The network data consisted of information on self-reported social ties in three types of work- related communication collected through an online sociometric survey instrument in the first year of study. Preliminary interviews consistently identified three types of informal, work-related interaction among employees that we distinguished in our survey: communication about the (i) day-to-day architectural design work, (ii) innovative new ideas, and (iii) business development. The network data were obtained from a free-choice survey with two-way directed questions, wherein giving-information-to and getting-information- from components of communication were asked separately (Wasserman and Faust, 1994). Thus, three network survey question pairs were used: one for communication about the day-to-day architectural design work, one for innovative new ideas, and one for business development. The wording of the questions were tailored to reflect the conditions of the company based on the interviews, and were checked with one of the supervisors before publishing the survey online. A one-sentence example was given in all three types of communication. Communication about the day-to- day architectural design work was defined as relating to the output delivered to clients that was recurring and was in the realm of respondent’s line of expertise. Communication for innovative new ideas was defined as work-related ideas that the respondent was not aware of anyone else proposing previously. Business development communication was defined as com- munication about improvements in pre-existing products or services, or internal company process or personal work practices. The response rate was 90% for the questions about communication in day-to-day architectural design work and business development tasks and 84% in communicating innovative new ideas. In the online survey, the network questions were presented after a filtering question wherein the employees had defined their own acquaintances from a roster of all employee names. Small organization size permitted a full roster method, which rules out recall bias thus increasing reliability of the network measures (Marsden, 2011). Separating giving-and-getting components of communication further increases psychometric reliability by allowing confirmation of each social tie (Krackhardt, 1990). The frequency scale in communication was set to choices of (4) daily, (3) weekly, (2) once a month, (1) less than once a month, or (0) = not at all. We transposed the getting-information-from component in each of network question pairs, and joined the resulting two networks together, by using the value of the giving-information component as communication frequency resulting in confirmed communication ties between individuals. Before generating the brokerage measures, we combined 52 The contingent effect of work roles on brokerage in professional organizations the three networks by summing up the frequencies, then dichotomizing at the mean frequency (MIN = 1, MAX = 12, MEAN = 3.411, SD = 2.47). Brokerage Our first brokerage measure was the additive inverse of Burt’s constraint (Burt, 1992). First, we generated Burt’s constraint with Ucinet VI structural holes routine limiting the measure to consider only an individual’s contacts’ ties and using both outgoing and incoming ties. Then we generated our brokerage measure by calculating 1 minus constraint, following recent network studies (Carnabuci and Oszegi, 2015; Soda et al., 2019). Thus, the higher the resulting brokerage measure is, the more brokering opportunities the individual has. In other words, our measure indicates how an individual’s communication is concentrated in non-redundant contacts in groups of connected colleagues, because the less constrained actors are connected to more groups of others (Burt, 1992). In our analysis, the higher the additive inverse of Burt’s constraint, the better opportunities for brokerage the individual has. As the second brokerage measure, we used Betweenness centrality (Freeman, 1977) generated with Stata function “nwcommands”. We added Betweenness centrality to the measures, because it has been frequently used as an additional brokerage measure (e.g. Fang et al., 2015). Independent work role We created a dummy variable to distinguish between independent and interdependent roles. All individuals in any of the manager roles (n = 34) were coded as interdependent (0) and all individuals in professional architect roles (n = 31) were coded as independent (1). Control variables We requested that the human resource manager of the company to provide us with demographic data of the employees. From that data, we created the control variables for organizational tenure, gender, and education to be used in our models because they were found to be significant in earlier studies of network positions and various outcome variables (Reagans and McEvily, 2003; Reagans and Zuckerman, 2001). Language skills and age were also considered in evaluating the modeling strategy, but these variables did not increase the explanatory power of the models and were dropped. Individuals were very homogeneous in terms of language skills, and age was highly correlated with tenure. There were six divisions in the firm specializing in certain types of architectural projects, for example, office buildings or shopping malls. We checked for an intraclass correlation (ICC) between the units to determine whether unit affiliation is a considerable source of variance in performance and did not find justification for hierarchical models. Results Table 2 presents bivariate correlations and descriptive statistics of the variables. Dependent variables and work role are numbered 1 to 3, followed by control variables and brokerage measures. We found Table 2. Means, standard deviations, and correlations. Variable Mean SD 1 2 3 4 5 6 7 1 Billable hours 103.88 32.89 2 Promoting new ideas 4.18 6.15 −0.56** 3 Independent work role 0.47 0.5 0.30* −0.41** 4 Tenure 7.37 6.31 −0.07 0.16 −0.32** 5 Female 0.33 0.48 0.13 −0.18 0.11 −0.13 6 Master’s degree 0.83 0.38 −0.06 0.2 −0.39** 0.08 −0.20 7 Inverse of Burt’s constraint 0.06 0.06 −0.19 0.41** −0.35** 0.25* −0.41** 0.06 8 Betweenness centrality 77.29 89.72 −0.32** 0.77** −0.27* 0.12 −0.25* 0.11 0.6** Notes: *p < 0.05; **p < 0.01. 53 CONNECTIONS a positive correlation between both brokerage measures and idea promotion. The number of billable hours negative correlation is significant with betweenness centrality, but not with the inverse of Burt’s constraint. The independent role (i.e. 1 = independent, 0 = interdependent) correlates positively with the number of billable hours and negatively with idea promotion, which supports the assumption of distinct output expectations between the work roles. Having a master’s degree and tenure correlate negatively with independent work role, indicating that those in interdependent work roles have higher education and higher tenure than independent roles. Both brokerage measures are positively intercorrelated as expected. We z-standardized all independent variables to facilitate better interpretation of the moderation effect as suggested by Dawson (2014). Tables 3 and 4 present the results of the regression analyses testing the association between the inverse of Burt’s constraint, betweenness centrality, promoting new ideas, and billable hours. As postestimation of the models showed heteroscedasticity of the residuals caused by slight non-normality of the transformed dependent variables, we used robust standard errors to control for this, as suggested by Antonakis and Dietz (2011). OLS regression was chosen because it has been considered a valid modeling strategy when network measures are included as independent variables (e.g. Reagans and McEvily, 2003; Srivastava, 2015). However, the network measures violate the independence of observations, which is one of the key assumptions of OLS regression resulting as underestimating of standard errors and over-rejecting of hypotheses (Srivastava, 2015). To correct this, we chose a procedure suggested by Borgatti et al. (2018) and compared the results of our conventional OLS models with those obtained from Ucinet VI node- level regression, which uses the OLS regression to generate the coefficients, but permutation technique for the p-values. As both modeling techniques are presented side by side in Tables 3 and 4, it can be observed that the permutation technique generally results in higher t-values for those coefficients that are statistically significant, providing additional support for our results. Our hypothesis about the work role’s boundary effect on brokerage means that brokerage is associated with higher work role-prescribed per- formance, if the role is interdependent. In other words, as employees in interdependent work roles are expected to engage in promoting new ideas in the organization, they benefit from brokerage. To test this aspect of the hypothesis, we first examined the Table 3. Results of conventional and node-level OLS regression analysis for promoting new ideas (t-values in parentheses). Promoting new ideas Model 1 conventional OLS Model 2 permutation OLS Model 3 conventional OLS Model 4 permutation OLS Independent work role −0.54 (−1.62) −0.54 (−1.79) −0.82 (−3.08)** −0.82 (−3.25)** Tenure 0.06 (0.45) 0.06 (0.50) 0.10 (0.79) 0.10 (0.91) Female 0.34 (1.57) 0.34 (1.17) 0.22 (1.20) 0.22 (0.92) Master’s degree 0.15 (0.60) 0.15 (0.41) 0.11 (0.52) 0.11 (0.35) Inverse of Burt’s constraint 2.10 (3.47)** 2.10 (4.97)** Independent × Inv. of Burt’s constraint −1.78 (−2.81)** −1.78 (−3.76)** Betweenness centrality 0.86 (5.92)** 0.86 (7.22)** Independent × Betweenness −0.33 (−1.25) −0.33 (−1.12) Constant 1.20 (3.09)** 1.20 (na) 1.65 (5.80)** 1.65 (na) R2 0.48 0.48 0.62 0.62 n 65 65 65 65 Note: **p < 0.01. 54 The contingent effect of work roles on brokerage in professional organizations Table 4. Results of OLS and node-level regression analysis for billable hours (t-values in parentheses). Billable hours Model 5 conventional OLS Model 6 permutation OLS Model 7 conventional OLS Model 8 permutation OLS Independent work role 0.43 (1.48) 0.43 (1.44) 0.62 (2.07)* 0.62 (2.07)* Tenure 0.10 (0.69) 0.10 (0.72) 0.09 (0.65) 0.09 (0.73) Female 0.09 (0.37) 0.09 (0.34) 0.10 (0.41) 0.10 (0.46) Master’s degree 0.31 (1.05) 0.31 (0.89) 0.29 (0.92) 0.29 (0.80) Inverse of Burt’s constraint −0.91 (−2.55)* −0.91 (−2.20)* Independent × Inv. of Burt’s constraint 1.21 (3.06)** 1.21 (2.60)* Betweenness centrality −0.25 (−1.67) −0.25 (−1.83) Independent × Betweenness 0.32 (1.12) 0.32 (0.94) Constant −0.34 (−0.86) −0.34 (na) −0.53 (−1.32) −0.53 (na) R2 0.18 0.18 0.14 0.14 n 65 65 65 65 Notes: *p < 0.05; **p < 0.01. main effects of the inverse of Burt’s constraint and betweenness centrality and then their interactions with independent versus interdependent work role. According to the main effects of the brokerage measures in Models 1 to 4 in Table 3, brokerage is associated with higher scores for promoting new ideas. When examining the significant interaction effect of the work role in the Models 1 and 2 in Table 3, it is evident that employees in interdependent roles benefit from brokerage more than those in independent roles for promoting new ideas. For example in Models 1 and 2, the positive effect of the inverse of Burt’s constraint for interdependent work roles is 2.10 and for independent work roles, the effect is 2.08 −1.78 = 0.30. Further, according to our hypothesis, for inde- pendent work roles, brokerage should have less effect on role-prescribed performance than for interdependent roles. In Table 4, brokerage is modeled with billable hours, which is the role-prescribed performance measure for independent work roles. The main effects of Models 5 and 6 in Table 4 indicate that brokerage is negatively associated with billable hours. The interaction effect of the work role in Model 5 shows that the negative effect of the inverse of Burt’s constraint for interdependent work roles is −0.91 and for independent work roles, the effect is −0.90−1.21 = −2.11. This shows that higher brokerage is associated with lower role-prescribed performance for independent roles. According to the main effects of the models presented in Tables 3 and 4, brokerage seems to be associated with higher performance in idea promotion and lower performance in billable hours, despite work role. However, in order to distinguish the work role-specific effects, further examination is needed. For this purpose, we examined the interactions by studying the simple slopes, which is a procedure in probing the interaction patterns (Dawson, 2014). After generating the significances of the simple slopes for interactions of the models with the inverse of Burt’s constraint and betweenness centrality for both promoting new ideas and billable hours (Models 1, 3, 5, and 7 in Tables 3 and 4) with Stata’s “margins” procedure (Table A1), we confirmed the hypothesis. For promoting new ideas, the coefficients of the simple slopes were statistically significant and positive for interdependent roles with both brokerage measures. Betweenness centrality was positive and significant also for independent roles, suggesting that brokerage is also associated with independent professionals in promoting their ideas. This was the case for the independent professionals in our study 55 CONNECTIONS who were not expected to promote new ideas, which was evident because 12 professionals out of 31 received zero nominations as promoters of new ideas. Notably, for interdependent roles, brokerage is associated with lower number of billable hours. Discussion Our study adds knowledge on the relation of brokerage to performance and improves the empirical understanding of how formal organization is related to the informal. In our case organization, our finding is that brokerage is associated with higher role- prescribed performance for those in interdependent roles, but not for those in independent roles. Therefore, our findings show that work role is a contingency, a boundary condition for brokerage. As brokers are bridging structurally distinct groups (Adler and Kwon, 2000; Burt, 1992, 1997; Reagans and McEvily, 2008), brokerage correlates with managerial performance in our empirical setting but does not have an association with independent professional’s performance measured with the amount of billable hours. Theoretical contributions By presenting work role as a boundary condition for brokerage, this paper makes several contributions to theory. First, the study complements earlier studies on the interplay of formal and informal organization (Biancani et al., 2014; Kleinbaum et al., 2013; Soda and Zaheer, 2012; Srivastava, 2015). Our results show that formal and informal structures reinforce each other, as proposed by McEvily et al. (2014) as one interaction mechanism between the formal and informal structures. Second, the study complements the contingency perspective on network theory. The network theory’s structuralist argument suggests a direct causal link from brokerage to performance (e.g. Emirbayer and Goodwin, 1994; Mayhew, 1980), traditionally giving less attention to contingencies. This has probably led to underrepresentation of network studies taking moderators such as work roles into account, only with few exceptions (Ahuja et al., 2003; Brass, 1981; Burt, 1998; Ibarra and Andrews, 1993), and only quite recently, the individual attributes as contingencies have been consistently included in network studies (Landis, 2016). Therefore, this study contributes to most brokerage literature that seems to imply that brokerage position benefits the broker, and sometimes but not always the network, all the time under all circumstances. Our study provides empirical evidence that suggests that it is in the role of managers to broker relations and communication among the horizontally and vertically differentiated units and employees for which they have responsibility. Our study suggests that formal work role not only greatly influences the performance targets, but also limits the advantage of brokerage to the behavior prescribed by the work role only for interdependent work roles. The strength of our study is in its organization- wide approach. We obtained network data from the entire population of employees in the firm with a particularly detailed survey questionnaire backed up with qualitative interviews. We separately surveyed giving-and-getting types of informal work-related communication ties enabling improved accuracy in examining brokerage. This so-called whole-network approach increases the validity of the brokerage measures used in the models (Borgatti et al., 2018). The second strength of our study is that it measured the role-prescribed performance with objective performance data: independent work role’s performance with billable hours from the time sheets and interdependent role’s performance with peer evaluations of idea promotion. By doing so, we complement the studies that have connected organization-wide networks and work performance (Brass, 1981; Carboni and Ehrlich, 2013; Cross and Cummings, 2004; Mehra et al., 2001; Sparrowe et al., 2001). Limitations and future research Despite its contributions, this study has several limitations providing motivation for further research. The first one concerns the case study character of our study. Our data were gathered from one firm, which limits the generalizability of the results. Yet, we were able to collect detailed survey, timesheet, and demographic data about the individuals working in the firm, resulting in organization-wide, bounded network data with the dependent variables that were meaningful proxies for performance. Confirming with the interviews and reviewing the self-reported job descriptions of professionals and supervisors, we concluded that the architect office seems like so many professional organizations, where work requires both high talent and extensive training, and where managers have professional backgrounds. The architects are regulated by a national regulatory agency with certification exams, and most of the individuals we studied were certified architects, thus the professional’s work in the firm was similar compared to firms in the same industry. The firm was well established in its market, and the employee 56 The contingent effect of work roles on brokerage in professional organizations turnover was relatively low providing prerequisites for established communication network structures and divisions of labor between the work roles. The second limitation is the reverse causality caused by common method variance, which is the usual limitation discussed in survey-based network studies (e.g. Carboni and Ehrlich, 2013; Sparrowe et al., 2001). We addressed common method variance by constructing our network variables from the first year of study and used the dependent variables from the second year. According to the assumptions of the structuralist approach of network theory, we assumed the causality of brokerage predicting performance in our research design. Our approach speaks to this causality, but as the communication network structures may take time to develop and become rigid, we are still left with a concern of reverse causality in which performance leads to structural advantage to some extent. This may be the case with the employees in interdependent roles, since brokerage was, as expected, associated with a higher idea promoter score, and promoters have a tendency to become central individuals (Obstfeld, 2005; Scott, 2000), making our idea promoter DV actually a measure of prestige. Nevertheless, becoming prestigious in a professional organization arguably requires brokerage between others, so we are certain to have captured the right phenomenon with our measure of idea promotion. The third limitation is related to alternative explanations on the mechanisms of why the nature of work role moderates brokerage. Our argumentation developed around Thompson’s (1967) idea of more independent roles (e.g. professional architects in our case) requiring less collaboration and coordination, thus benefiting less from brokerage is in line with previous research. However, differences in legitimacy between professional architects and managers would provide an alternative explanation for our hypothesis in our data. For example, Burt (1998) shows that women do not benefit from brokerage unless they have a more senior mentor as a sponsor, and argues that this effect is common for all low-status individuals in an organization (Burt and Merluzzi, 2014). High-status versus low- status distinction is not entirely unrelated to the interdependent–independent distinction in our paper as the managers, on average, in our case firm have higher tenure and education levels than professional architects. However, nothing in our interviews and discussions in the company signaled to us about a possible legitimacy problem in the company. The fourth limitation is related to our performance measures. Billable hours as a measure of pro fessional’s performance is uniform across all indi viduals, but the idea promotion score merits further examination. Superior evaluations have been the most commonly used across previous network studies, despite variation across superiors (Teigland and Wasko, 2009). Our peer evaluation method’s strongpoint is that it rules out the variance between different supervisors evaluating their subordinates. We considered peer evaluation meaningful, because the size of the firm was rather small, and everyone knew each other since they shared the same open office space. Future research could extend the findings of this paper in numerous directions. One direction comes from the contribution of this paper suggesting the independent–interdependent work role as a boundary condition for brokerage. As brokerage theory has been applied to a wide range of work contexts, which might be argued to vary in terms of interdependence, the interdependence aspect has not been at the core of their research design implying that it should be equally well applicable to both. Moreover, most of the empirical evidence of benefits of brokerage up until now has been done exclusively with managers, therefore coming from the work that is fundamentally interdependent (e.g. chain managers or investment bankers). Further research would be needed to complement brokerage theory with work role point of view to clarify this specific boundary condition. Further research could also investigate more how status differences and legitimacy issues between individuals act as boundary conditions. For theorizing this stream of research, brokerage theory could benefit from hypotheses of status differences coming from evolutionary psychology and behavioral economics. Management theory’s formal–informal aspects present another future research direction. An innovative approach would be to study the co- existence and effectiveness of formal and informal structures with operationalizing formal structures not only as role hierarchies but also as workflow networks derived from project data and control for clearly work-related communication between superiors and employees. As most professional organizations are not as stratified as architectural firms, participating in the same project would serve as a proxy for formal structure. Novel data gathering methods about informal social structure could also be used. Since work-related communication is increasingly taking place digitally, communication data can be gathered from databases in addition to self-administered surveys. By analyzing the content of communication by text mining; for example, examining the content of e-mails individuals send each other, and dividing the content between formal and informal communication, would shed light on a multiplicity of relationships, 57 CONNECTIONS efficiency and innovativeness on a large scale, and answer the question as to how these network structures are associated with each other. Managerial implications In addition to the theoretical contributions, our study has implications for managers of professional organizations. According to the extant understanding in the managerial practice, successful organizations are both highly efficient in what they do and capable of adapting to changes. Typically, in professional organizations, professionals work primarily on tasks requiring specialized skills and competence, and managers work primarily in project management, sales, and offering development. Executives of professional organizations, at least in the most artistically and intellectually demanding kind, such as architecture, should therefore proceed with caution with the ideas of flattening formal hierarchies and divisions of labor in their organizations, in order to sustain simultaneous managerial capacity and professional performance. The finding that brokerage affords limited advantage to independent professionals suggests that, contrary to common belief, such people maybe should not invest a great deal of their time in networking and bridge building if that is not what their professional roles require. An informal organization in a professional organization can thus be seen as a mixture of independent professionals and interdependent managers. A successful firm balancing efficiency and adaptation is one that provides room for both independent and interdependent work roles and considers that not everyone should behave as brokers. Conclusion In this paper, we examined how work role moderates the advantages of brokerage for role-prescribed performance. Our findings suggest that the advantage is contingent upon the work role and brokerage facilitates role-prescribed performance for individuals in interdependent roles but not for those in independent roles. References Adler, P. and Kwon, S.-W. 2000. “Social capital: the good, the bad, and the ugly”, In Lesser, E. (Ed.), Know­ ledge and Social Capital. Foundations and Applications. Butterworth-Heineman, Boston, MA, pp. 89–119. Adler, P. S. and Kwon, S.-W. W. 2002. Social capital: prospects for a new concept. Academy of Management Review 27: 17–40. Ahuja, M. K., Galletta, D. F. and Carley, K. M. 2003. Individual centrality and performance in virtual R&D groups: an empirical study. Management Science 49: 21–38. Amabile, T. M. 1996. Creativity in Context, vol. xviii Boulder, CO, 317 pp. Antonakis, J. and Dietz, J. 2011. Looking for validity or testing it? The perils of stepwise regression, extreme- scores analysis, heteroscedasticity, and measurement error. Personality and Individual Differences 50: 409–415. Aral, S. and Van Alstyne, M. 2011. The diversity- bandwidth trade-off. American Journal of Sociology 117: 90–171. Bachrach, D. G., Powell, B. C., Collins, B. J. and Richey, R. G. 2006. Effects of task interdependence on the relationship between helping behavior and group per- formance. Journal of Applied Psychology 91: 1396–1405. Barnes, M., Kalberg, K., Pan, M. and Leung, P. 2016. When is brokerage negatively associated with economic benefits? Ethnic diversity, competition, and common-pool resources. Social Networks 45: 55–65. Biancani, S., McFarland, D. A. and Dahlander, L. 2014. The Semiformal Organization. Organization Science 25: 1306–1324. Biddle, B. 1986. Recent developments in role theory. Annual Review of Sociology 12: 67–92. Borgatti, S. P., Everett, M. G. and Johnson, J. C. 2018. Analyzing Social Networks, Sage, London. Brass, D. J. 1981. Structural relationships, job charac- teristics, and worker satisfaction and per formance. Administrative Science Quarterly 26: 331–348. Burt, R. S. 1992. Structural Holes: The Social Structure of Competition, vol. 58, Harvard University Press, Cambridge MA, available at: http://books. google.com/books?id=E6v0cVy8hVIC Burt, R. S. 1997. The contingent value of social capital. Administrative Science Quarterly 42: 339–365. Burt, R. S. 1998. The gender of social capital. Rationality and Society 10: 5–46. Burt, R. S. 2004. Structural holes and good ideas. American Journal of Sociology 110: 349–399. Burt, R. S. and Merluzzi, J. 2014. Embedded brokerage: hubs versus locals. Research in the Sociology of Organizations 40: 161–77. Carboni, I. and Ehrlich, K. 2013. The effect of relational and team characteristics on individual performance: a social network perspective. Human Resource Management 52: 511–535. Carnabuci, G. and Oszegi, D. I. 2015. Social networks, cognitive style, and innovative performance: a contingency perspective. Academy of Management Journal 58: 881–905. Cross, R. and Cummings, J. N. 2004. Tie and network correlates of individual performance in 58 The contingent effect of work roles on brokerage in professional organizations knowledge-intensive work. Academy of Management Journal 47: 928–937. Cummings, T. G. 1978. Self-regulating work groups: a socio-technical synthesis. Academy of Management Review 3: 625–634. Dawson, J. F. 2014. Moderation in management research: what, why, when, and how. Journal of Business and Psychology 29: 1–19. Emirbayer, M. and Goodwin, J. 1994. Network analysis , culture, and the problem of agency. American Journal of Sociology 99: 1411–1454. Etzioni, A. 1964. Modern Organizations Prentice- Hall, Englewood Cliffs, NJ. Fang, R., Landis, B., Zhang, Z., Anderson, M. H., Shaw, J. D. and Kilduff, M. 2015. Outcomes in organizations integrating personality and social networks: a meta-analysis of personality, network position, and work outcomes in organizations. Organization Science 26: 1243–1260. Fleming, L., Mingo, S. and Chen, D. 2007. Collaborative brokerage, generative creativity, and creative success. Administrative Science Quarterly 52: 443–475. Freeman, L. C. 1977. A set of measures of centrality based on betweenness. Sociometry 40: 35–40. Granovetter, M. S. 1973. The strength of weak ties. American Journal of Sociology 78: 1360–1380. Griffin, M. A., Neal, A. and Parker, S. K. 2007. A new model of work role performance: positive behavior in uncertain and interdependent contexts. Academy of Management Journal 50: 327–347. Hansen, M. T. 1999. The search-transfer problem: the role of weak ties in sharing knowledge across organization subunits. Administrative Science Quarterly 44: 82–111. Hansen, M. T. and Haas, M. R. 2001. Competing for attention in knowledge markets: electronic document dissemination in a management consulting company. Administrative Science Quarterly 46: 1–28. Ibarra, H. and Andrews, S. B. 1993. Power, social influence, and sense making: effects of network centrality and proximity on employee perceptions. Administrative Science Quarterly 38: 277–303. Katz, D. and Kahn, R. L. 1978. The Social Psychology of Organizations 2nd ed., Wiley, New York, NY. Kleinbaum, A. M., Stuart, T. E. and Tushman, M. L. 2013. Discretion within constraint: homophily and structure in a formal organization. Organization Science 24: 1316–1357. Krackhardt, D. 1990. Assessing the political landscape: structure, cognition, and power in organizations. Admini­ strative Science Quarterly 35: 342–369. Krackhardt, D. J. 1994. “Graph theoretical dimensions of informal organizations”, In Carley, K. M. and Prietula, M. J. (Eds), Computational Organization Theory, L. Erlbaum Associates, Hillsdale, NJ, pp. xvii, 318 pp. Landis, B. 2016. Personality and social networks in organizations: a review and future directions. Journal of Organizational Behavior 37: S107–S121. Langfred, C. W. 2000. Work-group design and autonomy: a field study of the interaction between task interdependence and group autonomy. Small Group Research 31: 54–70. Langfred, C. W. 2005. Autonomy and performance in teams: the multilevel moderating effect of task interdependence. Journal of Management 31: 513–529. Leroy, H., Anseel, F., Gardner, W. L. and Sels, L. 2015. Authentic leadership, authentic followership, basic need satisfaction, and work role performance: a cross- level study. Journal of Management 41: 1677–1697. Lincoln, J. R. 1990. Social structures: a network approach. Administrative Science Quarterly 35: 748–752. Lincoln, J. R. and Miller, J. 1979. Work and friendship ties in organizations: a comparative analysis of relational networks. Administrative Science Quarterly 24: 181–199. March, J. G. 1991. Exploration and exploitation in organizational learning. Organization Science 2: 71–87. Marsden, P. (2011), “Survey methods for network data”, In Scott, J. and Carrington, P. J. (Eds), The Sage Handbook of Social Network Analysis, Sage Publications, London, pp. 370–388. Mayhew, B. H. 1980. Structuralism Versus Indi- vidualism: part 1, Shadowboxing in the Dark. Social Forces 59: 335–375. McEvily, B., Soda, G. and Tortoriello, M. 2014. More formally: rediscovering the missing link between formal organization and informal social structure. The Academy of Management Annals 8: 299–345. Mehra, A., Kilduff, M. and Brass, D. J. 2001. The social networks of high and low self-monitors: implications for workplace performance. Administrative Science Quarterly 46: 121–146. Merton, R. 1939. Bureaucratic structure and personality. Social Forces 18: 560–568. Obstfeld, D. 2005. Social Networks, the tertius iungens orientation, and involvement in innovation. Administrative Science Quarterly 50: 100–130. Padgett, J. F. and Ansell, C. K. 1993. Robust action and the rise of the Medici, 1400-1434. American Journal of Sociology 98: 1259–1319. Pfeffer, J. and Salancik, G. R. 1975. Determinants of supervisory behavior: a role set analysis. Human Relations 28: 139–154. Reagans, R. and McEvily, B. 2003. Network structure and knowledge transfer: the effects of cohesion and range. Administrative Science Quarterly 48: 240–267. Reagans, R. and McEvily, B. 2008. Contradictory or compatible? Reconsidering the “Trade-Off” between brokerage and closure on knowledge sharing. Network Strategy 25: 275–313. Reagans, R. and Zuckerman, E. W. 2001. Networks, diversity, and productivity: the social capital of corporate R&D teams. Organization Science 12: 502–517. Saavedra, R., Earley, P. C. and Van Dyne, L. 1993. Complex interdependence in task-performing groups. Journal of Applied Psychology 78: 61–72. 59 CONNECTIONS Scott, J. 2000. Social Network Analysis: A Handbook 2nd ed, Sage Publications, London. Soda, G. and Zaheer, A. 2012. A network perspective on organizational architecture: performance effects of the interplay of formal and informal organization. Strategic Management Journal 33: 751–771. Soda, G., Stea, D. and Pedersen, T. 2019. Network structure, collaborative context, and individual creativity. Journal of Management 45: 1739–1765. Sparrowe, R. T., Liden, R. C., Wayne, S. J. and Kraimer, M. L. 2001. Social networks and the performance of individuals and groups. Academy of Management Journal 44: 316–325. Srivastava, S. B. 2015. Intraorganizational network dynamics in times of ambiguity. Organization Science 26: 1365–1380. Stajkovic, A. D., Lee, D. and Nyberg, A. J. 2009. Collective efficacy, group potency, and group perfor- mance: meta-analyses of their relationships, and test of a mediation model. Journal of Applied Psychology 94: 814–828. Teigland, R. and Wasko, M. 2009. Knowledge transfer in MNCs: examining how intrinsic motivations and knowledge sourcing impact individual centrality and performance. Journal of International Management 15: 15–31. Thompson, J. D. 1967. Organizations in Action McGraw-Hill, New York, NY. Tuominen, T. 2013. Innovation and development ac- tivities in professional service firms – a role structure per- spective, Aalto University publication series DOCTORAL DISSERTATIONS. Wang, J., Cheng, G. H., Chen, T. and Leung, K. 2019. Team creativity/innovation in culturally diverse teams: a meta-analysis. Journal of Organizational Behavior 40: 693–708. Wasserman, S. and Faust, K. 1994. Social Network Analysis: Methods and Applications. Cambridge University Press. New York, NY. Weber, M. 1982. “Bureaucracy”, in Grusky, O. and Miller, G. (Eds), Complex Organizations, Free Press, New York, NY, pp. 7–36. Welbourne, T. M., Johnson, D. E. and Erez, A. 1998. The role-based performance scale: validity analysis of a theory-based measure. Academy of Management Journal 41: 540–555. Table A1. Simple slopes of the Models 1, 3, 5, and 7. Delta method dy/dx SE z P > |z| [95% conf. interval] Promoting new ideas Inv. of Burt’s constraint independent work role 0 2.09692 60510 3.47 0.001 88567 3.30816 1 0.31451 0.19970 1.57 0.121 −0.08523 0.71427 Betweenness centrality independent work role 0 0.86236 0.14562 5.92 0.000 0.57087 1.15386 1 0.53269 0.22076 2.41 0.019 0.09077 0.97460 Billable hours Inv. of Burt’s constraint independent work role 0 −0.90929 0.35692 −2.55 0.014 −1.62375 −0.19484 1 0.30558 0.18013 1.70 0.095 −0.05499 0.66157 Betweenness centrality independent work role 0 −0.25708 0.15388 −1.67 0.100 −0.56511 0.05093 1 0.06624 0.24507 0.27 0.788 −0.42432 0.55680 Note: Statistically significant slopes italicized. Appendix work_33nbguho3jgb5kkpigcjwnzm64 ---- Submitted 3 November 2015 Accepted 21 June 2016 Published 8 August 2016 Corresponding author Konstantin Kozlov, kozlov_kn@spbstu.ru, mackoel@gmail.com Academic editor Sandra Gesing Additional Information and Declarations can be found on page 16 DOI 10.7717/peerj-cs.74 Copyright 2016 Kozlov et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A software for parameter optimization with Differential Evolution Entirely Parallel method Konstantin Kozlov1, Alexander M. Samsonov1,2 and Maria Samsonova1 1 Mathematical Biology and Bioinformatics Lab, IAMM, Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russia 2 Ioffe Institute, Saint Petersburg, Russia ABSTRACT Summary. Differential Evolution Entirely Parallel (DEEP) package is a software for finding unknown real and integer parameters in dynamical models of biological processes by minimizing one or even several objective functions that measure the deviation of model solution from data. Numerical solutions provided by the most efficient global optimization methods are often problem-specific and cannot be easily adapted to other tasks. In contrast, DEEP allows a user to describe both mathematical model and objective function in any programming language, such as R, Octave or Python and others. Being implemented in C, DEEP demonstrates as good performance as the top three methods from CEC-2014 (Competition on evolutionary computation) benchmark and was successfully applied to several biological problems. Availability. DEEP method is an open source and free software distributed under the terms of GPL licence version 3. The sources are available at http://deepmethod. sourceforge.net/ and binary packages for Fedora GNU/Linux are provided for RPM package manager at https://build.opensuse.org/project/repositories/home:mackoel: compbio. Subjects Computational Biology, Distributed and Parallel Computing, Optimization Theory and Computation Keywords Differential Evolution, Parameter optimization, Mathematical modeling, Parallelization, Bioinformatics, Open source software INTRODUCTION The estimation of dynamical model parameters minimizing the discrepancy between model solution and the set of observed data is among the most important, widely studied problems in applied mathematics, and is known as an inverse problem of mathematical modeling (Mendes & Kell, 1998; Moles, Mendes & Banga, 2003). Numerical strategies for solutions of an inverse problems often involve optimization methods. Many global and local, stochastic and deterministic optimization techniques, including the nature-inspired ones, have been developed and implemented in a wide range of free, open source and commercial software packages. Mathematical modeling being one of the primary tools of computational systems biology provides new insights into the mechanisms that control the biological systems. It becomes very attractive to experimentalists due to predictive abilities of carefully selected models, if any. How to cite this article Kozlov et al. (2016), A software for parameter optimization with Differential Evolution Entirely Parallel method. PeerJ Comput. Sci. 2:e74; DOI 10.7717/peerj-cs.74 https://peerj.com mailto:kozlov_kn@spbstu.ru mailto:mackoel@gmail.com https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.74 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://deepmethod.sourceforge.net/ http://deepmethod.sourceforge.net/ https://build.opensuse.org/project/repositories/home:mackoel:compbio https://build.opensuse.org/project/repositories/home:mackoel:compbio http://dx.doi.org/10.7717/peerj-cs.74 Researchers benefit from the ability of a model to predict in silico the consequences of a biological experiment, which was not used for training. The properties of the model are determined by the structure of the mathematical description and the values of the unknown constants and control parameters that represents the coefficients of underlying biochemical reactions. These unknowns are to be found as a best suited solution to an inverse problem of mathematical modeling, i.e., by the fitting model output to experimental observations. The parameter set is to be reliable, and different types of data are to be considered. Development of reliable and easy-to-use algorithmsand programs for solution to the inverse problem remains a challenging task due to diversity and high computational complexity of biomedical applications, as well as the necessity to treat large sets of heterogeneous data. In systems biology the commonly used global optimization algorithm is the parallel Simulated Annealing (SA) (Chu, Deng & Reinitz, 1999). This method requires considerable CPU time, but is capable to eventually find the global extremum and runs efficiently in parallel computations. However, the wide range of methods called Genetic Algorithms (GA) has been developed later and successfully applied to biological problems (Spirov & Kazansky, 2002). Modern Evolutionary algorithms such as Evolution Strategies (ESs) or Differential Evolution (DE) can outperform other methods in the estimation of parameters of several biological models (Fomekong-Nanfack, Kaandorp & Blom, 2007; Fomekong- Nanfack, 2009; Suleimenov, 2013). The general challenge in the efficient implementation of the global optimization methods is that they depend on problem-specific assumptions and thus are not able to be easily adapted to another problems. For example, in SA both the final result and computational time depend on the so-called cooling schedule, the success of the GA optimization strongly depends on selected mutation, recombination and selection rules, and the evolutionary algorithms heavily rely on the algorithmic parameters which define the model of evolution. Currently a lot of approaches exist based on metaheuristics for parameters estimation in biology. For example, enhanced Scatter Search (Egea, Martí & Banga, 2010), implemented in MEIGOR (Metaheuristics for systems biology and bioinformatics global optimization) package for R statistical language was reported to outperform the state-of-the-art methods (Egea et al., 2014). This method can provide high quality solution for integer and real parameters, however, it is computationally expensive. We developed DEEP, a software that implements the Differential Evolution Entirely Parallel (DEEP) method introduced recently (Kozlov & Samsonov, 2011). The rationale behind the design of this programme was to provide an open source software with performance comparable to the competitive packages, as well as to allow a user to implement both mathematical model and comparison of solution with experimental data in any software package or programming language, such as R, Octave, Python or others. PROBLEM STATEMENT DEEP method was developed to solve the inverse problem of mathematical modeling. For a given mathematical model with parameters q∈RK , where K is number of parameters, and observable data Y we seek the vector q̂: q̂=argminF(q,Y ) (1) Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 where F is a measure of the deviation of model prediction from observable data. Additional constraints may be imposed: hj(q)=0, j=1,...,NH (2) gm(q)≤0, m=1,...,NG (3) qLk a, β= |a−b|−1 9−2 . Then the offspring is created according to the formula: vb,j+1=c1+(c2−c1)◦r, where r ={rk}, rk =U(0,1), k =1,...,K is a random number uniformly distributed between 0 and 1. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 6/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 Selection rule Algorithm 3. SELECTION proc select (individual) = { if (F < the value of the parent) then Accept offspring else for all criteria Fi, hj, gm as f do if (f < the value of the parent) then Generate the random number U . if (U < control parameter for this criterion) then Accept offspring end if end if end for end if } In order to increase the robustness of the procedure we have implemented the follow- ing selection rule for DE, described in detail in Kozlov et al. (2013) (see the Algorithm 3 insertion). Briefly, several different objective functions are used to decide if an offspring will be selected for a new generation. Firstly, the main objective function is checked. The offspring replaces its parent if the value of this function for offspring’s set of parameters is less than that for the parental one. In the opposite case the additional objective functions are considered. The offspring replaces its parent if the value of any other objective function is better, and a randomly selected value is less than the predefined parameter for this function. Preserving population diversity The original DE algorithm was highly dependent on internal parameters as reported by other authors, see, for example (Gaemperle, Mueller & Koumoutsakos, 2002). An efficient adaptive scheme for selection of internal parameters Sk and pk based on the control of the population diversity was proposed in Zaharie (2002). Let us consider the variation for parameter k in the current generation: vark = 1 NP NP∑ i=1 ( qik− 1 NP NP∑ l=1 qlk )2 where k=1,...,n. For the next generation the scaling constant is calculated by Sk =   √ NP ·(ρk−1)+pk(2−pk) 2·NP ·pk NP ·(ρk−1)+pk(2−pk)≥0 Sinf NP ·(ρk−1)+pk(2−pk)<0 Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 7/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 or alternatively the crossover probability is adopted as pk = { −(NP ·S2k−1)+ √ (NP ·S2k−1) 2−NP ·(1−ρk) ρk ≥1 pinf ρk <1 where Sinf =1/ √ NP, pinf =0, ρk =γ ( varpreviousk /vark ) and γ is a new constant that controls the decrease of the variability of parameters in the course of iteration process. Mixed integer-real problems DE operates on floating point parameters, while many real problems contain integer parameters, e.g., indices of some kind. Two algorithms for parameter conversion from real to integer are implemented in DEEP method as described in Kozlov et al. (2013). The first method rounds off a real value to the nearest integer number. The second procedure consists of the following steps: • The values are sorted in ascending order. • The index of the parameter in the floating point array becomes the value of the parameter in the integer array. Parallelization of objective function calculation Algorithm 4. OBJECTIVE FUNCTION proc objfunc (population) = { Create a Pool of a specified number worker threads. Create an Asynchronous Queue of tasks Q in the Pool. for all individuals in population as x do Push x to queue Q. end for Wait all worker threads in the Pool to finish. } proc Worker Thread (parameters) = { 1. Transform parameters from real to integer as needed. 2. Save parameters into temporary file of specified format. 3. Call specified program and supply the temporary file to it. 4. Capture output of the program. 5. Split output with specified delimiters into a list of values. 6. Assign values in the specified order to Fi, hj, gm,∀i,j,m. 7. Return Worker Thread to Pool. } Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 8/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 DEEP can be effectively parallelized due to independent evaluation of each population member. Various models for evolutionary algorithms parallelization have been developed, such as master-slave, island, cellular or hybrid versions (Tasoulis et al., 2004). The approach implemented in DEEP (see the Algorithm 4 insertion) utilizes the multicore architecture of modern CPUs and employs the pool of worker threads with asynchronous queue of tasks to evaluate the individual solutions in parallel. The calcu- lation of objective function for each trial vector using the command supplied by a user is pushed to the asynchronous queue and starts as soon as there is an available thread in the pool. Such approach is similar to ‘‘guided’’ schedule in OpenMP but gives us more flexibility and control. The command output is automatically recognized according to the specified format. All threads started in the current iteration are to be finished before the next one starts. IMPLEMENTATION DEEP is implemented in C programming language as console application and employs interfaces from GLIB project (https://developer.gnome.org/glib/), e.g., Thread Pool API. The architecture allows a user to utilize any programming language or computer system, such as R, Octave or Python to implement both mathematical model and comparison of solution with experimental data. Control parameters All the control parameters are specified in the single input file as a key-value pairs in INI- format supplied to the DEEP executable on the command line. The control parameters are arranged into three groups described below. Mathematical Model section specifies the parameter number, both the lower and upper parameter bounds, as well as the software itself necessary to run a model. A possibility is provided to indicate parameters that are to be kept unchanged. Objective Function section defines the aggregation methods for constraints and multiple objectives. The type of function, i.e., main or additional objective, equality or inequality constraint, is denoted by special keyword. Ranks and weights are to be given here. Method Settings section allows the user to tune the settings, namely, population size, stopping criterion, offspring generation strategy, the number of the oldest individuals to be substituted in the next generation 9, the maximal number of working threads and the seed for random number generator. Two options for offspring generation are provided, namely the selection of best individual or ‘‘trigonometric mutation.’’ The stopping criterion can limit the convergence rate, absolute or relative value of the objective function, number of generations or the wall clock time. The initial population is by default generated randomly within the limits given; however, it is also possible to define one initial guess and generate the individuals in the specified vicinity of it. Programming interfaces The DEEP method can be used as the static or dynamic shared object and embedded in another software package. Application programming interfaces (APIs) can be used to Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 9/20 https://peerj.com https://developer.gnome.org/glib/ http://dx.doi.org/10.7717/peerj-cs.74 connect with existing code implementing mathematical model and objective function. This approach is often preferred in academic and industrial applications where the high level modeling system language is not sufficient or the computation time should be reduced. RESULTS Method testing on benchmark functions To evaluate the performance of DEEP we used three simple multimodal test functions of dimension D=30 from the Competition on Real Parameter Single Objective Optimization 2014 (CEC-2014) test suite (Liang, Qu & Suganthan, 2014), namely: Shifted and Rotated Griewank’s Function. H(x)=h ( M ( 600(x−oH) 100 )) +700; h(x)= D∑ i=1 x2i 4000 − D∏ i=1 cos ( xi √ i ) +1 Shifted Rastrigin’s Function. R(x)=r ( 5.12(x−or) 100 ) +800; r(x)= D∑ i=1 (x2i −10cos(2πxi)+10) Shifted Schwefel’s Function. S(x)= s ( 1000(x−os) 100 ) +1000; s(x)=418.9829×D− D∑ i=1 g(zi(xi)), where zi=xi+4.209687462275036x102, and g(zi)=   zisin(|zi| 1/2) if |zi|<500, (500−mod(zi,500))∗ ∗sin (√ |500−mod(zi,500)| ) − − (zi−500)2 1000D if zi >500, (mod(|zi|,500)−500)∗ ∗sin (√ |mod(zi,500)−500| ) − − (zi+500)2 1000D if zi <−500, and the global optimum is shifted to oi =[oi1,oi2,...,oiD]T and rotated using the rotation matrix Mi. For each function 51 runs were performed with identical settings and with random initial population. The maximal allowed number of functional evaluations was set to 3×105. Other DEEP settings were NP =200, Gmax =1,499 and 9=40. The measured error was the difference between the known optimal value and the obtained solution. Following the methodology described in Tanabe & Fukunaga (2014) we used the Wilcoxon rank-sum test with significance level p < 0.05 to compare the evaluation Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 10/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 Table 1 The results of statistical comparison of DEEP with the top three methods from CEC-2014 on 3 functions. The symbols+,−,≈ indicate that DEEP performed significantly better (+), significantly worse (−), or not significantly different (≈) compared to another algorithm using the Wilcoxon rank- sum test (significantly, p<0.05). All results are based on 51 runs. DEEP vs CMLP L-SHADE UMOEAs − (worse) 0 0 0 ≈ (no sig.) 3 3 1 + (better) 0 0 2 results for 51 runs with the results of the top three methods from CEC-2014 (Liang, Qu & Suganthan, 2014) taken from CEC-2014 report: 1. Covariance Matrix Learning and Searching Preference (CMLP) (Chen et al., 2014), 2. Success-History Based Parameter Adaptation for Differential Evolution (L-SHADE) (Tanabe & Fukunaga, 2014), 3. United Multi-Operator Evolutionary Algorithms (UMOEAs) (Elsayed et al., 2014). The number of benchmark functions from three tested (+), (−), (≈) is presented in Table 1. DEEP demonstrated the same or better performance. The method test on reduced model of gene regulation To demonstrate how DEEP works in applications we developed a realistic benchmark problem based on real biological model of gap gene regulatory network (Kozlov et al., 2015b). A model provides a dynamical description of gap gene regulatory system, using detailed DNA-based information, as well as spatial TF concentration data at varying time points. The gap gene regulatory network controls segment determination in the early Drosophila embryo (Akam, 1987; Jaeger, 2011; Surkova et al., 2008). The state variables of this model are the concentrations of mRNAs and proteins encoded by four gap genes hb, Kr, gt, and kni. The model implements the thermodynamic approach in the form proposed in He et al. (2010) to calculate the expression of a target gene at the RNA level. This level is proportional to the gene activation level also called the promoter occupancy, and is determined by concentrations of eight transcription factors Hb, Kr, Gt, Kni, Bcd, Tll, Cad and Hkb: Eai (t)= ZaON,i(t) ZaON,i(t)+Z a OFF,i(t) (7) where ZaON,i(t) and Z a OFF,i(t) are statistical weights of the enhancer with the basal transcriptional complex bound and unbound, respectively. Two sets of the reaction–diffusion differential equations for mRNA uai (t) and protein concentrations vai (t) describe the dynamics of the system (Reinitz & Sharp, 1995; Jaeger et al., 2004; Kozlov et al., 2012): duai /dt =R a uE a i (t)+D a u(n)[(u a i−1−u a i )+(u a i+1−u a i )]−λ a uu a i , (8) dvai /dt =R a vu a i (t−τ a v )+D a v(n)[(v a i−1−v a i )+(v a i+1−v a i )]−λ a vv a i , (9) where n is the cleavage cycle number, Rav and R a u are the maximum synthesis rates, D a v, D a u (to smooth the resulting model output) are the diffusion coefficients, λav and λ a u are the Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 11/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 decay rates for protein and mRNA of gene a. The model spans the time period of cleavage cycles 13 and 14A (c13 and c14 resp.) and the interval of A-P axis from 35% to 92% (58 nuclei) of embryo length. The number of nuclei along the A-P axis is doubled when going from c13 to c14. The model is fitted to data on gap protein concentrations from the FlyEx database (Pisarev et al., 2008) and mRNA concentrations from SuperFly (Cicin-Sain et al., 2015). To fit the model we used the residual sum of squared differences between the model output and data and we used the weighted Pattern Generation Potential proposed in Samee & Sinha (2013) as the second objective function: RSS(x,y)= ∑ ∀g,n,t:∃y g n (t) (xgn (t)−y g n (t)) 2 wPGP(x,y)= 1+(penalty(x,y)−reward(x,y)) 2 where g, n and t are gene, nucleus and time point respectively and reward(x,y)= ∑ iyi∗min(yi,xi)∑ iyi∗yi penalty(x,y)= ∑ i(ymax−yi)∗max(xi−yi,0)∑ i(ymax−yi)∗ ∑ i(ymax−yi) were xi and yi are respectively predicted and experimentally observed expression in nucleus i, and ymax is the maximum level of experimentally observed expression. Conse- quently, the combined objective function is defined by: F(q,Y )=2∗10−7∗RSS(v(q),V )+1.5∗10−7∗RSS(u(q),U) +wPGP(v(q),V )+0.6∗wPGP(u(q),U) +10−8∗Penalty(q), (10) where Y ={V,U} contains data for u and v, and the function Penalty limits the growth of regulatory parameters, and the weights were obtained experimentally. We simplified the original computationally expensive model (Kozlov et al., 2015b) to use it as a benchmark in our calculations as follows. Firstly, we reduced the number of nuclei from 58 to 10 and considered only one target gene with DNA sequence from kni. Consequently, the number of parameters was reduced to 34, two of which are of integer type. Biologically feasible box constraints in the form (4) are imposed for 28 parameters. Next, we fitted this reduced model to the coarsened data and used the obtained solution and model parameters as the synthetic data for benchmark. Thus, the exact parameters of benchmark optimization problem are known. To compare DEEP and MEIGOR (Egea et al., 2014) we run both methods in the same conditions and record the final value of the objective function (11), final parameters and the number of functional evaluations. We considered those solutions for which the final functional value is less than 0.005 that corresponds to parameters close to exact known values. The Welch two sample t-test demonstrated that DEEP used less objective function evaluations than MEIGOR with p<0.005 (see Fig. 1). Real applications DEEP software was successfully applied to explain the dramatic decrease in gap gene expression in early Drosophila embryo caused by a null mutation in Kr gene. Figure 2A Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 12/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 Figure 1 Comparison of number of objective function evaluations for DEEP and MEIGOR on reduced model of gene regulation. DEEP used less objective function evaluations than MEIGOR with p < 0.005 according to Welch two sample t-test. presents the topology of regulatory network inferred by fitting the dynamical model with 44 parameters of gap gene expression to the wild type and Kr mutant data simultaneously (Kozlov et al., 2012). Other DEEP applications include different problems described in Ivanisenko et al. (2014); Nuriddinov et al. (2013). Recently, DEEP was used in the online ancestry prediction tool reAdmix that can identify the biogeographic origins of highly mixed individuals (Kozlov et al., 2015a). reAdmix is available at http://chcb.saban-chla.usc.edu/reAdmix/. Two applications are discussed below in details. Subgenomic Hepatitis C virus replicon replication model The hepatitis C virus (HCV) causes hazardous liver diseases leading frequently to cirrhosis and hepatocellular carcinoma. No effective anti-HCV therapy is available up to date. Design of the effective anti-HCV medicine is a challenging task due to the ability of the hepatitis C virus to rapidly acquire drug resistance. The cells containing HCV subgenomic replicon are widely used for experimental studies of the HCV genome replication mechanisms and the in vitro testing of the tentative medicine. HCV NS3/4A protease is essential for viral replication and therefore it has been one of the most attractive targets for development of specific antiviral agents for HCV. We used the new algorithm and software package to determine 18 parameters (kinetic reaction constants) of the mathematical model of the subgenomic Hepatitis C virus Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 13/20 https://peerj.com http://chcb.saban-chla.usc.edu/reAdmix/ http://dx.doi.org/10.7717/peerj-cs.74 Figure 2 Gene regulatory network, arrows and T-ended curves indicate activation and repressive inter- actions respectively, dotted lines show interactions present in wild type only (A). Regulatory weights of in- dividual transcription factor binding sites (B). Evolution of three objective functions during parameter fit- ting (C). See text for details. (HCV) replicon replication in Huh-7 cells in the presence of the HCV NS3 protease inhibitor, see Ivanisenko et al. (2013). The experimental data include kinetic curves of the viral RNA suppression at various inhibitor concentrations of the VX-950 and BILN-2061 inhibitors (Lin et al., 2004; Lin et al., 2006). We seek for the set of parameters that minimizes three criteria. The main criterion (RSS) is the residual sum of squared differences between the model output and data. Additional criteria 2 (F2) and 3 (F3) penalize the deviation of the time to steady state and the number of viral vesicles at the steady state, respectively. The combined criterion was defined as follows: Fcombined=RSS+0.1·F2+0.1·F3 (11) where the weights were obtained experimentally. The dependence of the best value of the combined criterion (11) in population of individuals on the generation number for 10 runs is plotted in Fig. 3A. The objective function is to be evaluated once for each member of the generation, the size of which was set to 200. The plot of the criteria in the close vicinity of the optimal values of the two parameters from the set is shown in Figs. 3B and 3C. Despite of the fact that the criteria do not take a minimal values in one and the same point, the algorithm produces reliable approximation of the optimum. The comparison of the model output and experimental dependencies of the viral RNA suppression rate on inhibitor concentration is shown in Figs. 3D and 3E. It is worth to note that, the model correctly reproduces experimental kinetics of the viral RNA suppression. The predictive power of the model was estimated using the experimental data on dependencies of the viral RNA suppression rate on the increasing concentration of the SCH-503034 (Malcolm et al., 2006) and ITMN-191 (Seiwert et al., 2008) inhibitors. These Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 14/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 Figure 3 (A) The combined criterion (11) vs. the generation number for 10 runs. 200 function eval- uations were performed by the minimization procedure for each generation. (B, C) The criteria graphs are shown in the close vicinity of the optimal values of the four parameters. The values of the parameters found by the algorithm are denoted as x and y. (D, E) The viral RNA suppression in the presence of the NS3 protease inhibitors in different concentrations. The dependence of the viral RNA suppression on the increasing concentration of BILN-2061 (D) and VX-950 (E) inhibitors is shown for the third day post- treatment. A solid line is used to show model output and points correspond to the experimental data (Lin et al., 2004; Lin et al., 2006). (F, G) The predicted kinetics and the suppression rate of the viral RNA in comparison with data not used for parameter estimation. The dependencies of the suppression rate of the viral RNA on the increasing concentration of the SCH-503034 (F) and ITMN-2061 (G) inhibitors (Mal- colm et al., 2006; Seiwert et al., 2008). data were not used for parameter estimation. As it can be seen in Figs. 3F and 3G, the model correctly reproduces experimental observations and thus can be used for in silico studies. Sequence-based model of the gap gene regulatory network Recently, DEEP method was successfully applied to recover 68 parameters of the DNA sequence-based model (7)–(8) of regulatory network of 4 gap genes—hb, Kr, gt, and kni— and 8 transcription factors: Hb, Kr, Gt, Kni, Bcd, Tll, Cad and Hkb (Kozlov et al., 2015b). The trained model provides a tool to estimate the importance of each TF binding site for the model output (see Fig. 2B). We showed that functionally important sites are not exclusively located in cis-regulatory elements and that sites with low regulatory weight are important for the model output (Kozlov et al., 2014). The evolution of the three objective functions during one optimization run is shown in Fig. 2C. Note that the wPGP and the Penalty functions do not decline monotonically and simultaneously. In a few first steps these functions reach their maximal values while RSS falls sharply, that corresponds to the adaptation of the control parameters of the algorithm and substitution of old parameter sets with good ones. Then wPGP starts to decay, and Penalty fluctuates at high level, while RSS decays approximately at the same rate as wPGP. As Penalty depends only on regulatory parameters, its behaviour at this stage illustrates that it disallows the process to be trapped in some local minimum with extreme values of parameters. During the second half of the optimization process, Penalty Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 15/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 reaches its final low level and stays at it almost constant till convergence while the RSS and wPGP exhibit a modest growth and then converge. This illustrates the ability of DEEP to balance several objective functions. The model output at this stage is not changed much as indicated by RSS though the absolute values of regulatory parameters are fine tuned. CONCLUSIONS The parallelization of objective function calculation implemented in DEEP method considerably reduces the computational time. Several members of the current generation are evaluated in parallel, which in our experience with Sequence-based Model of the Gap Gene Regulatory Network, resulted in 24 times speedup on 24 core computational node (Intel Xeon 5670, Joint Supercomputer Center of the Russian Academy of Sciences, Moscow). The calculation of 24 objective functions in parallel threads took approximately the same 20 s as one sequential job, and the optimization runs were able to converge in 14 h after approximately 60,000 functional evaluations. To sum up, we elaborated both the method and the software, which demonstrated high performance on test functions and biological problems of finding parameters in dynamic models of biological processes by minimizing one or even several objective functions that measure the deviation of model solution from data. ACKNOWLEDGEMENTS We are thankful to the Joint Supercomputer Center of the Russian Academy of Sciences, Moscow, for provided computational resources. ADDITIONAL INFORMATION AND DECLARATIONS Funding The implementation and testing was supported by RSF grant no. 14-14-00302, the method development was supported by RFBR grant 14-01-00334 and the Programme ‘‘5-100-2020’’ by the Russian Ministry of Science and Education. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: RSF: 14-14-00302. RFBR: 14-01-00334. Russian Ministry of Science and Education: 5-100-2020. Competing Interests The authors declare there are no competing interests. Author Contributions • Konstantin Kozlov conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 16/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.74 • Alexander M. Samsonov conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. • Maria Samsonova conceived and designed the experiments, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: SourceForge: http://deepmethod.sourceforge.net/ openSUSE: https://build.opensuse.org/project/repositories/home:mackoel:compbio. REFERENCES Akam M. 1987. The molecular basis for metameric pattern in the Drosophila embryo. Development 101:1–22. Chen L, Liu H-L, Zheng Z, Xie S. 2014. A evolutionary algorithm based on covariance matrix learning and searching preference for solving CEC 2014 benchmark prob- lems. In: CEC 2014 special session and competition on single objective real-parameter numerical optimization, vol. 3. Piscataway: IEEE, 2672–2677. Chu KW, Deng Y, Reinitz J. 1999. Parallel simulated annealing by mixing of states. The Journal of Computational Physics 148:646–662 DOI 10.1006/jcph.1998.6134. Cicin-Sain D, Pulido AH, Crombach A, Wotton KR, Jiménez-Guri E, Taly J-F, Roma G, Jaeger J. 2015. SuperFly: a comparative database for quantified spatio- temporal gene expression patterns in early dipteran embryos. Nucleic Acids Research 43(D1):D751–D755 DOI 10.1093/nar/gku1142. Egea JA, Henriques D, Cokelaer T, Villaverde AF, MacNamara A, Danciu D-P, Banga JR, Saez-Rodriguez J. 2014. MEIGO: an open-source software suite based on metaheuristics for global optimization in systems biology and bioinformatics. BMC Bioinformatics 15(1):1–9 DOI 10.1186/1471-2105-15-1. Egea JA, Martí R, Banga JR. 2010. An evolutionary method for complex-process optimization. Computers & Operations Research 37(2):315–324 DOI 10.1016/j.cor.2009.05.003. Elsayed SM, Sarker RA, Essam DL, Hamza NM. 2014. Testing united multi-operator evolutionary algorithms on the CEC-2014 real-parameter numerical optimization. In: CEC 2014 special session and competition on single objective real-parameter numerical optimization, vol. 3. Piscataway: IEEE, 1650–1657. Fan H-Y, Lampinen J. 2003. A trigonometric mutation operation to differential evolu- tion. Journal of Global Optimization 27:105–129 DOI 10.1023/A:1024653025686. Fomekong-Nanfack Y. 2009. Genetic Regulatory Networks Inference: modeling, parameters estimation and model validation. PhD Thesis, University of Amsterdam. Fomekong-Nanfack Y, Kaandorp J, Blom J. 2007. Efficient parameter estimation for spatio-temporal models of pattern formation: case study of Drosophila melanogaster. Bioinformatics 23(24):3356–3363 DOI 10.1093/bioinformatics/btm433. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 17/20 https://peerj.com http://deepmethod.sourceforge.net/ https://build.opensuse.org/project/repositories/home:mackoel:compbio http://dx.doi.org/10.1006/jcph.1998.6134 http://dx.doi.org/10.1093/nar/gku1142 http://dx.doi.org/10.1186/1471-2105-15-1 http://dx.doi.org/10.1016/j.cor.2009.05.003 http://dx.doi.org/10.1023/A:1024653025686 http://dx.doi.org/10.1093/bioinformatics/btm433 http://dx.doi.org/10.7717/peerj-cs.74 Gaemperle R, Mueller SD, Koumoutsakos P. 2002. A parameter study for differential evolution. In: Grmela A, Mastorakis NE, eds. Advances in intelligent systems, fuzzy systems, evolutionary computation. WSEAS Press, 293–298. He X, Samee MAH, Blatti C, Sinha S. 2010. Thermodynamics-based models of tran- scriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression. PLoS Computational Biology 6(9):e1000935 DOI 10.1371/journal.pcbi.1000935. Ivanisenko N, Mishchenko E, Akberdin I, Demenkov P, Likhoshvai V, Kozlov K, Todorov D, Samsonova M, Samsonov A, Kolchanov N, Ivanisenko V. 2013. Replication of the Subgenomic Hepatitis C virus replicon in the presence of the NS3 protease inhibitors: a stochastic model. Biophysics 58(5):592–606 DOI 10.1134/S0006350913050059. Ivanisenko NV, Mishchenko EL, Akberdin IR, Demenkov PS, Likhoshvai VA, Kozlov KN, Todorov DI, Gursky VV, Samsonova MG, Samsonov AM, Clausznitzer D, Kaderali L, Kolchanov NA, Ivanisenko VA. 2014. A new stochastic model for Subgenomic Hepatitis C virus replication considers drug resistant mutants. PLoS ONE 9(3):e91502 DOI 10.1371/journal.pone.0091502. Jaeger J. 2011. The gap gene network. Cellular and Molecular Life Sciences 68:243–274 DOI 10.1007/s00018-010-0536-y. Jaeger J, Surkova S, Blagov M, Janssens H, Kosman D, Kozlov KN, Manu, Myasnikova E, Vanario-Alonso CE, Samsonova M, Sharp DH, Reinitz J. 2004. Dynamic control of positional information in the early Drosophila embryo. Nature 430:368–371 DOI 10.1038/nature02678. Kozlov K, Chebotarev D, Hassan M, Triska M, Triska P, Flegontov P, Tatarinova T. 2015a. Differential evolution approach to detect recent admixture. BMC Genomics 16(Suppl 8):Article S9 DOI 10.1101/015446. Kozlov K, Gursky VV, Kulakovskiy IV, Dymova A, Samsonova M. 2015b. Analysis of functional importance of binding sites in the drosophila gap gene network model. BMC Genomics 16(13):1–16 DOI 10.1186/1471-2164-16-S13-S7. Kozlov K, Gursky V, Kulakovskiy I, Samsonova M. 2014. Sequence-based model of gap gene regulatory network. BMC Genomics 15(Suppl 12):Article S6. Kozlov K, Ivanisenko N, Ivanisenko V, Kolchanov N, Samsonova M, Samsonov AM. 2013. Enhanced differential evolution entirely parallel method for biomedical applications. In: Malyshkin V, ed. Lecture notes in computer science, vol. 7979. New York: Springer, 409–416. Kozlov K, Samsonov A. 2011. DEEP—differential evolution entirely parallel method for gene regulatory networks. Journal of Supercomputing 57:172–178 DOI 10.1007/s11227-010-0390-6. Kozlov K, Surkova S, Myasnikova E, Reinitz J, Samsonova M. 2012. Modeling of gap gene expression in Drosophila Kruppel mutants. PLoS Computational Biology 8(8):e1002635 DOI 10.1371/journal.pcbi.1002635. Liang JJ, Qu BY, Suganthan PN. 2014. Problem definitions and evaluation criteria for the CEC 2014 special session and competition on single objective real-parameter Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 18/20 https://peerj.com http://dx.doi.org/10.1371/journal.pcbi.1000935 http://dx.doi.org/10.1134/S0006350913050059 http://dx.doi.org/10.1371/journal.pone.0091502 http://dx.doi.org/10.1007/s00018-010-0536-y http://dx.doi.org/10.1038/nature02678 http://dx.doi.org/10.1101/015446 http://dx.doi.org/10.1186/1471-2164-16-S13-S7 http://dx.doi.org/10.1007/s11227-010-0390-6 http://dx.doi.org/10.1371/journal.pcbi.1002635 http://dx.doi.org/10.7717/peerj-cs.74 numerical optimization. Technical Report 201311. Singapore: Computational Intelligence Laboratory, Zhengzhou University, Zhengzhou China And Technical Report, Nanyang Technological University. Lin C, Lin K, Luong YP, Rao BG, Wei YY, Brennan DL, Fulghum JR, Hsiao HM, Ma S, Maxwell JP, Cottrell KM, Perni RB, Gates CA, Kwong AD. 2004. In vitro resistance studies of hepatitis C virus serine protease inhibitors, VX-950 and BILN 2061: structural analysis indicates different resistance mechanisms. Journal of Biological Chemistry 279(17):17508–17514 DOI 10.1074/jbc.M313020200. Lin K, Perni RB, Kwong AD, Lin C. 2006. VX-950, a novel hepatitis C virus (HCV) NS3- 4A protease inhibitor, exhibits potent antiviral activities in HCv replicon cells. An- timicrobial Agents and Chemotherapy 50(5):1813–1822 DOI 10.1128/AAC.50.5.1813-1822.2006. Malcolm BA, Liu R, Lahser F, Agrawal S, Belanger B, Butkiewicz N, Chase R, Gheyas F, Hart A, Hesk D, Ingravallo P, Jiang C, Kong R, Lu J, Pichardo J, Prongay A, Skelton A, Tong X, Venkatraman S, Xia E, Girijavallabhan V, Njoroge FG. 2006. SCH 503034, a mechanism-based inhibitor of hepatitis C virus NS3 protease, suppresses polyprotein maturation and enhances the antiviral activity of alpha in- terferon in replicon cells. Antimicrobial Agents and Chemotherapy 50(3):1013–1020 DOI 10.1128/AAC.50.3.1013-1020.2006. Mendes P, Kell DB. 1998. Non-linear optimization of biochemical pathways: applica- tions to metabolic engineering and parameter estimation. Bioinformatics 14:869–883 DOI 10.1093/bioinformatics/14.10.869. Moles CG, Mendes P, Banga JR. 2003. Parameter estimation in biochemical pathways: comparison of global optimization methods. Genome Research 13:2467–2474 DOI 10.1101/gr.1262503. Nuriddinov M, Kazantsev F, Rozanov A, Kozlov K, Peltek S, Akberdin I, Kolchanov N. 2013. Mathematical modeling of ethanol and lactic acid biosynthesis by theromphilic geobacillus bacteria. Russian Journal of Genetics: Applied Research 17(4/1):686–704. Pisarev A, Poustelnikova E, Samsonova M, Reinitz J. 2008. FlyEx, the quantitative atlas on segmentation gene expression at cellular resolution. Nucleic Acids Research 37:D560–D566 DOI 10.1093/nar/gkn717. Reinitz J, Sharp DH. 1995. Mechanism of eve stripe formation. Mechanisms of Develop- ment 49:133–158 DOI 10.1016/0925-4773(94)00310-J. Samee MAH, Sinha S. 2013. Evaluating thermodynamic models of enhancer activity on cellular resolution gene expression data. Methods 62:79–90 DOI 10.1016/j.ymeth.2013.03.005. Seiwert SD, Andrews SW, Jiang Y, Serebryany V, Tan H, Kossen K, Rajagopalan RPT, Misialek S, Stevens SK, Stoycheva A, Hong J, Lim SR, Qin X, Rieger R, Condroski KR, Zhang H, Do MG, Lemieux C, Hingorani GP, Hartley DP, Josey JA, Pan L, Beigelman L, Blatt LM. 2008. Preclinical characteristics of the HCV NS3/4A protease inhibitor ITMN-191 (R7227). Antimicrobial Agents and Chemotherapy 52(12):4432–4441 DOI 10.1128/AAC.00699-08. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 19/20 https://peerj.com http://dx.doi.org/10.1074/jbc.M313020200 http://dx.doi.org/10.1128/AAC.50.5.1813-1822.2006 http://dx.doi.org/10.1128/AAC.50.3.1013-1020.2006 http://dx.doi.org/10.1093/bioinformatics/14.10.869 http://dx.doi.org/10.1101/gr.1262503 http://dx.doi.org/10.1093/nar/gkn717 http://dx.doi.org/10.1016/0925-4773(94)00310-J http://dx.doi.org/10.1016/j.ymeth.2013.03.005 http://dx.doi.org/10.1128/AAC.00699-08 http://dx.doi.org/10.7717/peerj-cs.74 Spirov AV, Kazansky AB. 2002. Jumping genes-mutators can raise efficacy of evolu- tionary search. In: Proceedings of the genetic and evolutionary computation conference GECCO2002. San Francisco: Morgan Kaufmann Publishers Inc. Storn R, Price K. 1995. Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. Technical Report TR-95-012. Berkeley: ICSI. Suleimenov Y. 2013. Global parameter estimation for thermodynamic models of transcriptional regulation. Methods 62:99–108 DOI 10.1016/j.ymeth.2013.05.012. Surkova S, Kosman D, Kozlov K, Manu, Myasnikova E, Samsonova A, Spirov A, Vanario-Alonso CE, Samsonova M, Reinitz J. 2008. Characterization of the Drosophila segment determination morphome. Developmental Biology 313(2):844–862 DOI 10.1016/j.ydbio.2007.10.037. Tanabe R, Fukunaga AS. 2014. Improving the search performance of shade by using linear population size reduction. In: CEC 2014 special session and competition on single objective real-parameter numerical optimization, vol. 3. Piscataway: IEEE, 1658–1665. Tasoulis D, Pavlidis N, Plagianakos V, Vrahatis M. 2004. Parallel differential evolution. In: Congress on evolutionary computation (CEC 2004), vol. 2. Piscataway: IEEE, 2023–2029. Zaharie D. 2002. Parameter adaptation in differential evolution by controlling the population diversity. In: Petcu D, ed. Proceedigs of the 4th international workshop on symbolic and numeric algorithms for scientific computing. Timisoara, Romania, 385–397. Kozlov et al. (2016), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.74 20/20 https://peerj.com http://dx.doi.org/10.1016/j.ymeth.2013.05.012 http://dx.doi.org/10.1016/j.ydbio.2007.10.037 http://dx.doi.org/10.7717/peerj-cs.74 work_35n7f6rilnfzdevpa7noi3pla4 ---- Submitted 6 August 2020 Accepted 8 October 2020 Published 7 December 2020 Corresponding author Chengbin Peng, pengcheng- bin@nbu.edu.cn Academic editor Faizal Khan Additional Information and Declarations can be found on page 18 DOI 10.7717/peerj-cs.311 Copyright 2020 Fan et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS A deep learning-based ensemble method for helmet-wearing detection Zheming Fan1, Chengbin Peng1,2, Licun Dai1, Feng Cao1, Jianyu Qi1 and Wenyi Hua1 1 College of Information Science and Engineering, Ningbo University, Ningbo, China 2 Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, Ningbo, China ABSTRACT Recently, object detection methods have developed rapidly and have been widely used in many areas. In many scenarios, helmet wearing detection is very useful, because people are required to wear helmets to protect their safety when they work in construction sites or cycle in the streets. However, for the problem of helmet wearing detection in complex scenes such as construction sites and workshops, the detection accuracy of current approaches still needs to be improved. In this work, we analyze the mechanism and performance of several detection algorithms and identify two feasible base algorithms that have complementary advantages. We use one base algorithm to detect relatively large heads and helmets. Also, we use the other base algorithm to detect relatively small heads, and we add another convolutional neural network to detect whether there is a helmet above each head. Then, we integrate these two base algorithms with an ensemble method. In this method, we first propose an approach to merge information of heads and helmets from the base algorithms, and then propose a linear function to estimate the confidence score of the identified heads and helmets. Experiments on a benchmark data set show that, our approach increases the precision and recall for base algorithms, and the mean Average Precision of our approach is 0.93, which is better than many other approaches. With GPU acceleration, our approach can achieve real-time processing on contemporary computers, which is useful in practice. Subjects Computer Vision, Data Mining and Machine Learning, Social Computing Keywords Ensemble method, Deep learning, Helmet-wearing detection, Face detection INTRODUCTION Helmets can play a vital role in protecting people. For example, many severe accidents in production and work sites and roads have been related to violations of wearing helmets. Some personnel may lack safety awareness in a working site and often do not or forget to wear helmets. On the road, craniocerebral injury is the leading cause of serious injury to cyclists in road traffic (World Health Organization, 2006). However, wearing a helmet reduces the risk of head injury of motorcycle riders by 69% (Liu et al., 2008), and wearing a helmet reduces the risk of head injury for cyclists by 63%–88% (Thompson, Rivara & Thompson, 1999). Monitoring helmet-wearing manually can have many limitations, as people can be fatigue and costly. Reducing manual monitoring while ensuring that relevant personnel wearing helmets all the time in the working area has become an urgent problem. How to cite this article Fan Z, Peng C, Dai L, Cao F, Qi J, Hua W. 2020. A deep learning-based ensemble method for helmet-wearing de- tection. PeerJ Comput. Sci. 6:e311 http://doi.org/10.7717/peerj-cs.311 https://peerj.com/computer-science mailto:pengchengbin@nbu.edu.cn mailto:pengchengbin@nbu.edu.cn https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.311 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.311 Image recognition technology can reduce the workforce and material expenditures, and can significantly protect workers in many areas. Developments of computer vision algorithms and hardware (Feng et al., 2019) have paved the road for the application in helmet detection. Deep neural networks have gained much attention in image classification (Krizhevsky, Sutskever & Hinton, 2017), object recognition (Donahue et al., 2013), and image segmentation (Garcia-Garcia et al., 2017). Previous computer vision algorithms for helmet detection are usually used in relatively simple scenes. For helmet detection, Rubaiyat et al, (2016) used a histogram of oriented gradient and a support vector machine to locate persons, then used a hough transform to detect helmet for the construction worker. Li et al. (2017b) identified helmets by background subtraction. Li et al. (2017a) used ViBe background modeling algorithm and human body classification framework C4 to select people and heads, and then identified whether people wore helmets through color space transformation and color feature recognition. However, such approaches are typically not suitable for complex scenes and dynamic backgrounds, such as construction sites, workshops, and streets. Choudhury, Aggarwal & Tomar (2020) and Long, Cui & Zheng (2019) use single shot object detector algorithm to detect helmets. Siebert & Lin (2020) used RetinaNet which uses a multi-scale feature pyramid and focal loss to address the general limitation of one-stage detectors in accuracy, it works well in certain situations but its performance is highly scene dependent and influenced by light. Bo et al. (2019) use the You Only Look Once (YOLO) algorithm to accurately detect helmet wear in images with an average of four targets. However, most of these approaches are not suitable for both small and large helmets at the same time. In this work, we propose a framework to integrate two complementary deep learning algorithms to improve the ability of helmet-wearing detection in complex scenes. Our approach is able to identify regular-size and tiny-size objects at the same time for helmet- wearing detection, and can be used for detection in complex scenes. This framework can outperform traditional approaches on benchmark data. RELATED WORK The starting point of CNN is the neurocognitive machine model (Fukushima & Miyake, 1982). At this time, the convolution structure has appeared. The classic LeNet (LeCun et al., 1998) was proposed in 1998. However, CNN’s edge began to be overshadowed by models such as SVM (support vector machine) later. With the introduction of ReLU (Rectified Linear Units), dropout, and historic opportunities brought by GPU and big data, CNN ushered in a breakthrough in 2012: AlexNet (Krizhevsky, Sutskever & Hinton, 2017). In the following years, CNN showed explosive development, and various CNN models emerged. CNN has gradually gained the favor of scholars due to its advantages of not having to manually design features when extracting image features (Shi, Chen & Yang, 2019). Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 2/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 Many recent object detection approaches are based on RCNN (Region-based Convolutional Neural Networks) algorithms and YOLO algorithms (Redmon et al., 2016). RCNN is an improved algorithm based on CNN. Girshick et al. propose an epoch-making RCNN algorithm (Girshick et al., 2014) in the field of object detection. The central idea is to use a search selective method to extract some borders from the image. Then the size of the area divided by the border is normalized to the convolutional neural network input size, and then the SVM is used to identify the target. The bounding box of the target is obtained through a linear regression model. It brought deep learning and CNN to people’s sight. However, there are disadvantages such as cumbersome training steps, slow test training speed, and large space occupation. In order to improve the training and testing speed of RCNN, Fast RCNN algorithm (Girshick, 2015) was developed. It uses fewer layers while adding an ROI pooling layer to adjust the convolution area, and using softmax instead of the original SVM for classification. Compared with RCNN, Fast RCNN has improved training and testing speed. However, because the selective search method is also used to extract the borders of the region of interest, the speed of this algorithm is still not ideal for working with large data sets. Later, Faster RCNN (Ren et al., 2015) integrates feature extraction, proposal extraction, bounding box regression, classification, etc. into a network. The overall performance is far superior to CNN, and at the same time, it runs nearly much faster than CNN. Thus, Faster RCNN is commonly used in many applications. The Faster RCNN performs well for relatively large objects, but when detecting small faces or helmets, there will be a large false negative rate. Tiny Face has made certain optimizations for small face detection. It mainly optimizes face detection from three aspects: the role of scale invariance, image resolution, and contextual reasoning. Scale invariance is a fundamental property of almost all current recognition and object detection systems, but from a practical point of view, the same scale is not applicable to a sensor with a limited resolution: the difference in incentives between a 300px face and a 3px face is undeniable (Hu & Ramanan, 2017). Ramanan et al. conducted an in-depth analysis of the role of scale invariance, image resolution, and contextual reasoning. Compared with mainstream technology at the time, the error rate can be significantly reduced (Hu & Ramanan, 2017). Boosting algorithm was initially proposed as a polynomial-time algorithm, and the effectiveness has been experimentally and theoretically proved (Schapire, 1990). Afterward, Freund et al. improved the Boosting algorithm to obtain the Adaboost algorithm (Freund & Schapire, 1997). The principle of the algorithm is to filter out the weights from the trained weak classifiers by adjusting the sample weights and weak classifier weights. The weak classifiers with the smallest coefficients are combined into a final robust classifier. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 3/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 In this work, in order to identify a variety of heads and helmets in complex scenes, we propose a framework to use incorporate multiple complementary deep learning algorithms to improve the joint performance. MATERIALS & METHODS Method To address the helmet-wearing detection problem, we compare several object detection methods, such as the naive Bayes classifier, SVM, and Artificial Neural Networks classifier. Naive Bayes usually needs independent distributive premises. SVM is difficult to training for various scenes. In the case of a complex scene and huge training data, artificial neural networks are expected to have better accuracy and reliability, so we propose to use artificial neural networks, especially, convolutional neural networks, to solve this issue. To address the disadvantages raised by long-range cameras, we further improve the performance by integrating multiple complementary deep neural network models. Base algorithms Faster RCNN for detecting faces and helmet-wearing. After images are fed, Faster RCNN firstly extracts image feature maps through a group of basic conv+relu+pooling layer. Next, RPN (Region Proposal Networks) will set a large number of anchors on the scale of the original image, and randomly select 128 positive anchors and 128 negative anchors from all anchors for binary training, and use these anchors and a softmax function to initially extract positive anchors as the candidate area. At this time, the candidate regions are not accurate and require bounding boxes. For a given image I, we use A to represent the ground-truth anchors. We use AF and cF to represent the identified bounding boxes and helmet-wearing confidence scores, respectively, computed by the Faster-RCNN algorithm. If we use F to represent the algorithm, WF to represent the weight of the network, this approach can be written as follows. AF,cF =F(I,WF) (1) If we consider AF =F(I,WF)[0] and cF =F(I,WF)[1], we can use Loss(F(I,WF)[0],F(I,WF)[1],A) (2) to represent the loss function (Fukushima & Miyake, 1982) when to minimize differences between the detected anchors and the ground-truth. Thus, when we train this model, the optimization is as follows. W∗F =argminWFLoss(F(I,WF)[0],F(I,WF)[1],A) (3) Tiny Face for detecting faces. The overall idea of Tiny Face is similar to RPN in Faster RCNN, which is a one-stage detection method. The difference is that some scale specific design and multi-scale feature fusion are added, supplemented by image pyramid so that Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 4/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 the final detection effect for small faces is better. The training data were changed by three scales, with one-third of the probability respectively and sent to the network for training. Multiple scales could be selected to improve the accuracy rate in the prediction. For a given image I, we can also use AT and cT to represent the identified bounding boxes and confidence scores computed by the Tiny Face algorithm, so if we use T to represent the Tiny Face algorithm and WT to represent the corresponding weight, we can have AT,cT =T(I,WT ) (4) However, Tiny Face can only be used to determine whether the detection target contains a human face and cannot directly distinguish whether the target is wearing a helmet. Thus, we propose to use CNN to overcome this disadvantage. CNN for detecting helmet-wearing. For anchors determined by Tiny Face, we can use a CNN to detect helmets above the face. We enlarge the face area detected by the Tiny Face and feed it into the CNN model for prediction. The confidence scores indicating whether there is a helmet above the face can be computed by the CNN algorithm cC =C(AT,I,WC), (5) where C is a function representing the forward propagation of CNN. Here, C is a composition of two convolution layers and one fully connected layer. The loss function is again to minimize the difference between detected helmets and the ground-truth Loss(AT,C(AT,I,WC),A). (6) Ensemble model detecting high and low resolution helmets For the two lists of face anchors AF and AT detected by the base algorithms above, we merge them with the following strategy. We first initialize an empty anchor list AS and two score vector cSF and cSC. For the ith anchor in AF and the corresponding score in cF, namely, AF[i] and cF[i], we first insert them into AS and cSF respectively. Then AF[i] is compared with all the anchors in AT . If some anchors in AT have more than 60% overlapping area with AF[i], we remove these anchors in AT and remove the corresponding entries in cC. We also take the mean value of the removed entries in cC and insert it into cSC. If no overlapped anchors in AT is found, we insert zero into cSC. After all the anchors in AF in processed, the remaining anchors in AT , the remaining confidence values in cC, and a zero vector of the same length is inserted into AS, cSC, and cSF, respectively. At last, we compute the covering area of each anchor in AS and store them in δ. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 5/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 The pseudocode of the merge process can be described as follows. After the data preparation, many ensemble learning methods can be used for model integration. In this work, we consider a basic ensemble model defined as follows S(cSF,cSC,δ,α)= ∑ i αihi(cSF,cSC,δ) (7) where α is the model parameter, δ is a vector containing the area of corresponding anchors, and hi() is a classifier. We choose decision trees with maximum depth of two in the experiment, and i is ranged from 0 to 1000. The variable δ is used here because the two base algorithms are good at identifying relatively large and small objects respectively, and adding covering areas of anchors can help improve the accuracy. Thus, in the ensemble method, AS is the anchor lists, and cS=S(cSF,cSC,δ,α) contains the corresponding confidence values about helmet-wearing. To train this model, we merge the anchor set AS and the ground-truth set A in a similar manner as merging AF and AT , and we use ĉSF, ĉSC and ĉ to represent the corresponding variables after merging. Zeros are filled if the corresponding anchor does not exist before merging. Then, the loss between the identified anchors in AS and the ground-truth anchors A is E(δ,α,ĉSF,ĉSC,ĉ)= n∑ i=0 (S(ĉSF[i],ĉSC[i],δ,α)− ĉ[i]) 2 (8) where n is the total anchors after merging. The optimal value of α can be computed by minimizing the error α ∗ =argminαE(α,ĉSF,ĉSC,ĉ) (9) Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 6/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 The whole process can be described by the pseudocode below. Experiments In order to evaluate the performance of our framework, we use five criteria: TPR=m/n (10) FPR= l/k (11) RE =m/N (12) FNR=1−RE (13) PRE =m/(m+l) (14) where TPR is the true positive rate, FPR is the false positive rate, FNR is the false negative rate, RE is the recall rate, PRE is the precision rate, m is the correct prediction by models under the current threshold, n is the number of parts of the model detection result that are identical to the truth ground, l is the false prediction by models under the current threshold, k is the number of parts of the model detection result that are different from the truth ground, and N is the number of targets that actually exist. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 7/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 Figure 1 Faster RCNN detecting big faces. Full-size DOI: 10.7717/peerjcs.311/fig-1 To evaluate our approach, we take the publicly available benchmark data set (Safety Helmet Wearing-Dataset, ), containing images from construction sites, roads, workshops, and classrooms. The data set consists of a total of 7,581 images. We use five-fold cross validations for experiments. We randomly divide all the images into five parts. Training set, validation set, and testing set contains 3/5, 1/5, and 1/5 of the total images respectively. Preliminary analysis The detection results of Faster RCNN for faces are shown in Figs. 1 and 2. From these two figures, we can see that Faster-RCNN is suitable for detecting large objects, but not finding small ones. The detection results of Tiny Face are shown in Fig. 3. From this result, we can see that Tiny Face is good at finding small faces. To compare the differences between the two models, we used Faster RCNN and Tiny Face to test the 1000 images from the data set, and count the number of faces of different sizes detected by the two models. Figure 4 is the histogram of real data, and Fig. 5 is the histogram of face sizes detected by Faster RCNN. Taking the number of pixels (px2) as the area measurement, a face with an area smaller than 500px2 is defined as a small face, and a face larger than 500px2 is defined as a large face. Because of the large area span, the smallest face is only 90px2, while the largest face can reach 2000000px2. In order to prevent the histograms from crowding together, only faces with an area less than 2000px are shown in the figure. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 8/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-1 http://dx.doi.org/10.7717/peerj-cs.311 Figure 2 Faster RCNN detecting small faces. Full-size DOI: 10.7717/peerjcs.311/fig-2 Figure 3 Tiny Face detecting small faces. Full-size DOI: 10.7717/peerjcs.311/fig-3 Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 9/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-2 https://doi.org/10.7717/peerjcs.311/fig-3 http://dx.doi.org/10.7717/peerj-cs.311 Figure 4 Histogram of real data. Full-size DOI: 10.7717/peerjcs.311/fig-4 Figure 5 Histogram of face sizes detected by Faster RCNN. Full-size DOI: 10.7717/peerjcs.311/fig-5 According to statistics, there are actually 1,568 big faces and 83 small faces. The initial model of Faster RCNN can detect 1,468 big faces and 37 small faces. Under the assumption that the labels are correct, the false negative rate of big faces is 5.2%, and that of small faces is 55.5%. Obviously, the Faster RCNN model has lower accuracy for small faces. Then we performed statistics on Tiny Face and got the histogram of Tiny Face detection results in Fig. 6. Tiny Face can detect 1306 large faces and 44 small faces. The false negative rate for large faces is 16.8%, which is 11.6% higher than Faster RCNN, and the false negative rate for small faces is 47.0%, which is 8.5% lower than Faster RCNN. Although it is only a preliminary model, the model has not been adjusted and the amount of training has been Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 10/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-4 https://doi.org/10.7717/peerjcs.311/fig-5 http://dx.doi.org/10.7717/peerj-cs.311 Figure 6 Histogram of face sizes detected by Tiny Face. Full-size DOI: 10.7717/peerjcs.311/fig-6 adjusted to improve the accuracy of the model, but it is not difficult to see from the current data that the detection capabilities of the Faster RCNN and Tiny Face models have their own focus. When Faster RCNN detects large faces, it has a great advantage, and Tiny Face’s ability to detect small faces is better than Faster RCNN. We can find that Faster RCNN has a higher true positive rate for detecting large faces and Tiny Face has a higher true positive rate for detecting small faces. The overall effect can be better if we can combine the two methods. Accuracy of base algorithms for helmet detection Accuracy of F(I,W∗F ). In this part, we evaluate the accuracy of T(I,W ∗ T ) alone in Algo. 2. Theoretically, the more training steps the model has, the better, but in order to prevent overfitting, we still need to observe the accuracy of the model under different training steps. In the beginning, we select some images from training dataset to evaluate the model. We trained 5,000 steps and used the model to test the images of the training set, but it was obvious that the effect was not very satisfactory. Because T(I,W∗T ) is based on Faster RCNN, which has high accuracy, it is easy to miss the mark of small faces. Therefore, the quality of the model can be preliminarily judged by the number of detected targets, and then we gradually increased the number of training steps. When the number of training steps reaches 20,000 steps, the number of detected targets in the detection results of 1,000 test set images is basically maintained at about 1,300. As the number of training steps increases, the number of detected targets increases slightly. When the number reaches 60,000 steps, the number of detected targets is 1,523. At this time, precision rate of the model is 87.3%,and the recall rate is 85.9%. When the number of training steps reaches 70,000 steps, the number of detected targets is close to 1,700. At this time, the precision rate of the model is 81.2%, and the recall rate is 86.3%. We find that Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 11/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-6 http://dx.doi.org/10.7717/peerj-cs.311 Table 1 Relationship between training steps and accuracy. Steps Precision rate Recall rate 5,000 80.0% 72.4% 20,000 84.0% 82.0% 40,000 86.1% 85.1% 60,000 87.3% 85.9% 70,000 81.2% 86.3% Figure 7 ROC with respect to A Full-size DOI: 10.7717/peerjcs.311/fig-7 although the recall rate has a slight increase, but the precision rate is much lower, so we chose the model with 60,000 training steps as the final model. See Table 1 for the accuracy of F(I,W∗F ) under different training steps. Regarding to the scoring threshold, it is 0.5 by default, which means that when the score is lower than 0.5, the result will be discarded. We successively set the threshold to 0.3, 0.4, 0.5, 0.6, 0.7, and tested the validation data to choose the one that works best. Finally, we found that when the threshold is 0.6, the precision rate of the test result is 87.3%, and the recall rate is 85.9%, which is better than other thresholds. After comprehensive consideration, we finally keep 0.6 as the threshold for the ensemble. The ROC curve on the training set is shown in Fig. 7. When training this model, in order to distinguish whether an individual wearing a helmet, we use two labels: people wearing and without wearing a helmet. It makes the final trained model more accurately distinguish whether the target wears a helmet. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 12/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-7 http://dx.doi.org/10.7717/peerj-cs.311 Figure 8 ROC with respect to A_T and c_C. Full-size DOI: 10.7717/peerjcs.311/fig-8 Accuracy of T(I,W∗T ). In this part, we consider the accuracy of T(I,W ∗ T ) alone in Algo. 2. It is basically a trained Tiny Face model. We lowered the scoring threshold of Tiny Face to 0.5, requiring the Tiny Face model to be able to determine the location of the small face, and it does not need it to accurately return the scoring value. The precision rate of face detection was 85.6%, and the recall rate was 69.4%. Accuracy of C(AT,I,W∗C). In this part, we consider the accuracy of C(AT,I,W ∗ C) alone in Algo. 2. Function C(AT,I,W∗C) is basically a CNN model, which requires only one target in an image, so we select over 2,000 images from the training set, cropped the target according to the corresponding anchor labels and get 20,000 images with only one target in each image. We select 18,000 images as training data for CNN, and the other 2,000 images as a validation set to detect the accuracy of CNN, also the cropped images are divided into two sets, people wearing helmets and without wearing helmets. In addition, we rotate some images to get richer training samples. With cross-validation, we choose to use four pairs of convolution and pooling layers, of which the first layer and the size of the convolution kernel of the second convolution layer are [5,5], and the size of the convolution kernel of the third and fourth convolution layers is [3,3]. The precision rate of the final two-class CNN reached 90.3% when we use it to test the validation set of CNN. The ROC curve on the training set is shown in Fig. 8. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 13/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-8 http://dx.doi.org/10.7717/peerj-cs.311 Figure 9 ROC with respect to A_S and c_S . Full-size DOI: 10.7717/peerjcs.311/fig-9 Accuracy of Ensemble Method S(cSF,cSC,α∗). The ROC curve coverage areas of Faster CNN and AT , cC are 0.86and 0.83, respectively. The ensemble method can further improve the accuracy of the final result. Among the data, cF and cC are the results from two base methods, respectively, and the area is the size of the target frame. Obviously, cF and cC can be used as the characteristic values of the ensemble method. We test the trained model, and the area under the ROC curve coverage is larger, becoming 0.90. The ROC curve on the training set is shown in Fig. 9. Obviously, the ROC curve covered by the ensemble method has the largest coverage area, which proves that the ensemble method is effective in our model. Comparison of different algorithms. In this part, we demonstrate the effectiveness of our ensemble framework by combining Faster RCNN and Tiny Face+CNN together with ROC curve and PR curve. The ROC and PR curves are calculated from testing results through 5-fold cross-validation as Figs. 10 and 11. From Figs. 10 and 11, we can see that combination with our framework (in green) is better than single algorithms (in black and orange). Our framework can also gain the largest area under the ROC curve (0.83) in Fig. 10, and the largest area under the PR curve (0.93), namely the mAP score. It means our framework works best on average over all the possible threshold choices. Tables 2 and 3 reveals a similar phenomenon when a reasonable threshold is chosen. It indicates that, with a well-chosen threshold, our framework works better than others in terms of TPR, FPR,FNR, precision, and recall. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 14/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-9 http://dx.doi.org/10.7717/peerj-cs.311 Figure 10 Comparison with ROC. Full-size DOI: 10.7717/peerjcs.311/fig-10 Figure 11 Comparison with PR. Full-size DOI: 10.7717/peerjcs.311/fig-11 Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 15/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-10 https://doi.org/10.7717/peerjcs.311/fig-11 http://dx.doi.org/10.7717/peerj-cs.311 Table 2 Comparison with TPR, FPR, FNR. Algorithm True positive rate False positive rate False negative rate Faster RCNN 74.7% 43.1% 72.7% TinyFace + CNN 73.8% 25.8% 51.9% Faster RCNN + Tiny Face + CNN 75.6% 18.3% 42.5% Table 3 Comparison with precision and recall. Algorithm Precision Recall Faster RCNN 85.4% 27.3% Tiny Face + CNN 91.5% 48.1% Faster RCNN + Tiny Face + CNN 92.5% 57.5% Figure 12 Comparison with ROC for integrating Mobilenet and Tiny Face. Full-size DOI: 10.7717/peerjcs.311/fig-12 Our framework can also be used to integrate other complementary deep learning methods to improve their performance. As an example, we use our framework to combine Mobilenet and TinyFace+CNN, and compare the integrated results with single algorithms. The performance is shown in Figs. 12 and 13. Similar to the previous case, the algorithm performance is generally improved. Our framework also works well when a specific threshold is chosen, as shown in Tables 4 and 5. Through these experiments, we can find that the integrated framework for two complementary models can improve the performance of single algorithms by increasing Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 16/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-12 http://dx.doi.org/10.7717/peerj-cs.311 Figure 13 Comparison with PR for integrating Mobilenet and Tiny Face. Full-size DOI: 10.7717/peerjcs.311/fig-13 Table 4 Comparison with TPR, FPR, FNR for integrating Mobilenet and Tiny Face. Algorithm True positive rate False positive rate False negative rate Mobilenet 74.3% 17.4% 32.0% TinyFace + CNN 73.3% 25.2% 52.5% Mobilenet + Tiny Face + CNN 80.0% 17.2% 35.2% Table 5 Comparison with Precision and Recall for integrating Mobilenet and Tiny Face. Algorithm Precision Recall Mobilenet 92.0% 69.4% Tiny Face + CNN 91.9% 47.7% Mobilenet + Tiny Face + CNN 94.7% 77.7% the true positive rate, the precision rate, and the recall rate, while reducing the false positive rate and false negative rate. DISCUSSION The detection accuracy of a single model is usually not a satisfactory, so we use an ensemble method to integrate models to get better results. Considering complementary behaviors of different algorithms, using an ensemble method for integration can effectively improve the accuracy of the detection results. For example, through our experiments, we can use Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 17/21 https://peerj.com https://doi.org/10.7717/peerjcs.311/fig-13 http://dx.doi.org/10.7717/peerj-cs.311 Tiny Face model with CNN to overcome the shortcomings that the Faster RCNN model possesses when detecting small faces. Although the proportion of small faces in the test set of this experiment is not very large, the missing rate is still one percent lower than that of a single model. In the test set with a large proportion of small faces, the detection accuracy of the integrated model can be improved further. CONCLUSION When the detection accuracy of a single deep learning model could not meet the demand for helmet-wearing detection, we can integrate a complementary model with it to get better results. In addition, our framework can make single algorithms more robust to data sets from different scenarios, because it can utilize the advantage of the complementary algorithms. By analyzing a variety of object detection models, we find that many models are difficult to achieve high-precision for helmet-wearing detection in different scenarios. Therefore, we carefully select two complementary base models and add additional modules to make them suitable for helmet-wearing detection. We ensemble the base models and build a more powerful helmet-wearing detection algorithm to further improve the detection capability. Our approach can be accelerated by GPU and be deployed on distributed computers to reduce processing time, and thus, can be useful in real-world scenarios. In the future, the model can also be extended by integrating additional features or models and upgraded to mixed neural network models. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work was supported by the National Natural Science Foundation of China (NO. 61802372), the Natural Science Foundation of Zhejiang Province (NO. LGG20F020011), the Ningbo Science and Technology Innovation Project (NO. 2018B10080), and the Qianjiang Talent Plan (NO. QJD1702031). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: National Natural Science Foundation of China: 61802372. Natural Science Foundation of Zhejiang Province: LGG20F020011. Ningbo Science and Technology Innovation Project: 2018B10080. Qianjiang Talent Plan: QJD1702031. Competing Interests The authors declare there are no competing interests. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 18/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311 Author Contributions • Zheming Fan and Chengbin Peng conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. • Licun Dai conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. • Feng Cao, Jianyu Qi and Wenyi Hua performed the experiments, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Code is available as a Supplemental File. The data set is available at GitHub: https://github.com/njvisionpower/Safety-Helmet- Wearing-Dataset. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.311#supplemental-information. REFERENCES Anonymous. 2020. Safety helmet wearing-dataset. Available at https://github.com/ njvisionpower/Safety-Helmet-Wearing-Dataset (accessed on 3 August 2020). Bo Y, Huan Q, Huan X, Rong Z, Hongbin L, Kebin M, Weizhong Z, Lei Z. 2019. Helmet detection under the power construction scene based on image analysis. In: 2019 IEEE 7th international conference on computer science and network technology (ICCSNT). Piscataway: IEEE, 67–71. Choudhury T, Aggarwal A, Tomar R. 2020. A deep learning approach to helmet detection for road safety. Journal of Scientific and Industrial Research 79(06):509–512. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T. 2013. A deep convolutional activation feature for generic visual recognition. Berkeley: UC Berkeley & ICSI. Feng X, Jiang Y, Yang X, Du M, Li X. 2019. Computer vision algorithms and hardware implementations: a survey. Integration 69:309–320 DOI 10.1016/j.vlsi.2019.07.005. Freund Y, Schapire RE. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1):119–139 DOI 10.1006/jcss.1997.1504. Fukushima K, Miyake S. 1982. Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In: Competition and cooperation in neural nets. Berlin, Heidelberg: Springer, 267–285. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 19/21 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.311#supplemental-information https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset http://dx.doi.org/10.7717/peerj-cs.311#supplemental-information http://dx.doi.org/10.7717/peerj-cs.311#supplemental-information https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset http://dx.doi.org/10.1016/j.vlsi.2019.07.005 http://dx.doi.org/10.1006/jcss.1997.1504 http://dx.doi.org/10.7717/peerj-cs.311 Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J. 2017. A review on deep learning techniques applied to semantic segmentation. ArXiv preprint. arXiv:1704.06857. Girshick R. 2015. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. Piscataway: IEEE, 1440–1448. Girshick R, Donahue J, Darrell T, Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 580–587. Hu P, Ramanan D. 2017. Finding tiny faces. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 951–959. Krizhevsky A, Sutskever I, Hinton GE. 2017. Imagenet classification with deep convolu- tional neural networks. Communications of the ACM 60(6):84–90. LeCun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to docu- ment recognition. Proceedings of the IEEE 86(11):2278–2324 DOI 10.1109/5.726791. Li K, Zhao X, Bian J, Tan M. 2017a. Automatic safety helmet wearing detection. In: 2017 IEEE 7th annual international conference on CYBER technology in automation, control, and intelligent systems (CYBER). Piscataway: IEEE, 617–622. Li J, Liu H, Wang T, Jiang M, Wang S, Li K, Zhao X. 2017b. Safety helmet wearing detec- tion based on image processing and machine learning. In: 2017 Ninth International Conference on Advanced Computational Intelligence (ICACI). Piscataway: IEEE, 201–205. Liu BC, Ivers R, Norton R, Boufous S, Blows S, Lo SK. 2008. Helmets for preventing injury in motorcycle riders. Cochrane Database of Systematic Reviews 1:CD004333 DOI 10.1002/14651858.CD004333.pub3. Liu XH, Ye XN. 2014. Skin color detection and hu moments in helmet recognition research. Journal of East China University of Science and Technology 3:365–370. Long X, Cui W, Zheng Z. 2019. Safety helmet wearing detection based on deep learning. In: 2019 IEEE 3rd information technology, networking, electronic and automation control conference (ITNEC). Piscataway: IEEE, 2495–2499. Redmon J, Divvala S, Girshick R, Farhadi A. 2016. You only look once: unified, real- time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Piscataway: IEEE, 779–788. Ren S, He K, Girshick R, Sun J. 2015. Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. 91–99. Rubaiyat AH, Toma TT, Kalantari-Khandani M, Rahman SA, Chen L, Ye Y, Pan CS. 2016. Automatic detection of helmet uses for construction safety. In: 2016 IEEE/WIC/ACM International Conference on Web Intelligence Workshops (WIW). Piscataway: IEEE, 135–142. Schapire RE. 1990. The strength of weak learnability. Machine Learning 5(2):197–227. Shi H, Chen X, Yang Y. 2019. Safety helmet wearing detection method of improved YOLOv3. Computer Engineering and Applications 55:213–220 DOI 10.3778/j.issn.1002-8331.1811-0389. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 20/21 https://peerj.com http://arXiv.org/abs/1704.06857 http://dx.doi.org/10.1109/5.726791 http://dx.doi.org/10.1002/14651858.CD004333.pub3 http://dx.doi.org/10.3778/j.issn.1002-8331.1811-0389 http://dx.doi.org/10.7717/peerj-cs.311 Siebert FW, Lin H. 2020. Detecting motorcycle helmet use with deep learning. Accident Analysis & Prevention 134:105319 DOI 10.1016/j.aap.2019.105319. Thompson DC, Rivara F, Thompson R. 1999. Helmets for preventing head and facial injuries in bicyclists. Cochrane Database of Systematic Reviews 1992(2):CD001855 DOI 10.1002/14651858.CD001855. World Health Organization. 2006. Helmets: a road safety manual for decision-makers and practitioners. Geneva: World Health Organization. Yunbo LIU, Huang H. 2015. Research on monitoring of workers’ helmet wearing at the construction site. Electronic Science and Technology 28(4):69–72. Fan et al. (2020), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.311 21/21 https://peerj.com http://dx.doi.org/10.1016/j.aap.2019.105319 http://dx.doi.org/10.1002/14651858.CD001855 http://dx.doi.org/10.7717/peerj-cs.311 work_35tjy7sldnhtrdrp4eqtj2nuw4 ---- Paper Title (use style: paper title) International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 DOI: 10.21307/ijanmc-2020-003 15 Research on Enterprise Application Integration Platform Based on SOA Architecture Liu Pingping School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China E-mail: 1341369601@qq.com Lu Jiaxing School of Computer Science and Engineering Xi’an Technological University Xi’an, 710021, China E-mail: 1721653661@qq.com Abstract—Tobacco industry is a relatively early industry in China' information industry, so there are many problems, such as the lack of overall planning, the wide application system, but the low degree of integration of information resources, and the serious problem of "information island". In order to solve how to establish an efficient and flexible way of information interaction within the enterprise, an enterprise application integration platform based on SOA architecture is proposed. The platform takes ESB as the core, transforms enterprise information integration into a new way conforming to SOA architecture, and establishes the idea of basic data management, so as to achieve the purpose of optimizing the overall information resources of the enterprise Keywords-SOA Framework; Information Interaction; ESB; Information Integration I. INTRODUCTION SOA is widely used in the IT industry in the 21st century. SOA is service oriented architecture, Service oriented architecture. It is architecture, not a technology or a method. We can also say that SOA is an idea. In China, many enterprises begin to build enterprise integration platform based on SOA. For example, Kingdee Apusic SOA. Fmqm exotica, developed by Almaden Laboratory of IBM company, is a distributed workflow management system based on persistent message queue, which can save the execution information of workflow through persistent message queue and complete the complete independence of all nodes in the execution process SOA architecture has three significant advantages: loose coupling, coarse granularity, and location protocol transparency. Through the encapsulation of services to achieve a comprehensive loose coupling, loose coupling can reduce the dependency between services, so that the flexibility of the service itself can be improved, and it will not be forced to adjust because of other services adjustment, thus greatly improving the reusability of services. Coarse granularity means that the interface of services defined in SOA is close to the actual user operation. Location protocol transparency means that when accessing the service defined by SOA, you do not need to know the specific location and transport protocol of the service. Even if the location and transport protocol of the service change, the client that invokes the service does not need to change. Based on the investigation and analysis of the problems of information island, high coupling degree and poor integration expansibility of each system in BJ cigarette factory based on SOA architecture, an enterprise application integration platform design scheme suitable for the actual situation of the enterprise is proposed to solve the current problems. Therefore, through the research on the practical application of the enterprise application integration platform based on SOA architecture in cigarette enterprises, through the research on SOA architecture and ESB technology, combined with the analysis of the actual problems of information integration in cigarette manufacturing enterprises, this paper puts forward the design scheme of the enterprise application integration platform that adapts to the actual situation of enterprises. II. THE CURRENT SITUATION Tobacco industry is an industry with an early start of information construction in China. At present, the level of information is generally high. BJ cigarette factory, as the main cigarette manufacturing enterprise in Shaanxi Province, has many application systems after years of information construction, covering all aspects of the factory from production to management. The main information systems include manufacturing execution (MES) system and enterprise Industrial International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 16 resource planning (ERP) system, logistics system, data acquisition system in the car room, centralized control system in the silk making workshop, power and energy management system, human resource management system, enterprise card system, etc. The more application systems there are, the problem is not only the disunity of basic data, but also the complexity of system integration. The traditional integration method generally adopts the point-to-point mode. Each system needs a special channel to integrate Chengdu. As shown in Figure 1, the integration of N application systems will generate N * (n-1) integration channels with high complexity. When new application systems need to be integrated, the complexity improvement will also be exponential. Figure 1. Integration complexity In summary, the current informatization problems of BJ cigarette factory mainly include the following aspects: 1) There are some isolated information islands and some application systems are in information closed state due to lack of external integration means. 2) The basic data in the enterprise is scattered in different application systems, which need to be maintained separately, so it is difficult to ensure the unity of the basic data of the whole enterprise, so as to "count out one place". The lack of a unified basic data code system makes information interaction difficult. 3) Basic data depends on business system and has high coupling. At present, the basic data in the enterprise mainly depends on ERP system, and the purpose of the basic data is to provide the most basic data information for all application systems of the whole enterprise, so relying on a single application system will cause unnecessary impact on the users of other basic data. 4) Poor integration scalability. At present, the information system of the whole enterprise adopts the point-to-point integration mode. If the new application system wants to join the integration system, it needs the cooperation of each application system. The upgrading or transformation of the existing application system also needs to involve a lot of external interface changes, because of this poor scalability. 5) The lack of management and monitoring of data interaction process makes it difficult to find and deal with problems in the process of data interaction in time. Some data have high requirements for timeliness. If it can not be communicated in time, it will have a significant impact on the actual business. Therefore, effective management and monitoring measures are needed for the data interaction process. 6) The integration of point-to-point results in the aggravation of network burden, and many data interaction contents are repeated, but the data can not be reused, resulting in the waste of resources. Through the analysis and optimization of the current problems of the enterprise, the core of which is to establish a reasonable and efficient way of information integration. In recent years, with the continuous development of information integration technology and the formulation of a series of standards and specifications, a new solution is gradually being paid attention to, which is based on service oriented architecture the enterprise application integration (EAI) of Architecture (SOA) regards each application system in the enterprise as the service unit of SOA architecture, and establishes the enterprise application integration platform to realize the information integration between each application system. In this enterprise application integration platform, an enterprise service bus (ESB) is needed to provide standardized services. Enterprise service bus is the service operation support platform in SOA architecture, and the services encapsulated by other application systems run on this service bus, as shown in Figure 2, its establishment can effectively optimize the current enterprise ' s disordered and meshed integration mode. Secondly, we need to establish a data exchange management platform to manage all services running in the enterprise service bus and monitor the data interaction process in the integration. Finally, we need to establish a basic data management platform, as a service provider in the SOA architecture, to provide basic data management functions for other application systems. The basic data International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 17 management platform will integrate the basic data of other application systems and manage them uniformly. Other systems do not need to be managed separately. Figure 2. Schematic diagram of optimized enterprise integration channel III. DESIGN AND IMPLEMENTATION Because IBM WMB is used as the enterprise service bus, IBM DB2 will be used as the database and IBM was (WebSphere Application Server) will be used as the application server for better overall stability. According to the demand analysis, the data exchange platform as the enterprise service bus will provide a unified entry service WS? MB. After other application systems call the service, the ESB will parse and route the called messages, and find the corresponding registered business processing WebService to call. The data exchange management platform is responsible for the management and monitoring of IBM WMB enterprise service bus. In the data exchange management platform, it is necessary to register, modify, disable, reuse and other management functions for the service processing business. At the same time, the data exchange management platform should also realize the log recording of data sending, so as to complete the monitoring of data exchange process. Figure 3. Technical framework of the platform IV. DESIGN OF ENTERPRISE SERVICE BUS DATE EXCHANGE PLATFORM As the core module of enterprise application integration platform, data exchange platform needs to undertake the important work of message transmission. As shown in Figure 4, it is a basic data exchange process. The data exchange platform publishes a unified entry service WS? MB through the web service. The service caller calls the service first, and sends the call request to the data exchange platform in the form of XML message. The data exchange platform will analyze the message content, find the actual service to call, and send the message to the actual, the actual service provider will return the processed results to the data exchange platform in the form of XML message, and then the data exchange platform will return to the original caller. Figure 4. Flow chart of data exchange International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 18 Finally, complete content and organizational editing before formatting. Please take note of the following items when proofreading spelling and grammar: A. Abbreviations and Acronyms Define abbreviations and acronyms the first time they are used in the text, even after they have been defined in the abstract. Abbreviations such as IEEE, SI, MKS, CGS, sc, dc, and rms do not have to be defined. Do not use abbreviations in the title or heads unless they are unavoidable the request message sent by the service caller should contain complete routing information, so how to define the routing information? In fact, from the process of data exchange, it can be seen that the three elements of data sender, data receiver and the service to be called can constitute a unique data process, so the routing information should also contain three elements. Unified format definition of service call request XML message: < ID > < ID > / / message ID or serial number < name > < name > / / message description < source > < source > / / data source < target > < target > / / data destination < sername > < / sername > / / call service ID < msgtype > < / msgtype > / / type of message (0: normal 1: request 2: answer) < rtcode > < / rtcode > / / the return value of the corresponding request (1: success 0: failure) < rtdesc > < / rtdesc > / / return description of the corresponding request < backup1 > < backup1 > / / standby information < date > XXXX / XX / XX XX: XX: XX < / date > / / message sending time
< / Table > / / the main body of the data sent. Table represents a data table. Row is the specific data. Table can be nested in table to represent the data of the main sub table structure.
In the XML definition, the head part describes the basic information of the data, and the three attributes of source, target and sername are the most important routing information. Through these three attributes, you can uniquely determine the service that the data needs to call, that is, the user of the service, the provider of the service and the name of the service. These attributes are registered by the service management module and are in the unified portal after receiving the XML data, the service calls the corresponding service according to the three attributes and the service registration information in the management module to send the data. V. IMPLEMENTATION The main functions of data exchange management platform are service management and data exchange process monitoring. As shown in Figure 5, it is the main interface of the data exchange management platform. The frequency of data exchange can be calculated through the log, and the reception and transmission volume of each system and data accessing the platform can be displayed intuitively. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 19 Figure 5. Main interface of data exchange management platform a) The function of service governance is realized by registering and managing the web service published by each application system. As shown in Figure 6, the contents to be registered include: sequence number, system name, interface name, enabling tag, source, target, interface service name and WebServiceURL, namespace, calling method input object, input parameter name, output parameter name, calling method output object, extended input parameter, extended input parameter value, authentication information, WebService technology, remarks, etc. (due to confidentiality reasons, the figure is not complete). b) Figure 7 is the implementation interface of the authority management function of the management platform for basic data. The maintenance of basic data is usually carried out by the personnel in charge of specific business. Different business personnel are usually responsible for different data. The authority management module can configure the addition, modification, deletion and query authority of various basic data according to different roles, and can also configure whether specific attributes are visible The RBAC mode is implemented. Figure 6. Data operation module International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 20 First, configure different roles in role management, then configure the permissions of roles through role and function relationship, and finally configure the roles of different users through role and user relationship or user and role relationship. A user can have multiple roles, and a role can be played by multiple users at the same time. Figure 8 is the implementation of the data synchronization module of the basic data management platform. By customizing the interface content and configuring different sending interfaces for different systems, you can configure whether the attribute column of the basic data is sent, the name of the sent column, etc., and you can also filter and group the sent content through SQL statements. Figure 7. Authority management Figure 8. Definition of interface service Other application systems publish and register the service of receiving basic data in the data exchange management platform. After the basic data management platform configures the interface, when the new basic data maintenance is completed, the platform will automatically send the data to the corresponding system through the data exchange platform according to the interface configuration. All application systems adopt this mode, and the basic data is unified. International Journal of Advanced Network, Monitoring and Controls Volume 05, No.01, 2020 21 VI. CONCLUSION In this paper, through understanding the application status of SOA architecture in BJ cigarette enterprise, the enterprise integrated information system based on SOA architecture is proposed to solve the problems of the enterprise's current information island, high coupling between various systems, and poor integration and expansion. The basic data management platform manages and synchronizes the basic data of the whole enterprise in a centralized way, so as to solve the integration problem of application systems caused by the data inconsistency. REFERENCES [1] Li Zi, Yang Bin. Application of Enterprise Service Bus Technology (ESB) in large enterprises [J]. Information technology, 2013, 2:146- 147. [2] Lu Hongwei. Enterprise application integration solution based on SOA and ESB [J]. Computer application and software, 2010, 4:215- 216. [3] Chen Lingping. Design and implementation of unified application service interface platform based on SOA [J]. Network security technology and application, 2009, 4:89K. Elissa, “Title of paper if known,” unpublished. [4] Jin Baohua, he Zhenyuan, Zhang Liang, et al. Analysis and design of data sharing and exchange platform based on SOA [J]. Journal of Zhengzhou Institute of light industry (NATURAL SCIENCE EDITION), 2011, 2:102-103. [5] Du Wanya. Design and implementation of SOA framework based on ESB [D]. Beijing: Beijing Jiaotong University, 2008. [6] Zhang Jun. research and implementation of distributed ESB bus based on SOA Architecture [D]. Nanjing: Nanjing University of technology, 2009. [7] Wang Li, ye Weiquan. Planning and design of Anyan tobacco data center based on SOA Architecture [J]. Computer knowledge and technology, 2013, 1:476-478. [8] Electronic Publication: Digital Object Identifiers (DOIs): [9] Lin Yuqin, Huang Chenhui. Research on ESB framework for enterprise application integration [J]. Computer application, 2010, 6:1658-1660. [10] Ma Ruijuan, he Lili. Research on data exchange service system of tobacco industry based on ESB [J]. Computer programming skills and maintenance, 2013, 6:33-35. work_37wft6eehnazxaxgenjumsuamm ---- Parallel Algorithms for Unsupervised Tagging Sujith Ravi Google Mountain View, CA 94043 sravi@google.com Sergei Vassilivitskii Google Mountain View, CA 94043 sergeiv@google.com Vibhor Rastogi∗ Twitter San Francisco, CA vibhor.rastogi@gmail.com Abstract We propose a new method for unsupervised tagging that finds minimal models which are then further improved by Expectation Max- imization training. In contrast to previous approaches that rely on manually specified and multi-step heuristics for model minimiza- tion, our approach is a simple greedy approx- imation algorithm DMLC (DISTRIBUTED- MINIMUM-LABEL-COVER) that solves this objective in a single step. We extend the method and show how to ef- ficiently parallelize the algorithm on modern parallel computing platforms while preserving approximation guarantees. The new method easily scales to large data and grammar sizes, overcoming the memory bottleneck in previ- ous approaches. We demonstrate the power of the new algorithm by evaluating on various sequence labeling tasks: Part-of-Speech tag- ging for multiple languages (including low- resource languages), with complete and in- complete dictionaries, and supertagging, a complex sequence labeling task, where the grammar size alone can grow to millions of entries. Our results show that for all of these settings, our method achieves state-of-the-art scalable performance that yields high quality tagging outputs. 1 Introduction Supervised sequence labeling with large labeled training datasets is considered a solved problem. For ∗∗The research described herein was conducted while the author was working at Google. instance, state of the art systems obtain tagging ac- curacies over 97% for part-of-speech (POS) tagging on the English Penn Treebank. However, learning accurate taggers without labeled data remains a chal- lenge. The accuracies quickly drop when faced with data from a different domain, language, or when there is very little labeled information available for training (Banko and Moore, 2004). Recently, there has been an increasing amount of research tackling this problem using unsuper- vised methods. A popular approach is to learn from POS-tag dictionaries (Merialdo, 1994), where we are given a raw word sequence and a dictionary of legal tags for each word type. Learning from POS- tag dictionaries is still challenging. Complete word- tag dictionaries may not always be available for use and in every setting. When they are available, the dictionaries are often noisy, resulting in high tag- ging ambiguity. Furthermore, when applying tag- gers in new domains or different datasets, we may encounter new words that are missing from the dic- tionary. There have been some efforts to learn POS taggers from incomplete dictionaries by extending the dictionary to include these words using some heuristics (Toutanova and Johnson, 2008) or using other methods such as type-supervision (Garrette and Baldridge, 2012). In this work, we tackle the problem of unsuper- vised sequence labeling using tag dictionaries. The first reported work on this problem was on POS tag- ging from Merialdo (1994). The approach involved training a standard Hidden Markov Model (HMM) using the Expectation Maximization (EM) algo- rithm (Dempster et al., 1977), though EM does not perform well on this task (Johnson, 2007). More re- cent methods have yielded better performance than EM (see (Ravi and Knight, 2009) for an overview). One interesting line of research introduced by Ravi and Knight (2009) explores the idea of per- forming model minimization followed by EM train- ing to learn taggers. Their idea is closely related to the classic Minimum Description Length princi- ple for model selection (Barron et al., 1998). They (1) formulate an objective function to find the small- est model that explains the text (model minimization step), and then, (2) fit the minimized model to the data (EM step). For POS tagging, this method (Ravi and Knight, 2009) yields the best performance to date; 91.6% tagging accuracy on a standard test dataset from the English Penn Treebank. The orig- inal work from (Ravi and Knight, 2009) uses an in- teger linear programming (ILP) formulation to find minimal models, an approach which does not scale to large datasets. Ravi et al. (2010b) introduced a two-step greedy approximation to the original ob- jective function (called the MIN-GREEDY algo- rithm) that runs much faster while maintaining the high tagging performance. Garrette and Baldridge (2012) showed how to use several heuristics to fur- ther improve this algorithm (for instance, better choice of tag bigrams when breaking ties) and stack other techniques on top, such as careful initialization of HMM emission models which results in further performance gains. Their method also works un- der incomplete dictionary scenarios and can be ap- plied to certain low-resource scenarios (Garrette and Baldridge, 2013) by combining model minimization with supervised training. In this work, we propose a new scalable algorithm for performing model minimization for this task. By making an assumption on the structure of the solu- tion, we prove that a variant of the greedy set cover algorithm always finds an approximately optimal la- bel set. This is in contrast to previous methods that employ heuristic approaches with no guarantee on the quality of the solution. In addition, we do not have to rely on ad hoc tie-breaking procedures or careful initializations for unknown words. Finally, not only is the proposed method approximately op- timal, it is also easy to distribute, allowing it to eas- ily scale to very large datasets. We show empirically that our method, combined with an EM training step outperforms existing state of the art systems. 1.1 Our Contributions • We present a new method, DISTRIBUTED MINIMUM LABEL COVER, DMLC, for model minimization that uses a fast, greedy algorithm with formal approximation guarantees to the quality of the solution. • We show how to efficiently parallelize the al- gorithm while preserving approximation guar- antees. In contrast, existing minimization ap- proaches cannot match the new distributed al- gorithm when scaling from thousands to mil- lions or even billions of tokens. • We show that our method easily scales to both large data and grammar sizes, and does not re- quire the corpus or label set to fit into memory. This allows us to tackle complex tagging tasks, where the tagset consists of several thousand labels, which results in more than one million entires in the grammar. • We demonstrate the power of the new method by evaluating under several differ- ent scenarios—POS tagging for multiple lan- guages (including low-resource languages), with complete and incomplete dictionaries, as well as a complex sequence labeling task of su- pertagging. Our results show that for all these settings, our method achieves state-of-the-art performance yielding high quality taggings. 2 Related Work Recently, there has been an increasing amount of research tackling this problem from multiple di- rections. Some efforts have focused on inducing POS tag clusters without any tags (Christodoulopou- los et al., 2010; Reichart et al., 2010; Moon et al., 2010), but evaluating such systems proves dif- ficult since it is not straightforward to map the clus- ter labels onto gold standard tags. A more pop- ular approach is to learn from POS-tag dictionar- ies (Merialdo, 1994; Ravi and Knight, 2009), incom- plete dictionaries (Hasan and Ng, 2009; Garrette and Baldridge, 2012) and human-constructed dictionar- ies (Goldberg et al., 2008). Another direction that has been explored in the past includes bootstrapping taggers for a new lan- guage based on information acquired from other lan- guages (Das and Petrov, 2011) or limited annota- tion resources (Garrette and Baldridge, 2013). Ad- ditional work focused on building supervised tag- gers for noisy domains such as Twitter (Gimpel et al., 2011). While most of the relevant work in this area centers on POS tagging, there has been some work done for building taggers for more complex sequence labeling tasks such as supertagging (Ravi et al., 2010a). Other related work include alternative methods for learning sparse models via priors in Bayesian in- ference (Goldwater and Griffiths, 2007) and poste- rior regularization (Ganchev et al., 2010). But these methods only encourage sparsity and do not explic- itly seek to minimize the model size, which is the ob- jective function used in this work. Moreover, taggers learned using model minimization have been shown to produce state-of-the-art results for the problems discussed here. 3 Model Following Ravi and Knight (2009), we formulate the problem as that of label selection on the sentence graph. Formally, we are given a set of sequences, S = {S1,S2, . . . ,Sn} where each Si is a sequence of words, Si = wi1,wi2, . . . ,wi,|Si|. With each word wij we associate a set of possible tags Tij. We will denote by m the total number of (possibly du- plicate) words (tokens) in the corpus. Additionally, we define two special words w0 and w∞ with special tags start and end, and consider the modified sequences S′i = w0,Si,w∞. To sim- plify notation, we will refer to w∞ = w|Si|+1. The sequence label problem asks us to select a valid tag tij ∈ Tij for each word wij in the input to minimize a specific objective function. We will refer to a tag pair (ti,j−1, tij) as a label. Our aim is to minimize the number of distinct labels used to cover the full input. Formally, given a se- quence S′i and a tag tij for each word wij in S ′ i, let the induced set of labels for sequence S′i be Li = |S′i|⋃ j=1 {(ti,j−1, tij)}. The total number of distinct labels used over all se- quences is then φ = ∣∣∪i Li| = ∣∣⋃ i |Si|+1⋃ j=1 {(ti,j−1, tij)}|. Note that the order of the tokens in the label makes a difference as {(NN, VP)} and {(VP, NN)} are two distinct labels. Now we can define the problem formally, follow- ing (Ravi and Knight, 2009). Problem 1 (Minimum Label Cover). Given a set S of sequences of words, where each word wij has a set of valid tags Tij, the problem is to find a valid tag assignment tij ∈ Tij for each word that minimizes the number of distinct labels or tag pairs over all sequences, φ = ∣∣⋃ i ⋃|Si|+1 j=1 {(ti,j−1, tij)}| . The problem is closely related to the classical Set Cover problem and is also NP-complete. To reduce Set Cover to the label selection problem, map each element i of the Set Cover instance to a single word sentence Si = wi1, and let the valid tags Ti1 con- tain the names of the sets that contain element i. Consider a solution to the label selection problem; every sentence Si is covered by two labels (w0,ki) and (ki,w∞), for some ki ∈ Ti1, which corresponds to an element i being covered by set ki in the Set Cover instance. Thus any valid solution to the label selection problem leads to a feasible solution to the Set Cover problem ({k1,k2, . . .}) of exactly half the size. Finally, we will use {{. . .}} notation to denote a multiset of elements, i.e. a set where an element may appear multiple times. 4 Algorithm In this Section, we describe the DISTRIBUTED- MINIMUM-LABEL-COVER, DMLC, algorithm for approximately solving the minimum label cover problem. We describe the algorithm in a central- ized setting, and defer the distributed implementa- tion to Section 5. Before describing the algorithm, we briefly explain the relationship of the minimum label cover problem to set cover. 4.1 Modification of Set Cover As we pointed out earlier, the minimum label cover problem is at least as hard as the Set Cover prob- 1: Input: A set of sequences S with each words wij having possible tags Tij. 2: Output: A tag assignment tij ∈ Tij for each word wij approximately minimizing labels. 3: Let M be the multi set of all possible labels generated by choosing each possible tag t ∈ Tij. M = ⋃ i   |Si|+1⋃ j=1 ⋃ t′∈Ti,j−1 t∈Tij {{(t′, t)}}   (1) 4: Let L = ∅ be the set of selected labels. 5: repeat 6: Select the most frequent label not yet se- lected: (t′, t) = arg max(s′,s)/∈L |M ∩ (s′,s)|. 7: For each bigram (wi,j−1,wij) where t′ ∈ Ti,j−1 and t ∈ Tij tentatively assign t′ to wi,j−1 and t to wij. Add (t′, t) to L. 8: If a word gets two assignments, select one at random with equal probability. 9: If a bigram (wij,wi,j+1) is consistent with assignments in (t,t′), fix the tenta- tive assignments, and set Ti,j−1 = {t′} and Tij = t. Recompute M, the multi- set of possible labels, with the updated Ti,j−1 and Tij. 10: until there are no unassigned words Algorithm 1: MLC Algorithm 1: Input: A set of sequences S with each words wij having possible tags Tij. 2: Output: A tag assignment tij ∈ Tij for each word wij approximately minimizing labels. 3: (Graph Creation) Initialize each vertex vij with the set of possible tags Tij and its neighbors vi,j+1 and vi,j−1. 4: repeat 5: (Message Passing) Each vertex vij sends its pos- sibly tags Tij to its forward neighbor vij+1. 6: (Counter Update) Each vertex receives the the tags Ti,j−1 and adds all possible labels {(s,s′)|s ∈ Ti,j−1,s′ ∈ Tij} to a global counter (M). 7: (MaxLabel Selection) Each vertex queries the global counter M to find the maximum label (t,t′). 8: (Tentative Assignment) Each vertex vij selects a tag tentatively as follows: If one of the tags t,t′ is in the feasible set Tij, it tentatively selects the tag. 9: (Random Assignment) If both are feasible it se- lects one at random. The vertex communicates its assignment to its neighbors. 10: (Confirmed Assignment) Each vertex receives the tentative assignment from its neighbors. If together with its neighbors it can match the se- lected label, the assignment is finalized. If the assigned tag is T , then the vertex vij sets the valid tag set Tij to {t}. 11: until no unassigned vertices exist. Algorithm 2: DMLC Implementation lem. An additional challenge comes from the fact that labels are tags for a pair of words, and hence are related. For example, if we label a word pair (wi,j−1,wij) as (NN, VP), then the label for the next word pair (wij,wi,j+1) has to be of the form (VP, *), i.e., it has to start with VP. Previous work (Ravi et al., 2010a; Ravi et al., 2010b) recognized this challenge and employed two phase heuristic approaches. Eschewing heuristics, we will show that with one natural assumption, even with this extra set of constraints, the standard greedy algorithm for this problem results in a solution with a provable approximation ratio of O(log m). In practice, however, the algorithm performs far better than the worst case ratio, and similar to the work of (Gomes et al., 2006), we find that the greedy approach selects a cover approximately 11% worse than the optimum solution. 4.2 MLC Algorithm We present in Algorithm 1 our MINIMUM LABEL COVER algorithm to approximately solve the mini- mum label cover problem. The algorithm is simple, efficient, and easy to distribute. The algorithm chooses labels one at a time, select- ing a label that covers as many words as possible in every iteration. For this, it generates and maintains a multi-set of all possible labels M (Step 3). The multi-set contains an occurrence of each valid label, for example, if wi,j−1 has two possible valid tags NN and VP, and wij has one possible valid tag VP, then M will contain two labels, namely (NN, VP) and (VP, VP). Since M is a multi-set it will contain duplicates, e.g. the label (NN, VP) will appear for each adjacent pair of words that have NN and VP as valid tags, respectively. In each iteration, the algorithm picks a label with the most number of occurrences in M and adds it to the set of chosen labels (Step 6). Intuitively, this is a greedy step to select a label that covers the most number of word pairs. Once the algorithm picks a label (t′, t), it tries to assign as many words to tags t or t′ as possible (Step 7). A word can be assigned t′ if t′ is a valid tag for it, and t a valid tag for the next word in sequence. Similarly, a word can be assigned t, if t is a valid tag for it, and t′ a valid tag for the previous word. Some words can get both assignments, in which case we choose one tentatively at random (Step 8). If a word’s tentative random tag, say t, is consistent with the choices of its adjacent words (say t′ from the previous word), then the tentative choice is fixed as a permanent one. Whenever a tag is selected, the set of valid tags Tij for the word is reduced to a sin- gleton {t}. Once the set of valid tags Tij changes, the multi-set M of all possible labels also changes, as seen from Eq 1. The multi-set is then recom- puted (Step 9) and the iterations repeated until all of words have been tagged. We can show that under a natural assumption this simple algorithm is approximately optimal. Assumption 1 (c-feasibility). Let c ≥ 1 be any num- ber, and k be the size of the optimal solution to the original problem. In each iteration, the MLC algo- rithm fixes the tags for some words. We say that the algorithm is c-feasible, if after each iteration there exists some solution to the remaining problem, con- sistent with the chosen tags, with size at most ck . The assumption encodes the fact that a single bad greedy choice is not going to destroy the overall structure of the solution, and a nearly optimal so- lution remains. We note that this assumption of c- feasibility is not only sufficient, as we will formally show, but is also necessary. Indeed, without any as- sumptions, once the algorithm fixes the tag for some words, an optimal label may no longer be consis- tent with the chosen tags, and it is not hard to find contrived examples where the size of the optimal so- lution doubles after each iteration of MLC. Since the underlying problem is NP-complete, it is computationally hard to give direct evidence ver- ifying the assumption on natural language inputs. However, on small examples we are able to show that the greedy algorithm is within a small constant factor of the optimum, specifically it is within 11% of the optimum model size for the POS tagging problem using the standard 24k dataset (Ravi and Knight, 2009). Combined with the fact that the final method outperforms state of the art approaches, this leads us to conclude that the structural assumption is well justified. Lemma 1. Under the assumption of c-feasibility, the MLC algorithm achieves a O(c log m) approx- imation to the minimum label cover problem, where m = ∑ i |Si| is the total number of tokens. Proof. To prove the Lemma we will define an objec- tive function φ̄, counting the number of unlabeled word pairs, as a function of possible labels, and show that φ̄ decreases by a factor of (1−O(1/ck)) at every iteration. To define φ̄, we first define φ, the number of la- beled word pairs. Consider a particular set of la- bels, L = {L1,L2, . . . ,Lk} where each label is a pair (ti, tj). Call {tij} a valid assignment of to- kens if for each wij, we have tij ∈ Tij. Then the score of L under an assignment t, which we denote by φt, is the number of bigram labels that appear in L. Formally, φt(L) = | ∪i,j {{(ti,j−1, tij) ∩L}}|. Finally, we define φ(L) to be the best such assign- ment, φ(L) = maxt φt(L), and φ̄(L) = m−φ(L) the number of uncovered labels. Consider the label selected by the algorithm in ev- ery step. By the c-feasibility assumption, there ex- ists some solution having ck labels. Thus, some la- bel from that solution covers at least a 1/ck fraction of the remaining words. The selected label (t,t′) maximizes the intersection with the remaining fea- sible labels. The conflict resolution step ensures that in expectation the realized benefit is at least a half of the maximum, thereby reducing φ̄ by at least a (1 − 1/2ck) fraction. Therefore, after O(kc log m) operations all of the labels are covered. 4.3 Fitting the Model Using EM Once the greedy algorithm terminates and returns a minimized grammar of tag bigrams, we follow the approach of Ravi and Knight (2009) and fit the min- imized model to the data using the alternating EM strategy. In this step, we run an alternating optimization procedure iteratively in phases. In each phase, we initialize (and prune away) parameters within the two HMM components (transition or emission model) using the output from the previous phase. We initialize this procedure by restricting the tran- sition parameters to only those tag bigrams selected in the model minimization step. We train in con- junction with the original emission model using EM algorithm which prunes away some of the emission parameters. In the next phase, we alternate the ini- tialization by choosing the pruned emission model along with the original transition model (with full set of tag bigrams) and retrain using EM. The alter- nating EM iterations are terminated when the change in the size of the observed grammar (i.e., the number of unique bigrams in the tagging output) is ≤ 5%.1 We refer to our entire approach using greedy mini- mization followed by EM training as DMLC + EM. 5 Distributed Implementation The DMLC algorithm is directly suited towards parallelization across many machines. We turn to Pregel (Malewicz et al., 2010), and its open source version Giraph (Apa, 2013). In these systems the computation proceeds in rounds. In every round, ev- ery machine does some local processing and then sends arbitrary messages to other machines. Se- mantically, we think of the communication graph as fixed, and in each round each vertex performs some local computation and then sends messages to its neighbors. This mode of parallel programming di- rects the programmers to “Think like a vertex.” The specific systems like Pregel and Giraph build infrastructure that ensures that the overall system 1For more details on the alternating EM strategy and how initialization with minimized models improve EM performance in alternating iterations, refer to (Ravi and Knight, 2009). is fault tolerant, efficient, and fast. In addition, they provide implementation of commonly used dis- tributed data structures, such as, for example global counters. The programmer’s job is simply to specify the code that each vertex will run at every round. We implemented the DMLC algorithm in Pregel. The implementation is straightforward and given in Algorithm 2. The multi-set M of Algorithm 1 is represented as a global counter in Algorithm 2. The message passing (Step 3) and counter update (Step 4) steps update this global counter and hence per- form the role of Step 3 of Algorithm 1. Step 5 se- lects the label with largest count, which is equivalent to the greedy label picking step 6 of Algorithm 1. Fi- nally steps 6, 7, and 8 update the tag assignment of each vertex performing the roles of steps 7, 8, and 9, respectively, of Algorithm 1. 5.1 Speeding up the Algorithm The implementation described above directly copies the sequential algorithm. Here we describe addi- tional steps we took to further improve the parallel running times. Singleton Sets: As the parallel algorithm pro- ceeds, the set of feasible sets associated with a node slowly decreases. At some point there is only one tag that a node can take on, however this tag is rare, and so it takes a while for it to be selected using the greedy strategy. Nevertheless, if a node and one of its neighbors have only a single tag left, then it is safe to assign the unique label 2. Modifying the Graph: As is often the case, the bottleneck in parallel computations is the commu- nication. To reduce the amount of communication we reduce the graph on the fly, removing nodes and edges once they no longer play a role in the compu- tation. This simple modification decreases the com- munication time in later rounds as the total size of the problem shrinks. 6 Experiments and Results In this Section, we describe the experimental setup for various tasks, settings and compare empirical performance of our method against several existing 2We must judiciously initialize the global counter to take care of this assignment, but this is easily accomplished. baselines. The performance results for all systems (on all tasks) are measured in terms of tagging accu- racy, i.e. % of tokens from the test corpus that were labeled correctly by the system. 6.1 Part-of-Speech Tagging Task 6.1.1 Tagging Using a Complete Dictionary Data: We use a standard test set (consisting of 24,115 word tokens from the Penn Treebank) for the POS tagging task. The tagset consists of 45 dis- tinct tag labels and the dictionary contains 57,388 word/tag pairs derived from the entire Penn Tree- bank. Per-token ambiguity for the test data is about 1.5 tags/token. In addition to the standard 24k dataset, we also train and test on larger data sets— 973k tokens from the Penn Treebank, 3M tokens from PTB+Europarl (Koehn, 2005) data. Methods: We evaluate and compare performance for POS tagging using four different methods that employ the model minimization idea combined with EM training: • EM: Training a bigram HMM model using EM algorithm (Merialdo, 1994). • ILP + EM: Minimizing grammar size using integer linear programming, followed by EM training (Ravi and Knight, 2009). • MIN-GREEDY + EM: Minimizing grammar size using the two-step greedy method (Ravi et al., 2010b). • DMLC + EM: This work. Results: Table 1 shows the results for POS tag- ging on English Penn Treebank data. On the smaller test datasets, all of the model minimization strate- gies (methods 2, 3, 4) tend to perform equally well, yielding state-of-the-art results and large improve- ment over standard EM. When training (and testing) on larger corpora sizes, DMLC yields the best re- ported performance on this task to date. A major advantage of the new method is that it can easily scale to large corpora sizes and the distributed na- ture of the algorithm still permits fast, efficient op- timization of the global objective function. So, un- like the earlier methods (such as MIN-GREEDY) it is fast enough to run on several millions of tokens to yield additional performance gains (shown in last column). Speedups: We also observe a significant speedup when using the parallelized version of the DMLC algorithm. Performing model minimization on the 24k tokens dataset takes 55 seconds on a single ma- chine, whereas parallelization permits model mini- mization to be feasible even on large datasets. Fig 1 shows the running time for DMLC when run on a cluster of 100 machines. We vary the input data size from 1M word tokens to about 8M word tokens, while holding the resources constant. Both the algo- rithm and its distributed implementation in DMLC are linear time operations as evident by the plot. In fact, for comparison, we also plot a straight line passing through the first two runtimes. The straight line essentially plots runtimes corresponding to a linear speedup. DMLC clearly achieves better run- times showing even better than linear speedup. The reason for this is that distributed version has a con- stant overhead for initialization, independent of the data size. While the running time for rest of the im- plementation is linear in data size. Thus, as the data size becomes larger, the constant overhead becomes less significant, and the distributed implementation appears to complete slightly faster as data size in- creases. Figure 1: Runtime vs. data size (measured in # of word tokens) on 100 machines. For comparison, we also plot a straight line passing through the first two runtimes. The straight line essentially plots runtimes corresponding to a linear speedup. DMLC clearly achieves better runtimes showing a better than linear speedup. 6.1.2 Tagging Using Incomplete Dictionaries We also evaluate our approach for POS tagging under other resource-constrained scenarios. Obtain- Method Tagging accuracy (%) te=24k te=973k tr=24k tr=973k tr=3.7M 1. EM 81.7 82.3 2. ILP + EM (Ravi and Knight, 2009) 91.6 3. MIN-GREEDY + EM (Ravi et al., 2010b) 91.6 87.1 4. DMLC + EM (this work) 91.4 87.5 87.8 Table 1: Results for unsupervised part-of-speech tagging on English Penn Treebank dataset. Tagging accuracies for different methods are shown on multiple datasets. te shows the size (number of tokens) in the test data, tr represents the size of the raw text used to perform model minimization. ing a complete dictionary is often difficult, espe- cially for new domains. To verify the utility of our method when the input dictionary is incomplete, we evaluate against standard datasets used in previous work (Garrette and Baldridge, 2012) and compare against the previous best reported performance for the same task. In all the experiments (described here and in subsequent sections), we use the fol- lowing terminology—raw data refers to unlabeled text used by different methods (for model minimiza- tion or other unsupervised training procedures such as EM), dictionary consists of word/tag entries that are legal, and test refers to data over which tagging evaluation is performed. English Data: For English POS tagging with in- complete dictionary, we evaluate on the Penn Tree- bank (Marcus et al., 1993) data. Following (Garrette and Baldridge, 2012), we extracted a word-tag dic- tionary from sections 00-15 (751,059 tokens) con- sisting of 39,087 word types, 45,331 word/tag en- tries, a per-type ambiguity of 1.16 yielding a per- token ambiguity of 2.21 on the raw corpus (treating unknown words as having all 45 possible tags). As in their setup, we then use the first 47,996 tokens of section 16 as raw data and perform final evalua- tion on the sections 22-24. We use the raw corpus along with the unlabeled test data to perform model minimization and EM training. Unknown words are allowed to have all possible tags in both these pro- cedures. Italian Data: The minimization strategy pre- sented here is a general-purpose method that does not require any specific tuning and works for other languages as well. To demonstrate this, we also per- form evaluation on a different language (Italian) us- ing the TUT corpus (Bosco et al., 2000). Follow- ing (Garrette and Baldridge, 2012), we use the same data splits as their setting. We take the first half of each of the five sections to build the word-tag dic- tionary, the next quarter as raw data and the last quarter as test data. The dictionary was constructed from 41,000 tokens comprised of 7,814 word types, 8,370 word/tag pairs, per-type ambiguity of 1.07 and a per-token ambiguity of 1.41 on the raw data. The raw data consisted of 18,574 tokens and the test con- tained 18,763 tokens. We use the unlabeled corpus from the raw and test data to perform model mini- mization followed by unsupervised EM training. Other Languages: In order to test the effective- ness of our method in other non-English settings, we also report the performance of our method on sev- eral other Indo-European languages using treebank data from CoNLL-X and CoNLL-2007 shared tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007). The corpus statistics for the five languages (Danish, Greek, Italian, Portuguese and Spanish) are listed below. For each language, we construct a dictionary from the raw training data. The unlabeled corpus from the raw training and test data is used to perform model minimization fol- lowed by unsupervised EM training. As before, un- known words are allowed to have all possible tags. We report the final tagging performance on the test data and compare it to baseline EM. Garrette and Baldridge (2012) treat unknown words (words that appear in the raw text but are missing from the dictionary) in a special manner and use several heuristics to perform better initialization for such words (for example, the probability that an unknown word is associated with a particular tag is conditioned on the openness of the tag). They also use an auto-supervision technique to smooth counts learnt from EM onto new words encountered dur- ing testing. In contrast, we do not apply any such technique for unknown words and allow them to be mapped uniformly to all possible tags in the dictio- nary. For this particular set of experiments, the only difference from the Garrette and Baldridge (2012) setup is that we include unlabeled text from the test data (but without any dictionary tag labels or special heuristics) to our existing word tokens from raw text for performing model minimization. This is a stan- dard practice used in unsupervised training scenar- ios (for example, Bayesian inference methods) and in general for scalable techniques where the goal is to perform inference on the same data for which one wishes to produce some structured prediction. Language Train Dict Test (tokens) (entries) (tokens) DANISH 94386 18797 5852 GREEK 65419 12894 4804 ITALIAN 71199 14934 5096 PORTUGUESE 206678 30053 5867 SPANISH 89334 17176 5694 Results: Table 2 (column 2) compares previously reported results against our approach for English. We observe that our method obtains a huge improve- ment over standard EM and gets comparable results to the previous best reported scores for the same task from (Garrette and Baldridge, 2012). It is encourag- ing to note that the new system achieves this per- formance without using any of the carefully-chosen heuristics employed by the previous method. How- ever, we do note that some of these techniques can be easily combined with our method to produce fur- ther improvements. Table 2 (column 3) also shows results on Ital- ian POS tagging. We observe that our method achieves significant improvements in tagging accu- racy over all the baseline systems including the pre- vious best system (+2.9%). This demonstrates that the method generalizes well to other languages and produces consistent tagging improvements over ex- isting methods for the same task. Results for POS tagging on CoNLL data in five different languages are displayed in Figure 2. Note that the proportion of raw data in test versus train 50 60 70 80 90 DANISH GREEK ITALIAN PORTUGUESE SPANISH 79.4 66.3 84.6 80.1 83.1 77.8 65.6 82 78.5 81.3 EM DMLC+EM �� � � � �� � �� �� �� � Figure 2: Part-of-Speech tagging accuracy for different languages on CoNLL data using incomplete dictionaries. (from the standard CoNLL shared tasks) is much smaller compared to the earlier experimental set- tings. In general, we observe that adding more raw data for EM training improves the tagging quality (same trend observed earlier in Table 1: column 2 versus column 3). Despite this, DMLC + EM still achieves significant improvements over the baseline EM system on multiple languages (as shown in Fig- ure 2). An additional advantage of the new method is that it can easily scale to larger corpora and it pro- duces a much more compact grammar that can be efficiently incorporated for EM training. 6.1.3 Tagging for Low-Resource Languages Learning part-of-speech taggers for severely low- resource languages (e.g., Malagasy) is very chal- lenging. In addition to scarce (token-supervised) labeled resources, the tag dictionaries avail- able for training taggers are tiny compared to other languages such as English. Garrette and Baldridge (2013) combine various supervised and semi-supervised learning algorithms into a common POS tagger training pipeline to addresses some of these challenges. They also report tagging accuracy improvements on low-resource languages when us- ing the combined system over any single algorithm. Their system has four main parts, in order: (1) Tag dictionary expansion using label propagation algo- rithm, (2) Weighted model minimization, (3) Ex- pectation maximization (EM) training of HMMs us- ing auto-supervision, (4) MaxEnt Markov Model (MEMM) training. The entire procedure results in a trained tagger model that can then be applied to tag any raw data.3 Step 2 in this procedure involves 3For more details, refer (Garrette and Baldridge, 2013). Method Tagging accuracy (%) English (PTB 00-15) Italian (TUT) 1. Random 63.53 62.81 2. EM 69.20 60.70 3. Type-supervision + HMM initialization (Garrette and Baldridge, 2012) 88.52 72.86 4. DMLC + EM (this work) 88.11 75.79 Table 2: Part-of-Speech tagging accuracy using PTB sections 00-15 and TUT to build the tag dictionary. For compar- ison, we also include the results for the previously reported state-of-the-art system (method 3) for the same task. Method Tagging accuracy (%) Total Known Unknown Low-resource tagging using (Garrette and Baldridge, 2013) 80.7 (70.2) 87.6 (90.3) 66.1 (45.1) Low-resource tagging using DMLC + EM (this work) 81.1 (70.8) 87.9 (90.3) 66.7 (46.5) Table 3: Part-of-Speech tagging accuracy for a low-resource language (Malagasy) on All/Known/Unknown tokens in the test data. Tagging performance is shown for multiple experiments using different (incomplete) dictionary sizes: (a) small, (b) tiny (shown in parentheses). The new method (row 2) significantly outperforms the existing method with p < 0.01 for small dictionary and p < 0.05 for tiny dictionary. a weighted version of model minimization which uses the multi-step greedy approach from Ravi et al. (2010b) enhanced with additional heuristics that uses tag weights learnt via label propagation (in Step 1) within the minimization process. We replace the model minimization procedure in their Step 2 with our method (DMLC + EM) and di- rectly compare this new system with their approach in terms of tagging accuracy. Note for all other steps in the pipeline we follow the same procedure (and run the same code) as Garrette and Baldridge (2013), including the same smoothing procedure for EM ini- tialization in Step 3. Data: We use the exact same setup as Garrette and Baldridge (2013) and run experiments on Mala- gasy, an Austronesian language spoken in Madagas- car. We use the publicly available data4: 100k raw tokens for training, a word-tag dictionary acquired with 4 hours of human annotation effort (used for type-supervision), and a held-out test dataset (5341 tokens). We provide the unlabeled corpus from the raw training data along with the word-tag dictionary as input to model minimization and evaluate on the test corpus. We run multiple experiments for dif- ferent (incomplete) dictionary scenarios: (a) small = 2773 word/tag pairs, (b) tiny = 329 word/tag pairs. Results: Table 3 shows results on Malagasy data comparing a system that employs (unweighted) 4github.com/ dhgarrette/low-resource-pos-tagging-2013 DMLC against the existing state-of-the-art system that incorporates a multi-step weighted model min- imization combined with additional heuristics. We observe that switching to the new model minimiza- tion procedure alone yields significant improvement in tagging accuracy under both dictionary scenarios. It is encouraging that a better minimization proce- dure also leads to higher tagging quality on the un- known word tokens (column 4 in the table), even when the input dictionary is tiny. 6.2 Supertagging Compared to POS tagging, a more challenging task is learning supertaggers for lexicalized grammar formalisms such as Combinatory Categorial Gram- mar (CCG) (Steedman, 2000). For example, CCG- bank (Hockenmaier and Steedman, 2007) contains 1241 distinct supertags (lexical categories) and the most ambiguous word has 126 supertags. This pro- vides a much more challenging starting point for the semi-supervised methods typically applied to the task. Yet, this is an important task since cre- ating grammars and resources for CCG parsers for new domains and languages is highly labor- and knowledge-intensive. As described earlier, our approach scales easily to large datasets as well as label sizes. To evaluate it on the supertagging task, we use the same dataset from (Ravi et al., 2010a) and compare against their base- line method that uses an modified (two-step) version Method Supertagging accuracy (%) Ambiguous Total 1. EM 38.7 45.6 2. ILP∗ + EM (Ravi et al., 2010a) 52.1 57.3 3. DMLC + EM (this work) 55.9 59.3 Table 4: Results for unsupervised supertagging with a dictionary. Here, we report the total accuracy as well as accuracy on just the ambiguous tokens (i.e., tokens which have more than one tagging possibility). ∗The baseline method 2 requires several pre-processing steps in order to run feasibly for this task (described in Section 6.2). In contrast, the new approach (DMLC) runs fast and also permits efficient parallelization. of the ILP formulation for model minimization. Data: We use the CCGbank data for this ex- periment. This data was created by semi- auto- matically converting the Penn Treebank to CCG derivations (Hockenmaier and Steedman, 2007). We use the standard splits of the data used in semi- supervised tagging experiments (Banko and Moore, 2004)—sections 0-18 for training (i.e., to construct the word-tag dictionary), and sections 22-24 for test. Results: Table 4 compares the results for two baseline systems—standard EM (method 1), and a previously reported system using model minimiza- tion (method 2) for the same task. We observe that DMLC produces better taggings than either of these and yields significant improvement in accu- racy (+2% overall, +3.8% on ambiguous tokens). Note that it is not feasible to run the ILP-based baseline (method 2 in the table) directly since it is very slow in practice, so Ravi et al. (2010a) use a set of pre-processing steps to prune the original grammar size (unique tag pairs) from >1M to sev- eral thousand entries followed by a modified two- step ILP minimization strategy. This is required to permit their model minimization step to be run in a feasible manner. On the other hand, the new ap- proach DMLC (method 3) scales better even when the data/label sizes are large, hence it can be run with the full data using the original model minimization formulation (rather than a two-step heuristic). Ravi et al. (2010a) also report further improve- ments using an alternative approach involving an ILP-based weighted minimization procedure. In Section 7 we briefly discuss how the DMLC method can be extended to this setting and combined with other similar methods. 7 Discussion and Conclusion We present a fast, efficient model minimization algorithm for unsupervised tagging that improves upon previous two-step heuristics. We show that un- der a fairly natural assumption of c-feasibility the solution obtained by our minimization algorithm is O(c log m)-approximate to the optimal. Although in the case of two-step heuristics, the first step guar- antees an O(log m)-approximation, the second step, which is required to get a consistent solution, can introduce many additional labels resulting in a so- lution arbitrarily away from the optimal. Our one step approach ensures consistency at each step of the algorithm, while the c-feasibility assumption means that the solution does not diverge too much from the optimal in each iteration. In addition to proving approximation guarantees for the new algorithm, we show that it is paralleliz- able, allowing us to easily scale to larger datasets than previously explored. Our results show that the algorithm achieves state-of-the-art performance, outperforming existing methods on several differ- ent tasks (both POS tagging and supertagging) and works well even with incomplete dictionaries and extremely low-resource languages like Malagasy. For future work, it would be interesting to apply a weighted version of the DMLC algorithm where la- bels (i.e., tag pairs) can have different weight distri- butions instead of uniform weights. Our algorithm can be extended to allow an input weight distribu- tion to be specified for minimization. In order to initialize the weights we could use existing strate- gies such as grammar-informed initialization (Ravi et al., 2010a) or output distributions learnt via other methods such as label propagation (Garrette and Baldridge, 2013). References 2013. Apache giraph. http://giraph.apache. org/. Michele Banko and Robert C. Moore. 2004. Part-of- speech tagging in context. In Proceedings of COLING, pages 556–561. Andrew R Barron, Jorma Rissanen, and Bin Yu. 1998. The Minimum Description Length Principle in Cod- ing and Modeling. IEEE Transactions of Information Theory, 44(6):2743–2760. Cristina Bosco, Vincenzo Lombardo, Daniela Vassallo, and Leonardo Lesmo. 2000. Building a Treebank for Italian: a data-driven annotation schema. In Proceed- ings of the Second International Conference on Lan- guage Resources and Evaluation LREC-2000, pages 99–105. Sabine Buchholz and Erwin Marsi. 2006. Conll-x shared task on multilingual dependency parsing. In Proceed- ings of CoNLL, pages 149–164. Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman. 2010. Two decades of unsupervised POS induction: How far have we come? In Proceed- ings of the Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 575–584. Dipanjan Das and Slav Petrov. 2011. Unsupervised part- of-speech tagging with bilingual graph-based projec- tions. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies - Volume 1, pages 600– 609. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Se- ries B, 39(1):1–38. Kuzman Ganchev, João Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. Journal of Machine Learning Research, 11:2001–2049. Dan Garrette and Jason Baldridge. 2012. Type- supervised Hidden Markov Models for part-of-speech tagging with incomplete tag dictionaries. In Proceed- ings of the Conference on Empirical Methods in Nat- ural Language Processing and Computational Natu- ral Language Learning (EMNLP-CoNLL), pages 821– 831. Dan Garrette and Jason Baldridge. 2013. Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 138–147. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: annotation, features, and experiments. In Pro- ceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies: short papers - Volume 2, pages 42–47. Yoav Goldberg, Meni Adler, and Michael Elhadad. 2008. EM can find pretty good HMM POS-taggers (when given a good start). In Proceedings of ACL, pages 746–754. Sharon Goldwater and Thomas L. Griffiths. 2007. A fully Bayesian approach to unsupervised part-of- speech tagging. In ACL. Fernando C. Gomes, Cludio N. Meneses, Panos M. Pardalos, and Gerardo Valdisio R. Viana. 2006. Ex- perimental analysis of approximation algorithms for the vertex cover and set covering problems. Kazi Saidul Hasan and Vincent Ng. 2009. Weakly super- vised part-of-speech tagging for morphologically-rich, resource-scarce languages. In Proceedings of the 12th Conference on the European Chapter of the Associa- tion for Computational Linguistics, pages 363–371. Julia Hockenmaier and Mark Steedman. 2007. CCG- bank: A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Compu- tational Linguistics, 33(3):355–396. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proceedings of the Joint Conference on Empirical Methods in Natural Language Process- ing and Computational Natural Language Learning (EMNLP-CoNLL), pages 296–305. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Machine Transla- tion Summit X, pages 79–86. Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grze- gorz Czajkowski. 2010. Pregel: a system for large- scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Manage- ment of data, pages 135–146. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated cor- pus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. Bernard Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20(2):155–171. Taesun Moon, Katrin Erk, and Jason Baldridge. 2010. Crouching Dirichlet, Hidden Markov Model: Unsu- pervised POS tagging with context local tag genera- tion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 196– 206. Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDon- ald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pages 915–932. Sujith Ravi and Kevin Knight. 2009. Minimized models for unsupervised part-of-speech tagging. In Proceed- ings of the Joint Conferenceof the 47th Annual Meet- ing of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natu- ral Language Processing (ACL-IJCNLP), pages 504– 512. Sujith Ravi, Jason Baldridge, and Kevin Knight. 2010a. Minimized models and grammar-informed initializa- tion for supertagging with highly ambiguous lexicons. In Proceedings of the 48th Annual Meeting of the As- sociation for Computational Linguistics (ACL), pages 495–503. Sujith Ravi, Ashish Vaswani, Kevin Knight, and David Chiang. 2010b. Fast, greedy model minimization for unsupervised tagging. In Proceedings of the 23rd In- ternational Conference on Computational Linguistics (COLING), pages 940–948. Roi Reichart, Raanan Fattal, and Ari Rappoport. 2010. Improved unsupervised POS induction using intrinsic clustering quality and a Zipfian constraint. In Proceed- ings of the Fourteenth Conference on Computational Natural Language Learning, pages 57–66. Mark Steedman. 2000. The Syntactic Process. MIT Press, Cambridge, MA, USA. Kristina Toutanova and Mark Johnson. 2008. A Bayesian LDA-based model for semi-supervised part- of-speech tagging. In Advances in Neural Information Processing Systems (NIPS), pages 1521–1528. work_3bm3dechy5didpwd3zolcdcwti ---- Analysis of historical road accident data supporting autonomous vehicle control strategies Analysis of historical road accident data supporting autonomous vehicle control strategies Sándor Szénási1,2 1 Faculty of Economics and Informatics, J. Selye University, Komárno, Slovakia 2 John von Neumann Faculty of Informatics, Óbuda University, Budapest, Hungary ABSTRACT It is expected that most accidents occurring due to human mistakes will be eliminated by autonomous vehicles. Their control is based on real-time data obtained from the various sensors, processed by sophisticated algorithms and the operation of actuators. However, it is worth noting that this process flow cannot handle unexpected accident situations like a child running out in front of the vehicle or an unexpectedly slippery road surface. A comprehensive analysis of historical accident data can help to forecast these situations. For example, it is possible to localize areas of the public road network, where the number of accidents related to careless pedestrians or bad road surface conditions is significantly higher than expected. This information can help the control of the autonomous vehicle to prepare for dangerous situations long before the real-time sensors provide any related information. This manuscript presents a data-mining method working on the already existing road accident database records to find the black spots of the road network. As a next step, a further statistical approach is used to find the significant risk factors of these zones, which result can be built into the controlling strategy of self- driven cars to prepare them for these situations to decrease the probability of the potential further incidents. The evaluation part of this paper shows that the robustness of the proposed method is similar to the already existing black spot searching algorithms. However, it provides additional information about the main accident patterns. Subjects Autonomous Systems, Data Mining and Machine Learning, Spatial and Geographic Information Systems Keywords Data mining, DBSCAN, Road accident, Statistics, Autonomous vehicle, Road safety INTRODUCTION Human drivers have many disadvantages compared to autonomous vehicles (slower reaction time, inattentiveness, variable physical condition) (Kertesz & Felde, 2020). Nevertheless, they can often perform better (Chatterjee et al., 2002) in some unexpected situations like a child running out in front of the vehicle. Because beyond the information gained in real-time, they may have specific knowledge about a given location (linked to the previous example, the human driver may know that there is a playground without a fence near the road; therefore, the appearance of a child is not unexpected). Drivers also have some incomplete but useful historical knowledge about accidents and they can build this information into their driving behavior. If they know that there were several How to cite this article Szénási S. 2021. Analysis of historical road accident data supporting autonomous vehicle control strategies. PeerJ Comput. Sci. 7:e399 DOI 10.7717/peerj-cs.399 Submitted 13 October 2020 Accepted 28 January 2021 Published 23 February 2021 Corresponding author Sándor Szénási, szenasi.sandor@nik.uni-obuda.hu Academic editor Chintan Amrit Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.399 Copyright 2021 Szénási Distributed under Creative Commons CC-BY 4.0 http://dx.doi.org/10.7717/peerj-cs.399 mailto:szenasi.�sandor@�nik.�uni-obuda.�hu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.399 http://www.creativecommons.org/licenses/by/4.0/ http://www.creativecommons.org/licenses/by/4.0/ https://peerj.com/computer-science/ pedestrian collisions somewhere, they will decrease the speed and try to be more attentive without triggering real-time signals. Thanks to this behavior, they can prepare for and avoid some types of accidents, which were not possible without this historical data. Another example might be a road section, which is usually extremely slippery on rainy days. Real-time sensors can detect the element of slipping when it is too late to avoid the consequences. Some historical accident data can help to prepare the car for these unexpected situations. We propose the following consecutive steps to integrate historical data into the control algorithm for autonomous devices: 1. Localize accident black spots in an already existing accident database, using statistical or data-mining methods; 2. Determine the common reasons for these accidents with statistical analysis or pattern matching; 3. Specify the necessary preventive steps to decrease the probability of further accidents. This article mainly focuses on the first two steps because the third one largely depends on the limits and equipment of the self-driven car. For example, in the case of dangerous areas is it possible to increase the power of lights to make the car more visible? Or in the case of large chance of pedestrian accidents, is it possible to increase the volume of the artificial engine sound to avoid careless road crossing? Can the car change the suspension settings to prepare for potentially dangerous road sections? The scope of this paper is the development of the theoretical background to support these preliminary protection activities. The appropriate preliminary actions may significantly decrease the number and severity of road accidents. For example, Carsten & Tate (2005) present a model for the relationship between changes in vehicle speed and the number of occurred accidents. It is visible from this model (based on the national injury database of Great Britain to predict the effects of speed on road accidents) that for each 1 km/h change in mean speed, the best- estimated change of accident risk is 3%. Accordingly, it is worth making assumptions about the dangerous areas and adapting the control of the autonomous cars to these predictions. BACKGROUND Black spot identification Black spot management (identification, analysis, and treatment of black spots in the public road network) is one of the most important tasks of road safety engineers. The identification of these extremely hazardous sections is the first step to prevent further accidents or to decrease the seriousness of these. It is a heavily researched area, and there are several theoretical methods for this purpose. However it has a long tradition in traffic engineering; interestingly, there is not any generally accepted definition of road accident black spots (also known as hot spots), Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 2/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ the official definition varies by country. It follows that the method used to find these hazardous locations also varies by country. For example, by the definition of the Hungarian government, outside built-up area black spots are defined as road sections no longer than 100 meters where the number of accidents during the last three years is at least 3. According to this, road safety engineers use simple threshold-based methods (for example, the traditional sliding window technique) to find these areas. Switzerland uses a significantly different definition as black spots are sections of the road network (or intersections) where the number of accidents is “well above” the number of accidents at comparable sites. The key difference is the term “comparable sites” because these advanced comparative methods do not try to classify all road segments by itself but try to compare to similar areas. There are some general attributes of accident black spots to overcome the conceptual confusion. These are usually well-defined sections or intersections of the public road network, where road accidents are historically concentrated (Elvik, 2008; Delorme & Lassarre, 2014; Murray, White & Ison, 2012; Montella et al., 2013; Hegyi, Borsos & Koren, 2017). Nowadays, road accidents are monitored by the governments and all data about accidents are stored in large, reliable and partially public databases (without any personal information about the participants). Much data about the road network is also available (road layout, speed limits, tables, etc.). As a result, road safety engineers can use several procedures from various fields (statistics, data mining, pattern recognition) to localize accident black spots in these databases. It is a common assumption that the number of accidents is significantly higher at these locations compared to other sections of the road network. However, this alone is neither a necessary nor sufficient condition. The variation of the average yearly accident count of road sections is relatively high compared to the number of accidents. Because of this, the regression to the mean effect can distort the historical data. A given section with more accidents than average is not necessarily an accident black spot. The converse is also true, as there may be true black spots with relatively few accidents for a given year. However, this deficiency is already theoretically proven as most black spot identification methods are based on the accident numbers of the last few years, simply because this is the best place to start a detailed analysis. Nevertheless, it is always worth keeping in mind that these locations are just black spot candidates, but it needs further examination to make the right decision concerning them. The best way to do this is via a detailed scene investigation, but it is very expensive and time-consuming. Another theoretical approach can be the analysis of accident data to find some irregular patterns and identify one or more risk factors causing these accidents. Without these, it is possible that the higher frequency of accidents is purely coincidental at a given location and time. To localize potential accident black spots, the most traditional procedure is the sliding window method (Lee & Lee, 2013; Elvik, 2008; Geurts et al., 2006). The input parameters of Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 3/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ the process are the section length and a threshold value. The method is based on the following: 1. Divide the selected road into small uniform sized sections; 2. Count the number of accidents that have occurred in the last few years for each section; 3. Flag the segments where this number is higher than a given threshold as potential black spots. There are many variants of the proposed traditional sliding window method (Anderson, 2009; Szénási & Jankó, 2007). A potential alternative is to use variable window length. One of its advantages is that it is unnecessary to set the appropriate parameter, but sufficient to give a minimal and maximal value. The method can try several window lengths to find the largest black spots possible. Due to this modification, it can find small local black spots and larger ones too. The traditional sliding window method uses non- overlapping segments, but it is also possible to slide the window with smaller steps than the window size. This leads to a more sensitive method, which can find more black spot candidates. However, it is also necessary to manage the overlapping black spots (considering these as one big cluster, or multiple distinct ones). It is worth mentioning that the method has some additional advantages: it has very low computational demand (compared to the alternatives) and is based only on the road accident database. The sliding window method is one of the first widely used procedures; therefore, it is based on the traditional road number + section number positioning system (for example, the accident location is Road 66, 12+450 kilometer+meter). This traditional positioning system was the only real alternative in the past. However, in the last decades, the spreading of GPS technology makes it possible to collect spatial coordinates of accidents. This step has several benefits (faster and more accurate localization) but also requires the rethinking of the already existing methods. It is possible to extend the sliding window method to a two-dimensional procedure, but it is not widely used. It is better to seek out better and more applicable methods fitting to the spatial systems given by the GPS coordinates. From this field, Kernel Density Estimation (KDE) methods are one of the most popular spatial data analysis techniques (Bíl, Andrášik & Janoška, 2013; Flahaut et al., 2003; Anderson, 2009; Yu et al., 2014; Toran & Moridpour, 2015). These have been employed in many research projects to analyze road accidents. KDE methods have the advantages of simple implementation and easy understanding. These also have the benefit to naturally handle the noise of the data (caused by inaccuracy of GPS devices). In general, KDE is used as an estimation of the Probability Random Function of a random variable. From the safety experts’ point of view, the result of the KDE method is the accident density estimation at a given reference point. The procedure has several parameters, like the search radius distance from the reference point (bandwidth or kernel size) and the kernel function. Several researchers recommend the use of empirical Bayesian methods combining the benefits of the predicted and historical accident frequencies. These models usually analyze the distributions of the already existing historical data from several aspects, and Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 4/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ give predictions about the expected accident state. In the Empirical Bayesian method, the existing historical accident count and the expected accident count predicted by the model are added using different weights (Ghadi & Török, 2019). Because of this, this process requires an accurate accident prediction model. Another group of already available methods is based on clustering techniques. These procedures are from the field of data mining, where clustering is one of the widely used unsupervised learning methods. In this context, a cluster is a group of items, which are similar to each other and differ from items outside the cluster. Accidents with similar attributes (where properties can be the location and/or another risk factor(s)) can be considered as one cluster, using this concept in the field of black spot searching. Most studies use the basic K-means clustering method (Mauro, De Luca & Dell’Acqua, 2013), but there are also some fuzzy-based C-means solutions. As already mentioned, the results of the proposed methods are just a set of black spot candidates. It needs further analysis to make a final, valid decision as to whether it is a real accident black spot or not. Furthermore, whether or not it requires any actions. This is the point where our research turns away from traditional road safety management work (identification and elimination of black spots). Based on the collected clusters, road safety engineers must select the black spot candidates having the largest safety potential, which is based on the prediction of the effect of the best available preventive action (cost of the local improvement activity compared to the expected befits in the number and severity of further accidents). From the perspective of autonomous car control, the role of this safety potential is essential. The self-driven car has no options to solve road safety problems. The only important information is the existence of accident black spots and the potential safety mechanisms, which may help to avoid further crashes. As a second difference, from the road safety engineers’ point of view, it is not necessary that the accidents of a given black spot have common characteristics. The hot spot definition of this paper assumes that accidents of a given cluster have similar attributes because this pattern will be the basis of the preventive actions. The localization of accident black spot candidates is a heavily researched area and there are several fully-automated methods to find these. Nevertheless, the further automatic pattern analysis of these is not as well developed. This phase usually needs a great deal of manual work by human road safety experts (they must travel to the scene and investigate the environment to support their decisions about recommended actions). However, this process is supportable by some general rules but is mostly done manually using the pattern matching capability of the human mind. To fully automate it, it is necessary to make this method applicable to self-driven cars. According to this objective, this paper focuses on the help for autonomous vehicles to take the appropriate preventive actions to avoid accidents: � Localize black spot candidates using historical accident database; � Make assumptions about the common risk factors and patterns of these accidents; � According to these preliminary results, the autonomous device will know where the dangerous areas are and what preventive actions to take. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 5/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Automated accident prevention Autonomous vehicles will have several ways to avoid accidents and, therefore, is a hot, widely researched topic. Nevertheless, most papers deal with options existing only in the far future when autonomous devices will be a part of a densely connected network without any human interferences. Real-world implementations are far from this point, but some technologies already exist, although they are not closely related to autonomous vehicles. Currently, implemented accident prevention systems are built into traditional cars as braking assistants, etc. However, it is worth considering these because such methods will be the predecessor of the future techniques applicable to self-driven vehicles. The two main classes of accident prevention systems are passive and active methods. Passive systems send notifications to the driver about their warnings but do not perform any active operations. On the contrary, active methods have the right to perform interventions (braking, steering, etc.) to avoid accidents. It seems obvious that these prevention systems have a large positive impact on accident prevention, and it has already been proven by Jermakian (2011) that passive methods have significant benefits. There are more than one million vehicle crashes prevented in the USA each year. As Harpen proved (Harper, Hendrickson & Samaras, 2016), the cost-benefit ratio of these systems is also positive. Brake assist systems are one of the most researched active systems, where the potential benefits are the lower risk of injury, and the less serious injuries of the pedestrians (Rosén et al., 2010). Current forward-looking crash avoidance systems are usually continuously scanning the space in front of the vehicle using various devices (camera, radar, LIDAR, etc.). If any of these detects an unexpected vehicle or pedestrian, the brake assistant system takes the appropriate (preliminary) actions, which can be the enforcement of the braking system or direct autonomous emergency braking. Bálint, Fagerlind & Kullgren (2013) presented very promising results with a test-based methodology for the assessment of braking and pre-crash warning systems. These typically are only using the real-time information given by the vehicle sensors without any knowledge extracted from historical accident data. Run-time crash prediction models are also related to the topic of this paper. Hossain et al. (2019) presented a comprehensive comparison and review of existing real-time crash prediction models. The basic assumption of these systems is that the probability of a crash situation within a short time window is predictable by the current environmental parameters measured by the sensors. Therefore, most of the already existing methods use only the acquired sensor data to make real-time decisions about potential crash situations. According to this assumption, authors do not use the already existing accident databases as an input to fine-tune the system’s predictions. The work of Lenard, Badea-Romero & Danton (2014) is closer to the research presented in this paper. They analyzed the common accident scenarios to support the development of autonomous emergency braking protocols. Based on the hierarchical ascending method in two British accident databases filtered by some previously defined conditions (they use only the urban pedestrian accidents that occurred in daylight and with fine Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 6/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ weather conditions), attributes of the most common accident scenarios were presented. This paper defines the major accident scenarios and classifies all existing pedestrian accidents into one of these categories. The results of this research would be useful in the training phase of a self-driven vehicle to introduce all possible scenarios to the algorithm. The objective of Nitsche et al. (2017) is similar, which proposes a novel data analysis method to detect pre-crash situations at various (T- and four-legged) intersections. The purpose of this work is also to support the safety tests of autonomous devices. They clustered accident data into several distinct partitions with the well-known k-medoids procedure. Based on these clusters, an association rules algorithm was applied to each cluster to specify the driving scenarios. The input was a crash database from the UK (containing one thousand junction crashes). The result of the paper contains thirteen crash clusters, describing the main pre-accident situations. MATERIALS AND METHODS Black spot candidate localization Density-based spatial clustering of applications with noise For the black spot candidate localization step, the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm was used. It is not widely used in the field of road safety engineering; however, it is one of the most efficient density-based clustering methods from the field of data mining. The main objective of density-based clustering tasks is the following: the density of elements within a cluster must be significantly higher than between separate clusters. This principle distinguishes the two distinct classes of elements: items inside a cluster and the outliers (elements outside of any cluster). According to the road safety task, elements are the accidents in the public road network. These are identified by spatial GPS coordinates and have several additional attributes (time, accident nature, etc.). The general DBSCAN method needs a definition for distance calculation between two elements. In the case of road accidents, the Euclidean distance between the two GPS coordinates was used (black spots are usually spread over a small area. Therefore, it is a good estimation of the real road network distances). The DBSCAN method requires two additional parameters: � ε: a radius type variable (meters); � MinPts: the lower limit for the number of accidents in a cluster (accidents). The main definitions of the DBSCAN algorithm are as follows: � ε environment of a given x element is the space within the ε radius of the x element; � x is an internal element if the ε environment of the given x contains at least MinPts elements (including x); � x is directly densely reachable from y means that x is in the ε environment of y which is an internal element; � x is densely reachable from y if it is accessible through a chain of directly densely reachable elements from y; Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 7/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ � all points not densely reachable from an internal element are the outliers; � if x is an internal element then it forms a cluster together with all densely reachable elements from x. The objective of the process is to find clusters of accidents in the public road network in which all elements are densely connected, and no further expansion is possible. The steps to achieve this are as follows: 1. Select one internal element from the accident database as the starting point. This will be the first point of the cluster. 2. Extend the cluster with all directly densely reachable elements from any point of the cluster recursively. 3. If it is not possible to extend the cluster with additional points, the cluster can be considered as final (it contains all the densely reachable items from the starting point). If this cluster meets the prerequisites for a black spot candidate, it is stored in the result set. 4. Repeat steps 1–3 for all internal elements of the database. The result of the presented procedure is a set of black spot candidates. The prerequisites of Step 4 can be one or more of the following: � The number of accidents should be more than a given threshold. � The accident density of the given area should be more than a given threshold. The proposed method has several advantages over the traditional methods. Unlike the sliding window algorithm, which analyzes only the accidents of a given road section, the DBSCAN is a spatial algorithm managing all accidents of the database together. This difference would be substantial in the case of junctions where the accidents of the same junction were assigned to different road numbers. This can be especially critical in the case of built-up areas and traffic roundabouts, where the number of connected roads is high. Determination of accident density One of the benefits of the traditional sliding window method is that it is easy to interpret for human experts. The number of accidents in a given road section is a very informative number. It is also easy to calculate some derived values, like the accident density, which is the number of accidents divided by the length of the road section. This divisor is often extended with the traffic rate or the time period length values. In the case of spatial black spot localization techniques, the definition of road accident density is more complex. These methods are not based on road sections, so division by the section length is not applicable. It is necessary to calculate the area of the black spot to use it as a divisor somehow. This article proposes a novel method to calculate the area of the region spanned by the black spot accidents. It finds the smallest boundary convex polygon containing all Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 8/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ accidents of a given cluster. The density of the black spot will be the number of accidents divided by the area of this polygon. The area is calculated by the Gauss’ area formula Eq. (1). aðCÞ¼ Pn�1 i¼1 xiyiþ1 þxny1 � Pn�1 i¼1 xiþ1yi �x1yn �� �� 2 ¼ x1y2 þx2y3 þ…þxn�1yn þxny1 �x2y1 �x3y2 �…�xnyn�1 �x1ynj j 2 (1) where � a(C): the area of the C polygon (cluster); � n: the number of vertices of the polygon; � (xi, yi): the two-dimensional coordinates of the i-th vertex of the C polygon (where i ∈ {1, 2, …, n}). If the number of accidents is less than three, the proposed area concept is not applicable. However, clusters with one or two accidents are usually not considered as black spot candidates. Therefore, this is not a real limitation. In the case of clusters with more than two accidents, the accident rate is calculated as Eq. (2). rðCÞ¼ jCj aðCÞ (2) where � ρ(C): the accident density of the C cluster; � |C|: the number of accidents in the C cluster. The formula requires the sequence of corner coordinates of the polygon in a given order (in this case, a clockwise direction). The traditional DBSCAN algorithm continuously builds the polygon from a starting point and the result is a set of accidents. Consequently, there should be an additional step to give the corner points in the appropriate order. It is possible to do this after the DBSCAN finishes, but it is also possible to extend the DBSCAN method with the following steps: � In the case of the first (P1) and second (P2) items, the concept of “polygon” cannot be interpreted. Hence, these are automatically marked as further corner points of the polygon. � With the third point (P3), the items already form a polygon. The p3 point must be on the right side of the vector P1P2 �� , which can be checked using a scalar multiplication to ensure the clockwise direction requested by the Gauss formula. If this is not the case, it is necessary to change the order of P1 and P2. After that step, P1, P2 and P3 will be the corner points of the polygon in a clockwise direction. � For every additional point (P5, P6, . . ., Pn), it must be checked that the additional point is inside the actual boundary convex polygon or not. It is possible to check that the new Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 9/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ point (Pnew) is on the right side of the boundary vector or not. If it is true for each vectors, the point must be inside the polygon (or in the border). Therefore, it is not necessary to modify the shape. If the new point is on the left side of any boundary vectors, then it is outside the boundary convex polygon. There must be a sequence of one or more consecutive vectors breaking the rule. Let k and l be the first and last vectors of this sequence. It is possible to substitute the Pk�1; Pk; Pkþ1; . . . ; Pl�1; Pl; Plþ1 part of the boundary vector list with Pk − 1, Pnew, Pl + 1. Because of the convexity of the original polygon, the Pk − 1, Pnew, Pl + 1 triangle contains all the Pk; Pkþ1; . . . ; Pl�1; Pl points, and the transformation also ensures the convexity of the new polygon and the clockwise direction of the corner points. Three figures about this process have been attached to the article in the Supplemental File “DBSCAN images”. It is possible to calculate the black spot area and the accident density of a given cluster using the previous method. Analysis of black spot candidates The result of the various black spot localization algorithms (sliding window, clustering, etc.) is a list of potential hot spots. However, having some accidents in a cluster does not mean that the hazard of accidents is significantly higher here. It is usually accepted by researchers that the number of accidents in a given area (section) of the road network fits the Poisson distribution. A special feature of road accident distribution is that the number of accidents is relatively low (compared to the size of the road network), and the variance is high. Therefore, the volatility of the accident number is very high, which means that a given cluster where the number of accidents is above the average is not inevitably a hot spot. The list given by the previous methods needs further examination to find the real hazardous sites. At this point, the methodology of this paper significantly differs from the work of road safety engineers. Their objective is to find hazardous sites and take the appropriate actions to decrease the probability of further accidents. They must select the sites having the largest safety potential where the best cost-effective actions can be taken to decrease the number and severity of accidents. It is a very complex procedure based on the data of historical accidents, the expected number of accidents, the environmental conditions and the cost/expected benefits of different safety actions. Contrary to this, the objective of a self-driven car is not the elimination of road safety problems. As an ordinary participant of traffic, it has no chance to make the road network better. Nevertheless, as a passive participant, it should be able to localize the problematic areas, analyze these, and take the necessary preliminary steps to avoid further accidents. Another difference between the methods of these fields is that from the perspective of road safety engineers, it is not necessary that the accidents of a given black spot have any special patterns or common characteristics. For the self-driven car, the localization of high-risk areas where the number of accidents is significantly higher than expected is Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 10/25 http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ not enough because this fact does not help to take the appropriate preliminary steps. This is the reason why this paper focuses on the identification of accident reasons. The result of this further investigation can be one of the following: � If it is not possible to identify any unexpected pattern in the accident attributes then the cluster cannot be considered as an accident black spot. The high number of accidents is just a coincidence and there are no suggestions to avoid further crashes. � In contrast, if there is a special pattern in the accident attributes then this cluster has the potential to decrease the probability of further crashes. These reasons for similar accidents would be related to the road network, weather, lighting conditions or human errors (drivers and pedestrians). In the second case, the knowledge of this special pattern (the common reasons for accidents in the same cluster) can be essential. It is presumable that it is possible to avoid accidents caused by the car itself. For example, if it is visible from the accident database that the number of accidents caused by slippery road is significantly higher than expected in a given area, the self-driven car should decrease the speed or change its trajectory to reduce the probability of this event. However, it is also worth noting that the preliminary actions can be very useful to decrease the probability of accidents caused by other drivers or pedestrians. For example, if the historical accident data contains patterns that the number of accidents caused by pedestrians is higher than expected, then the self-driven car would proactively try to decrease this negative potential using some type of visual or auditory warning or decreasing speed. Deducing the environmental reasons for accidents Accident databases usually contain certain taxonomy for accident types. These are usually structured classes of specific events and reasons, and scene investigators must classify each accident into one of these categories, which is very important statistical information. This method has several limitations because it is rare when the occurrence of an accident originates in one specific reason. Usually, multiple reasons, forming a complex structure, cause an accident. For example, the investigator codes the accident as a type of “catching-up accident”, but this does not give any information about why the accident occurs. It is also typical that most of the accidents in the Hungarian road network are caused by “incorrect choice of speed”. However, it is obvious that not just the speeding itself was the triggering reason for these accidents. There should be other factors (besides, it is unarguable that speeding increases the effects of other factors and makes certain accidents unavoidable). Based on these experiences, this paper does not try to assign all accidents to mutually exclusive accident reason classes. Contrarily, the proposed method defines several potential accident reasons, which are not mutually exclusive. These factors can be complementary and having different weights and roles in the occurrence of the accident. Only the reasons with potential preventive operations are discussed because these have valuable information for the self-driven car. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 11/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ The proposed method is based on the following consecutive steps: 1. All known accidents are analyzed by all possible accident reasons, and a score value is assigned to the accident showing how much the accident is affected by a given factor. 2. The distribution of these score values is approximated by the examination of all known accidents. 3. Based on the result of the previously presented DBSCAN algorithm, the distribution of these score values is also calculated for each black spot candidate. 4. The distributions for all accidents and a given black spot are compared. If the distribution of a given factor is significantly differing (to the positive direction), the cluster is marked as a hazardous area for the given factor. The independent accident reason factors, like “slippery road”, “bad visibility”, “careless pedestrians”, etc. are defined as R1, R2,…, RN, where N is the number of these. As discussed previously, these reasons are not stored directly in the database but can be inferred from the general attributes of accidents. A scoring table is used for this purpose: the weights of the i-th accident factor (1 ≤ i ≤ N) is stored as Wi; where Wiattr = value shows the score for the Ri accident reason when the attr attribute equals to value. Accordingly, the cumulative score of the Ri reason for the x accident is Eq. (3): SiðxÞ¼ X 8attr2AðxÞ Wiattr¼x:attr (3) where Si(x) is the score value of the Ri reason for accident x. The x.y corresponds to the value of the specific y attribute of the x accident, and AðxÞcontains all the available known attributes of x. It is also possible to calculate the same value, not just for an accident but also for all accidents of a black spot candidate. The Hi(C) set contains all the Si(x) score values for all x accidents in the C cluster as visible in Eq. (4): HiðCÞ¼ fSiðxÞjx 2 Cg (4) Distribution of accident scores As a further step, it is necessary to determine that there is any significant reason which proves that the C set is a real hot spot or not. For a well-established decision, it is necessary to analyze all the accidents in the database to determine the main characteristics of the distributions of all R reasons. Based on these results, it is possible to compare the distributions of Hi(C) values for the examined C hot spot candidate and the reference Ĥi values for the whole accident database (D) for a given Ri reason Eq. (5). Ĥi ¼fSiðxÞjx 2 Dg (5) If the distributions of Hi(C) and Ĥi are the same, it is assumable that the Ri reason has no significant role in the accumulation of accidents. Otherwise, if these distributions differ Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 12/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ as the number Ri score values are higher in Hi(C) than in Ĥi, there may be some cause/causal relationship between them. Hypothesis tests can show if the mean value of a given accident reason score (Ri) in a given cluster is higher than the same mean for all accidents in the database. The used alternative hypothesis states that the mean score of the cluster minus the mean score of the whole population is greater than zero Eq. (7). The null hypothesis covers all other possible outcomes Eq. (6). H=j� : mC �mD � 0 (6) H=k� : mC �mD . 0 (7) where: � μC is the mean score value for the black spot candidate; � μD is the mean score value for all accidents in the database (full population). This article proposes the application of Welch’s t-test, which is a two-sample location test used to test the hypothesis that the means of two populations are equal (like the popular Student’s t-test, but Welch’s test is more reliable when sample sizes are significantly different and the variances are also unequal). The Welch’s test assumes that both populations have normal distributions. Nevertheless, in the case of moderately large samples and application of the one-tailed test, the t-tests are relatively robust to moderate violations of the normality assumption. In this case, the populations are large enough (the full population contains thousands of accidents and black spots also contain several accidents), and it also holds that the one-tailed test is the appropriate method because we are looking for clusters where the mean is significantly higher than in the entire population. Ahad & Yahaya (2014) shows that Welch’s test can cause Type I errors when the variances of the two populations differ and distributions are non-normal. In this case, the variances are similar, and Type I errors are acceptable (some identified black spot candidates may not be real black spots). According to Welch’s method, the statistic t value is given by Eq. (8). t ¼ x1 �x2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi v1 n1 þ v2 n2 r (8) where: � x1 is the mean of the first sample; � x2 is the mean of the second sample; � v1 is the variance of the first sample; � v2 is the variance of the second sample; � n1 is the size of the first sample; � n2 is the size of the second sample. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 13/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ The degree of freedom (v) is calculated by Eq. (9) v ¼ � v1 n1 þ v2 n2 �2 v21 n21ðn1 �1Þ þ v 2 2 n22ðn2 �1Þ (9) Based on the previously calculated t and v values, the t-distribution can be used to determine the probability (P). The one-tailed test is applied because it will answer the question that the mean of the cluster is significantly higher than the mean of the entire population. Based on P and a previously defined level of significance (a) it is possible to reject or not the null hypothesis. In the case of rejection, it can be assumed that the examined accident reason is related to the accidents as one of the possible causal factors. If the null hypothesis cannot be rejected, there is no evidence for this. Scoring factors The practical evaluation presented by this paper focuses on one specific accident reason (N = 1) the slippery road condition factor (R1). The used accident database contains more than two hundred fields, in four categories: � general accident attributes (date and time, location, nature, etc.); � general environmental attributes (weather conditions, visibility, etc.); � data about participants (was it a vehicle or pedestrian, speed, direction, etc.); � data about injured persons (age of the injured person, etc.). Weighting tables have been developed to estimate the effect of a given accident reason factor on the occurrence of the accident. It is possible to distinguish the following three type of accident properties, focusing on the slippery road condition accident factor: � Some fields directly contain information about the examined factor. In this case, the “Road surface” property (abbreviated as roadsrf) of an accident has an option of “4-oily, slippery”. This is taken as the basis for further weights; the score value of this attribute is 1.0 (W1roadsrf = 4 = 1.0), showing that the accident is highly affected by the slippery road condition factor. It is worth noting that it is not efficient making a binary decision about the examined factor based on this value because there are other values (“3-snowy”, “5-another staining”) having similar effects. This is reflected in the weight values. � In some cases, there are no such direct fields, but it is possible to deduce information about a given factor from the already existing data. For example, in the case of the slippery road condition factor, the weather conditions (wthr property in the database) can help this process. In these cases, the score values assigned to different weather condition cases show an estimation of how much the given factor affected the occurrence of the accident. In the case of snowing (“6-snowy”), it would be higher Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 14/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ (W1wthr = 4 = 0.3) than for ideal conditions like “1-sunny” (W 1 wthr = 1 = 0). It is also considered that in the case of accident nature “31-Slipping, carving, overturning on the road”, the slippery road factor influenced the results (W1accnat = 32 = 0.2) � The last group contains the fields without any relation to the examined factor. For example, fields like “Age of the driver” do not affect the results. The weights for all values of these fields are consequently zero. The Supplemental File “Scoring tables” contains the given weight values for the affected fields. Weight values are based on a comprehensive literature review from the fields of road safety and road friction measurements (Wallman & Åström, 2001; Andersson et al., 2007; Sokolovskij, 2010; Colomb, Duthon & Laukkanen, 2017). However, some of the values are affected by the subjective experiments of the authors. It should take some further research to determine the most efficient weights. RESULTS Accident database This paper uses the official road accident database of Hungary, where data regarding accidents with personal injury are collected by the police. After some conversion and corrections, this dataset is handled by the Central Statistics Department. The completeness of the database is ensured by legislation, and participants of public road accidents with personal injury are obliged to report it to the police. A police officer starts the data collection on the spot by recording the most relevant data about the location and the main attributes of the accidents (participants, casualties, etc.). After 30 days, it is possible to refine the final injury level for all participants. After that finalization step, the Central Statistics Department collects and rechecks all records. Road safety engineers and researchers can use this database for their work. The evaluation part of this paper is based on the accidents of this database from 1 January 2011 to 31 December 2018. It contains 128,767 accidents with personal injury classified into three categories: fatal, serious and slight. There are no accidents in the database without personal injury. Because of the high number of accidents and high computational demand of the clustering algorithm, this paper deals with two counties of Hungary: accidents of “Győr-Moson-Sopron” county was used to find the optimal parameters of the algorithm and “Heves” county was used as a control dataset. DBSCAN clustering The input database for the clustering was the personal injury accidents of a given county of Hungary (“Győr-Moson-Sopron” county). This experiment was performed twice at two consecutive time intervals to measure the robustness of the method. The examined t interval contains the accidents which occurred in 1 January 2011–31 December 2014. and the t̂ validation interval was 1 January 2015–1 December 2018. The number of accidents was 3,256 in the t interval (the D set contains these accidents) and 3,011 in the t̂ interval (the D̂ set contains these accidents). Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 15/25 http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ In the hot spot search phase, the following DBSCAN parameters were used: � ε value: 100 m; � minimum accident count: five accidents; � minimum accident density: 0.0001 accident/m2. The result of this raw DBSCAN clustering method was 165 black spot candidates in the t interval and 152 black spot candidates in the t̂ interval. Statistical test Unlike traditional black spot searching methods, the next step is not the calculation of some safety potential index, but the determination of the different accident reason factors using the scoring method presented in Section. Considering the R1 slippery road condition factor, the S1(x) value is calculated for all x accidents. Most of these are not related to a slippery road surface reasons; so, S1 value for these is 0. As a prerequisite for the Welch-test, a population of Si(y) values is generated where y stands for all accidents in the database. The main parameters of this sample are: � number of items (n1): 3,256 � mean (x1): 0.2438 � variance (v1): 0.1115 It is possible to calculate these values for all of them, iterating the overall black spot candidates. Based on the whole population comparison and the black spot candidates, the Welch-test was applied to get the statistical result values. According to the Welch-test, it is possible to use the Student distribution with these parameters and the given level of significance (a = 0.05) to reject the null hypothesis or not. Table 1 shows the black spot candidates of the t interval where the null hypothesis was rejected because the mean of the R1 score for the given black spot candidate was significantly higher than the expected average. It can be assumed that these black spots are affected by the examined R1 factor. Figure 1 shows the environment and the accidents of the first black spot from this list. As is visible in the satellite image, it is a part of a long straight road; consequently, there is no reason for the autonomous car to decrease its speed. From the historical database, Table 2 contains detailed information about the accidents. As is visible, there is a high number of accidents affected by one or more slippery road-related attributes. This pattern Table 1 Accident black spots where the null hypothesis was rejected. # Location Count Mean Variance Prob. 1 LAT 47.6301/LON 16.7333 8 0.75 0.0857 0.000878 2 LAT 47.5956/LON 17.5872 11 0.55 0.0887 0.003629 3 LAT 47.3866/LON 17.8659 5 1.12 0.2820 0.010502 4 LAT 47.5708/LON 17.5790 6 0.56 0.1307 0.040157 Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 16/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ significantly differs from the expectations; hence, there should be some environmental issues at this location. The examination and elimination of these reasons is the task of road safety experts (Orosz et al., 2015). Nevertheless, until then, it is worth taking preventive steps to decrease the chance of further accidents. The autonomous vehicle should adapt its control to this situation (speed reduction, using safer trajectory, etc.). DISCUSSION There is not any generally accepted method for the evaluation of black spots because there is not an exact definition for these. Based on real-world accident data, there is not any Figure 1 Road accidents of the black spot located at LAT 47.6301/LON 16.7333. Map Data @2021 Google, Satellite Images @2021 CNES/Airbus, Geoimage Austria, Maxar Technologies. Full-size DOI: 10.7717/peerj-cs.399/fig-1 Table 2 Accidents of the black spot located at LAT 47.6301 / LON 16.7333. Time Latitude Longitude Outcome Surface Weather Accident nature 2011.02.03 16:05 47.6302 16.7327 Light Wet Sunny Track leaving 2011.05.06 17:35 47.6298 16.7340 Hard Normal Sunny Track leaving 2011.06.26 10:24 47.6300 16.7338 Light Wet Rainy Track leaving 2011.06.26 10:28 47.6300 16.7334 Hard Wet Rainy Track leaving 2011.07.21 9:10 47.6302 16.7330 Hard Wet Rainy Track leaving 2013.06.24 17:50 47.6298 16.7340 Light Wet Overcast Frontal crash 2014.01.09 12:45 47.6301 16.7330 Light Wet Sunny Slipping, carving 2014.01.20 10:45 47.6303 16.7325 Hard Wet Overcast Track leaving Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 17/25 http://dx.doi.org/10.7717/peerj-cs.399/fig-1 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ list of real black spots. So, the widely accepted confusion table-based methods are not usable here (assigning the clusters into true-positive, false-positive, true-negative, false- negative classes and calculating the common measurements like accuracy, recall, etc.). Therefore, it is necessary to evaluate these results based on the general characteristics of these locations. The accident density of black spots is significantly higher than the average; though, this is just a necessary condition but not sufficient for validity. Because of the high volatility of accidents, the regression to the mean effect can distort the results. It is a well-known statistical phenomenon that roads with a high number of road accidents in a particular period are likely to have fewer during the consecutive period just because of the random fluctuations in crash numbers. In the case of real black spots, the high number of accidents is permanent. Thus, it should be a good evaluation technique to check the number of accidents of the consecutive validation time interval inside the clusters identified in the t interval. There are specific tests for this purpose introduced by Cheng & Washington (2005) used by various article (Montella, 2010): site consistency tests, method consistency tests, and the total rank differences test. Since these are developed for black spot searching methods based on road intervals, it was necessary to adapt them to use spatial coordinates and black spot regions. The input series for all tests were the result of the previous black spot identification process, as � Ci is the i-th cluster identified in the D database (1 ≤ i ≤ n where n is the number of identified black spots in the t interval); � Ĉi is the i-th cluster identified in the D̂ database (1 � i � n̂ where n̂ is the number of identified black spots in the t̂ interval). Site consistency test This test assumes that any site identified as a black spot in the t time period should also reveal high risk in the subsequent t̂ time period. Let p(C) the convex boundary polygon of the C cluster given by the algorithm presented in Section, and p is the union of these regions identified in the t time period (10). � ¼ [n i¼1 �ðCiÞ (10) As the next step, we collect all accidents for the consecutive t̂ time period, which are inside the clusters identified by the prior t time period. The T1 attribute shows the number of these accidents divided by the summarized area of these clusters. Thus, this is the accident density of these clusters in the consecutive time period Eq. (11). T1 ¼ jfx 2 D̂jx inside �gjPn i¼1 aðCiÞ (11) Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 18/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Accident reason factor consistency test As this paper goes further, revealing the accident reason factors, it is also worth checking if the accidents in the t̂ time period inside the region identified by the t time period have the same attributes or not. This leads to the introduction of the T′1 value, which shows the average score value for these accidents Eq. (12). T01 ¼ P 8x2D̂ � S1ðxÞ; if x inside � 0; else jfx 2 D̂jx inside �gj (12) Method consistency test It is also assumable that a black spot area identified in the t time period will also be identified as black spot in the consecutive t̂ time period. A given black spot searching method can be considered consistent if the number of a black spots identified in both periods is large. Meanwhile, that of black spots identified only in one of the examined periods is small. It is possible to use Eq. (13) to calculate this method consistency: T2 ¼ jfC1; C2; . . . Cng\fĈ1; Ĉ2; . . . Ĉn̂gj jfC1; C2; . . . Cng4fĈ1; Ĉ2; . . . Ĉn̂gj (13) where T2 is the ratio of the number of clusters existing in both search results and the number of clusters given by only the search in t or only in t̂ time period (△ stands for the symmetric difference of sets). A pair of clusters from the t and t̂ period considered identical if the distance between these is less than 300 m. Rank difference test The rank difference test is based on black spots identified in both the t and t̂ periods. The black spots of both periods are sorted by accident density, and the rank difference test shows the difference in the positions of the same cluster in the two lists. The smaller the value, the more consistent the examined method is, because the sequence of clusters is similar. Large numbers shows that the examined method was able to identify the same black spot in both intervals but with a different severity related to each other. Let O and Ô the sequences of black spots identified in both periods (both sequences contain the items of thefC1; C2; . . . Cng\fĈ1; Ĉ2; . . . Ĉn̂g set) ordered by accident density in the t time period (O) and in the t̂ time period (Ô). The T3 will show the rank difference of the examined method Eq. (14). Obviously jOj ¼ jÔj. T3 ¼ P c2O jRankðc; OÞ�Rankðc; ÔÞj jOj (14) where Rank(x, Y) is the rank of the x black spot in the Y sequence. Evaluation results First, the proposed method was compared to the traditional Sliding Window method (SW) using dynamic window length. The minimal window length parameter was 250 m, the Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 19/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ minimal accident number was 5, and the minimal accident density was 0.01 accidents/m. As a further step, the novel method was also compared to the raw DBSCAN based clustering (without the accident factor scoring). The parameters of this were the same as presented above. The proposed method is presented in the comparison under the DARF (DBSCAN with Accident Reason Factor determination) name. Table 3 shows the overall results for “Győr-Moson-Sopron” county. As visible, the number of black spots recognized by the DARF method is significantly less than by its alternatives. It was expected because the SW and DBSCAN methods list all clusters where the accident density is higher than a given threshold. In contrast, the DARF method results in only black spots affected by the R1 accident factor. The difference between the SW and DBSCAN is also significant and is caused by the fact that the SW uses road name + road section positioning which is not available in built-up areas. In comparison, the DBSCAN method is based on GPS coordinates and can find the black spots of municipal roads (which is one advantage of this approach). The T1 result is similar in the case of DBSCAN and DARF methods and it is significantly less in the case of SW. The T2 results are almost the same for all algorithms. The third general metric shows that the proposed method performs very well on the rank difference test. However, it is worth noting that the number of black spots is significantly less in this case, which can be an advantage. The T′1 metric shows the real strength of the proposed method. As expected, black spots identified by the SW and DBSCAN contain a mixture of various accidents. Consequently, the average of the R1 score is near to the mean of the population (0.2159 and 0.1922 compared to 0.2438). Contrary to this, the score number for the accidents of the t̂ time interval placed inside the clusters located by the data of the t interval is 0.62, which is significantly higher than the average. These results confirm that the proposed method has very similar characteristics to the already existing methods. The slightly lower T2 value shows that as a raw black spot searching algorithm, it is not as robust as the alternatives. Nonetheless, the T′1 result shows Table 3 Results of the comparison of the SW, DBSCAN, and DARF methods based on the road slippery condition. Precision is the ratio of the number of confirmed black spots (identified in both intervals) and the number of all black spots (identified at least in one of the intervals). Results are based on the personal injury accidents occured in “Győr-Moson-Sopron” county. Value SW DBSCAN DARF BS identified in both t and t̂ 67 129 4 BS identified in t but not in t̂ 8 36 2 BS identified in t̂ but not in t 20 23 0 Precision 41.36% 40.69% 40.00% T1 test result (accidents/m) 0.0094 0.0435 0.0447 T′1 test result 0.2159 0.1922 0.6200 T2 test result 0.5447 0.5223 0.5000 T3 test result 3.8765 5.9054 0.2000 Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 20/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ that it is satisfactory for our purpose. It can localize areas when the expected number of accidents with given accident reasons is significantly higher than the average. Table 4 shows the same values for another county (“Heves”) as a control dataset to check the robustness of the method. As visible, the main characteristics of the results are very similar. In this case, the T1 and T3 results are better compared to the alternatives. However, the T′1 value is slightly lower, but still significantly higher than the population average. CONCLUSIONS This work presents a novel, fully automated method updating autonomous vehicles concerning potential road risk factors. The method is based on the DBSCAN data-mining algorithm, which can localize black spot candidates where the number of accidents is greater than expected. It has several advantages to the traditional sliding window method, especially in built-up areas and accidents occurred at junctions. Beyond the traditional road safety engineering work, an additional processing step was also introduced, making assumptions about the main accident reasons. All possible reasons (road slippery, pedestrian issues, etc.) should be checked one-by-one, assigning score values to all accidents. The proposed method considers the distribution of these score values for the full population (all accidents of the given county) and each black spot candidate. Using hypotheses tests (one-tailed Welch-test), it is possible to select clusters in which the mean of the score values is significantly higher than the expected value (calculated by statistical methods based on the entire accident database). These can be considered as black spots affected by the given factor. The output of this process is a sequence of risky locations on the public road network and a prediction concerning the accident reasons. These would be the base of further research suggesting automatic preventive steps to autonomous vehicles. This dataset can be useful in the route planning phase (try to avoid black spots) and in the traveling phase (take preventive steps when approaching dangerous locations) (Alonso et al., 2016). This knowledge would decrease the number and seriousness of public road accidents. Table 4 Results of the comparison of the SW, DBSCAN, and DARF methods based on the road slippery condition. Precision is the ratio of the number of confirmed black spots (identified in both intervals) and the number of all black spots (identified at least in one of the intervals). Results are based on the personal injury accidents are occured in “Heves” county. Value SW DBSCAN DARF BS identified in both t and t̂ 25 38 4 BS identified in t but not in t̂ 9 12 0 BS identified in t̂ but not in t 16 26 3 Precision 33.33% 33.33% 36.36% T1 test result (accidents/m) 0.0074 0.0323 0.0732 T′1 test result 0.2148 0.2286 0.5778 T2 test result 0.3333 0.3333 0.4000 T3 test result 1.8667 1.4912 0.0000 Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 21/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ As a limitation, it is worth noting that the proposed method would result in false positive alarms. Fortunately, these results are used by autonomous vehicles; therefore, the consequences are usually minor inconveniences (decreasing the speed, etc.) compared to the traditional road safety investigations, where the manual revision is essential. It is also worth seeing that our method is based only on local historical data resulting in problems typical of traditional statistical black spot searching methods (high variation compared to the expected value). It would be worth developing a hybrid method based on the Empirical Bayes method, which achieves superior control for random variation. The next step of this research project will be the development of these preventive steps. The previously acquired information should be built into the control of the self-driven vehicle to fine-tune its strategy of movement to avoid all predictable risky situations. For example, if the presented method predicts high probability of pedestrian accidents, the car should increase the engine voice volume; in the case of a high chance of frontal accidents, it is worth increasing the power of the headlights; and obviously, decreasing the speed near any of the dangerous locations may decrease the seriousness of most accidents. Building an expert system to give similar advice based on the historical data should be the next step of this project. Another direction of further development is to make the method more sensitive to real-time environmental conditions. For example, if the autonomous car has to plan a route at night in wet weather, then it should pay more attention to historical accidents that have occurred under similar conditions. This also confirms the fact that it is necessary to make simple and fully automatic algorithms for this purpose to make the fast recalculations available. As another further development, an Artificial Intelligence based approach should be used to extend the database to solve the problems raised by the limitations of the dataset. ACKNOWLEDGEMENTS The authors would like to thank Domokos Jankó for his support and novel ideas about the topic. Rest in peace our friend. ADDITIONAL INFORMATION AND DECLARATIONS Funding The research presented in this paper was carried out as part of the EFOP-3.6.2-16-2017- 00016 project in the framework of the New Széchenyi Plan. The completion of this project is funded by the European Union and co-financed by the European Social Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: New Széchenyi Plan: EFOP-3.6.2-16-2017-00016. European Union and European Social Fund. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 22/25 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Competing Interests Sándor Szénási is an Academic Editor for PeerJ. Author Contributions � Sándor Szénási conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Data Availability The following information was supplied regarding data availability: Data and code are available in the Supplemental Files. Supplemental Information Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj-cs.399#supplemental-information. REFERENCES Ahad NA, Yahaya SSS. 2014. Sensitivity analysis of Welch’s t-test. AIP Conference Proceedings 1605(February 2015):888–893. Alonso F, Alonso M, Esteban C, Useche SA. 2016. Knowledge of the concepts of black spot, grey spot and high accident concentration sections among drivers. Science Publishing Group 1(4):39–46. Anderson TK. 2009. Kernel density estimation and K-means clustering to profile road accident hotspots. Accident Analysis & Prevention 41(3):359–364 DOI 10.1016/j.aap.2008.12.014. Andersson M, Bruzelius F, Casselgren J, Gäfvert M, Hjort M, Hultén J, Håbring F, Klomp M, Olsson G, Sjödahl M, Svendenius J, Woxneryd S, Wälivaara B. 2007. Road friction estimation IVSS project report. Available at https://research.chalmers.se/en/publication/101026. Bálint A, Fagerlind H, Kullgren A. 2013. A test-based method for the assessment of pre-crash warning and braking systems. Accident Analysis & Prevention 59:192–199 DOI 10.1016/j.aap.2013.05.021. Bíl M, Andrášik R, Janoška Z. 2013. Identification of hazardous road locations of traffic accidents by means of kernel density estimation and cluster significance evaluation. Accident Analysis & Prevention 55(3):265–273 DOI 10.1016/j.aap.2013.03.003. Carsten OM, Tate FN. 2005. Intelligent speed adaptation: accident savings and cost-benefit analysis. Accident Analysis & Prevention 37(3):407–416 DOI 10.1016/j.aap.2004.02.007. Chatterjee K, Hounsell NB, Firmin PE, Bonsall PW. 2002. Driver response to variable message sign information in London. Transportation Research Part C: Emerging Technologies 10(2):149–169 DOI 10.1016/S0968-090X(01)00008-0. Cheng W, Washington SP. 2005. Experimental evaluation of hotspot identification methods. Accident Analysis & Prevention 37(5):870–881 DOI 10.1016/j.aap.2005.04.015. Colomb M, Duthon P, Laukkanen S. 2017. Characteristics of adverse weather conditions. In: DENSE. Brussels: CER. Delorme R, Lassarre S. 2014. A new theory of complexity for safety research—the case of the long-lasting gap in road safety outcomes between France and Great Britain. Safety Science 70:488–503 DOI 10.1016/j.ssci.2014.06.015. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 23/25 http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.7717/peerj-cs.399#supplemental-information http://dx.doi.org/10.1016/j.aap.2008.12.014 https://research.chalmers.se/en/publication/101026 http://dx.doi.org/10.1016/j.aap.2013.05.021 http://dx.doi.org/10.1016/j.aap.2013.03.003 http://dx.doi.org/10.1016/j.aap.2004.02.007 http://dx.doi.org/10.1016/S0968-090X(01)00008-0 http://dx.doi.org/10.1016/j.aap.2005.04.015 http://dx.doi.org/10.1016/j.ssci.2014.06.015 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Elvik R. 2008. A survey of operational definitions of hazardous road locations in some European countries. Accident Analysis & Prevention 40(6):1830–1835 DOI 10.1016/j.aap.2008.08.001. Flahaut B, Mouchart M, Martin ES, Thomas I. 2003. The local spatial autocorrelation and the kernel method for identifying black zones. Accident Analysis & Prevention 35(6):991–1004 DOI 10.1016/S0001-4575(02)00107-0. Geurts K, Wets G, Brijs T, Vanhoof K, Karlis D. 2006. Ranking and selecting dangerous crash locations: correcting for the number of passengers and Bayesian ranking plots. Journal of Safety Research 37(1):83–91 DOI 10.1016/j.jsr.2005.10.020. Ghadi M, Török Á. 2019. A comparative analysis of black spot identification methods and road accident segmentation methods. Accident Analysis & Prevention 128:1–7 DOI 10.1016/j.aap.2019.03.002. Harper CD, Hendrickson CT, Samaras C. 2016. Cost and benefit estimates of partially-automated vehicle collision avoidance technologies. Accident Analysis & Prevention 95:104–115 DOI 10.1016/j.aap.2016.06.017. Hegyi P, Borsos A, Koren C. 2017. Searching possible accident black spot locations with accident analysis and GIS software based on GPS coordinates. Pollack Periodica 12(3):129–140 DOI 10.1556/606.2017.12.3.12. Hossain M, Abdel-Aty M, Quddus MA, Muromachi Y, Sadeek SN. 2019. Real-time crash prediction models: state-of-the-art, design pathways and ubiquitous requirements. Accident Analysis & Prevention 124:66–84 DOI 10.1016/j.aap.2018.12.022. Jermakian JS. 2011. Crash avoidance potential of four passenger vehicle technologies. Accident Analysis & Prevention 43(3):732–740 DOI 10.1016/j.aap.2010.10.020. Kertesz G, Felde I. 2020. One-shot re-identification using image projections in deep triplet convolutional network. In: SOSE, 2020—IEEE 15th International Conference of System of Systems Engineering, Proceedings. Piscataway: IEEE, 597–601. Lee S, Lee Y. 2013. Calculation method for sliding-window length: a traffic accident frequency case study. Easter Asia Society for Trasportation Studies 9:1–13. Lenard J, Badea-Romero A, Danton R. 2014. Typical pedestrian accident scenarios for the development of autonomous emergency braking test protocols. Accident Analysis & Prevention 73(4):73–80 DOI 10.1016/j.aap.2014.08.012. Mauro R, De Luca M, Dell’Acqua G. 2013. Using a k-means clustering algorithm to examine patterns of vehicle crashes in before-after analysis. Modern Applied Science 7(10):11–19. Montella A. 2010. A comparative analysis of hotspot identification methods. Accident Analysis & Prevention 42(2):571–581 DOI 10.1016/j.aap.2009.09.025. Montella A, Andreassen D, Tarko AP, Turner S, Mauriello F, Imbriani LL, Romero MA. 2013. Crash databases in Australasia, the European Union, and the United States. Transportation Research Record: Journal of the Transportation Research Board 2386(1):128–136 DOI 10.3141/2386-15. Murray W, White J, Ison S. 2012. Work-related road safety: a case study of Roche Australia. Safety Science 50(1):129–137 DOI 10.1016/j.ssci.2011.07.012. Nitsche P, Thomas P, Stuetz R, Welsh R. 2017. Pre-crash scenarios at road junctions: a clustering method for car crash data. Accident Analysis & Prevention 107:137–151 DOI 10.1016/j.aap.2017.07.011. Orosz G, Mocsári T, Borsos A, Koren C. 2015. Evaluation of low-cost safety measures on the Hungarian national road network. In: Proceedings of the XXVth World Road Congress, Seoul: World Road Association, 1–11. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 24/25 http://dx.doi.org/10.1016/j.aap.2008.08.001 http://dx.doi.org/10.1016/S0001-4575(02)00107-0 http://dx.doi.org/10.1016/j.jsr.2005.10.020 http://dx.doi.org/10.1016/j.aap.2019.03.002 http://dx.doi.org/10.1016/j.aap.2016.06.017 http://dx.doi.org/10.1556/606.2017.12.3.12 http://dx.doi.org/10.1016/j.aap.2018.12.022 http://dx.doi.org/10.1016/j.aap.2010.10.020 http://dx.doi.org/10.1016/j.aap.2014.08.012 http://dx.doi.org/10.1016/j.aap.2009.09.025 http://dx.doi.org/10.3141/2386-15 http://dx.doi.org/10.1016/j.ssci.2011.07.012 http://dx.doi.org/10.1016/j.aap.2017.07.011 http://dx.doi.org/10.7717/peerj-cs.399 https://peerj.com/computer-science/ Rosén E, Källhammer JE, Eriksson D, Nentwich M, Fredriksson R, Smith K. 2010. Pedestrian injury mitigation by autonomous braking. Accident Analysis & Prevention 42(6):1949–1957 DOI 10.1016/j.aap.2010.05.018. Sokolovskij E. 2010. Automobile braking and traction characteristics on the different road surfaces. Transport 22(4):275–278 DOI 10.3846/16484142.2007.9638141. Szénási S, Jankó D. 2007. Internet-based decision-support system in the field of traffic safety on public road networks. In: 6th European Transport Conference. Budapest, 131–136. Toran A, Moridpour S. 2015. Identifying crash black spots in melbourne road network using kernel density estimation in GIS. In: Road Safety and Simulation. Wallman C-G, Åström H. 2001. Friction measurement methods and the correlation between road friction and traffic safety: a literature review. Available at https://books.google.hu/books? id=VL9BHQAACAAJ. Yu H, Liu P, Chen J, Wang H. 2014. Comparative analysis of the spatial analysis methods for hotspot identification. Accident Analysis & Prevention 66(2083):80–88 DOI 10.1016/j.aap.2014.01.017. Szénási (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.399 25/25 http://dx.doi.org/10.1016/j.aap.2010.05.018 http://dx.doi.org/10.3846/16484142.2007.9638141 https://books.google.hu/books?id=VL9BHQAACAAJ https://books.google.hu/books?id=VL9BHQAACAAJ http://dx.doi.org/10.1016/j.aap.2014.01.017 https://peerj.com/computer-science/ http://dx.doi.org/10.7717/peerj-cs.399 Analysis of historical road accident data supporting autonomous vehicle control strategies Introduction Background Materials and Methods Results Discussion Conclusions flink7 References << /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Dot Gain 20%) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Warning /CompatibilityLevel 1.4 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams false /MaxSubsetPct 100 /Optimize true /OPM 1 /ParseDSCComments true /ParseDSCCommentsForDocInfo true /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo true /PreserveFlatness true /PreserveHalftoneInfo false /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Apply /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile (None) /AlwaysEmbed [ true ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 300 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages false /ColorImageDownsampleType /Average /ColorImageResolution 300 /ColorImageDepth 8 /ColorImageMinDownsampleDepth 1 /ColorImageDownsampleThreshold 1.50000 /EncodeColorImages true /ColorImageFilter /FlateEncode /AutoFilterColorImages false /ColorImageAutoFilterStrategy /JPEG /ColorACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /ColorImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000ColorACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000ColorImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages false /GrayImageDownsampleType /Average /GrayImageResolution 300 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /GrayImageDict << /QFactor 0.15 /HSamples [1 1 1 1] /VSamples [1 1 1 1] >> /JPEG2000GrayACSImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /JPEG2000GrayImageDict << /TileWidth 256 /TileHeight 256 /Quality 30 >> /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages false /MonoImageDownsampleType /Average /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict << /K -1 >> /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False /CreateJDFFile false /Description << /CHS /CHT /DAN /DEU /ESP /FRA /ITA /JPN /KOR /NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers. De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 5.0 en hoger.) /NOR /PTB /SUO /SVE /ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers. Created PDF documents can be opened with Acrobat and Adobe Reader 5.0 and later.) >> /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ << /AsReaderSpreads false /CropImagesToFrames true /ErrorControl /WarnAndContinue /FlattenerIgnoreSpreadOverrides false /IncludeGuidesGrids false /IncludeNonPrinting false /IncludeSlug false /Namespace [ (Adobe) (InDesign) (4.0) ] /OmitPlacedBitmaps false /OmitPlacedEPS false /OmitPlacedPDF false /SimulateOverprint /Legacy >> << /AddBleedMarks false /AddColorBars false /AddCropMarks false /AddPageInfo false /AddRegMarks false /ConvertColors /NoConversion /DestinationProfileName () /DestinationProfileSelector /NA /Downsample16BitImages true /FlattenerPreset << /PresetSelector /MediumResolution >> /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ] >> setdistillerparams << /HWResolution [2400 2400] /PageSize [612.000 792.000] >> setpagedevice work_3ecslqollvat3jyruwpjy26lou ---- Modeling Past and Future for Neural Machine Translation Zaixiang Zheng∗ Nanjing University zhengzx@nlp.nju.edu.cn Hao Zhou∗ Toutiao AI Lab zhouhao.nlp@bytedance.com Shujian Huang Nanjing University huangsj@nlp.nju.edu.cn Lili Mou University of Waterloo doublepower.mou@gmail.com Xinyu Dai Nanjing University dxy@nlp.nju.edu.cn Jiajun Chen Nanjing University chenjj@nlp.nju.edu.cn Zhaopeng Tu Tencent AI Lab zptu@tencent.com Abstract Existing neural machine translation systems do not explicitly model what has been trans- lated and what has not during the decoding phase. To address this problem, we propose a novel mechanism that separates the source information into two parts: translated PAST contents and untranslated FUTURE contents, which are modeled by two additional recur- rent layers. The PAST and FUTURE contents are fed to both the attention model and the de- coder states, which provides Neural Machine Translation (NMT) systems with the knowl- edge of translated and untranslated contents. Experimental results show that the proposed approach significantly improves the perfor- mance in Chinese-English, German-English, and English-German translation tasks. Specif- ically, the proposed model outperforms the conventional coverage model in terms of both the translation quality and the alignment error rate.† 1 Introduction Neural machine translation (NMT) generally adopts an encoder-decoder framework (Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014), where the encoder summarizes the source sentence into a source context vector, and the de- coder generates the target sentence word-by-word based on the given source. During translation, the decoder implicitly serves several functionalities at the same time: *Equal contributions. †Our code can be downloaded from https://github. com/zhengzx-nlp/past-and-future-nmt. 1. Building a language model over the target sen- tence for translation fluency (LM). 2. Acquiring the most relevant source-side in- formation to generate the current target word (PRESENT). 3. Maintaining what parts in the source have been translated (PAST) and what parts have not (FUTURE). However, it may be difficult for a single recur- rent neural network (RNN) decoder to accomplish these functionalities simultaneously. A recent suc- cessful extension of NMT models is the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015), which makes a soft selection over source words and yields an attentive vector to represent the most relevant source parts for the current decoding state. In this sense, the attention mechanism sepa- rates the PRESENT functionality from the decoder RNN, achieving significant performance improve- ment. In addition to PRESENT, we address the impor- tance of modeling PAST and FUTURE contents in machine translation. The PAST contents indicate translated information, whereas the FUTURE con- tents indicate untranslated information, both being crucial to NMT models, especially to avoid under- translation and over-translation (Tu et al., 2016). Ideally, PAST grows and FUTURE declines during the translation process. However, it may be difficult for a single RNN to explicitly model the above pro- cesses. In this paper, we propose a novel neural machine translation system that explicitly models PAST and FUTURE contents with two additional RNN layers. The RNN modeling the PAST contents (called PAST layer) starts from scratch and accumulates the in- 145 Transactions of the Association for Computational Linguistics, vol. 6, pp. 145–157, 2018. Action Editor: Philipp Koehn. Submission batch: 6/2017; Revision batch: 9/2017; Published 3/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. formation that is being translated at each decoding step (i.e., the PRESENT information yielded by at- tention). The RNN modeling the FUTURE contents (called FUTURE layer) begins with holistic source summarization, and subtracts the PRESENT infor- mation at each step. The two processes are guided by proposed auxiliary objectives. Intuitively, the RNN state of the PAST layer corresponds to source contents that have been translated at a particular step, and the RNN state of the FUTURE layer cor- responds to source contents of untranslated words. At each decoding step, PAST and FUTURE together provide a full summarization of the source informa- tion. We then feed the PAST and FUTURE informa- tion to both the attention model and decoder states. In this way, our proposed mechanism not only pro- vides coverage information for the attention model, but also gives a holistic view of the source informa- tion at each time. We conducted experiments on Chinese-English, German-English, and English-German benchmarks. Experiments show that the proposed mechanism yields 2.7, 1.7, and 1.1 improvements of BLEU scores in three tasks, respectively. In addition, it ob- tains an alignment error rate of 35.90%, significantly lower than the baseline (39.73%) and the coverage model (38.73%) by Tu et al. (2016). We observe that in traditional attention-based NMT, most errors occur due to over- and under-translation, which is probably because the decoder RNN fails to keep track of what has been translated and what has not. Our model can alleviate such problems by explicitly modeling PAST and FUTURE contents. 2 Motivation In this section, we first introduce the standard attention-based NMT, and then motivate our model by several empirical findings. The attention mechanism, proposed in Bahdanau et al. (2015), yields a dynamic source context vec- tor for the translation at a particular decoding step, modeling PRESENT information as described in Sec- tion 1. This process is illustrated in Figure 1. Formally, let x = {x1, . . . ,xI} be a given in- put sentence. The encoder RNN—generally imple- mented as a bi-directional RNN (Schuster and Pali- wal, 1997)—transforms the sentence to a sequence ct hi hi hI hI h1 h1 xix1 Encoder + αt,1 αt, i αt, I ss yt Decoder In iti al iz e w ith so ur ce s um m ar iz at io n source vector for present translation xI Figure 1: Architecture of attention-based NMT. of annotations with hi = [−→ h i; ←− h i ] being the an- notation of xi. ( −→ h i and ←− h i refer to RNN’s hidden states in both directions.) Based on the source annotations, another decoder RNN generates the translation by predicting a target word yt at each time step t: P(yt|y