doi:10.1016/j.eswa.2007.03.012 Available online at www.sciencedirect.com www.elsevier.com/locate/eswa Expert Systems with Applications 34 (2008) 2290–2297 Expert Systems with Applications Structure clustering for Chinese patent documents Su-Hsien Huang a,d,*, Hao-Ren Ke b, Wei-Pang Yang c a Institute of Computer Science and Engineering, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu, Taiwan, ROC b Library and Institute of Information Management, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu, Taiwan, ROC c Department of Information Management, National Don Hwa University, 1, Section 2, Da Hsueh Road, Shou-Feng, Hualien, Taiwan, ROC d Department of Information Management, Minghsin University of Science and Technology, 1, Hsin Hsin Road, Hsin Feng, Hsinchu, Taiwan, ROC Abstract This paper aims to cluster Chinese patent documents with the structures. Both the explicit and implicit structures are analyzed to represent by the proposed structure expression. Accordingly, an unsupervised clustering algorithm called structured self-organizing map (SOM) is adopted to cluster Chinese patent documents with both similar content and structure. Structured SOM clusters the similar content of each sub-part structure, and then propagates the similarity to upper level ones. Experimental result showed the maps size and number of patents are proportional to the computing time, which implies the width and depth of structure affects the performance of structured SOM. Structured clustering of patents is helpful in many applications. In the lawsuit of copyright, companies are easy to find claim conflict in the existent patents to contradict the accusation. Moreover, decision-maker of a company can be advised to avoid hot- spot aspects of patents, which can save a lot of R&D effort. � 2007 Elsevier Ltd. All rights reserved. Keywords: Structure clustering; Chinese patent; Structure expression; Metadata 1. Introduction 1.1. Background Digital libraries provide various integrated services to coordinate information over Internet. However, the dis- tributed and heterogeneous data complicate the integration when it involves several issues, such as format, content, semantics, etc. The discrepancy and redundancy confuses users in readability. To provide a unified view, data cluster- ing with an overview by integrating similar data has received considerable attention in recent researches. Conventional data clustering adopts feature vector to represent data. It lacks some aspects of consideration to identify similarity. For example, chemical compounds with the same molecules but different structures are not the same. Moreover, two trademarks with the same compo- 0957-4174/$ - see front matter � 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.03.012 * Corresponding author. E-mail address: sshuang@cis.nctu.edu.tw (S.-H. Huang). nents but different placements cannot be identified similar, too. These two examples account for the structure is a key-factor to influence the clustering in particular domains. Although conventional structure clustering focuses on visual pattern query, chemical structure identification and syntactic parsing tree, rare mentioned in documents. This is owing to the structure in documents is hard to obtain. Interestingly, the development in rhetorical structure the- ory (RST) facilitates the extraction of structure. RST was proposed in the 1980s and has been successfully applied to documents summarization, automatic layout, and so on. RST categorizes the characteristics of phrases and con- structs a closure tree to represent structure. Therefore, applying RST to represent documents structure becomes possible and practical. Several kinds of documents, like patent and electronic thesis, contain both structure and content information. To cluster these documents, only the same content in the same structure can be identified similar. The tort of copy- right is a good illustration to require conflict in both claim mailto:sshuang@cis.nctu.edu.tw S.-H. Huang et al. / Expert Systems with Applications 34 (2008) 2290–2297 2291 and claim structure. The structure of documents can be categorized into two types. Explicit structure represents the existent attributes of documents, like ‘‘subject’’, ‘‘abstract’’, ‘‘publisher’’ and ‘‘classification’’ in patent doc- uments. On the other hands, the structure analyzed from content is implicit structure. The implicit structure can be extracted by analyzing the writing style and the well- established convention. For example, in the claim field of patents, ‘‘the . . . of claim 1’’, ‘‘comprising’’, ‘‘wherein’’, ‘‘having’’, ‘‘consist of’’, etc. construct the implicit structure. With both explicit and implicit structures, document struc- ture can be constructed and applied in structure clustering. This study tries to apply Chinese patent documents in structure clustering. Patent structure has been mentioned in recent literatures. Shinmori, Okumura, Marukawa, and Iwayama (2003) provides a rhetorical method to cate- gorize Japanese patent structure into three styles – process sequence style, element enumeration style and Jepson-like style. Each claim in the patent is given a weight to evaluate the relationships among different claims. Fujii, Iwayama, and Kando (2006) extends Shinmori’s work and adds delimiter to punctuate claim into components. Mase, Matsubayashi, Ogawa, Iwayama, and Oshio (2004) analyze the patent structure into premise, description and target parts. Keywords in different part can be re-weighted to enhance query precision (Iwayama, Fujii, Kando, & Marukawa, 2006). Clustering in patent is a real-time application. Real-time means the time constraint of clustering algorithm is tight when users require the result on-line. Unfortunately, struc- ture clustering is more complicate than conventional one, where the computing time is subject to the structure. Hence, the first requirement of structure clustering is effi- ciency. Additionally, real-time also implies there is no training data in advance. Therefore, unsupervised algo- rithm is suggested when there is unnecessary to use training data in the process. Unsupervised clustering integrates sim- ilar data in definite circles, and adjusts the parameters to obtain result. Kohonen proposes an unsupervised neural network self-organizing map (SOM) and receives excellent performance (Kohonen, 1998). SOM provides unsuper- vised neural network clustering and maps high dimension data (usually two) into a low-dimension map. In SOM, clo- ser nodes in the map imply shorter distance in real data. SOM applies in many domains, like bio-structure cluster- ing, graph structure clustering and audio-pattern cluster- ing, etc. The multi-dimension representation of SOM facilitates the readability which is suitable for patent clustering. 1.2. Literatures of early studies It is noteworthy that conventional SOM cannot deal with structure data. In 1998, a general framework is pre- sented to deal with structure by neural network (Frasconi, Gori, & Sperduti, 1998). This framework constructs direc- ted ordered acyclic graphs (DOAG) for structure data and recurrently proceeds data by following DOAG sequence. This framework further motivates the upcoming researches. SOMSD (Sperduti, 2001) applies self-organiz- ing map (SOM) to manipulate structured data. Each neu- ron in SOMSD is equipped with two weights: one for previous structure and one for two-time previous structure. Structure similarity is estimated by concatenating current similarity and these two structure similarity. RSOM (Voegtlin, 2002) recursively calculates SOM by adding cur- rent similarity and previous structure. Hammer, Micheli, and Sperduti (2002) propose a general framework to further extend SOMSD. Vesanto and Alhoniemi (2000) cluster each sub-structure part in single SOM and use hier- archical k-means cluster to investigate similar structure. Smith and Ng (2003) and Roussinov and Chen (2001) derive user navigation patterns as structure to cluster web page navigation pattern. Similar navigation patterns (trea- ted as structure graph) are clustered together by SOM. Xu, Chang, and Paplinski (2005) use on-line expansion to enlarge SOM size into multiple layers. Other of structure clustering can be found in literatures (Hagenbuchner, Sperduti, & Tsoi, 2004; Hammer & Jain, 2004; Rossi, Conan-Guez, & Golli, 2004; Sperduti & Starita, 1997; Strickert & Hammer, 2003). 1.3. Purpose of this study This study is intended to apply Chinese patent docu- ments in structure clustering. A patent structure is analyzed first and provides structure expression to represent it. Hav- ing the structure established, self-organizing map (SOM) is adopted. Several modification of SOM is undertaken to process structure, including input, output and training algorithm. In brief, three major objectives involved in this study are: (1) Construct expression model to represent the patent structure. (2) Develop structure clustering algorithm to process structure patent. (3) Evaluate the structure clustering algorithm. This paper is organized as follows. Chapter 2 analyzes the structure of Chinese patent documents. Chapter 3 pro- poses structured SOM to cluster patents. Chapter 4 imple- ments structured SOM and a series of experiments are conducted in Chapter 5. Chapter 6 draws the conclusions. 2. Structure analysis As shown in Fig. 2, the preprocessing of structure clus- tering is to analyze the structure. The received documents are first represented by structure expression, and then determine the input sequence and maximum branch num- ber (MBN). To formally describe the structure, two formal constructors are defined. Given a Structure S, S can be represented by the following two constructors: 2292 S.-H. Huang et al. / Expert Systems with Applications 34 (2008) 2290–2297 (1) Tuple constructor (TC): A tuple constructor TC of Structure S is an ordered list constructed by the union of single-value attributes. For example, the set of attributes ci, c1 Æ occurence = 1, c2 Æ occurence = 1, . . .,cn Æ occurence = 1, are represented as TCs = {c1, c2,. . .,cn}s. The subscript s is the name of TC. (2) Set constructor (SC): A set constructor SC of Struc- ture S is a multi-value type with the same occurrence larger than one. For example, the set of attributes ci, c1 Æ occurence = i, c2 Æ occurence = i,. . .,cn Æ occu- rence = i, are represented as hc1; c2; . . . ; cni i s. The sub- script s represents the name of SC and i represents the maximum occurrence. 2.1. Explicit structure A structure document contains two types of structure. The first one is explicit structure, which is obtained from the schema of structure documents (like data metadata). Explicit structure is applicable in documents with regular schema. To take a simple example of electronic thesis, the attributes ‘‘subject’’, ‘‘abstract’’, ‘‘chapter’’, ‘‘section’’, and ‘‘paragraph’’ contain explicit structure. The structure expressions above example can be expressed as follows: Subject; Abstract; Contenth in1paragraph D En2 section � �n3 chapter ( ) Book where n1, n2 and n3 stand for the maximum occurrence of the attributes. Applying explicit structure in structure clustering is meaningful when the documents contain regular schema and massive content in each attribute. For example, elec- tronic thesis and XML-based news are suitable for explicit Fig. 1. Structure of structure clustering. In this study, Chinese patent docu- ments contain most information in ‘‘Claim’’ field. There- fore, implicit structure is introduced in the next section. 2.2. Implicit structure The structure analyzed from content is implicit structure. The implicit structure can be extracted by analyzing the writing style and the well-established convention. In the ‘‘Claim’’ field of Chinese patent documents, two types are categorized: • Composition style: As in ‘‘ ’’ (comprising), ‘‘ ’’ (includes), the set of element is described. These key- words are used in method to imply the composition of the inventions. This type of style represents a set of ele- ments contains in the claim structure. • Pre-condition style: As in ‘‘ ’’ (as claimed in . . .), ‘‘ ’’ (wherein), these descriptions imply the state- ments has followed a list of composition. This type of style represents a list of compositions construct the claim structure. Comparing with other-linguistic patents, there are sub- stantial difficulties to obtain several relationships in Chi- nese patent. For example, in Shinmori’s method, the process sequence style (means the sequence of processes, like ‘‘does’’, ‘‘and does’’) is ambiguous in Chinese gram- mar. It’s always followed by ‘‘ ’’ where is the same meaning with the article ‘‘one’’. These two types of style can derive the implicit structure of patents. For the presence of these keywords, a hierarchy relationship can be constructed. For example, Fig. 1 illus- trates an example of Chinese patent and represents the implicit structure analyzed by these two styles. To give each Chinese patent. S.-H. Huang et al. / Expert Systems with Applications 34 (2008) 2290–2297 2293 segmented claim an ID, naming mechanism is defined for each claim as follows: Patent ID-Claim IDlevel1- � � �Claim IDleveli The structure can be formally described by the structure expression, which can be more comprehensive for structure clustering. The composition style means the occurrence happens in the following statements. By following the structure expression in previous section, it can be expressed as hii, where i stands for the structure level. The pre-condi- tion style implies an attribute contains in the following statement. By following the structure expression, it can be expressed as {}i, where i stands for the structure level. For the example of Fig. 1, the structure expression can be expressed as follows: fClaima;fClaimbglevel2; Claimcglevel1 � �3 patent 2.3. Determine input sequence After the structure is analyzed, the input of each sub- part structure is determined by following directed acyclic graph (DAG). A depth-first search is adapted in the naviga- tion of structure tree. For the formal expression of both explicit and implicit structure, the input of structure follows two principles: (1) By order of the attributes in tuple constructor. (2) Iteratively expand set constructor to i times. For example, the input sequence of explicit structure Subject; Abstract; Contenth in1paragraph D En2 section � �n3 chapter ( ) Book Fig. 2. Structu is Subject ! Abstract ! Chapter1 ! Section1 ! Paragraph1 !���! Paragraphn3 ! Section2 !���! Paragraphn2 !���! Chaptern1 ! Sectionn2 ! Paragraphn1 For the example of Fig. 1, the input sequence to structure clustering is: P ! 1 ! 6 ! 11 ! 2 ! 3 ! 5 ! 4 ! 7 ! 8 ! 10 ! 9 ! 12 ! 13 ! 15 ! 14 2.4. Determine maximum branch number The next step is to determine the maximum branch num- ber (MBN) of tuple and set constructor. The maximum branch number (MBN) represents the maximum number of branches in patent structure. In the structure expression of both explicit and implicit, MBN is the maximum num- ber of TC attributes and SC occurrence. For example in Fig. 1, the MBN is 3 when the largest branch and occur- rence is less than 3. 3. Structured SOM Self-organizing map (SOM) (Kohonen, 1998) provides unsupervised neural network clustering and maps high- dimension data into a low-dimension map (usually two). There are five steps to train SOM (in Fig. 2): (1) Initialize weight vectors of output map as the same number features with input document vector. (2) Present input documents in order. red SOM. 2294 S.-H. Huang et al. / Expert Systems with Applications 34 (2008) 2290–2297 (3) Compute the distance between the input document and all nodes in the map and select the closest node as the winner. (4) Update the weights of the winner node and its neighbors. (5) Repeat steps (3)–(4) to other documents and iterate all inputs until convergence. Label the regions of the final map to represent the clustering result. SOM with structure receives considerable attention in recent years (Rossi et al., 2004; Strickert & Hammer, 2003; Voegtlin, 2002). The structured SOM in this study refers to Hagenbuchner’s self-organizing map clustering algorithm in structured data (Hagenbuchner et al., 2004). Hagenbuchner’s approach applied structured SOM on the images identification. The main concept is to calculate each sub-structure respectively, and concatenate the result to the upper structure. Notably, the images in Hagenbuch- ner’s method have simple structure and regular compo- nents. Experiment result has shown that structured SOM can cluster similar images together and distinguish the structure difference in the map. This paper applies struc- tured SOM in Chinese patents and tries to identify similar patent with different structure. The process of structured SOM is described in Fig. 2. 3.1. Training for explicit structure Applying SOM in Chinese patent documents requires modification of input/output vectors and algorithm. The explicit structure can be represented by structure expres- sion shown in previous section. The next step is to deter- mine the maximum branch number (MBN) of tuple and set constructor in explicit structure. The shortage of nodes less than MBN requires additional Null nodes. For the sim- plicity, assume the MBN = 3 in the example of explicit structure (n1 = 3, n2 = 3, n1 = 1), the structure expression can be rewrite as: Subject;Abstract; Content;NULL;NULLh in1paragraph D En2 section ; �� NULL;NULLin1chapter � Book The output nodes of SOM also need to be modified as the same number fields as input vector. For example, the output node of structured SOM in previous example is d x;y ¼ðV d x;y ;ðx1; y1Þ;ðx1; y1Þ;ðx1; y1ÞÞ The structure expression is assigned into SOM input vector by directed acyclic graph (DAG) sequence, for example, given a three node input vector, the input vector are rewrite as following: d book ¼ðV book;ðxsubject;ysubjectÞ;ðxabstract;yabstractÞ;ðxchapter;ychapterÞÞ Additionally, the distance calculation is updated as following: d ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðV Node0 � V bookÞ 2 q þjV Node1 � V subjectjþ jV Node2 � V abstractjþ jV Node3 � V chapterj where |VNodei � VNodej| represents the distance between these two vectors. The modification of distance formula means the cascaded calculation to all connected nodes. The adapting of structured SOM is updated as following: wd x;yðt þ 1Þ¼ wd x;yðtÞþ gðtÞ� jV Node1 � V subjectj wd x1;y1ðt þ 1Þ¼ wd x;yðtÞþ gðtÞ� jV Node2 � V abstractj wd x2;y2ðt þ 1Þ¼ wd x;yðtÞþ gðtÞ� jV Node3 � V chapterj 3.2. Training for implicit structure For the implicit structure, the dimension of the vector is also determined by the MBN in all input patents. Assume the MBN = 3, the representation in previous example is: ðV Patent; Claim1; Claim6; Claim11Þ ðV Node1; Claim2; Claim3; Claim5Þ ðV Node6; Claim7; Claim8; Claim10Þ ðV Node11; Claim12; Claim13; Claim15Þ The output nodes of SOM also need to be modified as the same number fields as input vector. For example, the output node of structured SOM in previous example is d x;y ¼ðV d x;y ;ðx1; y1Þ;ðx1; y1Þ;ðx1; y1ÞÞ Additionally, the distance calculation is updated as follow- ing formula: d ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðV Node0 � V PatentÞ 2 q þjV Node1 � V Claim1jþ jV Node2 � V Claim6jþ jV Node3 � V Claim11j where |VNodei � VNodej| represents the distance between these two vectors. The modification of distance formula means the cascaded calculation to all connected nodes. The adapting of structured SOM is also updated as following: wd x;yðt þ 1Þ¼ wd x;yðtÞþ gðtÞ� jV Node1 � V Claim1j wd x1;y1ðt þ 1Þ¼ wd x;yðtÞþ gðtÞ� jV Node2 � V Claim6j wd x2;y2ðt þ 1Þ¼ wd x;yðtÞþ gðtÞ� jV Node3 � V Claim11j 4. Implementation Fig. 3 displays the structured SOM system. The system contains six major parts: (1) SOM system: Choose standard SOM and structured SOM. (2) Parameter setting: Set the map dimension, learning rate, adaptation neighborhood and iteration to train. (3) Function selection: Select open file, close result, train- ing, save result, resize map and quit training. Parameter Setting Function Selection Map Information System Message SOM Map SOM System Fig. 3. Implementation of structured SOM. S.-H. Huang et al. / Expert Systems with Applications 34 (2008) 2290–2297 2295 (4) Map information: Display the containing patents in the SOM grid. (5) System message: Show the current iteration and pro- cessing message. (6) SOM map: Illustrate the clustering result. Fig. 4 demonstrates an example to apply structured SOM in Chinese patent documents. Initially, 10 Chinese patent documents are selected from database. These docu- ments are sent to standard SOM to obtain preliminary clustering. In the clustering, totally four clusters are labeled, e-commerce, binary-tree, identification and data- E-Commerse 1) Standard SOM 2) Structured SOM 3) Claim Clustering 3 Claim Conflict 21 Structure Clustering Fig. 4. Patent clustering. base. The interesting is paid in the five patents of e-com- merce cluster. There are three observations: (1) Commerce system is subjected. (2) Client–server architecture is addressed. (3) Encrypt system is applied. Hereafter, these patents are sent to structured SOM. In the structured SOM, these five patents are separated into three clusters, which represent they are different in struc- ture. For cluster 1, the patent primarily describes the con- Table 2 Time vs. documents (s) Document number 5 10 20 Standard SOM 8 29 118 Structured SOM 1727 2927 5713 (5 · 5 map, 50 iterations, MBN = 9). Table 3 Times vs. MBN (s) MBN 3 5 6 9 Structured SOM 7 10 53 295 (5 · 5 map, 10 documents, 5 iterations). 2296 S.-H. Huang et al. / Expert Systems with Applications 34 (2008) 2290–2297 struction of commerce system and is written in a very flat style. On the contrary, cluster 2 mainly focuses on encrypt system and is written in a very deep style. These two clus- ters are distinguished with cluster 3 not only the different subjects but the different patent structure. Notably, three documents marked as cluster 3 clusters together and might have high risk to conflict in the claims. Therefore, these claims are segmented by following the naming mechanism and cluster by SOM again to obtain conflict clusters. In the third step of Fig. 4, claim conflicts have observed in three areas: (1) Wire and wireless. (2) Client-side and server-side service. (3) Digital authorization and public key encrypt. These three patent conflicts provide good advisement for patent inventors to avoid involvement in this area. Struc- tured SOM is also helpful in many patent applications. In the lawsuit of copyright, companies are easy to find claim conflict in the existent patents, which is easy to con- tradict the accusation. Moreover, decision-maker of a com- pany can be advised to avoid hot-spot aspects of patents from structured SOM, which can save a lot of R&D effort. 5. Experiments A series of experiments are conducted to examine the performance of structured SOM. The data set comes from the claim attribute of Chinese patent documents, to exam- ine the implicit structure of patents. The experimental plat- form is Intel Celeron 1.5 GHz CPU, 512 MB RAM, Microsoft 2000 OS. The structured SOM was implemented by Java. In Table 1, the experiment was conducted to examine the performance in different map size. The parameters were set to 10 documents in 50 iterations with MBN = 9. The computing time of both standard and structured SOM is positive proportional to the map size. However, structured SOM is more time-consuming than standard one. In Table 2, the experiment was conducted to examine the performance in different document numbers. The parameters were set to 5 · 5 map in 50 iterations with MBN = 9. The computing times of both standard and structured SOM are positive proportional to the document numbers. The experiment corresponding to Table 1 implies the map size and documents are important factors to influ- ence the computing time. However, the time complexity between standard and structured SOM is extremely large, Table 1 Time vs. map size (s) Map size 2 · 2 3 · 3 5 · 5 10 · 10 Standard SOM 1 6 29 114 Structured SOM 227 355 2927 11,932 (10 documents, 50 iterations, MBN = 9). too. One explanation for the phenomenon is the document structure complicates the clustering. The deeper the struc- ture is, the longer the computing time. The experimental result of Table 3 further introduced another factor to influence computing time. With 10 docu- ments in five iteration of 5 · 5 map, structured SOM is nearly exponential proportional to the MBN. This is the most important factor account for the complexity of struc- tured SOM because both input vector and output map are modified to fit the dimension of MBN. These results lead to the conclusion that the width (MBN) and the depth of the structure are two key-factors for structured SOM performance. 6. Conclusions and future work Increasing Chinese patent documents require the clus- tering in the application. To cluster precisely, patents with similar content but different structure should be distin- guished. However, conventional clustering lacks of struc- ture consideration cannot identify similar patents with different structure, and rare research mentioned in text domain. In this paper, a clustering algorithm called struc- tured SOM is proposed to clusters structure patents by considering the content and structure information simulta- neously. Structured SOM requires the analysis of patent structure in advance. Two types of structure expression, explicit and implicit structure, are provided. Structured SOM modifies the input and output vectors for structure documents and iteratively training by adapting each sub- part of structure. An example to cluster Chinese patent documents has successfully implemented. Claim conflict appeared in the analysis provides advisements for patent inventors and decision-makers of business. In the lawsuit of copyright, companies are easy to find claim conflict in the existent pat- ents, which is easy to contradict the accusation. Moreover, decision-maker of a company can be advised to avoid hot- spot aspects of patents from structured SOM, which can save a lot of R&D effort. Experiments are conducted to examine the performance of structured SOM. Conclusions are given that both map S.-H. Huang et al. / Expert Systems with Applications 34 (2008) 2290–2297 2297 size and documents are positive proportional to the computing time. This implies two factors are important: the width (MBN) and depth of structure. MBN is nearly exponential drop-off the performance of structured SOM. In some practical application, prediction is adapted to esti- mate node distance to reduce computing time but sacrifice neglectful accuracy. Structure clustering is rarely mentioned in the text domain because the text with structure (like data metadata) has small amount of content in the attributes (explicit structure). With the development of digitalization, some structure documents, like electronic thesis and on-line news, can be applied in structured SOM. The future direc- tion for this study might be to discover more applications for structure clustering. It is also lots of work to improve the efficiency of structured SOM. Acknowledgements This research was supported by the Software Technol- ogy for Advanced Network Application project of Institute for Information Industry in 2004 and sponsored by MOEA, ROC. References Frasconi, P., Gori, M., & Sperduti, A. (1998). A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786. Fujii, A., Iwayama, A., & Kando, N. (2006). Test collections for patent retrieval and patent classification in the fifth NTCIR workshop. In Proceedings of the fifth international conference on language resources and evaluation (pp. 671–674). Hagenbuchner, M., Sperduti, A., & Tsoi, A. C. (2004). A self-organizing map for adaptive processing of structured data. IEEE Transactions on Neural Networks, 14(3), 491–505. Hammer, B., & Jain, B. J. (2004). Neural methods for non-standard data. In european symposium at artificial neural networks 2004 (pp. 281–292). Hammer, B., Micheli, A., & Sperduti, A. (2002). A general framework for unsupervised processing of structured data. In European symposium on artificial neural networks (ESANN’2002) (pp. 395–400). Iwayama, M., Fujii, A., Kando, N., & Marukawa, Y. (2006). Evaluating patent retrieval in the third NTCIR workshop. Information Processing and Management, 42(1), 207–221. Kohonen, T. (1998). The self-organizing map. Neurocomputing, 21, 1–6. Mase, H., Matsubayashi, T., Ogawa, Y., Iwayama, M., & Oshio, T. (2004). Two-stage patent retrieval method considering claim structure. In NTCIR workshop 4 meeting. Rossi, F., Conan-Guez, B., & Golli, A. E. (2004). Clustering functional data with the SOM algorithm. In Proceedings of european symposium on artificial neural networks 2004 (ESANN’04) (pp. 305–312). Roussinov, D. G., & Chen, H. (2001). Information navigation on the web clustering and summarizing query results. Information Processing and Management, 37, 789–816. Shinmori, A., Okumura, M., Marukawa, Y., & Iwayama, M. (2003). Patent claim processing for readability – Structure analysis and term explanation. In ACL-2003 workshop on patent corpus processing. Sapporo: Association for Computational Linguistics. Smith, A., & Ng, A. (2003). Web page clustering using a self-organizing map of user navigation patterns. Decision Support Systems, 35, 245–256. Sperduti, A. (2001). Neural networks for adaptive processing of structured data. Lecture Notes in Computer Science, 2130, 5–12. Sperduti, A., & Starita, A. (1997). Supervised neural networks for classification of structures. IEEE Transaction on Neural Networks, 8(3), 714–735. Strickert, M., & Hammer, B. (2003). Unsupervised recursive sequence processing. In Proceedings of european symposium on artificial neural networks 2003 (ESANN 2003) (pp. 27–32). Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3), 586–600. Voegtlin, T. (2002). Recursive self-organizing maps. Neural Networks, 15, 979–991. Xu, P., Chang, C. H., & Paplinski, A. (2005). Self-organizing topological tree for online quantization and data clustering. IEEE Transactions on Systems, Man, and Cybernetics, 35(3), 515–526. Structure clustering for Chinese patent documents Introduction Background Literatures of early studies Purpose of this study Structure analysis Explicit structure Implicit structure Determine input sequence Determine maximum branch number Structured SOM Training for explicit structure Training for implicit structure Implementation Experiments Conclusions and future work Acknowledgements References