ARTICLE 

Exploring Final Project Trends Utilizing Nuclear 
Knowledge Taxonomy 
An Approach Using Text Mining 
Faizhal Arif Santosa 

 
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2023  
https://doi.org/10.6017/ital.v42i1.15603 

Faizhal Arif Santosa (faizhalarif@gmail.com) is Academic Librarian, Polytechnic Institute of 
Nuclear Technology, National Research and Innovation Agency. © 2022. 

ABSTRACT 

The National Nuclear Energy Agency of Indonesia (BATAN) taxonomy is a nuclear competence field 
organized into six categories. The Polytechnic Institute of Nuclear Technology, as an institution of 
nuclear education, faces a challenge in organizing student publications according to the fields in the 
BATAN taxonomy, especially in the library. The goal of this research is to determine the most efficient 
automatic document classification model using text mining to categorize student final project 
documents in Indonesian and monitor the development of the nuclear field in each category. The kNN 
algorithm is used to classify documents and identify the best model by comparing Cosine Similarity, 
Correlation Similarity, and Dice Similarity, along with vector creation binary term occurrence and 
TF-IDF. A total of 99 documents labeled as reference data were obtained from the BATAN repository, 
and 536 unlabeled final project documents were prepared for prediction. In this study, several text 
mining approaches such as stem, stop words filter, n-grams, and filter by length were utilized. The 
number of k is 4, with Cosine-binary being the best model with an accuracy value of 97 percent, and 
kNN works optimally when working with binary term occurrence in Indonesian language documents 
when compared to TF-IDF. Engineering of Nuclear Devices and Facilities is the most popular field 
among students, while Management is the least preferred. However, Isotopes and Radiation are the 
most prominent fields in Nuclear Technochemistry. Text mining can assist librarians in grouping 
documents based on specific criteria. There is also the possibility of observing the evolution of each 
existing category based on the increase of documents and the application of similar methods in 
various circumstances. Because of the curriculum and courses given, the growth of each discipline of 
nuclear science in the study program is different and varied. 

INTRODUCTION 

The National Nuclear Energy Agency of Indonesia (BATAN), now known as the Research 
Organization for Nuclear Energy (ORTN)—National Research and Innovation Agency (BRIN), in 
2018 issued a decision regarding BATAN’s six competencies: Isotopes and Radiation (IR), Nuclear 
Fuel Cycle and Advanced Materials (NFCAM), Engineering of Nuclear Devices and Facilities 
(ENDF), Nuclear Reactor (NR), Nuclear and Radiation Safety and Security (NRSS), and 
Management (Mgt). These areas of focus are also known as BATAN’s knowledge taxonomy, which 
is used to support Nuclear Knowledge Management (NKM) and the grouping of explicit knowledge 
in repositories.1  

The Polytechnic Institute of Nuclear Technology (PINT), which is under the auspices of BATAN 
and is now in one of the directorates of BRIN, can also utilize BATAN’s knowledge taxonomy to 
classify students’ final assignments. Every year the PINT Library accepts final assignments from 

mailto:faizhalarif@gmail.com


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 2 
SANTOSA 

students who have graduated from three study programs, namely Nuclear Technochemistry, 
Electronics Instrumentation, and Electromechanics. Over the past six years (2017 to 2022), 563 
final assignments in Indonesian were collected and needed to be classified into the BATAN’s 
knowledge taxonomy in order to see the document growth of each existing competency. However, 
it is quite time consuming for librarians to assign individual documents to the most appropriate 
taxonomy term. It is also possible to involve experts to determine the right group, which results in 
increased working time to complete a document. This obstacle arises because librarians do not 
have in-depth and detailed knowledge of the nuclear field so it is feared that grouping errors will 
occur. 

In this study, the author tried to classify the collection of final project documents owned by the 
PINT Library based on BATAN’s knowledge taxonomy. The author used text mining tools, 
choosing the k-nearest neighbors (kNN) algorithm for this study. Similar research also leads to 
trying to focus on automatic document classification of certain subjects,2 which in this case is the 
subject of nuclear engineering. The hope is that users will find it easier to explore knowledge 
according to their area of interest through taxonomy grouping based on explicit knowledge,3 in 
this case, PINT students’ final project documents. Finding the trend of research conducted by 
students on each subject is also one of the goals of this research. 

LITERATURE REVIEW 

Text Mining in Libraries 
The increasing number of publications currently makes it a challenge to classify and find out the 
growth and trends of a topic. Document classification is one of the jobs that is quite time 
consuming so document classification automation by utilizing text mining is very necessary.4  The 
application and utilization of text mining itself is very broad. Several studies have demonstrated 
the usefulness of text mining in libraries.  

Pong et al. from City University of Hong Kong conducted research to facilitate the classification 
process using machine learning.5 This study aimed to streamline document categorization 
utilizing automatic document classification by using a system called the web-based automatic 
document classification system (WADCS) and claimed to be the pioneer of a comprehensive study 
of automatic document classification on a classification that is already popular in the world, 
namely the Library of Congress Classification (LCC) utilizing kNN and naive Bayes (NB). This 
research indicates that the machine-learning algorithm they used can be applied by the library for 
document classification. 

Wagstaff and Liu utilized text mining to perform automatic classification to help make decisions to 
select candidate documents for weeding.6 This study used data from Wesleyan University from 
2011 to 2014 to predict which documents were eligible for weeding and which will be stored. Five 
classifier models, namely kNN, naive Bayes, decision tree, random forest, and support vector 
machines (SVM), were used to compare their performance. While this process may not replace 
librarians, this study can help librarians make better decisions and reduce their workload 
significantly. 

Lamba and Madhusudhan applied the use of text mining to extract important topics which were 
published in the DESIDOC Journal of Library and Information Technology over a period of 38 
years.7 The Latent Dirichlet Allocation (LDA) method used in this study is able to find topics from 


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 3 
SANTOSA 

within a collection of documents so that they can see how these topics develop over time. Because 
LDA is an algorithm for looking at topics from a group of words that appear together, the authors 
suggest that this study be expanded by utilizing articles that have been labeled using supervised 
classification.  

kNN Classifier 
Various studies try to find answers to the most appropriate method of grouping the collection of 
documents. The kNN and SVM algorithms were used as comparative methods in the document 
classification study.8 However, there is no definite standard for the methods used in text mining.9 
Choosing the right technique in each phase of document classification can improve the 
performance of the text classifier, so, experts generally make adjustments to existing methods to 
get better results.10 

Kim and Choi compared kNN, Maximum Entropy Model (MEM), and SVM to classify Japanese 
patent documents by focusing on the structure of patents.11 Instead of comparing the entire text, 
specific components named semantic elements, such as purpose, background, and application 
fields, are compared from the training document. These semantically grouped components are the 
basis for patent categorization. In addition, the strategy used is the existence of cross -references 
from two semantic fields that are useful for determining the intentions of the patent writer s who 
are still unsure or hidden. This strategy works well on kNN compared to MEM and SVM where 
SVM doesn’t do very well when handling large data sets. However, research conducted by Alhaj et 
al. on Arabic documents showed that SVM can outperform kNN by implementing a stemming 
strategy.12 Meanwhile, through the approach to the relationship between unstructured text 
documents, the study conducted by Mona et al. was able to increase the performance of kNN 
combined with TF-IDF by 5 percent.13 

The kNN algorithm is one of the popular classifiers that categorizes new data based on the concept 
of similarity from the amount of data (determined by the specified “k” value) around it.14 This 
method is believed to be able to group documents effectively because it is not limited to the 
number of vector sizes.15 Wagstaff and Liu noted that one of the weaknesses of kNN is the long 
processing time when faced with large datasets, but kNN as a classifier is easy to apply.16 In terms 
of measurement, previous experiments showed that kNN was not suitable when used with 
Euclidean distance.17 Generally, similarity measures such as Cosine, Jaccard, and Dice were used in 
the kNN classifier.18  

One of the problems in text classification is the number of attributes or dimensions so that many 
irrelevant attributes in the data set cause the classifier’s performance to not run optimally.19 For 
this reason, it is necessary to have a technique to increase effectiveness and reduce dimensions 
that are too large through the selection of features or terms,20 such as within-document TF, 
weighting with one of the popular methods, namely TF-IDF (which sees how important a word is 
in a collection of corpus),21 and binary representation which looks at the absence and presence of 
a concept in a document22 by converting it to 0 and 1.23 

Aims of the Study 

University libraries have a vital role in managing internal publications to support the education 
ecosystem. In connection with the role of the PINT to support NKM and nuclear development, it is 
necessary to apply technology to help provide advice on certain classes of documents. In addition, 
in order to see scientific developments, generally experts conduct bibliometric studies which are 


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 4 
SANTOSA 

limited to the title and abstract fields. Text mining provides an opportunity to dig deeper. Instead 
of just the title and abstract, this study used the full text of the final project collection. The trend of 
a subject will be seen from the growth and percentage of existing documents. So, the objectives of 
this study are to 

• explore the best kNN model to be applied to classify the final project; 
• know the development of nuclear subjects based on BATAN’s knowledge taxonomy; and 
• know the development of nuclear subjects from each study program at the PINT. 

METHODS 

A total of 99 documents were taken from the BATAN repository and manually labeled as 
reference data. This study was conducted using RapidMiner Studio software. The first document 
processing method is to convert all words into lower case and divide the text into a collection of 
tokens. Filters on tokens are also applied based on the length of the token. In this case, the author 
applied a minimum of 3 characters and a maximum of 25 characters. Stop words were also applied 
to eliminate short words (e.g., “and,” “the,” and “are”), thereby reducing the vector size. English 
and Indonesian stop words were used for this study to overcome the use of English in the abstract 
section and Indonesian as the document language. The collection of words from Haryalesmana 

was chosen to be the stop words for Indonesian.24 The stemming technique is applied to reduce 
dimensions that are useful for improving the function of the classification system 25 by changing 
word forms into basic word,26 e.g., water, waters, watered, and watering into water. This analysis 
applies Wicaksana data to Indonesian stemming.27 

Some words cannot be separated from other words because they form a meaning, e.g., 
nondestructive testing, biological radiation effects, structural chemical analysis, and water -cooled 
reactors. To overcome this case, the use of n-grams can help identify compound words that have a 
meaning so that the words are not reduced.28 N-grams will record a number of “n” words that 
follow the previous word.29 To accommodate these words, in this study, three words were 
assigned to n-grams. 


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 5 
SANTOSA 

Figure 1. Nuclear taxonomy classification framework. 

 
Vector creation in this study used TF-IDF and binary term occurrence and then compared them to 
determine the best performance. In the kNN method, it is necessary to determine the value of “k” 
manually, so a value of 2–10 was chosen by activating a weighted vote which is useful for weighing 
the contributions of neighbors in the vicinity. Weight voting indicates the use of multiple voting 
methods by assigning a weight to each neighbor depending on their distance from the unknown 
item.30 The types of measurement chosen to get maximum results were Numerical Measure and 
tested Cosine Similarity, Correlation Similarity, and Dice Similarity. Meanwhile, to measure 
performance, the author used cross validation with a number of folds of 10. Then, using this set of 
procedures, documents from the BATAN repository are classified. The procedure that achieves the 
highest level of accuracy is then submitted as a model. This model was applied to 563 final project 
documents that have not been labeled so that each document has a label according to BATAN’s 
knowledge taxonomy. 

RESULTS 

The experiment was carried out 54 times to determine the best kNN performance from the 
proposed approach, namely Cosine-binary, Correlation-binary, Dice-binary, Cosine–TF-IDF, 
Correlation–TF-IDF, and Dice–TF-IDF utilizing Cross validation. Cosine was still the most accurate 
in the TF-IDF vector creation process, with an accuracy of 81.89 percent on seven neighbors, and 
Dice reaches the lowest point when used on four neighbors. In contrast to Correlation and Dice, 
Cosine can perform well when creating binary vectors. Cosine on four neighbors had the best 
performance, with a 97 percent accuracy rate. The lowest accuracy occurred when the number of 
selected neighbors was two and the overall numerical measure had decreased in neighbors more 
than nine.  

The classification model for unlabeled documents was determined to be the Cosine-binary method 
with four neighbors. The experiment found that this method did not successfully group three 


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 6 
SANTOSA 

documents (for details of the Confusion Matrix, see appendix A). Even though document 7 ought to 
be on NFCAM, but with a lower score of 0.49921, it was predicted on the NRSS with a confidence 
value of 0.50079. Documents 86 and 93, which were supposed to be about ENDF, were unable to 
be foreseen. Document 93 was predicted on the NRSS with a confidence value of 0.50126 and 
document 86 was predicted on the NR with a value of 0.49936.  

Figure 2. A comparison of the accuracy levels in the kNN method. 

 
This study utilized 563 unlabeled documents that were divided into six years. There were 34 
fewer documents in 2021 than there were in 2020, a significant drop from the previous year (see 
table 1). The number of documents then climbed again in 2022, reaching 98. RapidMiner’s labeling 
process ran into issues when it got to the process document stage. To improve memory 
performance, the documents were split into three runs (2017–2018, 2019–2020, and 2021–2022) 
because the memory was not sufficient to execute a set of commands on docu ment processing. 

The results of the previous set of procedures were then exported as tabular data for further study. 
Every year, the evolution of each nuclear subject can be seen in the final project report (see fig. 3). 
During the test period, 282 documents (50.09%) of the total extant papers had an ENDF study, 
followed by IR with 95 documents (16.87%) and NFCAM with 69 documents (12.26%). While 
there were very little changes between NR and NRSS, NR contains 47 papers (8.35%) connected 
while NRSS had 45 documents (7.99%). Mgt was the subject with the fewest documents, with a 
total of 25 (4.44%) from 2017 to 2022. 


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 7 
SANTOSA 

Table 1. The PINT’s final project documents growth from 2017 to 2022 

Study Program 2017 2018 2019 2020 2021 2022 Grand 
total 

Electromechanics 35 34 43 35 24 41 212 

Electronics 
Instrumentation 

27 34 38 38 22 28 187 

Nuclear 
Technochemistry 

31 31 26 27 20 29 164 

Grand total 93 99 107 100 66 98 563 

 
See appendix B for more information on the confidence value of each predicted document. 

Of the 212 final project reports in the Electromechanics study program 63.68 percent (135 
documents) were projected to be on the ENDF subject, followed by 17.92 percent (38 documents) 
on NFCAM, NRSS with 8.96 percent (19 documents), and NR 5.19 percent (11 documents). 
Meanwhile, IR had the fewest papers predicted, with 2.83 percent (6 documents) while Mgt had 
1.42 percent (3 documents) predicted. Every year, ENDF was the most predicted subject in this 
study (see fig. 4). 

Figure 3. Nuclear subject development by percentage each year. 

 
INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 8 
SANTOSA 

Figure 4. Nuclear subject development in Electromechanics by % each year.  

 
The final project report on Instrumentation Electronics, which included 187 papers, was 
successfully predicted into five subjects. ENDF was projected to contain 141 documents (75.40%), 
NRSS was likely to contain 24 documents (12.83%), and NR was predicted to contain 14 
documents (7.49%). Furthermore, only 7 documents (3.74%) on Mgt and 1 document (0.53%) on 
IR were predicted. NFCAM, on the other hand, is not mentioned in any of the Electronics 
Instrumentation publications (see fig. 5). 

Final processing was performed on a collection of Nuclear Technochemistry documents. One 
hundred sixty-four documents are predicted at IR of 53.66 percent (88 documents), NFCAM of 
18.90 percent (31 documents), NR of 13.41 percent (22 documents), Mgt of 9.15 percent (15 
documents), ENDF of 3.66 percent (6 documents), and the remaining 1.22 percent (2 documents) 
were predicted on the NRSS. Subjects that were popular each year vary (see fig. 6) when 
compared to Electromechanics and Instrumentation Electronics, where ENDF was the most 
popular topic in these two study programs. 


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 9 
SANTOSA 

Figure 5. Nuclear subject development in Electronics Instrumentation by % each year. 

 
INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 10 
SANTOSA 

Figure 6. Nuclear subject development in Nuclear Technochemistry by % each year. 

 
DISCUSSION 

The study found that implementing kNN with Cosine Similarity in association with vector 
construction=Binary and k=4 resulted in the highest accuracy results of 97 percent. In general, 
this strategy outperformed in every class examined, and it can only be balanced on one occasion, 
notably at k=9 by utilizing Correlation Similarity. When compared to the use of TF-IDF, the results 
likewise indicated that binary term occurrence always functioned well. TF-IDF was only able to 
achieve its highest accuracy of 81.89 percent when k was 7 using Correlation Similarity. Cosine 
similarity also seemed to work efficiently on every vector creation, both when using binary and 
TF-IDF (in classes numbering 2, 5, and 10 the use of TF-IDF was not optimal), compared to 
numerical measures of Correlation Similarity and Dice Similarity. Cosine similarity evaluates the 
similarity of documents, and a high similarity score indicates that the documents are quite 
similar.31 

Nuclear Field Growth 
In general, aside from the ENDF field, which is steady and increasing, other subjects endure annual 
changes in development. For the past six years, ENDF has been the most popular subject among 
students. The ENDF reached the highest percentage rate in 2022, with 59 documents predicted on 
this subject. Students preferred engineering final project reports on mechanics and structures, 
electromechanics, control systems, nuclear instrumentation, or nuclear facility process 
technology. Research conducted by Wang et al. also suggests that the current popular topic of 
research on nuclear power is modeling and simulation.32 


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 11 
SANTOSA 

The ENDF document’s average confidence value was 0.6499916, with a median value of 
0.7490455. The two documents with the lowest confidence in the ENDF were document numbers 
233 and 597. Document 233 had a confidence value of 0.25105 and was predicted in the other 
three subject areas (NRSS, NR, Mgt) with close values. Likewise, the 597 documents predicted in 
the ENDF with a confidence value of 0.25156 were higher than the NRSS, NFCAM, and IR subjects, 
but with a not too significant difference. Both of these documents can be investigated further and 
directly evaluated by the librarian in order to obtain a more precise field. The majority of the final 
project reports projected in the ENDF have confidence levels around 0.50, and some even higher 
at 0.75. This study also reveals that 11 documents in the ENDF category have a confidence value of 
1. With lower NRSS confidence values, 239 ENDF documents connected to the NRSS field. This 
relationship demonstrates a good tendency among students conducting nuclear engineering 
related to the NRSS discipline. 

Though it differs significantly from ENDF, IR is becoming a prominent field. The final project 
report for IR was developed in 2017–2018, but it shrank again from 2019 to 2021, then increased 
in 2022. In comparison to other fields, IR has the highest minimal confidence score of 0.4 987, with 
many documents lying within the 0.5 and 0.75 range. Meanwhile, the confidence value for 26 
documents predicted by IR is 1. The NFCAM subject area is a prediction that appears frequently in 
IR predictions but has a lower level of confidence. There are 54 documents indicating the 
existence of research that involves isotopes and radiation in nuclear materials, nuclear 
excavations, radioactive waste, structures, or advanced materials. 

NFCAM is inversely proportional to the conditions that occur in ENDF. After increasing in 2019, 
this subject faced a reversal over the next three years, with only two documents classified in this 
subject through 2022. Students are still uncommonly interested in nuclear minerals, nuclear fuel, 
radioactive waste, structural materials, and advanced materials. Six projected documents in this 
field have confidence levels of 1, while many more have confidence levels between 0.50 and 0.75. 
The IR field is also expected to appear alongside the NFCAM field publications. 

There were also ups and downs in NR and NRSS. Twenty-five of the 47 documents identified on 
the NR were also predicted with a lower value in the NRSS field. This demonstrates that students 
explored the relationship between the subject of reactor research and safety and security in 
various documents. Meanwhile, only eight of the 46 NRSS papers are unrelated to the ENDF field. 
This demonstrates that students who study nuclear safety and security tend to perform 
engineering to address situations involving nuclear safety and security. Documents in these two 
fields are usually concentrated in the 0.5 confidence value range in both NR and NRSS. 

Mgt is one of the least studied topics among students. Human resources, organization, 
management, program planning, auditing, quality systems, informatics utilization, or cooperation 
are more commonly associated with the Mgt field. The Mgt increased in 2020, although it became 
the field with the fewest documents on earlier occasions (2017 to 2019 and 2021 to 2022). In 
terms of confidence value, 21 Mgt documents have a value greater than 0.5, with eight documents 
worth 1. With 10 documents, the ENDF is the most often discussed study area with Mgt. 

Progression in Each Study Program 

Even if they are still within the purview of nuclear science, the growth of the nuclear field in each 
study program differs depending on the curriculum. Students are influenced by knowledge, and 


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 12 
SANTOSA 

more specifically the process of learning and comprehending (whether theoretical or more 
practical).33 

ENDF is still the most popular field in Electromechanics and Electronics Instrumentation study 
programs. These two study programs offer courses in ENDF topic areas such as mechanical, civil 
and architectural, electromechanical, electrical, control systems, and radiation detection for 
nuclear devices. Furthermore, the Electronics Instrumentation study program offers courses on 
nuclear electronics, signal processing techniques, and practical work on interface and data 
acquisition techniques, all of which are part of the ENDF nuclear instrumentation group. 

Apart from ENDF, the fields of NFCAM and NRSS have been present in Electromechanics for a 
period of six years. While Mgt is currently a less appealing topic, there have been no final project 
reports relating to Mgt in the most recent three years. 

In Electronics Instrumentation, the absence of a field occurs in NFCAM. The findings of the 
predictions demonstrate that none of the documents predicted on NFCAM were proper. 
Meanwhile, only 10 documents that intersect with NFCAM which have lower confidence in the 
range of values from 0.247 to 0.251. Nuclear minerals, nuclear fuel, structural materials and 
advanced materials, and radioactive waste were not studied in depth in this study program, 
illustrating why NFCAM is not predicted in instrumentation electronics. 

In contrast to other study programs, IR is the most predictable field in the final project report in 
Nuclear Technochemistry. In this investigation, Nuclear Technochemistry owns 88 of the 95 
documents examined. This study program includes IR specializations such as the use of isotopes 
and radiation in agriculture, health, and industry. Radioisotope production becomes another 
discipline that specializes in the creation of isotopes and radiation sources, which explains why IR  
is so popular among Nuclear Technochemistry students. The NFCAM field was not present in 
2022, despite the fact that it had been the topic of several students’ studies throughout the 
preceding five years. While the ENDF and Mgt fields have only been present in the last three years, 
there were no predictable papers in the previous three years. 

CONCLUSION 

The trend of research activities carried out by students from one study program to the next 
appears to vary although they are both within the scope of the nuclear field. For example, the field 
of ENDF is quite popular among Electromechanics and Electronics Instrumentation students but 
not for Nuclear Technochemistry students because ENDF only appeared three years ago and the 
number of documents is still modest. However, ENDF deserves to be a field that needs attention. 
Nuclear Technochemistry students with radiochemistry learning experiences demonstrate that 
the IR field is linear and interesting to them. Due to a paucity of publications, the low proportion in 
certain categories, e.g., Mgt, shows a potential to further investigate this field. 

This study demonstrates an opportunity to use text mining to assist librarians in performing 
automatic document classification based on specific subjects. The best model in this study is 
produced by combining kNN with Cosine similarity and binary term occurrence. The model used 
can help improve the quality of decisions made to accurately and efficiently categorize documents. 
To determine a more specific classification, pay close attention to documents that have a low level 
of confidence and intersect with other issues. This study is limited to the kNN method and 


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 13 
SANTOSA 

documents from the BATAN repository, as well as final project documents for PINT students. 
Large-scale testing can be conducted, for instance, in the International Atomic Energy Agency ’s 
(IAEA) nuclear repository known as the International Nuclear Information System (INIS) 
Repository, or in other databases with the complexity of categorizing documents throughout many 
languages. 

DATA ACCESSIBILITY 

Datasets and data analysis code for RapidMiner have been uploaded to the RIN Dataverse: 
https://hdl.handle.net/20.500.12690/RIN/ASRGVO. 

Data visualization can be accessed through Tableau Public: 
https://public.tableau.com/app/profile/faizhal.arif/viz/FinalProjectTrendsUtilizingNuclearKnow
ledgeTaxonomy/Story1 

  
https://hdl.handle.net/20.500.12690/RIN/ASRGVO
https://public.tableau.com/app/profile/faizhal.arif/viz/FinalProjectTrendsUtilizingNuclearKnowledgeTaxonomy/Story1
https://public.tableau.com/app/profile/faizhal.arif/viz/FinalProjectTrendsUtilizingNuclearKnowledgeTaxonomy/Story1


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 14 
SANTOSA 

APPENDIX A: CONFUSION MATRIX OF 10-FOLD CROSS VALIDATION 

Accuracy: 97.00% +/- 4.83% (micro average: 96.97%) 

 
True 

NFCAM 
True IR 

True 
NRSS 

True 
Mgt 

True 
NR 

True 
ENDF 

Class 
precision 

Pred. NFCAM 13 0 0 0 0 0 100.00% 

Pred. IR 0 18 0 0 0 0 100.00% 

Pred. NRSS 1 0 20 0 0 1 90.91% 

Pred. Mgt 0 0 0 19 0 0 100.00% 

Pred. NR 0 0 0 0 13 1 92.86% 

Pred. ENDF 0 0 0 0 0 13 100.00% 

Class recall 92.86% 100.00% 100.00% 100.00% 100.00% 86.67%  

 
INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 15 
SANTOSA 

APPENDIX B: THE CONFIDENCE VALUE OF EACH FIELD 
E

N
D

F
 

IR
 

M
g
t 

 
INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 16 
SANTOSA 

N
F

C
A

M
 

N
R

 
N
R

S
S

 
INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 17 
SANTOSA 

ENDNOTES 
 

1 Budi Prasetyo and Anggiana Rohandi Yusuf, “Pengelolaan Pengetahuan Eksplisit Berbasis 
Teknologi Informasi di BATAN,” in Prosiding Seminar Nasional SDM teknologi Nuklir (Seminar 
Nasional SDM Teknologi Nuklir, Yogyakarta: Sekolah Tinggi Teknologi Nuklir, 2018), 126–32, 
https://inis.iaea.org/collection/NCLCollectionStore/_Public/50/062/50062856.pdf?r=1 . 

2 Joanna Yi-Hang Pong et al., “A Comparative Study of Two Automatic Document Classification 
Methods in a Library Setting,” Journal of Information Science 34, no. 2 (April 2008): 213–30, 
https://doi.org/10.1177/0165551507082592. 

3 Prasetyo and Yusuf, “Pengelolaan Pengetahuan Eksplisit.” 

4 Jae-Ho Kim and Key-Sun Choi, “Patent Document Categorization Based on Semantic Structural 
Information,” Information Processing & Management 43, no. 5 (September 2007): 1200–15, 
https://doi.org/10.1016/j.ipm.2007.02.002; Pong et al., “A Comparative Study”; Khusbu 
Thakur and Vinit Kumar, “Application of Text Mining Techniques on Scholarly Research 
Articles: Methods and Tools,” New Review of Academic Librarianship (May 12, 2021): 1–25, 
https://doi.org/10.1080/13614533.2021.1918190. 

5 Pong et al., “A Comparative Study.” 

6 Kiri L. Wagstaff and Geoffrey Z. Liu, “Automated Classification to Improve the Efficiency of 
Weeding Library Collections,” The Journal of Academic Librarianship 44, no. 2 (March 2018): 
238–47, https://doi.org/10.1016/j.acalib.2018.02.001. 

7 Manika Lamba and Margam Madhusudhan, “Mapping of Topics in DESIDOC Journal of Library 
and Information Technology, India: A Study,” Scientometrics 120, no. 2 (August 2019): 477–
505, https://doi.org/10.1007/s11192-019-03137-5. 

8 Fábio Figueiredo et al., “Word Co-Occurrence Features for Text Classification,” Information 
Systems 36, no. 5 (July 2011): 843–58, https://doi.org/10.1016/j.is.2011.02.002; Yen-Hsien 
Lee et al., “Use of a Domain-Specific Ontology to Support Automated Document Categorization 
at the Concept Level: Method Development and Evaluation,” Expert Systems with Applications 
174 (July 2021): 114681, https://doi.org/10.1016/j.eswa.2021.114681; Yousif A. Alhaj et al., 
“A Study of the Effects of Stemming Strategies on Arabic Document Classification,” IEEE Access 
7 (2019): 32664–71, https://doi.org/10.1109/ACCESS.2019.2903331. 

9 David Antons et al., “The Application of Text Mining Methods in Innovation Research: Current 
State, Evolution Patterns, and Development Priorities,” R&D Management 50, no. 3 (June 
2020): 329–51, https://doi.org/10.1111/radm.12408; Muhammad Arshad et al., “Next 
Generation Data Analytics: Text Mining in Library Practice and Research,” Library Philosophy 
and Practice (2020): 1–12. 

10 Mowafy Mona, Rezk Amira, and Hazem M. El-Bakry, “An Efficient Classification Model for 
Unstructured Text Document,” American Journal of Computer Science and Information 
Technology 06, no. 01 (2018), https://doi.org/10.21767/2349-3917.100016. 

 
https://inis.iaea.org/collection/NCLCollectionStore/_Public/50/062/50062856.pdf?r=1
https://doi.org/10.1177/0165551507082592
https://doi.org/10.1016/j.ipm.2007.02.002
https://doi.org/10.1080/13614533.2021.1918190
https://doi.org/10.1016/j.acalib.2018.02.001
https://doi.org/10.1007/s11192-019-03137-5
https://doi.org/10.1016/j.is.2011.02.002
https://doi.org/10.1016/j.eswa.2021.114681
https://doi.org/10.1109/ACCESS.2019.2903331
https://doi.org/10.1111/radm.12408
https://doi.org/10.21767/2349-3917.100016


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 18 
SANTOSA 

 
11 Kim and Choi, “Patent Document Categorization.” 

12 Alhaj et al., “A Study of the Effects of Stemming Strategies.”  

13 Mona, Amira, and El-Bakry, “An Efficient Classification Model.” 

14 Thakur and Kumar, “Application of Text Mining Techniques.” 

15 Kim and Choi, “Patent Document Categorization.” 

16 Wagstaff and Liu, “Automated Classification.” 

17 Najat Ali, Daniel Neagu, and Paul Trundle, “Evaluation of K-Nearest Neighbour Classifier 
Performance for Heterogeneous Data Sets,” SN Applied Sciences 1, no. 12 (December 2019): 
1559, https://doi.org/10.1007/s42452-019-1356-9. 

18 Roiss Alhutaish and Nazlia Omar, “Arabic Text Classification Using K-Nearest Neighbour 
Algorithm,” The International Arab Journal of Information Technology 12, no. 2 (2015): 190–95. 

19 Mona, Amira, and El-Bakry, “An Efficient Classification Model.” 

20 Guozhong Feng et al., “A Probabilistic Model Derived Term Weighting Scheme for Text 
Classification,” Pattern Recognition Letters 110 (July 2018): 23–29, 
https://doi.org/10.1016/j.patrec.2018.03.003. 

21 Snezhana Sulova et al., “Using Text Mining to Classify Research Papers,” in 17th International 
Multidisciplinary Scientific GeoConference SGEM 2017, vol. 17, International Multidisciplinary 
Scientific GeoConference-SGEM (17th International Multidisciplinary Scientific GeoConference 
SGEM, Sofia: Surveying Geology & Mining Ecology Management (SGEM), 2017), 647 –54, 
https://doi.org/10.5593/sgem2017/21/S07.083. 

22 Lee et al., “Use of a Domain-Specific Ontology.” 

23 Man Lan et al., “Supervised and Traditional Term Weighting Methods for Automatic Text 
Categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence 31, no. 4 (April 
2009): 721–35, https://doi.org/10.1109/TPAMI.2008.110. 

24 Devid Haryalesmana, “Masdevid/ID-Stop words,” 2019, https://github.com/masdevid/ID-Stop 
words. 

25 Alhaj et al., “A Study of the Effects of Stemming Strategies.” 

26 Pong et al., “A Comparative Study.” 

27 Ananta Pandu Wicaksana, “Nolimitid/Nolimit-Kamus,” 2015, 
https://github.com/nolimitid/nolimit-kamus. 

28 Antons et al., “The Application of Text Mining Methods.” 
 

https://doi.org/10.1007/s42452-019-1356-9
https://doi.org/10.1016/j.patrec.2018.03.003
https://doi.org/10.5593/sgem2017/21/S07.083
https://doi.org/10.1109/TPAMI.2008.110
https://github.com/nolimitid/nolimit-kamus


INFORMATION TECHNOLOGY AND LIBRARIES MARCH 2023 

EXPLORING FINAL PROJECT TRENDS UTILIZING NUCLEAR KNOWLEDGE TAXONOMY 19 
SANTOSA 

 
29 Kanish Shah et al., “A Comparative Analysis of Logistic Regression, Random Forest and KNN 
Models for the Text Classification,” Augmented Human Research 5, no. 1 (December 2020): 12, 
https://doi.org/10.1007/s41133-020-00032-0. 

30 Judit Tamas and Zsolt Toth, “Classification-Based Symbolic Indoor Positioning over the Miskolc 
IIS Data-Set,” Journal of Location Based Services 12, no. 1 (January 2, 2018): 2–18, 
https://doi.org/10.1080/17489725.2018.1455992. 

31 Hanan Aljuaid et al., “Important Citation Identification Using Sentiment Analysis of In -Text 
Citations,” Telematics and Informatics 56 (January 2021): 101492, 
https://doi.org/10.1016/j.tele.2020.101492. 

32 Qiang Wang, Rongrong Li, and Gang He, “Research Status of Nuclear Power: A Review,” 
Renewable and Sustainable Energy Reviews 90 (July 2018): 90–96, 
https://doi.org/10.1016/j.rser.2018.03.044. 

33 Ronald Barnett, “Knowing and Becoming in the Higher Education Curriculum,” Studies in Higher 
Education 34, no. 4 (June 2009): 429–40, https://doi.org/10.1080/03075070902771978. 

https://doi.org/10.1007/s41133-020-00032-0
https://doi.org/10.1080/17489725.2018.1455992
https://doi.org/10.1016/j.tele.2020.101492
https://doi.org/10.1016/j.rser.2018.03.044
https://doi.org/10.1080/03075070902771978

	Abstract
	Introduction
	Literature Review
	Text Mining in Libraries
	kNN Classifier
	Aims of the Study

	Methods
	Results
	Discussion
	Nuclear Field Growth
	Progression in Each Study Program

	Conclusion
	Data Accessibility
	Appendix A: Confusion Matrix of 10-Fold Cross Validation
	Appendix B: The Confidence Value of Each Field
	Endnotes