key: cord-0075243-o7thmezh authors: Gorbachev, I. E.; Kriulin, A. A.; Latypov, I. T. title: Methodology of Mediametric Information Analysis with the Use of Machine Learning Algorithms date: 2022-03-01 journal: Aut DOI: 10.3103/s0146411621080125 sha: e5a1c92d1da71e2c6c10bdb5d7e7bf4296f18a3a doc_id: 75243 cord_uid: o7thmezh The qualitative analysis of information messages and the assessment of publications on the Internet are becoming more urgent than ever. A large number of materials are published on the Internet on various events in the world; the nature of these publications can affect the political and social life of society. In order to ensure the safety of the population of the Russian Federation and meet the requirements of regulatory documents, a methodology for mediametric information analysis with the use of machine learning algorithms is proposed. Based on the results of research in this area, the main approaches to mediametric information analysis are determined. An approach is proposed for determining the sentiment of publications using the Word2Vec model and machine learning algorithms for natural language processing. A methodology is formulated that takes into account the technical features of a publication source and the existing methods of mediametric information analysis. On the basis of real information publications, the results of the implementation of the methodology of mediametric analysis and determination of the sentiment of messages are presented. In the context of the expanding spectrum of forms of confrontation in the political, economic, informational, diplomatic and other spheres, traditional views on armed military and local conflicts are changing. As noted in the Doctrine of Information Security of the Russian Federation [1] , approved by the decree of the President of the Russian Federation of December 5, 2016 , "The scale of the use of information and psychological influence by special services of individual states, aimed at destabilizing the internal political and social situation, is expanding… The informational influence on the population of Russia is increasing…". In order to counter information aggression against the Russian Federation, the mediametric assessment of information messages in communications of all scales is of particular importance. The relevance of mediametric research of information in the global information network (GIN) Internet is due to the contradiction between the need for an adequate assessment of publications for making informed decisions and the imperfection of methods for quantitative assessment of various qualitative properties of an information message and its source. The results of the analysis of the works of scientists of the Russian Institute for Strategic Studies [2] , Russian media researchers [3, 4] showed that, at present, a technical groundwork has been created for the possibility of quantitatively assessing information publications, economic aspects in the work of the media have been considered, a number of assessed indicators of political coloration of news messages has been proposed, and proposals for their application have been developed. However, it should be noted that there is no systematic approach to mediametrics of information to ensure the security of the Russian Federation. Assessment is carried out on separate indicators, which allows analyzing information messages only from one side, outside the framework of a systematic approach. The complex mediametric criterion is not defined and/or justified. Single-type approaches are used to assess information from various communication environments without taking into account the technical features of their functioning. The scientific novelty of the research results is determined by a new approach to mediametric assessment of information messages using modern methods of data mining and a developed methodology that takes into account the existing methods of mediametric analysis. The practical significance of the research lies in the development of a comprehensive methodology for the quantitative assessment of information publications in the GIN Internet, which can be used to provide psychological protection from information influences on the population of Russia. The historical definition of mediametry was presumably formed in the middle of the 20th century and by the beginning of the 1990s was reduced to practical activities for the study and collection of quantitative and qualitative data of the media channel and audience in order to study the effectiveness of the media [5] . With the current development of modern communication media, the concept of mediametry has expanded significantly and is at the intersection of disciplines such as political science, sociology, economics, history, and applied mathematics. For each of these disciplines, the concept of mediametry reflects its own specifics within the framework of solving specific applied problems. As the author of the study [3] points out, in the field of media consumption and mediametry, there is no uniform terminology and standards that take into account the dissemination of information using innovative technologies and services. There are also no quantitative indicators of media consumption, official data on the total volume and time spent on individual media resources, information on the distribution of the audience across different platforms, devices, and consumption methods. In this regard, the analysis of the application of existing mediametric methods is carried out based on the characteristics of various applied disciplines. Modern media economics, as one of the varieties of mediametric knowledge of information processes, distinguishes a number of special indicators [4] . (1) Coverage of a unique audience-quantitative characteristics of a particular information carrier relative to the number of contacts with recipients that it can provide. For printed publications, this is the circulation and the number of readers for each copy. For radio or TV channels, this is the number of viewers or listeners in a given period of time. (2) Technical coverage of the audience-quantitative indicators of the technical capabilities of information media in relation to different regions. The study of the audience is also accompanied by demographic research (age composition of the audience, education, income level, etc.) and psychographic research (lifestyle, values, opinions, etc.). (3) Relevance to the target group-an indicator that takes into account the percentage of client audience for a particular communication channel of the total audience size from the coverage indicator. (4) Rating-the ratio of the number of people watching, listening or reading certain information content to the total number of the audience, which has the appropriate technical capabilities. (5) Cumulative rating number-the sum of all presentations of a particular content, expressed as a percentage of the audience. Also, in [4] , more specific media economic indicators are considered in detail. It should be noted that the above indicators have a number of disadvantages due to the technical features of the implementation of specific information sources. Counters of Internet resources do not record real people, but various device identifiers (IP-address, browser, etc.) of the GIN Internet users. Accordingly, the quantitative audience will grow when one user has several devices, which will distort the real assessment. Within the framework of political mediametry, two main characteristics are used [5] : the event coverage intensity and the aggressiveness of the event being covered. The coverage intensity indicator is defined as the average number of significant materials that appeared in the mass media of a particular country during a calendar month. The degree of aggressiveness of an event is determined using the aggressiveness index, which is calculated as the quotient of dividing the number of negative materials by the number of neutral publications. With the help of indicators of political mediametry, it is possible to determine some characteristics of an information attack, for example, the moment of its beginning, intensity, and degree of pressure. From the analysis of the possibility of using existing mediametric methods, it follows that, firstly, there is no official regulatory framework for quantitative and qualitative assessment of information messages, which significantly complicates research in the field of mediametric analysis. Secondly, the economic methods of mediametrics are quite specific and are used to assess the profitability of the media. Thirdly, AUTOMATIC the indicators of political mediametry make it possible to determine some quantitative indicators of information messages, but they are quite subjective and are mainly determined by an expert assessment of the sentiment of a publication. Information on the Internet and in the media is big data, and, in order to be able to assess it qualitatively and quantitatively, it is necessary to use data science as the core for modern information and analytical systems. As part of the approach to the quantitative assessment of information, it is proposed to systematize the most popular methods of data analysis for their further application to specific events. (1) Determination of the publication sentiment: positive, negative, or neutral. Solving this problem will allow assessing the opinion of the media regarding various events, the reactions of famous people to the event, and so on. The purpose of sentiment analysis using machine learning is to determine the properties of text information and find opinions and emotions in the text. Depending on the nature of the tasks being solved, sentiment analysis comes down to determining the topic of publication and the author's opinion and position. The use of machine learning methods is due to a large number, which show its effective use in information security problems [6] [7] [8] [9] . To determine the sentiment using machine learning algorithms, the text of publications is represented as number vectors. There are two main ways to do this. The first method, bag-of-words, involves compiling one large dictionary of words that occur in the text and assigning an index to each word. The original words in the text are replaced with the corresponding indices, and, using metrics (Euclidean distance, Chebyshev distance, etc.), the vectors are compared. The second way is to use the word2vec model, the task of which is to maximize the values of vectors of words that are nearby and minimize the values of vectors of words that do not appear with each other. Formally, such a model can be represented as follows: ( 1) where w v is the target word, w c is the context word, and w c1 is all the context words in the document. The resulting numeric word vectors are a sample of data for machine learning algorithms. The use of machine learning algorithms allows determining the sentiment of a publication, that is, classifying it into one of three groups: positive, negative, or neutral. To solve the classification problem, it is necessary to form a training sample of publications, which should equally contain positive, negative, and neutral texts. In the absence of a trained sample, it is advisable to start the analysis of publications with the clustering problem according to the same principles as for the classification problem. (2) Use of oriented graphs in the analysis of social networks [10] . This method is based on the determination of the centrality index, which characterizes the most important vertices of the graph. Applications using this metric can be used to identify influencers on a social network. Google has also developed an influencer-based PageRank algorithm for the link ranking of web pages. (3) Use of a word cloud to visualize words in a publication [10] . This method belongs to natural language processing (NLP) techniques and boils down to placing words on the graph in accordance with the sizes proportional to their frequencies. Most of the problems solved by the NLP can be fully used for information mediametric analysis when making decisions on specific events. The disadvantages of this approach include the need for a high level of knowledge in the subject area of the assessed publications to create high-quality (representative) samples when training algorithms. Also, with a significant increase in the volume of processed texts, the requirements for the computing power of machines will proportionally increase. Taking into account the results of the analysis of the application of mediametry methods, as well as the approach to the quantitative assessment of information using machine learning algorithms, a methodology has been developed for assessing text publications. The general scheme of the methodology is shown in Fig. 1 . The methodology consists of eight main stages. At the first stage, all publication methods are determined for each source. It is worth noting that publications on a topic from one source may differ structurally, technically, and substantively, depending on the information platform (microblog, website, page on a social network, etc.); therefore, the methods of posting publications are grouped by web content. This step is also required to prepare automated content requests. Not every website provides or has a programming interface, so requests can be made through standard protocols. Also, some websites may have RSS feeds for describing news feeds. Major social media and microblogging services provide application programming interfaces (APIs) for accessing content. After identifying all the publication methods, the next step is to determine the time range of the event of interest. Using the time range, one can track the dynamics of publications, calculate intensity intervals, build a time series of an event, and analyze it. To perform the stages of the methodology, the following information messages were taken: -using the Twitter API for the Python programming language, a sample of US President Donald Trump's tweets from 2010 to March 2020 was received. The sample contains 41122 unique records; -from the platform for research, data processing, and machine learning Kaggle, a sample of open data was received on the topic of the coronavirus pandemic from January 2020 to March 2020. The sample contains 48421 unique records [11] ; -from the service for developers GitHub [12] , a sample of H. Clinton's mail messages was received during her function as US Secretary of State. The sample contains 7945 unique records. At the third stage, the calculation of the main media economic indicators for the source of publications is performed. Formalized methods for calculating media economic indicators are presented in [4] . Media economic indicators may reflect the desire of a source to increase its level of profitability by attracting an additional audience to high-profile events. Text data received from requests to web pages, as a rule, come in hypertext format, which significantly reduces their quality. For this, at the fourth stage, preprocessing of text publications is performed, which includes: -removal of html tags, punctuation, symbols, and other service data. The most popular automated software implementation for hypertext processing is the Beautiful Soup HTML and XML parsing library; -removal of common stop words such as prepositions, suffixes, participles, and the like. Stop word removal is accomplished using the Python Natural Language Toolkit (NLTK) library for symbolic and aggregate natural language processing. At the fifth stage, the main statistical characteristics of samples with text data of publications are calculated: average, maximum, and minimum number of words, standard deviation, and other indicators. At the sixth stage, graphical visualization of statistical indicators, dependencies, and characteristics of text data is carried out. For plotting two-dimensional or three-dimensional graphs, a library in the Python programming language, Matplotlib, is used. An example of using visualization tools from the Matplotlib package to sample open data on the coronavirus pandemic is shown in Fig. 2 . An example of constructing a word cloud for the tweets of US President D. Trump is shown in Fig. 3 . If the sample of publications is long in time, it is advisable to divide the word clouds into several intervals. Figure 3 shows that the words in D. Trump's publications at different periods of his political career differ significantly. At the seventh stage, the construction of network graphs for publications is carried out, which will allow visualizing the process of disseminating information from various sources. To construct network graphs, it is most efficient to use special programs such as Gephi. The final stage of the methodology is to determine the sentiment of text publications. Using the statistical indicator TF-IDF [10] , which is used to assess the importance of a word in the context of a document, a diagram of the most frequently used words of D. Trump and H. Clinton has been constructed (Fig. 4) . TF (2) word frequency is the ratio of the number of occurrences of a word to the total number of words in one message: (2) where t is the word in document d, n t is the number of occurrences of a word t, and n k is the number of all words in the document. The inverse frequency of an IDF document (3) is the inverse of the frequency with which a word occurs in all messages: where |D| is the total number of messages and is the number of messages in which t occurs. The result of applying the TF-IDF word frequency index to samples with D. Trump's tweets and H. Clinton's mail messages is illustrated in Fig. 4 . Based on the statements of D. Trump and H. Clinton shown in the figure, their priorities in a certain period of time can be assessed. As seen, they can differ significantly. At this stage, we also used the word2vec model based on distribution semantics. The input of the model was text data from a sample of D. Trump's tweets; at the output, the model returned a weighted vector for each word. Using the random forest algorithm, all publications from D. Trump's twitter were classified into three categories: positive, negative, and neutral. Figure 5 shows the result of classification by the random forest algorithm of D. Trump's publications on Twitter by sentiment using the word2vec model. Before the presidential election After the presidential election To analyze the sentiment of H. Clinton's mail messages using the word2vec model, a Bayesian classifier was used, the result is shown in Fig. 6 . As is seen in Fig. 6 , the most negative mail messages were sent towards Iraq and the UK. CONCLUSIONS The analysis, systematization, and improvement of the scientific and methodological apparatus of mediametric information assessment in the GIN Internet in order to determine its nature have been carried out. The analysis showed that mediametric research of information messages is carried out for sepa- rately taken applied industries; there is no comprehensive systematic approach to solving problems of quantitative assessment of publications. An approach to the quantitative assessment of information in the GIN Internet based on data mining has been proposed. The approach involves the use of modern machine learning algorithms, which will automate the process of classifying publications in the media by sentiment. The disadvantages of this approach are the mathematical complexity in determining the sentiment of messages, as well as the need for representative samples marked up for training. The proposed approach made it possible to develop a comprehensive methodology for assessing information publications in the GIN Internet, which takes into account the existing methods of mediametric analysis, and also uses methods of quantitative analysis based on machine learning algorithms. The use of the methodology allows assessing text messages from sources that differ in technical characteristics, as well as determining the sentiment level of the message or source. Thus, the developed complex methodology for assessing information publications in the GIN Internet makes it possible to quantitatively determine the main statistical indicators of the source of messages and take into account the economic indicators of resources. The technique also allows calculating the sentiment level of text publications using machine learning algorithms when analyzing the nature of information and its source. On Enactment of Doctrine of Information Security of the Russian Federation Zarubezhnye SMI i bezopasnost' Rossii (Political Mediametry: Foreign Mass Media and Security of Russia) The studies of audience and media consumption in the digital environment: Methodological and practical problems Mediaekonomika zarubezhnykh stran. Uchebnoe posobie (Economics of Media in Foreign Countries: Handbook) of Slovar'-spravochnik po materialam pressy i literatury 90-kh godov XX veka (Dictionary and Reference Book on Materials of Mass Media and Literature of 1990th) Using graph theory for cloud system security modeling Applying deep learning techniques for Android malware detection The use of an artificial neural network to detect automatically managed accounts in social networks Information security evaluation for Android mobile operating system Data Science from Scratch: First Principles with Python Data sample on corona virus pandemic in the world The authors declare that they have no conflicts of interest.