key: cord-0035943-e0ggcdv4
authors: Bader, Brett W.; Puretskiy, Andrey A.; Berry, Michael W.
title: Scenario Discovery Using Nonnegative Tensor Factorization
date: 2008
journal: Progress in Pattern Recognition, Image Analysis and Applications
DOI: 10.1007/978-3-540-85920-8_96
sha: a014c7cedf0955b68cc56e98e751d623858f863c
doc_id: 35943
cord_uid: e0ggcdv4

In the relatively new field of visual analytics there is a great need for automated approaches to both verify and discover the intentions and schemes of primary actors through time. Data mining and knowledge discovery play critical roles in facilitating the ability to extract meaningful information from large and complex textual-based (digital) collections. In this study, we develop a mathematical strategy based on nonnegative tensor factorization (NTF) to extract and sequence important activities and specific events from sources such as news articles. The ability to automatically reconstruct a plot or confirm involvement in a questionable activity is greatly facilitated by our approach. As a variant of the PARAFAC multidimensional data model, we apply our NTF algorithm to the terrorism-based scenarios of the VAST 2007 Contest data set to demonstrate how term-by-entity associations can be used for scenario/plot discovery and evaluation.

Visual analytics is the science of analytical reasoning supported by the highly interactive visual interface. Tools for visual analytics are designed to synthesize information; derive insight from large, dynamic, and often conflicting data; and facilitate the mining of such data for both expected and unexpected associations that can be both verified and easily communicated. The discipline of visual analytics is very interdisciplinary and extends well beyond traditional scientific and information visualization to include statistics, mathematics, knowledge representation, management and discovery technologies, cognitive and perceptual sciences, and decision sciences. In this study, we use nonnegative tensor factorization (NTF) techniques to extract term-by-entity associations from textual data describing the activities and events associated with a possible terrorism-based plot. As shown in [1] , nonnegative tensor factorization based on the well-known PARAFAC [2] model for multidimensional data can be effective in extracting important topics of discussions from media such as electronic mail. How such tensor factorizations would fare for (automatically) exposing the major (or sub-major) plots of a scenario hidden in the details of news stories, blog entries, and similar textual data was the question we sought to address in this work. The fictitious terrorist activities created by Whiting et al. for the VAST 2007 Contest [3] provide an excellent example of textual data that is needed to address that fundamental question. We will first describe the scenario provided by this data set in Section 2 followed by a discussion of our methods of analysis using NTF in Section 3. Details of how the data was processed are provided in Section 4, and our analysis and observations using NTF are summarized in Section 5. Concluding remarks and a brief discussion of future work are given in Section 6.

The VAST 2007 Contest [3] , a participation category of the IEEE VAST 2007 Symposium, was designed to promote the development of benchmark data sets and metrics for visual analytics as well as to establish a forum to advance visual analytics evaluation methods. The data set associated with the contest consisted of news stories and blog entries, along with background information and some multimedia materials (small maps and data tables). Participants in the contest were asked to investigate a major law enforcement/counter-terrorism scenario, form their hypotheses, and collect supporting evidence. Tasks that each team/entry was expected to address included i) processing the text and multimedia information to identify entities of interest (e.g., people, places and activities), ii) depicting this information visually using interactive visualizations and other tools to aid in the analysis of the information; iii) answering specific contest questions based on the analysis; and iv) producing a video demonstration of their system showing how they arrived at those answers. In our study, we demonstrate the use of nonnegative tensor factorization for (automated) knowledge discovery that would facilitate the first three of the tasks mentioned above.

The scenario depicted in the VAST 2007 Contest involved an emergency related to wildlife law enforcement occurring in the fall of 2004. Endangered species issues and ecoterrorism played key roles in the underlying terrorist activity. The heterogeneous data used to describe the situation included text, images, and some statistics. The activities of certain animal rights groups, such as the People for the Ethical Treatments of Animals (PETA) and Earth Liberation Front (ELF), are involved with the plot but are not the primary or interesting parties for investigation.

Entries (or answers) submitted to the VAST 2007 Contest were judged according to the correctness of the answers to the questions and the evidence provided. Points were awarded for correct answers and subtracted for incorrect answers. A more subjective assessment of the quality of the displays, interactions and support for the analytical process was also provided. Such an assessment was based on the visuals and description of the analytic process (including the video).

Participants were required to answer the questions (who, what and where) on an answer form, and for each response they were required to identify the most relevant documents or other materials from the data set as evidence. In a debriefing section of the answer form, contestants must describe the plot(s) and subplots(s) and how people, motivations, activities and locations are part of the plot, their relationship, and any uncertainties or information gaps that exist. This debriefing is judged on both accuracy and clarity. A short list of the questions each entry was required to answer are provided below: -(Who) Who are the players engaging in questionable activities in the plot(s)?

When appropriate, specify the organization they are associated with. -(When/What) What events occurred during this time frame that are most relevant to the plot(s)? -(Where) What locations are most relevant to the plot(s)?

In the next section, we describe how linear algebra/NTF and subsequent MATLAB software can be used as decision support tools to answer the questions above and thereby provide a more scalable and time-efficient solution to the scenario analysis needs of security experts.

Before presenting an overview of the algorithm and software needed to implement our nonnegative tensor factorization (NTF) model, we formally define all the important notations associated with tensor mathematics.

We use the following notation. We denote scalars by lowercase letters (a), vectors by boldface lowercase letters (a), matrices by boldface uppercase letters (A), and n-way arrays or tensors by boldface script letters (X).

We denote the ith element of vector a by a i . We denote the jth column of matrix A by a j and element (i, j) by a ij . We denote element (i, j, k) of a thirdorder tensor X by x ijk .

The symbol • denotes the outer (tensor) product, which is defined for two column vectors as

The symbol denotes the Khatri-Rao product (columnwise Kronecker) [4] of two matrices with the same number of columns. Given two matrices A, B ∈ R m×n , the Khatri-Rao product is defined as

where the symbol ⊗ denotes the Kronecker product. It is helpful in some cases to vectorize and/or matricize a tensor by simply rearranging the elements of X into a vector or matrix, respectively. Although different notations exist, we use the notation in [4] . For a three-way array X of size m × n × p, the notation X (m×np) represents a matrix of size m × np in which the index over n runs the fastest over the columns and p the slowest. Other permutations, such as X (p×nm) , are possible by changing the row index and the fast/slow column indices.

The norm of a tensor, X , is the same as the Frobenius norm of the matricized array, i.e.,

In 1970, Harshman [2] proposed a tensor decomposition called Parallel Factors (PARAFAC), which is also known as Canonical Decomposition (CANDE-COMP), as developed by Carroll and Chang [5] in the same year. For a comprehensive review of tensor decompositions and their applications, see Kolda and Bader [6] . Given a third-order tensor X of size m × n × p and a desired approximation rank r, the PARAFAC model approximates X as a sum of r rank-1 tensors formed by the outer product of three vectors. It is convenient to group each set of r vectors together as matrices A, B, and C, which we call factor matrices. We may express the PARAFAC model in a number of ways using different notations. Here are three equivalent representations for a third-order tensor X:

The PARAFAC model may be extended to N -way data. Without loss of generality, we normalize all columns of the factor matrices to unit length and store the accumulated weight in a vector λ:

We also impose a constraint on the final solution such that λ 1 ≥ λ 2 ≥ · · · ≥ λ r . Because this normalization and reordering may be performed in a postprocessing step, we describe an algorithm for fitting the model without λ.

A common approach to fitting the PARAFAC model to data is an alternating least squares (ALS) algorithm [2, 7, 8] , where one cycles over all factor matrices and performs a least-squares update for one factor matrix while holding all the others constant.

Our analysis deals with a variant of the PARAFAC model that constrains the factor matrices to be nonnegative. Often this is called nonnegative tensor factorization (NTF) or sometimes just nonnegative PARAFAC.

The goal of NTF is to find the best fitting nonnegative matrices A ∈ R m×r + , B ∈ R n×r + , and C ∈ R p×r + in the PARAFAC model, Equation (1), that fit the data in X. This corresponds to the following minimization problem:

To compute the NTF, we solve a series of nonnegative matrix factorization (NMF) subproblems min A∈R m×r

Each of these matrix systems is treated as an NMF problem and solved in succession using the multiplicative update introduced by Lee and Seung [9] . This procedure was first used by Welling and Weber [10] . We have adapted their algorithm and incorporated the addition of for stability as was done in [11, 12] . For example, to solve for A we compute

Here, is a small number like 10 −9 that adds stability to the calculation and guards against introducing a negative number from numerical underflow. An algorithm for NTF was written in MATLAB, using sparse extensions of the Tensor Toolbox [13, 14] . The approximation rank r > 0 of NTF corresponds to the number of associations in the corpus that we are interested in finding. Each triad {a j , b j , c j }, for j = 1, . . . , r, defines scores for a set of terms, entities, and time for a particular association in the corpus of news stories, as shown in Figure 1 . The value λ j (after normalization) indicates the weight of the association for triad j. Each column of C records the strength of each association over time. 

In testing our NTF algorithm from Section 3.2, we processed 1, 455 text files corresponding to news stories, email messages or blog posts from the VAST 2007 Contest data set [3] . In addition to the plain text versions of these files, the data includes corresponding tagged files. The five SGML-based SGML tags are date, person, location, organization and money. A sample extract from one of the tagged news stories is provided below:

Activists performed a skit of the <ENAMEX TYPE="PERSON">Executive Yuan</ENAMEX> failing to supervise its subordinates to prevent avian flu. "The hygiene required when butchering chickens could not stressed more at this point, where we are exposed to the potential spread of avian flu," said <ENAMEX TYPE="PERSON">Chen Yu-min</ENAMEX> (??????), <ENAMEX TYPE="ORGANIZATION"><ENAMEX TYPE="ORGANIZATION">director of the <ENAMEX TYPE="ORGANIZATION"> Environment and Animal Society</ENAMEX></ENAMEX> of Taiwan </ENAMEX>.

Each file within this data set has a unique time stamp associated with it. The date tag was used to help extract the time stamp from each file, but was otherwise ignored. The four remaining tags all represent entities of interest and were used to create a single comma-delimited file that later served as input for the NTF model. Two separate dictionary files 1 were used in the process, one for terms and one for entities. The dictionaries allowed the words to be replaced by numeric IDs. A Python script using the built-in SGML Parser class and the dictionary files described above was used to generate a comma-delimited file of this format:

Date, Entity Type, Entity ID, Term ID, Term Frequency Count 2 A sample extract from the resulting file is given below: 

In order to answer the questions posed in Section 2.2 for scenario analysis, we considered term-by-entity associations in the news stories over monthly time intervals, which corresponds to a sparse array X of size 12, 121 × 7141 × 15 with 1,142,077 nonzeros. We scaled the nonzero term counts according to a weighted frequency:

where l ijk is the local weight for term i co-occurring with entity j in a news story during month k, t i is the global weight for term i, e j is the global weight for entity j, and w j is a time normalization factor. Let f ijk be the frequency (raw count) of term i appearing in the same news story as entity j during month k. The specific formulas of the scaled tensor are listed here:

The global weights are adapted from the well-known log-entropy weighting scheme [16] used on term-by-document matrices. The log local weight scales the raw term frequencies to diminish the importance of high frequency terms. The entropy global weight helps to discriminate important terms and entities from frequently occurring terms and entities. The time normalization helps to correct imbalances in the number news stories over each time period.

For the VAST 2007 Contest data (see Section 2), we computed a 25-component (r = 25) nonnegative decomposition of the term-entity-month array X. One iteration took about 25 seconds on a 3GHz Pentium Xeon desktop computer with 2GB of RAM. Most runs required about 20 iterations to satisfy a tolerance of 10 −4 in the relative change of fit. We chose the best minimizer from among ten runs starting from random initializations. The relative norm of the difference for the one chosen was 0.8881.

Processing the data with NTF resulted in twenty-five groups consisting of interrelated entities and terms. Fifteen entities of various types and thirty-five terms describe each group. The following excerpt from the NTF output demonstrates the layout of Group 20 (Environmental and animal rights issues in China), see Table 1 : For each of the twenty-five groups, a relevance score was computed for every story in the data set. Entities were given twice the weight of terms for the purpose of this computation. In order to improve the chances of smaller-length documents being judged relevant, Lnu scaling [16] was applied. If Y = [y ij ] is the term-by-document matrix where by each element y ij defines the relevancy of term (or entity) i to document (or story) j, then we define

.

The slope s is set to 0.2; p,the average number of unique words in a document within the data set was determined to be 158; and u, the number of unique words in document or story j was calculated a priori. The function χ is defined as

so that the overall relevance score R for each document (story) j using the terms and entities (indexed by i) of the given tensor group is simply defined by

For relevant stories, in addition to the overall score, a score was also computed for every individual sentence. For each story, the three top-scoring sentences were then printed out as a brief summary. This step significantly improved the analysis process by decreasing the amount of time necessary to determine a story's subject and whether it was in fact relevant to a given tensor group. In addition to the top three sentences, each summary also contains the overall story score and a reference to the file containing the full story. The latter was used to extract events and activities from the story. The events and activities were listed in a spreadsheet in chronological order. The scoring and summarization of two different news stories against the Group 20 (see Table 1 ) is illustrated in Figure 2 . Chinese government's actions related to environmental/animal rights 6

Disease: potential dangers of genetic modification related to food-borne illnesses; animal-borne diseases and their transfer to humans; effects of diet on health 9

Exotic animal and drug trafficking 10 Meat alternatives and their benefits for consumer health; animal treatment standards 11 Spread of Animal diseases to humans 13 Animal-rights groups responsible for attacks on pet stores and supermarkets;

People An empirically determined threshold (0.0088) on R was used to create a set of relevant stories. In some cases, no further processing was necessary. In other cases, where many clearly irrelevant documents were included in the set, a slightly higher secondary threshold value was applied. Table 2 shows the reduction in news stories associated with a given NTF-generated Group as the threshold on R is increased.

The extraction of events and activities was a somewhat subjective process, involving occasional ambiguity between events and activities. In general, a discrete, one-time action was typically classified as an event, while a continuous action was classified as an activity.

Recalling the nature of the scenario defining the VAST 2007 Contest data set (see Section 2.1), we note that Groups 6, 9, 15 , and 18 from Table 1 expose both events and activities of two important terrorist plots: cocaine trafficking via tropical fish trade and spread of monkeypox by infected chinchillas. Examining the peak nonnegative components of each column of the C factor of our NTF model (see Section 3.2) we can sequence the Groups of Table 1 through time and show (see Figure 3 ) the progression of the events and activities through the timespan of the entire data set. In Figure 4 we highlight the C-factor spikes of the groups associated with the dominant terrorist plots of the VAST 2007 Contest [3] . Tables 3 and 4 in the Appendix list the dominant terms and entities  of Groups 9 and 15 from Table 1 

We have demonstrated how a nonnegative tensor factorization (NTF) model can be used to generate useful term-by-entity associations from textual-based data (news stories, blog entries, etc.). Using the VAST 2007 Contest data set and its fictitious terrorist plots, we have generated interpretable tensor factors that can be used to define and track events and activities critical to those plots. These preliminary results are quite promising and suggest that NTF may become an effective tool for the emerging discipline of visual analytics. Further research in the effects of scaling/weighting of the attributes defining the initial tensor (or data array) is needed along with the incorporation of the underlying NTF model into a visual interface more suitable for human steering and querying. "I urge your readers to not use Global Ways...low survival rates, poor packaging, poor responsiveness. We will never use them again... the packaging on my last shipment of Tigrinus Catfish was abominable. Shipping bags were covered in some noxious substance that caused our handlers hands to go numb and eventually need emergency medical treatment. You can imagine the condition of the fish themselves. Less than 10% survived their trip." 2 "In response to these comments, Global Ways spokesman blamed an inexperienced packer in South America for problems in a very few shipments. We are working with these customers and have guaranteed them satisfaction. A problem occurred in a very small number of the 5000 shipments we handled last quarter." 3 "The Fish and Wildlife Service (FWS) is warning ornamental fish merchants not to handle any shipments of tropical catfish from South America that may have come into the United States through Miami, as the packaging bags may be contaminated with a substance that can cause serious illness. The source of the toxin is unknown, but it is not directly related to the imported fish themselves."

Discussion tracking in Enron email using PARAFAC

Foundations of the PARAFAC procedure: models and conditions for an "explanatory" multi-modal factor analysis. UCLA working papers in phonetics

Multi-way analysis: applications in the chemical sciences

Analysis of individual differences in multidimensional scaling via an N-way generalization of 'Eckart-Young' decomposition

Tensor decompositions and applications. SIAM Review

Recent developments in CANDE-COMP/PARAFAC algorithms: A critical review

A comparison of algorithms for fitting the PARAFAC model

Learning the parts of objects by non-negative matrix factorization

Positive tensor factorization

Email surveillance using nonnegative matrix factorization

Email Surveillance Using Nonnegative Matrix Factorization

Efficient MATLAB computations with sparse and factored tensors

Matlab tensor toolbox

GTP (General Text Parser) Software for Text Mining

Understanding Search Engines: Mathematical Modeling and Text Retrieval

The dominant terms and entities (based on nonnegative components of factors A and B from Equation (1)) for Groups 9 and 15 (see Table 1 ) along with the top-scoring news stories (with summaries) using thresholds on the R measure defined by Equation (3). "The disease, which is not usually fatal to humans, has previously been discovered in Wisconsin, Illinois and Indiana in mid-May. Most of those affected had contact with prairie dogs -a type of wild rodent which lives in burrows on the western US plains, and sold as pets." 2 "Seven people in the Los Angeles area have seriously ill from monkeypox -a rare smallpox-like disease, US health officials say. More than 50 possible cases are being investigated." 3 "But as the monkeypox outbreak shows, there are risks in owning exotic pets. Now that two people have died from a new, more potent strain, the mishmash of regulations leaves loopholes that have allowed sick animals to travel, health officials say. Chinchillas, believed to be the cause of the recent outbreak and on the endangered species lists, are part of that concern." 4 "Cesar Gil, an animal rights activist and chinchilla breeder in the LA area, is currently being sought in connection with the monkeypox outbreak. He has also been a suspect in radical animal rights activities.Officials believe Gil has fled the country."