key: cord-0571148-5wonq1dd authors: Laghaout, Amine title: Supervised learning on heterogeneous, attributed entities interacting over time date: 2020-07-22 journal: nan DOI: nan sha: 85d4f9b2ce414dbdf28d1478d9b8b64248f0ba5b doc_id: 571148 cord_uid: 5wonq1dd Most physical or social phenomena can be represented by ontologies where the constituent entities are interacting in various ways with each other and with their environment. Furthermore, those entities are likely heterogeneous and attributed with features that evolve dynamically in time as a response to their successive interactions. In order to apply machine learning on such entities, e.g., for classification purposes, one therefore needs to integrate the interactions into the feature engineering in a systematic way. This proposal shows how, to this end, the current state of graph machine learning remains inadequate and needs to be be augmented with a comprehensive feature engineering paradigm in space and time. In the industry, and even in the scientific literature, supervised learning has overwhelmingly been applied to data sets where the training examples are (i) independent of each other, and (ii) homogeneous, i.e., the subjects of the classfication or regression are instances of the same entity type such that each column in the design matrix has a consistent interpretation and format across all rows. In other words, the training examples used for the estimation 1 refer to entities that are (i) noninteracting and of (ii) the same type. This assumption-or rather, approximation-implies that the entities are decoupled from their environment and that their design matrix is self-contained. In the real world, however, one cannot make such an assumption since any given entity to be estimated is likely part of a broader ontology which intertwines its properties with those of other entities. One can attempt to handcraft the relationships of the ontology into the design matrix of each entity type, but such a feature engineering exercise is demanding in human expertise, prone to the introduction of biases, and more importantly, cannot generalize to arbitrary problem domains. Significant progress in overcoming this limitation has been afforded by the recent rise to prominence of graph machine learning (GML) [Wu et al., 2020 , Zhang et al., 2018b . Applications of GML to evidently graph-based ontologies such as social networks [Al-Eidi et al., 2020] or cybersecurity [Liu et al., 2018 , Sun et al., 2019 have quickly flourished and developer tools such as StellarGraph [Data61, 2018] are reaching maturity. Despite all these advances, a thoroughly comprehensive and generalizable approach has yet to be devised for the estimation of heterogeneous, attributed nodes that interact in time. This latter scenario is indeed the most ubiquitous in the real world since, at the most fundamental level, all phenomena can be reduced to physical interactions between indivisible entities. Some attempts in this direction, such as spatio-temporal graphs [Zhang et al., 2018a] , do take into account the space-time aspect of the problem but cannot accommodate the heterogeneity of the graph. Conversely, the few algorithms that can handle heterogeneity such as GraphSAGE [Hamilton et al., 2017] , and its derivative HinSAGE, cannot model time-evolution of both the node attributes and of the edges. More crucially, the main shortcoming of GML lies in the fact that interactions between entities are constrained to bi-partite relationships. This makes it inapplicable for problems where the interaction can-in principle-behave as a stochastic black box involving an arbitrary number of vertices with no particular directionality. This proposal aims to resolve the above problem by outlining a reference architecture for the estimation of heterogeneous, attributed entities that interact with their environment-and with each other-over time. In order to accommodate arbitrary interactions, it shall do away with GML's attempts to force-fit 2 graph topologies onto the ontology of the problem domain. Instead, it shall let interactions be modeled as learnable modules that can be recycled for any entity instances they involve. Because a parallel can be drawn between interactions and the notion of hyperedges, one can describe the proposal herein as a revised, tentative blueprint for hypergraph machine learning [Zhou et al., 2006 , Jiang et al., 2019 . The building blocks of the problem are formally defined in §2; §3 presents the architecture and spells out the learning problem in terms of the building blocks; and §4 goes over the open questions and blind spots that may require further investigation. An intuition for the power and ubiquity of the present proposal is best elicited by a real-world use case. Due to its global impact and media exposure, a relatable application is the modeling of the spread of a pandemic, such as COVID-19. The goal of the model is to determine whether an entity is infected by-or acts as a vector for-the disease. Here, entities could be biological (e.g., human beings, pets, wild animals) or inanimate (e.g., objects such as dooknobs or handrails, or venues such as markets or fitness clubs). One can see that each of these entities have attributes, i.e., features, that are intrinsic to them. These can be the age or genetic makeup for a biological entity, or the capacity or level of sanitation for a physical venue. Note how these intrinsic features are most often dynamic and hence require a temporal treatment. Extrinsic features, on the other hand, arise from the interactions that the entities have had with one another, conditioned on various environmental parameters. As entities interact, their extrinsic features-which are also dynamic-are to be updated so as to reflect the changing likelihood that any given entity carries the virus. 3 2 Building blocks Let ε (j) k be the k-th instance of an entity of type j. The set E (j) of entities of type j adds up to the overall, hetergeneous set E of all entities, i.e., where E is the number of entity types. Each entity ε (j) k is represented algebraically by a vector of featuresd (j) k whose interpretation and dimensionality is fixed by the entity type (cf. §2.3). Entities are not static in the sense that they interact with their environment, thereby updating their features upon those interactions. The most trivial interaction involves a single entity ε whose labels or attribute features are re-written by the environment independently of other entities. A more interesting case happens when the environment also involves one or more other entities ε = ε of potentially different types. These other entities ε both influence-and are in turn influenced by-the presence of ε. Such interactions thus induce correlations (i.e., dependencies) or perturbations (i.e., noise) among the features of the entities at play. Let us formalize the above by denoting the l-th instance of an interaction of type i as a function of the set 4 ξ (i) l ⊂ E of entities involved, the vector τ (i) l of environmental parameters that modulate the interaction, and the timestamp t of when the interaction occurred. Note that a given interaction type i predetermines the structure of both ξ (i) l and τ (i) l . One can think of an interaction type as a set of co-occurring relationships within the broader ontology of the problem domain. An interaction is instantiated, i.e., subscripted with some index l as in Eq. (2), once it is time-stamped and associated with a particular set of entity instances ξ Just as for entities, interactions can be grouped into sets according to their types, thereby adding up to the overall set X of I possible interaction types: Notice how χ is the multipartite generalization of the attributed bipartite edge ξ = {ε, ε } commonly known from "run-of-the-mill" graph theory. The definition of an interaction in Eq. (2) is thus more akin to a hyperedge that spans ξ, is attributed with τ , and is time-stamped at t. Let the knowledge about an entity ε (j) k at time t be encoded by its data vector 5 d (j) which is the concatenation 6 of three vectors, namely the target features, the intrinsic features, and the extrinsic features. Notice that the extrinsic features are themselves a concatenation of as many interactions as an entity of type j is involved in. This is developed further in §2.3.3. The target features b k are the scores or labels 7 which drive supervised learning. One can assume that one begins with a body of labeled entities for which the targets are well-defined and serve as ground truths or, more realistically, as seed beliefs 8 for the estimation of the entities. 4 ξ (j) l is, strictly speaking, a dictionary, or associative array, of key-value pairs where the key specifies the role in the interaction, and the value specifies the entity instance. 5 or more generally, a tensor 6 Concatenation shall be symbolized mathematically as the direct sum operator ⊕. 7 One shall refer to scores and labels interchangeably, thus leaving the freedom to the particular use-case to decide whether it is a matter of regression or classification, respectively. 8 hence the b-notation for beliefs The intrinsic featuresf k are any features which can be completely decoupled from the presence of other entities ε k is an aggregation through time of all the sequential updates f k can instead be an abstract encoding that collapses the history of intrinsic feature updates onto a fixed, lower-dimensional space. This time-collapse-or aggregation-operation M where ∆T is the lookback period from the current time t and should ideally span all the way back to the creation time of ε where t is needed to serve as an attention parameter. 10 One can thus re-write Eq. (5) represents the time-stamped history of any variable v from t−∆T to t. In terms of deep learning architecture, M (j) f can be implemented by a recurrent neural network or a transformer [Vaswani et al., 2017] . However, a (psuedo-)Markovian simplification, 11 denotedM (j) f , could make use of only the current update and the previous aggregation, i.e., The extrinsic featuresχ k are those that depend on the data vectors of other entities in the context of an interaction χ is a thus latent representation which summarizes the sequence of interactions of type i which ε (j) k has been involved in up to and including time step t. In a manner similar to Eqs. (5, 6, 7), this time-aggregation can be expressed bŷ which took place at time t . The process by which an interaction χ (i) l generates a latent representation χ (j,i) k of itself to each of its participating entities ε (j) k is examplified in Fig. 1 for the tri-partite case. This process shall be referred to as space-aggregation in the sense that, unlike M χ , which aggregate histories through 9 Because of implementational constraints, however, only the most recent interval of history can be stored in memory so ∆T will most likely be a finite time window. 10 More recent events typically deserve more attention than older ones. 11 The viability of this simplification is subject to experimentation. l aggregates the features of neighbouring entities as per the topology of the hypergraph that links them (in space). One can therefore consider χ (i) l as a black box for any conceivable technique from graph machine learning [Wu et al., 2020] or even traditional belief propagation [Yedidia et al., 2003] . Formally, space-aggregation at time step t shall be expressed as the mapping Finally, going back to the time-aggregation M (j,i) χ of extrinsic features, one can consider the Markovian assumptionχ 3 Supervised learning 3.1 Circuit diagram of the estimation Figure 2 brings together the building blocks that were presented above. It depicts, on a discretized timeline, the flow of information that culminates at time t with the training on-or the serving of-a target belief for entity ε (j) k . Figure 3 shows the same scenario as Fig. 2 under the Markovian assumptions of Eqs. (8) and (12). As indicated by the small diagonal arrows, the environment can at any time step do any one of three operations on the entity, namely • update its target belief, • update its intrinsic features, or • involve it in one or more interactions, thereby updating its extrinsic features. The pseudocode for systematically processing three potential updates in an online fashion is shown in Alg. 1. l . Finally, once again for the sake of simplicity, only one interaction instance is shown. As stated in the introduction, the aim of this article is to outline a reference architecture for supervised learning on the entities. Taking entity ε (j) k as an example, the goal is to learn a function M (j) which maps its featuresf In practice, M (j) can only achieve an approximation β (j) of the actual target b 8) and (12). where K (j) is the number of entities of type j and D is an arbitrary distance metric which can double as a loss function. A full expansion of M (j) in terms of the aggregator operations in time Eqs. (6, 9) and space Eq. (10) yields the estimated belief at time t = {time-aggregations Eqs. (6) and (9) or, if one were to apply the Markovian assumption on Eq. (17), One can thus see that the optimization problem of Eq. (15) is not limited to the parameters and hyper-parameters of M (j) but also extends to those of the aggregators M χ , and χ (j) such that the global optimum is given by Notice how, unlike most of the literature on graph machine learning, the data aggregators are not dependent on any particular instances of entities and interactions but only on their types j and i, respectively. Some remarks are in order regarding the design of Fig. 2 , especially about the aggregation processes. First, there is no time-aggregation for the targets. This is motivated by the fact that the aggregators should focus on inferring the current target b (j) k (t) instead of trying to reproduce its history. Except for feeding in the latest target b (j) k (t−1) to neighbouring entities via the space-aggregator, the targets of any given entity should not leak in its feature space to avoid the risk of overfitting. 12 Another reason is that, unlike pure time-series problems (e.g., stock prediction) where the target's history is itself the main feature, the present problem much more heavily entangles the entities to their environment such that, at any time step t, a target can be abruptly overwritten, in complete disregard for any historical continuity. Note that the target approximations β (j) k are not reused either anywhere in the aggregation so as to ensure that no error gets inadvertantly amplified by a feedback loop. Note that the above choices to exclude the targets from (most) aggregations are not founded on an absolute rationale. They are merely precautions against overfitting and error amplification. One may very well devise regularization mechanisms that will alleviate these concerns and efficiently incorporate target histories as meaningful features in themselves. A final observation is that not all three updates-i.e., of targets, intrinsic, and extrinsic featuressystematically happen at every time step. Whenever an update is "missing", one shall not replace it with a null value, but simply carry over the last update together with its timestamp. Here again, this is not an absolute requirement as one could adopt alternative approaches, where missing values can be represented by dedicated values. Such design choices are mong those that require experimentation with a particular use case (cf. §4). In the broader context of ontologies, be them social, man-made, or natural, applications of machine learning have mostly been cross-sectional in that they deal with the estimation of a specific entity Interactions: several people share the same object; an animal is sold at a market; two people shake hands; etc. type only. This work attempts to further the reach of machine learning to arbitrary ontologies where the entities are attributed, dynamic, heterogeneous, and-more importantly-interacting with each other in ways that do not necessarily fit traditional graph topographies made up of bipartite edges. Estimation-i.e., classification or regression-is therefore not constrained to any given entity type anymore but can be applied accross all heterogeneous entities. Examples of such ontologies are illustrated in Fig. 4 . The reference architecture presented herein is intentially kept as high-level as possible, thereby allowing for the modular implementation of the four aggregators in a way that is agnostic as to their inner components, be them neural networks or any other technique. Given any particular problem domain, further investigation is therefore needed as to the particular design of the aggregators, in particular when it comes to such issues as • the initialization of the latent representationsf Time-ordered bipartite graph for spatiotemporal social network analysis Stellargraph machine learning library Inductive representation learning on large graphs Dynamic hypergraph neural networks Heterogeneous graph neural networks for malicious account detection Hindom: A robust malicious domain detection system based on heterogeneous information network with transductive classification A comprehensive survey on graph neural networks Understanding Belief Propagation and Its Generalizations Gaan: Gated attention networks for learning on large and spatiotemporal graphs Deep learning on graphs: A survey Learning with hypergraphs: Clustering, classification, and embedding This work was supported by the Innovation Fund Denmark. The author would like to thank Egon Kidmose for valuable feedback on the manuscruipt.