key: cord-0612156-qshkjyjz authors: Barua, Hrishav Bakul; Mg, Theint Haythi; Pramanick, Pradip; Sarkar, Chayan title: Detecting socially interacting groups using f-formation: A survey of taxonomy, methods, datasets, applications, challenges, and future research directions date: 2021-08-13 journal: nan DOI: nan sha: 434653f2d224e375bb14ea58400fca935549233d doc_id: 612156 cord_uid: qshkjyjz Robots in our daily surroundings are increasing day by day. Their usability and acceptability largely depend on their explicit and implicit interaction capability with fellow human beings. As a result, social behavior is one of the most sought-after qualities that a robot can possess. However, there is no specific aspect and/or feature that defines socially acceptable behavior and it largely depends on the situation, application, and society. In this article, we investigate one such social behavior for collocated robots. Imagine a group of people is interacting with each other and we want to join the group. We as human beings do it in a socially acceptable manner, i.e., within the group, we do position ourselves in such a way that we can participate in the group activity without disturbing/obstructing anybody. To possess such a quality, first, a robot needs to determine the formation of the group and then determine a position for itself, which we humans do implicitly. The theory of f-formation can be utilized for this purpose. As the types of formations can be very diverse, detecting the social groups is not a trivial task. In this article, we provide a comprehensive survey of the existing work on social interaction and group detection using f-formation for robotics and other applications. We also put forward a novel holistic survey framework combining all the possible concerns and modules relevant to this problem. We define taxonomies based on methods, camera views, datasets, detection capabilities and scale, evaluation approaches, and application areas. We discuss certain open challenges and limitations in current literature along with possible future research directions based on this framework. In particular, we discuss the existing methods/techniques and their relative merits and demerits, applications, and provide a set of unsolved but relevant problems in this domain. Uniqueness of the survey. To the best of our knowledge, this survey is the first of its kind in this subject area. Our survey puts forward the idea of social groups with the perspective of f-formation with some comprehensive details. We also discuss the optimum joining position for a robot to enable human-robot interaction, after successful detection of the formation using computer vision techniques. Additionally, we propose a holistic framework to signify the various concern areas in the detection and prediction of social groups. Various taxonomies, regarding camera view of the environment for collecting scenes for detection, datasets for training machine/deep learning (ML/DL) models, detection capability, and scale and evaluation methods are discussed. We discuss and categorize all the detection methods, particularly rule-based and machine learning-based. Furthermore, we also deliberate the application areas of such detection and recognition giving primary focus to robotics. We also detailed the challenges, limitations, and future research directions in this area. Organization of the survey. This survey article is organized into the following sections. Section 2 puts forward a comprehensive idea about social spaces involved in group interaction. The questions like the meaning of f-formation, types of f-formation, and evolution of f-formation from one type to another when a new member joins a group have been answered along with pictorial depiction for readers' understanding. In Section 3, we propose a generic and holistic framework for group and interaction detection using formations which also becomes the basis of categorization of the literature in the survey based on the various concern areas and modules. Then we present a year-wise compilation of research performed in this domain with analysis in Section 4. Section 5 discusses the various input methods for detection such as cameras and other sensors. It also puts focus on the various camera views and positions. Section 6 summarizes the methods, techniques, and algorithms for detection focusing on both rule-based static Artificial Intelligence (AI) methods and learning-based (data-driven) methods. Then we talk about detection capabilities and scale in Section 7. In this section, we also discuss briefly the various datasets available for training and testing purposes. Section 8 presents the various evaluation strategy and methodology from the perspective of algorithmic computational complexity and application areas like robotics and vision. The various application areas are stated in Section 9. Finally, we discuss the limitations and challenges in existing state-of-the-art literature and methods as well as propose some future research directions and prospects in each of the modules (of the survey framework) in Section 10. We conclude the survey in Section 11. ✗-> signifies no treatment in the paper, ✓-> signifies some mention exists and -> means comprehensive treatment of the concern area. (1) Comprehensive F-formation list and tutorial on social spaces, (2) Camera views and sensors, (3) Datasets, (4) Detection capability/scale, (5) Evaluation methodology, (6) Feature selection, (7) Rule-based AI methods/techniques, (8) Machine Learning based AI methods/techniques, (9) Applications, (10) Limitations, Challenges and Future directions, (11) Generic survey framework for group and interaction detection. In this section, we describe the theory of f-formation, leveraging the theory to study the groups of interacting people, and how a robot can utilize this to imitate social behavior while interaction with a group of people. Manuscript submitted to ACM 4 Barua et al. Facing formation (f-formation) happens when two or more people sustain a spatial and orientational relationship and they have equal, direct, and exclusive access to the space between them [61] . Fig. 1 depicts such a social space where a group of people are interacting. An f-formation is the proper organization of three social spaces: O-space, P-space, and R-space [116] . They are situated like three circles surrounding each other. O-space, the innermost circle, is a convex empty space that is normally surrounded by the people in the group and the participants generally look inward into the O-space. P-space, the second circle, is a narrow space where active participants are standing. The R-space, the outermost circle, is a space where an inactive participant (listener) or an outsider who is not a part of the conversation stands. Although both the theory of f-formation and appropriate methods to detect them have been well-analyzed in the literature, a comprehensive list of all the possible f-formations during different kinds of interactions is yet to be brought out. The most common ones are side-by-side, viz-a-viz, L-shaped and triangular, defined for groups of two to three persons. Some others include circular, square, rectangular and semi-circular, which are more flexible and can contain a varying number of persons. We list down and categorize a complete collection of all the known f-formations below (see Fig. 2 for a pictorial representation). (a) Side-by-side: The side-by-side formation is formed when two people stand close to each other facing in the same direction. Both faces either right or left or center. A minimum of 2 people is required for such a formation [61] . (b) Vis-a-vis or face-to-face: This formation comes into existence when two people are facing each other. Only 2 people are required for such a formation [61] . (c) L-shape: The L-shape is formed when two people face each other perpendicularly and are situated on the two ends of the letter "L" -one person facing the center and the other facing right or left [61] . (d) Reserved L-shaped: This is formed when two people are in a position of L-shape, but they are facing in different directions [72] . (e) Wide V-shaped. Two people are facing in the same direction like side by side but they tilt their bodies slightly to face each other a little. Minimum 2 people required for this formation [72] . (f) Spooning: This formation has two people with one person facing forward and the other look over from the back in the same direction [72] . (g) Z-shaped: This is formed when two people are standing side-by-side but facing in opposite directions [72] . (h) Line formation: In this formation, all are standing in side-by-side fashion as a straight line and a minimum of 2 people are required [135] . (i) Column formation: In this formation, all are standing in a fashion where one is behind the other in a straight line and a minimum of 2 people are required [136] . people is required [137] . (k) Side-by-side with one headliner: In this formation, one person stands in the front and others stand side-by-side at the back. A minimum of 3 people are required and they all face in the same direction [40] . (l) Side-by-side with outsider: In this formation, one participant occupying an outer position of the side-by-side formation in the r-space, who usually does not play an active role in the conversation. A minimum of 3 people is required [79] . (m) V-shaped: In this formation, all people stand in a V-shaped fashion and face the same direction. A minimum of 3 people is required [138] . (n) Horseshoe: The group of people stands in the shape of "U" and a minimum of 5 people are required [116] . (o) Semi-circular: The semi-circular formation is where three or more people are focusing on the same task while interacting with each other [96] . (p) semi-circular with one leader in the middle: In this formation, people stand in semi-circular shape and there is one person in the center who is facing to the group of people in the semi-circle. A minimum of 4 people is required [79] . (q) Square (Infantry square): Four people stand in the square shaped fashion [139] . (r) Triangle: As the name suggests, three people stands in a triangular shape in this formation [96] . (s) Circle: As the name suggests, a group of people stands in a circular shape in this formation [94] . (t) Circular arrangement with outsiders: In this formation, some people stand in a circular fashion and one/two additional people stand at the back of the circular formation [40] . (u) Geese formation: In this formation, there are two or more people where one person is leading the path and the others are following that person but may or may not be looking in the same direction. A minimum of 2 people is required [42] . (v) Lone wolves: This is not really a formation (yet). There is only one person ready to be joined by others before an interaction [42] . Manuscript submitted to ACM 8 Barua et al. As social robotics is one of the most important application areas of group and interaction detection using computer vision, we put forward a list of positions where a robot can join a formation (here f-formation) after successfully detecting it. But joining a group requires a socially aware [28] or human-aware [74] navigation protocol embedded into the robot. In other words, the robot should imitate human-like natural behavior while approaching a group (considering correct direction and angle) for interaction and discussion without incurring any discomfort to the existing members of the group. However, this part of the story is out of the scope of our survey and we limit our work to detection and prediction of group and interaction only. But, as it seems necessary to at least briefly mention this side of the coin, we put forward the possible joining locations and natural joining path or approach direction/angle (robot trajectory) in this section (also discussed briefly in Sections 8 and 10). Researchers can think of presenting and publishing a systematic survey on the navigation and joining aspect of a robot/autonomous agent after successful detection/prediction of the group interaction and f-formation. Table 2 summarizes a list of formations with the number of people and correspondingly the new formations after a person/robot joins it. The pictorial summary of the same is presented in Fig. 3 . This survey aims to facilitate the concerned researchers with a comprehensive overview of this domain of group and interaction detection. The idea of group/interaction detection is not new and is around for more than a decade. Researchers are trying to design and develop new methods, techniques, algorithms, and architectures for various application areas ranging from computer vision and robotics to social environment analysis. The problem of a group and/or interaction detection is a non-trivial problem of computer vision. The existing research approaches follow both classical AI algorithms like rule-based methods, geometric reasoning, etc., and neural networkedbased methods. Moreover, learning paradigms like supervised, semi-supervised and unsupervised are also used. Proper categorization of these methods is necessary for future research directions. We have proposed a holistic framework that corresponds to the concern areas of f-formation research and can also be considered as a generic architecture for a typical group/interection detection task using f-formation. Fig. 4 puts forward a possible framework with different modules of such a detection task. The various concern areas of this domain can be characterized by -sensors used, camera view/position for capturing the group interaction, datasets used for training/testing the method in case of learning-based approaches (indoor or outdoor), feature selection method & criteria, detection capabilities (static/dynamic scenes) & scale (single or multi-group scenario), evaluation methodology (efficiency/accuracy and/or simulation study and human experience study), and application areas. The mentioned modules are used as the basis of categorization of the literature in our survey and are attended to in the upcoming sections one by one (as mentioned in Fig. 4 ). Finally, we conclude the survey by discussing the limitations, challenges, and future directions/prospects (Section 10) in each of the concerned modules. In 1990, Kendon [70] proposed the f-formation theory for group interaction by participating people on the basis of proxemics behavior. A computer system to detect human proxemics behavior was first studied by Hall almost six decades ago [55, 56] . This section is a survey on the literature collected on f-formation, using static and learning-based AI approaches. Fig. 5 shows the various specified distance ranges for different designated interaction types on the basis of intimacy level between the participating people. The distance ranges specified in green colored boxes are relevant to group/interaction and f-formation detection perspective. The blue-colored boxes signify distance ranges that are not generally seen in any So, there is a transition from traditional AI-based methods to machine learning and data-driven techniques like almost any domain of AI. Table 3 can be referred to for the complete list of references (year-wise). The table also consists of the keywords (methods, focus areas, and technologies) for most of the references for a better perception of the readers. Human proxemics using distance ranges as stated by Hall [55] . The green comment boxes signify distance ranges typically used for f-formation and group interaction. Geometric reasoning on sonar data [12] . Wizard-of-Oz (WoZ) study on spatial distances related to f-formations [61] . Clustering trajectories tracked by static laser range finders [67] , trajectory classification by SVM [112] . 2 2010 Probabilistic generative model on IR tracking data [54] , WoZ study of robot's body movement [75] , SVM classification using kinematic features [144] . Analysis of different f-formations for information seeking [79] , Hough-transform based voting [37] , graph clustering [60] , a study on transitions between f-formations on interaction cues [82] , a computational model of interaction space for virtual humans extending f-formation theory [88] , a study of physical distancing from a robot [85] , utilizing geometric properties of a simulated environment [86] , a study to relate f-formations with conversation initiation [117] , Gaussian clustering on camera-tracked trajectories [38] . 9 2012 Application of f-formations in collaborative cooking [91] , Kinect-based tracking with rules [78] , WoZ study on social interaction in nursing facilities [71] , a study of robot gaze behaviors in group conversations [143] , velocity models (while walking) [84] , SVM with motion features [108] , Hidden Markov Model (HMM) [49] . Spatial geometric analysis on Kinect data [44, 46] , analysis of f-formation in blended reality [41] , a comparison of [37] and [60] [114] , exemplar based approach [77] , multi-scale detection [115] , Bag-of-Visual-Words (BoVW) based classifier [126] , Inter-Relation Pattern Matrix [23] , HMM classifiers [81] , O-space based path planning [52] . Hough Voting (HVFF), Graph-cuts (GCFF) [100] , game theory based approach [128] , correlation clustering algorithm [11] , reasoning on proximity and visual orientation data [42] , effects of cultural differences [65] , HMM to classify accelerometer data [59] , iterative augmentation algorithm [31] , adaptive weights learning methods [102] , estimating lower-body pose from head pose and facial orientation [142] , search-based method [45] , study on group-approaching behavior [69] , spatial activity analysis in a multiplayer game [66] . Robust Tracking Algorithm using TLD [9] , GCFF based approach [116] , Correlation Clustering algorithm [10] , multimodal data fusion [8] , spatial analysis in collaborative cooking [90] , GIZ (Group Interaction Zone) detection method [30] , study on influencing formations by a tour guide robot [68] , joint inference of pose and f-formations [121] , participation state model [118] , SALSA dataset for evaluating social behavior [7] , multi-level tracking based algorithm [131] , Structural SVM (SSVM) using Dynamic Time Warping (DTW) loss [119] , Long-Short Term Memory (LSTM) network [2] , influence of approach behavior on comfort [15] . 14 2016 F-formation applied to mobile collaborative activities [125] , subjective annotations of fformation [145] , game-theoretic clustering [129] , study of display angles in museum [62] , mobile co-location analysis using f-formation [113] , proxemics analysis algorithm [104] , review of human group detection approaches [123] , LSTM based detection in ego-view [3] . Haar cascade face detector based algorithm [73, 94] , weakly-supervised learning [127] , temporal segmentation of social activities [35] , omnidirectional mobility in f-formations [141] , review of multimodal social scene analysis [6] , 3D group motion prediction from video [64] , survey on social navigation of robots [28] , a study on robot's approaching behavior [16] , heuristic calculation of robot's stopping distance [109] , a study on human perception of robot's gaze [130] , computational models of spatial orientation in VR [97] . Optical-flow based algorithm in ego-view [105] , meta-classifier learning using accelerometer data [50] , human-friendly approach planner [111] , discussion on improved teleoperation using f-formation [92] , effect of spatial arrangement in conversation workload [80] , study of f-formation dynamics in a vast area [40] . Study on teleoperators following f-formations [96] , analysis on conversational unit prediction using f-formation [103] , empirical comparison of data-driven approaches [57] , LSTM networks applied on multimodal data [107] , robot's optimal pose estimation in social groups [95] , review of robot and human group interaction [140] , Staged Social Behavior Learning (SSBL) [47] , Euclidean distance based calculation after 2D pose estimation [93] , Robot-Centric Group Estimation Model (RoboGEM) [124] . Difference in spatial group configurations between physically and virtually present agents [58] , Conditional Random Field (CRF) with SVM for jointly detecting group membership, fformation and approach angle [18] . 2 This section summarizes the input methods in the group/interaction detection framework (Fig. 4) and RFID (Radio-frequency identification) sensors, etc. These are chosen based on the application areas and the working environment. There are two different types of camera positioning used -ego-vision/ego-view (ego-centric) camera for robotics and exo-vision/exo-view (exo-centric) or global view cameras (fixed in walls and ceiling) in indoor environments or outdoor environments (see Fig. 6 ). Cameras are used for drone surveillance, robotic vision, and scene monitoring. In these cases, we work with the ego/exo views of the scene to detect group interactions. Ego-centric view. Ego-centric refers to the first-person perspective; for example, images or videos captured by a wearable camera or robot camera. The captured data is focused on the part where the target objects are placed. In [105] , the robot's camera is used for capturing scenes which is also referred a robot-centric view. In [45] , the authors use first-person view cameras for estimating the location and orientation of the people in a group. In [2] , the authors use low temporal resolution wearable camera for capturing groups' images. Exo-centric view. The exo-centric view is concerned with the third-person perspective or the top view; for example, images or videos which are captured by surveillance/monitoring cameras. There can be one or many social interaction groups in a scene that can be captured simultaneously from the top view. In [102] , the authors use 4 cameras for detecting groups at large scale. The method also detects changes in the target groups when it moves closer or further from the cameras. In [60] , experiments are done by capturing a video with a camera from approximately 15 meters overhead. In [38] , the images are captured by using a fisheye camera and it is mounted 7 meters above the floor. Sensors play a vital role to find the relative distance of the people in a group, which helps accurate prediction of the type of f-formation. Researchers used different types of sensors such as depth sensor, laser sensor, audio or speech sensor, RFID, and UWB sensor in the literature. There are some cases where both cameras and other types of sensors are used simultaneously for detection. In [130] the authors use the UWB (Ultra-wideband) localization beacons, Kinect and an audio sensor for detecting people and other entities, and RGB cameras for monitoring. The data for scenes are captured in the form of images and/or videos depending on the method that uses the input for scene detection. Some instances of WiFi-based tracking [51] of humans are also visible in the literature. Table 4 gives a categorization of the surveyed literature on the basis of camera views and sensors. In the table, readers may see the number of cameras used in each of the cited papers specified. The table also specifies the various cameras and sensors used in each paper. There are many f-formation detection methods proposed in the literature. In this article, we broadly categorize these methods into two classes -(a) rule-based methods (fixed rules, assumptions, and geometric reasoning) like the conventional image processing and vision techniques, and (b) learning-based method (or data driven approach). With the Big data revolution at its bloom, learning-based methods have come to prominence in the recent past. Multimedia and visual analytics [99] from big data remains a lucrative tool for the future of this domain. 12 Barua et al. Exo-view Fig. 6 . Different camera views of the same group/interaction in an indoor environment. In the left side a robot is having a ego-view camera and in the right side an exo-view or global view camera is fixed in a wall. These images are produced in webot robotic simulator [39] . In group discussions, people stand in a position where the conversation can happen effectively. Kendon [70] proposed a formal structure of group proxemics among the interacting people in a formation (described in Section 2), where Hough-voting strategy is used for finding the O-space [100] . In [115] , the authors use the Hough-voting approach with a two-step algorithm -1) fixed cardinality group detection, and 2) groups merging. Using these two steps, they detect the type of f-formation. In [46] , the experiment uses a heat map-based method for recognizing human activity and the best view camera selection method. In [100] , the (GCFF) graph cut f-formation is used for detecting f-formations in static images with the graph-cuts algorithms via clustering graphs. Yasuharu Den [40] says that formations are also dependent on the social organization and environment. He explains formation with outsiders where people stand based on their position. In this paper [96] , there are three constraint-based formations namely triangle, rectangle, and semi-circular formation. They use a game-theoretic model for the position and orientational information of people to detect groups in the scene. For checking the formation, they use an algorithm proposed by Vascon et al. [128, 129] that generates the 2D frustum of the position and orientation of people in the group. In [94] , the authors use the Haar cascade face detector algorithm to detect the faces and eyes of people. Based on the face and eye detection, the method decides how many frontal, right, and/or left faces are there and then decides the formations. In [73] , the Haar cascade classifier is used with quadrant methodology. The paper differentiates the person's facing direction by looking where the eye is located and in which quadrant. In [46] , the authors use a new method to find the dominant sets and then compares with modularity cut. But this method is applicable only when everyone is standing. In [103] , the method uses speaking turns for indicating the existence of distinct conversation floors and gets the estimation of the presence of voice. But this method cannot detect the silent (inactive) participants. In [107] , proximity and acceleration data are used and pairwise representations are used with LSTM (Long short-term memory) network. They are used for identifying the presence of interaction and the roles of participants in the interaction. However, using a fixed threshold for identifying speakers can create mislabel in some instances. In [10] , structural SVM Manuscript submitted to ACM [9] , multi [90] , multi [45] , multi [142] , 1 [31] , [69] , multi [42] , 1 [11] , 1 [51] , depth camera, RGB camera [49] , an omni-directional camera [143] , multi [91] , [88] , 3 [75] , 4 [54] , [57] , robot camera [61] , 2 [12] , [10] , 2 [67] , 1 [18] Exo-centric (Global view) [Section 5.1] • Social scene monitoring • Covid-19 social distancing monitoring • Human interaction detection and analysis 1 [96] , multi [50] , multi [92] , 1 [16] , [141] , multi [6] , multi [64] , 1 [28] , 4 [130] , multi [127] , 8 [109] , 1 [62] , multi [123] , multi [129] , 4 [104] , 2 [145] , multi [119] , [131] , multi [7] , multi [121] , [116] , 1 [30] , multi [116] , multi [8] , multi [90] , [66] , 1 [48] , 4 [102] , a single monocular camera [128] , 3 overhead fish-eye camera used for training classifier [59] , multi [142] , multi [45] Audio, sociometric badges, Blind sensor, prime sensor, WiFi based tracking, laser based tracking, depth sensor , band radios, touch receptors, RFID sensors, smart phones, UWB becon [7] , [42] , Kinect depth sensor [44] , [51] , [49] , [82] , [125] , speakers [103] , wearable sensors [107] , [28] , [109] , ulta wide-band localization beacons ( UWB), Kinect [130] , [67] , [112] , [113] , RFID tag [62] , [74] , [44] , [84] , [88] , Asus Xtion Pro sensor [93] , ZED sensors [105] , single worn accelerometer [50] , Kinect sensor [111] , Microsoft Speech SDK [97] , speaker, Asus Xtion Pro live RGB-D sensor [16] , Kinect [64] , motion tracker [109] , sociometric badges [6] , RGB-D sensor [141] , tablets [125] , tablets [113] , mobile sensors [123] , microphone, infrared(IR) beam and detector, bluetooth detector, accelerometer [7] , touch sensor [88] , range sensor [143] , laser sensors [84] , Wi-fi based tracking, Laser-based tracking [51] , PrimeSensor, Microsoft Kinect, microphone [81] , RFID sensors [41] , blind sensor, location beacon [42] , single worn accelerometer [59] , [80] , gaze animation controller [97] , [117] , grid world environment [86] , ethnography method [79] Others relevant literatures - [55] , [65] , [52] , [41] , [70] , [72] , [40] (support vector machine) is used for learning how to treat the distance and pose information, and correlation clustering algorithm is used to predict group compositions. Furthermore, TLD (Tracking learning detection) tracker is used for blur detection for ego-vision images. But the trackers cannot perform detection when the target is moving out of the camera field of view. In [105] , the method uses ego-centric pedestrian detection. The pedestrian detector generates bounding boxes. It uses optical flow for estimating motion between consecutive image frames. For detecting groups, they used joint pedestrian proximity and motion estimation. In [145] , the method detects the group with a group detector first then uses the trained classifier to differentiate the people involved in the group. Some researchers use pedestrians, vision-based algorithms, pose-estimation algorithm to detect [127] groups. In the article [127] , authors use body-pose for handling f-formation detection and finding the joint estimation of f-formation and target's head and body orientation. They also use multiple occlusion-adaptive classifiers. There are many more methods scientists use but each of them has its own strengths and weaknesses. Fig. 8 Manuscript submitted to ACM 14 Barua et al. We categorize methods as rule-based that include pre-defined rules, geometric assumptions, and reasoning. Rule-based methods are designed around well-known social behaviors and geometric properties and are often intuitive. In the absence of any learning paradigm, the algorithms are purely based on a static set of rules that are assumed to be true for a particular group situation (see Fig. 9 ). In the following, we list down the most popular rule-based methods that report a decent accuracy in detecting human groups. Voting based approach (2013). This approach is used for detecting and localizing groups by finding the matches based on exemplars. The authors in [77] suggest, this method works on agents so it is very flexible for different multi-agent scenarios. The results show that this method is effective for groups of up to four agents. The results are evaluated with people only without robots. The computational complexity of this method is low, hence it is real-time in nature and accuracy is very good. Graph cuts for f-formation (GCFF) (2015) [116] . Head and body pose estimation (HBPE) (2015) [121] . This method uses a joint learning framework for estimating head and body orientations that in turn is used for estimating f-formations. This method is evaluated with people in a scene without any robots. For evaluating, authors use the mean angular error for head, body pose estimation (HBPE) and use F1-score for f-formation estimation (FFE). This method is compared with Hough Voting for f-formation (HVFF) method. Though the results are more or less similar, this method is slightly more accurate and has a higher F1 score. GROUP (2015) [131] . The GROUP algorithm detects f-formations based on lower-body orientation distributions of people from the scene and gives a set of free-standing conversational groups on each time step. Firstly, it analyzes the maximum description length (MDL) parameter. The higher the MDL parameter, the higher is the radius of grouping people together. This method can also detect non-interacting people as outliers. It is evaluated with people only without any robots in the scene. The computational complexity of this method has been compared with state-of-the-art methods. Approach Planner (2018) [111] . The Approach Planner (AP) enables a robot to navigates/plan based on the natural approaching behavior of humans toward the target person. This method can replicate human behavior/tendencies when approaching. The evaluation is based on the parameters derived from the skeletal information. Manuscript submitted to ACM Game-theoretic model (2019) [96] . The approach gives a 2D frustum for each virtual agent and robot by giving the position and orientation of them. Then it computes the affinity matrix. The method is evaluated both quantitatively and qualitatively. It is efficient in serving teleoperated robots who follow f-formations while joining groups automatically. The method also takes care of the fact that the formation is modified when new people/robots join the old group. The evaluation is made in a simulation environment. Machine learning-based methods are generally data-driven models where different algorithms are explored by researchers. Generally, we have special case of Deep learning methods under Machine learning. The primary learning paradigms used are -Supervised, Unsupervised, Semi-supervised and Reinforcement learning. However, a generic system that uses a machine learning algorithm to detect f-formation is shown Fig. 10 IR tracking method with SVM (2010) [54] . With the help of IR tracking, social interactions can be classified as either existing or non-existing by using geometric social signals. The authors train and test with many classifiers such as Support Vector Machines (SVM), Gaussian Mixture Models (GMM), and Naive Bayes classifier. IR tracking with an SVM classifier has been shown to achieve better accuracy than other classifiers. Graph-based clustering method (2011) [60] , (2013) [126] . In [60] , authors use the "socially motivated estimate of focus orientation" (SMEFO) feature to estimate body orientation that in turn estimates f-formation. This method has been compared with a modularity cut method. The evaluation process is done on the basis of computation complexity. The limitation of this approach can be seen in scenarios where people are moving within the group and/or people are joining/leaving the group. In [126] , authors find a graph representation from the 3D trajectories of people and head poses. Using a graph-clustering algorithm they discover social interaction groups. They use a Support Vector Machine (SVM) classifier for learning and classifying the group activities. The evaluation shows that it is better than the previous methods. Human experience study is also performed with robotic scenarios. This approach not only recognizes or detects particular group activity but also predicts a direct link between each person from that group. Novel framework (2013) [23] . This approach uses the Subjective View Frustum (SVF) as the main feature which encodes the visual field of a person in a 3D environment and the Inter-Relation Pattern Matrix (IRPM) as a tool for evaluation. Manuscript submitted to ACM 16 Barua et al. For the tracking part, Hybrid Joint-Separable (HJS) filter is used. The tracker gives the position of the head and feet of each person. Computational result based evaluation is done with the other counterparts in terms of accuracy/efficiency. GIZ detection (2015) [30] . This method detects groups based on proxemics. Group Interaction Energy (GIE) feature, Attraction and Repulsion Features (ARF), Granger Causality Test (GCT), and Additional Features (AF) are proposed in this method. Tests are also conducted by combining these features. This method allows people to be connected loosely. The evaluation is done on the basis of computational accuracy and efficiency. 3D skeleton reconstruction using patch trajectory (2017) [64] . This algorithm works in two stages. First, it takes images from different views as input and produces 3D body skeletal proposals for people using 2D pose detection. Secondly, it refines those using a 3D patch trajectory stream and provides temporally stable 3D skeletons. Authors evaluate the method quantitatively and qualitatively yielding an accuracy of 99%. The limitation of this method lies in its dependency on 2D pose detection and the computation time complexity. Learning methods for HPE (head pose estimation) and BPE (Body pose estimation) (2017) [127] . This method uses a jointly learning framework for estimating the head, body orientations of targets, and f-formations for conversational Method based on pedestrian motion estimation (2018) [105] . This method works in three parts -ego-centric pedestrian detection, pedestrian motion estimation, and group detection using joint motion and proximity estimation. The pedestrian detectors result in bounding boxes (BB) with two features -a position of pedestrians and size of the BB. An optical feature is used for motion estimation. Then, joint pedestrian proximity and motion estimation are used for detecting groups while considering the depth data. The evaluation is done in terms of real-life human experience study using robots and humans. Method based on Multi-Person 2D pose estimation (2019) [93] . This method firstly estimates the position of the human skeletal characteristic points in the image plane and calculates the Euclidean distance between those points. Then Part Affinity Fields (PAFs) feature is applied to find the distance based on the Euclidean distance. A curve-fitting approach is used for validation purposes. No need for prior information about camera parameters/features of the scene is needed. The evaluation is performed on the basis of human experience study in real-life scenes using robots and humans. Bagged tree (2019) [57] . The proposed algorithm works in three steps -data-set deconstruction, pairwise classification, and reconstruction. Authors evaluate this algorithm with three ML classification models -weighted KNN, bagged trees, and logistic regression, where bagged tree model achieve better result in pairwise accuracy, precision, recall, and F1-score. But this method still needs to be trained on larger datasets and also on richer features. The evaluation is done with human experience study with robots. RoboGEM (2019) [124] . RoboGEM is an unsupervised algorithm that detects groups from an ego-centric view. This method works using three main modules -pedestrian detection module "P", pedestrian motion estimation module "V ", and group detection module "G". In the first module, an off-the-shelf pedestrian detector (YOLO) is used that provides bounding Manuscript submitted to ACM boxes for each person in the image. In the second module, V is estimated using optical flow. In the last module, the human group detection is performed using joint motion and proximity estimation. The authors compared this method with existing approaches using Intersection-over-Union (IoU), false positives per image (FPPI), and depth threshold matrices. The evaluation is done with human experience with robots. Table 5 lists down the surveyed papers on the basis of rule-based and learning-based AI approaches. From the algorithmic trends, it is evident that learning-based approaches are slightly more predominant in recent years. However, both the methods are equally explored over the years and in recent pasts. Learning-based methods tend to be more accurate than their rule-based counterparts. Examples of such methods are: [60] , [81] , [64] , [105] , [57] , and [124] . Table 6 lists down the accuracy and efficiency of the different approaches. This signifies the detection quality of the methods and techniques. From the survey, it can be established that mostly unsatisfactory accuracy can be seen in rule-based approaches more than learning-based models. And, as expected the main reason in general for low accuracy lies in the inability of the methods to detect dynamic groups as well as multiple groups in the scene accurately. But one interesting observation is that accuracy is largely impacted by the camera vision as well. Basically, low accuracy can be seen in exo-vision input methods and datasets. Similarly, real-timeliness is another issue with methods dealing with dynamic and multiple groups. There is no prominent impact of camera views, datasets, and method types in this case. Readers may refer to the online Appendix which contains the detailed comparison of methods and techniques under rule-based static AI approaches in Table 1 and machine learning-based approaches in Table 2 . Table 5 . Classification based on approach/method for group and f-formation detection. References Classical Rule based AI methods [Fixed model based learning and prediction method based on certain geometric assumptions and/or reasoning (Section 6.1).] Approach behavior [112] , sociologically principled method [37] , proposed model [117] , The Compensation (or Equilibrium) Model, The Reciprocity Model, The Attraction-Mediation Model, The Attraction-Transformation Model [85] , rapid ethnography method [79] , digital ethnography [91] , GroupTogether system [78] , museum guide robot system [143] , extended f-formation system [46] , Multi-scale Hough voting approach [115] , HFF (Hough for f-Formations), DSFF (Dominant-sets for f-Formations), [114] , PolySocial Reality (PoSR)-F-FORMATION [41] , two-Gaussian mixture model,O-space model [52] , Wifi based tracking, Laser-based tracking, Vision-based tracking [51] , heat map based f-formation representation [44] , [128] , group tracking and behavior recognition [48] , search-based method [45] , Estimating positions and the orientations of lower bodies) [142] , Kendon's diagramming practice [90] , GROUP [131] , Graph-Cuts for f-formation (GCFF) [116] , [68] , head, body pose estimation (HBPE) [121] , Link Method, Interpersonal Synchrony Method [104] , Frustum of attention modeling [129] , f-formation as dominant set model [145] , HRI motion planning system [141] , footing behavior models (Spatial-Reorientation Model, Eye-Gaze Model) [97] , MC-HBPE(matrix completion for head and body pose estimation) method [6] , [109] , Haar cascade face detector algorithm [73] , Haar cascade face detector algorithm [94] , [130] , Approaching method [111] , Measuring Workload Method [80] , [95] , [96] , [103] , f-formation as dominant set model [146] Machine Learning based AI methods [Data-driven models for learning and prediction using Supervised, Semisupervised, Unsupervised and Reinforcement learning (ML/DL) or any such techniques (Section 6.2).] [12] ,IR tracking techniques [54] , SVM classifier [144] , GRID WORLD SCENARIO [86] , graphbased clustering method [60] , [38] , [29] , Hidden Markov Model(HMM) [49] , proposed method with o-space and without o-sapce(SVM) [108] , Region-based approach with level set method [71] , IRPM(Inter-Relation Pattern Matrix) [23] , graph-based clustering algorithm [126] , voting based approach [77] , Hidden Markov Models (HMMs) [81] , SVM [31] , Transfer Learning approaches [102] , method with Hidden Markov Model( [59] , head pose estimation technique [11] , [9] , Matrix Completion for Head and Body Pose Estimation (MC-HBPE) [8] , GIZ detection [30] , [7] , Supervised Correlation Clustering (CC) through Structured Learning [119] , Long-Short Term Memory (LSTM), Hough-Voting (HVFF) [2] , Long Short Term Memory (LSTM) [3] , 3D skeleton reconstruction using patch trajectory [64] , Human aware motion planner [28] , [35] , Learning Methods for HPE and BPE [127] , GAMUT(Group bAsed Meta-classifier learning Using local neighborhood Training) [50] , Group detection method [105] , [57] , Long Short Term Memory (LSTM) network [107] , RoboGEM (Robot-Centric Group Estimation Model) [124] , Multi-Person 2D pose estimation [93] , Staged Social Behavior Learning (SSBL) [47] , multiclass SVM classifier [18] Other studies Wizard-of-Oz [61] , Wizard-of-Oz paradigm [15] , Wizard-of-Oz [75] --F-formation as dominant set model (2018) [146] good (71% F-measure) real time (2019) [95] (DT) --Machine Learning based AI methods (Section 6.2) IR tracking technique with SVM (2010) [54] good (77.81%) real time Extraversion, neuroticism (SVM) (2010) [144] (66%, 75%) real time Graph-based clustering method (2011) [60] excellent (all singletons precision 95%, limitation: only for standing) real time to be considered. People may sway or move their bodies occasionally too. Apart from these, methods also need to consider a single group or multiple existing groups in a scene. Outliers to one group can be a part of another group or can be noise at the global level. Fig. 11 depicts a taxonomy of group detection in interaction scenarios in real-life cases. Detection (Section 7) Detection capability (Section 7.1) Detection scale (Section 7.2) Single group Multiple groups Methods need to attend to both static and dynamic groups in interactions and formations. Here, we do categorization of this aspect. Static group scene. A static scene means people in the scene are not moving. The people interacting in a group or formation do not change groups or new people do not join a group while interaction is in progress. The people within the group do not sway or change head/body pose and orientation that can affect f-formation detection. In such cases, it is easier to detect groups and formations. No temporal aspect in the scene is to be considered and the method can work on a single image. In [31] and [93] , single image is used from a single egocentric camera for detection. Mostly indoor scenes like conferences, group discussions, coffee breaks, and meetings, such static groups can be found. Dynamic group scene. In the case of a dynamic scene, people tend to move in groups, also referred group dynamics. New people can join a group and/or existing people can leave a group. Also, some people participating in an interaction may temporarily change their head/body pose and orientation a bit; this necessarily does not mean that the formation has changed. In such cases, it becomes very difficult for an algorithm to detect the group or formation in the interaction scenario. As a result, the methods need to consider the temporal information of the scene utilizing a sequence of images over a window. A sequence of image/image stream is taken over a particular period of time. In [38] , the video data is used which is 10 frames per second for detecting dynamic groups and interactions. In [29] , the surveillance videos are used for experiment purposes. Similarly, [115] utilizes video feed from a cocktail party [133] for its experiments, which is one frame in 5 seconds. Fig. 12 depicts the dynamic scene and group scenario. The EGO-GROUP dataset [4] has a video of an indoor laboratory setup. The video consists of 395 image frames. The specialty of this video is that the people in the scene are not static in one position and they change position/orientation and location with time. On the right-hand side of the figure, we put forward four instances of the image sequence where four different types of groups/interactions and formations are visible for the same four people in the scene. This type of dynamism should be handled by the detection methods with efficiency considering temporal aspects of the scenes. In outdoor scenes such as waiting rooms, stations, airports, restaurants, theatres, and lobbies, dynamic groups are mostly encountered. Table 7 summarizes the references into two detection capability types found in the literature. Since we need different methods depending on how many groups are there in a captured image or video, the detection scale plays an important role. Table 7 . Classification based on group/interaction and formation detection capability. References Static scene detection [93] , [50] , [92] , [109] , [145] , [129] , [104] , [116] , [44] , [114] , [100] , [128] , [31] , [102] , [69] , [47] Dynamic scene detection [124] , [58] , [103] , [107] , [105] , [50] , [111] , [92] , [80] , [40] , [130] , [16] , [64] , [6] , [141] , [35] , [94] , [127] , [73] , [125] , [62] , [113] , [104] , [3] , [9] , [8] , [90] , [68] , [116] , [121] , [118] , [7] , [131] , [119] , [2] , [54] , [75] , [144] , [60] , [82] , [85] , [117] , [38] , [79] , [61] , [12] , [29] , [91] , [78] , [71] , [143] , [84] , [108] , [49] , [77] , [115] , [95] , [51] , [126] , [23] , [46] , [41] , [128] , [11] , [59] , [48] , [142] , [45] , [67] , [112] , [66] , [140] , [30] , [15] , [37] , [18] , [146] Single group detection. When a sensor/camera detects only one interacting group in the scene, the work is easily done. The stream of images sequence can have multiple groups as well. But all the methods do not have the capability to detect multiple groups simultaneously. In some cases, single group detection is useful when a robot needs to detect a single group of interest in a scene or environment and join the group for interaction/discussion. The datasets used for this kind of detection are mostly captured indoor (for example office and panoptic studio [33] ) or outdoor (mostly private datasets). Other publicly available datasets are BEHAVE [89] and YouTube videos which can also be used for such purposes. In the case of ego-view camera-based detection methods, single group detection is the primary focus. Multiple group detection. When there is more than one interacting group or formation in a scene, the detection methods need special attention. Sometimes there is only one interacting group in the scene along with some additional people who are not actively involved in the interaction. Those cases can also be considered under the same umbrella and are quite challenging too. This kind of detection is useful for finding how many groups are there or finding a particular group in a diverse scene in surveillance/monitoring applications. There exists some datasets comprising of such scenarioscoffee break dataset [36] , EGO-GROUP [4] , SALSA [134] , cocktail party [133] , GDet [19] , Synthetic [43] , Idiap Poster Data [63] , and FriendsMeet2 (GM2) [20, 22] . Beyond these, some researchers have used their own (private) datasets. In [38] and [115] , the authors experimented with such datasets where there is more than one group in the scene (the party data). Similarly, in [29] , the surveillance videos are used as data where there can be more than one group in the captured video. Table 8 classifies the literature on the basis of group/interaction detection scale. The multiple group detection scenario normally comes in exo-view based methods. Fig. 14 depicts three scenarios from a renowned dataset, EGO-GROUP [4] . Fig. 14a shows a single triangular formation with one outlier in an indoor environment. Fig. 14b depicts a viz-a-viz formation in an outdoor situation. Fig. 14c shows two groups, one triangular and one L-shaped formation in an indoor situation. Ego-vision scenes of groups and interactions are seen in 20 datasets, whereas 38 datasets have exo-vision or global view images of groups, 1 dataset has both of them, and camera-view is not known for 11 datasets. Fig. 13 gives a comprehensive idea about the taxonomy of datasets (training/testing) generally used in group/interaction and formation detection tasks. Dataset (Table 9 Table 8 . Classification based on group/interaction and formation detection scale. References Single group detection [58] , [47] , [57] , [111] , [92] , [80] , [40] , [97] , [130] , [109] , [16] , [64] , [35] , [94] , [73] , [125] , [62] , [113] , [3] , [90] , [68] , [116] , [118] , [131] , [2] , [75] , [144] , [82] , [88] , [85] , [86] , [117] , [79] , [61] , [12] , [91] , [78] , [143] , [84] , [108] , [49] , [44] , [74] , [46] , [41] , [100] , [42] , [65] , [45] , [69] , [67] , [112] , [66] , [140] , [15] , [95] , [18] Multi-group detection [124] , [93] , [96] , [103] , [107] , [105] , [50] , [6] , [127] , [145] , [141] , [129] , [104] , [9] , [8] , [116] , [121] , [7] , [119] , [54] , [60] , [29] , [71] , [114] , [77] , [115] , [126] , [23] , [81] , [128] , [11] , [59] , [31] , [102] , [48] , [142] , [30] , [38] , [92] , [37] , [146] The most important part of the formation or interaction detection framework (Fig. 4) is the evaluation methodologies. The conventional methods to compare methods and techniques in such vision tasks are accuracy and efficiency. The accuracy defines how accurately a method detects/predicts or recognizes an f-formation. The efficiency parameter relates to the real-timeliness aspect of the method. Apart from these the papers in the surveyed literature also speak about simulation-based evaluation and human experience study-based evaluations (for robotic applications specifically). Fig. 15 shows the simple taxonomy of evaluation methods for various group/interaction and formation detection methods or algorithms. Evaluation methods (Section 8) For human interaction analysis/ scene monitoring etc. Simulation based evaluation. This type of evaluation is conducted using simulation tools. The simulators have different features and have the ability to simulate the real world in complex environments. A range of simulators are used Manuscript submitted to ACM Detecting socially interacting groups using f-formation: A survey 25 in the surveyed literature -Gazebo [96] , RoboDK [106] , and Webot [39] . Nowadays researchers are also focused on using Virtual Reality (VR) or Augmented Reality (AR) technologies for evaluation purposes. The evaluations are performed mainly to access the perception of a virtual robot or an autonomous agent. The question to be answered is, how well a simulated robot (in a simulated environment) can perceive a group of simulated people involved in an interaction. Secondly, after detection, is the simulated robot joining the group naturally without discomforting the simulated people (see Section 2.3 and Fig. 3 ). Parameters like stopping distance for the robot, orientation, and pose based on the perceived group pose/orientation and angle of approach depending on the group's angle and position should be considered. Extensive discussion on these factors post group/interaction detection by a robot or autonomous agent is out of the scope of this survey. Human experience study based evaluation. This type of evaluation is based on testing the detection methods using ego-vision robots or on a real scenario with human participants as evaluators. A questionnaire is provided to the human participators to rate the quality of the method being used by the robot in real scenarios. The questions and parameters similar to simulation-based evaluation can be considered in this case as well but with a real robot perceiving human groups (who are also the evaluators) and interaction. In real-life scenarios, the groups are not static and tend to move when a member joins and leaves the group. Accordingly, the robot or the autonomous agent must detect the changes in group formation, orientation, and pose to re-adjust itself in a natural and more human-like manner without causing any comfort to other humans. Accuracy/Efficiency evaluation without using robot or simulators. This kind of evaluation is based on the accuracy or efficiency aspects of the methods but not tested in the real environment or by using robots. Here, the focus is mainly to evaluate the computational aspects of the methods/algorithms without evaluating the usability in reallife applications like robotics. However, applications like human behavior/interaction analysis, scene monitoring, and surveillance depend entirely on such evaluation. Table 10 classifies the surveyed papers on the basis of the evaluation strategy adopted. It also shows the descriptions/names of simulators in the simulation-based category. Table 10 . Classification based on evaluation methods and strategy. References Simulation based evaluation (robotic simulators/virtual environment) 2D grid environment simulated in Greenfoot [86] , simulated the process of deformation of contours using P-spaces represented by Contours of the Level Set Method [71] , [45] , Robot Operating System (ROS) implementation of PedSim [104] , a simulated avatar embodied confederate [97] , Gazebo [96] , a simulator using Unity 3D game engine [47] Human experience study based evaluation (with real robots) [12] , [61] , [75] , [82] , [85] , [117] , [143] , [49] , [51] , [126] , [81] , [15] , [141] , [94] , [28] , [109] , [73] , [130] , [111] , [105] , [80] , [93] , [95] , [57] , [124] , [18] Accuracy/Efficiency evaluation (without robot, only computation) [54] , [144] , [60] , [37] , [38] , [79] , [29] , [91] , [78] , [71] , [108] , [46] , [115] , [114] , [23] , [77] , [44] , [45] , [128] , [11] , [59] , [31] , [102] , [48] , [142] , [45] , [8] , [90] , [116] , [121] , [30] , [131] , [129] , [104] , [3] , [145] , [35] , [127] , [6] , [64] , [50] , [107] , [103] , [146] Group or interaction detection has seen vast applications in many areas of computer vision. Specifically speaking, with the emergence of robotics and AI, this domain has realized its true potential. In this paper, we categorize the application landscape into two broad areas: robotic applications and other vision applications. Further, these have been broken down into five groups as summarized in Table 11 . The robot vision implies the applications where the robot's camera is placed in an ego-centric view for finding the groups only, but there is no purpose of initiation of interaction with a human. In Human-robot interaction, f-formation detection is used to detect the group in order to participate in the interaction with Manuscript submitted to ACM 26 Barua et al. fellow human beings autonomously. In telepresence, a remote person uses the robot to interact with a group of people. In such a scenario, the semi-autonomous robot can detect the group and join them while the remote human operator can control the robot to adjust its positioning. Scene monitoring is useful for analyzing indoor or outdoor scenes with people interacting and forming groups and f-formations for various activities. On the other hand, human behavior and interaction analysis refer to the behavior between humans and how they are interacting based on the situation. Furthermore, visual analytics in big data has empowered the domain beyond imagination. People are trying to use these technologies in various aspects of life. In the current scenario of the Covid-19 pandemic, we can utilize this technology in monitoring social distancing in human groups and interactions as well. As already mentioned, telepresence robotics can be utilized by doctors/nurses and other medical staff to attend to patients in remote locations without physically being present. Application area (Section 9) Other vision application Indoor/Outdoor scene monitoring Human behaviour and interaction analysis Fig. 16 . Taxonomy for application areas for group/interaction and f-formation detection. Table 11 . Classification based on targeted application areas. References Drone/Robotic vision [58] , [64] , [6] , [127] , [125] , [145] , [129] , [113] , [104] , [8] , [116] , [121] , [131] , [119] , [2] , [75] , [60] , [29] , [78] , [114] , [77] , [115] , [23] , [46] , [41] , [100] , [128] , [11] , [31] , [102] , [48] , [123] , [30] , [37] , [18] , [146] Human-robot interaction [124] , [93] , [47] , [57] , [105] , [50] , [111] , [92] , [80] , [130] , [109] , [16] , [141] , [35] , [62] , [68] , [118] , [75] , [82] , [88] , [85] , [86] , [117] , [61] , [12] , [143] , [84] , [49] , [74] , [51] , [81] , [52] , [65] , [142] , [69] , [67] , [112] , [66] , [140] , [15] Telepresence/Teleoperation technologies [96] , [92] , [94] , [73] Indoor/outdoor Scene monitoring and surveillance [58] , [144] , [97] , [75] , [82] , [51] , [99] Human behaviour and interaction analysis [103] , [107] , [40] , [127] , [104] , [3] , [9] , [90] , [7] , [54] , [144] , [82] , [38] , [79] , [61] , [91] , [71] , [108] , [44] , [126] , [42] , [59] , [45] , [123] Covid-19 and social distancing Scope of future research The survey is treated based on a generic framework of concern areas about group/interaction detection using the theory of f-formation (see Section 3 and Fig. 4 ). It addressess various identified modules and concern areas such as camera view and availability of other sensor data, datasets, feature selection, methods/techniques, detection capabilities/scale, evaluation methodologies, and application areas. • The existing methods have almost equal share of fixed rule-based ( Fig. 9 ) and learning-based ( Fig. 10 ) approaches (Tables 5 and 6; In online Appendix Tables 1 and 2) . Researchers need to orient their research towards data-driven approaches using deep learning and reinforcement learning paradigms for handling complex situations. Meta-learning can also be explored on large-scale combined datasets. The complex scenarios in detection tasks can be easily solved Manuscript submitted to ACM using big data and visual analytics [99] . Apart from that, representing data in the form of a graph can solve many performance issues in terms of accuracy and efficiency. The graph neural networks (GNN) such as graph convolution networks (GCNN) can also be a potential candidate to create appropriate models. A combination of recurrent neural networks (RNN), convolution neural networks (CNN), and/or graph recurrent networks (GRNN) can also be explored for identifying more accurate and promising detection models. • The problems like dynamism in groups (people leaving/joining the group dynamically or changing position and orientation within the group) and occlusions of people pose serious challenges and limitations to the current state-ofthe-art methods in terms of accuracy and efficiency. Researchers can think about devising rules based on reasoning and geometry to detect application-specific groups and interactions. A combination of rules, geometry-based reasoning along data-driven models can also be explored to improve detection quality. Apart from detecting the group and formation alone, methods should be designed to detect the orientation and pose of the group itself (see [18] ). This can facilitate a good approach direction and angle (natural and human-like) for robots to join the group. • The major challenge with the datasets is their availability. Creating good quality (large scale) vision datasets (for training and testing) is a mammoth task in itself but has its own research/academic merit. The only 20% of the surveyed datasets are publicly available ( • Detection capabilities need attention with respect to dynamic scenes (Fig. 12 ) as well as multiple groups (Fig. 14) . The literature is rich in taking care of most of the aspects of detection (Tables 7 and 8 ). However, some more research attention is required in cases of occlusion, background clutter, and lighting conditions. Researchers can use reinforcement learning and deep learning models for these problems. Also, appropriate datasets need to be prepared at a larger scale. • Evaluation of the methods remains a challenge in the current literature (Table 10) . Mostly, computational evaluation has been performed in terms of accuracy and efficiency (Table 6 ). But in a problem like a group/interaction detection, human experience studies and/or simulation-based studies are important to establish the effectiveness of the method in various applications like robotics, telepresence, and social surveillance (see Section 8, Fig. 15 and Table 11 ). The researchers need to orient their studies in this respect as well. Apart from that, most of the methods yield good accuracy but achieving real-time solutions maintaining good accuracy is a concern. The explorers can think of designing lightweight models for real-time detection of groups and interactions for dynamic scenes. • Applications of this domain can be widely seen in robotics, surveillance, human behavior analysis, and telepresence technologies (Table 11 and Fig. 16 ). However, we can also think about using this technology in Covid-19 related applications such as monitoring of social distancing norms and others. • We also have discussed about two types of camera views: Ego and Exo-views ( Fig. 6 and Table 4 ). Ego vision is used predominantly for robotics related applications. Methods using ego-view cameras for input are less compared to exo-view cameras. The main reason behind this roots down to the scarcity of public ego-view datasets for training the models. Researchers can also direct their research on designing detection models which can be created on a hybrid system of camera views and sensors. The visual as well as other forms of inputs combined can be used for better detection and prediction tasks. Various combinations of camera views and positions can be experimented with for better scene capture and robust dataset creation for learning models. of this survey framework. Me tho ds & Tec hni que s D a ta s e ts With the emergence of computer vision, robotics, multimedia analytics, etc., the world is changing for good with the progress of artificial intelligence. Computation systems and autonomous agents are expected to show more human-like behavior and capability. One of the most important problems in this domain persists to be group/interaction detection and prediction using f-formation. Although some research has been conducted in the last decade, much more progress is still envisioned. This survey aims at generalizing the problem of group/interaction detection via a framework, which is referred to as the theme of this survey as well. This article presents a comprehensive glance at all the concern areas of this defined framework. This includes definitions of various f-formations, input camera views and sensors, datasets, feature selection, algorithms, detection capability and scale, quality of detection, evaluation methodologies, and application areas. The article also discusses the limitations, challenges, and future scope of research in this domain. Human activity analysis: A review Towards social interaction detection in egocentric photo-streams With whom do I interact? Detecting social interactions in egocentric photo-streams AImageLab datasets AImageLab datasets Multimodal analysis of free-standing conversational groups Salsa: A novel dataset for multimodal group behavior analysis Analyzing free-standing conversational groups: A multimodal approach Understanding social relationships in egocentric vision Understanding Social Relationships in Egocentric Vision. Pattern Recogn From ego to nos-vision: Detecting social relationships in first-person views Navigation for human-robot interaction tasks Monocular 3D pose estimation and tracking by detection 2021. Collective Activity Dataset Group vs. individual comfort when a robot approaches How should a robot approach two people Towards an Integrated Approach to Crowd Analysis and Crowd Synthesis: a Case Study and First Results Let me join you! Real-time F-formation recognition by a socially aware robot Friends Meet Dataset Decentralized particle filter for joint individual-group tracking Social interactions by visual focus of attention in a three-dimensional environment Guiding Visual Surveillance by Tracking Human Attention Unsupervised learning of a scene-specific coarse gaze estimator Social behavior recognition in continuous video The MatchNMingle dataset Recent trends in social aware robot navigation: A survey We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video Group activity recognition with group interaction zone based on relative distance between human objects Discovering groups of people in images Discovering Groups of People in Images CMU. 2021. CMU Panoptic Dataset UoL 3D Social Interaction Dataset Automatic detection of human interactions from rgb-d data for social activity classification CoffeeeBreak dataset Social interaction discovery by statistical analysis of F-formations Towards computational proxemics: Inferring social relations from interpersonal distances Webots-Open Source Robot Simulator F-formation and social context: how spatial orientation of participants' bodies is organized in the vast field Social F-formation in blended reality Automatic detection of social behavior of museum visitor pairs Social interaction detection using a multi-sensor approach Recovering social interaction spatial structure from multiple first-person views Temporal encoded F-formation system for social interaction detection Learning Socially Appropriate Robot Approaching Behavior Toward Groups using Deep Reinforcement Learning Group tracking and behavior recognition in long video surveillance sequences Social behavior recognition using body posture and head pose for human-robot interaction Detecting conversing groups using social dynamics from wearable acceleration: Group size awareness The network robot system: enabling social human-robot interaction in public spaces Social path planning: Generic human-robot interaction framework for robotic navigation tasks Estimating Face orientation from Robust Detection of Salient Facial Structures. FG Net Workshop on Visual Observation of Deictic Gestures Detecting social situations from interaction geometry A system for the notation of proxemic behavior The Hidden Dimension. Anchor Books Recognizing F-Formations in the Open World Comparing F-Formations Between Humans and On-Screen Agents Detecting conversing groups with a single worn accelerometer Detecting f-formations as dominant sets Investigating Spatial Relationships in Human-Robot Interaction Effects of the display angle on social behaviors of the people around the display: A field study at a museum Idiap Research Institute. 2021. Idiap Poster Data Panoptic studio: A massively multiview system for social interaction capture Cultural differences in how an engagement-seeking robot should approach a group of people Spatial play effects in a tangible game with an f-formation of multiple players Abstracting people's trajectories for social robots to proactively approach customers How Can a Tour Guide Robot's Orientation Influence Visitors' Orientation and Formations Robot etiquette: How to approach a pair of people Conducting interaction: Patterns of behavior in focused encounters Detection of social interaction from observation of daily living environments Human-computer Interaction-INTERACT 2013: 14th IFIP TC 13 International Conference Towards a Method to Detect F-formations in Real-Time to Enable Social Robots to Join Groups Human-aware robot navigation: A survey Reconfiguring spatial formation arrangement by robot body orientation Crowds by Example Finding group interactions in social clutter Cross-device interaction via micro-mobility and f-formations Using F-formations to analyse spatial patterns of interaction in physical environments Where Should Robots Talk? Spatial Arrangement Study from a Participant Workload Perspective Automated proxemic feature extraction and behavior recognition: Applications in human-robot interaction An experimental design for studying proxemic behavior in human-robot interaction MMLAB. 2021. Multimedia Signal Processing and Understanding Lab How do people walk side-by-side? Using a computational model of human behavior for a social robot Human-robot proxemics: physical and psychological distancing in human-robot interaction Towards modelling spatial cognition for intelligent agents From body space to interaction space: modeling spatial cooperation for virtual humans BEHAVE: Computer-assisted prescreening of video streams for unusual activities Connecting in the kitchen: an empirical study of physical interactions while cooking together at home Cooking together: a digital ethnography Join the Group Formations using Social Cues in Social Robots A Novel Method for Estimating Distances from a Robot to Humans Using Egocentric RGB Camera Estimating F-Formations for Mobile Robotic Telepresence Estimating Optimal Placement for a Robot in Social Group Interaction F-Formations for Social Interaction in Simulation Using Virtual Agents and Mobile Robotic Telepresence Systems Who, me? How virtual agents can shape conversational footing in virtual reality You'll never walk alone: Modeling social behavior for multi-target tracking Multimedia big data analytics: A survey modified. F-formation discovery in static images An Adaptation Framework for Head-Pose Classification in Dynamic Multi-view Scenarios Exploring transfer learning approaches for head pose classification from multi-view surveillance images Towards automatic estimation of conversation floors within F-formations Modeling the dynamics of individual behaviors for group detection in crowds using low-level features Robot-Centric Human Group Detection Angelique Simulate Robot Applications Detecting F-formations & Roles in Crowded Social Scenes with Wearables: Combining Proxemics Dynamics using LSTMs Real time detection of social interactions in surveillance video Stopping distance for a robot approaching two conversating persons Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities Replicating natural approaching behavior of humans for improving robot's approach toward two persons during a conversation How to approach humans? Strategies for social robots to initiate interaction F-formations and collaboration dynamics study for designing mobile collocation Group detection in still images by F-formation modeling: A comparative study Multi-scale F-formation discovery for group detection F-formation detection: Individuating free-standing conversational groups in images Spatial formation model for initiating conversation Measuring communication participation to initiate conversation in human-robot interaction Socially constrained structural learning for groups detection in crowd Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT Jointly estimating interactions and head, body pose of interactors from distant social scenes Perceiving the person and their interactions with the others for social robotics-a review Robot perception of human groups in the real world: State of the art Robot-centric perception of human groups It's Not How You Stand, It's How You Move: F-formations and Collaboration Dynamics in a Mobile Learning Game Social Cues in Group Formation and Local Interactions for Collective Activity Analysis Joint Estimation of Human Pose and Conversational Groups from Social Scenes A game-theoretic probabilistic approach for detecting conversational groups Detecting conversational groups in images and sequences: A robust game-theoretic approach Towards robot autonomy in group conversations: Understanding the effects of body orientation and gaze Parallel detection of conversational groups of free-standing people and tracking of their lower-body orientation Computational Vision and Geometry Lab (CVGL) at Stanford. 2021. Discovering Groups of People in Images Group Human-Robot Interaction: A Review A study on the social acceptance of a robot in a multi-human interaction using an F-formation based motion model Recognizing conversation groups in an open space by estimating placement of lower bodies Establishment of spatial formation by a mobile guide robot Space speaks: towards socially and personality aware visual surveillance Beyond F-Formations: Determining Social Involvement in Free Standing Conversing Groups from Static Images On Social Involvement in Mingling Scenarios: Detecting Associates of F-Formations in Still Images Towards social interaction detection in egocentric photo-streams With whom do I interact? Detecting social interactions in egocentric photo-streams AImageLab datasets AImageLab datasets Multimodal analysis of free-standing conversational groups Salsa: A novel dataset for multimodal group behavior analysis SALSA: A Novel Dataset" for Multimodal Group Behavior Analysis Analyzing free-standing conversational groups: A multimodal approach Understanding social relationships in egocentric vision From ego to nos-vision: Detecting social relationships in first-person views Navigation for human-robot interaction tasks Monocular 3D pose estimation and tracking by detection 2021. Collective Activity Dataset Group vs. individual comfort when a robot approaches Towards an Integrated Approach to Crowd Analysis and Crowd Synthesis: a Case Study and First Results Let me join you! Real-time F-formation recognition by a socially aware robot Decentralized particle filter for joint individual-group tracking Social interactions by visual focus of attention in a three-dimensional environment Guiding Visual Surveillance by Tracking Human Attention Unsupervised learning of a scene-specific coarse gaze estimator Social behavior recognition in continuous video The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates Recent trends in social aware robot navigation: A survey We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video Group activity recognition with group interaction zone based on relative distance between human objects Discovering Groups of People in Images Discovering groups of people in images CMU. 2021. CMU Panoptic Dataset UoL 3D Social Interaction Dataset Automatic detection of human interactions from rgb-d data for social activity classification CoffeeeBreak dataset Social interaction discovery by statistical analysis of F-formations Towards computational proxemics: Inferring social relations from interpersonal distances Social interaction detection using a multi-sensor approach Recovering social interaction spatial structure from multiple first-person views Temporal encoded F-formation system for social interaction detection Learning Socially Appropriate Robot Approaching Behavior Toward Groups using Deep Reinforcement Learning Group tracking and behavior recognition in long video surveillance sequences Social behavior recognition using body posture and head pose for human-robot interaction Detecting conversing groups using social dynamics from wearable acceleration: Group size awareness The network robot system: enabling social human-robot interaction in public spaces Social path planning: Generic human-robot interaction framework for robotic navigation tasks Estimating Face orientation from Robust Detection of Salient Facial Structures. FG Net Workshop on Visual Observation of Deictic Gestures Detecting social situations from interaction geometry Recognizing F-Formations in the Open World Detecting conversing groups with a single worn accelerometer Detecting f-formations as dominant sets Investigating Spatial Relationships in Human-Robot Interaction Idiap Research Institute. 2021. Idiap Poster Data Idiap Research Institute. 2021. Idiap Poster Data Manuscript submitted to ACM Online Appendix to: Detecting Socially Interacting Groups using F-formation: A Survey of Taxonomy, Methods, Applications, Challenges and Future Research Directions Panoptic studio: A massively multiview system for social interaction capture How Can a Tour Guide Robot's Orientation Influence Visitors' Orientation and Formations Detection of social interaction from observation of daily living environments Towards a Method to Detect F-formations in Real-Time to Enable Social Robots to Join Groups Reconfiguring spatial formation arrangement by robot body orientation Crowds by Example Finding group interactions in social clutter Cross-device interaction via micro-mobility and f-formations Using F-formations to analyse spatial patterns of interaction in physical environments Where Should Robots Talk? Spatial Arrangement Study from a Participant Workload Perspective Automated proxemic feature extraction and behavior recognition: Applications in human-robot interaction MMLAB. 2021. Multimedia Signal Processing and Understanding Lab Human-robot proxemics: physical and psychological distancing in human-robot interaction Towards modelling spatial cognition for intelligent agents BEHAVE: Computer-assisted prescreening of video streams for unusual activities Connecting in the kitchen: an empirical study of physical interactions while cooking together at home Cooking together: a digital ethnography A Novel Method for Estimating Distances from a Robot to Humans Using Egocentric RGB Camera Estimating F-Formations for Mobile Robotic Telepresence Estimating Optimal Placement for a Robot in Social Group Interaction F-Formations for Social Interaction in Simulation Using Virtual Agents and Mobile Robotic Telepresence Systems Who, me? How virtual agents can shape conversational footing in virtual reality You'll never walk alone: Modeling social behavior for multi-target tracking An Adaptation Framework for Head-Pose Classification in Dynamic Multi-view Scenarios Exploring transfer learning approaches for head pose classification from multi-view surveillance images Towards automatic estimation of conversation floors within F-formations Modeling the dynamics of individual behaviors for group detection in crowds using low-level features Robot-Centric Human Group Detection Angelique Detecting F-formations & Roles in Crowded Social Scenes with Wearables: Combining Proxemics Dynamics using LSTMs Real time detection of social interactions in surveillance video Stopping distance for a robot approaching two conversating persons Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities Replicating natural approaching behavior of humans for improving robot's approach toward two persons during a conversation How to approach humans? Strategies for social robots to initiate interaction Group detection in still images by F-formation modeling: A comparative study Multi-scale F-formation discovery for group detection F-formation detection: Individuating free-standing conversational groups in images Spatial formation model for initiating conversation Socially constrained structural learning for groups detection in crowd Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT Jointly estimating interactions and head, body pose of interactors from distant social scenes Robot-centric perception of human groups Social Cues in Group Formation and Local Interactions for Collective Activity Analysis Joint Estimation of Human Pose and Conversational Groups from Social Scenes A game-theoretic probabilistic approach for detecting conversational groups Detecting conversational groups in images and sequences: A robust game-theoretic approach Towards robot autonomy in group conversations: Understanding the effects of body orientation and gaze Parallel detection of conversational groups of free-standing people and tracking of their lower-body orientation Computational Vision and Geometry Lab (CVGL) at Stanford. 2021. Discovering Groups of People in Images A study on the social acceptance of a robot in a multi-human interaction using an F-formation based motion model Recognizing conversation groups in an open space by estimating placement of lower bodies Establishment of spatial formation by a mobile guide robot Space speaks: towards socially and personality aware visual surveillance Beyond F-Formations: Determining Social Involvement in Free Standing Conversing Groups from Static Images On Social Involvement in Mingling Scenarios: Detecting Associates of F-Formations in Still Images Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Skeletal key point detection SVM,CRF)(2020) [16] ✓ ✗ ✗ ✗ ✓(S) EGO-GROUP [3] , own dataset -Ego