key: cord-0532882-n6e925yt
authors: Subramanyam, Shishir; Viola, Irene; Jansen, Jack; Alexiou, Evangelos; Hanjalic, Alan; Cesar, Pablo
title: Evaluating the Impact of Tiled User-Adaptive Real-Time Point Cloud Streaming on VR Remote Communication
date: 2022-05-10
journal: nan
DOI: nan
sha: b5ba9f66136042d8e79eea410055eea597ab0cbf
doc_id: 532882
cord_uid: n6e925yt

Remote communication has rapidly become a part of everyday life in both professional and personal contexts. However, popular video conferencing applications present limitations in terms of quality of communication, immersion and social meaning. VR remote communication applications offer a greater sense of co-presence and mutual sensing of emotions between remote users. Previous research on these applications has shown that realistic point cloud user reconstructions offer better immersion and communication as compared to synthetic user avatars. However, photorealistic point clouds require a large volume of data per frame and are challenging to transmit over bandwidth-limited networks. Recent research has demonstrated significant improvements to perceived quality by optimizing the usage of bandwidth based on the position and orientation of the user's viewport with user-adaptive streaming. In this work, we developed a real-time VR communication application with an adaptation engine that features tiled user-adaptive streaming based on user behaviour. The application also supports traditional network adaptive streaming. The contribution of this work is to evaluate the impact of tiled user-adaptive streaming on quality of communication, visual quality, system performance and task completion in a functional live VR remote communication system. We perform a subjective evaluation with 33 users to compare the different streaming conditions with a neck exercise training task. As a baseline, we use uncompressed streaming requiring ca. 300Mbps and our solution achieves similar visual quality with tiled adaptive streaming at 14Mbps. We also demonstrate statistically significant gains to the quality of interaction and improvements to system performance and CPU consumption with tiled adaptive streaming as compared to the more traditional network adaptive streaming.

Remote communication and collaboration has rapidly become a necessity in a globalized and connected world. Video conferencing applications have become ubiquitous in everyday life in both professional and personal environments. Notwithstanding their popularity, it is estimated that travel for the purpose of in-person communication is responsible for roughly eight percent of US energy consumption [35] . With recent events like the Covid-19 global pandemic there is an increased need for applications that can deliver a greater sense of co-presence and mutual sensing of emotions in remote communication. Current video conferencing solutions have clear limitations in this regard [7, 26, 31, 55] .

Immersive Virtual Reality (VR) applications offer an increased sense of presence and immersion. These applications have emerged as a promising alternative for remote communication and telepresence [6, 9, 18, 24, 32, 36, 52, 54] . They allow users to employ both verbal and non-verbal communication in a shared virtual space. In such applications, users can be embodied in the virtual space either using avatars or real-time photorealistic 3D reconstructions typically using depth sensors. Previous work in the field has demonstrated that realistic user reconstructions improve immersion and communication [23, 33] as compared to avatars.

Among the different 3D formats, point clouds have emerged as a popular representation for user reconstructions as they are arXiv:2205.04906v1 [cs.MM] 10 May 2022 relatively easy to acquire in real-time using consumer depth sensors [9, 18] . This format represents the object's geometry as an unorganized collection of surface point coordinates with additional attributes such as color provided at each point location. They are generally resilient to noise and do no incur an additional computational overhead to triangulate mesh faces making them suitable for real-time applications. Owing to their unorganized nature they are also easy to partition into non-overlapping segments. However, photorealistic point clouds present a large volume of data per frame that requires real-time compression to transmit over bandwidthlimited networks [29, 51] . In order to alleviate this requirement recent research has looked into adapting the point cloud stream to the user's viewport location and orientation in order to optimize how the available bandwidth is spent. This is done by prioritizing objects or surfaces facing the viewer and reducing the wastage of bandwidth on surfaces or objects that are occluded or outside the viewport [8, 14, 37, 47, 49, 51] . However this approach to adaptive streaming has not yet been evaluated for live communication with real-time point cloud reconstructions.

In this work, we implement and evaluate a two user social VR system with real-time adaptive streaming of user reconstructions shown in figure 1. We set out to assess the impact of tiled adaptive streaming on the quality of communication, visual quality and subjective task related performance. We constructed a social VR pipeline extending the system proposed in [18] with network adaptive and tiled adaptive streaming. Two confederate users were recruited and trained to play the role of a trainer in every experiment session while 33 users (16 females, 17 males) were recruited to play the role of trainee. The participants were asked to learn and perform three neck exercises during the session. The contribution of this work is to evaluate and compare tiled adaptive streaming (TA) with traditional network adaptive streaming (NA) and baseline uncompressed streaming in a functional live VR remote communication system. We propose and employ a novel evaluation methodology using a training task to perform the assessment. We address the following four research questions for VR remote communication:

• R1: How does tiled user-adaptive point cloud streaming impact quality of interaction/ quality of communication (QoI)? • R2: How does tiled user-adaptive point cloud streaming impact the experience of performing a training task? • R3: How does tiled user-adaptive point cloud streaming impact the perceived quality of remote user reconstruction? • R4: What is the computational overhead of using tiled adaptive point cloud streaming?

From our results, we observe statistically significant improvements to QoI (R1). We observe no statistically significant change to task experience (R2). We observe significant improvements to visual quality (R3) and at 14Mbps we observe similar visual quality to uncompressed point clouds streamed at ca. 300Mbps. We also see a reduction in CPU utilization and an improvement to playback performance (R4). We validated the communication system and checked that the training task provides coherent results in the evaluation. , a web based social VR communication system using photo-realistic user reconstructions that was evaluated using both simulations and subjective studies. Jansen et al. [18] proposed a pipeline for volumetric videoconferencing using low latency DASH with photo-realistic point cloud user reconstructions. In this work, we extend this system design with network adaptive (NA) and tiled user-adaptive (TA) streaming. We transmit tiled point cloud user reconstructions at fixed target bitrates to assess the experience without the influence of a volatile network.

Point cloud compression has received significant research attention in recent years with the launch of two new MPEG compression standards [44] . The V-PCC standard codec for dynamic point clouds that projects point clouds geometry and attributes onto separate 2D patches that are them packed into video tracks along with the occupancy. These video tracks are then encoded using legacy video codecs making this approach suitable for relatively dense and uniform distribution of points. The G-PCC standard codec uses an octree space partitioning structure to code geometry and can be optionally combined with an additional surface reconstruction step using the TriSoup approach [38] . G-PCC also includes several modules for attribute coding, the lowest complexity coder uses the Region Adaptive Hierarchical Transform (RAHT) [4] . This codec is targeted at irregular sparse distributions of points making it suitable for live captured point clouds. However, both codecs introduce high complexity encode making them unsuitable for real-time communication. At the start of the MPEG standardization activity an anchor codec proposed by Mekuria et al. [32] was introduced. This codec utilizes octree occupancy to code geometry and scans attributes to map them to a 2D grid to maximize correlations amongst co-located points and encodes them with legacy JPEG image compression. This approach offers low encode and low decode complexity making it suitable for real-time framerate sensitive applications such as VR remote communication.

In this work, we use the anchor codec to encode point cloud tiles at multiple quality levels in real-time before streaming.

Streaming. Initial works on adaptive streaming of point clouds utilized entire point cloud objects as the basic unit of bandwidth allocation in scenes containing multiple point cloud [13, 14] propose DASH-PC for dynamic adaptive view aware point cloud streaming. They propose three spatial subsampling techniques to create multiple representations of point cloud objects in a scene. The density of each object representation is used by the client for bitrate allocation based on human visual acuity. Hooft et al. [51] propose PCC-DASH, a standards compliant means for HTTP adaptive streaming. They present three heuristics based on the users viewport and distance to the object to allocate bitrate to different objects in the scene. Different ranking metrics and bitrate allocation heuristics had to be selected for different scenes and user navigation paths. Another approach used in previous work is to split each point cloud object into tiles that are then used as the unit of bandwidth allocation. Park et al. [37] define a utility per tile based on the user's proximity, point cloud surface quality and display device resolution. To account for interactions, they propose a window-based design for the Client Buffer Manager with greedy utility maximization. This type of rasterization and pixel occupancy based approach is not suitable as computing this at every frame in computationally expensive. He at al. [11] propose view-dependent streaming over hybrid networks. Each point cloud frame is projected onto the six faces of a bounding cube, with a color and a depth video created per face. The videos are transmitted using digital broadcasting. The user can request videos that correspond to particular faces of the cube in high quality from the edge node of a bidirectional broadband network, reconstructing the point cloud from the downloaded depth and color videos at the receiver end. This approach requires a redundant extra reconstruction step at each receiver. We instead perform the reconstruction on the sender side in order to generate a self view to embody the user and transmit the reconstruction to all receivers. Li et al. [27] propose a joint communication and computational resource allocation framework to stream and decode pre-recorded point clouds. They also propose a QoE metric to guide tile selection based on the users viewport, distance to tile and available quality levels. Lee et al. [25] propose GROOT a realtime streaming system to reduce decoding overhead by dividing the point cloud into cells defined by the leaf nodes of an octree represented in a parallel decodable tree. Han et al [10] propose ViVo using a similar approach and employ machine learning models to predict viewport movement. Liu et al. [28] follow a similar approach, they include an uncompressed base layer and use fuzzy logic based quality selection. This type of approach using use the leaf nodes of the octree as an enhancement layer is currently not suitable for realtime systems as it adds an extra surface orientation estimation step that introduces additional delays in the pipeline. Subramanyam et al. [49] build on the ideas presented in PCC-DASH [51] to tile point cloud content using low complexity surface estimation suitable for frame rate sensitive real-time applications. They performed an objective quality evaluation using prerecorded navigation paths and image distortion metrics. In this work, we build on their approach and create tiles based on surfaces visible to multiple depth sensors and estimate their orientation using the transformation matrix associated with each sensor.

Quality assessment for remote communication is usually conducted using subjective user studies that are either passive or active. Passive tests involve asking users to rate prerecorded clips of content. this approach to evaluation is more suited for standardized testing of codecs with offline content and has limited ecological validity in remote communication [15, 43] . Active tests involve multiple remote participants being placed in an interactive live communication system. The International Telecommunication Union published recommendations to define evaluation methods for quantifying the impact of terminal and communication link performance on pointto-point audiovisual communication [16] . The recommendation contains sample tasks such as name guessing, story comparison, picture comparison, object description and building blocks. Schmitt et al. [43] utilize the building blocks task to develop and evaluate personalized quality of experiment metrics for multiparty video conferencing at varying bitrates. Smith et al. [46] compare face-to-face communication with embodied and unembodied remote VR communication. They propose a task involving negotiating apartment layout and furniture placement based on blueprints. Li et al. [26] compare face-to-face, videoconferencing and Facebook spaces VR communication in the context of a photosharing task. They found that Facebook spaces is able to closely approximate face-to-face photosharing. In general, these methods have been used to compare VR remote communication with other technologies and with face to face communication. The tasks proposed either focus on the audio quality or rely on external objects for the evaluation. In this work, we focus on evaluating adaptive streaming within VR remote communication. We define a new visually focused training task where participants are taught a neck exercise from a trainer and are asked to perform the exercise in order to complete the task.

In order to evaluate communication, several questionnaires have been proposed in the literature. Toet et al. [50] propose the holistic mediated social communication (H-MSC) framework and associated questionnaire to evaluate the experience of spatial presence as well as social presence. The framework is general enough to be used for any mediated social communication system. Slater et al [45] and Witmer et al. [53] have proposed two popular questionnaires aimed at measuring presence in virtual environments. Kangas et al. [20] present a pragmatic task related questionnaire that they use to evaluate VR interaction techniques in a rigid object manipulation task. Li et al. [26] propose a social VR questionnaire that evaluates Quality of Interaction/Communication (QoI), Presence/Immersion and Social Meaning. In this work, we combine the QoI part of this questionnaire along with visual quality questions from [16] , and task related questions from [20] .

The goal was to define a task that could be used to evaluate QoI, visual quality and task experience in VR remote communication. We needed to facilitate a conversation with a fixed general outline and with a focus on the visual aspect of communication. We considered the tasks presented in ITU-T P.920 [16] , the building blocks task presented in [43] as well as the photosharing task presented in [26] . We found that these tasks are either heavily dependent on the audio quality (story comparison, name-guessing and object description task) or required external objects that had to either be live captured along with the user or digitally represented in the virtual world with appropriate interaction tools (picture comparison, building blocks, photo-sharing). These objects could occlude and distract from the visual quality of the user reconstruction. In order to assess the impact of tiled adaptive point cloud streaming, we chose to keep the audio quality consistent across sessions and manipulated the quality of the point cloud representation of the participants. Based on pre-trials with seven colleagues, we selected a training task where participants were asked to first learn and then perform a neck exercise. This allowed for a fixed general outline of the conversation and focused on the visual representation of the remote participant. The neck exercises were found not to induce severe motion sickness as validated in section 4 and provided coherent results. In our experiment design, we use a confederate user as the trainer. In this approach, all participants (trainees) in the same condition are always paired with the same trainer. This allows us to focus on the individual as the unit of study and we mitigate social context as a confounding variable. In addition, we isolate the basic elements of communication by attempting to hold constant the behavior of one conversation partner. In order to adhere to the recommendations on using confederate users [22] , we used an asymmetric communication channel. The trainers are always shown the same quality point clouds of the trainee (octree depth 9) in order to ensure that the confederate users are as naive as possible to the current experimental condition. In addition the trainers were not briefed on the hypothesis associated with each experimental condition. Confederate behavior could be scripted as these users were always the initiators and addressers as they provided instructions on how to perform the neck exercise. We then evaluate the quality ratings only from the point of view of the trainees in line with the recommendations in ITU-T P.1301 [15] .

3.2.1 System. The overall architecture and dataflow of the realtime VR remote communication system is shown in figure 1 with baseline, TA and NA streaming. This is an extension of the system presented in [18] . Apart from the point cloud delivery pipeline described here the actual implementation also contains an audio delivery pipeline and a module for session management. The point The capture module interfaces with three Azure Kinect Depth sensors as shown in figure 1 . The sensors are calibrated in advance to generate the transformation matrix with intrinsics to combine color and depth as well as extrinsics to bring all sensors to a common coordinate system. The color and depth images from the sensors are then transformed and fused in order to reconstruct the point cloud. We set a target capture framerate of 15 fps based on pre-tests with colleagues as this was the maximum achievable framerate at an acceptable baseline quality. The point cloud generated is first sent directly to the renderer in order to generate a self view to embody the participant.

In baseline uncompressed, the point cloud is serialized and sent directly to the writer to forward to the receiver. For NA, the point cloud is sent to the encoder where it is compressed to three quality levels and sent to the writer to prepare the adaptation set with the associated encoded size.

For TA, the point cloud is split into tiles based on the contributing sensor. Each tile contains an orientation vector that is derived from the transformation matrix of the sensor and the centroid of it's bounding box. The tiles are then fed to the compression module that launches encoders in parallel for each tile and quality level. In this way we create an adaptation set with multiple representations for each tile. In addition we prepare a tile meta data structure that contains information on the number of tiles, meta data for each tile, available quality levels and the associated encoded size.

At the receiver, for baseline uncompressed the point cloud is sent directly to the renderer. For NA, the adaptation engine selects the highest possible quality within the available bandwidth budget. This is then decoded and sent to the renderer. We apply the bitrate budget per frame based on the target capture framerate of 15 frames per second (fps).

For TA, the adaptation engine utilizes both the tile metadata from the sender and the receiver's interactions in the system in terms of viewport position and orientation to select an appropriate representation for each tile within the available bandwidth budget. The tiles are then decoded in parallel and sent to the synchronizer. The synchronizer module was developed to playback tiled point cloud sequences with tiles of varying sizes and quality in a consistent manner. The primary goal of the synchronizer is to playback tiles of the same frame together with a secondary goal of playing back frames at the right time to match the received frame rate. The point clouds are then sent to the renderer.

Finally, the renderer stores the point locations and colors on a vertex buffer and draws procedural geometry on the GPU. Points are rendered as camera facing quads with a fixed offset based on the selected quality level.

We use a modified version of the approach proposed in [49] in order to create tiles. The point cloud is split into tiles based on the contributing sensor. We use the forward vector of the contributing sensor to estimate the orientation of the tile surface. We also compute the centroid of the bounding box of the tile. Each tile has an orientation ì and a bounding box centroid ( ) . The current viewport of the user is defined by a position ( ) and an orientation ì . The utility of each tile is calculated based on the following formula:

− ì · ì , otherwise.

We use the absolute value of the the dot product to identify surfaces directly facing the user. In addition, to account for the position of the user we ensure that the two tiles closest to the user always have a positive utility.

In the next step the calculated utility is used to allocate the bandwidth budget to each tile. In this work, we use the allocation strategies presented in [48, 51] . Based on pre-tests with colleagues we found that uniform bit rate allocation achieved a higher median score and a lower spread of scores. We use the utility to rank the available tiles. The quality of each tile is then increased one step at a time in order of this ranking.

The primary components determining the point cloud quality are the capture sensors, the codec configurations or adaptation set, target bitrate, streaming condition and the display device used. In order to evaluate adaptive streaming we vary the streaming condition and target bitrate. The other factors do not change dynamically over a session and are uninteresting for streaming optimization. They are held constant across all participants. In addition the audio quality is also maintained at the same level across all experimental sessions using the Ogg container format and the Speex codec with an ultra wide band sampling rate. In order to conduct a pre-test and set the experimental conditions we used the dataset published by Reimat et al [42] sub-sampled to three cameras (1,5,6) as this most closely resembles our capture setup. Based on pre-tests on encode time and captured point count we use three codec configurations on the MPEG anchor codec shown in table 1. All codec configurations were encoded at JPEG QP 75. We also set two target bitrates exclusively for remote user point cloud reconstructions at 7Mbps and 14Mbps based on the approximate bandwidth requirement for full point clouds using the two highest quality levels from the pre-tests. We selected three neck exercises that were to taught to all participants. In the experiment, we evaluate three streaming conditions: baseline uncompressed, NA and TA.

With three neck exercises and three streaming conditions we used a Greco-Latin square design to randomize and counter balance the different levels of each variable. In order to avoid fatigue we separated the target bitrates so each participant only took part in three sessions at a fixed target bitrate. We recruited two confederate users to play the role of trainer for each target bitrate. We recruited 16 and 17 participants for the two target bitrates of 7Mbps and 14Mbps.

Upon arrival participants were led to the experiment room and were briefed about the purpose of the study, after which they were asked for written consent for data gathering. Participants were asked to provide some background information on themselves and to take a Ishihara test [3] for color perception. Participants were then asked to fill the simulator sickness questionnaire before starting and again after each experiment session. We then conducted a brief training session where participants were shown the highest and lowest available quality of the remote user point cloud to serve as an anchor. Participants were then taught how to use the HMD controllers to teleport and navigate the virtual space. Participants were informed that during the experiment they are free to move about the virtual space. Participants then entered the first session, in each session there was a brief introduction by the trainer, a training stage where the trainer demonstrated the exercise technique and finally a performance where participants were asked to repeat the exercise three times in order to complete the session. Each session took 2 to 3 minutes to complete. Participants were asked to fill in a questionnaire to report their experience after each session. Participants completed the experiment in ca. 30 minutes.

Participants took part in the study in a separate room from the trainer. Each room had a workstation, an HTC Vive Pro Eye HMD (with controllers and base stations) and three azure kinect depth sensors. The configuration of the setup used by participants is shown in table 1. Figure 2 shows the setup used to capture each user and the resulting point cloud. The two labs were connected using a dedicated gigabit ethernet connection in order to control the connection quality for the duration of the study. Each of the azure kinect cameras were set to use a depth mode of NFOV unbinned with a resolution of 640x576 and the lowest supported color resolution of 1280x720. This was done as the color image is later mapped to the depth image in order to reconstruct the point cloud similar to the method proposed in [42] .

During each experiment session the system resource consumption was measured using the Resources Consumption Metrics (RCM) tool [34] . The conversation audio was recorded with audio only capture using the Open Broadcaster Software tool. The VR remote 

After each condition participants filled in a questionnaire about the experience they just had. The first six questions were related to QoI from [26] with a 5-point Likert scale to address R1. The next four questions were about the visual quality of the point cloud representation taken from [16] with a 5-point ACR scale to address R3. The last five questions were about task related experience from [20] with a 7-point Likert scale to address R2. In addition, on the trainee node we record system resource consumption and log playback performance to address R4.

We evaluate system performance based on resource consumption, framerate and latency. The machine used for the evaluation is described in table 1. Resource consumption was logged at the trainee node for each experiment session using the RCM measurement tool [34] . This tool is a windows application that allows for the capture of CPU, GPU and memory usage for a given process in a 1-second interval. The results are shown in table 2. As expected we see higher memory consumption for large uncompressed point clouds. At 7Mbps we observe ca. 10% reduction in CPU usage on account of additional parallelization in encoding and decoding in TA as compared to NA with similar GPU and memory utilization. At 14Mbps we observe similar results with a 18% reduction in CPU usage.

In addition, our VR application logs the capture to render latency and framerate while accounting for clock sync between the trainee and trainer node. The latency results are shown in figure 3 . The selfview latency describes the time needed for reconstructing and rendering only. The selfview is rendered with a median latency of 75ms across all experiment sessions. We observe the largest range of latency for baseline uncompressed streaming with the largest point clouds requiring ca. 300Mbps to transmit and render. At 7Mbps we observe similar latency across the two streaming conditions. However, at 14Mbps we observe a 74ms increase in median latency for NA. This is caused by larger encode and decode times required for the highest quality point clouds in our adaptation set. In case of TA we have some performance gains due to parallel encoding and decoding of tiles and generally smaller point clouds decoded at the receiver. The application runs at a near consistent 90 fps with motion reprojection. The point cloud target capture framerate is set to 15 fps based on the capability of the system. On the receiver end we observe a drop in median framerate to 10.4 fps for NA at 14Mbps caused by the encode and decode times required for the highest quality full point clouds in our adaptation set. For the remaining streaming conditions we generally observe similar performance of ca. 15 fps point cloud playback with a larger range for uncompressed point clouds as shown in figure 3 . To summarize, we observe significant gains in playback performance (framerate and latency) by employing TA with lower CPU usage as compared to NA to address R4.

In this section of the questionnaire we included the QoI questions from [26] . These questions are meant to assess four types of experience: (1) feeling understood (2) engaging in conversations (3) sensing other's emotion and (4) feeling comfortable in the environment. The overall QoI scores are obtained by adding up the scores for each of the six questions. We split the analysis for each target bitrate of 7Mbps and 14Mbps. For the 7Mbps case a Shapiro-Wilk normality test issued on the entirety of the scores indicates that they do not follow a normal distribution ( = 0.9163, = 0.003). We use non-parametric statistical tools to perform an exploratory data analysis and check if statistical differences could be found amongst the different streaming conditions. To compare the QoI across the different streaming conditions, we first conduct a Friedman's test to check if any groups exist with statistically significant differences ( 2 = 12.04, = 0.0024). We then conduct a Wilcoxon signed rank test with Bonferroni correction. The results are shown in table 3 We observe statistically significant differences in two of the comparisons. TA out performs NA with a medium effect size ( = 0.39505). As expected, baseline uncompressed streaming out performs NA with a large effect size ( = 0.467). On the other hand, we do not observe statistically significant differences between TA and baseline uncompressed indicating similar performance in terms of QoI. This time we observe statistically significant differences in all comparisons. TA out performs NA with a medium effect size ( = 0.3780). Baseline uncompressed streaming out performs NA with a large effect size ( = 0.5710) and out performs TA with a medium effect size ( = 0.3710). In general, we observe that TA leads to statistically significant gains in terms of QoI with respect to NA across both bitrates to address R1.

In order to validate the three exercises we used, we checked if they led to different QoI scores and we found no statistically significant differences using the Friedman test ( 2 = 3.16, = 0.206) at 7Mbps and ( 2 = 2.28, = 0.3198) at 14Mbps.

In order to assess the visual quality we include a question about the visual quality of the trainer's point cloud representation. Participants were asked to indicate the quality on a scale from 1 to 5 (1-Bad, 2-Poor, 3-Fair, 4-Good, and 5-Excellent). We analyzed these scores separately for each target bitrate. For the 7Mbps case a Shapiro-Wilk normality test issued on the scores indicates that they do not follow a normal distribution ( = 0.8106, < 0.001). To compare the remote user visual quality across the different streaming conditions, we first conduct a Friedman's test to check if any groups exist with statistically significant differences ( 2 = 9.8, = 0.0074). We then conduct a Wilcoxon signed rank test with Bonferroni correction. The results are shown in table 5. We observe statistically significant differences in two pairwise comparisons. TA out performs NA with a medium effect size ( = 0.4570). Baseline uncompressed streaming out performs NA with a medium effect size ( = 0.4520). We do not observe statistically significant differences between TA and baseline indicating similar performance in terms of visual quality. This time we observe statistically significant differences in two of the pairwise comparisons. TA out performs NA with a medium effect size ( = 0.4830). Baseline uncompressed out performs NA with a large effect size ( = 0.5390). We do not observe statistically significant differences between TA and baseline uncompressed streaming indicating similar performance in terms of visual quality at 14Mbps. The distribution of the scores is shown in figure 4 . To address R3, we observe significant gains in perceived visual quality by employing TA over NA. At 14Mbps we even observe the same median score as baseline uncompressed streaming that required ca. 300Mbps.

For the task related experience questions we used the questionnaire presented in [20] . The questions are meant to assess the participants confidence in using the system and if the system was natural and easy to use while performing the task. We compute an overall score by adding up the five questions from this section. We split the analyses by target bit rate. A Shapiro Wilk normality test issued on the scores revealed that they are not normally distributed; ( = 0.9019, = 0.0013) for 7Mbps and ( = 0.9653, = 0.014) for 14Mbps. We then checked if there are any groups with statistically significant differences using Friedman's test ( 2 = 2.1, = 0.3504) at 7Mbps and ( 2 = 2.56, = 0.278) at 14Mbps. We found no statistically significant differences amongst the different streaming conditions at both target bitrates. This indicates participants were able to adapt their behavior to compensate for changes in the point cloud quality and were able to complete the task within the same time regardless. We observe no gains in task experience by employing TA (R2). This can be explained by the relative simplicity of completing our training task as compared to tasks used in other works [16, 26, 43] .

Questionnaire. We calculate the total severity of cybersickness based on the SSQ questionnaire for all 33 participants across the three experiment sessions. We observe low post-exposure total severity scores for cybersickness, as defined in [1, 21] 

In our pre-trials we observed a higher median score and a lower spread of scores with uniform tile allocation. This is different from the results in [49] with a prerecorded dataset, where hybrid tile allocation was shown to yield a higher perceived quality. The point clouds used in that study were captured offline. They were dense and voxellized with ca. 1 million points per frame and the adaptation set comprised of 30 quality levels. In our study, we used real-time live captured point clouds with ca. 130K points with an adaptation set of 3 quality levels defined by octree depths. Further real world studies need to conducted to better understand the impact of these optimization techniques and acceptable quality differences amongst visible tiles for real-time reconstructions.

In general, tiled adaptive streaming techniques have received significant research attention for omnidirectional videos [5, 40, 56] and point clouds [12, 30, 39, 41] . However, further study is required for live real-time human point cloud reconstructions. Although the tiles are independently decodable, their quality cannot be optimized in isolation based on available bandwidth and viewport. Some participants reported that seeing artifacts in body extremities and in the face of the reconstruction was unpleasant. Further study into body part segmentation and quality perception is required to optimize tiling and tile selection strategies for humans engaging in conversation as compared to prerecorded content, objects and scenes.

There is a need for new standardized tasks that can be used to evaluate VR remote communication. The existing ITU recommendations are insufficient to handle novel interaction techniques and immersive content inherent to VR communication. In this work, we utilized a neck exercise training task as it was more visually focused and the interaction was repeatable with a confederate trainer. Further study into other use cases and scenarios are required to evaluate emerging VR communication systems. To this end, ITU-T has recently launched a new activity [17] to develop assessment methods for extended reality meetings.

In our study, participants were trained on how to navigate the virtual space with a controller based teleport. During the session, we did not force participants to move in order to keep the interaction more natural. Future studies on VR remote communication should account for this trade-off. Movement within the scene is important to evaluate the visual quality of adaptation from different view angles but scripting or forcing these movements tends to break the flow of the interaction making it difficult to evaluate quality of communication.

In this paper, we presented a VR remote communication system with tiled adaptive real-time point cloud streaming using commodity hardware. We present an evaluation framework and a training task to evaluate the impact of adaptive streaming on QoI, visual quality and task related experience. Our system at 14Mbps was able to achieve similar visual quality as compared to uncompressed streaming at ca. 300Mbps. We also demonstrate statistically significant improvements to QoI as compared to traditional network adaptive streaming as well as improvements to playback performance with a 10% to 18% reduction in CPU usage.

On the Usage of the Simulator Sickness Questionnaire for Virtual Reality Research

PC-MCU: Point Cloud Multipoint Control Unit for Multi-User Holoconferencing Systems

The Ishihara Test for Color Blindness

Compression of 3D Point Clouds Using a Region-Adaptive Hierarchical Transform

Forget Video Conferencing-Host Your Next Meeting in VR

Schierl, and Cornelius Hellge. 2020. INTERACTIVE VOLUMETRIC VIDEO FROM THE CLOUD

VRComm: An End-to-End Web System for Real-Time Photorealistic Social VR Communication

ViVo: Visibility-Aware Mobile Volumetric Video Streaming

View-Dependent Streaming of Dynamic Point Cloud over Hybrid Networks

From Capturing to Rendering: Volumetric Media Delivery with Six Degrees of Freedom

Adaptive Rate Allocation for View-Aware Point-Cloud Streaming

Dynamic Adaptive Point Cloud Streaming

1301 : Subjective quality evaluation of audio and audiovisual multiparty telemeetings

P.920 : Interactive test methods for audiovisual communications

QoE assessment of eXtended Reality (XR) meetings. International Telecommunication Union

A Pipeline for Multiparty Volumetric Video Conferencing: Transmission of Point Clouds over Low Latency DASH

RoomAlive: Magical Experiences Enabled by Scalable, Adaptive Projector-Camera Units

Trade-Off between Task Accuracy, Task Completion Time and Naturalness for Direct Object Manipulation in Virtual Reality

Simulator Sickness Questionnaire: An Enhanced Method for Quantifying Simulator Sickness

Language in dialogue: when confederates might be hazardous to your data

The Effect of Avatar Realism in Immersive Social Virtual Realities

Project Starline: A High-Fidelity Telepresence System

GROOT: A Real-Time Streaming System of High-Fidelity Volumetric Videos

Measuring and Understanding Photo Sharing Experiences in Social Virtual Reality

Joint Communication and Computational Resource Allocation for QoE-driven Point Cloud Video Streaming

Fuzzy Logic-Based Adaptive Point Cloud Video Streaming

Point Cloud Video Streaming: Challenges and Solutions

Point Cloud Video Streaming: Challenges and Solutions

Nonverbal Overload: A Theoretical Argument for the Causes of Zoom Fatigue

Design, Implementation and Evaluation of a Point Cloud Codec for Tele-Immersive Video

Objective and subjective quality assessment of geometry compression of reconstructed 3D humans in a 3D virtual room

Open-Source Software Tools for Measuring Resources Consumption and DASH Metrics

Facsimile Appearance to Create Energy Savings (FACES)

Holoportation: Virtual 3D Teleportation in Real-Time

Rate-Utility Optimized Streaming of Volumetric Media for Augmented Reality

Dynamic Polygon Cloud Compression

Dynamic Adaptive Streaming for Augmented Reality Applications

An HTTP/2-Based Adaptive Streaming Framework for 360°V irtual Reality Videos

Toward Practical Volumetric Video Streaming on Commodity Smartphones

CWIPC-SXR: Point Cloud Dynamic Human Dataset for Social XR

Towards Individual QoE for Multiparty Videoconferencing

Emerging MPEG Standards for Point Cloud Compression

Depth of Presence in Virtual Environments

Communication Behavior in Embodied Virtual Reality

Split Rendering for Mixed Reality: Interactive Volumetric Video in Action

Comparing the Quality of Highly Realistic Digital Humans in 3DoF and 6DoF: A Volumetric Video Case Study

User Centered Adaptive Streaming of Dynamic Point Clouds with Low Complexity Tiling

Holistic Framework for Quality Assessment of Mediated Social Communication

Towards 6DoF HTTP Adaptive Streaming Through Point Cloud Compression

Projected Augmented Reality with the RoomAlive Toolkit

Measuring Presence in Virtual Environments: A Presence Questionnaire

Enabling Multi-Party 3D Tele-Immersive Environments with <i>ViewCast</i>

Elhassan Makled, and Slim Abdennadher. 2020. A Design Space for Social Presence in VR

Scalable 360°Video Stream Delivery: Challenges, Solutions, and Opportunities