key: cord-0485672-eqz9rdrc authors: Fujihashi, Takuya; Koike-Akino, Toshiaki; Chen, Siheng; Watanabe, Takashi title: Wireless 3D Point Cloud Delivery Using Deep Graph Neural Networks date: 2020-06-17 journal: nan DOI: nan sha: 9a484d5f6086962c99473d886b2661fa77f08209 doc_id: 485672 cord_uid: eqz9rdrc In typical point cloud delivery, a sender uses octree-based digital video compression to send three-dimensional (3D) points and color attributes over band-limited links. However, the digital-based schemes have an issue called the cliff effect, where the 3D reconstruction quality will be a step function in terms of wireless channel quality. To prevent the cliff effect subject to channel quality fluctuation, we have proposed soft point cloud delivery called HoloCast. Although the HoloCast realizes graceful quality improvement according to wireless channel quality, it requires large communication overheads. In this paper, we propose a novel scheme for soft point cloud delivery to simultaneously realize better quality and lower communication overheads. The proposed scheme introduces an end-to-end deep learning framework based on graph neural network (GNN) to reconstruct high-quality point clouds from its distorted observation under wireless fading channels. We demonstrate that the proposed GNN-based scheme can reconstruct clean 3D point cloud with low overheads by removing fading and noise effects. Three-dimensional (3D) holographic displays [1, 2] have emerged as attractive interface techniques for reconstructing 3D perceptual scenes that provide full parallax and depth information for human eyes. 3D holographic display can be widely used for many applications: entertainment, virtual training, and medical imaging. Specifically, such 3D holographic visualizations will play a more important role in the post-Coronavirus (COVID-19) society because the 3D data can realize highpresence in remote conferencing [3] and healthcare [4] . For example, holographic data of doctors and medical imaging provide more interactive verbal guidance in tele-surgery [5] . Point cloud [6] is one of data formats to represent 3D scenes/objects on the holographic display [7] . We focus on wireless point cloud delivery systems, which send 3D point data to a remote display over wireless links to reproduce the corresponding 3D scenes/objects. In contrast to conventional two-dimensional (2D) images, 3D points in point cloud data are massive, non-ordered, and non-uniformly distributed in space. One of major issues in point cloud delivery is how to compress and send such numerous and irregular structure of 3D points while keeping original 3D scenes/objects. For example, when the number of 3D points is 800,000, the amount of traffic without any compression will be approximately 38 Mbits [8] . Large traffic causes low 3D reconstruction quality in point cloud delivery over limited data rate links in wireless communications. (a) Holographic Display [12] (b) 3D Modeling [13] For point cloud compression over wireless links, conventional schemes typically rely on digital encoding such as point cloud library (PCL) [9, 10] . Specifically, a sender decomposes point cloud into multiple 3D point sets by using the octree decomposition [11] , quantizes, and takes entropy coding to generate the compressed bitstream. Here, the compression rate of the bitstream is adaptively selected according to link capacity of wireless channels. The compressed bitstream is then transmitted over the channels by using a channel coding and digital modulation scheme. A successful high-quality delivery of point clouds over wireless links can realize high presence in video applications such as virtual reality and augmented reality on wireless devices as shown in Fig. 1 . However, the conventional schemes of point cloud delivery suffer from the following problems due to the wireless channel unreliability and nonlinear operations of data compression. First, the encoded bitstream is highly vulnerable for bit errors [14] . When the channel signal-to-noise ratio (SNR) falls under a certain threshold, possible catastrophic errors occurred in the bitstream during communications will disable point cloud restoration. This phenomenon is called the cliff effect [15] . Second, the reconstruction quality does not improve even when the wireless channel quality is improved unless an adaptive rate control of source and channel coding is performed in real-time according to the rapid fading channels. This is called the leveling effect. Third, quantization is a lossy process and its distortion cannot be recovered at the receiver. Finally, voxel-domain point cloud encoders [9, 10] have limited coding efficiency since it does not yield a good energy compaction. Although conventional transform techniques, such as discrete cosine transform (DCT), can be used even for point cloud data, they do not fully exploit the underlying irregular geometry of the 3D points. To solve the above-mentioned issues, we have proposed HoloCast [16, 17] to realize graceful 3D reconstruction quality improvement with the improvement of wireless channel quality. Fig. 2 (a) shows the overview of HoloCast. The key ideas of HoloCast are 1) skipping digital operations, i.e., quantization and entropy/channel coding, analogous to SoftCast [18] and 2) introducing graph signal processing (GSP) [19] to achieve better energy compaction. Specifically, HoloCast regards the 3D points as vertices in a graph and takes graph Fourier transform (GFT) [20] to exploit the correlations between the adjacent graph signals, and directly sends the GFT coefficients by using near-analog modulation [18] . However, the GFT-based coding in HoloCast needs a large communication overhead for decoding even with overhead reduction techniques [17] . Specifically, a sender needs to transmit the eigenvectors of graph Laplacian matrix as metadata. The main objective of our study is to extend HoloCast by introducing a new framework known as graph neural networks (GNN) [21] to realize high quality and low overheads. GNN is a novel model for graph representation learning, which allows analyzing irregular geometric structure of graph data. We focus on an end-to-end (E2E) deep learning, i.e., GNNbased autoencoders (GAE) [22] - [25] , to encode 3D point clouds into a compressed representation. One of the benefits in the GAE is to allow the graph signal reconstruction from the limited number of latent variables without requiring additional metadata. Fig. 2(b) shows the overview of the proposed scheme, where the GNN-based encoder compresses 3D points into the latent variables, and then the compressed variables are directly mapped to transmission signals without relying on digital modulation schemes. The latent variables, which are distorted through wireless fading channels, are fed into another neural network decoder to reconstruct clean 3D points. Our contribution is three-fold: 1) we verify that the proposed GNN-based point cloud delivery realizes better 3D reconstruction quality compared with the conventional HoloCast over fading channels, 2) we confirm that the proposed GNN-based encoder can reduce the amount of communication overhead by one order of magnitude, and 3) we demonstrate that adaptive channel precoding brings further quality improvement by means of the diversity gain of the rapid fading channels. Fig. 3 shows the proposed E2E point cloud encoder and decoder, to prevent the cliff/leveling effects in 3D scene reconstruction, to gracefully improve reconstruction quality along with channel quality, and to reduce the amount of overhead. Encoder: The encoder part first regards the 3D points as a graph signal using a weighted and undirected graph G = (V , E, W ) where V and E are the vertex and edge sets of G, respectively. W is an adjacency matrix having positive edge weights and the (i, j)th entry W i,j represents the weight of an edge connecting vertices i and j. We consider the attributes of the point cloud, i.e., the 3D coordinates p = [x, y, z] T ∈ R 3×N as signals that reside on the vertices in the graph (N is the number of vertices). For simplicity, we consider a K-nearestneighbor graph, where each 3D point connects to its K closest 3D points. Here, we use a binary adjacency matrix whose entry is either is 1 or 0 to indicate connectivity. The encoder maps the 3D coordinate attributes p to m-dimensional and Lchannel real-valued latent variables z ∈ R m×L by means of an encoding function f θ . The encoding function f θ is parameterized using graph convolutional neural networks (GCNN) with weights θ. The encoder consists of a series of graph convolution followed by leaky rectified linear unit (ReLU) activation function, Top-K pooling [26] , and a normalization layer. The graph convolution layers extract the graph signal features and the nonlinear activation function allows to learn a non-linear mapping from the source signal to the coded signal. Top-K pooling layer chooses the largest K values from each channel to remain important features. The output of the last graph convolution layer is normalized such that z 2 = mLP , where P denotes the average transmission power. Wireless Link: The coded variables z are sent over the communication channel by directly mapping to in-phase and quadrature (I-Q) symbols x for analog wireless transmissions. The wireless channel, denoted by η, introduces stochastic distortion to the transmission symbols. To optimize the proposed scheme under wireless communications, the channel transfer function η must be incorporated into the E2E GAE. We consider the channel model based on Rayleigh fading as reasonable wireless communication systems. In Rayleigh fading channels, each analog-modulated symbol at the receiver can be modeled as follows: y i = h i x i + n i , where y i is the ith received symbol, x i is the ith transmission symbol, h i is ith multiplicative fading coefficient, and n i is an additive white Gaussian noise (AWGN) with an average noise variance of σ 2 . The fading coefficients in the Rayleigh fading channels follow zero-mean complex Gaussian distribution, i.e., h i ∼ CN(0, 1) where ∼ means "distributed as" and CN(a, b) is a complex Gaussian distribution with a mean of a and variance of b. To reduce the impact of fading effects, we consider two equalization techniques at the sender and receiver, i.e., preequalization and post-equalization, for a channel transfer function η. The pre-equalization can be realized at the sender side by sending pre-equalized transmission symbol x i to the receiver as x i = w i z i where w i is a pre-equalizer weight. Although there are many variants of pre-equalizer, we assume a simple pre-equalization: * denotes the conjugate operation. In this case, the channel transfer function will be: η preeq (z i ) = |h i |z i + n i . The post-equalization can be realized at the receiver side by taking an inverse operation of the fading attenuation. Specifically, the receiver takes the post-equalization such that y i = y i /h i , given the estimated fading coefficient h i . In this case, the channel transfer function will be: η posteq (z i ) = z i + n i /h i . In addition to pre-/postequalization, we also consider precoding method which sorts the latent variables h according to the fading level |h i | in descending order. Such sorting may facilitate for GNN to optimize the best latent variables to achieve diversity gain. Decoder: Upon the receipt of distorted latent variables, the decoder uses a decoding function g φ , based on a multilayer perceptron (MLP) for 3D point cloud reconstruction. The decoder consists of a series of fully-connected layers and leaky ReLU with a parameter set φ. The MLP decoder maps the distorted latent variables z into an estimate p of the 3D coordinates. The last layer uses hyperbolic tangent (tanh) activation function. Loss Function: The proposed GNN-based encoding and decoding functions are trained to minimize a loss function: where E[·] is an expectation, d(p,p) is a defined distortion function between the original and reconstructed 3D coordinate attributes, Pr(p,p) is the joint probability distribution of the original and reconstructed 3D coordinate attributes. Since the true distribution of the input attributes is often unknown and thus the expected distortion is also unknown. To learn better weights for the minimization of the expected distortion in Rayleigh fading channels, all potential distortions due to channel fading and additive noise are synthetically analyzed by the proposed scheme in off-line learning phase. We use adaptive momentum (ADAM) optimizer for weight learning with an initial learning rate of 0.005, batch size of 10, momentum of 0.9, and momentum2 of 0.999 for 500 epochs. Datasets: We use a benchmark dataset of ShapeNet [27] for experiments. ShapeNet contains more than 50,000 unique 3D points from 55 categories. In our experiments, we select point clouds of "Airplane" category. We sample 2,115 point clouds for training and 234 point clouds for testing. The training data are used for learning the network weights while the testing data are used for comparison in terms of 3D reconstruction and visual quality. Quality Metric: We use the augmented Chamfer distance [24] as the distortion function for the 3D coordinate attributes. The augmented Chamfer distance d CH (S, S) is defined as where S is the input point set and S is the reconstructed point set. The term min p∈ S ||p− p|| 2 enforces that any 3D coodinate p in the original point cloud has a matching 3D point p in the reconstructed point cloud, and the term min p∈S ||p − p|| 2 enforces the matching vice versa. The max operation enforces that the distance from S to S and the distance vice versa have to be small simultaneously. Wireless Environment: We consider Rayleigh fading channels with an additive noise n i for realistic wireless environments. The additive noise n i follows white Gaussian distribution with a variance of σ 2 , i.e., n i ∼ CN(0, σ 2 ). GAE Architecture: We use PyTorch Geometric (PyG) [28] for the implementation of our GAE architecture. The encoder repeats a series of GCNConv [29] with the output channels between 12 and 48, leaky ReLU activation function, and Top-K pooling at the graph pooling ratio between 0.5 and 0.9 three times. The output of the last Top-K pooling layer is followed by normalization layer which enforces the average power constraint. The decoder uses a series of fully-connected layer and leaky ReLU three times to reconstruct the 3D coordinate attributes from the distorted latent variables via a channel transfer function. Here, the output channels of the first and the second fully-connected layers are the same as the output channels of GCNConv while the output channels of the last fully-connected layer is 3. We firstly discuss an impact of the proposed GNN-based coding on the amount of communication overheads. Fig. 4 shows the 3D reconstruction quality over Rayleigh fading channels as a function of communication overheads at a wireless channel SNR of 20 dB. Here, the communication overhead is the total number of transmission symbols in nearanalog modulation (and GFT basis matrix for HoloCast). We compare four schemes: HoloCast [16] , HoloCast with Givens rotation [17] , SoftCast [18] , and proposed schemes. HoloCast uses octree decomposition and takes GFT for the graph signals in each octree block to convert into spectrum domain by using the eigenvectors of the random-walk graph Laplacian matrix. HoloCast sends the eigenvectors of varying octree block sizes without compression. HoloCast with Givens rotation uses a quantization bit depth b between 2 and 12 to compress the eigenvectors for overhead reduction. Here, the octree block size is fixed to 300. SoftCast takes DCTbased decorrelation for 3D coordinate attributes and directly maps the coefficients on the I-Q components. The proposed scheme uses GNN-based encoding and decoding for overhead reduction. Here, the proposed scheme uses precoding with a channel transfer function of post-equalization. We can see that the proposed scheme achieves a significant overhead reduction at the same 3D reconstruction quality compared with the conventional HoloCast and HoloCast with Givens rotation. For example, Chamfer distance of the proposed scheme is 0.011 at the communication overhead of 9.0 × 10 4 symbols. On the other hand, Chamfer distance of HoloCast with Givens rotation is 0.010 at the communication overhead of 1.1 × 10 6 symbols. In this case, the proposed scheme achieves 92.0% overhead reduction at almost the same 3D reconstruction quality. The conventional SoftCast has a limited 3D reconstruction quality irrespective of the communication overhead. Although the communication overhead in SoftCast is smaller than that in the proposed scheme, SoftCast needs an additional communication overhead for power scaling [15] . Detailed analysis of the communication overhead for power scaling will be left as the future work. We now discuss an effect of wireless channel quality on the reconstructed point cloud quality. We consider two HoloCast schemes with a bit depth of 5 and 12 in Givens rotation at an octree decomposition size of 300. Fig. 5 shows the 3D reconstruction quality of the reference schemes over Rayleigh fading channels as a function of wireless channel SNRs. We observe the following results: • The proposed scheme yields the best 3D reconstruction quality in low wireless SNR regimes. • Although the conventional HoloCast schemes realize better 3D reconstruction quality in high wireless SNR regimes, the required overhead is more than 10-times larger than that of the proposed method. • The 3D reconstruction quality of SoftCast is lower than that of the proposed scheme irrespective of wireless channel SNRs. In this section, we evaluate the effects of precoding and channel transfer functions, i.e., post-equalization and preequalization, on the 3D reconstruction quality of the proposed scheme. Fig. 6 shows the 3D reconstruction quality of the proposed schemes over Rayleigh fading channels as a function of wireless channel SNRs for the case with different channel transfer functions of η preeq and η posteq with/without precoding. The evaluation results are summarized as follows: • Precoding performs well in high wireless SNR regimes since it may achieve a higher diversity gain. • Pre-equalization works well at lower SNR regimes, whereas post-equalization does well at higher SNRs. • As a consequence, pre-equalization without precoding yields the best 3D reconstruction quality at low SNR regimes below 10 dB. • Accordingly, the post-equalization with precoding becomes the best one in the high SNR regimes above 10 dB. We finally compare some examples of visual snapshots for SoftCast, HoloCast, and the proposed schemes over Rayleigh fading channels in Figs. 7(a) through (g) at a channel SNR of 20 dB. Here, the point cloud is selected from one point cloud from the test data in ShapeNet database. Although each proposed scheme may reconstruct the 3D shape of the aircraft, the proposed scheme with precoding may realize clear reconstruction compared with the proposed scheme without precoding. In particular, SoftCast has an obvious degradation over other schemes. Nevertheless, the 3D shape of the aircraft tail still remains noisy even with proposed methods. Note that we focused on a simplified GNN method compared to state-ofthe-art techniques such as graph inception networks (GIN) [24] and FoldingNet [23] in order to demonstrate an initial proofof-concept study of GNN-based 3D point cloud delivery. Extension to further improve 3D reconstruction quality will be considered as another follow-up work. To the best of authors' knowledge, this paper is the very first study exploiting GNN methods for wireless 3D point cloud delivery. We have proposed a novel scheme of soft point cloud delivery for future wireless streaming of holographic and 3D data. Specifically, the proposed scheme integrates GNN-based point cloud coding and near-analog modulation to simultaneously achieve: 1) prevention of the cliff effect, 2) prevention of the leveling effect, 3) high energy compaction, and 4) low communication overhead. In addition, the proposed E2E design of GAE scheme accounts for random distortion due to fading channels through the use of pre-/post-equalization and precoding techniques. Evaluation results demonstrated that the proposed scheme achieves a good trade-off between 3D reconstruction quality and communication overhead compared with the conventional SoftCast and HoloCast. More rigorous analysis with GIN and FoldingNet will follow as future work. ACKNOWLEDGMENT T. Fujihashi was partly supported by JSPS KAKENHI Grant Number 17K12672. Holographic three-dimensional telepresence using large-area photorefractive polymer Ultrahigh-definition dynamic 3D holographic display by active control of volume speckle fields Communication, interactivity, and satisfaction in B2B relationships Telementoring and telesurgery for minimally invasive procedures Overview of the MPEG activity on point cloud compression Fast computer-generated hologram generation method for three-dimensional point cloud model Point cloud compression in MPEG Real-time compression of point cloud streams 3D is here: Point cloud library (PCL) Octree-based point-cloud compression Global holographic display market 2018 size, share, outlook and forecast to 2023 Blk360 datasets Video transmission over lossy wireless networks: A cross-layer perspective Highquality soft video delivery with GMRF-based overhead reduction HoloCast: Graph signal processing for graceful point cloud delivery Overhead reduction in graph-based point cloud delivery A cross-layer design for scalable mobile video Graph signal processing: Overview, challenges, and applications Point cloud attribute compression with graph transform On Node Features for Graph Neural Networks Learning Representations and Generative Models for 3D Point Clouds Foldingnet: Point cloud auto-encoder via deep grid deformation Deep unsupervised learning of 3d point clouds via graph topology inference and filtering PCT: Large-scale 3d point cloud representations via graph inception networks with applications to autonomous driving Graph U-Nets ShapeNet: An Information-Rich 3D Model Repository Fast Graph Representation Learning with PyTorch Geometric Semi-Supervised Classification with Graph Convolutional Networks