key: cord-0484445-em7tvkiw authors: Jiang, Peiwen; Wen, Chao-Kai; Jin, Shi; Li, Geoffrey Ye title: Wireless Semantic Communications for Video Conferencing date: 2022-04-16 journal: nan DOI: nan sha: 8a3856552ebe60414fcd8ae63d5871e892eaa495 doc_id: 484445 cord_uid: em7tvkiw Video conferencing has become a popular mode of meeting even if it consumes considerable communication resources. Conventional video compression causes resolution reduction under limited bandwidth. Semantic video conferencing maintains high resolution by transmitting some keypoints to represent motions because the background is almost static, and the speakers do not change often. However, the study on the impact of the transmission errors on keypoints is limited. In this paper, we initially establish a basal semantic video conferencing (SVC) network, which dramatically reduces transmission resources while only losing detailed expressions. The transmission errors in SVC only lead to a changed expression, whereas those in the conventional methods destroy pixels directly. However, the conventional error detector, such as the cyclic redundancy check, cannot reflect the degree of expression changes. To overcome this issue, we develop an incremental redundancy hybrid automatic repeat-request (IR-HARQ) framework for the varying channels (SVC-HARQ) incorporating a novel semantic error detector. The SVC-HARQ has flexibility in bit consumption and achieves good performance. In addition, SVC-CSI is designed for channel state information (CSI) feedback to allocate the keypoint transmission and enhance the performance dramatically. Simulation shows that the proposed wireless semantic communication system can significantly improve the transmission efficiency.This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible A basal network for the semantic-based video conferencing called SVC is established into a three-level framework. First, we investigate the entire transmission process and analyze the difference between the semantic and conventional errors. Then, we add acknowledgement (ACK) feedback to the SVC; it is a widely used technique in the conventional wireless communications to ensure a successful semantic transmission. An incremental redundancy hybrid automatic repeat request (IR-HARQ) framework for the SVC, called SVC-HARQ, is proposed to guarantee the quality of the received frames when facing wicked channels. Then, the transmitter learns to allocate information with different importance according to signal-to-noise ratios (SNR) at different subchannels with the help of CSI, which is called SVC-CSI. The major contributions of this work are summarized as follows: 1) Establishing SVC framework. The state-of-art technology to restore the image from several keypoints has achieved a huge compression ratio. Thus, we exploit the technology to cope with the channel distortion. Different from the huge compression ratio (only few number of keypoints) that causes the transmitted image to lose the detailed expressions, the simulation results show that transmission errors in physical channel transmission may change the locations of the keypoints and lead to inaccurate expressions. Nevertheless, these errors may be visually acceptable through semantic processing. On the contrary, the errors usually destroy the pixels directly for the conventional methods. 2) Combining with HARQ scheme. To guarantee the feasibility of the SVC under varying channels, the IR-HARQ feedback framework for the SVC, called SVC-HARQ, is developed. Compared with the conventional bit error detection using cyclic redundancy check, a semantic error detector is used to decide whether the received frame requires an incremental transmission. The semantic error detector exploits the fluency of the video to check of the received frame. The simulation results demonstrate that the inaccurate keypoints usually reduce the fluency. The proposed SVC-HARQ can adapt different bit error rates (BER) and require to transmit fewer bits than the competing methods. 3) Exploiting CSI. The CSI is exploited so that optimal transmitted information with different importance can be allocated automatically on different subchannels, which is called SVC-CSI. The SVC-CSI learns to allocate more information at the subchannels with high SNRs than those with low SNRs. An extra incremental transmission is trained without employing CSI because the performance of the SVC-CSI becomes worse when the testing channel environment is different from the training environment; thus, it shall be robust to varying channels and is called SVC-CSI-HARQ. The rest of this paper is organized as follows. Section II introduces the system model and the related methods, including conventional IR-HARQ and adaptive modulation. Then, we describe the proposed networks in Section III. In Section IV, we demonstrate the superiority of the proposed networks in terms of semantic metrics and the required bits. Finally, Section V concludes this paper. In this section, we first describe the existing frameworks on semantic networks and then introduce some important techniques in wireless communications that can potentially help semantic transmission. Finally, we discuss the challenges when applying the semantic transmission over the wireless communication systems. To transmit source information, such as a picture p, the semantic transmitter first extracts its meaning. The semantic extractor plays a similar role to the source encoder in the conventional communication systems and is denoted as S(p; W S ), where W S is the parameter set for semantic extraction. Then, the channel encoder, C(·), can be designed separately or jointly with the semantic extractor and the encoded symbols are generated for channel transmission. The whole encoder process can be expressed as where W C denotes the parameter set for semantic channel encoder. The transmitted picture can be recovered at the receiver byp where S −1 (·) and C −1 (·) represent the semantic source decoder and channel decoder, respectively. As indicated in [1] , the semantic processing and transmission in the semantic communications are remarkably different from the conventional ones. Meanwhile, the local and shared knowledge in the semantic systems plays a major role. Semantic knowledge can be exploited implicitly or explicitly, as summarized in the following: 1) Implicit Semantic Knowledge. In these designs [14] - [17] , the local and shared knowledge is implicitly contained in the trainable parameters and the transceivers are usually trained in an end-to-end manner. The impact of physical channels is also learned implicitly. These methods automatically extract semantic features and cope with the distortion and interference in the physical channels. However, the trained parameters are difficult to adjust under changing transmit sources or physical environments. 2) Explicit Semantic Knowledge. In some specific tasks, the semantic knowledge is shared explicitly. For example, the semantic network in [22] shares the most important features in the image so that the received image can be classified better than the conventional methods. In [21] , the photo of the speaker is shared because the appearance of the speaker has no much change during a speech. The explicit shared knowledge can be adjusted according to the change in the source information, such as replacing the photo for the next speaker. Apart from the semantic knowledge, the existing methods have not considered adjusting the settings under different channel environments. Therefore, the semantic methods cannot adapt to the physical channel variation. In this section, we introduce two key techniques in wireless communications to cope with changing environments, which can be exploited in semantic system design. In modern communication systems with HARQ, the corrupted packets are retransmitted. IR-HARQ can balance the requirements of transmission resources and accuracy and is a popular option. To establish an IR-HARQ system, we need to have a channel encoder and an error detector. If the semantic symbols, k, are protected by a conventional channel encoder, C(·), then the coded symbol can be expressed as In [23] , the coded symbol vector can be divided into s 1 and s 2 with s = [s 1 , s 2 ] , where s 1 corresponds to coded semantic symbols with high coding rate, and s 2 represent the incremental symbols. The high code rate symbols, s 1 , are transmitted first. Denoteŝ 1 as the received symbols corresponding to s 1 . The recovered semantic symbols can be expressed as, The conventional CRC error detector is widely used for HARQ systems and extra parity bits are coded from p and transmitted at the very beginning. With extra parity bits at the receiver, b CRC , ACK information can be generated by where Det CRC (·) denotes the error detection process. The feedback signal ACK = 1 when no error is found; otherwise, ACK = 0. The incremental symbols, s 2 , need to be transmitted to decrease the code rate if some errors are founded (ACK = 0). The received coded symbols are combined together and decoded again, If the decoded result still has errors, then the retransmission starts, and the above process is repeated. This IR-HARQ method can deal with the varying channels in different time slots. In addition, diversity on channel conditions at different frequencies of the same time slot can be exploited. If orthogonal frequency division multiplexing (OFDM) is used, then the overall channel bandwidth can be divided into L parallel flat fading subchannels with different SNRs [24] . For OFDM systems, the subchannel gains are different, whereas the noise powers at different subchannels are the same. Then, the overall SNR can be expressed as where h l and s l are the frequency response and transmit symbols at the l-th subchannel, respectively. The σ 2 is the noise power at the receiver. Once σ and [h 1 , · · · , h L ] are available at the transmitter, the modulation of the s l can be adaptive to cope with the changing gains of the subchannels. Although the combination of the semantic networks and conventional link adaptive methods is naturally considered, the novel mechanism on semantic transmission brings challenges on the design in the technical level. As a deep and inexplicable network, the performance of the semantic-based transceiver must be guaranteed under varying physical environments. III. TRANSCEIVER DESIGN FOR SEMANTIC VIDEO CONFERENCING In this section, we introduce novel architectures for semantic video conferencing, which exploit conventional strategies in wireless communications. We start with a basic network as a semantic source encoder. Then, a novel error detector is proposed to generate an ACK feedback. The basic network is expanded into the HARQ mode to cope with varying channels at different time slots. Finally, the CSI for each subchannel is fed back to the semantic transmitter for adaptive modulation. Restoring a specific face in an image from few keypoints has been studied in [20] , [21] . In these methods, the keypoints contain the changing information of facial expression and manner. Other information, such as appearance features, does not change during a speech and can be shared to the receiver in advance. Besides, as presented in [21] , the keypoints can be compressed and encoded to improve the transmission efficiency. The above methods dramatically reduce the requirement of transmit resource. However, the existing methods only focus on the framework of source coding and ignore the impact of varying wireless channels. A complete semantic video conferencing framework is shown in Fig. 1 , where the simple dense layers are introduced as a channel coding module. Fig. 1 (a) shows a semantic video conferencing framework, called SVC. The whole framework consists of three levels similar to [1] , including effectiveness, semantic, and technical levels. The effectiveness level delivers the motion and expression of the speaker. The conventional goal is to minimize the difference of the transmitted and recovered frames. At the semantic level, the photo of the speaker is shared in advance given that the speaker has no remarkable change during the speech. Usually, the first frame of the video is shared to the receiver for convenience, whereas a photo with distinct face is beneficial to generate a good image at the receiver. The keypoint detector extracts the movement of the face in the current frame, and these keypoints are transmitted at the technical level. Based on the received keypoints and the shared photo, the semantic part of the receiver reconstructs the frame. The networks in the technical level are trained to cope with the distortion and interference from the physical channels. From above description, the SVC has three subnets, including a keypoint detector and a generator in the semantic level, and an encoder-decoder in the technical level as shown in Fig. 1(b) . The keypoint detector extracts n coordinates of the keypoints, k i ∈ R 2×n , from the i-th frame, where W KD denotes the set of trainable parameters of the keypoint detector. Specially, the first frame, p 0 , with its keypoints k 0 , is shared to the receiver. The keypoint detector consists of convolution neural networks (CNNs) similar to [25] . The inputted image matrix with sizes (256, 256, 3) is first downsampled to (64, 64, 3) by anti-alias interpolation to reduce complexity of the keypoint detector. Then, the image is processed by an hourglass network [26] with three blocks. Each block has a 3 × 3 convolution operation with a Relu activation function, a batch normalization, and a 2 × 2 average pooling. The network has 1024 maximum channels and 32 output channels. After the hourglass network, a 7 × 7 convolution converts the output of CNN blocks from (64, 64, 32) into (64, 64, n), thereby dividing the image into n 64 × 64 grids. The softmax activation is applied to choose grid point with the largest output value. The selected n grid points are normalized to [−1, 1]. The encoder-decoder consists of dense and quantization/dequantization layers. For the encoder, the n keypoints of the i-th frame, k i (expressed in n coordinates), are considered as 2n real numbers and processed by three dense layers, f en (·), with 512, 256, and m neurons, where m is the number of the transmit symbols. The first two layers use Relu activation function, and the last one uses Sigmoid activation function. Then, a two-bit quantization Q(·) is applied to generate 2m transmitted bits, b i . The whole process is expressed as where W en is the set of trainable parameters in the dense layers. The dequantizer, Q −1 (·), at the receiver in the technical level is the inverse process of Q(·) to recover the m real numbers from the received bits,b i . The three dense layers, f de (·; W de ), have 512, 256, 2n neurons, where the first two layers use Relu activation function and the last one uses Tanh to restore n keypoints,k i . This process can be expressed aŝ where W de is the set of trainable parameters in the dense layers. The derivative of the quantization layer is replaced by that of the expectation in the backward pass because the gradient is truncated by the quantization [27] . The generator reconstructs the current frame from the shared image, p 0 , with its keypoints, k 0 , and the received keypoints of the i-th frame,k i . This process is denoted as G(·; W G ), where W G denotes the set of trainable parameters. Therefore, the i-th frame can be recovered bŷ where the architecture of the generator is similar to that in [20] but without the Jacobian matrix. The loss function consists of perceptual loss [28] based on a pretrained CNN, called VGG-19 [29] , a patch-level discriminator loss [30] , and an equivariance loss [20] , denoted as L P (·), L D (·), and L E (·), respectively. As a result, the overall loss function will be Because the trainable parameters in the technical level are much fewer than other parts but still important, we add a mean-squared-error (MSE) loss function to train the technical level, yielding The training processes are divided into three steps. At the beginning, the technical level is ignored, and the parameters in the keypoint detector and the generator are trained by L(p i ,p i ), Then, the parameters in the technical level are trained by L MSE to restore the k i under the impact of the physical channel distortion, Finally, all trainable parameters of the SVC are fine-tuned in the end-to-end manner as The proposed SVC is a combination of the video synthesis and theoretic three-level semantic transmission in wireless communications. This basic framework is established and trained to study the impact of replacing video transmission with semantic keypoint transmission. The performance of the semantic transmission can be improved further by introducing the ACK feedback in wireless networks, as shown in from the following section. HARQ can cope with time-varying channels in wireless communications. Retransmission and transmitting incremental symbols are flexible under changing channels with the ACK feedback. Thus, we develop a novel semantic video conferencing framework with HARQ, called SVC-HARQ, to improve semantic transmission. As shown in Fig. 2 , the receiver feeds an ACK signal back to the transmitter after the first transmission. The first transmission is the same as in Fig. 1 and the trained parameters can be used directly. The first transmitted bit vector, b 1,i , can be expressed as where W 1,KD is the set of trainable parameters in the keypoint detector, and W 1,en is the set of trainable parameters in the encoder. Then, the recovered frame iŝ where W 1,G is the set of parameters in the generator of the first transmission, andb 1,i represents the received bit sequence at the first transmission. Then, the reconstructed frame,p 1,i , is evaluated by a semantic detector. If the detector finds thatp 1,i is unacceptable, then ACK=0 is fed back to the transmitter, and an incremental transmission is triggered. The incremental bit sequence is transmitted to correct the errors. Different from the first transmission, the incremental transmission only concentrates on the fallible keypoints under wicked channel conditions. Thus, the incremental transmission also needs to be trained and has different trainable parameters, namely, W 2,KD , W 2,en , and W 2,G , for the keypoint detector, decoder, and generator, respectively. The incremental transmitted bit sequence is and the recovered frame iŝ where whereb 2,i = h(Q(f en (KD(p i ; W 2,KD ); W 2,en ))), and h(·) indicates that the transmitted bits are with random errors due to channel distortion. The above description indicates that the semantic detector is the key module of the SVC-HARQ because the detector directly decides whether the incremental transmission or retransmisstion is required. The conventional error detector, CRC, in the HARQ system is unsuitable for the SVC-HARQ because the difference between some subtle errors in the received frames are acceptable for the conferee. We use an image quality assessment method [31] to evaluate whether or not the received frame is acceptable. This quality assessment network can be obtained by transferlearning a VGG-19 based classifier as shown in the left of Fig. 2(b) . The VGG-19 based quality detector consists of VGG-19 and one dense layer with Sigmoid activation function to output frame quality indicator. The received frame is labeled as 1 for acceptable quality and 0 for unacceptable quality. The loss function is cross-entropy. With a trained detector, Det VGG (·), the ACK feedback can be expressed as In fact, the errors in the received keypoints dp not decrease the image quality directly, but they change the facial expressions. The generator can reconstruct an acceptable face image even if the keypoints have some errors because the general appearance is obtained from the shared image. The error keypoints only change the current expression and lead the video to be not fluent. To detect these changes, we propose a novel fluency detector on the right of Fig. 2(b) . Thus, the detector needs to distinguish inappropriate expressions. We use a keypoint detector to capture the keypoints afterp 1,i . Then, we calculate the distance between the keypoints ofp 1,i andp 1,i−1 . A large distance means that the expression has a sudden change and the transmitted keypoints have some errors. The whole detection process can be expressed as where f Det (·) is a dense layer with one neuron output and Sigmoid activation function. This detector is trained with cross-entropy as the loss function by the reconstructed frames collected by the SVC under different channels. The average keypoint distance (AKD) calculated by a pretrained facial landmark detector [32] is used for labeling, where the output of the detector is labeled with 1 if its AKD is smaller than five and 0 otherwise. The loss function is cross-entropy. After training, the fluency detector is also used similar to Eq. (24) . The above SVD methods exploit no CSI further. However, the noise power of the subchannels can be obtained by the receiver. For example, the frequency selective channels can be divided into different subchannels with different SNRs. We assume that the CSI of all subchannels is estimated by the receiver and shared to the transmitter. These channel conditions are exploited by the encoder-decoder at the technical level, which helps to protect the most important keypoints. The accurate CSI of each subchannel cannot be obtained in practice, and the feedback of the entire CSI values also requires resources. Thus, the receiver sorts the subchannels by their channel conditions and feeds this sequence back to the transmitter. This method simplifies the design of encoder-decoder in the technical level and reduces the feedback cost. Compared with the original SVC, that with CSI feedback (SVC-CSI) only needs to add a sort module SN (·) as shown in Fig. 3(a) .The output of the sort module is denoted as b CSI where its elements, representing subchannel gains, are in decreasing order. Then, b CSI i is sent to the receiver. At the receiver, receivedb CSI i is restored by SN −1 (·). Because the other parts are the same as the SVC and the sort module of the SVC-CSI has no impact on the gradient, the SVC-CSI has the same training strategy as the SVC. The above methods encode the keypoints into bit sequence, which can be easily applied into the conventional wireless communication systems, such as the OFDM system with quadrature amplitude modulation (QAM). Furthermore, joint design with modulation module to encode the keypoints into constellation points directly can further improve the performance. Thus, we also investigate the benefit of CSI feedback on the encoding keypoints into constellation points. In the left structure of Fig. 3(b) , we directly replace the quantization in Fig. 3(a) with a dense layer and Tanh activation function. Its output has m real symbols, which denote m/2 constellation points. These points are also rearranged according to CSI feedback. This constellation method is called full-resolution because the learned constellation points can appear anywhere in the constellation. However, the full-resolution constellation is extremely complex for practical systems due to the finite precision. Thus, the constellation points need to be limited. We combine two bits into a real symbol similar to 16QAM. Meanwhile, each 2-bit vector is coded by the shared two trainable parameters, α and β, yielding where the 2m bits in b are first modulated into the real symbols, s i,j , which only have four possible values, i. e., the constellation points only appear in 16 locations. The m-symbol vector s i is also divided into L subchannels and has been multiplied to different transmit powers, ρ = [ρ 1 , · · · , ρ L ]. Then, s i is rearranged to s CSI i according to CSI feedback and sent to the receiver. The training process of these two methods with constellation points is still the same as that of SVC. Specially, this method is called quantized-resolution and only introduces L + 2 parameters α, β, and ρ. In general, some bits/symbols always transmit under better channel conditions than the others with CSI feedback. Thus, the networks can learn to transmit important keypoints at the subchannels with high SNRs. In this section, we present the numerical results of different frameworks and discuss the pros and cons of the semantic-based video conferencing. We also compare their bit consumption (required number of bits) with competing ones. Training settings. The VoxCeleb dataset [33] has considerable face videos of speakers. These Metrics. Three metrics are used to evaluate the results: 1) Average keypoint distance (AKD). We use a pretrained facial landmark detector [32] to evaluate the errors in our transmission. This pretrained detector extracts keypoints from the received and transmitted frame, and their average distance is computed. The AKD metric represents the motion and the changing expression of the face. 2) Structural similarity index measure (SSIM). SSIM evaluates the structural similarity among patches of the input images [36] . Therefore, SSIM, which is more robust than PSNR, is widely used as the metric of images. Perceptual loss is commonly used as a regularization method when training a network in the computer vision. Through calculating the sum of MSEs between the estimated and the true image at the different layers of a pretrained network, such as VGG, the similarity of the features represented by the perceptual loss. Here, we choose the perceptual loss metric proposed in [37] . In Table I . Considered as a lower resolution image of the original frame, the structural information of the H264 frame is still reserved and the locations of the detected keypoints unchanged. Thus, the H264 has better SSIM and AKD metrics than the SVC. However, the The conventional H264+RS can perfectly restore the transmitted information when the errors are fewer than its correction capability. Thus, the performance of the H264+RS is unchanged when BER= 0 ∼ 0.02 and decreases sharply when BER = 0.025 ∼ 0.027. The SVC methods always have better performance in terms of the three metrics when BER > 0.027 because the semantic transmission can still repair the semantic errors under high BER. However, the training environment is important for the performance of the SVC. The SVC (BER= 0) is more suitable for low BER, and its performance becomes worse than the SVC (BER= 0.05) when BER > 0.02. That means the two SVC methods allocate different transmit resources for coping with errors implicitly. In general, the SVC can save transmit resources for a high resolution video conferencing because it only transmits keypoints and has no need to compress pixel information such as H264. Moreover, the SVC has a superiority under the extremely high BER. However, the training BER affects the performance of the SVC, similar to selecting the code rate of channel coding. Thus, the IR-HARQ frame of the SVC is proposed and tested in the following section. Before discussing the performance of the SVC-HARQ, we first take a look at the difference between the semantic and bit errors. As shown in Fig. 6 The errors in the SVC are difficult to find independently; thus, the VGG-based quality detector always achieves higher accepted ratio than 96%. Therefore, the SVC can protect the visual quality even under the wicked environment because the main appearance features are shared in advance. Thus, the VGG-based quality detector is insufficiently effective as a semantic detector. Apart from the quality of the received frame, the received video should be fluent. Meanwhile, the performance of the current frame cannot be obtained because the true current frame is unknown to the receiver in practice. Thus, the keypoints are detected again at the receiver by the trained keypoint detector, and the average distances of the detected keypoints between the adjacent frames, called detected AKD, are related to the video fluency. As shown in Fig. 7(b) , the detected AKDs of the most frames are lower than 0.05 when no bit error exists and increase with BER. Compared with the 32-bit CRC code used in the conventional HARQ systems, the fluency detector is helpful to guarantee the quality of the video without any extra parity code. Finally, the whole SVC-HARQ framework is tested as shown in Fig. 8 As shown in Fig. 9(a) In order to visualize the impact of the CSI feedback, we replace the three dense layers at the transmitter in the technical level with one dense layer, whose input includes keypoints (20 real numbers) and output includes 80 symbols (320 bits with 4-bit quantization). In Fig. 9(b) , the absolute values of trained multiplicative weights in this dense layer are shown as a gray picture, and only the weights on the right picture are trained with CSI feedback. Thus, the 0 ∼ 10 symbols on the right picture are usually transmitted at better channels than those on the left one due to CSI feedback. The absolute values in the circle of the right picture are larger than those of the left picture. This finding means that the transmitter learns to place more information at the better channel conditions. The SVC-CSI performs better than the SVC when BER is higher because most information is transmitted at the first several channels with lower noise power. are spread around in Fig. 10(a) . Meanwhile, this method learns to transmit more information at better subchannels; thus, the transmit power decreases when the channel condition becomes worse as shown in Fig. 10(c) . The SVC-CSI (quantized-resolution) has the same modulation method at all subchannels and its constellation points shown in Fig. 10 (b) are similar to 16-QAM. In order to cope with the noise, the transmit power of the SVC-CSI (quantized-resolution) becomes larger as the channel condition becomes worse. Fig. 11 have the same transmit resources. The SVC-CSI (full-resolution) always has best performance, but its complexity is impractical. The SVC-CSI (quantized-resolution) learns a different modulation method from 16-QAM. This method performs worse than 16-QAM when SNR ≥ 8 dB but better than 16-QAM when SNR is low. Therefore, the trained modulation of SVC-CSI (quantized-resolution) is suitable for wired environments but cannot perfectly reconstruct the frame when SNR is high. The introduction of HARQ can improve the performance of the SVC-CSI under mismatched channels, and this method is called SVC-CSI-HARQ. In this method, the second transmission is trained without CSI feedback and thus robust to the varying environments. This strategy can guarantee that the SVC-CSI-HARQ under mismatched environment is not worse than SVC-HARQ. Meanwhile, SVC-CSI-HARQ performs better than SVC-HARQ under mismatched channels when 0.02 ≤ BER ≤ 0.1 because the mismatched power correlation at subchannels is slight when subchannel gain is large. In addition, the SVC-CSI-HARQ shows its superiority under the matched channels. Specially, the SVC-HARQ is slightly better than the methods with CSI feedback when BER=0 because the CSI feedback is ineffective under noiseless channels. In contrast, the CSI feedback may mislead the SVC to transmit less information at the last We have also considered the impact of feedback in the SVC and designed an IR-HARQ framework called SVC-HARQ with ACK feedback. The changed expression of the error keypoints obtains nonsmooth adjacent frames. To detect semantic errors, we have developed a semantic detector, including identity classifier and fluency detection. The SVC-HARQ is flexible and it can combine the performance of the networks trained under different BERs and always reach a good performance. The CSI feedback can also enhance the performance further. The transmitted symbols or bits are sorted by SNRs at different subchannels, called SVC-CSI. The SVC-CSI learns to allocate more information at the subchannels with higher gains and performs better than the SVC without CSI feedback. However, the robustness of the SVC-CSI decreases because the channel model is exploited when training. The combination of CSI and ACK feedback can balance the performance, bit consumption, and robustness. Towards a theory of semantic communication Semantic communications: Principles and challenges Rethinking modern communication from semantic coding to semantic communication Learning semantics: An opportunity for effective 6G communications Power of deep learning for channel estimation and signal detection in OFDM systems Deep learning-based end-to-end wireless communication systems with conditional GANs as unknown channels An introduction to deep learning for the physical layer Deep learning in physical layer communications Model-driven deep learning for physical layer communications Joint source-channel coding for deep-space image transmission using rateless codes DeepJSCC-f: Deep joint source-channel coding of images with feedback Deep joint source-channel coding for wireless image transmission Deep learning-constructed joint transmission-recognition for internet of things Joint source-channel coding for video communications Semantic communication systems for speech transmission Deep learning for joint source-channel coding of text Deep learning enabled semantic communication systems A lite distributed semantic communication system for internet of things Task-oriented image transmission for scene classification in unmanned aerial systems First order motion model for image animation One-shot free-view neural talking-head synthesis for video conferencing Semantic communications with ai tasks Application of Reed-Solomon codes with erasure decoding to type-II hybrid ARQ transmission A link adaptation scheme optimized for wireless JPEG 2000 transmission over realistic MIMO systems Animating arbitrary objects via deep motion transfer Stacked hourglass networks for human pose estimation Lossy image compression with compressive autoencoders Perceptual losses for real-time style transfer and super-resolution Very deep convolutional networks for large-scale image recognition Few-shot video-to-video synthesis Blind image quality assessment using a deep bilinear convolutional neural network How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3d facial landmarks) Voxceleb: a large-scale speaker identification dataset Adam: A method for stochastic optimization Type-II hybrid-ARQ protocols using punctured MDS codes Image quality metrics: PSNR vs. SSIM The unreasonable effectiveness of deep features as a perceptual metric