key: cord-0030965-ul0hmfbi authors: Zhao, Zhao; Song, Mingyang; Tang, Hongyue title: Research on Multicamera Photography Image Art in BERT Motion Based on Deep Learning Mode date: 2022-04-27 journal: Comput Intell Neurosci DOI: 10.1155/2022/2819269 sha: 433b7df300115c275e36958ba4e055a3267d91f4 doc_id: 30965 cord_uid: ul0hmfbi In order to improve the artistic expression effect of photographic images, this article combines the deep learning model to conduct multicamera photographic image art research in BERT motion. Moreover, this article analyzes the external parameter errors caused in the calibration process and uses the checkerboard in the common field of view to calibrate the spatial coordinates of the corners of the board in multiple camera coordinate systems. In addition, this article aims to match the spatial coordinates of the corresponding points to each other and solve the rotation and translation matrix in the transformation process. Finally, this article uses the LM algorithm to optimize the calibration parameters of the camera and combines the deep learning algorithm to perform image processing. The experimental research results show that the research method of multicamera photography image art in BERT motion based on the deep learning mode proposed in this article can effectively improve the expression effect of image art. In people's daily life, as one of the main tools for image dissemination, photography has become ubiquitous to record and discover different visual possibilities. However, photography and photographic art are two completely different concepts, and photography is not related to all art. Photography is recognized as a relatively recent thing in the field of culture and art. In the 1970s, a large number of art festivals, periodicals, and galleries were launched in Western countries. Immediately afterwards, some colleges and universities set up professional photography colleges and photography departments. In addition, the research on the history of photography in the academic field is also deepening, a large number of photographic works have been included in the art collection, and more and more artists have begun to create photography. In short, the practice and dissemination of photography are no longer confined to a narrow field of practice but lead to the palace of art and culture. At the same time, people's concept of photography is constantly updated. When photography was invented, it was used only as a service tool. Today, however, it is increasingly being discussed and appreciated as the artwork itself. As a result, the public's attitude towards the practicality of photography has changed, attention has been drawn from the perceptual and rational nature of images, and the production, dissemination, and circulation of photography, as well as the form, value, and use of photography, have also changed accordingly. Some people simply regard the art of photography as craftsmanship based on the principles of optics, chemistry, and mechanics with purity. To put it more frankly, they believe that photography is a kind of performance art of taking pictures, which leads them to ignore the lofty status of photography and its outstanding contribution to the development history of human society. In addition, the importance of photographic practice is generally ignored by some people, who only pay attention to theoretical knowledge and believe that it is enough to master the classic theories, such as visual aesthetics, cultural studies, and image expression in photographic art, and do not pay attention to photographic practice. ese understandings of photography art are one-sided and not objective. Photography art is a comprehensive artistic behavior. It is inclusive and has different connotations in different situations. Sometimes it pays more attention to the connotation of the technical level, and sometimes it focuses more on the expression of culture and emotion. Today's photographic art is specifically a type of modern plastic art. A camera is a tool for photographic creation. Taking the photographer's creative concept as the basic starting point, they use the camera to take pictures of people or things in the real world and uses certain modern processing methods, perform artistic processing on the photographed things, and finally complete the creation of photographic works of art. rough the works showing the living conditions of human beings in the current society, at the same time, it can express the author's thoughts and feelings. Aesthetic features are more of an attribute, the presentation of their own aesthetic features. Photography, as one of many art categories, has aesthetic characteristics similar to other aesthetic activities, that is, the commonality of photography aesthetics and other aesthetic activities. In addition, photography art also has its own characteristics; it belongs to a branch of visual plastic art. is article combines the deep learning model to study the multicamera photographic image art in BERT motion to improve the performance of photographic image art. From the perspective of the development process of image text description, it can be divided into three stages: templatebased image text description method; retrieval-based image text description method; deep learning-based image text description method [1] . Before the deep learning method was proposed, most of the image description methods used template-based and retrieval-based methods. e templatebased image text description method is mainly to annotate the image content, which is based on image annotation technology [2] . Template-based methods rely on visual perception of the relationship between image objects and components and describe images using representations of subject, predicate, environment, and preposition collocations [3] . Reference [4] used the method of image context subject, object, and their relationship to describe the image, used the neighbor similarity algorithm to calculate the matching degree between adjacent tuples, finally calculating the score, and the score is proportional to the matching degree. Reference [5] proposed a Conditional RaIldom Field (CRF) algorithm, the central idea of which is to generate text descriptions for predicted text labels according to template matching rules. Reference [6] improved the template in the image text description task and used the hidden Markov model to fill the template with sentences. Reference [7] applied syntactic analysis to the image text description task, used the VDR (Visual Dependency Representation) method to represent the object relationships contained in the graph with a dependency graph, and then represented the image as a VDR and then traversed the VDR, fully considering the VDR. Syntax tree relationships to fill in gaps in sentences. Template-based image text description methods may be grammatically correct in language descriptions, but the output descriptions are highly template-dependent, have poor generalization performance, and generate text sentences that lack diversity. Due to the backwardness of this method, the image text description task is no longer used [8] . Searching for the relationship between matching text and images is the main purpose of the retrieval-based image description generation task. Retrieval-based image description also includes vision-based retrieval and multimodal-based retrieval methods [9] . e retrieval method based on visual space is to obtain textual information from the features of similar parts in the image. e image retrieval dataset established by the literature [10] , for each image, uses appropriate words to describe the image. Reference [11] proposed a large dataset, which includes attribute annotations of objects, which can be used to train attribute classifiers, predicted object attributes, and improved the quality of image text descriptions. e retrieval method based on multimodal space is to perform a multimodal representation of all images and text sentences in the training corpus. For the image to be tested, retrieval is performed in the multimodal space after the image and the text are jointly mapped. First, a set of similar images of the image to be tested is obtained, and the text description corresponding to the test image is obtained according to the text description of the similar image. In [12] , the authors proposed to learn multimodal space representation, use kernel function calculation method to extract high-dimensional image features, in the multimodal space represented by image and text jointly, use a sorting algorithm according to the high-dimensional features of images, and find candidate text for the set of similar images of the target. Finally, the candidate text is screened according to a certain sorting algorithm, and the text description corresponding to the image is obtained. Reference [13] applied the neural network with stronger expressive ability in the field of image text description. It makes the image text description generated in the multimodal space more accurate and of higher quality. e retrieval-based method makes full use of the dataset, but the generated image text description largely depends on the dataset. When the gap between the target image and the training data set is large, the effect of the generated text description will be poor, and only the generated text description can be generated. Human-annotated sentences are already in the dataset. Retrieval-based image description methods have better expressiveness, mobility, and practicality. Although some excellent results have been obtained, the retrieval-based method still has dependencies. e production text description is highly dependent on the training corpus, and there are problems such as high complexity, which are seriously affected by human intervention, which makes the generated text description sentences simple and ineffective. With rich semantic information, researchers continue to explore new text description methods [14] . With the popularization of deep learning knowledge, researchers have proposed new methods based on deep learning methods. e advanced and commonly used method is the end-to-end model. On the one hand, deep convolutional neural networks can be used to create models for object features in images; on the other hand, recurrent neural networks can be used to create 2 Computational Intelligence and Neuroscience language models for text [15] . Reference [16] proposed a deep semantic alignment model. e model echoes images and text descriptions by aligning them. In terms of images, a region-based convolutional neural network is used for pretraining, and the features of the images are mapped to the feature space of word vectors to match the two-part features. Descriptions are generated using recurrent convolutional networks in terms of language models. e NIC model based on the English dataset proposed in [17] uses the advanced Inception V3 network to extract image features and uses LSTM to generate descriptions in the language generation model stage. is method plays an important role in the image description task. Reference [18] proposed a hard attention mechanism and a soft attention mechanism, used the combination of the attention mechanism and the LSTM network to obtain the image information of each step, and improved the expressive ability of the image description. Camera calibration has two important functions. One is to construct the relationship between the image plane coordinates and the camera space coordinates, which is explained in detail in the principle of calibration and the principle of binocular stereo imaging. e other is to construct the rotation and translation relationship between the camera coordinate system and other camera coordinate systems. As shown in Figure 1 , there are two coordinate systems in space, (o-x-y-z) and (O-X-Y-Z), and o and P are the origins of the two coordinate systems, respectively. Moreover, there is a straight line between the two origins, and the same straight line forms different vectors at different coordinates. Among them, the o point and the P point are both (0, 0.0) in their respective coordinates, and the vector from the o point to the P point is (x 1 , y 1 , z 1 ) . It can also be interpreted as the coordinate of the point P in the (o-x-y-z) coordinate system is (x 1 , y 1 , z 1 ). In the same way, the vector from point P to point o is (x 2 , y 2 , z 2 ), which is interpreted as the coordinate of point o in the (O-X-Y-Z) coordinate system is (x2, y2, 2). Since the directions of each axis corresponding to the two coordinate systems are different, these two vectors are not two opposite vectors, but the length of the same straight line is unchanged. erefore, the magnitudes of these two vectors are the same, as shown in the following formula: (1) A coordinate system with two origins that do not coincide cannot discuss rotation. erefore, we first move the (o-x-y-z) coordinate system to the vector (x 1 , y 1 , z 1 ) until its origin coincides with the origin of the (O-X-Y-Z) coordinate system. As shown in Figure 2 , in order to make the two coordinate systems completely coincide, the o-x-y-z coordinates will be rotated around the origin, and the counterclockwise direction is the direction of the rotation angle increment. e process is mainly divided into the following three steps. e first step, which keeps the z-axis stationary, rotates the x-axis and the y-axis counterclockwise around the z-axis by an angle of a. At this time, x reaches the N-axis position, and a rotation variable R z (α) is generated, and its matrix is shown in the following formula: In the second step, it keeps the x-axis stationary, which is the current N-axis, and the y-axis and the z-axis rotate counterclockwise around the x-axis by an angle of ß. e zaxis just coincides with the z-axis, and a rotation variable Computational Intelligence and Neuroscience R x (α) is generated, whose matrix is shown in the following formula: e third step, which again keeps the z-axis stationary, rotates the x-axis and the y-axis counterclockwise around the z-axis by a y angle. At this time, the x-axis coincides with the X-axis, the y-axis coincides with the axis, and a rotation variable R z (c) is generated, whose matrix is shown in the following formula: e whole process consists of three independent rotations, generating a rotation matrix R, where R is the product of R z (α) , R x (α), and R z (c), and the calculation result is shown in the following formula: As shown in formula (6), the two vectors (x 1 , y 1 , z 1 ) and (x 2 , y 2 , z 2 ) in the above two camera coordinate systems can be converted to each other under the action of the rotation matrix R. It can be seen that the rotation-translation matrix is invertible, and the inverse matrix of the rotation matrix is its own transpose matrix, as shown in the following formula: When multiple cameras are calibrated, due to the accuracy of target positioning and the minimization of the global error of the pursuit, as shown in formula (8), there is a very small error in the external parameters, and the same point corresponds to different coordinates in the two camera coordinate systems. Among them, the coordinate of this point in a camera coordinate is (x m , y m , z m ). e other camera coordinate system is converted to this camera coordinate system through the rotation and translation matrix after calibration. e corresponding coordinate is (x n , y n , z n ). Because it is the same point in the real space, it should coincide in the unified coordinate system. If there is no coincidence, there must be an error in the rotation and translation matrix. erefore, this error needs to be analyzed. If the origin of the camera coordinate system coincides after the transformation, according to the coordinate transformation process, the relative position of the points in the same coordinate system does not change, then the distance between the two points and the origin should be equal. At this time, the point coordinate can be regarded as a vector; that is, the vector's modulus is equal, as shown in the following formula: When the origins are coincident, there is no need to retranslate the camera coordinate system, and only the parameters in the rotation matrix need to be calculated accurately. Finally, the translation vector is transformed once using (6) to reduce the error. erefore, there is a very small angular rotation between the two coordinates, and the camera coordinate system that needs to be aligned is rotated by three angles in turn according to the rotation method shown in Figure 2 . We assume that the vectors (x m , y m , z m ) and (x n , y n , z n ) form a small angle θ. e relationship between the included angle θ and the two vectors is shown in the following formula: When the included angle θ is very small and approaches 0, the values of θ and sin θ are equal, and its geometric meaning is the ratio of the distance between the two points to the vector modulus. Because the viewing angle of the camera is limited to a certain extent, in order to obtain a wider field of view, the camera will be arranged in a relatively far position, and the modulus of the camera coordinate vector will also increase accordingly. Moreover, the common field of view formed by binocular imaging is further compressed, and the measurement area will be divided into multiple segments, which will cause the measurement points in other camera coordinate systems to be converted into the global coordinate system and cause great errors. In order to avoid great errors, further optimization of the calibration external parameters must be carried out. Another case is when the lengths of the two vectors (x m , y m , z m ) and (x n , y n , z n ) are not equal. en, the relationship between the included angle θ and the two vectors is shown in the following formula: cos θ � x m x n + y m y n + z m z n ����������� � At this time, it is not only the rotation angle that causes the camera coordinate system error but also the translation 4 Computational Intelligence and Neuroscience vector error. e allowable range of this error should consider the difference between the lengths of the two vectors and the distance between the two measurement points. In general, the optimization of the overall parameters of the camera during measurement will rarely cause a large change in the translation vector; that is, the distance between the origins of the two coordinate systems will not be very far. erefore, the vector length difference will not exceed the measurement spread. e conversion relationship between the two coordinate systems is calculated and obtained according to the geometric relationship between the coordinate systems of each camera in the process of constructing a large field of view in a multicamera system. Global calibration is to calibrate the overall measurement system to obtain the conversion relationship between all camera coordinate systems and reference coordinate systems. Taking a dual-camera station system composed of four cameras as an example, the multicamera global calibration process is described in detail with reference to Figure 3 . According to the definition of each coordinate system, it can be known that the transformation relationship between the camera coordinate systems of each camera in the entire measurement system is as follows: e transformation of the A camera coordinate system and the B camera coordinate system of the measurement system 1 is as follows: e conversion of the c camera coordinate system of the measurement system 2 to the D camera coordinate system is as follows: x In the above, R BA and T BA , R DC and T DC can be directly solved by the camera calibration principle. Among them, the R matrix and the T matrix can complete the inverse transformation of the two coordinate systems by formulas (6) and (7) . In order to realize the coordinate conversion between the two measurement systems, the rotation matrix R CA and translation vector T CA of the A camera coordinate system and the C camera coordinate are directly solved by calibration. It can be solved indirectly by C camera calibration and B camera calibration rotation matrix R CB and translation vector T CB , or by constructing D camera and A camera calibration parameters rotation matrix R DA and translation vector T DA . e same target point exists in the common field of view, and the relationship between multiple target points P A (x Ai , y Ai , z Ai ) in the A camera coordinate system and the corresponding target point P C (x Ci , y Ci , z Ci ) in the C camera coordinate system is constructed, where i � 1,2,3, .... e parameters in the transformation matrix and transformation vector that satisfy multiple target points are solved. x Ai If it is assumed that there is a matrix A, then A T represents the transposed matrix of A. Similarly, ‖A‖ 2 and ‖A‖ ∞ represent the spectral norm and row sum norm of matrix A, respectively. e singular values of matrix A are to be solved, the largest of which is the spectral norm of the matrix A. e sum of the absolute values of the elements in each row of matrix A is calculated, the largest of which is the row sum norm of matrix A. f is a functional relationship that maps the parameter vector P ∈ R m to the estimated measurement vector x � f(p), x ∈ R m . An initial estimated parameter p and a measurement vector x are provided, and it is expected to find p + the vector f that best satisfies the functional relationship, that is, minimize the squared distance e basis of the LM algorithm is to solve a linear approximation f in the neighborhood of δ p . For a very small value of d, the Taylor series expansion approximation is as follows: Among them, J is the Jacobian matrix zf(p)/zp . Like all nonlinear substitution methods, LM is iterative, starting from a chosen starting point p 0 . is method generates a series of vectors p 1 , p 2 , p 3,... . ese vectors converge to the local minima p+ with respect to the functional relation f. erefore, at each step, it is necessary to find a suitable value δ p to minimize the overall value ‖x − f(p + δ p )‖ ≈ ‖x − f(p) − Jδ p ‖ � ‖ε − Jδ p ‖, and finding δ p is the solution to a linear least-squares problem. When the column space of Jδ p − ∈ and J is orthogonal, δ p reaches the minimum value z and f J T (Jδ p − ε) � 0 can be obtained, and the resulting value 6 can be used as the solution of the normal equation. e J T J in the matrix on the left side of the equation is an approximation to the Hessian matrix, that is, to the second derivative matrix. e LM method actually solves for small changes in the equation, which can be called augmented regular equations. Among them, the off-diagonal elements of N are the same as the corresponding elements in J T J, and its diagonal elements are N ii � μ + [J T J] ii , where μ > 0. e process of changing the diagonal elements of J T J is called damping, Computational Intelligence and Neuroscience and μ is the damping factor. In the iterative process, if a new parameter vector p + δ p is obtained, which reduces the error ε, where δ p can be calculated by the equation, the new parameter vector can be made closer to the optimal solution. Moreover, the iterative process is repeated with the aim of reducing the damping value. If the new parameter vector p + δ p is obtained and the error ε becomes larger or does not change, the damping value is increased, the augmented canonical equation is solved again, and the algorithm iterates until a value δ p that reduces the error is found. In the equation, the solution is repeated for different damping factors μ until a parameter vector p + δ p that is closer to the approximation is found. is update process corresponds to one iteration in the LM algorithm. In the LM algorithm, the damping factor μ is adjusted during each iteration to ensure that the error can be reduced. If damping is set to a large value f, the matrix N in the equation used is almost on the diagonal and the LM update step value δ p is close to the direction of the steepest descent. Also, the value of δ p decreases in this case. Damping also handles the case of insufficient Jacobian elements, thus making J T J a singular matrix. e LM algorithm terminates when at least one of the following conditions is met: (1) e gradient size of ε T ε and Jε on the right side of the equation will drop below a threshold. (2) e relative change of the size of the step value δ p falls below another threshold ε 2 . (3) e error ε T ε falls below another threshold ε 3 . (4) e maximum number of iterations k max is completed. If the covariance matrix Σ x for the measurement vector x is available, −1 x and the norm ε T −1 x ε can be incorporated into the LM algorithm by minimizing squares instead of directly solving for ε T ε. erefore, a least-squares problem with minimum weights defined by weighted canonical equations is solved. (4) Research on multicamera photography image art in BERT motion is based on the deep learning model e image sentiment analysis model of sample selection and image content generation based on BERT features includes four parts: "image content generation," "text feature extraction," "sample selection based on BERT features," and "image sentiment analysis." e image is processed through deep learning, as shown in Figure 4 . e specific process of image content generation is shown in Figure 5 . Figure 6 shows the flowchart of the binocular stereo vision measurement procedure. e binocular cameras are synchronously triggered to capture a frame of pictures and then transfer them to the computer memory through the USB 3.0 interface. en, it calls the image data to perform a series of tasks such as marker point detection and image point matching with the same name, then solves the pixel coordinates of the center of the marker point, and then performs 3D reconstruction to obtain the coordinates of the marker point in physical space and output the data. According to the actual measurement needs, a CCD camera bracket and a fixed platform, lighting equipment, etc. are added to the system as auxiliary. e connection diagram of the main part of the system is shown in Figure 7 . e camera coordinate system and the imaging plane coordinate system are established. As shown in Figure 8 , the world coordinate system in this section is selected on the target plane containing low-rank textures. For the convenience of description, the following two definitions are made first. When there is no rotation of the camera imaging plane coordinate system relative to the xw0wYw plane of the world coordinate system, the image captured at this time is called a Computational Intelligence and Neuroscience facing image. When the camera imaging plane coordinate system and XwDwYw, the plane has some unknown rotation and translation, the image captured at this time is called an oblique captured image. Obviously, compared with the original low-rank texture image, the front-facing image has no deformation, only scaling, and still retains the low-rank characteristic. When shooting at an angle, the image no longer retains low-rank properties due to projection distortion. On the basis of the above research, the system model proposed in this article is verified, and the expression effect and image art of the multicamera image are evaluated, and the results shown in Tables 1 and 2 are obtained. From the above research, it can be seen that the research method of multicamera photography image art in BERT motion based on the deep learning model proposed in this article can effectively improve the expression effect of image art. Photography is a kind of visual art, but its expression must be carried out through forms, and different forms bring completely different visual experience. ere are many forms of expression, and as one of the visual arts effects, ordering cannot be underestimated. "Ordering" as a guideline in graphic design can make the design more organized and normative. Similarly, for photography, "ordering" will make the photographic work have different formal meanings and show a strong sense of design order. ere are many aspects to the formal expression of order in photography, such as symmetry and balance, repetition, and gradual change. ese forms of order are often used in graphic art design, which can make the designed picture produce a visual experience of different orders. is article combines the deep learning model to conduct multicamera photographic image art research in BERT motion. e experimental research results show that the research method of multicamera photography image art in BERT motion based on the deep learning mode proposed in this article can effectively improve the expression effect of image art. e labeled datasets used to support the findings of this study are available from the corresponding author upon request. e authors declare no conflicts of interest. Efficient convolutional neural networks for depth-based multi-person pose estimation Multi-person pose estimation using bounding box constraint and LSTM fast and accurate whole-body pose estimation in the wild and its applications Body part extraction and pose estimation method in rowing videos Multi-person hierarchical 3d pose estimation in natural videos Realtime multi-person 2D pose estimation An evaluation of pose estimation in video of traditional martial arts presentation Deep probabilistic human pose estimation Multipath affinage stacked-hourglass networks for human pose estimation Portable 3D human pose estimation for human-human interaction using a chestmounted fisheye camera VNect Human pose estimation in video via structured space learning and halfway temporal evaluation Multiple human 3d pose estimation from multiview images Hierarchical contextual refinement networks for human pose estimation A multi-stage convolution machine with scaling and dilation for human pose estimation Rescue method based on V2X communication and human pose estimation Action recognition using deep convolutional neural networks and compressed spatio-temporal pose encodings DTCoach: your digital twin coach on the edge during COVID-19 and beyond