key: cord-0719961-3ene99cl
authors: Muslikhin, Muslikhin; Horng, Jenq-Ruey; Yang, Szu-Yueh; Wang, Ming-Shyan; Awaluddin, Baiti-Ahmad
title: An Artificial Intelligence of Things-Based Picking Algorithm for Online Shop in the Society 5.0’s Context
date: 2021-04-16
journal: Sensors (Basel)
DOI: 10.3390/s21082813
sha: aa2c4eec9eab48307861daa5af9bd3335053e8c3
doc_id: 719961
cord_uid: 3ene99cl

In this study, an Artificial Intelligence of Things (AIoT)-based automated picking system was proposed for the development of an online shop and the services for automated shipping systems. Speed and convenience are two key points in Industry 4.0 and Society 5.0. In the context of online shopping, speed and convenience can be provided by integrating e-commerce platforms with AIoT systems and robots that are following consumers’ needs. Therefore, this proposed system diverts consumers who are moved by AIoT, while robotic manipulators replace human tasks to pick. To prove this idea, we implemented a modified YOLO (You Only Look Once) algorithm as a detection and localization tool for items purchased by consumers. At the same time, the modified YOLOv2 with data-driven mode was used for the process of taking goods from unstructured shop shelves. Our system performance is proven by experiments to meet the expectations in evaluating efficiency, speed, and convenience of the system in Society 5.0’s context.

It is quite challenging to ensure that the massive implementation of artificial intelligence (AI) in online shopping takes effect. Currently, most of AI dominated systems are implemented in large-scale industries, such as Amazon, Walmart, BMW, and Mercedes Benz [1] [2] [3] . Separately, Industry 4.0 including techniques of AI, Internet of Things (IoT), robots, and online shopping has now shown tremendous benefits. Especially during the COVID-19 pandemic like today, physical contact is much avoided. In other words, the role of Industry 4.0 content should be accelerated to meet the needs of people who tend to prioritize comfort and safety.

Since Society 5.0 was introduced in Japan, the role of IoT has become the backbone of the related system [4] . In addition to the emergence of AI in the last decade, the appearance of IoT and AI gives rise to the terminology, Artificial Intelligence of Things (AIoT), which demonstrates the basic principle of how to think of a device like a human with the support of an internet network [5] [6] [7] . Meanwhile, Society 5.0 takes the maximum benefit from AIoT in daily life [8, 9] . Many realities are in line with the concept of Society 5.0, such as shopping. However, shopping has experienced several downsizings, for instance, in online shopping, order delivery using applications, shopping with electronic payments, limited shopping served by machines (vending machines), and online shopping facilitated by AIoT. Unfortunately, the integration of AIoT in online shopping has not been widely reported.

Some of the AIoT approaches work for other applications [10] . Two studies [11, 12] have introduced AIoT for drug screening applications and hair health diagnostics. In-interface (REST API) reply time and internet data traffic density are obstacles for the system [34] . These works focus on image processing, AIoT combined with industrial robots picking exactly according to products purchased by consumers. In this paper, the pick-up accuracy and the speed of the retrieval service from a stacked shop shelf environment will be focal points.

Online shopping services for consumers and from the offline shop side are proposed and completed. In this paper, the following details are given:

The REST API is operated to get a reply from the online shop, which depends on the last transaction, and a selective data-driven mode completed by "data/last_transaction" data for YOLOv2 is proposed.

The shelf collision obstacle for manipulators in shop shelves is weighed. This problem is solved by proposing a modified selective YOLOv2 technique to classify the edge of shelf as a forbidden points cloud to avoid each shelf edge.

We are specific to robotic manipulator conditions, and the AIoT-based picking algorithm is implemented and evaluated; it provides a reference for eye-in-hand manipulator systems concerning Society 5.0 in terms of comfort and safety.

In this paper, Section 2 addresses the system design. Section 3 acquaints an online shopping platform until the system sending the last transaction to the offline shop, and the robot with AIoT based picking algorithm would execute it in Section 4. Experimental results are shown in the next Section 5. Finally, the work and give suggestions for potential future work are concluded in Section 6.

Our system consists of two parts; online shop platforms and offline shops equipped with manipulator robots. The online shop provides functions that, as usual, consumers could shop based on the App or website. Before being able to shop, potential customers must register first, including name, mobile number, address, city, and zip code. After buying and completing payments, consumers are asked to upload proof of payment. On the online shop, an admin would not verify because of our system set as a data-driven mode. In the offline shop side, the manipulator robot with an eye-in-hand structure will receive a REST API request-reply shortly after the administrator validates the payment. One data packet is sent to the client's PC and the robot grasps the purchased products.

The architecture of the proposed AIoT is shown in Figure 1 . Instance segmentation in the YOLOv2 network is used to identify products that are stacked on a shelf after the recognized product undergoes a safe operation check-in XYZ coordinates and a collision side detection algorithm with deep learning to obtain an accurate and secure final position. The proposed algorithm involves identifying the product as well as picking the obstacle edges of the shop shelf environment in the 2D drawing. In Figure 1 , the green blocks refer to the selective localization of products, while those shaded in orange correlate with the online shop platform and the white shades represent the deep learning process. These two goals coordinate with each other to complete the product picking mission so that the gripper avoids collision; hence, the procedures associated with both objectives are shaded in cyan. The picking algorithm would be explained and determined in Section 4.

A website-based online shop has developed, as at https://indoaltantis.com (accessed on 15 October 2020). This online shop includes three kinds of users: customers, staff, and administrator. To fulfill the App version, we convert it to an app form to access the Android or iOS platforms.

The minimal features in the online shop are applied because AIoT for picking algorithms are focused and developed. These features include displaying offers, payments, verification, and stock updates. As previously mentioned, potential customers must register in shop and be served by the system. This online shop uses the Laravel PHP framework with a MySQLi database and has adopted the transport layer security (TLS) protocol for its standard security.

The characteristics of the products sold at our shop vary widely, ranging from different size, weight, packaging type, and packaging form. This needs to consider that every object has a different treatment. For example, taking instant noodles is different from taking sardines cans. Details of the products are presented in Table 1 . In addition to viewing the purchased products on the proposed Android-based mobile device app, a customer's records stored on the cloud-based management platform can also be visited through a website, as shown in Figure 2 . 

The context of Society 5.0 [34] is specifically a way to achieve Sustainable Developing Goals (SGDs) in which it is composed of several cutting-edge technologies. The technologies include 17 basic techniques such as; IoT, big data, AI, drone, robot, 3D print, sensor, sharing on demand, mobile, edge, cloud, 5G, public key infrastructure (PKI), virtual reality (VR), augmented reality (AR), and mixed reality (MR). They can be grouped into two parts; information and cyber-physical system (execution), like a coin with two sides; accuracy and speed.

Accuracy is closely related to precision. There are four probabilities: accurate-precise, accurate-unprecise, inaccurate-precise, and inaccurate-unprecise in testing. In the system, accuracy leads to recognition, localization, and retrieval by robots. Whereas speed refers to the duration of the REST API reply, the speed at which the purchased product is recognized, and the overall retrieval by the system [35, 36] .

The key to implementing Society 5.0 lies in Science, Technology, and Innovation (STI), creating new value through Cyber-Physical System (CPS), and multidisciplinary collaboration as we apply in this study [35] . The results of the accuracy and speed evaluation are specifically discussed in Section 5.

The website and app versions of the online shop have been developed, although data communication with the offline shop is necessary. In this process, we take advantage of the REST API that is a communication architecture using the HTTP protocol for data exchange [31] . Meanwhile, the web service API works at the operating system level to help applications communicate with the base layer and with each other according to protocols.

The reply to the REST API request on our web server contains explicit details of the purchase of goods. The raw data for item purchases include 27 parameters to be parsed according to the minimum requirements for robot tasks in offline shops. The primary data contain the buyer name, the product name, the purchased quantity, the full address, the zip code, and the telephone number. The REST API mechanism is seen in Figure 3 . Broadly, Figure 3 reveals that the data flow from the customer to the server was forwarded to the client robot. In fact, the client robot is not directly connected to the internet by a wire or wireless connection, but there is a middleware in the form of a controller that connects an internet computer. On the server-side, it is as much as possible designed lightly with page size < 5 MB and reply speed < 5.3 s. These require an entityrelationship diagram (ERD), as shown in Figure 4 . Eight items in databases, staff, consumer, product, order, payment, ordered product, robot, and session, are connected. With the session containing the IP address system, it is designed to support observing the location of robots/offline shops in a map distribution. 

The CNN, RCNN, Fast RCNN, and Faster RCNN are trained to perform bounding box classification and regression simultaneously [23] . In practice, YOLO runs much faster than Faster RCNN due to its simpler architecture. YOLO architecture is inspired by GoogleNet, see Figure 5 . As explained in the introduction, various detection techniques are valid, reliable, and fast. Recognition results are generally followed up with grasping or other commands [16] [17] [18] 32, 37] . Moreover, the estimation of the purchased product position in 3D space is expected to be as validly as possible. Although being one of the most popular recognition methods, YOLOv3 [20, 38] not only has heavy computation but also needs the GPU with the highest computing ability for Compute Unified Device Architecture (CUDA). Because of this development to be implemented in small and medium enterprise (SME) shops, we decided to utilize the onboard GPU, and YOLOv2, not YOLOv3. Even though using YOLOv2, we modify YOLOv2 to have the capability of being fast and accurate from the original version. We apply a data-driven mode for the selected detector. In short, with only specific anchors used, there is an opportunity to carry out the recognition process twice, so that the product depth validation on shop shelves can be reached by the robot.

To realize the purpose of recognizing purchased products, the YOLOv2 needs a training stage like in common deep learning. In the training process, image datasets are labeled by the MATLAB Image Labeler App. All have nine classes of RoI (Regions of Interests) namely ABC Ketchup, British Milk Tea, Gao Vermicelli, Instant Cup Noodle, Apple Yogurt, Soto Noodle, Lageo Wafers, Master Sardines, and Tai Lemon Tea generated from around 1500 images.

The YOLOv2 feature extraction layer is the most effective when the output feature width and height are between 8 and 16 times smaller than those of the input image. This amount of downsampling is a trade-off between spatial resolution and performance quality characteristics. Note that selecting the optimal feature extraction layer requires empirical evaluation. This means that labeling requires accuracy. To identify object classes in an image, YOLOv2 uses anchor boxes and predicts these three attributes for each anchor box; (1) intersection over union (IoU) predicts each anchor box's objectivity score, (2) the anchor box offsets an improvement to the location of the anchor box, (3) probability of class predicts the class label assigned to each anchor box.

The YOLOv2ReorgLayer function in Figure 5 generates a YOLOv2ReorgLayer object that shows the reorganization layer for the YOLOv2 object detector network. By piling adjacent characteristics into separate channels, the reorganization layer rearranges the highresolution feature maps from a lower layer. The performance of the layer of reorganization is fed to the layer of depth concatenation. This feed is bypassed and skipped many convolutional processes so that, as previously mentioned, the output is very fast. The concatenation layer of the depth concatenates the reorganized high-resolution features from a higher layer with the low-resolution features.

The purchased products are target objects and the system does not detect non-targets as we have previously done [17] . We believe that the system is robust enough to recognize targets. The selective detector strategy is the key to avoiding true-negatives (TN) and falsepositives (FP) in the system. So that machine vision with precision to make decisions during manipulation and the time detection will be shortened. For instance, the detection results are provided to show bounding boxes with each confidence level on shelf environment. A detailed discussion of the shelf environment would be given in Section 4.

Several segmented targets were generated for the purchased products, in which one segment depicted a detected target through deep learning with a modified selective YOLOv2 detector. The bounding boxes were calculated to get centroid points and depths of the consumer product frame P in the camera frame C. Our workflow in the estimation of depth is shown in Figure 6 . The depths were extracted from the double check bounding box and the centroids were verified via SURF with a disparity map. Moreover, using the intrinsic camera parameters, the depths have been converted from the target P-frame to the C-Stereo camera-like. The configuration scheme of the stereo camera-like is based on stereo vision, shown in Figure 7 . Mono camera set in parallel line x is fixed in the workspace and O as an optical center. The baseline b is constructed by shifting this camera with the focal length f along x line. The projections of a given reference point P(X p , Y p , Z p ) are p 1 (x 1 , y 1 ) and p 2 (x 2 , y 2 ) in the image plan 1 and image plan 2, respectively. In perspective projection, the image coordinate of P in two image planes is shown in Figure 7 to simplify the calculation [39] . In Figure 7 , we assume just one camera only for defining the object. After camera shift along b line, both images are parallax on the shifted cameras, and the Y axis is perpendicular to the page. According to the theory of identical triangles, the Equations (1) and (2) are obtained, and Equation (3) shows the depth of P point, Z.

where x 1 and x 2 are the pixel locations on the 2D image. The disparity d is the difference of x coordinates in image 1 and image 2,

After Z has been obtained, we can respectively obtain the X and Y coordinates of P point using equations in (5),

where X, Y and Z are the actual positions on the 3D image. It is assumed that the products sold in offline shop are within the gripper's reach. For this reason, the centroid position of each detection result needs to be searched. The easiest technique is to calculate the centroids from the bounding box expressed in (6) below: 

The bounding box matrix B box has four columns a [1,...,4] and the number of rows depends on the number of detected products a [n,4] on each shelf. So, we could find the centroid (X cen Y cen ) from Equations (7) and (8).

From Equation (8), the centroid can be calculated and becomes the reference point for a gripper to pick the target. The centroid point in this condition is still in 2D image, so it is necessary to add Z value. The Z value is obtained from Equation (5) and verified by Equations (7)-(10).

In the traditional Speeded Up Robust Features (SURF) algorithm, the box filter is used to approximate the convolution of the 2D images [40, 41] . The Gaussian second-order derivative can simplify the calculation and improve the efficiency [23] . The SURF detector uses Hessian matrix because of its good performance in computation time and accuracy. Given a point X = (x, y) in an image I, the Hessian matrix H(x, σ) in x at scale σ is defined as follows,

where L xx (x, σ) is the convolution of the Gaussian second-order derivative ∂ 2 ∂x 2 g(σ) with the image I at point x, and L xy (x, σ) and L yy (x, σ) have similar definitions. The 9 × 9 box filters in Figure 8 are approximations of a Gaussian with scale σ = 1.2 and represent the lowest scale for computing the blob response maps while denoted by D xx , D yy , and D xy . For computational efficiency, the weights applied to rectangular regions are kept simple as Equation (10), det(H approx ) = D xx , D yy − (0.9D xy ) 2 (10) Figure 8 . Left to right: The 9 × 9 box (discretized and cropped) Gaussian second order partial derivatives in y-direction. The grey regions are equal to zero.

To match the expression for the Hessian's determinant and number of w = 0.9, the relative weight w of the filter responses is used. With this detector shown in Figure 8 , the image to be detected must be converted into a gray image first. The matching points were increased to 100 points to boost precision, in the expectation that the mean precision would increase with a low standard deviation. The SURF alone, however, was not accurate enough. So, with another technique, namely the disparity map, is needed to search matching points.

The disparity map D(x,y) represents the displacement of the corresponding pixels between the left and right images. However, locating corresponding pixels is difficult in reality. Some variables may cause problems in the non-occlusion pixels, such as nontextured-homogeneous, repeated-texture, and camera noise [42] [43] [44] . The calculation of the disparity is done by block matching (BM) for all pixels and the validity of the disparity significance is defined as follows.

In Figure 9b , block matching is applied to the disparity map. The disparities between the left image and the right image are derived from (11) and (12), where ε d R→L (x, y) is the normalized block matching error with the horizontal disparity d, W is window of the block matching, and D max is the maximum value disparity within the permissible limit. The following will obtain the disparity from the right image f r to the left image frame f l , to check the observed disparity, while u and v are the number of pixels in xy-camera image plane, respectively. The minimum matching error (MME) then calculates how close the pair image values in the left image (x,y) are to and in the right image with disparity (x + d,y) to the identical points. The MME is known to be (14) . Figure 9 manifests the sequence of detection and localization process. In Figure 9a , the image looks shaded because it is a composite view between the left and right images with a baseline of 50 mm. Meanwhile, Figure 9b showed the result from the disparity map in Figure 9a by applying block matching Equations (11)- (14) . When capturing pairs of objects, the image is separately used as YOLOv2 input for detection purposes, and the detection results are shown in Figure 9c . Before finally knowing the type of product and position in Figure 9e ,d acts as a triangulator between stereo camera-like in Equation (5).

Due to the limited reach of the robotic manipulator, the store shelves need to be visually shortened in half; left and right. In Figure 2 , the robotic manipulator will look for the target on the left side first; if it does not find the purchased product, it changes to the right side. The dimensions of the half shelf are 622 mm × 785 mm × 200 mm so that those of the whole shelve are 1244 mm × 785 mm × 200 mm.

Scanning the purchased products on the left and right depends on the data on purchase. The REST API reply becomes a trigger for this modified YOLOv2 architecture. If the purchased product has been found by modified YOLOv2 on the left shelf, the gripper immediately moves to hold it. On the other hand, if not found on the left shelf, the gripper shifts to the right shelf to detect until the gripping process occurs. Algorithm 1 illustrates this; if the purchased product Pr is in the first half of the shelf and the depth of camera world Z C is known, a matching process is carried out so that the depth of the world Z R robot will be determined. The minimum matching error (MME) then calculates how close the pair image values in the left image (x,y) are to and in the right image with disparity (x + d,y) to the identical points. The MME is known to be (14) . Figure 9 manifests the sequence of detection and localization process. In Figure 9a , the image looks shaded because it is a composite view between the left and right images with a baseline of 50 mm. Meanwhile, Figure 9b showed the result from the disparity map in Figure 9a by applying block matching Equations (11)- (14) . When capturing pairs of objects, the image is separately used as YOLOv2 input for detection purposes, and the detection results are shown in Figure 9c . Before finally knowing the type of product and position in Figure 9e, Figure 9d acts as a triangulator between stereo camera-like in Equation (5).

Due to the limited reach of the robotic manipulator, the store shelves need to be visually shortened in half; left and right. In Figure 2 , the robotic manipulator will look for the target on the left side first; if it does not find the purchased product, it changes to the right side. The dimensions of the half shelf are 622 mm × 785 mm × 200 mm so that those of the whole shelve are 1244 mm × 785 mm × 200 mm.

Scanning the purchased products on the left and right depends on the data on purchase. The REST API reply becomes a trigger for this modified YOLOv2 architecture. If the purchased product has been found by modified YOLOv2 on the left shelf, the gripper immediately moves to hold it. On the other hand, if not found on the left shelf, the gripper shifts to the right shelf to detect until the gripping process occurs. Algorithm 1 illustrates this; if the purchased product Pr is in the first half of the shelf and the depth of camera world C Z is known, a matching process is carried out so that the depth of the world R Z robot will be determined.

: a product in a half shelf system frame;

input : the REST API data-driven. 230 mm × 120 mm × 62 mm with a maximum weight of 532 g; while the types of packaging include bottles, pouches, cups, cans, and carton boxes. We do not consider the target orientation because the position of XY's target is always upright to the end-effector in this study.

The mono camera mounted to the end-effector perceives the product on the shop shelves in 2D coordinates. The coordinates of the purchased product in camera frame C need to be converted into those in the end-effector frame E. Figure 10 shows relationship among each frame, where P is the purchased products frame and R is the robotic base frame. Because a half shelf system is adopted, two transformations are required for each left and right shelf. Figure 10 . Eye-in-hand robot manipulator frame in an offline shop. The transformation begins from target in shelf coordinate frame P to camera coordinate frame C, then transforms from C to coordinate frame E, and continues to transform frame R as robot base.

The transformations between frames perception downsizing into links of the MELFA RV-3SD robotic manipulator have been included in the robot controller. Let R P be the location of the purchased product Pr with respect to the R-frame and C P be the location of purchased product Pr in the C-frame. The transformation of the target coordinate from P-frame to R-frame could be expressed in (15) .

where P R T is transformation from P frame to R frame, the target frame and the robotic base frame, respectively. The B C T depicted in Figure 10 can be obtained from camera calibration while R B T is known. Based on parameters of B C T and R B T, R C T could be obtained. An offset of 280 mm is added to calculate the initial pose in the gripper frame based on the Robotiq three fingers offset configurations, to ensure that the camera aims the center of products at the initial pose. Robotic manipulator is designed to move from O 1 to O 2 baseline as initial position, picking process, and placing to home position movh←gpc until release the gripped product into a box.

The offline shop with a robotic manipulator needs to be aware of its shelf environment to avoid manipulator and gripper from knocking products and shelves. The robotic system also must verify the manipulation plan and accurately grasp the purchased products so that possible damage can be suppressed [45] .

After the selector receives the driven data, it will immediately request the suitably selected YOLOv2 detector. The position of the product on the shelf is always upright, except for instant noodles and vermicelli. Therefore, the target orientation is not considered in this study. Besides, we do not take the various forms of product packaging into account because the gripper is adapted to the target.

Products in the shelf are not affected by the arrangement and combination of products in the same shelf. For example, a shelf containing two bottles of British Milk Tea, one yogurt, and one ABC Soy Sweet Ketchup, is acceptable to the system while there is a tolerance for grippers. The condition of the uniform products on the shelf within one camera frame allows multi-detection and overlapping to occur. Although the random combination causes multi-detection occurring, it is possible to suppress overlapping.

We deliberately bring up all detection results even though it can be selected according to the number of products purchased by consumers directly. For instance, a consumer bought two ABC Soy Sweet Ketchups, and then the detection results would appear only the first and second highest confidence values. The option brings up all detection results so that the robot can keep a safe position to grasp, and the safe position tends to avoid the edge of the shelf to prevent collisions. Figure 11 depicts the possibility of detection on the purchased products. Figure 11a shows the clear and safe position on the inline products for picking by the gripper. Meanwhile, if the products are too tightly positioned or not inline, there will be a chance that the bounding box overlaps, as shown in Figure 11b . Overlapping occurs due to slightly inaccurate and precise detection, but at the accepted limit, >0.85. In addition, overlapping can occur naturally because the products are physically arranged back to back (not inline), as shown in Figure 11c . Observing Figure 11d , there has been an overlapping bounding box. Figure 11b -d occur because the products are close together and/or have differences in-depth and Equations (6)-(8) are used to find the centroid of the bounding box, which is ultimately used as a localization reference. In Figure 11e , where the two product targets are the same and in a safe position for pick and place but have different confidence scores, the one with higher value will be grasped first using Algorithms 1 and 2 (stated in Section 4.2.3).

To identify purchased products in the safe shelf position shown in Figure 11 , it is necessary to determine the distance between each product based on bounding boxes. Thereafter, it is important for the limited range of the gripper to 72 mm or 120 mm to make a decision. The distance between one product and another can be squashed so that the gripper fingers cannot enter to grasp. The worst risk is if the gap between the products is too narrow, the gripper fingers will hit the target or the other. If the gripper hits the target, the position of the purchased product at least will be shifted from its original position and the gripper potentially fails to grasp. As an illustration, there are five yogurts in a row detected as targets with each confidence value while the consumer bought one item only, then four yogurts must be eliminated. For the separation process to run well, the multi-detected results need to be sorted by P (n) = ∑ n−1 i=0 (n − i) so that the maximum of confidence level P (n) [max] is obtained.

However, as there were certain conditions in overlapped purchased products that caused the sorting production to surpass the safe gripper limit, we perceive that our approach was not always quite acceptable, as shown in Figure 11c . Empirically, the sorting method does not apply to overlapped purchased products. By applying this method might be rated as a non-optimal condition. 

To identify purchased products in the safe shelf position shown in Figure 11 , it is necessary to determine the distance between each product based on bounding boxes. Thereafter, it is important for the limited range of the gripper to 72 mm or 120 mm to make a decision. The distance between one product and another can be squashed so that the gripper fingers cannot enter to grasp. The worst risk is if the gap between the products is too narrow, the gripper fingers will hit the target or the other. If the gripper hits the target, the position of the purchased product at least will be shifted from its original position and the gripper potentially fails to grasp. As an illustration, there are five yogurts in a row detected as targets with each confidence value while the consumer bought one item only, then four yogurts must be eliminated. For the separation process to run well, the multi-detected results need to be sorted by ( ) = ∑ ( − ) so that the maximum of confidence level ( )[max] n P is obtained.

However, as there were certain conditions in overlapped purchased products that caused the sorting production to surpass the safe gripper limit, we perceive that our approach was not always quite acceptable, as shown in Figure 11c . Empirically, the sorting method does not apply to overlapped purchased products. By applying this method might be rated as a non-optimal condition. 

Contrary to the first method, the purchased products that overlap by P (n)[max] ∪ P i=n with the highest confidence level tend to be ambiguous make a grasping decision. For instance, suppose there are five yogurts on a shelf, where three are in the front line and the rest two in behind with the highest confidence level on one of two in behind. If the conditions are as above, the verification will be made using the SURF-Disparity Map and handled accurately.

To solve this overlapping problem, the centroid bounding box marked (+) is compared to the disparity map. However, the possibility of comparing two identical images has the potential to be missed, therefore SURF was applied and was able to identify overlapped purchased products.

In a mixed product by P n k , one shelf can contain two or more types of products, as shown in Figure 11c . The position of Master Sardines next to British Milk Tea allows the bounding box to overlap with the same product types shown in Figure 11b . The advantage of using the selected YOLOv2 detector eliminates the possibility of true-negative results due to the error detection of other products that are not compatible with this detector. It needs to pay attention to shelf space to solve the overlapping bounding box position in mixed products. One shelf cell covering an area of 414 mm × 261 mm × 200 mm contains two types of products, so it is only 207 mm for one product type. For this reason, it is impossible to rely on the highest confidence value as a benchmark to be picked by the gripper because the confidence value may occur in a product where is close to the shelf edge. The method is done concerning two centroids, and the position of the rightmost or leftmost centroid will be avoided by the system as written in Algorithm 2. In short, the system tends to choose the inner bounding box by B (box[a [1+i,1] ,..., a [1+i,4] ]) P (n) rather than simply the highest score.

Although the weakness of multi-detection is resolved in the overlapping matters, this method also has disadvantages because of collapsed products. A collapsed product can still be recognized by the system but cannot be grasped because the gripper finger has a limited range. Furthermore, the performance of two stratified methods is a double check by conducting rules. Not only do multi-detection and overlapping have potential problems, but mixed products also have the potential to overlapping problem that can be resolved by Algorithm 2. Finally, by double-checking, the system can solve localization problems through those methods.

In this study, we carried out a scenario that the system was run for SME as much as possible, which the cost of the devices was reduced while still paying attention to system performance. We hope that this system could be adopted cheaply with the minimum specifications as shown in Table 2 . The reach of the manipulator robot is limited to a radius of 642 mm and positioned fixedly on the table. It is necessary to arrange the shelf in both the number and the position as described in Section 3.5.2. This study is also proved with a video available in this link https://youtu.be/WD1mIL7o8X8 (accessed on 28 November 2020). 

As defined in Equation (16), the detection results of confidence, accuracy, precision, recall, performance (F1), average precision (AP), average µ, standard deviation σ and time are evaluated. The numbers of true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN) were included in roughly 500 recorded images in total to testify the entire detection evaluation process. To compute the precision, recall, F1 score, and AP, a confidence value of 0.85 was set [46] . Starting from observing, TP satisfies the detection with an IoU value > 0.5 and no double bounding boxes, and the opposite condition is called FP. While the true negative (TN) criterion is that there are no double bounding boxes and the IoU value is <0.5, the rest is FN if it is wrongly detected and the IoU value is <0.5 or even there is no bounding box. The results could be seen in Table 3 Table 3 , and the bolds are the highest values for each parameter. 

In Table 4 , we compare the performance between the YOLOv2 detector and our modified YOLOv2 detector with varying amounts of combination. We grasp purchased product gpc ← P (n)[max] on the shelf models using 88 grasp attempts in our setting, to evaluate the effective performance of the method through the data of consumer purchase process. Table 4 shows the evaluation of the two methods, where each method is evaluated for two conditions so that there are a total of four combinations, 8-10 grasp attempts each based on online purchase. Each trial product position on the shelf is always changed randomly, including on the right or left shelves. Besides, the success rate in holding effort is substantially slightly increased, from 0.807 original YOLOv2 to 0.835 after we modified the YOLOv2 detector. Moreover, these results could be informative to understand the performance of the modified selective YOLOv2 detector. The data-driven mode plays an important role in this technique. First, raw data from the REST API reply become the key to determine the selection process and then call the detector type after data parsing. Second, the relatively short response time of the IoT system (µ < 0.401 s) brings the opportunity for AI to doublecheck for the same frame so that localization can be improved. The grasp failure was due to the gripper that we could not control the finger opening, most of these grasp failures were the product width close to the gripper finger opening width. The gripper that we use is adaptive to shape but restricted to a couple of choices; straight open (72 mm) and wide-open (120 mm) types, as well as this limits the size of the product we could hold.

About 80% of systems are successful in grasping, even though the localization results are very accurate. After being analyzed, the narrow straight-open gripper causes the potential purchased product to be pushed-in when grasping a target so that the XYZ position changes. As an illustration, product of B (British Milk Tea) has a width of 64 mm while the straight-open gripper is only 72 mm, meaning that there is only a tolerance of 4 mm for the left and right sides as shown in Figure 9e . The best way is to use a vacuum gripper, but it has weaknesses in holding products with uneven surfaces, easily shifted positions, and plastic bags such as products: A (Indomie Soto Instant Noodle), C (Golden Apple Yogurt), E (ABC Soy Sweet Ketchup), H (ABC Cup Baso Noodle), and I (Bun Gao Vermicelli).

AIoT Shop evaluation involves integrating an integrated manner of three parts of the evaluation: AI, IoT, and online shop. From each of these parts, several indicators are derived. The metrics evaluated in the AI segment include precision, accuracy, recall, and performance aspects. Besides, the IoT assessment is focused on Asghari et al. [47] [48] [49] . Indicators include security, response time, cost, and consumption of energy (CoE). The online shop with the app version and the website is evaluated online by the Website Grader Tool which includes four indicators: performance, search engine optimization (SEO), mobility, and security.

Broadly, this AIoT shop evaluation in Society 5.0 touches upon three terms: efficiency, speed, and convenience. For this reason, Figure 12 depicts the correlation of each aspect that has been evaluated in the context of Society 5.0. Figure 12 also shows that the AI system built with modified YOLOv2 can work very well with a score of >0.95, slightly moderate for IoT µ > 0.82, superior in terms of cost but quite risky in terms of security. Even with online shop among an absolute score it has a high score on the mobile side, while on the contrary, only 0.5 for the security indicator. Overall evaluation of the AIoT shop in the context of convenience, speed, and efficiency reaches a score of 0.860. We took AI testing from the data in Table 2 . While the online shop test was obtained from the Website Grader Tool with the specifications of the online shop page size was light 1.7 MB of 3 MB and with a page load speed, 4.2 s of 5.3 seconds' maximum limit. From the security side, we have used HTTPS or SSL and the system suggests updating JavaScript libraries. Finally, the IoT evaluation is based on [47] at a low cost. In 2020, the cost of IoT devices is only USD 0.38 while this system does not require other sensors and just connects the 6-DOF robotic manipulator to internet.

This study intends to develop a picking algorithm using AIoT and implement the results of the algorithm at convenience shops to follow the trend of Society 5.0. The picking algorithm is based on YOLOv2 modified into a parallel detector through a data-driven mode. A robotic manipulator in an offline shop is connected to the internet for picking off the shelf. To improve localization accuracy, we use a stereo camera for every purchased product which is tested twice and verified using a SURF Disparity Map. The test results show that the picking algorithm with AIoT correctly takes the purchased products with a success rate of 0.835 in settings such as in a convenience shop. From the indicator of time per transaction, the service speed is at rate of 0.792.

The results of the developed picking algorithm are applied to a limited extent for rack-to-box picking. Our next study will involve an automated guided vehicle (AGV) for delivery from offline stores to consumers by adopting several localization strategies. 

Autonomous Shopping Systems: Identifying and Overcoming Barriers to Consumer Adoption

Automatic recognition of clothes pattern and motifs empowering online fashion shopping

A Large-Scale Deep Architecture for Personalized Grocery Basket Recommendations

Technologies of Production with Society 5.0

Information Technology as the Basis for Transformation into a Digital Society and Industry 5.0

Toward an Efficient Search Method to Capture the Future MOT Curriculum Based on the Society 5.0

Science, technology and innovation ecosystem transformation toward society 5.0

The future is Society 5.0

Mobile based healthcare management using artificial intelligence

An experimental setup of multi-intelligent control system (MICS) of water management using the Internet of Things (IoT)

A Deep Learning-Based Intelligent Medicine Recognition System for Chronic Patients

ScalpEye: A Deep Learning-Based Scalp Hair Inspection and Diagnosis System for Scalp Health

A Wearable Smart-Glasses-Based Drug Pill Recognition System Using Deep Learning for Visually Impaired Chronic Patients

DeepCrash: A Deep Learning-Based Internet of Vehicles System for Head-On and Single-Vehicle Accident Detection with Emergency Notification

Enabling Intelligent Environment by the Design of Emotionally Aware Virtual Assistant: A Case of Smart Campus

Imagenet classification with deep convolutional neural networks

Towards Real-Time Object Detection with Region Proposal Networks

You only look once: Unified, real-time object detection

Fast and Efficient Non-Contact Ball Detector for Picking Robots

Tender Tea Shoots Recognition and Positioning for Picking Robot Using Improved YOLO-V3 Model

Grasping Point Detection of Randomly Placed Fruit Cluster Using Adaptive Morphology Segmentation and Principal Component Classification of Multiple Features

Object Classification Using CNN-Based Fusion of Vision and LIDAR in Autonomous Vehicle Environment

Object Localization and Depth Estimation for Eye-in-Hand Manipulator Using Mono Camera

Fruit Localization and Environment Perception for Strawberry Harvesting Robots

Inferring 3D Shapes of Unknown Rigid Objects in Clutter through Inverse Physics Reasoning

Using Manipulation Primitives for Object Sorting in Cluttered Environments

Predicting online shopping behaviour from clickstream data using deep learning

Enhanced IoT-Aware Online Shopping System

Bringing online shopping experience to offline retail through augmented reality and RFID

When Time Matters: Predictive Mission Planning in Cyber-Physical Scenarios

Design Patterns and Extensibility of REST API for Networking Applications

Stereo Vision-Based Object Recognition and Manipulation by Regions with Convolutional Neural Network

Cyber-Physical System Based Production Monitoring for Tapioca Starch Production

Society 5.0: For Human Security and Well-Being

CPS-Based Smart Control Model for Shopfloor Material Handling

Developing Social Networks Mashups: An Overview of REST-Based APIs

A Three-Dimensional Adaptive PSO-Based Packing Algorithm for an IoT-Based Automated e-Fulfillment Packaging System

Cucumber Fruits Detection in Greenhouses Based on Instance Segmentation

Eye to hand calibration using ANFIS for stereo vision-based object manipulation system

Up Robust Features (SURF). Comput. Vis. Image Underst

Speeded up Robust Features for vision-based underwater motion estimation and SLAM: Comparison with correlation-based techniques

Sequential block-based disparity map estimation algorithm for stereoscopic image coding. Signal Process

Multicriteria Meta-Heuristics for AGV Dispatching Control Based on Computational Intelligence

Estimation of disparity maps through an evolutionary algorithm and global image features as descriptors

Visual Object Recognition and Pose Estimation Based on a Deep Semantic Segmentation Network

Branch localization method based on the skeleton feature extraction and stereo matching for apple harvesting robot

Internet of Things applications: A systematic review

A Review of Usability and Security Evaluation Model of Ecommerce Website

Evaluating ecommerce websites cognitive efficiency: An integrative framework based on data envelopment analysis

Acknowledgments: Thanks to Helen Grace Gonzales for proofreading the article.

The authors declare no conflict of interest.