When the robot grabs objects in the current scene, the state of the scene after grabbing depends on the current state of the scene and the actions taken by the manipulator. The input process can be described by the Markov Decision Process (MDP)25. The Markov decision process contains 5 elements: the state set S, the action set A, the state transition probability matrix P, the reward set R and the discount factor. (gamma). The gripping process of the manipulator can be described as in the state (s_t). According to the strategy (pi)the manipulator acts (at) with probability P((s_{t+1}) | (s_t),(at) ) to act on the scene so that it goes into the new state (s_{t+1})and get the reward (r_t). The goal of reinforcement learning for manipulators is to find an optimal strategy (G_t) and the gripping action of the manipulators can get the maximum output (G_t) cumulatively, where (gamma) is a discount factor that aims to reduce the influences of current rewards acting on future behavior.

begin{aligned} { G_t = R_{t+1} +gamma R_{t+2} +gamma ^{2} R_{t+3} +…=sum _{k=0 }^T gamma ^{k} R_{t+1+k} } end{aligned}

(1)

According to Markov decision process, we can know that reinforcement learning should store the mapping relationships between state and action, which is called the learning strategy. Mapping relationships are typically stored in a table. For large scale problems, tables can hardly be maintained and stored in memory. Therefore, this paper adopts a deep reinforcement learning method based on the deep Q network of the value function26.27. DQN takes the feature vector (phi(s)) of the input state s, and outputs the state-action value function Q(s, a) that matches each action through network processing. DQN adopts an approximate description of the value function, so it evolves into a regression problem, called function fitting:

begin{aligned} { Q(s, a; theta )approx Q^* (s, a) } end{aligned}

(2)

Or (theta) represents the parameters of the model. In order to accelerate the convergence of the network, we designed two convolutional neural networks: the present value network function Q(s, a; (theta)) and the target value network function Q(s, a ; (theta ^{-})). The current value network parameters will be updated with the iteration process, and the target value network will calculate the Q value throughout the iteration process, but the network parameters will be fixed at a certain number of iteration steps. After some steps, the parameters of the current value network are copied and updated, so the optimization goal of the model is:

begin{aligned} C=R+gamma max _{a^{‘}} Q(s^{‘},a^{‘};theta ^{-}) end{aligned}

(3)

Therefore, by minimizing the squared error of the present value function and the target value function, the network parameters can be updated.

begin{aligned} L(theta )=E[(R+gamma max_{a^{‘}} Q(s^{‘},a^{‘};theta ^{-} )-Q(s,a;theta ))^{2} ] end{aligned}

(4)

1. A.

Description of the state set

In this article, we use RGB-D cameras to get the point cloud of the objects, but the point cloud image cannot be placed directly into the input DQN, so this article takes the RGB top view of the point cloud in current scene before each action as state (s_t) as the status input of the DQN. But for object collection, if the camera is always above the grip area and installed vertically downwards, it will make a difference in the grip shots. Therefore, a method with a certain installation angle which is placed outside the gripping area is adopted, but there will be missing information when the top view is obtained from only one viewing angle, as shown in Fig. 1. This document uses two RGB-D cameras to capture from two viewpoints to ensure complete object information. The pixel resolution of the resulting RGB top view is 224×224.

1. B

Description of the action space

With the aim of adapting complex grasping scenes, this article introduces a grasping-based pushing manipulator. For the reasons that pushing can disturb the original arrangement of objects, making it more convenient for the manipulator to grasp. Two pieces of information must be obtained when the manipulator undertakes a gripping or pushing action, namely the target position (x, y, z) to which the gripper must move and the postures of the gripper. The target position (x,y) can be learned by DQN, and the z-height can be obtained from depth maps; For end manipulator postures, this item adopts vertical downward gripping method, so there is only one rotation angle (theta) along the z axis in the end manipulator postures, and the manipulator can push along the direction of an angle (alpha) as it grows.

When the manipulator pushes or grabs objects, the end rotation angle and the push angle of the manipulator do not need to be a precise angle. Therefore, we discretize the above angles and divide 360° into 16 equal parts, which are defined by 22.5° as a gripping or pushing direction. Therefore, the action space can be described as follows:

Pushing action: taking the (x,y) coordinates learned by the DQN as the starting position, the end of the manipulator pushes the object in one of 16 defined directions and the pushing distance is half the length on the longest side of the objects, which is fixed at 5 cm in this article.

Gripping action: taking the (x,y) coordinates learned by the DQN as the target position of the center movement of the gripper at the end of the manipulator, which is rotated in one of the above 16 directions to grip objects.

It can be seen from the state set that the state descriptions use a top view, while the visual calibration uses an oblique view as usual. While the manipulator takes steps to push or grab, we need to determine the conversion between the two images, and the conversion relationship can be determined by the viewpoint transformation.

1. VS

Value network design

Due to image recognition, in this paper, we model the Q-function as two fully convolutional networks (FCNs), which was proposed by28, one of which is used to select the push action and the other is used to select the grip action. These two networks have the same structure: each network contains a DenseNet121(29), then is cascaded with two 1×1 convolutional layers, and each convolutional layer has a ReLU activation function and batch normalization, and finally bilinear oversampling, each network will capture the RGB top view of the point cloud of the first two input scene views.

In order to get the proper push or grab angle, we rotate the RGB top view 15 times as input to FCN. That is, for each FCN network, 16×2 images will be defined as input and the output will have the same size and number of heatmaps as the input images. Each image pixel represents an independent Q value prediction. By sorting the Q values ​​of the pixels in the output heat maps of the two networks, we can get the largest Q value of the heat maps in the push network and the grab network respectively, and we can select the Q value in both heatmaps which is larger than the final output of the array (Fig. 2). The final selected pixel and direction is given by the pixel corresponding to the maximum Q value in the heatmaps that the network finally produced, and the rotation angle corresponding to the heatmaps. The selected action is given by the network corresponding to the heatmap.

1. D.

Reward settings

The traditional reward is defined as an R=1 reward for a successful entry. If the total variation of the difference between the height maps exceeds a certain threshold (tau), a reward R = 0.5 is given. Such a reward is too rare, which will lead to non-convergence and slow convergence in model training. In order to solve the problems, a piecemeal reward strategy instead of a single reward is proposed in this article. The piecewise reward function is defined as follows:

begin{aligned} { R=left{ begin{array}{rcl} 1 &{} &{} {enter successfully} -1 &{} &{} {enter failed} 0.3 &{} &{} {pixel change tau (10% sim 24%)} 0.5 &{} &{} {pixel change tau (24% sim 40 %)} 0.7 &{} &{} {pixel change tau (40% sim 100%)} -0.1 &{} &{} {otherwise} end{array} right . } end{aligned}

(5)

where R is the action reward chosen by the deep Q-learning network decision-maker. If the seizure succeeds, the reward R=1 is given, if the seizure fails, the reward R=−1 is given; if the pixel rate changes (tau) of the stage after pushing is 10% to 24%, the reward R=0.3 is given; If the pixel rate changes (tau) is between 24% and 40%, the reward R=0.5 is given; if the rate of change (tau) is 40% 100%, the reward R = 0.7 is given; otherwise, the reward R = −0.1 is given.

### Consent to participate

All authors confirm that they are involved in this study.

### Consent to publication

The authors confirm that the manuscript has been read and approved by all named authors.

Share.