Deep deterministic policy gradient (DDPG) has been proved to be effective in optimizing particle swarm optimization (PSO), but whether DDPG can optimize multi-objective discrete particle swarm optimization (MODPSO) remains to be determined. The present work aims to probe into this topic. Experiments showed that the DDPG can not only quickly improve the convergence speed of MODPSO, but also overcome the problem of local optimal solution that MODPSO may suffer. The research findings are of great significance for the theoretical research and application of MODPSO.

Particle swarm optimization (PSO), a swarm intelligence algorithm, was proposed by Bai et al. [

The static hyperparameter configuration has been proved to be an important factor constraining the performance of PSO, especially in convergence speed and local optimal solution. Therefore, Lu et al. [

The configuration of hyperparameters of MODPSO and PSO is static. For example, value of the positive acceleration constant [

In this logic, this paper explored whether DDPG could improve the performance of MODPSO, basic idea is that DDPG initially generates random hyperparameters for MODPSO, MODPSO gives good or bad reward to DDPG after adopting these hyperparameters, DDPG will generates new hyperparameters for MODPSO according to the reward from MODPSO. Steps of MODPSO are presented in Section 2. Frame of DDPG is presented in Section 3. Steps of deep deterministic policy gradient multi-objective discrete particle swarm optimization (DDPGMODPSO) is presented in Section 4, which is the development of MODPSO. Experimental results are presented in Section 5. Conclusion is presented in Section 6.

To compare DDPGMODPSO and MODPSO, the fitness function proposed by Sun et al. [

Materialized view is an important term in database and data house, the above multi-objective function is about to reduce cost and time of materialized view. _{i,process}_{j,memory}_{k,memory}

The MODPSO process designed by Sun et al. [

Input: Discrete solution space;

positive acceleration constants _{1} and _{2};

number of particles;

number of training epochs.

Output: Optimal solution.

Step 1: Initialize the particle swarm, determine particle speed and position randomly, and deal with the illegal position.

Step 2: Calculate individual fitness values

Step 3: If individual fitness values greater than individual optimal fitness values, update the individual optimal fitness values as follows.

Step 4: If individual optimal fitness values greater than global optimal fitness values, update the global optimal fitness values as follows.

Step 5: Update velocity for each particle.
_{1}∼_{2}∼

Step 6: Update particle position as follows.

Deal with the illegal position.

Step 7: Judge whether the training is terminated. If it is terminated, output the result; otherwise, go to Step 2.

It is necessary to noted that illegal position means position not in the discrete solution space, the way we deal with illegal position is replace it by the nearest position in the discrete solution space.

DDPG was proposed by Zhao et al. [_{t}_{t}_{+1}), Action (_{t}_{t}

The components of the DDPG are specified as follows.

Environment

Actor network and critic network obtain the optimal hyperparameters _{1} and _{2} by continuously interacting with MODPSO, and then apply _{1} and _{2} to MODPSO, so MODPSO is set as the Environment here.

Agent

Since actor network and critic network are the only elements that can interact with the Environment, the actor network and the critic network are used as Agent here.

Action

Actions are _{1} and _{2}.

State

State have six dimensions. The first five dimensions represent the change in the average fitness value of MODPSO in the past, and the last dimension represents the current number of training epochs of MODPSO. The calculation method is as follows.
_{it}_{max}_{t}

Reward

State of the Environment is transferred, and Environment then gives Reward to Agent according to Agent's action. The setting of Reward is related to whether the global optimal fitness value has changed. The calculation method is as follows.

Inspired by deep deterministic policy gradient particle swarm optimization (DDPGPSO) proposed by Lu et al. [

The DDPGMODPSO process designed in the this work is as follows.

Input: Discrete solution space;

the number of particles;

number of training epochs for MODPSO _{max}

number of training epochs for DDPG _{max}

Output: Optimal solution.

Step 1: Randomly initialize the parameters of the current policy network

Step 2: Randomly initialize the parameters of target policy network

Step 3: Initialize experience replay pool

Step 4: Execute Step 1 to Step 4 of the MODPSO algorithm.

Step 5: While episode is less than _{max}

Step 6: Initialize a random process to explore Action;

Step 7: Receive initial observations _{1} of Environment (particle swarm).

Step 8: While _{max}

Step 9: Choose Action according to the current strategy and explore noise. The specific process is as follows.

Step 10: Execute Action _{t}_{t}_{t}_{+1};

Step 11: Save (_{t}_{t}_{t}_{t}_{+1}) to

Step 12: Randomly sample a minimum batch of _{i}

Step 13: Update current Q-network by minimizing the loss function, and the minimum loss function is expressed as follows.

Step 14: Use the gradient of the sample to update actor network as follows.

Step 15: Update target network.

Step 16:

Step 17: episode++.

Step 18: Save the weights of the trained actor network;

Step 19: Execute Step1 to Step 4 of the MODPSO algorithm;

Step 20: While _{max}

Step 21: Calculate the current State and update the global optimal fitness value.

Step 22: According to current policy network _{1} and _{2};

Step 23: Execute Step 5 to Step 6 of the MODPSO algorithm.

Step 24:

The discrete solution space (the experimental data) in this paper is the same as Sun et al. [

Parameters | Comment |
---|---|

Storage | Storage space occupied by base tables and materialized views |

Time | Time and Storage have a unique correspondence, so it only participate in the calculation of fitness value |

Parameters | Values |
---|---|

Size of discrete solution space | 312 |

Number of particles | 10 |

Particle coordinates | Storage |

Training times | 100 |

Layer name | Output dimension | Input |
---|---|---|

Input | 6 | — |

L0 layer | 400 | Input |

L1 layer | 300 | L0 layer |

Output | 2 | L1 layer |

Layer name | Output dimension | Input |
---|---|---|

Input 1 | 6 | — |

Input 2 | 2 | — |

Stitching layer | 8 | Input 1 and Input 2 |

L0 layer | 400 | Stitching layer |

L1 layer | 300 | L0 layer |

Output | 1 | L1 layer |

Ten sets of experiments were carried out on MODPSO and DDPGMODPSO separately, experimental result is showd in

Experiment id | MODPSO | DDPGMODPSO | ||
---|---|---|---|---|

Global optimal value is found or not | Steps | Global optimal value is found or not | Steps | |

1 | No | None | No | None |

2 | Yes | 54 | No | None |

3 | Yes | 58 | Yes | 57 |

4 | No | None | Yes | 76 |

5 | No | None | Yes | 9 |

6 | No | None | Yes | 13 |

7 | No | None | Yes | 20 |

8 | No | None | Yes | 20 |

9 | No | None | No | None |

10 | No | None | Yes | 90 |

As showed in

As showed in

As is showed in

In the present work, we explored how the DDPG can improve the performance of MODPSO. Experiments revealed that DDPGMODPSO outperformed MODPSO in discovering the optimal fitness value and converged faster than the latter. Therefore, it was verified that DDPG can significantly improve the global optimal fitness value discovery capability and convergence speed of MODPSO.