DDPG的网络结构

DDPG学习中的小trick

DDPG的完整流程

# 4、模型实验

pip install -e .

replay_buffer.py：定义了两种不同的经验池，一种是普通的经验池，一种是优先采样经验池

segment_tree.py:只有在使用优先采样经验池的时候才用到。定义一种树结构根据经验的优先级进行采样

DDPG-Actor实现

Actor的输入是一个具体的状态，经过两层的全链接网络输出选择的Action。

defactor_network(name):    with tf.(name) as scope:        x = state_input        x = tf.layers.dense(x, 64)        if self.layer_norm:            x = tc.layers.layer_norm(x, center=True, scale=True)        x = tf.nn.relu(x)        x = tf.layers.dense(x, 64)        if self.layer_norm:            x = tc.layers.layer_norm(x, center=True, scale=True)        x = tf.nn.relu(x)        x = tf.layers.dense(x, self.nb_actions,                            kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))        x = tf.nn.tanh(x)    return x

DDPG-Critic实现

Critic的输入是state，以及所有Agent当前的action信息：

defcritic_network(name, action_input, reuse=False):    with tf.variable_scope(name) as scope:        if reuse:            scope.reuse_variables()        x = state_input        x = tf.layers.dense(x, 64)        if self.layer_norm:            x = tc.layers.layer_norm(x, center=True, scale=True)        x = tf.nn.relu(x)        x = tf.concat([x, action_input], axis=-1)        x = tf.layers.dense(x, 64)        if self.layer_norm:            x = tc.layers.layer_norm(x, center=True, scale=True)        x = tf.nn.relu(x)        x = tf.layers.dense(x, 1, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))    return x

Actor的训练目标是Q值的最大化，而Critic的训练目标是最小化Q估计值和Q实际值之间的差距：

self.actor_optimizer = tf.train.AdamOptimizer(1e-4)self.critic_optimizer = tf.train.AdamOptimizer(1e-3)# 最大化Q值self.actor_loss = -tf.reduce_mean(    critic_network(name + '_critic', action_input=tf.concat([self.action_output, other_action_input], axis=1),                   reuse=True))self.actor_train = self.actor_optimizer.minimize(self.actor_loss)self.target_Q = tf.placeholder(shape=[None, 1], dtype=tf.float32)self.critic_loss = tf.reduce_mean(tf.square(self.target_Q - self.critic_output))self.critic_train = self.critic_optimizer.minimize(self.critic_loss)

agent1_action, agent2_action, agent3_action = get_agents_action(o_n, sess, noise_rate=0.2)#三个agent的行动a = [[0, i[0][0], 0, i[0][1], 0] for i in [agent1_action, agent2_action, agent3_action]]#绿球的行动a.append([0, np.random.rand() * 2 - 1, 0, np.random.rand() * 2 - 1, 0])o_n_next, r_n, d_n, i_n = env.step(a)

agent1_memory.add(np.vstack([o_n[0], o_n[1], o_n[2]]),                  np.vstack([agent1_action[0], agent2_action[0], agent3_action[0]]),                  r_n[0], np.vstack([o_n_next[0], o_n_next[1], o_n_next[2]]), False)agent2_memory.add(np.vstack([o_n[1], o_n[2], o_n[0]]),                  np.vstack([agent2_action[0], agent3_action[0], agent1_action[0]]),                  r_n[1], np.vstack([o_n_next[1], o_n_next[2], o_n_next[0]]), False)agent3_memory.add(np.vstack([o_n[2], o_n[0], o_n[1]]),                  np.vstack([agent3_action[0], agent1_action[0], agent2_action[0]]),                  r_n[2], np.vstack([o_n_next[2], o_n_next[0], o_n_next[1]]), False)

train_agent(agent1_ddpg,agent1_ddpg_target,agent1_memory,agent1_actor_target_update,agent1_critic_target_update,sess,[agent2_ddpg_target, agent3_ddpg_target])train_agent(agent2_ddpg,agent2_ddpg_target,agent2_memory,agent2_actor_target_update,agent2_critic_target_update,sess,[agent3_ddpg_target, agent1_ddpg_target])train_agent(agent3_ddpg,agent3_ddpg_target,agent3_memory,agent3_actor_target_update,agent3_critic_target_update,sess,[agent1_ddpg_target, agent2_ddpg_target])

# 参考文献：

1、https://blog.csdn.net/qiusuoxiaozi/article/details/79066612

2、https://arxiv.org/pdf/1706.02275.pdf