```def train(epochs, print_every_n=500):
player1 = Player(epsilon=0.01)
player2 = Player(epsilon=0.01)
judger = Judger(player1, player2)
player1_win = 0.0
player2_win = 0.0
for i in range(1, epochs + 1):
# 完成一轮游戏
winner = judger.play(print_state=False)
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
if i % print_every_n == 0:
print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
# 一轮游戏结束，便可以获得 reward，并迭代更新所有 state 的价值
player1.backup()
player2.backup()
# 重新初始化
judger.reset()
# 保存 value(state)，后面被用来选择 action
player1.save_policy()
player2.save_policy()

# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
while True:
player1 = HumanPlayer()
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
# 对弈
winner = judger.play()
if winner == player2.symbol:
print("You lose!")
elif winner == player1.symbol:
print("You win!")
else:
print("It is a tie!")```

hill-a/stable-baselines

AlphaZero

```# https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md
# 安装 https://github.com/deepmind/open_spiel/blob/master/docs/install.md
# 训练模型 & 对弈
\$ az_path=exp/tic_tac_toe_alpha_zero
\$ python3 open_spiel/python/examples/tic_tac_toe_alpha_zero.py --path \${az_path}
\$ python3 open_spiel/python/examples/mcts.py --game=tic_tac_toe --player1=human --player2=az --az_path=\${az_path}/checkpoint-25
2020-12-26 21:26:57.202819: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-26 21:26:57.221343: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcfa55182a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-26 21:26:57.221356: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
I1226 21:26:57.446192 4433972672 saver.py:1293] Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
Initial state:
...
...
...
Choose an action (empty to print legal actions): 4
Player 0 sampled action: x(1,1)
Next state:
...
.x.
...
Player 1 sampled action: o(2,2)
Next state:
...
.x.
..o
Choose an action (empty to print legal actions): 7
Player 0 sampled action: x(2,1)
Next state:
...
.x.
.xo
Player 1 sampled action: o(0,1)
Next state:
.o.
.x.
.xo
Choose an action (empty to print legal actions): 3
Player 0 sampled action: x(1,0)
Next state:
.o.
xx.
.xo
Player 1 sampled action: o(1,2)
Next state:
.o.
xxo
.xo
Choose an action (empty to print legal actions): 2
Player 0 sampled action: x(0,2)
Next state:
.ox
xxo
.xo
Player 1 sampled action: o(2,0)
Next state:
.ox
xxo
oxo
Choose an action (empty to print legal actions): 0
Player 0 sampled action: x(0,0)
Next state:
xox
xxo
oxo
Returns: 0.0 0.0 , Game actions: x(1,1) o(2,2) x(2,1) o(0,1) x(1,0) o(1,2) x(0,2) o(2,0) x(0,0)
Number of games played: 1
Number of distinct games played: 1
Players: human az
Overall wins [0, 0]
Overall returns [0.0, 0.0]```

```# DQN agent vs Tabular Q-Learning agents trained on Tic Tac Toe.
\$ python3 open_spiel/python/examples/tic_tac_toe_dqn_vs_tabular.py```

AlphaZero 同样适用于除 GO 之外的 two players games。