## 3. 使用拟合Q迭代求解MDP

Q π ( s , a ) = E [ ∑ t > t s r t ∣ s , a ] Q_π(s,a) = E[\sum_{t>t_s}r_t|s,a] Q π ​ ( s , a ) = E [ t > t s ​ ∑ ​ r t ​ ∣ s , a ]

π ( s ) = a r g m a x a Q π ( s , a ) π(s) = \underset{a}{argmax} Q_π(s,a) π ( s ) = a a r g ma x ​ Q π ​ ( s , a )

FQI算法的思想是将学习动作值函数的问题减少为监督学习问题。这是通过迭代应用一些监督学习算法来完成的。我们首先训练模型以根据状态和动作特征预测即时回报，然后训练第二个模型以预测未来两个时间步长的回报，依此类推。

T = ( s , a , r , s ′ ) T = {(s,a,r,s’)} ( s , a , r , s ′ )

FQI算法可以描述如下：

s i s_i s i ​ 和动作
a i a_i a i ​ 是表示状态和动作特征的向量，奖励
r r r

0 ≤ γ ≤ 1 0≤γ≤1

## 4. 设置测试平台环境

events = [
0, # no action
1, # visit
2  # purchase
]
offers = [
2, # small discount
3  # large discount
]
demographic = [
0, # young customers
1  # old customers
]

k = 100            # number of time steps
m = 3              # number of offers (campaigns)
n = 1000           # number of customers
def generate_profiles(n, k, m):
p_offers = [1 / m] * m                                      # offer probabilities
t_offers = np.linspace(0, k, m + 2).tolist()[1 : -1]        # offer campaign times
t_offer_jit = 5                                             # offer time jitter, std dev

P = np.zeros((n, k))                                        # matrix of events
F = np.zeros((n, k))                                        # offer history
D = np.zeros((n, 1))                                        # demographic features
for u in range(0, n):
D[u, 0] = np.random.binomial(1, 0.5)                    # demographic features

# determine m time points to issue offers for customer u
offer_times_u = np.rint(t_offer_jit * np.random.randn(len(t_offers)) + t_offers)

for t in range(0, k):                                   # simulate a trajectory
if t in offer_times_u:
F[u, t] = multinomial_int(p_offers) + 1         # issue an offer at time t

event = multinomial_int(get_event_pr(D[u], F[u]))   # generate an event at time t
P[u, t] = event

return P, F, D

P = n × ķ P = n × ķ ， 客户事件矩阵
F = n × ķ F = n × ķ ， 发给客户的优惠矩阵
d = n × 1 d = n × 1 ， 客户人口统计特征矩阵

def get_event_pr(d, f):
f_ids = offer_seq(f)     # sequence of offer IDs received by the customer
f_ids = np.concatenate((offer_seq(f), np.zeros(3 - len(f_ids))))

if((f_ids[0] == 1 and f_ids[1] == 3) or
(f_ids[1] == 1 and f_ids[2] == 3) or
(f_ids[0] == 1 and f_ids[2] == 3)):
p_events = [0.70, 0.08, 0.22]     # higher probability of purchase
else:
p_events = [0.90, 0.08, 0.02]         # default behavior

if(np.random.binomial(1, 0.1) > 0):       # add some noise
p_events = [0.70, 0.08, 0.22]

return p_events

P, F, D = generate_profiles(n, k, m)          # training set
Pt, Ft, Dt = generate_profiles(n, k, m)       # test set
visualize_profiles(P)
visualize_profiles(F)

## 5. 估计行动值函数

# p, f, d - rows of matrices P, F, and D that correspond to a given customer
# t_start, t_end - time interval
def state_features(p, f, d, t_start, t_end):
p_frame = p[0 : t_end]
f_frame = f[0 : t_end]
return np.array([
d[0],                  # demographic features
count(p_frame, 1),     # visits
index(f_frame, 1, k),  # first time offer #1 was issued
index(f_frame, 2, k),  # first time offer #2 was issued
index(f_frame, 3, k)   # first time offer #3 was issued
])
def frame_reward(p, t_start, t_end):
return count(p[t_start : t_end], 2)   # number of purchases in the time frame

T = prepare_trajectories(P, F, D)
Tt = prepare_trajectories(Pt, Ft, Dt)

def Q_0(sa):
return [1]
Q = Q_0
iterations = 6
for i in range(iterations):                 # FQI iterations
X = []
Y = []
for sample in T.reshape((n * (m + 1), m + 1)):
x = np.append(sample[0], sample[1]) # feature vector consists of state-action pairs

a_best, v_best = best_action(Q, sample[3], offers)
y = sample[2] + v_best              # reward + value

X.append(x)
Y.append(y)

regr = RandomForestRegressor(max_depth=4, random_state=0, n_estimators=10)
regr.fit(X, np.ravel(Y))
Q = regr.predict
# find the optimal action under a greedy policy and corresponding state value
def best_action(Q, state, actions):
v_best = 0
a_best = 0
for a in actions:
v = Q([np.append(state, a)])[0]
if(v > v_best):
v_best = v
a_best = a

return a_best, v_best

## 6. 评估下一个最佳行动政策

FQI算法提供了一种根据在某些先前策略下获得的历史数据来估算作用值函数的方法。我们需要将操作值函数转换为“下一个最佳操作”，可以通过几种不同的方式来完成此操作。我们已经看到，可以通过执行具有最高期望值的操作并将在给定状态下允许执行任何其他操作的概率为零来完成此操作。这种方法会产生一个贪婪的策略，该策略只专注于根据当前已知的策略获得奖励，并且不会探索替代（非最优）操作来改进策略。

π ϵ ( s , a ) = { 1 − ϵ , i f a = a r g m a x a Q π ( s , a ) ϵ / ( m − 1 ) , o t h e r w i s e π_ϵ(s,a)=\left\{\begin{matrix} 1-ϵ ,& if a = \underset{a}{argmax} Q_π(s,a)\\ ϵ/(m−1), & otherwise \end{matrix}\right. π ϵ ​ ( s , a ) = { 1 − ϵ , ϵ / ( m − 1 ) , ​ i f a = a a r g ma x ​ Q π ​ ( s , a ) o t h e r w i s e ​

m m m 是该状态允许的行动总数。这为探索和策略改进提供了更多带宽。在动态在线环境中，该策略可以多次更新，并且参数 ϵ ϵ ϵ 也可以根据观察到的反馈的波动性动态调​​整。

R ^ ( π ) = R ( π 0 ) ∏ t π ( s t , a t ) π 0 ( s t , a t ) \widehat{R}(π) = R(π_0) \prod_{t} \frac {π(s_t,a_t)} {π_0(s_t,a_t)} R ( π ) = R ( π 0 ​ ) t ∏ ​ π 0 ​ ( s t ​ , a t ​ ) π ( s t ​ , a t ​ ) ​

R ( π 0 ) R(π_0) R ( π 0 ​ ) 是观察到的轨迹返回。在一系列轨迹上平均此估计，我们可以评估策略的总体预期绩效。这种方法称为重要性抽样估计。

def evaluate_policy_return(T, behavioral_policy, target_policy):
returns = []
for trajectory in T:
importance_weight = 1
trajectory_return = 0
for transition in trajectory:
state, action, reward  = transition[0 : 3]
action_prob_b = behavioral_policy(state, action)
action_prob_t = target_policy(state, action)

importance_weight *= (action_prob_t / action_prob_b)
trajectory_return += reward

returns.append(trajectory_return * importance_weight)

return np.mean(returns)

policy_values = []
eps = np.linspace(0, 2/3, num = 10)
for e in eps:
policy = make_epsilon_greedy_policy(e)
policy_values.append( evaluate_policy_return(T, behavioral_policy, policy) )
plt.plot(eps, policy_values)
plt.show()

## 7. 结论

G. Theocharous, P. Thomas, and M. Ghavamzadeh, Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees, 2015

D. Wu, X. Chen, X. Yang, H. Wang, Q. Tan, X. Zhang, J. Xu, and K. Gai, Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising, 2018

D. Ernst, L. Wehenkel, and P. Geurts, Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005

M. Riedmiller, Neural Fitted Q Iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method, 2005

V. Mnih et al, Human-level Control Through Deep Reinforcement Learning, 2015