Press "Enter" to skip to content

Markov Decision Process 的程式範例

之前跟大家介绍过 Markov Decision Process(MDP) 的原理跟数学推导 还有 在 Human-Robot Interaction 上的应用 ,今天我们就带大家透过一个程式来体验实作上的细节。有兴趣的读者可以透过这个例子将这个程式应用到机器人上!

 

快速複习 MDP 的基本概念

 

在进入问题之前,我们先简单複习一下 MDP,首先是 MDP 的定义:

 

 

然后,假设已经知道在每个 state 会获得多少 reward,我们就可以用 value iteration 的演算法算出在各 state 该採取哪个 action 可以获得最高的 reward(也就是获得 optimal policy):

 

 

如果觉得看不太懂也没关係,可以回去看 Markov Decision Process(MDP) 的原理跟数学推导

 

程式场景说明

 

今天我们要讨论的程式,是假设我们要实作如下场景:

 

今天你有一只机器人要帮你清理桌子,桌上有一个宝特瓶(bottle)跟一个玻璃杯(glass),机器人应该要把宝特瓶跟玻璃杯都拿起来,但你还不太确定这只机器人能不能顺利地完成清理的任务。

 

随着你对这只机器人的观察,你可能会越来越相信或越来越不信他会把事情做好,而当你不信这只机器人的时候,你就会想要出手干涉他的行为。

 

跟 MDP 的符号相接

 

首先,我们要先知道我们必须定义清楚 state、action、transition matrix 跟 reward。因为这是所有 MDP 问题必备的元素。

 

再来,因为我们现在的问题牵涉到人类相不相信机器人,所以 state 会从单纯只有 world state(bottle、glass 的状态)延伸到还有 human state。而且,action 也有分 human 跟 robot 的 action。

 

我们把 state 跟 action 定义清楚:

 

 

人类的 state 可以分成相信或不相信

 

世界中的 state 可以根据 bottle 和 glass 在桌上、机器人手上、人手上来区分(两种物体都各有三种可能,所以共有 3*3=9 种 state)

 

 

然后,我们把 transition 定义清楚:

 

 

当机器人选择去拿 bottle 或 glass,人类出手干预的机率

从下表中可以看出,若人相信机器人,当机器人要拿 bottle 时,人干预的机率是 0.1;若人相信机器人,当机器人要拿 glass 时,人干预的机率是 0.2;依此类推。

这张图中有点小错误,第一列的最右栏应该是 $P(a^H_t=Intervene|s^H_t, a^R_t)$。

 

当人已经出手干预(或不干预),world state 会怎幺变化

 

人的 state 会怎幺变化

 

 

有了以上的这些机率之后,我们就可以计算 transition matrix。

 

最后,我们需要定义 reward:

 

 

审视一下现在的 model,还缺一些推导

 

看到这边,大家应该会觉得不太懂,怎幺感觉起来跟之前学到的 MDP 不太一样?之前我们学到的是我们只有一个机器人会採取 action,但因为 action 会带来的结果是 non-deterministic,所以我们才需要 MDP 来 model。

 

但按照我们上面的定义,我们会画出一个这样的 graph:

 

human state($s^H_t$)跟 robot action($a^R_t$) 会决定 human action($a^H_t$)
$s^H_t$、$a^R_t$ 跟 $a^H_t$ 会决定下一时刻 t+1 的 human state
…依此类推

这跟我们原本认知的 MDP:

 

 

长得不大一样。

 

开始推导成 MDP

 

巧妙的地方来了,我们可以将 state space 用 human state 跟 world state 一起定义

 

$$ S = S^H * S^W$$

 

所以这张图里面的 human state space 和 world state space 就可以合併(红的合在一起、蓝的合在一起):

 

 

然后,我们可以把 $a^H_t$ 视为 MDP 里面的不确定因素。因为机器人无法掌握人的行为,所以机器人不知道自己今天若去抓 bottle,到底能不能到达 “成功抓起 bottle” 的 state,而这就是符合原本 MDP 精神的地方。

 

虽然接下来还有一些数学上的推导,不过个人觉得有点太过细节,有兴趣的读者可以去看看 这个 note 的 page.5-6 ,对于机率的 marginalization 无感的话可以看我之前写的 Why do we need marginalization in probability? (为什幺我们在机率中需要讨论边缘化)

 

程式码

 

整段 code 主要的顺序还是

 

 

    1. 定义 state

 

    1. 定义一些基本的转换机率(为了计算transition matrix)

 

    1. 计算 transition matrix

 

    1. 定义 reward

 

    1. 用 value iteration 算出 optimal policy

 

 

然后就上 code 啦:

 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%% CSCI 699: Computational Human-Robot Interaction %%%%%
%%%%% Fall 2018, University of Southern California    %%%%%
%%%%% Author: Stefanos Nikolaidis, [email protected]   %%%%%
%%%%% Commented by: Po-Jen Lai, [email protected]      %%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
clear all
close all
 
## 定义 state
 
human_states_str = {'no_trust', 'trust'};
 
Label.BOTTLE = 1;
Label.GLASS = 2;
 
Label.ON_TABLE = 1;
Label.PICKED_BY_ROBOT = 2;
Label.PICKED_BY_HUMAN = 3;
 
Label.NO_TRUST = 1;
Label.TRUST = 2;
 
objstate_str = {'on_table','picked_robot','picked_human'};
ractions_str = {'pick_bottle','pick_glass'};
 
## 定义一些基本的转换机率(为了计算 transition matrix)
 
%{no_trust,trust} x{bottle, glass}
PROB_TRUST_INCREASE = [0.8 0.9;
    0.0 0.0];
 
%{no_trust,trust} x {bottle,glass}
PROB_TRUST_INTERVENE = [0.3 0.8;
    0.1 0.2];
 
counter = 1;
for ii = 1:3 %for each object
    for jj = 1:3 %for each  state
        world_states(counter,:) = [ii,jj];
        counter = counter + 1;
    end
end
num_human_states = length(human_states_str);
num_world_states = counter -1;
num_trust_states = 2;
num_states = num_world_states*num_trust_states;
num_ractions = length(ractions_str);
 
## 计算 transition matrix
 
Trans = zeros(num_states,num_ractions,num_states);
 
for sh = 1:num_human_states
    for sw = 1:num_world_states
        for ra = 1:num_ractions
            world_state = world_states(sw,:);
            ss = (sh-1)*num_world_states + sw;
            
            %picked by robot
            new_world_state = world_state;
            new_world_state(ra) =Label.PICKED_BY_ROBOT;
            nsw = findWorldState(new_world_state,world_states);
            
            nsh = sh; %trust stays the same
            nss = (nsh-1)*num_world_states + nsw;
            Trans(ss,ra,nss) = (1-PROB_TRUST_INTERVENE(sh,ra))*(1-PROB_TRUST_INCREASE(sh,ra));
            
            if sh == 1 %trust can increase only if low
                nsh = sh+1;
                nss = (nsh-1)*num_world_states + nsw;
                Trans(ss,ra,nss) = (1-PROB_TRUST_INTERVENE(sh,ra))*PROB_TRUST_INCREASE(sh,ra);
            end
            
            %picked by human
            new_world_state = world_state;
            new_world_state(ra) =Label.PICKED_BY_HUMAN;
            nsw = findWorldState(new_world_state,world_states);
            
            nsh = sh; %trust stays the same
            nss = (nsh-1)*num_world_states + nsw;
            Trans(ss,ra,nss) = PROB_TRUST_INTERVENE(sh,ra);
        end
    end
end
 
disp(' ')
disp('Transition Matrix')
for ss = 1:num_states
    for ra = 1:num_ractions
        sh = floor(ss/(num_world_states+1)) + 1;
        sw = ss - (sh-1)*num_world_states;
        nIndices = find(Trans(ss,ra,:)>0);
        %do not worry about invalid actions
        if (world_states(sw,ra) == Label.ON_TABLE)
            str = strcat('if~', human_states_str{sh} , ' and bottle is~' , objstate_str(world_states(sw,1)) , ' and glass is~' , objstate_str(world_states(sw,2)) , ' and robot does~' , ractions_str{ra},':');
            disp(str);
            for nn = 1:length(nIndices)
                nss = nIndices(nn);
                
                nsh = floor(nss/(num_world_states+1)) + 1;
                nsw = nss - (nsh-1)*num_world_states;
                str = strcat('then the prob of ~', human_states_str{nsh} , ' and bottle is~' , objstate_str(world_states(nsw,1)) , ' and glass is~' , objstate_str(world_states(nsw,2)),':',num2str(Trans(ss,ra,nss)));
                disp(str);
            end
        end
    end
end
 
## 定义 reward
 
%reward function
Rew = zeros(num_states,num_ractions);
for ss = 1:num_states
    for ra = 1:num_ractions
        sh = floor(ss/(num_world_states+1)) + 1;
        sw = ss - (sh-1)*num_world_states;
         
         %say bonus if starts with glass
         if (world_states(sw,Label.GLASS)==Label.PICKED_BY_ROBOT)&& (world_states(sw,Label.BOTTLE)==Label.ON_TABLE)
           Rew(ss,ra) = 5;
         end
          
        %if we are not in a final state
        if (world_states(sw,1)==Label.ON_TABLE) || (world_states(sw,2)==Label.ON_TABLE)
            if world_states(sw,ra)~= Label.ON_TABLE %penalize infeasible actions
                Rew(ss,ra) = -1000;
            end
        end
    end
end
 
%add positive reward for goal.
goal = findWorldState([Label.PICKED_BY_ROBOT, Label.PICKED_BY_ROBOT],world_states);
for ra = 1:num_ractions
    for sh = 1:2
        ss = (sh-1)*num_world_states + goal;
        Rew(ss,ra) = 10;
    end
end
 
%print reward function
disp(' ')
disp('reward function')
for ss = 1:num_states
    for ra = 1:num_ractions
        sh = floor(ss/(num_world_states+1)) + 1;
        sw = ss - (sh-1)*num_world_states;
        str = strcat('if~', human_states_str{sh} , ' and bottle is~' , objstate_str(world_states(sw,1)) , ' and glass is~' , objstate_str(world_states(sw,2)) , 'and robot action is~',ractions_str{ra}, ' then reward is: ',num2str(Rew(ss,ra)));
        disp(str);
    end
end
 
## 用 value iteration 算出 optimal policy
 
%value iteration
T = 3;
V = zeros(num_states,1);
policy = zeros(num_states,T);
new_V = zeros(num_states,1);
Q = zeros(num_states, num_ractions);
for tt = T:-1:1
    for ss = 1:num_states
         
        %check if terminal state
        if ss == 12
            debug = 1;
        end
        sh = floor(ss/(num_world_states+1)) + 1;
        sw = ss - (sh-1)*num_world_states;
        if ((world_states(sw,1)~=Label.ON_TABLE) && (world_states(sw,2)~=Label.ON_TABLE))
            new_V(ss) = Rew(ss,1);
            policy(ss,tt) = 1;
            continue;
        end
         
         
        maxV = -1e6;
        maxIndx = -1;
        for ra = 1:num_ractions
            res = Rew(ss,ra);
            for nss = 1:num_states
                res = res +  Trans(ss,ra,nss)*V(nss);
            end
            Q(ss,ra) = res;
            if res > maxV
                maxV = res;
                maxIndx = ra;
            end
        end
        new_V(ss) = maxV;
        policy(ss,tt) = maxIndx;
    end
    V = new_V;
end
 
disp(' ')
disp('policy')
%print policy
for tt = 1
    tt
    for ss = 1:num_states
        sh = floor(ss/(num_world_states+1)) + 1;
        sw = ss - (sh-1)*num_world_states;
        if ((world_states(sw,1) == Label.ON_TABLE)||(world_states(sw,2) == Label.ON_TABLE)) %we care only about feasible states
            str = strcat('if~', human_states_str{sh} , ' and bottle is~' , objstate_str(world_states(sw,1)) , ' and glass is~' , objstate_str(world_states(sw,2)) , 'then robot does: ',ractions_str{policy(ss,tt)});
            disp(str);
        end
    end
end

 

总结

 

今天跟大家简单介绍了一下该怎幺写 MDP 的程式,虽然没有细到每一行程式码都解释清楚,但对于想要自己弄懂并应用的读者来说应该已经有些帮助,若有问题欢迎你在下方留言讨论!

 

原文出处:https://blog.techbridge.cc/2018/12/22/intro-to-mdp-program/

Be First to Comment

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注