Estimating Q(s,s') with Deep Deterministic Dynamics Gradients. (arXiv:2002.09505v1 [cs.LG])

In this paper, we introduce a novel form of value function, $Q(s, s’)$, that
expresses the utility of transitioning from a state $s$ to a neighboring state
$s’$ and then acting optimally thereafter. In order to derive an optimal
policy, we develop a forward dynamics model that learns to make next-state
predictions that maximize this value. This formulation decouples actions from
values while still learning off-policy. We highlight the benefits of this
approach in terms of value function transfer, learning within redundant action
spaces, and learning off-policy from state observations generated by
sub-optimal or completely random policies. Code and videos are available at

Source link

Related posts

Reducing deep learning inference cost with MXNet and Amazon Elastic Inference


Amazon Web Services Features Sentient Ascend


Take my bits awaaaay: DARPA wants to develop AI fighter program to augment human pilots


This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Privacy & Cookies Policy