大工至善|大学至真分享 http://blog.sciencenet.cn/u/lcj2212916

博文

[转载]【计算机科学】【2016】从深度强化学习到随机计算图的期望优化

已有 1556 次阅读 2019-12-19 18:28 |系统分类:科研笔记|文章来源:转载

本文为美国加利福尼亚大学伯克利分校(作者:john schulman)的博士论文,共103页。

 

本文主要研究强化学习,将其看作是一个优化问题:即相对于策略参数最大化期望的总回报。本文的第一部分是关于使策略梯度方法更具样本效率和可靠性,特别是当与表达的非线性函数逼近器(如神经网络)一起使用时。第3章考虑了如何确保策略更新导致的单调改进,以及如何在给定一批采样轨迹的情况下最优更新策略。在提供了理论分析之后,我们提出了一种实用的方法,称为信赖域策略优化TRPO),它在两个具有挑战性的任务:模拟机器人运动和使用屏幕图像作为输入的Atari游戏上表现良好。第4章着眼于以与TRPO互补的方式改进策略梯度方法的样本复杂度:使用状态值函数减少策略梯度估计的方差。使用这种方法,我们获得了最先进的学习三维机器人运动控制器的结果。强化学习可以看作是优化期望的一种特殊情况,并且在机器学习的其他领域出现类似的优化问题;例如,在变分推理中,以及在使用包括内存和注意力机制的体系结构时。第5章提供了一个统一的观点,这些问题的一般演算所获得梯度估计器的目标,涉及到混合的采样随机变量和可微操作。这种统一的观点促使算法从强化学习应用到其他预测和概率建模问题。

 

This thesis is mostly focused onreinforcement learning, which is viewed as an optimization problem: maximizethe expected total reward with respect to the parameters of the policy. Thefirst part of the thesis is concerned with making policy gradient methods moresample-efficient and reliable, especially when used with expressive nonlinearfunction approximators such as neural networks. Chapter 3 considers how toensure that policy updates lead to monotonic improvement, and how to optimallyupdate a policy given a batch of sampled trajectories. After providing atheoretical analysis, we propose a practical method called trust region policyoptimization (TRPO), which performs well on two challenging tasks: simulatedrobotic locomotion, and playing Atari games using screen images as input.Chapter 4 looks at improving sample complexity of policy gradient methods in away that is complementary to TRPO: reducing the variance of policy gradientestimates using a state-value function. Using this method, we obtainstate-of-the-art results for learning locomotion controllers for simulated 3Drobots. Reinforcement learning can be viewed as a special case of optimizing anexpectation, and similar optimization problems arise in other areas of machinelearning; for example, in variational inference, and when using architecturesthat include mechanisms for memory and attention. Chapter 5 provides a unifyingview of these problems, with a general calculus for obtaining gradientestimators of objectives that involve a mixture of sampled random variables anddifferentiable operations. This unifying view motivates applying algorithmsfrom reinforcement learning to other prediction and probabilistic modelingproblems.

 

 

引言

项目背景

信赖域策略优化

广义优势估计

随机计算图

结论


更多精彩文章请关注公众号:qrcode_for_gh_60b944f6c215_258.jpg



https://blog.sciencenet.cn/blog-69686-1210723.html

上一篇:[转载]【信息技术】【2017.04】提高安全性的生物特征加密系统
下一篇:[转载]【信息技术】【2013】基于视觉的鲁棒车辆主动学习检测与跟踪
收藏 IP: 60.169.68.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-1 18:44

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部