6. 柔軟物の操作の学習における報酬
• エントロピ正則された強化学習(Deep Dynamic Policy Programming)
• シミュレータの使用なしで学習
Tsurumine, Y., Cui, Y., Uchibe, E., and Matsubara, T. (2017). Deep dynamic policy programming for robot control
with raw images. In Proc. of IROS.
7. シャツの折り畳みの場合
実用的な報酬を準備するのは
難しい
Tsurumine, Y., Cui, Y., Uchibe, E., and Matsubara, T. (2019). Deep reinforcement learning with smooth policy
update: Application to robotic cloth manipulation. Robotics and Autonomous Systems, 112: 72-83.
13. 行動クローニングの問題点
• エキスパートと学習者の状態行動分布は異なる(共変量シフト)
• 行動し続けることで誤差が蓄積し,エキスパートの分布から逸脱
– 元の分布に戻る手段がない
Ross, S. & Bagnell, J.A. (2010). Efficient Reductions for Imitation Learning. In Proc. of AISTATS, 9:661–668.
Osa, T., Pajarinen, J., Neumann, G., Bagnell, J.A., Abbeel, P.A., & Peters, J. (2018). An Algorithmic Perspective on
Imitation Learning. Foundations and Trends in Robotics 7, no. 1–2, 1–179.
14. 敵対的生成ネットワーク(Generative Adversarial
Network; GAN)
• 生成器(Generator)と識別器(Discriminator)の競合によって
データを生成するモデル
https://deephunt.in/the-gan-zoo-79597dc8c347
識別器𝐷(𝑥)生成器𝐺(𝑧)
識別器𝐷(𝑥)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014).
Generative Adversarial Nets. NeurIPS 27, 2672–2680.
45. References
• Blondé, L., & Kalousis, A. (2019). Sample-Efficient Imitation Learning via Generative Adversarial Nets.
Proc. of the 22nd International Conference on Artificial Intelligence and Statistics, 3138–48.
• Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016). A Connection Between Generative
Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models. NIPS 2016
Workshop on Adversarial Training.
• Fu, J., Luo, K., and Levine, S. (2018). Learning robust rewards with adversarial inverse reinforcement
learning. In Proc. of ICLR.
• Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-
Critic Methods. Proc. of the 35th International Conference on Machine Learning.
• Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D., Pineau, J., & Precup, D. (2018). OptionGAN:
Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning.
In Proc. of AAAI.
• Hirakawa, T., Yamashita, T., Tamaki, T., Fujiyoshi, H., Umezu, Y., Takeuchi, I., Matsumoto, S., and
Yoda, K. (2018). Can AI predict animal movements? Filling gaps in animal trajectories using inverse
reinforcement learning. Ecosphere.
46. References
• Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. NIPS29.
• Kalakrishnan, M., Pastor, P., Righetti, L., & Schaal, S. (2013). Learning objective functions for
manipulation. In Proc. of ICRA, 1331–1336.
• Kostrikov, I., Agrawal, K.K., Dwibedi, D., Levine, S., & Tompson, J. (2019). Discriminator-Actor-Critic:
Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning. Proc. of the 7th
ICLR.
• Kozuno, T., Uchibe, E., and Doya, K. (2019). Theoretical analysis of efficiency and robustness of
softmax and gap-increasing operators in reinforcement learning. In Proc. of AISTATS.
• Li, Y., Song, J., & Ermon, S. (2017). InfoGAIL: Interpretable Imitation Learning from Visual
Demonstrations. NIPS30.
• Peng, X.B., Kanazawa, A., Toyer, S., Abbeel, P., & Levine, S. (2019). Variational Discriminator
Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow.
In Proc. of the 7th International Conference on Learning Representations. ICLR, 2019.
• Sasaki, F., Yohira, T., & Kawaguchi, A. (2019). Sample Efficient Imitation Learning for Continuous
Control. Proc. of the 7th International Conference on Learning Representations.
47. References
• Schaul, T., Horgan, D., Gregor, K., & Silver, D. (2015). Universal Value Function Approximators. In Proc.
of ICML, 1312–1320.
• Shimosaka, M., Kaneko, T., & Nishi, K. (2014). Modeling risk anticipation and defensive driving on
residential roads with inverse reinforcement learning. Proc. of the 17th International IEEE Conference
on Intelligent Transportation Systems, 1694–1700.
• Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning.
Cambridge University Press.
• Sun, M., & Ma, X. (2019). Adversarial Imitation Learning from Incomplete Demonstrations. In Proc. of
IJCAI, 2019.
• Suzuki, Y., Wee, W.M., & Nishioka, I. (2019). TV Advertisement Scheduling by Learning Expert
Intentions. In Proc. of the 25th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp. 3071–81.
• Torabi, F., Warnell, G., & Stone, P. (2019). Generative Adversarial Imitation from Observation. ICML
2019 Workshop on Imitation, Intent, and Interaction.
• Uchibe, E. & Doya, K. (2014). Inverse reinforcement learning using dynamic policy programming. In
Proc. of ICDL and Epirob.
48. References
• Uchibe, E. (2018). Model-Free Deep Inverse Reinforcement Learning by Logistic Regression. Neural
Processing Letters, 47(3): 891-905.
• 内部. (2019). エントロピ正則された強化学習を用いた模倣学習. 第33回人工知能学会全国大会
予稿集.
• Uchibe, E. (2019). Imitation learning based on entropy-regularized forward and inverse
reinforcement learning. Proc. of RLDM.
• Uchibe, E., & Doya, K. (in preparation). Imitation learning based on entropy-regularized forward and
inverse reinforcement learning.
• Wulfmeier, M., Rao, D., Wang, D.Z., Ondruska, P., & Posner, I. (2017). Large-scale cost function
learning for path planning using deep inverse reinforcement learning. International Journal of
Robotics Research, vol. 36, no. 10: 1073–1087.
• Yamaguchi, S., Honda, N., Ikeda, M., Tsukada, Y., Nakano, S., Mori, I., and Ishii, S. (2018).
Identification of animal behavioral strategies by inverse reinforcement learning. PLoS Computational
Biology.