RL 논문 리뷰 스터디에서 Agent57 논문 내용을 정리해 발표했습니다. Agent57은 NGU(Never Give Up)를 기반으로 몇 가지 기능을 개선해 57개의 Atari 게임 모두 인간보다 뛰어난 점수를 기록한 최초의 RL 알고리즘입니다. 많은 분들에게 도움이 되었으면 합니다.
Bayesian Optimization for Balancing Metrics in Recommender SystemsViral Gupta
Most large-scale online recommender systems like newsfeed ranking, people recommendations, job recommendations, etc. often have multiple utilities or metrics that need to be simultaneously optimized. The machine learning models that are trained to optimize a single utility are combined together through parameters to generate the final ranking function. These combination parameters drive business metrics. Finding the right choice of the parameters is often done through online A/B experimentation, which can be incredibly complex and time-consuming, especially considering the non-linear effects of these parameters on the metrics of interest.
In this tutorial, we will talk about how we can apply Bayesian Optimization techniques to obtain the parameters for such complex online systems in order to balance the competing metrics. First, we will provide an in-depth introduction to Bayesian Optimization, covering some of the basics as well as the recent advances in the field. Second, we will talk about how to formulate a real-world recommender system problem as a black-box optimization problem that can be solved via Bayesian Optimization. We will focus on a few key problems such as newsfeed ranking, people recommendations, job recommendations, etc. Third, we will talk about the architecture of the solution and how we are able to deploy it for large-scale systems. Finally, we will discuss the extensions and some of the future directions in this domain.
Supporting cognitive adaptability through game design working 100412Shane Gallagher
Adaptability is a metacompetency critically important to the United States Department of Defense and is considered a key component of 21st Century skills by the U.S. Department of Labor and the U.S. Department of Education). Video games are seen as learning environments supporting the acquisition of 21st century skills. Can games, then, be used as components of an effective learning environment that support the development of adaptability?
Initially this paper describes the metacompetency of adaptability. Next is how adaptability can be functionally and discretely measured by focusing on its most granular or micromomentary level which we describe as cognitive adaptability. Finally, the authors examine both the nature of cognitive adaptability, interventions that support its development, and how those interventions might be translated into game design features. Toward this end, the paper will also discuss how these features are exhibited in a popular commercially available video game and how it could be employed to test the hypothesis that a play frame of 12 consecutive hours, using a video game meeting the design criteria and example previously discussed, will increase cognitive adaptability in the players
Bayesian Optimization for Balancing Metrics in Recommender SystemsViral Gupta
Most large-scale online recommender systems like newsfeed ranking, people recommendations, job recommendations, etc. often have multiple utilities or metrics that need to be simultaneously optimized. The machine learning models that are trained to optimize a single utility are combined together through parameters to generate the final ranking function. These combination parameters drive business metrics. Finding the right choice of the parameters is often done through online A/B experimentation, which can be incredibly complex and time-consuming, especially considering the non-linear effects of these parameters on the metrics of interest.
In this tutorial, we will talk about how we can apply Bayesian Optimization techniques to obtain the parameters for such complex online systems in order to balance the competing metrics. First, we will provide an in-depth introduction to Bayesian Optimization, covering some of the basics as well as the recent advances in the field. Second, we will talk about how to formulate a real-world recommender system problem as a black-box optimization problem that can be solved via Bayesian Optimization. We will focus on a few key problems such as newsfeed ranking, people recommendations, job recommendations, etc. Third, we will talk about the architecture of the solution and how we are able to deploy it for large-scale systems. Finally, we will discuss the extensions and some of the future directions in this domain.
Supporting cognitive adaptability through game design working 100412Shane Gallagher
Adaptability is a metacompetency critically important to the United States Department of Defense and is considered a key component of 21st Century skills by the U.S. Department of Labor and the U.S. Department of Education). Video games are seen as learning environments supporting the acquisition of 21st century skills. Can games, then, be used as components of an effective learning environment that support the development of adaptability?
Initially this paper describes the metacompetency of adaptability. Next is how adaptability can be functionally and discretely measured by focusing on its most granular or micromomentary level which we describe as cognitive adaptability. Finally, the authors examine both the nature of cognitive adaptability, interventions that support its development, and how those interventions might be translated into game design features. Toward this end, the paper will also discuss how these features are exhibited in a popular commercially available video game and how it could be employed to test the hypothesis that a play frame of 12 consecutive hours, using a video game meeting the design criteria and example previously discussed, will increase cognitive adaptability in the players
Recommender Systems from A to Z – The Right DatasetCrossing Minds
In the last years a lot of improvements were done in the field of Machine Learning and the Tools that support the community of developers. But still, implementing a recommender system is very hard.
That is why at Crossing Minds, we decided to create a series of 4 meetups to discuss how to implement a recommender system end-to-end:
Part 1 – The Right Dataset
Part 2 – Model Training
Part 3 – Model Evaluation
Part 4 – Real-Time Deployment
This first meetup will be about building the right dataset and doing all the preprocessing needed to create different models. We will talk about explicit vs implicit feedback, dataset analysis, likes/dislikes vs ratings, users and items features, normalization and similarities.
Here is a presentation given to the programmers at Insomniac Games on July 6, 2007. It covers two related topics. First we consider a case study of an optimization to NPC AI used on Resistance: Fall of Man. Then we discuss a broader proposal for a game-play architecture that supports and encourages higher performance code compared to the common game-object driven main loop.
The presentation is also available on Insomniac\'s public R&D website:
http://www.insomniacgames.com/tech/techpage.php
In the latest years, we have witnessed a growing number of media transmitted and stored on computers and mobile devices. For this reason, there is an actual need to employ smart compression algorithms to reduce the size of our media files. However, such techniques are often responsible for severe reduction of user perceived quality. In this talk we present several approaches we have developed to restore degraded images and videos to match their original quality, making use of Generative Adversarial Networks. The aim of the talk is to highlight the main features of our research work, including the advantages of our solution, the current challenges and the possible directions for future improvements.
We present our approach for the NIPS 2017 "Learning To Run" challenge. The goal of the challenge is to develop a controller able to run in a complex environment, by training a model with Deep Reinforcement Learning methods.
We follow the approach of the team Reason8 (3rd place). We begin from the algorithm that performed better on the task, DDPG. We implement and benchmark several improvements over vanilla DDPG, including parallel sampling, parameter noise, layer normalization and domain specific changes. We were able to reproduce results of the Reason8 team, obtaining a model able to run for more than 30m.
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query PitfallsMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a member of the support team I will share common mistakes observed as well as tips and tricks to avoiding them.
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Atari Game State Representation using Convolutional Neural Networksjohnstamford
I recently gave a talk to some MSc Machine Learning students at De Montfort University about the project I did for my MSc. The work included looking at feature extraction from game screens using the Arcade Learning Environment and Convolutional Neural Networks (CNN).
The work was planned to investigate if the costly nature Q-Learning could be offset by the use of a trained system using 'expert' data. The system uses the same technology as used by Deepmind in their 2013 paper.
Recommender Systems from A to Z – The Right DatasetCrossing Minds
In the last years a lot of improvements were done in the field of Machine Learning and the Tools that support the community of developers. But still, implementing a recommender system is very hard.
That is why at Crossing Minds, we decided to create a series of 4 meetups to discuss how to implement a recommender system end-to-end:
Part 1 – The Right Dataset
Part 2 – Model Training
Part 3 – Model Evaluation
Part 4 – Real-Time Deployment
This first meetup will be about building the right dataset and doing all the preprocessing needed to create different models. We will talk about explicit vs implicit feedback, dataset analysis, likes/dislikes vs ratings, users and items features, normalization and similarities.
Here is a presentation given to the programmers at Insomniac Games on July 6, 2007. It covers two related topics. First we consider a case study of an optimization to NPC AI used on Resistance: Fall of Man. Then we discuss a broader proposal for a game-play architecture that supports and encourages higher performance code compared to the common game-object driven main loop.
The presentation is also available on Insomniac\'s public R&D website:
http://www.insomniacgames.com/tech/techpage.php
In the latest years, we have witnessed a growing number of media transmitted and stored on computers and mobile devices. For this reason, there is an actual need to employ smart compression algorithms to reduce the size of our media files. However, such techniques are often responsible for severe reduction of user perceived quality. In this talk we present several approaches we have developed to restore degraded images and videos to match their original quality, making use of Generative Adversarial Networks. The aim of the talk is to highlight the main features of our research work, including the advantages of our solution, the current challenges and the possible directions for future improvements.
We present our approach for the NIPS 2017 "Learning To Run" challenge. The goal of the challenge is to develop a controller able to run in a complex environment, by training a model with Deep Reinforcement Learning methods.
We follow the approach of the team Reason8 (3rd place). We begin from the algorithm that performed better on the task, DDPG. We implement and benchmark several improvements over vanilla DDPG, including parallel sampling, parameter noise, layer normalization and domain specific changes. We were able to reproduce results of the Reason8 team, obtaining a model able to run for more than 30m.
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query PitfallsMongoDB
Query performance can either be a constant headache or the unsung hero of an application. MongoDB provides extremely powerful querying capabilities when used properly. As a member of the support team I will share common mistakes observed as well as tips and tricks to avoiding them.
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Atari Game State Representation using Convolutional Neural Networksjohnstamford
I recently gave a talk to some MSc Machine Learning students at De Montfort University about the project I did for my MSc. The work included looking at feature extraction from game screens using the Arcade Learning Environment and Convolutional Neural Networks (CNN).
The work was planned to investigate if the costly nature Q-Learning could be offset by the use of a trained system using 'expert' data. The system uses the same technology as used by Deepmind in their 2013 paper.
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
Momenti Seminar에서 진행했던 "하스스톤 시뮬레이터 RosettaStone 개발 5년 간의 기록"의 발표 자료를 공유드립니다. 5년 동안 오픈 소스 프로젝트를 진행하면서 경험했던 일들을 정리하며 어떤 교훈을 얻었는지 생각해보는 시간이었습니다. 많은 분들에게 도움이 되었으면 합니다.
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
RL 논문 리뷰 스터디에서 Evolving Reinforcement Learning Algorithms 논문 내용을 정리해 발표했습니다. 이 논문은 Value-based Model-free RL 에이전트의 손실 함수를 표현하는 언어를 설계하고 기존 DQN보다 최적화된 손실 함수를 제안합니다. 많은 분들에게 도움이 되었으면 합니다.
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
RL 논문 리뷰 스터디에서 Adversarially Guided Actor-Critic 논문 내용을 정리해 발표했습니다. AGAC는 Actor-Critic에 GAN에서 영감을 받은 방법들을 결합해 리워드가 희소하고 탐험이 어려운 환경에서 뛰어난 성능을 보여줍니다. 많은 분들에게 도움이 되었으면 합니다.
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
강화학습에 관심을 갖게 되어 어떤 게임에 적용해볼까 고민하다가 평소 즐기던 '하스스톤'이라는 게임에 관심을 갖게 되어 2017년 말부터 하스스톤 강화 학습을 위한 API를 만들기 시작했습니다. 이 발표를 통해 평소 하스스톤과 같은 카드 게임 개발이나 게임에 강화학습을 적용하기 위한 환경을 구축하는데 관심을 갖고 있던 프로그래머들에게 조금이나마 도움이 되었으면 합니다.
모던 C++의 시초인 C++11은 C++ 코드 전반에 많은 변화를 가져왔습니다. 그리고 최근 C++20의 표준위원회 회의가 마무리되었습니다. 내년에 C++20이 도입되면 C++11이 처음 도입되었을 때와 비슷한 규모, 또는 그 이상의 변화가 있을 것이라고 예상하고 있습니다. C++20에는 Concepts, Contract, Ranges, Coroutine, Module 등 굵직한 기능 외에도 많은 기능들이 추가될 예정입니다. 이번 세션에서는 C++20에 추가될 주요 기능들을 살펴보고자 합니다.
GDG Campus Korea에서 개최한 'Daily 만년 Junior들의 이야기 : 델리만주' 밋업에서 발표했던 내용으로 대학원 석사 입학 후부터 오늘날까지 어떤 활동들을 했는지 정리했습니다. 대학원생 분들과 게임 프로그래머 취업을 준비하시는 분들께 많은 도움이 되었으면 합니다.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
3. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Arcade Learning Environment (ALE)
• An interface to a diverse set of Atari 2600 game environments designed to be
engaging and challenging for human players
• The Atari 2600 games are well suited for evaluating general competency in
AI agents for three main reasons
• 1) Varied enough to claim generality
• 2) Each interesting enough to be representative of settings that might be faced in practice
• 3) Each created by an independent party to be free of experimenter’s bias
4. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Previous achievement of Deep RL in ALE
• DQN (Minh et al., 2015) : 100% HNS on 23 games
• Rainbow (M. Hessel et al., 2017) : 100% HNS on 53 games
• R2D2 (Kapturowski et al., 2018) : 100% HNS on 52 games
• MuZero (Schrittwieser et al., 2019) : 100% HNS on 51 games
→ Despite all efforts, no single RL algorithm has been able to achieve over
100% HNS on all 57 Atari games with one set of hyperparameters.
5. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• How to measure Artificial General Intelligence?
6. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Atari-57 5th percentile performance
7. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Challenging issues of Deep RL in ALE
• 1) Long-term credit assignment
• This problem is particularly hard when rewards are delayed and credit needs to be assigned over
long sequences of actions.
• The game of Skiing is a canonical example due to its peculiar reward structure.
• The goal of the game is to run downhill through all gates as fast as possible.
• A penalty of five seconds is given for each missed gate.
• The reward, given only at the end, is proportional to the time elapsed.
• long-term credit assignment is needed to understand why an action taken early in the game (e.g.
missing a gate) has a negative impact in the obtained reward.
8. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Challenging issues of Deep RL in ALE
• 2) Exploration in large high dimensional state spaces
• Games like Private Eye, Montezuma’s Revenge, Pitfall! or Venture are widely considered hard
exploration games as hundreds of actions may be required before a first positive reward is seen.
• In order to succeed, the agents need to keep exploring the environment despite the apparent
impossibility of finding positive rewards.
9. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Challenging issues of Deep RL in ALE
• Exploration algorithms in deep RL generally fall into three categories:
• Randomize value function
• Unsupervised policy learning
• Intrinsic motivation: NGU
• Other work combines handcrafted features, domain-specific knowledge or
privileged pre-training to side-step the exploration problem.
10. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• In summary, our contributions are as follows:
• 1) A new parameterization of the state-action value function that decomposes the
contributions of the intrinsic and extrinsic rewards. As a result, we significantly
increase the training stability over a large range of intrinsic reward scales.
• 2) A meta-controller: an adaptive mechanism to select which of the policies
(parameterized by exploration rate and discount factors) to prioritize throughout
the training process. This allows the agent to control the exploration/exploitation
trade-off by dedicating more resources to one or the other.
11. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• In summary, our contributions are as follows:
• 3) Finally, we demonstrate for the first time performance that is above the human
baseline across all Atari 57 games. As part of these experiments, we also find that
simply re-tuning the backprop through time window to be twice the previously
published window for R2D2 led to superior long-term credit assignment (e.g., in
Solaris) while still maintaining or improving overall performance on the remaining
games.
12. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Our work builds on top of the Never Give Up (NGU) agent,
which combines two ideas:
• The curiosity-driven exploration
• Distributed deep RL agents, in particular R2D2
• NGU computes an intrinsic reward in order to encourage
exploration. This reward is defined by combining
• Per-episode novelty
• Life-long novelty
13. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Per-episode novelty, 𝑟𝑡
episodic
, rapidly vanishes over the course
of an episode, and it is computed by comparing observations
to the contents of an episodic memory.
• The life-long novelty, 𝛼 𝑡, slowly vanishes throughout training,
and it is computed by using a parametric model.
15. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• With this, the intrinsic reward 𝒓𝒊
𝒕
is defined as follows:
𝑟𝑡
𝑖
= 𝑟𝑡
episodic
∙ min max 𝛼 𝑡, 1 , 𝐿
where 𝐿 = 5 is a chosen maximum reward scaling.
• This leverages the long-term novelty provided by 𝛼 𝑡,
while 𝑟𝑡
episodic
continues to encourage the agent to explore
within an episode.
16. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• At time 𝑡, NGU adds 𝑁 different scales of the same intrinsic
reward 𝛽𝑗 𝑟𝑡
𝑖
(𝛽𝑗 ∈ ℝ+
, 𝑗 ∈ 0, … , 𝑁 − 1) to the extrinsic reward
provided by the environment, 𝑟𝑡
𝑒
, to form 𝑁 potential total
rewards 𝑟𝑗,𝑡 = 𝑟𝑡
𝑒
+ 𝛽𝑗 𝑟𝑡
𝑖
.
• Consequently, NGU aims to learn the 𝑁 different associated
optimal state-action value functions 𝑄 𝑟 𝑗
∗
associated with each
reward function 𝑟𝑗,𝑡.
17. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• The exploration rates 𝛽𝑗 are parameters that control the degree
of exploration.
• Higher values will encourage exploratory policies.
• Smaller values will encourage exploitative policies.
• Additionally, for purposes of learning long-term credit
assignment, each 𝑄 𝑟 𝑗
∗
has its own associated discount factor 𝛾𝑗.
18. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Since the intrinsic reward is typically much more dense than
the extrinsic reward, 𝛽𝑗, 𝛾𝑗 𝑗=0
𝑁−1
are chosen so as to allow for
long term horizons (high values of 𝛾𝑗) for exploitative policies
(small values of 𝛽𝑗) and small term horizons (low values of 𝛾𝑗)
for exploratory policies (high values of 𝛽𝑗).
20. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• To learn the state-action value function 𝑄 𝑟 𝑗
∗
, NGU trains a
recurrent neural network 𝑄 𝑥, 𝑎, 𝑗; 𝜃 , where 𝑗 is a one-hot
vector indexing one of 𝑁 implied MDPs (in particular 𝛽𝑗, 𝛾𝑗 ),
𝑥 is the current observation, 𝑎 is an action, and 𝜃 are the
parameters of the network (including the recurrent state).
• In practice, NGU can be unstable and fail to learn an
appropriate approximation of 𝑄 𝑟 𝑗
∗
for all the state-action value
functions in the family, even in simple environments.
21. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• This is especially the case when the scale and sparseness of 𝑟𝑡
𝑒
and 𝑟𝑡
𝑖
are both different, or when one reward is more noisy
than the other.
• We conjecture that learning a common state-action value
function for a mix of rewards is difficult when the rewards are
very different in nature.
22. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Our agent is a deep distributed RL agent, in the lineage of
R2D2 and NGU.
23. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• It decouples the data collection and the learning processes by
having many actors feed data to a central prioritized replay
buffer.
• A learner can then sample training data from this buffer.
• More precisely, the replay buffer contains sequences of
transitions that are removed regularly in a FIFO-manner.
24. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• These sequences come from actor processes that interact with
independent copies of the environment, and they are
prioritized based on temporal differences errors.
• The priorities are initialized by the actors and updated by the
learner with the updated state-action value function
𝑄 𝑥, 𝑎, 𝑗; 𝜃 .
25. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• According to those priorities, the learner samples sequences of
transitions from the replay buffer to construct an RL loss.
• Then, it updates the parameters of the neural network
𝑄 𝑥, 𝑎, 𝑗; 𝜃 by minimizing the RL loss to approximate the
optimal state-action value function.
• Finally, each actor shares the same network architecture as the
learner but with different weights.
26. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• We refer as 𝜃𝑙 to the parameters of the 𝑙-th actor. The learner
weights 𝜃 are sent to the actor frequently, which allows it to
update its own weights 𝜃𝑙.
• Each actor uses different values 𝜖𝑙, which are employed to
follow an 𝜖𝑙-greedy policy based on the current estimate of the
state-action value function 𝑄 𝑥, 𝑎, 𝑗; 𝜃𝑙 .
• In particular, at the beginning of each episode and in each
actor, NGU uniformly selects a pair 𝛽𝑗, 𝛾𝑗 .
27. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• State-Action Value Function Parameterization
𝑄 𝑥, 𝑎, 𝑗; 𝜃 = 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑒
+ 𝛽𝑗 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑖
• 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑒 : Extrinsic component
• 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑖 : Intrinsic component
• Weight: 𝜃 𝑒
and 𝜃 𝑖
(separately)
with identical architecture and 𝜃 = 𝜃 𝑖 ∪ 𝜃 𝑒
• Reward: 𝑟 𝑒 and 𝑟 𝑖 (separately)
But, same target policy
𝜋 = argmax 𝑎∈𝒜 𝑄(𝑥, 𝑎, 𝑗; 𝜃 𝑒
)
28. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• State-Action Value Function Parameterization
• We show that this optimization of separate state-action values is equivalent
to the optimization of the original single state-action value function with
reward 𝒓 𝒆
+ 𝜷𝒋 𝒓𝒊
in Appendix B.
• Even though the theoretical objective being optimized is the same, the
parameterization is different: we use two different neural networks to
approximate each one of these state-action values.
31. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Adaptive Exploration over a Family of Policies
• The core idea of NGU is to jointly train a family of policies with different
degrees of exploratory behavior using a single network architecture.
• In this way, training these exploratory policies plays the role of a set of
auxiliary tasks that can help train the shared architecture even in the absence
of extrinsic rewards.
• A major limitation of this approach is that all policies are trained equally,
regardless of their contribution to the learning progress.
32. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• We propose to incorporate a meta-controller that can
adaptively select which policies to use both at training and
evaluation time. This carries two important consequences.
33. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Meta-controller with adaptively select
• 1) By selecting which policies to prioritize during training, we can allocate
more of the capacity of the network to better represent the state-action
value function of the policies that are most relevant for the task at hand.
• Note that this is likely to change throughout the training process, naturally
building a curriculum to facilitate training.
34. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Meta-controller with adaptively select
• Policies are represented by pairs of exploration rate and discount factor,
(𝛽𝑗, 𝛾𝑗), which determine the discounted cumulative rewards to maximize.
• It is natural to expect policies with higher 𝛽𝑗 and lower 𝛾𝑗 to make more
progress early in training, while the opposite would be expected as training
progresses.
35. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Meta-controller with adaptively select
• 2) This mechanism also provides a natural way of choosing the best policy in
the family to use at evaluation time.
• Considering a wide range of values of 𝛾𝑗 with 𝛽𝑗≈0, provides a way of
automatically adjusting the discount factor on a per-task basis.
• This significantly increases the generality of the approach.
36. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• We propose to implement the meta-controller using
a nonstationary multi-arm bandit algorithm running
independently on each actor.
• The reason for this choice, as opposed to a global meta-
controller, is that each actor follows a different 𝜖𝑙-greedy
policy which may alter the choice of the optimal arm.
37. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• The multi-armed bandit problem is a problem in which a fixed
limited set of resources must be allocated between competing
(alternative) choices in a way that maximizes their expected
gain, when each choice's properties are only partially known at
the time of allocation, and may become better understood as
time passes or by allocating resources to the choice.
39. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Nonstationary multi-arm bandit algorithm
• Each arm 𝑗 from the 𝑁-arm bandit is linked to a policy in the family and
corresponds to a pair (𝛽𝑗, 𝛾𝑗). At the beginning of each episode, say, the 𝑘th
episode, the meta-controller chooses an arm 𝐽 𝑘 setting which policy will be
executed. (𝐽 𝑘: random variable)
• Then the 𝑙-th actor acts 𝑙-greedily with respect to the corresponding state-
action value function, 𝑄(𝑥, 𝑎, 𝐽 𝑘; 𝜃𝑙), for the whole episode.
• The undiscounted extrinsic episode returns, noted 𝑅 𝑘
𝑒
𝐽 𝑘 , are used as a
reward signal to train the multi-arm bandit algorithm of the meta-controller.
40. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Nonstationary multi-arm bandit algorithm
• The reward signal 𝑅 𝑘
𝑒
𝐽 𝑘 is non-stationary, as the agent changes throughout
training. Thus, a classical bandit algorithm such as Upper Confidence Bound
(UCB) will not be able to adapt to the changes of the reward through time.
• Therefore, we employ a simplified sliding-window UCB (SW-UCB) with
𝝐 𝐔𝐂𝐁-greedy exploration.
• With probability 1 − 𝜖UCB, this algorithm runs a slight modification of classic
UCB on a sliding window of size 𝜏 and selects a random arm with probability
𝜖UCB.
43. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Conclusions
• We present the first deep reinforcement learning agent with
performance above the human benchmark on all 57 Atari
games.
• The agent is able to balance the learning of different skills that
are required to be performant on such diverse set of games:
• Exploration
• Exploitation
• Long-term credit assignment
44. Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Conclusions
• To do that, we propose simple improvements to an existing
agent, Never Give Up, which has good performance on hard-
exploration games, but in itself does not have strong overall
performance across all 57 games.
• 1) Using a different parameterization of the state-action value function
• 2) Using a meta-controller to dynamically adapt the novelty preference and
discount
• 3) The use of longer backprop-through time window to learn from using the
Retrace algorithm