Slide shown at the ML meetup Victoria 2020 about the AWS opensource project "Amazon SageMaker for Battlesnake AI" https://github.com/awslabs/sagemaker-battlesnake-ai
4. Battlesnake reinforcement learning starter pack
Agent Environment
Training your own battlesnake
reinforcement learning model
2. Reinforcement learning module
Build custom rules on top of an
existing model
3. Heuristics module
One-click deployment of an
existing model
1. One-click deploy module
12. Reinforcement learning module
Training routine for the module (deep Q learning)
for epi in episodes:
state = env.reset()
while agent.is_alive():
if prob < eps:
action = agent.get_random_action()
else:
action = agent.get_next_best_action(state)
next_state, reward = env.step(action)
memory.append(next_state, state, action, reward)
agent.learn(memory)
Agent Environment
Actions
Rewards
State
13. Reinforcement learning module
Training routine for the module (deep Q learning)
for epi in episodes:
state = env.reset()
while agents.agents_alive() > 1:
actions = []
for agent in agents:
if prob < eps:
action = agent.get_random_action()
else:
action = agent.get_next_best_action(state)
actions.append(action)
next_state, reward = env.step(actions)
memory.append(next_state, state, actions, reward)
for agent in agents:
agent.learn(memory)
Agent
Environment
Actions
Rewards
State
Agent
Agent
24. Reinforcement learning module
Learning
New predicted total expected reward ≃ neural network (new reward, predicted total expected reward)
t = 0
Predicted total expected rewards
t = 1 New predicted total expected rewards
New reward = 1
29. Reinforcement learning module
Rewards design
• Surviving another turn
• Eating food
• Starving
• Winning the game
• Losing the game
• Hitting a
wall/snake/yourself
• Performing a forbidden
move
• Eating another snake
• Forcing another snake to
hit your body
36. Battlesnake reinforcement learning starter pack
Agent Environment
Training your own battlesnake
reinforcement learning model
2. Reinforcement learning module
Build custom rules on top of an
existing model
3. Heuristics module
One-click deployment of an
existing model
1. One-click deploy module
37. Custom rules with the heuristic module
Provides a starting point for you to build upon
Direction of movement
Pos. snakes+food
Reward?
Trained AI snake model
38. Custom rules with the heuristic module
Provides a starting point for you to build upon
Direction of movement
Pos. snakes+food
Reward?
Trained AI snake model
42. Battlesnake reinforcement learning starter pack
Agent Environment
Training your own battlesnake
reinforcement learning model
2. Reinforcement learning module
Build custom rules on top of an
existing model
3. Heuristics module
One-click deployment of an
existing model
1. One-click deploy module
43. One-click deployment with sagemaker
• After you train your own model
• After writing your custom rules
• Using the existing pretrained snake
Hello everyone, my name is Jonathan Chung and I’m an applied scientist at AWS.
My colleague here is Xavier and he’s a solutions architect at AWS.
Since you are in this talk, I presume everyone knows about battlesnakes but I’ll give a brief description anyway.
Whoever is old enough might remember the snakes game on their phone.
The snake moves around.
When you hit the wall, you die.
When you hit yourself, you also die.
When you eat some food, you get longer
The aim of the game is to stay alive as long as possible
Here is battlesnakes which is an online version of the traditional snakes game where multiple snakes compete and the winner is the snake that survives for longest
The main differences in the gameplay is that:
If your snake hits another snake’s head, the shorter snakes die, and
Every snake starts with 100 health and every move you take you lose one health. Eating food will replenish your health and if your health falls to 0, you die
We built a battlesnake starter pack that could be used by all types of developers.
Firstly, the one-click deploy module will build a snake for you, deploy this on the cloud and provide you a URL for you. We want to demonstrate how easy it is so here’s a quick demo of how to get a snake
Let’s go back to this module. if you are an AI or reinforcement learning enthusiast or you want to learn how to train your own reinforcement learning algorithm or you are just too lazy to write your own and you want a starting point, you can use the reinforcement learning module. This module will allow you to train then automatically deploy your snake after it has trained.
Let’s say you don’t want to train a snake, but you want to make use of the existing module but add your own flare to it, you can use the heuristics module. In this module, you can write custom rules on top of existing models that overrides the commands of the AI.
I’ll talk a bit about the reinforcement learning module and briefly how reinforcement learning works.
I’m not an expert at this field so please feel free to stop me any time.
In reinforcement learning, you have an agent that interacts with environment.
Specifically, the agent provides actions, which changes the state of the environment and also the environment provides rewards for the actions it took.
The aim of the agent is to maximise the rewards
Let me give you some examples.
In a self driving car.
The car is the agent.
The agent can for example, accelerate, deaccelerate, turn left, turn right, etc. and these could be the possible actions
The environment is the road and the agent views this in the form of images of the road.
The reward is the number of KMs the car has driven.
For example, the car sees that the road is curving to the right. Then
So to maximise the number of KMs driven, the agent (which is the car) will turn right.
Let’s say the environment shows a stop sign, you might think the agent should just ignore it and going because the reward is to maximise the number of KMs driven. But reinforcement learning is trained to maximise the future rewards and stopping will maximise the future rewards rather than the immediate rewards.
There are many more examples, go example alpha go.
the agent decides the position of the piece in the next move
The environment is the position of all the pieces on the board.
The reward, is simple, which is just to win or lose.
Can we guess what the battlesnake one will be?
Ok so what actions can the snake take?
What will the state be?
How about the reward?
Actually, I didn’t write it because there are many different choices. I’ll get into that later.
But battlesnakes is not just one snake.
This is called a multiagent reinforcement learning problem where there are multiple agents each providing their own actions.
Examples of multiagent reinforcement learning problem includes starcraft or alphastar, which controls each one of the units separately
Essentially, the configuration of the problem is the same.
Each agent provide actions which interacts with the environment and they receive rewards.
So let me explain the pseudocode code of how reinforcement learning works in a single agent case.
In this module, we developed an implementation of deep Q learning
How this works is like a simulation. You play make the agent take actions given a state, then you record the action and what happened.
Specifically, you define the total number of games you want to simulate in episodes
Then you get the initial state and you play the game until the agent (which is the snake) dies.
At each time step, the agents will take an action.
At the start, the agent will simply randomly choose an action. (I'll explain this condition later)
Then you apply the action to the environment and you get the resulting state and reward
So you keep repeating these steps while storing the next_state, state, action and reward into memory.
When you have enough simulated results, You’ll set the agent to learn from the memory, and I’ll explain how later
Once the agent starts to learn how to play, you give it more chance to provide actions with the else statement here.
These actions will be what the agent thinks the best action to take are. Given the state
So as your agent gets better, you get more simulated results
For example in battlesnakes, this way you can get simulation results of when the snakes are larger, or when there is a scarcity of food. etc
I’ll explain how training routine accommodates for multiple agents.
Firstly, the training loop checks that if there are fewer than 2 snakes alive. If there is only 1 snake, that snake has won
Next, instead of preforming 1 action, each agent will perform their own action.
The remaining steps are similar to the 1 agent case.
Let me first explain the environment.
We modeled the rules of the battlesnake engine which includes how the snake moves, eats, grows, and dies based on the openAI gym. And this is a representation of the snakes.
The environment takes in the actions then moves the snakes accordingly.
Afterwards, the rewards and the states are emitted out.
Next, I’ll explain the state.
Let’s say we have a very small board and here is the food. We also have 3 snakes, 1 orange long one and 2 short ones.
So we represented the state with an image.
Each agent represents 1 snake
So if we have 3 snakes, then we will need 3 agents
The agents are fed in specific images. For example, the agent representing the orange snake will be fed an image like this.
Similar to an RGB image with 3 channels, in the first channel, we provided the information about the food.
The second channel will provide information about the orange snake, which is the snake that the agent is representing
The third channel has information about all the other snakes.
So we also provided a border of -1s to indicate the wall. We found that it’s easier for the algorithm to avoid the walls this way
Also, we put a 5 to represent the head, 1s to represent the body. Because the head is more important
Similar for the green snakes
And for the blue snake.
So we used a very simple reward function. Every time the snake lives another turn, they are given another reward.
Let’s talk about the agent now. The agent takes the state and figures out which direction to move to.
Note that the reward here is only used during the learning process.
So how does it figure out how which direction to take?
We use a neural network to learn this behavior
The input is an image and the output is of size 4 representing up down left right.
Given the image representation of the environment, in the methodology presented, the neural network is trained to predict the total expected reward of each move.
Basically, you want to take the action that gives you the highest total expected reward.
For example, given this snake, the expected reward of moving up is 0, because it’ll die immediately. Therefore, the neural network shouldn’t predict this action.
So how does the neural network learn?
Suppose you run the neural network once like just now and you get the predicted total expected rewards here.
suppose that you went right and you get an actual reward.
Then you run the next step with the new state and you get another total expected rewards predictions again
What is the relationship between the predicted total expected reward *this* and here total predicted
Let me try to explain is this way.
At t = 0, the total expected rewards is everything to the future
At t = 1, it’s a bit different, because you took one action and you actually know what the reward is.
You know that by going right, you got 1 reward.
Formally, the total expected reward is typically denoted as Q.
This component was described just now where the difference between the current and total predicted Q is related to the reward.
The gamma term here is called the discount factor. This is a number between 0 and 1 which kind of accounts for opportunity costs where future actions provide less reward.
A common strategy for neural networks to learn is to incrementally update the weights of the neural network
The alpha term here is the learning rate which determines how much to incrementally alter the weights of the network
Since you take the actions with the maximum rewards, the network will slowly learn to take the actions of the maximum reward
So that’s the basics of reinforcement learning with.
Feel free to ask me questions about it or I can guide you to some more material
What I did was very simple applications of deep Q network. There are many opportunities to improve the model I presented.
- Firstly, the methods of representing the state could change.
As you can imagine, this method requires a fixed size of the map.
One possible method to is to create a snake centric representation, which means your snake is only provided with information of what it’s head is close to.
Other possibly representations include the snakes and food only as coordinates.
The neural network we took was also very simple.
An attention-like method was used to incorporate the snake health, ID, and turn count into a convolutional neural network to predict the actions.
But I believe the most work could be achieved with the rewards design.
The gym provides functionaility to try to maximise or minimize these rewards, but we really didn’t investigate it too much.
Let me go into a bit of technical details about the module. We built it with Apache’s MXNet and the solution was built with amazon sagemaker.
- Sagemaker provides methods to build, train & tune, and deploy the models
We know that not everyone has a 12 gb GPU at home to train an AI bot.
So sagemaker allows you to train your own snake directly on the cloud
This way you can get your own snake model.
As you can imagine, there are many different parameters that you need to decide on.
For example, how the network is designed, like how deep the network is, etc.
The best parameters for the learning parameters such as the discount factor and the learning rate could be investigated
Also, you could test different methods to represent the state and the rewards could be investigated
For example, if you want to investigate if including a border in the state representation will help.
In one run you will put the border and in the second run, you remove the border
Then you compare between the snakes to see which one works better
For example, using the optimization module, we found that the -1 borders actually make the agents significantly better
Once your model has been trained, you can deploy the model to the cloud
My colleague Xavier will give you more details about this later
So for developers who don’t want necessarily want to train a new snake but you want to make use of this environement and deployment methods, you can use the heuristics module.
Also, we know that you don’t have days and days to train an AI for every single situation out there. You can build custom rules to override the commands of the AI
Let’s say you are the pink snake in this situation.
The AI tells you to go left which is fine in this situation
but you know that if the blue snake just continues up, you are dead meat.
So you can override the AI for you to go right instead, that way, you wont die.
Furthermore, we believe that we can streamline the development processes this way
The suggested method is to set up the game engine on your own computer, write your code then upload it to the cloud.
So we decided to make use of the gym we built that simulates the battlesnake engine for you to develop your rules.
After you are satisfied with your rules, the code will be automatically packaged and uploaded into the cloud.
The heuristics module also provides a situation simulation component.
For example, you want your snake to be in this exact configuration and to see what it’ll do.
You can define this in the gym then try it out.
Finally, Xavier will describe the deployment process
In fact, our solution supports a one-click deployment after you train your own model, writing your custom rules, or even if you wanted to use our pretrained snake.
The purpose of this is so that you could focus on developing your snake more rather than worrying how how to deploy the snake into a webserver