This blog is about the first part of my internship which I did at the Intelligent and Autonomous Systems group at CWI, Amsterdam, under the supervision of Dr Hendrik Baier.
Problem Statement:
We wanted to find out a decent way to train neural networks using Evolution Strategies to play board games. Due to CoViD’19 I did my internship virtually, and I didn’t had access to CWI servers. Hence, we miniaturised the problem statement to solving a simple board game like Connect Four.
This blog is about the first part of my internship which I did at the Intelligent and Autonomous Systems group at CWI, Amsterdam, under the supervision of Dr Hendrik Baier.
Problem Statement:
We wanted to find out a decent way to train neural networks using Evolution Strategies to play board games. Due to CoViD’19 I did my internship virtually, and I didn’t had access to CWI servers. Hence, we miniaturised the problem statement to solving a simple board game like Connect Four.
C51 is one of my favourite RL algorithms, because of its unique approach of calculating the distribution of returns rather than calculating the expectation of rewards. I firstly heard of this approach in this(AI Prism: Deep RL Bootcamp) youtube video. Someone with basic knowledge of DQN and policy gradient algorithms should watch the complete course. C51 has some modified versions too like QR-DQN and IQN. C51 can be easily visualised compared to QR-DQN and IQN, so before going into them, we must know about C51 and the distributional perspective of RL. I have implemented C51 and QR-DQN, and this blog will help readers to understand Distributional RL. I have also written some of my observations. You can find my code here.
Distributional RL
The idea behind learning the distribution of future returns instead of just learning the expected value is to make the model capable of learning intrinsic randomness(Stochastic dynamics, Stochastic reward) of the returns. Consider a person who bought a lottery ticket and so did 1000 other people. 10 lucky people will get 1M dollar if they win, which is 1 in a 100 chance he will get 1M dollars otherwise he will get 0 dollars with chance 90 in 100 so, expected amount a person will get is 1000 dollars and traditional value-based Deep RL algorithms learn to calculate the expected reward, but in reality he will never get reward of 1000 dollars, i.e. we are getting false knowledge of the possible returns. Returns can be multimodal like in the lottery scenario due to which variance is too high which makes the convergence difficult.