Reinforcement Learning with Python by Vihar Kurama

Reinforcement is a class of machine learning where an agent learns how to behave in the environment by performing actions and thereby drawing intuitions and seeing the results. In this article, you’ll learn to understand and design a reinforcement learning problem and solve in Python.

Recently we’ve been seeing computers playing games against humans, either as bots in multiplayer games or as opponents in one-on-one games like Dota2, PUB-G, Mario. Deepmind(a research company) made history when the news that their AlphaGo program defeated the South Korean Go world champion in 2016. If you’re an intense gamer, probably you must have listened about Dota 2 OpenAI Five match, where machines played against humans and defeated world top Dota2 players in few matches (If you are interested about this, hereis the complete analysis of the algorithm and the game played by the machine).

The latest version of OpenAI Five taking Roshan.(src)


So here’s the central question, Why do we need reinforcement learning? Is it only used for games? Or can it be applied to real-world scenarios and problems? If you are learning about reinforcement learning for the first time, the answer to this question is beyond your imagination. It’s one of the widely used and fastest growing technologies in the fields of Artificial Intelligence.

Here are a few applications that motivate you to build reinforcement systems,

  1. Self Driving Cars
  2. Gaming
  3. Robotics
  4. Recommendation Systems
  5. Advertising and Marketing

A Brief Review and Origins of Reinforcement Learning

So, where has this Reinforcement Learning come from when we have a good number of Machine Learning and Deep Learning techniques available at hand? “It’s invented by Rich Sutton and Andrew Barto, Rich’s Ph.D. thesis advisor.” It has taken its form in the 1980s but was archaic then. Later, Rich believed in its promising nature that it’ll eventually be recognized.

Reinforcement Learning supports automation by learning from the environment it is present in, so does Machine Learning and Deep Learning, not the same strategy, but both support automation. So, why Reinforcement Learning?

It’s very much like the natural learning process wherein, the process/the model would be receiving feedback as to whether it has performed well or not. Deep Learning and Machine Learning, are learning processes as well, but which are most focussed on finding patterns in the existing data. Reinforcement Learning, on the other hand, does this learning by trial and error method, and eventually, gets to the right actions or the global optimum. The significant additional advantage of Reinforcement Learning is that we need not provide the whole training data as in Supervised Learning. Instead, a few chunks would suffice.

Understanding Reinforcement Learning

Imagine you are teaching your cats new tricks, but unfortunately, cats don’t understand our language so we can’t tell them what we want to do with them. Instead, emulate a situation, and your cat tries to respond in many different ways. If the cat’s response is the desired one, we reward them with milk. Now guess what, the next time the cat is exposed to the same situation, the cat executes a similar action with even more enthusiasm in expectation of more food. So this is learning from positive responses, if they are treated with negative responses such as angry faces, they don’t tend to learn from them.

Similarly, this is how Reinforcement Learning works, we give the machines a few inputs and actions, and then, reward them based on the output. Reward maximisation will be our end goal. Now let’s see how we interpret the same problem above as a Reinforcement Learning problem.

  • The cat will be the “agent” that is exposed to the “environment”.
  • The environment is a house/play-area depending on what you teach to it.
  • The situations encounter is called as the “state” which is analogous for example, your cat crawling under the bed or running. These can be interpreted as states.
  • The agents react by performing actions to change from one “state” to another.
  • After the change in states, we give the agent either a “reward” or a “penalty” depending on the action that is performed.
  • The “policy” is the strategy of choosing an action for finding better outcomes.

Now that we have understood what Reinforcement Learning is, let’s deep dive into the origins and evolution of Reinforcement Learning and Deep Reinforcement Learning in the below section and, how it can solve the problems that Supervised or Unsupervised Learning can’t do and here’s the fun fact, Google search engine is optimised using Reinforcement Algorithms.

Getting familiar with Reinforcement Learning Terminology

Agent and the Environment play the essential role in the reinforcement learning algorithm. The environment is the world that agent survives in. The agent also perceives a reward signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return. Before we write our first reinforcement learning algorithms, we need to understand the following “Terminology”.

  1. States: The state is a complete description of the world, they don’t hide any pieces of information that is present in the world. It can be a position, a constant or a dynamic. We mostly record these states in arrays, matrices or higher order tensors.
  2. Action: Action is usually based on the environment, different environments lead to different actions based on the agent. Set of valid actions for an agent are recorded in a space called an action space. These are usually finite in number.
  3. Environment: This is the place where the agent lives and interacts with. For different types of environments, we use different rewards, policies, etc.
  4. Reward and Return: The reward function R is the one which must be kept tracked all-time in reinforcement learning. It plays a vital role in tuning, optimizing the algorithm and stop training the algorithm. It depends on the current state of the world, the action just taken, and the next state of the world.
  5. Policies: Policy is a rule used by an agent for choosing the next action, these are also called as agents brains.

Now that we have seen all the reinforcement terminology, now let’s solve a problem using reinforcement algorithms. Before that, we need to understand how we design an issue and assign this reinforcement learning terminology when solving the problem.

Solving the Taxi Problem

Now that we have seen all the reinforcement terminology, now let’s solve a problem using reinforcement algorithms. Before that, we need to understand how we design a problem and assign this reinforcement learning terminology when solving the problem.

Let’s say we have a training area for our taxi where we are teaching it to transport people in a parking lot to four different locations (R,G,Y,B) . Before that, we need to understand and set up the environment for which python comes into action. If you are doing python from scratch, I would recommend this article.

You can setup up the Taxi-Problem environment using OpenAi’s Gym, which is one of the most used libraries for solving reinforcement problems. Alright, before using it we need to install the gym on your machine, to do that, you can use python package installer also called as the pip. Below is the command to install.

pip install gym

Now let’s see how our environment is going to render, all the models and interface for this problem is already configured in the gym and named under Taxi-V2. To render this environment below is the code snippet.

“There are 4 locations (labelled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.” (Source: )

This will be the rendered output on your console:

Taxi V2 ENV


Perfect, env is the core of OpenAi Gym, which is the unified environment interface. The following are the env methods that would be quite helpful to us:

env.reset: Resets the environment and returns a random initial state.
env.step(action): Step the environment by one timestep.

env.step(action) returns the following variables

  • observation: Observations of the environment.
  • reward: If your action was beneficial or not
  • done: Indicates if we have successfully picked up and dropped off a passenger, also called one episode
  • info: Additional info such as performance and latency for debugging purposes
  • env.render: Renders one frame of the environment (helpful in visualizing the environment)

Now that we have seen the environment, let’s understand the problem more deeply, the taxi is the only car in this parking lot. We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part of our state space. Notice the current location state of our taxi is coordinate (3, 1).

In the environment, there are four possible locations where you can drop the passengers in the taxi which are: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)]in (row, col) coordinates if you can interpret the above-rendered environment as a coordinate axis.

When we also account for one (1) additional passenger state of being inside the taxi, we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there are four (4) destinations and five (4 + 1) passenger locations. So, our taxi environment has 5×5×5×4=500 total possible states. The agent encounters one of the 500 states, and it takes action. The action in our case can be to move in a direction or decide to pick up/drop off a passenger.

In other words, we have six possible actions: pickup, drop, north, east, south, west(These four directions are the moves by which the taxi is moved.)

This is the action space: the set of all the actions that our agent can take in a given state.

You’ll notice in the illustration above, that the taxi cannot perform certain actions in certain states due to walls. In the environment’s code, we will simply provide a -1 penalty for every wall hit and the taxi won’t move anywhere. This will just rack up penalties causing the taxi to consider going around the wall.

Reward Table: When the Taxi environment is created, there is an initial Reward table that’s also created, called P. We can think of it like a matrix that has the number of states as rows and number of actions as columns, i.e. states × actions matrix.

Since every state is in this matrix, we can see the default reward values assigned to our illustration’s state:

>>> import gym
>>> env = gym.make("Taxi-v2").env
>>> env.P[328]
{0: [(1.0, 433, -1, False)], 
 1: [(1.0, 233, -1, False)],
 2: [(1.0, 353, -1, False)],
 3: [(1.0, 333, -1, False)],
 4: [(1.0, 333, -10, False)],
 5: [(1.0, 333, -10, False)]

This dictionary has a structure {action: [(probability, nextstate, reward, done)]}.

  • The 0–5 corresponds to the actions (south, north, east, west, pickup, dropoff) the taxi can perform at our current state in the illustration.
  • done is used to tell us when we have successfully dropped off a passenger in the right location.

To solve the problem without any reinforcement learning, we can set the goal state, choose some sample spaces and then if it reaches the goal state with a number of iterations we assume it’s the maximum reward, else the reward is increased if it’s near to goal state and penalty is raised if reward for the step is -10 which is minimum.

Now let’s code this problem without reinforcement learning.

Since we have our P table for default rewards in each state, we can try to have our taxi navigate just using that.

We’ll create an infinite loop which runs until one passenger reaches one destination (one episode), or in other words, when the received reward is 20. The env.action_space.sample()method automatically selects one random action from set of all possible actions.

Let’s see what happens:


credits: OpenAI


Our problem is solved but isn’t optimized or this algorithm doesn’t work all the time, we need to have a proper interacting agent so that the number of iterations that the machine/algorithm takes is very less. Here comes the Q-Learning algorithm let’s see how it is implemented in the next section.

Introduction to Q-Learning

This algorithm is most used and basic reinforcement algorithm, this uses the environment rewards to learn over time, the best action to take in a given state. In the above implementation, we have our reward table “P” from where the agent will learn from. Using the reward table it chooses the next action if it’s beneficial or not and then they update a new value called Q-Value. This new table created is called the Q-Table and they map to a combination called (State, Action) combination. If the Q-values are better, we have more optimized rewards.

For example, if the taxi is faced with a state that includes a passenger at its current location, it is highly likely that the Q-value for pickup is higher when compared to other actions, like dropoff or north.

Q-values are initialized to an arbitrary value, and as the agent exposes itself to the environment and receives different rewards by executing different actions, the Q-values are updated using the equation:

Here comes a question, how to initialize this Q-Values and how to calculate them, for that we initialize the Q-values with arbitrary constants and then as the agent exposes to the environment it receives various rewards by executing different actions. Once the actions are executed, the Q-Values are executed by the equation.

Here Alpha and Gamma are the parameters for the Q-learning algorithm. Alpha is known as the learning rate and gamma as the discount factor both the values range between 0 and 1 and sometimes equal to one. Gamma can be zero while alpha cannot, as the loss should be updated with some learning rate. Alpha here represents the same which is used in supervised learning. Gamma determines how much importance we want to give to future rewards.

Below is the algorithm in brief,

  • Step 1: Initialize the Q-Table with all zeros and Q-Values to arbitrary constants.
  • Step 2: Let the agent react to the environment and explore the actions. For each change in state, select any one among all possible actions for the current state (S).
  • Step 3: Travel to the next state (S’) as a result of that action (a).
  • Step 4: For all possible actions from the state (S’) select the one with the highest Q-value.
  • Step 5: Update Q-table values using the equation.
  • State 6: Change the next state as the current state.
  • Step 7: If goal state is reached, then end and repeat the process.

Q-Learning in Python

Perfect, now all you’re values will be stored in the variable q_table .

That’s all you’re model is trained and the environment can now drop the passengers more accurately. There you go with this you can understand reinforcement learning and able to code new problem.

More Reinforcement Techniques:

  • MDPs and Bellman Equations
  • Dynamic Programming: Model-Based RL, Policy Iteration and Value Iteration
  • Deep Q Learning
  • Policy Gradient Methods

Code for this article can be found at

Thanks for reading. This article is authored by Vihar Kurama and Samhita Alla.

Stay tuned for more articles, also check cool articles written by Samhita Alla.

References: OpenAI, Playing Atari with Deep Reinforcement Learning, SkyMind, LearnDataSci.

March 21, 2019

Leave a Reply


This site uses Akismet to reduce spam. Learn how your comment data is processed.

Notify of
© HAKIN9 MEDIA SP. Z O.O. SP. K. 2013