Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

from __future__ import print_function import os, sys, time, datetime, json, random import numpy as np from keras.models import Sequential from keras.layers.core import Dense, Activation from

from __future__ import print_function    import os, sys, time, datetime, json, random    import numpy as np    from keras.models import Sequential    from keras.layers.core import Dense, Activation    from keras.optimizers import SGD , Adam, RMSprop    from keras.layers.advanced_activations import PReLU    import matplotlib.pyplot as plt    from TreasureMaze import TreasureMaze    from GameExperience import GameExperience    %matplotlib inline    ---------------------------------------------------------------------------    ModuleNotFoundError                       Traceback (most recent call last)     in           7 from keras.layers.advanced_activations import PReLU          8 import matplotlib.pyplot as plt    ----> 9 from TreasureMaze import TreasureMaze         10 from GameExperience import GameExperience         11 get_ipython().run_line_magic('matplotlib', 'inline')        ModuleNotFoundError: No module named 'TreasureMaze'    The following code block contains an 8x8 matrix that will be used as a maze object:        maze = np.array([        [ 1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.],        [ 1.,  0.,  1.,  1.,  1.,  0.,  1.,  1.],        [ 1.,  1.,  1.,  1.,  0.,  1.,  0.,  1.],        [ 1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.],        [ 1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.],        [ 1.,  1.,  1.,  0.,  1.,  0.,  0.,  0.],        [ 1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.],        [ 1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.]    ])    This helper function allows a visual representation of the maze object:        def show(qmaze):        plt.grid('on')        nrows, ncols = qmaze.maze.shape        ax = plt.gca()        ax.set_xticks(np.arange(0.5, nrows, 1))        ax.set_yticks(np.arange(0.5, ncols, 1))        ax.set_xticklabels([])        ax.set_yticklabels([])        canvas = np.copy(qmaze.maze)        for row,col in qmaze.visited:            canvas[row,col] = 0.6        pirate_row, pirate_col, _ = qmaze.state        canvas[pirate_row, pirate_col] = 0.3   # pirate cell        canvas[nrows-1, ncols-1] = 0.9 # treasure cell        img = plt.imshow(canvas, interpolation='none', cmap='gray')        return img    The pirate agent can move in four directions: left, right, up, and down.        While the agent primarily learns by experience through exploitation, often, the agent can choose to explore the environment to find previously undiscovered paths. This is called "exploration" and is defined by epsilon. This value is typically a lower value such as 0.1, which means for every ten attempts, the agent will attempt to learn by experience nine times and will randomly explore a new path one time. You are encouraged to try various values for the exploration factor and see how the algorithm performs.        LEFT = 0    UP = 1    RIGHT = 2    DOWN = 3            # Exploration factor    epsilon = 0.1        # Actions dictionary    actions_dict = {        LEFT: 'left',        UP: 'up',        RIGHT: 'right',        DOWN: 'down',    }        num_actions = len(actions_dict)    The sample code block and output below show creating a maze object and performing one action (DOWN), which returns the reward. The resulting updated environment is visualized.        qmaze = TreasureMaze(maze)    canvas, reward, game_over = qmaze.act(DOWN)    print("reward=", reward)    show(qmaze)    This function simulates a full game based on the provided trained model. The other parameters include the TreasureMaze object and the starting position of the pirate.        def play_game(model, qmaze, pirate_cell):        qmaze.reset(pirate_cell)        envstate = qmaze.observe()        while True:            prev_envstate = envstate            # get next action            q = model.predict(prev_envstate)            action = np.argmax(q[0])                # apply action, get rewards and new state            envstate, reward, game_status = qmaze.act(action)            if game_status == 'win':                return True            elif game_status == 'lose':                return False    This function helps you to determine whether the pirate can win any game at all. If your maze is not well designed, the pirate may not win any game at all. In this case, your training would not yield any result. The provided maze in this notebook ensures that there is a path to win and you can run this method to check.        def completion_check(model, qmaze):        for cell in qmaze.free_cells:            if not qmaze.valid_actions(cell):                return False            if not play_game(model, qmaze, cell):                return False        return True    The code you have been given in this block will build the neural network model. Review the code and note the number of layers, as well as the activation, optimizer, and loss functions that are used to train the model.        def build_model(maze):        model = Sequential()        model.add(Dense(maze.size, input_shape=(maze.size,)))        model.add(PReLU())        model.add(Dense(maze.size))        model.add(PReLU())        model.add(Dense(num_actions))        model.compile(optimizer='adam', loss='mse')        return model    #TODO: Complete the Q-Training Algorithm Code Block    This is your deep Q-learning implementation. The goal of your deep Q-learning implementation is to find the best possible navigation sequence that results in reaching the treasure cell while maximizing the reward. In your implementation, you need to determine the optimal number of epochs to achieve a 100% win rate.        You will need to complete the section starting with #pseudocode. The pseudocode has been included for you.        def qtrain(model, maze, **opt):            # exploration factor        global epsilon             # number of epochs        n_epoch = opt.get('n_epoch', 15000)            # maximum memory to store episodes        max_memory = opt.get('max_memory', 1000)            # maximum data size for training        data_size = opt.get('data_size', 50)            # start time        start_time = datetime.datetime.now()            # Construct environment/game from numpy array: maze (see above)        qmaze = TreasureMaze(maze)            # Initialize experience replay object        experience = GameExperience(model, max_memory=max_memory)                win_history = []   # history of win/lose game        hsize = qmaze.maze.size//2   # history window size        win_rate = 0.0                # pseudocode:        # For each epoch for i in range(n_epoch):            print('\n\n\n')            print('This is the start of the next epoch...')            print('\n')                    agent_cell = np.random.randint(0, high=7 size=2)        #Reset the random statring postion             qmaze.reset([0,0])            envstate = qmaze.observe()            loss = 0            n_episodes = 0                         # while game is not over            while qmaze.game_status() == 'not_over':game with agent at the random starting postion                previous_envstate = envstate                valid_actions = qmaze.valid_actions()                                # Get next action                if np.random.rand() < epsilon:                    action = random.choice(valid_actions)                else:                    action = np.argmax(experience.predict(envstate))                                # Take the action                envstate, reward, game_status = qmaze.act(action)                # Increase episode count because taking an action makes an episode                n_episodes += 1                # Remember the episode                episode = [previous_envstate, action, reward, envstate, game_status]                experience.remember(episode)                                                    inputs,targets = experience.get_data()                            history = model.fit(inputs, targets, epochs=8, batch_size=24, verbose=0)                loss = model.evaluate(inputs, targets)                                # if pirate agent won game                if episode [4] == 'win':                    win_history.append(1)                    win_rate = sum(win_history) / len(win_history)                    break                # if pirate agent lost the game                elif episode[4] == 'lose':                    win_history.append(0)                    win_rate = sum(win_history) / len(win_history)                    break                # else the game is not over                        if win_rate > epsilon:                print('win rate is larger than epsilon!!!!')                if completion_check(model, qmaze) == True:                    print('completion_check() passes!!')            #Print the epoch, loss, episodes, win count, and win rate for each epoch            dt = datetime.datetime.now() - start_time            t = format_time(dt.total_seconds())            template = "Epoch: {:03d}/{:d} | Loss: {:.4f} | Episodes: {:d} | Win count: {:d} | Win rate: {:.3f} | time: {}"            print(template.format(epoch, n_epoch-1, loss, n_episodes, sum(win_history), win_rate, t))            # We simply check if training has exhausted all free cells and if in all            # cases the agent won.            if win_rate > 0.9 : epsilon = 0.05            if sum(win_history[-hsize:]) == hsize and completion_check(model, qmaze):                print("Reached 100%% win rate at epoch: %d" % (epoch,))                break                        # Determine the total time for training        dt = datetime.datetime.now() - start_time        seconds = dt.total_seconds()        t = format_time(seconds)            print("n_epoch: %d, max_mem: %d, data: %d, time: %s" % (epoch, max_memory, data_size, t))        return seconds        # This is a small utility for printing readable time strings:    def format_time(seconds):        if seconds < 400:            s = float(seconds)            return "%.1f seconds" % (s,)        elif seconds < 4000:            m = seconds / 60.0            return "%.2f minutes" % (m,)        else:            h = seconds / 3600.0            return "%.2f hours" % (h,)      File "", line 30        print('\n\n\n')        ^    IndentationError: unexpected indent    Test Your Model    Now we will start testing the deep Q-learning implementation. To begin, select Cell, then Run All from the menu bar. This will run your notebook. As it runs, you should see output begin to appear beneath the next few cells. The code below creates an instance of TreasureMaze.        qmaze = TreasureMaze(maze)    show(qmaze)    In the next code block, you will build your model and train it using deep Q-learning. Note: This step takes several minutes to fully run.        model = build_model(maze)    qtrain(model, maze, epochs=1000, max_memory=8*maze.size, data_size=32)    This cell will check to see if the model passes the completion check. Note: This could take several minutes.        completion_check(model, qmaze)    show(qmaze)    This cell will test your model for one game. It will start the pirate at the top-left corner and run play_game. The agent should find a path from the starting position to the target (treasure). The treasure is located in the bottom-right corner.        pirate_start = (0, 0)    play_game(model, qmaze, pirate_start)    show(qmaze)
image

Instructions for my part

Q-Training Algorithm   You have been given a skeleton implementation in the TreasureHuntGame Jupyter Notebook. Your task is to implement deep-Q learning. The goal of your deep Q-learning implementation is to find the best possible navigation sequence that results in reaching the treasure cell while maximizing the reward. In your implementation, you need to determine the optimal number of epochs to achieve a 100% win rate.

 

You will need to complete the section starting with #pseudocode. The pseudocode has been included for you.

def qtrain(model, maze, **opt):

    # exploration factor
    global epsilon

    # number of epochs
    n_epoch = opt.get('n_epoch', 15000)

    # maximum memory to store episodes
    max_memory = opt.get('max_memory', 1000)

    # maximum data size for training
    data_size = opt.get('data_size', 50)

    # start time
    start_time = datetime.datetime.now()

    # Construct environment/game from numpy array: maze (see above)
    qmaze = TreasureMaze(maze)

    # Initialize experience replay object
    experience = GameExperience(model, max_memory=max_memory)
   
    win_history = []   # history of win/lose game
    hsize = qmaze.maze.size//2   # history window size
    win_rate = 0.0
   
    # pseudocode:
    # For each epoch for i in range(n_epoch):
        print('\n\n\n')
        print('This is the start of the next epoch...')
        print('\n')
       
    agent_cell = np.random.randint(0, high=7 size=2)
    #Reset the random statring postion
        qmaze.reset([0,0])
        envstate = qmaze.observe()
        loss = 0
        n_episodes = 0
       
         # while game is not over
        while qmaze.game_status() == 'not_over':game with agent at the random starting postion
            previous_envstate = envstate
            valid_actions = qmaze.valid_actions()
           
            # Get next action
            if np.random.rand() < epsilon:
                action = random.choice(valid_actions)
            else:
                action = np.argmax(experience.predict(envstate))
           
            # Take the action
            envstate, reward, game_status = qmaze.act(action)
            # Increase episode count because taking an action makes an episode
            n_episodes += 1
            # Remember the episode
            episode = [previous_envstate, action, reward, envstate, game_status]
            experience.remember(episode)                
               
            inputs,targets = experience.get_data()            
            history = model.fit(inputs, targets, epochs=8, batch_size=24, verbose=0)
            loss = model.evaluate(inputs, targets)
           
            # if pirate agent won game
            if episode [4] == 'win':
                win_history.append(1)
                win_rate = sum(win_history) / len(win_history)
                break
            # if pirate agent lost the game
            elif episode[4] == 'lose':
                win_history.append(0)
                win_rate = sum(win_history) / len(win_history)
                break
            # else the game is not over
       
        if win_rate > epsilon:
            print('win rate is larger than epsilon!!!!')
            if completion_check(model, qmaze) == True:
                print('completion_check() passes!!')

    #Print the epoch, loss, episodes, win count, and win rate for each epoch
        dt = datetime.datetime.now() - start_time
        t = format_time(dt.total_seconds())
        template = "Epoch: {:03d}/{:d} | Loss: {:.4f} | Episodes: {:d} | Win count: {:d} | Win rate: {:.3f} | time: {}"
        print(template.format(epoch, n_epoch-1, loss, n_episodes, sum(win_history), win_rate, t))
        # We simply check if training has exhausted all free cells and if in all
        # cases the agent won.
        if win_rate > 0.9 : epsilon = 0.05
        if sum(win_history[-hsize:]) == hsize and completion_check(model, qmaze):
            print("Reached 100%% win rate at epoch: %d" % (epoch,))
            break
   
   
    # Determine the total time for training
    dt = datetime.datetime.now() - start_time
    seconds = dt.total_seconds()
    t = format_time(seconds)

    print("n_epoch: %d, max_mem: %d, data: %d, time: %s" % (epoch, max_memory, data_size, t))
    return seconds

# This is a small utility for printing readable time strings:
def format_time(seconds):
    if seconds < 400:
        s = float(seconds)
        return "%.1f seconds" % (s,)
    elif seconds < 4000:
        m = seconds / 60.0
        return "%.2f minutes" % (m,)
    else:
        h = seconds / 3600.0
        return "%.2f hours" % (h,)

Agent Specifications You will use the Python programming language for this project, as well as the TensorFlow and Keras libraries. These have been pre-installed in the Virtual Lab (Apporto). The environment for your agent has already been designed as a maze (8x8 matrix), containing free (1), occupied (0), and target (1 at the bottom right) cells, as below: [1., 0., 1., 1., 1., 1., 1., 1.], [1., 0., 1., 1., 1., O., 1., 1.], [1., 1., 1., 1., O., 1., O., 1.], [1., 1., 1., 0., 1., 1., 1., 1.], [1., 1., 0., 1., 1., 1., 1., 1.], [1., 1., 1., 0., 1., O., O., 0.], [1., 1., 1., 0., 1., 1., 1., 1.], [1., 1., 1., 1., O., 1., 1., 1.] Your agent (pirate) should start at the top left. The agent can move in four directions: left, right, up, and down. The agent rewards vary from -1 point to 1 point. When the agent reaches the target, the reward will be 1 point. Moving to an occupied cell will result in a penalty of -0.75 points. Attempting to move outside the matrix boundary will result in a penalty of -0.8 points. Moving from a cell to an adjacent cell will result in a penalty of -0.04 points, primarily to avoid the agent wandering within the maze. A negative threshold has been defined for you in order to reduce training time, avoid infinite loops, and avoid unnecessary wandering.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Fundamentals Of Management

Authors: Ricky Griffin

10th Edition

0357517342, 978-0357517345

More Books

Students also viewed these Programming questions