Assignment: Reinforcement Learning and Deep LearningContentsPart 1: Q-learning (Snake)Provided Snake EnvironmentQ-learning AgentDebug ConveniencePart 2: Deep Learning (MNIST Fashion)BackgroundNeural NetworkImplementation DetailsTestingWhat to ReportWarning/HintsDeliverablesReport ChecklistPart 1: Q-learning (Snake)Snake is a famous video game originated in the 1976 arcade game Blockade. The player uses up, down, left and right tocontrol the snake which grows in length(when it eats the food), with the snake body and walls around the environment beingthe primary obstacle. In this assignment, you will train AI agents using reinforcement learning to play a simple version ofthe game snake. You will implement a TD version of the Q-learning algorithm.Image from WikipediaProvided Snake EnvironmentSnakeIn this assignment, the size of the entire game board is 560x560. The green rectangle is the snake agent and the red rectangleis the food. Snake head is marked with a thicker boarder for easier recognition. Food is generated randomly on board oncethe initial food is eaten. The size for every side of wall (filled with blue) is 40. The snake head, body segment and food havethe same size of 40x40. Snake moves with a speed of 40 per frame. In this setup, the entire board that our snake agent canmove has a size of 480x480 and can be treated as a 12x12 grid. Every time it eats a food, the points increases 1 and its bodygrows one segment. Before implementing the Q-learning algorithm, we must first define Snake as a Markov DecisionProcess (MDP). Note in Q-learning, state variables do not need to represent the whole board, it only needs torepresent enough information to let the agent make decisions.(So once you get environment state, you need to convertit to the state space as defined below). Also, the smaller the state space, the more quickly the agent will be able toexplore it all.State: A tuple (adjoining_wall_x, adjoining_wall_y, food_dir_x, food_dir_y, adjoining_body_top,adjoining_body_bottom, adjoining_body_left, adjoining_body_right).[adjoining_wall_x, adjoining_wall_y] gives whether there is wall next to snake head. It has 9 states:adjoining_wall_x: 0 (no adjoining wall on x axis), 1 (wall on snake head left), 2 (wall on snake head right)adjoining_wall_y: 0 (no adjoining wall on y axis), 1 (wall on snake head top), 2 (wall on snake head bottom)(Note that [0, 0] is also the case when snake runs out of the 480x480 board)[food_dir_x, food_dir_y] gives the direction of food to snake head. It has 9 states:food_dir_x: 0 (same coords on x axis), 1 (food on snake head left), 2 (food on snake head right)food_dir_y: 0 (same coords on y axis), 1 (food on snake head top), 2 (food on snake head bottom)[adjoining_body_top, adjoining_body_bottom, adjoining_body_left, adjoining_body_right] checks if thereis snake body in adjoining square of snake head. It has 8 states:adjoining_body_top: 1 (adjoining top square has snake body), 0 (otherwise)adjoining_body_bottom: 1 (adjoining bottom square has snake body), 0 (otherwise)adjoining_body_left: 1 (adjoining left square has snake body), 0 (otherwise)adjoining_body_right: 1 (adjoining right square has snake body), 0 (otherwise)Actions: Your agents actions are chosen from the set {up, down, left, right}.Rewards: +1 when your action results in getting the food (snake head position is the same as the food position), -1when the snake dies, that is when snake head hits the wall, its body segment or the head tries to move towards itsadjacent body segment (moving backwards). -0.1 otherwise (does not die nor get food).Q-learning AgentTrainedAgentIn this part of the assignment, you will create a snake agent to learn how to get food as many as possible without dying. Inorder to do this, you must use Q-learning. Implement the TD Q-learning algorithm and train it on the MDP outlined above.Q(s, a) ← Q(s, a) + α(R(s) + γ max Q(s′, a′) − Q(s, a)) a′Also, use the exploration policy mentioned in class and use 1 for R+:During training, your agent needs to update your Q-table first (this step is skipped when the initial state and action areNone), get the next action using the above exploration policy, and then update N-table with that action. If the game is over,that is when the dead varaible becomes true, you only need to update your Q table and reset the game. During testing, youragent only needs to give the best action using Q-table. Train it for as long as you deem necessary, counting the averagenumber of points your agent can get. Your average over 1000 test games should be at least 20. For grading purposes,please submit code with the above exploration policy, state configurations and reward model. We will initialize youragent class with different parameters (Ne, C, gamma), initialize environment with different initial snake and foodpostion and compare the Q-table result at the point when the first food is eaten during training(see snake_main.pyfor Q-table generation detail).Once you have this working, you will need to adjust the learning rate, α (how about a fixed learning rate or other C value?),the discount factor, γ, and the settings that you use to trade off exploration vs. exploitation.In your report, please include the values of α, γ, and any parameters for your exploration settings that you used, and discusshow you obtained these values. What changes happen in the game when you adjust any of these variables? How manygames does your agent need to simulate before it learns an optimal policy? After your Q-learning seems to have convergedto a good policy, run your algorithm on a large number of test games (≥1000) and report the average number of points.In addition to discussing these things, try adjusting the state configurations that were defined above. If you think it would bebeneficial, you may also change the reward model to provide more informative feedback to the agent. Try to findmodifications that allow the agent to learn a better policy than the one you found before. In your report, describe the changesyou made and the new number of points the agent was able to get. What effect did this have on the time it takes to train youragent? Include any other interesting observations.TipsInitially, all the Q value estimates should be 0.The learning rate should decay as C/(C+N(s,a)), where N(s,a) is the number of times you have seen the given thestate-action pair.When adjusting state configurations, try to make state numbers as small as possible to make the training easier. If thestate numbers are too large. The snake may stuck in an infinite loop.In a reasonable implementation, you should see your average points increase in seconds.You can run python snake_main.py --human to play the game yourself.Debug ConvenienceFor debug convenience, we provide three debug examples of Q-table for you. Each Q-table is generated exactly aftersnake eats the first food in training process. More specifically, its the first time when snake reaches exactly 1 point intraining, see how the Q-table is generated and saved during training in snake_main.py. For example, you can run diffcheckpoint.npy checkpoint1.npy to see whether there is a difference. The only difference of these three debug examples isthe setting of parameters(initialized position of snake head and food, Ne, C and gamma).Notice that for passing the autograder, if the scores of actions from exploration function are equal, the priorityshould be right > left > down > up.[Debug Example 1] snake_head_x=200, snake_head_y=200, food_x=80, food_y=80, Ne=40, C=40, gamma=0.7checkpoint1.npy[Debug Example 2] snake_head_x=200, snake_head_y=200, food_x=80, food_y=80, Ne=20, C=60, gamma=0.5checkpoint2.npy[Debug Example 3] snake_head_x=80, snake_head_y=80, food_x=200, food_y=200, Ne=40, C=40, gamma=0.7checkpoint3.npyNote that for one part of the autograder, we will run your training process on different settings of parameters andcompare the Q-table generated exactly when snake reaches 1 point first time in training, with ours. So making sureyou can pass these debug examples will help you a lot for passing this part of the autograder.In addition, for the other part of autograder, we will test your performance using your q_agent.npy and agent.py.Average points over 20 points on 1000 test games should be able to obatin full credit for this part.Part 2: Deep Learning (MNIST Fashion)Created by Austin Bae (Modified from Ryley Higa)BackgroundBy now you should be familiar with the MNIST dataset from MP3. In MP3, we trained a model on this dataset using LinearClassifiers such as Naive Bayes and Perceptron. This time around, we will implement a fully connected neural network fromscratch on the same dataset we used in MP3.Your task is to build a 4-layer neural network with 256 hidden nodes per layer except the last layer which should have 10(number of classes) layers. You are going to use a method called Minibatch Gradient Descent, which runs for a givennumber of iterations (epochs) and does the following per epoch:1. Shuffle the training data2. Split the data into batches (use batch size of 200)3. For each batch (subset of data):feed batch into the 4-layer neural networkCompute loss and update weights4. Observe the total loss and go to next iterationThe Neural NetworkNeural Network ArchitectureOne Layer Neural NetworkFor this assignment, you will build a 4-layer, fully connected Neural Network. You can think about a fully-connected NeuralNetwork as a series of multiple interconnected layers, in which are basically a group of nodes (See the diagram above). Youcan think of each layer as the perceptron model you implemented in MP3, but instead of one perceptron, you have multipleperceptrons that feed the output as the input to another.Neural Network LayerThis is what each layer in a Neural Network looks like. Note that there are two directions: forward and backwardpropagation. Forward Propagation uses both the inputs (Ain) and the weights (W, b) specific to that layer to compute theoutput (Aout) to the next layer. Backward Propagation computes the gradient (derivative) of the loss function with respectto the weights at each layer.Inside of each propagation, there are two functions that both compute forward and backward. In general, Affinetransformations compute the affine output (Z) by doing some sort of computation with the input and weights (like matrixmultiply). Nonlinear Activation functions then take that affine output and transform it. There are numerous activationfunctions out there, but for our assignment we will be using ReLU (Rectified Linear Units) which is just simply R(x) =max(0, x). The backwards functions will compute the gradients of the inputs with respect to loss.Then, at the last layer, the output should be the classification computed by your Neural Network. You have to then calculatethe loss, which would represent how well your network did in classifying correctly. The models job is to minimize this loss,since the better it does, the lower loss it will have. Again, we can use many different types of loss functions, but we will useCross-entropy for this assignment.Implementation DetailsYou are to implement these 8 different functions inside of neural_network.py. You will only have to edit and submit thisfile. Do not import any non-standard Python libraries other than numpy, or you will result in getting a 0 on theautograder.affine_forward(A, W, b)Inputs: A (data with size n,d), W (weights with size d,d), b (bias with size d)Outputs: Z (affine output with size n,d), cache (tuple of the original inputs)affine_backward(dZ, cache)Inputs: dZ (gradient of Z), cache (of the forward operation)Outputs: dA, dW, db (gradients with respect to loss)relu_forward(Z)Inputs: Z (affiQ-learning作业代做、代写Python课程作业、代做Network作业、Python程序语言作业调试 代写留学生ne output with size n,d)R(Z) = max(0,Z) : basically sets all negative values in matrix to 0Outputs: A (Relu Output with size n,d), cache object (Z)relu_backward(dA, cache)Inputs: dA (gradient of A), cachedZij = 0 if Zij = 0. else dZij = dAij. Basically if Z was zeroed out at a point, then dZ should also be zeroed outsince it shouldnt contribute to the gradient at that location.Outputs: dZ (gradient of Z)cross_entropy(F, y)Inputs: F (logits with size n, num_classes), y (actual class label of data with size n)Fik refers to the score of classifying row i as class k. Fiyi refers to the score of F classifying row i as the actualclass given by yi. So if the actual label for row i was 7, then Fiyi = Fi7. The 1{j = yi} function in the gradientcalculation is just a binary function that outputs either a 0 or a 1 if the condition is met.Output: loss, dF (gradient of the logits)four_nn() - 4 layer neural network functionThis functions inputs and outputs are up to you, and it wont be autogradedThe Neural Network must have 4 layers, with (256, 256, 256, and num_classes) nodes per layersIn this function, you should use all of your helper functions aboveThis should be called inside of both minibatch_gd() and test_nn()Here is the pseudocode for a 3 layer neural network, as reference.You should use a learning rate (eta) of 0.1minibatch_gd(epoch, w1, w2, w3, w4, b1, b2, b3, b4, x_train, y_train, num_classes, shuffle=True)This function will implement your minibatch gradient descent (model training).Use batch size of 200. You can assume that len(x_train) will always be divisible by 200.Inputs:epoch: number of iterationsw1, w2, w3, w4: Your weights corresponding to each of the layersb1, b2, b3, b3: Your biases corresponding to each of the layersx_train, y_train: Numpy arrays of the features and labelsnum_classes: number of classes. This should be 10 for our dataset. Reason we pass this as a parameter isthat your model should be able to run for datasets of any size.shuffle: boolean flag indicating whether you should shuffle your data at each epoch. By default this is setto True. This will be set to false during testing/autograding, so adjust your code so that it will shuffle onlyif this boolean flag is set to True.Here is a pseudocode for reference:Outputs:all the modified weights, biases (w1 ~ w4, b1 ~ b4)losses: a list of total loss at each epoch. Note that this will be a list with length = epochtest_nn(w1, w2, w3, w4, b1, b2, b3, b4, x_test, y_test, num_classes)This function will evaluate how well your trained model performs in classifying test data.Inputs:w1 ~ w4, b1 ~ b4: trained weights/biases at each layerx_test, y_test: Numpy arrays of the test features and labelsnum_classes: Number of classes (10)Outputs: Average Classification Rate, Average Classification Rate per Class (List with size=num_classes)TestingWe have provided you with a unit testing script nn_test.py that will check if your individual functions produce the rightoutput. Passing these unit tests are a good but incomplete measure of whether or not your functions are correct. This testscript also checks if your minibatch gradient descent works for a smaller dataset, so you should confirm that you pass thesetests before running nn_main.py on the actual MNIST data. Feel free to modify this file as you need, this file wont besubmitted nor graded.What to ReportRun your minibatch gradient for 10, 30, and 50 epochs. You should start with clean generated weights/biases for each run.For each run, report:1. Confusion Matrix2. Average Classification Rate3. Runtime of your minibatch gradient functionAt the end of 50 epochs, include a graph that plots epochs vs losses.Also report any interesting observations, and possible explanations for those observations.Extra Credit OpportunityConvolutional Neural Network (CNN) is a different type of Neural Network that is very popular, especially in imageprocessing. Implement a CNN using the same MNIST fashion dataset using any deep learning frameworks of your choice(Tensorflow, Pytorch, Keras, etc). You may reference online materials, but make sure to cite them as necessary!Write a paragraph on how CNNs work, what you implemented, and the results of your CNN. Compare it to theperformanceof your fully-connected 4 layer network. Report the confusion matrix, average classification rate, and average classificationrate per class of the test data once your model converges. The extra credit will be capped to 10% of the entire MP grade.Submit your file as nn_extracredit.ext, ext being the extension of the file. You do not have to submit any runtimeinformation (modules/metadata) but make sure to describe your algorithm in the report.Warning/HintsOne disadvantage of Neural Networks is that they may take a long time to compute. Given that fact, correctness of youralgorithms may not be sufficient to successfully complete this assignment. Use numpy functions (matrix multiply,broadcasting) as much as possible as opposed to native python loops, as this will significantly decrease your runtime.With decently optimized code, we were able to get under 10 seconds per epoch on a 2015 Macbook Pro, and roughly under80 seconds on EWS. We highly recommend you running the code on a machine faster than EWS, but EWS should give youthe highest-bound on computation time. We will cut off computation for autograding at around 120 seconds per epoch onEWS. Also, running your model for 50 epochs will take somewhere around 10 mins ~ 1 hour. If you do this last minute youmay not be able to run your computations on time. There will be no excuses for missing the deadline.We will also grade your code on a different dataset that will have different number of rows (but divisible by 200), features,and target classes. So it is very important that you dont hardcode any numbers into your algorithms, and it should be able tosupport data of any size. Provided hyperparameters (number of layers, number of nodes per layer, learning rate, batch size)should remain the same.Provided Code SkeletonWe have provided skeleton.zip with the descriptions below. For Part 1, do not import any non-standard libraries exceptpygame (pygame version 1.9.4) and numpy. For Part 2, do not import any non-standard library except numpy. Failureto do so will result in a 0 in the autograder.Part 1snake.py - This is the file that defines the snake environment and creates the GUI for the game.utils.py - This is the file that defines some of the discretization constants as defined above and contains the functionsto save and load models.agent.py This is the file where you will be doing all of your work. This file contains the Agent class. This is the agentyou will implement to act in the snake environment. Below is the list of instance variables and functions in the Agentclass.self._train: This is a boolean flag variable that you should use to determine if the agent is in train or test mode.In train mode, the agent should explore(based on exploration function) and exploit based on the Q table. In testmode, the agent should purely exploit and always take the best action.train(): This function sets the self._train to be True. This is called before the training loop is run insnake_main.pytest(): This function sets the self._train to be False. This is called before the testing loop is run insnake_main.pysave_model(): This function saves the self.Q table. This is called after the training loop in snake_main.py.load_model(): This function loads the self.Q table. This is called before the testing loop in snake_main.py.act(state, points, dead): This is the main function you will implement and is called repeatedly bysnake_main.py while games are being run. state is the state from the snake environment and is a list of[snake_head_x, snake_head_y, snake_body, food_x, food_y](Notice that in actfunction, you first need to discretize this into the state configuration wedefined above). points is the number of food the snake has eaten. dead is aboolean indicating if the snake is dead. points, dead should be used to defineyour reward function. act should return a number from the set of {0,1,2,3}.Returning 0 will move the snake agent up, returning 1 will move the snake agentdown, and returning 2 will move the agent left, returning 3 will move the agentright. If self._train is True, this function should update the Q table and return anaction(Notice that if the scores of actions from exploration function are equal,the priority should be right > left > down > up). If self._train is False, the agentshould simply return the best action based on the Q table.snake_main.py - This is the main file that starts the program. This file runs the snakegame with your implemented agent acting in it. The code runs a number of traininggames, then a number of testing games, and then displays example games at the end.Do not modify the provided code. You will only have to modify agent.pyPart 2neural_network.py - The only file that you will have to modify. Descriptions of thefunctions are in the descriptions above.nn_test.py - Unit test module for you to use. Simplyrun it using python nn_test.py. nn_main.py - Mainfunction that will run your code written innueral_network.py. data - folder that contains the trainand test numpy files, exactly the same from MP3tests - You dont have to touch this since all the text parsing has been done for you in nn_test.pyYou can modify the test/main functions as long as your inputs/outputs for all functions inside ofneural_network.py is consistent with the instructions. Note that only neural_network.py will besubmitted & graded.DeliverablesPlease submit only the following files.agent.py with the same exploration policy, state configurations and reward model mentioned above.q_agent.npy the best numpy array trained by you with the same state configurationsmentioned above. (Can be saved by passing --model_name q_agent.npy tosnake_main.py). Note that this model above should work without modifying anycode files other than agent.py.neural_network.py with all the functions for part 2.report.pdfReport ChecklistPart 11. Briefly describe theimplementation ofyour agent snake.How does the agentact during trainphase?How does the agent act during test phase?2. Use Ne, C (or fixed alpha?), gamma that you believe to be the best. After training hasconverged, run your algorithm on 1000 test games and report the average point.Give the value of Ne, C (or fixedalpha) you believe to be the best.Report the training convergence time.Report average point on 1000 test games.3. Describe the changes you made to your MDP(state configuration, exploration policy andreward model), at least make changes to state configuration. Report the performance(the average points on 1000 test games). Notice that training your modified state space should give you at least 10 points in average for 1000 test games. Explain why thesechanges are reasonable, observe how snake acts after changes and analyze the positiveand negative effects they have. Notice again, make sure your submitted agent.py andq_agent.npy are without these changes and your changed MDP should not besubmitted.Part 21. Briefly describe any optimizations you performed on your functions for faster runtime.2. Report Confusion Matrix, Average Classification Rate, and Runtime of Minibatchgradient for 10, 30, 50 epochs. This means that you should have 3x3=9 total items in thissection.3. Add a graph that plots epochs vs losses at that epoch. (For 50 epochs)4. Describe any trends that you see. Is this expected or surprising? What do youthink is the explanation for the observations?5. Report Extra Credit section, if any.转自:http://www.6daixie.com/contents/3/4929.html
讲解:Q-learning、Python、Network、Python Statistics、、|
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。
推荐阅读更多精彩内容
- By clicking to agree to this Schedule 2, which is hereby ...
- 这节课把我们上节课注解方式配合类文件实现用户增删改查api改成xml映射的方式