COMP202 Assignment 4Due: Dec 4 at 23:59This is an individual assignment.Flu Pandemic!It’s the near-future, and Montreal is having a flu pandemic (oh no!). It’s not going well and somehownow you’re the one in charge of getting data on the pandemic to the epidemiologists.You’ve been given one large file of raw data about the early days of the pandemic. But the datawas recorded by dozens of different people and systems. In the chaos of the pandemic, data wasnot recorded in consistent ways. Some lines are recorded in French. Others in English. Some linesseparate the information with tabs, others with commas.“Data cleaning” refers to the process of taking raw data and processing it into a state that can beused for empirical analysis. Your task in this assignment is to clean the raw data file you’ve beengiven.InstructionsIt is very important that you follow the directions as closely as possible. The directions,while perhaps tedious, are designed to make it as easy as possible for the TAs to mark the assignmentsby letting them run your assignment, in some cases through automated tests. While thesetests will never be used to determine your entire grade, they speed up the process significantly,which allows the TAs to provide better feedback and not waste time on administrative details.Plus, if the TA is in a good mood while he or she is grading, then that increases the chance of themgiving out partial marks. :)Up to 30% can be removed for bad indentation of your code as well as omitting comments, or poorcoding structure.To get full marks, you must:• Follow all directions below– In particular, make sure that all function and variable names are spelled exactly asdescribed in this document. Else a 50% penalty will be applied.• Make sure that your code runs.– Code with errors will receive a very low mark.• Write your name and student ID as a comment in all .py files you hand in• Name your variables and helper functions appropriately– The purpose of each variable should be obvious from the name• Comment your work– A comment every line is not needed, but there should be enough comments to fullyunderstand your program1Errata and Frequently Asked QuestionsOn MyCourses we have a discussion forum titled Assignment 4. A thread will be pinned inthe forum with any errata and frequently asked questions. If you are stuck on theassignment, start by checking that thread.We strongly encourage starting the assignment early. For example, office hours the weekbefore the deadline are not very crowded, but office hours close to the deadline will be.What To SubmitPlease put all your files in a folder called Assignment4. Zip the folder (DO NOT RAR it) andsubmit it in MyCourses. If you do not know how to zip files, please ask any search engine orfriends. Google will be your best friend with this, and a lot of different little problems as well.Inside your zipped folder, there must be the following files. Do not submit any other files. Anydeviation from these requirements may lead to lost marks.1. initial clean.py2. time series.py3. time series.png4. construct patients.py5. fatality by age.png6. README.txt In this file, you can tell the TA about any issues you ran into doing this assignment.If you point out an error that you know occurs in your program, it may lead the TAto give you more partial credit.This file is also where you should make note of anybody you talked to about the assignment.Remember this is an individual assignment, but you can talk to other students using theGilligan’s Island Rule: you can’t take any notes/writing/code out of the discussion, andafterwards you must do something inane like watch television for at least 30 minutes.If you didn’t talk to anybody nor have anything you want to tell the TA, just say “nothingto report” in the file.StyleThere are 70 marks for completing functions in this assignment.There are 30 marks for the style of your code.Some tips:• Call helper functions rather than copy and paste (or reinvent) code• Create helper functions where appropriate• Use descriptive variable names• Lines of code should NOT require the TA to scroll horizontally to read the whole thing• Add blank lines between “chunks” of code to improve readibility2Your Data FileEvery student in the class has a unique file to clean. To download yours, go tohttps://www.cs.mcgill.ca/~patitsas/comp202/files/YOURSTUDENTNUMBER.txt.Important: at this url, save the file. To do so, right click and use “save page as”. Do notcopy/paste the file contents because your browser could have messed up the accents in the file.Suggested: the file is rather large. We recommend starting with only the first 5-10 lines of your file,then adding in more lines as you’ve tested your code. A 15-line version of your file can be foundat: https://www.cs.mcgill.ca/~patitsas/comp202/files/YOURSTUDENTNUMBER-short.txtAbout The DataEach line of your raw file contains the following information.1. A number representing who recorded the data2. A number representing the patient — each patient has a unique number. The first patientdiagnosed with the flu has the patient number 0, the second patient has the patient number1, etc. There could be multiple rows for the same patient — for example they could bediagnosed on one day, and then die on another day.3. The date this entry was made4. The patient’s date of birth5. The patient’s sex/gender (some have sex recorded, some gender)6. The patient’s home postal code7. The patient’s state: Infected / Recovered / Dead (at the time the entry was made)8. The patient’s temperature at the time the entry was made9. How many days the patient has been symptomaticHere’s what it could like if patient #21 is recorded as infected for three days, and then recovers(and so the number of days symptomatic does not increase in the last entry):6 21 2022/11/28 1980/2/14 X H1Z I 40 52 21 2022.11.29 1980.2.14 non-binary H1Z inf 41.3C 66 21 2022/11/30 1980/2/14 X H1Z I 39 71 21 2022-12-01 1980-2-14 genderqueer H1Z Recovered 37.2 7Safe AssumptionsYou can assume:1. The columns will always appear in the same order2. Every column will be present3. Each row is in chronological order4. There are no spelling mistakes5. Each recorder records data in a consistent way6. All dates are recorded in ISO format: year-month-day, where the year is four digits, and themonth is the number (e.g. 2019-11-30), but could be delimited with any of ‘.’, ‘/’ or ‘-’.7. Each unique patient will have an entry in the file for every day when they’re infected, up tothe last day of the file.8. If a patient dies or recovers, the entry that notes they died/recovered will be the last timethe patient appears in the file31 Initial Clean [16 points]Create a new module initial clean.py and put your name and student ID at the top. All of thecode for this section will go into this module. You may not import any modules other than doctest.1.1 Which Delimiter [5 points]Create, document and test the function, which delimiter:• Input: one string• A delimiter is the name for a string that used to separate columns of data on a single line• Returns: the most commonly used delimiter in the input string; will be one of space/comma/tab• Example:>>> which_delimiter(’0 1 2,3’)’ ’• Raise a AssertionError exception if there is no space/comma/tab (Note: don’t worry thatwe have not seen AssertionError in class! We are deliberately using a different kind of errorthan TypeError/ValueError/etc so the autograder can tell the difference between your raisedexceptions and any issues in your code.)• You can assume that you do not have to deal with ties1.2 Stage 1: Delimiting and Capitals [6 points]Create, document and test the function, stage one:• Two inputs: input filename and output filename• This will open the file with the name input filename, and read the file line by line• We will be making changes to each line and then writing the new version of the line to a newfile named output filename• Because there is French in the files we need to add encoding = ‘utf-8’ as a parameter toall calls to open, so we can support the accents. This looks like:out_file = open(out_filename, ’w’, encoding = ’utf-8’)• The changes to make to the data:1. Change the most common delimiter to tab (if it is not already tab-delimited)2. Change all text to be upper case3. Change any / or . in the dates to hyphens (e.g. 2022/11/28 becomes 2022-11-28)• Return an integer: how many lines were written to output filename>>> stage_one(’1111111.txt’, ’stage1.tsv’)3000• Why do I use .tsv now instead of .txt? The data is now all tab separated!• See next page for example of how the data changes4• Example: if we start with data that looks like:6 0 2022/11/28 1980/2/14 F H3Z I 40 37 1 2022.11.29 1949.8.24 HOMME H1M2B5 INF 40C 410 0 2022/11/29 1980/2/14 femme h3z3l2 infect´ee 39,13 C 411,2,2022.11.29,1982.1.24,femme,h3x1r7,morte,39,3 C,3After stage one, the start of our output file should look like:6 0 2022-11-28 1980-2-14 F H3Z I 40 37 1 2022-11-29 1949-8-24 HOMME H1M2B5 INF 40C 410 0 2022-11-29 1980-2-14 FEMME H3Z3L2 INFECT´EE 39,13 C 411 2 2022-11-29 1982-1-24 FEMME H3X1R7 MORTE 39 3 C 31.3 Stage 2: Consistent Columns [5 points]Create, document and test the function, stage two:• Two inputs: input filename and output filename• This will open the file with the name input filename, and read the file line by line• Like in Stage 1, we will be making changes to each line and then writing the new version ofthe line to a new file named output filename• Because there is French in the files we need to add encoding = ‘utf-8’ as a parameter toall calls to open, so we can support the accents. This looks like:out_file = open(out_filename, ’w’, encoding = ’utf-8’)• The changes to make to the data:1. All lines should have 9 columns2. Any lines with more than 9 columns should be cleaned so the line is now 9 columns.For example, in French the comma is used for decimal points, so the temperature ’39,2’could have been broken into 39 and 2.• Example: if our input file is the output file from Stage 1’s example, we now have:6 0 2022-11-28 1980-2-14 F H3Z I 40 37 1 2022-11-29 1949-8-24 HOMME H1M2B5 INF 40C 410 0 2022-11-29 1980-2-14 FEMME H3Z3L2 INFECT´EE 39,13C 411 2 2022-11-29 1982-1-24 FEMME H3X1R7 MORTE 39.3 C 3• Return an integer: how many lines were written to output filename>>> stage_two(’stage1.tsv’, ’stage2.tsv’)300052 Pandemic Over Time [18 points]Create a new module time series.py and put your name and student ID at the top. All of thecode for this section will go into this module.You may import the Python modules doctest, datetime, numpy and matplotlib, including theirsub-modules (e.g. pyplot)2.1 Date Diff [5 points]Create, document and test the function, date diff:• Input: two strings representing dates in ISO format (eg. 2019-11-29)• Returns: how many days apart the two dates are, as an integer• If the first date is earlier than the second date, the number should be positive; otherwise thenumber should be negative• Example:>>> date_diff(’2019-10-31’, ’2019-11-2’)2• Tip: Python offers a module called datetime that can you help you with this. Since we havenot covered this module in class, here are some important things to know about it:– You can create date objects. Here are a few examples:import dCOMP202代做、代写Python程序语言、代做Pythoatetimedate1 = datetime.date(2019, 10, 31) # Year, month, dayprint(date1.year) # will be 2019date2 = datetime.date(2019, 11, 2)print(date2.month) # will be 11diff = date1 - date2– You can subtract two date objects. The result is a timedelta object, which has oneattribute: days. This is how many days apart the two dates are.• You can read more here: https://docs.python.org/3/library/datetime.html2.2 Get Age [3 points]Create, document and test the function, get age:• Input: two strings representing dates in ISO format (eg. 2019-11-29)• Returns: how many complete years apart the two dates are, as an integer• Assume one year is 365.2425 days• If the first date is earlier than the second date, the number should be positive; otherwise thenumber should be negative• Examples:>>> get_age(’2018-10-31’, ’2019-11-2’)1>>> get_age(’2018-10-31’, ’2000-11-2’)-1762.3 Stage Three [5 points]Create, document and test the function, stage three:• Two inputs: input filename and output filename• This will open the file with the name input filename, and read the file line by line. Rememberwe want utf-8 encoding like previous stages:out_file = open(out_filename, ’w’, encoding = ’utf-8’)• We will be making changes to each line and then writing the new version of the line to a newfile named output filename• First, determine the index date: the first date in the first line of the file (2022-11-28 in ourrunning example)• The changes to make to the data:1. Replace the date of each record with the date diff of that date and the index date2. Replace the date of birth with age at the time of the index date3. Replace the status with one of I, R and D. (Representing Infected, Recovered, and Dead;the French words are infect´e(e), r´ecup´er´e(e) and mort(e).)• Example: if our input file is the output file from Stage 2’s example, we now have:6 0 0 42 F H3Z I 40 37 1 1 73 HOMME H1M2B5 I 40C 410 0 1 42 FEMME H3Z3L2 I 39,13 C 411 2 1 40 FEMME H3X1R7 D 39 C 3• Return: a dictionary. The keys are each day of the pandemic (integer). The values are adictionary, with how many people are in each state on that day. Example:>>> stage_three(’stage2.tsv’, ’stage3.tsv’){0: {’I’: 1, ’D’: 0, ’R’: 0}, 1: {’I’: 2, ’D’: 1, ’R’: 0}}72.4 Plot Time Series [5 points]Create, document and test the function, plot time series:• Input: a dictionary of dictionaries, formatted as the return value of Stage Three• Return: a list of lists, where each sublist represents each day of the pandemic. Each sublist[how many people infected, how many people recovered, how many people dead]>>> d = stage_three(’stage2.tsv’, ’stage3.tsv’)>>> plot_time_series(d)[[1, 0, 0], [2, 0, 1]]• In the function, also plot that list with matplotlib’s plot function, and save the png astime series.png– Set the xlabel as ‘Days into Pandemic’– Set the ylabel as ‘Number of People’– Create a legend with Infected, Recovered and Dead. You can do this with:plt.legend([’Infected’, ’Recovered’, ’Dead’])– Title the plot ‘Time series of early pandemic, by ’ and then append your name– Save the file as time series.png• You should get a plot with three increasing lines; the slopes will vary from person to person,and could look like:83 Patients [34 points]Create a new module construct patients.py and put your name and student ID at the top. Allof the code for this section will go into this module.You may import doctest, datetime, numpy and matplotlib, including sub-modules (e.g. pyplot)3.1 Patient ClassCreate, document and test the class Patient. Its methods are:1. init [15 points]• Input (all strings): the number of the patient, the day into the pandemic they werediagnosed, the age of the patient, the sex/gender of the patient, the postal code of thepatient, the state of the patient, the temperature of the patient, and the days the patienthas been symptomatic• Initialize these attributes:– self.num: the number of the patient, an int– self.day diagnosed: which day into the pandemic they were diagnosed, an int– self.age: the age of the patient, an int– self.sex gender: the sex/gender of the patient, a string that is either M, F or X.∗ These are for man/male, woman/female or non-binary.∗ The French word for woman is ‘femme’; the French word for man is ‘homme’.∗ The value ‘H’ is short for ‘homme’.∗ Variants like boy/girl may appear in your data.∗ Look up any genders in your data that you do not recognize. A list of nonbinaryidentities is available here: https://nonbinary.miraheze.org/wiki/List_of_nonbinary_identities– self.postal: the first three characters of the patient’s postal code, a string.∗ If they do not have a valid postal code (e.g. ‘N.A.’), use ‘000’.∗ A valid Montreal postal code should start with H, then a number, then a letter.(You do not have to validate the characters after the first three).– self.state: the state of the patient. Assume the input will be one of I, R or D.– self.temps: a list of floats, recording all the temperatures observed for this patientin Celsius (starting with the one given as input).∗ Note: in French, the comma is used for decimal points.∗ The input could be in Fahrenheit, so convert any temperature above 45 toCelsius. Round it to two decimals.∗ If you get a string which does not contain a number (e.g. ‘N.A.’ because thepatient died), record this as 0.– self.days symptomatic: how many days the patient has been symptomatic, an int92. str [4 points]• Return a string of the following attributes, separated by tabs: self.num, self.age, self.sex gender,self.postal, self.day diagnosed, self.state, self.days symptomatic, and then all the temperaturesobserved separated by semi-colons• Example:>>> p = Patient(’0’, ’0’, ’42’, ’Woman’, ’H3Z2B5’, ’I’, ’102.2’, ’12’)>>> print(str(p))0 42 F H3Z 0 I 12 39.03. update [5 points]• Input: another Patient object• You can assume this object is based on an entry that was made after the one the currentPatient is based on• If this other object’s number, sex/gender, and postal code are all the same as the currentpatient:– Update the days the patient is symptomatic to the newer one– Update the state of the patient to the newer one– Append the new temperature observed about the patient. You can assume the otherPatient has only one temperature stored in their temps.• Example:>>> p = Patient(’0’, ’0’, ’42’, ’Woman’, ’H3Z2B5’, ’I’, ’102.2’, ’12’)>>> p1 = Patient(’0’, ’1’, ’42’, ’F’, ’H3Z’, ’I’, ’40,0 C’, ’13’)>>> p.update(p1)>>> print(str(p))0 42 F H3Z 0 I 13 39.0;40.0• Raise an AssertionError exception if num/sex gender/postal are not the same103.2 Stage Four [5 points]Create, document and test the function, stage four:• Two inputs: input filename and output filename• This will open the file with the name input filename, and read the file line by line. As withother stages, be sure to set the encoding to utf-8.• Create a new Patient object for each line. Do not do any conversions here — all the conversionsshould take place in the Patient initialization.• Keep (and return) a dictionary of all the patients:– Use the patient’s number (as int) for the key, and the Patient objects for the values.– Whenever you see a new entry for an existing patient, update the existing Patient objectrather than overwrite it.• Write to the output file: every Patient converted to a string, sorted by patient number(separated by new lines)• Example: if our input file is the output file from Stage 3’s example, we now have:0 42 F H3Z 0 I 12 40.0;39.13;39.45;39.5;39.36;39.2;39.0;39.04;38.82;37.71 73 M H1M 1 I 5 40.0;0.02 40 F H3X 1 I 9 39.0;39.0;39.22;39.2;38.2;37.4;37.43 18 F H1T 2 I 8 39.2;39.93;40.0;38.5• Return the dictionary of Patients• Example:>>> p = stage_four(’stage3.tsv’, ’stage4.tsv’)>>> len(p)1716>>> print(str(p[0]))0 42 F H3Z 0 I 12 40.0;39.13;39.45;39.5;39.36;39.2;39.0;39.04;38.82;37.7113.3 Fatality Probability by Age [5 points]Create, document and test the function, fatality by age:• Input: a dictionary of Patient objects• Goal: plot the probability of fatality versus age• For this plot, round patients’ ages to the nearest 5 (e.g 23 becomes 25)• To calculate probability of fatality, for each age group:how many people died / (how many people died + how many people recovered)• Plot info:– Save your plot as fatality by age.png– Set the xlabel as ‘Age’– Set the ylabel as ‘Deaths / (Deaths+Recoveries)’– Set the y axis range from 0 to 1.2. You can do this with:plt.ylim((0, 1.2))– Title the plot ‘Probabilty of death vs age ’ and then append your name• You should get a plot with one line, and could look like this:• Return: list of probabilites of death by age group.• Example (matches the plot, not the example files):>>> p = stage_four(’stage3.tsv’, ’stage4.tsv’)>>> fatality_by_age(p)[1.0, 1.0, 0.6875, 0.75, 0.8, 0.7, 0.9285714285714286, 0.6666666666666666,0.65, 0.3333333333333333, 0.5714285714285714, 0.7222222222222222,0.6923076923076923, 0.5384615384615384, 1.0, 0.875, 0.6666666666666666, 1.0, 0.75]12Closing NotesThere are many more things worth analyzing in the file you’ve cleaned! Some things epidemiologistswould look at include:• Estimating a basic reproduction number — how many people a person with the flu will infect• Looking at a heat map of infections by part of the city (from postal codes)• Whether there is a correlation between the average/maximum fever a patient has an theirchance of deathIf you want more practice with numpy and matplotlib, you might want to try plotting/fitting yourdata to figure those out!Getting good data quickly is an important task for epidemiologists, to determine whether a pandemichas started and how to contain it. Speedy containment is vital for stopping pandemics.Influenza is a family of viruses with many strains that have caused catastrophic pandemics. SpanishFlu in 1918 killed 20-100 million people, far more than World War I. The more recent Asian Flu(1957-8) and Hong Kong Flu (1968-9) both killed about a million people each.Even the ‘ordinary’ seasonal varieties of influenza can kill many people. People who with compromisedimmune systems (e.g due to cancer, AIDS) are most at risk, which is why herd immunity isimportant. Get your flu shot!Real Data Is UglyGo back to page 3 of the assignment, and revisit those ‘Safe Assumptions’. Relaxing any of thosemakes data much harder to clean!For example, if we don’t restrict the dates to ISO format, you now have to figure out how datesare ordered. Sound tricky? Here’s such a file you can try it out with: https://www.cs.mcgill.ca/~patitsas/comp202/files/challenge.txtReal world data is often much uglier than what you saw in this assignment: missing entries andmisspellings are common. And in some cases you could even have to deal with malicious (fake)entries that you have to try and identify to remove.Data ScienceIf you enjoyed this assignment, you might want to find a summer job doing data science! Datacleaning is a huge part of what is called data science: using computational practices to analyse(often unstructured) data.This skill set is in demand in the workforce! If you’d like to pursue this for work, you’ll also wantto take more statistics classes, and more computer science classes like COMP 250 to write code toefficiently process giant data sets. Hope to see you in COMP 250!13转自:http://www.daixie0.com/contents/3/4384.html
讲解:COMP202、Python、Python、dataProlog|Matlab
©著作权归作者所有,转载或内容合作请联系作者
- 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
- 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
- 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
推荐阅读更多精彩内容
- The Inner Game of Tennis W Timothy Gallwey Jonathan Cape ...
- 【100天崔律阅读营·Day68-10.6日志】 这是2019年6月29日“100天崔律阅读营”之“整理术 《怦然...