墨尔本大学 COMP90059 Assignment2 课业解析
墨尔本大学 COMP90059 Assignment2 课业解析
题意:
主要任务是使用Python读取一个CSV文件中的信息,主要分成五个Task
解析:
使用Python读取CSV文件,构造一个数据结构来保存这些数据,需要能够很方便的对这些数据进行查找,以此来计算平均支出,最大支出等信息。
1.Clean it up:该文件信息中,存在许多的错误信息,比如日期格式错误,数据是否有效和支出信息错误等,目的是写一个clean_data(data)函数,检查数据是否存在错误或无效,如果检查到数据中某一项是错误的信息,使用none值代替原值。 2.Average expenditure:写一个计算平均支出的函数,传入三个参数:数据,开始时间,结束时间;以此来计算这个数据中的平均支出。 3.distribution of fund:写一个查找资金分配的函数,传入三个参数:数据,资金的分配数量,领域;目的就是查找某一区域所分配的资金在不同范围内的数量:先将资金区间平均分成几个小区间,然后查找该领域分配到的资金在不同小区间的数量。 4.TOP 5:在规定的范围内找出支出最大的五个领域。 5.Health check:编写main函数,统计1997-2011内的每年度财政开支;列出支出前五的领域。
涉及知识点:
Python,文件流,数据结构
更多可+薇❤讨论:qing1119X
COMP90059 Introduction to Programming
Semester 2, 2019
School of Computing and Information Systems
The University of Melbourne
Assignment 2
Due date: Sunday 20th October 2019, 11.30pm
This assignment is worth 20 marks, and will count as 20% of your final mark in this subject. The
assignment specification is on the COMP90059 GROK environment, under the heading ‘Assignment 2’
(https://groklearning.com/learn/unimelb-comp900598-2019-s2/ass2/0/).
There are FIVE (5) questions in this assignment. The fifth question will require you to call the functions
you wrote in the first four questions, but we will provide sample versions so that you can complete the fifth
question even if you don’t complete all the others.
This is an individual project. While you may discuss the problems with your classmates, you must not
show written solutions to another student or use written solutions from another student.
You are reminded that your submission for this project is to be your own individual work. For most people,
collaboration will form a natural part of the undertaking of this project. However, it is still an individual task,
and so reuse of code or excessive influence in algorithm choice and development will be considered misconduct. We will check submissions for originality and will invoke the University’s Academic Misconduct
policy (http://academichonesty.unimelb.edu.au/policy.html) where inappropriate collusion or
plagiarism appear to have taken place. Your code will be passed through our plagiarism software.
Late submissions: A 10% penalty will be applied for each ‘late’ day and no late submissions will be
accepted after 5 days from deadline. If you submit after the deadline, your submission will be treated as
late and will be penalised.
Marking rubric: A sample marking rubric is uploaded on the LMS for your information.
Note: Three types of automated test cases will be run against your submission on GROK: (i) example
test cases from the examples given to you in the specification – you will see a tick mark if you pass them;
(ii) hidden test cases—you won’t see the test cases but you will receive a tick if your code has passed each
of them; and (iii) assessment test cases, which you will not see, but the markers will see and use to assess
your project. Be careful! GROK will allow you to submit code that does not pass tests.
Read the specification carefully and follow the instructions for each question.
Only assessment test cases will be used to calculate your mark, as outlined in the marking rubric. Make
sure to use a good programming style that includes relevant comments and formatting of the code. Five (5)
marks out of the 20 will be allocated to style, formatting and approach to solving the questions.
Good Luck!
Dr Antonette Mendoza, semester 2 2019
1
Background
Things to look out for in solving the questions are:
• don’t be afraid to create extra variables, e.g. to break up the code into conceptual sub-parts, improve
readability, or avoid redundancy in your code
• we also encourage you to write helper functions to simplify your code – you can write as many
functions as you like, as long as one of them is the function you are asked to write
• commenting of code is one thing that you will be marked on; get some practice writing comments in
your code, focusing on:
1. describing key variables when they are first defined, but not things like index variables in for
loops; and
2. describing what ‘chunks’ of code do, i.e. not every line, but chunks of code that perform a
particular operation, such as
#find the maximum value in the list
or
#count the number of vowels.
• be aware of tradeoffs with extra variables, functions, and comments – whilst we encourage these, if
we can’t see the important parts of your code easily, it is a style issue and may be penalized.
• start early and seek help from Grok Tutor messaging if you have trouble understanding what the
question asks you to do, or if your output does not match expectations.
Health expenditure in Australia
Health expenditure occurs where money is spent on health goods and services. It occurs at different levels
of government, as well as by non-government entities such as private health insurers and individuals.
In many cases, funds pass through a number of different entities before they are ultimately spent by
providers (such as hospitals, general practices and pharmacies) on health goods and services.
The term ‘health expenditure’ in this context relates to all funds given to, or for, providers of health goods
and services. It includes the funds provided by the Australian Government to the state and territory governments, as well as the funds provided by the state and territory governments to providers.
Congratulations! You have been appointed by the Australian Institute of Health and Welfare (AIHW) as a
programmer-analyst to make sense of the large volumes of data they collected between 1997 - 2012 and
conduct some data analysis to understand Australia’s health expenditure. As part of your new job, your
manager has asked you to write four (4) functions that perform specific tasks, plus a ‘main’ function that
uses all the other functions.
1 Health expenditure data format
The health expenditure data is given to you in one or more comma-separated values (CSV) files.
CSV is a simple file format which is widely used for storing tabular data (data that consists of columns and
rows). In CSV, columns are separated by commas, and rows are separated by newlines (so every line of the
text file corresponds to a row of the data).
Usually, the first row of the file is a header row which gives names for the columns.
Note: If CSV uses commas to separate columns, what do we do when our data itself contains commas?
The solution is to put quotation marks around the cell containing a comma. Python’s csv library does this
automatically.
The health expenditure data contains the following columns:
2
ID A unique ID assigned to each expenditure (row).
fin year The financial year of the expenditure.
state The state allocated the funds.
area The health area where the expenditure was provided.
funding The broad source of the funding (either Government or Non-government).
detailed funding The organisation that provided funding.
expenditure The amount provided in millions of dollars (rounded to whole numbers).
Here is a sample of the health expenditure data provided to you by the AIHW:
ID,fin_year,state,area,funding,detailed_funding,expenditure
1,1997-98,NSW,Administration,Government,Australian Government,315
2,1997-98,NSW,Administration,Government,State and local,120
3,1997-98,NSW,Administration,Non-government,Private health insurance funds,3145
4,1997-98,VIC,Administration,Government,Australian Government,2035
2 Getting the data into Python
In order to clean up and analyse the data, we need a way to take data from a CSV file and put it into a Python
data structure. Fortunately, Python has a built-in csv library which can do most of the work for us.
You won’t have to use the csv library directly, though. We will provide you with a helper function called
read data which uses the csv library to read the data and turn it into a dictionary of dictionaries. For
example, suppose the data we saw in the previous example:
ID,fin_year,state,area,funding,detailed_funding,expenditure
1,1997-98,NSW,Administration,Government,Australian Government,315
2,1997-98,NSW,Administration,Government,State and local,120
was stored in a file called sample data.csv. To work with this data in Python, we would call
read_data("sample_data.csv")
which would return the following Python dictionary:
{
"1": {
"fin_year": "1997-98",
"state": "NSW",
"area": "Adminstration",
"funding": "Government",
"detailed_funding": "Australian Government",
"expenditure": "315"
},
"2": {
"fin_year": "1997-98",
"state": "NSW",
"area": "Adminstration",
"funding": "Government",
"detailed_funding": "State and local",
"expenditure": "120"
}
}
Note: Notice that all of the values in the nested dictionaries are strings, even the numeric values. If you
want to use the values in numerical calculations, you will have to typecast them yourself.
Nested dictionaries can be confusing. Here are some simple examples of how to access data in a nested
dictionary:
3
# save the data in a variable
data = {
"1": {
"fin_year": "1997-98",
"state": "NSW",
"area": "Adminstration",
"funding": "Government",
"detailed_funding": "Australian Government",
"expenditure": "315"
},
"2": {
"fin_year": "1997-98",
"state": "NSW",
"area": "Adminstration",
"funding": "Government",
"detailed_funding": "State and local",
"expenditure": "120"
}
}
# which state was the first expenditure provided for?
print(data["1"]["state"])
# was the funding for the second expenditure Government or Non-government?
print(data["2"]["funding"])
# what is the difference in millions between the two expenditures?
print(int(data["1"]["expenditure"])-int(data["2"]["expenditure"]))
3 Workspace files
You’ll notice that in this assignment there are multiple files in the Grok workspace (the area where you
write your code). A quick explanation of these files is in order:
program.py The file where you will write your code. We have included a little bit of code in program.py
to get you started.
header.py A file containing some useful functions and constants. We have already imported the relevant
functions and constants for you in each question.
Various CSV files You will see some files in the workspace called noisy sample.csv, noisy data.csv,
cleaned sample.csv, or cleaned data.csv. (The exact files vary from question to question.)
These are health expenditure CSV files provided to you by the AIHW. You can use them to test your
functions as you work through the questions. The sample files are quite small (only a few lines),
while the data files are relatively long (1000 lines).
Question 1: Cleaning It Up (3 marks)
You have been provided with large CSV files containing health expediture data about Australia. Unfortunately, the data is ‘noisy’: some people have made data entry mistakes, or intentionally entered incorrect
data. Your first task as a programmer-analyst is to clean up the noisy data for later analysis.
There are a few particular errors in this data:
• Typos have occured in the expediture resulting in some non-numeric values.
• People have entered expenditure areas that are out-of-date and no longer valid. The valid areas are
listed in a variable called VALID AREAS, which is given to you.
4
• Some people have formatted the financial year incorrectly. Using words instead of digits, e.g. inputting ‘twenty-ten to eleven’ instead of ‘2010-11’ or using too many or too few digits, e.g. ‘10-11’
– others have entered years outside the range of the dataset. The data provided is for financial years
within the range 1997-98 and 2011-12.
Write a function clean data(data) which takes one argument, a dictionary of data in the format returned
by read data. This data has been read directly from a CSV file, a nd i s n oisy! Your f unction should
construct and return a new data dictionary which is identical to the input dictionary, except that invalid data
values have been replaced with None. You should not modify the argument dictionary, data.
For example, let’s look at the data contained in noisy sample.csv:
>>> data_noisy = read_data(’noisy_sample.csv’)
>>> for key, value in sorted(data_noisy.items()):
... print(key)
... print(value)
1
{’fin_year’: ’1997-98’, ’state’: ’NSW’, ’area’: ’Administration’, ’funding’:
’Government’, ’detailed_funding’: ’Australian Government’, ’expenditure’: ’315’}
2
{’fin_year’: ’twenty-ten to eleven’, ’state’: ’NSW’, ’area’: ’Administration’, ’funding’:
’Government’, ’detailed_funding’: ’State and local’, ’expenditure’: ’120’}
3
{’fin_year’: ’2000-01’, ’state’: ’NSW’, ’area’: ’Miscellenous’, ’funding’:
’Non-government’, ’detailed_funding’: ’Private health insurance funds’, ’expenditure’: ’314’}
4
{’fin_year’: ’98-99’, ’state’: ’NSW’, ’area’: ’Aids and appliances’, ’funding’:
’Government’, ’detailed_funding’: ’Australian Government’, ’expenditure’: ’4e’}
5
{’fin_year’: ’1997-98’, ’state’: ’NSW’, ’area’: ’Aids and appliances’, ’funding’:
’Non-government’, ’detailed_funding’: ’Individuals’, ’expenditure’: ’12’}
Clearly some of the values are invalid! Let’s call clean data on the data, and look at the result:
>>> data_cleaned = clean_data(data_noisy)
>>> for key, value in sorted(data_cleaned.items()):
... print(key)
... print(value)
1
{’fin_year’: ’1997-98’, ’state’: ’NSW’, ’area’: ’Administration’, ’funding’:
’Government’, ’detailed_funding’: ’Australian Government’, ’expenditure’: ’315’}
2
{’fin_year’: None, ’state’: ’NSW’, ’area’: ’Administration’, ’funding’:
’Government’, ’detailed_funding’: ’State and local’, ’expenditure’: ’120’}
3
{’fin_year’: ’2000-01’, ’state’: ’NSW’, ’area’: None, ’funding’:
’Non-government’, ’detailed_funding’: ’Private health insurance funds’, ’expenditure’: ’314’}
4
{’fin_year’: None, ’state’: ’NSW’, ’area’: ’Aids and appliances’, ’funding’:
’Government’, ’detailed_funding’: ’Australian Government’, ’expenditure’: None}
5
{’fin_year’: ’1997-98’, ’state’: ’NSW’, ’area’: ’Aids and appliances’, ’funding’:
’Non-government’, ’detailed_funding’: ’Individuals’, ’expenditure’: ’12’}
Notice the None values in the nested dictionaries of the cleaned data.
You can assume the following:
• the input data dictionary does not contain None values;
• all numeric expediture values are valid;
5
Testing your function: We have included some code at the bottom of your file to make it easier for you
to test your code. Don’t worry if you don’t understand all the details. When you open the Grok terminal,
the code we’ve added will read the data from the file specified in test file into a dictionary called
data noisy.
To manually test your code, change the value of test file to the file you want to use to test your code,
then execute the following in the Grok terminal:
>>> data_cleaned = clean_data(data_noisy)
Note: For the remainder of the assignment, we will provide you with nice clean data for analysis, so
you can work on later questions without completing this one. (In the real world, this probably wouldn’t
happen!) Remember that the clean data will contain some None values.
Be careful! When GROK says your submission passed our tests, this only means the submission was
accepted for marking. In this assignment we allow you to submit code that does not pass tests. Such
code will not receive correctness marks. Therefore, pay close attention to the advisory messages that
GROK gives after saying your submission passed our tests.
Question 2: Average expenditure (4 marks)
Write a function called avg expenditure(data, start, end) which takes three arguments, a dictionary of data in the format returned by read data, a start range of financial year in format XXXX-XX,
and an end range in same format. You can assume the financial year input start and end is valid. The
function calculates the average expenditure within the provided range of financial years and returns the
average rounded to the closest whole number. If the start is greater than end the function should return
-1. You may assume the health data in data is ‘clean’, that is all invalid values have been replaced by None.
If a nested dictionary contains a None value for the expenditure key or fin year key, you should ignore
it in your calculation. (If the dictionary has None for a different key, e.g. area, you should still include it
in the calculation.)
Here are some examples of what your function should return for different datasets and financial year brackets:
>>> data_cleaned = read_data("cleaned_sample.csv")
>>> avg_expenditure(data_cleaned, ‘1997-98’, ‘2011-12’)
214
>>> avg_expenditure(data_cleaned, ‘2000-01’, ‘2010-11’)
314
>>> data_cleaned = read_data("cleaned_data.csv")
>>> avg_expenditure(data_cleaned, ‘1997-98’, ‘2011-12’)
222
>>> avg_expenditure(data_cleaned, ‘2000-01’, ‘2010-11’)
224
Be careful! When GROK says your submission passed our tests, this only means the submission was
accepted for marking. In this assignment we allow you to submit code that does not pass tests. Such
code will not receive correctness marks. Therefore, pay close attention to the advisory messages that
GROK gives after saying your submission passed our tests.
Question 3: Distribution of funds (5 marks)
Your employers are interested in the distribution of expenditure across different areas of the health sector.
One way to analyse this is to divide the range of possible expenditures into a number of equal-sized ‘bins’
– where a bin is just a subset of the overall range – then count the number of expendures falling into each
bin (if you’ve ever worked with histograms before, this should be very familiar).
For example, we could divide the total expenditure range [0–5000] into ten bins: 0–499, 500–999, 1000–
1499, and so on, up to 4500–5000. The distribution of expenditures would then be summarised by 10
integers corresponding to the ten bins. In general, we experiment with the number of bins to find the
number that gives the most informative distribution.
6
Write a function called funding dist(data, n bins, area)
Here is an example of how your function should behave:
>>> data = read_data("cleaned_data.csv")
>>> funding_dist(data, 3, "Research")
[47, 3, 2]
>>> funding_dist(data, 8, "Private hospitals")
[56, 14, 3, 4, 2, 2, 1, 2]
>>> funding_dist(data, 8, "Research")
[39, 6, 3, 1, 0, 1, 1, 1]
>>> funding_dist(data, 12, "Public hospitals")
[75, 6, 2, 5, 1, 2, 3, 1, 0, 0, 1, 1]
funding dist(data, n bins, area), which calculates the distribution of expenditures greater than or
equal to the minimum expenditure and less than or equal to the max expenditure for a given area, by dividing
that range into n bins bins and counting the number of expenditures that fall into each bin. The bin width
should be an integer. Your function should return a list of ints, with each integer representing the
number of expenditures falling in the corresponding bin.
If a nested dictionary in data contains a None value for the expenditure or area key, you should ignore
it in your calculation. (If the dictionary has None for a different key, you should still include it in the
calculation.)
You may assume that n bins is a positive integer. Notice that including the maximum expenditure in
the last bin may make the last bin slightly ‘wider’ than the others. For example, if max expend == 101
and min expend == -20, n bins == 6, the bins would be -20–1, 0–19, 20–39, 40–59, 60–79, and 80–
101.
Be careful! When GROK says your submission passed our tests, this only means the submission was
accepted for marking. In this assignment we allow you to submit code that does not pass tests. Such
code will not receive correctness marks. Therefore, pay close attention to the advisory messages that
GROK gives after saying your submission passed our tests.
Question 4: Top 5 (5 marks)
Write a function called area expenditure counts(data, lower spent, upper spent) which creates a dictionary of the number of expenditures in the given expenditure amount bracket for each area.
That is, each key in the dictionary should be an area name, and the value for that area should be an int
corresponding to the number of expenditures in the area who fall in the expenditure amount bracket specified by lower spent and upper spent (inclusive). Your dictionary should have a key for all areas in
VALID AREAS, even the ones that have no expenditures in the expenditure amount bracket. The function
should return the top 5 areas by expenditure count as a list of tuples[(area, expenditure count),
...]. Top 5 areas should be listed in descending order by count, and ties should be broken by alphabetical order.
In this question, you should ignore any nested dictionary with a None value for the area key or the
expenditure key. None values for other keys are acceptable. You may assume that lower spent and
upper spent are positive. If lower spent > upper spent, your function should return a list with a
value of 0 for every expenditure.
Here are some examples of how your function should behave:
>>> data_cleaned = read_data("cleaned_data.csv")
>>> area_expenditure_counts(data_cleaned, 1000, 5000)
[(’Medical services’, 15), (’Public hospitals’, 15),
(’Benefit-paid pharmaceuticals’, 6), (’All other medications’, 4),
(’Dental services’, 3)]
>>>area_expenditure_counts(data_cleaned, 0, 1000)
[(’Patient transport services’, 88), (’Dental services’, 85),
(’Private hospitals’, 82), (’Aids and appliances’, 81),
(’Public hospitals’, 80)]
7
>>> area_expenditure_counts(data_cleaned, 4000, 8000)
[(’Public hospitals’, 3), (’Medical services’, 2), (’Administration’, 0),
(’Aids and appliances’, 0), (’All other medications’, 0)]
>>>area_expenditure_counts(data_cleaned, -1000, 0)
[(’Community health’, 22), (’Medical expense tax rebate’, 20),
(’Patient transport services’, 16), (’Dental services’, 5),
(’All other medications’, 4)]
Be careful! When GROK says your submission passed our tests, this only means the submission was
accepted for marking. In this assignment we allow you to submit code that does not pass tests. Such
code will not receive correctness marks. Therefore, pay close attention to the advisory messages that
GROK gives after saying your submission passed our tests.
Question 5: Health Check (3 marks)
A prestigious Victorian university has asked the AIHW to produce a report on health expenditure in Australia. They have asked you to help them generate some of the data for this report.
Write a function called main(datafile) which takes a filename as an argument, which reads the health
data contained in that file, cleans the data, and uses the data to print out some facts about health expenditure.
You should assume that the data in datafile is noisy. Your function should calculate and print out the
following facts:
1. The average expenditure for each financial year between 1997-98 and 2011-12.
2. A list of the top 5 expenditure areas by count. The report is only interested in expenditures between
0-800 million (inclusive).
Note: You will probably find it useful to call read data, clean data, etc. in your main function.
To help, we have provided implementations of all the functions from preceding questions. These are
imported at the top of program.py. You do not need to copy code from previous questions.
Average expenditure should be listed in cronological order by financial year with format XXXX-XX,
and only listed if they have a non-zero average. Next to the finalical year, in brackets, print the average
expenditure for that year.
Top 5 areas should be listed in descending order by count, and only listed if they have at least one
expenditure. Ties should be broken by alphabetical order. Next to the area name, in brackets, print the
number of expenditures in the area.
Here is an example of what your function should print. Make sure your function matches the format
exactly.
>>> main("noisy_data.csv")
Average Australian health expenditure
1997-98 (185,000,000)
1998-99 (90,000,000)
1999-00 (157,000,000)
2000-01 (224,000,000)
2001-02 (237,000,000)
2002-03 (215,000,000)
2003-04 (155,000,000)
2004-05 (247,000,000)
2005-06 (116,000,000)
2006-07 (231,000,000)
2007-08 (198,000,000)
2008-09 (250,000,000)
2009-10 (240,000,000)
2010-11 (335,000,000)
2011-12 (436,000,000)
Top 5 health expenditure areas by count
8
Patient transport services (88)
Dental services (85)
Aids and appliances (81)
Private hospitals (79)
Public hospitals (76)
>>> main("noisy_sample.csv")
Average Australian health expenditure
1997-98 (164,000,000)
2000-01 (314,000,000)
Top 5 health expenditure areas by count
Administration (2)
Aids and appliances (1)
Be careful! When GROK says your submission passed our tests, this only means the submission was
accepted for marking. In this assignment we allow you to submit code that does not pass tests. Such
code will not receive correctness marks. Therefore, pay close attention to the advisory messages that
GROK gives after saying your submission passed our tests.
End of assignment.
9