讲解:CITS1401、Python、Verify Authorship、PythonSPSS|Java

CITS1401 Computational Thinking with PythonProject 2 Semester 2 2019Page 1 of 6Project 2: Using Stylometry to Verify AuthorshipSubmission deadlines:Stage 1: 5:00pm, Friday 18 October 2019 for the pseudocodeStage 2: 5:00pm, Friday 25 October 2019 for the code.Value: 20% of CITS1401.To be done individually.You should construct a Python 3 program containing your solution to the followingproblem and submit your program electronically on LMS. No other method ofsubmission is allowed.You are expected to have read and understood the Universitys guidelines onacademic conduct. In accordance with this policy, you may discuss with otherstudents the general principles required to understand this project, but the workyou submit must be the result of your own effort. Plagiarism detection, and othersystems for detecting potential malpractice, will therefore be used. Besides, ifwhat you submit is not your own work then you will have learnt little and willtherefore, likely, fail the final exam.You must submit your project before the submission deadline listed above.Following UWA policy, a late penalty of 10% will be deducted for each day (or partday), after the deadline, that the assignment is submitted. However, in order tofacilitate marking of the assignments in a timely manner, no submissions will beallowed after 7 days following the deadline.OverviewUWA, like every university around the country (probably around the galaxy) isvery worried about ghost-written submissions for assignments. This is also knownas contract cheating. Whatever you call it, ghost-writing is about getting someoneelse to do your work, but submitting it as if it was only your work. In this case weare concerned with essays. The incidence is believed to be low, but its clearly nota good thing.Coming from a different angle, debates have raged at various times about whetherdifferent authors works were actually by those authors. For example, were all theworks attributed to William Shakespeare actually by him? One approach toexamining both of these issues is to use stylometry. That is, rather than lookingdirectly at the content of texts, as one does when looking for suspected plagiarism,stylometic looks for stylistic similarities. In other words, similarities in the ways aparticular author uses language, rather than similarities in the actual words on thepage, on the assumption that an author will use a similar style for similar sorts ofcontent, fiction, non-fiction, etc.CITS1401 Computational Thinking with PythonProject 2 Semester 2 2019Page 2 of 6What you will do for this Project is write a program that reads in two text filescontaining the works to be analysed and builds a profile for each. The two profilesare compared and returned besides a score which reflects the distance betweenthe two works in terms of their style; low scores, down to 0, imply that the sameauthor is likely responsible for both works, while large scores imply differentauthors.Specification: What your program will need to doInput:Your program must define the function main with the following signature:def main(textfile1, textfile2, feature)The first and second arguments are the names of the text files with a work to beanalysed. The third argument is the type of feature that will be used to comparethe document profiles. The allowed feature names are: punctuation, unigrams,conjunctions and composite.Output:The function is required to return the following outputs in the order providedbelow:• the score from a pairwise comparison rounded to four decimal places,• the dictionary containing the profile of first file (textfile1), and• the dictionary containing the profile of second file (textfile2)A more detailed specification• For the purposes of this project, a sentence is a sequence of words followed byeither a full-stop, question mark or exclamation mark, which in turn must befollowed either by a quotation mark (so the sentence is the end of a quote orspoken utterance), or white space (space, tab or new-line character). Thus:This is some text. This is yet more textcontains one sentence followed by the start of another sentence.• You are required to create the profile of input files using dictionaries. The profilefor each document will contain the number of occurrences of certain words(case insensitive) and pieces of punctuation.• The counted words or punctuations are dependent on the input feature whichcan be: punctuation, unigrams, conjunctions and composite.• For conjunctions: your program is required to count the number ofoccurrences of the following words:also, although, and, as, because, before, but, for, if, nor, of,or, since, that, though, until, when, whenever, whereas,which, while, yetCITS1401 Computational Thinking with PythonProject 2 Semester 2 2019Page 3 of 6• For unigrams: your program is required to count the number of occurrencesof each word in the files. Consider the following three lines of text contained ina file:This is a Document.This is only a documentA test should not cause problemThe word count will be: a:3, document:2, this:2, is:2, only:1,should:1, not1, cause:1, problem:1• For punctuation: your program should count certain pieces of punctuation:comma and semicolon. In addition, your program should also count singlequoteand hyphen, but only under certain circumstances. Specifically, yourprogram should count single-quote marks, but only when they appear asapostrophes surrounded by letters, i.e. indicating a contraction such asshouldnt or wont. (Apostrophe is being included as an indication of moreinformal writing, perhaps direct speech.). Your program should count dash(minus) signs, but only when they are surrounded by letters, indicating acompound-word, such as compound-word. Any other punctuation or letters,e.g . when not at the end of a sentence, should be regarded as white space,so serve to end words. For these purposes, strings of digits are also words asthey convey information. Therefore, in the unlikely event that a floating pointnumber, such as 3.142, appears, that is regarded as two words.Note: Some of the texts we will use include double hyphen, i.e. --. This is tobe regarded as a space character.• For composite: your program should contain number of occurrences ofpunctuations (as explained above) and conjunctions. In addition, your programshould also add to the profile two further parameters relating to the text: theaverage number of words per sentence and the average number of sentencesper paragraph, where a paragraph is any number of sentences followed by ablank line or by the end of the text.• Each of the words and punctuation symbols should be placed, together withtheir respective counts, in a dictionary, which is called a profile.• The first output by the main function is the distance between the correspondingp代写CITS1401、代做Python编程设计、代写Verirofiles which should be computed using the standard distance formula:• The second and third outputs returned by the main function arethe profiles corresponding to the first and second text files respectively. Thereturned profiles as dictionaries in which each word is the key and value is thenumber of occurrences of the key, such as {“also”:10, ”got”: 6} where“also” and “got” are the keys and have occurred 10 and 6 times respectively.CITS1401 Computational Thinking with PythonProject 2 Semester 2 2019Page 4 of 6Example:Download the project2data.zip file from the folder of Project 2 on LMS. An exampleinteraction is provided as a sampleanswers.txt which you can find insampleresult.txt. The results are based on three files: sample1.txt andsample2.txt, both excerpts taken from Life on the Mississippi, by Mark Twain.Some Text Files to ExamineSome text files are also included in the zip file for you to try out. All of the texts,apart from Kangaroo, were obtained from Project Gutenberg(www.gutenberg.org). All the files have a long text at the end which containsProject Gutenberg license and terms of use. I have removed the Gutenberg termsand license in the files rather than left them in the texts because that may affectthe profiles.Author Title Fiction/Non-fictionHenry Lawson Children of the Bush FictionD. H. Lawrence Fantasia of the Unconscious Non FictionMark Twain Life on the Mississippi Non FictionD. H. Lawrence Sea and Sardinia Non FictionD. H. Lawrence Kangaroo FictionMark Twain Adventures of Hucklebery Finn FictionAndrew BartonBanjo PatersonThree Elephant Power FictionA small note of warning. If you decide to download your own texts from ProjectGutenberg, please be aware that many of the texts include spurious Unicodecharacters. Unfortunately, the file input-output functions we use in CITS1401 (andI use on a daily basis) only work with the standard ASCII character set, so willcause an exception if Unicode characters are in the text. While Python is well ableto deal with Unicode, special input-output functions are needed, which are beyondthe scope of this unit. What I have done is use the Unix command: cat –vetfilename to make the Unicode characters visible in the ASCII character set, andthen use a text editor to remove them. (Tedious.)Important:You will have noticed that you have not been asked to write specific functions.That has been left to you. However, as in Project 1, it is important that yourprogram defines the top-level function main() as described above. main()should then call the other functions. (Of course, these may call further functions.) CITS1401 Computational Thinking with PythonProject 2 Semester 2 2019Page 5 of 6The reason this is important is that when I test your program, my testing programwill call your main() function. So, if you fail to define main(), or define it with adifferent signature, my program will not be able to test your program.Things to avoid:There are a few things for your program to avoid.• You are not allowed to import any Python module except math or os. Whileuse of other modules are perfectly sensible thing to do (and the way I oftenmay do it), it takes away much of the point of different aspects of the project,which is about getting practice creating code to accurately extract the parts ofstrings that that you need, and use of basic Python structures, in this casedictionaries.• Please do not assume that the input file names will end in .txt. File namesuffixes such as .csv and .txt are not mandatory in systems other than MicrosoftWindows.• Please make sure your program does NOT call the input() or print()functions. That will cause your program to hang, waiting for input that myautomated testing system will not provide. In fact, what will happen is that themarking program detects the call(s), and will not test your code at all.Submission:Stage 1:Submit a single PDF file containing your approach and/or pseudocode for thesolution of the problem as per guidelines discussed in Lecture L2 SoftwareDevelopment Process and Project 1 Stage 1 submission. You need to discuss thedocument with lab demonstrator before submission. It is mandatory to submit thisfile before 5:00pm 18 October 2019 on LMS to avoid 10% deduction in Project2 grading. This will be a formative feedback of your problem solving skillsdeveloped in the course. In case you do not submit the file, 10% of the total marksof the project will be deducted from your obtained grade of the Stage 2submission.Stage 2:Submit a single Python (.py) file containing all of your functions via LMS before5:00pm 25 October 2019 on LMSYou need to contact unit coordinator if you have special considerations or makinglate submission.Marking Rubric:Your program will be marked out of 30 (later scaled to be out of 20% of the finalmark).22 out of 30 marks will be awarded based on how well your program completes anumber of tests, reflecting normal use of the program, and also how the programhandles various error states, such as the input file not being present. Other than CITS1401 Computational Thinking with PythonProject 2 Semester 2 2019Page 6 of 6things that you were asked to assume, you need to think creatively about theinputs your program may face.8 out of 30 marks will be style (4/8) “the code is clear to read” and efficiency (4/8)“your program is well constructed and runs efficiently”. For style, think about useof comments, sensible variable names, your name at the top of the program, etc.(Please look at your lecture notes, where this is discussed.)Style Rubric:0 Gibberish, impossible to understand1-2 Style is really poor3 Style is good or very good, with small lapses4 Excellent style, really easy to read and followYour program will be traversing text files of various sizes (possibly including largecorpora) so try to minimise the number of times your program looks at the samedata items. You may wish to use dictionaries (or sets, if you are prepared to readthe documentation), rather than lists.Efficiency Rubric:0 Code too incomplete to judge efficiency, or wrong problem tackled1 Very poor efficiency, additional loops, inappropriate use of readline()2 Acceptable efficiency, one or more lapses3 Good efficiency, small lapses4 Excellent efficiency, should have no problem on large filesAutomated testing is being used so that all submitted programs are being testedthe same way. Sometimes it happens that there is one mistake in the programthat means that no tests are passed. If the marker is able to spot the cause andfix it readily, then they are allowed to do that and your - now fixed - program willscore whatever it scores from the tests, minus 2 marks, because other studentswill not have had the benefit of marker intervention. Still, thats way better thangetting zero. On the other hand, if the bug is too hard to fix, the marker needs tomove on to other submissions.转自:http://www.3daixie.com/contents/11/3444.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,701评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,649评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 166,037评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,994评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,018评论 6 395
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,796评论 1 308
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,481评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,370评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,868评论 1 319
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,014评论 3 338
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,153评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,832评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,494评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,039评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,156评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,437评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,131评论 2 356

推荐阅读更多精彩内容