The goal of this document is to give you advanced tools for working withdatabase queries. By now you should already know how to connect to a databaseand how to make some nice plots. You will be expected to use e cient queries onexams. In other words the strategies used in the rst SQL lab generally broughtthe entire database into R memory and then performed calculations/actions.Last week and beyond we are working on extracting just what we need and onlyimporting the deisred rows and columns into R.You need to submit the solutions to all of the lab questions by Mondayat 9am. It is recommended that you use R Markdown for working throughassignments. Include code and output for all questions. Just make sure thatyou format the output so as to make it possible for the marker to understandwhat you are doing. Stray pieces of output that the marker needs to hunt forwon’t make the marker happy. Make sure that you clearly present your results.1 Connecting to a DatabaseConnect to the &"stat240Week5lab.sqlite&" database. Then print the list of tablesin the database just for your reference. Remember to call the usual databaseand SQL libraries. You can obtain table information using PRAGMA as we didlast week. This week we have two Olympic datasets, Summer and Winter. Inthe Summer table NOC is a 3 letter country code, see https://en.wikipedia.org/wiki/List_of_IOC_country_codes.2 Moving Averages with SQLIn statistics, a moving average (rolling average or running average) is a calcula-tion to analyze data points by creating series of averages of di erent subsets ofthe full data set. Usually the mean is taken from an equal number of data oneither side of a central value. Sometimes we use the average of the previous fewyears since we do not yet have information moving forward. Let’s look at howto obtain moving with SQL Queries.Lets try to obtain moving averages for total number of athletes who obtainedmedals regardless of the country. For that rst we need to create a Frequencytable where we need a unique list of year with its corresponding total medalcounts. To do this we will create a new virtual table as an interim step. We’llsplit up the task into small pieces so that you can see what is happening.Obtain the medal count within each year.1mov_avg1 = &"SELECT year, Count(year) AS TotalNumber FROMWinterO GROUP BY year&"out = dbGetQuery(con, mov_avg1)Question 1a Make a nice plot of the number of athletes who obtainedmedals per year. Use the option type=&"p&". Make sure axes and the titleare nice and clear (total 5 points: 4 points for the plot + 1 point for yourline of code).Create a virtual table (called a VIEW ) which we will call tot meds tohouse that output from the previous query without bringing the resultsinto R. This will let us do other queries on that virtual table.mov_avg2 = &"CREATE VIEW tot_meds AS SELECT year,Count(year) AS TotalNumber FROM WinterO GROUP BY year&"dbSendQuery(con, mov_avg2)Question 1b Check that the above worked by nding the new virtualtable inside the database (1 point for code and console output).Do this to delete the temporary table so that you don’t get stuck witherrors when you rerun your code. If you run this you will need to re-runthe previous step to re-build the virtual table.dbSendQuery(con, &"drop view tot_meds&")Query the virtual table to extract year and total medal.mov_avg3 = &"SELECT year, TotalNumber FROM tot_meds&"out = dbGetQuery(con, mov_avg3)Question 1c Add a line plot of the total number of athletes who receivedmedals per year from this table as an addition to the plot in part 1a. Makesure you chose a di erent colour for the line. (total 2 points. 1 point forthe line, 1 point for the code)Now we will compute a moving average by de ning an index within theyear and averaging everything within that window. This requires us tode ne two new tables t1 and t2. Remember that we can specify wherea variable comes from by using &".&". So t1.year means that we use thevariable year from table t1.We will use table t1 to de ne an index over year and use that index toaverage everything within the window:WHERE t2.Year BETWEEN (t1.Year-4) AND (t1.Year+4) .Note that the moving average will contain more or less Olympics withinthe window depending on the year. The reason is that the Olympics wereoccasionally cancelled due to world wars.2check = &"SELECT * FROM tot_meds AS t,(SELECT t1.year, AVG(t2.TotalNumber) AS mavgFROM tot_meds AS t1, tot_meds AS t2WHERE t2.year BETWEEN (t1.year-4) AND (t1.year+4)GROUP BY t1.Year) sq WHERE (t.year = sq.year)&"movingAvg = dbGetQuery(con, check)Question 1d Put a line with the moving average on the gureand make a legend(total 4 points; 2 points for the line, 2 pointsfor the legend).Now, make sure that you delete the temporary table.3 The number of medals won by team China in2004Now let’s look at the number of medals won by team China in the summerOlympics.meds = &"SELECT Edition as Year, Count(Edition) AS TotalNumber FROMOlymp_meds GROUP BY NOC, EditionHAVING NOC == CHN AND Year == 2004&"out = dbGetQuery(con, meds)Question 2a, 2 points; max 1 sentence Why does our number dif-fer from the one on https://en.wikipedia.org/wiki/China_at_the_2004_Summer_Olympics? The biggest danger to having a lot of di erent techniqueswith little understanding of what they are actual代写Database编程、Database实验作业代做、SQL code 编程作业帮做、代写Database实验ly doing is that it is easy toget an answer but hard to know if your answer is correct. In assignments and inexams, make sure you know if your answer is correct - or at least how to checkyour work.Question 2b, 6 points (5 for code, 1 for output proving you havethe right answer) Make a SQL query that will provide one row for each medalwon in the 2004 Olympics by team China. (Hint: Make sure that you get theright number of each medal type when comparing to https://en.wikipedia.org/wiki/China_at_the_2004_Summer_Olympics.)Question 3 a) Create a plot of total number of medal events fora country of your choosing (that country must participate regularlyin the Olympics going back to 1948 or earlier and earn at least 1medal per summer or winter Olympics. Be careful about using NOCor Country.) Add a line with the moving average of the current andpast 12 years, so that we are not averaging into the future. Addanother line with the moving average of with a window containingthe past and next 4 years. Finally include a horizontal like with theaverage number of medals for that country across all years. Includethe usual things to make the plot readable and (obviously) make surethat we know what is in the plot (total 8 points.)34 SQL summary statisticsThe objective here is to Extract Summary Statistics from the values in the tableinstead of bringing the table into R and then producing summary statistics.First the obvious summaries by making a VIEW with data from teamCanada. We will consider only the number of athletes who received medals.Some summaries can be obtained directly from SQLite:Canadian = &"CREATE VIEW Can_tot_meds AS SELECT Edition AS Year,Count(Edition) AS TotalNumber FROM Olymp_meds GROUP BY NOC,Edition HAVING NOC == CAN &"dbGetQuery(con, Canadian)# dbGetQuery(con, drop view# Can_tot_meds )summaries = &"SELECT COUNT(Year) AS YearsInOlympics,AVG(TotalNumber) AS AVGmedalcount , MIN(TotalNumber) AS MINmedalcount ,MAX(TotalNumber) AS MAXmedalcount FROM Can_tot_meds&"(out = dbGetQuery(con, summaries))Now to get the median. Recall that to get the median we order all the valuesand take the one in the middle (1=2 of the way through the sorted values). Weactually do that by using LIMIT to limit the results to give us a single valueand then asking for a speci c ordered value using OFFSET to de ne the valuein the middle of the sorted values:getmedian = &"SELECT TotalNumber AS Median FROM Can_tot_medsORDER BY TotalNumber LIMIT 1 OFFSET (SELECT COUNT(TotalNumber)FROM Can_tot_meds) /2&"(out = dbGetQuery(con, getmedian))Note that SQL expects the OFFSET to be an integer and therefore expects(except in very speci c exceptions) that the OFFSET be determined by usingintegers and not decimals. So basically dividing by 10 works but multiplying by.1 doesn’t.Question 4 (5 points for code, 5 points for numbers)Use SQL to obtain the (10;30;50;70;90)th percentiles of the total number ofathletes who won medals earned over all countries with medals (i.e. all countriesin the table ’WinterO’) for the years when the Olympics were in Lake Placid(Hint, this is easiest if you split this into 5 queries).5 Doing ’Fancy’ Math Within RSQLiteThe function initExtension, loads some math extensions into the database sothat we can get SQL to do mathematical queries. For example the StandardDeviation (stdev) is included in those extensions.initExtension(con)dbGetQuery(con, &"SELECT STDEV(TotalNumber) FROM Can_tot_meds&")Find all of the functions this allows us to use by looking at the help le(within R) for initExtension.Let’s Standardize the populations of Canadian regions. To obtain a stan-dardized score we wish to look for deviations from typical values that scaled bywithin the usual variation.Z-Score(x) = x xpvar(x)We can obtain the standardized values of the Canadian populations withineach region but we need to do this calculation on every row of data. We could usea VIEW table. Or we could de ne a temporary table within the SQL argument.We do that here using WITH to de ne the temporary table pop_cnt. Then wecall a variable within that table using the notation pop_cnt.variable. You cancheck partial results by running the SELECT command within the brackets.That will show you what you will get as the temporary table pop_cnt.sql_z = &"WITH pop_cnt AS(SELECT avg(Population__2011) AS mean,stdev(Population__2011) AS sd FROM POP)SELECT Geographic_name, POP.Population__2011,pop_cnt.mean, pop_cnt.sd,(POP.Population__2011 - pop_cnt.mean) / pop_cnt.sd AS z_score_popFROM pop_cnt, POP&"out = dbGetQuery(con, sql_z)Note that the rst value of the POP table is the population of Canada. Wecan use OFFSET to skip ahead to the second value. The problem is that LIMITmust be used with OFFSET. LIMIT returns a limited number of rows, but wewant all of them. Remember how we obtained all remaining rows when usingdbfetch? Use that trick to x the code below which otherwise calculates thestandardized values.Question 5, total 5 points (2 points for making a histogram of thestandardized populations and including a nice title and axis label. 3points for the SQL code. Hint: Make sure you look at the rst fewvalues in the table to make sure you have the population within smallregions and not something much larger. If there are any rows youneed to exclude, for full points your SQL code should do that for you.转自:http://ass.3daixie.com/2018052158549140.html
讲解:Database、Database、SQL code 、Database
©著作权归作者所有,转载或内容合作请联系作者
- 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
- 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
- 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
推荐阅读更多精彩内容
- 安装步骤: 下载地址 解压后进入文件目录,编译 so文件gcc -DMYSQL_DYNAMIC_PLUGIN -f...