讲解:STAT3017、R、Data Statistics、RR|R

STAT3017 Final Project Page 1 of 6Big Data Statistics - Final Project (v1)Total of 100 Marksdue Monday 29 October 2018 at 17:00In this project we consider how to test hypotheses about covariance matrices and, since everything“eighties” is trendy again, we will look at some multivariate time series papers from that epoch andmake them fresh again. There are interesting connections between these two topics. The aim of theproject is to take the modern viewpoint and understand what happens in both situations when thedimensionality p of the observations becomes large.Testing Covariance MatricesQuestion 1 [5 marks]As a warm-up, read Section 6.6 in [C] and reproduce the calculations of Example 6.12 in R. Inthis example, Box’s M-test is used to study nursing home data from Wisconsin (data found inExample 6.10). If you have slightly different results to the book, briefly explain why.Question 2 [10 marks]Box’s M-test (aka. Box’s χ2 approximation) is a classic result that is based on a likelihood ratiotest (LRT). The general philosophy behind a LRT is to maximise the likelihood under the nullhypothesis H0 and also to maximise the likelihood under the alternative hypothesis H1.Definition 1. If the distribution of the random sample ? = (?1, . . . , ?n)0 depends upon a parametervector θ, and if H0 : θ ∈ ?0 and H1 : θ ∈ ?1 are any two hypotheses, then the likelihood ratiostatistic for testing H0 against H1 is defined asis the largest value which the likelihood function takes in the region ?i, i = 0, 1.At this point it is good to remember that a multivariate Normal distribution is completely characterisedby the parameter vector θ = (μ, Σ), i.e., only the mean vector and the covariance matrixare needed to know the distribution.The LRT has the following important asymptotic property as n → ∞ that Box leverages to obtainhis χ2 approximation.Theorem 1. If 1 q and if 0 is an r-dimensional subregion of 1 then (under some technicalassumptions) for each ω ∈ 0, 2 log(λ1) has an asymptotic χ2q?r distribution as n → ∞.The explanation why Theorem 1 is true starts in Section 10.2 of [B] where the LRT is derived,culminating in critical region for λ1 given by eq. (9). At this point, no assumptions are made aboutthe distribution of the population covariance matrices Σ1, . . . , Σq (so we don’t know how λ1 isdistributed). Assumptions are made in Section 10.4: covariances are assumed Wishart distributedwhich occurs when the random samples 1, . . . , n are multivariate Normal. Box’s χ2 asymptoticapproximation is obtained in Section 10.5 thanks to a formula for the h-moment of λ1. As [λh1]has a specific form (given in terms of ratios of Gamma functions), Theorem 8.5.1 of [B] can beapplied to get an approximation of ?(?2ρ log(λ1) ≤ z) in terms of the χ2 distribution.Dale Roberts - Australian National UniversityLast updated: September 21, 2018STAT3017 Final Project Page 2 of 6Now that you understand some of the theory, study the classic “iris” dataset (available in R in theiris variable). The populations are Iris versicolor (1), Iris setosa (2), and Iris virginica (3); eachsample consists of 50 observations. Use Box’s M-test (or otherwise) to:(a)[5] Test the hypothesis Σ1 = Σ2 at the 5% significance level.(b)[5] Test the hypothesis Σ1 = Σ2 = Σ3 at the 5% significance level.Note: this is Problem 10.1 from [B].Question 3 [10 marks]On page 311 in [C], just above Example 6.12, the authors make the comment that “Box’s χ2approximation works well if each n` exceeds 20 and if p and g do not exceed 5”. Your task is toperform a simulation study (see [J]) to show what happens to Box’s χ2 approximation when pexceeds 5 while holding g fixed, e.g., g = 2. This means you have to design an experiment toshow how badly Box’s test performs for large p by choosing appropriate Σ1 and Σ2, simulatingsample data, etc. Present your results in a clear manner (see [J] for presentation tips).Question 4 [10 marks]We are now going to look at the problem of testing that a covariance matrix is equal to a givenmatrix. If observations ?1, . . . , ?n are multivariate Normal Np(ν, Ψ), we wish to test the hypothesisH0 : Ψ = Ψ0 where Ψ0 is a given positive definite matrix. Let Q be the matrix such thatQΨ0Q0 = I,then set μ := Qν and Σ := QΨQ0. If we define ?i:= Q?iit follows that ?1, . . . , ?n are observationsfrom Np(μ, Σ) and the hypothesis H0 is transformed to H0 : Σ = I. Using the LRT approach, wecan find the test statisticUnfortunately λ1 is a biased statistic. The following unbiased estimator was proposedwhere N := n 1 and := /n. The distribution of λ1has the following χ2 approximation(2ρ log λ1 ≤ z) = (Cf ≤ z) + γ2ρ2(n 1)2((Cf +4 ≤ z)(Cf ≤ z)) + O(n3). (1)where Ck ~ χ2k(i.e., χ2 distributed with k degrees of freedom), f :=12p(p + 1), ρ := 1 �(2p2 +3p 1)/[6(n �1)(p + 1)], and γ2 := p(2p4 + 6p3 + p2 12p�13)/[288(p + 1)]. All the detailscan be found in [B] Section 10.8.1, [B] around Eq. (19) on p. 441, and [A].Perform a simulation study to understand the performance (type I error and power) of (1) forn = 500 and p = 5, 10, 50, 100, 300; see [K].Dale Roberts - Australian National UniversityLast updated: September 21, 2018STAT3017 Final Project Page 3 of 6Question 5 [10 marks]Continuing the previous question (and its notation), notice that1 = tr log |?|� p.Setting T1 := tr �log |?| �p, prove the following theorem.Theorem 2. Assume that n → ∞, p → ∞, and p/n → y ∈ (0, 1). ThenT1 p d1(yN) → N(μ, σ21)where N := n �1, YN := p/N andd1(y ) := 1 +log(1 y ),μ1 := 12log(1 y ),σ21:= 2 log(1y ) 2y.Hint: Apply Theorem in Lecture 6 on page 7 with 1pT1 := F(f ) with f (x) = x ? log x ? 1. Alsosee [D].Question 6 [10 marks]Continuing the previous question and notation, use the Theorem to construct an algorithm thattests H1 : Σ = I and perform a simulation study to understand its performance (type I error andpower) for p = 5, 10, 50, 100, 300. Comment on how it performs compared to (1).Multivariate Time SeriesLet denote the set of integers. A sequence of random vector observations (?t: t = 1, . . . , T)with values in ?pis called a p-dimensional (vector) time series. We denote the sample mean andsample covariance matrix byThe lag-τ sample cross-covariance (aka. autocovariance) matrix is defined asThe lag-τ cross-correlation is given byρτ = D?τDwhere D = diag(1/√s11, 1/√s22, . . . , 1/√spp) and the values come from ?0 = [sij]. Assuming[t] = 0, some authors (e.g., [H], [I]) omit ?t and consider the symmetrised lag-τ samplecross-covariance given byDale Roberts - Australian National UniversityLast updated: September 21, 2018STAT3017 Final Project Page 4 of 6Question 7 [12 marks]Simulation is a helpful way 代写STAT3017留学生作业、代做R编程设计作业、代写Data Statistics作业、代做R语言作业 帮做R语言编to learn about vector time series. Define the matricesGenerate 300 observations from the “vector autoregressive” VAR(1) modelt = At1 + εt (2)where εt ~ N2(0, Σ), i.e., they are i.i.d. bivariate normal random variables with mean zero andcovariance Σ. Note that when simulating is it customary omit the first 100 or more observationsand you can start with 0 = (0, 0)0.Also generate 300 observations from the “vector moving average” VMA(1) modelt = εt + Aεt1. (3)(a)[1] Plot the time series t for the VAR(1) model given by (2)(b)[1] Obtain the first five lags of sample cross-correlations of ?t for the VAR(1) model, i.e.,ρ1, . . . , ρ5.(c)[1] Plot the time series ?t for the MA(1) model given by (3).(d)[1] Obtain the first two lags of sample cross-correlations of ?t for the MA(1) model.(e)[5] Implement the test from [F] and reproduce the simulation experiment given in Section 5.This means you need to generate Table 1 from [F].(f)[3] The file q-fdebt.txt contains the U.S. quarterly federal debts held by (i) foreign andinternational investors, (ii) federal reserve banks, and (iii) the public. The data are fromthe Federal Reserve Bank of St. Louis, from 1970 to 2012 for 171 observations, and notseasonally adjusted. The debts are in billions of dollars. Take the log transformation and thefirst difference for each time series. Let (?t) be the differenced log series.Test H0 : ρ1 = . . . = ρ10 = 0 vs Ha : ρτ = 0 6 for some τ ∈ {1, . . . , 10} using the test from[F]. Draw the conclusion using the 5% significance level.Question 8 [13 marks]More generally, a p-dimensional time series ?t follows a VAR model of order `, VAR(`),i=1Ai?t?i + εt (4)where a0 is a p-dimensional constant vector and Ai are p × p (non-zero) matrices for i > 0, andi.i.d. εt ~ Np(0, Σ) for all t with p × p covariance matrix Σ.One day you might want to “build a model” using the VAR(`) framework. One of the first thingsyou need to do is to determine the optimal order `. Tiao and Box (1981) suggest using sequentiallikelihood ratio tests; see Section 4 in [G]. Their approach is to compare a VAR(`) model with aVAR(` 1) model and amounts to considering the hypothesis testing problemH0 : A` = 0 vs. H1 : A` 6= 0.Dale Roberts - Australian National UniversityLast updated: September 21, 2018STAT3017 Final Project Page 5 of 6We can do this by determining model parameters using a least-squares approach. We rewrite (4)asis a (p` + 1)-dimensional vector and ? = [a0, A1, . . . , A`] is ap × 1 + ` × (p × p) = p × (p` + 1) matrix. With observations at times t = ` + 1, . . . , T, we writethe data asX = X + E (5)where X is a (T ? `) × p matrix with the ith row being ?0`+i, X is a (T `) × (p` + 1) designmatrix with the ith row being X0`+i, and E is a (T `) × p matrix with the ith row being ε0`+i.The matrix contains the coefficient parameters of the VAR(`) model and let Σ�,` be thecorresponding innovation covariance matrix. Under a normality assumption, the likelihood ratio forthe testing problem isThe likelihood ratio test of H0 is equivalent to rejecting H0 for large values ofA commonly used statistic is Bartlett’s approximation given byM(`) = ?(T ? ` ? 1.5 ? p`) logwhich follows asymptotically (as n → ∞ and p fixed) a χ2 distribution with p2 degrees of freedom.The following methodology is suggested for selecting the order `:1. Select a positive integer P, which is the maximum VAR order that we would like to consider.2. Setup the regression framework (5) for the VAR(P) model. That is, there are T ? Pobservations (i.e., rows) in the X matrix.3. For ` = 0, . . . , P compute the least-squares estimate of the AR coefficient matrix ?. For` = 0, we have ? = a0. Then compute the ML estimate for Σ�, ` given byΣ�,` := (1/T ? P)R0`R`where R` = ? ? X? is the residual matrix of the fitted VAR(`) model.4. For ` = 1, . . . , P , compute test statistic M(`) and its p-value, which is based on theasymptotic χ2k2 distribution.5. Examine the test statistics sequentially starting with ` = 1. If all the p-values of the M(`)test statistics are greater than the specified type I error for ` > m, then a VAR(m) model isspecified. This is so because the test rejects the null hypothesis A` = 0, but fails to rejectA` = 0 for ` > m.Dale Roberts - Australian National UniversityLast updated: September 21, 2018STAT3017 Final Project Page 6 of 6Consider a bivariate time series is the change in monthly US treasurybills with maturity 3 months and ?CPItis the inflation rate, in percentage, of the U.S. monthlyconsumer price index (CPI). This data from the Federal Reserve Bank of St. Louis. The CPIrate is 100 times the difference of the log CPI index. The sample period is from January 1947 toDecember 2012. The data are in the file m-cpib3m.txt.(a)[1] Plot the time series ?t.(b)[6] Select a VAR order for ?t using the methodology (described above).(c)[6] Drawing on your results obtained in this project and the theory discussed in class, explainand demonstrate (e.g., simulation study) what might happen with this methodology if thedimensionality p of the time series becomes large.Question 9 [20 marks]The recent paper [H] is concerned with extensions of the classical Marchenko-Pastur to the timeseries case. Reproduce their simulation study which is found in Section 5 and Figure 1.References[A] Sugiura, Nagao (1968). Unbiasedness of some test criteria for the equality of one or two covariance matrices.Annals of Mathematical Statistics Vol. 39, No. 5, 1686–1692.[B] Anderson (2003). An introduction to Multivariate Statistical Analysis. Wiley.[C] Johnson, Wichern (2007). Applied Multivariate Statistical Analysis. Pearson Prentice Hall.[D] Bai, Jiang, Yao, Zheng (2009). Corrections to LRT on large-dimensional covariance matrix by RMT. Annals ofStatistics Vol 37, No. 6B, 3822–3840.[E] Zheng, Bai, Yao (2017). CLT for eigenvalue statistics of large-dimensional general Fisher matrices with applications.Bernouilli 23(2), 1130–1178.[F] Li, McLeod (1981). Distribution of the Residual Autocorrelations in Multivariate ARMA Time Series Models, J.R.Stat. Soc. B 43, No. 2, 231–239.[G] Tiao and Box (1981). Modelling multiple time series with applications. Journal of the American StatisticalAssociation, 76. 802 – 816.[H] Liu, Aue, Paul (2015). On the Marchenko-Pastur Law for Linear Time Series. Annals of Statistics Vol. 43, No. 2,675–712.[I] Liu, Aue, Paul (2017). Spectral analysis of sample autocovariance matrices of a class of linear time series inmoderately high dimensions. Bernouilli 23(4A), 2181–2209.[J] http://www4.stat.ncsu.edu/~davidian/st810a/simulation_handout.pdf[K] https://stats.stackexchange.com/a/40874Dale Roberts - Australian National UniversityLast updated: September 21, 2018转自:http://ass.3daixie.com/2018103123017568.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,490评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,581评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 165,830评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,957评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,974评论 6 393
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,754评论 1 307
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,464评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,357评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,847评论 1 317
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,995评论 3 338
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,137评论 1 351
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,819评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,482评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,023评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,149评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,409评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,086评论 2 355

推荐阅读更多精彩内容

  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,457评论 0 13
  • 昨天中午和会员带着家属一起在店里聚餐,主要是沟通16号去长沙学习一事,效果不太好, 今天重要的事情 1做好一切准备...
    玉儿双悦家政中介1384452阅读 155评论 0 0
  • 我的小米手环没了电,充电线又找不到。所以昨天凌晨醒的时候,我无法通过手环得知时间。但只听得外面有不断的管道流水声。...
    黄厨厨阅读 210评论 0 0
  • Emily王梓涵阅读 121评论 0 0
  • 风陵渡 游潼关十里画廊有感 九曲黄河九道弯, 十里画廊画风情。 潼关古城古栈道, 留与后世后人评。 风陵渡黄河滩 ...
    银杏飘香武阅读 160评论 0 1