To elaborate consider a sequence of independent and identically distributed (i.i.d.) random variables with
. In statistics we could be interested in the behaviour of the random variable
or some statistic
as
grows.

Convergence in
As we shall see that convergence in guarantees convergence in probability. However, we will see by counter example, that the converse is not true.
Definition 1.2.4. The sequence of random variables converges in
to random variable
, for
if:
.
Theorem 1.2.6. Let . If
then
.
In order to prove this result, we use the H¨older inequality. Let 1 < s, t < +∞ with 1/s + 1/t = 1 then for any two random variables Z and Y of moments s and t respectively
.
Theorem 1.2.7. Let 1 ≤ p < +∞. If then
.

Population, Sample and Models
Often data collected from an experiment are a collection of measurements on a variable of interest. The way one often proceeds in statistics is to hypothesize a model of the data, which we term the population. One then collects data to form a sample. We begin by introducing the notion of a model for data collection.
Definition 2.2.1. The random variables are called a random sample of size n from the population
if
are i.i.d. random variables with PMF or PDF
.
In principle, the random variables could be vector-valued, and indeed, the i.i.d. property is not really needed per-se. In general, one can easily extend this pseudo definition to where the samples form a Markov chain or any other type of (interesting) dependence structure. However, to facilitate much of what will follow, we will maintain the notion that the random variables are i.i.d. In our case we are assuming that the model for data is the same each time and that our measurements are somehow ‘independent’, which of course eliminates some interesting data types, such as those observed in time or space or both. Clearly, as we have seen already, that the joint density (we will use this term exchangeably for PMF or PDF) of the random variables is
.
In much of our work, we will assume that there is a finite dimensional parameter which unknown, which charaterizes the density
, which we will write as
. The joint density is thus
.
Lots of work will be on trying to find good methods to estimate θ in a basis of observed data (the sample). We have already seen the maximum likelihood method and this will be considered in more details later on in this course. Other work will focus on constructing formal statistical tests for the parameter θ, where the test will have a statistical or physical interpretation. For instance a test that two variables have ‘no relationship’. Especially for ‘frequentist’ statistics (of which almost all this course is concerned with, despite my own opinion of this type of inference) the test methods, or estimates of the θ may be based or justified on large sample (n) properties of the associated model.

We observe data in pairs: with
and
The
are termed to be response variables and the xi are explanatory variables. It is hypothesised that there exist some functional relationship of the form (although one is not restricted to this scenario):
,
,
,
for some distribution F ensuring that ,
. That is to say, in some manner, the variables xi are some-how relevant for explaining or predicting the
. To that end, we summarize the functional relationship g by collection of finite dimensional parameters θ, which we will seek to estimate from our observed data.
Heart catheterization is sometimes performed on children with congenital heart defects. A teflon tube is passed into a major vein at the femoral region and pushed up into the heart to obtain information about the heart’s physiology and functional ability. The length of the catheter is typically determined by a physician’s guess. In a small study, the exact catheter length required was determined by a fluoroscope to check if the tip of the catheter reached the pulmonary artery. The patients’ heights and weights were recorded. The objective is to see how accurately the catheter length could be determined by these two variables.
We focus upon the case where there are p−regressors; that is for :
,
where are i.i.d. zero mean random variables, p > 1. For simplicity we will assume:
; p < n .
Neither assumption is absolutely necessary, but we will avoid many mathematical complexities in this manner. In order to introduce a convenient matrix notation we will write ' (the ′ denotes transpose),
' and let X be an n×p matrix with first column of 1’s and each row i ∈ {1 . . . , n},
. Finally write
' then we have representation of the linear model as:
.
The residual sum of squares (RSS) is defined as
The objective is to compute θ so as to minimize the RSS.
The Likelihood Principle. In the inference about θ, after the data is observed, all relevant experimental information is contained in the likelihood function for the observed data. Furthermore, two likelihood functions contain the same information about θ if they are proportional to each other.
