•Educational
and Psychological Measurement
•Zhehan
Jiang
•Peking University
[if ppt]•[endif]
•Course No. 06716070
•CTT
model and reliability
•EPM
•CTT
& Reliability
•Formal
definition of the CTT model, assumptions, results.
•CTT
definition of reliability
•Properties
of composite scores
•Standard
Error of Measurement
•Debates:
soon!
•Should
CONSEQUENCES be considered “part” of validity?
–Example:
using student test scores to evaluate teacher effectiveness
•Can
test scores be sufficiently valid without being reliable?
–Example:
Driver’s test includes a reliable written portion and an unreliable performance
task. Why?
•CTT
“Model”
•Test scores are random variables“sampled” from a hypothetical population
•X = T + E
•Definition of E(X)
•True score for an examinee:
Tj = E(X) = mXj
•True
Score
•The
true score is the mean, or expected value, of an examinee’s observed scores
obtained from a large (theoretically infinite) number of repeated test
administrations.
•Theoretically,
every examinee has a distribution of possible observed scores…even though we
usually only test once.
•Observed
Scores
•What would make the observed scores
change from one trial to the next? (Hint: True scores don’t change)
–Errors
are random and fluctuate
–An
examinee’s distribution of observed scores would be centered around his/her
true score.
•Observed
Scores
•The observed scores have a SD, it
reflects the amount of error variability present.
•A really reliable test
would have examinees’ observed scores closely clustered around their true
scores, with very little random fluctuation.
•Properties
of Error
•For
examinee j: Ej = Xj - Tj
•Note
that Ej is a new random variable and that E(Ej) = 0.
–Because
E(Xj) = Tj
•Interpretation:
average of errors for one examinee = 0
•Assumptions
•Reliability
•“Reliability”
refers to the consistency (or reproducibility) of scores over administrations.
–Repeated
over time, across parallel forms, between raters, or over tasks within an
assessment
•Reliability
= Repeatability
•Reliability
•One
way to think about this:
If z-scores for examinees stay consistent over administrations, the test scoresare reliable.
•Another
way:
the extent to which scores are free of randomness or errormakes them reliable.
•Important
Note
•As with validity evidence, remember that
tests are not reliable, per se, but rather test
scoresare reliable.
•A test may be administered to a very
different population of examinees and produce very different results…
•How
to Quantify Reliability?
•We
know it is desirable for scores to be relatively free from random error, and we
know X = T + E.
•If
T and X are highly related, it implies that E and X are weakly related. If X
and T are perfectly related, then all Observed variability is due to True
scores.
•Reliability
Index
•Reliability
Index = Correlation between Observed scores and True scores: rXT
•Estimating
Reliability
•The reliability index is an important
result, but it isn’t practical without further assumptions being made.
•We can’t observe True scores, only
Observed scores, so how could we ever estimate the correlation between the two?
•Parallel
Forms
•The
CTT estimation of reliability depends on the concept of parallel forms. Two
forms are parallel if:
–Each
examinee has the same true score on both forms of the test: Tj1= Tj2
–Error
variances for both forms are equal: s2(E1)= s2(E2)
–Errors
are uncorrelated across forms
–This
assumes the same construct!
•Parallel
Forms
•It is
difficult (at best) to construct strictlyparallel forms, but the concept is important because it makes reliability
estimable!
•What’s
important is that it’s theoretically
possibleto construct strictly parallel forms…
•Not-so-parallel
forms
•These
definitions of forms that are not strictly parallel will be especially helpful
when we discuss the task of equating or linking different forms.
–Tau
(t)
equivalence
–Essential
tau equivalence
–Congenericity(or “congeneric forms”)
•Tau (t)
equivalence
•Tau (t)
equivalence relaxes the assumption of equal error variances (i.e., error
variances may be unequal), but keeps the assumption that true scores are equal:
Tj1 = Tj2
•Errors
still uncorrelated
•Essential
Tau equivalence
•Essential
Tau (t)
equivalence further relaxes assumptions
–Error
variances are not necessarily equal, and
–Truescores across forms only differ by an additive constant:
T
j1 = Tj2 + c
•Errors
still uncorrelated
•Congeneric
forms
•Congenericityfurther relaxes assumptions to allow for different scales across forms
–Error
variances are not necessarily equal, and
–Truescores across forms differ by a positive linear function:
T
j1 =
d*Tj2+ c, where d > 0
•Errors
still uncorrelated
•Parallel
Forms
•All
that is required for CTT to work is that the concept of
parallel forms is theoretically possible.
•In
practice, we will only need to rely on the assumption of congenericity to
deal with estimating reliability and equating multiple forms.
•Reliability
Coefficient
•Correlation
between observed scores across two parallel forms: rXiXj
•Reliability
Coefficient
•Simple,
elegant, enduring concept:
•Coefficient
vs. Index
•Rel.
Coefficient = (Rel. Index)2
•Rel.
Index = SQRT(Rel. Coefficient)
•Variance
Components
•As
Error variance decreases…
–Ratio
of True/Observed variance increases
–Reliability
coefficient increases
•Interpretations
•Reliability
Coefficient= proportion of Observed score variance due to True score variability.
•Reliability
Index= correlation between Observed and True scores.
•Importance?
Reliability can now be estimated with observable data!
•Importance
•Through
the reliability coefficient, we can determine how much of the variability in
observed scores is due to differences among TRUE scores (the thing we’re trying
to measure!).
•The
higher the value (bounded by zero and one), the less influenced by random
errors the scores are.
•Example
•Let’s
say rXiXj = 0.81. 81% of the Observed scorevariance is due to True score variance, and
s2(T) = 0.81s2(X).
•If s(X) =4, we can predict:
s(T) =
SQRT(0.81*16) =
3.6
•And,the correlation between X,T:
rXT = SQRT(0.81)= 0.9
•Standard
Error of Measurement
•So,
if we have measurements across parallel forms, we can estimate the proportion
of True score to Observed score variance…so what?
•If we
know the proportion of True score variance, we also know the proportion of Error variance.
•Std.
Error of Measurement
•By
knowing the Error
variance,
we can use this information to state our confidence that an examinee’s test
score accurately reflects his/her true ability (i.e., the True score).
•Influence
of Error
•We
can’t know how much of any one examinee’s score is due to error, but we can
estimate the expected amount of variability for observed scores around each
examinee’s true score…THINK: “confidence interval”
•True
Score
•Remember, True score is defined as the
mean, or expected value, of an examinee’s Observed scores from a large number
of repeated test administrations.
•Theoretically, every examinee has a
distribution of possible observed scores, even though we only observe one (or
two).
•Std.
Error of Measurement
•We
can’t actually
computethe standard deviation of possible observed scores for each examinee, but we
can estimate the averageerror standard deviation…
•This
is what we call the Standard Error of Measurement (SEM).
–In a
couple of weeks we
will talk about conditional SEMs.
•Std.
Error of Measurement
•Confidence
Intervals
•Assuming
Normally distributed errors (common in Regression):
•X ± 1sE à 68%
CI
–On
repeated testing, 68% of the time X would be in this interval
•X ± 1.96sE à 95%
CI
–On
repeated testing, 95% of the time X would be in this interval
•Statistical
Analogy
•Reliability
Coefficient: rXiXj is just like R2 from Regression
•Likewise,
the standard error of measurement is just like the standard error of estimate.
•Soon
we’ll generalize this to predict T from X.
•Typical
Reliability Data
•Correlation
between scores from the same form administered to the same group of examinees
on two separate occasions (coefficient
of stability).
–“Test-retest
Reliability”
•Correlation
between two different forms administered to the same examinees on one occasion (coefficient
of equivalence).
–“Parallel-forms
Reliability”
•Typical
Reliability Data
•Correlation among test scores when
examinees respond to parallel components repeatedly is estimated by the coefficient
of internal consistency.
–Next
week’s topic is Internal Consistency: the reliability of composite scores
[if ppt]•[endif]