2021-04-24

•Educational

and Psychological Measurement

•Zhehan

Jiang

•Peking University

[if ppt]•[endif]

•Course No. 06716070

•CTT

model and reliability

•EPM

•CTT

& Reliability

•Formal

definition of the CTT model, assumptions, results.

•CTT

definition of reliability

•Properties

of composite scores

•Standard

Error of Measurement

•Debates:

soon!

•Should

CONSEQUENCES be considered “part” of validity?

–Example:

using student test scores to evaluate teacher effectiveness

•Can

test scores be sufficiently valid without being reliable?

–Example:

Driver’s test includes a reliable written portion and an unreliable performance

task. Why?

•CTT

“Model”

•Test scores are random variables“sampled” from a hypothetical population

•X = T + E

•Definition of E(X)

•True score for an examinee:

Tj = E(X) = mXj

•True

Score

•The

true score is the mean, or expected value, of an examinee’s observed scores

obtained from a large (theoretically infinite) number of repeated test

administrations.

•Theoretically,

every examinee has a distribution of possible observed scores…even though we

usually only test once.

•Observed

Scores

•What would make the observed scores

change from one trial to the next? (Hint: True scores don’t change)

–Errors

are random and fluctuate

–An

examinee’s distribution of observed scores would be centered around his/her

true score.

•Observed

Scores

•The observed scores have a SD, it

reflects the amount of error variability present.

•A really reliable test

would have examinees’ observed scores closely clustered around their true

scores, with very little random fluctuation.

•Properties

of Error

•For

examinee j: Ej = Xj - Tj

•Note

that Ej is a new random variable and that E(Ej) = 0.

–Because

E(Xj) = Tj

•Interpretation:

average of errors for one examinee = 0

•Assumptions

•Reliability

•“Reliability”

refers to the consistency (or reproducibility) of scores over administrations.

–Repeated

over time, across parallel forms, between raters, or over tasks within an

assessment

•Reliability

= Repeatability

•Reliability

•One

way to think about this:

If z-scores for examinees stay consistent over administrations, the test scoresare reliable.

•Another

way:

the extent to which scores are free of randomness or errormakes them reliable.

•Important

Note

•As with validity evidence, remember that

tests are not reliable, per se, but rather test

scoresare reliable.

•A test may be administered to a very

different population of examinees and produce very different results…

•How

to Quantify Reliability?

•We

know it is desirable for scores to be relatively free from random error, and we

know X = T + E.

•If

T and X are highly related, it implies that E and X are weakly related. If X

and T are perfectly related, then all Observed variability is due to True

scores.

•Reliability

Index

•Reliability

Index = Correlation between Observed scores and True scores: rXT

•Estimating

Reliability

•The reliability index is an important

result, but it isn’t practical without further assumptions being made.

•We can’t observe True scores, only

Observed scores, so how could we ever estimate the correlation between the two?

•Parallel

Forms

•The

CTT estimation of reliability depends on the concept of parallel forms. Two

forms are parallel if:

–Each

examinee has the same true score on both forms of the test: Tj1= Tj2

–Error

variances for both forms are equal: s2(E1)= s2(E2)

–Errors

are uncorrelated across forms

–This

assumes the same construct!

•Parallel

Forms

•It is

difficult (at best) to construct strictlyparallel forms, but the concept is important because it makes reliability

estimable!

•What’s

important is that it’s theoretically

possibleto construct strictly parallel forms…

•Not-so-parallel

forms

•These

definitions of forms that are not strictly parallel will be especially helpful

when we discuss the task of equating or linking different forms.

–Tau

(t)

equivalence

–Essential

tau equivalence

–Congenericity(or “congeneric forms”)

•Tau (t)

equivalence

•Tau (t)

equivalence relaxes the assumption of equal error variances (i.e., error

variances may be unequal), but keeps the assumption that true scores are equal:

Tj1 = Tj2

•Errors

still uncorrelated

•Essential

Tau equivalence

•Essential

Tau (t)

equivalence further relaxes assumptions

–Error

variances are not necessarily equal, and

–Truescores across forms only differ by an additive constant:

j1 = Tj2 + c

•Errors

still uncorrelated

•Congeneric

forms

•Congenericityfurther relaxes assumptions to allow for different scales across forms

–Error

variances are not necessarily equal, and

–Truescores across forms differ by a positive linear function:

j1 =

d*Tj2+ c, where d > 0

•Errors

still uncorrelated

•Parallel

Forms

•All

that is required for CTT to work is that the concept of

parallel forms is theoretically possible.

•In

practice, we will only need to rely on the assumption of congenericity to

deal with estimating reliability and equating multiple forms.

•Reliability

Coefficient

•Correlation

between observed scores across two parallel forms: rXiXj

•Reliability

Coefficient

•Simple,

elegant, enduring concept:

•Coefficient

vs. Index

•Rel.

Coefficient = (Rel. Index)2

•Rel.

Index = SQRT(Rel. Coefficient)

•Variance

Components

•As

Error variance decreases…

–Ratio

of True/Observed variance increases

–Reliability

coefficient increases

•Interpretations

•Reliability

Coefficient= proportion of Observed score variance due to True score variability.

•Reliability

Index= correlation between Observed and True scores.

•Importance?

Reliability can now be estimated with observable data!

•Importance

•Through

the reliability coefficient, we can determine how much of the variability in

observed scores is due to differences among TRUE scores (the thing we’re trying

to measure!).

•The

higher the value (bounded by zero and one), the less influenced by random

errors the scores are.

•Example

•Let’s

say rXiXj = 0.81. 81% of the Observed scorevariance is due to True score variance, and

s2(T) = 0.81s2(X).

•If s(X) =4, we can predict:

s(T) =

SQRT(0.81*16) =

3.6

•And,the correlation between X,T:

rXT = SQRT(0.81)= 0.9

•Standard

Error of Measurement

•So,

if we have measurements across parallel forms, we can estimate the proportion

of True score to Observed score variance…so what?

•If we

know the proportion of True score variance, we also know the proportion of Error variance.

•Std.

Error of Measurement

•By

knowing the Error

variance,

we can use this information to state our confidence that an examinee’s test

score accurately reflects his/her true ability (i.e., the True score).

•Influence

of Error

•We

can’t know how much of any one examinee’s score is due to error, but we can

estimate the expected amount of variability for observed scores around each

examinee’s true score…THINK: “confidence interval”

•True

Score

•Remember, True score is defined as the

mean, or expected value, of an examinee’s Observed scores from a large number

of repeated test administrations.

•Theoretically, every examinee has a

distribution of possible observed scores, even though we only observe one (or

two).

•Std.

Error of Measurement

•We

can’t actually

computethe standard deviation of possible observed scores for each examinee, but we

can estimate the averageerror standard deviation…

•This

is what we call the Standard Error of Measurement (SEM).

–In a

couple of weeks we

will talk about conditional SEMs.

•Std.

Error of Measurement

•Confidence

Intervals

•Assuming

Normally distributed errors (common in Regression):

•X ± 1sE à 68%

–On

repeated testing, 68% of the time X would be in this interval

•X ± 1.96sE à 95%

–On

repeated testing, 95% of the time X would be in this interval

•Statistical

Analogy

•Reliability

Coefficient: rXiXj is just like R2 from Regression

•Likewise,

the standard error of measurement is just like the standard error of estimate.

•Soon

we’ll generalize this to predict T from X.

•Typical

Reliability Data

•Correlation

between scores from the same form administered to the same group of examinees

on two separate occasions (coefficient

of stability).

–“Test-retest

Reliability”

•Correlation

between two different forms administered to the same examinees on one occasion (coefficient

of equivalence).

–“Parallel-forms

Reliability”

•Typical

Reliability Data

•Correlation among test scores when

examinees respond to parallel components repeatedly is estimated by the coefficient

of internal consistency.

–Next

week’s topic is Internal Consistency: the reliability of composite scores

[if ppt]•[endif]

2021-04-24

推荐阅读更多精彩内容