BAYESIAN DIAGNOSTICS FOR TEST DESIGN AND ANALYSIS

This paper attempts to bridge the gap between classical test theory and item response theory. It is demonstrated that the familiar and popular statistics used in classical test theory can be translated into a Bayesian framework where all of the advantages of the Bayesian paradigm can be realized. In particular, prior opinion can be introduced and inferences can be obtained using posterior distributions. In classical test theory, inferential decisions are based on the values of statistics that are calculated from the responses of subjects over various test questions. In the proposed approach, analogous “statistics” are constructed from the output of simulation from the posterior distribution. This leads to population-based inferences which focus on the properties of the test rather than the performance of specific subjects. The use of the JAGS programming language facilitates extensions to more complex scenarios involving the assessment of tests and questionnaires.

language (Plummer 2015) facilitates extensions to more complex scenarios involving the assessment of tests and questionnaires.In Section 2, we provide the background for the typical testing framework involving dichotomous responses arising from test questions.In this context, some of the common statistics used in CTT are provided.This scenario is then imbedded into a Bayesian framework and it is demonstrated how the familiar testing measures can be easily translated into Bayesian diagnostics.Initially, a very simple prior distribution is introduced.In this section, we emphasize the ad-vantages of the proposed approach over the use of the familiar statistics used in CTT.We also demonstrate how missing data pose no difficulty.
In Section 3, we examine some real data taken from the aviation industry that consists of the results of multiple-choice questions given to pilots.We compare the traditional statistics with analogous Bayesian diagnostics.We also consider several extensions to the basic model introduced in Section 2. In particular, we introduce a more realistic prior which recognizes that some questions are more/less difficult for most respondents and that some respondents are stronger/weaker across most questions.The prior is also beneficial in that it reduces the effective dimensionality of the parametrization.We also indicate how the model can be extended to account for different instructors who have an effect on the performance of their students.Finally, we provide a discussion in Section 4 and a short conclusion in Section 5.

Materials and Methods
We consider test data presented in a n × k matrix X = (x ij ) where the n rows correspond to the respondents and the k columns refer to the test questions.The data are dichotomous (binary) where x ij = 1(0) specifies that the ith respondent provides a correct (incorrect) answer to the jth question.Therefore, the setup is applicable to true/false questions and to multiple-

Introduction
The important problems of test/questionnaire design and analysis have historically been ap-proached from either the perspective of classical test theory (CTT) or item response theory (IRT).Both of these research areas have an extensive literature where numerous comparative studies have been carried out (e.g.Hambleton and Jones 1993, Fan 1998, Guler, Uyanik and Teker 2014, Kohli, Koran and Henn 2015, Raykov and Marcoulides 2016).As research developments have progressed, the distinction between classical test theory and item response theory has narrowed.However, in a very brief and perhaps oversimplified com-parison of the two approaches, CTT is the original testing framework and essentially concerns the results of test questions on a specific sample of respondents and has few (if any) modeling assumptions.One of the appealing aspects of CTT is that the corresponding statistics are relatively simple and guidelines have been introduced for the assessment of these statistics.In the IRT framework, more complex models are considered where these models have components (i.e.parameters) that distinguish particular aspects of tests and are generalizable to a population of respondents.IRT relies more on statistical theory and is less accessible to some practioners.IRT has grown in many directions where various models have been proposed.Most notably, Bayesian implementations of IRT now exist (Fox 2010, Levy andMislevy 2016), and these require another level of statistical sophistication on the part of the practitioner.In this paper, we demonstrate how some of the very simple and still popular statistics of CTT can be directly translated into a Bayesian IRT framework.The advantage to the practitioner is that they may continue using familiar measures but simultaneously take advantage of the utility of the Bayesian paradigm.For example, they can introduce subjective prior opinion (if deemed necessary) and they can view their familiar measures from the perspective of populations (using posterior distributions).In addition, the use of the JAGS programming Printed ISSN: 2336-2375 choice questions.For questions with ordinal grading, it is possible to introduce a threshold that corresponds to pass (fail) so that such questions can also be analyzed within the above framework.In CTT, there are various statistics that have been proposed to assess the characteristics of test questions and the overall test.We now review three of these statistics.The first statistic, sometimes referred to as the P-value, is calculated on each of the k test questions.For the jth question, its P-value is defined as and is the proportion of correct responses on the jth question.Typically, a question is not viewed as a "good" question if its P-value is either too close to 0 (the question is difficult) or too close to 1 (the question is easy).In such cases, there is little testing taking place since most respondents have the same result.The second statistic that is referred to as the discrimination index is also calculated for each of the k test questions.For the jth question, its discrimination index is defined as where N Uj is the number of `strong' students who answered the jth question correctly and N Lj is the number of `weak' students who answered the jth question correctly.The subscripts U and L denote `upper' and `lower' respectively.The strong and weak students are categorized into two groups according to their overall test score where the test score for the ith student is given by .
. When n is even and the order statistics x(n/2) and x(n/2+1) differ, then the two groups form a partition of the set of the n respondents.In other cases, slight adjustments are made in forming the two groups.The discrimination index lies in the interval (−1, 1) where large positive values are viewed as desirable (strong students do better on the question than weak students), values near zero indicate that the question does not differentiate between strong and weak students, and negative values are viewed as undesirable (weak students do better on the question than strong students).The third statistic which is referred to as Cronbach's alpha is used to describe the reliability or internal consistency of the overall test.It is defined as ∑ is the variance with respect to the jth question and where values near the upper limit are generally preferred (DeVellis 2012).However, we note that various criticisms have been made related to the above interpretation (Sijtsma 2009).For example, if for a given subject, the k questions all have the same response, then the questions are redundant, which is obviously not desirable.However, in this case, α = 1.Before introducing the Bayesian analogue corresponding to CTT, there are two points that we wish to emphasize.First, although IRT has overtaken CTT in various ways, the CTT statistics (1), ( 2) and ( 3) are still widely used in practice (see for example, Yuan et al. 2012, Brozova and Rydval 2014).Second, as forcibly argued in the IRT literature (e.g.Hambleton and Jones 1993), an important feature of the more complex IRT models is that item (question) performance is linked to respondent ability.In other words, the results on test questions vary according to the strength of the student.The models and methods introduced in this paper preserve the simplicity of the common CTT statistics yet allow for the interplay between item performance and ability.Our approach is based on simple Bernoulli models where x ij ~ Bernoulli(θ ij ).The model stipulates that the probablity of a correct answer by the ith respondent to the jth question is given by An immediate reaction to (4) may be that the model is problematic since there are as many parameters nk as there are data values.However, in a Bayesian approach, prior information is available and parameters may "borrow" from one another such that the effective parameterization is reduced.Under (4), the development of measures comparable to the statistics (1), ( 2) and ( 3) is straight-forward.Instead of calculating ( 1), ( 2) and ( 3) based on the data matrix X, the calculations are carried out on the parameter matrix Θ = (θ ij ).
And herein lies a possible second reaction -the θ ij 's are unknown.How can one calculate "statistics" based on Θ?
The answer again relies on the Bayesian formulation.Under a simulation-based Bayesian approach, Θ's are generated from the posterior distribution, and each simulated sample gives rise to the analogous measures.An important added benefit is that we do not have a single observed statistic (p, d, α) as in CTT, but rather, we have a posterior distribution corresponding to our new measures and this facilitates the assessment of variability.These features and other features are emphasized in the real data example presented in Section 4.
There is another attractive aspect of the Bayesian formulation.
Whereas the statistics (1), ( 2) and (3) refer to the observed X values, the Bayesian measures refer to the probabilities associated with the questions and the respondents.And we suggest that this corresponds to the real problem of interest where the properties of the questions/respondents is more important to practitioners than the particular sample.The idea of focusing on population quantities (i.e.parameters) rather than statistics (i.e.data) has been previously explored; see for example Swartz (2011) in the context of clustering.We also mention that there is great flexibility in the approach.Not only can the statistics (1), ( 2) and (3) be translated to Bayesian versions, we can do likewise with any CTT statistic.
The only additional ingredient that is required for the Bayesian implementation is the specification of a prior distribution on the parameters.Initially, we consider a somewhat unrealistic prior where we assume that the θ ij are independent and identically distributed (iid) Uniform (0, 1) random variables.The Uniform distribution is sometimes referred to as a reference prior; it is flat and has the required domain θ ij ∈ (0, 1).Above, we alluded to simulation-based Bayesian software.Accordingly, we use the JAGS programming language which is relatively simple to use and avoids the need of special purpose Markov chain Monte Carlo code.JAGS is open source software (www.mcmc-jags.sourceforge.net)which is very similar to WinBUGS.Details on WinBUGS and an introduction to the Bayesian approach are given by Lunn et al. (2013).

Relationship of approach to IRT
Various models have been proposed in IRT.In a three-parameter logistic IRT model, we retain the notation above and express ( ) ( ) where p i is the ability parameter for the ith respondent and a j , b j and c j are characteristics of the jth test question.
The relationship ( 5) is known as an item response function (IRF).
The IRF is an important feature of IRT and is typically plotted as a function of the ability p i for estimated test characteristics a ˆj, bj and c ˆj.One of the notable differences between our approach and IRT is that we allow more freedom in the θ ij parameters since the θ ij are assigned a prior probability distribution.In IRT, the functional relationship is fixed according to (5) or by some alternative IRT model.Accordingly, in our framework, measures such as the Bayesian P-value and the Bayesian discrimination are not constrained by functional relationships.

Missing data
The Bayesian model is appealing in its simplicity.Via the simulated parameters θ ij , researchers are able to investigate questions involving both respondents and test questions.One of the added advantages of a Bayesian approach is the elegance and ease with which missing data can be handled.For example, there are exams where test questions are randomly generated from a databank for each student or subsets of students.In these situations, individual students answer only some of the questions.In this sense, there is missing data.We therefore distinguish between the observed data x obs and the missing data x mis .Letting [A | B] denote the generic conditional density of A given B, the relevant posterior distribution in this case is The key observation from ( 6) is that is the unnormalized posterior density that one would obtain if xmis were actually observed.Therefore, one simulates as before except that xmis takes the role of a random parameter rather than a fixed data value.To handle missing data in JAGS, we need only code the unobserved data values with the NA symbol.We emphasize that this is incredibly easy to do.

Results
We consider the results of a multiple-choice exam given to pilots where there are n = 307 respondents (pilots) and k = 10 test questions.In the aviation industry, safety is of paramount importance, and therefore, the proportion of correct answers must be very high.We first calculate various CTT statistics.For this dataset the vector of P-values is ( ) which indicates that all questions are answered better by the stronger students than by the weaker students.Cronbach's alpha is α = 0.492 which (for many researchers) indicates that the test is reliable.Since the P-value and discrimination index provide properties of the same test, they are sometimes interpreted jointly.In Table 1, we provide guidelines (Skoda, Doulik and Hajerova-Mullerova 2006) that have been proposed for a suitable test and have been endorsed by Brozova and Rydval (2014).Although practitioners may have alternative guidelines for a particular application, here we illustrate the utility of the proposed Bayesian with respect to the guidelines provided in Table 1.We now present some results based on 1000 simulations from the posterior distribution.For each simulation, the Bayesian P-value, the discrimination index and Cronbach's alpha were calcu-lated.In Figure 1, we provide the joint distribution of the Bayesian P-value and the discrimination index for questions 1 and 2. In contrast to the single paired observations (p 1 = 0.925, d 1 = 0.612) and (p 2 = 0.837, d 2 = 0.788), Figure 1 highlights that there is variability associated with each measure and uncertainty is expressed via the posterior distribution.In each of the plots, we have provided bars according to the guidelines in Table 1 which allows us to assess the suitability of the test questions.
We observe a difference between the properties of question 1 and question 2. For example, question 2 is more difficult (i.e. the cloud of points is slightly shifted to the left).We also observe that there is more variability in the discrimination index than in the P-value.
We also observe in Figure 1 that the generated P-values are smaller than the traditional CTT statistics p 1 = 0.925 and p 2 = 0.837.This is due to the unrealistic θ ij ~ Uniform(0, 1) prior distribution which shrinks the posterior distribution of θ ij towards 0.5.In a particular application, we may have specific knowledge concerning the θ ij values, and this knowledge can be incorporated into the prior distribution.We illustrate this flexibility in Section 4.
In Figure 2, we provide a density plot of the posterior distribution of the Bayesian version of Cronbach's alpha.Again, the figure highlights that there is variability associated with the measure.One of the frequent discussion points concerning the use of Cronbach's alpha is that its interpretation is subject to the dimension of the n × k data matrix X.With the Bayesian version of Cronbach's alpha, the observed variability depends on the dimension of X.We note that the posterior mean 0.075 in Figure 2 differs from the traditional CTT statistics α = 0.492.In Section 4, we vary the prior and observe changes in the resultant posterior mean.

A more realistic prior
We now turn our attention to the development of a more realistic prior, one which recognizes that some questions are more/ less difficult for most respondents and that some respondents are stronger/weaker across most questions.The intention is to introduce a prior distribution that leads to Bayesian CTT statistics that are more in line with the traditional CTT statistics.This allows practitioners to use the same calibration scales with which they are comfortable.
The suggested prior has the following assumed structure In ( 7), the truncation corresponds to the interval (0, 1) and the parameters µ ij and 2 ij σ are specified according to an empirical Bayes procedure.The procedure first requires logistic regression involving the original data X where ( ) Logistic regression provides us with parameter estimates 0 ˆ, i β α and ˆj γ .We then invert the logistic function and set To set 2 ij σ , we make use of the Delta method applied to (8).After some calculations, this yields where v ˆ is the sum of the entries in the variance-covariance matrix corresponding to the parameter estimates.
Whereas the calculation of µ ij and 2 ij σ may appear daunting for some practitioners, we note that the predict function can be used on a glm object in R to provide the values.This is most convenient when running the rjags package since it provides an interface from R to the JAGS library.In the Appendix, we see that the empirical Bayes procedure requires only three statements of code.
To check the impact of the empirical Bayes prior specification (7), we repeat the Bayesian analysis on the aviation dataset.Recall for question 1, the CTT P-value was 0.925 and the posterior mean of the Bayesian P-value was 0.642.With the new prior that takes into account student ability and test difficulty, the posterior mean of the Bayesian P-value is 0.912.We therefore see that the new value has moved towards the CTT value.Similarly, with Cronbach's alpha, the CTT value was 0.492, the posterior mean of the Bayesian α was 0.075, and the posterior mean of the Bayesian α based on the empirical Bayes prior specification (7) is 0.201.In Figure 3, we provide the joint distribution of the Bayesian P-value and the discrimination index for questions 1 and 2 based on the empirical Bayes prior of Section 4.2.The distribution of values are more in line with the CTT diagnostics.In Figure 4, we provide a density plot of the posterior distribution of the Bayesian version of Cronbach's alpha based on the empirical Bayes prior of Section 4.2.Again, the distribution of values are more in line with the CTT diagnostic.We repeat that a main advantage of the empirical Bayes procedure is that it takes into account the difficulty of questions and the strength of the respondent.The prior specification in (7) provides only a template of what can be done.For example, one could introduce alternative distributions.One could also introduce more knowledge about students and test questions by modifying the truncated-Normal distribution.In the Appendix, we see that the specification of the prior in JAGS is straightforward (e.g. one line involving the dnorm function).1.

Generalizing with respect to instructors
We now demonstrate that the Bayesian framework provides advantages that are not available in the classical CTT framework.A possible application is the assessment of instructors.For example, we may have L instructors who are each responsible for a cohort of students.In this case, every observation x ij has an added subscript such that x ijl = 1(0) denotes that the i student has a correct (incorrect) response to the jth question and that this student received instruction on this question by instructor l.We similarly extend the notation for the parameters leading to terms θ ijl .The above setup is also applicable to other situations.
Printed ISSN: 2336-2375 For example, a comparison of different groups of students may be of interest where the groups are designated by the index l.
Using either the simple uniform prior or the more realistic prior given by ( 7) and ( 8), posterior realizations of θ ijl are generated as before.Let S l = {θ ijm : m = l} and let nl be the number of terms in the set S l .Then an analysis of instructors in the spirit of the CTT Bayesian framework can be based by calculating ..
which can be interpreted as the average probability of a correct answer for instructor l.One can compare the ..l θ values, l = 1,..., L, and assess their relative magnitudes by also calculating their corresponding posterior standard deviations.

Discussion
The two main approaches to questionnaire design and analysis are IRT and CTT.Methods based on IRT require the specification of statistical models and permit the inferential benefits associated with the models.IRT is the dominant approach used in major educational testing initiatives (An and Yung 2014) and IRT software is now widely accessible including popular statistical packages such as SAS (Choi 2017).Much recent research has been carried out under the IRT umbrella and there are now many IRT models that can be considered for a given application (Cai et al. 2016).However, despite the popularity of IRT, there are two main drawbacks involving IRT.First, sometimes the existing statistical models do not adequately characterize the special features of an application and the models need to be modified (if possible) to account for these features.In comparison to CTT, Hambleton and Jones (1993) describe the assumptions related to IRT as `strong'.Second, the sophistication of the IRT models in terms of model fitting and interpretation is sometimes beyond the technical scope of practitioners.For example, even the simple IRF given in (5) often poses a challenge for a nontechnical audience.On the other hand, CTT approaches consist of few assumptions and are easily adopted by practitioners.These appealing features have led to the continuation of the use of CTT despite the lack of inferential capabilities under CTT.For example, in clinical psychology when there are fewer than 20 test items, Jabrayilov, Emons and Sijtsma (2016) recommend CTT over IRT for detecting change in individuals.In discussing CTT, Hambleton and Jones (1993) write that the dependence of the methodology on the particular test and examinees `limit the utility of the person and item statistics in practical test development work and complicate any analyses'.The methods proposed in this paper allow practitioners to work under the familiar CTT approach, yet benefit from inferential capabilities.This is accomplished by imbedding the CTT structure within a Bayesian framework.The inferential component is accomplished via simulation from posterior distributions where simulated values provide population-level descriptions of questionnaires.However, the greatest advantage of the proposed approach is its flexibility.We have seen that we can vary the prior to take into account subjective beliefs concerning students and test questions.In addition, the flexibility of applications is facilitated through the availability of the simulated θ ij values (something that is not immediately available in IRT).For example, we have shown in Section 3 how the introduction of a new subscript can extend an investigation to take into account the effect of instructors.As another example, suppose that a researcher is interested in the performance of students on test questions 6, 7 and 8.Then, for the ith student, the researcher needs only keep track of the simulated outcomes T i = θ i6 + θ i7 + θ i8 .Essentially, with the θ ij values, the researcher can investigate any aspect of interest regarding students and test questions.Finally, we have used an empirical Bayes procedure based on fitting a logistic regression model according to (8).Nothing prevents us from using a similar procedure based on an alternative parametrization.For example, we could fit a logistic regression model according to three-parameter IRF (5).This would further tighten the relationship between our Bayesian CTT approach and IRT.

Conclusion
We have made the case that the approach developed in this paper may help bridge the gap between CTT and IRT, by retaining the simplicity of CTT and by providing the inferential advantages of IRT.In particular, when compared to traditional CTT, the proposed approach does not rely on the interpretation of summary statistics.Rather, variability can be assessed via posterior distributions.

Figure 1 :
Figure 1: Posterior simulations of the Bayesian P-value and discrimination index for questions 1 and 2 using the iid uniform prior.Horizontal lines are drawn to delineate the recommendations fromTable 1.

Figure 2 :
Figure 2: Posterior density plot of the Bayesian version of Cronbach's alpha using the iid uniform prior.

Figure 3 :
Figure 3: Posterior simulations of the Bayesian P-value and discrimination index for questions 1 and 2 using the empirical Bayes prior of Section 4.2.Horizontal lines are drawn to delineate the recommendations from Table1.

Figure 4 :
Figure 4: Posterior density plot of the Bayesian version of Cronbach's alpha using the empirical Bayes prior of Section 4.2.