Common Metrics

Comparing Scores from Different Patient Reported Outcomes using Item Response Theory beta


In the field of PRO measurement there is a plethora of instruments and questionnaires. For example, it has been estimated that over 100 instruments alone have been designed to measure (aspects) of depression or depressive severity, such as the

  • Patient Health Questionnaire PHQ-9
  • Beck Depression Inventory BDI-I and II
  • Center for Epidemiologic Studies Depression Scale CESD
  • and many more.
This also applies to various other constructs such as Anxiety, Fatigue, Physical Functioning or Quality of Life. In a nutshell: the more important a construct is considered, the more instruments have been probably developed.

All these instruments differ in various ways, for example, in view of their underlying philosophies, psychometric principles of test construction, their emphasis on different aspects of the construct, and their precision and validation. One of the main challenges is that data obtained through different measures are hard to compare. Thus, several measures are often used in a single study for the sake of comparability, resulting in increased respondent burden. In summary, a lack of standardization in measurement of PROs is a problem and has been widely acknowledged in the literature.


Unfortunately, in the framework of Classical Test Theory (CTT), the scores from different measures are hard to compare. Item-Response Theory (IRT) can help to enhance comparability across instruments by providing so-called common metrics.

A common metric is an IRT model, such as the GRM (Graded Response Model) or the GPCM (Generalized Partial Credit Model), that comprises parameters of items from various measures, measuring a common variable. Item parameters describe the relation between item response and latent variable. With such statistical model, one can estimate this common variable by subsets of items, e.g. if different measures are used or if data are missing. Such models are usually estimated using large samples, often calibrated to some reference population.

Several such common metrics have been developed over the past years. Further below you can find a short description including the full reference of the ones we included in the score conversion app.

A graded Response Model calibrating the BDI-II (n = 748), CES-D (n = 747) and PHQ-9 (n = 1,120) on the US PROMIS Depression metric. The model was developed within the PROsetta Stone Project.

Choi SW, Schalet BD, Cook KF, et al. Establishing a common metric for depressive symptoms: linking the BDI-II, CES-D, and PHQ-9 to PROMIS depression. Psychological assessment 2014;26:513-27.
Generalized Partial Credit Model covering three Anxiety scales (GAD-7, HADS-A, PROMIS-Anxiety 8 Item Short Form). Based on very small dataset (n = 194) of patients with heart failure.

Fischer HF, Klug C, Roeper K, et al. Screening for mental disorders in heart failure patients using computer-adaptive tests. Quality of Life Research doi:10.1007/s11136-013-0599-y
Generalized Partial Credit Model covering the Physical Functioning Scale of the SF-36 and the Health Assessment Questionnaire Disability Index, estimated in 1,791 dutch patients with rheumatoid arthritis. Please note that two scoring algorithms (with/withoud aids) for the HAQ-DI are available.

ten Klooster P, Oude Voshaar M A H, Gandek B, et al. Development and evaluation of a crosswalk between the SF-36 physical functioning scale and Health Assessment Questionnaire disability index in rheumatoid. Health and quality of Life outcomes 2013;11:199. Available at:
A graded response modeling covering 6 measures of personality disorders developed in a German, age and gender representative sample (N = 849) and anchored to the German general population.

Zimmermann, J., Müller, S., Bach, B., Hutsebaut, J., Hummelen, B., Fischer, F. (under review) A common metric for self-reported severity of personality disorder.
If you want to use another common metric that has not yet been implemented for score estimation that is certainly possible: please contact Felix Fischer.


The application presented here sets up an IRT model with all parameters fixed to the item parameters of the selected common metric. Currently, GRMs and GPCMs are implemented.

The underlying R package mirt uses a marginal maximum likelihood method to estimate item parameters of IRT models; hence, estimation of person parameters can be conducted independently from item parameters. For person parameter estimation we included the Expected A Posteriori (EAP), Bayes Modal (MAP), Weighted Likelihood Estimation (WLE) and Maximum Likelihood (ML) methods. An EAP estimate for Sum Scores is available when your data stem from one measure only.

Some general remarks about choice of estimation methods:

  • ML relies only on the item parameters. It might be a good choice if there is no information about the sample distribution. It comes with the disadvantage that extreme response patterns yield no numerical estimate which makes calculation of some statistics like mean and SD impossible.
  • EAP has been often used in clinical applications, e.g. in Computer-Adaptive-Tests. The main reason is that it is easily computable.
  • In all Bayesian approaches, choice of an appropriate prior is crucial. We offer three choices (see below).
We suggest investigating the impact of different estimation methods on the results of theta score estimation and resulting group statistics.

Using your data you can estimate theta scores with this app with a normally distributed prior distribution with mean = 0 and variance = 1, a normally diffuse prior with mean = 0 and variance = 10, or with a normal prior with mean and variance estimated from the data. Please note that after the estimation your theta scores are transformed to the popular T-metric with mean = 50 and SD = 10.

Test-specific standard errors for precision comparison were calculated using the testinfo() function of mirt for models comprising all items from a single questionnaire. These standard errors are the same when one estimates theta with the ML method; hence, comparison to test precision is only possible under the ML approach.

The application was built using R 3.0.2, shiny and ggplot2. IRT model and theta estimation is realized with mirt.

If you want to learn more about the implemented methods, we would like to refer to the following books:

  • Embretson SE, Reise SP. Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum Associates; 2000.
  • Thissen D, Wainer H. Test Scoring. Mahwah, NJ: Lawrence Erlbaum Associates; 2000.

Strengths and Limitations

We strongly believe that common metrics offer the chance to set standards in Patient-Reported Outcome Measurement independent of the measures used. For example, many common metrics are anchored at meaningful values, e.g. 50 as a general population mean, facilitating score interpretation.

The particular strengths of the direct estimation of the latent variable from the response pattern as provided here are:

  • All available information are used for estimation.
  • Results have been reported to be slightly more precise compared to the use of cross-walk tables.
  • The precision of each individual score can be assessed.
  • Estimation is also possible in case of missing item responses.

Nonetheless, there are some limitations of this method:

  • Little is known about the validity of these common metrics and they have rarely been validated so far in external samples.
  • The fewer items provided for theta estimation, the stronger the influence of priors. Hence, when comparing theta across different measures, shorter measures might show a tendency towards the scale mean.

In our opinion an enhanced comparability of data - especially data already collected - outweighs these limitations. We encourage you to use our App and share your experiences with us, so that we can further investigate the strengths and limitations for future applicants. As the App has not been widely tested yet, we still would like to ask you to use this application with caution. Please feel free to contact us. We are interested in your experiences and your needs.

Score Conversion

Start App!


Please let us know what you think about this site and feel free to send us your questions. We are grateful for feedback and will provide support!



This website and the score conversion app have been designed and developed by Felix Fischer under the supervision of Matthias Rose at the Charité Universitätsmedizin Berlin.

Furthermore, we would like to thank

  • Janine Devine
  • Thomas Forkmann
  • Peter ten Klooster
  • Gregor Liegl
  • Sandra Nolte
  • Muirne Paap
for feedback, proofreading and application testing.


We would like to draw your attention to some other websites that might be of interest to you.