Introduction
The growing number of people living with multiple long-term conditions (MLTC, multimorbidity) is one of the most significant challenges facing healthcare systems globally. In the UK, more than one in three adults live with MLTC, and in older people and those living in more deprived areas, it is now the norm [
1‐
3]. People with MLTC are likely to experience poorer health outcomes and lower quality of life (QOL). Healthcare services – which are largely orientated around single conditions – face growing pressure from the increased demand and higher costs associated with MLTC [
4‐
6].
In light of this challenge, there has been rapid growth in MLTC research over the past two decades, including several trials of interventions to improve outcomes for people with MLTC. However, there remains no clear evidence of benefit from these interventions [
7]. Whether this represents intervention failure or problems with the outcome measures used is unclear.
Quality of life (QOL) is often chosen as the primary outcome for trials because it is an intuitive, holistic concept that is important to patients and influences healthcare utilisation [
8]. While single-disease trials often use disease-specific patient reported outcome measures (PROMs) to measure QOL, there are currently no QOL PROMs that are designed specifically for people with MLTC. Consequently, MLTC trials have relied on generic measures such as EQ-5D-5L and SF-36 [
7,
9‐
11]. These generic PROMs are considered broadly applicable and have well-established validity in many contexts [
12]. However, their development often predates the widespread use of modern psychometric techniques known as modern test theory
1 and they have certain well-recognised problems. EQ-5D-5L, for example, exhibits limited sensitivity to change [
13,
14], and is prone to differential item functioning in heterogenous populations (that is, unwanted variations in the psychometric performance of items between demographic groups) [
15]. The limitations of using generic PROMs in MLTC research in particular have been highlighted previously [
16,
17].
The development of bespoke outcome measures for people with MLTC has therefore been identified as a research priority [
8,
17‐
19]. Although a number of MLTC-specific PROMs have been developed in recent years, these measure concepts such as treatment burden and illness perception rather than QOL per se, so are less suitable as primary outcome measures in trials. Moreover, all of these MLTC-specific PROMs exhibited risk of bias according to the COSMIN checklist (which assesses the quality of PROM design and validation according to modern test theory) in a recent systematic review [
17].
Development of the Multimorbidity Questionnaire (MMQ)
In response to this, the Multimorbidity Questionnaire (MMQ) – a Danish-language PROM for people with MLTC – was recently developed and validated using Modern Test Theory methods at the University of Copenhagen. It’s qualitative groundwork, conceptualisation and development have been described elsewhere [
20‐
22]. It has two parts: MMQ1, measuring needs-based QOL; and MMQ2, measuring self-perceived inequity. These were conceived as independent but coherent constructs, designed to be used either separately or together [
23]. The present paper focusses only on the translation and validation of MMQ1 given its explicit focus on QOL. The translation and validation of MMQ2 will be reported separately.
MMQ1, measuring needs-based QOL, is made up of 37 items across six domains: physical ability, self-determination, security, social life, self-image and personal finances. Needs-based QOL is a well-established concept that is widely used in disease-specific QOL PROMs [
24], and is based on Maslow’s Hierarchy of Needs and the assumption that life gains quality from a person’s ability to achieve their goals and fulfil their needs [
25]. As such, it is a more patient-centred concept than health-related QOL (as measured by EQ-5D-5L), and has also been shown to be more responsive [
26]. In the Danish validation study, MMQ1 demonstrated strong psychometric properties using Rasch analysis [
22]. It has since been used as an outcome measure in a large-scale trial of a new model of care for people with MLTC in Denmark (the MM600 trial) [
27]. The six domains of MMQ1 were developed as independent scales, resulting in six separate scores rather than a single sum score.
Aim
The aim of this study was to translate and validate MMQ1 for use in the UK.
Methods
Phase 1: Translation, adaptation and content validation
The translation of MMQ1 from Danish to English was conducted using a two-panel method [
28]. The first panel took place in Copenhagen in January 2023. It consisted of three native English speakers living in Denmark overseen by members of the research team (KB and JBB) who were also bilingual. This group produced a preliminary ‘expert’ translation of MMQ. The second panel took place in Edinburgh in March 2023 and consisted of a focus group with six members of a Patient and Public Involvement (PPI) network established to support research within the Advanced Care Research Centre (ACRC) at the University of Edinburgh. Participants had diverse socioeconomic backgrounds and levels of education, and all had experience of living with long term conditions, however, as a PPI group rather than formally recruited research participants, consent was not obtained for recording or reporting characteristics. The focus group was also attended by the Danish research team (KB, JBB, ABRJ) in order to maintain consistency with the original meaning of the items, and was facilitated by a member of the Edinburgh research team (KS). The process involved using a
think-aloud approach to review the preliminary translation line-by-line, with
verbal probing used to explore any semantic ambiguity and improve the translation for a lay audience [
29,
30]. Changes were made to the translation if there was agreement between lay participants that suggested changes improved clarity, and the Danish research team felt that the original meaning was preserved. Formal, subsequent back translation was not performed, although the bilingual fluency of the Danish team negated the need for this, as translation accuracy could be checked in real-time. In addition to refining the translation, participants’ impressions of how understandable, appropriate and relevant the measure seemed (its face validity) were also discussed.
The resulting translation was then piloted in six cognitive interviews taking place in August and September 2023, with patients purposively sampled from two GP practices in Lothian, one in an urban area of high deprivation and one in a rural area of mixed deprivation. Suitable patients identified by GPs were those living with MLTC where the GP felt that the health conditions were likely to be affecting QOL. Sampling was also targeted such that there was a balance of genders and a weighting towards participants from more deprived areas and/or with experience of mental-physical multimorbidity. This weighting was chosen as it was hypothesized that impact on QOL might be greater in these groups, and therefore questionnaire items and responses reflecting this could be suitably piloted. Nine suitable patients were identified by their GP, either opportunistically or based on clinical familiarity, and contacted by their GP by telephone. All nine agreed to be contacted by the research team with further information. Of these, six responded to contact from the research team and agreed to proceed to interview. Cognitive interviewing uses qualitative techniques (including think-aloud and verbal probing) on a one-to-one basis to assess how a respondent processes each item, thereby reducing measurement error [
29,
30]. Six interviews were conducted in two rounds, with some initial analysis and modification taking place after the first four interviews, prior to minor amendments and further testing. Verbal probing explored the importance, relevance and clarity of each item, concept coverage (content validity) and responder burden. Interviews were recorded and transcribed, and field notes documented.
Phase 2: Psychometric validation in a survey
The translated questionnaire was then included in a survey of 597 adults recruited from eight GP practices in Lothian. Full details of the sampling process including search criteria is included in the Supplementary File. In summary, practices were recruited by volunteering in response to an advert circulated by NHS Research Scotland Primary Care Network. In the eight practices that were recruited, practice lists were searched electronically for adult patients on two or more chronic disease registers, or on four or more repeat medications (a method used previously by SWM to identify patients with MLTC [
31]). Patients were excluded if they had already been recruited to another research study by the Primary Care Network. Using these criteria, 11,860 potentially eligible patients were identified across eight practices. From these, a random sample of 2,800 patients was selected for screening. The size of the sample screening list in each of the eight practices ranged from 200 to 400 (mean 350) and was determined by the discretion of the participating GP. The sampling lists were screened to remove any patients deemed inappropriate for survey inclusion (such as those with dementia or approaching end-of-life). From 2,800 patients (across eight practices) included after random sampling of eligible patients, 2,753 patients were retained for invitation to complete the survey (47 deemed unsuitable by their GP). From the 2,753 distributed surveys, 597 responses were received (22%). This higher than expected response rate exceeded the target of 400 responses, chosen for consistency with the Danish MMQ validation study [
22]. Surveys were posted in November and December 2023 and collection stopped at the end of January 2024. No reminders were sent, and no reimbursement provided.
Survey packs included a cover letter, a participant information sheet, the questionnaire and a pre-paid return envelope. The questionnaire itself (see Appendix 1) included the translated MMQ1 and MMQ2; two demographic items (age and gender); a checklist of 17 common chronic conditions (to assess multimorbidity) [
32]; a bespoke single-item global QOL rating (“
I would say my overall quality of life is: very good / good / acceptable / poor / very poor”); a bespoke feedback item to assess responder burden (“
Thinking about the questions answered so far, I would say the questionnaire felt: far too long / too long / about right / too short / far too short”); two comparator QOL measures (EQ-5D-5L, ICE-CAP) to assess concurrent validity with MMQ1 [
33,
34], and one comparator measure for MMQ2 (CARE measure) [
35].
Analysis
Data were analysed using SPSS version 27 and R version 4.2.2. Practices were grouped according to whether they served mainly deprived, mixed, or affluent populations. Psychometric analysis of MMQ1 using classical test theory was guided by the International Society for Quality of Life Research (ISOQOL) minimum reporting standards [
36], as well as the COSMIN Risk of Bias checklist [
37], as detailed below. Analyses for each scale were based on complete responses for that scale. Responses with a missing value in a given domain were excluded from analyses of that scale. As such, the number of complete responses included for analysis varied slightly for each scale.
Conceptual framework
Described elsewhere [
20] and summarised above.
Translation
Two-panel method described above.
Face and content validity
The importance, relevance, clarity and coverage of items was initially explored in the focus group, and then in greater depth during cognitive interviews (see Phase 1 above). Cognitive interview transcripts were thematically analysed using a framework approach [
38].
Scale properties
Completion rates, and floor-effects (proportion of responses scoring zero) were compared to the comparator measures (EQ-5D-5L, ICE-CAP).
Dimensionality (structural validity)
Confirmatory factor analysis (CFA) was performed on R using the
lavaan package, with a diagonally weighted least squares estimation method given the ordinal nature of the data, and an independent clustering model. Six separate CFA models were fitted to the six scales in MMQ1, given that these were developed to be used independently, generating six separate scores. Factor loadings and overall measures of fit (Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), and Standardized Root Mean Square Residual (SRMR)) were calculated. CFI > 0.95, RMSEA < 0.06 and SRMR < 0.08 indicate good fit [
39].
Concurrent validity
EQ-5D-5L and ICE-CAP are well-established QOL PROMs, for which high positive correlation with MMQ1 was hypothesized. Correlation was assessed using Spearman correlation coefficients (due to non-linearity of the variables) in a matrix including the six MMQ1 scale scores, the individual item scores from the comparator measure (five items in EQ-5D-5L, and five in ICE-CAP) and the sum scores for the comparator measures. The correlation matrix also included MMQ1 total scores (i.e., the sum of all six scale scores, even though these are technically independent scales), as well as the single-item global rating of QOL, in order to further examine concurrent validity. EQ-5D-5L may be considered a gold standard measure of quality of life (HRQOL), so this comparison assessed both construct and criterion validity as defined by ISOQOL.
Convergent and discriminant validity
Multiscale analysis was also performed, calculating the correlation (Spearman coefficient) between items in a given scale and the remaining sum score for that scale (corrected item-total correlation) and comparing this with the correlation between the same items and the sum score for the other scales [
40]. Items are expected to demonstrate higher correlation with their native scale (convergent validity) than other scales within the measure (discriminant validity). Inter-scale correlations were also examined, with high values (> 0.6) suggesting substantial overlap between domains.
Reliability
Internal consistency was tested by way of inter-item correlation matrix, Cronbach’s alpha, and scale reliability with items dropped. Cronbach’s alpha values above 0.7 were interpreted as indicating good internal consistency reliability [
39]. Scale reliability should be reduced rather than improved when items are dropped. Optimal inter-item correlation is between 0.20–0.40, with higher values suggesting homogeneity and redundancy of items within the scale [
41].
Responsiveness
This was not assessable as it requires repeat measurement. However, future work is planned to assess the performance of MMQ1 in longitudinal surveys.
Interpretability of scores
Scores for MMQ1 are to be reported as six individual scale scores. In order to aid interpretation of these values in future studies and to assess the discriminative ability of each scale, the distribution of MMQ1 scale scores were plotted with data grouped according to responses to the single-item global QOL rating. The mean scale scores (plus SD) were compared for each global item response category (Very good, Good, Acceptable, Poor or Very poor). Discriminative ability is reported as the number of individuals needed in a t-test with 5% significance and 80% power to find a clinically meaningful difference between known groups. In this case this was taken as the difference between consecutive global item response categories (i.e., between very good/good, good/acceptable, acceptable/poor, poor/very poor). A low number of individuals (< 75) indicates a highly discriminative scale. The same analysis was performed in the Danish MMQ validation study, allowing for direct comparison.
Responder burden
Responder views regarding the length of the questionnaire were explored during cognitive interviews and by way of a feedback item included in the survey (which used a 5-point Likert scale to assess agreement with the statement “The questionnaire felt too long”). Investigator burden was not assessed in this study.
Discussion
In this study we have translated, adapted and validated a multi-scale PROM for needs-based QOL in people living in UK with MLTC. MMQ1 contains 37 items over 6 independent scales: Physical ability, Concerns and worries, Limitations in daily life, Social life, Personal finances and Self-image. These scales have a strong conceptual basis, and demonstrate acceptable content validity, internal consistency, structural validity, and concurrent validity. All of the scales, with the exception of Personal finances (which only contains three items), have strong discriminative ability for detecting clinically meaningful differences in QOL as measured by a single global item. However, a substantial floor effect was noted in all scales, and high inter-item and inter-scale correlations point to item redundancy and scale overlap.
The finding of high floor effects may, in part, be attributable to the sampling method and in particular how it differed from that of the Danish validation study. While the Danish study excluded patients who rated their health as “good” or “very good” [
22], no such exclusion existed in the present study. Instead, we used an electronic sampling method to randomly identify adult patients with MLTC, possibly resulting in a “healthier” sample population than in the Danish study. This discrepancy highlights wider inconsistencies in the concept and definition of MLTC/multimorbidity and raises the question as to whether MMQ1 is more suitable for use in those with so-called “complex multimorbidity” [
42]. The highest floor effect (61%) was seen in the Personal finances scale. This might partly be due to it being the shortest scale, with only three items, making a score of zero more likely than in longer scales. Additionally, this domain may reflect a higher severity of QOL impact than the other scales, in so far as limitations in daily life, or concerns and worries, are likely to arise before an individual’s personal finances are materially affected. Although the six scales of MMQ1 are not designed to produce a single sum score, they are intended to be used alongside each other, and so it is notable that only 7% of respondents scored zero in all six domains, which was lower than the floor effect found in EQ-5D-5L and ICE-CAP.
CFA demonstrated acceptable structural validity in the six scales of MMQ1, although potential local dependence between two items in scale six (Self-image) lead to the inclusion of a correlation term in the model for that scale. Although the pattern of item-total correlations (Supplementary Table 3) supported the scales’ discriminant and convergent validity, there were also high inter-item and inter-scale correlations (Supplementary Tables 1 and 5), suggesting item redundancy and overlap between domains. Respondent feedback also favoured shortening the measure. Taken together, these findings suggest that future work to refine and abbreviate MMQ1 may be of value. This may involve Rasch analysis of the study data, which would provide greater item-level detail, and, coupled with our understanding of the items’ content validity, help determine whether MMQ1 can be shortened without compromising its measurement properties. Future analysis may also involve testing for differential item functioning in relation to key variables such as age, sex, socioeconomic status and multimorbidity, something which is a recognised limitation of generic outcome measures in heterogenous populations such as MLTC.
The use of six independent scales in this PROM has both advantages and disadvantages, particularly in the context of clinical trials. Each scale addresses a different aspect of QOL, allowing for a more nuanced and multifaceted assessment of this broad concept, measurement of which is not well served by reducing it to a single score. This design has the potential to enhance sensitivity by detecting domain-specific changes that a single score might miss. It also offers flexibility, with the option to only use, or to prioritise, relevant scales according to a study’s objectives. On the other hand, this design also adds complexity for researchers in terms of score interpretation, statistical analysis, and power calculations. However, the analysis of discriminative ability (Table
5) shows that the required sample sizes to discriminate between Good/Acceptable or Acceptable/Poor in the global QOL item are less than 75 for five out of the six scales.
In general, the psychometric validation of PROMs on an aggregate, population level does not guarantee their reliability on an individual level, where greater fluctuation and error is expected [
23,
43]. In keeping with this, MMQ1 was not designed for use in a clinical setting, but rather as a research tool for use in observational and intervention studies. Nevertheless, in practice PROMS are often used by clinicians to explore patients’ views and priorities, and facilitate discussions around treatment options [
44]. Reflecting this, the original Danish MMQ1 has been used as both a communication aid and outcome measure in the pilot study of an intervention for people with severe mental illness and MLTC in Denmark (the SOFIA trial) [
45].
Strengths and limitations
This study used a robust approach to translation, adaptation and content validation, and involved a survey of a large sample size, using a battery of questionnaires: MMQ1, EQ-5D-5L and ICE-CAP, plus MMQ2 and CARE (not reported in this paper). The response rate was relatively low (22%) and representativeness could not be assessed, so the possibility of response bias could not be excluded. Future work involving comparison between sample and population characteristics will be required to explore this.
This psychometric validation study, in line with the recommended use of MMQ1, treated scale scores as missing if one or more of its items had missing values. Hence, validity was assessed for completers only. Including incomplete scale values in the present study would have necessitated strong assumptions and introduced potential distortion, while the potential gain in statistical power is minimal: completion rates for the 37 items are between 97 and 99% (1–3% missing), and completion rates for the six scales are between 96 and 99% (1–4% missing). When using MMQ1 as an outcome measure in future studies however, it will be necessary to impute missing values in order to avoid bias and preserve statistical power. Since missing values occur at the scale level, the validity established for completers implicitly extends to the non-completers.
Feedback pertaining to the length and response burden of MMQ1 should also be considered in the context of the overall response burden of the survey pack. EQ-5D-5L and ICE-CAP were used to assess concurrent validity, but it should be noted that these measure health-related QOL which is conceptually distinct from needs-based QOL as measured by MMQ1. Finally, there was also no opportunity to assess responsiveness, measurement error, or test–retest reliability as this study only involved a single cross-sectional survey. However, the high values of Cronbach’s alpha are important given that this is considered the lower limit for reliability [
46].
Conclusion
MMQ1 is a multi-scale PROM for needs-based QOL in people with MLTC, which has been translated and adapted from the original Danish, and validated in a UK setting. It demonstrated acceptable psychometric properties using classical test theory, albeit with evidence of item redundancy and scale overlap which, along with responder feedback, support the need to refine and shorten the measure, if psychometrically feasible. There also remains a need to assess responsiveness in a before-and-after survey. To our knowledge this is the first English-language measure of QOL that is bespoke for people with MLTC. As such, it has the potential to improve the measurement of QOL in MLTC intervention trials.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.