Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Perspective
  • Published: 12 February 2024

Understanding metric-related pitfalls in image analysis validation

  • Annika Reinke   ORCID: orcid.org/0000-0003-4363-1876 1 , 2 , 3   na1 ,
  • Minu D. Tizabi   ORCID: orcid.org/0000-0003-3687-6381 1 , 4   na1 ,
  • Michael Baumgartner   ORCID: orcid.org/0000-0003-4455-9917 5 ,
  • Matthias Eisenmann 1 ,
  • Doreen Heckmann-Nötzel 1 , 4 ,
  • A. Emre Kavur   ORCID: orcid.org/0000-0002-9328-8140 1 , 5 , 6 ,
  • Tim Rädsch   ORCID: orcid.org/0000-0003-3518-0315 1 , 2 ,
  • Carole H. Sudre 7 , 8 ,
  • Laura Acion   ORCID: orcid.org/0000-0001-5213-6012 9 ,
  • Michela Antonelli   ORCID: orcid.org/0000-0002-3005-4523 8 , 10 ,
  • Tal Arbel   ORCID: orcid.org/0000-0001-8870-3007 11 ,
  • Spyridon Bakas   ORCID: orcid.org/0000-0001-8734-6482 12 , 13 ,
  • Arriel Benis   ORCID: orcid.org/0000-0002-9125-8300 14 , 15 ,
  • Florian Buettner 16 , 17 , 18 , 19 , 20 ,
  • M. Jorge Cardoso   ORCID: orcid.org/0000-0003-1284-2558 8 ,
  • Veronika Cheplygina   ORCID: orcid.org/0000-0003-0176-9324 21 ,
  • Jianxu Chen   ORCID: orcid.org/0000-0002-8500-1357 22 ,
  • Evangelia Christodoulou 1 ,
  • Beth A. Cimini   ORCID: orcid.org/0000-0001-9640-9318 23 ,
  • Keyvan Farahani 24 ,
  • Luciana Ferrer 25 ,
  • Adrian Galdran 26 , 27 ,
  • Bram van Ginneken 28 , 29 ,
  • Ben Glocker   ORCID: orcid.org/0000-0002-4897-9356 30 ,
  • Patrick Godau   ORCID: orcid.org/0000-0002-0365-7265 1 , 3 , 4 ,
  • Daniel A. Hashimoto   ORCID: orcid.org/0000-0003-4725-3104 31 , 32 ,
  • Michael M. Hoffman   ORCID: orcid.org/0000-0002-4517-1562 33 , 34 , 35 , 36 ,
  • Merel Huisman 37 ,
  • Fabian Isensee 5 , 6 ,
  • Pierre Jannin   ORCID: orcid.org/0000-0002-7415-071X 38 , 39 ,
  • Charles E. Kahn   ORCID: orcid.org/0000-0002-6654-7434 40 ,
  • Dagmar Kainmueller 41 , 42 ,
  • Bernhard Kainz 43 , 44 ,
  • Alexandros Karargyris   ORCID: orcid.org/0000-0002-1930-3410 45 ,
  • Jens Kleesiek   ORCID: orcid.org/0000-0001-8686-0682 46 ,
  • Florian Kofler 47 ,
  • Thijs Kooi 48 ,
  • Annette Kopp-Schneider   ORCID: orcid.org/0000-0002-1810-0267 49 ,
  • Michal Kozubek   ORCID: orcid.org/0000-0001-7902-589X 50 ,
  • Anna Kreshuk   ORCID: orcid.org/0000-0003-1334-6388 51 ,
  • Tahsin Kurc 52 ,
  • Bennett A. Landman   ORCID: orcid.org/0000-0001-5733-2127 53 ,
  • Geert Litjens   ORCID: orcid.org/0000-0003-1554-1291 54 ,
  • Amin Madani 55 ,
  • Klaus Maier-Hein 5 , 56 ,
  • Anne L. Martel   ORCID: orcid.org/0000-0003-1375-5501 34 , 57 ,
  • Erik Meijering   ORCID: orcid.org/0000-0001-8015-8358 58 ,
  • Bjoern Menze   ORCID: orcid.org/0000-0003-4136-5690 59 ,
  • Karel G. M. Moons 60 ,
  • Henning Müller   ORCID: orcid.org/0000-0001-6800-9878 61 , 62 ,
  • Brennan Nichyporuk   ORCID: orcid.org/0009-0006-8087-6089 63 ,
  • Felix Nickel 64 ,
  • Jens Petersen 5 ,
  • Susanne M. Rafelski   ORCID: orcid.org/0000-0002-1399-5970 65 ,
  • Nasir Rajpoot   ORCID: orcid.org/0000-0001-6760-1271 66 ,
  • Mauricio Reyes 67 , 68 ,
  • Michael A. Riegler   ORCID: orcid.org/0000-0002-3153-2064 69 , 70 ,
  • Nicola Rieke   ORCID: orcid.org/0000-0003-0241-9334 71 ,
  • Julio Saez-Rodriguez   ORCID: orcid.org/0000-0002-8552-8976 72 , 73 ,
  • Clara I. Sánchez 74 ,
  • Shravya Shetty 75 ,
  • Ronald M. Summers   ORCID: orcid.org/0000-0001-8081-7376 76 ,
  • Abdel A. Taha 77 ,
  • Aleksei Tiulpin   ORCID: orcid.org/0000-0002-7852-4141 78 , 79 ,
  • Sotirios A. Tsaftaris 80 ,
  • Ben Van Calster 81 , 82 ,
  • Gaël Varoquaux   ORCID: orcid.org/0000-0003-1076-5122 83 ,
  • Ziv R. Yaniv   ORCID: orcid.org/0000-0003-0315-7727 84 ,
  • Paul F. Jäger   ORCID: orcid.org/0000-0002-6243-2568 2 , 85   na2 &
  • Lena Maier-Hein   ORCID: orcid.org/0000-0003-4910-9368 1 , 2 , 3 , 4 , 73   na2  

Nature Methods volume  21 ,  pages 182–194 ( 2024 ) Cite this article

1428 Accesses

2 Citations

106 Altmetric

Metrics details

  • Medical research

Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence shows that, particularly in image analysis, metrics are often chosen inadequately. Although taking into account the individual strengths, weaknesses and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multistage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides a reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Although focused on biomedical image analysis, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. The work serves to enhance global comprehension of a key topic in image analysis validation.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

251,40 € per year

only 20,95 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

research paper about validation

Data availability

No data were used in this study.

Code availability

We provide reference implementations for all Metrics Reloaded metrics within the MONAI open-source framework. They are accessible at https://github.com/Project-MONAI/MetricsReloaded .

Maier-Hein, L. et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nat. Commun. 9 , 1–13 (2018). With this comprehensive analysis of biomedical image analysis competitions (challenges), the authors initiated a shift in how such challenges are designed, performed and reported in the biomedical domain. Its concepts and guidelines have been adopted by reputed organizations such as the Medical Image Computing and Computer Assisted Interventions.

Article   ADS   Google Scholar  

Gooding, M. J. et al. Comparative evaluation of autocontouring in clinical practice: a practical method using the Turing test. Med. Phys. 45 , 5105–5115 (2018).

Article   PubMed   Google Scholar  

Kofler F. et al. Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the Dice coefficient. Preprint at arXiv https://doi.org/10.48550/arXiv.2103.06205 (2021).

Vaassen, F. et al. Evaluation of measures for assessing time-saving of automatic organ-at-risk segmentation in radiotherapy. Phys. Imaging Radiat. Oncol. 13 , 1–6 (2020).

Maier-Hein L. et al. Metrics reloaded: recommendations for image analysis validation. Nat. Methods https://doi.org/10.1038/s41592-023-02151-z (2024).

Davide, C. & Giuseppe, J. The advantages of the Matthews correlation coefficient (MCC) over F 1 score and accuracy in binary classification evaluation. BMC Genomics 21 , 1–13 (2020).

Google Scholar  

Davide, C., Niklas, T. & Giuseppe, J. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min. 14 , 1–22 (2021). The manuscript addresses the challenge of evaluating binary classifications. It compares MCC with other metrics, explaining their mathematical relationships and providing use cases where MCC offers more informative results.

Grandini M., Bagli E. & Visani G. Metrics for multi-class classification: an overview. Preprint at arXiv https://doi.org/10.48550/arXiv.2008.05756 (2020).

Taha, A. A. & Hanbury, A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. imaging 15 , 1–28 (2015). The paper discusses the importance of effective metrics for evaluating the accuracy of 3D medical image segmentation algorithms. The authors analyze existing metrics, propose a selection methodology, and develop a tool to aid researchers in choosing appropriate evaluation metrics based on the specific characteristics of the segmentation task.

Article   Google Scholar  

Taha A. A., Hanbury A. & del Toro O. A J. A formal method for selecting evaluation metrics for image segmentation. In 2014 IEEE International Conference on Image Processing 932–936 (IEEE, 2014).

Lin T.-Y. et al. Microsoft COCO: common objects in context. In European Conference on Computer Vision 740–755 (Springer, 2014).

Reinke A., et al. Common limitations of image processing metrics: a picture story. Preprint at arXiv https://doi.org/10.48550/arXiv.2104.05642 (2021).

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. The Pascal Visual Object Classes (VOC) challenge. Int. J. Comput. Vis. 88 , 303–338 (2010).

Howard, A. et al. Sartorius—cell instance segmentation. Kaggle https://www.kaggle.com/c/sartorius-cell-instance-segmentation (2021).

Schmidt U., Weigert M., Broaddus C. & Myers G. Cell detection with star-convex polygons. In International Conference on Medical Image Computing and Computer-Assisted Intervention 265–273 (Springer, 2018).

Stringer, C., Wang, T., Michaelos, M. & Pachitariu, M. Cellpose: a generalist algorithm for cellular segmentation. Nat. methods 18 , 100–106 (2021).

Article   CAS   PubMed   Google Scholar  

Hirling D. et al. Segmentation metric misinterpretations in bioimage analysis. Nat. Methods https://doi.org/10.1038/s41592-023-01942-8 (2023).

Brown, B. B. Delphi Process: A Methodology Used for the Elicitation of Opinions of Experts (RAND Corporation, 1968).

Prashant Nasa, R. J. & Juneja, D. Delphi methodology in healthcare research: how to decide its appropriateness. World J. Methodol. 11 , 116 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Yeghiazaryan, V. & Voiculescu, I. D. Family of boundary overlap metrics for the evaluation of medical image segmentation. J. Med. Imaging 5 , 015006 (2018).

Gruber, S. & Buettner, F. Better uncertainty calibration via proper scores for classification and beyond. Preprint at arXiv https://doi.org/10.48550/arXiv.2203.07835 (2022).

Gooding, M. J., Boukerroui, D., Osorio, E. V., Monshouwer, R. & Brunenberg, E. Multicenter comparison of measures for quantitative evaluation of contouring in radiotherapy. Phys. Imaging Radiat. Oncol. 24 , 152–158 (2022).

Cordts, M. et al. The cityscapes dataset. In CVPR Workshop on The Future of Datasets in Vision (2015).

Muschelli, J. ROC and AUC with a binary predictor: a potentially misleading metric. J. Classif. 37 , 696–708 (2020).

Article   MathSciNet   PubMed   Google Scholar  

Patrick Bilic, P. C. et al. The liver tumor segmentation benchmark (LITS). Med. Image Anal. 84 , 102680 (2023).

Tran, T. N. et al. Sources of performance variability in deep learning-based polyp detection. Preprint at arXiv https://doi.org/10.48550/arXiv.2211.09708 (2022).

Wiesenfarth, M. et al. Methods and open-source toolkit for analyzing and visualizing challenge results. Sci. Rep. 11 , 1–15 (2021).

Lennerz, J. K., Green, U., Williamson, D. F. K. & Mahmood, F. A unifying force for the realization of medical ai. NPJ Digit. Med. 5 , 172 (2022).

Correia, P. & Pereira, F. Video object relevance metrics for overall segmentation quality evaluation. EURASIP J. Adv. Signal Process. 2006 , 1–11 (2006).

Honauer, K., Maier-Hein, L. & Kondermann, D. The HCI stereo metrics: geometry-aware performance analysis of stereo algorithms. In Proceedings of the IEEE International Conference on Computer Vision 2120–2128 (IEEE, 2015).

Konukoglu, E., Glocker, B., Ye, D. H., Criminisi, A. & Pohl, K. M. Discriminative segmentation-based evaluation through shape dissimilarity. IEEE Trans. Med. Imaging 31 , 2278–2289 (2012).

Margolin, R., Zelnik-Manor, L. & Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 248–255 (2014).

Carbonell, A., De la Pena, M., Flores, R. & Gago, S. Effects of the trinucleotide preceding the self-cleavage site on eggplant latent viroid hammerheads: differences in co- and post-transcriptional self-cleavage may explain the lack of trinucleotide AUC in most natural hammerheads. Nucleic Acids Res. 34 , 5613–5622 (2006).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Di Sabatino, A. & Corazza, G. R. Nonceliac gluten sensitivity: sense or sensibility? Ann. Intern. Med. 156 , 309–311 (2012).

Roberts B. et al. Systematic gene tagging using CRISPR/Cas9 in human stem cells to illuminate cell organization. Mol. Biol. Cell 28 , 2854–2874 (2017).

Chen, J. et al. The Allen Cell and Structure Segmenter: a new open source toolkit for segmenting 3D intracellular structures in fluorescence microscopy images. Preprint at bioRxiv https://doi.org/10.1101/491035 (2020).

Ounkomol, C., Seshamani, S., Maleckar, M. M., Collman, F. & Johnson, G. R. Label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy. Nat. Methods 15 , 917–920 (2018).

Viana, M. P. et al. Integrated intracellular organization and its variations in human IPS cells. Nature 613 , 345–354 (2023).

Download references

Acknowledgements

This work was initiated by the Helmholtz Association of German Research Centers in the scope of the Helmholtz Imaging Incubator (HI), the Medical Image Computing and Computer Assisted Interventions Special Interest Group for biomedical image analysis challenges,and the benchmarking working group of the MONAI initiative. It has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 101002198, NEURAL SPICING) and the Surgical Oncology Program of the National Center for Tumor Diseases (NCT) Heidelberg. It was further supported in part by the Intramural Research Program of the National Institutes of Health Clinical Center as well as by the National Cancer Institute (NCI) and the National Institute of Neurological Disorders and Stroke (NINDS) of the National Institutes of Health (NIH), under award numbers NCI:U01CA242871, NCI:U24CA279629 and NINDS:R01NS042645. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH. T.A. acknowledges the Canada Institute for Advanced Research (CIFAR) AI Chairs program, and the Natural Sciences and Engineering Research Council of Canada. F.B. was co-funded by the European Union (ERC, TAIPO, 101088594). The views and opinions expressed are those of only the authors and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. M.J.C. acknowledges funding from Wellcome/EPSRC Centre for Medical Engineering (WT203148/Z/16/Z), the Wellcome Trust (WT213038/Z/18/and the InnovateUK-funded London AI Centre for Value-Based Healthcare. J.C. is supported by the Federal Ministry of Education and Research (BMBF) under the funding reference 161L0272. V.C. acknowledges funding from NovoNordisk Foundation (NNF21OC0068816) and Independent Research Council Denmark (1134-00017B). B.A.C. was supported by NIH grant P41 GM135019 and grant 2020-225720 from the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation. G.S.C. was supported by Cancer Research UK (programme grant: C49297/A27294). M.M.H. is supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN- 2022-05134). A. Karargyris is supported by French State Funds managed by the ‘Agence Nationale de la Recherche (ANR)’ - ‘Investissements d’Avenir’ (Investments for the Future), grant ANR-10- IAHU-02 (IHU Strasbourg). M.K. was funded by the Ministry of Education, Youth and Sports of the Czech Republic (Project LM2018129). T. Kurc was supported in part by 4UH3-CA225021- 03, 1U24CA180924-01A1, 3U24CA215109-and 1UG3-CA225-021-01 grants from the National Institutes of Health. G.L. receives research funding from the Dutch Research Council, the Dutch Cancer Association, HealthHolland, the European Research Council, the European Union and the Innovative Medicine Initiative. S.M.R. wishes to acknowledge the Allen Institute for Cell Science founder P. G. Allen for his vision, encouragement and support. M.R is supported by Innosuisse grant number 31274.1 and Swiss National Science Foundation Grant Number 205320_212939. C.H.S. is supported by an Alzheimer’s Society Junior Fellowship (AS-JF-17-011). R.M.S. is supported by the Intramural Research Program of the NIH Clinical Center. A.T. acknowledges support from Academy of Finland (Profi6 336449 funding program), University of Oulu strategic funding, Finnish Foundation for Cardiovascular Research, Wellbeing Services County of North Ostrobothnia (VTR project K62716) and the Terttu foundation. S.A.T. acknowledges the support of Canon Medical and the Royal Academy of Engineering and the Research Chairs and Senior Research Fellowships scheme (grant RCSRF1819\8\25). We would like to thank P. Bankhead, G. S. Collins, R. Haase, F. Hamprecht, A. Karthikesalingam, H. Kenngott, P. Mattson, D. Moher, B. Stieltjes and M. Wiesenfarth for fruitful discussions on this work. We would like to thank S. Engelhardt, S. Koehler, M. A. Noyan, G. Polat, H. Rivaz, J. Schroeter, A. Saha, L. Sharan, P. Hirsch and M. Viana for suggesting additional illustrations that can be found in ref. 12 , 35 .

Author information

These authors contributed equally: Annika Reinke, Minu D. Tizabi.

These authors jointly supervised this work: Paul F. Jäger, Lena Maier-Hein.

Authors and Affiliations

German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany

Annika Reinke, Minu D. Tizabi, Matthias Eisenmann, Doreen Heckmann-Nötzel, A. Emre Kavur, Tim Rädsch, Evangelia Christodoulou, Patrick Godau & Lena Maier-Hein

German Cancer Research Center (DKFZ) Heidelberg, HI Helmholtz Imaging, Heidelberg, Germany

Annika Reinke, Tim Rädsch, Paul F. Jäger & Lena Maier-Hein

Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany

Annika Reinke, Patrick Godau & Lena Maier-Hein

National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and University Medical Center Heidelberg, Heidelberg, Germany

Minu D. Tizabi, Doreen Heckmann-Nötzel, Patrick Godau & Lena Maier-Hein

German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Heidelberg, Germany

Michael Baumgartner, A. Emre Kavur, Fabian Isensee, Klaus Maier-Hein & Jens Petersen

German Cancer Research Center (DKFZ) Heidelberg, HI Applied Computer Vision Lab, Heidelberg, Germany

A. Emre Kavur & Fabian Isensee

MRC Unit for Lifelong Health and Ageing at UCL and Centre for Medical Image Computing, Department of Computer Science, University College London, London, UK

Carole H. Sudre

School of Biomedical Engineering and Imaging Science, King’s College London, London, UK

Carole H. Sudre, Michela Antonelli & M. Jorge Cardoso

Instituto de Cálculo, CONICET – Universidad de Buenos Aires, Buenos Aires, Argentina

Laura Acion

Centre for Medical Image Computing, University College London, London, UK

Michela Antonelli

Centre for Intelligent Machines and MILA (Quebec Artificial Intelligence Institute), McGill University, Montréal, Quebec, Canada

Division of Computational Pathology, Dept of Pathology & Laboratory Medicine, Indiana University School of Medicine, Indianapolis, IN, USA

Spyridon Bakas

Center for Biomedical Image Computing and Analytics (CBICA), University of Pennsylvania, Philadelphia, PA, USA

Department of Digital Medical Technologies, Holon Institute of Technology, Holon, Israel

Arriel Benis

European Federation for Medical Informatics, Le Mont-sur-Lausanne, Switzerland

German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership between DKFZ and UCT Frankfurt-Marburg, Frankfurt am Main, Germany

Florian Buettner

German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany

Goethe University Frankfurt, Department of Medicine, Frankfurt am Main, Germany

Goethe University Frankfurt, Department of Informatics, Frankfurt am Main, Germany

Frankfurt Cancer Insititute, Frankfurt am Main, Germany

Department of Computer Science, IT University of Copenhagen, Copenhagen, Denmark

Veronika Cheplygina

Leibniz-Institut für Analytische Wissenschaften – ISAS – e.V., Dortmund, Germany

Jianxu Chen

Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA

Beth A. Cimini

Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda, MD, USA

Keyvan Farahani

Instituto de Investigación en Ciencias de la Computación (ICC), CONICET-UBA, Ciudad Autónoma de Buenos Aires, Buenos Aires, Argentina

Luciana Ferrer

Universitat Pompeu Fabra, Barcelona, Spain

Adrian Galdran

University of Adelaide, Adelaide, South Australia, Australia

Fraunhofer MEVIS, Bremen, Germany

Bram van Ginneken

Radboud Institute for Health Sciences, Radboud University Medical Center, Nijmegen, the Netherlands

Department of Computing, Imperial College London, South Kensington Campus, London, UK

Ben Glocker

Department of Surgery, Perelman School of Medicine, Philadelphia, PA, USA

Daniel A. Hashimoto

General Robotics Automation Sensing and Perception Laboratory, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA

Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada

Michael M. Hoffman

Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada

Michael M. Hoffman & Anne L. Martel

Department of Computer Science, University of Toronto, Toronto, Ontario, Canada

Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada

Department of Radiology and Nuclear Medicine, Radboud University Medical Center, Nijmegen, the Netherlands

Merel Huisman

Laboratoire Traitement du Signal et de l’Image – UMR_S 1099, Université de Rennes 1, Rennes, France

Pierre Jannin

INSERM, Paris, France

Department of Radiology and Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA

Charles E. Kahn

Max-Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Biomedical Image Analysis and HI Helmholtz Imaging, Berlin, Germany

Dagmar Kainmueller

University of Potsdam, Digital Engineering Faculty, Potsdam, Germany

Department of Computing, Faculty of Engineering, Imperial College London, London, UK

Bernhard Kainz

Department AIBE, Friedrich-Alexander-Universität (FAU), Erlangen-Nürnberg, Germany

IHU Strasbourg, Strasbourg, France

Alexandros Karargyris

Translational Image-guided Oncology (TIO), Institute for AI in Medicine (IKIM), University Medicine Essen, Essen, Germany

Jens Kleesiek

Helmholtz AI, Oberschleißheim, Germany

Florian Kofler

Lunit, Seoul, South Korea

German Cancer Research Center (DKFZ) Heidelberg, Division of Biostatistics, Heidelberg, Germany

Annette Kopp-Schneider

Centre for Biomedical Image Analysis and Faculty of Informatics, Masaryk University, Brno, Czech Republic

Michal Kozubek

Cell Biology and Biophysics Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany

Anna Kreshuk

Department of Biomedical Informatics, Stony Brook University, Health Science Center, Stony Brook, NY, USA

Tahsin Kurc

Electrical Engineering, Vanderbilt University, Nashville, TN, USA

Bennett A. Landman

Department of Pathology, Radboud University Medical Center, Nijmegen, the Netherlands

Geert Litjens

Department of Surgery, University Health Network, Philadelphia, PA, USA

Amin Madani

Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany

Klaus Maier-Hein

Physical Sciences, Sunnybrook Research Institute, Toronto, Ontario, Canada

Anne L. Martel

School of Computer Science and Engineering, University of New South Wales, UNSW Sydney, Kensington, New South Wales, Australia

Erik Meijering

Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland

Bjoern Menze

Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, the Netherlands

Karel G. M. Moons

Information Systems Institute, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland

Henning Müller

Medical Faculty, University of Geneva, Geneva, Switzerland

MILA (Quebec Artificial Intelligence Institute), Montréal, Quebec, Canada

Brennan Nichyporuk

Department of General, Visceral and Thoracic Surgery, University Medical Center Hamburg-Eppendorf, Hamburg, Germany

Felix Nickel

Allen Institute for Cell Science, Seattle, WA, USA

Susanne M. Rafelski

Tissue Image Analytics Laboratory, Department of Computer Science, University of Warwick, Coventry, UK

Nasir Rajpoot

ARTORG Center for Biomedical Engineering Research, University of Bern, Bern, Switzerland

Mauricio Reyes

Department of Radiation Oncology, University Hospital Bern, University of Bern, Bern, Switzerland

Simula Metropolitan Center for Digital Engineering, Oslo, Norway

Michael A. Riegler

UiT The Arctic University of Norway, Tromsø, Norway

NVIDIA GmbH, München, Germany

Nicola Rieke

Institute for Computational Biomedicine, Heidelberg University, Heidelberg, Germany

Julio Saez-Rodriguez

Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany

Julio Saez-Rodriguez & Lena Maier-Hein

Informatics Institute, Faculty of Science, University of Amsterdam, Amsterdam, the Netherlands

Clara I. Sánchez

Google Health, Google, Palo Alto, CA, USA

Shravya Shetty

National Institutes of Health Clinical Center, Bethesda, MD, USA

Ronald M. Summers

Institute of Information Systems Engineering, TU Wien, Vienna, Austria

Abdel A. Taha

Research Unit of Health Sciences and Technology, Faculty of Medicine, University of Oulu, Oulu, Finland

Aleksei Tiulpin

Neurocenter Oulu, Oulu University Hospital, Oulu, Finland

School of Engineering, The University of Edinburgh, Edinburgh, Scotland

Sotirios A. Tsaftaris

Department of Development and Regeneration and EPI-centre, KU Leuven, Leuven, Belgium

Ben Van Calster

Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands

Parietal project team, INRIA Saclay-Île de France, Palaiseau, France

Gaël Varoquaux

National Institute of Allergy and Infectious Diseases, Bethesda, MD, USA

Ziv R. Yaniv

German Cancer Research Center (DKFZ) Heidelberg, Interactive Machine Learning Group, Heidelberg, Germany

  • Paul F. Jäger

You can also search for this author in PubMed   Google Scholar

Contributions

A.R. initiated and led the study, was a member of the Delphi core team, wrote and reviewed the manuscript, prepared and evaluated all surveys, organized all workshops, suggested pitfalls, and designed all figures. M.D.T. was a member of the extended Delphi core team and wrote and reviewed the manuscript. P.F.J. initiated and led the study, was a member of the Delphi core team, led the Object Detection (ObD) and Instance Segmentation (InS) expert group, wrote and reviewed the manuscript, prepared and evaluated all surveys, organized all workshops, suggested pitfalls and participated in surveys. L.M.-H. initiated and led the study, was a member of the Delphi core team, wrote and reviewed the manuscript, prepared and evaluated all surveys, organized all workshops and suggested pitfalls. M.B. was a member of the extended Delphi core team, was an active member of the ObD and InS expert group, wrote and reviewed the manuscript and participated in surveys and workshops. M.E. was a member of the extended Delphi core team, reviewed the document, assisted in survey preparation and participated in surveys and workshops. D.H.-N. was a member of the extended Delphi core team and prepared all surveys. A.E.K. was a member of the extended Delphi core team and participated in surveys. T.R. was a member of the extended Delphi core team, was an active member of the ObD and InS expert group, reviewed the document, assisted in survey preparation, tested all metric examples, suggested pitfalls and participated in surveys and workshops. C.H.S. was an active member of the ObD and InS expert group, reviewed the manuscript, suggested pitfalls, tested all metric examples and participated in surveys and workshops. L.A. reviewed the manuscript and participated in surveys and workshops. M.A. was an active member of the Semantic Segmentation (SemS) expert group and participated in surveys and workshops. T.A. was an active member of the ObD and InS expert group, suggested pitfalls, reviewed the manuscript and participated in surveys and workshops. S.B. co-led the SemS expert group, reviewed the manuscript, suggested pitfalls and participated in surveys and workshops. A.B. was an active member of the biomedical and cross-topic expert groups, reviewed the manuscript and participated in surveys and workshops. F.B. led the calibration expert group, suggested pitfalls, reviewed the manuscript, and participated in surveys. M.J.C. was an active member of the Image-level Classification (ImLC) expert group and participated in surveys and workshops. V.C. was an active member of the ImLC and cross-topic expert groups, reviewed the manuscript and participated in surveys and workshops. J.C. reviewed the manuscript, suggested pitfalls and participated in surveys. E.C. led the cross-topic expert group, was a member of the extended Delphi core team, wrote and reviewed the manuscript, suggested pitfalls and participated in surveys. B.A.C. was an active member of the ObD and InS expert group, reviewed the manuscript, suggested pitfalls and participated in surveys and workshops. K.F. was an active member of the biomedical and cross-topic expert groups and participated in surveys and workshops. L.F. was an active member of the calibration expert group, reviewed the manuscript, suggested pitfalls and participated in surveys. A.G. was an active member of the calibration expert group, reviewed the manuscript, suggested pitfalls and participated in surveys. B.V.G. participated in surveys and workshops. B.G. led the cross-topic expert group and was an active member of the SemS expert group, reviewed the manuscript, suggested pitfalls and participated in surveys and workshops. P.G. led the ImLC expert group, was a member of the extended Delphi core team, wrote and reviewed the manuscript, suggested pitfalls and participated in surveys and workshops. D.A.H. was an active member of the biomedical and cross-topic expert groups, reviewed the manuscript suggested pitfalls and participated in surveys and workshops. M.M.H. was an active member of the ImLC expert group, reviewed the manuscript and participated in surveys and workshops. M.H. co-led the biomedical expert group, was an active member of the cross-topic expert group, reviewed the manuscript and participated in surveys and workshops. F.I. led the SemS expert group, reviewed the manuscript and participated in surveys and workshops. P.J. co-led the cross-topic expert group, was an active member of the ObD and InS expert group, reviewed the manuscript and participated in surveys and workshops. C.E.K. was an active member of the biomedical expert group, reviewed the manuscript and participated in surveys and workshops. D.K. suggested pitfalls and participated in surveys. B.K. suggested pitfalls and participated in surveys. J.K. led the biomedical expert group, reviewed the manuscript and participated in surveys and workshops. F.K. suggested pitfalls and participated in surveys. T. Kooi suggested pitfalls and participated in surveys. A.K.-S. was a member of the extended Delphi core team and was an active member of the cross-topic group. M.K. led the ObD and InS expert group, reviewed the manuscript, suggested pitfalls and participated in surveys and workshops. A. Kreshuk was an active member of the biomedical expert group, reviewed the manuscript and participated in surveys and workshops. T. Kurc participated in surveys and workshops. B.A.L. was an active member of the SemS expert group and participated in surveys and workshops. G.L. was an active member of the ImLC expert group, reviewed the manuscript and participated in surveys and workshops. A.M. was an active member of the biomedical and SemS expert groups, suggested pitfalls and participated in surveys and workshops. K.M.-H. was an active member of the SemS expert group, reviewed the manuscript and participated in surveys and workshops. A.L.M. participated in surveys and workshops. E.M. was an active member of the ImLC expert group, reviewed the manuscript and participated in surveys. B.M. participated in surveys and workshops. K.G.M.M. was an active member of the cross-topic expert group, reviewed the manuscript, suggested pitfalls and participated in surveys and workshops. H.M. was an active member of the ImLC expert group, reviewed the manuscript and participated in surveys and workshops. B.N. was an active member of the ObD and InS expert group, and participated in surveys. F.N. was an active member of the biomedical expert group and participated in surveys and workshops. J.P. participated in surveys and workshops. S.M.R. reviewed the manuscript, suggested pitfalls and participated in surveys. N. Rajpoot participated in surveys and workshops. M.R. led the SemS expert group, reviewed the manuscript, suggested pitfalls and participated in surveys and workshops. M.A.R. led the ImLC expert group, reviewed the manuscript, suggested pitfalls and participated in surveys and workshops. N. Rieke was an active member of the SemS expert group and participated in surveys and workshops. R.M.S. was an active member of the ObD and InS, the biomedical and the cross-topic expert groups, reviewed the manuscript, and participated in surveys and workshops. A.A.T. co-led the SemS expert group, suggested pitfalls, and participated in surveys and workshops. A.T. was an active member of the calibration group, reviewed the manuscript and participated in surveys. S.A.T. was an active member of the ObD and InS expert group, reviewed the manuscript and participated in surveys and workshops. B.V.C. was an active member of the cross-topic expert group and participated in surveys. G.V. was an active member of the ImLC and cross-topic expert groups, reviewed the manuscript and suggested pitfalls. Z.R.Y. suggested pitfalls and participated in surveys. A. Karargyris, J.S.-R., C.I.S. and S.S. served on the expert Delphi panel and participated in workshops and surveys.

Corresponding authors

Correspondence to Annika Reinke , Minu D. Tizabi , Paul F. Jäger or Lena Maier-Hein .

Ethics declarations

Competing interests.

F.B. is an employee of Siemens (Munich, Germany). B.V.G. is a shareholder of Thirona (Nijmegen, the Netherlands). B.G. is an employee of HeartFlow (California, USA) and Kheiron Medical Technologies (London, UK). M.M.H. received an Nvidia GPU Grant. T. Kooi is an employee of Lunit (Seoul, South Korea). G.L. is on the advisory board of Canon Healthcare IT (Minnesota, USA) and is a shareholder of Aiosyn (Nijmegen, the Netherlands). N. Rajpoot is the founder and CSO of Histofy (New York, USA). N. Rieke is an employee of Nvidia (Munich, Germany). J.S.-R. reports funding from GSK (Heidelberg, Germany), Pfizer (New York, USA) and Sanofi (Paris, France) and fees from Travere Therapeutics (California, USA), Stadapharm (Bad Vilbel, Germany), Astex Therapeutics (Cambridge, UK), Pfizer (New York, USA) and Grunenthal (Aachen, Germany). R.M.S. receives patent royalties from iCAD (New Hampshire, USA), ScanMed (Nebraska, USA), Philips (Amsterdam, the Netherlands), Translation Holdings (Alabama, USA) and PingAn (Shenzhen, China); his lab received research support from PingAn through a Cooperative Research and Development Agreement. S.A.T. receives financial support from Canon Medical Research Europe (Edinburgh, Scotland). The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature Methods thanks Pingkun Yan for their contribution to the peer review of this work. Primary Handling editor: Rita Strack, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 [p2.2] disregard of the properties of the target structures..

(a) Small structure sizes . The predictions of two algorithms (Prediction 1/2) differ in only a single pixel. In the case of the small structure (bottom row), this has a substantial effect on the corresponding Dice Similarity Coefficient (DSC) metric value (similar for the Intersection over Union (IoU)). This pitfall is also relevant for other overlap-based metrics such as the centerline Dice Similarity Coefficient (clDice), and localization criteria such as Box/Approx/Mask IoU and Intersection over Reference (IoR). (b) Complex structure shapes . Common overlap-based metrics (here: DSC) are unaware of complex structure shapes and treat Predictions 1 and 2 equally. The clDice uncovers the fact that Prediction 1 misses the fine-granular branches of the reference and favors Prediction 2, which focuses on the center line of the object. This pitfall is also relevant for other overlap-based such as metrics IoU and pixel-level F β Score as well as localization criteria such as Box/Approx/Mask IoU, Center Distance, Mask IoU > 0, Point inside Mask/Box/Approx, and IoR.

Extended Data Fig. 2 [P2.4] Disregard of the properties of the algorithm output.

(a) Possibility of overlapping predictions . If multiple structures of the same type can be seen within the same image (here: reference objects R1 and R2 ), it is generally advisable to phrase the problem as instance segmentation (InS; right) rather than semantic segmentation (SemS; left). This way, issues with boundary-based metrics resulting from comparing a given structure boundary to the boundary of the wrong instance in the reference can be avoided. In the provided example, the distance of the red boundary pixel to the reference, as measured by a boundary-based metric in SemS problems, would be zero, because different instances of the same structure cannot be distinguished. This problem is overcome by phrasing the problem as InS. In this case, (only) the boundary of the matched instance (here: R2) is considered for distance computation. (b) Possibility of empty prediction or reference . Each column represents a potential scenario for per-image validation of objects, categorized by whether True Positives (TPs), False Negatives (FNs), and False Positives (FPs) are present (n > 0) or not (n = 0) after matching/assignment. The sketches on the top showcase each scenario when setting ‘n > 0’ to ‘n = 1’. For each scenario, Sensitivity, Positive Predictive Value (PPV), and the F 1 Score are calculated. Some scenarios yield undefined values (Not a Number (NaN)).

Supplementary information

Supplementary Information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Reinke, A., Tizabi, M.D., Baumgartner, M. et al. Understanding metric-related pitfalls in image analysis validation. Nat Methods 21 , 182–194 (2024). https://doi.org/10.1038/s41592-023-02150-0

Download citation

Received : 09 February 2023

Accepted : 12 December 2023

Published : 12 February 2024

Issue Date : February 2024

DOI : https://doi.org/10.1038/s41592-023-02150-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Where imaging and metrics meet.

Nature Methods (2024)

Metrics reloaded: recommendations for image analysis validation

  • Lena Maier-Hein
  • Annika Reinke

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

research paper about validation

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • For authors
  • Browse by collection
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 9, Issue 10
  • Questionnaire validation practice: a protocol for a systematic descriptive literature review of health literacy assessments
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0001-5704-0490 Melanie Hawkins 1 ,
  • Gerald R Elsworth 1 ,
  • Richard H Osborne 2
  • 1 School of Health and Social Development, Faculty of Health , Deakin University , Burwood , Victoria , Australia
  • 2 Global Health and Equity, Faculty of Health, Arts and Design , Swinburne University of Technology , Hawthorn , Victoria , Australia
  • Correspondence to Melanie Hawkins; melanie.hawkins{at}deakin.edu.au

Introduction Contemporary validity testing theory holds that validity lies in the extent to which a proposed interpretation and use of test scores is justified, the evidence for which is dependent on both quantitative and qualitative research methods. Despite this, we hypothesise that development and validation studies for assessments in the field of health primarily report a limited range of statistical properties, and that a systematic theoretical framework for validity testing is rarely applied. Using health literacy assessments as an exemplar, this paper outlines a protocol for a systematic descriptive literature review about types of validity evidence being reported and if the evidence is reported within a theoretical framework.

Methods and analysis A systematic descriptive literature review of qualitative and quantitative research will be used to investigate the scope of validation practice in the rapidly growing field of health literacy assessment. This review method employs a frequency analysis to reveal potentially interpretable patterns of phenomena in a research area; in this study, patterns in types of validity evidence reported, as assessed against the criteria of the 2014 Standards for Educational and Psychological Testing , and in the number of studies using a theoretical validity testing framework. The search process will be consistent with the Preferred Reporting Items for Systematic Reviews and Meta-analyses statement. Outcomes of the review will describe patterns in reported validity evidence, methods used to generate the evidence and theoretical frameworks underpinning validation practice and claims. This review will inform a theoretical basis for future development and validity testing of health assessments in general.

Ethics and dissemination Ethics approval is not required for this systematic review because only published research will be examined. Dissemination of the review findings will be through publication in a peer-reviewed journal, at conference presentations and in the lead author’s doctoral thesis.

  • validity testing theory
  • health literacy
  • health assessment
  • measurement

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:  http://creativecommons.org/licenses/by-nc/4.0/ .

https://doi.org/10.1136/bmjopen-2019-030753

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

This is the first systematic literature review to examine types of validity evidence for a range of health literacy assessments within the framework of the authoritative reference for validity testing theory, The Standards for Educational and Psychological Testing .

The review is grounded in the contemporary definition of validity as a quality of the interpretations and inferences made from measurement scores rather than as solely based on the properties of a measurement instrument.

The search for the review will be limited only by the end search date (March 2019) because health literacy is a relatively new field and publications are not expected prior to about 30 years ago.

All definitions of health literacy and all types of health literacy assessment instruments will be included.

A limitation of the review is that the search will be restricted to studies published and instruments developed in the English language, and this may introduce an English language and culture bias.

Introduction

Historically, the focus of validation practice has been on the statistical properties of a test or other measurement instrument, and this has been adopted as the basis of validity testing for individual and population assessments in the field of health. 1 However, advancements in validity testing theory hold that validity lies in the justification of a proposed interpretation of test scores for an intended purpose, the evidence for which includes but is not limited to the test’s statistical properties. 2–7 Therefore, to validate means to investigate , through a range of methods, the extent to which a proposed interpretation and use of test scores is justified. 7–9 The term ‘test’ in this paper is used in the same sense as Cronbach uses it in his 1971 Test Validation chapter 8 to refer to all procedures for collecting data about individuals and populations. In health, these procedures include objective tests (eg, clinical assessments) and subjective tests (eg, patient questionnaires) or a combination of both and may involve quantitative (eg, questionnaire) or qualitative methods (eg, interview). The act of testing results in data that require interpretation. In the field of health, such interpretations are usually used for making decisions about individuals or populations. The process of validation needs to provide evidence that these interpretations and decisions are credible, and a theoretical framework to guide this process is warranted. 1 2 10

The authoritative reference for validity testing theory comes from education and psychology: the Standards for Educational and Psychological Testing (the Standards ). 3 The Standards define validity as ‘the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests’ and that ‘the process of validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations’ (p.11). 3 A test’s proposed score interpretation and use is described in Kane’s argument-based approach to validation as an interpretation/use argument (IUA; also called an interpretive argument). 11 12 Validity testing theory requires test developers and users to generate and evaluate a range of validity evidence such that a validity argument can determine the plausibility of the IUA. 3 7 9 11 12 Despite this contemporary stance on validity testing theory and practice, the application of validity testing theory and methodology is not common practice for individual and population assessments in the field of health. 1 Furthermore, there are calls for developers, users and translators/adapters of health assessments to establish theoretically driven validation plans for IUAs such that validity evidence can be systematically collected and evaluated. 1 2 7 10

The Standards provide a theoretical framework that can be used or adapted to form a validation plan for development of a new test or to evaluate the validity of an IUA for a new context. 1 2 Based on the notion that construct validity is the foundation of test development and use, the theoretical framework of the Standards outlines five sources of evidence on which validity arguments should be founded: (1) test content, (2) response processes, (3) internal structure, (4) relationship of scores to other variables and (5) validity and the consequences of testing ( table 1 ). 3

  • View inline

The five sources of validity evidence 3

Validity testing in the health context

Two of the five sources of validity evidence defined by the Standards (internal structure and relationship of scores to other variables) have a focus on the statistical properties of a test. However, the other three (test content, response processes and consequences of testing) are strongly reliant on evidence based on qualitative research methods. Greenhalgh et al have called for more credence and publication space to be given to qualitative research in the health sciences. 13 Zumbo and Chan (p.350, 2014) call specifically for more validity evidence from qualitative and mixed methods. 1 It is time to systematically assess if test developers and users in health are generating and integrating a range of quantitative and qualitative evidence to support inferences made from these data. 1

In chapter 1 of their book, Zumbo and Chan report the results of a systematic search of validation studies from the 1960s to 2010. Results from this search for the health sciences categories of ‘life satisfaction, well-being or quality of life’ and ‘health or medicine’, show that there is a dramatic increase in publication of validation studies since the 1990s that produce primarily what is classified as construct validity. 1 Given this was a snapshot review of validation practice during these years, the authors do not delve into the methods used to generate evidence for construct validity. However, Barry et al , in a systematic review investigating the frequency with which psychometric properties were reported for validity and reliability in health education and behaviour (also published in 2014), found that the primary methods used to generate evidence for construct validity were factor analysis, correlation coefficient and χ 2 . 14 This limited view of construct validity as simply correlation between items or tests measuring the same or similar constructs is at odds with the Standards where evaluation and integration of evidence from perhaps several other sources (ie, test content, response processes, internal structure, relationships with theoretically predicted external variables, and intended and unintended consequences) is needed to determine the degree to which a construct is represented by score interpretations (p.11). 3

Health literacy

This literature review will examine validity evidence for health literacy assessments. Health literacy is a relatively new area of measurement, and there has been a rapid development in the definition and measurement of this multi-dimensional concept. 15–18 Health literacy is now a priority of the WHO, 19 and many countries have incorporated it into health policy, 20–24 and are including it in national health surveys. 25–27

Definitions of health literacy include those for functional health literacy (ie, a focus on comprehension and numeric abilities) to multi-dimensional definitions such as that used by the WHO: ‘the cognitive and social skills which determine the motivation and ability of individuals to gain access to, understand and use information in ways which promote and maintain good health’. 28 The general purpose of health literacy assessment is to determine pathways to facilitate access to and improve understanding and use of health information and services, as well as to improve or support the health literacy responsiveness of health services. 28–31 However, these two uses of data (in general, to improve patient outcomes and to improve organisational procedures) may require evaluative integration of different types of evidence to justify score interpretations to inform patient interventions or organisational change. 3 7 9 11 32 A strong and coherent evidence-based conception of the health literacy construct is required to support score interpretations. 14 33–35 Decisions that arise from measurements of health literacy will affect individuals and populations and, as such, there must be strong argument for the validity of score interpretations for each measurement purpose.

To enhance the quality and transparency of the proposed systematic descriptive literature review, this protocol paper outlines the scope and purpose of the review. 36 37 Using the theoretical framework of the five sources of validity evidence of the Standards , and health literacy assessments as an exemplar, the results of this systematic descriptive literature review will indicate current validation practice. The assumptions that underlie this literature review are that, despite the advancement of contemporary validity testing theory in education and psychology, a systematic theoretical framework for validity testing has not been applied in the field of health, and that validation practice for health assessments remains centred on general psychometric properties that typically provide insufficient evidence that the test is fit for its intended use. The purpose of the review is to investigate quantitative and qualitative validity evidence reported for the development and testing of health literacy assessments to describe patterns in the types of validity evidence reported, 38–45 and identify use of theory for validation practice. Specifically, the review will address the following questions:

What is being reported as validity evidence for health literacy assessment data?

Do the studies place the validity evidence within a validity testing framework, such as that offered by the Standards ?

Methods and analysis

Review method.

This review is designed to provide the basis for a critique of validation practice for health literacy assessments within the context of the validity testing framework of the Standards . It is not an evaluation of the specific arguments that authors have made about validity from the data that have been gathered for individual measurement instruments. The review is intended to quantify the types of validity evidence being reported so a systematic descriptive literature review was chosen as the most appropriate review technique. Described by King and He (2005) 42 as belonging towards the qualitative end of a continuum of review techniques, a descriptive literature review nevertheless employs a frequency analysis to reveal interpretable patterns in a research area; such as, in this review, in the types of validity evidence being reported for health literacy assessments and in the number of studies that refer to a validity testing framework. A descriptive literature review can include qualitative and quantitative research and is based on a systematic and exhaustive review method. 38–41 43 44 38 39 The method for this review will be guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement. 46

Eligibility criteria

This literature review is not an assessment of participant data but a collation of reported validity evidence. As such, the focus is not on the participants in the studies but on the evidence presented in support of the validity of interpretations and uses of health literacy assessment data. This means that it will be the type of study that is considered for inclusion rather than the type of study participant. Inclusion criteria are as follows:

Development/application/validation studies about health literacy assessments : We expect to find many papers that describe the development and initial validation studies of health literacy assessments. Papers that use an existing health literacy assessment to measure outcomes but do not claim to conduct validity testing will not be included. Studies of comparison (eg, participant groups) or of prediction (eg, health literacy and hospital admissions) will be included only if the authors openly claim that the study results contribute validation evidence for the health literacy assessment instrument.

Not limited by date : There will be no start date to the search such that papers about validation and health literacy assessments from the early days of health literacy measurement will be included in the search. Health literacy is a relatively new concept and the earliest papers are expected to date back only about 30 years. The end search date was in March 2019.

Studies published and health literacy assessments developed in the English language : Due to resource limitations, the search will be restricted to studies published in the English language and instruments developed in the English language. Translated instruments will be excluded. We realise that these exclusions introduce an English language and culture bias, and we recommend that a similar descriptive review of published studies about health literacy assessments developed in or translated to other languages is warranted.

Qualitative and quantitative research methods : Given that comprehensive validity testing includes both qualitative and quantitative methods, studies employing either or both will be included.

All definitions of health literacy : Definitions of health literacy have been accumulating over the past 30 years and reflect a range of health literacy testing methods as well as contexts, interpretations and uses of the data. We include all definitions of health literacy and all types of health literacy assessment instruments, which may include objective, subjective, uni-dimensional and multi-dimensional measurement instruments.

Exclusion criteria

Systematic reviews and other types of reviews captured by the search will not be included in the analysis. However, before being excluded, the reference lists will be checked for articles that may have been missed by the database search. Predictive, association or other comparative studies that do not explicitly claim in the abstract to contribute validity evidence will also not be included. Instruments developed in languages other than English, and translation studies, will be excluded as noted previously.

Information sources

Systematic electronic searches of the following databases will be conducted in EBSCOhost: MEDLINE Complete, Global Health, CINAHL Complete, PsycINFO and Academic Search Complete. EMBASE will also be searched. The electronic database search will be supplemented by searching for dissertations and theses through proquest.com, dissertation.com and openthesis.org. Reference lists of pertinent systematic reviews that are identified in the search will be scanned, as well as article reference lists and the authors’ personal reference lists, to ensure all relevant articles have been captured. The search terms will use medical subject headings and text words related to types of assessment instruments, health literacy, validation and validity testing. Peer reviewed full articles and examined theses will be included in the search.

Search strategy

An expert university librarian has been consulted as part of planning the literature search strategy. The strategy will focus on health literacy, types of assessment instruments, validation and validity, and methods used to determine the validity of interpretation and use of data from health literacy assessments. The search terms have been determined through scoping searches and examining search terms from other measurement and health literacy systematic reviews. The database searches were completed in March 2019 and the search terms used are described in online supplementary file 1 .

Supplemental material

Study selection.

Literature search results will be saved and the titles and abstracts downloaded to Endnote Reference Manager X9. Titles and abstracts of the search results will be screened for duplicates and according to the inclusion and exclusion criteria. The full texts of articles that seem to meet the eligibility criteria or that are potentially eligible will then be obtained and screened. Excluded articles and reasons for exclusions will be recorded. The PRISMA flow diagram will be used to document the review process. 46

Data extraction

The data extraction framework will be adapted from tables in Hawkins et al 2 (p.1702) and Cox and Owen (p.254). 47 Data extraction from eligible articles will be conducted by one reviewer (MH) and comprehensively checked by a second reviewer (GE).

Subjective and objective health literacy assessments will be identified along with those that combine objective and subjective items or scales. Data to be extracted will include the date and source of publication; the context of the study (eg, country, type of organisation/institution, type of investigation, representative population); statements about the use of a theoretical validity testing framework; the types of validity evidence reported; the methods used to generate the evidence; and the validation claims made by the authors of the papers, as based on their reported evidence.

Data synthesis and analysis

A descriptive analysis of extracted data, as based on the theoretical framework of the Standards , will be used to identify patterns in the types of validity evidence being reported, the methods used to generate the evidence and theoretical frameworks underlying validation practice. Where possible and relevant to the concept of validity, changes in validation practice and assessment of health literacy over time will be explored. It is possible that one study may use more than one method and generate more than one type of validity evidence. Statements about a theoretical underpinning to the generation of validity evidence will be collated.

Patient and public involvement

Patients and the public were not involved in the development or design of this literature review.

With the increasing use of health assessment data for decision-making, the health of individuals and populations relies on test developers and users to provide evidence for validity arguments for the interpretations and uses of these data. This systematic descriptive literature review will collate existing validity evidence for health literacy assessments developed in English and identify patterns of reporting frequency according to the five sources of evidence in the Standards , and establish if the validity evidence is being placed within a theoretical framework for validation planning. 3 The potential implications of this review include finding that, when assessed against the Standards’ theoretical framework, current validation practice in health literacy (and possibly in health assessment in general) has limited capacity for determining valid score interpretation and use. The Standards’ framework challenges the long-held perception in health assessment that validity refers to an assessment tool rather than to the interpretation of data for a specific use. 48 49

The validity of decisions based on research data is a critical aspect of health services research. Our understanding of the phenomena we research is dependent on the quality of our measurement of the constructs of interest, which, in turn, affects the validity of the inferences we make and actions we take from data interpretations. 6 7 Too often the measurement quality is considered separate to the decisions that need to be made. 6 50 However, questionable measurement (perhaps through use of an instrument that was developed using suboptimal methods, was inappropriately applied or through gaps in validity testing) cannot lead to valid inferences. 3 50 To make appropriate and responsible decisions for individuals, communities, health services and policy development, we must consider the integrity of the instruments, and the context and purpose of measurement, to justify decisions and actions based on the data.

A limitation of the review is that the search will be restricted to studies published and instruments developed in the English language, and this may introduce an English language and culture bias. A similar review of health literacy assessments developed in or translated to other languages is warranted. A further limitation is that we rely on the information authors provide in identified articles. It is possible that some authors have an incomplete understanding of the specific methods they are using and reporting, and may not accurately or clearly provide details on validity testing procedures employed. Documentation for decisions made during data extraction will be kept by the researchers.

Health literacy is a relatively new area of research. We are fortunate to be at the start of a burgeoning field and can include all publications about validity testing of English-language health literacy assessments. The inclusion of the earliest to the most recent publications provides the opportunity to understand changes and advancements in health literacy measurement and methods of analysis since the introduction of the concept of health literacy. Using health literacy assessments as an exemplar, the outcomes of this review will guide and inform a theoretical basis for the future practice of validity testing of health assessments in general to ensure, as far as is possible, the integrity of the inferences made from data for individual and population benefits.

Acknowledgments

The authors acknowledge and thank Rachel West, Deakin University Liaison Librarian, for her expertise and advice during the preparation of this systematic literature review.

  • Hawkins M ,
  • Elsworth GR ,
  • American Educational Research Association
  • American Psychological Association
  • National Council on Measurement in Education
  • Cronbach LJ
  • Sawatzky R ,
  • Zumbo BD , et al
  • Greenhalgh T ,
  • Annandale E ,
  • Ashcroft R , et al
  • Piazza-Gardner AK , et al
  • Sørensen K ,
  • Van den Broucke S ,
  • Fullam J , et al
  • Jordan JE ,
  • Osborne RH ,
  • Buchbinder R
  • Nguyen TH ,
  • Paasche-Orlow MK ,
  • Kim MT , et al
  • Valerio MA ,
  • McCormack LA , et al
  • World Health Organization
  • Scotland NHS
  • Schaeffer D ,
  • Berens E-M ,
  • Weishaar H , et al
  • Australian Commission on Safety and Quality in Health Care
  • U.S. Department of Health and Human Services Office of Disease Prevention and Health Promotion
  • Osborne RH , et al
  • Australian Institute of Health and Welfare
  • New Zealand Ministry of Health
  • Batterham R ,
  • Beauchamp A ,
  • Trezona A ,
  • Nutbeam D ,
  • Premkumar P
  • Buchbinder R ,
  • Hoffmann F ,
  • Mathes T , et al
  • Schlagenhaufer C ,
  • Winemiller DR ,
  • Mitchell ME ,
  • Sutliff J , et al
  • Jackson SE ,
  • Trudel M-C ,
  • Jaana M , et al
  • Liberati A ,
  • Tetzlaff J , et al
  • Hubley AM ,
  • DeVellis RF ,
  • Alfieri WS ,
  • Ahluwalia IB

Twitter @4MelanieHawkins

Contributors MH and RHO conceptualised the research question and analytical plan. Under supervision from RHO, MH led the development of the search strategy, selection criteria, data extraction criteria and analysis method, which was then comprehensively assessed and checked by GRE. MH drafted the initial manuscript and led subsequent drafts. GRE and RHO read and provided feedback on manuscript iterations. All authors approved the final manuscript. RHO is the guarantor.

Funding MH is funded by a National Health and Medical Research Council (NHMRC) of Australia Postgraduate Scholarship (APP1150679). RHO is funded in part through a National Health and Medical Research Council (NHMRC) of Australia Principal Research Fellowship (APP1155125).

Competing interests None declared.

Patient consent for publication Not required.

Ethics approval Ethics approval is not required for this systematic review because only published research will be examined. Dissemination will be through publication in a peer-reviewed journal and at conference presentations, and in the lead author’s doctoral thesis.

Provenance and peer review Not commissioned; externally peer reviewed.

Read the full text or download the PDF:

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • The 4 Types of Validity in Research | Definitions & Examples

The 4 Types of Validity in Research | Definitions & Examples

Published on September 6, 2019 by Fiona Middleton . Revised on June 22, 2023.

Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid. There are four main types of validity:

  • Construct validity : Does the test measure the concept that it’s intended to measure?
  • Content validity : Is the test fully representative of what it aims to measure?
  • Face validity : Does the content of the test appear to be suitable to its aims?
  • Criterion validity : Do the results accurately measure the concrete outcome they are designed to measure?

In quantitative research , you have to consider the reliability and validity of your methods and measurements.

Note that this article deals with types of test validity, which determine the accuracy of the actual components of a measure. If you are doing experimental research, you also need to consider internal and external validity , which deal with the experimental design and the generalizability of results.

Table of contents

Construct validity, content validity, face validity, criterion validity, other interesting articles, frequently asked questions about types of validity.

Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It’s central to establishing the overall validity of a method.

What is a construct?

A construct refers to a concept or characteristic that can’t be directly observed, but can be measured by observing other indicators that are associated with it.

Constructs can be characteristics of individuals, such as intelligence, obesity, job satisfaction, or depression; they can also be broader concepts applied to organizations or social groups, such as gender equality, corporate social responsibility, or freedom of speech.

There is no objective, observable entity called “depression” that we can measure directly. But based on existing psychological research and theory, we can measure depression based on a collection of symptoms and indicators, such as low self-confidence and low energy levels.

What is construct validity?

Construct validity is about ensuring that the method of measurement matches the construct you want to measure. If you develop a questionnaire to diagnose depression, you need to know: does the questionnaire really measure the construct of depression? Or is it actually measuring the respondent’s mood, self-esteem, or some other construct?

To achieve construct validity, you have to ensure that your indicators and measurements are carefully developed based on relevant existing knowledge. The questionnaire must include only relevant questions that measure known indicators of depression.

The other types of validity described below can all be considered as forms of evidence for construct validity.

Prevent plagiarism. Run a free check.

Content validity assesses whether a test is representative of all aspects of the construct.

To produce valid results, the content of a test, survey or measurement method must cover all relevant parts of the subject it aims to measure. If some aspects are missing from the measurement (or if irrelevant aspects are included), the validity is threatened and the research is likely suffering from omitted variable bias .

A mathematics teacher develops an end-of-semester algebra test for her class. The test should cover every form of algebra that was taught in the class. If some types of algebra are left out, then the results may not be an accurate indication of students’ understanding of the subject. Similarly, if she includes questions that are not related to algebra, the results are no longer a valid measure of algebra knowledge.

Face validity considers how suitable the content of a test seems to be on the surface. It’s similar to content validity, but face validity is a more informal and subjective assessment.

You create a survey to measure the regularity of people’s dietary habits. You review the survey items, which ask questions about every meal of the day and snacks eaten in between for every day of the week. On its surface, the survey seems like a good representation of what you want to test, so you consider it to have high face validity.

As face validity is a subjective measure, it’s often considered the weakest form of validity. However, it can be useful in the initial stages of developing a method.

Criterion validity evaluates how well a test can predict a concrete outcome, or how well the results of your test approximate the results of another test.

What is a criterion variable?

A criterion variable is an established and effective measurement that is widely considered valid, sometimes referred to as a “gold standard” measurement. Criterion variables can be very difficult to find.

What is criterion validity?

To evaluate criterion validity, you calculate the correlation between the results of your measurement and the results of the criterion measurement. If there is a high correlation, this gives a good indication that your test is measuring what it intends to measure.

A university professor creates a new test to measure applicants’ English writing ability. To assess how well the test really does measure students’ writing ability, she finds an existing test that is considered a valid measurement of English writing ability, and compares the results when the same group of students take both tests. If the outcomes are very similar, the new test has high criterion validity.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Face validity and content validity are similar in that they both evaluate how suitable the content of a test is. The difference is that face validity is subjective, and assesses content at surface level.

When a test has strong face validity, anyone would agree that the test’s questions appear to measure what they are intended to measure.

For example, looking at a 4th grade math test consisting of problems in which students have to add and multiply, most people would agree that it has strong face validity (i.e., it looks like a math test).

On the other hand, content validity evaluates how well a test represents all the aspects of a topic. Assessing content validity is more systematic and relies on expert evaluation. of each question, analyzing whether each one covers the aspects that the test was designed to cover.

A 4th grade math test would have high content validity if it covered all the skills taught in that grade. Experts(in this case, math teachers), would have to evaluate the content validity by comparing the test to the learning objectives.

Criterion validity evaluates how well a test measures the outcome it was designed to measure. An outcome can be, for example, the onset of a disease.

Criterion validity consists of two subtypes depending on the time at which the two measures (the criterion and your test) are obtained:

  • Concurrent validity is a validation strategy where the the scores of a test and the criterion are obtained at the same time .
  • Predictive validity is a validation strategy where the criterion variables are measured after the scores of the test.

Convergent validity and discriminant validity are both subtypes of construct validity . Together, they help you evaluate whether a test measures the concept it was designed to measure.

  • Convergent validity indicates whether a test that is designed to measure a particular construct correlates with other tests that assess the same or similar construct.
  • Discriminant validity indicates whether two tests that should not be highly related to each other are indeed not related. This type of validity is also called divergent validity .

You need to assess both in order to demonstrate construct validity. Neither one alone is sufficient for establishing construct validity.

The purpose of theory-testing mode is to find evidence in order to disprove, refine, or support a theory. As such, generalizability is not the aim of theory-testing mode.

Due to this, the priority of researchers in theory-testing mode is to eliminate alternative causes for relationships between variables . In other words, they prioritize internal validity over external validity , including ecological validity .

It’s often best to ask a variety of people to review your measurements. You can ask experts, such as other researchers, or laypeople, such as potential participants, to judge the face validity of tests.

While experts have a deep understanding of research methods , the people you’re studying can provide you with valuable insights you may have missed otherwise.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Middleton, F. (2023, June 22). The 4 Types of Validity in Research | Definitions & Examples. Scribbr. Retrieved February 19, 2024, from https://www.scribbr.com/methodology/types-of-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, reliability vs. validity in research | difference, types and examples, construct validity | definition, types, & examples, external validity | definition, types, threats & examples, what is your plagiarism score.

Advance

Collecting and validating data: A simple guide for researchers

Declaration of conflicts of interest, corresponding author email, lead author country, lead author job role.

  • Higher Education Lecturer

Lead author institution

Human participants, ethics statement, usage metrics.

Advance: Social Sciences & Humanities

  • Quantitative/Statistical Research
  • Research Methods
  • Research Methods & Evaluation (general)

CC BY 4.0

Elsevier QRcode Wechat

  • Research Process

Why is data validation important in research?

  • 3 minute read
  • 43.4K views

Table of Contents

Data collection and analysis is one of the most important aspects of conducting research. High-quality data allows researchers to interpret findings accurately, act as a foundation for future studies, and give credibility to their research. As such, research often needs to go under the scanner to be free of suspicions of fraud and data falsification . At times, even unintentional errors in data could be viewed as research misconduct. Hence, data integrity is essential to protect your reputation and the reliability of your study.

Owing to the very nature of research and the sheer volume of data collected in large-scale studies, errors are bound to occur. One way to avoid “bad” or erroneous data is through data validation.

What is data validation?

Data validation is the process of examining the quality and accuracy of the collected data before processing and analysing it. It not only ensures the accuracy but also confirms the completeness of your data. However, data validation is time-consuming and can delay analysis significantly. So, is this step really important?

Importance of data validation

Data validation is important for several aspects of a well-conducted study:

  • To ensure a robust dataset: The primary aim of data validation is to ensure an error-free dataset for further analysis. This is especially important if you or other researchers plan to use the dataset for future studies or to train machine learning models.
  • To get a clearer picture of the data: Data validation also includes ‘cleaning-up’ of data, i.e., removing inputs that are incomplete, not standardized, or not within the range specified for your study. This process could also shed light on previously unknown patterns in the data and provide additional insights regarding the findings.
  • To get accurate results: If your dataset has discrepancies, it will impact the final results and lead to inaccurate interpretations. Data validation can help identify errors, thus increasing the accuracy of your results.
  • To mitigate the risk of forming incorrect hypotheses: Only those inferences and hypotheses that are backed by solid data are considered valid. Thus, data validation can help you form logical and reasonable speculations .
  • To ensure the legitimacy of your findings: The integrity of your study is often determined by how reproducible it is. Data validation can enhance the reproducibility of your findings.

Data validation in research

Data validation is necessary for all types of research. For quantitative research, which utilizes measurable data points, the quality of data can be enhanced by selecting the correct methodology, avoiding biases in the study design, choosing an appropriate sample size and type, and conducting suitable statistical analyses.

In contrast, qualitative research , which includes surveys or behavioural studies, is prone to the use of incomplete and/or poor-quality data. This is because of the likelihood that the responses provided by survey participants are inaccurate and due to the subjective nature of observational studies. Thus, it is extremely important to validate data by incorporating a range of clear and objective questions in surveys, bullet-proofing multiple-choice questions, and setting standard parameters for data collection.

Importantly, for studies that utilize machine learning approaches or mathematical models, validating the data model is as important as validating the data inputs. Thus, for the generation of automated data validation protocols, one must rely on appropriate data structures, content, and file types to avoid errors due to automation.

Although data validation may seem like an unnecessary or time-consuming step, it is absolutely critical to validate the integrity of your study and is absolutely worth the effort. To learn more about how to validate data effectively, head over to Elsevier Author Services !

how to write the results section of a research paper

  • Manuscript Preparation

How to write the results section of a research paper

choosing the Right Research Methodology

Choosing the Right Research Methodology: A Guide for Researchers

You may also like.

what is a descriptive research design

Descriptive Research Design and Its Myriad Uses

Doctor doing a Biomedical Research Paper

Five Common Mistakes to Avoid When Writing a Biomedical Research Paper

research paper about validation

Making Technical Writing in Environmental Engineering Accessible

Risks of AI-assisted Academic Writing

To Err is Not Human: The Dangers of AI-assisted Academic Writing

Importance-of-Data-Collection

When Data Speak, Listen: Importance of Data Collection and Analysis Methods

choosing the Right Research Methodology

Writing a good review article

Scholarly Sources What are They and Where can You Find Them

Scholarly Sources: What are They and Where can You Find Them?

Input your search keywords and press Enter.

Ask Yale Library

My Library Accounts

Find, Request, and Use

Help and Research Support

Visit and Study

Explore Collections

Research Data Management: Validate Data

  • Plan for Data
  • Organize & Document Data
  • Store & Secure Data
  • Validate Data
  • Share & Re-use Data
  • Data Use Agreements
  • Research Data Policies

What is Data Validation?

Data validation is important for ensuring regular monitoring of your data and assuring all stakeholders that your data is of a high quality that reliably meets research integrity standards — and also a crucial aspect of Yale's Research Data and Materials Policy, which states "The University deems appropriate stewardship of research data as fundamental to both high-quality research and academic integrity and therefore seeks to attain the highest standards in the generation, management, retention, preservation, curation, and sharing of research data."

Data Validation Methods

Basic methods to ensure data quality — all researchers should follow these practices :

  • Be consistent and follow other data management best practices, such as data organization and documentation
  • Document any data inconsistencies you encounter
  • Check all datasets for duplicates and errors
  • Use data validation tools (such as those in Excel and other software) where possible

Advanced methods to ensure data quality — the following methods may be useful in more computationally-focused research :

  • Establish processes to routinely inspect small subsets of your data
  • Perform statistical validation using software and/or programming languages
  • Use data validation applications at point of deposit in a data repository

Additional Resources for Data Validation

Data validation and quality assurance is often discipline-specific, and expectations and standards may vary. To learn more about data validation and data quality assurance, consider the information from the following U.S. government entities producing large amounts of public data:

  • U.S. Census Bureau Information Quality Guidelines
  • U.S. Geological Survey Data-Quality Management
  • << Previous: Store & Secure Data
  • Next: Share & Re-use Data >>
  • Last Updated: Sep 27, 2023 1:15 PM
  • URL: https://guides.library.yale.edu/datamanagement

Yale Library logo

Site Navigation

P.O. BOX 208240 New Haven, CT 06250-8240 (203) 432-1775

Yale's Libraries

Bass Library

Beinecke Rare Book and Manuscript Library

Classics Library

Cushing/Whitney Medical Library

Divinity Library

East Asia Library

Gilmore Music Library

Haas Family Arts Library

Lewis Walpole Library

Lillian Goldman Law Library

Marx Science and Social Science Library

Sterling Memorial Library

Yale Center for British Art

SUBSCRIBE TO OUR NEWSLETTER

@YALELIBRARY

image of the ceiling of sterling memorial library

Yale Library Instagram

Accessibility       Diversity, Equity, and Inclusion      Giving       Privacy and Data Use      Contact Our Web Team    

© 2022 Yale University Library • All Rights Reserved

The Role of Academic Validation in Developing Mattering and Academic Success

  • Published: 24 March 2022
  • Volume 63 , pages 1368–1393, ( 2022 )

Cite this article

  • Elise Swanson   ORCID: orcid.org/0000-0002-4529-9646 1 &
  • Darnell Cole 2  

1234 Accesses

2 Citations

2 Altmetric

Explore all metrics

We use survey data from three four-year campuses to explore the relationship between academic validation and student outcomes during students’ first 3 years in college using structural equation modeling. We examine both a psychosocial outcome (mattering to campus) and an academic outcome (cumulative GPA). We find that both frequency of interactions with faculty and feelings of academic validation from faculty are positively related to students’ feelings of mattering to campus and cumulative GPA in their third year. Our results suggest that academic validation, beyond the frequency of faculty–student interactions, is an important predictor of students’ psychosocial and academic success.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research paper about validation

Data Availability

The data used for this analysis are restricted-used and under the purview of the Promoting At-promise Student Success project. Interested researchers may apply to access the data. The survey used was for this research was compiled by researchers at the Pullias Center for Higher Education. Certain scales on the survey were used with permission from other research organizations; the survey instrument used for this study may not be used without appropriate permissions for all scales on the survey.

Code Availability

All analyses were conducted in Stata; code is available from the authors upon request.

A concern with this modeling decision is that our estimates of the relationships between validation and faculty interactions, respectively, and third-year GPA may include the indirect relationship between prior (e.g., T1) validation and faculty as well as the direct relationship between the T2 measurements and third-year GPA. When we include students’ high school, first semester, first year, second year, and third year GPA, we find no significant relationship between students’ first-year faculty interactions and second-year GPA and a small, marginally significant relationship between first-year validation and second-year GPA, mitigating this concern. We also estimate the model including lagged direct paths between first-year validation and faculty interactions and third-year GPA; we find similar results to those presented below affirming the importance of second-year validation for predicting third-year GPA, again mitigating concerns of bias in our main estimates. However, a conservative interpretation of our results is as the cumulative relationship between second-year student-initiated interactions with faculty and feelings of academic validation with GPA. Goodness-of-fit measures are similar across specifications.

Angrist, J., Autor, D., Hudson, S., & Pallais, A. (2016). Evaluating post-secondary aid: Enrollment, persistence, and projected completion effects. NBER Working Paper 23015. National Bureau of Economic Research. https://www.nber.org/papers/w23015

Astin, A. W. (1984). Student involvement: A developmental theory for higher education. Journal of College Student Personnel, 25 (4), 297–308.

Google Scholar  

Astin, A. (1985). Achieving educational excellence: a critical assessment of priorities and practices in higher education . Jossey-Bass.

Astin, A. W. (1993). What matters in college: Four critical years revisited . Jossey-Bass.

Aud, S., Fox, M., & KewalRamani, A. (2010). Status and trends in the education of racial and ethnic groups (NCES 2010-015). U.S. Department of Education, National Center for Education Statistics. U.S. Government Printing Office.

Auerbach, R., Alonso, J., Axinn, W., Cuijpers, P., Ebert, D., Green, J., Hwang, I., Kessler, R., Liu, H., Mortier, P., Nock, M., Pinder-Amaker, S., Sampson, N., Aguilar-Gaxiola, S., Al-Hamzawi, A., Andrade, L., Benjet, C., Caldas-de-Almeida, J., Demyttenaere, K., Florescu, S., … Buffaerts, R. (2016). Mental disorders among college students in the World Health Organization World Mental Health Surveys. Psychological Medicine, 46 (2016), 2955–2970. https://doi.org/10.1017/S0033291716001665

Barnett, E. (2006). Validation experiences and persistence among urban community college students. Dissertation Abstracts International, 68 (02) A. (UMI No. 3250210).

Barnett, E. (2011). Validation experiences and persistence among community college students. The Review of Higher Education, 34 (2), 193–230. https://doi.org/10.1353/rhe.2010.0019

Article   Google Scholar  

Bean, J. P., & Kuh, G. D. (1984). The relationship between student–faculty interaction and undergraduate grade point average. Research in Higher Education, 21 , 461–477.

Belenky, M. F., Clinchy, B. M., Goldberger, N. R., & Tarule, J. M. (1986). Women’s ways of knowing: The development of self, voice, and mind . Basic Books.

Bettinger, T., & Long, B. (2005). Do faculty serve as role models? The impact of instructor gender on female students. American Economic Association, 95 (2), 152–157.

Brown McNair, T., Albertine, S., Cooper, M. A., McDonald, N., & Major, T. (2016). Becoming a student-ready college: A new culture of leadership for student success . Jossey-Bass.

Brownback, A., & Sadoff, S. (2019). Improving college instruction through incentives. UC San Diego Working Paper. Retrieved from https://rady.ucsd.edu/docs/Brownback_Sadoff_College_Instructor_Incentives_20190501.pdf

Cole, D., & Griffin, K. A. (2013). Advancing the study of student-faculty interaction: A focus on diverse students and faculty. In Higher education: Handbook of theory and research (pp. 561–611). Springer.

Cole, D., Kitchen, J., & Kezar, A. (2018). Examining a comprehensive college transition program: An account of an iterative mixed methods longitudinal design. Research in Higher Education, 60 , 392–413. https://doi.org/10.1007/s11162-018-9515-1

Dee, T. (2005). A teacher like me: Does race, ethnicity, or gender matter? American Economic Review, 95 (2), 158–165.

Dixon Rayle, A., & Chung, K. (2007). Revisiting first year college students’ mattering: Social support, academic stress, and the mattering experience. Journal of College Student Retention, 9 (1), 21–37.

Dykes, M. (2011). Appalachian bridges to the baccalaureate: Mattering perceptions and transfer persistence of low-income, first-generation community college students (Doctoral Dissertation). Retrieved from University of Kentucky Doctoral Dissertations (844). https://uknowledge.uky.edu/gradschool_diss/844

Eagen, K., & Jaeger, A. (2009). Effects of exposure to part-time faculty on community college transfer. Research in Higher Education, 50 (2), 168–188.

Egalite, A., Kisida, B., & Winters, M. (2015). Representation in the classroom: The effect of own-race teachers on student achievement. Economics of Education Review, 45 (2015), 44–52.

Eisenberg, D., Hunt, J., Speer, N., & Zivin, K. (2011). Mental health service utilization among college students in the United States. The Journal of Nervous and Mental Disease, 199 (5), 301–308. https://doi.org/10.1097/NMD.0b013e3182175123

Gershenfeld, S., Hood, D., & Zhan, M. (2015). The role of first-semester GPA in predicting graduation rates of underrepresented students. Journal of College Student Retention: Research, Theory & Practice, 17 (4), 469–488.

Gershenson, S., Holt, S., & Papageorge, N. (2016). Who believes in me? The effect of student-teacher demographic match on teacher expectations. Economics of Education Review, 52 (2016), 209–224.

Gildersleeve, R. (2011). Toward a neo-critical validation theory: participatory action research and Mexican migrant student success. Enrollment Management Journal, 5 (2), 72–96.

Hallett, R., Bettencourt, G. M., Kezar, A., Kitchen, J. A., Perez, R., & Reason, R. (2021). Re-envisioning campuses to holistically support students: The ecological validation model of student success [Brief]. USC Pullias Center for Higher Education.

Hallett, R. E., Kezar, A., Perez, R. J., & Kitchen, J. A. (2019). A typology of college transition and support programs: Situating a 2-year comprehensive college transition program within college access. American Behavioral Scientist, 64 (3), 230–252.

Hallett, R. E., Reason, R. D., Toccoli, J., Kitchen, J. A., & Perez, R. J. (2019). The process of academic validation within a comprehensive college transition program. American Behavioral Scientist, 64 (3), 253–275.

Hart, T. (2017). The relationship between online students’ use of services and their feelings of mattering (Doctoral Dissertation). Retrieved from The University of New Mexico Digital Repository. https://digitalrepository.unm.edu/oils_etds/43

Hu, L., & Bentler, P. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6 (1), 1–55.

Hunt, J., & Eisenberg, D. (2010). Mental health problems and help-seeking behavior among college students. Journal of Adolescent Health, 46 (1), 3–10. https://doi.org/10.1016/j.jadohealth.2009.08.008

Hurtado, S., Alvarado, A. R., & Guillermo-Wann, C. (2015). Creating inclusive environments: The mediating effect of faculty and staff validation on the relationship of discrimination/bias to students’ sense of belonging. JCSCORE, 1 (1), 59–81.

Hurtado, S., Cuellar, M., & Guillermo-Wann, C. (2011). Quantitative measures of students’ sense of validation: Advancing the study of diverse learning environments. Enrollment Management Journal, 5 (2), 53–71.

Jaschik, S. (March 2016). Grade inflation, higher and higher. Inside Higher Ed . https://www.insidehighered.com/news/2016/03/29/survey-finds-grade-inflationcontinues-rise-four-year-colleges-not-communitycollege#:~:text=Grade%20point%20averages%20at%20four,than%2042%20percent%20of%20grades

Ketchen Lipson, S., Kern, A., Eisenberg, D., & Breland-Noble, A. (2018). Mental health disparities among students of color. Journal of Adolescent Health, 63 (2018), 348–356. https://doi.org/10.1016/j.jadohealth.2018.04.014

Kitchen, J. A., Perez, R., Hallett, R., Kezar, A., & Reason, R. (2020, November 21). Ecology of validation: A new student support model for promoting college success among lowincome, first-generation, and racially minoritized students [Conference presentation]. Association for the Study of Higher Education. https://www.ashe.ws//Files/Past%20Conferences/ASHE%202020%20Program%20Book.pdf

Klasik, D. (2012). The college application gauntlet: A systematic analysis of the steps to four-year college enrollment. Research in Higher Education, 53 (5), 506–549.

Kline, R. (2015). Principles and practice of structural equation modeling (4th ed.). Guilford Press.

Kuh, G. D., Schuh, J. H., Whitt, E. J., & Associates (1991). Involving colleges: Successful approaches to fostering student learning and development outside the classroom . Jossey-Bass.

Kuh, G. D., & Hu, S. (2001). The effects of student-faculty interaction in the 1990s. The Review of Higher Education, 24 (3), 309–332.

Linares, L. I. R., & Muñoz, S. M. (2011). Revisiting validation theory: Theoretical foundations, applications, and extensions. Enrollment Management Journal, 2 (1), 12–33.

Lindsay, C., & Hart, C. (2017). Teacher race and school discipline: Are students suspended less often when they have a teacher of the same race? Education next, 17 (1), 72–78.

Lozano, A. (2010). Culture centers: Providing a sense of belonging and promoting student success. In L. D. Patton (Ed.), Culture centers in higher education: Perspectives on identity, theory, and practice. Stylus.

Mayhew, M., Rockenbach, A., Bowman, N., Seifert, T., Wolniak, G., Pascarella, E., & Terenzini, P. (2016). How college affects students Volume 3: 21st century evidence that higher education works . Jossey-Bass.

Melguizo, T., Martorell, P., Swanson, E., Chi, E., Park, E., & Kezar, A. (2021). Expanding student success: The impact of a comprehensive college transition program on psychosocial outcomes. Journal of Research on Educational Effectiveness, 14 (4), 835–860.

Nora, A., Urick, A., & Cerecer, P. D. Q. (2011). Validating students: A conceptualization and overview of its impact on student experiences and outcomes. Enrollment Management Journal, 5 (2), 34–52.

O’Shea, S., & Delahunty, J. (2018). Getting through the day and still having a smile on my face! How do students define success in the university learning environment? Higher Education Research & Development, 37 (5), 1062–1075.

Pascarella, E., & Terenzini, P. (1991). How college affects students: Findings and insights from twenty years of research . Jossey-Bass.

Pattison, E., Grodsky, E., & Muller, C. (2013). Is the sky falling? Grade inflation and the signaling power of grades. Educational Researcher, 42 (5), 259–265. https://doi.org/10.3102/0013189X13481382

Perez, R. J., Acuña, A., & Reason, R. D. (2021). Pedagogy of validation: Autobiographical reading and writing courses for first year, low-income students. Innovative Higher Education . https://doi.org/10.1007/s10755-021-09555-9

Quaye, S. J., & Chang, S. H. (2012). Fostering cultures of inclusion in the classroom: From marginality to mattering. In S. D. Museus & U. M. Jayakumar (Eds.), Creating campus cultures: Fostering success among racially diverse student populations. Routledge.

Rendon, L. (1994). Validating culturally diverse students: Toward a new model of learning and student development. Innovative Higher Education, 19 (1), 33–51.

Rendon, L. (2002). Community College Puente: A validating model of education. Educational Policy, 16 (4), 642–667.

Rendon, L., & Munoz, S. M. (2011). Revisiting validation theory: Theoretical foundations, applications, and extensions. Enrollment Management Journal, 5 (2), 12–33.

Robinson-Cimpian, J. P., Lubienski, S. T., Ganley, C. M., & Copur-Gencturk, Y. (2014). Teachers’ perceptions of students’ mathematics proficiency may exacerbate early gender gaps in achievement. Developmental Psychology, 50 (4), 1262–1281.

Rodriquez, R. (1975). Coming home again: The new American scholarship boy. American Scholar, 44 (1), 15–28.

Rosenberg, M., & McCullough, B. C. (1981). Mattering: Inferred significance and mental health among adolescents. Research in Community and Mental Health, 2 , 163–182.

Saggio, J., & Rendòn, L. (2004). Persistence among American Indians and Alaska Natives at a bible college: The importance of family, spirituality, and validation. Christian Higher Education, 3 (3), 223–240.

Schlossberg, N. K. (1989). Marginality and mattering: Key issues in building community. New Directions for Student Services, 1989 (48), 5–15. https://doi.org/10.1002/ss.37119894803

Stanton-Salazar, R. D. (2011). A social capital framework for the study of institutional agents and their role in the empowerment of low-status students and youth. Youth & Society, 43 (3), 1066–1109.

Tinto, V. (1993). Leaving college: Rethinking the causes and cures of student attrition (2nd ed.). University of Chicago Press.

Vasquez, M., Gonzalez, A., Cataño, Y., & Garcia, F. (2021). Exploring the role of women as validating agents for Latino men in their transfer success. Community College Journal of Research and Practice . https://doi.org/10.1080/10668926.2021.1873874

Zhang, Y. (2016). An overlooked population in community college: International students’ (in)validation experiences with academic advising. Community College Review, 44 (2), 153–170.

Download references

This project received support from the Susan Thompson Buffett Foundation.

Author information

Authors and affiliations.

Harvard University, 50 Church St, Fourth Floor, Cambridge, MA, 02138, USA

Elise Swanson

University of Southern California, Los Angeles, USA

Darnell Cole

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Elise Swanson .

Ethics declarations

Conflict of interest.

The authors have no conflicts of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We would like to thank Adrianna Kezar, Tatiana Melguizo, Ronald Hallett, Gwendelyn Rivera, KC Culver, Joseph Kitchen, Rosemary Perez, Robert Reason, Matt Soldner, Mark Masterton, Evan Nielsen, Cameron McPhee, Samantha Nieman, and all the other members of the broader mixed-methods evaluation team for designing and implementing the Longitudinal Survey of Thompson Scholars, for helping us get a better understanding of the program and providing feedback on previous versions of this manuscript. We would also like to thank Gregory Hancock for his assistance with the structural equation modeling. Finally, we would also like to thank the staff at the Thompson Scholars Learning Communities for their reflections and continued work to support at-promise students. This study received financial support from the Susan Thompson Buffett Foundation. Opinions are those of the authors alone and do not necessarily reflect those of the granting agency or of the authors’ home institutions.

See Tables 5 , 6 , and 7 .

Rights and permissions

Reprints and permissions

About this article

Swanson, E., Cole, D. The Role of Academic Validation in Developing Mattering and Academic Success. Res High Educ 63 , 1368–1393 (2022). https://doi.org/10.1007/s11162-022-09686-8

Download citation

Received : 03 March 2021

Accepted : 08 March 2022

Published : 24 March 2022

Issue Date : December 2022

DOI : https://doi.org/10.1007/s11162-022-09686-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Academic achievement
  • Longitudinal analysis
  • Structural equation modeling
  • Find a journal
  • Publish with us
  • Track your research

research paper about validation

CORE MACHINE LEARNING

Revisiting feature prediction for learning visual representations from video.

February 15, 2024

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Adrien Bardes

Quentin Garrido

Xinlei Chen

Michael Rabbat

Mido Assran

Nicolas Ballas

Research Topics

Core Machine Learning

Related Publications

January 09, 2024

Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK Work Decomposition

Less Wright , Adnan Hoque

January 06, 2024

RANKING AND RECOMMENDATIONS

Reinforcement learning, learning to bid and rank together in recommendation systems.

Geng Ji , Wentao Jiang , Jiang Li , Fahmid Morshed Fahid , Zhengxing Chen , Yinghua Li , Jun Xiao , Chongxi Bao , Zheqing (Bill) Zhu

November 13, 2023

Mechanic: A Learning Rate Tuner

Aaron Defazio , Ashok Cutkosky , Harsh Mehta

October 01, 2023

Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots

Wei Hung , Bo-Kai Huang , Ping-Chun Hsieh , Xi Liu

Latest News

Cicero: an ai agent that negotiates, persuades, and cooperates with people.

November 22, 2022

research paper about validation

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment..

Latest Work

Our Actions

Meta © 2024

This paper is in the following e-collection/theme issue:

Published on 14.2.2024 in Vol 26 (2024)

Machine Learning–Based Prediction of Suicidality in Adolescents With Allergic Rhinitis: Derivation and Validation in 2 Independent Nationwide Cohorts

Authors of this article:

Author Orcid Image

Original Paper

  • Hojae Lee 1, 2 * , MSc   ; 
  • Joong Ki Cho 3 * , MD   ; 
  • Jaeyu Park 1, 2 * , MSc   ; 
  • Hyeri Lee 1, 2 * , MSc   ; 
  • Guillaume Fond 4 , MD, PhD   ; 
  • Laurent Boyer 4 , MD, PhD   ; 
  • Hyeon Jin Kim 1, 2 , MSc   ; 
  • Seoyoung Park 5 , BS   ; 
  • Wonyoung Cho 1 , PhD   ; 
  • Hayeon Lee 2, 5 , PhD   ; 
  • Jinseok Lee 5, 6 , PhD   ; 
  • Dong Keon Yon 1, 2, 7 , MD, PhD  

1 Department of Regulatory Science, Kyung Hee University, Seoul, Republic of Korea

2 Center for Digital Health, Medical Science Research Institute, Kyung Hee University College of Medicine, Seoul, Republic of Korea

3 Department of Pediatrics, Columbia University Irving Medical Center, New York, NY, United States

4 Assistance Publique-Hôpitaux de Marseille, Research Centre on Health Services and Quality of Life, Aix Marseille University, Marseille, France

5 Department of Biomedical Engineering, Kyung Hee University, Yongin, Republic of Korea

6 Department of Electronics and Information Convergence Engineering, Kyung Hee University, Yongin, Republic of Korea

7 Department of Pediatrics, Kyung Hee University College of Medicine, Seoul, Republic of Korea

*these authors contributed equally

Corresponding Author:

Dong Keon Yon, MD, PhD

Department of Regulatory Science, Kyung Hee University

23 Kyungheedae-ro, Dongdaemun-gu

Seoul, 02447

Republic of Korea

Phone: 82 2 6935 2476

Fax:82 504 478 0201

Email: [email protected]

Background: Given the additional risk of suicide-related behaviors in adolescents with allergic rhinitis (AR), it is important to use the growing field of machine learning (ML) to evaluate this risk.

Objective: This study aims to evaluate the validity and usefulness of an ML model for predicting suicide risk in patients with AR.

Methods: We used data from 2 independent survey studies, Korea Youth Risk Behavior Web-based Survey (KYRBS; n=299,468) for the original data set and Korea National Health and Nutrition Examination Survey (KNHANES; n=833) for the external validation data set, to predict suicide risks of AR in adolescents aged 13 to 18 years, with 3.45% (10,341/299,468) and 1.4% (12/833) of the patients attempting suicide in the KYRBS and KNHANES studies, respectively. The outcome of interest was the suicide attempt risks. We selected various ML-based models with hyperparameter tuning in the discovery and performed an area under the receiver operating characteristic curve (AUROC) analysis in the train, test, and external validation data.

Results: The study data set included 299,468 (KYRBS; original data set) and 833 (KNHANES; external validation data set) patients with AR recruited between 2005 and 2022. The best-performing ML model was the random forest model with a mean AUROC of 84.12% (95% CI 83.98%-84.27%) in the original data set. Applying this result to the external validation data set revealed the best performance among the models, with an AUROC of 89.87% (sensitivity 83.33%, specificity 82.58%, accuracy 82.59%, and balanced accuracy 82.96%). While looking at feature importance, the 5 most important features in predicting suicide attempts in adolescent patients with AR are depression, stress status, academic achievement, age, and alcohol consumption.

Conclusions: This study emphasizes the potential of ML models in predicting suicide risks in patients with AR, encouraging further application of these models in other conditions to enhance adolescent health and decrease suicide rates.

Introduction

Allergic rhinitis (AR) is a common atopic disorder that affects approximately 14% of the global population [ 1 - 3 ]. AR is called allergic rhinoconjunctivitis when the eyes are involved and is an inflammatory condition characterized by at least one of the following symptoms: nasal congestion, rhinorrhea, itching, and sneezing [ 4 , 5 ]. In addition, it has been found to affect quality of life measures such as sleep, physical and social functioning, and learning and memory [ 6 ]. Furthermore, AR has been found to be associated with depressive symptoms, including suicidal ideation and suicide attempts [ 7 - 13 ]. Suicide rates in adolescents continue to increase, and suicide is the leading cause of adolescent death in Korea [ 14 ]. Adolescents are by nature susceptible to mental health problems owing to the many transitions involved in this period of their life, including changes in school, living situations, pressures of fitting into peer groups, and building their own identity. This can invoke helplessness, insecurity, stress, and a loss of control, possibly accelerating suicide rates in adolescents, a leading cause of adolescent death in South Korea [ 14 ]. Owing to this substantially burden in adolescents, it is important to further understand and investigate potential methods of mitigating the risk that AR adds to an already pressing issue. There is a possibility that the suicide rate increased because of the decrease in the quality of life among adolescents with AR [ 15 ]. Therefore, we will predict the suicidal attempts among those with AR using machine learning (ML) models.

Suicide prediction is elusive and thus adds to the challenges of suicide prevention worldwide. No practical methods for anticipating individual suicides or stratifying individuals according to risk have been well established [ 16 , 17 ]; however, ML-based models is a potential method for more accurately identifying adolescents at risk of suicide. A systematic review on the prediction of self-injurious thoughts and behaviors with ML determined that despite its limited application, ML has made a significant advancement in suicide prediction [ 18 ]. Another review found that ML has the potential to improve suicide predictions compared with traditional suicide prediction models [ 18 ]. Such studies illustrate the rapidly growing potential of ML.

Given the additional risk of suicide-related behaviors in adolescents with AR, it would be relevant to use the growing field of ML to better evaluate this risk. Using nationwide population data, this study aimed to develop an ML-based model to predict suicide attempts among patients with AR using 2 independent nationwide cohorts in South Korea. We expect that this ML model produced from these data will have a high balanced accuracy and area under the receiver operating characteristic curve (AUROC) and consequently assist in better understanding suicide risk in adolescents with AR.

Study Design and Participants

This study aimed to develop an ML model to predict suicidality in Korean adolescents aged 13 to 18 years using clinical features extracted from 2 large independent data sets: the Korea Youth Risk Behavior Web-based Survey (KYRBS) and the Korea National Health and Nutrition Examination Survey (KNHANES) [ 19 , 20 ]. Figure 1 shows the workflow diagrams of the KYRBS and KNHANES data sets, which both offer nationally representative samples and estimates of the total adolescent population in South Korea. The original sample size for KYRBS was 1,067,169. However, after excluding patients without totaling 767,701, the final study population in the KYRBS data set was reduced to 299,468. Similarly, the KNHANES data set initially comprised 152,791 participants. However, after excluding 140,724 individuals either aged <13 or >19 years, 869 individuals with missing values on school performance, and 10,365 individuals without AR, the final study population in the KNHANES data set was 833.

research paper about validation

We included Korean adolescents aged between 13 and 18 years who completed the survey between 2005 and 2021 in KYRBS and between 1998 and 2021 in KNHANES. The outcome of suicidality was defined for people who attempted suicide more than once within 1 year [ 12 ], and covariates included age, sex, BMI (kg/m 2 ), residential area, household income, parents’ level of education, academic achievement, smoking status, stress status, and feelings of sadness and despair.

We trained, validated, and externally tested the ML model’s predictive accuracy and potential clinical efficacy in identifying the presence of mental health conditions using data from adolescents who met the same inclusion and exclusion criteria as those in the KYRBS data set. This study followed the guidelines outlined in the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis statement (Table S1 in Multimedia Appendix 1 ).

Ethical Considerations

The study protocol was approved by the institutional review board of the Korean Centers for Disease Control and Prevention Agency and Kyung Hee University (2022-06-042), and all participants provided written informed consent.

Variables and Algorithm Selection

To solve the imbalance issue of our data set, we used the synthetic minority oversampling technique to balance the training data set. The synthetic minority oversampling technique synthesizes new data from existing data using k-nearest neighbors and inserts them into the original data set [ 21 ]. The data set was randomly divided into 4:1 at a base training set (462,203/577,854, 79.98%) and a base test set (115,651/577,854, 20.01%) with equal distribution of different classes of patient data. This study aimed to develop a predictive model with a small number of variables and good performance; a model trained with the basic training set is needed to compare with a model with fewer variables than the basic training set. In addition, a corresponding test set was required to evaluate each training set, including the basic training set. Continuous variables were compared using the 2-tailed t test or Mann-Whitney U test, and categorical variables were compared using the chi-square test [ 22 ]. The odds ratios of the variables were determined by logistic regression (method: enter). Data set variables were analyzed using SAS software (version 9.3; SAS Institute Inc).

In this study, as depicted in Figure 2 , we analyzed the original KYRBS data set. The data set was divided into training and test data sets using a 4:1 ratio, with the training data set being used for model development and the test data set being used for model evaluation. We applied various ML algorithms to the training data set and assessed their performance based on the AUROC scores on the test data set. Models that exhibited high performance were selected for further investigation.

research paper about validation

For external validation, we used an additional data set from the KNHANES, which contained the same columns as the original KYRBS data set. To enhance our understanding of the model’s performance and its variability, we applied bootstrapping techniques. Bootstrapping was repeated 10,000 times to evaluate the model’s performance on the external data set [ 23 ]. This process involved creating numerous resamples from the data set, with each sample being used to calculate the model’s performance metrics. We calculated the mean and SE of these performance metrics across all the bootstrap samples. This technical approach provided a robust measure of the model’s performance, accounting for variability and uncertainty in the external data set. The performance of the selected models on this external data set was evaluated, and their performance metrics were compared with those obtained from the training and test data sets. In summary, the train results were derived from the training data set, the test results were derived from the test data set, and the external results were derived from the external validation data set provided by KNHANES.

As shown in Figure 1 , in the data preprocessing phase of this study, we took several steps to clean and prepare the data for efficient analysis. We used SAS software for data processing, which included categorizing and handling the missing values from the data set from the survey. Moreover, to address variables present in KYRBS but absent in KNHANES, we filled the missing values in KNHANES with the median values from KYRBS. These steps included handling missing values, categorizing categorical variables, and scaling numerical features using SAS software. The preprocessing steps aimed to ensure that the data were in a suitable format for the subsequent application of various ML algorithms. It is crucial to preprocess the data effectively, as this can substantially impact the performance and generalizability of the models being developed. Moreover, we used a 10-fold cross-validation approach to assess the performance of the ML models more reliably. This method involves partitioning the original data set into 10 equal-sized subsets, with each subset being used as a test data set once, whereas the remaining subsets serve as the training data set. The process was repeated 10 times, and the performance metrics, such as the AUROC score, sensitivity, specificity, accuracy, balanced accuracy score, precision, and F 1 -score, were averaged over these iterations.

To estimate the uncertainty and variability of our results, we calculated the 95% CIs for each performance metric, including the AUROC score, sensitivity, specificity, accuracy, balanced accuracy score, precision, and F 1 -score during the 10-fold cross-validation process to the train and test data sets. The 95% CI provides a range of plausible values for the performance metrics and is a useful tool for determining the stability and generalizability of the models. Data processing was performed using SAS software, and ML analysis was performed using Python (version 3.9.16), TensorFlow-gpu (version 2.6.0), Keras (version 2.6.0), NumPy (version 1.23.5), pandas (version 1.5.3), scikit-learn (version 1.2.2), Matplotlib (version 3.7.1), and shap (version 0.42.1). The ML models that were used were tree-based models, which are random forest, XGBoost, AdaBoost, and light gradient boost. We used GridSearch to fine-tune the hyperparameters of the models with the objective of maximizing the AUROC scores. The model hyperparameters were tuned using GridSearch, which uses many combinations of different hyperparameters to obtain the best AUROC score. GridSearch is an exhaustive search method that systematically explores a range of hyperparameter combinations, evaluating the performance of each combination on the given data set. By selecting the optimal set of hyperparameters from multiple variables using GridSearch, we selected from a range of hyperparameters for the random forest model. Finally, we chose number estimators at 100, maximum depth at 6, and maximum features as sqrt . This approach aimed to improve the performance and generalizability of our models, ultimately leading to more accurate and reliable predictions.

Subsequently, we focused on a detailed analysis of feature impact within the model. Feature importance was assessed using the mean decrease in impurity within the random forest model, indicating how each feature contributes to more uniform node splits. In addition, the Seaborn library was used for visualizing this importance, enhancing the interpretability of the results, and aiding in the identification of the most impactful features for the model’s predictive accuracy. This approach provides effective feature selection and model optimization.

To interpret and gain insights into the model’s predictions, we calculated the Shapley Additive Explanations (SHAP) values from the random forest model. SHAP is a popular model-agnostic, local explanation approach designed to explain any given classifier. Lundberg and Lee [ 24 ] proposed the SHAP value as a united approach to explain the output of any ML model. We used the force plot and waterfall plot of the random forest model. This visualizes the contribution of each feature to the model’s prediction for a specific instance, showing how each feature pushes the model’s output from the base value. In contrast, the waterfall plot provides a detailed, step-by-step breakdown of how each feature contributes to moving the model’s output from the expected value to the actual prediction.

Software and Libraries

All computations, model training, and evaluations were executed using Python (version 3.9.16), TensorFlow-gpu (version 2.6.0), Keras (version 2.6.0), NumPy (version 1.23.5), pandas (version 1.5.3), scikit-learn (version 1.2.2), and Matplotlib (version 3.7.12) for ML tasks and data wrangling. Visualization was facilitated using Matplotlib (version 3.7.2), Seaborn (version 0.12.2), and shap (version 0.42.1).

Demographic Characteristics

This study was conducted using nationwide population data from 2 independent cohorts in South Korea to develop and investigate an ML-based model for predicting suicide attempts in patients with AR. The demographic characteristics of the study population were as follows: both cohorts consisted of patients with AR, with the KYRBS cohort including 299,468 patients and the KNHANES cohort comprising 833 adolescents aged 13 to 18 years. Table 1 shows the baseline characteristics of the KYRBS and KNHANES. In the original KYRBS cohort, the sex distribution revealed that among 299,468 patients, 152,789 (51.02%) were male patients and 146,679 (48.98%) were female patients. In the extravalidated KNHANES cohort, comprising 833 patients, 492 (59.06%) were male patients and 341 (40.94%) were female patients. The patient samples in both cohorts encompassed diverse socioeconomic backgrounds, including varying levels of education, income, and occupation. Overall, the study included a total of 300,301 patients with AR from diverse demographic backgrounds, ensuring a representative sample for the development and evaluation of the ML model. By considering these demographic characteristics, this study aimed to provide valuable insights into the risk of suicide attempts among individuals with AR, with a particular focus on the adolescent population.

a AR: allergic rhinitis.

b According to Asia-Pacific guidelines, BMI is divided into 4 groups: underweight (<18.5 kg/m 2 ), normal (18.5-22.9 kg/m 2 ), overweight (23.0-24.9 kg/m 2 ), and obese (≥25.0 kg/m 2 ).

c N/A: not applicable.

d Stress was defined by receipt of mental health counseling owing to stress.

ML Model Results

As shown in Figure 3 and Figure S1 in Multimedia Appendix 1 , upon conducting extensive model evaluations, it was found that the random forest model was the best model in predicting suicide attempts in patients with AR. The train data results revealed that the random forest model achieved a sensitivity of 76.83 (95% CI 76.31-77.35), a specificity of 75.62 (95% CI 75.04-76.20), an accuracy of 76.22 (95% CI 76.07-76.38), a balanced accuracy of 76.22 (95% CI 76.07-76.38), a precision of 75.91 (95% CI 75.57-76.25), an F 1 -score of 76.37 (95% CI 76.19-76.54), and an AUROC of 84.12 (95% CI 83.98-84.27).

research paper about validation

In contrast, the AdaBoost model yielded slightly different results with a sensitivity of 78.29 (95% CI 78.11-78.46), a specificity of 75.02 (95% CI 74.82-75.22), an accuracy of 76.65 (95% CI 76.56-76.75), a precision of 75.81 (95% CI 75.68-75.94), an F 1 -score of 77.03 (95% CI 76.93-77.12), and an AUROC of 84.42 (95% CI 84.34-84.50).

However, when these models were evaluated on a separate test set, their performance varied. The random forest model obtained a sensitivity of 77.61 (95% CI 77.43-77.79), a specificity of 75.03 (95% CI 74.83-75.23), an accuracy of 76.32 (95% CI 76.16-76.49), a balanced accuracy of 76.32 (95% CI 76.16-76.49), a precision of 75.66 (95% CI 75.49-75.83), an F 1 -score of 76.62 (95% CI 76.46-76.78), and an AUROC of 84.18 (95% CI 84.07-84.28). Conversely, the AdaBoost model showed a sensitivity of 78.14 (95% CI 77.95-78.33), a specificity of 75.18 (95% CI 74.78-75.58), an accuracy of 76.66 (95% CI 76.40-76.92), a balanced accuracy of 76.66 (95% CI 76.40-76.92), a precision of 75.89 (95% CI 75.57-76.21), an F 1 -score of 77.00 (95% CI 76.77-77.23), and an AUROC of 84.27 (95% CI 84.06-84.48).

For external validation, an independent data set, KNHANES, was used. The random forest model achieved a sensitivity of 91.72 (95% CI 91.55-91.88), a specificity of 77.36 (95% CI 77.33-77.39), an accuracy of 77.57 (95% CI 77.54-77.60), a balanced accuracy of 84.54 (95% CI 84.46-84.62), and an AUROC of 89.84 (95% CI 89.78-89.90). Meanwhile, the AdaBoost model’s external validation results revealed a sensitivity of 75.09 (95% CI 74.84-75.34), a specificity of 82.09 (95% CI 82.06-82.11), an accuracy of 81.99 (95% CI 81.96-82.01), a balanced accuracy of 78.59 (95% CI 78.46-78.72), and an AUROC of 89.12 (95% CI 89.06-89.18).

On the basis of the comprehensive results, the random forest model demonstrated superior performance compared with the AdaBoost model when evaluated on both internal and external data sets. In addition, the area under the precision-recall curve for the random forest model, a measure of model performance under conditions of class imbalance, was 81.98 (95% CI 79.88-84.08), as shown in Figure S1 in Multimedia Appendix 1 . This indicates the model’s robust ability to maintain precision across various levels of recall.

Feature Importance

Table 2 shows that the random forest model identified sadness and despair (53.1%) as the most influential feature in predicting suicide attempts in patients with AR, followed by stress status (28.35%), academic achievement (5.18%), age (4.08%), alcohol consumption (2.96%), household income (1.65%), sex (1.56%), smoking status (1.33%), BMI (kg/m 2 ; 0.69%), region (0.47%), parents’ highest educational level (0.27%), atopic dermatitis (0.23%), and asthma (0.14%) in descending order of importance.

a According to Asia-Pacific guidelines, BMI is divided into 4 groups: underweight (<18.5 kg/m 2 ), normal (18.5-22.9 kg/m 2 ), overweight (23.0-24.9 kg/m 2 ), and obese (≥25.0 kg/m 2 ).

We addressed a deeper visual interpretation of the SHAP values within our ML model. Figure S2 in Multimedia Appendix 1 shows a waterfall plot, distinctively showcasing the cumulative contribution of each feature to a single prediction. We interpreted individual predictions by starting from the initial estimate and sequentially incorporating the influence of each feature to reach the final prediction. E[f(x)] refers to the average predicted output of the model across the entire data set, providing insights into the model’s overall prediction tendency. The starting point of the illustration, denoted as E[f(X)]=0.50, represents the model’s average prediction for the given data set. Among the variables, sadness and despair stood out, boosting the prediction by 0.16 and ranking as the most influential factor. Conversely, stress status, school performance, and sex reduced the prediction by 0.1, 0.02, and 0.01, respectively. This visualization offers a clear insight into the profound influence each feature wields in predicting adolescent suicidal thinking. Our ML model notably underscores substantial reliance on sadness and despair and stress status features. Moreover, in the force plot, features pushing the prediction higher are usually shown in one red color, whereas those pushing the prediction lower are shown in blue, clearly displaying the push and pull effect of each feature on the model’s prediction. This type of visualization will allow us to see the balance of each effect at each individual prediction level, further clarifying the roles of sadness despair, stress status, and other features in assessing the risk of adolescent suicide attempts.

Code Availability

On the basis of the results of the ML model, we established a web-based application for policy makers or health system managers to support their decision-making process for cases involving suicidal attempts in adolescents with AR [ 25 ]. An example of a web interface and the results is shown in Figure S3 in Multimedia Appendix 1 . Custom code for the website is available on the internet [ 26 ].

Principal Findings

The study results showed that ML models can predict suicide attempts in patients with AR with relatively high accuracy. The random forest model is the best ML model to predict suicide attempts among Korean adolescents with AR, with an AUROC of 84.12% (original data set) and 89.87% (external validation data set). While looking at feature importance, the 5 most important features in predicting suicide attempts in adolescent patients with AR are depression, stress status, academic achievement, age, and alcohol consumption.

To our knowledge, this is the first study to use an ML model in the context of patients with AR and suicidality, especially at this population level. These results reinforce the importance for clinicians to pay close attention to atopic conditions such as AR when screening for suicide risk as well as to help identify the most important risk factors in such patients.

Comparison With Previous Studies

In concordance with our findings, studies on suicide-related behavior prediction using ML have shown great promise. A study found that ML for suicide risk prediction in children and adolescents with electronic health records was able to detect 53% to 62% of suicide-positive participants with 90% specificity [ 27 ], and a case-control study of first-time suicide attempts with a cohort of >45,000 patients demonstrated accurate and robust first-time suicide attempt prediction [ 28 ], with the best predicting model achieving an AUROC of 0.932. A study that used the Korea Welfare Panel Study to develop an ML algorithm determined that >80% of individuals at risk of suicide-related behaviors could be predicted by various mental and socioeconomic characteristics of the respondents [ 29 ]. In addition, ML together with in-person screening has been found to result in the best suicide risk prediction [ 30 ], illustrating its potential to be used by clinicians in the medical field. These studies, as well as our study, support the continued need to build and improve ML models for predicting suicide risk, especially for at-risk patients.

As discussed, adolescents remain at a high risk for suicide-related behaviors because of their unique social situation. A study identified that significant risk factors for suicide in youth include a history of mental disorders, previous suicide attempts, impulsivity, family structure or environment, interpersonal strain, school problems and academic stress status, etc [ 31 ]. This is reflected in the findings of this study that showed adolescent’s age, academic achievement, BMI group, and household income as some of the most important contributors to suicide attempt prediction in adolescents with AR. In addition, it has been observed that atopic dermatitis and asthma, although less common than the other risk factors, contribute to this already high burden of risk.

Plausible Mechanism

This study shows the importance of understanding what puts adolescents at risk for suicide, especially in the context of AR. There is a proposed pathogenic mechanism that connects atopy and its associated risk of increased suicidality. Allergic inflammatory mediators, interleukin (IL)-4, IL-5, and IL-13, are released and perpetuated by “allergic” helper T subtype 2 (T H 2) cells [ 32 , 33 ]. Along with atopic dermatitis and asthma, AR is associated with systemic increases in such cytokines [ 34 ]. Early life overexposure to IL-4, which can occur because of T H 2 sensitization from allergic disease, has been reported to reduce myelination and lead to cognitive impairment and developmental delays [ 35 ], and these effects have been found to be inhibited with IL-4 neutralization [ 36 ]. Allergy-mediated cytokines can also lead to aberrations in rapid eye movement (REM) sleep, increased REM latency, increased arousal, and decreased REM duration [ 37 ], thereby reducing sleep quality, quality of life, and overall happiness. In addition, T H 2 sensitization has been shown to possibly lead to negative effects on the developing brain, leading to increased attention-deficit/hyperactivity disorder, depression, anxiety, and suicidal ideation [ 38 ]. These mechanisms of action point toward a functional correlation between atopy and psychological disorders, including depression. This study helps to elucidate which other risk factors further contribute to such patients’ already increased risk.

Strengths and Limitations

This study had several limitations. As the data sets only contained Korean adolescents, this model may not extend to the global adolescent population. South Korea’s unique cultural and environmental setting may particularly affect the generalizability of the study. In the validation data set, participants aged <13 and >19 years were treated as missing data, and only patients with AR were analyzed, resulting in a low figure of 1%. This is because the KNHANES data set targets all ages, and not just adolescents; hence, it lacks the specificity for adolescents compared with the KYRBS data set. However, this data set represents South Korea and is used in studies as an external validation for the KYRBS data set [ 39 - 41 ]. In addition, the risk calculator that we produced was purely created for academic purposes, and its application should be limited to that scope. It is to be used as an example of what could be developed in the future with further refinement of ML models and our understanding of AR and suicide risk. The observed results are not intended to guide clinical management at this stage. As for the strengths of our study, to the best of our knowledge, this is the first study to create an ML model to predict suicide attempts in adolescents with AR. It is important to continue to investigate the role of atopic conditions in exacerbating suicide risk and to further understand how such patients may be affected by their disease process in the context of depression and suicide-related behaviors. Our model has the potential to make a significant impact on improving suicide risk assessment, early identification, and effective interventions for patients with AR, and it is important to further investigate the usefulness of ML and suicidality in patients with atopic diseases.

Clinical and Policy Implications

The findings of this study have various implications for clinicians and policy makers. It reveals the importance of screening adolescent patients with allergic diseases such as AR for suicidality. It can be assumed that as symptoms of atopy worsen, the risk for suicide also increases, and thus, it is important to encourage physicians to treat and pay close attention to their patients with allergies. These findings may also encourage psychiatrists to begin screening for allergies in their patients with depression or who are at risk of suicide attempts. Moreover, this ML-derived algorithm can be used by clinicians to independently screen their patients for the risk of suicide attempts. This can be done quickly and efficiently in the outpatient setting and can be incorporated as a tool to improve health outcomes for patients with atopic diseases. The relatively high accuracy of this model encourages further research into developing similar ML models for other atopic diseases, including atopic dermatitis and asthma. For policy makers, these findings illustrate the importance of raising awareness of the contribution of allergic diseases, especially untreated ones, to increased rates of suicide. As part of this awareness, it is critical to highlight the potential benefits of interventions such as anti-inflammatory diets and fasting on mental health [ 42 , 43 ]. If the general population understands the risks, they can become more diligent in bringing adolescent patients to their clinicians to be screened and treated for their allergies. They will then be able to use an ML model such as this study to uniquely understand each patient’s individualized suicide risk based on their various risk factors, including academic achievement and stress status. From this perspective, this ML algorithm holds great potential for improving the lives of adolescents affected by AR and its consequences.

Conclusions

ML models are a new and innovative field of study that show great potential in predicting suicide, a task that has proven difficult for clinicians thus far. This study confirms this potential and demonstrates its accuracy in the context of AR. Being able to identify patients with a specific risk factor and understand their unique risks for suicide is incredibly important and relevant for clinicians of all different specialties. This encourages the development of ML models not only in other atopic conditions such as asthma and atopic dermatitis but also in conditions outside of atopy. Further research should be conducted to investigate the utility of these ML models in the clinical field with the goal of decreasing suicide rates in the already vulnerable group that is adolescents. Although much research remains to be done, this new and exciting field of ML holds promise for improving the health of patients with atopic diseases.

Acknowledgments

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant HV22C0233). The funders had no role in the study design, data collection, data analysis, data interpretation, or writing of the manuscript.

Data Availability

Data are available upon reasonable request. Study protocol and statistical code are available from DKY. The data set is available from the Korea Disease Control and Prevention Agency through a data use agreement.

Authors' Contributions

DKY had full access to all the data in the study and took responsibility for the integrity of the data and the accuracy of the data analysis. All the authors approved the final version of the manuscript before submission. Hojae L, JKC, JP, Hyeri L, Hayeon L, JL, and DKY proposed the study concept and design; Hojae L, JKC, JP, Hyeri L, Hayeon L, JL, and DKY performed the acquisition, analysis, or interpretation of data; Hojae L, JKC, JP, Hyeri L, Hayeon L, JL, and DKY drafted the manuscript; all authors critically revised the manuscript for important intellectual content; Hojae L, JKC, JP, Hyeri L, Hayeon L, JL, and DKY did the statistical analysis; and Hayeon L, JL, and DKY supervised the study. DKY is the guarantor for this study. Hojae L, JKC, JP, and Hojae L contributed equally as joint first authors. Hayeon L, JL, and DKY contributed equally as corresponding authors. Hayeon L is a senior author. The corresponding author (Hayeon L, JL, and DKY) attests that all listed authors met the authorship criteria and that no others meeting the criteria have been omitted.

Conflicts of Interest

None declared.

Supplement material and Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis statement.

  • Koo MJ, Kwon R, Lee SW, Choi YS, Shin YH, Rhee SY, et al. National trends in the prevalence of allergic diseases among Korean adolescents before and during COVID-19, 2009-2021: a serial analysis of the national representative study. Allergy. Jun 2023;78(6):1665-1670. [ CrossRef ] [ Medline ]
  • Lee K, Lee H, Kwon R, Shin YH, Yeo SG, Lee YJ, et al. Global burden of vaccine-associated anaphylaxis and their related vaccines, 1967-2023: a comprehensive analysis of the international pharmacovigilance database. Allergy (Forthcoming). Dec 10, 2023 [ CrossRef ] [ Medline ]
  • Shin YH, Hwang J, Kwon R, Lee SW, Kim MS, GBD 2019 Allergic Disorders Collaborators; et al. Global, regional, and national burden of allergic disorders and their risk factors in 204 countries and territories, from 1990 to 2019: a systematic analysis for the Global Burden of Disease Study 2019. Allergy. Aug 2023;78(8):2232-2254. [ CrossRef ] [ Medline ]
  • Brown T. Diagnosis and management of allergic rhinitis in children. Pediatr Ann. Dec 01, 2019;48(12):e485-e488. [ CrossRef ] [ Medline ]
  • Yon DK, Hwang S, Lee SW, Jee HM, Sheen YH, Kim JH, et al. Indoor exposure and sensitization to formaldehyde among inner-city children with increased risk for asthma and rhinitis. Am J Respir Crit Care Med. Aug 01, 2019;200(3):388-393. [ CrossRef ] [ Medline ]
  • Noh H, An J, Kim MJ, Sheen YH, Yoon J, Welsh B, et al. Sleep problems increase school accidents related to allergic diseases. Pediatr Allergy Immunol. Jan 2020;31(1):98-103. [ CrossRef ] [ Medline ]
  • Timonen M, Jokelainen J, Hakko H, Silvennoinen-Kassinen S, Meyer-Rochow VB, Herva A, et al. Atopy and depression: results from the Northern Finland 1966 Birth Cohort Study. Mol Psychiatry. Aug 2003;8(8):738-744. [ CrossRef ] [ Medline ]
  • Guzman A, Tonelli LH, Roberts D, Stiller JW, Jackson MA, Soriano JJ, et al. Mood-worsening with high-pollen-counts and seasonality: a preliminary report. J Affect Disord. Aug 2007;101(1-3):269-274. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Timonen M, Jokelainen J, Silvennoinen-Kassinen S, Herva A, Zitting P, Xu B, et al. Association between skin test diagnosed atopy and professionally diagnosed depression: a Northern Finland 1966 Birth Cohort study. Biol Psychiatry. Aug 15, 2002;52(4):349-355. [ CrossRef ] [ Medline ]
  • Postolache TT, Stiller JW, Herrell R, Goldstein MA, Shreeram SS, Zebrak R, et al. Tree pollen peaks are associated with increased nonviolent suicide in women. Mol Psychiatry. Mar 2005;10(3):232-235. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Marshall PS, O'Hara C, Steinberg P. Effects of seasonal allergic rhinitis on fatigue levels and mood. Psychosom Med. 2002;64(4):684-691. [ CrossRef ] [ Medline ]
  • Woo HG, Park S, Yon H, Lee SW, Koyanagi A, Jacob L, et al. National trends in sadness, suicidality, and COVID-19 pandemic-related risk factors among South Korean adolescents from 2005 to 2021. JAMA Netw Open. May 01, 2023;6(5):e2314838. [ CrossRef ] [ Medline ]
  • Amritwar AU, Lowry CA, Brenner LA, Hoisington AJ, Hamilton R, Stiller JW, et al. Mental health in allergic rhinitis: depression and suicidal behavior. Curr Treat Options Allergy. Mar 2017;4(1):71-97. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kim N, Song JY, Yang H, Kim MJ, Lee K, Shin YH, et al. National trends in suicide-related behaviors among youths between 2005-2020, including COVID-19: a Korean representative survey of one million adolescents. Eur Rev Med Pharmacol Sci. Feb 2023;27(3):1192-1202. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lee KH, Yon DK, Suh DI. Prevalence of allergic diseases among Korean adolescents during the COVID-19 pandemic: comparison with pre-COVID-19 11-year trends. Eur Rev Med Pharmacol Sci. Apr 2022;26(7):2556-2568. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Franklin JC, Ribeiro JD, Fox KR, Bentley KH, Kleiman EM, Huang X, et al. Risk factors for suicidal thoughts and behaviors: a meta-analysis of 50 years of research. Psychol Bull. Feb 2017;143(2):187-232. [ CrossRef ] [ Medline ]
  • Large M, Kaneson M, Myles N, Myles H, Gunaratne P, Ryan C. Meta-analysis of longitudinal cohort studies of suicide risk assessment among psychiatric patients: heterogeneity in results and lack of improvement over time. PLoS One. Jun 2016;11(6):e0156322. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Burke TA, Ammerman BA, Jacobucci R. The use of machine learning in the study of suicidal and non-suicidal self-injurious thoughts and behaviors: a systematic review. J Affect Disord. Feb 15, 2019;245:869-884. [ CrossRef ] [ Medline ]
  • Park S, Yon H, Ban CY, Shin H, Eum S, Lee SW, et al. National trends in alcohol and substance use among adolescents from 2005 to 2021: a Korean serial cross-sectional study of one million adolescents. World J Pediatr. Nov 2023;19(11):1071-1081. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kweon S, Kim Y, Jang MJ, Kim Y, Kim K, Choi S, et al. Data resource profile: the Korea National Health and Nutrition Examination Survey (KNHANES). Int J Epidemiol. Feb 2014;43(1):69-77. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ren Y, Wu D, Tong Y, López-DeFede A, Gareau S. Issue of data imbalance on low birthweight baby outcomes prediction and associated risk factors identification: establishment of benchmarking key machine learning models with data rebalancing strategies. J Med Internet Res. May 31, 2023;25:e44081. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lee SW. Methods for testing statistical differences between groups in medical research: statistical standard and guideline of life cycle committee. Life Cycle. Jan 24, 2022;2:e1. [ FREE Full text ] [ CrossRef ]
  • Koo JH, Park YH, Kang DR. Factors predicting older people's acceptance of a personalized health care service app and the effect of chronic disease: cross-sectional questionnaire study. JMIR Aging. Jun 21, 2023;6:e41429. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lundberg SL, Lee S. A unified approach to interpreting model predictions. arXiv Preprint posted online May 22, 2017. [ FREE Full text ] [ CrossRef ]
  • Streamlit. URL: https://predictsuicidality.streamlit.app [accessed 2024-01-26]
  • Machine learning-based prediction of suicidality in adolescents with allergic rhinitis. GitHub. URL: https://github.com/CenterForDH/suicidality [accessed 2024-01-29]
  • Su C, Aseltine R, Doshi R, Chen K, Rogers SC, Wang F. Machine learning for suicide risk prediction in children and adolescents with electronic health records. Transl Psychiatry. Nov 26, 2020;10(1):413. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Tsui FR, Shi L, Ruiz V, Ryan ND, Biernesser C, Iyengar S, et al. Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts. JAMIA Open. Jan 2021;4(1):ooab011. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lee J, Pak TY. Machine learning prediction of suicidal ideation, planning, and attempt among Korean adults: a population-based study. SSM Popul Health. Sep 2022;19:101231. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Wilimitis D, Turer RW, Ripperger M, McCoy AB, Sperry SH, Fielstein EM, et al. Integration of face-to-face screening with real-time machine learning to predict risk of suicide among adults. JAMA Netw Open. May 02, 2022;5(5):e2212095. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bilsen J. Suicide and youth: risk factors. Front Psychiatry. 2018;9:540. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Akdis SH. Allergy and the immunologic basis of atopic disease. In: Kliegman RM, Stanton BF, St. Geme JW, Schor NF, Behrman RE, editors. Nelson Textbook of Pediatrics: Expert Consult. Philadelphia, PA. Elsevier Academic Press; Aug 01, 2019;388-393.
  • Mukundan TH. Screening for allergic disease in a child with sleep disorder and screening for sleep disturbance in allergic disease. In: Fishbein A, Sheldon SH, editors. Allergy and Sleep: Basic Principles and Clinical Practice. Cham, Switzerland. Springer; Dec 10, 2023;77-85.
  • Kim HJ, Kim YJ, Lee SH, Yu J, Jeong SK, Hong SJ. Effects of Lactobacillus rhamnosus on allergic march model by suppressing Th2, Th17, and TSLP responses via CD4(+)CD25(+)Foxp3(+) Tregs. Clin Immunol. Jul 2014;153(1):178-186. [ CrossRef ] [ Medline ]
  • Gadani SP, Cronk JC, Norris GT, Kipnis J. IL-4 in the brain: a cytokine to remember. J Immunol. Nov 01, 2012;189(9):4213-4219. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Zanno AE, Romer MA, Fox L, Golden T, Jaeckle-Santos L, Simmons RA, et al. Reducing Th2 inflammation through neutralizing IL-4 antibody rescues myelination in IUGR rat brain. J Neurodev Disord. Dec 16, 2019;11(1):34. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Thompson A, Sardana N, Craig TJ. Sleep impairment and daytime sleepiness in patients with allergic rhinitis: the role of congestion and inflammation. Ann Allergy Asthma Immunol. Dec 2013;111(6):446-451. [ CrossRef ] [ Medline ]
  • Jackson-Cowan L, Cole EF, Arbiser JL, Silverberg JI, Lawley LP. TH2 sensitization in the skin-gut-brain axis: how early-life Th2-mediated inflammation may negatively perpetuate developmental and psychologic abnormalities. Pediatr Dermatol. Sep 2021;38(5):1032-1039. [ CrossRef ] [ Medline ]
  • Lee H, Park J, Lee M, Kim HJ, Kim M, Kwon R, et al. National trends in allergic rhinitis and chronic rhinosinusitis and COVID-19 pandemic-related factors in South Korea, from 1998 to 2021. Int Arch Allergy Immunol. Jan 05, 2024:1-7. [ CrossRef ] [ Medline ]
  • Kwon R, Lee H, Kim MS, Lee J, Yon DK. Machine learning-based prediction of suicidality in adolescents during the COVID-19 pandemic (2020-2021): derivation and validation in two independent nationwide cohorts. Asian J Psychiatr. Oct 2023;88:103704. [ CrossRef ] [ Medline ]
  • Kang J, Park J, Lee H, Lee M, Kim S, Koyanagi A, et al. National trends in depression and suicide attempts and COVID-19 pandemic-related factors, 1998-2021: a nationwide study in South Korea. Asian J Psychiatr. Oct 2023;88:103727. [ CrossRef ] [ Medline ]
  • Berthelot E, Etchecopar-Etchart D, Thellier D, Lancon C, Boyer L, Fond G. Fasting interventions for stress, anxiety and depressive symptoms: a systematic review and meta-analysis. Nutrients. Nov 05, 2021;13(11):3947. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Fond G, Young AH, Godin O, Messiaen M, Lançon C, Auquier P, et al. Improving diet for psychiatric patients : high potential benefits and evidence for safety. J Affect Disord. Mar 15, 2020;265:567-569. [ CrossRef ] [ Medline ]

Abbreviations

Edited by Q Jin; submitted 01.08.23; peer-reviewed by HG Woo, V Ruiz, B Montezano; comments to author 31.10.23; revised version received 24.12.23; accepted 16.01.24; published 14.02.24.

©Hojae Lee, Joong Ki Cho, Jaeyu Park, Hyeri Lee, Guillaume Fond, Laurent Boyer, Hyeon Jin Kim, Seoyoung Park, Wonyoung Cho, Hayeon Lee, Jinseok Lee, Dong Keon Yon. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 14.02.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

A once-ignored community of science sleuths now has the research community on its heels

research paper about validation

A community of sleuths hunting for errors in scientific research have sent shockwaves through some of the most prestigious research institutions in the world — and the science community at large.

High-profile cases of alleged image manipulations in papers authored by the former president at Stanford University and leaders at the Dana-Farber Cancer Institute have made national media headlines, and some top science leaders think this could be just the start.

“At the rate things are going, we expect another one of these to come up every few weeks,” said Holden Thorp, the editor-in-chief of the Science family of scientific journals, whose namesake publication is one of the two most influential in the field. 

The sleuths argue their work is necessary to correct the scientific record and prevent generations of researchers from pursuing dead-end topics because of flawed papers. And some scientists say it’s time for universities and academic publishers to reform how they address flawed research. 

“I understand why the sleuths finding these things are so pissed off,” said Michael Eisen, a biologist, the former editor of the journal eLife and a prominent voice of reform in scientific publishing. “Everybody — the author, the journal, the institution, everybody — is incentivized to minimize the importance of these things.” 

For about a decade, science sleuths unearthed widespread problems in scientific images in published papers, publishing concerns online but receiving little attention. 

That began to change last summer after then-Stanford President Marc Tessier-Lavigne, who is a neuroscientist, stepped down from his post after scrutiny of alleged image manipulations in studies he helped author and a report criticizing his laboratory culture. Tessier-Lavigne was not found to have engaged in misconduct himself, but members of his lab appeared to manipulate images in dubious ways, a report from a scientific panel hired to examine the allegations said. 

In January, a scathing post from a blogger exposed questionable work from top leaders at the Dana-Farber Cancer Institute , which subsequently asked journals to retract six articles and issue corrections for dozens more. 

In a resignation statement , Tessier-Lavigne noted that the panel did not find that he knew of misconduct and that he never submitted papers he didn’t think were accurate. In a statement from its research integrity officer, Dana-Farber said it took decisive action to correct the scientific record and that image discrepancies were not necessarily evidence an author sought to deceive. 

“We’re certainly living through a moment — a public awareness — that really hit an inflection when the Marc Tessier-Lavigne matter happened and has continued steadily since then, with Dana-Farber being the latest,” Thorp said. 

Now, the long-standing problem is in the national spotlight, and new artificial intelligence tools are only making it easier to spot problems that range from decades-old errors and sloppy science to images enhanced unethically in photo-editing software.  

This heightened scrutiny is reshaping how some publishers are operating. And it’s pushing universities, journals and researchers to reckon with new technology, a potential backlog of undiscovered errors and how to be more transparent when problems are identified. 

This comes at a fraught time in academic halls. Bill Ackman, a venture capitalist, in a post on X last month discussed weaponizing artificial intelligence to identify plagiarism of leaders at top-flight universities where he has had ideological differences, raising questions about political motivations in plagiarism investigations. More broadly, public trust in scientists and science has declined steadily in recent years, according to the Pew Research Center .

Eisen said he didn’t think sleuths’ concerns over scientific images had veered into “McCarthyist” territory.

“I think they’ve been targeting a very specific type of problem in the literature, and they’re right — it’s bad,” Eisen said. 

Scientific publishing builds the base of what scientists understand about their disciplines, and it’s the primary way that researchers with new findings outline their work for colleagues. Before publication, scientific journals consider submissions and send them to outside researchers in the field for vetting and to spot errors or faulty reasoning, which is called peer review. Journal editors will review studies for plagiarism and for copy edits before they’re published. 

That system is not perfect and still relies on good-faith efforts by researchers to not manipulate their findings.

Over the past 15 years, scientists have grown increasingly concerned about problems that some researchers were digitally altering images in their papers to skew or emphasize results. Discovering irregularities in images — typically of experiments involving mice, gels or blots — has become a larger priority of scientific journals’ work.   

Jana Christopher, an expert on scientific images who works for the Federation of European Biochemical Societies and its journals, said the field of image integrity screening has grown rapidly since she began working in it about 15 years ago. 

At the time, “nobody was doing this and people were kind of in denial about research fraud,” Christopher said. “The common view was that it was very rare and every now and then you would find someone who fudged their results.” 

Today, scientific journals have entire teams dedicated to dealing with images and trying to ensure their accuracy. More papers are being retracted than ever — with a record 10,000-plus pulled last year, according to a Nature analysis . 

A loose group of scientific sleuths have added outside pressure. Sleuths often discover and flag errors or potential manipulations on the online forum PubPeer. Some sleuths receive little or no payment or public recognition for their work.

“To some extent, there is a vigilantism around it,” Eisen said. 

An analysis of comments on more than 24,000 articles posted on PubPeer found that more than 62% of comments on PubPeer were related to image manipulation. 

For years, sleuths relied on sharp eyes, keen pattern recognition and an understanding of photo manipulation tools. In the past few years, rapidly developing artificial intelligence tools, which can scan papers for irregularities, are supercharging their work. 

Now, scientific journals are adopting similar technology to try to prevent errors from reaching publication. In January, Science announced that it was using an artificial intelligence tool called Proofig to scan papers that were being edited and peer-reviewed for publication. 

Thorp, the Science editor-in-chief, said the family of six journals added the tool “quietly” into its workflow about six months before that January announcement. Before, the journal was reliant on eye-checks to catch these types of problems. 

Thorp said Proofig identified several papers late in the editorial process that were not published because of problematic images that were difficult to explain and other instances in which authors had “logical explanations” for issues they corrected before publication.

“The serious errors that cause us not to publish a paper are less than 1%,” Thorp said.

In a statement, Chris Graf, the research integrity director at the publishing company Springer Nature, said his company is developing and testing “in-house AI image integrity software” to check for image duplications. Graf’s research integrity unit currently uses Proofig to help assess articles if concerns are raised after publication. 

Graf said processes varied across its journals, but that some Springer Nature publications manually check images for manipulations with Adobe Photoshop tools and look for inconsistencies in raw data for experiments that visualize cell components or common scientific experiments.

“While the AI-based tools are helpful in speeding up and scaling up the investigations, we still consider the human element of all our investigations to be crucial,” Graf said, adding that image recognition software is not perfect and that human expertise is required to protect against false positives and negatives. 

No tool will catch every mistake or cheat. 

“There’s a lot of human beings in that process. We’re never going to catch everything,” Thorp said. “We need to get much better at managing this when it happens, as journals, institutions and authors.”

Many science sleuths had grown frustrated after their concerns seemed to be ignored or as investigations trickled along slowly and without a public resolution.  

Sholto David, who publicly exposed concerns about Dana-Farber research in a blog post, said he largely “gave up” on writing letters to journal editors about errors he discovered because their responses were so insufficient. 

Elisabeth Bik, a microbiologist and longtime image sleuth, said she has frequently flagged image problems and “nothing happens.” 

Leaving public comments questioning research figures on PubPeer can start a public conversation over questionable research, but authors and research institutions often don’t respond directly to the online critiques. 

While journals can issue corrections or retractions, it’s typically a research institution’s or a university’s responsibility to investigate cases. When cases involve biomedical research supported by federal funding, the federal Office of Research Integrity can investigate. 

Thorp said the institutions need to move more swiftly to take responsibility when errors are discovered and speak plainly and publicly about what happened to earn the public’s trust.  

“Universities are so slow at responding and so slow at running through their processes, and the longer that goes on, the more damage that goes on,” Thorp said. “We don’t know what happened if instead of launching this investigation Stanford said, ‘These papers are wrong. We’re going to retract them. It’s our responsibility. But for now, we’re taking the blame and owning up to this.’” 

Some scientists worry that image concerns are only scratching the surface of science’s integrity issues — problems in images are simply much easier to spot than data errors in spreadsheets. 

And while policing bad papers and seeking accountability is important, some scientists think those measures will be treating symptoms of the larger problem: a culture that rewards the careers of those who publish the most exciting results, rather than the ones that hold up over time. 

“The scientific culture itself does not say we care about being right; it says we care about getting splashy papers,” Eisen said. 

Evan Bush is a science reporter for NBC News. He can be reached at [email protected].

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Gigascience
  • v.6(5); 2017 May

Using and understanding cross-validation strategies. Perspectives on Saeb et al.

Max a little.

1 Department of Mathematics, Astun University, Aston Triangle, B4 7ET, Birmingham, UK

Gael Varoquaux

2 Parietal, INRIA, NeuroSpin, bat 145 CEA Saclay, 91191, Gif sur Yvette, France

Sohrab Saeb

3 Department of Preventive Medicine, Northwestern University, 750 N Lake Shore Dr, 60611, Chicago, USA

Luca Lonini

4 Rehabilitation Institute of Chicago, 345 E Superior, 60611, Chicago, USA

Arun Jayaraman

David c mohr, konrad p kording, associated data.

This three-part review takes a detailed look at the complexities of cross-validation, fostered by the peer review of Saeb et al.’s paper entitled “The need to approximate the use-case in clinical machine learning.” It contains perspectives by reviewers and by the original authors that touch upon cross-validation: the suitability of different strategies and their interpretation.

This review is organized in three sections, each presenting a different view on the suitability of different cross-validation strategies: one by M.A. Little, one by G. Varoquaux (who both also reviewed the original paper), and one by Saeb et al.

Perspective by M. A. Little: an important problem that subject-wise cross-validation does not fix

In their important article, Saeb et al. 2017 [ 1 ] propose, on the basis of empirical evidence and a simulation study, that one should use leave-subject-out (or “subject-wise”) cross-validation (CV) rather than basic CV (what we are calling “record-wise CV”) in clinical diagnostic application settings of machine learning predictors, where there are multiple observations from each individual and small numbers of individuals. The reason is that complex predictors can pick up a confounding relationship between identity and diagnostic status and so produce unrealistically high prediction accuracy, and this is not correctly reported by record-wise CV, but it is for subject-wise CV. From Saeb et al.’s article [ 1 ], I interpret the following claims: (i) subject-wise CV mimics the “use-case”: in usage, predictions will always be made on new subjects for which we have no observations, so observations from individuals in the training set must not appear in the test set, (ii) subject-wise CV will correctly estimate the out-of-sample prediction error under these circumstances, and (iii) record-wise CV creates dependence between training and test sets due to shared identities across train/test sets, so it will produce biased estimates (e.g., prediction errors are underestimated).

When I was asked to critically review this paper originally, I could not really grasp the assumptions of their probabilistic arguments, which were not made explicit. I guess if I were quicker I would have gotten it immediately, but fortunately and later during the review process they relieved my stupidity and made it clear that they intend a specific model for the data in which dependency exists between each observation from each subject, but where subjects – and the full dependency between and distributional structure of observations – are drawn independent and identically distributed ( i.i.d.) from the population as a whole. But whether this model is applicable to any specific data set is empirically testable and they do not propose a direct test for this, so below I want to posit a different model where Saeb et al.’s [ 1 ] claims (i-iii) above do not hold. This model is grounded in my practical experience with this kind of data and evidence from real data in this discipline (see Fig.  1 ). This model allows me to test the scope of Saeb et al.’s [ 1 ] claims.

An external file that holds a picture, illustration, etc.
Object name is gix020fig1.jpg

Features from clinical data sets of the kind discussed by Saeb et al. are often highly clustered, often not effectively sharing the same distribution across individuals. Using subject-wise CV with this kind of data, the assumption of subject-wise CV that the training and test sets come from the same distribution is effectively violated. Here the leave-one-subject-out train/test distribution mismatch is 100% for all subjects for this feature (two-sample Kolmogorov-Smirnov test at 0.05 significance, Bonferroni corrected). This means that leave-one-subject-out CV cannot provide a consistent out-of-sample estimator for this data. The mean absolute serial (Pearson) correlation within subject for each feature for all subjects is an unimportant 0.06, with no correlations reaching significance (at the 0.05 level, Bonferroni corrected), barring only one feature that is weakly correlated just above significance for only 10% of subjects. This means that observations within subjects are effectively independent, but such dependence is cited as the main motivation for using subject-wise CV. Data from [ 3 ]

Subject-wise CV is not always a valid substitute for record-wise CV in diagnostic applications

Take a collection of i.i.d. random variables and observations generated by them. Now split these variables up into two groups, according to any uniformly random selection that is independent of the data. The groups are now independent, and each observation is identically distributed to every other observation in each group. This (i.i.d. situation) describes exactly what record-wise CV assumes [ 2 ]. So any uniformly random way of splitting the data into, i.e., test and training subsets, which are independent of the data, must leave each observation in each group independent of and identically distributed to any other observation in either group. The consequence is that record-wise CV should not depend upon how we split the data, provided that we split in a way that is independent of the data. Any estimator of any quantity computed on either group of this split must be independent of the way in which this split is performed [ 1 ].

Let us now split the data in a way that is conditioned on one of the variables in the data or some other variable upon which the data depends. The split is no longer in general independent of the data. A simple example is splitting on subject identity. This modification of record-wise CV is indeed subject-wise CV, and we ideally want it to inherit the i.i.d. assumption of record-wise CV, for then we can simply borrow the applicable theory wholesale. But we will find that for some kinds of data we can create a split that violates the “identically distributed” assumption of record-wise CV: indeed, this happens where the data is clustered in value by identity, yet where the data is still i.i.d. For this kind of data, observations belonging to one identity will have different distributions than those in other identities: this is a commonly encountered situation in the context of Saeb et al.’s article [ 1 ]; indeed, this kind of clustering is sufficient to cause complex predictor identity confounding (see Fig.  1 for a real example). Such a model is known as an i.i.d. ‘mixture model’ in the statistics and machine learning literature [ 2 ].

But, does subject-wise CV work for this data? No: we get generally different values of the estimator in the training set than in the test set because they are based on data with different distributions. Yet, the original data is indeed still i.i.d. To get the same estimator in both groups, we must have different samples from the same distribution, rather than different samples from different distributions, across the train/test split. This contradicts claim (ii) above: in the identity-clustered data case (that can cause identity-confounded predictions), subject-wise CV does not produce a consistent estimate of the out-of-sample prediction error. This distributional mismatch is clear in Fig.  1 .

Clustering does not necessarily imply dependence

Staying for now with this simple i.i.d. mixture model, we can ask about the independence of training and test sets, which bears on claim (iii) above. Does the fact that the data is clustered imply that observations are dependent? The answer is of course no because it is an i.i.d. model. While observations within a cluster might take on more similar values to observations in that cluster than to others, each observation is still independent of every other. What about within clusters? Still no, because it is an i.i.d. model. Indeed, no dependence between observations is created just by clustering alone. To be clear: in this model, observations do depend upon their identity variable – that is what we mean by clustering on identity here – but they do not depend upon each other [ 3 ].

This means that no matter how we uniformly, randomly, and independently split the data generated by a simple i.i.d. mixture model, the data in each group are still independent of each other, and they are still independent across groups. But this is the record-wise CV splitting method, and the “independent” part is satisfied. Claim (iii) is contradicted: clustering in this model does not by itself necessarily invalidate record-wise CV due to dependence across a split between training and test sets. And this of course entirely obviates the need for subject-wise CV, contradicting claim (i). However, if we use subject-wise CV anyway, we end up violating the “identically distributed” part by splitting on the clustering variable, contradicting claim (ii).

For a simple model that exhibits the identity- confounding effect, subject-wise doesn’t work

With just the simple i.i.d. mixture model, the data is clustered and can exhibit identity confounding for complex predictors, but the data is still i.i.d., so no dependence is created between training and testing subsets, which eliminates the theoretical justification for using subject-wise CV. However, to avoid the confounding discovered empirically, subject-wise CV is proposed, which splits on the clustering variable. But this leads inevitably to violating the identically distributed assumption, which is required for subject-wise CV to produce consistent estimates of the out-of-sample prediction accuracy.

Saeb et al. [ 1 ] of course may not agree with these model assumptions, particularly the lack of within-subject dependence. It may seem “intuitively obvious” that there must be such dependence. But this is an intuition that is testable, and in my experience, real data I have encountered in this context do not exhibit any detectable signs of within-subject dependence between observations (see, for example, Fig.  1 , where the serial correlation within features is effectively non-existent) [ 4 ].

In my view, if one cannot actually detect any dependence in practice, then Occam’s razor suggests that we go with the simpler model, which explains the data, i.e., the i.i.d. mixture model. One could still insist that no, there must be some kind of dependence only we haven’t looked hard enough for it. Then these hidden dependencies cause effective clustering by value within subject, and if this dependence is strong enough to cause clustering, which in turn causes identity-confounding for complex predictors, then it is likely to be strong enough for subject-wise CV to empirically violate the identically distributed assumption in any finite sample.

Subject-wise CV-as a proposed solution to identity confounding for complex predictors, for some quite simple models that fit the description given by Saeb et al. [ 1 ], fit the data, and cause identity confounding-introduces a new problem that undermines its consistency as an estimator of out-of-sample prediction accuracy, either in theory or empirically, or both. Subject-wise CV does not always work, and in some realistic cases it fails where record-wise CV does not, even when the data are clustered on identity. This sets a significant limitation on the applicability of Saeb et al.’s [ 1 ] subject-wise CV proposal.

The real problem is confounded predictions, not CV

CV is such a seductively simple and universal technique that it is easy to overstate what it can really tell us or do for us in practice. As Saeb et al.’s [ 1 ] empirical evidence shows quite clearly, there are practical situations in which there is an apparently systematic difference between prediction error estimates obtained by record-wise CV and by subject-wise CV. But if we strip this down, all that is actually shown is indirect evidence from prediction errors. The synthetic simulation model arguments proposed in the article are interesting, but this is just a toy model, carefully constructed to highlight the identity-confounding flaw with complex predictors. The evidence is again all indirect through prediction errors. So there is no direct evidence in the article that the critical assumptions of various forms of CV are being violated in empirical studies found in the literature.

What is not in doubt, as Saeb et al. [ 1 ] show, is that special properties of the data (here, clustering in feature space) interact with specific properties of some prediction algorithms to cause the predictor to settle on a confounded relationship between the input features and the output variable. Because this problem does not lie with the CV method itself, we should find the actual source of the confound and fix that first.

Adapting CV to attempt to fix a particular confound that it cannot fix risks causing additional problems, in this particular case, an inadvertent mismatch between train and test distributions for real data. Modern nonlinear predictors are particularly sensitive to such “data drift,” a well-known problem [ 4 ]. Because of this mismatch, subject-wise CV causes systematic underfitting in practice in this situation: an alternative explanation for Saeb et al.’s empirical findings that we cannot rule out. It is not hard to see why this underfitting might occur: if a critical part of the pattern that relates features to the output variable is missing in the training data because an entire set of observations has been removed as they just happen to belong to a specific individual, then no predictor can be expected to make accurate predictions.

Leave-one-subject-out CV is not a panacea and suffers from additional problems. It often results in estimators with large variance because of the considerable heterogeneity between individuals in practice and the often small number of individuals available to each study. It is dubious to interpret a prediction error estimate with large spread.

If one truly can identify CV as the problem, then it would be important to choose another CV method, as Saeb et al. [ 1 ] suggest. Theoretical CV research has moved on considerably since record-wise CV was first described, and there are now adapted CV techniques available for dealing with many different scenarios where record-wise or subject-wise CV would not work. Indeed, subject-wise has recently been theoretically investigated for the case where jointly observations from each individual are mutually dependent and individuals are i.i.d. examples of this joint distribution [ 5 ]. (I would like to thank the authors for making me aware of this one.) And there are adapted methods for dealing with mismatched distributions [ 6 ] or serial dependencies between observations where we want to make prognostic predictions (e.g., the time series setting or longitudinal setting; see [ 2 ]). I would also want to make readers aware of the related ideas of domain adaptation [ 7 ], a very active topic addressing the many issues that arise in deployment situations, where the deployment data differ substantially in distribution and other aspects from the train/test data.

Mandating subject-wise CV could make things worse

While I find the pleasing symmetry of a simple idea such as ‘the CV method must match/approximate the use-case’ very appealing, of which Saeb et al.’s subject-wise CV proposal is a special case, I don’t believe it is for the best to follow this prescription uncritically. In practice, we usually face multiple known and unknown potential confounds.

Such perplexing confounds seem to have taxed the minds of experts in the field for a long time. They have, in particular, been the subject of substantial investigation in the data mining community, where they are known as “leakages,”(see [ 8 ] for many concrete examples and examples where attempts to fix them actually exacerbate the problem). For a hypothetical but entirely plausible example, consider the case where all the healthy subjects in the data set are younger than those with the condition we want to diagnose. In this case, since many aspects of aging can be detected in features, we could have an obvious age-related confound in the predictor, and again, subject-wise CV will not fix this. Worse though, everyone could come to believe the results are technically sound solely because it uses subject-wise CV according to prescription.

It may be that a purely pragmatic solution, where appropriate given the probabilistic dependency and distributional structure of the data, is to try out both record-wise and subject-wise CV on a problem just to see what you find. If an unexpected discrepancy is found, this may indicate that some kind of confound based on identity is occurring. Then one should fix the confound.

In summary, this article contains some important observations and interesting empirical evidence, and we should be grateful to Saeb et al. [ 1 ] for their work here, and it opens up an important discussion. However, I do not agree with the uncritical application of this simplified prescription because whether subject-wise CV is applicable or not depends upon the dependency and distributional structure of the data, and this may or may not coincide with the intended ‘use-case.’ Finally, to avoid any doubt, I should clarify that I do not of course disagree with Saeb et al. [ 1 ] on the need to identify such obvious confounds with complex predictors, but I do not believe they are necessarily caused by train/test set dependence and thereby fixed just by using a different CV method.

Max A. Little

Perspective by G. Varoquaux: cross-validation is important and tricky

Saeb et al. [ 1 ] discuss an important point that is well summarized by their title: the need to approximate the use-case when evaluating machine learning methods for clinical applications. The central aspect of the problem is cross-validation: How should it be done? How should the results be interpreted?

The reviewers of this important paper worried that readers might retain overly simple messages, maybe originating from the paper’s efforts at being didactic. Focused on diagnostic applications, the paper stresses the importance of subject-wise cross-validation, often performed by leave-one-subject. But there is no one-size-fits-all solution to methodological mistakes. In addition, the usage of leave-one-out cross-validation, frequent in medical informatics communities, is fragile to confounds. Here, I give my take on some issues raised by the original paper [ 1 ] and by Dr. Little’s comments.

Confounded predictions are indeed an important problem

Given data and a target, a good predictor predicts; in other words, it tries to find statistical associations between the data and the target. Machine learning techniques can, and will, draw their predictions from effects of non-interest - confounds - or from stratifications in the data. Given two visits of the same patient with a chronic disease, who would not be tempted to conclude on a diagnostic upon recognizing the patient? If a diagnostic method is meant to be applied to new patients, it must be evaluated as such. Hence, the cross-validation strategy must test generalization to new subjects, as recommended by Saeb et al. [ 1 ].

However, if the method is meant for prognosis from continuous measurements, it may fine-tune to a subject’s data. In such a case, the cross-validation must be made by leaving out future measurements, and not full subjects. The central aspect to choose how to separate train and test data is the dependence structure between these. While most mathematical studies of cross-validation are for i.i.d. data, applications often have dependencies across observations that are due to unobserved confounding effects: samples belong to multiple subjects, and movements differ across populations. Whether these effects are confounds or not depends on the scientific or clinical question.

I have recently encountered two interesting situations that warranted specific cross-validation settings. In Abraham et al. [ 9 ], we were interested in diagnostics from imaging data of multiple sites. An important question was whether the predictive biomarkers would carry over across sites. To test this question, we measured prediction by leaving out full sites in the cross-validation. In a different application, Liem et al. [ 10 ] showed prediction of brain age from magnetic resonance images. However, it is known that elderly people tend to move more in the scanners, and that this movement has a systematic effect on the images. To demonstrate that the prediction of brain age was not driven by movement, they showed prediction on a subset of the data specifically crafted so that age and movement were uncorrelated.

In the first section of this review, Dr. Little correctly points out that, in the presence of a confounds changing the cross-validation method can measure the impact of a confound but not fix it. More so, it may lead the predictors to underfit, in other words, to use only a fraction of the information available to predict. For instance, to demonstrate prediction of brain age independent of movement, training predictive models on a data set crafted to only have subjects with a given amount of movement would give less powerful predictors. Indeed, the required data culling would deplete the train set. In addition, it would lead to predictors easily fooled by movement: they would be applicable only on data with the same amount of movement.

Avoid leave-one-out: cross-validation with small test sets is fragile

Beyond the dependency structure of the data, another important aspect of choosing a cross-validation strategy is to have large test sets. Unfortunately, in communities such as medical imaging, leave-one-out is the norm. In such a situation, there is only one sample in the test set, which leads to a subpar measure of the error. An erroneous intuition is that this strategy creates many folds that will compensate. The standard recommendation in machine learning is to use test sets of 10% to 20% of the data (see [ 11 , 12 ] for historical references and [13, 7.10] for a modern text book). Experiments on brain imaging confirm this recommendation [ 14 ].

Intuitions on cross-validation are challenging as there are multiple sources of randomness: the train data which estimate the expectancy on the family of models learned, and the test data, which estimate the expectancy on the data on which the models will be applied. However, the example of creating a balanced test set in age prediction [ 10 ] outlines the importance of having many observations in the test set. It is impossible to accumulate rich statistics on errors in small test sets. It is not legit to accumulate statistics in the union of all test sets across cross-validation folds. Indeed this would break their independence with the train sets and open the door to leaking of information from the train sets to the corresponding statistics [ 5 ].

Leave-one-out must be avoided. Cross-validation strategies with large test sets - typically 10% of the data - can be more robust to confounding effects. Keeping the number of folds large is still possible with strategies known as repeated test-train split, shuffle-split, repeated K-Fold, or Monte-Carlo Cross-Validation [ 2 , 14 ].

Modeling cross-validation in heterogeneous data

Discussions are made more complex by the difficult question of what exactly a cross-validation strategy is measuring. In particular, Dr. Little mentions in the first section that models may underfit or lead to large variations when predicting across subjects. These considerations have some truth. Yet, in a clear application setting, as for a clinical purpose, the intended usage should dictate the cross-validation setting. For research purposes, understanding what drives cross-validation results is important. This may be difficult, as stressed by Dr. Little. To complement these arguments with a different perspective, I expose below conceptual tools that I find useful in thinking about this rich set of questions, though I do not claim to answer them. Intuitions can enable easy communication, but here I need to rely on mathematical formalism.

equation M1

where e is observation noise. In such model, e may be i.i.d. even though the relationship between y and X is not i.i.d, e.g., changing from subject to subject as in Saeb et al. [ 1 ]. This formalism is a classic way of modeling confounding effects used, e.g., in statistical testing for linear models [ 15 ].

equation M6

Given models ( 1 ) and ( 2 ) thatinclude a confound Z , ( 3 ), which gives the cross-validation error, must be refined to include Z in the expectancies, marginally or conditionally on Z . If Z models subjects and the goal is to predict across to new subjects, the expectancies must be marginal with respect to Z . This tells us that all the data from a given subject should be either in the train or the test. If the prediction is a prognosis knowing the subject’s past, the expectancies are then conditional to Z , and a subject should be spread between the train and test sets.

Gaël Varoquaux

The authors’ perspective

We do not live in a perfect world, and we thus are always facing trade-offs when analyzing the performance of machine learning algorithms. All cross-validation (CV) methods, at best, estimate the true prediction error. Acquiring the true prediction error would require infinite amounts of usage case data. In this sense, we completely agree with Dr. Varoquaux on the complexity of the process, which is partially reflected in the review by Dr. Little as well. We have also encountered similar complex scenarios in our own work, where subject-specific models needed to be cross-validated across time, e.g., in cases where tracking, not diagnosis, was the objective [ 16 , 17 ]. We thank both Dr. Varoquaux and Dr. Little for bringing these additional discussions and didactics to this important problem. However, in his review, Dr. Little has criticized three of our claims that (i) subject-wise CV mimics the use-case, (ii) subject-wise CV will, under the right assumptions, estimate the out-of-sample prediction error, and (iii) record-wise CV underestimates use-case prediction errors. We will critically review these three points here.

When subject-wise CV mimics the use-case scenario

If we are to use machine learning to diagnose a disease, we want to generalize from known subjects, some having the disease and some not, to new patients who have to be diagnosed in the future. In this setting, subject-wise CV obviously mimics the use-case scenario by dividing the data set into known (train) and future (test) subjects. Therefore, we are not clear why Dr. Little asks if “subject-wise can be a replacement for record-wise.” For such a wide set of use-case scenarios, that is not even a question. Record-wise CV would, on a reasonably small data set, readily allow diagnosing a disease using any feature that can identify the individual. For example, we would be able to diagnose Parkinson’s disease from subjects’ nose length.

Now, there are also other scenarios: we may want to ask if a patient develops a disease, based on past baseline measures of the same patient and new data. In this case, using data from the same patient is mandatory. Here, our machine learning system needs to use information from the same subject to predict their future state. However, record-wise CV, which randomly splits data into train and test sets regardless of time, is still not mimicking the use-case scenario. Using record-wise CV here would mean detecting the disease based on future knowledge about the disease state of a subject.

When subject-wise CV correctly estimates the out-of-sample prediction error

When we build a machine learning system that diagnoses a disease, then we must consider how often it will misdiagnose the disease on new subjects. Provided that subjects are recruited randomly from the target population, a subject that has not been seen by the algorithm is, for all practical purposes, indistinguishable from other people in the population who are not in the study. However, there are situations where this assumption will not hold: if our data set is small, even if the algorithm has not seen those other subjects, the algorithm developer certainly has, and may therefore bias the meta-parameters of the algorithm. As such, the algorithm will implicitly contain information from all data points, which means that whenever we try something on our test set, we are using the test set information. Therefore, the most important issue with subject-wise CV is the implicit re-use of test data. To deal with this problem, we can use large data sets where train and test sets are completely isolated from the beginning or run a pre-registered replication study. In this sense, we agree with Dr. Little that even subject-wise CV is not sufficiently conservative. However, subject-wise CV will still be a much more meaningful metric than record-wise.

When record-wise CV underestimates the use-case prediction error

Quite simply, whenever there is any subject-specific component in the data, there will likely be a bias. In fact, Fig. 3 of our paper shows this dependency in simulation. The idea is also intuitive: in a small cohort, we would be able to use machine learning to “diagnose” any disease based on any subject-specific feature (e.g., the nose length), because the algorithm could, at least partially, identify subjects based on those features. This would give us an obviously misguided belief that our machine learning system will work perfectly on a new cohort. For simple confounds, such as linear ones, we can use a formalism similar to that of Dr. Varoquaux and mathematically derive the biases. Obviously, we can construct cases where there is no bias despite subject-specific variation, e.g., when the subject-specific variance is impossible for the machine learning system to learn or represent. In years of experience working with biomedical data, we have yet to see signs of such variance. In other words, the biases are mathematically predicted, are intuitively obvious, and are experimentally demonstrated by our paper.

In summary, we agree with Dr. Little and Dr. Varoquaux on the complexity of the cross-validation problem and are thankful for bringing this up. We can not, however, agree with Dr. Little’s three criticisms of our paper. We strongly urge scientists and clinicians who want to diagnose diseases to avoid record-wise cross-validation. Or if they do, we would like to be given the opportunity to short the stocks of the resulting start-ups.

Sohrab Saeb, Luca Lonini, Arun Jayaraman, David C. Mohr, and Konrad P. Kording

Conflicts of interest

The authors declare no competing interests.

Supplementary Material

Giga-d-17-00022_original_submission.pdf, supplement files.

IMAGES

  1. FREE 10+ Validation Report Samples in PDF

    research paper about validation

  2. Reliability vs. Validity in Research

    research paper about validation

  3. Validation sheet

    research paper about validation

  4. Validation Research Paper

    research paper about validation

  5. Template Of A Validation Certificate.

    research paper about validation

  6. Validity and reliability of data in research

    research paper about validation

VIDEO

  1. SELECTING A TOPIC FOR RESEARCH

  2. Secrets To Finding High-Impact Research Topics (I NEVER Revealed These Before)

  3. Module Wise Important Topics

  4. Validation of qualitative methods

  5. 2- Method validation definition

  6. Content Validity Index for research paper, thesis and dissertation

COMMENTS

  1. Verification, analytical validation, and clinical validation (V3): the

    Published: 14 April 2020 Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs)...

  2. Systematic literature review of validation methods for AI systems

    Consequently, validation challenges have been well observed in the earlier research, and our aim in this paper is to study the validation methods that resolve or alleviate these challenges. Gao et al. (2019) also argue that there is a deficiency in supporting tools for validating AI systems. Readily available tools are not non-existent.

  3. (PDF) Validation

    Muhammad Takwin Machmud View Show abstract ... In development research, validation was needed to evaluate whether the product was in accordance with the objectives (Harmono, 2020; Glod-Lendvai,...

  4. Common misconceptions about validation studies

    Validation data can be combined with quantitative bias analysis methods 7-10 to compute bias-adjusted estimates that account for the systematic error and yield uncertainty intervals that represent total uncertainty better than conventional confidence intervals.

  5. Verification Strategies for Establishing Reliability and Validity in

    We suggest that by focusing on strategies to establish trustworthiness ( Guba and Lincoln's 1981 term for rigor 1) at the end of the study, rather than focusing on processes of verification during the study, the investigator runs the risk of missing serious threats to the reliability and validity until it is too late to correct them.

  6. Method of preparing a document for survey instrument validation by

    Validation of a survey instrument is an important activity in the research process. Face validity and content validity, though being qualitative methods, are essential steps in validating how far the survey instrument can measure what it is intended for.

  7. Best Practices for Developing and Validating Scales for Health, Social

    Scale development and validation are critical to much of the work in the health, social, and behavioral sciences. However, the constellation of techniques required for scale development and evaluation can be onerous, jargon-filled, unfamiliar, and resource-intensive. Further, it is often not a part of graduate training.

  8. PDF Methods of Test Validation

    Test validation methods are at the heart of language testing research. The way in which validity is conceptualized determines the scope and nature of validity investigations and hence the methods to gather evidence. Validation frame-works specify the process used to prioritize, integrate, and evaluate evidence collected using various methods.

  9. Understanding metric-related pitfalls in image analysis validation

    Validation metrics are key for tracking scientific progress and bridging the current chasm between artificial intelligence research and its translation into practice. However, increasing evidence ...

  10. Questionnaire validation practice: a protocol for a systematic

    Methods and analysis A systematic descriptive literature review of qualitative and quantitative research will be used to investigate the scope of validation practice in the rapidly growing field of health literacy assessment. This review method employs a frequency analysis to reveal potentially interpretable patterns of phenomena in a research area; in this study, patterns in types of validity ...

  11. The 4 Types of Validity in Research

    There are four main types of validity: Construct validity: Does the test measure the concept that it's intended to measure? Content validity: Is the test fully representative of what it aims to measure? Face validity: Does the content of the test appear to be suitable to its aims?

  12. PDF The Role of Academic Validation in Developing Mattering and Academic

    Specically, we ask the following research questions: 1. Does academic validation predict student feelings of mattering to campus, after account-ing for student background characteristics and frequency of student-initiated interactions with instructors? 2. Does academic validation predict student achievement, above and beyond student back -

  13. (PDF) Model verification & validation strategies and methods: an

    ... Validation tests the model's accuracy by performing experiments and comparing the simulation results. Model validation sets the reliability and limitations of the proposed method (Yin and...

  14. Collecting and validating data: A simple guide for researchers

    This paper provides an A-Z guide for sampling and census within an integrated framework of data collection and validation. It provides worked examples and inter-disciplinary exemplifications of essential methods, techniques and formulas for each step of the framework. Real research application tricks are disentangled for non-statisticians.

  15. Why is data validation important in research?

    Data validation is the process of examining the quality and accuracy of the collected data before processing and analysing it. It not only ensures the accuracy but also confirms the completeness of your data. However, data validation is time-consuming and can delay analysis significantly. So, is this step really important?

  16. PDF Analytical Method Development and Validation: a Review

    Abstract: The top objective of any pharmaceutical industry is to produce products of necessary characteristic and quality reliably, in a cost-effective manner. Development of a method is essential for discovery, development, and evaluation of medicines in the pharmaceutical formulation.

  17. Research Data Management: Validate Data

    Basic methods to ensure data quality — all researchers should follow these practices: Be consistent and follow other data management best practices, such as data organization and documentation Document any data inconsistencies you encounter Check all datasets for duplicates and errors

  18. PDF Validating Design Methods and Research: The Validation Square

    within the field of engineering design. In this paper, we explore the question of how one validates design research in general, and design methods in particular. Being anchored in the scientific inquiry tradition, research validation is strongly tied to a fundamental problem addressed in epistemology, namely, what is scientific knowledge and how

  19. Participant Validation: A Strategy to Strengthen the ...

    Participant validation is one strategy for cocreation in research, and Lincoln and Guba suggest how it can be incorporated at different stages of the research process. Most studies claiming to have used participant validation refer to sharing interviewtranscripts or quotations with the informants.

  20. The Role of Academic Validation in Developing Mattering and ...

    Darnell Cole 1233 Accesses 2 Citations 2 Altmetric Explore all metrics Abstract We use survey data from three four-year campuses to explore the relationship between academic validation and student outcomes during students' first 3 years in college using structural equation modeling.

  21. Impact of the Choice of Cross-Validation Techniques on the Results of

    1. Dataset Creation. The database used in this study was collected from the mPower Public Research Portal [3,4]. mPower is a clinical study of PD performed only through a mobile application interface that consists of seven tasks to be completed by subjects, of which we were only interested in two (the demographic survey and the voice activity tasks).

  22. What could be the best way of validation of research work in a

    Ways of validation for your manuscript. Manuscripts. 1 Recommendation. 1. Give a property of the result. This property should couple what is known about the result with established relationships ...

  23. [2402.06196] Large Language Models: A Survey

    The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs.

  24. Revisiting Feature Prediction for Learning Visual Representations from

    Abstract. This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision.

  25. Journal of Medical Internet Research

    Background: Given the additional risk of suicide-related behaviors in adolescents with allergic rhinitis (AR), it is important to use the growing field of machine learning (ML) to evaluate this risk. Objective: This study aims to evaluate the validity and usefulness of an ML model for predicting suicide risk in patients with AR. Methods: We used data from 2 independent survey studies, Korea ...

  26. A once-ignored community of science sleuths now has the research

    A community of sleuths hunting for errors in scientific research have sent shockwaves through some of the most prestigious research institutions in the world — and the science community at large.

  27. PDF Keyframer: Empowering Animation Design using Large Language Models

    In this paper, we explore how the emerging technology of large language models can assist exploration and refinement in animation design. 3 FORMATIVE STUDY We conducted a formative interview study to answer our research question •RQ1: What painpoints exist for motion designers, and what ideas do they have for how AI might assist with these ...

  28. Using and understanding cross-validation strategies. Perspectives on

    The reviewers of this important paper worried that readers might retain overly simple messages, maybe originating from the paper's efforts at being didactic. ... Yet, in a clear application setting, as for a clinical purpose, the intended usage should dictate the cross-validation setting. For research purposes, understanding what drives cross ...

  29. New global area estimates for coral reefs from high-resolution mapping

    Future versions of the Allen Coral Atlas portal may change, but the v2.0 data on Google Earth Engine will always replicate the statistics in this paper. Validation and confidence intervals. At the time training data were sampled from the reference data polygons, a spatially independent set of validation data points was also sampled for each region.