Faster Medical AI Validation = Quicker FDA Submission

Overview

AI technology is being rapidly developed and is already being used in sectors such as finance, healthcare, and energy. For these and other critical sectors, there is immense public interest in validating that AI is trustworthy and accurate. The software industry developing these tools is accustomed to quickly getting products to market, and is unlikely to self-regulate at a socially-acceptable level. After all, this is the industry that gave us the mantra “move fast and break things. There is little debate at this point that there is real value in using AI to help expedite the development of cutting-edge solutions that solve real problems across virtually every industry. But in the rush to build and deliver those products, we cannot lose sight of the potential pitfalls of AI that can arise when underlying models are not subject to a rigorous validation process. The Biden Administration recognizes the importance and urgency of this issue too, which is why it recently issued an Executive Order that “establishes new standards for AI safety and security, protects Americans’ privacy, advances equity and civil rights, stands up for consumers and workers, promotes innovation and competition, advances American leadership around the world, and more.” As part of that Order, the U.S. Department of Health and Human Services (HHS) is “directed to establish an AI safety program to work in partnership with Patient Safety Organizations” in order to ensure AI solutions are safe and equitable before they hit the market. The goal of the Order is clear, but there is not enough detail about what the structure of the AI safety program should look like, nor is there clarity on exactly what should be measured. Office of the National Coordinator for Health Information Technology (ONC) has recently announced new rules affecting AI vendors and clinicians who use HHS-certified decision support software, effective by the end of 2024. These rules mirror FDA's Good Machine Learning Practices and force both vendors and providers to adopt an AI lifecycle perspective on how models are trained, developed, validated and monitored in a transparent and rigorous manner.

Structure

While the safety program could be established independently by regulatory agencies on a sector-by-sector basis (FDA , Federal Energy Regulatory Commission , U.S. Securities and Exchange Commission , etc.), the commonalities across AI systems present a strong argument for a centralized AI Validation Agency (AIVA) that operates a central validation platform. The problem with a singular, centralized agency however, is that no government body has the expertise, tools, or the human resources required to manage AI validation at scale. In order to meet the velocity at which AI is being developed, we need to create a hub and spoke framework that enables an AIVA to coordinate with relevant government agencies as well as regional and sector-specific AI Validation Platform Entities (AIVPE) that report through to AIVA. This hub and spoke system is an appropriate design as it allows for 1) regional office specialization based on location (i.e. financial AI in New York City), and 2) independent/redundant validation from separate offices to increase confidence in the conclusions. The focus of the central agency is AI validation, but it cannot maintain expertise in every relevant application domain, nor is there one single organization that can handle the sheer volume of new models that need testing. For example, in the case of medical AI, the spoke office would test and assure the quality of AI tools in a manner compliant with current healthcare regulatory and quality standards. The spoke office would also create a repository to report and track clinical errors associated with the use of the technology in order to fine-tune future models. This repository (and other domain-specific ones) should be available to the public for transparency. The efficacy of the hub and spoke model will also depend on robust public-private data partnerships that leverage the best available domain expertise. The best way to test and validate new AI models is through hospitals and patient safety organizations donating real world medical data for testing. The U.S. Department of Health and Human Services (HHS)'s Agency for Healthcare Research and Quality (AHRQ) can facilitate this.

Measurement

AI validation has to be done by experts who are equipped with the appropriate testing tools that do not require coding skills, to inspect and audit AI trustworthiness. Depending on the design and purpose of the AI tool, this can be checked by either a) using test data that allows the tester to give the tool known inputs and verify that it gives outputs that match those in the test dataset or b) using a subject matter expert (example: a radiologist for medical imaging AI validation) to assess the quality of the outputs from a variety of inputs. AIVA's focus on validation of the outcome is of utmost importance. Regulating internal model specifics would likely have unintended consequences as the technology matures.

AIVA would act as a centralized entity that maintains the AIVP and concentrates expertise in AI testing, conducting three general activities:

Quantitative validation: Using test datasets with known outputs to assess the AI performance using statistical measures (accuracy, precision, recall, etc.). This includes detection of discriminatory behaviors or other socially problematic outputs.
Qualitative validation: Connecting with subject matter experts to rate the quality of the AI outputs (Example: Doctors when assessing medical AI, FERC experts when assessing electricity dispatch AI). Experts will also advise about subject-specific risks or other concerns that should be assessed/considered for each AI application.
Communication/dissemination: Similar to the ways that the FDA maintains detailed information about drug trial design and results or the EPA offers detailed emissions data, an important function of the AIVA is to maintain an AIVP that reports methods and results. After privacy and security concerns have been addressed, AIVA should strive for transparency in operations and results.

The validation process also needs regulatory oversight. National Institute of Standards and Technology (NIST) is leading the pack here so far with its AI risk management framework, though there is no universally accepted standard yet. There are roughly 200 standards being discussed globally, but we need to continue to move toward a single set of accepted benchmarks for all applications.

Cost

Depending on the design of relevant regulatory structures, the AIVA costs could be shared with the regulated entities, similar to the cost sharing approach in the FDA and credit ratings agencies. Around half of FDA funding comes from “user fees” paid by the industrial entities seeking fast and effective regulation of drugs or medical devices. In credit ratings, the big three (S&P Global, Moody's Corporation, Fitch Ratings) are financial services firms that offer government-sanctioned credit ratings in the US, with their rating services funded through payments by the rated credit instruments. In both cases, the value of efficient and effective ratings/approval supports the cost-sharing model.

Conclusion

There is simply too much collective benefit across multiple industries that we risk not realizing if there is not a concerted effort to validate AI models and ensure their efficacy and equity in the market. AI has the power to improve human outcomes, save lives, and make things so much more efficient that it requires our government and society to treat it as seriously as we do education, energy, or commerce. The best way to ensure that AI is safe, fair, and equitable is to have a central agency supported by a hub-and-spoke network that serves as a critical watchdog layer.

About the Author

Enes HOSGOR

CEO at Gesund.ai

Dr. Enes Hosgor is an engineer by training and an AI entrepreneur by trade driven to unlock scientific and technological breakthroughs having built AI products and companies in the last 10+ years in high compliance environments. After selling his first ML company based on his Ph.D. work at Carnegie Mellon University, he joined a digital surgery company named Caresyntax to found and lead its ML division. His penchant for healthcare comes from his family of physicians including his late father, sister and wife. Formerly a Fulbright Scholar at the University of Texas at Austin, some of his published scientific work can be found in Medical Image Analysis; International Journal of Computer Assisted Radiology and Surgery; Nature Scientific Reports, and British Journal of Surgery, among other peer-reviewed outlets.