Medical Care
Cambridge Prof Claims Synthetic Data Superior in Healthcare AI
2024-11-27
In my previous reports on the difficulties of integrating AI into healthcare, a consistent theme emerged - the essential requirement for researchers to uphold patient confidentiality when dealing with clinical data. While this data is valuable for training AI, it may directly identify an individual or lead to reidentification from anonymized sources. A related issue is that systems trained on datasets lacking patient diversity tend to provide more accurate and detailed results for the majority group but less so for minorities. Clearly, this poses a problem when researching treatments that should be effective for all. As explored in my October 2024 report on AI and medical devices, the same problem extends to technologies predominantly designed, tested, and calibrated within a dominant group of data subjects, potentially leading to less accurate readings for others. Optical sensors are one such tool. My October report examined pulse oximeters commonly used in blood oxygen testing, which gave less accurate readings for people with darker skin tones. During the COVID pandemic, this might have resulted in a higher mortality rate among black and minority ethnic (BAME) patients who were sent home instead of hospitalized due to inaccurate readings.

Overcoming Challenges and Ensuring Data Safety

Using AI-Improved Data and Synthetic Data

Professor Mihaela van der Schaar, John Humphrey Plummer Professor of Machine Learning, AI, and Medicine at the University of Cambridge and Director of the Cambridge Centre for AI in Medicine, believes that AI can enhance the quality of data. Whether it's electronic health record data, data in bio banks, or clinical registries, high-quality data is crucial for AI and epidemiology. Medical data contains various errors such as complex real-world data, multimodal data that needs aggregation, and data that may be unfair, noisy, or missing informative elements. In the case of rare diseases, data may be limited or private and cannot be shared. Data also changes over time due to changing practices, demographics, or emerging diseases like COVID. With AI, we can improve data quality at every stage of model design. This includes imputing missing data, reducing noise, and dealing with "heart" examples. We can harmonize different types of datasets from various clinical trials and between them and electronic health records. At the training model stage, we can divide data into subgroups for more robust training or data-informed model selection. We can also test models using new data-centric approaches and address data shift and drift.

Addressing the Limitations of Synthetic Data

The broad area of synthetic data, along with the emerging challenge of AI training other AIs with AI-generated data, alarms some commentators. Last year, researcher Jathan Sadowski coined the term 'Habsburg AI' to describe this problem. Just as photocopying a photocopy leads to a loss of the original image, synthetic content may overwhelm human-made content online. Generative AI's 'hallucinations' also pose a problem as humans relying on AI for information may find themselves in a world of untrustworthy data. My recent report on the problem with pulse oximeters shows that synthetic data may have flaws. During the COVID pandemic, it was reported that BAME patients were at higher risk due to inaccurate readings from pulse oximeters tested mainly on white skin tones. The British government has acknowledged this problem, but no adjustments have been made. Synthetic data may amplify a majority view and overlook significant anomalies in human data. Professor van der Schaar rejects these comparisons, stating that synthetic data is a powerful creation that can improve data quality and simulate forward-moving scenarios. However, there is good and bad synthetic data, and human researchers need to be able to distinguish between them.

The Role of Clinicians in AI-Enabled Ecosystems

We hope to build an AI-empowered clinical ecosystem with various analytics. These analytics are designed and thought about by clinicians who know their needs and how to test systems. As research progresses, we need to be cautious about the emerging world of AI. By 2026, most online content is expected to be synthetic. AIs will be trained by other AIs using AI-generated data, potentially leading to a Habsburg-style future where technology refers more to itself than to humans. At this point, we need humans who can step outside the system and identify its flaws. In the meantime, diginomica will continue to report on these issues as we are only human.
more stories
See more