By Matt Fisher, General Counsel, Carium
Twitter: @matt_r_fisher
Twitter: @cariumcares
Host of Healthcare de Jure – #HCdeJure
Generative AI and large language models (LLM) continue to garner a lot of press, attention, and investment in healthcare. The promise is that such tools will free up a lot of time by offloading some tasks or potentially filling roles that remain empty at this point in time. However, can the accuracy of the tools be trusted? How will the tools be trained and on what data? Those are all valid considerations that must be appropriately addressed before widespread or in-depth use can really occur.
The Accuracy Issue
A recent study compared ChatGPT with Google for questions and searches relating to dementia or other cognitive decline concerns. The objective of the study was to compare results for the dementia related questions posed to each tool. The questions were a combination of informational and service delivery. The responses were then evaluated by domain experts based upon the following criteria: (i) currency of information, (ii) reliability of the information sources, (iii) objectivity, (iv) relevance to the actual question that was posed, and (v) similarity of the response between the two tools.
After evaluating the results the researchers found some positives for both options. Google was determined to provide more currency and reliable responses whereas ChatGPT was assessed as more objective. A bigger differential was found in response relevance with ChatGPT performing better. Readability was assessed poorly for both tools as the average grade level of the responses was in the high school range. Similarity of content for the tools varied widely with most responses being rated as medium or low for similarity.
The researchers concluded that both Google and ChatGPT have strengths and weaknesses. Some of the biggest issues for ChatGPT are ones commonly identified in coverage of what its capabilities are. Specifically, the biggest weakness is not providing a source for the information that it presented, which means it can be difficult to sort analyze the accuracy of a response even when it may seem to be of high quality and very useful. For Google, the relevancy of the responses could be improved, though arguably providing referrals or references to helpful resources could be better.
The research is helpful for understanding the current shortcomings of the tools and could provide some insight into how improvement can occur. A very important factor to keep in mind is that neither Google nor ChatGPT are healthcare specific tools. Both are designed for broad, generalized use and not trained for the nuances of healthcare or the healthcare industry. Could better training make a difference? The answer is likely yes, but that leads into the next issue.
How to Train for Healthcare
If tools like Google and ChatGPT are not healthcare specific, how can they be made healthcare specific? Specialized training is one of the clearer answers. But that also brings its own question of what healthcare specific data will be used for that training.
One aspect of the training would be feeding generalized medical information into the tools from publicly available information sources. The sources would likely include government documents, journal articles, scientific papers, and other evidence based, verifiably accurate sources that actual clinicians would rely upon. One further issue on that front would be how to “correct” the training as evidence and knowledge evolve. Even humans are not necessarily the best at immediately acting upon or internalizing new data and breaking from old habits. Would the same biases or limitations be inherent in a generative AI or LLM tool? With no technical background, that query admittedly cannot be addressed in this discussion, but hopefully others can enter the discussion and provide a more nuanced and informed understanding.
Another aspect of the training is a bit more complicated. The tougher area is how to train tools to understand the nuances and idiosyncrasies of patient and clinician communication, documentation, and related interactive components of healthcare that come naturally to individuals. What information sources can be used to train technology on that front? A relatively common answer has been electronic medical records and other troves of data being created through digital interactions between patients and clinicians.
While requests for that data are usually occurring on a de-identified basis, it does raise the perpetual issue of whether combining so much data can mean that it will actually remain de-identified. Terms for acquisition of the data may also seek to keep it indefinitely and with no ability to return or delete. Before an entity shares data in that scenario, it should be very clear on the conditions it attached to collection of the data as well as what potential uses were identified for use of the data. If care is not taken, an entity could very easily create a very big headache for itself.
The other aspect of sharing so much patient data, even if permissible under law and contract, is the impact on the individuals whose information is being shared. The discourse around privacy and data sharing over recent years has focused on individuals gaining more control over their data, being more clearly informed of potential uses, or being allowed to participate in the benefits received from use of the data. It is likely that none of those scenarios would play out in sharing data for purposes of training a generative AI of LLM tool.
Should that happen? Arguably it is more of an ethical dilemma than a legal one (at least assuming all of the legal and regulatory checkboxes have been ticked). There is no easy or clear answer, but it should be brought to the fore before too much data exchange hands and get loosed into the wild.
Looking Ahead
Training, development, release, and use of generative AI and LLM tools will not stop. Given that reality, it is essential to establish ore robust parameters guiding those efforts and what will happen with data. Absent a considerate approach, backlash can be expected, which could undermine valuable tools and solutions.
This article was originally published on The Pulse blog and is republished here with permission.