William Hersh, MD, Professor and Chair, OHSU
Blog: Informatics Professor
I have written a number of postings over the last year about various aspects of electronic health record (EHR) data, from the transition of the work of informatics from implementation to analytics to the problems that still prevent us from making optimal use of data, such as the difficulties of data entry. One of my themes has been that knowledge will not just fall out of the data; we will need to improve the quality and completeness of data to learn from it. The requirements for getting better data include widespread adherence to data standards, engaging and motivating those who enter data to improving it, making it easier for those individuals to enter quality data, and evolving our healthcare system to valuing this data. If we are not able to meet these challenges with our current data, it is unlikely we will be able to do so when we have “big data,” i.e., that which is orders of magnitude larger and more complex beyond what we have now. No field has devoted more thought, research, or evaluation to the challenges of clinical and health data than informatics. Thus, whether it is tackling issues of how to implement systems in complex clinical settings; meeting the needs of clinicians, patients, and others; or how to maximize the quality of data, the road to making the best use of (big or non-big) data must pass through informatics.
An example of the fact that knowledge will not just fall out of the data comes from some research activity I have been involved in over the last couple years, which is the Text Retrieval Conference (TREC) Medical Records Track [1]. As those familiar with the field of information retrieval (IR) know, TREC is an annual “challenge evaluation,” sponsored by the National Institute for Standards and Technology (NIST) [2]. Challenge evaluations bring research groups with common interests and use cases together to apply their systems to a common task or set of tasks, using a common data set, and comparing results using agreed-upon metrics (ideally in a scholarly and not an overly competitive forum). TREC operates on a yearly cycle, consisting of 5-7 “tracks” that each represent a specific focus of IR research. TREC began with the straightforward tasks of “ad hoc” retrieval (user entering queries into a search engine seeking relevant documents) and “routing” (user seeking relevant documents from a new stream of documents based knowledge of previous relevant documents). In subsequent years, TREC evolved to its current state of diverse tracks representing newer problems in IR, such as Web search, video searching, question-answering, cross-language retrieval, and user studies. (Some of these tracks have spawned their own challenge evaluations, especially in the area of cross-language evaluation, an important issue in Europe and Asia.) Virtually all tracks have focused on generic content, typically newswire or Web content, with very few being “domain specific,” although I have been involved in two domain-specific tracks in the areas of genomics literature [3] and medical records [1].
In TREC and IR jargon, test collections consist of an adequately large and realistic collection of content, such as documents, medical records, Web pages, etc. [2]. Test collections also include a set of topics, usually at least 25-50 for statistical reliability [4], that are instances of the task being studied. A final component is human relevance judgments or assessments over the content items, indicating which are relevant and should be retrieved for each topic. Success is usually measured by some sort of aggregate statistic that combines the base measures of recall (proportion of relevant content items in the test collection retrieved) and precision (proportion of relevant content items in the search retrieved). (For those familiar with medical diagnostic test characteristics, these correspond to sensitivity and positive predictive value. The reciprocal of precision is also sometimes called number needed to retrieve, since it measures how many overall documents must be read or viewed for each relevant one retrieved.)
The use case for the track TREC Medical Records Track was identifying patients from a collection of medical records who might be candidates for clinical studies. This is a real-world task for which automated retrieval systems could greatly aid in ability to carry out clinical research, quality measurement and improvement, or other “secondary uses” of clinical data [4]. The metric used to measure systems employed was inferred normalized distributed cumulative gain (infNDCG), which takes into account some other factors, such as incomplete judgment of all documents retrieval by all research groups.
The data for the track was a corpus of de-identified medical records developed by the University of Pittsburgh Medical Center. Records containing data, text, and ICD-9 codes are grouped by “visits” or patient encounters with the health system. (Due to the de-identification process, it is impossible to know whether one or more visits might emanate from the same patient.) There were 93,551 documents mapped into 17,264 visits.
I was involved in a number of aspects of organizing this track. I contributed in both guiding the task (or use case) as well as leading some of track infrastructure activities, namely development of search topics and relevance assessments. This work has been aided greatly by students with medical and other expertise in the OHSU Biomedical Informatics Graduate Program.
The results of the TREC Medical Records Track provide a good example of why the road to big data passes through informatics, or in other words, why there is still considerable work to be done from an informatics standpoint before knowledge simply falls out of data. While the performance of systems in the track has been good from an IR standpoint, they also show these systems and approaches have a considerable ways to go before we can just turn the data analytics crank and have medical knowledge emanate. The magnitude of how far we need to go comes from the precision at various levels of retrieval (e.g., precision at 10 retrieved, 50 retrieved, 100 retrieved, etc.), demonstrating how many nonrelevant visits are retrieved. In the case of typical ad hoc IR, we can probably quickly dispense with documents are relatively easy to identify as not relevant. But this may be a more difficult task for complex patients and complex records.
A failure analysis over the data from the 2011 track carried out at OHSU demonstrated why there are still many challenges that need to be overcome [5]. This analysis found a number of reasons why visits frequently retrieved were not relevant:
- Notes contain very similar term confused with topic
- Topic symptom/condition/procedure done in the past
- Most, but not all, criteria present
- All criteria present but not in the time/sequence specified by the topic description
- Topic terms mentioned as future possibility
- Topic terms not present–can’t determine why record was captured
- Irrelevant reference in record to topic terms
- Topic terms denied or ruled out
The analysis also found reasons why visits rarely retrieval were actually relevant:
- Topic terms present in record but overlooked in search
- Visit notes used a synonym for topic terms
- Topic terms not named and must be derived
- Topic terms present in diagnosis list but not visit notes
A number of research groups used a variety of techniques, such as synonym and query expansion, machine learning algorithms, and matching against ICD-9 codes, but still had results that were not better than manually constructed queries (which also require a form of informatics expertise in knowing how to query the clinical domain). The results data also show this is a challenging task, as the performance of different systems varied widely on different topics.
From my perspective, these results show that successful use of big data will not come just from smart algorithms and fast computer hardware. It will also require the informatics expertise to design and implement EHRs, high-quality and complete clinical data, and a proper understanding of the clinical/health domain to make most effective use of the data. As such, achieving the value of big data passes through informatics.
References
- Voorhees, E and Hersh, W (2012). Overview of the TREC 2012 Medical Records Track. The Twenty-First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD. National Institute for Standards and Technology.
- Voorhees, EM and Harman, DK, Eds. (2005). TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA, MIT Press.
- Hersh, W and Voorhees, E (2009). TREC genomics special issue overview. Information Retrieval. 12: 1-15.
- Buckley, C and Voorhees, E (2000). Evaluating evaluation measure stability. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece. ACM Press. 33-40.
- Edinger, T, Cohen, AM, et al. (2012). Barriers to retrieving patient information from electronic health record data: failure analysis from the TREC Medical Records Track. AMIA 2012 Annual Symposium, Chicago, IL, 180-188.