William Hersh, MD, Professor and Chair, OHSU
Blog: Informatics Professor
Twitter: @williamhersh
Like many academic health science universities, my institution has undertaken a planning process around data science. In the process of figuring how to merge our various data-related silos, we tried to look at what other universities were doing. One high-profile effort has been launched at the University of Michigan, and the formation of their program and those of others inspired a statistician, David Donoho, to look at data science from the purview of his field 50 years after famed statistician John Turkey had called for reformulation of the discipline into a science of learning from data. Donoho’s resulting paper [1] motivated me to look at data science from the purview of my field, biomedical and health informatics.
Statistics has of course been around for centuries, although this author drew from an event 50 years ago, a lecture by George Tukey. The informatics field has not been in existence for as many centuries, but one summary of its history by Fourman credits the origin of the term to Philip Dreyfus in 1962 [2]. However, the Wikipedia entry for informatics attributes the term to a German computer scientist Karl Steinbuch in 1956. Fourman also notes that the heaviest use of the term informatics comes from its attachment to various biomedical and health terms [2].
If the informatics field is indeed 60 years old, I have been working in it for about half of its existence, since I started my National Library of Medicine (NLM) medical informatics fellowship in 1987. I have certainly devoted a part of my career to raising awareness of the term informatics, making the case for it as a discipline [3]. Clearly the discipline has become recognized, with many academic departments, mostly in health science universities, and a new physician subspecialty devoted to it [4].
And now comes data science. What are we in informatics to make of this new field? Is it the same as informatics? If not, how does it differ? I have written about this before.
Donoho’s paper does offer some interesting insights [1]. I get a kick out of one tongue-in-cheek definition he gives of a data scientist, whom he defines as a “person who is better at statistics than any software engineer and better at software engineering than any statistician.” Perhaps we could substitute informatician for software engineer, i.e., a data scientist is someone who is better at statistics than any informatician and is better at informatics than any statistician?
Donoho does later provide a more serious definition of data science, which is that it is “the science of learning from data; it studies the methods involved in the analysis and processing of data and proposes technology to improve methods in an evidence-based manner.” He goes on to further note, “the scope and impact of this science will expand enormously in coming decades as scientific data and data about science itself become ubiquitously available.”
Donoho goes on to note six key aspects (he calls them “divisions” of “greater data science”) that I believe further serve to define the work of the field:
- Data Exploration and Preparation
- Data Representation and Transformation
- Computing with Data
- Data Modeling
- Data Visualization and Presentation
- Science about Data Science
Clearly data is important to informatics. But is it everything? We can being to answer this question by thinking about the activities of informatics where data, at least not “Big Data,” is not central. While I suppose it could be argued that all applications of informatics make use of some amount of data, there are aspects of those applications where data is not the central element. Consider the many complaints that have emerged around the adoption of electronic health records, such as poor usability, impeding of workflow, and even concerns around patient safety [5]. Academic health science leaders can lead the charge in use of data but must do so in the context of a framework that protects the rights of patients, clinicians, and others [6].
Like many informaticians, I do remain enthusiastic for the prospect of the growing quantity of data to advance our understanding of human health and disease, and how to treat the latter better. But I also have some caveats. I have concerns that some data scientists read too much into correlations and associations, especially in the face of so much medical data capture being imprecise, our lack of adoption of standards, and its inaccessibility when not structured well (which can lead us to try to “unscramble eggs”).
It is clear that informatics cannot ignore data science, but our field must also be among the leaders in determining its proper place and usage, especially in health-related areas. We must recognize the overlap as well as appreciate the areas where informatics can be synergistic with data science.
References
1. Donoho, D (2015). 50 years of Data Science. Princeton NJ, Tukey Centennial Workshop.
2. Fourman, M (2002). Informatics. In International Encyclopedia of Information and Library Science, 2nd Edition. J. Feather and P. Sturges. London, England, Routledge: 237-244.
3. Hersh, W (2009). A stimulus to define informatics and health information technology. BMC Medical Informatics & Decision Making.
4. Detmer, DE and Shortliffe, EH (2014). Clinical informatics: prospects for a new medical subspecialty. Journal of the American Medical Association. 311: 2067-2068.
5. Rosenbaum, L (2015). Transitional chaos or enduring harm? The EHR and the disruption of medicine. New England Journal of Medicine. 373: 1585-1588.
6. Koster, J, Stewart, E, et al. (2016). Health care transformation: a strategy rooted in data and analytics. Academic Medicine. Epub ahead of print.
This article post first appeared on The Informatics Professor. Dr. Hersh is a frequent contributing expert to HITECH Answers.