William Hersh, MD, Professor and Chair, OHSU
Blog: Informatics Professor
Twitter: @williamhersh
Earlier this year, I submitted a response (and posted it in this blog) to a National Institutes of Health (NIH) Request for Information (RFI) on a draft of their Strategic Plan for Data Science. My main concern was that while there was nothing in the report I did not agree with, I believed there needed to be more attention to the science of data science.
In October, the NIH released another RFI, this one entitled, Proposed Provisions for a Draft Data Management and Sharing Policy for NIH Funded or Supported Research. Similar to the Strategic Plan for Data Science, most of what is in this draft plan is reasonable in my opinion. But what concerns me more is, similar to the earlier RFI, what is left out.
My main concerns have to do with the definition and use of “scientific data.” Early on, the plan defines “scientific data” as “the recorded factual material commonly accepted in the scientific community as necessary to validate and replicate research findings including, but not limited to, data used to support scholarly publications.” The draft further notes that “scientific data do not include laboratory notebooks, preliminary analyses, completed case report forms, drafts of scientific papers, plans for future research, peer reviews, communications with colleagues, or physical objects, such as laboratory specimens. For the purposes of a possible Policy, scientific data may include certain individual-level and summary or aggregate data, as well as metadata. NIH expects that reasonable efforts should be made to digitize all scientific data.”
The draft report then runs through the various provisions. Among them are:
- Data Management and Sharing Plans – new requirements to make sure data is FAIR (findable, accessible, interoperable, and reusable)
- Related Tools, Software and/or Code – documentation of all the tools used to analyze the data, with a preference toward open-source software (or documentation of reasons why open-source software is not used)
- Standards – what standards, including data formats, data identifiers, definitions, and other data documentation, are employed
- Data Preservation and Access – processes and descriptions for how data is preserved and made available for access
- Timelines – for access, including whether any is held back to allow publication(s) by those who collect it
- Data Sharing Agreements, Licensing, and Intellectual Property – which of these are used and how so
All of the above are reasonable. However, my main concern is what appears to be a relatively narrow scope of what constitutes scientific data. As such, what follows is what I submitted in my comments to the draft policy. (These comments were also incorporated into a larger response by the Clinical Data to Health [CD2H] Project, of which I am part.)
The definition of scientific data implies that such data is only that which is collected in active experimentation or observation. This ignores the increasing amount of scientific research that does not come from experiments, but rather is derived from real-world measurements of health and disease. This includes everything from data routinely collected by mobile or wearable devices to social media to the electronic health record (EHR). A growing amount of research analyzes and makes inferences using such data.
It could be argued that this sort of data derived “from the wild” should adhere to the provisions above. However, this data is also highly personal and usually highly private. Would you or I want our raw EHR in a data repository? Perhaps connected to our genome data? But if such data are not accessible at all, then the chances for reproducibility are slim.
There is also another twist on this, which concerns data used for informatics research. In a good deal of informatics research, such as the patient cohort retrieval work I do in my own research¹, we use raw, identifiable EHR data. We then proceed to evaluate the performance of our systems and algorithms with this data. Obviously we want this research to be reproducible as well.
There are solutions to these problems, such as Evaluation as a Service² approaches that protect such data and allow researchers to send their systems to the data in walled-off containers and receive aggregate results. Maybe the approach in this instance would be to maintain encrypted snapshots of the data that could be unencrypted in highly controlled circumstances.
In any case, the NIH Data Management and Sharing Policy for NIH Funded or Supported Research is a great starting point but should take a broader view of scientific data and develop policies to insure research is reproducible. Research done with data that does not originate as scientific data should be accounted for, including when that data is used for informatics research.
References
¹ Wu, S, Liu, S, et al. (2017). Intra-institutional EHR collections for patient-level information retrieval. Journal of the American Society for Information Science & Technology. 68: 2636-2648.
² Hanbury, A, Müller, H, et al. (2015). Evaluation-as-a-service: overview and outlook. arXiv.org: arXiv:1512.07454.
This article post first appeared on The Informatics Professor. Dr. Hersh is a frequent contributing expert to HITECH Answers.