By Abhinav Shashank, Chief Executive Officer & Co-Founder, Innovaccer
Twitter: @abhinavshashank
Twitter: @innovaccer
Unstructured data has always been a complication everywhere, and more so than anywhere in healthcare industry. Invaluable information is at stake, all due to unstructured data and the difficulties in accessing it, making it difficult for the quick change in expectations for preventive care, wellness and speedy diagnosis and treatment.
In this quickly transforming healthcare industry, becoming an integral part of the “Information Generation” people have time and again tried to break the ice and come up with something to facilitate data sharing, but with data warehouses’ limited capabilities the growth is limited.
Although data lake is a relatively new term, tagged by some a ‘dream’ but many organizations have made this a reality. Google, Facebook and Yahoo have not only used data lakes, but have allowed it to innovate in an agile way, creating a value chain that can process enormous amount of data in a speedy and reliable way, at much less costs.
Data Lake: How different it is from Data Warehouse?
In 2010, when James Dixon came up a new architecture, his cutting-edge idea known as ‘Data Lake’ gained quite a bit of captivation and regard. In simple terms, data lake is an easily accessible, centralized storage repository of large volumes of structured and unstructured data with a ‘store-everything’ approach. The key differences between a data lake and a data warehouse are:
- A data warehouse only stores structured data, while a data lake can accommodate structured as well as unstructured data.
- Before processing data in data warehouse, it has to be modeled before data is loaded into it. For a data lake – you just load the raw data and it’s ready to use.
- Since data warehouse is a highly-structured repository, it’s difficult, if not impossible to change the underlying structure. On the other hand, data lake offers easy reconfiguration on the go.
- Data lake, being on a Hadoop platform, costs less to store data and being open sourced, it doesn’t require any licensing.
How is data lake turning the tables in healthcare?
The Budget Control Act of 2011 dramatically increased the rate of adoption in the United States. Before this, doctors making notes on thick charts of paper wasn’t an unusual sight, and by 2015, 96% of non-federal acute care hospitals had adopted EHRs as compared to 71.9% in 2011. This is enough indication of the vast amount of data being processed in the healthcare industry: some of it is in a relational structure, or will be pulled into one. Integrating this Big Data all over are data lakes.
1.) Incoming data from extensive sources
We can broadly categorize all the data in healthcare industry into two sources: clinical data and claims data.
Claims data comes from the payers, containing extremely uniform and structured data about patients receiving care, their demographics and the care setting they are in. Since the data is complete and meant for reimbursements, it contains all valuable information one can need. But, since the data is first abstracted and then summarized to bring only the data meaningful for provider reimbursement on the forefront, it doesn’t lists all information and is more like a general diagnosis.
The second source of data in healthcare is clinical data. True, healthcare industry has been one of the last ones to be electronically-oriented; there is immense amount of data from the providers’ end to be processed. Patients’ important and critical information about diagnoses, claims and medical history is stored in EHRs and are used to analyze the patients’ health in every time frame, all at once – an approach we call Patient 360.
2.) How does data lake come into play?
Both the claims data and the clinical data are first in an abstract form that have to be summarized and then analyzed to derive a meaningful use out of it. Data is pulled into the data lake, where each data element is assigned a unique identifier with a set of metadata tags.
Most of the times, data lake is structured on a Hadoop Distributed File System which can accommodate data from disparate sources: structured or unstructured and is a cost-effective repository. This data is then subjected to extract, load and transform (ETL) methods for collection and integration of data, which can later be processed by Spark – a simple, analytical framework.
3.) Dealing with future requirements
The transition of healthcare industry from fee-for-service to value-based care, the amount of data to be processed is going to increase exponentially. Ever since the ACA was signed into law, more than 600 Accountable Care Organizations have come up, providing service to about 20.5 million people. Needless to say, this data would skyrocket in the coming years.
Data lake has a massive environment that can accommodate data in bulk in its raw form and if equipped with strategic and analytic tools, this data would not only be machine readable but also easy to use by providers and payers alike. Using scalable data lake as a repository also allows huge chunks of data to be kept in and processed in an aggregate form rather than siloed, and facilitating analysis and drawing insights.
Countless Possibilities with an Integrated Data Lake
Data lake being a flexible and reliable platform offers endless possibilities that can be put to use in healthcare: from keeping in line with transition to value-based care and providing transparency, to growing in a charged manner and delivering a holistic view of care services; data lake has it all and is not just limited to:
1.) Building a Population Health Management model:
An effective PHM model is based on four pillars: technology, analytics, care coordination and patient engagement. Using integrated data lakes will result in better-equipped providers, able to make accurate decisions – leading to reduced readmissions, comprehensive care management and quality health care.
2.) Optimizing and managing treatments in real-time:
Empowering a network between PCPs, patients and the specialists by data integration and sharing, combined with analysis of patient’s clinical and claims data to provide patients with the right care at the right time.
3.) Processing chunks of data at once:
A data lake is no respecter of data – it can have all kinds of data, ranging from structured to unstructured and offering the agility to reconfigure the underlying schema – a flexibility data warehouse doesn’t offer. Moreover, the raw data stored in data lakes is never lost – it is stored in its original format for further analytics and processing.
4.) Accelerated query processing:
Since data governorship comes into effect on the way out, the user doesn’t need a prior knowledge how data has been ingested. This not only increases the efficiency, but also comes with high concurrency and improved query processing and complex joins.
5.) Cost Effective:
Data lake offers to store massive amount of siloed data with flexibility to grow and shrink as and when needed. Since it is implemented on Hadoop, an open source platform, it costs less and performs well in efficient manner.
The amount of unstructured data in the healthcare industry is immense, and with this data growing at a rate of 48% per year, we need to make healthcare a data-driven industry with increased scalability, performance and analytic capability. We have only scratched the surface of application of data lakes and in future, when medical imaging would be an essential way of diagnosis, the hitch of having unstructured data would be easily manageable with comprehensive use of data lakes. In the future of healthcare, data lake is a prominent component, growing across the enterprise.
This article was originally published on Innovaccer and is republished here with permission.