To Hadoop or Not to Hadoop and the Impact on Big Data
Riddle me this. Hadoop is:
- An open-source software platform for storing and processing large-scale data sets on computer clusters.
- A processing platform that can be deployed in both traditional on-site data centers and in the cloud.
- One of the most significant data processing platforms for handling big data analytics in healthcare.
- Gets its name from one of the platform’s original developers at Yahoo! who named it after his son’s toy elephant.
The answer to all, of course, is yes.
Hadoop hit my radar a few months back in researching various health IT conferences to cover, one of which was the 2014 Hadoop Summit in San Jose. To be frank, I am about as far away as it gets from the programmer world and mindset but as I dug a little deeper into Hadoop, it was becoming clear, even to my layman’s eyes, that the platform is having a significant impact on big data processing and analytics.
To help gain a better understanding of just how much of an impact, I reached out to Jared Crapo. Vice President at Health Catalyst, to answer a few questions. Earlier this year, he wrote a guest blog post on Big Data in Healthcare where he touched on Hadoop.
1. First and foremost, what is Hadoop?
When people talk about Hadoop, they can be talking about a couple of different things, which often makes it confusing. Hadoop is an open-source distributed data storage and analysis application that was developed by Yahoo! based on research papers published by Google. Hadoop implements Google’s MapReduce algorithm by divvying up a large query into many parts, sending those respective parts to many different processing nodes, and then combining the results from each node.
Hadoop also refers to the ecosystem of tools and software that works with and enhances the core storage and processing components:
- Hive – a SQL-like query language for Hadoop
- Pig – a high-level query language for MapReduce
- HBase – a columnar data store that runs on top of the Hadoop distributed file storage mechanism
- Spark – general purpose cluster computing framework
Unlike many data management tools, Hadoop was designed from the beginning as a distributed processing and storage platform. This moves primary responsibility for dealing with hardware failure into the software, optimizing Hadoop for use on large clusters of commodity hardware.
2. Yahoo!, of course, runs applications using Hadoop, as does Facebook as well as many of our largest tech companies. What are some of the key reasons for the move to the platform’s adoption?
Large companies have rapidly adopted Hadoop for two reasons, enormous data sets and cost. Most have data sets that are just too large for traditional database management applications. In the summer of 2011, Eric Baldeschwieler (formerly VP of Hadoop engineering at Yahoo!), CEO of Hortonworks (a company that provides commercial support for Hadoop) said that Yahoo! has 42,000 nodes in several different Hadoop clusters with a combined capacity of about 200 petabytes (200,000 terabytes).
Even if existing database applications could accommodate these large data sets, the cost of typical enterprise hardware and disk storage becomes prohibitive. Hadoop was designed from the beginning to run on commodity hardware with frequent failures. This substantially reduces the need for expensive hardware infrastructure to host a Hadoop cluster. Because Hadoop is open source, there are no licensing fees for the software either, another substantial savings.
I think it’s important to note that both of these companies started using traditional database management systems and didn’t start leveraging Hadoop until they had no more scaling options.
3. In February of this year, HIMSS Journal released a report on big data, Big data analytics in healthcare: promise and potential. In the report the authors list Hadoop as the most significant data processing platform for big data analytics in healthcare. How do you see Hadoop impacting and/or changing healthcare analytics?
Using Hadoop, researchers can now use data sets that were traditionally impossible to handle. A team in Colorado is correlating air quality data with asthma admissions. Life sciences companies use genomic and proteomic data to speed drug development. The Hadoop data processing and storage platform opens up entire new research domains for discovery. Computers are great at finding correlations in data sets with many variables, a task for which humans are ill-suited.
However, for most healthcare providers, the data processing platform is not the real problem, and most healthcare providers don’t have “big data”. A hospital CIO I know plans for future storage growth by estimating 100MB of data generated per patient, per year. A large 600-bed hospital can keep a 20-year data history in a couple hundred terabytes.
Every day, there are more than 4.75 billion content items shared on Facebook (including status updates, wall posts, photos, videos, and comments), more than 4.5 billion “Likes,” and more than 10 billion messages sent. More than 250 billion photos have been uploaded to Facebook, and more than 350 million photos are uploaded every day on average. Facebook adds 500 terabytes a day to their Hadoop warehouse.
Southwest’s fleet of 607 Boing 737 aircraft generate 262,224 terabytes of data every day. They don’t store it all (yet), but the planes’ instrumentation produce that much data.
Healthcare analytics is generally not being held back by the capability of the data processing platforms. There are a few exceptions in the life sciences, and genomics provides another interesting use case for big data. But for most healthcare providers, the limiting factor is our willingness and ability let data inform and change the way we deliver care. Today, it takes more than a decade for compelling clinical evidence to become common clinical practice. We have known for a long time that babies born at 37 weeks are twice as likely to die from complications like pneumonia and respiratory distress than those born at 39 weeks. Yet 8 percent of births are non-medically necessary pre-term deliveries (i.e. before 39 weeks).
The problem we should be talking about in healthcare analytics is not what the latest data processing platform can do for us. We should be talking about how we can use data to engage clinicians to help them provide higher quality care. It’s not how much data you have that matters, but how you use it. At our upcoming September Healthcare Analytics Summit, national experts and healthcare executives will lead an interactive discussion on how Healthcare Analytics has gone from a “Nice To Have” to a “Must Have” in order to support the requirements of healthcare transformation.
4. Keying in on that statement of usage, can you give us an example of how clinicians would use data sources outside their own environment?
Data from other clinical providers in your geography can be very useful. Claims data give a broad picture but not a deep one. Data from other non-traditional sources also has surprising relevance; in some cases, it’s a better predictor than clinical data. For example: EPA data on geographical toxic chemical load adds additional insight to cancer rates for long-term residents. The CMS-HCC risk adjustment model can help providers understand why patients in their area seem to have higher or lower risk for certain disease conditions. Household size of 1 increases the risk of readmissions because there is no other caregiver in the home.
5. What are the drawbacks of Hadoop and what do CTOs, CIOs and other IT leaders need to consider?
Compared with typical enterprise infrastructure, Hadoop is very young technology and the capabilities and tools are relatively immature.
So too are the number of people who have lots of experience with Hadoop. The only people with 10 years of experience are the two guys at Yahoo! who created it. If Hadoop solves a data analysis problem for your organization, you need to make sure you plan for enough skilled people to help deploy, manage, and query data from it. Remember your competition for these resources will be large technology and financial services companies, and people with Hadoop experience are in high demand.
You probably will also need to consider an alternative hardware maintenance approach. Hadoop was designed for commodity hardware, with its attendant higher failure rates. Instead of purchasing maintenance on the hardware and having someone else come fix or replace it when it breaks, you should plan to have spare nodes sitting in the closet, or even racked up in the data center. Deploying Hadoop on expensive enterprise hardware with SAN based disk and 24×7 maintenance coverage reduces the value proposition of the technology.
The good news is that the commercial database vendors, including Microsoft, Oracle, and Teradata, are all racing to integrate Hadoop into their offerings. These integrations will make it much easier to utilize Hadoop’s unique capabilities while leveraging existing infrastructure and data assets.
6. Finally, where do you see development of this platform going and what will be its ongoing impact on big data?
Fifteen years ago, we didn’t capture data unless we knew we needed it. The cost to capture and store it was just too high. Fifteen years from now, reductions in the cost to capture and store data will likely mean that we will capture and store everything. Hadoop is a huge leap forward in our ability to efficiently store and process large quantities of data. This allows more people to spend more time thinking about interesting questions and how to apply the resulting answers in a meaningful and useful way.
Jared Crapo joined Health Catalyst in February 2013 as a Vice President. Prior to coming to Catalyst, he worked for Medicity as the Chief of Staff to the CEO. During his tenure at Medicity, he was also the Director of Product Management and the Director of Product Strategy. Jared co-founded Allviant, a spin-out of Medicity, that created consumer health management tools. In his early career, he developed physician accounting systems and health claims payment systems.