Hadoop Drives Down Costs, Drives Up Usability With SQL Convergence
|eweek.com | Posted 2013-04-22|
SPECIAL FEATURE: As more enterprises begin to adopt the Hadoop big data wrangling technology, there is a growing need for SQL convergence.
In 2011, Charles Boicey looked at Twitter, Facebook, Yahoo and other major Web entities and said to himself, "Why do those guys get to have all the fun?"
Boicey, an informatics solutions architect at the UC Irvine Medical Center, said he could very much see that the underlying big data technologies driving the big Web companies could help in the IT environment at the medical center.
Boicey told eWEEK, "I was intrigued by the volume of data and the speed with which they could access it, and I said, 'Why can't we do that' in healthcare?"
"We came to the conclusion that healthcare data although domain specific is structurally not much different than a tweet, Facebook posting or LinkedIn profile and that the environment powering these applications should be able to do the same with health care data," he wrote in a 2012 blog post.
Moreover, "A lab result is not that much different than a Twitter message," he told eWEEK. "Pathology and radiology reports share the same basic structure as a LinkedIn profile with headers, sections and subsections. A medical record shares characteristics of Facebook in that both represent events over time."
Indeed, in health care, data shares many of the same qualities as that found in the large Web properties. Both have a seemingly infinite volume of data to ingest, and it is all types and formats across structured, unstructured, video and audio, Boicey said. "We also noticed the near zero latency in which data was not only ingested but also rendered back to users was important. Intelligence was also apparent in that algorithms were employed to make suggestion such as people you may know."
That was the beginning of a strategy to employ Hadoop at the medical center. Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. It supports the running of applications on large clusters of commodity hardware. Hadoop, derived from Google's MapReduce and Google File System (GFS) papers, became a target technology because of its attractive scale-to-cost ratio and because it is open source.
The UCI Medical Center's first big data project was to build an environment capable of receiving Continuity of Care Documents (CCDs) via a JSON pipeline, store them in MongoDB and then render them via a Web user interface that had search capabilities. From there, the new system, known as Saritor, went online.
Boicey said Saritor became a necessity because electronic medical records (EMR) cannot handle complex operations such as anomaly detection, machine learning, complex algorithms or pattern set recognition, and the Enterprise Data Warehouse (EDW) supports quality control, operations, clinicians and researchers.
"We, like many organizations with data warehouses, run ETL [extract, transform, load] processes at night to minimize the load on the production systems," Boicey said in his post. "We have some real-time interfaces with the data warehouse, but not all data is ingested in real time. In turn, our data suffers from a latency factor of up to 24 hours in many cases, making this environment suboptimal."
The Hadoop environment stores a wide range of health care data, including EMR-generated data, genomic data, financial data, patient and caregiver data, smart pump data and ventilator data. "Any electronically generated data in a health care environment can be ingested and stored in Hadoop and, most importantly, on commodity hardware," Boicey said.
Hadoop also enabled the medical center to pull its legacy data into Saritor, covering 1.2 million patients and more than 9 million records. "We had over 20 years of legacy data that was costing us $100,000 to maintain," said Boicey.
Putting the data into Saritor also enabled UCI Medical Center's researchers to have access to 23 years of data that they didn't have ready access to before. "They can now ask questions of the data that they couldn't before," Boicey said.
Boicey, noting that he architected Saritor, had 15 years of experience in trauma and critical care nursing before he switched careers to IT. Saritor now is being updated with a system to acquire home monitoring data. The center also is working on a pilot to enhance patient monitoring in the hospital and to support patient self-monitoring.
In addition, a UCI student project is underway to develop a sentiment analysis dashboard to better understand the social media environment external to UC Irvine Health, Boicey said in a post earlier this month. As part of the patient experience feedback loop, the medical center will be able to reach out and connect with patients to better understand their concerns and enhance the patient experience.
Saritor uses the Hortonworks Hadoop distribution and surrounding tools. Boicey said they chose Hortonworks because it suited their requirements for an open-source platform without proprietary features. "When you're standing up something like this," he said, "you don't want to be paying through the nose." Plus, he added, Hortonworks was "really interested" in working with the health care industry.
"Health care has been an area of interest for us," said Shaun Connolly, vice president of corporate strategy at Hortonworks. "Over the last six to nine months, one of the trends we've seen with Hadoop is that people are moving from kicking tires to more targeted pilots and proofs-of-concepts." And although the use cases vary, as was the case with UCI, "they are usually driven by a desire to grab new forms of data and weave it in with their existing data," he said.
Meanwhile, Neustar, a provider of real-time information and analytics for the Internet, telecommunications, entertainment and marketing industries as well as a provider of clearinghouse and directory services to the global communications and Internet industries, also is using Hadoop to expand its data warehousing environment.
Mike Peterson, vice president of platforms and data architecture for Neustar, said the company used Hadoop to transition from an environment consisting of Oracle databases, Netezza appliances and Teradata technology that was becoming cost prohibitive and restrictive.
With that system, "We could only keep 1 percent of our data for 60 days, but we were tasked with keeping 100 percent of our data for a year, and that led us to Hadoop," Peterson told eWEEK.
Because they were bleeding money, the team wanted a cost-effective solution. "Our target was $500 per terabyte. We were at $100,000 per terabyte with the old system," Peterson said. "With our Hadoop cluster, we're now at $900 per terabyte."
In addition to saving major financial resources, Neustar's Hortonworks-based Hadoop system captures all the data—structured and unstructured—enables real data science, and provides more data to push into the reporting and analytics layer, Peterson said.
Yet another Hortonworks Hadoop user, Luminar, is an analytics and modeling provider that serves the U.S. Hispanic market, transforming Hispanic consumer data into insights and business intelligence.
Franklin Rios, president of Luminar, told eWEEK the company has more than 140 million consumer records with transactional data added to them. Rather than use sample data to tap into Latino consumer sentiment, Luminar took a big data approach, he said. Like Neustar, Luminar's existing environment could not handle the growth of data, and to scale that environment as it was would mean adding more expensive hardware, software and personnel. "But that was not feasible from a financial point of view," he said.
"At Luminar, we use analytical modeling, technology and data processing to help our clients fine-tune their marketing strategies [and] tailor their messages to Hispanic consumers," Rios said.
The Hadoop system "became like an 'easy' button for my team," he said, noting that to date Luminar has 150 terabytes of data—70 percent of which is structured data and 30 percent of which is unstructured. Going in, Rios said, Luminar's projections were that the company would gain 13 to 15 percent cost efficiency, but so far that figure is at 28 percent based on the company's calculations.
But it is not enough that Hadoop can save enterprises money and help them scale apps; it also must be accessible. As enterprises continue to adopt Hadoop, a key trend emerging is the convergence of Hadoop and SQL (Structured Query Language).
"This trend is important because most BI [business intelligence] tools want to speak SQL, and Hive, which allows that, works through MapReduce [and] is too slow," Andrew Brust, president of Blue Badge Insights and a big data guru, told eWEEK. "The various technologies involved tend to query Hadoop's Distributed File System (HDFS) data directly, bypassing MapReduce. This works better with BI data discovery tools because you can interactively query without waiting forever between queries. It is also important because SQL skills are widespread and MapReduce skills are most definitely not."
Brust noted some technologies in this space, including Cloudera with Impala, Teradata Aster with SQL-H, Hadapt, ParAccel with ODI and Microsoft with PolyBase—a component of SQL Server Parallel Data Warehouse. And there are many others.
"The excitement around adding the SQL layer is aimed at empowering all the people out there that are knowledgeable in SQL," Hortonworks' Connolly said. "We do see that as an important use case, and we're investing in Hive."
Sqoop, an Apache Hadoop tool, allows us to extract data from the relational database into HDFS."
"As crappy as it is, SQL is a language everybody uses and you're not going to change that," Jim Kaskades, CEO of Infochimps, told eWEEK. Infochimps delivers a cloud service solution for big data that eliminates the struggle to master all the new big data technologies. Infochimps claims its solutions make it faster, easier and less complex to build and manage big data applications and quickly deliver actionable insights. Infochimps offers an elastic Hadoop cloud for massively parallel data analysis.
"So much of the world understands SQL and BI. … For crying out loud, you can connect Excel to SQL," said Fred Gallagher, general manager of Vectorwise, which provides an analytical SQL for big data from Actian. Actian enables organizations to transform big data into business value with data management solutions to connect, analyze and take automated action across their business operations. The Vectorwise 3.0 analytic database offers companies high-performance integration for Hadoop with the Vectorwise Hadoop Connector.
The trend toward SQL is the ability for non-engineering-level users to access data from Hadoop," said Lloyd Tabb, CEO and co-founder of Looker, a BI software company with a SQL-based solution. "Traditionally, to access Hadoop you had to be a programmer. But SQL is exactly what it is defined as, a Structured Query Language, and Hadoop vendors are moving toward SQL to make Hadoop more accessible. This lowers the barrier to entry."
Steve Hillion, chief data officer at Alpine Data Labs, which enables end-to-end analytics on combined data from Hadoop and relational databases, said he understands the need for SQL on Hadoop, as "MapReduce is not the friendliest technology." Alpine, however, looks to get the best performance, "so we get down to the lowest level interface, which is MapReduce," he said.
Last year, Ovum analyst Tony Baer began taking notice of the convergence between Hadoop and SQL. In a blog post from October 2012, Baer wrote, "The Hadoop and SQL worlds are converging, and you're going to be able to perform interactive BI analytics on it.
"SQL convergence is the next major battleground for Hadoop," he added.
"Hadoop has not until now been for the SQL-minded," Baer said. "The initial path was, find someone to do data exploration inside Hadoop, but once you're ready to do repeatable analysis, ETL [extract, transform, load] it into a SQL data warehouse."
Cloudera's Impala came out of customer demand, said Amr Awadallah, founder, CTO and vice president of engineering of the Hadoop distributor.
With Impala, you can query data, whether stored in HDFS or Apache HBase—including SELECT, JOIN and aggregate functions—in real time, said Justin Erickson, senior product manager at Cloudera. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries, Erickson said in a blog post co-authored with Marcel Kornacher, the architect of Cloudera Impala.
"We see a high majority of our customers using Impala," which is currently in beta, Erickson told eWEEK. Hadoop is valuable for three primary purposes: scaling systems, cost efficiency and flexibility, he said, adding that providing SQL support enhances the flexibility.
Spark and Shark that are better. They are out of UC Berkeley and completely open. There is much more interest in the Berkeley projects because anybody can come and participate. Impala is open and on GitHub, but it is dominated by Cloudera engineers. So there is open source and there is open source."
The Spark and Shark technologies caught the attention of Monte Zweben, co-founder and CEO of Splice Machine, a maker of the Splice SQL Engine, which is a SQL-compliant database designed for big data applications.
Zweben said the explosion of data being generated by apps, sites, devices and users has overwhelmed traditional Relational Database Management Systems (RDBMSes). In response, many companies have turned to big data or NoSQL solutions that are highly scalable on commodity hardware. However, these databases come at a big cost—they have very limited SQL support, often causing rewrites of existing apps or BI reports.
Built on the Hadoop stack, the Splice SQL Engine enables application developers to build hyper-personalized Web, mobile and social applications that truly scale while leveraging the ubiquity of SQL tools and skill sets in the marketplace. The Splice SQL Engine also scales to handle business intelligence and analysis, and works turnkey with tools like MicroStrategy and Tableau.
Zweben told eWEEK he considered integrating the Spark and Shark technologies into his solution, which is now in beta.
"The NoSQL community threw out the baby with the bath water," Zweben said. "They got it right with flexible schemas and distributed, auto-sharded architectures, but it was a mistake to discard SQL. The Splice SQL Engine enables companies to get the cost-effective scalability, flexibility and availability their big data, mobile and Web applications require—while capitalizing on the prevalence of the proven SQL tools and experts that are ubiquitous in the industry."
For his part, Ravi Chandran, CTO and co-founder of XtremeData, maker of XtremeData dbX, a massively scalable DBMS for big data warehouses, said that by providing an economical SQL solution that complements Hadoop, dbX accelerates the adoption of massively parallel processing (MPP) SQL solutions and makes it easier for organizations to roll out large-scale data environments and high-performance analytics.
As it runs on commodity hardware, dbX is less expensive than MPP options like Teradata, Exadata and Netezza, but is not as cost-effective as Hadoop.
"What is better depends on the application," said Blue Badge's Brust. "Hadoop runs on cheap commodity hardware and uses commodity storage. MPP databases tend to run on expensive appliances and use expensive enterprise storage. Hadoop can keep scaling as needed, whereas many appliance-based MPP solutions can only scale to what's inside the cabinet. But, SQL/relational—and therefore MPP—has much more ecosystem and skill set availability."