Hewlett Packard Enterprise this week rolled out a version of its Vertica analytical database system intended to improve Apache Kafka pipeline management, as well as Apache Spark and Hadoop integration.
The updates are part of HPE’s effort to adapt to a data management space that has seen major open source tool proliferation since Vertica first appeared.
Vertica has been able to access Hadoop data before but with Vertica 8.0, the analytical engine can work with Hadoop data in place, thus reducing data movement.
That is part of a general trend with such engines, according to IDC analyst Carl Olofson. Still, he cautioned that Hadoop is far from a replacement for analytical databases such as Vertica.
“This means you can expand the types of data that you query. But it doesn’t mean Hadoop takes over,” Olofson said. “It’s not an either-or situation.”
Instead, he continued, better links between Vertica and Hadoop show that the different data processing types can co-exist. High-performance querying capabilities of Vertica, he said, can in effect “reach into [Hadoop] data and bring valid result sets back to the database environment.”
The in-place processing update for Hadoop, along with new links to Apache Spark, are intended to enable Vertica to play alongside open-source Hadoop and Spark tools. While less mature, the open source offerings are finding use for new types of analytics, especially ones dealing with massive amounts of web data.
To this end, HPE Vertica 8 supports faster data loading, visual monitoring of Apache Kafka data streams, and in-database machine learning libraries. The new Apache Spark connector is said to support faster data exchange between Vertica and Spark systems.
Also on tap is support for the Apache Parquet storage format that complements the ORC Hadoop data format support already in place. The Vertica enhancements were discussed at the company’s Big Data 2016 conference in Boston.
Analytical database field is crowded
Highly scalable analytical databases like Vertica arose during the past 10 years as an alternative to general-purpose relational database management systems for some types of data warehousing and analytics number crunching.
Based largely on fast column-store architectures powered by massively parallel processing, the early field also included Netezza, Greenplum, ParAccel and others in addition to Vertica. Collectively, they made a mark in data management by running queries more quickly than established databases and data warehouses, where many such jobs were taking too long. Large vendors quickly took notice and bought up technologies one by one — for example, IBM acquired Netezza, EMC purchased Greenplum and HP took over Vertica.
But with all the entries, the analytical database field is crowded and competitive — and sales as a whole haven’t lived up to the original optimistic expectations. That combination was enough to drive one vendor out of the market: Actian Corp. this week confirmed that it’s pulling the plug on its Actian Analytics Platform, which includes analytical database Actian Matrix, in order to focus on operational data management and data integration technologies. Actian Matrix was based on technology the company gained with its 2013 acquisition of analytical DBMS startup ParAccel.
Some of what could be called ”the ParAccel torch” is carried forward in the increasingly popular Amazon Redshift cloud data warehouse. Amazon Redshift is based in large part on ParAccel technology.
Looking at Spark
That an analytical database like Vertica often exists among other, diverse data technologies is shown in a quick inventory of Etsy Inc., an online marketplace for artisans. Rafe Colburn, Etsy’s director of engineering, lists Kafka, AWS, Scalding (for developing machine language routines), Hadoop MapReduce and Parquet as just some of the software the company employs along with Vertica — not to mention that Etsy is ”looking into Spark.”
Colburn said Etsy is on Version 7.1 of Vertica, and looking at 7.2 features. He added that Vertica is used for supporting internal dashboards and financial reporting, among other jobs, and has improved users’ ability to query Etsy customer activity over an earlier Postgres DB implementation.
Vertica 8.0’s support of Parquet is of interest, Colburn said, because his shop has begun to work with the Parquet format. “Parquet is the data format of the future for us,” he explained, while acknowledging that the future may hold still more data formats to support.
Vertica, he said, provided horizontal scalability that was welcome, wasn’t difficult to install in Etsy’s data center and has proved relatively easy to ingest data into the system. He said HPE’s engineering improvements to the Spark-Vertica connector showed promise in terms of performance.
Enter machine learning
SQL queries requiring high concurrency have been a sweet spot for analytical databases like Vertica. Where they may be challenged going forward, according to some analysts, is in the statistically oriented machine learning approaches now making headway among some big web companies.
Analytical RDBMSs were successful because initially they offered radical price-performance advantages over existing database and data warehouse alternatives in analytical SQL, according to Curt Monash, president of Monash Research.
“They scaled out well to many nodes, and a number either started out with columnar systems or added columnar capabilities early on,” he said. But as a result, the incumbents did cut prices and improve their capabilities for analytical SQL use cases, in Monash’s view.
In a recent blog post that ponders the future of the analytical RDBMS, Monash said the systems still excelled at key business intelligent jobs such as complex ad hoc queries and high-concurrency reporting and dashboards. But he also suggested that new types of advanced analytics, such as machine learning, may find a better home in Spark.