What is Really Behind the Success of Data
Warehouse Appliances?
Over the past four years, data warehouse appliances have become a
disruptive force in the data warehousing market, increasingly
displacing systems built on traditional computing architectures. The
market is characterized by tremendous growth, projected to increase
at a compound annual growth rate (CAGR) of 70 percent through 2010,
and its success is borne out by large brand-name companies
worldwide, now numbering well over 100, who are implementing data
warehouse appliances as a key component of their strategic business
intelligence initiatives.1 According to Gartner, data
warehouse appliances are projected for mainstream market adoption
within two to five years. 2
But what is behind all this growth? Why are data warehouse
appliances consistently able to outperform general-purpose systems
to uncover deeply buried customer and operational trends? What
accounts for their low total cost of ownership and ease of use in
the data center?
In this article, you will see why data warehouse appliances are a
proven long-term enterprise solution with advantages in performance,
cost, administrative simplicity, space, power and cooling - key
factors for enterprises that depend on high-performance analytics.
I'll show what's "under the hood" of a true data warehouse appliance
- how a design based on data streaming rather than general-purpose
architectures is ideal for data analytics. And you'll see why an
integrated appliance built with commodity components provides a
sustainable and ever-increasing performance advantage.
Meeting a Market Need
Data warehouse appliances emerged in response to a market need:
general-purpose processing architectures (those based on Intel
chips, for example) are not designed for the type of processing
required in data warehousing. As the data volumes, analytic demands
and number of users grow, query performance slows - not a good thing
for businesses that depend on fast, complex analytics. And because
of the limitations of the underlying architecture for data
warehousing, the systems are expensive to scale, complex to
administer and time-consuming to deploy.
General-purpose servers operate by reading data off disk,
bringing it across an I/O connection and loading it into memory for
processing. This approach has been used by generations of PCs and
general-purpose servers for standard computing applications -
whether processing an invoice or accessing a Web site - that are
characterized by operations on individual data elements. But the
method works poorly when shuttling huge blocks of data back and
forth across backplanes and I/O channels for analysis, with
bottlenecks in these shared resources adversely impacting
performance. Continuing advances in chip technology have not been
able to resolve this limitation of the underlying architecture. As a
result, data warehouse systems bound to general-purpose
architectures can't keep pace with today's BI demands.
Unlike general-purpose computing, analytic applications require
examining massive volumes of data as smoothly and efficiently as
possible. Data warehouse appliances were created specifically to
meet this need, with an architecture designed for quickly querying
terabytes of data and integrating components best suited for this
task.
What Constitutes a Data Warehouse Appliance?
Since the data warehouse appliance approach began receiving
widespread attention in 2003, a number of new entrants (from
industry behemoths to the smallest startups) have tried to stake
their claims to it. One way to recognize a true data warehouse
appliance is to see if it meets a few basic criteria - which, in
many ways, reflect the qualities of any good appliance:
Purpose-built for
performance. A data warehouse appliance is
a fully integrated device built for a single purpose: to enable
real-time BI and analytics on terabytes of data. As such, it is not
bound to general-purpose computing architectures, but starts with a
clean slate to meet the challenge in the most effective and
efficient way.
Based on commodity components. The
architecture makes the be st use of commodity hardware for powerful
and economic deployment. O ff-the-shelf processors, storage and
other modules can be replaced as more powerful versions emerge,
allowing continuing increase in performance without being bound to a
particular vendor.
Simple to install and use. A true
data warehouse appliance requires no tuning, indexing, partitioning
or aggregations. Like any good household appliance, it's easy to
deploy and maintain, with installation in hours and the ability to
have a large data warehouse up and running in a day or so.
Low acquisition and ongoing costs.
Appliances are just less costly to own and maintain - even for a
large enterprise data warehouse implementation of 100 terabytes or
more.
Enterprise compatibility. A data
warehouse appliance uses standards-based interfaces and
plug-and-play integration with all major BI and data integration
vendors.
Low power, cooling and space
consumption. A true appliance delivers
high-performance in a compact footprint at a modest consumption of
electrical power. Heat generation is a fraction of conventional
architectures, eliminating the need for skip-a-row equipment
patterns to keep the system cool.
Under the Hood: Where Does the Performance Edge Come From?
Like many conventional large data warehouse systems, a data
warehouse appliance derives its processing power from a Massive
Parallel Processing (MPP) array of nodes. In this architecture,
nodes are deployed in a "shared nothing" architecture that provides
a very efficient way of combining many nodes in a highly parallel
environment. Unlike traditional MPP solutions, however, where the
cost of each node and the added complexity of every additional node
prevent any high degree of parallelism-by-hardware, it is very
common for a data warehouse appliance to deploy dozens, hundreds or
even more query processing nodes in a single appliance package.
At the front end of an appliance, one or more Linux hosts are
responsible for managing and prioritizing the workload among the
nodes and aggregating the results. In addition to optimizing overall
query performance, the host gives the data warehouse appliance broad
enterprise compatibility, running a powerful MPP architecture within
a standard Linux box that is simple to integrate into a company's IT
infrastructure.
The goal of a data warehouse appliance should be to eliminate the
traditional bottlenecks of business analytic systems - I/O, memory,
processing and network. While some data warehouse appliance
architectures may still separate storage from processing, the most
optimal design for a high-performance MPP node in a data warehouse
appliance would have a very low ratio of disks to CPU and memory,
with a highly effective bandwidth transfer rate from the disks.
Ideally these would be configured with a 1:1 ratio (one disk drive
per CPU) in a direct-attached configuration to simplify data
movement.
In general, data warehouse appliances represent a much better
CPU-to-disk ratio, providing more processing power per amount of
user data versus conventional solutions, and at a lower cost. One
potential cost-saving element in the MPP node is the use of
low-cost, commodity, embedded CPU technology instead of more
expensive high-end processors such as those used in blades and
conventional servers. These CPUs can make a lot of sense in the
purpose-built design of data warehouse appliances because there is
no need to run full operating systems or other applications on the
nodes and they are more suitable for the data streaming requirements
of the data warehouse versus general-purpose computing. These
devices also tend to use as little as one-twentieth of the power of
high-end CPUs, which allows for much denser, power-efficient
packaging.
An even more effective, high-performance design of a data
warehouse appliance may include a field-programmable gate array
(FPGA) in the parallel processing node for query performance
acceleration. In this architecture, each query-processing node
contains an FPGA together with the CPU, memory and direct-attached
storage device.
The approach can be seen as bringing the query to the data
- recognition that a streaming architecture that moves
processing intelligence to a data stream as it is flowing off disk
produces results much faster than the opposite (and conventional)
approach of moving vast amounts of data across expensive I/O
interconnects into memory. It's a built-in performance advantage for
powering the complex queries at the heart of business analytics. A
common off-the-shelf device about the size of a thumbnail, the FPGA
filters and performs processing operations on data streaming through
the device at high speed, without interrupting the flow. In
addition, the performance gains of FPGAs are actually outpacing CPU
technology - where Moore's Law suggests a doubling of CPU
performance approximately every 18-24 months, FPGAs are progressing
much faster.3 On the query processing node, it can filter
more than 90 percent of initial data as data streams off the disk,
greatly accelerating application performance over "brute force"
CPU-based processing.
This fully integrated architecture, built with inexpensive
commodity components, provides a dramatic performance advantage - 10
to 100 times faster than data warehousing systems based on
general-purpose architectures. The architecture accounts for the low
purchase price of the data warehouse appliance as well as its
administrative simplicity because there's no indexing, partitioning
or other traditional tuning required to tweak performance. It also
accounts for the low power and cooling requirements because
processors are not straining to handle overwhelming amounts of
data.
Can Other Approaches Catch Up?
How have other data warehouse vendors responded to this
disruptive force? While multipurpose servers continue to increase in
performance, their technology path for data warehousing remains
hindered by I/O and memory bottlenecks. Many vendors are trying to
apply the latest innovations in general-purpose computing - from
higher-speed, multi-core processors to faster interconnect
technologies - to squeeze greater query performance from the
underlying architecture. Other common approaches combine server
clusters with rack-mounted storage in the same cabinet or some other
"hybrid" blade approach combining CPUs with disk drives.
Whether these systems are marketed as appliances or not, they are
simply not designed to handle deep analysis of massive amounts of
data - and remain at an inherent disadvantage to a true data
warehouse appliance. Furthermore, by relying on multiple cores and
ever-faster clock rates as a 'brute-force' answer, these attempted
solutions are further limited by growing power and cooling concerns
in the data center.
New processing technologies continue to emerge, but none have
been able to overcome these basic handicaps.
Staying Power for Unconstrained Analytics
Data warehouse appliances are rapidly growing their share of the
data warehousing systems market. The model of an integrated
appliance built with commodity components has also shown that it can
sustain its performance advantage, easily incorporating new
technology for continuous improvement to keep up with growing BI
demands. Since the first models were released, query performance has
already increased by orders of magnitude as new components for
streaming, processing and storage have come on the market. And
because the development of faster FPGAs is outpacing CPUs, the
performance gap between appliances and systems built with
general-purpose architectures appears to be widening.
What really matters is the impact that a true data warehouse
appliance can have on an enterprise, allowing users to perform
unconstrained analytics on all their business data, even in
extremely busy mixed-workload environments. Companies can run
existing queries faster and more deeply, but even more importantly,
they can perform new, previously impossible analyses to drive
business growth. The impact goes even further: from changing the way
companies think about staffing their data warehouse to helping
mid-tier businesses solve critical BI needs that were previously out
of reach. Today, data warehouse appliances are fundamentally
changing the way people operate their businesses, allowing them to
fully leverage BI for competitive advantage - because now they
can.
References:
- IDC Report: "Business Analytics Appliances Are Here to Stay."
June 2006.
- Ted Friedman, et al."Hype Cycle for Data Management, 2006."
Gartner Research, July 2006.
- Aussie Schnore & Malachy Devlin. OpenFPGA BOF presentation
at SCO5, GE Global Research & Nallatech, 16 Nov 2005.
............................................................................... For more information on related topics visit
the following related portals... Business
Intelligence (BI) and DW
Design, Methodology.
Phil Francisco is the director of Product Marketing for Netezza.
He may be reached at pfrancisco@netezza.com.
|