How is it done? In a number of ways. One of the more
promising strategies, Massively Parallel Processing (MPP),
involves breaking up a query so that multiple processors can run it
against multiple storage devices, then reassemble the responses to
produce an answer. Another alternative is SMP (Symmetric
Multiprocessing), in which multiple processors juggle tasks
using caching techniques and a common pool of memory.
What's the benefit? Quicker access to information in a
world of huge databases. With MPP, adding processors improves access
time at a nearly linear rate: A 32-processor machine can query more
than 3 terabytes of data in about the same time that a single
processor could query 100 gigabytes. While the scalability and
performance of SMP systems keeps improving, MPP architectures still
dominate very large data warehousing applications.
Who invented it? In the data warehousing market, NCR's
Teradata unit has been MPP's biggest proponent. The largest Teradata
warehouses run on the company's own WorldMark server hardware and
its own version of Unix, using a database management system designed
specifically for the MPP environment.
Because supporting MPP requires tweaks to the database management
system, operating system and server hardware, many vendors have
preferred to push the limits of what they can achieve with SMP.
However, IBM is supporting MPP with its Regatta servers (RS/6000 SP)
and in its DB2 Extended Enterprise Edition.
In September, startup Netezza introduced its Netezza Performance
Server, a refrigerator-sized "data warehouse appliance" aimed at
providing MPP performance at a lower price by using open-source
software like Linux and the Postgres database. Netezza uses
specialized query-processing chips installed on each hard disk. Each
of these "snippet processors" scans the disk it is responsible for,
finds data matching the query parameters, and sends the results back
to the database responsible for assembling the answer. This cuts
down on the transmission of irrelevant data within the server
cabinet, minimizing performance bottlenecks and lessening the
workload on the central database.
Who's using it? Teradata has a blue-chip customer base,
including Wal-Mart in retailing and Whirlpool in manufacturing.
Lloyd's of London is using IBM's MPP solution to analyze claims and
other insurance data.
Netezza has captured a handful of early customers. Vibrant
Solutions, which works with companies such as Nextel on call-data
analysis, says it will be able to support much more data, with
faster query response, by employing Netezza's technology. "It's very
similar to a lot of the other massively parallel architectures that
have been around for a while, but they brought the price into a
reasonable window," says Vibrant CTO Rick Mahuson.
What are the drawbacks? MPP systems tend to cost more,
both in price and ongoing administration. Teradata says the
long-term cost of ownership is favorable, however, particularly when
scattered data marts (departmental data warehouses) are consolidated
into a central, company-wide data warehouse.
Netezza is trying to change the price equation (at $2.5 million,
even its 18-terabyte server is a fraction of the cost of comparable
MPP systems) and claims its appliance will run with minimal
administration. "Netezza's product shows great promise," says Giga
Information Group analyst Philip Russom, but he suspects many
enterprise customers will be scared of entrusting multi-terabyte
applications to open-source technology.
REFERENCE: ONE QUERY, MANY PATHS Even with today's
superfast machines, it can take days to generate a report from a
multi-terabyte warehouse. Here's how using Massively Parallel
Processing can speed up the task.
1. A 3-terabyte data warehouse receives a request for a list of
all customer purchases that were greater than $10,000.
2. It passes on the query to 10 "nodes." Each node has its own
processors and also controls one or more storage devices. Each
storage device, in turn, contains a subset of the 3-terabyte
warehouse. In this example, each node queries one storage device
that holds 100,000 records.
3. Each device sends back a list. The data warehouse
consolidates the responses into a single result that took hours
instead of days to build.
Wondering if you might need to reexamine your processing
capability? Click here
to take our quick Quiz.
Background Reading
Not
convinced you need to process in parallel? Click here
to download a PDF (Portable Document Format) version of Sun
Microsystems' white paper on the advantages of a symmetric
architecture.