Identifying and Eliminating the Dreaded Cartesian Product

September 3, 2010

Cartesian Products usually don't provide useful information and often result in mistakes that can hurt your database developer career. Learn to spot Cartesian Joins and banish them from your SELECT queries forever.

I’ll never forget one of my first SQL assignments as a young junior developer in the Federal Government. I had to produce a report for one of our clients to help them generate some statistics for the minister’s office. When the report numbers varied from their own estimates by a wide margin, they called a meeting between with my boss and me. I had no idea where or that I even had gone wrong, until my boss blurted out “Robert, you created a Cartesian Product.” After she finished berating me in front of our clients, the meeting was adjourned and I went off to rewrite my query. I came up with what they were looking for eventually, but my boss never let me live it down, and brought up in every meeting thereafter: “Robert, remember the time that you created a Cartesian Product…”. I wound up transferring to a different department about six months later. Now, it’s true that my boss wasn’t an especially forgiving person, but mistakes like those can really hurt your career. Today, I’d like to share with you what I’ve learned about Cartesian Products over the years, so that you can spot them and banish them from your SELECT queries forever.

How to Generate a Cartesian Product

The following query extracts data from two tables without filtering of any sort. Omitting the WHERE clause can be useful in situations where you want to see all the rows in a table but wish to reorder or hide non-relevant columns:

SELECT name, 
       gender, 
       CONCAT('$', FORMAT(salary, 2)) AS 'Monthly Salary' 
FROM   employees,
       shops; 

The problem here is that the query selects from multiple tables. Without any explicit table joins, we wind up with a kind of default join called a Cartesian Join (or Cross Join). The Cross Join name refers to the fact that it joins every row of the first table to every row of the second table. In other words, Cartesian Joins represent the sum of the number of columns of the input tables plus the product of the number of rows of the input tables.

You can see in the results that every row in the first (employees) table is returned for every row in the second (shops) table. Since there are three rows in the shops table, the query produces three of each row from the employees table:

name

gender

Monthly Salary

Jon Simpson

M

$4,500.00

Jon Simpson

M

$4,500.00

Jon Simpson

M

$4,500.00

Barbara Breitenmoser

F

(NULL)

Barbara Breitenmoser

F

(NULL)

Barbara Breitenmoser

F

(NULL)

Kirsten Ruegg

F

$5,600.00

Kirsten Ruegg

F

$5,600.00

Kirsten Ruegg

F

$5,600.00

Ralph Teller

M

$5,100.00

Ralph Teller

M

$5,100.00

Ralph Teller

M

$5,100.00

Peter Jonson

M

$5,200.00

Peter Jonson

M

$5,200.00

Peter Jonson

M

$5,200.00

That’s a lot of rows for two little tables. The results grow exponentially where more rows and/or tables are involved. Because of the strain that such a query puts on system resources and that the resulting data set contains way too much information for the query writer to select what is interesting, Cartesian Joins are almost always performed by accident. As we learned in my own cautionary tale, it’s good to know how to spot one before your clients or supervisor comes back to you with questions as to why there are so many duplicate rows. Indeed, the presence of many duplicates, combined with an unusually large result set, is a telltale sign that you might have a Cartesian Product on your hands.

How Filtering Criteria Can Mask a Cartesian Product

SELECT statements which contain a WHERE clause can easily hide a Cartesian Product because not all rows will appear in duplicate. Here’s an innocent enough looking query in which someone forgot to include a table join:

SELECT name, 
       gender, 
       CONCAT('$', FORMAT(salary, 2)) AS 'Monthly Salary' 
FROM   shops, 
       employees
WHERE  shops.shop = 'Zurich';

Since the shop doesn’t appear in the result set, it would be easy to accept the output to be accurate. However, we can easily verify that only the first two employees in the list work in Zurich. The other three appear to have escaped our filtering criteria!

name

gender

Monthly Salary

Jon Simpson

m

$4,500.00

Barbara Breitenmoser

f

(NULL)

Kirsten Ruegg

f

$5,600.00

Ralph Teller

m

$5,100.00

Peter Jonson

m

$5,200.00

Outputting all of the columns shows more clearly what’s going on. The filter did in fact only return the Zurich shop. However, without a proper table join the query produces a record for every employee, whether or not they are linked to the Zurich shop. This makes sense when you consider that the employees are not associated with any shop without a join. Hence, the shop_id field in the employees table has nothing to do with the one in the shops one. What the query is saying is “Fetch me all of the rows from the shops table where the name match ‘Zurich’ and all of the rows from the employees table”:

shop_id

shop

id

shop_id_1

gender

name

salary

1

Zurich

1

1

m

Jon Simpson

4500

1

Zurich

2

1

f

Barbara Breitenmoser

(NULL)

1

Zurich

3

2

f

Kirsten Ruegg

5600

1

Zurich

4

3

m

Ralph Teller

5100

1

Zurich

5

3

m

Peter Jonson

5200

Similarly, narrowing down the results from the employee table produces only rows from that table which match the criteria and all rows from the other table. Here’s a query that filters the employees by salary:

SELECT name, 
       gender, 
       CONCAT('$', FORMAT(salary, 2)) AS 'Monthly Salary' 
FROM   shops, 
       employees
WHERE  employees.salary > 5500;

It matches one row in the employees table, which is displayed once for each row of the shops table:

name

gender

Monthly Salary

Kirsten Ruegg

f

$5,600.00

Kirsten Ruegg

f

$5,600.00

Kirsten Ruegg

f

$5,600.00

Again, including all of the columns confirms this assumption:

id

shop_id

gender

name

salary

shop_id_1

shop

3

2

f

Kirsten Ruegg

5600

1

Zurich

3

2

f

Kirsten Ruegg

5600

2

New York

3

2

f

Kirsten Ruegg

5600

3

London

SELECT name, 
       gender, 
       CONCAT('$', FORMAT(salary, 2)) AS 'Monthly Salary' 
FROM   shops, 
       employees
WHERE  shops.shop = 'Zurich' 
OR     employees.salary > 5500;

Now these results don’t immediately appear suspect because, depending on the criteria and which columns are displayed, duplicate values are not an uncommon occurrence:

name

gender

Monthly Salary

Jon Simpson

M

$4,500.00

Barbara Breitenmoser

f

(NULL)

Kirsten Ruegg

f

$5,600.00

Kirsten Ruegg

f

$5,600.00

Kirsten Ruegg

f

$5,600.00

Ralph Teller

m

$5,100.00

Peter Jonson

m

$5,200.00

The query is still displaying an employee for every row in the shops table, but this time it displays an employee for each row of the shops table as well as three rows for the one matching employee. In other words, all five employees are displayed for matching shops (the Zurich one) and the employee record whose income exceeds $5500 (Kristen Ruegg) is repeated for every row in the shops table.

As we saw today, Cartesian Products don’t tend to provide useful information. Therefore, the moral of the story is this: avoid Cartesian Joins at all costs unless you have a crystal clear reason for doing it.

Related Articles

Getting the Right Data with SQL Joins

» See All Articles by Columnist Rob Gravelle








The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers