Identifying and Eliminating the Dreaded Cartesian Product
September 3, 2010
Cartesian Products usually don't provide useful information and often result in mistakes that can hurt your database developer career. Learn to spot Cartesian Joins and banish them from your SELECT queries forever.
I’ll never forget one of my first SQL assignments as a young junior developer in the Federal Government. I had to produce a report for one of our clients to help them generate some statistics for the minister’s office. When the report numbers varied from their own estimates by a wide margin, they called a meeting between with my boss and me. I had no idea where or that I even had gone wrong, until my boss blurted out “Robert, you created a Cartesian Product.” After she finished berating me in front of our clients, the meeting was adjourned and I went off to rewrite my query. I came up with what they were looking for eventually, but my boss never let me live it down, and brought up in every meeting thereafter: “Robert, remember the time that you created a Cartesian Product…”. I wound up transferring to a different department about six months later. Now, it’s true that my boss wasn’t an especially forgiving person, but mistakes like those can really hurt your career. Today, I’d like to share with you what I’ve learned about Cartesian Products over the years, so that you can spot them and banish them from your SELECT queries forever.
How to Generate a Cartesian Product
The following query extracts data from two tables without filtering of any sort. Omitting the WHERE clause can be useful in situations where you want to see all the rows in a table but wish to reorder or hide non-relevant columns:
SELECT name, gender, CONCAT('$', FORMAT(salary, 2)) AS 'Monthly Salary' FROM employees, shops;
The problem here is that the query selects from multiple tables. Without any explicit table joins, we wind up with a kind of default join called a Cartesian Join (or Cross Join). The Cross Join name refers to the fact that it joins every row of the first table to every row of the second table. In other words, Cartesian Joins represent the sum of the number of columns of the input tables plus the product of the number of rows of the input tables.
You can see in the results that every row in the first (employees) table is returned for every row in the second (shops) table. Since there are three rows in the shops table, the query produces three of each row from the employees table:
That’s a lot of rows for two little tables. The results grow exponentially where more rows and/or tables are involved. Because of the strain that such a query puts on system resources and that the resulting data set contains way too much information for the query writer to select what is interesting, Cartesian Joins are almost always performed by accident. As we learned in my own cautionary tale, it’s good to know how to spot one before your clients or supervisor comes back to you with questions as to why there are so many duplicate rows. Indeed, the presence of many duplicates, combined with an unusually large result set, is a telltale sign that you might have a Cartesian Product on your hands.
How Filtering Criteria Can Mask a Cartesian Product
SELECT statements which contain a WHERE clause can easily hide a Cartesian Product because not all rows will appear in duplicate. Here’s an innocent enough looking query in which someone forgot to include a table join:
SELECT name, gender, CONCAT('$', FORMAT(salary, 2)) AS 'Monthly Salary' FROM shops, employees WHERE shops.shop = 'Zurich';
Since the shop doesn’t appear in the result set, it would be easy to accept the output to be accurate. However, we can easily verify that only the first two employees in the list work in Zurich. The other three appear to have escaped our filtering criteria!
Outputting all of the columns shows more clearly what’s going on. The filter did in fact only return the Zurich shop. However, without a proper table join the query produces a record for every employee, whether or not they are linked to the Zurich shop. This makes sense when you consider that the employees are not associated with any shop without a join. Hence, the shop_id field in the employees table has nothing to do with the one in the shops one. What the query is saying is “Fetch me all of the rows from the shops table where the name match ‘Zurich’ and all of the rows from the employees table”:
Similarly, narrowing down the results from the employee table produces only rows from that table which match the criteria and all rows from the other table. Here’s a query that filters the employees by salary:
SELECT name, gender, CONCAT('$', FORMAT(salary, 2)) AS 'Monthly Salary' FROM shops, employees WHERE employees.salary > 5500;
It matches one row in the employees table, which is displayed once for each row of the shops table:
Again, including all of the columns confirms this assumption:
SELECT name, gender, CONCAT('$', FORMAT(salary, 2)) AS 'Monthly Salary' FROM shops, employees WHERE shops.shop = 'Zurich' OR employees.salary > 5500;
Now these results don’t immediately appear suspect because, depending on the criteria and which columns are displayed, duplicate values are not an uncommon occurrence:
The query is still displaying an employee for every row in the shops table, but this time it displays an employee for each row of the shops table as well as three rows for the one matching employee. In other words, all five employees are displayed for matching shops (the Zurich one) and the employee record whose income exceeds $5500 (Kristen Ruegg) is repeated for every row in the shops table.
As we saw today, Cartesian Products don’t tend to provide useful information. Therefore, the moral of the story is this: avoid Cartesian Joins at all costs unless you have a crystal clear reason for doing it.