Hashing in Oracle
October 24, 2007
The internal workings of the Oracle RDBMS are a marvel to behold. Think about all of the things you do know with respect to what it takes to issue a select statement, perform an update, or create a table. Latches, locks, serialization, caching, pinning, flushing, checkpointing, sorting, space management, undo, redo, and the list goes on and on. Aside from visible functions, there are a slew of transparent operations. Many of these transparent operations, as the adjective implies, not only work behind the scenes, but also stay there. There are also operations whose work is largely hidden, but whose results are exposed to us. One such operation found in several key areas of how Oracle works is hashing.
What is Hashing?
Hash isnt a formal term, but more of a description for what takes place: chop and mix. A hash function returns a value for an input, and the output is generally shorter or more compact than the input. For a relatively simple hash function, there is no guarantee that inputs map one to one for outputs, that is, two different inputs can have the same hashed output value. Add more sophistication to the hash function, or add yet another hashing, and the probability of two distinct inputs having the same hashed output can be greatly decreased.
Before getting more into where hashing takes place in Oracle, lets look at some examples of how hashing is used. A good general overview of hashing and various types of hashing can be found at a Mathematica site. The table of names takes the sum of the ASCII values of each letter, sums them, and then lists the mod 13 remainder. The table is reproduced below with an additional column to help illustrate what is taking place. I also provided some code to help reproduce the results.
create table hash_example (lastname varchar2(12), ascii_sum number, hashed_value number); insert into hash_example (lastname) values ('Brecker'); insert into hash_example (lastname) values ('Corea'); insert into hash_example (lastname) values ('Davis'); insert into hash_example (lastname) values ('Hancock'); insert into hash_example (lastname) values ('Harris'); insert into hash_example (lastname) values ('Marsalis'); insert into hash_example (lastname) values ('Parker'); commit; declare cursor c is select lower(lastname) lastname from hash_example for update; l_name hash_example.lastname%type; l_sum hash_example.ascii_sum%type := 0; l_tmp number := 0; begin for r in c loop for i in 1..length(r.lastname) loop l_tmp := ascii(substr(r.lastname,i,1)); l_sum := l_sum + l_tmp; end loop; update hash_example set ascii_sum = l_sum, hashed_value = mod(l_sum,13) where current of c; l_tmp := 0; l_sum := 0; end loop; end; / PL/SQL procedure successfully completed. SQL> select * from hash_example; LASTNAME ASCII_SUM HASHED_VALUE ------------ ---------- ------------ Brecker 734 6 Corea 522 2 Davis 535 2 Hancock 727 12 Harris 649 12 Marsalis 860 2 Parker 645 8
Overall, this was a pretty simple example. Given that you can pass in a value and get an output based on the hashing algorithm, and that the output is like a fingerprint of the original data, how do you think this can applied to a real table? Suppose you have a table that contains employee identification numbers along with salaries. Aside from using an already built in hashing function (e.g., SHA-1), you could devise a function to hash the combined employee ID and salary values. This hashed value could be stored elsewhere, and then auditing of the table would compare the current values and the stored values. Any (unknown) change in the fingerprint would be cause for concern. This is just one safeguard, and no doubt, youd want to have some auditing (before the fact, not after!) on a sensitive table like this.
Hashing in a Database
So far, weve seen two uses of hashing that can be applied to a database. In the first example, a hashed value of some key attribute can be used for indexing (the hashed value points back to several values), and in the second example, hashing can be used as a comparison tool. These two concepts lend themselves to extensive use within Oracle.
How does Oracle know if a SQL statement is new or already parsed? A parsed statement has a hash value that is by and large unique enough (collisions, or same output for different inputs can happen, but are rare). Shown below is an extract from a trace file, and the lines show each statements identifying information.
Statements 1, 2 and 3 are different, so each has its own hashed value (the hv value). Statements 1 and 4 are the same, and note both have the same hv value. One of the more desirable characteristics of a hashing function is that it should be quite fast and inexpensive to perform. If your statement has already been parsed (has a hash ID value), then a comparison is all that is needed. As pointed out in numerous other sources, hard parsing is undesirable. Re-use of statements can be accomplished two ways. One is via the use of bind variables, and the other is to embed commonly used statements in functions or procedures.
Where else is hashing used? Hash partitioning is useful for data that doesnt cleanly fall into categories such as range or list. Composite partitioning such as range-hash can be used to further categorized and distribute data. In these examples, the indexing feature or benefit is being used. Since this is such a highly performant arrangement of data, the use of hash clusters would be a natural extension. Overview of Hash Clusters in the Concepts guide provides a good description of how useful clusters can be. Clustered data is easier to find, and for sorted data, the benefits are even better.
Maybe not as obvious for usage as in the past, given how much better the cost based optimizer is, but hash joins are a frequently used means of accessing data. Incorporating a hint in a statement can result in seeing the hashing step in an execution plan. Without looking at execution plans, and by letting the optimizer do its thing, Oracle isnt going to tell you that a hash join was used.
Now its time for a behind the scene use of hashing. In fact, this particular use of hashing is so under the radar that you would probably never even know it existed in Oracle unless you encountered one of several bugs. From note 4960265.9 we have, In some cases when using a join filter in a RAC configuration (parallel hash-join) wrong results can be produced. The stated fix for this is to set the undocumented parameter _BLOOM_FILTER_ENABLED to false. What is a Bloom filter?
There are lots of hashing techniques, and General Purpose Hash Function Algorithms, by Arash Partow, gives a decent summary without delving too deep into some esoteric computer science. The definition of a Bloom filter given in the paper states the following:
So now, we know another technique Oracle uses to store data. In a sense, the data can be over stored, that is, we want to make really, really sure it exists in one place or another. We may be wrong about the place (and then go find an accurate location), but the main point is that it exists. With this type of hashing, it will be difficult to say we dont have it anywhere, when in fact we do.
Hashing is but one of many cleverly implemented functions embedded in Oracle. What would be interesting is being able to document all of the hash function implementation used in Oracle. For example, how does Oracle come up with 810835083 from select ename from scott.emp where empno = 7369? And why do hashed partitions work better (better = data spread more evenly) when the number of buckets is a power of two?