Hashing in Oracle

The internal workings of the
Oracle RDBMS are a marvel to behold. Think about all of the things you do know
with respect to what it takes to issue a select statement, perform an update,
or create a table. Latches, locks, serialization, caching, pinning, flushing, checkpointing,
sorting, space management, undo, redo, and the list goes on and on. Aside from
visible functions, there are a slew of transparent operations. Many of these
transparent operations, as the adjective implies, not only work behind the
scenes, but also stay there. There are also operations whose work is largely
hidden, but whose results are exposed to us. One such operation found in
several key areas of how Oracle works is hashing.

What is Hashing?

Hash isn’t a formal term,
but more of a description for what takes place: chop and mix. A hash function
returns a value for an input, and the output is generally shorter or more
compact than the input. For a relatively simple hash function, there is no
guarantee that inputs map one to one for outputs, that is, two different inputs
can have the same hashed output value. Add more sophistication to the hash
function, or add yet another hashing, and the probability of two distinct
inputs having the same hashed output can be greatly decreased.

Before getting more into
where hashing takes place in Oracle, let’s look at some examples of how hashing
is used. A good general overview of hashing and various types of hashing can be
found at a Mathematica
site
. The table of names takes the sum of the ASCII values of each letter,
sums them, and then lists the mod 13 remainder. The table is reproduced below
with an additional column to help illustrate what is taking place. I also
provided some code to help reproduce the results.

create table hash_example
(lastname varchar2(12),
  ascii_sum number,
  hashed_value number);
insert into hash_example (lastname) values ('Brecker');
insert into hash_example (lastname) values ('Corea');
insert into hash_example (lastname) values ('Davis');
insert into hash_example (lastname) values ('Hancock');
insert into hash_example (lastname) values ('Harris');
insert into hash_example (lastname) values ('Marsalis');
insert into hash_example (lastname) values ('Parker');
commit;

declare
  cursor c is
  select lower(lastname) lastname 
  from hash_example 
  for update;
  l_name hash_example.lastname%type;
  l_sum hash_example.ascii_sum%type := 0;
  l_tmp number := 0;
begin
  for r in c loop
    for i in 1..length(r.lastname) loop
      l_tmp := ascii(substr(r.lastname,i,1));
      l_sum := l_sum + l_tmp;
    end loop;
    update hash_example
    set ascii_sum = l_sum,
        hashed_value = mod(l_sum,13)
    where current of c;
    l_tmp := 0;
    l_sum := 0;
  end loop;
end;
/

PL/SQL procedure successfully completed.

SQL> select * from hash_example;

LASTNAME      ASCII_SUM HASHED_VALUE
------------ ---------- ------------
Brecker             734            6
Corea               522            2
Davis               535            2
Hancock             727           12
Harris              649           12
Marsalis            860            2
Parker              645            8

Overall, this was a pretty
simple example. Given that you can pass in a value and get an output based on
the hashing algorithm, and that the output is like a fingerprint of the
original data, how do you think this can applied to a real table? Suppose you
have a table that contains employee identification numbers along with salaries.
Aside from using an already built in hashing function (e.g., SHA-1), you could
devise a function to hash the combined employee ID and salary values. This
hashed value could be stored elsewhere, and then auditing of the table would
compare the current values and the stored values. Any (unknown) change in the
fingerprint would be cause for concern. This is just one safeguard, and no doubt,
you’d want to have some auditing (before the fact, not after!) on a sensitive
table like this.

Hashing in a Database

So far, we’ve seen two uses
of hashing that can be applied to a database. In the first example, a hashed
value of some key attribute can be used for indexing (the hashed value points
back to several values), and in the second example, hashing can be used as a
comparison tool. These two concepts lend themselves to extensive use within
Oracle.

How does Oracle know if a
SQL statement is new or already parsed? A parsed statement has a hash value
that is by and large unique enough (collisions, or same output for different
inputs can happen, but are rare). Shown below is an extract from a trace file,
and the lines show each statement’s identifying information.

1

PARSING IN CURSOR #1 len=46
dep=0 uid=5 hv=810835083 ad=’1d9cd494′

select ename from scott.emp
where empno = 7369

2

PARSING IN CURSOR #2 len=46
dep=0 uid=5 hv=1290101743 ad=’1d9cca00′

select ename from scott.emp
where empno = 7499

3

PARSING IN CURSOR #1 len=46
dep=0 uid=5 hv=1589217292 ad=’1d9cc7c4′

SELECT ENAME FROM
SCOTT.EMP WHERE 7499 = EMPNO

4

PARSING IN CURSOR #2 len=46
dep=0 uid=5 hv=810835083 ad=’1d9cd494′

select ename from scott.emp
where empno = 7369

Statements 1, 2 and 3 are
different, so each has its own hashed value (the hv value). Statements 1 and 4
are the same, and note both have the same hv value. One of the more desirable
characteristics of a hashing function is that it should be quite fast and
inexpensive to perform. If your statement has already been parsed (has a hash
ID value), then a comparison is all that is needed. As pointed out in numerous
other sources, hard parsing is undesirable. Re-use of statements can be
accomplished two ways. One is via the use of bind variables, and the other is
to embed commonly used statements in functions or procedures.

Where else is hashing used? Hash
partitioning is useful for data that doesn’t cleanly fall into categories such
as range or list. Composite partitioning such as range-hash can be used to
further categorized and distribute data. In these examples, the indexing
feature or benefit is being used. Since this is such a highly performant
arrangement of data, the use of hash clusters would be a natural extension. “Overview
of Hash Clusters
” in the Concepts guide provides a good description of how
useful clusters can be. Clustered data is easier to find, and for sorted data,
the benefits are even better.

Maybe not as obvious for
usage as in the past, given how much better the cost based optimizer is, but
hash joins are a frequently used means of accessing data. Incorporating a hint in
a statement can result in seeing the hashing step in an execution plan. Without
looking at execution plans, and by letting the optimizer do its thing, Oracle isn’t
going to “tell” you that a hash join was used.

Hidden Hashing

Now it’s time for a behind the scene use of hashing. In fact, this particular use of hashing is so under the radar that you would probably never even know it existed in Oracle unless you encountered one of several bugs. From note 4960265.9 we have, “In some cases when using a join filter in a RAC configuration (parallel hash-join) wrong results can be produced.” The stated fix for this is to set the undocumented parameter _BLOOM_FILTER_ENABLED to false. What is a Bloom filter?

There are lots of hashing techniques, and “General Purpose Hash Function Algorithms,” by Arash Partow, gives a decent summary without delving too deep into some esoteric computer science. The definition of a Bloom filter given in the paper states the following:

A Bloom filter allows for the state of existence of a very large set of possible type values to be repsented with a much smaller piece of memory. This is achieved through the use of multiple distinct hash functions and also by allowing the result of a query for the existence of a particular type to have a certain amount of error. This error can be controlled by varying the size of the table used for the Bloom filter and also by varying the number of hash functions.

So now, we know another technique Oracle uses to store data. In a sense, the data can be “over stored,” that is, we want to make really, really sure it exists in one place or another. We may be wrong about the place (and then go find an accurate location), but the main point is that it exists. With this type of hashing, it will be difficult to say we don’t have it anywhere, when in fact we do.

In Closing

Hashing is but one of many cleverly implemented functions embedded in Oracle. What would be interesting is being able to document all of the hash function implementation used in Oracle. For example, how does Oracle come up with 810835083 from “select ename from scott.emp where empno = 7369?” And why do hashed partitions work better (better = data spread more evenly) when the number of buckets is a power of two?

»


See All Articles by Columnist
Steve Callan

Steve Callan
Steve Callan
Steve is an Oracle DBA (OCP 8i and 9i)/developer working in Denver. His Oracle experience also includes Forms and Reports, Oracle9iAS and Oracle9iDS.

Latest Articles