Searching for text within a database can be like searching for the proverbial needle in a haystack. Users depend on the database administrator to solve complex text search requirements. Do you respond with clever methods such as LIKE, regular expressions and function-based indexes? Is there something more available to you and your users?
As a database user, searching for text within a database can be like searching for the proverbial needle in a haystack. As a DBA, you are more than likely going to be the primary resource in helping a user solve a complex text search requirement. Are you left with clever ways to use LIKE, regular expressions and function-based indexes, or is there something more available to you and your users?
In prior articles (Searching and Sorting Strings in Oracle, Sorting Strings in Oracle and Multilingual linguistic searching and sorting in Oracle), I covered linguistic searching and sorting, well, at least how Oracle does it anyway. The search for strings can be trivial, but at the same time, the importance of finding a particular string – or something close enough to it – can have extreme importance. We’ve all read or heard about the horror stories of airline travelers who are mistakenly placed on the no-fly list. On the other hand, and luckily enough so far, the consequences of a mistake the other way (someone getting on board who shouldn’t have been able to in the first place) haven’t been too severe. Of course, this doesn’t address the issue of when the information is available, but isn’t checked, or not checked often enough.
Typical searches for strings are against one column within one table in a schema. We also generally know what we are looking for, so if we’re not able to code an exact match in a WHERE clause, we can get lucky to some degree by using the LIKE condition or operator. In fact, when using the LIKE condition, did you know there is more than one LIKE-like operator? Oracle allows the use of LIKEC, LIKE2 and LIKE4, with their meanings described in the SQL Reference Guide. (10g version here).
What LIKE offers us is a near-match exact hit based on the operand (what we’re searching for). With the wildcard (%), we can find matches containing or including the base string. The example below is pretty clear-cut in terms of searching for a name containing the string AK.
SQL> select ename 2 from emp 3 where ename like '%AK%'; ENAME ---------- BLAKE
Just out of curiosity, is the following query equivalent?
SQL> select ename 2 from emp 3 where '%AK%' like ename;
The answer is no, because the operand for LIKE has to follow LIKE; it cannot come before.
SQL> select ename 2 from emp 3 where '%AK%' like ename; no rows selected
In these examples, we know where to look, and maybe aren’t quite sure what we’re looking for, but we’re pretty close (near match) to begin with. What happens when the search is based on strings where the parts are not “close enough” together, or where we are looking for keywords out of a larger set of data? This is where Oracle Text comes into play.
In Oracle8, Oracle Text appeared as a data cartridge and was named Oracle ConText. In 8i days, it was renamed to Oracle interMedia Text, and since Oracle9i and on, it has been know as Oracle Text. One improvement in this feature is that Oracle Text is now included with the RDBMS software installation and does not require any additional licensing or module-like plug-ins. Throughout its history, the schema associated with Oracle Text has been CTXSYS. One change is that CTXSYS’s objects, like many other utility types of accounts, are in the SYSAUX tablespace.
You’ve seen other SYS-like accounts during installation of Oracle, many of which have no use or relevance for most database-driven applications. If you need to expand text searching, including crawling a web page for keywords or examining an XML document, Oracle Text is exactly what you need. Understanding how and where Oracle Text fits into the big scheme of things requires a sampling of some white papers available at Oracle Technology Network (and yes, you’ll see references to older versions of Oracle even though Text appears under the 11g banner).
In Oracle’s documentation, Oracle Text is described as the following.
Oracle Text is a technology that enables you to build text query applications and document classification applications. Oracle Text provides indexing, word and theme searching, and viewing capabilities for text.
This description does a disservice to what Text can do, as a white paper description shows.
Oracle Text uses standard SQL to index, search, and analyze text and documents stored in the Oracle database, in files, and on the Web. Oracle Text can perform linguistic analysis on documents; search text using a variety of strategies including keyword searching, contextual queries, Boolean operations, pattern matching, mixed thematic queries, HTML/XML section searching, etc. Oracle Text excels at mixed queries, i.e. those that involve structured relational attributes as well as text.
Do you subscribe to newsletters or RSS feeds? How does the application know when to send you notification of an event? In a financial setting, you don’t want to be inundated with every finance-related news item, but rather only receive notifications of interest. If your interest happens to be Oracle and phrases containing “market share,” then a rule-based decision can be made to send you news when Oracle and market share appear. The architecture of such a system is shown below.
So, what are some differences between Oracle Text and “normal” SQL queries? Not a whole lot, really. As with many statements, the object type, which helps to speed up queries, is our friendly index. Just as in regular DML and queries, indexes help to speed up searches in Oracle Text. The type of index and how they are used/created is a bit different, but once you understand some of the terminology in Text, their meaning becomes clearer.
One major variation between normal indexes and the indexes used by Oracle Text is how or when an index is updated. In normal (regular SQL) DML, an index is updated upon insert into a table (obviously assuming the column of interest is indexed). In Oracle Text, this is not always the case, and it is up to you to perform a synchronization. Statement syntax is also another significant variation. The good news is that the output looks like the output from your run of the mill SELECT statement.
The four index types (generally domain indexes for that matter) and their query operators are:
- CONTEXT, using CONTAINS
- CTXCAT, using CATSEARCH
- CTXRULE, using MATCHES
- CTXPATH, using existsNode().
Some of the indexes also use parameters, and those will be examined in subsequent articles. The query operator names are somewhat intuitive in how they support the index type. In a context search (based on large coherent documents), we want to know if the text contains what we’re looking for. For many, but smaller documents (also of various types), we have to search through a catalog, so we perform a catalog search. In most contexts, we know that rules require a match, and one thing you can’t escape in XML is searching a path to see if something exists (at a node).
Overall, the idea of Oracle Text is simple, but some of its use and implementation can be complex, but like many other topics in Oracle, once you see some working examples, the complexity generally tends to go away.
In subsequent articles, we’ll cover the basic setup of using Oracle Text and examine examples shown in the documentation and some of the sample code available at OTN.