Oracle Text Architecture (Classes, Objects, Preferences, Attributes)
An
Oracle text index can be created on the following column types:
CHAR,
VARCHAR, VARCHAR2, LONG, LONG RAW, BLOB, CLOB or BFILE.
Text
index will store words and the documents in which these words occur.
The
Oracle Text engine uses special preference system for providing this service.
Text
indexing is managed in several stages called "classes". Defined
classes on the Oracle 8.1.7.2 database are:
SQL> select * from ctx_classes;
CLA_NAME CLA_DESCRIPTION
------------------------------ -----------------------------DATASTORE Data store Class
FILTER Filter Class
SECTION_GROUP Section Group
LEXER Lexer Class
WORDLIST Word List Class
STOPLIST Stop List Class
STORAGE Storage Class
INDEX_SET Index Set
Datastore is a document data selector table from which the CTX
engine reads column data and returns document data.
Filter takes the document data from the datastore class and
filters it to readable text.
Sectioner takes a text format, as input, and generates two outputs
(the section boundaries and plaintext).
Lexer gets plaintext from the sectioner and splits it into
words (discrete tokens).
Wordlist, Stoplist, and storage classes
STOPLIST
Class holds a list of stop words, which are filtered out during indexing
WORDLIST Class holds fuzzy and steam expansion settings used at the query time
STORAGE
Class holds storage parameters for the underlying index tables and their
indexes.
For
example, a text index KB_INDEX will consist of the following objects:
DR$KB_INDEX$I
- the tokens table, indexed tokens
DR$KB_INDEX$K
- the docid mapping table, where text keys occur
DR$KB_INDEX$N - the negative row
table, documents marked for deletion
DR$KB_INDEX$R
- the rowid mapping table
DR$KB_INDEX$P
- the substring index table
Every class has a defined list of
objects, which are only templates.
A Class list on the Oracle 8.1.7.2
database version:
SQL> select * from ctx_objects;
OBJ_CLASS OBJ_NAME OBJ_DESCRIPTION
--------------- ------------------- -----------------------------------------
DATASTORE DIRECT_DATASTORE Documents are stored in the column
DATASTORE DETAIL_DATASTORE Documents are split into multiple lines
DATASTORE FILE_DATASTORE Documents are stored in files,column is file name
DATASTORE URL_DATASTORE Documents are web pages, column is URL
DATASTORE USER_DATASTORE Documents are stored in the column
DATASTORE NESTED_DATASTORE Documents are stored in a column in the nested table
DATASTORE MULTI_COLUMN_DATASTORE Documents are stored in multiple columns
FILTER NULL_FILTER Null filter
FILTER USER_FILTER User-defined filter
FILTER CHARSET_FILTER character set converting filter
FILTER INSO_FILTER filter for binary document formats
FILTER PROCEDURE_FILTER Procedure filter
SECTION_GROUP NULL_SECTION_GROUP null section group
SECTION_GROUP BASIC_SECTION_GROUP basic section group
SECTION_GROUP HTML_SECTION_GROUP html section group
SECTION_GROUP XML_SECTION_GROUP xml section group
SECTION_GROUP NEWS_SECTION_GROUP news section group
SECTION_GROUP AUTO_SECTION_GROUP auto section group
LEXER BASIC_LEXER Lexer for alphabetic languages
LEXER JAPANESE_VGRAM_LEXER V-gram lexer for Japanese
LEXER KOREAN_LEXER Dictionary-based lexer for Korean
LEXER CHINESE_VGRAM_LEXER V-GRAM lexer for Chinese
LEXER CHINESE_LEXER Chinese lexer
LEXER MULTI_LEXER Multi-language lexer
WORDLIST BASIC_WORDLIST basic wordlist
STOPLIST BASIC_STOPLIST basic stoplist
STOPLIST MULTI_STOPLIST multi-language stoplist
STORAGE BASIC_STORAGE text-index storage
INDEX_SET BASIC_INDEX_SET basic index set
Instead of directly using objects, we are
creating a "preference" from one of the object templates, which is further
customized by setting "attributes".
When we create the Oracle Text index, the
indexing engine will read the defined preference for
the index and invoke the attribute for each indexed document.