Document Management with Oracle Text - Page 2

June 12, 2003

Oracle Text Architecture (Classes, Objects, Preferences, Attributes)

An Oracle text index can be created on the following column types:

CHAR, VARCHAR, VARCHAR2, LONG, LONG RAW, BLOB, CLOB or BFILE.

Text index will store words and the documents in which these words occur.

The Oracle Text engine uses special preference system for providing this service.


click for larger image

Text indexing is managed in several stages called "classes". Defined classes on the Oracle 8.1.7.2 database are:

SQL> select * from ctx_classes;
 
CLA_NAME                       CLA_DESCRIPTION
------------------------------ -----------------------------DATASTORE    Data store Class
FILTER                          Filter Class
SECTION_GROUP                   Section Group
LEXER                           Lexer Class
WORDLIST                        Word List Class
STOPLIST                        Stop List Class
STORAGE                         Storage Class
INDEX_SET                       Index Set

Datastore is a document data selector table from which the CTX engine reads column data and returns document data.

Filter takes the document data from the datastore class and filters it to readable text.

Sectioner takes a text format, as input, and generates two outputs (the section boundaries and plaintext).

Lexer gets plaintext from the sectioner and splits it into words (discrete tokens).

Wordlist, Stoplist, and storage classes

STOPLIST Class holds a list of stop words, which are filtered out during indexing WORDLIST Class holds fuzzy and steam expansion settings used at the query time

STORAGE Class holds storage parameters for the underlying index tables and their indexes.

For example, a text index KB_INDEX will consist of the following objects:

DR$KB_INDEX$I - the tokens table, indexed tokens

DR$KB_INDEX$K - the docid mapping table, where text keys occur

DR$KB_INDEX$N - the negative row table, documents marked for deletion

DR$KB_INDEX$R - the rowid mapping table

DR$KB_INDEX$P - the substring index table

Every class has a defined list of objects, which are only templates.

A Class list on the Oracle 8.1.7.2 database version:

SQL> select * from ctx_objects;
 
OBJ_CLASS           OBJ_NAME             OBJ_DESCRIPTION
---------------   ------------------- -----------------------------------------
DATASTORE         DIRECT_DATASTORE      Documents are stored in the column
DATASTORE         DETAIL_DATASTORE      Documents are split into multiple lines
DATASTORE         FILE_DATASTORE        Documents are stored in files,column is file name
DATASTORE         URL_DATASTORE         Documents are web pages, column is URL
DATASTORE         USER_DATASTORE        Documents are stored in the column
DATASTORE         NESTED_DATASTORE      Documents are stored in a column in the nested table
DATASTORE     MULTI_COLUMN_DATASTORE    Documents are stored in multiple columns
FILTER            NULL_FILTER           Null filter
FILTER            USER_FILTER           User-defined filter
FILTER            CHARSET_FILTER        character set converting filter
FILTER            INSO_FILTER           filter for binary document formats
FILTER            PROCEDURE_FILTER      Procedure filter
SECTION_GROUP     NULL_SECTION_GROUP    null section group
SECTION_GROUP     BASIC_SECTION_GROUP   basic section group
SECTION_GROUP     HTML_SECTION_GROUP    html section group
SECTION_GROUP     XML_SECTION_GROUP     xml section group
SECTION_GROUP     NEWS_SECTION_GROUP    news section group
SECTION_GROUP     AUTO_SECTION_GROUP    auto section group
LEXER             BASIC_LEXER           Lexer for alphabetic languages
LEXER             JAPANESE_VGRAM_LEXER  V-gram lexer for Japanese
LEXER             KOREAN_LEXER          Dictionary-based lexer for Korean
LEXER             CHINESE_VGRAM_LEXER   V-GRAM lexer for Chinese
LEXER             CHINESE_LEXER         Chinese lexer
LEXER             MULTI_LEXER           Multi-language lexer
WORDLIST          BASIC_WORDLIST        basic wordlist
STOPLIST          BASIC_STOPLIST        basic stoplist
STOPLIST          MULTI_STOPLIST        multi-language stoplist
STORAGE           BASIC_STORAGE         text-index storage
INDEX_SET         BASIC_INDEX_SET       basic index set

Instead of directly using objects, we are creating a "preference" from one of the object templates, which is further customized by setting "attributes".

When we create the Oracle Text index, the indexing engine will read the defined preference for the index and invoke the attribute for each indexed document.








The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers