Document Management with Oracle Text - Page 2June 12, 2003 Oracle Text Architecture (Classes, Objects, Preferences, Attributes)An Oracle text index can be created on the following column types: CHAR, VARCHAR, VARCHAR2, LONG, LONG RAW, BLOB, CLOB or BFILE. Text index will store words and the documents in which these words occur. The Oracle Text engine uses special preference system for providing this service. Text indexing is managed in several stages called "classes". Defined classes on the Oracle 8.1.7.2 database are: SQL> select * from ctx_classes; CLA_NAME CLA_DESCRIPTION ------------------------------ -----------------------------DATASTORE Data store Class FILTER Filter Class SECTION_GROUP Section Group LEXER Lexer Class WORDLIST Word List Class STOPLIST Stop List Class STORAGE Storage Class INDEX_SET Index Set Datastore is a document data selector table from which the CTX engine reads column data and returns document data. Filter takes the document data from the datastore class and filters it to readable text. Sectioner takes a text format, as input, and generates two outputs (the section boundaries and plaintext). Lexer gets plaintext from the sectioner and splits it into words (discrete tokens). Wordlist, Stoplist, and storage classesSTOPLIST Class holds a list of stop words, which are filtered out during indexing WORDLIST Class holds fuzzy and steam expansion settings used at the query time STORAGE Class holds storage parameters for the underlying index tables and their indexes. For example, a text index KB_INDEX will consist of the following objects: DR$KB_INDEX$I - the tokens table, indexed tokens DR$KB_INDEX$K - the docid mapping table, where text keys occur DR$KB_INDEX$N - the negative row table, documents marked for deletion DR$KB_INDEX$R - the rowid mapping table DR$KB_INDEX$P - the substring index table Every class has a defined list of objects, which are only templates. A Class list on the Oracle 8.1.7.2 database version: SQL> select * from ctx_objects; OBJ_CLASS OBJ_NAME OBJ_DESCRIPTION --------------- ------------------- ----------------------------------------- DATASTORE DIRECT_DATASTORE Documents are stored in the column DATASTORE DETAIL_DATASTORE Documents are split into multiple lines DATASTORE FILE_DATASTORE Documents are stored in files,column is file name DATASTORE URL_DATASTORE Documents are web pages, column is URL DATASTORE USER_DATASTORE Documents are stored in the column DATASTORE NESTED_DATASTORE Documents are stored in a column in the nested table DATASTORE MULTI_COLUMN_DATASTORE Documents are stored in multiple columns FILTER NULL_FILTER Null filter FILTER USER_FILTER User-defined filter FILTER CHARSET_FILTER character set converting filter FILTER INSO_FILTER filter for binary document formats FILTER PROCEDURE_FILTER Procedure filter SECTION_GROUP NULL_SECTION_GROUP null section group SECTION_GROUP BASIC_SECTION_GROUP basic section group SECTION_GROUP HTML_SECTION_GROUP html section group SECTION_GROUP XML_SECTION_GROUP xml section group SECTION_GROUP NEWS_SECTION_GROUP news section group SECTION_GROUP AUTO_SECTION_GROUP auto section group LEXER BASIC_LEXER Lexer for alphabetic languages LEXER JAPANESE_VGRAM_LEXER V-gram lexer for Japanese LEXER KOREAN_LEXER Dictionary-based lexer for Korean LEXER CHINESE_VGRAM_LEXER V-GRAM lexer for Chinese LEXER CHINESE_LEXER Chinese lexer LEXER MULTI_LEXER Multi-language lexer WORDLIST BASIC_WORDLIST basic wordlist STOPLIST BASIC_STOPLIST basic stoplist STOPLIST MULTI_STOPLIST multi-language stoplist STORAGE BASIC_STORAGE text-index storage INDEX_SET BASIC_INDEX_SET basic index set Instead of directly using objects, we are creating a "preference" from one of the object templates, which is further customized by setting "attributes". When we create the Oracle Text index, the indexing engine will read the defined preference for the index and invoke the attribute for each indexed document. |