Oracle Context Option Application Developer's Guide Go to Product Documentation Library
Library
Go to books for this product
Product
Go to Contents for this book
Contents
Go to Index
Index



Go to previous file in sequence Go to next file in sequence

CHAPTER 1. Text Concepts and Definitions


This chapter explains the fundamental concepts that underlie ConText Option text processing.

The following topics are covered in this chapter:

What Is a Text Application?

A text application is any program that locates, retrieves, analyzes, or otherwise manipulates non-structured alpha-numeric data. Two examples of typical text applications are:

Documents

In this manual, the terms documents and text are used interchangeably. However, text is a more general term referring to any collection of unstructured data stored in a database column or in an external system file.

The term document, however, has two specific and distinct meanings:

In this manual, the word document refers only to the second definition above.

Text Storage

ConText Option supports two methods of text storage:

For more information about text storage in ConText Option, see Oracle ConText Option Administrator's Guide

Internal Storage

Documents stored inside the database reside in a text column. A text column can be any standard column that stores unstructured textual data within an Oracle database table.

Documents in a text column can consist of plain text (i.e. ASCII) or formatted text (i.e. Microsoft Word, WordPerfect). In addition, each document in a text column can be in a different format.

External Storage

Besides storing text in an Oracle database, ConText Option can process text that is stored in operating system files. ConText Option considers this as indirect data store, because the text column for the table contains a pointer to the external file rather than the actual text.

The pointer can be:

Querying, retrieval, and linguistic processing for external files is identical to the processing for documents stored internally. However, because external documents have no direct link back to the column in the database, when a change to a document is made, the change is not recorded automatically in the table.

If changes to a document require the text index to be updated, ConText Option has no way of notifying the DML Queue that the text index needs to be updated. Notification of the DML Queue can be accomplished through two methods:

Text Retrieval

The objective of a query is to identify documents that are most relevant to the user's needs by searching for text in the document collection and then retrieving those documents for the user.

This section discusses:

Search Options

There are several search options available for querying text, including:

Any of these options can be combined using logical operators. For example, you can use the AND operator to search for only those documents that contain both the terms night AND day.

Text Queries

A text query is a means for encoding search criteria so that the text can be searched efficiently and relevant documents retrieved. Before you can execute a query on a text column, you must index the column.

For more information about creating text indexes for columns, see Oracle ConText Option Administrator's Guide.

To retrieve relevant documents, a text query must accomplish three tasks:

The first two tasks produce a list of documents that meet the search criteria with the corresponding score for each document. This list is called the hitlist.The third task returns to the user selected rows and columns of the text table for each document in the hitlist.

The three tasks required to retrieve documents can be accomplished using two-steps, one-step, or an in-memory cursor. All three methods produce exactly the same results. You choose a method depending on the needs of the application.

In addition, ConText Option allows you to return the number of hits for a query in place of the actual hitlist. This can be useful for queries that produce very long hitlists.

Theme Queries

In addition to querying English-language documents by words or phrases (text query), you can query these documents by theme, or by their main concepts.

Theme queries work similarly to text querying in that you must create an index (theme) for the documents before you can query. Theme queries differ from text queries in that you need not provide the word patterns for the search. ConText option interprets your query conceptually according to its view of the world and returns an appropriate document hitlist based on theme, along with a measure of how relevant each document is to the query.

You can use the standard query methods to perform theme queries, namely one-step, two-step, and in-memory. In a theme query, you can use most of the operators you use in regular text queries.

For more information about theme queries, see "Theme Queries (Chapter 4)."

For more information about creating theme indexes for columns, see Oracle ConText Option Administrator's Guide.

Query Methods

ConText Option supports three different methods for performing queries:

In addition, ConText Option provides a method for counting query hits without performing an actual query.

Two-step Queries

Two-step queries use a PL/SQL procedure in the first step to create a hitlist and store the results in a specified hitlist result table.

The second step uses a SELECT statement to select the results from the result table. In addition, the hitlist table can be joined with the original table to return more detailed document information. In the two-step method, the physical hitlist table is available to the application program.

One-step Queries

In a one-step query, you create a single SQL statement that uses the ConText Option query functions to search for relevant documents and return a record set of selected rows and columns of the text table directly to the user.

The hitlist is processed by ConText Option using internal result tables. As a result, you do not have to create result tables before running a one-step query; however, the internal result tables are not available to the application program.

In-memory Queries

In-memory queries use a buffer and a CONTAINS cursor to the buffer to return query results, rather than the result tables used in two-step and one-step queries. As a result, in-memory queries are generally faster than two-step and one-step queries for shorter hitlists.

In an in-memory query, you open a cursor to the query buffer and run a query. ConText Option writes the results of the query to the buffer. You fetch the results, then close the cursor.

Results can be returned in order of their textkeys or sorted by score.

Query Hits Counting

In addition to two-step, one-step, and in-memory queries, you can use the CTX_QUERY.COUNT_HITS function to return the number of hits for a query without generating scores for the hits or returning the textkeys for the documents. The documents can be stored in a local or remote database. Counting query hits is generally much faster than performing a full query and can be used to audit queries to ensure large and unmanageable hitlists are not returned.

Counting query hits can be performed in two modes: estimate and exact. The modes are based on the method ConText Option uses to record deleted documents in a text index.

In exact mode, hits are returned only for those documents that satisfy the conditions of the query expression and are currently in the text column of the table.

In estimate mode, hits may be included for documents that satisfy the query condition, but have been deleted from the text column or have been updated so that they no longer satisfy the query expression. This can occur when the text index for the column has not been optimized and the internal document IDs are still present in the index.

In general, the inaccuracy of the results returned by COUNT_HITS in estimate mode is proportional to the amount of DML that has been performed on a text column.

Note: If the index being queried has been optimized and no further DML has been performed on the text column, estimate mode will return accurate results.

For more information about text indexing, DML, and optimization, see Oracle ConText Option Administrator's Guide

Query Expressions

Query expressions are made up of words and phrases (query terms) combined with operators and other special characters to produce search criteria. Operators specify the relative importance of the query terms, define relationships between those terms, control how the search is performed, and determine how much output is returned.

The most basic kind of query expression is single words or phrases that return documents with a score based on the number of occurrences of the words or phrases. More complex expressions allow the user to weight certain terms, search for words that sound like each other, and find all of the words based on a particular root.

ConText Option provides a rich vocabulary of operators and special characters that can be used to create highly sophisticated query expressions that meet many complex user needs.

For more information about query expressions, see "Understanding Query Expressions (Chapter 3)."

Stored Query Expressions

A stored query expression (SQE) is a named query expression that has been stored in database tables along with the results of the query.

You can combine queries by referencing an SQE within the query expression of another query. Using an SQE in a query results in faster execution of the query because the results are already stored in the database.

Stored query expressions can also be used to perform interactive queries, in which an initial query is refined using one or more additional queries.

Process Model

The process model for using SQEs is:

Administration of SQEs can be performed using the REFRESH_SQE, REMOVE_SQE, and PURGE_SQE procedures in the CTX_QUERY PL/SQL package.

SQE Tables

Each SQE is stored in two tables: a central or system table owned by CTXSYS and an text index table attached to the policy for which the SQE was created.

The table owned by CTXSYS is an internal table which stores the SQE definitions for all the SQEs that have been created for all existing policies. It cannot be accessed directly, but can be viewed through two views, CTX_SQES (users with CTXADMIN role) and CTX_USER_SQES (users with CTXAPP and CTXADMIN roles).

The table used to store the results an SQE for a text column is part of the text index for the column and is created automatically by ConText Option during the initial text indexing of the column; however, the SQR table is only populated when an SQE is created/stored and updated when an SQE is re-evaluated.

The tablespace, storage clause, and other parameters used to create the SQR table are specified by the Engine preference in the policy for the text column of the SQE.

Note: Similar to the other ConText index tables, the SQR table is an internal table that is accessed only by ConText Option when an SQE is processed in a query.

For more information about policies, preferences, text indexing, and the structure of the SQE tables and views, see Oracle ConText Option Administrator's Guide.

Session and System SQEs

When you initially create an SQE, you can specify whether the SQE is for the current session or for all sessions (system SQE).

You can use session SQEs only in the current session. These SQEs are stored only for the duration of the session. When a session is terminated, all session SQEs created during the session are deleted from the SQE tables. If you want to use a session SQE in another session, you must recreate the SQE.

System SQEs can be used in all sessions, including concurrent sessions. When a session is terminated, system SQEs created during the session are not deleted from the SQE tables and can be used in future sessions.

Query Expressions in SQEs

SQEs support all of query expression operators, with the following exceptions:

SQEs also support all of the special characters and other components that can be used in a query expression, including PL/SQL functions and other SQEs.

For example, an SQE could be created (SQE 1) that stores the results of a query for term A and term B. Then, a second SQE could be created (SQE 2) that stores the results of a query for SQE 1 and term C. Finally, SQE 2 could be called in a query to return all of the documents that contain term A, term B, and term C.

Re-evaluation of SQEs

If the text column referenced by an SQE has been modified since the SQE was created, the SQE results may be out-of-date. Before returning the results of an SQE in a query expression, ConText Option verifies that the results are current. If they are not current, ConText Option automatically evaluates the differences and updates the results.

ConText Option also verifies that any SQEs nested within an SQE have up-to-date results.

Note: ConText Option does not verify whether PL/SQL functions in SQEs have been updated. If a PL/SQL function in an SQE has been updated, the SQE must be manually re-evaluated.

Result lists in SQE tables may get fragmented by consequtive re-evaluations. You can resolve fragmentation by calling CTX_QUERY.REFRESH_SQE.

Hitlists

Whenever a query is executed, ConText Option generates a list of all the documents that meet the search criteria together with a score to indicate the relative importance of the document with regard to the search criteria. This is a hitlist.

In a two-step query, the hitlist is created explicitly and returned to the user as a result table that must have been allocated by the application program.

In a one-step query, the hitlist is generated and processed internally by ConText Option. The results of the query, including the generated scores, are returned to the user as a record set of selected documents; the hitlist is not available as a separate table.

In an in-memory query, the hitlist is stored in memory and is returned through a loop that fetches the individual hits from memory.

Scoring

Scoring is the method ConText Option uses to indicate which of the documents in the hitlist are most closely related to the user's needs based on the search criteria. The score is based on a numerical analysis of the occurrences of the query expression.

For example, a document that contains the search expression 10 times is considered more relevant than one that only contains the expression 5 times.

In basic queries, the score is calculated as the number of times a chosen search word appears in the document, and the score can be used to order the hitlist so that the highest scoring documents appear first. In more complex queries, the score is affected by various relationships between words and phrases; weights applied to various elements of the search expression also affect the score by giving more or less emphasis to the occurrence of those terms within the document.

Scores are generated by the general purpose text engine during queries (text or theme). The engine calculates a relevance score for each cell in the text column that meets the search criteria. The upper bound of the score value is 100, and each row meeting the criteria is assigned a score between 1 and 100.

In two-step queries, it is generated by the CONTAINS procedure and stored in a result table called the hitlist table.

In one-step queries, the score is generated internally by the CONTAINS function and returned by the SCORE function.

In in-memory queries, score is one of the output arguments specified when running the query and is returned when the hits are retrieved.

Result Tables

Result tables are storage areas used by ConText Option to store output from user queries. These tables are allocated by the application program or procedure and exist until they are released by the application.

Result tables are conceptually distinguished from normal database tables in that they have specific meaning only when applied to specific ConText Option functions, specifically in the following two situations:

Result tables are also used in one-step queries; however, the tables used in one-step queries are internal tables that are allocated by ConText Option and cannot be accessed from application program.

You can create result tables using the SQL command CREATE or using functions provided in the CTX_QUERY PL/SQL package.

For more information about the structure of result tables, see "Result Tables (Chapter 12)".

For more information about CTX_QUERY, see "PL/SQL Packages (Chapter 11)".




Go to previous file in sequence Go to next file in sequence
Prev Next
Oracle
Copyright © 1996 Oracle Corporation.
All Rights Reserved.
Go to Product Documentation Library
Library
Go to books for this product
Product
Go to Contents for this book
Contents
Go to Index
Index