Full-Text Search Functions

Full-Text Search Functions
Prev	Chapter 13. Functions and Operators	Next

MATCH (col1,col2,...) AGAINST (expr [IN BOOLEAN MODE | WITH QUERY EXPANSION]): As of MySQL 3.23.23, MySQL has support for full-text indexing and searching. A full-text index in MySQL is an index of type FULLTEXT. FULLTEXT indexes are used with MyISAM tables only and can be created from CHAR, VARCHAR, or TEXT columns at CREATE TABLE time or added later with ALTER TABLE or CREATE INDEX. For large datasets, it will be much faster to load your data into a table that has no FULLTEXT index, then create the index with ALTER TABLE (or CREATE INDEX). Loading data into a table that already has a FULLTEXT index could be significantly slower. Constraints on full-text searching are listed in the section called “Full-Text Restrictions”.

Full-text searching is performed with the MATCH() function.

mysql> CREATE TABLE articles (
    ->   id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    ->   title VARCHAR(200),
    ->   body TEXT,
    ->   FULLTEXT (title,body)
    -> );
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO articles (title,body) VALUES
    -> ('MySQL Tutorial','DBMS stands for DataBase ...'),
    -> ('How To Use MySQL Well','After you went through a ...'),
    -> ('Optimizing MySQL','In this tutorial we will show ...'),
    -> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
    -> ('MySQL vs. YourSQL','In the following database comparison ...'),
    -> ('MySQL Security','When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body) AGAINST ('database');
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)

The MATCH() function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT index. The search string is given as the argument to AGAINST(). For every row in the table, MATCH() returns a relevance value, that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH() list.

By default, the search is performed in case-insensitive fashion. In MySQL 4.1 and up, you can make a full-text search by using a binary collation for the indexed columns. For example, a column that has a character set of latin1 can be assigned a collation of latin1_bin to make it case sensitive for full-text searches.

When MATCH() is used in a WHERE clause, as in the preceding example, the rows returned are automatically sorted with the highest relevance first. Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.

For natural-language full-text searches, it is a requirement that the columns named in the MATCH() function be the same columns included in some FULLTEXT index in your table. For the preceding query, note that the columns named in the MATCH() function (title and body) are the same as those named in the definition of the article table's FULLTEXT index. If you wanted to search the title or body separately, you would need to create FULLTEXT indexes for each column.

It is also possible to perform a boolean search or a search with query expansion. These search types are described in the section called “Boolean Full-Text Searches” and the section called “Full-Text Searches with Query Expansion”.

The preceding example is a basic illustration showing how to use the MATCH() function where rows are returned in order of decreasing relevance. The next example shows how to retrieve the relevance values explicitly. Returned rows are not ordered because the SELECT statement includes neither WHERE nor ORDER BY clauses:

mysql> SELECT id, MATCH (title,body) AGAINST ('Tutorial')
    -> FROM articles;
+----+-----------------------------------------+
| id | MATCH (title,body) AGAINST ('Tutorial') |
+----+-----------------------------------------+
|  1 |                        0.65545833110809 |
|  2 |                                       0 |
|  3 |                        0.66266459226608 |
|  4 |                                       0 |
|  5 |                                       0 |
|  6 |                                       0 |
+----+-----------------------------------------+
6 rows in set (0.00 sec)

The following example is more complex. The query returns the relevance values and it also sorts the rows in order of decreasing relevance. To achieve this result, you should specify MATCH() twice: once in the SELECT list and once in the WHERE clause. This causes no additional overhead, because the MySQL optimizer notices that the two MATCH() calls are identical and invokes the full-text search code only once.

mysql> SELECT id, body, MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root') AS score
    -> FROM articles WHERE MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body                                | score           |
+----+-------------------------------------+-----------------+
|  4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
|  6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)

MySQL uses a very simple parser to split text into words. A “word” is any sequence of characters consisting of letters, digits, ‘'’, or ‘_’. Some words are ignored in full-text searches:

Any word that is too short is ignored. The default minimum length of words that will be found by full-text searches is four characters.
Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list.

The default minimum word length and stopword list can be changed as described in the section called “Fine-Tuning MySQL Full-Text Search”.

Every correct word in the collection and in the query is weighted according to its significance in the collection or query. This way, a word that is present in many documents has a lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are then combined to compute the relevance of the row.

Such a technique works best with large collections (in fact, it was carefully tuned this way). For very small tables, word distribution does not adequately reflect their semantic value, and this model may sometimes produce bizarre results. For example, although the word “MySQL” is present in every row of the articles table, a search for the word produces no results:

mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body) AGAINST ('MySQL');
Empty set (0.00 sec)

The search result is empty because the word “MySQL” is present in at least 50% of the rows. As such, it is effectively treated as a stopword. For large datasets, this is the most desirable behavior—a natural language query should not return every second row from a 1GB table. For small datasets, it may be less desirable.

A word that matches half of rows in a table is less likely to locate relevant documents. In fact, it will most likely find plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that rows containing the word are assigned a low semantic value for the particular dataset in which they occur. A given word may exceed the 50% threshold in one dataset but not another.

The 50% threshold has a significant implication when you first try full-text searching to see how it works: If you create a table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results. Be sure to insert at least three rows, and preferably many more.

Boolean Full-Text Searches

As of Version 4.0.1, MySQL can also perform boolean full-text searches using the IN BOOLEAN MODE modifier.

mysql> SELECT * FROM articles WHERE MATCH (title,body)
    -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+-----------------------+-------------------------------------+
| id | title                 | body                                |
+----+-----------------------+-------------------------------------+
|  1 | MySQL Tutorial        | DBMS stands for DataBase ...        |
|  2 | How To Use MySQL Well | After you went through a ...        |
|  3 | Optimizing MySQL      | In this tutorial we will show ...   |
|  4 | 1001 MySQL Tricks     | 1. Never run mysqld as root. 2. ... |
|  6 | MySQL Security        | When configured properly, MySQL ... |
+----+-----------------------+-------------------------------------+

This query retrieves all the rows that contain the word “MySQL” but that do not contain the word “YourSQL”.

Boolean full-text searches have these characteristics:

They do not use the 50% threshold.
They do not automatically sort rows in order of decreasing relevance. You can see this from the preceding query result: The row with the highest relevance is the one that contains “MySQL” twice, but it is listed last, not first.
They can work even without a FULLTEXT index, although this would be slow.
The minimum and maximum word length full-text parameters apply.
The stopword list applies.

The boolean full-text search capability supports the following operators:

+: A leading plus sign indicates that this word must be present in every row returned.
-: A leading minus sign indicates that this word must not be present in any row returned.
(no operator): By default (when neither + nor - is specified) the word is optional, but the rows that contain it will be rated higher. This mimics the behavior of MATCH() ... AGAINST() without the IN BOOLEAN MODE modifier.
> <: These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The > operator increases the contribution and the < operator decreases it. See the example below.
( ): Parentheses are used to group words into subexpressions. Parenthesized groups can be nested.
~: A leading tilde acts as a negation operator, causing the word's contribution to the row relevance to be negative. It's useful for marking noise words. A row that contains such a word will be rated lower than others, but will not be excluded altogether, as it would be with the - operator.
*: An asterisk is the truncation operator. Unlike the other operators, it should be appended to the word.
": A phrase that is enclosed within double quote (‘"’) characters matches only rows that contain the phrase literally, as it was typed.

The following examples demonstrate some search strings that use boolean full-text operators:

'apple banana': Find rows that contain at least one of the two words.
'+apple +juice': Find rows that contain both words.
'+apple macintosh': Find rows that contain the word “apple”, but rank rows higher if they also contain “macintosh”.
'+apple -macintosh': Find rows that contain the word “apple” but not “macintosh”.
'+apple +(>turnover <strudel)': Find rows that contain the words “apple” and “turnover”, or “apple” and “strudel” (in any order), but rank “apple turnover” higher than “apple strudel”.
'apple*': Find rows that contain words such as “apple”, “apples”, “applesauce”, or “applet”.
'"some words"': Find rows that contain the exact phrase “some words” (for example, rows that contain “some words of wisdom” but not “some noise words”). Note that the ‘"’ characters that surround the phrase are operator characters that delimit the phrase. They are not the quotes that surround the search string itself.

Full-Text Searches with Query Expansion

As of MySQL 4.1.1, full-text search supports query expansion (in particular, its variant “blind query expansion”). This is generally useful when a search phrase is too short, which often means that the user is relying on implied knowledge that the full-text search engine usually lacks. For example, a user searching for “database” may really mean that “MySQL”, “Oracle”, “DB2”, and “RDBMS” all are phrases that should match “databases” and should be returned, too. This is implied knowledge.

Blind query expansion (also known as automatic relevance feedback) is enabled by adding WITH QUERY EXPANSION following the search phrase. It works by performing the search twice, where the search phrase for the second search is the original search phrase concatenated with the few top found documents from the first search. Thus, if one of these documents contains the word “databases” and the word “MySQL”, the second search will find the documents that contain the word “MySQL” even if they do not contain the word “database”. The following example shows this difference:

mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body) AGAINST ('database');
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body)
    -> AGAINST ('database' WITH QUERY EXPANSION);
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  3 | Optimizing MySQL  | In this tutorial we will show ...        |
+----+-------------------+------------------------------------------+
3 rows in set (0.00 sec)

Another example could be searching for books by Georges Simenon about Maigret, when a user is not sure how to spell “Maigret”. A search for “Megre and the reluctant witnesses” will find only “Maigret and the Reluctant Witnesses” without query expansion. A search with query expansion will find all books with the word “Maigret” on the second pass.

Note: Because blind query expansion tends to increase noise significantly by returning non-relevant documents, it's only meaningful to use when a search phrase is rather short.

Full-Text Restrictions

Full-text searches are supported for MyISAM tables only.
As of MySQL 4.1.1, full-text searches can be used with most multi-byte character sets. The exception is that for Unicode, the utf8 character set can be used, but not the ucs2 character set.
As of MySQL 4.1, the use of multiple character sets within a single table is supported. However, all columns in a FULLTEXT index must have the same character set and collation.
The MATCH() column list must exactly match the column list in some FULLTEXT index definition for the table, unless this MATCH() is IN BOOLEAN MODE.
The argument to AGAINST() must be a constant string.

Fine-Tuning MySQL Full-Text Search

The MySQL full-text search capability has few user-tunable parameters yet, although adding more is very high on the TODO. You can exert more control over full-text searching behavior if you have a MySQL source distribution because some changes require source code modfications. See the section called “MySQL Installation Using a Source Distribution”.

Note that full-text search was carefully tuned for the best searching effectiveness. Modifying the default behavior will, in most cases, make the search results worse. Do not alter the MySQL sources unless you know what you are doing!

Most full-text variables described in the following items must be set at server startup time. For these variables, a server restart is required to change them and you cannot modify them dynamically while the server is running.

Some variable changes require that you rebuild the FULLTEXT indexes in your tables. Instructions for doing this are given at the end of this section.

The minimum and maximum length of words to be indexed is defined by the ft_min_word_len and ft_max_word_len system variables (available as of MySQL 4.0.0). See the section called “Server System Variables”. The default minimum value is four characters. The default maximum depends on your version of MySQL. If you change either value, you must rebuild your FULLTEXT indexes. For example, if you want three-character words to be searchable, you can set the ft_min_word_len variable by putting the following lines in an option file:
```
[mysqld]
ft_min_word_len=3
```
Then restart the server and rebuild your FULLTEXT indexes. Also note particularly the remarks regarding myisamchk in the instructions following this list.
To override the default stopword list, set the ft_stopword_file system variable (available as of MySQL 4.0.10). See the section called “Server System Variables”. The variable value should be the pathname of the file containing the stopword list, or the empty string to disable stopword filtering. After changing the value, rebuild your FULLTEXT indexes.
The 50% threshold for natural language searches is determined by the particular weighting scheme chosen. To disable it, look for the following line in myisam/ftdefs.h:
```
#define GWS_IN_USE GWS_PROB
```
Change the line to this:
```
#define GWS_IN_USE GWS_FREQ
```
Then recompile MySQL. There is no need to rebuild the indexes in this case. Note: By doing this you severely decrease MySQL's ability to provide adequate relevance values for the MATCH() function. If you really need to search for such common words, it would be better to search using IN BOOLEAN MODE instead, which does not observe the 50% threshold.
To change the operators used for boolean full-text searches, set the ft_boolean_syntax system variable (available as of MySQL 4.0.1). The variable also can be changed while the server is running, but you must have the SUPER privilege to do so. No index rebuilding is necessary. the section called “Server System Variables” describes the rules that define how to set this variable.

If you modify full-text variables that affect indexing (ft_min_word_len, ft_max_word_len, or ft_stopword_file), you must rebuild your FULLTEXT indexes after making the changes and restarting the server. To rebuild the indexes in this case, it's sufficient to do a QUICK repair operation:

mysql> REPAIR TABLE tbl_name QUICK;

With regard specifically to using the IN BOOLEAN MODE capability, if you upgrade from MySQL 3.23 to 4.0 or later, it's necessary to replace the index header as well. To do this, do a USE_FRM repair operation:

mysql> REPAIR TABLE tbl_name USE_FRM;

This is necessary because boolean full-text searches require a flag in the index header that was not present in MySQL 3.23, and that is not added if you do only a QUICK repair. If you attempt a boolean full-text search without rebuilding the indexes this way, the search will return incorrect results.

Note that if you use myisamchk to perform an operation that modifies table indexes (such as repair or analyze), the FULLTEXT indexes are rebuilt using the default full-text parameter values for minimum and maximum word length and the stopword file unless you specify otherwise. This can result in queries failing.

The problem occurs because these parameters are known only by the server. They are not stored in MyISAM index files. To avoid the problem if you have modified the minimum or maximum word length or the stopword file in the server, specify the same ft_min_word_len, ft_max_word_len, and ft_stopword_file values to myisamchk that you use for mysqld. For example, if you have set the minimum word length to 3, you can repair a table with myisamchk like this:

shell> myisamchk --recover --ft_min_word_len=3 tbl_name.MYI

To ensure that myisamchk and the server use the same values for full-text parameters, you can place each one in both the [mysqld] and [myisamchk] sections of an option file:

[mysqld]
ft_min_word_len=3

[myisamchk]
ft_min_word_len=3

An alternative to using myisamchk is to use the REPAIR TABLE, ANALYZE TABLE, OPTIMIZE TABLE, or ALTER TABLE. These statements are performed by the server, which knows the proper full-text parameter values to use.

Full-Text Search TODO

Improved performance for all FULLTEXT operations.
Proximity operators.
Support for “always-index words.” These could be any strings the user wants to treat as words, such as “C++”, “AS/400”, or “TCP/IP”.
Support for full-text search in MERGE tables.
Support for the ucs2 character set.
Make the stopword list dependent on the language of the dataset.
Stemming (dependent on the language of the dataset).
Generic user-suppliable UDF preparser.
Make the model more flexible (by adding some adjustable parameters to FULLTEXT in CREATE TABLE and ALTER TABLE statements).

Prev	Up	Next
Date and Time Functions	Home	Cast Functions and Operators