PostgreSQL 全文搜索
# 安装postgresql
sudo yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
sudo yum update
sudo yum install -y postgresql15-server
# 初始化 postgresql
sudo /usr/pgsql-15/bin/postgresql-15-setup initdb
sudo systemctl enable postgresql-15
sudo systemctl start postgresql-15
# 检查服务
sudo systemctl status postgresql-15
术语 | 解释 |
---|---|
tsvector | 待检索的文档 |
tsquery | 用于检索的查询,包含了一系列的搜索词(search term)和操作符 |
# 如何使用
# 准备数据
-- 建表
create table t_docs (
id int primary key,
title varchar(256),
content varchar(4096)
);
insert into t_docs select 1, 'Learing PG Full Text Search', 'PG support Full Text search. Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document.';
insert into t_docs select 2, 'Learing PG Index', 'Although indexes in PostgreSQL do not need maintenance or tuning, it is still important to check which indexes are actually used by the real-life query workload. Examining index usage for an individual query is done with the EXPLAIN command; its application for this purpose is illustrated in Section 14.1. It is also possible to gather overall statistics about index usage in a running server, as described in Section 28.2.';
insert into t_docs select 3, 'The SQL Language', 'This part describes the use of the SQL language in PostgreSQL. We start with describing the general syntax of SQL, then explain how to create the structures to hold data, how to populate the database, and how to query it. The middle part lists the available data types and functions for use in SQL commands. The rest treats several aspects that are important for tuning a database for optimal performance.';
insert into t_docs select 4, 'What Is PostgreSQL', 'PostgreSQL is an object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department. POSTGRES pioneered many concepts that only became available in some commercial database systems much later.';
insert into t_docs select 5, 'A Brief History of PostgreSQL', 'The object-relational database management system now known as PostgreSQL is derived from the POSTGRES package written at the University of California at Berkeley. With over two decades of development behind it, PostgreSQL is now the most advanced open-source database available anywhere.';
全文搜索可以不依赖索引,如:
-- 关键字 `open-source`
fence=# select id,title from t_docs where to_tsvector(content) @@ to_tsquery('open-source');
id | title
----+-------------------------------
5 | A Brief History of PostgreSQL
(1 row)
-- 关键字 `postgres`
fence=# select id,title from t_docs where to_tsvector(content) @@ to_tsquery('postgres');
id | title
----+-------------------------------
4 | What Is PostgreSQL
5 | A Brief History of PostgreSQL
(2 rows)
-- 关键字 `postgresql`
fence=# select id,title from t_docs where to_tsvector(content) @@ to_tsquery('postgresql');
id | title
----+-------------------------------
2 | Learing PG Index
3 | The SQL Language
4 | What Is PostgreSQL
5 | A Brief History of PostgreSQL
(4 rows)
-- 两个关键字,从多列中检索
-- 下例中如果不从title中检索,则 id 为 2 的记录不会返回
fence=# select id,title from t_docs where to_tsvector(content || ' ' || title) @@ to_tsquery('postgres | pg');
id | title
----+-------------------------------
1 | Learing PG Full Text Search
2 | Learing PG Index
4 | What Is PostgreSQL
5 | A Brief History of PostgreSQL
(4 rows)
由此可知,在postgresql的默认配置下:
- 全文检索时关键字的大小写不影响搜索结果
- 在默认词典中,
postgres
和postgresql
不是近义词,也没有派生关系
使用索引可以加速查询,全文搜索支持两种索引:
- GIN
- GIST
创建 GIN
索引
索引列必须指定config name(即下例中的
english
),如果不指定,则会使用 default_text_search_config (opens new window) 构建索引,若后续该配置被修改,则之前已经建立的索引将无法使用。
-- 复制 10000 行数据
fence=# insert into t_docs select t*5+d.id as id, d.title, d.content from generate_series(1, 2000) as t, t_docs d;
INSERT 0 10000
-- 创建索引
create index i_docs__content on t_docs using GIN (to_tsvector('english', content));
-- 检查,已经可以击中索引
fence=# explain select id,title from t_docs where to_tsvector('english', content) @@ to_tsquery('postgres');
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t_docs (cost=12.64..188.55 rows=50 width=520)
Recheck Cond: (to_tsvector('english'::regconfig, (content)::text) @@ to_tsquery('postgres'::text))
-> Bitmap Index Scan on i_docs__content (cost=0.00..12.63 rows=50 width=0)
Index Cond: (to_tsvector('english'::regconfig, (content)::text) @@ to_tsquery('postgres'::text))
(4 rows)
-- 试验可见,创建索引之后,速度明显提升, 从 723 毫秒提升到 1 毫秒
fence=# \timing
Timing is on.
fence=# select count(1) from t_docs where to_tsvector('english', content) @@ to_tsquery('postgres');
count
-------
4002
(1 row)
Time: 723.538 ms
fence=# create index i_docs__content on t_docs using GIN (to_tsvector('english', content));
CREATE INDEX
Time: 911.129 ms
fence=# select count(1) from t_docs where to_tsvector('english', content) @@ to_tsquery('postgres');
count
-------
4002
(1 row)
Time: 1.329 ms
索引可以建立在多个列上
fence=# create index i_docs__title_content on t_docs using GIN (to_tsvector('english', title || ' ' || content));
CREATE INDEX
我们还可以将 to_tsvector
的输出单独保存一列,保持其可以随着源数据保持更新。在其之上创建索引,以用于检索。
-- 新增索引列
fence=# alter table t_docs add column search_title_content_index_col tsvector generated always as (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(content, ''))) stored;
ALTER TABLE
-- 检查索引列数据
fence=# select search_title_content_index_col from t_docs limit 1;
sear
ch_title_content_index_col
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------
'applic':80 'capabl':20 'common':43 'consid':84 'contain':52 'depend':76 'document':26,51,101 'find':49 'flexibl':74 'frequenc':95 'full':3,8,11 'given':53 'identifi':22 'languag':25 'lear':1 'natur':24 'natural-languag':23 'notion':67 '
option':32 'order':60 'pg':2,6 'provid':18 'queri':30,40,54,66,69,85,97 'relev':37 'return':57 'satisfi':28 'search':5,10,13,17,46,83 'set':88 'similar':63,71,92 'simplest':82 'sort':34 'specif':79 'support':7 'term':55 'text':4,9,12,16 '
type':44 'word':90,98
(1 row)
-- 在索引列上创建索引
fence=# create index i_docs__title_content_index_col on t_docs using GIN(search_title_content_index_col);
CREATE INDEX
-- 检验是否击中索引
fence=# explain select count(1) from t_docs where search_title_content_index_col @@ to_tsquery('postgres');
QUERY PLAN
------------------------------------------------------------------------------------------------------
Aggregate (cost=195.90..195.91 rows=1 width=8)
-> Bitmap Heap Scan on t_docs (cost=12.64..195.77 rows=50 width=0)
Recheck Cond: (search_title_content_index_col @@ to_tsquery('postgres'::text))
-> Bitmap Index Scan on i_docs__title_content_index_col (cost=0.00..12.63 rows=50 width=0)
Index Cond: (search_title_content_index_col @@ to_tsquery('postgres'::text))
(5 rows)
-- 检索数据
fence=# select count(1) from t_docs where search_title_content_index_col @@ to_tsquery('postgres');
count
-------
4002
上次更新: 2023/12/08, 06:34:37