This is an English translation of a Japanese blog. Some content may not be fully translated.
PostgreSQL

Trying PostgreSQL's Full-Text Search Engine pg_trgm

For cases where you want to perform “partial match search” using an index in PostgreSQL. With regular B-Tree indexes, partial match searches cannot use the index. By using pg_trgm, PostgreSQL’s full-text search engine, partial match searches can be performed easily. There is also pg_bigm and PGroonga, but this time I will try pg_trgm, which is included in contrib.

Version

postgres=# select version();
                                                 version
----------------------------------------------------------------------------------------------------------
 PostgreSQL 10.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

Enabling pg_trgm

postgres=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
postgres=# \dx
                                    List of installed extensions
  Name   | Version |   Schema   |                            Description
---------+---------+------------+-------------------------------------------------------------------
 pg_trgm | 1.3     | public     | text similarity measurement and index searching based on trigrams
 plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)
Create some test data (100 million text records)
CREATE TABLE t1 AS
SELECT to_char(num,'FM0000000000000000') textdata
FROM   generate_series(1,10000000) num
;
Insert target text data for searching
insert into t1 values('test1test2test3');
insert into t1 values('メロスは激怒した。必ず、かの 邪智暴虐 ( じゃちぼうぎゃく ) の王を除かなければならぬと決意した。');

At this point, 100,000,002 records are stored in the table.

select count(*) from t1;

postgres=# select count(*) from t1;
   count
-----------
 100000002
(1 row)
Partial match search for “test2”
explain analyze select * from t1 where textdata like '%test2%';
EXPLAIN ANALYZE Results

The search took about Execution time: 10061.160 ms, approximately 10 seconds. It’s slow because it performs a full table scan.

postgres=# explain analyze select * from t1 where textdata like '%test2%';
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..1159776.58 rows=10000 width=17) (actual time=10059.633..10061.097 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..1157776.58 rows=4167 width=17) (actual time=10053.097..10053.098 rows=0 loops=3)
         Filter: (textdata ~~ '%test2%'::text)
         Rows Removed by Filter: 33333334
 Planning time: 0.938 ms
 Execution time: 10061.160 ms
(8 rows)
Partial match search for “邪智暴虐”
explain analyze select * from t1 where textdata like '%邪智暴虐%';
EXPLAIN ANALYZE Results

The search took about Execution time: 9982.465 ms, again approximately 10 seconds.

postgres=# explain analyze select * from t1 where textdata like '%邪智暴虐%';
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..1159776.58 rows=10000 width=17) (actual time=9982.347..9982.403 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..1157776.58 rows=4167 width=17) (actual time=9977.282..9977.282 rows=0 loops=3)
         Filter: (textdata ~~ '%邪智暴虐%'::text)
         Rows Removed by Filter: 33333334
 Planning time: 0.052 ms
 Execution time: 9982.465 ms
(8 rows)

Creating a full-text search index:

create index pg_trgm_idx on t1 USING gin(textdata gin_trgm_ops);

It took about 300 seconds to complete.

[2020-03-16 21:27:36 JST]postgres postgres 4539[53] LOG:  statement: create index pg_trgm_idx on t1 USING gin(textdata gin_trgm_ops);
[2020-03-16 21:32:39 JST]postgres postgres 4539[54] LOG:  duration: 303535.309 ms

The table size is about 5217.8MB = approximately 5GB. The index size is about 1,461MB = approximately 1GB. I expected the index size to be larger than the table size for a full-text search index, but that may be due to the test data.

SELECT
    objectname,
    TO_CHAR(pg_relation_size(objectname::regclass),'999,999,999,999') AS bytes
FROM
    (
    SELECT
        tablename AS objectname
    FROM
        pg_tables
    WHERE
        schemaname = 'public'
    UNION
    SELECT
        indexname AS objectname
    FROM
        pg_indexes
    WHERE
        schemaname = 'public'
    ) AS objectlist
ORDER BY
    bytes DESC;

 objectname  |      bytes
-------------+------------------
 t1          |    5,217,837,056
 pg_trgm_idx |    1,461,542,912
(2 rows)
Partial match search for “test2”
explain analyze select * from t1 where textdata like '%test2%';
EXPLAIN ANALYZE Results

The search that took Execution time: 10061.160 ms now completes in 0.090 ms.

postgres=# explain analyze select * from t1 where textdata like '%test2%';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=285.50..36386.84 rows=10000 width=17) (actual time=0.021..0.022 rows=1 loops=1)
   Recheck Cond: (textdata ~~ '%test2%'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on pg_trgm_idx  (cost=0.00..283.00 rows=10000 width=0) (actual time=0.016..0.016 rows=1 loops=1)
         Index Cond: (textdata ~~ '%test2%'::text)
 Planning time: 1.148 ms
 Execution time: 0.090 ms
(7 rows)
Partial match search for “邪智暴虐”
explain analyze select * from t1 where textdata like '%邪智暴虐%';
EXPLAIN ANALYZE Results

The search that took Execution time: 9982.465 ms now completes in 0.044 ms.

postgres=# explain analyze select * from t1 where textdata like '%邪智暴虐%';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=217.50..36318.84 rows=10000 width=17) (actual time=0.022..0.022 rows=1 loops=1)
   Recheck Cond: (textdata ~~ '%邪智暴虐%'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on pg_trgm_idx  (cost=0.00..215.00 rows=10000 width=0) (actual time=0.017..0.017 rows=1 loops=1)
         Index Cond: (textdata ~~ '%邪智暴虐%'::text)
 Planning time: 0.107 ms
 Execution time: 0.044 ms
(7 rows)

Note that due to the nature of trigrams (tri-gram), searches with 2 or fewer characters will be extremely slow, so be careful.

postgres=# explain analyze select * from t1 where textdata like '%邪智%';
                                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=758005.50..794106.84 rows=10000 width=17) (actual time=169524.760..169524.761 rows=1 loops=1)
   Recheck Cond: (textdata ~~ '%邪智%'::text)
   Rows Removed by Index Recheck: 100000001
   Heap Blocks: exact=636943
   ->  Bitmap Index Scan on pg_trgm_idx  (cost=0.00..758003.00 rows=10000 width=0) (actual time=26370.882..26370.882 rows=100000002 loops=1)
         Index Cond: (textdata ~~ '%邪智%'::text)
 Planning time: 0.261 ms
 Execution time: 169524.849 ms
(8 rows)

postgres=#

About Recheck Cond

The following article explains this very clearly.

Explanation of the Recheck process in Bitmap Heap Scan after Bitmap Index Scan - Pat Team Leader’s Diary https://taityo-diary.hatenablog.jp/entry/2018/07/07/071928

postgresql - What does “Recheck Cond” in Explain result mean? - Stack Overflow https://stackoverflow.com/questions/50959814/what-does-recheck-cond-in-explain-result-mean/50961326#50961326

Addendum

When I changed the data to text data from the Aozora Bunko, the result was as expected: the index size became larger than the table size.

postgres=# SELECT * FROM pgstattuple('aozoradata');
-[ RECORD 1 ]------+----------
table_len          | 807690240
tuple_count        | 3533485
tuple_len          | 774167704
tuple_percent      | 95.85
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 3066496
free_percent       | 0.38

postgres=# \d aozoradata
                            Table "public.aozoradata"
 Column |  Type   | Collation | Nullable |                Default
--------+---------+-----------+----------+----------------------------------------
 id     | integer |           | not null | nextval('aozoradata_id_seq'::regclass)
 data   | text    |           | not null |
postgres=# SELECT
postgres-#     objectname,
postgres-#     TO_CHAR(pg_relation_size(objectname::regclass),'999,999,999,999') AS bytes
postgres-# FROM
postgres-#     (
postgres(#     SELECT
postgres(#         tablename AS objectname
postgres(#     FROM
postgres(#         pg_tables
postgres(#     WHERE
postgres(#         schemaname = 'public'
postgres(#     UNION
postgres(#     SELECT
postgres(#         indexname AS objectname
postgres(#     FROM
postgres(#         pg_indexes
postgres(#     WHERE
postgres(#         schemaname = 'public'
postgres(#     ) AS objectlist
postgres-# ORDER BY
postgres-#     bytes DESC;
   objectname   |      bytes
----------------+------------------
 aozoradata     |      807,690,240
 aozoradata_idx |    1,156,775,936
(2 rows)
Suggest an edit on GitHub