Trying PostgreSQL's Full-Text Search Engine pg_trgm

For cases where you want to perform “partial match search” using an index in PostgreSQL. With regular B-Tree indexes, partial match searches cannot use the index. By using pg_trgm, PostgreSQL’s full-text search engine, partial match searches can be performed easily. There is also pg_bigm and PGroonga, but this time I will try pg_trgm, which is included in contrib.

Version

postgres=# select version();
                                                 version
----------------------------------------------------------------------------------------------------------
 PostgreSQL 10.11 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit

Enabling pg_trgm

postgres=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
postgres=# \dx
                                    List of installed extensions
  Name   | Version |   Schema   |                            Description
---------+---------+------------+-------------------------------------------------------------------
 pg_trgm | 1.3     | public     | text similarity measurement and index searching based on trigrams
 plpgsql | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)

Create some test data (100 million text records)

CREATE TABLE t1 AS
SELECT to_char(num,'FM0000000000000000') textdata
FROM   generate_series(1,10000000) num
;

Insert target text data for searching

insert into t1 values('test1test2test3');
insert into t1 values('メロスは激怒した。必ず、かの 邪智暴虐 （ じゃちぼうぎゃく ） の王を除かなければならぬと決意した。');

At this point, 100,000,002 records are stored in the table.

select count(*) from t1;

postgres=# select count(*) from t1;
   count
-----------
 100000002
(1 row)

Partial match search for “test2”

explain analyze select * from t1 where textdata like '%test2%';

EXPLAIN ANALYZE Results

The search took about Execution time: 10061.160 ms, approximately 10 seconds. It’s slow because it performs a full table scan.

postgres=# explain analyze select * from t1 where textdata like '%test2%';
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..1159776.58 rows=10000 width=17) (actual time=10059.633..10061.097 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..1157776.58 rows=4167 width=17) (actual time=10053.097..10053.098 rows=0 loops=3)
         Filter: (textdata ~~ '%test2%'::text)
         Rows Removed by Filter: 33333334
 Planning time: 0.938 ms
 Execution time: 10061.160 ms
(8 rows)

Partial match search for “邪智暴虐”

explain analyze select * from t1 where textdata like '%邪智暴虐%';

EXPLAIN ANALYZE Results

The search took about Execution time: 9982.465 ms, again approximately 10 seconds.

postgres=# explain analyze select * from t1 where textdata like '%邪智暴虐%';
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..1159776.58 rows=10000 width=17) (actual time=9982.347..9982.403 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..1157776.58 rows=4167 width=17) (actual time=9977.282..9977.282 rows=0 loops=3)
         Filter: (textdata ~~ '%邪智暴虐%'::text)
         Rows Removed by Filter: 33333334
 Planning time: 0.052 ms
 Execution time: 9982.465 ms
(8 rows)

Creating a full-text search index:

create index pg_trgm_idx on t1 USING gin(textdata gin_trgm_ops);

It took about 300 seconds to complete.

[2020-03-16 21:27:36 JST]postgres postgres 4539[53] LOG:  statement: create index pg_trgm_idx on t1 USING gin(textdata gin_trgm_ops);
[2020-03-16 21:32:39 JST]postgres postgres 4539[54] LOG:  duration: 303535.309 ms

The table size is about 5217.8MB = approximately 5GB. The index size is about 1,461MB = approximately 1GB. I expected the index size to be larger than the table size for a full-text search index, but that may be due to the test data.

SELECT
    objectname,
    TO_CHAR(pg_relation_size(objectname::regclass),'999,999,999,999') AS bytes
FROM
    (
    SELECT
        tablename AS objectname
    FROM
        pg_tables
    WHERE
        schemaname = 'public'
    UNION
    SELECT
        indexname AS objectname
    FROM
        pg_indexes
    WHERE
        schemaname = 'public'
    ) AS objectlist
ORDER BY
    bytes DESC;

 objectname  |      bytes
-------------+------------------
 t1          |    5,217,837,056
 pg_trgm_idx |    1,461,542,912
(2 rows)

Partial match search for “test2”

explain analyze select * from t1 where textdata like '%test2%';

EXPLAIN ANALYZE Results

The search that took Execution time: 10061.160 ms now completes in 0.090 ms.

postgres=# explain analyze select * from t1 where textdata like '%test2%';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=285.50..36386.84 rows=10000 width=17) (actual time=0.021..0.022 rows=1 loops=1)
   Recheck Cond: (textdata ~~ '%test2%'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on pg_trgm_idx  (cost=0.00..283.00 rows=10000 width=0) (actual time=0.016..0.016 rows=1 loops=1)
         Index Cond: (textdata ~~ '%test2%'::text)
 Planning time: 1.148 ms
 Execution time: 0.090 ms
(7 rows)

Partial match search for “邪智暴虐”

explain analyze select * from t1 where textdata like '%邪智暴虐%';

EXPLAIN ANALYZE Results

The search that took Execution time: 9982.465 ms now completes in 0.044 ms.

postgres=# explain analyze select * from t1 where textdata like '%邪智暴虐%';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=217.50..36318.84 rows=10000 width=17) (actual time=0.022..0.022 rows=1 loops=1)
   Recheck Cond: (textdata ~~ '%邪智暴虐%'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on pg_trgm_idx  (cost=0.00..215.00 rows=10000 width=0) (actual time=0.017..0.017 rows=1 loops=1)
         Index Cond: (textdata ~~ '%邪智暴虐%'::text)
 Planning time: 0.107 ms
 Execution time: 0.044 ms
(7 rows)

Note that due to the nature of trigrams (tri-gram), searches with 2 or fewer characters will be extremely slow, so be careful.

postgres=# explain analyze select * from t1 where textdata like '%邪智%';
                                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=758005.50..794106.84 rows=10000 width=17) (actual time=169524.760..169524.761 rows=1 loops=1)
   Recheck Cond: (textdata ~~ '%邪智%'::text)
   Rows Removed by Index Recheck: 100000001
   Heap Blocks: exact=636943
   ->  Bitmap Index Scan on pg_trgm_idx  (cost=0.00..758003.00 rows=10000 width=0) (actual time=26370.882..26370.882 rows=100000002 loops=1)
         Index Cond: (textdata ~~ '%邪智%'::text)
 Planning time: 0.261 ms
 Execution time: 169524.849 ms
(8 rows)

postgres=#

About Recheck Cond

The following article explains this very clearly.

Explanation of the Recheck process in Bitmap Heap Scan after Bitmap Index Scan - Pat Team Leader’s Diary https://taityo-diary.hatenablog.jp/entry/2018/07/07/071928

postgresql - What does “Recheck Cond” in Explain result mean? - Stack Overflow https://stackoverflow.com/questions/50959814/what-does-recheck-cond-in-explain-result-mean/50961326#50961326

Addendum

When I changed the data to text data from the Aozora Bunko, the result was as expected: the index size became larger than the table size.

postgres=# SELECT * FROM pgstattuple('aozoradata');
-[ RECORD 1 ]------+----------
table_len          | 807690240
tuple_count        | 3533485
tuple_len          | 774167704
tuple_percent      | 95.85
dead_tuple_count   | 0
dead_tuple_len     | 0
dead_tuple_percent | 0
free_space         | 3066496
free_percent       | 0.38

postgres=# \d aozoradata
                            Table "public.aozoradata"
 Column |  Type   | Collation | Nullable |                Default
--------+---------+-----------+----------+----------------------------------------
 id     | integer |           | not null | nextval('aozoradata_id_seq'::regclass)
 data   | text    |           | not null |

postgres=# SELECT
postgres-#     objectname,
postgres-#     TO_CHAR(pg_relation_size(objectname::regclass),'999,999,999,999') AS bytes
postgres-# FROM
postgres-#     (
postgres(#     SELECT
postgres(#         tablename AS objectname
postgres(#     FROM
postgres(#         pg_tables
postgres(#     WHERE
postgres(#         schemaname = 'public'
postgres(#     UNION
postgres(#     SELECT
postgres(#         indexname AS objectname
postgres(#     FROM
postgres(#         pg_indexes
postgres(#     WHERE
postgres(#         schemaname = 'public'
postgres(#     ) AS objectlist
postgres-# ORDER BY
postgres-#     bytes DESC;
   objectname   |      bytes
----------------+------------------
 aozoradata     |      807,690,240
 aozoradata_idx |    1,156,775,936
(2 rows)

Trying PostgreSQL's Full-Text Search Engine pg_trgm

Version

Enabling pg_trgm

Create some test data (100 million text records)

Insert target text data for searching

Partial match search for “test2”

EXPLAIN ANALYZE Results

Partial match search for “邪智暴虐”

EXPLAIN ANALYZE Results

Partial match search for “test2”

EXPLAIN ANALYZE Results

Partial match search for “邪智暴虐”

EXPLAIN ANALYZE Results

About Recheck Cond

Addendum

Related articles