I referenced this site for guidance. Since the SQL statement creation part didn’t work in my environment, I switched to a different approach.
Text Mining with Aozora Bunko Data (Preparation) | ACALL BLOG https://blog.acall.jp/2019/11/aozorabunko-textmining/
Note that I simply wanted large-volume Japanese text data without proper formatting, so please be aware of that.
Clone from the Aozora Bunko Repository
#sudo yum -y install git
git clone --depth 1 https://github.com/aozorabunko/aozorabunko.git
Install nkf
wget https://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/n/nkf-2.1.3-5.el7.x86_64.rpm
sudo rpm -ivh nkf-2.1.3-5.el7.x86_64.rpm
Merge Text Files
cd aozorabunko-master
mdkir text
find cards -name '*.zip' -exec cp {} text \;
cd text
unzip '*.zip'
mkdir matome
cp *.txt matome
cd matome
find . -name "*.txt" -exec cat {} >>matome.txt +
nkf -w --overwrite matome.txt
Create SQL from Text File for Execution
# Remove single quotes from text file. Single quotes cause incorrect delimiter detection.
sed -e "s/'/ /g" matome.txt > matome1.sql
sed "s/^/insert into aozoradata (data) values ('/g" matome1.sql > matome2.sql
sed "s/$/');/g" matome2.sql > matome3.sql
Prepare PostgreSQL
CREATE TABLE aozoradata(
id serial not null,
data text not null
);
Execute SQL File (This takes a very long time)
psql -d postgres -U postgres -f matome3.sql
The number of rows becomes 3,533,485.
postgres=# \d+ aozoradata;
Table "public.aozoradata"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+----------------------------------------+----------+--------------+-------------
id | integer | | not null | nextval('aozoradata_id_seq'::regclass) | plain | |
data | text | | not null | | extended | |
Indexes:
"aozoradata_idx" gin (data gin_trgm_ops)
postgres=# select count(*) from aozoradata;
count
---------
3533485
(1 row)
Checking with pgstattuple shows the following:
postgres=# SELECT * FROM pgstattuple('aozoradata');
-[ RECORD 1 ]------+----------
table_len | 807690240
tuple_count | 3533485
tuple_len | 774167704
tuple_percent | 95.85
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 3066496
free_percent | 0.38
Checking from statistics (pg_stats). The avg_width is 190, which feels surprisingly small.
postgres=# select schemaname,tablename,attname,null_frac,avg_width,n_distinct,correlation from pg_stats where tablename='aozoradata';
-[ RECORD 1 ]------------
schemaname | public
tablename | aozoradata
attname | id
null_frac | 0
avg_width | 4
n_distinct | -1
correlation | 0.999998
-[ RECORD 2 ]------------
schemaname | public
tablename | aozoradata
attname | data
null_frac | 0
avg_width | 190
n_distinct | 114652
correlation | -0.00665814
Aozora Bunko annotations, blank lines, and hyphen separators remain in the data, so you’ll likely need to remove them as needed. In that case, please refer to the middle section onwards of the reference URL.
Text Mining with Aozora Bunko Data (Preparation) | ACALL BLOG https://blog.acall.jp/2019/11/aozorabunko-textmining/