Loading Data into Redshift

Assumes that IAM roles and the required IAM role attachments to the cluster are already complete.

Upload the Load File to S3

aws s3 cp aozora_data.csv s3://xxxx/load/

The load file is about 6GB, approximately 87 million rows (Aozora Bunko text data).

[ec2-user@bastin ~]$ ls -lh aozora_data.csv
-rw-rw-r-- 1 ec2-user ec2-user 6.1G Dec 16  2012 aozora_data.csv
[ec2-user@bastin ~]$ wc -l aozora_data.csv
87701673 aozora_data.csv

Create the Destination Table in Redshift

CREATE TABLE aozora_data(file VARCHAR(100),num INT,row INT,word TEXT,subtype1 VARCHAR(100),subtype2 VARCHAR(100),subtype3 VARCHAR(100),subtype4 VARCHAR(100),conjtype VARCHAR(30),conjugation VARCHAR(30),basic TEXT,ruby TEXT,pronunce TEXT );

Execute Data Load

copy aozorabunko_data from 's3://xxxxx/load/aozora_data.csv'
iam_role 'arn:aws:iam::xxxxx:role/myRedshiftRole'
csv;

The above method is an anti-pattern and runs only on a single slice, failing to effectively utilize resources in Redshift’s MPP architecture. As per the best practices below, it is preferable to split files and compress them before loading.

Amazon Redshift Data Loading Best Practices - Amazon Redshift https://docs.aws.amazon.com/ja_jp/redshift/latest/dg/c_loading-data-best-practices.html

Data Splitting

split -n 8 -d aozora_data.csv part-