Trying Amazon Neptune: From Instance Creation to Data Loading and Query Execution

Introduction

I decided to try Neptune, AWS’s fully managed graph database service. The plan is to go from instance creation through loading RDF-format data, and finally issue a simple query using SPARQL.

I’ll cover what a graph database is and what Amazon Neptune is in a separate article.

Steps to be performed:

Create the instance
Create an IAM role, attach the role to Neptune, configure the S3 VPC endpoint
Load data from S3
Verify the loaded data using the RDF4J console and HTTP REST endpoint

Prerequisites:

VPC and S3 created in advance

Creating the Instance

Select “Create Database”

Fill in “Specify DB Details”

For this test, I specified “Neptune 1.0.2.1.R4,” the latest version at the time. Note that changing to Multi-AZ after instance creation is not currently possible, so select it at this stage if needed.

Continue Filling In Details

The input fields are similar to RDS and Aurora.

After clicking the “Create Database” button, creation begins — wait a bit.

Creation completed in roughly 5 to 10 minutes.

Setting Up IAM Role and S3 VPC Endpoint

As preparation for data loading, configure the IAM role and S3 VPC endpoint.

Prerequisite: IAM Role and Amazon S3 Access - Amazon Neptune https://docs.aws.amazon.com/ja_jp/neptune/latest/userguide/bulk-load-tutorial-IAM.html

From the IAM screen, select “Create Role.”

Select S3.

Select “AmazonS3ReadOnlyAccess” to attach the policy.

Fill in as needed.

The role name was set to “NeptuneLoadFromS3.”

Navigate to the created role’s screen.

Go to “Trust relationships” - “Edit trust relationship” and paste the following, overwriting the existing content.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "rds.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Add an IAM Role to the Amazon Neptune Cluster

Go to the Neptune cluster and select “Manage IAM roles.”

Add the IAM role just created (NeptuneLoadFromS3).

Create an S3 VPC Endpoint

A VPC endpoint is required to load data from S3 into Neptune. Set up a VPC endpoint.

On the endpoint creation screen, select “com.amazonaws.ap-northeast-1.s3.” (Since this is the Tokyo region, it’s ap-northeast-1; other regions will have a different region name.)

Specify the VPC and route table.

Loading Data from S3 into Neptune

We’re now ready to load from S3 into Neptune. The data to load will be from http://rdf.geospecies.org . Upload a sample RDF data file in rdfxml format to the designated S3 bucket.

[ec2-user@bastin nep-tool]$ curl -O http://rdf.geospecies.org/geospecies.rdf.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 8891k  100 8891k    0     0  3405k      0  0:00:02  0:00:02 --:--:-- 3404k
[ec2-user@bastin nep-tool]$
[ec2-user@bastin nep-tool]$ ls -l geospecies.rdf.gz
-rw-rw-r-- 1 ec2-user ec2-user 9105109 Jan 28 08:16 geospecies.rdf.gz
[ec2-user@bastin nep-tool]$ aws s3 cp geospecies.rdf.gz s3://nep-s3-xxxx/
upload: ./geospecies.rdf.gz to s3://nep-s3-xxxx/geospecies.rdf.gz
[ec2-user@bastin nep-tool]$

Load the data with the following command. Update endpoint, source, format, and iamRoleArn as needed.

For RDF, the format can also be turtle, ntriples, and others.

Load Data Formats - Amazon Neptune https://docs.aws.amazon.com/ja_jp/neptune/latest/userguide/bulk-load-tutorial-format.html

curl -X POST \
    -H 'Content-Type: application/json' \
    https://neptest.xxxxxxxxxxxx.ap-northeast-1.neptune.amazonaws.com:8182/loader -d '
    {
      "source" : "s3://nep-s3-xxxx/geospecies.rdf.gz",
      "format" : "rdfxml",
      "iamRoleArn" : "arn:aws:iam::xxxxxxxxx:role/NeptuneLoadFromS3",
      "region" : "ap-northeast-1",
      "failOnError" : "FALSE",
      "parallelism" : "HIGH"
    }'

After execution, the following is displayed. Note the loadId as it’s needed to check status.

{
    "status" : "200 OK",
    "payload" : {
        "loadId" : "eff1268f-17ab-473a-b845-c2d91a317c01"
    }

Check the data load status. Specify the loadId obtained earlier.

curl -G 'https://neptest.xxxxxxxxxxxx.ap-northeast-1.neptune.amazonaws.com:8182/loader/eff1268f-17ab-473a-b845-c2d91a317c01'

In-Progress Output

[ec2-user@bastin nep-tool]$ curl -G 'https://neptest.xxxxxxxxxxxx.ap-northeast-1.neptune.amazonaws.com:8182/loader/eff1268f-17ab-473a-b845-c2d91a317c01'
{
    "status" : "200 OK",
    "payload" : {
        "feedCount" : [
            {
                "LOAD_IN_PROGRESS" : 1
            }
        ],
        "overallStatus" : {
            "fullUri" : "s3://nep-s3-xxxx/geospecies.rdf.gz",
            "runNumber" : 1,
            "retryNumber" : 0,
            "status" : "LOAD_IN_PROGRESS",
            "totalTimeSpent" : 148,
            "startTime" : 1580199498,
            "totalRecords" : 2130000,
            "totalDuplicates" : 0,
            "parsingErrors" : 0,
            "datatypeMismatchErrors" : 0,
            "insertErrors" : 0
        }
    }
}

Load Complete Output

[ec2-user@bastin nep-tool]$ curl -G 'https://neptest.xxxxxxxxxxxx.ap-northeast-1.neptune.amazonaws.com:8182/loader/eff1268f-17ab-473a-b845-c2d91a317c01'
{
    "status" : "200 OK",
    "payload" : {
        "feedCount" : [
            {
                "LOAD_COMPLETED" : 1
            }
        ],
        "overallStatus" : {
            "fullUri" : "s3://nep-s3-xxxx/geospecies.rdf.gz",
            "runNumber" : 1,
            "retryNumber" : 0,
            "status" : "LOAD_COMPLETED",
            "totalTimeSpent" : 149,
            "startTime" : 1580199498,
            "totalRecords" : 2201532,
            "totalDuplicates" : 0,
            "parsingErrors" : 0,
            "datatypeMismatchErrors" : 0,
            "insertErrors" : 0
        }
    }

The field descriptions are as follows. In this example, 2,201,532 records were loaded in 149 seconds.

Neptune Loader Get-Status API - Amazon Neptune https://docs.aws.amazon.com/ja_jp/neptune/latest/userguide/load-api-reference-status.html

Field	Description
fullUri	The URI of one or more files to be loaded. Format: s3://bucket/key
runNumber	The number of runs for this load or feed. This increments when the load is resumed.
retryNumber	The number of retries for this load or feed. This increments when the loader automatically retries a feed or load.
status	The returned status for this load or feed. LOAD_COMPLETED indicates the load succeeded without issues.
totalTimeSpent	Time spent (in seconds) parsing and inserting data for this load or feed. Does not include time spent retrieving the list of source files.
totalRecords	Total records loaded or attempted to be loaded.
totalDuplicates	Number of duplicate records encountered.
parsingErrors	Number of parsing errors encountered.
datatypeMismatchErrors	Number of records where the specified data did not match the data type.
insertErrors	Number of records that could not be inserted due to errors.

Issuing Queries to Neptune

Now that data is loaded, let’s issue queries.

Using the HTTP REST Endpoint

Connecting to a Neptune DB Instance Using the HTTP REST Endpoint - Amazon Neptune https://docs.aws.amazon.com/ja_jp/neptune/latest/userguide/access-graph-sparql-http-rest.html

curl -X POST --data-binary 'query=select ?s ?p ?o where {?s ?p ?o} limit 10' https://neptest.xxxxxxxxxxxx.ap-northeast-1.neptune.amazonaws.com:8182/sparql

Execution Result

[ec2-user@bastin nep-tool]$ curl -X POST --data-binary 'query=select ?s ?p ?o where {?s ?p ?o} limit 10' https://neptest.xxxxxxxxxxxx.ap-northeast-1.neptune.amazonaws.com:8182/sparql
{
  "head" : {
    "vars" : [ "s", "p", "o" ]
  },
  "results" : {
    "bindings" : [ {
      "s" : {
        "type" : "uri",
        "value" : "http://lod.geospecies.org/ses/uRtpv"
      },
      "p" : {
        "type" : "uri",
        "value" : "http://rdf.geospecies.org/ont/geospecies#isUnexpectedIn"
      },
      "o" : {
        "type" : "uri",
        "value" : "http://sws.geonames.org/5001836/"
      }
～omitted～

Using the RDF4J Console

Connecting to a Neptune DB Instance Using the RDF4J Console - Amazon Neptune https://docs.aws.amazon.com/ja_jp/neptune/latest/userguide/access-graph-sparql-rdf4j-console.html

Download the RDF4J SDK from the RDF4J site .

Upload the downloaded zip file to a specific EC2 instance.

[ec2-user@bastin nep-tool]$ ls -l
total 104740
-rw-r--r-- 1 ec2-user ec2-user 98147430 Jan 25 06:16 eclipse-rdf4j-3.0.4-sdk.zip
-rw-rw-r-- 1 ec2-user ec2-user  9105109 Jan 28 08:16 geospecies.rdf.gz

After unzipping, run console.sh located under bin/.

[ec2-user@bastin nep-tool]$ ./eclipse-rdf4j-3.0.4/bin/console.sh
08:37:35.639 [main] DEBUG org.eclipse.rdf4j.common.platform.PlatformFactory - os.name = linux
08:37:35.652 [main] DEBUG org.eclipse.rdf4j.common.platform.PlatformFactory - Detected Posix platform
Connected to default data directory
RDF4J Console 3.0.4+47737c0

3.0.4+47737c0
Type 'help' for help.
>

Create a SPARQL repository for the Neptune DB instance.

create sparql

You will be prompted to enter the following information. (Unverified, but if a read replica is created, the “SPARQL query endpoint” and “SPARQL update endpoint” should probably be split between master and read replica.)

Variable	Value
SPARQL query endpoint	https://your-neptune-endpoint:port/sparql
SPARQL update endpoint	https://your-neptune-endpoint:port/sparql
Local repository ID [endpoint@localhost]	neptune
Repository title [SPARQL endpoint repository @localhost]	Neptune DB instance

> create sparql
Please specify values for the following variables:
SPARQL query endpoint: https://neptest.xxxxxxxxxxxx.ap-northeast-1.neptune.amazonaws.com:8182/sparql
SPARQL update endpoint: https://neptest.xxxxxxxxxxxx.ap-northeast-1.neptune.amazonaws.com:8182/sparql
Local repository ID [endpoint@localhost]: neptune
Repository title [SPARQL endpoint repository @localhost]: neptune

Repository created

Connect to the Neptune instance. After connecting, the local repository ID appears in the prompt.

> open neptune
Opened repository 'neptune'
neptune>

Run the same query as with the HTTP REST endpoint approach.

neptune> sparql select ?s ?p ?o where {?s ?p ?o} limit 10
Evaluating SPARQL query...
+------------------------+------------------------+------------------------+
| s                      | p                      | o                      |
+------------------------+------------------------+------------------------+
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#hasScientificNameAuthorship>| "(LeConte, 1866)"^^xsd:string|
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#hasScientificName>| "Iphthiminus opacus (LeConte, 1866)"^^xsd:string|
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#isExpectedIn>| <http://sws.geonames.org/6255149/>|
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#isExpectedIn>| <http://sws.geonames.org/5279468/>|
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#hasNomenclaturalCode>| <http://rdf.geospecies.org/ont/geospecies#NomenclaturalCode_ICZN>|
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#isUnknownAboutIn>| <http://sws.geonames.org/4862182/>|
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#isUnknownAboutIn>| <http://sws.geonames.org/5037779/>|
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#isUnknownAboutIn>| <http://sws.geonames.org/5001836/>|
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#isUnknownAboutIn>| <http://sws.geonames.org/2635167/>|
| <http://lod.geospecies.org/ses/zJIK4>| <http://rdf.geospecies.org/ont/geospecies#hasSubfamilyName>| "Coelometopinae"^^xsd:string|
+------------------------+------------------------+------------------------+
10 result(s) (1268 ms)
neptune>