GraphLake File Format Specification

Version: DRAFT 0.1.0

Date: November 19, 2024

Author: Graham Moore

Contact: graham.moore@dataplatformsolutions.com

Introduction

Inspired by formats such Apache Iceberg[1], DeltaLake[2], and Parquet[3], GraphLake is a file format and metadata structures designed to support high performance, highly parallel processing of graph queries. GraphLake is graph native in how it thinks about storing and partitioning data. It is defined as an open specification to encourage others to utilise and adopt this approach. GraphLake facilitates graph engines that can scale in many different ways as the data is kept separate from the compute.

Goals

The following goals have helped to shape GraphLake format:

File Structure Overview

GraphLake is a file format that stores data in a directory structure. The directory structure is designed to support partitioning and metadata storage. The data is stored in files within the directory structure. The directory structure can be considered to be virtual which aligns niceley with object storage systems such as S3, MinIO and Azure Blob storage.

The approach is to partition on graphs then predicates, and then to store triples in data files. Each data file has a corresponding metadata construct that contains useful metadata such as a bloom filter, min and max values. Each triple is added to two files, the subject-object file and the object-subject file. The triples in these files are sorted based on either subject or object.

To reduce duplicated data, each data file has metadata that maintains a mapping between the URI prefix and an integer value. This integer value is then used in the binary representation. Full URIs can be reconstituted on read.

GraphLake does triple reification by creating a hash256 value from the complete data of the triple (excluding the graph) as this value can always be computed it is not explicitly stored with the triple. To aid retreival of triples by id, additional bloom filters are added to the data file metadata structure.

Deletes are supported with the tombstone construct. This file is exactly the same data format as the SO and OS data files. When querying for triples they should first be located in the data files, and then checked for existence in the tombstone files. Processing applications are free to re-write data files and remove deleted triples.

All files are immutable.

Directory & File Structure


        /root
            store.json
            /snapshots
                snapshot1.json
                ..
                snapshotN.json
            /graphs
                /graph1
                    /predicate1
                            data_file1.bin
                            data_file2.bin
                            data_file3.bin
                /graph2
                    /predicate1
                            data_file4.bin
                            data_file5.bin
                            data_file6.bin            
    

Store File

The store file contains metadata about the store. It contains information about the snapshots that comprise the store. The snapshot file is a JSON file of the following form:


        {
            "format_version": "0.1",
            "location" : "/graphlake/store1",
            "snapshots": [
                {
                    "id": "snapshot1 - id that corresponds to the snapshot file prefix",
                    "version" : "incrementing integer int64"
                    "timestamp": "milliseconds_since_epoch int64",
                }
            ]
        }
    

Snapshot File

The snapshot file contains the details of the graphs and the sets of files that comprise a given snapshot.

The following JSON schema defines the shape of the snapshot file:

        
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Snapshot Schema",
  "type": "object",
  "properties": {
    "id": {
      "type": "string",
      "description": "Snapshot ID that corresponds to the snapshot file prefix"
    },
    "version": {
      "type": "integer",
      "description": "Incrementing integer int64"
    },
    "timestamp": {
      "type": "integer",
      "description": "Milliseconds since epoch int64"
    },
    "graphs": {
      "type": "object",
      "additionalProperties": {
        "$ref": "#/definitions/graph"
      }
    }
  },
  "required": ["id", "version", "timestamp", "graphs"],
  "additionalProperties": false,
  "definitions": {
    "graph": {
      "type": "object",
      "properties": {
        "id": {
          "type": "string",
          "description": "Internal ID - a GUID or similar"
        },
        "predicate_index": {
          "type": "object",
          "additionalProperties": {
            "type": "string",
            "description": "Partition identifier or GUID"
          }
        },
        "partitions": {
          "type": "object",
          "additionalProperties": {
            "$ref": "#/definitions/partition"
          }
        },
        "so_data_files": {
          "type": "array",
          "items": {
            "$ref": "#/definitions/dataFile"
          }
        },
        "os_data_files": {
          "type": "array",
          "items": {
            "$ref": "#/definitions/dataFile"
          }
        },
        "tombstone_files": {
          "type": "array",
          "items": {
            "$ref": "#/definitions/dataFile"
          }
        }
      },
      "required": ["id", "predicate_index"]
    },
    "partition": {
      "type": "object",
      "properties": {
        "files": {
          "type": "array",
          "items": {
            "type": "string"
          }
        }
      },
      "required": ["files"]
    },
    "dataFile": {
      "type": "object",
      "properties": {
        "id": { "type": "string" },
        "predicate": { "type": "string" },
        "snapshot_version": { "type": "integer" },
        "is_inverse": { "type": "boolean" },
        "path": { "type": "string" },
        "size": { "type": "integer" },
        "triple_count": { "type": "integer" },
        "max_value": { "type": "string" },
        "min_value": { "type": "string" },
        "bloom_filter": {
          "type": "string",
          "description": "Base64-encoded string representing the Bloom filter data"
        },
        "graph": { "type": "string" },
        "original_graph": { "type": "string" },
        "prefix_map": {
          "type": "object",
          "additionalProperties": {
            "type": "integer"
          }
        },
        "prefixes": {
          "type": "array",
          "items": { "type": "string" }
        }
      },
      "required": [
        "id",
        "predicate",
        "snapshot_version",
        "is_inverse",
        "path",
        "size",
        "triple_count",
        "max_value",
        "min_value",
        "bloom_filter",
        "graph",
        "original_graph",
        "prefix_map",
        "prefixes"
      ]
    }
  }
}
        
    

Snapshot File

The snapshot file contains the details of the graphs and the sets of files that comprise a given snapshot.

The following table defines the structure of the snapshot file:

Property Type Description Comment
id string Snapshot ID that corresponds to the snapshot file prefix Required
version integer Incrementing integer (int64) Required
timestamp integer Milliseconds since epoch (int64) Required
graphs object Details of the graphs in the snapshot Additional properties reference #/definitions/graph

Graph Definition

The graph object is defined as follows:

Property Type Description Comment
id string Internal ID - a GUID Required
predicate_index object Mapping of predicates to partitions Additional properties are strings
partitions object Details of partitions Additional properties reference #/definitions/partition
so_data_files array List of subject-object data files Items reference #/definitions/dataFile
os_data_files array List of object-subject data files Items reference #/definitions/dataFile
tombstone_files array List of tombstone files Items reference #/definitions/dataFile

Partition Definition

The partition object is defined as follows:

Property Type Description Comment
files array List of file identifiers Items are strings

Data File Definition

The dataFile object is defined as follows:

Property Type Description Comment
id string Unique identifier for the data file Required
predicate string Predicate associated with the data file Required
snapshot_version integer Version of the snapshot Required
is_inverse boolean Indicates if the file is an inverse file Required
path string Path to the data file Required
size integer Size of the file in bytes Required
triple_count integer Number of triples in the file Required
max_value string Maximum value in the file Required
min_value string Minimum value in the file Required
bloom_filter string Base64-encoded Bloom filter data Required
graph string Graph URI associated with the file Required
original_graph string Original graph URI Required
prefix_map object Mapping of prefixes to integers Additional properties are integers
prefixes array List of prefixes Items are strings

This is an informative example


        {
            "id": "snapshot1 - id that corresponds to the snapshot file prefix",
            "version" : "incrementing integer int64"
            "timestamp": "milliseconds_since_epoch int64",
            "graphs": {
                "http://data.dataplatformsolutions.com/graph1": {
                    "id": "internal_id - a guid or similar",
                    "predicate_index" : {
                        "http://example.com/name" : "partition1 - a local non URI identifier"
                    },
                    "partitions" : {
                        "partition1" : {
                            "files": [
                                "data_file1",
                                "data_file2"
                            ]
                        }
                    }
                },
                "http://data.dataplatformsolutions.com/graph2": {
                    "id": "internal_id - a guid or similar",
                    "predicate_index" : {
                        "http://example.com/name" : "a guid"
                    },
                    "so_data_files": [
                        {
                            "id": "file123",
                            "predicate": "http://example.org/predicate",
                            "snapshot_version": 2,
                            "is_inverse": false,
                            "path": "/data/triples/file123.triples",
                            "size": 1048576,
                            "triple_count": 5000,
                            "max_value": "zeta",
                            "min_value": "alpha",
                            "bloom_filter": "SGVsbG8sIFdvcmxkIQ==",
                            "graph": "http://example.org/graph",
                            "original_graph": "http://original.example.org/graph",
                            "prefix_map": {
                            "ex": 1,
                            "foaf": 2,
                            "dc": 3
                            },
                            "prefixes": [
                                "ex",
                                "foaf",
                                "dc"
                            ]
                        }    
                    ],
                    "os_data_files": [
                        
                    ],
                    "tombstone_files": [
                        
                    ]
                }
            }            
        }
    

Triple Data File Binary Format

The binary form is designed to optimise streaming to support query joins or specific triple retrieval. The data file is a binary format in the follownig form.

Each TripleDataFile is a sequence of encoded triples, sorted by either Subject or Object.

Triple Encoding (per triple, in order) in the following form:

  1. Subject Prefix ID (int32, 4 bytes, little-endian)
  2. Subject Local Part Length (int32, 4 bytes, little-endian)
  3. Subject Local Part ([]byte)
  4. IsLiteral (byte, 1 byte)
  5. If IsLiteral == 1:
    1. Object String Length (int32, 4 bytes, little-endian)
      • Length of the literal value.
    2. Object String ([]byte)
      • The literal value, UTF-8 encoded.
    3. Datatype Prefix ID (int32, 4 bytes, little-endian)
      • Integer ID for the datatype's prefix, or 0 if none.
    4. Datatype Local Part Length (int32, 4 bytes, little-endian)
      • Length of the datatype's local part.
    5. Datatype Local Part ([]byte)
      • The datatype's local part, UTF-8 encoded.
  6. If IsLiteral == 0:
    1. Object Prefix ID (int32, 4 bytes, little-endian)
      • Integer ID for the object's prefix, or 0 if none.
    2. Object Local Part Length (int32, 4 bytes, little-endian)
      • Length of the object's local part.
    3. Object Local Part ([]byte)
      • The object's local part, UTF-8 encoded.

Notes: