GraphLake File Format Specification

Version: DRAFT 0.1.0

Date: November 19, 2024

Author: Graham Moore

Contact: graham.moore@dataplatformsolutions.com

Introduction

Inspired by formats such Apache Iceberg[1], DeltaLake[2], and Parquet[3], GraphLake is a file format and metadata structures designed to support high performance, highly parallel processing of graph queries. GraphLake is graph native in how it thinks about storing and partitioning data. It is defined as an open specification to encourage others to utilise and adopt this approach. GraphLake facilitates graph engines that can scale in many different ways as the data is kept separate from the compute.

Goals

The following goals have helped to shape GraphLake format:

Support point in time queries
Support for zero cost branching
Enable stateless readers - e.g with only access to the files a query engine can perform a query.
Support for analytical and transactional workloads
Schemaless Graph Support
Support for multiple graphs in a store
Support property graphs and RDF*

File Structure Overview

GraphLake is a file format that stores data in a directory structure. The directory structure is designed to support partitioning and metadata storage. The data is stored in files within the directory structure. The directory structure can be considered to be virtual which aligns niceley with object storage systems such as S3, MinIO and Azure Blob storage.

The approach is to partition on graphs then predicates, and then to store triples in data files. Each data file has a corresponding metadata construct that contains useful metadata such as a bloom filter, min and max values. Each triple is added to two files, the subject-object file and the object-subject file. The triples in these files are sorted based on either subject or object.

To reduce duplicated data, each data file has metadata that maintains a mapping between the URI prefix and an integer value. This integer value is then used in the binary representation. Full URIs can be reconstituted on read.

GraphLake does triple reification by creating a hash256 value from the complete data of the triple (excluding the graph) as this value can always be computed it is not explicitly stored with the triple. To aid retreival of triples by id, additional bloom filters are added to the data file metadata structure.

Deletes are supported with the tombstone construct. This file is exactly the same data format as the SO and OS data files. When querying for triples they should first be located in the data files, and then checked for existence in the tombstone files. Processing applications are free to re-write data files and remove deleted triples.

All files are immutable.

Directory & File Structure


        /root
            store.json
            /snapshots
                snapshot1.json
                ..
                snapshotN.json
            /graphs
                /graph1
                    /predicate1
                            data_file1.bin
                            data_file2.bin
                            data_file3.bin
                /graph2
                    /predicate1
                            data_file4.bin
                            data_file5.bin
                            data_file6.bin

Store File

The store file contains metadata about the store. It contains information about the snapshots that comprise the store. The snapshot file is a JSON file of the following form:


        {
            "format_version": "0.1",
            "location" : "/graphlake/store1",
            "snapshots": [
                {
                    "id": "snapshot1 - id that corresponds to the snapshot file prefix",
                    "version" : "incrementing integer int64"
                    "timestamp": "milliseconds_since_epoch int64",
                }
            ]
        }

Snapshot File

The snapshot file contains the details of the graphs and the sets of files that comprise a given snapshot.

The following JSON schema defines the shape of the snapshot file:

        
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Snapshot Schema",
  "type": "object",
  "properties": {
    "id": {
      "type": "string",
      "description": "Snapshot ID that corresponds to the snapshot file prefix"
    },
    "version": {
      "type": "integer",
      "description": "Incrementing integer int64"
    },
    "timestamp": {
      "type": "integer",
      "description": "Milliseconds since epoch int64"
    },
    "graphs": {
      "type": "object",
      "additionalProperties": {
        "$ref": "#/definitions/graph"
      }
    }
  },
  "required": ["id", "version", "timestamp", "graphs"],
  "additionalProperties": false,
  "definitions": {
    "graph": {
      "type": "object",
      "properties": {
        "id": {
          "type": "string",
          "description": "Internal ID - a GUID or similar"
        },
        "predicate_index": {
          "type": "object",
          "additionalProperties": {
            "type": "string",
            "description": "Partition identifier or GUID"
          }
        },
        "partitions": {
          "type": "object",
          "additionalProperties": {
            "$ref": "#/definitions/partition"
          }
        },
        "so_data_files": {
          "type": "array",
          "items": {
            "$ref": "#/definitions/dataFile"
          }
        },
        "os_data_files": {
          "type": "array",
          "items": {
            "$ref": "#/definitions/dataFile"
          }
        },
        "tombstone_files": {
          "type": "array",
          "items": {
            "$ref": "#/definitions/dataFile"
          }
        }
      },
      "required": ["id", "predicate_index"]
    },
    "partition": {
      "type": "object",
      "properties": {
        "files": {
          "type": "array",
          "items": {
            "type": "string"
          }
        }
      },
      "required": ["files"]
    },
    "dataFile": {
      "type": "object",
      "properties": {
        "id": { "type": "string" },
        "predicate": { "type": "string" },
        "snapshot_version": { "type": "integer" },
        "is_inverse": { "type": "boolean" },
        "path": { "type": "string" },
        "size": { "type": "integer" },
        "triple_count": { "type": "integer" },
        "max_value": { "type": "string" },
        "min_value": { "type": "string" },
        "bloom_filter": {
          "type": "string",
          "description": "Base64-encoded string representing the Bloom filter data"
        },
        "graph": { "type": "string" },
        "original_graph": { "type": "string" },
        "prefix_map": {
          "type": "object",
          "additionalProperties": {
            "type": "integer"
          }
        },
        "prefixes": {
          "type": "array",
          "items": { "type": "string" }
        }
      },
      "required": [
        "id",
        "predicate",
        "snapshot_version",
        "is_inverse",
        "path",
        "size",
        "triple_count",
        "max_value",
        "min_value",
        "bloom_filter",
        "graph",
        "original_graph",
        "prefix_map",
        "prefixes"
      ]
    }
  }
}

Snapshot File

The snapshot file contains the details of the graphs and the sets of files that comprise a given snapshot.

The following table defines the structure of the snapshot file:

Property	Type	Description	Comment
`id`	string	Snapshot ID that corresponds to the snapshot file prefix	Required
`version`	integer	Incrementing integer (int64)	Required
`timestamp`	integer	Milliseconds since epoch (int64)	Required
`graphs`	object	Details of the graphs in the snapshot	Additional properties reference `#/definitions/graph`

Graph Definition

The graph object is defined as follows:

Property	Type	Description	Comment
`id`	string	Internal ID - a GUID	Required
`predicate_index`	object	Mapping of predicates to partitions	Additional properties are strings
`partitions`	object	Details of partitions	Additional properties reference `#/definitions/partition`
`so_data_files`	array	List of subject-object data files	Items reference `#/definitions/dataFile`
`os_data_files`	array	List of object-subject data files	Items reference `#/definitions/dataFile`
`tombstone_files`	array	List of tombstone files	Items reference `#/definitions/dataFile`

Partition Definition

The partition object is defined as follows:

Property	Type	Description	Comment
`files`	array	List of file identifiers	Items are strings

Data File Definition

The dataFile object is defined as follows:

Property	Type	Description	Comment
`id`	string	Unique identifier for the data file	Required
`predicate`	string	Predicate associated with the data file	Required
`snapshot_version`	integer	Version of the snapshot	Required
`is_inverse`	boolean	Indicates if the file is an inverse file	Required
`path`	string	Path to the data file	Required
`size`	integer	Size of the file in bytes	Required
`triple_count`	integer	Number of triples in the file	Required
`max_value`	string	Maximum value in the file	Required
`min_value`	string	Minimum value in the file	Required
`bloom_filter`	string	Base64-encoded Bloom filter data	Required
`graph`	string	Graph URI associated with the file	Required
`original_graph`	string	Original graph URI	Required
`prefix_map`	object	Mapping of prefixes to integers	Additional properties are integers
`prefixes`	array	List of prefixes	Items are strings

This is an informative example


        {
            "id": "snapshot1 - id that corresponds to the snapshot file prefix",
            "version" : "incrementing integer int64"
            "timestamp": "milliseconds_since_epoch int64",
            "graphs": {
                "http://data.dataplatformsolutions.com/graph1": {
                    "id": "internal_id - a guid or similar",
                    "predicate_index" : {
                        "http://example.com/name" : "partition1 - a local non URI identifier"
                    },
                    "partitions" : {
                        "partition1" : {
                            "files": [
                                "data_file1",
                                "data_file2"
                            ]
                        }
                    }
                },
                "http://data.dataplatformsolutions.com/graph2": {
                    "id": "internal_id - a guid or similar",
                    "predicate_index" : {
                        "http://example.com/name" : "a guid"
                    },
                    "so_data_files": [
                        {
                            "id": "file123",
                            "predicate": "http://example.org/predicate",
                            "snapshot_version": 2,
                            "is_inverse": false,
                            "path": "/data/triples/file123.triples",
                            "size": 1048576,
                            "triple_count": 5000,
                            "max_value": "zeta",
                            "min_value": "alpha",
                            "bloom_filter": "SGVsbG8sIFdvcmxkIQ==",
                            "graph": "http://example.org/graph",
                            "original_graph": "http://original.example.org/graph",
                            "prefix_map": {
                            "ex": 1,
                            "foaf": 2,
                            "dc": 3
                            },
                            "prefixes": [
                                "ex",
                                "foaf",
                                "dc"
                            ]
                        }    
                    ],
                    "os_data_files": [
                        
                    ],
                    "tombstone_files": [
                        
                    ]
                }
            }            
        }

Triple Data File Binary Format

The binary form is designed to optimise streaming to support query joins or specific triple retrieval. The data file is a binary format in the follownig form.

Each TripleDataFile is a sequence of encoded triples, sorted by either Subject or Object.

Triple Encoding (per triple, in order) in the following form:

Subject Prefix ID (int32, 4 bytes, little-endian)
- Integer ID for the subject's prefix, or 0 if none.
Subject Local Part Length (int32, 4 bytes, little-endian)
- Length of the subject's local part in bytes.
Subject Local Part ([]byte)
- The subject's local part, UTF-8 encoded.
IsLiteral (byte, 1 byte)
- 1 if the object is a literal, 0 otherwise.
If IsLiteral == 1:
1. Object String Length (int32, 4 bytes, little-endian)
  - Length of the literal value.
2. Object String ([]byte)
  - The literal value, UTF-8 encoded.
3. Datatype Prefix ID (int32, 4 bytes, little-endian)
  - Integer ID for the datatype's prefix, or 0 if none.
4. Datatype Local Part Length (int32, 4 bytes, little-endian)
  - Length of the datatype's local part.
5. Datatype Local Part ([]byte)
  - The datatype's local part, UTF-8 encoded.
If IsLiteral == 0:
1. Object Prefix ID (int32, 4 bytes, little-endian)
  - Integer ID for the object's prefix, or 0 if none.
2. Object Local Part Length (int32, 4 bytes, little-endian)
  - Length of the object's local part.
3. Object Local Part ([]byte)
  - The object's local part, UTF-8 encoded.

Notes:

All integer values are little-endian.
Prefix IDs are mapped to actual prefix strings in the file's metadata.
The file is a concatenation of all encoded triples, with no explicit delimiter.
Triples are ordered by Subject for the so files and Object for the os data files.