GraphLake File Format Specification

Version: DRAFT 0.1.0

Date: November 19, 2024

Author: Graham Moore

Contact: graham.moore@dataplatformsolutions.com

Introduction

Inspired by formats such Apache Iceberg[1], DeltaLake[2], and Parquet[3], GraphLake is a file format and metadata structures designed to support high performance, highly parallel processing of graph queries. GraphLake is graph native in how it thinks about storing and partitioning data. It is defined as an open specification to encourage others to utilise and adopt this approach. GraphLake facilitates graph engines that can scale in many different ways as the data is kept separate from the compute.

Goals

The following goals have helped to shape GraphLake format:

File Structure Overview

GraphLake is a file format that stores data in a directory structure. The directory structure is designed to support partitioning and metadata storage. The data is stored in files within the directory structure. The directory structure can be considered to be virtual which aligns niceley with object storage systems such as S3, MinIO and Azure Blob storage.

The approach is to partition on predicates and then to store triples in data files. Each data file has a corresponding metadata element that contains useful metadata such as a bloom filter and min and max values. Each triple is added to two files, the subject-object file and the object-subject file. These files are sorted based on either subject or object.

Deletes are supported with the tombstone construct. This file is exactly the same data format as the SO and OS data files.

Directory & File Structure


        /root
            store.json
            /snapshots
                snapshot1.json
                ..
                snapshotN.json
            /graphs
                /graph1
                    /predicate1
                            data_file1.bin
                            data_file2.bin
                            data_file3.bin
                /graph2
                    /predicate1
                            data_file4.bin
                            data_file5.bin
                            data_file6.bin            
    

Store File

The store file contains metadata about the store. It contains information about the snapshots that comprise the store. The store file is a JSON file of the following form:


        {
            "format_version": "0.1",
            "location" : "/graphlake/store1",
            "snapshots": [
                {
                    "id": "snapshot1 - id that corresponds to the snapshot file prefix",
                    "version" : "incrementing integer int64"
                    "timestamp": "milliseconds_since_epoch int64",
                }
            ]
        }
    

Snapshot File

The snapshot file contains the details of the graphs and the sets of files that comprise a given snapshot.

The following JSON schema defines the shape of the snapshot file:

        
            {
                "$schema": "http://json-schema.org/draft-07/schema#",
                "title": "Snapshot Schema",
                "type": "object",
                "properties": {
                  "id": {
                    "type": "string",
                    "description": "Snapshot ID that corresponds to the snapshot file prefix"
                  },
                  "version": {
                    "type": "integer",
                    "description": "Incrementing integer int64"
                  },
                  "timestamp": {
                    "type": "integer",
                    "description": "Milliseconds since epoch int64"
                  },
                  "graphs": {
                    "type": "object",
                    "patternProperties": {
                      "^.*$": {
                        "type": "object",
                        "properties": {
                          "id": {
                            "type": "string",
                            "description": "Internal ID - a GUID or similar"
                          },
                          "predicate_index": {
                            "type": "object",
                            "patternProperties": {
                              "^.*$": {
                                "type": "string",
                                "description": "Partition identifier or GUID"
                              }
                            },
                            "additionalProperties": false
                          },
                          "partitions": {
                            "type": "object",
                            "patternProperties": {
                              "^.*$": {
                                "type": "object",
                                "properties": {
                                  "files": {
                                    "type": "array",
                                    "items": {
                                      "type": "string"
                                    }
                                  }
                                },
                                "required": ["files"],
                                "additionalProperties": false
                              }
                            },
                            "additionalProperties": false
                          },
                          "so_data_files": {
                            "type": "array",
                            "items": {
                              "$ref": "#/definitions/dataFile"
                            }
                          },
                          "os_data_files": {
                            "type": "array",
                            "items": {
                              "$ref": "#/definitions/dataFile"
                            }
                          },
                          "tombstone_files": {
                            "type": "array",
                            "items": {
                              "$ref": "#/definitions/dataFile"
                            }
                          }
                        },
                        "required": ["id", "predicate_index"],
                        "additionalProperties": false
                      }
                    },
                    "additionalProperties": false
                  }
                },
                "required": ["id", "version", "timestamp", "graphs"],
                "additionalProperties": false,
                "definitions": {
                  "dataFile": {
                    "type": "object",
                    "properties": {
                      "id": { "type": "string" },
                      "predicate": { "type": "string" },
                      "snapshot_version": { "type": "integer" },
                      "is_inverse": { "type": "boolean" },
                      "path": { "type": "string" },
                      "size": { "type": "integer" },
                      "triple_count": { "type": "integer" },
                      "max_value": { "type": "string" },
                      "min_value": { "type": "string" },
                      "bloom_filter": {
                        "type": "string",
                        "description": "Base64-encoded string representing the Bloom filter data"
                      },
                      "graph": { "type": "string" },
                      "original_graph": { "type": "string" },
                      "prefix_map": {
                        "type": "object",
                        "patternProperties": {
                          "^.*$": {
                            "type": "integer"
                          }
                        },
                        "additionalProperties": false
                      },
                      "prefixes": {
                        "type": "array",
                        "items": { "type": "string" }
                      }
                    },
                    "required": [
                      "id",
                      "predicate",
                      "snapshot_version",
                      "is_inverse",
                      "path",
                      "size",
                      "triple_count",
                      "max_value",
                      "min_value",
                      "bloom_filter",
                      "graph",
                      "original_graph",
                      "prefix_map",
                      "prefixes"
                    ],
                    "additionalProperties": false
                  }
                }
              }
        
    

This is an informative example


        {
            "id": "snapshot1 - id that corresponds to the snapshot file prefix",
            "version" : "incrementing integer int64"
            "timestamp": "milliseconds_since_epoch int64",
            "graphs": {
                "http://data.dataplatformsolutions.com/graph1": {
                    "id": "internal_id - a guid or similar",
                    "predicate_index" : {
                        "http://example.com/name" : "partition1 - a local non URI identifier"
                    },
                    "partitions" : {
                        "partition1" : {
                            "files": [
                                "data_file1",
                                "data_file2"
                            ]
                        }
                    }
                },
                "http://data.dataplatformsolutions.com/graph2": {
                    "id": "internal_id - a guid or similar",
                    "predicate_index" : {
                        "http://example.com/name" : "a guid"
                    },
                    "so_data_files": [
                        {
                            "id": "file123",
                            "predicate": "http://example.org/predicate",
                            "snapshot_version": 2,
                            "is_inverse": false,
                            "path": "/data/triples/file123.triples",
                            "size": 1048576,
                            "triple_count": 5000,
                            "max_value": "zeta",
                            "min_value": "alpha",
                            "bloom_filter": "SGVsbG8sIFdvcmxkIQ==",
                            "graph": "http://example.org/graph",
                            "original_graph": "http://original.example.org/graph",
                            "prefix_map": {
                            "ex": 1,
                            "foaf": 2,
                            "dc": 3
                            },
                            "prefixes": [
                                "ex",
                                "foaf",
                                "dc"
                            ]
                        }    
                    ],
                    "os_data_files": [
                        
                    ],
                    "tombstone_files": [
                        
                    ]
                }
            }            
        }
    

Data File

The data files are a binary representation of the following form: