Version: DRAFT 0.1.0
Date: November 19, 2024
Author: Graham Moore
Contact: graham.moore@dataplatformsolutions.com
Inspired by formats such Apache Iceberg[1], DeltaLake[2], and Parquet[3], GraphLake is a file format and metadata structures designed to support high performance, highly parallel processing of graph queries. GraphLake is graph native in how it thinks about storing and partitioning data. It is defined as an open specification to encourage others to utilise and adopt this approach. GraphLake facilitates graph engines that can scale in many different ways as the data is kept separate from the compute.
The following goals have helped to shape GraphLake format:
GraphLake is a file format that stores data in a directory structure. The directory structure is designed to support partitioning and metadata storage. The data is stored in files within the directory structure. The directory structure can be considered to be virtual which aligns niceley with object storage systems such as S3, MinIO and Azure Blob storage.
The approach is to partition on predicates and then to store triples in data files. Each data file has a corresponding metadata element that contains useful metadata such as a bloom filter and min and max values. Each triple is added to two files, the subject-object file and the object-subject file. These files are sorted based on either subject or object.
Deletes are supported with the tombstone construct. This file is exactly the same data format as the SO and OS data files.
/root
store.json
/snapshots
snapshot1.json
..
snapshotN.json
/graphs
/graph1
/predicate1
data_file1.bin
data_file2.bin
data_file3.bin
/graph2
/predicate1
data_file4.bin
data_file5.bin
data_file6.bin
The store file contains metadata about the store. It contains information about the snapshots that comprise the store. The store file is a JSON file of the following form:
{
"format_version": "0.1",
"location" : "/graphlake/store1",
"snapshots": [
{
"id": "snapshot1 - id that corresponds to the snapshot file prefix",
"version" : "incrementing integer int64"
"timestamp": "milliseconds_since_epoch int64",
}
]
}
The snapshot file contains the details of the graphs and the sets of files that comprise a given snapshot.
The following JSON schema defines the shape of the snapshot file:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Snapshot Schema",
"type": "object",
"properties": {
"id": {
"type": "string",
"description": "Snapshot ID that corresponds to the snapshot file prefix"
},
"version": {
"type": "integer",
"description": "Incrementing integer int64"
},
"timestamp": {
"type": "integer",
"description": "Milliseconds since epoch int64"
},
"graphs": {
"type": "object",
"patternProperties": {
"^.*$": {
"type": "object",
"properties": {
"id": {
"type": "string",
"description": "Internal ID - a GUID or similar"
},
"predicate_index": {
"type": "object",
"patternProperties": {
"^.*$": {
"type": "string",
"description": "Partition identifier or GUID"
}
},
"additionalProperties": false
},
"partitions": {
"type": "object",
"patternProperties": {
"^.*$": {
"type": "object",
"properties": {
"files": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["files"],
"additionalProperties": false
}
},
"additionalProperties": false
},
"so_data_files": {
"type": "array",
"items": {
"$ref": "#/definitions/dataFile"
}
},
"os_data_files": {
"type": "array",
"items": {
"$ref": "#/definitions/dataFile"
}
},
"tombstone_files": {
"type": "array",
"items": {
"$ref": "#/definitions/dataFile"
}
}
},
"required": ["id", "predicate_index"],
"additionalProperties": false
}
},
"additionalProperties": false
}
},
"required": ["id", "version", "timestamp", "graphs"],
"additionalProperties": false,
"definitions": {
"dataFile": {
"type": "object",
"properties": {
"id": { "type": "string" },
"predicate": { "type": "string" },
"snapshot_version": { "type": "integer" },
"is_inverse": { "type": "boolean" },
"path": { "type": "string" },
"size": { "type": "integer" },
"triple_count": { "type": "integer" },
"max_value": { "type": "string" },
"min_value": { "type": "string" },
"bloom_filter": {
"type": "string",
"description": "Base64-encoded string representing the Bloom filter data"
},
"graph": { "type": "string" },
"original_graph": { "type": "string" },
"prefix_map": {
"type": "object",
"patternProperties": {
"^.*$": {
"type": "integer"
}
},
"additionalProperties": false
},
"prefixes": {
"type": "array",
"items": { "type": "string" }
}
},
"required": [
"id",
"predicate",
"snapshot_version",
"is_inverse",
"path",
"size",
"triple_count",
"max_value",
"min_value",
"bloom_filter",
"graph",
"original_graph",
"prefix_map",
"prefixes"
],
"additionalProperties": false
}
}
}
This is an informative example
{
"id": "snapshot1 - id that corresponds to the snapshot file prefix",
"version" : "incrementing integer int64"
"timestamp": "milliseconds_since_epoch int64",
"graphs": {
"http://data.dataplatformsolutions.com/graph1": {
"id": "internal_id - a guid or similar",
"predicate_index" : {
"http://example.com/name" : "partition1 - a local non URI identifier"
},
"partitions" : {
"partition1" : {
"files": [
"data_file1",
"data_file2"
]
}
}
},
"http://data.dataplatformsolutions.com/graph2": {
"id": "internal_id - a guid or similar",
"predicate_index" : {
"http://example.com/name" : "a guid"
},
"so_data_files": [
{
"id": "file123",
"predicate": "http://example.org/predicate",
"snapshot_version": 2,
"is_inverse": false,
"path": "/data/triples/file123.triples",
"size": 1048576,
"triple_count": 5000,
"max_value": "zeta",
"min_value": "alpha",
"bloom_filter": "SGVsbG8sIFdvcmxkIQ==",
"graph": "http://example.org/graph",
"original_graph": "http://original.example.org/graph",
"prefix_map": {
"ex": 1,
"foaf": 2,
"dc": 3
},
"prefixes": [
"ex",
"foaf",
"dc"
]
}
],
"os_data_files": [
],
"tombstone_files": [
]
}
}
}
The data files are a binary representation of the following form: