Version: DRAFT 0.1.0
Date: November 19, 2024
Author: Graham Moore
Contact: graham.moore@dataplatformsolutions.com
Inspired by formats such Apache Iceberg[1], DeltaLake[2], and Parquet[3], GraphLake is a file format and metadata structures designed to support high performance, highly parallel processing of graph queries. GraphLake is graph native in how it thinks about storing and partitioning data. It is defined as an open specification to encourage others to utilise and adopt this approach. GraphLake facilitates graph engines that can scale in many different ways as the data is kept separate from the compute.
The following goals have helped to shape GraphLake format:
GraphLake is a file format that stores data in a directory structure. The directory structure is designed to support partitioning and metadata storage. The data is stored in files within the directory structure. The directory structure can be considered to be virtual which aligns niceley with object storage systems such as S3, MinIO and Azure Blob storage.
The approach is to partition on graphs then predicates, and then to store triples in data files. Each data file has a corresponding metadata construct that contains useful metadata such as a bloom filter, min and max values. Each triple is added to two files, the subject-object file and the object-subject file. The triples in these files are sorted based on either subject or object.
To reduce duplicated data, each data file has metadata that maintains a mapping between the URI prefix and an integer value. This integer value is then used in the binary representation. Full URIs can be reconstituted on read.
GraphLake does triple reification by creating a hash256 value from the complete data of the triple (excluding the graph) as this value can always be computed it is not explicitly stored with the triple. To aid retreival of triples by id, additional bloom filters are added to the data file metadata structure.
Deletes are supported with the tombstone construct. This file is exactly the same data format as the SO and OS data files. When querying for triples they should first be located in the data files, and then checked for existence in the tombstone files. Processing applications are free to re-write data files and remove deleted triples.
All files are immutable.
/root
store.json
/snapshots
snapshot1.json
..
snapshotN.json
/graphs
/graph1
/predicate1
data_file1.bin
data_file2.bin
data_file3.bin
/graph2
/predicate1
data_file4.bin
data_file5.bin
data_file6.bin
The store file contains metadata about the store. It contains information about the snapshots that comprise the store. The snapshot file is a JSON file of the following form:
{
"format_version": "0.1",
"location" : "/graphlake/store1",
"snapshots": [
{
"id": "snapshot1 - id that corresponds to the snapshot file prefix",
"version" : "incrementing integer int64"
"timestamp": "milliseconds_since_epoch int64",
}
]
}
The snapshot file contains the details of the graphs and the sets of files that comprise a given snapshot.
The following JSON schema defines the shape of the snapshot file:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Snapshot Schema",
"type": "object",
"properties": {
"id": {
"type": "string",
"description": "Snapshot ID that corresponds to the snapshot file prefix"
},
"version": {
"type": "integer",
"description": "Incrementing integer int64"
},
"timestamp": {
"type": "integer",
"description": "Milliseconds since epoch int64"
},
"graphs": {
"type": "object",
"additionalProperties": {
"$ref": "#/definitions/graph"
}
}
},
"required": ["id", "version", "timestamp", "graphs"],
"additionalProperties": false,
"definitions": {
"graph": {
"type": "object",
"properties": {
"id": {
"type": "string",
"description": "Internal ID - a GUID or similar"
},
"predicate_index": {
"type": "object",
"additionalProperties": {
"type": "string",
"description": "Partition identifier or GUID"
}
},
"partitions": {
"type": "object",
"additionalProperties": {
"$ref": "#/definitions/partition"
}
},
"so_data_files": {
"type": "array",
"items": {
"$ref": "#/definitions/dataFile"
}
},
"os_data_files": {
"type": "array",
"items": {
"$ref": "#/definitions/dataFile"
}
},
"tombstone_files": {
"type": "array",
"items": {
"$ref": "#/definitions/dataFile"
}
}
},
"required": ["id", "predicate_index"]
},
"partition": {
"type": "object",
"properties": {
"files": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["files"]
},
"dataFile": {
"type": "object",
"properties": {
"id": { "type": "string" },
"predicate": { "type": "string" },
"snapshot_version": { "type": "integer" },
"is_inverse": { "type": "boolean" },
"path": { "type": "string" },
"size": { "type": "integer" },
"triple_count": { "type": "integer" },
"max_value": { "type": "string" },
"min_value": { "type": "string" },
"bloom_filter": {
"type": "string",
"description": "Base64-encoded string representing the Bloom filter data"
},
"graph": { "type": "string" },
"original_graph": { "type": "string" },
"prefix_map": {
"type": "object",
"additionalProperties": {
"type": "integer"
}
},
"prefixes": {
"type": "array",
"items": { "type": "string" }
}
},
"required": [
"id",
"predicate",
"snapshot_version",
"is_inverse",
"path",
"size",
"triple_count",
"max_value",
"min_value",
"bloom_filter",
"graph",
"original_graph",
"prefix_map",
"prefixes"
]
}
}
}
The snapshot file contains the details of the graphs and the sets of files that comprise a given snapshot.
The following table defines the structure of the snapshot file:
Property | Type | Description | Comment |
---|---|---|---|
id |
string | Snapshot ID that corresponds to the snapshot file prefix | Required |
version |
integer | Incrementing integer (int64) | Required |
timestamp |
integer | Milliseconds since epoch (int64) | Required |
graphs |
object | Details of the graphs in the snapshot | Additional properties reference #/definitions/graph |
The graph
object is defined as follows:
Property | Type | Description | Comment |
---|---|---|---|
id |
string | Internal ID - a GUID | Required |
predicate_index |
object | Mapping of predicates to partitions | Additional properties are strings |
partitions |
object | Details of partitions | Additional properties reference #/definitions/partition |
so_data_files |
array | List of subject-object data files | Items reference #/definitions/dataFile |
os_data_files |
array | List of object-subject data files | Items reference #/definitions/dataFile |
tombstone_files |
array | List of tombstone files | Items reference #/definitions/dataFile |
The partition
object is defined as follows:
Property | Type | Description | Comment |
---|---|---|---|
files |
array | List of file identifiers | Items are strings |
The dataFile
object is defined as follows:
Property | Type | Description | Comment |
---|---|---|---|
id |
string | Unique identifier for the data file | Required |
predicate |
string | Predicate associated with the data file | Required |
snapshot_version |
integer | Version of the snapshot | Required |
is_inverse |
boolean | Indicates if the file is an inverse file | Required |
path |
string | Path to the data file | Required |
size |
integer | Size of the file in bytes | Required |
triple_count |
integer | Number of triples in the file | Required |
max_value |
string | Maximum value in the file | Required |
min_value |
string | Minimum value in the file | Required |
bloom_filter |
string | Base64-encoded Bloom filter data | Required |
graph |
string | Graph URI associated with the file | Required |
original_graph |
string | Original graph URI | Required |
prefix_map |
object | Mapping of prefixes to integers | Additional properties are integers |
prefixes |
array | List of prefixes | Items are strings |
This is an informative example
{
"id": "snapshot1 - id that corresponds to the snapshot file prefix",
"version" : "incrementing integer int64"
"timestamp": "milliseconds_since_epoch int64",
"graphs": {
"http://data.dataplatformsolutions.com/graph1": {
"id": "internal_id - a guid or similar",
"predicate_index" : {
"http://example.com/name" : "partition1 - a local non URI identifier"
},
"partitions" : {
"partition1" : {
"files": [
"data_file1",
"data_file2"
]
}
}
},
"http://data.dataplatformsolutions.com/graph2": {
"id": "internal_id - a guid or similar",
"predicate_index" : {
"http://example.com/name" : "a guid"
},
"so_data_files": [
{
"id": "file123",
"predicate": "http://example.org/predicate",
"snapshot_version": 2,
"is_inverse": false,
"path": "/data/triples/file123.triples",
"size": 1048576,
"triple_count": 5000,
"max_value": "zeta",
"min_value": "alpha",
"bloom_filter": "SGVsbG8sIFdvcmxkIQ==",
"graph": "http://example.org/graph",
"original_graph": "http://original.example.org/graph",
"prefix_map": {
"ex": 1,
"foaf": 2,
"dc": 3
},
"prefixes": [
"ex",
"foaf",
"dc"
]
}
],
"os_data_files": [
],
"tombstone_files": [
]
}
}
}
The binary form is designed to optimise streaming to support query joins or specific triple retrieval. The data file is a binary format in the follownig form.
Each TripleDataFile is a sequence of encoded triples, sorted by either Subject or Object.