Data¶
hyperbench.data
¶
Dataset
¶
Bases: Dataset
A dataset class for loading and processing hypergraph data.
Args:
hdata: The processed hypergraph data in HData format.
sampling_strategy: The strategy used for sampling sub-hypergraphs (e.g., by node IDs or hyperedge IDs).
If not provided, defaults to SamplingStrategy.HYPEREDGE.
Source code in hyperbench/data/dataset.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 | |
__init__(hdata=None, sampling_strategy=SamplingStrategy.HYPEREDGE)
¶
Initialize the Dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hdata
|
HData | None
|
Optional HData object to initialize the dataset with.
If provided, the dataset will be initialized with this data instead of loading and processing from HIF. Must be provided if prepare is set to |
None
|
sampling_strategy
|
SamplingStrategy
|
The sampling strategy to use for the dataset. If not provided, defaults to |
HYPEREDGE
|
Source code in hyperbench/data/dataset.py
__getitem__(index)
¶
Sample a sub-hypergraph based on the sampling strategy and return it as HData. If: - Sampling by node IDs, the sub-hypergraph will contain all hyperedges incident to the sampled nodes and all nodes incident to those hyperedges. - Sampling by hyperedge IDs, the sub-hypergraph will contain all nodes incident to the sampled hyperedges.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | list[int]
|
An integer or a list of integers representing node or hyperedge IDs to sample, depending on the sampling strategy. |
required |
Returns:
| Type | Description |
|---|---|
HData
|
An HData instance containing the sampled sub-hypergraph. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provided index is invalid (e.g., empty list or list length exceeds number of nodes/hyperedges). |
IndexError
|
If any node/hyperedge ID is out of bounds. |
Source code in hyperbench/data/dataset.py
from_hdata(hdata, sampling_strategy=SamplingStrategy.HYPEREDGE)
classmethod
¶
Create a :class:Dataset instance from an :class:HData object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hdata
|
HData
|
:class: |
required |
sampling_strategy
|
SamplingStrategy
|
The sampling strategy to use for the dataset. If not provided, defaults to |
HYPEREDGE
|
Returns:
| Name | Type | Description |
|---|---|---|
The |
Dataset
|
class: |
Source code in hyperbench/data/dataset.py
from_url(url, sampling_strategy=SamplingStrategy.HYPEREDGE, save_on_disk=False)
classmethod
¶
Create a :class:Dataset instance by loading a hypergraph from a URL pointing to a .json or .json.zst file in HIF format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL to the .json or .json.zst file containing the HIF hypergraph data. |
required |
sampling_strategy
|
SamplingStrategy
|
The sampling strategy to use for the dataset. If not provided, defaults to |
HYPEREDGE
|
save_on_disk
|
bool
|
Whether to save the downloaded file on disk. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
The |
Dataset
|
class: |
Source code in hyperbench/data/dataset.py
from_path(filepath, sampling_strategy=SamplingStrategy.HYPEREDGE)
classmethod
¶
Create a :class:Dataset instance by loading a hypergraph from a local file path pointing to a .json or .json.zst file in HIF format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
The local file path to the .json or .json.zst file containing the HIF hypergraph data. |
required |
sampling_strategy
|
SamplingStrategy
|
The sampling strategy to use for the dataset. If not provided, defaults to |
HYPEREDGE
|
Returns:
| Name | Type | Description |
|---|---|---|
The |
Dataset
|
class: |
Source code in hyperbench/data/dataset.py
enrich_node_features(enricher, enrichment_mode=None)
¶
Enrich node features using the provided node feature enricher.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enricher
|
NodeEnricher
|
An instance of NodeEnricher to generate structural node features from hypergraph topology. |
required |
enrichment_mode
|
EnrichmentMode | None
|
How to combine generated features with existing |
None
|
Source code in hyperbench/data/dataset.py
enrich_node_features_from(dataset_with_features, node_space_setting='transductive', fill_value=None)
¶
Enrich node features from another dataset by copying features by global_node_ids.
Examples:
In a transductive setting, the full node space is preserved across datasets:
In inductive setting, missing node features can be filled with 0.0:
>>> test_dataset.enrich_node_features_from(
... train_dataset,
... node_space_setting="inductive",
... fill_value=0.0, # torch.tensor(0.0) also works and will be broadcast to the appropriate shape
... )
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_with_features
|
Dataset
|
Source dataset providing node features. |
required |
node_space_setting
|
NodeSpaceSetting
|
The setting for the node space, determining how nodes are handled.
|
'transductive'
|
fill_value
|
NodeSpaceFiller | None
|
Scalar or vector used to fill missing node features when |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the source dataset's node features cannot be aligned with the target dataset's nodes. |
Source code in hyperbench/data/dataset.py
enrich_hyperedge_attr(enricher, enrichment_mode=None)
¶
Enrich hyperedge features using the provided hyperedge feature enricher.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enricher
|
HyperedgeEnricher
|
An instance of HyperedgeEnricher to generate structural hyperedge features from hypergraph topology. |
required |
enrichment_mode
|
EnrichmentMode | None
|
How to combine generated features with existing |
None
|
Source code in hyperbench/data/dataset.py
enrich_hyperedge_weights(enricher, enrichment_mode=None)
¶
Enrich hyperedge weights using the provided hyperedge weight enricher.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
enricher
|
HyperedgeEnricher
|
An instance of HyperedgeEnricher to generate structural hyperedge features from hypergraph topology. |
required |
enrichment_mode
|
EnrichmentMode | None
|
How to combine generated features with existing |
None
|
Source code in hyperbench/data/dataset.py
update_from_hdata(hdata)
¶
Create a :class:Dataset instance from an :class:HData object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hdata
|
HData
|
:class: |
required |
Returns:
| Name | Type | Description |
|---|---|---|
The |
Dataset
|
class: |
Source code in hyperbench/data/dataset.py
add_negative_samples(negative_sampler, seed=None)
¶
Create a new :class:Dataset with sampled negative hyperedges added.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
negative_sampler
|
NegativeSampler
|
Sampler used to generate negative hyperedges from this dataset's |
required |
seed
|
int | None
|
Optional random seed used for both negative sampling and the final shuffle. |
None
|
Returns:
| Type | Description |
|---|---|
Dataset
|
A new :class: |
Source code in hyperbench/data/dataset.py
remove_hyperedges_with_fewer_than_k_nodes(k)
¶
Remove hyperedges that have fewer than k incident nodes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
The minimum number of nodes a hyperedge must have to be retained. |
required |
Source code in hyperbench/data/dataset.py
split(ratios, shuffle=False, seed=None, node_space_setting='transductive', assign_node_space_to='first')
¶
Split the dataset by hyperedges into partitions with contiguous 0-based hyperedge IDs.
Boundaries are computed using cumulative floor to prevent early splits from over-consuming edges. The last split absorbs any rounding remainder.
Examples:
Transductive split keeping the full node space only on the first split (default):
>>> train, test = dataset.split([0.8, 0.2])
>>> train.hdata.num_nodes == dataset.hdata.num_nodes
>>> test.hdata.num_nodes <= dataset.hdata.num_nodes
Transductive split keeping the full node space on all splits:
>>> train, test = dataset.split(
... [0.8, 0.2],
... node_space_setting="transductive",
... assign_node_space_to="all",
... )
>>> train.hdata.num_nodes == dataset.hdata.num_nodes
>>> test.hdata.num_nodes == dataset.hdata.num_nodes
Inductive split:
>>> train, test = dataset.split(
... [0.8, 0.2],
... node_space_setting="inductive",
... assign_node_space_to=None,
... )
>>> train.hdata.num_nodes <= dataset.hdata.num_nodes
>>> test.hdata.num_nodes <= dataset.hdata.num_nodes
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ratios
|
list[float]
|
List of floats summing to |
required |
shuffle
|
bool | None
|
Whether to shuffle hyperedges before splitting. Defaults to |
False
|
seed
|
int | None
|
Optional random seed for reproducibility. Ignored if shuffle is set to |
None
|
node_space_setting
|
NodeSpaceSetting
|
Whether to preserve the full node space in the splits.
|
'transductive'
|
assign_node_space_to
|
NodeSpaceAssignment | None
|
Which split(s) preserve the full node space when
|
'first'
|
Returns:
| Type | Description |
|---|---|
list[Dataset]
|
List of Dataset objects, one per split, each with contiguous IDs. |
Source code in hyperbench/data/dataset.py
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 | |
to(device)
¶
Move the dataset's HData to the specified device.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
device
|
device
|
The target device (e.g., |
required |
Returns:
| Type | Description |
|---|---|
Dataset
|
The Dataset instance moved to the specified device. |
Source code in hyperbench/data/dataset.py
stats()
¶
Compute statistics for the dataset.
This method currently delegates to the underlying HData's stats method.
The fields returned in the dictionary include:
- shape_x: The shape of the node feature matrix x.
- shape_hyperedge_attr: The shape of the hyperedge attribute matrix, or None if hyperedge attributes are not present.
- num_nodes: The number of nodes in the hypergraph.
- num_hyperedges: The number of hyperedges in the hypergraph.
- avg_degree_node_raw: The average degree of nodes, calculated as the mean number of hyperedges each node belongs to.
- avg_degree_node: The floored node average degree.
- avg_degree_hyperedge_raw: The average size of hyperedges, calculated as the mean number of nodes each hyperedge contains.
- avg_degree_hyperedge: The floored hyperedge average size.
- node_degree_max: The maximum degree of any node in the hypergraph.
- hyperedge_degree_max: The maximum size of any hyperedge in the hypergraph.
- node_degree_median: The median degree of nodes in the hypergraph.
- hyperedge_degree_median: The median size of hyperedges in the hypergraph.
- distribution_node_degree: A list where the value at index i represents the count of nodes with degree i.
- distribution_hyperedge_size: A list where the value at index i represents the count of hyperedges with size i.
- distribution_node_degree_hist: A dictionary where the keys are node degrees and the values are the count of nodes with that degree.
- distribution_hyperedge_size_hist: A dictionary where the keys are hyperedge sizes and the values are the count of hyperedges with that size.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary containing various statistics about the hypergraph. |
Source code in hyperbench/data/dataset.py
HIFLoader
¶
A utility class to load hypergraphs from HIF format.
Source code in hyperbench/data/hif.py
236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 | |
load_from_url(url, save_on_disk=False)
classmethod
¶
Load a hypergraph from a given URL pointing to a .json or .json.zst file in HIF format. Args: url (str): The URL to the .json or .json.zst file containing the HIF hypergraph data. save_on_disk (bool): Whether to save the downloaded file on disk. Returns: HData: The loaded hypergraph object.
Source code in hyperbench/data/hif.py
load_from_path(filepath)
classmethod
¶
Load a hypergraph from a local file path pointing to a .json or .json.zst file in HIF format. Args: filepath (str): The local file path to the .json or .json.zst file containing the HIF hypergraph data. Returns: HData: The loaded hypergraph object.
Source code in hyperbench/data/hif.py
HIFProcessor
¶
A utility class to process HIF hypergraph data into :class:HData format.
Source code in hyperbench/data/hif.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 | |
transform_attrs(attrs, attr_keys=None)
staticmethod
¶
Extract and encode numeric attributes to tensor.
Non-numeric attributes are discarded. Missing attributes are filled with 0.0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attrs
|
dict[str, Any]
|
Dictionary of attributes |
required |
attr_keys
|
list[str] | None
|
Optional list of attribute keys to encode. If provided, ensures consistent ordering and fill missing with |
None
|
Returns:
| Type | Description |
|---|---|
Tensor
|
Tensor of numeric attribute values |
Source code in hyperbench/data/hif.py
process_hypergraph(hypergraph)
classmethod
¶
Process the loaded hypergraph into :class:HData format, mapping HIF structure to tensors.
Returns:
| Type | Description |
|---|---|
HData
|
The processed hypergraph data. |
Source code in hyperbench/data/hif.py
__collect_attr_keys(attr_keys)
classmethod
¶
Collect unique numeric attribute keys from a list of attribute dictionaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attr_keys
|
list[dict[str, Any]]
|
List of attribute dictionaries. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of unique numeric attribute keys. |
Source code in hyperbench/data/hif.py
DataLoader
¶
Bases: DataLoader
Source code in hyperbench/data/loader.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
collate(batch)
¶
Collates a list of :class:HData objects into a single batched :class:HData object.
This function combines multiple separate samples into a single batched representation suitable for mini-batch training. It handles: - Concatenating node features from all samples. - Concatenating and offsetting hyperedges from all samples. - Concatenating hyperedge attributes from all samples, if present. - Concatenating hyperedge weights from all samples, if present.
Examples:
Given batch = [HData_0, HData_1]:
For node features:
>>> HData_0.x.shape # (3, 64) — 3 nodes with 64 features
>>> HData_1.x.shape # (2, 64) — 2 nodes with 64 features
>>> x.shape # (5, 64) — all 5 nodes concatenated
For hyperedge index:
HData_0(3 nodes, 2 hyperedges):
>>> hyperedge_index = [[0, 1, 1, 2], # Nodes 0, 1, 1, 2
... [0, 0, 1, 1]] # Hyperedge 0 contains {0,1}, Hyperedge 1 contains {1,2}
HData_1(2 nodes, 1 hyperedge):
Batched result:
>>> hyperedge_index = [[0, 1, 1, 2, 3, 4], # Node indices: original then offset by 3
... [0, 0, 1, 1, 2, 2]] # Hyperedge IDs: original then offset by 2
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch
|
list[HData]
|
List of :class:`HData objects to collate. |
required |
Returns:
| Type | Description |
|---|---|
HData
|
A single :class: |
Source code in hyperbench/data/loader.py
BaseSampler
¶
Bases: ABC
Source code in hyperbench/data/sampling.py
sample(index, hdata)
abstractmethod
¶
Sample a sub-hypergraph and return HData with global IDs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | list[int]
|
An integer or list of integers specifying which items to sample. |
required |
hdata
|
HData
|
The original HData to sample from. |
required |
Returns:
| Type | Description |
|---|---|
HData
|
A new HData instance containing only the sampled items and their associated data. |
Source code in hyperbench/data/sampling.py
HyperedgeSampler
¶
Bases: BaseSampler
Source code in hyperbench/data/sampling.py
sample(index, hdata)
¶
Sample hyperedges by their IDs and return the sub-hypergraph containing only those hyperedges and their incident nodes.
Examples:
hyperedge_index = [[0, 0, 1, 2, 3, 4], ... [0, 0, 0, 1, 2, 2]] hdata = HData.from_hyperedge_index(hyperedge_index) strategy = HyperedgeSampler() sampled_hdata = strategy.sample([0, 2], hdata) sampled_hdata.hyperedge_index tensor([[0, 0, 1, 3, 4], ... [0, 0, 0, 2, 2]])
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | list[int]
|
An integer or a list of integers representing hyperedge IDs to sample. |
required |
hdata
|
HData
|
The original HData to sample from. |
required |
Returns:
| Type | Description |
|---|---|
HData
|
An HData instance containing only the sampled hyperedges and their incident nodes. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provided index is invalid (e.g., empty list or list length exceeds number of hyperedges). |
IndexError
|
If any hyperedge ID is out of bounds. |
Source code in hyperbench/data/sampling.py
NodeSampler
¶
Bases: BaseSampler
Source code in hyperbench/data/sampling.py
sample(index, hdata)
¶
Sample nodes by their IDs and return the sub-hypergraph containing only those nodes and their incident hyperedges.
Examples:
hyperedge_index = [[0, 0, 1, 2, 3, 4], ... [0, 0, 0, 1, 2, 2]] hdata = HData.from_hyperedge_index(hyperedge_index) strategy = NodeSampler() sampled_hdata = strategy.sample([0, 3], hdata) sampled_hdata.hyperedge_index tensor([[0, 0, 1, 3, 4], ... [0, 0, 0, 2, 2]])
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | list[int]
|
An integer or a list of integers representing node IDs to sample. |
required |
hdata
|
HData
|
The original HData to sample from. |
required |
Returns:
| Type | Description |
|---|---|
HData
|
An HData instance containing only the sampled nodes and their incident hyperedges. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the provided index is invalid (e.g., empty list or list length exceeds number of nodes). |
IndexError
|
If any node ID is out of bounds. |
Source code in hyperbench/data/sampling.py
create_sampler_from_strategy(strategy)
¶
Factory function to create a sampler instance based on the provided sampling strategy type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy
|
SamplingStrategy
|
An instance of SamplingStrategy enum indicating which sampling strategy to use. |
required |
Returns:
| Type | Description |
|---|---|
BaseSampler
|
An instance of a subclass of BaseSampler corresponding to the specified strategy. If strategy is not recognized, defaults to |