Appearance
Quickstart Guide
This guide will help you get started with ColliderML datasets from HuggingFace Hub.
Installation
Install ColliderML using pip:
bash
pip install collidermlOr install from source:
bash
git clone https://github.com/OpenDataDetector/ColliderML.git
cd colliderml
pip install -e .Loading Your First Dataset
The ColliderML dataset is hosted on HuggingFace Hub and can be loaded using the standard datasets library:
python
from datasets import load_dataset
# Load the ttbar particles dataset (no pileup)
dataset = load_dataset(
"CERN/Colliderml-release-1",
"ttbar_pu0_particles",
split="train"
)
print(f"Loaded {len(dataset)} events")Prefer a local CLI + loader workflow?
The HuggingFace datasets approach is great for quick access. For analysis workflows, ColliderML also provides a more convenient pattern:
colliderml.load("ttbar_pu0")— one-liner that downloads on first call, then caches- explode event tables into flat tables with
colliderml.polars.explode_*
python
import colliderml
from colliderml.polars import explode_particles
# Downloads on first call, reads from the local cache afterwards.
frames = colliderml.load("ttbar_pu0", tables=["particles"], max_events=200)
particles_flat = explode_particles(frames["particles"])See the Library overview, then:
Generate events yourself
New in v0.4.0. Instead of downloading pre-generated data, you can simulate your own events locally (inside the ODD software container via Docker or Podman) or submit a job to the SaaS backend:
python
import colliderml
# Local: needs `pip install "colliderml[sim]"` plus Docker or Podman.
result = colliderml.simulate(preset="ttbar-quick")
print(result.run_dir) # parquet outputs land here
# Remote: needs `pip install "colliderml[remote]"` and an HF token.
result = colliderml.simulate(preset="higgs-portal-quick", remote=True)
print(result.remote_request_id)Full details in the Local Simulation and Remote Simulation guides.
Score your model against a benchmark task
New in v0.4.0. Six built-in benchmark tasks (tracking, jets, anomaly, and three systems tasks) let you compare any model on equal footing:
python
import colliderml.tasks
# What's available?
print(colliderml.tasks.list_tasks())
# Score local predictions
scores = colliderml.tasks.evaluate("tracking", "my_preds.parquet")
print(scores)
# Upload to the leaderboard (earns credits on new bests)
colliderml.tasks.submit("tracking", "my_preds.parquet")See the Benchmark Tasks guide for the full workflow.
Understanding Dataset Structure
The ColliderML dataset is organized with configurations that combine:
- Process: The physics process being simulated (e.g.,
ttbar,ggf,dihiggs) - Pileup: The pileup condition (e.g.,
pu0for no pileup,pu200for 200 pileup) - Object Type: The detector data type or hierarchy level
Available Configurations
Each configuration name follows the pattern {process}_{pileup}_{object_type}. For example:
ttbar_pu0_particlesggf_pu200_calo_hitsdihiggs_pu0_tracks
ColliderML provides multiple views of collision events:
particles: Truth-level particle information (Monte Carlo truth)tracker_hits: Detector measurements in the tracking systemcalo_hits: Detector measurements in the calorimetertracks: Reconstructed particle tracks
Loading Different Configurations
python
from datasets import load_dataset
# Load truth-level particles
particles = load_dataset(
"CERN/Colliderml-release-1",
"ttbar_pu0_particles",
split="train"
)
# Load tracker hits (detector measurements)
tracker_hits = load_dataset(
"CERN/Colliderml-release-1",
"ttbar_pu0_tracker_hits",
split="train"
)
# Load reconstructed tracks
tracks = load_dataset(
"CERN/Colliderml-release-1",
"ttbar_pu0_tracks",
split="train"
)Accessing Event Data
Single Event
python
# Get the first event
event = dataset[0]
# Inspect available fields
print("Event fields:", list(event.keys()))
# Access specific fields
for key, value in event.items():
if hasattr(value, '__len__'):
print(f"{key}: {len(value)} items")
else:
print(f"{key}: {value}")Batch Loading
Load multiple events at once for efficient processing:
python
# Load first 10 events as a batch
batch = dataset[:10]
# batch is a dictionary where each value is a list
print("Batch keys:", list(batch.keys()))
# Process batch data
for key, values in batch.items():
if hasattr(values, '__len__'):
print(f"{key}: batch of {len(values)} events")Iteration
Iterate through the dataset:
python
# Iterate over all events
for event in dataset:
# Process each event
print(f"Processing event with {len(event.keys())} fields")
# Your analysis code here
break # Remove this to process all eventsStreaming Mode
For large datasets that don't fit in memory, use streaming mode:
python
from datasets import load_dataset
# Load in streaming mode
dataset = load_dataset(
"CERN/Colliderml-release-1",
"ttbar_pu0_particles",
split="train",
streaming=True # Enable streaming
)
# Iterate without loading everything into memory
for i, event in enumerate(dataset):
if i >= 10: # Process first 10 events
break
print(f"Event {i}: {list(event.keys())}")Available Physics Processes
ColliderML includes multiple physics processes:
Top Quark Pair Production (ttbar)
python
dataset = load_dataset(
"CERN/Colliderml-release-1",
"ttbar_pu0_particles",
split="train"
)Gluon-Gluon Fusion / Higgs (ggf)
python
dataset = load_dataset(
"CERN/Colliderml-release-1",
"ggf_pu0_particles",
split="train"
)Di-Higgs Production (dihiggs)
python
dataset = load_dataset(
"CERN/Colliderml-release-1",
"dihiggs_pu0_particles",
split="train"
)Check the CERN/Colliderml-release-1 dataset page for a complete list of available configurations.
Data Inspection Example
Here's a complete example of loading and inspecting ColliderML data:
python
from datasets import load_dataset
import numpy as np
# Load dataset
dataset = load_dataset(
"CERN/Colliderml-release-1",
"ttbar_pu0_particles",
split="train"
)
print(f"\nDataset Information:")
print(f" Total events: {len(dataset)}")
print(f" Features: {dataset.features}")
# Inspect first event
event = dataset[0]
print(f"\nFirst Event Structure:")
for key, value in event.items():
print(f" {key}:")
print(f" Type: {type(value)}")
if hasattr(value, 'shape'):
print(f" Shape: {value.shape}")
print(f" Dtype: {value.dtype}")
# Print statistics for numeric arrays
if np.issubdtype(value.dtype, np.number) and value.size > 0:
print(f" Range: [{np.min(value):.3f}, {np.max(value):.3f}]")
print(f" Mean: {np.mean(value):.3f}")Using with PyTorch or TensorFlow
The datasets library integrates seamlessly with popular ML frameworks:
PyTorch
python
from datasets import load_dataset
dataset = load_dataset(
"CERN/Colliderml-release-1",
"ttbar_pu0_particles",
split="train"
)
# Convert to PyTorch format
dataset.set_format(type='torch', columns=['your_feature_columns'])
# Use with PyTorch DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)TensorFlow
python
# Convert to TensorFlow format
tf_dataset = dataset.to_tf_dataset(
columns=['your_feature_columns'],
batch_size=32,
shuffle=True
)Next Steps
- Explore the Data Structure documentation for detailed field descriptions
- Learn about Data Management for caching and optimization
- Check out the Examples for complete analysis workflows
- Read about the Physics Processes available in ColliderML
Getting Help
If you encounter issues:
- Check the FAQ
- Visit our GitHub Issues
- Consult the CERN/Colliderml-release-1 dataset page