Group Theory Collection

A comprehensive collection of permutation composition datasets for various mathematical groups, organized by computational complexity classes. This dataset is designed for studying the "Illusion of State" phenomenon in state-space models and transformer architectures.

Overview

This dataset provides 94 individual permutation group datasets spanning 10 different group families, systematically organized to facilitate research on the computational boundaries between solvable and non-solvable groups. The organization reflects the fundamental distinction between TC⁰-computable (solvable groups) and NC¹-complete (non-solvable groups) problems.

Research Motivation

Recent theoretical work demonstrates that TC⁰ models, including Transformers and standard State-Space Models (SSMs), cannot solve NC¹-complete problems such as composing permutations in non-solvable groups. This dataset enables researchers to:

Empirically verify theoretical computational complexity boundaries
Study the "Illusion of State" phenomenon in neural architectures
Benchmark mathematical reasoning capabilities of sequence models
Investigate generalization patterns across different group structures
Analyze the relationship between model architecture and algebraic computation

Dataset Structure

The dataset is organized in three complementary ways to support different research approaches:

1. Flat Organization (data/)

All 94 individual group datasets are available for direct access in a flat structure, facilitating straightforward loading and comparison across groups.

2. TC⁰ Complexity Class (TC0/)

Contains 58 solvable groups that can theoretically be computed by constant-depth threshold circuits. These groups serve as positive controls where current neural architectures should succeed.

3. NC¹ Complexity Class (NC1/)

Contains 36 non-solvable groups requiring logarithmic-depth circuits for computation. These groups represent problems that are provably beyond the computational capacity of TC⁰ models.

Usage

Basic Loading

from datasets import load_dataset

# Load specific group datasets using config names
s5_data = load_dataset("BeeGass/Group-Theory-Collection", name="s5")
a4_data = load_dataset("BeeGass/Group-Theory-Collection", name="a4")
m11_data = load_dataset("BeeGass/Group-Theory-Collection", name="m11")

# Alternative: Load from data directories
s5_data = load_dataset("BeeGass/Group-Theory-Collection", data_dir="data/s5")
tc0_cyclic = load_dataset("BeeGass/Group-Theory-Collection", data_dir="TC0/c10")
nc1_symmetric = load_dataset("BeeGass/Group-Theory-Collection", data_dir="NC1/s7")

# Access train/test splits
train_data = s5_data["train"]
test_data = s5_data["test"]

Data Format

Each example contains the following fields:

{
    'input_sequence': "123 456 789 ...",             # Space-separated permutation IDs (variable length)
    'target': "234",                                  # Result of composition as string
    'sequence_length': 512,                           # Length of input sequence (varies from 3 to 1024)
    'group_degree': 7,                                # Degree of the permutation group (e.g., S7 acts on 7 elements)
    'group_order': 5040,                              # Order (size) of the group (e.g., |S7| = 7!)
    'group_type': "symmetric"                         # Type of the group
}

Note: Sequences contain a variable number of permutation IDs (uniformly distributed between 3 and 1024). The provided target is the composition of all permutations in the input sequence.

Working with Different Sequence Lengths

The dataset already contains sequences of varying lengths (3 to 1024). You can filter or analyze based on sequence length:

# Load full dataset
dataset = load_dataset("BeeGass/Group-Theory-Collection", name="s5")

# Example: Filter for specific sequence lengths
short_sequences = dataset['train'].filter(lambda x: x['sequence_length'] <= 32)
medium_sequences = dataset['train'].filter(lambda x: 32 < x['sequence_length'] <= 256)
long_sequences = dataset['train'].filter(lambda x: x['sequence_length'] > 256)

# Analyze sequence length distribution
import numpy as np
lengths = np.array(dataset['train']['sequence_length'])
print(f"Min length: {lengths.min()}, Max length: {lengths.max()}")
print(f"Mean length: {lengths.mean():.1f}, Std: {lengths.std():.1f}")

Group Inventory

TC⁰ Groups (Solvable) - 58 Groups

Group Family	Groups	Orders	Mathematical Properties
Symmetric	S3, S4	6, 24	Solvable for n ≤ 4
Alternating	A3, A4	3, 12	Solvable for n ≤ 4
Cyclic	C2-C30 (all)	2-30	Abelian groups
Dihedral	D3-D20 (all)	6-40	Symmetries of regular polygons
Klein	V4	4	Smallest non-cyclic abelian group (isomorphic to Z₂²)
Quaternion	Q8, Q16, Q32	8, 16, 32	Non-abelian 2-groups
Elementary Abelian	Z2^[1-5], Z3^[1-4], Z5^[1-4]	Various	Direct products of cyclic groups
Frobenius	F20, F21	20, 21	Transitive permutation groups
Projective Special Linear	PSL(2,2), PSL(2,3)	6, 12	Solvable PSL groups

NC¹ Groups (Non-Solvable) - 36 Groups

Group Family	Groups	Orders	Mathematical Properties
Symmetric	S5, S6, S7, S8, S9	120-362,880	Non-solvable for n ≥ 5
Alternating	A5, A6, A7, A8, A9	60-181,440	Simple groups for n ≥ 5
Projective Special Linear	PSL(2,4), PSL(2,5), PSL(2,7), PSL(2,8), PSL(2,9), PSL(2,11), PSL(3,2), PSL(3,3), PSL(3,4), PSL(3,5)	Various	Simple groups (PSL(2,4) ≅ A5)
Mathieu	M11, M12	7,920, 95,040	Sporadic simple groups

Technical Specifications

Permutation Representation

Each permutation is assigned a unique integer identifier within its group
Mappings between IDs and permutation arrays are consistent across train/test splits
Permutation composition follows right-to-left convention (standard in mathematics)

Dataset Statistics

Train/Test Split: 80/20 ratio for all groups
Sequence Lengths: Variable lengths from 3 to 1024 permutations per example
File Format: Apache Arrow for efficient data loading and memory mapping
Total Size: Varies by group order and maximum sequence length

Composition Convention

For an input sequence [p₁, p₂, p₃], the target is computed as:

Mathematical notation: p₃ ∘ p₂ ∘ p₁
Operational interpretation: First apply p₁, then p₂, then p₃

Dataset Generation

The code used to generate this dataset is available at https://github.com/BeeGass/Group-Dataset-Generator. The repository includes:

Complete implementation of all permutation groups
Dataset generation scripts with configurable parameters
Verification and testing utilities
Documentation for extending the dataset with additional groups

Research Applications

This dataset supports various research directions:

Computational Complexity Theory: Empirical validation of TC⁰/NC¹ separation in neural networks
State-Space Model Analysis: Testing fundamental limitations of linear recurrent architectures
Transformer Architecture Studies: Investigating attention mechanism constraints
Mathematical Reasoning: Benchmarking symbolic manipulation capabilities
Generalization Studies: Cross-length and cross-group generalization patterns
Representation Learning: Understanding how models encode algebraic structures

Citation

When using this dataset in academic work, please cite:

@dataset{gass2024permutation,
  author = {Gass, Bryan},
  title = {Group Theory Collection},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/BeeGass/Group-Theory-Collection},
  note = {Organized by computational complexity classes (TC⁰/NC¹)}
}

@software{gass2024generator,
  author = {Gass, Bryan},
  title = {Group Dataset Generator},
  year = {2024},
  url = {https://github.com/BeeGass/Group-Dataset-Generator}
}

@article{merrill2024illusion,
  title = {The Illusion of State in State-Space Models},
  author = {Merrill, William and Jackson, Ashish and Goldstein, Yoav and Weiss, Gail and Angluin, Dana},
  journal = {arXiv preprint arXiv:2404.08819},
  year = {2024}
}

Acknowledgments

This dataset was inspired by the theoretical work of William Merrill and colleagues on "The Illusion of State in State-Space Models" (arXiv:2404.08819), which establishes fundamental computational limitations of state-space models through group-theoretic analysis.

License

This dataset is released under the MIT License.

Contact

For questions, issues, or contributions, please use the Hugging Face dataset repository's discussion forum or contact Bryan Gass directly.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support