NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

NucEL is a specialized language model designed for nucleotide sequence analysis and genomic applications. This model provides powerful embeddings for DNA sequences and can be fine-tuned for various downstream genomic tasks.

Model Details

  • Model Type: Transformer-based sequence model
  • Domain: Genomics and Nucleotide Sequences
  • Architecture: Based on ModernBert architecture optimized for nucleotide sequences

Features

  • Nucleotide-level tokenization and embedding
  • Pre-trained on human genome
  • Optimized for biological sequence understanding

Usage

Basic Usage

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)

# Example DNA sequence
sequence = "ATCGATCGATCGATCG"

# Tokenize and encode
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)

# Get sequence embeddings
embeddings = outputs.last_hidden_state
print(f"Sequence embeddings shape: {embeddings.shape}")

Installation

pip install transformers torch
# Install any additional dependencies for your specific use case

Requirements

  • transformers >= 4.21.0
  • torch >= 1.9.0
  • Python >= 3.7

Citation

If you use NucEL in your research, please cite:

@misc{nucel2024,
  title={NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations},
  author={Ke Ding, Brian Parker, and Jiayu Wen},
  year={2025},
  howpublished={\url{https://huggingface.co/FreakingPotato/NucEL}}
}

License

This model is released under the Apache 2.0 License.

Downloads last month
19
Safetensors
Model size
92.3M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support