Multi-Scale Multimodal Pose HAR
A custom PyTorch model for Human Activity Recognition (HAR) that integrates:
- Short-term pose transformers (factorized temporal + spatial attention)
- Long-term temporal aggregation
- Optional multimodal fusion with RGB images
- Multi-stage self-supervised + supervised training pipeline
π§ Model Overview
Architecture Summary
Pose stream
- Input:
(B, L, T, J, C) - Short-term encoder:
PoseFormerFactorized- Temporal attention (per joint)
- Spatial attention (per frame)
- Long-term encoder: Transformer over segment-level features
Image stream (optional)
- Backbone: ResNet18 / ResNet50
- Temporal pooling per segment
Fusion
concat(default): feature concatenation + MLPxattn: shallow cross-attention (pose tokens β image token)
Output
- Activity classification logits
- Optional intermediate embeddings / tokens
ποΈ Model Components
| Module | Description |
|---|---|
PoseFormerFactorized |
Short-term pose transformer |
LongTermTemporalBlock |
Long-range temporal modeling |
ImageEncoder |
CNN-based RGB feature extractor |
MMFusionConcatLN |
Concatenation-based multimodal fusion |
MMFusionCrossAttnShallow |
Cross-attention multimodal fusion |
SSLHeads |
Contrastive, reconstruction, temporal order heads |
π₯ Input Format
Pose Input
(B, L, T, J, C)
B: batch sizeL: number of temporal segmentsT: frames per segmentJ: number of joints (e.g. 17)C: joint channels (2D or 3D)
Image Input (optional)
(B, L, T, 3, H, W)
3: RGB channelsH: image heightW: image width
π Usage
1οΈβ£ Load Model Code
from model_har_final import (
PoseFormerFactorized,
MultiScaleTemporalModel
)
2οΈβ£ Build Model
pose_backbone = PoseFormerFactorized(
joints=17,
in_ch=3,
dim=128,
layers=4,
num_classes=6,
return_tokens=True
)
model = MultiScaleTemporalModel(
short_seq_model=pose_backbone,
num_classes=6,
enable_long_term=True,
multimodal=True,
fusion_mode="concat" # or "xattn"
)
3οΈβ£ Load Weights
import torch
ckpt = torch.load("best_stage3_dual_sched.pth", map_location="cpu")
model.load_state_dict(ckpt)
model.eval()
βΉοΈ This model is saved using
state_dict, not pickle-serialized objects, for maximum compatibility.
4οΈβ£ Inference
with torch.no_grad():
logits = model(pose_seq, img_seq)
preds = logits.argmax(dim=-1)
π§ͺ Training Strategy
The model is trained in three stages:
Stage 1 β Pose SSL + Weak Supervision
- Masked joint modeling (MJM)
- Contrastive learning (InfoNCE)
- Temporal order prediction
- Optional labeled supervision
Stage 2 β Pose-only SSL Refinement
- Contrastive + reconstruction losses
- Temporal attention disabled for stability
Stage 3 β Multimodal Supervised Fine-tuning
- Label smoothing CE
- Semantic prototype distillation
- Metric learning (triplet)
- Optional knowledge distillation
π Outputs
| Output | Shape |
|---|---|
| Logits | (B, num_classes) |
| Pose embedding | (B, D) |
| Pose tokens (optional) | (B, T, J, D) |
π Limitations
- Not compatible with
AutoModel.from_pretrained - Requires custom code to instantiate architecture
- Input pose format must match training configuration
π License
Apache License 2.0
π Citation
If you use this work, please cite:
@misc{kim2025multiscalehar,
title = {Multi-Scale Multimodal Pose Transformer for Human Activity Recognition},
author = {Minjae Kim},
year = {2025},
howpublished = {\url{https://huggingface.co/m97j/har-safety-model}}
}
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support