Video-Text-to-Text
Transformers
Safetensors
sam2
English
vica_qwen
text-generation
multimodal
vision-language
video understanding
visuospatial cognition
spatial reasoning
vlm
llava
qwen
siglip
hiera
dual-encoder
Instructions to use nkkbr/ViCA2-stage2-onevision-ft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nkkbr/ViCA2-stage2-onevision-ft with Transformers:
# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("nkkbr/ViCA2-stage2-onevision-ft", dtype="auto") - sam2
How to use nkkbr/ViCA2-stage2-onevision-ft with sam2:
# Use SAM2 with images import torch from sam2.sam2_image_predictor import SAM2ImagePredictor predictor = SAM2ImagePredictor.from_pretrained(nkkbr/ViCA2-stage2-onevision-ft) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): predictor.set_image(<your_image>) masks, _, _ = predictor.predict(<input_prompts>)# Use SAM2 with videos import torch from sam2.sam2_video_predictor import SAM2VideoPredictor predictor = SAM2VideoPredictor.from_pretrained(nkkbr/ViCA2-stage2-onevision-ft) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): state = predictor.init_state(<your_video>) # add new prompts and instantly get the output on the same frame frame_idx, object_ids, masks = predictor.add_new_points(state, <your_prompts>): # propagate the prompts to get masklets throughout the video for frame_idx, object_ids, masks in predictor.propagate_in_video(state): ... - Notebooks
- Google Colab
- Kaggle
- Xet hash:
- c23306a5e6f6b7eb36779848836106899d6332f3618f85e67c4b1799f0610636
- Size of remote file:
- 7.86 kB
- SHA256:
- b09586d488f17decf92f3a95b02067357ed083b3bd7bc83a658eb0b5815b1c30
·
Xet efficiently stores Large Files inside Git, intelligently splitting files into unique chunks and accelerating uploads and downloads. More info.