Miguel Rivas
rivasmig
·
AI & ML interests
ai
Recent Activity
updated
a collection
11 days ago
Methods
upvoted
a
paper
17 days ago
Scaling RL to Long Videos
updated
a collection
18 days ago
Simulations
Organizations
None yet
Simulations
-
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations
Paper • 2504.07830 • Published • 18 -
WORLDMEM: Long-term Consistent World Simulation with Memory
Paper • 2504.12369 • Published • 34 -
Towards a Unified Copernicus Foundation Model for Earth Vision
Paper • 2503.11849 • Published • 4 -
VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
Paper • 2506.18903 • Published • 22
Medical
Methods
-
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
Paper • 2411.04952 • Published • 30 -
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models
Paper • 2411.05005 • Published • 13 -
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
Paper • 2411.04075 • Published • 17 -
Self-Consistency Preference Optimization
Paper • 2411.04109 • Published • 19
Utility
-
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
Paper • 2411.05738 • Published • 15 -
A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents
Paper • 2410.22476 • Published • 29 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 51 -
Training-free Regional Prompting for Diffusion Transformers
Paper • 2411.02395 • Published • 26
Datasets
-
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Paper • 2504.09081 • Published • 17 -
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 17 -
Sekai: A Video Dataset towards World Exploration
Paper • 2506.15675 • Published • 65 -
WorldVLA: Towards Autoregressive Action World Model
Paper • 2506.21539 • Published • 39
Discussions
-
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models
Paper • 2504.07951 • Published • 29 -
Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability
Paper • 2504.08003 • Published • 49 -
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Paper • 2504.11468 • Published • 29 -
Towards Learning to Complete Anything in Lidar
Paper • 2504.12264 • Published • 10
VLMs
-
Task Vectors are Cross-Modal
Paper • 2410.22330 • Published • 11 -
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
Paper • 2502.01341 • Published • 39 -
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Paper • 2503.23573 • Published • 13 -
Kimi-VL Technical Report
Paper • 2504.07491 • Published • 133
Psychology
-
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
Paper • 2410.23743 • Published • 64 -
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Paper • 2411.03562 • Published • 69 -
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Paper • 2411.03884 • Published • 29 -
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
Paper • 2502.00698 • Published • 24
Copy
Datasets
-
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Paper • 2504.09081 • Published • 17 -
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 17 -
Sekai: A Video Dataset towards World Exploration
Paper • 2506.15675 • Published • 65 -
WorldVLA: Towards Autoregressive Action World Model
Paper • 2506.21539 • Published • 39
Simulations
-
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations
Paper • 2504.07830 • Published • 18 -
WORLDMEM: Long-term Consistent World Simulation with Memory
Paper • 2504.12369 • Published • 34 -
Towards a Unified Copernicus Foundation Model for Earth Vision
Paper • 2503.11849 • Published • 4 -
VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
Paper • 2506.18903 • Published • 22
Discussions
-
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models
Paper • 2504.07951 • Published • 29 -
Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability
Paper • 2504.08003 • Published • 49 -
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Paper • 2504.11468 • Published • 29 -
Towards Learning to Complete Anything in Lidar
Paper • 2504.12264 • Published • 10
Medical
VLMs
-
Task Vectors are Cross-Modal
Paper • 2410.22330 • Published • 11 -
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
Paper • 2502.01341 • Published • 39 -
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Paper • 2503.23573 • Published • 13 -
Kimi-VL Technical Report
Paper • 2504.07491 • Published • 133
Methods
-
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
Paper • 2411.04952 • Published • 30 -
Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models
Paper • 2411.05005 • Published • 13 -
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
Paper • 2411.04075 • Published • 17 -
Self-Consistency Preference Optimization
Paper • 2411.04109 • Published • 19
Psychology
-
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
Paper • 2410.23743 • Published • 64 -
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Paper • 2411.03562 • Published • 69 -
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
Paper • 2411.03884 • Published • 29 -
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
Paper • 2502.00698 • Published • 24
Utility
-
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
Paper • 2411.05738 • Published • 15 -
A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents
Paper • 2410.22476 • Published • 29 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 51 -
Training-free Regional Prompting for Diffusion Transformers
Paper • 2411.02395 • Published • 26