Alex
aslessor
AI & ML interests
None yet
Recent Activity
liked
a model
14 days ago
lightonai/LightOnOCR-1B-1025
updated
a collection
about 1 month ago
Document conversion
updated
a collection
2 months ago
Datasets
Organizations
None yet
Prompts
CoT
Agents
-
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
Paper • 2410.20424 • Published • 40 -
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Paper • 2502.08047 • Published • 28 -
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems
Paper • 2502.11098 • Published • 13 -
EnvX: Agentize Everything with Agentic AI
Paper • 2509.08088 • Published • 8
Text to image papers
-
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
Paper • 2311.09257 • Published • 48 -
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Paper • 2312.14125 • Published • 47 -
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper • 2312.16862 • Published • 31 -
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
Paper • 2401.01256 • Published • 21
Vision
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper • 2406.16860 • Published • 63 -
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper • 2407.02477 • Published • 24 -
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper • 2408.10188 • Published • 52 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133
Evaluation
-
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
Paper • 2408.00765 • Published • 14 -
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
Paper • 2407.21646 • Published • 18 -
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection
Paper • 2408.04284 • Published • 26 -
Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability
Paper • 2408.07852 • Published • 16
Speech
RAG
-
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Paper • 2409.12941 • Published • 24 -
LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
Paper • 2411.19638 • Published • 6 -
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
Paper • 2412.02592 • Published • 24 -
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
Paper • 2412.10704 • Published • 16
UI-to-Code
Image
-
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
Paper • 2506.18095 • Published • 66 -
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
Paper • 2507.04590 • Published • 16 -
Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation
Paper • 2509.00428 • Published • 17
Medical
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104 -
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Paper • 2502.19634 • Published • 63 -
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications
Paper • 2409.07314 • Published • 56 -
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
Paper • 2412.20070 • Published • 45
Synthetic Data
-
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Paper • 2410.09732 • Published • 55 -
How to Synthesize Text Data without Model Collapse?
Paper • 2412.14689 • Published • 52 -
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Paper • 2501.18511 • Published • 20 -
Synthetic Data RL: Task Definition Is All You Need
Paper • 2505.17063 • Published • 10
Datasets
-
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
Paper • 2405.07526 • Published • 21 -
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
Paper • 2406.14347 • Published • 102 -
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Paper • 2509.09676 • Published • 31
Audio
-
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
Paper • 2407.04051 • Published • 39 -
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Paper • 2408.16532 • Published • 50 -
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing
Paper • 2409.10831 • Published • 5
Video
-
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
Paper • 2408.16767 • Published • 32 -
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
Paper • 2411.16657 • Published • 20 -
Autoregressive Video Generation without Vector Quantization
Paper • 2412.14169 • Published • 14 -
Progressive Multimodal Reasoning via Active Retrieval
Paper • 2412.14835 • Published • 73
Fine tuning
-
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper • 2409.11355 • Published • 31 -
Training Language Models to Self-Correct via Reinforcement Learning
Paper • 2409.12917 • Published • 140 -
Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs
Paper • 2409.14988 • Published • 23 -
LLM Pretraining with Continuous Concepts
Paper • 2502.08524 • Published • 29
Document conversion
UI-to-Code
Prompts
Image
-
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
Paper • 2506.18095 • Published • 66 -
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
Paper • 2507.04590 • Published • 16 -
Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation
Paper • 2509.00428 • Published • 17
CoT
Medical
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104 -
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Paper • 2502.19634 • Published • 63 -
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications
Paper • 2409.07314 • Published • 56 -
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
Paper • 2412.20070 • Published • 45
Agents
-
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
Paper • 2410.20424 • Published • 40 -
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Paper • 2502.08047 • Published • 28 -
Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems
Paper • 2502.11098 • Published • 13 -
EnvX: Agentize Everything with Agentic AI
Paper • 2509.08088 • Published • 8
Synthetic Data
-
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
Paper • 2410.09732 • Published • 55 -
How to Synthesize Text Data without Model Collapse?
Paper • 2412.14689 • Published • 52 -
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Paper • 2501.18511 • Published • 20 -
Synthetic Data RL: Task Definition Is All You Need
Paper • 2505.17063 • Published • 10
Text to image papers
-
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
Paper • 2311.09257 • Published • 48 -
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Paper • 2312.14125 • Published • 47 -
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper • 2312.16862 • Published • 31 -
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
Paper • 2401.01256 • Published • 21
Datasets
-
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
Paper • 2405.07526 • Published • 21 -
nabla^2DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
Paper • 2406.14347 • Published • 102 -
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Paper • 2509.09676 • Published • 31
Vision
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper • 2406.16860 • Published • 63 -
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper • 2407.02477 • Published • 24 -
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper • 2408.10188 • Published • 52 -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 133
Audio
-
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
Paper • 2407.04051 • Published • 39 -
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Paper • 2408.16532 • Published • 50 -
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing
Paper • 2409.10831 • Published • 5
Evaluation
-
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
Paper • 2408.00765 • Published • 14 -
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent
Paper • 2407.21646 • Published • 18 -
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection
Paper • 2408.04284 • Published • 26 -
Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability
Paper • 2408.07852 • Published • 16
Video
-
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
Paper • 2408.16767 • Published • 32 -
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
Paper • 2411.16657 • Published • 20 -
Autoregressive Video Generation without Vector Quantization
Paper • 2412.14169 • Published • 14 -
Progressive Multimodal Reasoning via Active Retrieval
Paper • 2412.14835 • Published • 73
Speech
Fine tuning
-
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper • 2409.11355 • Published • 31 -
Training Language Models to Self-Correct via Reinforcement Learning
Paper • 2409.12917 • Published • 140 -
Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs
Paper • 2409.14988 • Published • 23 -
LLM Pretraining with Continuous Concepts
Paper • 2502.08524 • Published • 29
RAG
-
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Paper • 2409.12941 • Published • 24 -
LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
Paper • 2411.19638 • Published • 6 -
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
Paper • 2412.02592 • Published • 24 -
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
Paper • 2412.10704 • Published • 16