@Kseniase on Hugging Face: "12 Foundational AI Model Types Let’s refresh some fundamentals today to stay…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Kseniase

posted an update Jun 8

Post

6239

12 Foundational AI Model Types

Let’s refresh some fundamentals today to stay fluent in the what we all work with. Here are some of the most popular model types that shape the vast world of AI (with examples in the brackets):

1. LLM - Large Language Model (GPT, LLaMA) -> Large Language Models: A Survey (2402.06196)
+ history of LLMs: https://www.turingpost.com/t/The%20History%20of%20LLMs
It's trained on massive text datasets to understand and generate human language. They are mostly build on Transformer architecture, predicting the next token. LLMs scale by increasing overall parameter count across all components (layers, attention heads, MLPs, etc.)

2. SLM - Small Language Model (TinyLLaMA, Phi models, SmolLM) A Survey of Small Language Models (2410.20011)
Lightweight LM optimized for efficiency, low memory use, fast inference, and edge use. SLMs work using the same principles as LLMs

3. VLM - Vision-Language Model (CLIP, Flamingo) -> An Introduction to Vision-Language Modeling (2405.17247)
Processes and understands both images and text. VLMs map images and text into a shared embedding space or generate captions/descriptions from both

4. MLLM - Multimodal Large Language Model (Gemini) -> A Survey on Multimodal Large Language Models (2306.13549)
A large-scale model that can understand and process multiple types of data (modalities) — usually text + other formats, like images, videos, audio, structured data, 3D or spatial inputs. MLLMs can be LLMs extended with modality adapters or trained jointly across vision, text, audio, etc.

5. LAM - Large Action Model (InstructDiffusion, RT-2) -> Large Action Models: From Inception to Implementation (2412.10047)
Understands and generates action sequences by predicting action tokens (discrete/continuous instructions) that guide agents. Trained on behavior datasets, LAMs generalize across tasks, environments, and modalities - video, sensor data, etc.

Read about LRM, MoE, SSM, RNN, CNN, SAM and LNN below👇

Also, subscribe to the Turing Post: https://www.turingpost.com/subscribe

Kseniase

Jun 8

•

edited Jun 8

LRM - Large Reasoning Model (DeepSeek-R1, OpenAI's o3) -> https://huggingface.co/papers/2501.09686
Advanced AI systems specifically optimized for multi-step logical reasoning, complex problem-solving, and structured thinking. LRMs incorporate test-time scaling, Chain-of-Thought reasoning, tool use, external memory, strong math and code capabilities, and more modular design for reliable decision-making
MoE - Mixture of Experts (e.g. Mixtral) -> https://www.turingpost.com/p/moe
Uses many sub-networks called experts, but activates only a few per input, enabling massive scaling with sparse computation
SSM - State Space Model (Mamba, RetNet) -> https://huggingface.co/papers/2111.00396

our overview of SSMs and Mamba: https://www.turingpost.com/p/mamba
A neural network that defines the sequence as a continuous dynamical system, modeling how hidden state vectors change in response to inputs over time. SSMs are parallelizable and efficient for long contexts

RNN - Recurrent Neural Network (advanced variants: LSTM, GRU) -> https://huggingface.co/papers/1912.05911

detailed article about LSTM: https://www.turingpost.com/p/xlstm
Processes sequences one step at a time, passing information through a hidden state that acts as memory. RNNs were widely used in early NLP and time-series tasks but struggle with long-range dependencies compared to newer architectures

CNN - Convolutional Neural Network (MobileNet, EfficientNet) -> https://huggingface.co/papers/1511.08458
Automatically learns patterns from visual data. It uses convolutional layers to detect features like edges, textures, or shapes. Not so popular now, but still used in edge applications and visual processing
SAM - Segment Anything Model (developed by Meta AI) -> https://huggingface.co/papers/2304.02643
A foundation model trained on over 1 billion segmentation masks. Given a prompt (like a point or box), it segments the relevant object
LNN – Liquid Neural Network (LFMs - Liquid Foundation Models by Liquid AI) -> https://arxiv.org/pdf/2006.04439

more about LFMs https://www.turingpost.com/p/liquidhyena
LNNs use differential equations to model neuronal dynamics to adapt their behavior in real-time. They continuously update their internal state, which is great for time-series data, robotics, and real-world decision making

Tom-Neverwinter

Jun 9

Honestly I dont understand why we are trying to cram everything into one model and not making them in slices. a slice for vision a slice for interaction a slice for art or image generation a slice for audio or voice a slice for math a slice for history and so on and so forth

In this post