Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Kseniase 
posted an update Jun 8
Post
6239
12 Foundational AI Model Types

Let’s refresh some fundamentals today to stay fluent in the what we all work with. Here are some of the most popular model types that shape the vast world of AI (with examples in the brackets):

1. LLM - Large Language Model (GPT, LLaMA) -> Large Language Models: A Survey (2402.06196)
+ history of LLMs: https://www.turingpost.com/t/The%20History%20of%20LLMs
It's trained on massive text datasets to understand and generate human language. They are mostly build on Transformer architecture, predicting the next token. LLMs scale by increasing overall parameter count across all components (layers, attention heads, MLPs, etc.)

2. SLM - Small Language Model (TinyLLaMA, Phi models, SmolLM) A Survey of Small Language Models (2410.20011)
Lightweight LM optimized for efficiency, low memory use, fast inference, and edge use. SLMs work using the same principles as LLMs

3. VLM - Vision-Language Model (CLIP, Flamingo) -> An Introduction to Vision-Language Modeling (2405.17247)
Processes and understands both images and text. VLMs map images and text into a shared embedding space or generate captions/descriptions from both

4. MLLM - Multimodal Large Language Model (Gemini) -> A Survey on Multimodal Large Language Models (2306.13549)
A large-scale model that can understand and process multiple types of data (modalities) — usually text + other formats, like images, videos, audio, structured data, 3D or spatial inputs. MLLMs can be LLMs extended with modality adapters or trained jointly across vision, text, audio, etc.

5. LAM - Large Action Model (InstructDiffusion, RT-2) -> Large Action Models: From Inception to Implementation (2412.10047)
Understands and generates action sequences by predicting action tokens (discrete/continuous instructions) that guide agents. Trained on behavior datasets, LAMs generalize across tasks, environments, and modalities - video, sensor data, etc.

Read about LRM, MoE, SSM, RNN, CNN, SAM and LNN below👇

Also, subscribe to the Turing Post: https://www.turingpost.com/subscribe
  1. LRM - Large Reasoning Model (DeepSeek-R1, OpenAI's o3) -> https://huggingface.co/papers/2501.09686
    Advanced AI systems specifically optimized for multi-step logical reasoning, complex problem-solving, and structured thinking. LRMs incorporate test-time scaling, Chain-of-Thought reasoning, tool use, external memory, strong math and code capabilities, and more modular design for reliable decision-making

  2. MoE - Mixture of Experts (e.g. Mixtral) -> https://www.turingpost.com/p/moe
    Uses many sub-networks called experts, but activates only a few per input, enabling massive scaling with sparse computation

  3. SSM - State Space Model (Mamba, RetNet) -> https://huggingface.co/papers/2111.00396

  • our overview of SSMs and Mamba: https://www.turingpost.com/p/mamba
    A neural network that defines the sequence as a continuous dynamical system, modeling how hidden state vectors change in response to inputs over time. SSMs are parallelizable and efficient for long contexts
  1. RNN - Recurrent Neural Network (advanced variants: LSTM, GRU) -> https://huggingface.co/papers/1912.05911
  • detailed article about LSTM: https://www.turingpost.com/p/xlstm
    Processes sequences one step at a time, passing information through a hidden state that acts as memory. RNNs were widely used in early NLP and time-series tasks but struggle with long-range dependencies compared to newer architectures
  1. CNN - Convolutional Neural Network (MobileNet, EfficientNet) -> https://huggingface.co/papers/1511.08458
    Automatically learns patterns from visual data. It uses convolutional layers to detect features like edges, textures, or shapes. Not so popular now, but still used in edge applications and visual processing

  2. SAM - Segment Anything Model (developed by Meta AI) -> https://huggingface.co/papers/2304.02643
    A foundation model trained on over 1 billion segmentation masks. Given a prompt (like a point or box), it segments the relevant object

  3. LNN – Liquid Neural Network (LFMs - Liquid Foundation Models by Liquid AI) -> https://arxiv.org/pdf/2006.04439

  • more about LFMs https://www.turingpost.com/p/liquidhyena
    LNNs use differential equations to model neuronal dynamics to adapt their behavior in real-time. They continuously update their internal state, which is great for time-series data, robotics, and real-world decision making

Honestly I dont understand why we are trying to cram everything into one model and not making them in slices. a slice for vision a slice for interaction a slice for art or image generation a slice for audio or voice a slice for math a slice for history and so on and so forth

In this post