Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
538.8
TFLOPS
1
51
45
Jaward Sesay
Jaward
Follow
SivaMallikarjun's profile picture
medmac01's profile picture
devass121's profile picture
405 followers
·
24 following
https://github.com/Jaykef
JawardSesay_
Jaykef
AI & ML interests
Building Lectūra AI | CS Grad Student @BIT | AI/ML Research: Autonomous Agents, LLMs | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy
Recent Activity
posted
an
update
1 day ago
What a wild, beautiful read! > google: we are not done with the transformer yet. > What a wild, beautiful read! > google: we are not done with the transformer yet. > they showed that an llm can in fact update its pre-trained weights based solely on patterns learned from a given prompt/context (in real-time with no training/finetuning required). This means real-time vanilla in-context learning/knowledge update 🤯 The transformer is so much more than we think:) Paper: https://arxiv.org/pdf/2507.16003
upvoted
a
paper
2 days ago
Group Sequence Policy Optimization
posted
an
update
9 days ago
Towards batch sizes too small to meter🎉 beautiful work! And my personal favorite so far - I adore peak performance at small/nano scale. Everyone deserves to run/train AGI locally:) our data, our god model! They showed that: - you can train LLMs (upto 1B params) with as low as batch_size=1. This is unconventional given small batch sizes can lead to unstable/spiky training runs. - you can have a stable train run with just vanilla SGD(stochastic gradient descent), no momentum required🤯 - small batch sizes are more robust to hyperparameters (i.e no worries with initialization) - smaller batch sizes outperforms (“better per-Flops performance”) larger batch sizes. “We recommend that practitioners training large models in memory-constrained settings exploit the benefits of small batch sizes rather than trying to emulate the large batch size setting (e.g., through gradient accumulation) typically used in industry.” I’ve been doing this for ages - my mantra: all my experiments must scale on my 8gb ram m2 before moving to gpu. IOW I love being gpu poor. Checkout my nanoAI algo repo: https://github.com/Jaykef/ai-algorithms, all notebooks run on memory as low as 8gb ram
View all activity
Organizations
Jaward
's models
4
Sort: Recently updated
Jaward/CodeOptimus-Instruct-Mistral-7B-v0.1.gguf
7B
•
Updated
Mar 13
•
32
•
1
Jaward/smollm2_360m_grpo_gsm8k_reasoner
Text Generation
•
0.4B
•
Updated
Mar 4
•
3
•
1
Jaward/phi-3-mini-4k-instruct.Q4_0.gguf
Text Generation
•
4B
•
Updated
Apr 27, 2024
•
1.28k
•
3
Jaward/mlx-bge-small-en
Updated
Apr 17, 2024
•
3