Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
2
8
64
Mert Bozkurt
mertbozkurt
Follow
umutkkgz's profile picture
jadechoghari's profile picture
emirozalp's profile picture
3 followers
·
22 following
AI & ML interests
None yet
Recent Activity
liked
a model
about 18 hours ago
google/gemma-3-270m-it
reacted
to
flozi00
's
post
with ❤️
1 day ago
When models get too large for a single GPU, simply stacking layers vertically (Pipeline Parallelism) isn't always the answer. Sometimes, you need to slice the matrices themselves. My latest guide breaks down the hardware mechanics of Tensor Parallelism (TP). We look at how to shard individual operations across devices to make a cluster function as one massive accelerator. This isn't high-level theory—it is a look at the bare metal implementation. Here is what is covered in the deep dive: The Strategies: Column vs. Row Parallelism We analyze how to split weight matrices (W) and inputs (X). Column-Linear: Splits weights by columns. Requires an All-Gather to reconstruct the output. Row-Linear: Splits weights by rows. Requires an All-Reduce to sum partial results. The "Megatron-LM" Optimization Efficiency comes from minimizing communication. By sandwiching the non-linearity (GeLU) between a Column-Parallel layer and a Row-Parallel layer, we can skip synchronization entirely during the activation phase. This cuts communication events by 50% per block. The Hardware Reality: The Bandwidth Wall In TP, the dist.all_reduce operation sits on the critical path. The CUDA cores effectively stall while waiting for the ring-reduce to finish. Intra-Node: Works well because NVLink provides enough bandwidth to hide this latency. Inter-Node: Fails at scale. Standard networking (Ethernet/InfiniBand) is too slow for the high-frequency syncs required by TP. The article includes a raw PyTorch implementation using torch.distributed primitives to show exactly where the data moves and where the bottlenecks sit. Read the full hardware-centric guide here: https://flozi.net/en/guides/ai/scaling/tensor_parallel
upvoted
an
article
3 days ago
Continuous batching from first principles
View all activity
Organizations
spaces
16
Sort: Recently updated
Sleeping
Fast Api Omr
🐢
Runtime error
Llama2 Turkish Recipe
📈
Running
License Plate Detector
🦀
Runtime error
Demo1
📈
No application file
Add ChatdataTR
👀
Runtime error
Optical Mark Recognition
⚡
View 16 Spaces
models
4
Sort: Recently updated
mertbozkurt/speecht5_finetuned_voxpopuli_it
0.1B
•
Updated
May 12
•
2
mertbozkurt/whisper-tiny-en
Automatic Speech Recognition
•
37.8M
•
Updated
May 12
mertbozkurt/distilhubert-finetuned-gtzan
Audio Classification
•
23.7M
•
Updated
May 10
•
3
mertbozkurt/Llama-2-7b-TR-recipe
Text Generation
•
Updated
Jan 30, 2024
•
7
•
2
datasets
5
Sort: Recently updated
mertbozkurt/llama2-TR-recipe
Viewer
•
Updated
Jan 30, 2024
•
10.5k
•
20
•
7
mertbozkurt/turkish-recipe
Viewer
•
Updated
Dec 19, 2023
•
149k
•
105
•
8
mertbozkurt/Souhbet_chatTR
Viewer
•
Updated
Nov 30, 2023
•
1
•
10
mertbozkurt/school_data
Preview
•
Updated
Nov 1, 2023
•
3
mertbozkurt/quotes_philosophers
Updated
Dec 20, 2022
•
141
•
6