nyuuzyou
nyuuzyou
AI & ML interests
None yet
Recent Activity
updated
a dataset
about 2 hours ago
nyuuzyou/birdwatch
upvoted
a
changelog
2 days ago
Introducing a better Hugging Face CLI
updated
a dataset
3 days ago
nyuuzyou/pastvu
Organizations

posted
an
update
9 days ago

reacted to
m-ric's
post with 🔥
10 days ago
Post
1681
Open-source is catching up on Deep Research! 🔥 an Alibaba team has published a New data + RL recipe that allows open models to compete with OpenAI’s Deep Research.
This is one of the best papers I’ve read on fine-tuning LLMs for agentic use-cases.
Deep Research use cases, those where you task an agent to go very broad in its search on a topic, sometimes launching 100s of web searches to refine the answer. Here’s an example: “Between 1990 and 1994 inclusive, what teams played in a soccer match with a Brazilian referee had four yellow cards, two for each team where three of the total four were not issued during the first half, and four substitutions, one of which was for an injury in the first 25 minutes of the match.” (answer: Ireland v Romania)
Open-source model just weren’t performing that well. The team from Alibaba posited that the main cause for this was that Deep research-like tasks simply were missing from training data. Indeed, our usual agentic training data of a few tool calls hardly cover this “many-steps-with-unclear-entities” type of query.
So researchers decided to fill the gap, and create a high-quality dataset for Deep Research.
My highlights from the paper:
1 - The data: by smartly leveraging an ontology of knowledge as entities linked in a graph, they can then choose an arbitrary big subgraph to craft an arbitrarily difficult request. This process produced SailorfogQA, a high-quality traiing dataset for Deep Research.
2 - The traning methods: They start from Qwen 2.5. After fine-tuning on their dataset, researchers apply a round RL with a reward on format + answer (scored by LLM judge), and it does increase performance ~4% across all benchmarks.
I'm still amazed by the quality produced by Alibaba-NLP (makers of Qwen) - keep these papers coming!
This is one of the best papers I’ve read on fine-tuning LLMs for agentic use-cases.
Deep Research use cases, those where you task an agent to go very broad in its search on a topic, sometimes launching 100s of web searches to refine the answer. Here’s an example: “Between 1990 and 1994 inclusive, what teams played in a soccer match with a Brazilian referee had four yellow cards, two for each team where three of the total four were not issued during the first half, and four substitutions, one of which was for an injury in the first 25 minutes of the match.” (answer: Ireland v Romania)
Open-source model just weren’t performing that well. The team from Alibaba posited that the main cause for this was that Deep research-like tasks simply were missing from training data. Indeed, our usual agentic training data of a few tool calls hardly cover this “many-steps-with-unclear-entities” type of query.
So researchers decided to fill the gap, and create a high-quality dataset for Deep Research.
My highlights from the paper:
1 - The data: by smartly leveraging an ontology of knowledge as entities linked in a graph, they can then choose an arbitrary big subgraph to craft an arbitrarily difficult request. This process produced SailorfogQA, a high-quality traiing dataset for Deep Research.
2 - The traning methods: They start from Qwen 2.5. After fine-tuning on their dataset, researchers apply a round RL with a reward on format + answer (scored by LLM judge), and it does increase performance ~4% across all benchmarks.
I'm still amazed by the quality produced by Alibaba-NLP (makers of Qwen) - keep these papers coming!

reacted to
openfree's
post with 👍
10 days ago
Post
2912
🎯 AGI NOVEL Generator: The First Step Toward True AI Creativity
openfree/AGI-NOVEL
Can AI Write a 100,000-Word Novel?
What's the ultimate test for AGI (Artificial General Intelligence)? Calculation? Logic? Or creativity?
We tackled the hardest creative challenge: A single AI writing a full-length novel with consistent voice from beginning to end.
🚀 Core Innovations
Single Writer System: Not fragmented texts from multiple AIs, but a genuine novel by one author
Immediate Critique System: Real-time literary critique and revision for each part
170 Quadrillion Themes: Infinite creative possibilities (4.6 million years at 100 novels/day!)
Philosophical Depth: Nobel Prize-level existential exploration and social insight
🎲 Infinite Possibilities
"The day my father died, I discovered he had another family he'd hidden all his life."
One random click generates a powerful opening sentence and a completely new story begins.
📊 Technical Achievements
8,000-word novella auto-generation (approximately 20 minutes)
10 organically structured parts: Perfect narrative arc from introduction to resolution
Real-time progress tracking: Session recovery for uninterrupted creation
DOCX/TXT export: Korean standard book format (152x225mm) support
🌟 Journey Toward AGI
This project goes beyond simple text generation. Sustained memory, causal reasoning, emotional nuance, ethical self-censorship, originality - it tests all capabilities required for AGI.
Experience it now! Your unique story awaits.
openfree/AGI-NOVEL
Can AI Write a 100,000-Word Novel?
What's the ultimate test for AGI (Artificial General Intelligence)? Calculation? Logic? Or creativity?
We tackled the hardest creative challenge: A single AI writing a full-length novel with consistent voice from beginning to end.
🚀 Core Innovations
Single Writer System: Not fragmented texts from multiple AIs, but a genuine novel by one author
Immediate Critique System: Real-time literary critique and revision for each part
170 Quadrillion Themes: Infinite creative possibilities (4.6 million years at 100 novels/day!)
Philosophical Depth: Nobel Prize-level existential exploration and social insight
🎲 Infinite Possibilities
"The day my father died, I discovered he had another family he'd hidden all his life."
One random click generates a powerful opening sentence and a completely new story begins.
📊 Technical Achievements
8,000-word novella auto-generation (approximately 20 minutes)
10 organically structured parts: Perfect narrative arc from introduction to resolution
Real-time progress tracking: Session recovery for uninterrupted creation
DOCX/TXT export: Korean standard book format (152x225mm) support
🌟 Journey Toward AGI
This project goes beyond simple text generation. Sustained memory, causal reasoning, emotional nuance, ethical self-censorship, originality - it tests all capabilities required for AGI.
Experience it now! Your unique story awaits.

reacted to
merve's
post with ❤️
11 days ago
Post
2550
Fine-tune Gemma3n on videos with audios inside with Colab A100 🔥
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes 🫡 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM
stretch modalities and unfreeze layers as you wish! 🙏🏻 merve/smol-vision
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes 🫡 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM
stretch modalities and unfreeze layers as you wish! 🙏🏻 merve/smol-vision

reacted to
jsulz's
post with 🔥
12 days ago
Post
2782
We've moved over 20PB from Git LFS to Xet on the Hub without downtime or data loss. Having things "just work" on a migration of this scale is about as good as it gets.
Now, we're migrating the rest of the Hub https://huggingface.co/blog/migrating-the-hub-to-xet
But how did we get here?
In the early days of joining Hugging Face, we made a few key design decisions:
* There would be no "hard cut-over" from Git LFS to Xet
* A Xet-enabled repository should be able to contain both Xet and LFS files
* Repository migrations from LFS to Xet can run in the background without disrupting downloads or uploads
These were largely driven by our desire to ensure the community could keep working without interruption.
We cover the infrastructure making this all go in this post, specifically:
* An integral piece of infrastructure known internally as the Git LFS Bridge
* Background content migrations that run around the clock
To skip the wait and join Xet now, sign up here https://huggingface.co/join/xet
Now, we're migrating the rest of the Hub https://huggingface.co/blog/migrating-the-hub-to-xet
But how did we get here?
In the early days of joining Hugging Face, we made a few key design decisions:
* There would be no "hard cut-over" from Git LFS to Xet
* A Xet-enabled repository should be able to contain both Xet and LFS files
* Repository migrations from LFS to Xet can run in the background without disrupting downloads or uploads
These were largely driven by our desire to ensure the community could keep working without interruption.
We cover the infrastructure making this all go in this post, specifically:
* An integral piece of infrastructure known internally as the Git LFS Bridge
* Background content migrations that run around the clock
To skip the wait and join Xet now, sign up here https://huggingface.co/join/xet

reacted to
danielhanchen's
post with 🔥
13 days ago
Post
2741

posted
an
update
16 days ago
Post
391
🎵 UwUpad Soundboard Audio Dataset -
nyuuzyou/uwupad
Collection of approximately 48,366 audio files from uwupad.me soundboard platform featuring:
- Diverse audio content including memes, music clips, voice samples, sound effects, and entertainment audio
- Includes metadata: unique identifiers, descriptive titles, associated imagery URLs, original source URLs, and descriptive tags
- Contains high-quality MP3 audio files suitable for soundboard applications and audio processing research
- Organized in webdataset format with 61 tar shards and compressed JSONL metadata
Collection of approximately 48,366 audio files from uwupad.me soundboard platform featuring:
- Diverse audio content including memes, music clips, voice samples, sound effects, and entertainment audio
- Includes metadata: unique identifiers, descriptive titles, associated imagery URLs, original source URLs, and descriptive tags
- Contains high-quality MP3 audio files suitable for soundboard applications and audio processing research
- Organized in webdataset format with 61 tar shards and compressed JSONL metadata

reacted to
mlabonne's
post with 🔥
17 days ago
Post
4214

Based on a new hybrid architecture, these 350M, 700M, and 1.2B models are both fast and performant, ideal for on-device deployment.
I recommend fine-tuning them to power your next edge application. We already provide Colab notebooks to guide you. More to come soon!
📝 Blog post: https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models
🤗 Models: LiquidAI/lfm2-686d721927015b2ad73eaa38

reacted to
nroggendorff's
post with 🚀
17 days ago
They should add this to the huggingface-cli arguments or even make a separate button in the repository settings

reacted to
anakin87's
post with 🤗
27 days ago
Post
1138
🧰 Free up space on the Hub with
As you may know, Hugging Face Hub has storage limits on private repos (100 GB for free users, 1 TB for PROs).
This weekend I did some cleanup on my private repos
I went 1.58 TB down to 1 GB. 😅
Besides deleting old, unused models, the main tool I used was a lesser-known command:
When you train a model, you often push multiple checkpoints to the Hub.
Each checkpoint = a commit.
A 2.6B model in BF16 is ~5 GB.
So 10 checkpoints = 50 GB. That adds up fast.
While full commit history can be useful for rollbacks, it's often unnecessary for older experiments where only the final model matters.
In these cases, you can use
https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history
⚠️ super_squash_history is a non-revertible operation. Once squashed, the commit history cannot be retrieved.
Hope this is useful to others.
super_squash_history
🧹As you may know, Hugging Face Hub has storage limits on private repos (100 GB for free users, 1 TB for PROs).
This weekend I did some cleanup on my private repos
I went 1.58 TB down to 1 GB. 😅
Besides deleting old, unused models, the main tool I used was a lesser-known command:
super_squash_history
.When you train a model, you often push multiple checkpoints to the Hub.
Each checkpoint = a commit.
A 2.6B model in BF16 is ~5 GB.
So 10 checkpoints = 50 GB. That adds up fast.
While full commit history can be useful for rollbacks, it's often unnecessary for older experiments where only the final model matters.
In these cases, you can use
super_squash_history
: it reduces your entire repo history to a single commit.https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history
⚠️ super_squash_history is a non-revertible operation. Once squashed, the commit history cannot be retrieved.
Hope this is useful to others.

reacted to
openfree's
post with 🔥
28 days ago
Post
4111
🧠 SOMA: The Core Architecture for AGI Level 1 🚀
VIDraft/SOMA-AGI
🎯 The First Step Toward AGI
SOMA (Self-Orchestrating Modular Architect) is a revolutionary architecture that fulfills the essential requirements for AGI (Artificial General Intelligence) Level 1. It perfectly implements the common AGI prerequisites emphasized by Yann LeCun (Meta), OpenAI, and Google DeepMind within a single LLM.
📋 AGI Level 1 Core Requirements = SOMA's Perfect Implementation ✅
🎯 Planning Capability
→ Supervisor AI autonomously designs and executes comprehensive analysis roadmaps
🧩 Role Differentiation & Modularity
→ A single LLM instantly differentiates into 5 expert AIs for collaboration
🔄 Self-reflection & Feedback Loops
→ Evaluator AI continuously validates and directs improvements
🛠️ Tool-use & Autonomy
→ Full automation from web search to report generation
🎮 Long-term Agency Structure
→ Completes complex 11-stage collaborative processes end-to-end
🔷 SOMA's Three Core Structures
🧭 Self-Orchestrating
The ability to define problems and distribute roles without external instructions is fundamental. This is the actual implementation of OpenAI's "Agentic AI" concept, with built-in real-time self-regulation mechanisms.
🧩 Modular
A single LLM internally creates multiple personas:
🎯 Planner = Supervisor AI establishes strategies
💡 Creator = Presents innovative solutions
📚 Analyzer = Collects and analyzes data
⚖️ Evaluator = Performs critical assessments
📊 Executor = Final synthesis and implementation
This perfectly realizes Meta AI's proposed "World Model + Planner + Memory + Actor" structure.
🧠 Architect
Capable of high-level thinking and problem structuring beyond simple execution. It actually implements the plan-adapt-multitask capabilities required by DeepMind's Gemini series, systematically decomposing and reconstructing complex problems.
💫 SOMA = The Embodiment of AGI Level 1
VIDraft/SOMA-AGI
🎯 The First Step Toward AGI
SOMA (Self-Orchestrating Modular Architect) is a revolutionary architecture that fulfills the essential requirements for AGI (Artificial General Intelligence) Level 1. It perfectly implements the common AGI prerequisites emphasized by Yann LeCun (Meta), OpenAI, and Google DeepMind within a single LLM.
📋 AGI Level 1 Core Requirements = SOMA's Perfect Implementation ✅
🎯 Planning Capability
→ Supervisor AI autonomously designs and executes comprehensive analysis roadmaps
🧩 Role Differentiation & Modularity
→ A single LLM instantly differentiates into 5 expert AIs for collaboration
🔄 Self-reflection & Feedback Loops
→ Evaluator AI continuously validates and directs improvements
🛠️ Tool-use & Autonomy
→ Full automation from web search to report generation
🎮 Long-term Agency Structure
→ Completes complex 11-stage collaborative processes end-to-end
🔷 SOMA's Three Core Structures
🧭 Self-Orchestrating
The ability to define problems and distribute roles without external instructions is fundamental. This is the actual implementation of OpenAI's "Agentic AI" concept, with built-in real-time self-regulation mechanisms.
🧩 Modular
A single LLM internally creates multiple personas:
🎯 Planner = Supervisor AI establishes strategies
💡 Creator = Presents innovative solutions
📚 Analyzer = Collects and analyzes data
⚖️ Evaluator = Performs critical assessments
📊 Executor = Final synthesis and implementation
This perfectly realizes Meta AI's proposed "World Model + Planner + Memory + Actor" structure.
🧠 Architect
Capable of high-level thinking and problem structuring beyond simple execution. It actually implements the plan-adapt-multitask capabilities required by DeepMind's Gemini series, systematically decomposing and reconstructing complex problems.
💫 SOMA = The Embodiment of AGI Level 1
Any plans to migrate all repositories to Xet?

reacted to
jsulz's
post with 🚀
about 1 month ago
Post
4588
It's been a bit since I took a step back and looked at
xet-team
progress to migrate Hugging Face from Git LFS to Xet, but every time I do it boggles the mind.
A month ago there were 5,500 users/orgs on Xet with 150K repos and 4PB. Today?
🤗 700,000 users/orgs
📈 350,000 repos
🚀 15PB
Meanwhile, our migrations have pushed throughput to numbers that are bonkers. In June, we hit upload speeds of 577Gb/s (crossing 500Gb/s for the first time).
These are hard numbers to put into context, but let's try:
The latest run of the Common Crawl from
commoncrawl
was 471 TB.
We now have ~32 crawls stored in Xet. At peak upload speed we could move the latest crawl into Xet in about two hours.
We're moving to a new phase in the process, so stay tuned.
This shift in gears means it's also time to roll up our sleeves and look at all the bytes we have and the value we're adding to the community.
I already have some homework from @RichardErkhov to look at the dedupe across their uploads, and I'll be doing the same for other early adopters, big models/datasets, and frequent uploaders (looking at you @bartowski 👀)
Let me know if there's anything you're interested in; happy to dig in!

A month ago there were 5,500 users/orgs on Xet with 150K repos and 4PB. Today?
🤗 700,000 users/orgs
📈 350,000 repos
🚀 15PB
Meanwhile, our migrations have pushed throughput to numbers that are bonkers. In June, we hit upload speeds of 577Gb/s (crossing 500Gb/s for the first time).
These are hard numbers to put into context, but let's try:
The latest run of the Common Crawl from

We now have ~32 crawls stored in Xet. At peak upload speed we could move the latest crawl into Xet in about two hours.
We're moving to a new phase in the process, so stay tuned.
This shift in gears means it's also time to roll up our sleeves and look at all the bytes we have and the value we're adding to the community.
I already have some homework from @RichardErkhov to look at the dedupe across their uploads, and I'll be doing the same for other early adopters, big models/datasets, and frequent uploaders (looking at you @bartowski 👀)
Let me know if there's anything you're interested in; happy to dig in!

reacted to
albertvillanova's
post with 🚀
about 1 month ago
Post
1626
🚀 SmolAgents v1.19.0 is live!
This release brings major improvements to agent flexibility, UI usability, streaming architecture, and developer experience: making it easier than ever to build smart, interactive AI agents. Here's what's new:
🔧 Agent Upgrades
- Support for managed agents in ToolCallingAgent
- Context manager support for cleaner agent lifecycle handling
- Output formatting now uses XML tags for consistency
🖥️ UI Enhancements
- GradioUI now supports reset_agent_memory: perfect for fresh starts in dev & demos.
🔄 Streaming Refactor
- Streaming event aggregation moved off the Model class
- ➡️ Better architecture & maintainability
📦 Output Tracking
- CodeAgent outputs are now stored in ActionStep
- ✅ More visibility and structure to agent decisions
🐛 Bug Fixes
- Smarter planning logic
- Cleaner Docker logs
- Better prompt formatting for additional_args
- Safer internal functions and final answer matching
📚 Docs Improvements
- Added quickstart examples with tool usage
- One-click Colab launch buttons
- Expanded reference docs (AgentMemory, GradioUI docstrings)
- Fixed broken links and migrated to .md format
🔗 Full release notes:
https://github.com/huggingface/smolagents/releases/tag/v1.19.0
💬 Try it out, explore the new features, and let us know what you build!
#smolagents #opensource #AIagents #LLM #HuggingFace
This release brings major improvements to agent flexibility, UI usability, streaming architecture, and developer experience: making it easier than ever to build smart, interactive AI agents. Here's what's new:
🔧 Agent Upgrades
- Support for managed agents in ToolCallingAgent
- Context manager support for cleaner agent lifecycle handling
- Output formatting now uses XML tags for consistency
🖥️ UI Enhancements
- GradioUI now supports reset_agent_memory: perfect for fresh starts in dev & demos.
🔄 Streaming Refactor
- Streaming event aggregation moved off the Model class
- ➡️ Better architecture & maintainability
📦 Output Tracking
- CodeAgent outputs are now stored in ActionStep
- ✅ More visibility and structure to agent decisions
🐛 Bug Fixes
- Smarter planning logic
- Cleaner Docker logs
- Better prompt formatting for additional_args
- Safer internal functions and final answer matching
📚 Docs Improvements
- Added quickstart examples with tool usage
- One-click Colab launch buttons
- Expanded reference docs (AgentMemory, GradioUI docstrings)
- Fixed broken links and migrated to .md format
🔗 Full release notes:
https://github.com/huggingface/smolagents/releases/tag/v1.19.0
💬 Try it out, explore the new features, and let us know what you build!
#smolagents #opensource #AIagents #LLM #HuggingFace

reacted to
pagezyhf's
post with 🔥
about 1 month ago
Post
3206
Hackathons in Paris on July 5th and 6th!
Hugging Face just wrapped 4 months of deep work with AMD to push kernel-level optimization on their MI300X GPUs. Now, it's time to share everything we learned.
Join us in Paris at STATION F for a hands-on weekend of workshops and a hackathon focused on making open-source LLMs faster and more efficient on AMD.
Prizes, amazing host speakers, ... if you want more details, navigate to https://lu.ma/fmvdjmur!
Hugging Face just wrapped 4 months of deep work with AMD to push kernel-level optimization on their MI300X GPUs. Now, it's time to share everything we learned.
Join us in Paris at STATION F for a hands-on weekend of workshops and a hackathon focused on making open-source LLMs faster and more efficient on AMD.
Prizes, amazing host speakers, ... if you want more details, navigate to https://lu.ma/fmvdjmur!

reacted to
openfree's
post with 🔥
about 1 month ago
Post
3235
🎯 Open GAMMA - AI PPT Generator 'GamJa'
🚀 Project Introduction
Revolutionary AI presentation generator presented by OpenFree AI Community! Create professional-level PPTs with just a few clicks.
🆓 Completely FREE! Create Premium PPTs with Free GAMMA! 🎉
DEMO: openfree/Open-GAMMA
✨ Key Features
🤖 Powered by FACTS Grounding Leaderboard 2nd RANK LLM
Base Model: vidraft/gemma-3-R1984-27B
Perfect support for English/Korean/Multi-language
Automatic speaker notes generation
🎨 Premium Visuals
3D style AI image generation
5 design themes (Professional, Modern, Nature, Creative, Minimal)
FLUX style diagram images
Automatic emoji bullet points
📊 Smart Diagrams
Process Flow, Concept Map, WBS, Radial, Synoptic Chart
Content analysis-based automatic diagram generation
Perfect Korean font support
💡 Main Features
📝 Intelligent Content Generation
Auto-generate 3-20 slides just by entering a topic
Latest information through web search
Reference PDF, CSV, TXT files
🖼️ Visual Automation
3D images for cover & conclusion slides
Auto-generate 2 content-based diagrams
Add 2 FLUX style images
🎯 Customizable Design
5 professional themes
3 layout styles
Automatic emoji mapping system
💰 Premium Features for FREE!
Create professional-grade presentations with Free GAMMA (Open GAMMA) that rivals paid PPT generation services! 🚀
🚀 Project Introduction
Revolutionary AI presentation generator presented by OpenFree AI Community! Create professional-level PPTs with just a few clicks.
🆓 Completely FREE! Create Premium PPTs with Free GAMMA! 🎉
DEMO: openfree/Open-GAMMA
✨ Key Features
🤖 Powered by FACTS Grounding Leaderboard 2nd RANK LLM
Base Model: vidraft/gemma-3-R1984-27B
Perfect support for English/Korean/Multi-language
Automatic speaker notes generation
🎨 Premium Visuals
3D style AI image generation
5 design themes (Professional, Modern, Nature, Creative, Minimal)
FLUX style diagram images
Automatic emoji bullet points
📊 Smart Diagrams
Process Flow, Concept Map, WBS, Radial, Synoptic Chart
Content analysis-based automatic diagram generation
Perfect Korean font support
💡 Main Features
📝 Intelligent Content Generation
Auto-generate 3-20 slides just by entering a topic
Latest information through web search
Reference PDF, CSV, TXT files
🖼️ Visual Automation
3D images for cover & conclusion slides
Auto-generate 2 content-based diagrams
Add 2 FLUX style images
🎯 Customizable Design
5 professional themes
3 layout styles
Automatic emoji mapping system
💰 Premium Features for FREE!
Create professional-grade presentations with Free GAMMA (Open GAMMA) that rivals paid PPT generation services! 🚀

posted
an
update
about 1 month ago
Post
488
💻 NotaBug.org Code Dataset -
nyuuzyou/notabug-code
Collection of approximately 19,449 repositories worth of source code featuring:
- Diverse programming languages including Python, JavaScript, C++, Java, Go, Rust, and dozens of other languages
- Includes metadata: repository names, file paths, programming language detection, licensing information, and file sizes
- Contains high-quality source code files with line length filtering for optimal processing
- Organized in compressed JSONL format with Zstandard compression
Collection of approximately 19,449 repositories worth of source code featuring:
- Diverse programming languages including Python, JavaScript, C++, Java, Go, Rust, and dozens of other languages
- Includes metadata: repository names, file paths, programming language detection, licensing information, and file sizes
- Contains high-quality source code files with line length filtering for optimal processing
- Organized in compressed JSONL format with Zstandard compression