What is the Hugging Face Community Building?

Community Article Published July 15, 2025

Upvote

A Deep Dive into the Open Source AI Ecosystem

Exploring the Hub: What the Data Reveals
The ModelVerse: A More Distributed Ecosystem Than You Think

The DataVerse: The Foundation Layer

Organization Activity: The Leading Contributors

What People Are Researching: Hidden Insights
Domain-Specific Innovation

Model Evolution and Derivatives

Research Opportunities

Resources for Deeper Exploration
Interactive Spaces

Research Datasets

Academic Research

Looking Forward: What This Means for AI

A Deep Dive into the Open Source AI Ecosystem

The narrative around AI development often centers on a handful of major players and their flagship large language models. But this view misses the rich, diverse ecosystem of innovation happening across thousands of organizations, researchers, and developers worldwide. The Hugging Face Hub offers a unique window into this broader trend — serving as the largest repository of AI models, datasets, and applications, capturing contributions from organizations with varying levels of openness and different approaches to AI development.

With over 1.8 million models, 450,000 datasets, and 560,000 applications hosted on the Hub, we have unprecedented visibility into who's building what in AI. Unlike other platforms that focus on specific domains or require particular licenses, the Hub captures the full spectrum of AI development: from completely open research to artifacts from major tech companies, from cutting-edge LLMs to specialized time-series models for specific industries.

Let's explore what this data reveals about the true state of AI development, the surprising diversity of contributions, and the research opportunities hiding in plain sight.

Exploring the Hub: What the Data Reveals

The Hub captures rich metadata on every model, dataset, and application: download counts, likes, user versus organization contributions, creation dates, and derivative relationships. While individual metrics like all-time downloads or recent likes tell part of the story, no single measure captures the full complexity of AI development patterns: a model with modest downloads might spawn hundreds of derivatives, or a dataset with few likes might be fundamental to entire research domains. To make sense of this complexity, we've built several analytical tools: the ModelVerse Explorer maps organizational contributions to model development, the DataVerse Explorer does the same for datasets, and the Organization HeatMap visualizes activity patterns across contributors over time.

The ModelVerse: A More Distributed Ecosystem Than You Think

The ModelVerse Explorer reveals fascinating patterns in model contributions across organizations. While the headlines focus on OpenAI, Google, and Anthropic, the data shows a much more distributed landscape:

Small models dominate downloads: Even when organizations release both large and small variants of the same model family, the smaller versions consistently see higher download numbers. This suggests practical deployment considerations often outweigh the pursuit of maximum capability.
Legacy models persist: GPT-2 and BERT remain among the most downloaded models despite being years old, showing that modern chat interfaces represent just one slice of AI application.
Rapid community response: Models like DeepSeek's recent releases can accumulate thousands of likes and forks within days of release, demonstrating the community's quick adoption of promising new approaches.

Very soon after its release, DeepSeek-R1 became the most liked model on Hugging Face.

The DataVerse: The Foundation Layer

The DataVerse Explorer tells an even more interesting story about the distributed nature of AI development:

Evaluation datasets rule: The most downloaded datasets of all time are evaluation benchmarks, reflecting the community's priority for rigorous testing and comparison.
Open actors dominate foundational data: While closed companies may train on proprietary data, the datasets that serve as the foundation for most AI development come from universities, research institutions, and open organizations.
Domain-specific specialization: Beyond the general-purpose datasets that make headlines, there's a thriving ecosystem of specialized datasets for finance, healthcare, robotics, and other domains.

Organization Activity: The Leading Contributors

The organization heatmap reveals which entities are contributing the most frequently on the Hub:

AI2 leads the pack: The Allen Institute for AI emerges as one of the most active contributors, demonstrating the continued importance of research institutions.
Big Tech's varied approaches: Companies like IBM, NVIDIA, and Apple show significant activity that might not be visible in mainstream AI discussions. Microsoft's presence through various research divisions adds another layer of complexity.
International diversity: Organizations from China, Europe, and other regions contribute significantly, highlighting AI's global nature.

What People Are Researching: Hidden Insights

The Hub is much more than just a repository — it is a living laboratory for AI research. Here are some research directions that become visible when you look beyond LLMs.

Domain-Specific Innovation

Time Series Models: Amazon and Salesforce lead in time series forecasting, with significant contributions from Monash University, HF's LeRobot team, and the AutoGluon project. This represents billions in economic value that rarely makes AI headlines.

Biology and Life Sciences: Cambridge, Microsoft Research, and numerous biotechnology startups are quietly building the models that may revolutionize drug discovery and biological research.

Robotics: Open-source robotics projects alongside NVIDIA's contributions are creating the foundation for the next generation of autonomous systems.

Audio and Speech: While OpenAI's Whisper gets attention, download patterns show strong preference for open alternatives in many applications.

Model Evolution and Derivatives

The model tree statistics space enables us to investigate how AI models evolve through community contributions via self-reported parent model tags:

Organizations benefit differently from derivative works.
Some models become platforms for innovation while others remain isolated.
The most successful models, such as the Qwen, Llama, and Gemma models have spawned entire ecosystems of specialized variants.

For a more comprehensive analysis of multiple generations of models derived from a single model, check out the base model explorer space.

Research Opportunities

The data suggests several underexplored research directions:

Cross-domain transfer learning: With models spanning dozens of domains, there are opportunities to study how capabilities transfer between fields.
Collaborative development patterns: The derivative model networks reveal how distributed teams actually collaborate on AI development.
Long-term model viability: By tracking downloads and usage over time, we can understand which model architectures have staying power.

Resources for Deeper Exploration

Interactive Spaces

Cumulative Hub Stats: Track growth trends over time!

Cumulative Hub Stats space as of Jul 14, 2025

Semantic Search: Explore models and datasets by capability using free text search!

Model Atlas: Explore models and their relationships via an interactive graph visualization.

Research Datasets

For researchers wanting to dig deeper:

Hub Statistics: Comprehensive stats dataset at https://huggingface.co/datasets/cfahlgren1/hub-stats
Weekly Snapshots: Longitudinal data for trend analysis at https://huggingface.co/datasets/hfmlsoc/hub_weekly_snapshots
Model Cards with Metadata: Structured model documentation at https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata
Dataset Cards with Metadata: Comprehensive dataset documentation at https://huggingface.co/datasets/librarian-bots/dataset_cards_with_metadata

Academic Research

Several academic papers have already begun using Hub data to understand AI development:

Looking Forward: What This Means for AI

The data reveals that AI development is far more distributed, diverse, and collaborative than popular narratives suggest. While attention focuses on frontier models and billion-dollar companies, the real innovation often happens in specialized domains, through community collaboration, and via iterative improvement of existing models.

For researchers, this represents an opportunity to study AI development as it actually happens—not just as it's portrayed in press releases. For developers, it highlights the importance of looking beyond the latest model releases to find tools that actually solve real problems. For policymakers, it suggests that understanding AI's impact requires looking at the full ecosystem, not just the most visible players.

The Hub continues to evolve as a research platform, with new tools and datasets regularly added. Whether you're studying innovation patterns, analyzing model capabilities, or exploring the social dynamics of AI development, the data is here waiting to be explored.

What will you discover? We'd love to hear about exciting projects using Hub data! Explore the spaces and datasets mentioned above to start your own analysis and publish a paper or blog, or tag us on social media with what you find!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote