SomosNLP

non-profit

https://somosnlp.org/

SomosNLP_

somosnlp

Activity Feed

AI & ML interests

Democratizar el PLN en español e incentivar su aplicación para generar impacto social 💛

Recent Activity

mariagrandury published a dataset about 1 month ago

somosnlp/babylm-es

dianags authored a paper about 2 months ago

Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

dianags authored a paper about 2 months ago

Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction

View all activity

haritzpuerto

authored 2 papers 7 days ago

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

Paper • 2506.15674 • Published Jun 18

C-SEO Bench: Does Conversational SEO Work?

Paper • 2506.11097 • Published Jun 6

suchirsalhan

authored a paper about 1 month ago

ByteSpan: Information-Driven Subword Tokenisation

Paper • 2506.18639 • Published Jun 23 • 3

mariagrandury

published a dataset about 1 month ago

somosnlp/babylm-es

Updated Jun 19 • 6

dvilasuero

posted an update about 2 months ago

Post

2729

Super excited to launch Hugging Face Sheets: Spreadsheets meet AI and unstructured data.

A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.

Today, I'm thrilled to introduce our first step in this direction.

In a nutshell:

📁 Effortlessly run prompts and models over your data.
🌐 Agentic search for accuracy and real-time information.
🖼️ Familiar, minimalistic interface for interacting with data.
🎯 Human feedback 2.0: Your input directly improves generated data.
💯 Access hundreds of open models and leading inference providers.

Go to this space to try it out!

aisheets/sheets

Leave your questions below, we're just getting started!

3 replies

haritzpuerto

posted an update 2 months ago

Post

319

📜 Accepted at ACL 2025! Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs
We propose to fine-tune LLMs to generate diverse chains of thought (DCoT) in a single inference step. This enables within-inference refinement of the cots, no external feedback needed!
🔗 https://arxiv.org/abs/2407.03181

reddrex

in somosnlp/LingComp_QA 3 months ago

How use the dataset to train my model GPT

#1 opened 3 months ago by

luisaarias

ouhenio

updated a Space 3 months ago

Mapa Blend-es

🌍

Revisa el avance colectivo de blend-es 😊

suchirsalhan

authored a paper 4 months ago

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies

Paper • 2410.22886 • Published Oct 30, 2024 • 2

plaguss

authored a paper 6 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 240

gabrielmbmb

authored a paper 6 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 240

mariagrandury

updated 3 collections 6 months ago

haritzpuerto

posted an update 6 months ago

Post

642

I just got my first ChatGPT review on ARR! 😅 Any advice on how to prove it's AI-generated? Thanks!

3 replies

haritzpuerto

posted an update 6 months ago

Post

1484

I'm excited to announce that my internship paper at Parameter Lab was accepted to Findings of #NAACL2025 🎉
TLDR: Stating an LLM was trained on a sentence might not be possible 😥 , but it is possible for large enough amounts of tokens, such as long documents or multiple documents! 🤯
Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models (2411.00154)
🔗 https://github.com/parameterlab/mia-scaling

nataliaElv

posted an update 6 months ago

Post

1532

New chapter in the Hugging Face NLP course! 🤗 🚀

We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.

Any feedback for improvements welcome!

https://huggingface.co/learn/nlp-course/chapter10

nataliaElv

posted an update 7 months ago

Post

565

Do you want to easily save annotations to a Dataset in the Hub?

In the last version of Argilla (v2.6.0), you can export your data directly from the UI to the Hub.

Check all the changes and update to the latest version: https://github.com/argilla-io/argilla/releases/tag/v2.6.0

nataliaElv

posted an update 7 months ago

Post

1674

If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU

nataliaElv

posted an update 8 months ago

Post

1310

How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates 😂

Do you want to see how your annotations compare to others?
👉 Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
✍️ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: https://huggingface.co/spaces/HuggingFaceFW/discussion

AI & ML interests

Recent Activity

Team members 297

somosnlp's activity

How use the dataset to train my model GPT

Mapa Blend-es