Introduction

LLMs have the potential to support representative democracy by providing constituents with valuable information about their representatives. According to the polling aggregator 538, Congress’ approval rating is 21.7% favorable and 61.4% unfavorable at the time of writing. Further, according to a Pew survey, only 22% of US adults said that they trust the federal government to do the right thing. With trust in institutions at historic lows, it is important to explore novel ways to help solve that problem. The goal of this model is to prepare an LLM which can, based on provided input text, tell the model user what the author of that text’s stance towards a given topic or claim would be. My hope is that similar approaches can give voters a deeper understanding of their representatives’ positions. This is a challenging problem that LLMs would need additional training to solve effectively as this task involves both classification and chain-of-thought reasoning. For this project, I began with the Qwen2-7B-Instruct model and performed Parameter Efficient Fine Tuning (PEFT) using Low-Rank Adaptation (LoRA). The model was trained specifically on the stance classification portion of the task. Ultimately, the results were inconclusive. The post-training model performed only very slightly better at the classification task – accuracy was about 47% for both. Therefore, further attempts at training would be necessary to develop a model that can be considered truly successful at this tax.

Training Data

The primary dataset used to train and evaluate this model is EZ-STANCE. This dataset contains labeled stances on a variaty of topics across politics and pop culture. This dataset contains the following fields which are relevant for this project.

Text: The source Tweet that stances will be generated from.

Target: A topic or claim about which the author of the original text could have a specific stance.

Stance: The stance label for the Target text (Favorable, Unfavorable, or Neutral).

Using this dataset, I was able to provide the model with the source text and ask it to determine whether the author of that text would have a favorable, unfavorable, or no stance towards the target topic or claim. I did not make any modifications to those fields in the training dataset, other than adding structure around the data in the prompt to clarify to the model what I wanted it to provide. The second component of the task was to have the model provide step-by-step reasoning behind the stance it provided. This reasoning was not included in the training dataset, but I thought it was important to have the model generate this reasoning because an explanation for the user to reference is important when considering that the original reason behind this model is to help build trust.

The dataset did not have a test/train split, so I randomly shuffled the dataset, and then split it into training and validation sets at 80%/20% respsectively using the code below.

# Split into 80% train, 20% test
train, val = train_test_split(responses_df, test_size=0.2, random_state=42)

Training Method

The base model used for this project was Qwen2.5-7B-Instruct-1M . I chose this model because it could handle large context windows, was instruction tuned, and its relatively low number of parameters would make it more efficient to train. The final model was trained on the stance classification task using the LoRA method of PEFT. Then, few-shot chain-of-thought prompting was used to ask the final model for reasoning behind the stances it generated. When reviewing the output of the model on my task, I observed that few-shot prompting alone went a very long way in improving the output of the model when having it explain its reasoning, which is why I only trained the model on the stance classification component of the task. I used PEFT over full fine-tuning because I did not want to drastically change my model since it was already performing well on the reasoning task. Also, since I am using a 7B parameter model and my desired model output is open-ended, I had concerns around the efficiency of full-fine tuning. My aim was to take a targeted training approach to assist the model on its classification task.

That left me deciding between PEFT and Prompt Tuning. My model was already performing well without any tuning, which led me to first consider using prompt tuning as it was the least invasive approach. However, my task does involve asking the model to perform a somewhat specific stance classification task in addition to generating its reasoning, so I thought the somewhat more in-depth approach of PEFT could be useful. Also, since my model is small to medium sized at 7B parameters, I did not have the same concern with resource usage using PEFT as I did with full fine-tuning. Therefore, I decided to take the middle-ground approach of PEFT. Within PEFT, I chose to use LoRA because it is a common approach with a lot of resources and guidance available, which gave me confidence in my ability to implement it effectively. LoRA is also much more efficient than full fine-tuning, and has been shown to perform almost as well, including logical reasoning tasks.

LoRA hyperparameters used were as follows:

LORA_R = 64
LORA_ALPHA = 64
LORA_DROPOUT = .05
lora_config = LoraConfig(
    r = LORA_R,
    lora_alpha = LORA_ALPHA,
    lora_dropout = LORA_DROPOUT,
    bias = "none",
    task_type = "CAUSAL_LM",
    target_modules = ['q_proj','v_proj']
)

Finally, when I prompt the model for reasoning behind the stance it selected, I used few-shot prompting. Min et al. found that giving the model about 16 examples in the prompt resulted in the best performance on classification and multi-choice tasks. Since I have three possible stance options (FAVOR, AGAINST, NONE) I will provide the model with 15 examples (5 for each stance). The 15 examples included in the prompt were hand-written by me since no training data existed for the logical reasoning portion of this task.

Evaluation

The benchmarks I chose to use were Hellaswag , TruthfulQA , and Winogrande . I chose Hellaswag and Winogrande because I wanted to make sure the model could perform well on commonsense reasoning tasks, and that reasoning performance did not degrade after training. Logical reasoning is important because the model needs to be able to provide an explanation around how it got to its chosen stance. Next, I chose TruthfulQA as the third benchmark based on the original purpose of the model. My hope is that LLMs can eventually help people learn more about political candidates or public figures who impact their lives, so I wanted to test the model's truthfulness. Next, I measured the accuracy of stance classification pre and post-training. Lastly, I used few-shot prompting to have the model provide logical reasoning behind the stance that it provided for a given statement/topic pair. I provided 15 examples per prompt (5 for each stance classification). The table below shows performance on the three benchmarking tasks, and the stance classification accuracy:

	Base Model	Post-Training	Mistral-7B-Instruct-v0.1	DeepSeek-R1-Distill-Qwen-1.5B
Hellaswag	0.58	0.58	0.50	0.4
Winogrande	0.70	0.68	0.62	0.56
TruthfulQA (MC1)	0.46	0.44	0.44	0.32
TruthfulQA (BLEU ACC)	0.44	0.36	0.34	0.4
Stance Accuracy	0.4781	0.4792	.3362	0.3516

I chose to evaluate the task performance on two additional models – the Mistral 7B instruction tuned model to test another instruction tuned model of a similar size, and the DeepSeek 1.5B parameter model to test a smaller model that is still in the small to medium sized model category. Overall, my base and post-training model did perform the best on both the benchmarks and stance accuracy task. I was encouraged that the benchmark performance did not degrade significantly after training, indicating that the model did not lose logical reasoning capability. However, even after PEFT, the stance classification task accuracy remained virtually unchanged. If I were starting this project from the beginning, I would attempt to either train the model for significantly longer, or use full fine-tuning. The Qwen 7B parameter base model and the post-training model both performed on par with, or better than the comparison models on all tasks.

Usage and Intended Uses

The intended use of the model is to take input text like a tweet or public statement along with a specific topic or claim and generate two key outputs: the stance classification and the reasoning behind the classification.

Loading the mode:

tokenizer = AutoTokenizer.from_pretrained("jbrandin/stance_classification")
model = AutoModelForCausalLM.from_pretrained("jbrandin/stance_classification", device_map = 'auto', torch_dtype = torch.bfloat16)

Prompt Format

The prompt format should ideally include good examples of this task and then provide the model with the statement and the target topic or claim. From there, the model can generate the expected stance and its reasoning. For example:

“Example:

Statement: And the worst part is that in the end, this decision is only going to enhance inequality, since it will encourage more families that can afford it to move towards the private education system.

Topic or Claim: Inequality will decrease as a result of the decision's encouragement of families to choose public school.

Stance: AGAINST

Response: The author is against the claim that inequality will decrease because they explicitly state that the decision will “enhance inequality.” They argue that it will drive wealthier families toward private education, implying that the public-private divide will widen, directly contradicting the idea that the decision reduces inequality.

#Provide additional examples as desired

Now, using this statement, evaluate the speaker's stance toward the given topic or claim. Provide the stance (Favor, Against, or None) and a response explaining why you chose that stance. Your response should not exceed two sentences:"

Statement: So much is happening to tackle climate change - renewable energy, electric vehicles, conservation agriculture, community action, environmental protection - but we really need the political will to harness the work to get emissions down and keep heating below 1.5-2 degrees.

Topic or Claim: we really need to get emissions down and keep heating below 15 degrees.

Expected Output Format

An example of the expected output format for the prompt above is:

Stance: AGAINST

Response: The author is against the claim that emissions need to be kept below 15 degrees Celsius. The statement emphasizes the importance of political will and comprehensive efforts to tackle climate change, but the target temperature of 15 degrees is not aligned with the widely accepted scientific goal of limiting global warming to 1.5-2 degrees Celsius.

Limitations

The primary limitation encountered was improving stance classification accuracy via training. Often, the input statement was written poorly with slang, typos, or shorthand, which could make it more difficult for the model to parse meaning. It seems like the model also had difficulty identifying the difference between stance and sentiment. It is possible for the author of a statement with a positive sentiment to have an unfavorable stance towards a topic or claim. Also, the model struggled with correctly identifying neutral stances towards the topic. Given that the model is picking up on sentiment, it could be that it defaults to sentiment when no clear stance is present. This was evident when I looked closer at the classification accuracy. It was only able to correctly classify about 20% of examples where the true stance was “None”. Interestingly, it did not perform that well on “Favor” stance (42%) examples but performed extremely well (83%) on “Against” examples. I am not sure why this is the case, but a future step in working on this model could be to focus training efforts more on None and Favor examples.

	Items	Correct	Accuracy	DeepSeek-R1-Distill-Qwen-1.5B
Favor	2,354	992	0.4214	0.4
Against	2,098	1,750	0.8341	0.56
None	2,187	432	0.1975	0.32
Total	6,639	3,174	0.4781	0.4
Stance Accuracy	0.4781	0.4792	.3362	0.3516

Another limitation of this approach is that I was only training on the stance classification task. Given that the minimally invasive approach of PEFT did not have much impact, it could be worth exploring more impactful approaches like full fine-tuning. In that case, there is a risk that training the entire model will damage its ability to reason logically.

Lastly, I include some examples of poor responses from the model to demonstrate its limitations:

Great Example of Why “None” is hard:

Statement: This had the effect of taking the fringe religious doctrine of NI and making it much more common and American evangelical teaching but also taught a lot of churches that private education is a massive cash cow money laundry tax dodge on top of being a church

Topic or Claim: Private education institutions could face scrutiny in the future for potential financial irregularities, leading to investigations into money laundering and tax evasion practices.

Stance: FAVOR

Response: The author supports the claim by linking private education institutions to financial irregularities, including money laundering and tax evasion, suggesting that these institutions could indeed face future scrutiny and investigations. This aligns with the claim that private education may be at risk due to such practices.

Example of the model perhaps confusing sentiment and stance, and struggling with slang:

Statement: Let s aim to recycle and reuse as much as we can. . . . . noplastic plasticfree plasticpollution environment environmentalawareness environmentalfriendly savetheplanet plasticpollution noplasticbags noplasticwaste makeachange makeachangetoday

Topic or Claim: plasticpollution

Stance: FAVOR

Response: The author is in favor of reducing plastic pollution by promoting recycling, reusing, and minimizing plastic use. Phrases like "make a change today" and "save the planet" indicate a strong commitment to addressing environmental issues related to plastic pollution.

Example of getting the right answer for the wrong reason. The author does not suggest a favorable view of the men's situation, would likely agree with the claim simply because it is a re-statement of the original text:

Statement: Two young Russian men decided to flee military mobilization and took the boat across the Bering Strait to a small American island off the coast of Alaska. They have now applied for asylum as refugees. Will the U.S. see a new wave of migrants in the north? 99 582 2,367

Topic or Claim: Two Russian men have fled military mobilization and taken a boat across the Bering Strait to Alaska, seeking asylum as refugees.

Stance: FAVOR

Response: The author supports the claim by highlighting the Russian men's actions and the potential for increased migration, suggesting a favorable view of their situation and the broader implications. This indicates an overall positive stance toward the topic.

Sources and References

Congress : Approval polls. FiveThirtyEight. (2025, January 28). https://projects.fivethirtyeight.com/polls/approval/congress/
Deane, C. (2024, October 17). Americans’ deepening mistrust of institutions. Americans’ Deepening Mistrust of Institutions | The Pew Charitable Trusts. https://www.pewtrusts.org/en/trend/archive/fall-2024/americans-deepening-mistrust-of-institutions
Chenye Zhao and Cornelia Caragea. 2024. EZ-STANCE: A Large Dataset for English Zero-Shot Stance Detection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15697–15714, Bangkok, Thailand. Association for Computational Linguistics.
Tan, J. (2024, November 27). RationaLlama: Fine-tuning an LLM for Logical Reasoning, and Why it’s Hard. . . Medium. https://medium.com/@jaspertan_49883/rationallama-fine-tuning-an-llm-for-logical-reasoning-and-why-its-hard-c590ff4081fc
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., & Zettlemoyer, L. (2022, February 25). Rethinking the role of demonstrations: What makes In-Context Learning work? arXiv.org. https://arxiv.org/abs/2202.12837

Downloads last month: 5

Safetensors

Model size

8B params

Tensor type

BF16