Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeSemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection
We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and spoken across various continents. The data instances are multi-labeled into six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) emotion labels in monolingual settings, (b) emotion intensity scores, and (c) emotion labels in cross-lingual settings. The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, as well as findings on the best-performing systems, the most common approaches, and the most effective methods across various tracks and languages. The datasets for this task are publicly available.
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
Recent breakthroughs in text-to-4D generation rely on pre-trained text-to-image and text-to-video models to generate dynamic 3D scenes. However, current text-to-4D methods face a three-way tradeoff between the quality of scene appearance, 3D structure, and motion. For example, text-to-image models and their 3D-aware variants are trained on internet-scale image datasets and can be used to produce scenes with realistic appearance and 3D structure -- but no motion. Text-to-video models are trained on relatively smaller video datasets and can produce scenes with motion, but poorer appearance and 3D structure. While these models have complementary strengths, they also have opposing weaknesses, making it difficult to combine them in a way that alleviates this three-way tradeoff. Here, we introduce hybrid score distillation sampling, an alternating optimization procedure that blends supervision signals from multiple pre-trained diffusion models and incorporates benefits of each for high-fidelity text-to-4D generation. Using hybrid SDS, we demonstrate synthesis of 4D scenes with compelling appearance, 3D structure, and motion.
Dice Semimetric Losses: Optimizing the Dice Score with Soft Labels
The soft Dice loss (SDL) has taken a pivotal role in numerous automated segmentation pipelines in the medical imaging community. Over the last years, some reasons behind its superior functioning have been uncovered and further optimizations have been explored. However, there is currently no implementation that supports its direct utilization in scenarios involving soft labels. Hence, a synergy between the use of SDL and research leveraging the use of soft labels, also in the context of model calibration, is still missing. In this work, we introduce Dice semimetric losses (DMLs), which (i) are by design identical to SDL in a standard setting with hard labels, but (ii) can be employed in settings with soft labels. Our experiments on the public QUBIQ, LiTS and KiTS benchmarks confirm the potential synergy of DMLs with soft labels (e.g.\ averaging, label smoothing, and knowledge distillation) over hard labels (e.g.\ majority voting and random selection). As a result, we obtain superior Dice scores and model calibration, which supports the wider adoption of DMLs in practice. The code is available at https://github.com/zifuwanggg/JDTLosses{https://github.com/zifuwanggg/JDTLosses}.
Goal-Conditioned Imitation Learning using Score-based Diffusion Policies
We propose a new policy representation based on score-based diffusion models (SDMs). We apply our new policy representation in the domain of Goal-Conditioned Imitation Learning (GCIL) to learn general-purpose goal-specified policies from large uncurated datasets without rewards. Our new goal-conditioned policy architecture "BEhavior generation with ScOre-based Diffusion Policies" (BESO) leverages a generative, score-based diffusion model as its policy. BESO decouples the learning of the score model from the inference sampling process, and, hence allows for fast sampling strategies to generate goal-specified behavior in just 3 denoising steps, compared to 30+ steps of other diffusion based policies. Furthermore, BESO is highly expressive and can effectively capture multi-modality present in the solution space of the play data. Unlike previous methods such as Latent Plans or C-Bet, BESO does not rely on complex hierarchical policies or additional clustering for effective goal-conditioned behavior learning. Finally, we show how BESO can even be used to learn a goal-independent policy from play-data using classifier-free guidance. To the best of our knowledge this is the first work that a) represents a behavior policy based on such a decoupled SDM b) learns an SDM based policy in the domain of GCIL and c) provides a way to simultaneously learn a goal-dependent and a goal-independent policy from play-data. We evaluate BESO through detailed simulation and show that it consistently outperforms several state-of-the-art goal-conditioned imitation learning methods on challenging benchmarks. We additionally provide extensive ablation studies and experiments to demonstrate the effectiveness of our method for goal-conditioned behavior generation. Demonstrations and Code are available at https://intuitive-robots.github.io/beso-website/
Effectively Unbiased FID and Inception Score and where to find them
This paper shows that two commonly used evaluation metrics for generative models, the Fr\'echet Inception Distance (FID) and the Inception Score (IS), are biased -- the expected value of the score computed for a finite sample set is not the true value of the score. Worse, the paper shows that the bias term depends on the particular model being evaluated, so model A may get a better score than model B simply because model A's bias term is smaller. This effect cannot be fixed by evaluating at a fixed number of samples. This means all comparisons using FID or IS as currently computed are unreliable. We then show how to extrapolate the score to obtain an effectively bias-free estimate of scores computed with an infinite number of samples, which we term textrm{FID}_infty and textrm{IS}_infty. In turn, this effectively bias-free estimate requires good estimates of scores with a finite number of samples. We show that using Quasi-Monte Carlo integration notably improves estimates of FID and IS for finite sample sets. Our extrapolated scores are simple, drop-in replacements for the finite sample scores. Additionally, we show that using low discrepancy sequence in GAN training offers small improvements in the resulting generator.
VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Vision-language models (VLMs) discriminatively pre-trained with contrastive image-text matching losses such as P(match|text, image) have been criticized for lacking compositional understanding. This means they might output similar scores even if the original caption is rearranged into a different semantic statement. To address this, we propose to use the {bf V}isual {bf G}enerative {bf P}re-{bf T}raining Score ({bf VisualGPTScore}) of P(text|image), a multimodal generative score that captures the likelihood of a text caption conditioned on an image using an image-conditioned language model. Contrary to the belief that VLMs are mere bag-of-words models, our off-the-shelf VisualGPTScore demonstrates top-tier performance on recently proposed image-text retrieval benchmarks like ARO and Crepe that assess compositional reasoning. Furthermore, we factorize VisualGPTScore into a product of the marginal P(text) and the Pointwise Mutual Information (PMI). This helps to (a) diagnose datasets with strong language bias, and (b) debias results on other benchmarks like Winoground using an information-theoretic framework. VisualGPTScore provides valuable insights and serves as a strong baseline for future evaluation of visio-linguistic compositionality.
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.
Using Perturbation to Improve Goodness-of-Fit Tests based on Kernelized Stein Discrepancy
Kernelized Stein discrepancy (KSD) is a score-based discrepancy widely used in goodness-of-fit tests. It can be applied even when the target distribution has an unknown normalising factor, such as in Bayesian analysis. We show theoretically and empirically that the KSD test can suffer from low power when the target and the alternative distributions have the same well-separated modes but differ in mixing proportions. We propose to perturb the observed sample via Markov transition kernels, with respect to which the target distribution is invariant. This allows us to then employ the KSD test on the perturbed sample. We provide numerical evidence that with suitably chosen transition kernels the proposed approach can lead to substantially higher power than the KSD test.
Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC
Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide set of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation.
PrivPAS: A real time Privacy-Preserving AI System and applied ethics
With 3.78 billion social media users worldwide in 2021 (48% of the human population), almost 3 billion images are shared daily. At the same time, a consistent evolution of smartphone cameras has led to a photography explosion with 85% of all new pictures being captured using smartphones. However, lately, there has been an increased discussion of privacy concerns when a person being photographed is unaware of the picture being taken or has reservations about the same being shared. These privacy violations are amplified for people with disabilities, who may find it challenging to raise dissent even if they are aware. Such unauthorized image captures may also be misused to gain sympathy by third-party organizations, leading to a privacy breach. Privacy for people with disabilities has so far received comparatively less attention from the AI community. This motivates us to work towards a solution to generate privacy-conscious cues for raising awareness in smartphone users of any sensitivity in their viewfinder content. To this end, we introduce PrivPAS (A real time Privacy-Preserving AI System) a novel framework to identify sensitive content. Additionally, we curate and annotate a dataset to identify and localize accessibility markers and classify whether an image is sensitive to a featured subject with a disability. We demonstrate that the proposed lightweight architecture, with a memory footprint of a mere 8.49MB, achieves a high mAP of 89.52% on resource-constrained devices. Furthermore, our pipeline, trained on face anonymized data, achieves an F1-score of 73.1%.
Vision-Language Modeling in PET/CT for Visual Grounding of Positive Findings
Vision-language models can connect the text description of an object to its specific location in an image through visual grounding. This has potential applications in enhanced radiology reporting. However, these models require large annotated image-text datasets, which are lacking for PET/CT. We developed an automated pipeline to generate weak labels linking PET/CT report descriptions to their image locations and used it to train a 3D vision-language visual grounding model. Our pipeline finds positive findings in PET/CT reports by identifying mentions of SUVmax and axial slice numbers. From 25,578 PET/CT exams, we extracted 11,356 sentence-label pairs. Using this data, we trained ConTEXTual Net 3D, which integrates text embeddings from a large language model with a 3D nnU-Net via token-level cross-attention. The model's performance was compared against LLMSeg, a 2.5D version of ConTEXTual Net, and two nuclear medicine physicians. The weak-labeling pipeline accurately identified lesion locations in 98% of cases (246/251), with 7.5% requiring boundary adjustments. ConTEXTual Net 3D achieved an F1 score of 0.80, outperforming LLMSeg (F1=0.22) and the 2.5D model (F1=0.53), though it underperformed both physicians (F1=0.94 and 0.91). The model achieved better performance on FDG (F1=0.78) and DCFPyL (F1=0.75) exams, while performance dropped on DOTATE (F1=0.58) and Fluciclovine (F1=0.66). The model performed consistently across lesion sizes but showed reduced accuracy on lesions with low uptake. Our novel weak labeling pipeline accurately produced an annotated dataset of PET/CT image-text pairs, facilitating the development of 3D visual grounding models. ConTEXTual Net 3D significantly outperformed other models but fell short of the performance of nuclear medicine physicians. Our study suggests that even larger datasets may be needed to close this performance gap.
SPADE: Enhancing Adaptive Cyber Deception Strategies with Generative AI and Structured Prompt Engineering
The rapid evolution of modern malware presents significant challenges to the development of effective defense mechanisms. Traditional cyber deception techniques often rely on static or manually configured parameters, limiting their adaptability to dynamic and sophisticated threats. This study leverages Generative AI (GenAI) models to automate the creation of adaptive cyber deception ploys, focusing on structured prompt engineering (PE) to enhance relevance, actionability, and deployability. We introduce a systematic framework (SPADE) to address inherent challenges large language models (LLMs) pose to adaptive deceptions, including generalized outputs, ambiguity, under-utilization of contextual information, and scalability constraints. Evaluations across diverse malware scenarios using metrics such as Recall, Exact Match (EM), BLEU Score, and expert quality assessments identified ChatGPT-4o as the top performer. Additionally, it achieved high engagement (93%) and accuracy (96%) with minimal refinements. Gemini and ChatGPT-4o Mini demonstrated competitive performance, with Llama3.2 showing promise despite requiring further optimization. These findings highlight the transformative potential of GenAI in automating scalable, adaptive deception strategies and underscore the critical role of structured PE in advancing real-world cybersecurity applications.
Characterization of Magnetic Labyrinthine Structures through Junctions and Terminals Detection using Template Matching and CNN
In material sciences, characterizing faults in periodic structures is vital for understanding material properties. To characterize magnetic labyrinthine patterns, it is necessary to accurately identify junctions and terminals, often featuring over a thousand closely packed defects per image. This study introduces a new technique called TM-CNN (Template Matching - Convolutional Neural Network) designed to detect a multitude of small objects in images, such as defects in magnetic labyrinthine patterns. TM-CNN was used to identify these structures in 444 experimental images, and the results were explored to deepen the understanding of magnetic materials. It employs a two-stage detection approach combining template matching, used in initial detection, with a convolutional neural network, used to eliminate incorrect identifications. To train a CNN classifier, it is necessary to create a large number of training images. This difficulty prevents the use of CNN in many practical applications. TM-CNN significantly reduces the manual workload for creating training images by automatically making most of the annotations and leaving only a small number of corrections to human reviewers. In testing, TM-CNN achieved an impressive F1 score of 0.988, far outperforming traditional template matching and CNN-based object detection algorithms.
Probabilistic Adaptation of Text-to-Video Models
Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, adapting these models to tasks with limited domain-specific data, such as animation or robotics videos, poses a significant computational challenge, since finetuning a pretrained large model can be prohibitively expensive. Inspired by how a small modifiable component (e.g., prompts, prefix-tuning) can adapt a large language model to perform new tasks without requiring access to the model weights, we investigate how to adapt a large pretrained text-to-video model to a variety of downstream domains and tasks without finetuning. In answering this question, we propose Video Adapter, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model that is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. More videos can be found on the website https://video-adapter.github.io/.
An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders
Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learned representations orthogonal to existing methods such as kNN. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels. Our code implementation is available at https://github.com/scottclowe/zs-ssl-clustering/.
Predicting Gender by First Name Using Character-level Machine Learning
Predicting gender by the first name is not a simple task. In many applications, especially in the natural language processing (NLP) field, this task may be necessary, mainly when considering foreign names. In this paper, we examined and implemented several machine learning algorithms, such as extra trees, KNN, Naive Bayes, SVM, random forest, gradient boosting, light GBM, logistic regression, ridge classifier, and deep neural network models, such as MLP, RNN, GRU, CNN, and BiLSTM, to classify gender through the first name. A dataset of Brazilian names is used to train and evaluate the models. We analyzed the accuracy, recall, precision, f1 score, and confusion matrix to measure the models' performances. The results indicate that the gender prediction can be performed from the feature extraction strategy looking at the names as a set of strings. Some models accurately predict gender in more than 95% of the cases. The recurrent models overcome the feedforward models in this binary classification problem.
What can Large Language Models Capture about Code Functional Equivalence?
Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding if they are able to do so because they capture code semantics, and how well, is still an open question. In this paper, we tackle this problem by introducing SeqCoBench, a benchmark for systematically assessing how Code-LLMs can capture code functional equivalence. SeqCoBench contains over 20 code transformations that either preserve or alter the semantics of Python programs. We conduct extensive evaluations in different settings, including zero-shot and parameter-efficient finetuning methods on state-of-the-art (Code)-LLMs to see if they can discern semantically equivalent or different pairs of programs in SeqCoBench. We find that the performance gap between these LLMs and classical match-based retrieval scores is minimal, with both approaches showing a concerning lack of depth in understanding code semantics.
A Unit Enhancement and Guidance Framework for Audio-Driven Avatar Video Generation
Audio-driven human animation technology is widely used in human-computer interaction, and the emergence of diffusion models has further advanced its development. Currently, most methods rely on multi-stage generation and intermediate representations, resulting in long inference time and issues with generation quality in specific foreground regions and audio-motion consistency. These shortcomings are primarily due to the lack of localized fine-grained supervised guidance. To address above challenges, we propose Parts-aware Audio-driven Human Animation, PAHA, a unit enhancement and guidance framework for audio-driven upper-body animation. We introduce two key methods: Parts-Aware Re-weighting (PAR) and Parts Consistency Enhancement (PCE). PAR dynamically adjusts regional training loss weights based on pose confidence scores, effectively improving visual quality. PCE constructs and trains diffusion-based regional audio-visual classifiers to improve the consistency of motion and co-speech audio. Afterwards, we design two novel inference guidance methods for the foregoing classifiers, Sequential Guidance (SG) and Differential Guidance (DG), to balance efficiency and quality respectively. Additionally, we build CNAS, the first public Chinese News Anchor Speech dataset, to advance research and validation in this field. Extensive experimental results and user studies demonstrate that PAHA significantly outperforms existing methods in audio-motion alignment and video-related evaluations. The codes and CNAS dataset will be released upon acceptance.
Augmenting Perceptual Super-Resolution via Image Quality Predictors
Super-resolution (SR), a classical inverse problem in computer vision, is inherently ill-posed, inducing a distribution of plausible solutions for every input. However, the desired result is not simply the expectation of this distribution, which is the blurry image obtained by minimizing pixelwise error, but rather the sample with the highest image quality. A variety of techniques, from perceptual metrics to adversarial losses, are employed to this end. In this work, we explore an alternative: utilizing powerful non-reference image quality assessment (NR-IQA) models in the SR context. We begin with a comprehensive analysis of NR-IQA metrics on human-derived SR data, identifying both the accuracy (human alignment) and complementarity of different metrics. Then, we explore two methods of applying NR-IQA models to SR learning: (i) altering data sampling, by building on an existing multi-ground-truth SR framework, and (ii) directly optimizing a differentiable quality score. Our results demonstrate a more human-centric perception-distortion tradeoff, focusing less on non-perceptual pixel-wise distortion, instead improving the balance between perceptual fidelity and human-tuned NR-IQA measures.
Planning with Large Language Models for Code Generation
Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process. Although the programs they generate achieve high token-matching-based scores, they often fail to compile or generate incorrect outputs. The main reason is that conventional Transformer decoding algorithms may not be the best choice for code generation. In this work, we propose a novel Transformer decoding algorithm, Planning-Guided Transformer Decoding (PG-TD), that uses a planning algorithm to do lookahead search and guide the Transformer to generate better programs. Specifically, instead of simply optimizing the likelihood of the generated sequences, the Transformer makes use of a planner to generate candidate programs and test them on public test cases. The Transformer can therefore make more informed decisions and generate tokens that will eventually lead to higher-quality programs. We also design a mechanism that shares information between the Transformer and the planner to make our algorithm computationally efficient. We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks, showing that 1) it can generate programs that consistently achieve higher performance compared with competing baseline methods; 2) it enables controllable code generation, such as concise codes and highly-commented codes by optimizing modified objective.
Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models
To what extent do vision-and-language foundation models possess a realistic world model (observation times action rightarrow observation) and a dynamics model (observation times observation rightarrow action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.
The Liver Tumor Segmentation Benchmark (LiTS)
In this work, we report the set-up and results of the Liver Tumor Segmentation Benchmark (LiTS), which was organized in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI) 2017 and the International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2017 and 2018. The image dataset is diverse and contains primary and secondary tumors with varied sizes and appearances with various lesion-to-background levels (hyper-/hypo-dense), created in collaboration with seven hospitals and research institutions. Seventy-five submitted liver and liver tumor segmentation algorithms were trained on a set of 131 computed tomography (CT) volumes and were tested on 70 unseen test images acquired from different patients. We found that not a single algorithm performed best for both liver and liver tumors in the three events. The best liver segmentation algorithm achieved a Dice score of 0.963, whereas, for tumor segmentation, the best algorithms achieved Dices scores of 0.674 (ISBI 2017), 0.702 (MICCAI 2017), and 0.739 (MICCAI 2018). Retrospectively, we performed additional analysis on liver tumor detection and revealed that not all top-performing segmentation algorithms worked well for tumor detection. The best liver tumor detection method achieved a lesion-wise recall of 0.458 (ISBI 2017), 0.515 (MICCAI 2017), and 0.554 (MICCAI 2018), indicating the need for further research. LiTS remains an active benchmark and resource for research, e.g., contributing the liver-related segmentation tasks in http://medicaldecathlon.com/. In addition, both data and online evaluation are accessible via www.lits-challenge.com.
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers
Logical reasoning, i.e., deductively inferring the truth value of a conclusion from a set of premises, is an important task for artificial intelligence with wide potential impacts on science, mathematics, and society. While many prompting-based strategies have been proposed to enable Large Language Models (LLMs) to do such reasoning more effectively, they still appear unsatisfactory, often failing in subtle and unpredictable ways. In this work, we investigate the validity of instead reformulating such tasks as modular neurosymbolic programming, which we call LINC: Logical Inference via Neurosymbolic Computation. In LINC, the LLM acts as a semantic parser, translating premises and conclusions from natural language to expressions in first-order logic. These expressions are then offloaded to an external theorem prover, which symbolically performs deductive inference. Leveraging this approach, we observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate. On ProofWriter, augmenting the comparatively small open-source StarCoder+ (15.5B parameters) with LINC even outperforms GPT-3.5 and GPT-4 with Chain-of-Thought (CoT) prompting by an absolute 38% and 10%, respectively. When used with GPT-4, LINC scores 26% higher than CoT on ProofWriter while performing comparatively on FOLIO. Further analysis reveals that although both methods on average succeed roughly equally often on this dataset, they exhibit distinct and complementary failure modes. We thus provide promising evidence for how logical reasoning over natural language can be tackled through jointly leveraging LLMs alongside symbolic provers. All corresponding code is publicly available at https://github.com/benlipkin/linc
A region-wide, multi-year set of crop field boundary labels for Africa
African agriculture is undergoing rapid transformation. Annual maps of crop fields are key to understanding the nature of this transformation, but such maps are currently lacking and must be developed using advanced machine learning models trained on high resolution remote sensing imagery. To enable the development of such models, we delineated field boundaries in 33,746 Planet images captured between 2017 and 2023 across the continent using a custom labeling platform with built-in procedures for assessing and mitigating label error. We collected 42,403 labels, including 7,204 labels arising from tasks dedicated to assessing label quality (Class 1 labels), 32,167 from sites mapped once by a single labeller (Class 2) and 3,032 labels from sites where 3 or more labellers were tasked to map the same location (Class 4). Class 1 labels were used to calculate labeller-specific quality scores, while Class 1 and 4 sites mapped by at least 3 labellers were used to further evaluate label uncertainty using a Bayesian risk metric. Quality metrics showed that label quality was moderately high (0.75) for measures of total field extent, but low regarding the number of individual fields delineated (0.33), and the position of field edges (0.05). These values are expected when delineating small-scale fields in 3-5 m resolution imagery, which can be too coarse to reliably distinguish smaller fields, particularly in dense croplands, and therefore requires substantial labeller judgement. Nevertheless, previous work shows that such labels can train effective field mapping models. Furthermore, this large, probabilistic sample on its own provides valuable insight into regional agricultural characteristics, highlighting variations in the median field size and density. The imagery and vectorized labels along with quality information is available for download from two public repositories.
MAGR: Manifold-Aligned Graph Regularization for Continual Action Quality Assessment
Action Quality Assessment (AQA) evaluates diverse skills but models struggle with non-stationary data. We propose Continual AQA (CAQA) to refine models using sparse new data. Feature replay preserves memory without storing raw inputs. However, the misalignment between static old features and the dynamically changing feature manifold causes severe catastrophic forgetting. To address this novel problem, we propose Manifold-Aligned Graph Regularization (MAGR), which first aligns deviated old features to the current feature manifold, ensuring representation consistency. It then constructs a graph jointly arranging old and new features aligned with quality scores. Experiments show MAGR outperforms recent strong baselines with up to 6.56%, 5.66%, 15.64%, and 9.05% correlation gains on the MTL-AQA, FineDiving, UNLV-Dive, and JDM-MSA split datasets, respectively. This validates MAGR for continual assessment challenges arising from non-stationary skill variations.
Learning Iterative Reasoning through Energy Diffusion
We introduce iterative reasoning through energy diffusion (IRED), a novel framework for learning to reason for a variety of tasks by formulating reasoning and decision-making problems with energy-based optimization. IRED learns energy functions to represent the constraints between input conditions and desired outputs. After training, IRED adapts the number of optimization steps during inference based on problem difficulty, enabling it to solve problems outside its training distribution -- such as more complex Sudoku puzzles, matrix completion with large value magnitudes, and pathfinding in larger graphs. Key to our method's success is two novel techniques: learning a sequence of annealed energy landscapes for easier inference and a combination of score function and energy landscape supervision for faster and more stable training. Our experiments show that IRED outperforms existing methods in continuous-space reasoning, discrete-space reasoning, and planning tasks, particularly in more challenging scenarios. Code and visualizations at https://energy-based-model.github.io/ired/
Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference
We propose a unified framework for Singing Voice Synthesis (SVS) and Conversion (SVC), addressing the limitations of existing approaches in cross-domain SVS/SVC, poor output musicality, and scarcity of singing data. Our framework enables control over multiple aspects, including language content based on lyrics, performance attributes based on a musical score, singing style and vocal techniques based on a selector, and voice identity based on a speech sample. The proposed zero-shot learning paradigm consists of one SVS model and two SVC models, utilizing pre-trained content embeddings and a diffusion-based generator. The proposed framework is also trained on mixed datasets comprising both singing and speech audio, allowing singing voice cloning based on speech reference. Experiments show substantial improvements in timbre similarity and musicality over state-of-the-art baselines, providing insights into other low-data music tasks such as instrumental style transfer. Examples can be found at: everyone-can-sing.github.io.
Baybayin Character Instance Detection
The Philippine Government recently passed the "National Writing System Act," which promotes using Baybayin in Philippine texts. In support of this effort to promote the use of Baybayin, we present a computer vision system which can aid individuals who cannot easily read Baybayin script. In this paper, we survey the existing methods of identifying Baybayin scripts using computer vision and machine learning techniques and discuss their capabilities and limitations. Further, we propose a Baybayin Optical Character Instance Segmentation and Classification model using state-of-the-art Convolutional Neural Networks (CNNs) that detect Baybayin character instances in an image then outputs the Latin alphabet counterparts of each character instance in the image. Most existing systems are limited to character-level image classification and often misclassify or not natively support characters with diacritics. In addition, these existing models often have specific input requirements that limit it to classifying Baybayin text in a controlled setting, such as limitations in clarity and contrast, among others. To our knowledge, our proposed method is the first end-to-end character instance detection model for Baybayin, achieving a mAP50 score of 93.30%, mAP50-95 score of 80.50%, and F1-Score of 84.84%.
Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding
Large Language Models (LLMs) present immense potential in the medical field, yet concerns over data privacy, regulatory compliance, and model stability restrict their widespread adoption. Although the distillation of high-performing closed-source LLMs has proven effective for general tasks, their application in healthcare is limited due to reduced domain knowledge and remnants of alignment behavior hindering clinical tasks. To address these challenges, we propose Dialogue-Based Knowledge Encoding (DBKE). DBKE enhances models' implicit knowledge base and primes them for conversational recall, augmenting their conversational capabilities and enabling a soft alignment for subsequent use cases. By transforming dense academic source text into synthetic dialogue, DBKE broadens the model's knowledge base and enables a soft alignment that guides downstream behaviours. We present Clinical Camel, an open-source, healthcare-focused conversational model, to showcase the effectiveness of DBKE. Clinical Camel outperforms GPT-3.5 on the United States Medical Licensing Examination (USMLE) Step 1 and Step 3 with scores of 53.2 % and 58.2 %, respectively, compared to GPT-3.5's scores of 36.1 % and 55.7 %. Clinical Camel adeptly handles multi-stage clinical case problems, provides adaptive counseling, and generates clinical notes. However, it is prone to hallucinations, which pose a significant obstacle in safety-critical settings. The performance of Clinical Camel underscores the importance of continued research and development of open-source models for the safe and effective integration of LLMs in healthcare settings.
DAVE: Diagnostic benchmark for Audio Visual Evaluation
Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- where answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: https://github.com/gorjanradevski/dave
OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy
We present OrbNet Denali, a machine learning model for electronic structure that is designed as a drop-in replacement for ground-state density functional theory (DFT) energy calculations. The model is a message-passing neural network that uses symmetry-adapted atomic orbital features from a low-cost quantum calculation to predict the energy of a molecule. OrbNet Denali is trained on a vast dataset of 2.3 million DFT calculations on molecules and geometries. This dataset covers the most common elements in bio- and organic chemistry (H, Li, B, C, N, O, F, Na, Mg, Si, P, S, Cl, K, Ca, Br, I) as well as charged molecules. OrbNet Denali is demonstrated on several well-established benchmark datasets, and we find that it provides accuracy that is on par with modern DFT methods while offering a speedup of up to three orders of magnitude. For the GMTKN55 benchmark set, OrbNet Denali achieves WTMAD-1 and WTMAD-2 scores of 7.19 and 9.84, on par with modern DFT functionals. For several GMTKN55 subsets, which contain chemical problems that are not present in the training set, OrbNet Denali produces a mean absolute error comparable to those of DFT methods. For the Hutchison conformers benchmark set, OrbNet Denali has a median correlation coefficient of R^2=0.90 compared to the reference DLPNO-CCSD(T) calculation, and R^2=0.97 compared to the method used to generate the training data (wB97X-D3/def2-TZVP), exceeding the performance of any other method with a similar cost. Similarly, the model reaches chemical accuracy for non-covalent interactions in the S66x10 dataset. For torsional profiles, OrbNet Denali reproduces the torsion profiles of wB97X-D3/def2-TZVP with an average MAE of 0.12 kcal/mol for the potential energy surfaces of the diverse fragments in the TorsionNet500 dataset.
PropMolFlow: Property-guided Molecule Generation with Geometry-Complete Flow Matching
Molecule generation is advancing rapidly in chemical discovery and drug design. Flow matching methods have recently set the state of the art (SOTA) in unconditional molecule generation, surpassing score-based diffusion models. However, diffusion models still lead in property-guided generation. In this work, we introduce PropMolFlow, a novel approach for property-guided molecule generation based on geometry-complete SE(3)-equivariant flow matching. Integrating five different property embedding methods with a Gaussian expansion of scalar properties, PropMolFlow outperforms previous SOTA diffusion models in conditional molecule generation across various properties while preserving the stability and validity of the generated molecules, consistent with its unconditional counterpart. Additionally, it enables faster inference with significantly fewer time steps compared to baseline models. We highlight the importance of validating the properties of generated molecules through DFT calculations performed at the same level of theory as the training data. Specifically, our analysis identifies properties that require DFT validation and others where a pretrained SE(3) geometric vector perceptron regressors provide sufficiently accurate predictions on generated molecules. Furthermore, we introduce a new property metric designed to assess the model's ability to propose molecules with underrepresented property values, assessing its capacity for out-of-distribution generalization. Our findings reveal shortcomings in existing structural metrics, which mistakenly validate open-shell molecules or molecules with invalid valence-charge configurations, underscoring the need for improved evaluation frameworks. Overall, this work paves the way for developing targeted property-guided generation methods, enhancing the design of molecular generative models for diverse applications.
Image Restoration with Mean-Reverting Stochastic Differential Equations
This paper presents a stochastic differential equation (SDE) approach for general-purpose image restoration. The key construction consists in a mean-reverting SDE that transforms a high-quality image into a degraded counterpart as a mean state with fixed Gaussian noise. Then, by simulating the corresponding reverse-time SDE, we are able to restore the origin of the low-quality image without relying on any task-specific prior knowledge. Crucially, the proposed mean-reverting SDE has a closed-form solution, allowing us to compute the ground truth time-dependent score and learn it with a neural network. Moreover, we propose a maximum likelihood objective to learn an optimal reverse trajectory that stabilizes the training and improves the restoration results. The experiments show that our proposed method achieves highly competitive performance in quantitative comparisons on image deraining, deblurring, and denoising, setting a new state-of-the-art on two deraining datasets. Finally, the general applicability of our approach is further demonstrated via qualitative results on image super-resolution, inpainting, and dehazing. Code is available at https://github.com/Algolzw/image-restoration-sde.
Large Dual Encoders Are Generalizable Retrievers
It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited to make dual encoders an effective retrieval model for out-of-domain generalization. In this paper, we challenge this belief by scaling up the size of the dual encoder model {\em while keeping the bottleneck embedding size fixed.} With multi-stage training, surprisingly, scaling up the model size brings significant improvement on a variety of retrieval tasks, especially for out-of-domain generalization. Experimental results show that our dual encoders, Generalizable T5-based dense Retrievers (GTR), outperform %ColBERT~khattab2020colbert and existing sparse and dense retrievers on the BEIR dataset~thakur2021beir significantly. Most surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10\% of MS Marco supervised data to achieve the best out-of-domain performance. All the GTR models are released at https://tfhub.dev/google/collections/gtr/1.
PiCSAR: Probabilistic Confidence Selection And Ranking
Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.
Fine-Tuning Language Models from Human Preferences
Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.
HateCheck: Functional Tests for Hate Speech Detection Models
Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck's utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses.
Reliable Tuberculosis Detection using Chest X-ray with Deep Learning, Segmentation and Visualization
Tuberculosis (TB) is a chronic lung disease that occurs due to bacterial infection and is one of the top 10 leading causes of death. Accurate and early detection of TB is very important, otherwise, it could be life-threatening. In this work, we have detected TB reliably from the chest X-ray images using image pre-processing, data augmentation, image segmentation, and deep-learning classification techniques. Several public databases were used to create a database of 700 TB infected and 3500 normal chest X-ray images for this study. Nine different deep CNNs (ResNet18, ResNet50, ResNet101, ChexNet, InceptionV3, Vgg19, DenseNet201, SqueezeNet, and MobileNet), which were used for transfer learning from their pre-trained initial weights and trained, validated and tested for classifying TB and non-TB normal cases. Three different experiments were carried out in this work: segmentation of X-ray images using two different U-net models, classification using X-ray images, and segmented lung images. The accuracy, precision, sensitivity, F1-score, specificity in the detection of tuberculosis using X-ray images were 97.07 %, 97.34 %, 97.07 %, 97.14 % and 97.36 % respectively. However, segmented lungs for the classification outperformed than whole X-ray image-based classification and accuracy, precision, sensitivity, F1-score, specificity were 99.9 %, 99.91 %, 99.9 %, 99.9 %, and 99.52 % respectively. The paper also used a visualization technique to confirm that CNN learns dominantly from the segmented lung regions results in higher detection accuracy. The proposed method with state-of-the-art performance can be useful in the computer-aided faster diagnosis of tuberculosis.
Expert-level validation of AI-generated medical text with scalable language models
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( https://github.com/StanfordMIMI/MedVAL ), 2) MedVAL-Bench ( https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench ), and 3) MedVAL-4B ( https://huggingface.co/stanfordmimi/MedVAL-4B ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
Integration of Domain Knowledge using Medical Knowledge Graph Deep Learning for Cancer Phenotyping
A key component of deep learning (DL) for natural language processing (NLP) is word embeddings. Word embeddings that effectively capture the meaning and context of the word that they represent can significantly improve the performance of downstream DL models for various NLP tasks. Many existing word embeddings techniques capture the context of words based on word co-occurrence in documents and text; however, they often cannot capture broader domain-specific relationships between concepts that may be crucial for the NLP task at hand. In this paper, we propose a method to integrate external knowledge from medical terminology ontologies into the context captured by word embeddings. Specifically, we use a medical knowledge graph, such as the unified medical language system (UMLS), to find connections between clinical terms in cancer pathology reports. This approach aims to minimize the distance between connected clinical concepts. We evaluate the proposed approach using a Multitask Convolutional Neural Network (MT-CNN) to extract six cancer characteristics -- site, subsite, laterality, behavior, histology, and grade -- from a dataset of ~900K cancer pathology reports. The results show that the MT-CNN model which uses our domain informed embeddings outperforms the same MT-CNN using standard word2vec embeddings across all tasks, with an improvement in the overall micro- and macro-F1 scores by 4.97\%and 22.5\%, respectively.
The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge: Results after 1 Year Follow-up
We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcomes: clinical diagnosis, Alzheimer's Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13), and total volume of the ventricles. The methods used by challenge participants included multivariate linear regression, machine learning methods such as support vector machines and deep neural networks, as well as disease progression models. No single submission was best at predicting all three outcomes. For clinical diagnosis and ventricle volume prediction, the best algorithms strongly outperform simple baselines in predictive ability. However, for ADAS-Cog13 no single submitted prediction method was significantly better than random guesswork. Two ensemble methods based on taking the mean and median over all predictions, obtained top scores on almost all tasks. Better than average performance at diagnosis prediction was generally associated with the additional inclusion of features from cerebrospinal fluid (CSF) samples and diffusion tensor imaging (DTI). On the other hand, better performance at ventricle volume prediction was associated with inclusion of summary statistics, such as the slope or maxima/minima of biomarkers. TADPOLE's unique results suggest that current prediction algorithms provide sufficient accuracy to exploit biomarkers related to clinical diagnosis and ventricle volume, for cohort refinement in clinical trials for Alzheimer's disease. However, results call into question the usage of cognitive test scores for patient selection and as a primary endpoint in clinical trials.
Bonafide at LegalLens 2024 Shared Task: Using Lightweight DeBERTa Based Encoder For Legal Violation Detection and Resolution
In this work, we present two systems -- Named Entity Resolution (NER) and Natural Language Inference (NLI) -- for detecting legal violations within unstructured textual data and for associating these violations with potentially affected individuals, respectively. Both these systems are lightweight DeBERTa based encoders that outperform the LLM baselines. The proposed NER system achieved an F1 score of 60.01\% on Subtask A of the LegalLens challenge, which focuses on identifying violations. The proposed NLI system achieved an F1 score of 84.73\% on Subtask B of the LegalLens challenge, which focuses on resolving these violations by matching them with pre-existing legal complaints of class action cases. Our NER system ranked sixth and NLI system ranked fifth on the LegalLens leaderboard. We release the trained models and inference scripts.
Overview of Memotion 3: Sentiment and Emotion Analysis of Codemixed Hinglish Memes
Analyzing memes on the internet has emerged as a crucial endeavor due to the impact this multi-modal form of content wields in shaping online discourse. Memes have become a powerful tool for expressing emotions and sentiments, possibly even spreading hate and misinformation, through humor and sarcasm. In this paper, we present the overview of the Memotion 3 shared task, as part of the DeFactify 2 workshop at AAAI-23. The task released an annotated dataset of Hindi-English code-mixed memes based on their Sentiment (Task A), Emotion (Task B), and Emotion intensity (Task C). Each of these is defined as an individual task and the participants are ranked separately for each task. Over 50 teams registered for the shared task and 5 made final submissions to the test set of the Memotion 3 dataset. CLIP, BERT modifications, ViT etc. were the most popular models among the participants along with approaches such as Student-Teacher model, Fusion, and Ensembling. The best final F1 score for Task A is 34.41, Task B is 79.77 and Task C is 59.82.
UPB at SemEval-2022 Task 5: Enhancing UNITER with Image Sentiment and Graph Convolutional Networks for Multimedia Automatic Misogyny Identification
In recent times, the detection of hate-speech, offensive, or abusive language in online media has become an important topic in NLP research due to the exponential growth of social media and the propagation of such messages, as well as their impact. Misogyny detection, even though it plays an important part in hate-speech detection, has not received the same attention. In this paper, we describe our classification systems submitted to the SemEval-2022 Task 5: MAMI - Multimedia Automatic Misogyny Identification. The shared task aimed to identify misogynous content in a multi-modal setting by analysing meme images together with their textual captions. To this end, we propose two models based on the pre-trained UNITER model, one enhanced with an image sentiment classifier, whereas the second leverages a Vocabulary Graph Convolutional Network (VGCN). Additionally, we explore an ensemble using the aforementioned models. Our best model reaches an F1-score of 71.4% in Sub-task A and 67.3% for Sub-task B positioning our team in the upper third of the leaderboard. We release the code and experiments for our models on GitHub
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.
Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining
Self-supervised pretraining on large-scale satellite data has raised great interest in building Earth observation (EO) foundation models. However, many important resources beyond pure satellite imagery, such as land-cover-land-use products that provide free global semantic information, as well as vision foundation models that hold strong knowledge of the natural world, tend to be overlooked. In this work, we show these free additional resources not only help resolve common contrastive learning bottlenecks, but also significantly boost the efficiency and effectiveness of EO pretraining. Specifically, we first propose soft contrastive learning that optimizes cross-scene soft similarity based on land-cover-generated multi-label supervision, naturally solving the issue of multiple positive samples and too strict positive matching in complex scenes. Second, we explore cross-domain continual pretraining for both multispectral and SAR imagery, building efficient EO foundation models from strongest vision models such as DINOv2. Integrating simple weight-initialization and Siamese masking strategies into our soft contrastive learning framework, we demonstrate impressive continual pretraining performance even when the input channels and modalities are not aligned. Without prohibitive training, we produce multispectral and SAR foundation models that achieve significantly better results in 9 out of 10 downstream tasks than most existing SOTA models. For example, our ResNet50/ViT-S achieve 84.8/85.0 linear probing mAP scores on BigEarthNet-10\% which are better than most existing ViT-L models; under the same setting, our ViT-B sets a new record of 86.8 in multispectral, and 82.5 in SAR, the latter even better than many multispectral models. Dataset and models are available at https://github.com/zhu-xlab/softcon.
Towards Quantifiable Dialogue Coherence Evaluation
Automatic dialogue coherence evaluation has attracted increasing attention and is crucial for developing promising dialogue systems. However, existing metrics have two major limitations: (a) they are mostly trained in a simplified two-level setting (coherent vs. incoherent), while humans give Likert-type multi-level coherence scores, dubbed as "quantifiable"; (b) their predicted coherence scores cannot align with the actual human rating standards due to the absence of human guidance during training. To address these limitations, we propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel framework aiming to train a quantifiable dialogue coherence metric that can reflect the actual human rating standards. Specifically, QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning. During MLR pre-training, a new MLR loss is proposed for enabling the model to learn the coarse judgement of coherence degrees. Then, during KD fine-tuning, the pretrained model is further finetuned to learn the actual human rating standards with only very few human-annotated data. To advocate the generalizability even with limited fine-tuning data, a novel KD regularization is introduced to retain the knowledge learned at the pre-training stage. Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
Aligning large language models (LLMs) with human values and intents critically involves the use of human or AI feedback. While dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score Response A on a scale of 1-7) and rankings (e.g., is Response A better than Response B?). In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena, such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments. To our surprise, we also observe that the choice of feedback protocol also has a significant effect on the evaluation of aligned LLMs. In particular, we find that LLMs that leverage rankings data for alignment (say model X) are preferred over those that leverage ratings data (say model Y), with a rank-based evaluation protocol (is X/Y's response better than reference response?) but not with a rating-based evaluation protocol (score Rank X/Y's response on a scale of 1-7). Our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment. Our code and data are available at https://github.com/Hritikbansal/sparse_feedback.
GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation
We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM's ability to identify only the relevant information necessary to compose a response, or provide a deflective response when there is insufficient information. Evaluations of multiple state-of-the-art LLMs on GaRAGe show that the models tend to over-summarise rather than (a) ground their answers strictly on the annotated relevant passages (reaching at most a Relevance-Aware Factuality Score of 60%), or (b) deflect when no relevant grounding is available (reaching at most 31% true positive rate in deflections). The F1 in attribution to relevant sources is at most 58.9%, and we show that performance is particularly reduced when answering time-sensitive questions and when having to draw knowledge from sparser private grounding sources.
From Words to Code: Harnessing Data for Program Synthesis from Natural Language
Creating programs to correctly manipulate data is a difficult task, as the underlying programming languages and APIs can be challenging to learn for many users who are not skilled programmers. Large language models (LLMs) demonstrate remarkable potential for generating code from natural language, but in the data manipulation domain, apart from the natural language (NL) description of the intended task, we also have the dataset on which the task is to be performed, or the "data context". Existing approaches have utilized data context in a limited way by simply adding relevant information from the input data into the prompts sent to the LLM. In this work, we utilize the available input data to execute the candidate programs generated by the LLMs and gather their outputs. We introduce semantic reranking, a technique to rerank the programs generated by LLMs based on three signals coming the program outputs: (a) semantic filtering and well-formedness based score tuning: do programs even generate well-formed outputs, (b) semantic interleaving: how do the outputs from different candidates compare to each other, and (c) output-based score tuning: how do the outputs compare to outputs predicted for the same task. We provide theoretical justification for semantic interleaving. We also introduce temperature mixing, where we combine samples generated by LLMs using both high and low temperatures. We extensively evaluate our approach in three domains, namely databases (SQL), data science (Pandas) and business intelligence (Excel's Power Query M) on a variety of new and existing benchmarks. We observe substantial gains across domains, with improvements of up to 45% in top-1 accuracy and 34% in top-3 accuracy.
Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding
Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations - a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sentence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally prohibitive when the number of candidate sentences is large. In contrast, sentence embedding techniques learn a sentence-to-vector mapping and compute the similarity between the sentence vectors via simple elementary operations. In this paper, we introduce Distilled Sentence Embedding (DSE) - a model that is based on knowledge distillation from cross-attentive models, focusing on sentence-pair tasks. The outline of DSE is as follows: Given a cross-attentive teacher model (e.g. a fine-tuned BERT), we train a sentence embedding based student model to reconstruct the sentence-pair scores obtained by the teacher model. We empirically demonstrate the effectiveness of DSE on five GLUE sentence-pair tasks. DSE significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sentence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE produces sentence embeddings that reach state-of-the-art performance on universal sentence representation benchmarks. Our code is made publicly available at https://github.com/microsoft/Distilled-Sentence-Embedding.
ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering
Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.
NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization
Compositional generalization is crucial for artificial intelligence agents to solve complex vision-language reasoning tasks. Neuro-symbolic approaches have demonstrated promise in capturing compositional structures, but they face critical challenges: (a) reliance on predefined predicates for symbolic representations that limit adaptability, (b) difficulty in extracting predicates from raw data, and (c) using non-differentiable operations for combining primitive concepts. To address these issues, we propose NeSyCoCo, a neuro-symbolic framework that leverages large language models (LLMs) to generate symbolic representations and map them to differentiable neural computations. NeSyCoCo introduces three innovations: (a) augmenting natural language inputs with dependency structures to enhance the alignment with symbolic representations, (b) employing distributed word representations to link diverse, linguistically motivated logical predicates to neural modules, and (c) using the soft composition of normalized predicate scores to align symbolic and differentiable reasoning. Our framework achieves state-of-the-art results on the ReaSCAN and CLEVR-CoGenT compositional generalization benchmarks and demonstrates robust performance with novel concepts in the CLEVR-SYN benchmark.
Making Large Language Models Better Reasoners with Alignment
Reasoning is a cognitive process of using evidence to reach a sound conclusion. The reasoning capability is essential for large language models (LLMs) to serve as the brain of the artificial general intelligence agent. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. However, we find that the fine-tuned LLMs suffer from an Assessment Misalignment problem, i.e., they frequently assign higher scores to subpar COTs, leading to potential limitations in their reasoning abilities. To address this problem, we introduce an Alignment Fine-Tuning (AFT) paradigm, which involves three steps: 1) fine-tuning LLMs with COT training data; 2) generating multiple COT responses for each question, and categorizing them into positive and negative ones based on whether they achieve the correct answer; 3) calibrating the scores of positive and negative responses given by LLMs with a novel constraint alignment loss. Specifically, the constraint alignment loss has two objectives: a) Alignment, which guarantees that positive scores surpass negative scores to encourage answers with high-quality COTs; b) Constraint, which keeps the negative scores confined to a reasonable range to prevent the model degradation. Beyond just the binary positive and negative feedback, the constraint alignment loss can be seamlessly adapted to the ranking situations when ranking feedback is accessible. Furthermore, we also delve deeply into recent ranking-based alignment methods, such as DPO, RRHF, and PRO, and discover that the constraint, which has been overlooked by these approaches, is also crucial for their performance. Extensive experiments on four reasoning benchmarks with both binary and ranking feedback demonstrate the effectiveness of AFT.
MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More
Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC-MoE, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization, which formulates the adaptive bit-width allocation as a Linear Programming problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning, which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC-MoE integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency. Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.
Personality Style Recognition via Machine Learning: Identifying Anaclitic and Introjective Personality Styles from Patients' Speech
In disentangling the heterogeneity observed in psychopathology, personality of the patients is considered crucial. While it has been demonstrated that personality traits are reflected in the language used by a patient, we hypothesize that this enables automatic inference of the personality type directly from speech utterances, potentially more accurately than through a traditional questionnaire-based approach explicitly designed for personality classification. To validate this hypothesis, we adopt natural language processing (NLP) and standard machine learning tools for classification. We test this on a dataset of recorded clinical diagnostic interviews (CDI) on a sample of 79 patients diagnosed with major depressive disorder (MDD) -- a condition for which differentiated treatment based on personality styles has been advocated -- and classified into anaclitic and introjective personality styles. We start by analyzing the interviews to see which linguistic features are associated with each style, in order to gain a better understanding of the styles. Then, we develop automatic classifiers based on (a) standardized questionnaire responses; (b) basic text features, i.e., TF-IDF scores of words and word sequences; (c) more advanced text features, using LIWC (linguistic inquiry and word count) and context-aware features using BERT (bidirectional encoder representations from transformers); (d) audio features. We find that automated classification with language-derived features (i.e., based on LIWC) significantly outperforms questionnaire-based classification models. Furthermore, the best performance is achieved by combining LIWC with the questionnaire features. This suggests that more work should be put into developing linguistically based automated techniques for characterizing personality, however questionnaires still to some extent complement such methods.
LogiGAN: Learning Logical Reasoning via Adversarial Pre-training
We present LogiGAN, an unsupervised adversarial pre-training framework for improving logical reasoning abilities of language models. Upon automatic identifying logical reasoning phenomena in massive text corpus via detection heuristics, we train language models to predict the masked-out logical statements. Inspired by the facilitation effect of reflective thinking in human learning, we analogically simulate the learning-thinking process with an adversarial Generator-Verifier architecture to assist logic learning. LogiGAN implements a novel sequential GAN approach that (a) circumvents the non-differentiable challenge of the sequential GAN by leveraging the Generator as a sentence-level generative likelihood scorer with a learning objective of reaching scoring consensus with the Verifier; (b) is computationally feasible for large-scale pre-training with arbitrary target length. Both base and large size language models pre-trained with LogiGAN demonstrate obvious performance improvement on 12 datasets requiring general reasoning abilities, revealing the fundamental role of logic in broad reasoning, as well as the effectiveness of LogiGAN. Ablation studies on LogiGAN components reveal the relative orthogonality between linguistic and logic abilities and suggest that reflective thinking's facilitation effect might also generalize to machine learning.
Development of an NLP-driven computer-based test guide for visually impaired students
In recent years, advancements in Natural Language Processing (NLP) techniques have revolutionized the field of accessibility and exclusivity of testing, particularly for visually impaired students (VIS). CBT has shown in years back its relevance in terms of administering exams electronically, making the test process easier, providing quicker and more accurate results, and offering greater flexibility and accessibility for candidates. Yet, its relevance was not felt by the visually impaired students as they cannot access printed documents. Hence, in this paper, we present an NLP-driven Computer-Based Test guide for visually impaired students. It employs a speech technology pre-trained methods to provide real-time assistance and support to visually impaired students. The system utilizes NLP technologies to convert the text-based questions and the associated options in a machine-readable format. Subsequently, the speech technology pre-trained model processes the converted text enabling the VIS to comprehend and analyze the content. Furthermore, we validated that this pre-trained model is not perverse by testing for accuracy using sample audio datasets labels (A, B, C, D, E, F, G) to compare with the voice recordings obtained from 20 VIS which is been predicted by the system to attain values for precision, recall, and F1-scores. These metrics are used to assess the performance of the pre-trained model and have indicated that it is proficient enough to give its better performance to the evaluated system. The methodology adopted for this system is Object Oriented Analysis and Design Methodology (OOADM) where Objects are discussed and built by modeling real-world instances.
Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation
Prompt Tuning is emerging as a scalable and cost-effective method to fine-tune Pretrained Language Models (PLMs), which are often referred to as Large Language Models (LLMs). This study benchmarks the performance and computational efficiency of Prompt Tuning and baselines for multi-label text classification. This is applied to the challenging task of classifying companies into an investment firm's proprietary industry taxonomy, supporting their thematic investment strategy. Text-to-text classification is frequently reported to outperform task-specific classification heads, but has several limitations when applied to a multi-label classification problem where each label consists of multiple tokens: (a) Generated labels may not match any label in the label taxonomy; (b) The fine-tuning process lacks permutation invariance and is sensitive to the order of the provided labels; (c) The model provides binary decisions rather than appropriate confidence scores. Limitation (a) is addressed by applying constrained decoding using Trie Search, which slightly improves classification performance. All limitations (a), (b), and (c) are addressed by replacing the PLM's language head with a classification head, which is referred to as Prompt Tuned Embedding Classification (PTEC). This improves performance significantly, while also reducing computational costs during inference. In our industrial application, the training data is skewed towards well-known companies. We confirm that the model's performance is consistent across both well-known and less-known companies. Our overall results indicate the continuing need to adapt state-of-the-art methods to domain-specific tasks, even in the era of PLMs with strong generalization abilities. We release our codebase and a benchmarking dataset at https://github.com/EQTPartners/PTEC.
BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition
Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.
Bridging Cache-Friendliness and Concurrency: A Locality-Optimized In-Memory B-Skiplist
Skiplists are widely used for in-memory indexing in many key-value stores, such as RocksDB and LevelDB, due to their ease of implementation and simple concurrency control mechanisms. However, traditional skiplists suffer from poor cache locality, as they store only a single element per node, leaving performance on the table. Minimizing last-level cache misses is key to maximizing in-memory index performance, making high cache locality essential. In this paper, we present a practical concurrent B-skiplist that enhances cache locality and performance while preserving the simplicity of traditional skiplist structures and concurrency control schemes. Our key contributions include a top-down, single-pass insertion algorithm for B-skiplists and a corresponding simple and efficient top-down concurrency control scheme. On 128 threads, the proposed concurrent B-skiplist achieves between 2x-9x higher throughput compared to state-of-the-art concurrent skiplist implementations, including Facebook's concurrent skiplist from Folly and the Java ConcurrentSkipListMap. Furthermore, we find that the B-skiplist achieves competitive (0.9x-1.7x) throughput on point workloads compared to state-of-the-art cache-optimized tree-based indices (e.g., Masstree). For a more complete picture of the performance, we also measure the latency of skiplist and tree-based indices and find that the B-skiplist achieves between 3.5x-103x lower 99% latency compared to other concurrent skiplists and between 0.85x-64x lower 99% latency compared to tree-based indices on point workloads with inserts.
RAR-b: Reasoning as Retrieval Benchmark
Semantic textual similartiy (STS) and information retrieval tasks (IR) tasks have been the two major avenues to record the progress of embedding models in the past few years. Under the emerging Retrieval-augmented Generation (RAG) paradigm, we envision the need to evaluate next-level language understanding abilities of embedding models, and take a conscious look at the reasoning abilities stored in them. Addressing this, we pose the question: Can retrievers solve reasoning problems? By transforming reasoning tasks into retrieval tasks, we find that without specifically trained for reasoning-level language understanding, current state-of-the-art retriever models may still be far from being competent for playing the role of assisting LLMs, especially in reasoning-intensive tasks. Moreover, albeit trained to be aware of instructions, instruction-aware IR models are often better off without instructions in inference time for reasoning tasks, posing an overlooked retriever-LLM behavioral gap for the research community to align. However, recent decoder-based embedding models show great promise in narrowing the gap, highlighting the pathway for embedding models to achieve reasoning-level language understanding. We also show that, although current off-the-shelf re-ranker models fail on these tasks, injecting reasoning abilities into them through fine-tuning still appears easier than doing so to bi-encoders, and we are able to achieve state-of-the-art performance across all tasks by fine-tuning a reranking model. We release Reasoning as Retrieval Benchmark (RAR-b), a holistic suite of tasks and settings to evaluate the reasoning abilities stored in retriever models. RAR-b is available at https://github.com/gowitheflow-1998/RAR-b.
A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models
Recent advancements in large language models (LLMs) have propelled Artificial Intelligence (AI) to new heights, enabling breakthroughs in various tasks such as writing assistance, code generation, and machine translation. A significant distinction of advanced LLMs, such as ChatGPT, is their demonstrated ability to "reason." However, evaluating the reasoning ability of LLMs remains a challenge as most existing evaluations focus on their accuracy on the downstream tasks rather than directly assessing their reasoning processes. Efforts have been made to develop benchmarks and metrics to assess reasoning in LLMs, but they suffer from data leakage or limited scope. In this paper, we introduce LogicAsker, an automatic approach that comprehensively evaluates and improves the logical reasoning abilities of LLMs under a set of atomic reasoning skills based on propositional and predicate logic. The results provide insights into LLMs' reasoning abilities and reveal the logical rules the LLMs did not learn well. We evaluate LogicAsker on six widely deployed LLMs, including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco. The results show that test cases from LogicAsker can find logical reasoning failures in different LLMs with a rate of 25\% - 94\%. In addition, the test cases of LogicAsker can be further used to design demonstration examples for in-context learning, which effectively improves the logical reasoning ability of LLMs, e.g., 10\% for GPT-4. As far as we know, our work is the first to create prompts based on testing results to improve LLMs' formal reasoning ability effectively. All the code, data, and results will be released for reproduction and future research.
Is ChatGPT a Good Teacher Coach? Measuring Zero-Shot Performance For Scoring and Providing Actionable Insights on Classroom Instruction
Coaching, which involves classroom observation and expert feedback, is a widespread and fundamental part of teacher training. However, the majority of teachers do not have access to consistent, high quality coaching due to limited resources and access to expertise. We explore whether generative AI could become a cost-effective complement to expert feedback by serving as an automated teacher coach. In doing so, we propose three teacher coaching tasks for generative AI: (A) scoring transcript segments based on classroom observation instruments, (B) identifying highlights and missed opportunities for good instructional strategies, and (C) providing actionable suggestions for eliciting more student reasoning. We recruit expert math teachers to evaluate the zero-shot performance of ChatGPT on each of these tasks for elementary math classroom transcripts. Our results reveal that ChatGPT generates responses that are relevant to improving instruction, but they are often not novel or insightful. For example, 82% of the model's suggestions point to places in the transcript where the teacher is already implementing that suggestion. Our work highlights the challenges of producing insightful, novel and truthful feedback for teachers while paving the way for future research to address these obstacles and improve the capacity of generative AI to coach teachers.
MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
As opposed to evaluating computation and logic-based reasoning, current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning. While such qualitative capabilities are vital to medical diagnosis, in real-world scenarios, doctors frequently use clinical calculators that follow quantitative equations and rule-based reasoning paradigms for evidence-based decision support. To this end, we propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a step-by-step explanation showing how the answer is obtained. While our evaluation results show the potential of LLMs in this area, none of them are effective enough for clinical settings. Common issues include extracting the incorrect entities, not using the correct equation or rules for a calculation task, or incorrectly performing the arithmetic for the computation. We hope our study highlights the quantitative knowledge and reasoning gaps in LLMs within medical settings, encouraging future improvements of LLMs for various clinical calculation tasks.
NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robustness analysis results are available publicly on the NL-Augmenter repository (https://github.com/GEM-benchmark/NL-Augmenter).
Euclid Quick Data Release (Q1): First visual morphology catalogue
We present a detailed visual morphology catalogue for Euclid's Quick Release 1 (Q1). Our catalogue includes galaxy features such as bars, spiral arms, and ongoing mergers, for the 378000 bright (IE < 20.5) or extended (area geq 700,pixels) galaxies in Q1. The catalogue was created by finetuning the Zoobot galaxy foundation models on annotations from an intensive one month campaign by Galaxy Zoo volunteers. Our measurements are fully automated and hence fully scaleable. This catalogue is the first 0.4% of the approximately 100 million galaxies where Euclid will ultimately resolve detailed morphology.
Euclid Quick Data Release (Q1): From images to multiwavelength catalogues: the Euclid MERge Processing Function
The Euclid satellite is an ESA mission that was launched in July 2023. \Euclid is working in its regular observing mode with the target of observing an area of 14,000~deg^2 with two instruments, the Visible Camera (VIS) and the Near IR Spectrometer and Photometer (NISP) down to I_{rm E} = 24.5~mag (10, sigma) in the Euclid Wide Survey. Ground-based imaging data in the ugriz bands complement the \Euclid data to enable photo-z determination and VIS PSF modeling for week lensing analysis. Euclid investigates the distance-redshift relation and the evolution of cosmic structures by measuring shapes and redshifts of galaxies and clusters of galaxies out to zsim 2. Generating the multi-wavelength catalogues from \Euclid and ground-based data is an essential part of the \Euclid data processing system. In the framework of the \Euclid Science Ground Segment (SGS), the aim of the MER Processing Function (PF) pipeline is to detect objects in the \Euclid imaging data, measure their properties, and MERge them into a single multi-wavelength catalogue. The MER PF pipeline performs source detection on both visible (VIS) and near-infrared (NIR) images and offers four different photometric measurements: Kron total flux, aperture photometry on PSF-matched images, template fitting photometry, and S\'ersic fitting photometry. Furthermore, the MER PF pipeline measures a set of ancillary quantities, spanning from morphology to quality flags, to better characterise all detected sources. In this paper, we show how the MER PF pipeline is designed, detailing its main steps, and we show that the pipeline products meet the tight requirements that Euclid aims to achieve on photometric accuracy. We also present the other measurements (e.g. morphology) that are included in the OU-MER output catalogues and we list all output products coming out of the MER PF pipeline.
The Atacama Cosmology Telescope: DR6 Constraints on Extended Cosmological Models
We use new cosmic microwave background (CMB) primary temperature and polarization anisotropy measurements from the Atacama Cosmology Telescope (ACT) Data Release 6 (DR6) to test foundational assumptions of the standard cosmological model and set constraints on extensions to it. We derive constraints from the ACT DR6 power spectra alone, as well as in combination with legacy data from Planck. To break geometric degeneracies, we include ACT and Planck CMB lensing data and baryon acoustic oscillation data from DESI Year-1, and further add supernovae measurements from Pantheon+ for models that affect the late-time expansion history. We verify the near-scale-invariance (running of the spectral index d n_s/dln k = 0.0062 pm 0.0052) and adiabaticity of the primordial perturbations. Neutrino properties are consistent with Standard Model predictions: we find no evidence for new light, relativistic species that are free-streaming (N_{rm eff} = 2.86 pm 0.13, which combined with external BBN data becomes N_{rm eff} = 2.89 pm 0.11), for non-zero neutrino masses (sum m_nu < 0.082 eV at 95% CL), or for neutrino self-interactions. We also find no evidence for self-interacting dark radiation (N_{rm idr} < 0.134), early-universe variation of fundamental constants, early dark energy, primordial magnetic fields, or modified recombination. Our data are consistent with standard BBN, the FIRAS-inferred CMB temperature, a dark matter component that is collisionless and with only a small fraction allowed as axion-like particles, a cosmological constant, and the late-time growth rate predicted by general relativity. We find no statistically significant preference for a departure from the baseline LambdaCDM model. In general, models introduced to increase the Hubble constant or to decrease the amplitude of density fluctuations inferred from the primary CMB are not favored by our data.
Humanity's Last Exam
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy
Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an architecture-independent approach for leveraging the syntactic hierarchies of source code for incorporating entire file-level context into a fixed-length window. Using concrete syntax trees of each source file we extract syntactic hierarchies and integrate them into context window by selectively removing from view more specific, less relevant scopes for a given task. We evaluate this approach on code generation tasks and joint translation of natural language and source code in Python programming language, achieving a new state-of-the-art in code completion and summarization for Python in the CodeXGLUE benchmark. We also introduce new CodeXGLUE benchmarks for user-experience-motivated tasks: code completion with normalized literals, method body completion/code summarization conditioned on file-level context.
PennyLane: Automatic differentiation of hybrid quantum-classical computations
PennyLane is a Python 3 software framework for differentiable programming of quantum computers. The library provides a unified architecture for near-term quantum computing devices, supporting both qubit and continuous-variable paradigms. PennyLane's core feature is the ability to compute gradients of variational quantum circuits in a way that is compatible with classical techniques such as backpropagation. PennyLane thus extends the automatic differentiation algorithms common in optimization and machine learning to include quantum and hybrid computations. A plugin system makes the framework compatible with any gate-based quantum simulator or hardware. We provide plugins for hardware providers including the Xanadu Cloud, Amazon Braket, and IBM Quantum, allowing PennyLane optimizations to be run on publicly accessible quantum devices. On the classical front, PennyLane interfaces with accelerated machine learning libraries such as TensorFlow, PyTorch, JAX, and Autograd. PennyLane can be used for the optimization of variational quantum eigensolvers, quantum approximate optimization, quantum machine learning models, and many other applications.
Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography
The lack of large and diverse training data on Computer-Aided Diagnosis (CAD) in breast cancer detection has been one of the concerns that impedes the adoption of the system. Recently, pre-training with large-scale image text datasets via Vision-Language models (VLM) (\eg CLIP) partially addresses the issue of robustness and data efficiency in computer vision (CV). This paper proposes Mammo-CLIP, the first VLM pre-trained on a substantial amount of screening mammogram-report pairs, addressing the challenges of dataset diversity and size. Our experiments on two public datasets demonstrate strong performance in classifying and localizing various mammographic attributes crucial for breast cancer detection, showcasing data efficiency and robustness similar to CLIP in CV. We also propose Mammo-FActOR, a novel feature attribution method, to provide spatial interpretation of representation with sentence-level granularity within mammography reports. Code is available publicly: https://github.com/batmanlab/Mammo-CLIP.
MatchMiner-AI: An Open-Source Solution for Cancer Clinical Trial Matching
Clinical trials drive improvements in cancer treatments and outcomes. However, most adults with cancer do not participate in trials, and trials often fail to enroll enough patients to answer their scientific questions. Artificial intelligence could accelerate matching of patients to appropriate clinical trials. Here, we describe the development and evaluation of the MatchMiner-AI pipeline for clinical trial searching and ranking. MatchMiner-AI focuses on matching patients to potential trials based on core criteria describing clinical "spaces," or disease contexts, targeted by a trial. It aims to accelerate the human work of identifying potential matches, not to fully automate trial screening. The pipeline includes modules for extraction of key information from a patient's longitudinal electronic health record; rapid ranking of candidate trial-patient matches based on embeddings in vector space; and classification of whether a candidate match represents a reasonable clinical consideration. Code and synthetic data are available at https://huggingface.co/ksg-dfci/MatchMiner-AI . Model weights based on synthetic data are available at https://huggingface.co/ksg-dfci/TrialSpace and https://huggingface.co/ksg-dfci/TrialChecker . A simple cancer clinical trial search engine to demonstrate pipeline components is available at https://huggingface.co/spaces/ksg-dfci/trial_search_alpha .
JumpStarter: Human-AI Planning with Task-Structured Context Curation
Human-AI planning for complex goals remains challenging with current large language models (LLMs), which rely on linear chat histories and simplistic memory mechanisms. Despite advances in long-context prompting, users still manually manage information, leading to a high cognitive burden. Hence, we propose JumpStarter, a system that enables LLMs to collaborate with humans on complex goals by dynamically decomposing tasks to help users manage context. We specifically introduce task-structured context curation, a novel framework that breaks down a user's goal into a hierarchy of actionable subtasks, and scopes context to localized decision points, enabling finer-grained personalization and reuse. The framework is realized through three core mechanisms: context elicitation, selection, and reuse. We demonstrate that task-structured context curation significantly improves plan quality by 16% over ablations. Our user study shows that JumpStarter helped users generate plans with 79% higher quality compared to ChatGPT.
UEyes: An Eye-Tracking Dataset across User Interface Types
Different types of user interfaces differ significantly in the number of elements and how they are displayed. To examine how such differences affect the way users look at UIs, we collected and analyzed a large eye-tracking-based dataset, UEyes (62 participants, 1,980 UI screenshots, near 20K eye movement sequences), covering four major UI types: webpage, desktop UI, mobile UI, and poster. Furthermore, we analyze and discuss the differences in important factors, such as color, location, and gaze direction across UI types, individual viewing strategies and potential future directions. This position paper is a derivative of our recent paper with a particular focus on the UEyes dataset.
From structure mining to unsupervised exploration of atomic octahedral networks
Networks of atom-centered coordination octahedra commonly occur in inorganic and hybrid solid-state materials. Characterizing their spatial arrangements and characteristics is crucial for relating structures to properties for many materials families. The traditional method using case-by-case inspection becomes prohibitive for discovering trends and similarities in large datasets. Here, we operationalize chemical intuition to automate the geometric parsing, quantification, and classification of coordination octahedral networks. We find axis-resolved tilting trends in ABO_{3} perovskite polymorphs, which assist in detecting oxidation state changes. Moreover, we develop a scale-invariant encoding scheme to represent these networks, which, combined with human-assisted unsupervised machine learning, allows us to taxonomize the inorganic framework polytypes in hybrid iodoplumbates (A_xPb_yI_z). Consequently, we uncover a violation of Pauling's third rule and the design principles underpinning their topological diversity. Our results offer a glimpse into the vast design space of atomic octahedral networks and inform high-throughput, targeted screening of specific structure types.
Real-time quantum error correction beyond break-even
The ambition of harnessing the quantum for computation is at odds with the fundamental phenomenon of decoherence. The purpose of quantum error correction (QEC) is to counteract the natural tendency of a complex system to decohere. This cooperative process, which requires participation of multiple quantum and classical components, creates a special type of dissipation that removes the entropy caused by the errors faster than the rate at which these errors corrupt the stored quantum information. Previous experimental attempts to engineer such a process faced an excessive generation of errors that overwhelmed the error-correcting capability of the process itself. Whether it is practically possible to utilize QEC for extending quantum coherence thus remains an open question. We answer it by demonstrating a fully stabilized and error-corrected logical qubit whose quantum coherence is significantly longer than that of all the imperfect quantum components involved in the QEC process, beating the best of them with a coherence gain of G = 2.27 pm 0.07. We achieve this performance by combining innovations in several domains including the fabrication of superconducting quantum circuits and model-free reinforcement learning.
Scaling Artificial Intelligence for Multi-Tumor Early Detection with More Reports, Fewer Masks
Early tumor detection save lives. Each year, more than 300 million computed tomography (CT) scans are performed worldwide, offering a vast opportunity for effective cancer screening. However, detecting small or early-stage tumors on these CT scans remains challenging, even for experts. Artificial intelligence (AI) models can assist by highlighting suspicious regions, but training such models typically requires extensive tumor masks--detailed, voxel-wise outlines of tumors manually drawn by radiologists. Drawing these masks is costly, requiring years of effort and millions of dollars. In contrast, nearly every CT scan in clinical practice is already accompanied by medical reports describing the tumor's size, number, appearance, and sometimes, pathology results--information that is rich, abundant, and often underutilized for AI training. We introduce R-Super, which trains AI to segment tumors that match their descriptions in medical reports. This approach scales AI training with large collections of readily available medical reports, substantially reducing the need for manually drawn tumor masks. When trained on 101,654 reports, AI models achieved performance comparable to those trained on 723 masks. Combining reports and masks further improved sensitivity by +13% and specificity by +8%, surpassing radiologists in detecting five of the seven tumor types. Notably, R-Super enabled segmentation of tumors in the spleen, gallbladder, prostate, bladder, uterus, and esophagus, for which no public masks or AI models previously existed. This study challenges the long-held belief that large-scale, labor-intensive tumor mask creation is indispensable, establishing a scalable and accessible path toward early detection across diverse tumor types. We plan to release our trained models, code, and dataset at https://github.com/MrGiovanni/R-Super
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory
SentinelLMs: Encrypted Input Adaptation and Fine-tuning of Language Models for Private and Secure Inference
This paper addresses the privacy and security concerns associated with deep neural language models, which serve as crucial components in various modern AI-based applications. These models are often used after being pre-trained and fine-tuned for specific tasks, with deployment on servers accessed through the internet. However, this introduces two fundamental risks: (a) the transmission of user inputs to the server via the network gives rise to interception vulnerabilities, and (b) privacy concerns emerge as organizations that deploy such models store user data with restricted context. To address this, we propose a novel method to adapt and fine-tune transformer-based language models on passkey-encrypted user-specific text. The original pre-trained language model first undergoes a quick adaptation (without any further pre-training) with a series of irreversible transformations applied to the tokenizer and token embeddings. This enables the model to perform inference on encrypted inputs while preventing reverse engineering of text from model parameters and intermediate outputs. After adaptation, models are fine-tuned on encrypted versions of existing training datasets. Experimental evaluation employing adapted versions of renowned models (e.g., BERT, RoBERTa) across established benchmark English and multilingual datasets for text classification and sequence labeling shows that encrypted models achieve performance parity with their original counterparts. This serves to safeguard performance, privacy, and security cohesively.
Superposed Episodic and Semantic Memory via Sparse Distributed Representation
The abilities to perceive, learn, and use generalities, similarities, classes, i.e., semantic memory (SM), is central to cognition. Machine learning (ML), neural network, and AI research has been primarily driven by tasks requiring such abilities. However, another central facet of cognition, single-trial formation of permanent memories of experiences, i.e., episodic memory (EM), has had relatively little focus. Only recently has EM-like functionality been added to Deep Learning (DL) models, e.g., Neural Turing Machine, Memory Networks. However, in these cases: a) EM is implemented as a separate module, which entails substantial data movement (and so, time and power) between the DL net itself and EM; and b) individual items are stored localistically within the EM, precluding realizing the exponential representational efficiency of distributed over localist coding. We describe Sparsey, an unsupervised, hierarchical, spatial/spatiotemporal associative memory model differing fundamentally from mainstream ML models, most crucially, in its use of sparse distributed representations (SDRs), or, cell assemblies, which admits an extremely efficient, single-trial learning algorithm that maps input similarity into code space similarity (measured as intersection). SDRs of individual inputs are stored in superposition and because similarity is preserved, the patterns of intersections over the assigned codes reflect the similarity, i.e., statistical, structure, of all orders, not simply pairwise, over the inputs. Thus, SM, i.e., a generative model, is built as a computationally free side effect of the act of storing episodic memory traces of individual inputs, either spatial patterns or sequences. We report initial results on MNIST and on the Weizmann video event recognition benchmarks. While we have not yet attained SOTA class accuracy, learning takes only minutes on a single CPU.
DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation
Large Language Models (LLMs) have transformed listwise document reranking by enabling global reasoning over candidate sets, yet single models often struggle to balance fine-grained relevance scoring with holistic cross-document analysis. We propose DeepAgentRank (\DeAR), an open-source framework that decouples these tasks through a dual-stage approach, achieving superior accuracy and interpretability. In Stage 1, we distill token-level relevance signals from a frozen 13B LLaMA teacher into a compact \{3, 8\}B student model using a hybrid of cross-entropy, RankNet, and KL divergence losses, ensuring robust pointwise scoring. In Stage 2, we attach a second LoRA adapter and fine-tune on 20K GPT-4o-generated chain-of-thought permutations, enabling listwise reasoning with natural-language justifications. Evaluated on TREC-DL19/20, eight BEIR datasets, and NovelEval-2306, \DeAR surpasses open-source baselines by +5.1 nDCG@5 on DL20 and achieves 90.97 nDCG@10 on NovelEval, outperforming GPT-4 by +3.09. Without fine-tuning on Wikipedia, DeAR also excels in open-domain QA, achieving 54.29 Top-1 accuracy on Natural Questions, surpassing baselines like MonoT5, UPR, and RankGPT. Ablations confirm that dual-loss distillation ensures stable calibration, making \DeAR a highly effective and interpretable solution for modern reranking systems.Dataset and code available at https://github.com/DataScienceUIBK/DeAR-Reranking..
