new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Sep 11

Baichuan4-Finance Technical Report

Large language models (LLMs) have demonstrated strong capabilities in language understanding, generation, and reasoning, yet their potential in finance remains underexplored due to the complexity and specialization of financial knowledge. In this work, we report the development of the Baichuan4-Finance series, including a comprehensive suite of foundational Baichuan4-Finance-Base and an aligned language model Baichuan4-Finance, which are built upon Baichuan4-Turbo base model and tailored for finance domain. Firstly, we have dedicated significant effort to building a detailed pipeline for improving data quality. Moreover, in the continual pre-training phase, we propose a novel domain self-constraint training strategy, which enables Baichuan4-Finance-Base to acquire financial knowledge without losing general capabilities. After Supervised Fine-tuning and Reinforcement Learning from Human Feedback and AI Feedback, the chat model Baichuan4-Finance is able to tackle various financial certification questions and real-world scenario applications. We evaluate Baichuan4-Finance on many widely used general datasets and two holistic financial benchmarks. The evaluation results show that Baichuan4-Finance-Base surpasses almost all competitive baselines on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. At the same time, Baichuan4-Finance demonstrates even more impressive performance on financial application scenarios, showcasing its potential to foster community innovation in the financial LLM field.

Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Large Language Models (LLMs) exhibit considerable promise in financial applications; however, prevailing models frequently demonstrate limitations when confronted with scenarios that necessitate sophisticated reasoning capabilities, stringent trustworthiness criteria, and efficient adaptation to domain-specific requirements. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task label system with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage training pipeline, and dynamic attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including Fineva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA-diamond. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications. The Finova bench is available at https://github.com/antgroup/Finova.

Approaching Emergent Risks: An Exploratory Study into Artificial Intelligence Risk Management within Financial Organisations

Globally, artificial intelligence (AI) implementation is growing, holding the capability to fundamentally alter organisational processes and decision making. Simultaneously, this brings a multitude of emergent risks to organisations, exposing vulnerabilities in their extant risk management frameworks. This necessitates a greater understanding of how organisations can position themselves in response. This issue is particularly pertinent within the financial sector with relatively mature AI applications matched with severe societal repercussions of potential risk events. Despite this, academic risk management literature is trailing behind the speed of AI implementation. Adopting a management perspective, this study aims to contribute to the understanding of AI risk management in organisations through an exploratory empirical investigation into these practices. In-depth insights are gained through interviews with nine practitioners from different organisations within the UK financial sector. Through examining areas of organisational convergence and divergence, the findings of this study unearth levels of risk management framework readiness and prevailing approaches to risk management at both a processual and organisational level. Whilst enhancing the developing literature concerning AI risk management within organisations, the study simultaneously offers a practical contribution, providing key areas of guidance for practitioners in the operational development of AI risk management frameworks.

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

In recent years, multimodal benchmarks for general domains have guided the rapid development of multimodal models on general tasks. However, the financial field has its peculiarities. It features unique graphical images (e.g., candlestick charts, technical indicator charts) and possesses a wealth of specialized financial knowledge (e.g., futures, turnover rate). Therefore, benchmarks from general fields often fail to measure the performance of multimodal models in the financial domain, and thus cannot effectively guide the rapid development of large financial models. To promote the development of large financial multimodal models, we propose MME-Finance, an bilingual open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark. The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users (e.g., computer screenshots and mobile photography), creating questions according to the preferences in financial domain inquiries, and annotating questions by experts with 10+ years of experience in the financial industry. Additionally, we have developed a custom-designed financial evaluation system in which visual information is first introduced in the multi-modal evaluation process. Extensive experimental evaluations of 19 mainstream MLLMs are conducted to test their perception, reasoning, and cognition capabilities. The results indicate that models performing well on general benchmarks cannot do well on MME-Finance; for instance, the top-performing open-source and closed-source models obtain 65.69 (Qwen2VL-72B) and 63.18 (GPT-4o), respectively. Their performance is particularly poor in categories most relevant to finance, such as candlestick charts and technical indicator charts. In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.

FinRobot: An Open-Source AI Agent Platform for Financial Applications using Large Language Models

As financial institutions and professionals increasingly incorporate Large Language Models (LLMs) into their workflows, substantial barriers, including proprietary data and specialized knowledge, persist between the finance sector and the AI community. These challenges impede the AI community's ability to enhance financial tasks effectively. Acknowledging financial analysis's critical role, we aim to devise financial-specialized LLM-based toolchains and democratize access to them through open-source initiatives, promoting wider AI adoption in financial decision-making. In this paper, we introduce FinRobot, a novel open-source AI agent platform supporting multiple financially specialized AI agents, each powered by LLM. Specifically, the platform consists of four major layers: 1) the Financial AI Agents layer that formulates Financial Chain-of-Thought (CoT) by breaking sophisticated financial problems down into logical sequences; 2) the Financial LLM Algorithms layer dynamically configures appropriate model application strategies for specific tasks; 3) the LLMOps and DataOps layer produces accurate models by applying training/fine-tuning techniques and using task-relevant data; 4) the Multi-source LLM Foundation Models layer that integrates various LLMs and enables the above layers to access them directly. Finally, FinRobot provides hands-on for both professional-grade analysts and laypersons to utilize powerful AI techniques for advanced financial analysis. We open-source FinRobot at https://github.com/AI4Finance-Foundation/FinRobot.

Revolutionizing Finance with LLMs: An Overview of Applications and Insights

In recent years, Large Language Models (LLMs) like ChatGPT have seen considerable advancements and have been applied in diverse fields. Built on the Transformer architecture, these models are trained on extensive datasets, enabling them to understand and generate human language effectively. In the financial domain, the deployment of LLMs is gaining momentum. These models are being utilized for automating financial report generation, forecasting market trends, analyzing investor sentiment, and offering personalized financial advice. Leveraging their natural language processing capabilities, LLMs can distill key insights from vast financial data, aiding institutions in making informed investment choices and enhancing both operational efficiency and customer satisfaction. In this study, we provide a comprehensive overview of the emerging integration of LLMs into various financial tasks. Additionally, we conducted holistic tests on multiple financial tasks through the combination of natural language instructions. Our findings show that GPT-4 effectively follow prompt instructions across various financial tasks. This survey and evaluation of LLMs in the financial domain aim to deepen the understanding of LLMs' current role in finance for both financial practitioners and LLM researchers, identify new research and application prospects, and highlight how these technologies can be leveraged to solve practical challenges in the finance industry.

InvestLM: A Large Language Model for Investment using Financial Domain Instruction Tuning

We present a new financial domain large language model, InvestLM, tuned on LLaMA-65B (Touvron et al., 2023), using a carefully curated instruction dataset related to financial investment. Inspired by less-is-more-for-alignment (Zhou et al., 2023), we manually curate a small yet diverse instruction dataset, covering a wide range of financial related topics, from Chartered Financial Analyst (CFA) exam questions to SEC filings to Stackexchange quantitative finance discussions. InvestLM shows strong capabilities in understanding financial text and provides helpful responses to investment related questions. Financial experts, including hedge fund managers and research analysts, rate InvestLM's response as comparable to those of state-of-the-art commercial models (GPT-3.5, GPT-4 and Claude-2). Zero-shot evaluation on a set of financial NLP benchmarks demonstrates strong generalizability. From a research perspective, this work suggests that a high-quality domain specific LLM can be tuned using a small set of carefully curated instructions on a well-trained foundation model, which is consistent with the Superficial Alignment Hypothesis (Zhou et al., 2023). From a practical perspective, this work develops a state-of-the-art financial domain LLM with superior capability in understanding financial texts and providing helpful investment advice, potentially enhancing the work efficiency of financial professionals. We release the model parameters to the research community.

Bridging Language Models and Financial Analysis

The rapid advancements in Large Language Models (LLMs) have unlocked transformative possibilities in natural language processing, particularly within the financial sector. Financial data is often embedded in intricate relationships across textual content, numerical tables, and visual charts, posing challenges that traditional methods struggle to address effectively. However, the emergence of LLMs offers new pathways for processing and analyzing this multifaceted data with increased efficiency and insight. Despite the fast pace of innovation in LLM research, there remains a significant gap in their practical adoption within the finance industry, where cautious integration and long-term validation are prioritized. This disparity has led to a slower implementation of emerging LLM techniques, despite their immense potential in financial applications. As a result, many of the latest advancements in LLM technology remain underexplored or not fully utilized in this domain. This survey seeks to bridge this gap by providing a comprehensive overview of recent developments in LLM research and examining their applicability to the financial sector. Building on previous survey literature, we highlight several novel LLM methodologies, exploring their distinctive capabilities and their potential relevance to financial data analysis. By synthesizing insights from a broad range of studies, this paper aims to serve as a valuable resource for researchers and practitioners, offering direction on promising research avenues and outlining future opportunities for advancing LLM applications in finance.

FinWorld: An All-in-One Open-Source Platform for End-to-End Financial AI Research and Deployment

Financial AI holds great promise for transforming modern finance, with the potential to support a wide range of tasks such as market forecasting, portfolio management, quantitative trading, and automated analysis. However, existing platforms remain limited in task coverage, lack robust multimodal data integration, and offer insufficient support for the training and deployment of large language models (LLMs). In response to these limitations, we present FinWorld, an all-in-one open-source platform that provides end-to-end support for the entire financial AI workflow, from data acquisition to experimentation and deployment. FinWorld distinguishes itself through native integration of heterogeneous financial data, unified support for diverse AI paradigms, and advanced agent automation, enabling seamless development and deployment. Leveraging data from 2 representative markets, 4 stock pools, and over 800 million financial data points, we conduct comprehensive experiments on 4 key financial AI tasks. These experiments systematically evaluate deep learning and reinforcement learning algorithms, with particular emphasis on RL-based finetuning for LLMs and LLM Agents. The empirical results demonstrate that FinWorld significantly enhances reproducibility, supports transparent benchmarking, and streamlines deployment, thereby providing a strong foundation for future research and real-world applications. Code is available at Github~https://github.com/DVampire/FinWorld.

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

Large Language Models (LLMs) have shown promise for financial applications, yet their suitability for this high-stakes domain remains largely unproven due to inadequacies in existing benchmarks. Existing benchmarks solely rely on score-level evaluation, summarizing performance with a single score that obscures the nuanced understanding of what models truly know and their precise limitations. They also rely on datasets that cover only a narrow subset of financial concepts, while overlooking other essentials for real-world applications. To address these gaps, we introduce FinCDM, the first cognitive diagnosis evaluation framework tailored for financial LLMs, enabling the evaluation of LLMs at the knowledge-skill level, identifying what financial skills and knowledge they have or lack based on their response patterns across skill-tagged tasks, rather than a single aggregated number. We construct CPA-QKA, the first cognitively informed financial evaluation dataset derived from the Certified Public Accountant (CPA) examination, with comprehensive coverage of real-world accounting and financial skills. It is rigorously annotated by domain experts, who author, validate, and annotate questions with high inter-annotator agreement and fine-grained knowledge labels. Our extensive experiments on 30 proprietary, open-source, and domain-specific LLMs show that FinCDM reveals hidden knowledge gaps, identifies under-tested areas such as tax and regulatory reasoning overlooked by traditional benchmarks, and uncovers behavioral clusters among models. FinCDM introduces a new paradigm for financial LLM evaluation by enabling interpretable, skill-aware diagnosis that supports more trustworthy and targeted model development, and all datasets and evaluation scripts will be publicly released to support further research.

Quantitative Risk Management in Volatile Markets with an Expectile-Based Framework for the FTSE Index

This research presents a framework for quantitative risk management in volatile markets, specifically focusing on expectile-based methodologies applied to the FTSE 100 index. Traditional risk measures such as Value-at-Risk (VaR) have demonstrated significant limitations during periods of market stress, as evidenced during the 2008 financial crisis and subsequent volatile periods. This study develops an advanced expectile-based framework that addresses the shortcomings of conventional quantile-based approaches by providing greater sensitivity to tail losses and improved stability in extreme market conditions. The research employs a dataset spanning two decades of FTSE 100 returns, incorporating periods of high volatility, market crashes, and recovery phases. Our methodology introduces novel mathematical formulations for expectile regression models, enhanced threshold determination techniques using time series analysis, and robust backtesting procedures. The empirical results demonstrate that expectile-based Value-at-Risk (EVaR) consistently outperforms traditional VaR measures across various confidence levels and market conditions. The framework exhibits superior performance during volatile periods, with reduced model risk and enhanced predictive accuracy. Furthermore, the study establishes practical implementation guidelines for financial institutions and provides evidence-based recommendations for regulatory compliance and portfolio management. The findings contribute significantly to the literature on financial risk management and offer practical tools for practitioners dealing with volatile market environments.

FinRobot: AI Agent for Equity Research and Valuation with Large Language Models

As financial markets grow increasingly complex, there is a rising need for automated tools that can effectively assist human analysts in equity research, particularly within sell-side research. While Generative AI (GenAI) has attracted significant attention in this field, existing AI solutions often fall short due to their narrow focus on technical factors and limited capacity for discretionary judgment. These limitations hinder their ability to adapt to new data in real-time and accurately assess risks, which diminishes their practical value for investors. This paper presents FinRobot, the first AI agent framework specifically designed for equity research. FinRobot employs a multi-agent Chain of Thought (CoT) system, integrating both quantitative and qualitative analyses to emulate the comprehensive reasoning of a human analyst. The system is structured around three specialized agents: the Data-CoT Agent, which aggregates diverse data sources for robust financial integration; the Concept-CoT Agent, which mimics an analysts reasoning to generate actionable insights; and the Thesis-CoT Agent, which synthesizes these insights into a coherent investment thesis and report. FinRobot provides thorough company analysis supported by precise numerical data, industry-appropriate valuation metrics, and realistic risk assessments. Its dynamically updatable data pipeline ensures that research remains timely and relevant, adapting seamlessly to new financial information. Unlike existing automated research tools, such as CapitalCube and Wright Reports, FinRobot delivers insights comparable to those produced by major brokerage firms and fundamental research vendors. We open-source FinRobot at https://github. com/AI4Finance-Foundation/FinRobot.

FinGAIA: A Chinese Benchmark for AI Agents in Real-World Financial Domain

The booming development of AI agents presents unprecedented opportunities for automating complex tasks across various domains. However, their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. This paper introduces FinGAIA, an end-to-end benchmark designed to evaluate the practical abilities of AI agents in the financial domain. FinGAIA comprises 407 meticulously crafted tasks, spanning seven major financial sub-domains: securities, funds, banking, insurance, futures, trusts, and asset management. These tasks are organized into three hierarchical levels of scenario depth: basic business analysis, asset decision support, and strategic risk management. We evaluated 10 mainstream AI agents in a zero-shot setting. The best-performing agent, ChatGPT, achieved an overall accuracy of 48.9\%, which, while superior to non-professionals, still lags financial experts by over 35 percentage points. Error analysis has revealed five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, among others. These patterns point to crucial directions for future research. Our work provides the first agent benchmark closely related to the financial domain, aiming to objectively assess and promote the development of agents in this crucial field. Partial data is available at https://github.com/SUFE-AIFLM-Lab/FinGAIA.

Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models

As large language models become increasingly prevalent in the financial sector, there is a pressing need for a standardized method to comprehensively assess their performance. However, existing finance benchmarks often suffer from limited language and task coverage, as well as challenges such as low-quality datasets and inadequate adaptability for LLM evaluation. To address these limitations, we propose "Golden Touchstone", the first comprehensive bilingual benchmark for financial LLMs, which incorporates representative datasets from both Chinese and English across eight core financial NLP tasks. Developed from extensive open source data collection and industry-specific demands, this benchmark includes a variety of financial tasks aimed at thoroughly assessing models' language understanding and generation capabilities. Through comparative analysis of major models on the benchmark, such as GPT-4o Llama3, FinGPT and FinMA, we reveal their strengths and limitations in processing complex financial information. Additionally, we open-sourced Touchstone-GPT, a financial LLM trained through continual pre-training and financial instruction tuning, which demonstrates strong performance on the bilingual benchmark but still has limitations in specific tasks.This research not only provides the financial large language models with a practical evaluation tool but also guides the development and optimization of future research. The source code for Golden Touchstone and model weight of Touchstone-GPT have been made publicly available at https://github.com/IDEA-FinAI/Golden-Touchstone, contributing to the ongoing evolution of FinLLMs and fostering further research in this critical area.

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance

Although large language models (LLMs) has shown great performance on natural language processing (NLP) in the financial domain, there are no publicly available financial tailtored LLMs, instruction tuning datasets, and evaluation benchmarks, which is critical for continually pushing forward the open-source development of financial artificial intelligence (AI). This paper introduces PIXIU, a comprehensive framework including the first financial LLM based on fine-tuning LLaMA with instruction data, the first instruction data with 136K data samples to support the fine-tuning, and an evaluation benchmark with 5 tasks and 9 datasets. We first construct the large-scale multi-task instruction data considering a variety of financial tasks, financial document types, and financial data modalities. We then propose a financial LLM called FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks. To support the evaluation of financial LLMs, we propose a standardized benchmark that covers a set of critical financial tasks, including five financial NLP tasks and one financial prediction task. With this benchmark, we conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks. The model, datasets, benchmark, and experimental results are open-sourced to facilitate future research in financial AI.

Advancing Investment Frontiers: Industry-grade Deep Reinforcement Learning for Portfolio Optimization

This research paper delves into the application of Deep Reinforcement Learning (DRL) in asset-class agnostic portfolio optimization, integrating industry-grade methodologies with quantitative finance. At the heart of this integration is our robust framework that not only merges advanced DRL algorithms with modern computational techniques but also emphasizes stringent statistical analysis, software engineering and regulatory compliance. To the best of our knowledge, this is the first study integrating financial Reinforcement Learning with sim-to-real methodologies from robotics and mathematical physics, thus enriching our frameworks and arguments with this unique perspective. Our research culminates with the introduction of AlphaOptimizerNet, a proprietary Reinforcement Learning agent (and corresponding library). Developed from a synthesis of state-of-the-art (SOTA) literature and our unique interdisciplinary methodology, AlphaOptimizerNet demonstrates encouraging risk-return optimization across various asset classes with realistic constraints. These preliminary results underscore the practical efficacy of our frameworks. As the finance sector increasingly gravitates towards advanced algorithmic solutions, our study bridges theoretical advancements with real-world applicability, offering a template for ensuring safety and robust standards in this technologically driven future.

NumHTML: Numeric-Oriented Hierarchical Transformer Model for Multi-task Financial Forecasting

Financial forecasting has been an important and active area of machine learning research because of the challenges it presents and the potential rewards that even minor improvements in prediction accuracy or forecasting may entail. Traditionally, financial forecasting has heavily relied on quantitative indicators and metrics derived from structured financial statements. Earnings conference call data, including text and audio, is an important source of unstructured data that has been used for various prediction tasks using deep earning and related approaches. However, current deep learning-based methods are limited in the way that they deal with numeric data; numbers are typically treated as plain-text tokens without taking advantage of their underlying numeric structure. This paper describes a numeric-oriented hierarchical transformer model to predict stock returns, and financial risk using multi-modal aligned earnings calls data by taking advantage of the different categories of numbers (monetary, temporal, percentages etc.) and their magnitude. We present the results of a comprehensive evaluation of NumHTML against several state-of-the-art baselines using a real-world publicly available dataset. The results indicate that NumHTML significantly outperforms the current state-of-the-art across a variety of evaluation metrics and that it has the potential to offer significant financial gains in a practical trading context.

FinGPT: Democratizing Internet-scale Data for Financial Large Language Models

Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating human-like texts, which may potentially revolutionize the finance industry. However, existing LLMs often fall short in the financial field, which is mainly attributed to the disparities between general text data and financial text data. Unfortunately, there is only a limited number of financial text datasets available, and BloombergGPT, the first financial LLM (FinLLM), is close-sourced (only the training logs were released). In light of this, we aim to democratize Internet-scale financial data for LLMs, which is an open challenge due to diverse data sources, low signal-to-noise ratio, and high time-validity. To address the challenges, we introduce an open-sourced and data-centric framework, Financial Generative Pre-trained Transformer (FinGPT), that automates the collection and curation of real-time financial data from 34 diverse sources on the Internet, providing researchers and practitioners with accessible and transparent resources to develop their FinLLMs. Additionally, we propose a simple yet effective strategy for fine-tuning FinLLM using the inherent feedback from the market, dubbed Reinforcement Learning with Stock Prices (RLSP). We also adopt the Low-rank Adaptation (LoRA, QLoRA) method that enables users to customize their own FinLLMs from general-purpose LLMs at a low cost. Finally, we showcase several FinGPT applications, including robo-advisor, sentiment analysis for algorithmic trading, and low-code development. FinGPT aims to democratize FinLLMs, stimulate innovation, and unlock new opportunities in open finance. The codes have been open-sourced.

MiMIC: Multi-Modal Indian Earnings Calls Dataset to Predict Stock Prices

Predicting stock market prices following corporate earnings calls remains a significant challenge for investors and researchers alike, requiring innovative approaches that can process diverse information sources. This study investigates the impact of corporate earnings calls on stock prices by introducing a multi-modal predictive model. We leverage textual data from earnings call transcripts, along with images and tables from accompanying presentations, to forecast stock price movements on the trading day immediately following these calls. To facilitate this research, we developed the MiMIC (Multi-Modal Indian Earnings Calls) dataset, encompassing companies representing the Nifty 50, Nifty MidCap 50, and Nifty Small 50 indices. The dataset includes earnings call transcripts, presentations, fundamentals, technical indicators, and subsequent stock prices. We present a multimodal analytical framework that integrates quantitative variables with predictive signals derived from textual and visual modalities, thereby enabling a holistic approach to feature representation and analysis. This multi-modal approach demonstrates the potential for integrating diverse information sources to enhance financial forecasting accuracy. To promote further research in computational economics, we have made the MiMIC dataset publicly available under the CC-NC-SA-4.0 licence. Our work contributes to the growing body of literature on market reactions to corporate communications and highlights the efficacy of multi-modal machine learning techniques in financial analysis.

Solving the optimal stopping problem with reinforcement learning: an application in financial option exercise

The optimal stopping problem is a category of decision problems with a specific constrained configuration. It is relevant to various real-world applications such as finance and management. To solve the optimal stopping problem, state-of-the-art algorithms in dynamic programming, such as the least-squares Monte Carlo (LSMC), are employed. This type of algorithm relies on path simulations using only the last price of the underlying asset as a state representation. Also, the LSMC was thinking for option valuation where risk-neutral probabilities can be employed to account for uncertainty. However, the general optimal stopping problem goals may not fit the requirements of the LSMC showing auto-correlated prices. We employ a data-driven method that uses Monte Carlo simulation to train and test artificial neural networks (ANN) to solve the optimal stopping problem. Using ANN to solve decision problems is not entirely new. We propose a different architecture that uses convolutional neural networks (CNN) to deal with the dimensionality problem that arises when we transform the whole history of prices into a Markovian state. We present experiments that indicate that our proposed architecture improves results over the previous implementations under specific simulated time series function sets. Lastly, we employ our proposed method to compare the optimal exercise of the financial options problem with the LSMC algorithm. Our experiments show that our method can capture more accurate exercise opportunities when compared to the LSMC. We have outstandingly higher (above 974\% improvement) expected payoff from these exercise policies under the many Monte Carlo simulations that used the real-world return database on the out-of-sample (test) data.

FinTruthQA: A Benchmark Dataset for Evaluating the Quality of Financial Information Disclosure

Accurate and transparent financial information disclosure is essential in accounting and finance, fostering trust and enabling informed investment decisions that drive economic development. Among many information disclosure platforms, the Chinese stock exchanges' investor interactive platform provides a novel and interactive way for listed firms to disclose information of interest to investors through an online question-and-answer (Q&A) format. However, it is common for listed firms to respond to questions with limited or no substantive information, and automatically evaluating the quality of financial information disclosure on large amounts of Q&A pairs is challenging. In this study, our interdisciplinary team of AI and finance professionals proposed FinTruthQA, a benchmark designed to evaluate advanced natural language processing (NLP) techniques for the automatic quality assessment of information disclosure in financial Q&A data. It comprises 6,000 real-world financial Q&A entries and each Q&A was manually annotated based on four key evaluation criteria. We benchmarked various NLP techniques on FinTruthQA, including large language models(LLMs). Experiments showed that existing NLP models have strong predictive ability for question identification and question relevance tasks, but are suboptimal for answer readability and answer relevance tasks. By establishing this benchmark, we provide a robust foundation for the automatic evaluation of information disclosure, demonstrating how AI can be leveraged for social good by promoting transparency, fairness, and investor protection in financial disclosure practices. FinTruthQA can be used by auditors, regulators, and financial analysts for real-time monitoring and data-driven decision-making, as well as by researchers for advanced studies in accounting and finance, ultimately fostering greater trust and efficiency in the financial markets.

FinRobot: Generative Business Process AI Agents for Enterprise Resource Planning in Finance

Enterprise Resource Planning (ERP) systems serve as the digital backbone of modern financial institutions, yet they continue to rely on static, rule-based workflows that limit adaptability, scalability, and intelligence. As business operations grow more complex and data-rich, conventional ERP platforms struggle to integrate structured and unstructured data in real time and to accommodate dynamic, cross-functional workflows. In this paper, we present the first AI-native, agent-based framework for ERP systems, introducing a novel architecture of Generative Business Process AI Agents (GBPAs) that bring autonomy, reasoning, and dynamic optimization to enterprise workflows. The proposed system integrates generative AI with business process modeling and multi-agent orchestration, enabling end-to-end automation of complex tasks such as budget planning, financial reporting, and wire transfer processing. Unlike traditional workflow engines, GBPAs interpret user intent, synthesize workflows in real time, and coordinate specialized sub-agents for modular task execution. We validate the framework through case studies in bank wire transfers and employee reimbursements, two representative financial workflows with distinct complexity and data modalities. Results show that GBPAs achieve up to 40% reduction in processing time, 94% drop in error rate, and improved regulatory compliance by enabling parallelism, risk control insertion, and semantic reasoning. These findings highlight the potential of GBPAs to bridge the gap between generative AI capabilities and enterprise-grade automation, laying the groundwork for the next generation of intelligent ERP systems.

Stock Performance Evaluation for Portfolio Design from Different Sectors of the Indian Stock Market

The stock market offers a platform where people buy and sell shares of publicly listed companies. Generally, stock prices are quite volatile; hence predicting them is a daunting task. There is still much research going to develop more accuracy in stock price prediction. Portfolio construction refers to the allocation of different sector stocks optimally to achieve a maximum return by taking a minimum risk. A good portfolio can help investors earn maximum profit by taking a minimum risk. Beginning with Dow Jones Theory a lot of advancement has happened in the area of building efficient portfolios. In this project, we have tried to predict the future value of a few stocks from six important sectors of the Indian economy and also built a portfolio. As part of the project, our team has conducted a study of the performance of various Time series, machine learning, and deep learning models in stock price prediction on selected stocks from the chosen six important sectors of the economy. As part of building an efficient portfolio, we have studied multiple portfolio optimization theories beginning with the Modern Portfolio theory. We have built a minimum variance portfolio and optimal risk portfolio for all the six chosen sectors by using the daily stock prices over the past five years as training data and have also conducted back testing to check the performance of the portfolio. We look forward to continuing our study in the area of stock price prediction and asset allocation and consider this project as the first stepping stone.

Foundation Models and Fair Use

Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law. Lastly, we suggest that the law and technical mitigations should co-evolve. For example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. This co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. But we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models.

Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning

In the field of quantitative trading, it is common practice to transform raw historical stock data into indicative signals for the market trend. Such signals are called alpha factors. Alphas in formula forms are more interpretable and thus favored by practitioners concerned with risk. In practice, a set of formulaic alphas is often used together for better modeling precision, so we need to find synergistic formulaic alpha sets that work well together. However, most traditional alpha generators mine alphas one by one separately, overlooking the fact that the alphas would be combined later. In this paper, we propose a new alpha-mining framework that prioritizes mining a synergistic set of alphas, i.e., it directly uses the performance of the downstream combination model to optimize the alpha generator. Our framework also leverages the strong exploratory capabilities of reinforcement learning~(RL) to better explore the vast search space of formulaic alphas. The contribution to the combination models' performance is assigned to be the return used in the RL process, driving the alpha generator to find better alphas that improve upon the current set. Experimental evaluations on real-world stock market data demonstrate both the effectiveness and the efficiency of our framework for stock trend forecasting. The investment simulation results show that our framework is able to achieve higher returns compared to previous approaches.

Deep Reinforcement Learning for Quantitative Trading

Artificial Intelligence (AI) and Machine Learning (ML) are transforming the domain of Quantitative Trading (QT) through the deployment of advanced algorithms capable of sifting through extensive financial datasets to pinpoint lucrative investment openings. AI-driven models, particularly those employing ML techniques such as deep learning and reinforcement learning, have shown great prowess in predicting market trends and executing trades at a speed and accuracy that far surpass human capabilities. Its capacity to automate critical tasks, such as discerning market conditions and executing trading strategies, has been pivotal. However, persistent challenges exist in current QT methods, especially in effectively handling noisy and high-frequency financial data. Striking a balance between exploration and exploitation poses another challenge for AI-driven trading agents. To surmount these hurdles, our proposed solution, QTNet, introduces an adaptive trading model that autonomously formulates QT strategies through an intelligent trading agent. Incorporating deep reinforcement learning (DRL) with imitative learning methodologies, we bolster the proficiency of our model. To tackle the challenges posed by volatile financial datasets, we conceptualize the QT mechanism within the framework of a Partially Observable Markov Decision Process (POMDP). Moreover, by embedding imitative learning, the model can capitalize on traditional trading tactics, nurturing a balanced synergy between discovery and utilization. For a more realistic simulation, our trading agent undergoes training using minute-frequency data sourced from the live financial market. Experimental findings underscore the model's proficiency in extracting robust market features and its adaptability to diverse market conditions.

FNSPID: A Comprehensive Financial News Dataset in Time Series

Financial market predictions utilize historical data to anticipate future stock prices and market trends. Traditionally, these predictions have focused on the statistical analysis of quantitative factors, such as stock prices, trading volumes, inflation rates, and changes in industrial production. Recent advancements in large language models motivate the integrated financial analysis of both sentiment data, particularly market news, and numerical factors. Nonetheless, this methodology frequently encounters constraints due to the paucity of extensive datasets that amalgamate both quantitative and qualitative sentiment analyses. To address this challenge, we introduce a large-scale financial dataset, namely, Financial News and Stock Price Integration Dataset (FNSPID). It comprises 29.7 million stock prices and 15.7 million time-aligned financial news records for 4,775 S&P500 companies, covering the period from 1999 to 2023, sourced from 4 stock market news websites. We demonstrate that FNSPID excels existing stock market datasets in scale and diversity while uniquely incorporating sentiment information. Through financial analysis experiments on FNSPID, we propose: (1) the dataset's size and quality significantly boost market prediction accuracy; (2) adding sentiment scores modestly enhances performance on the transformer-based model; (3) a reproducible procedure that can update the dataset. Completed work, code, documentation, and examples are available at github.com/Zdong104/FNSPID. FNSPID offers unprecedented opportunities for the financial research community to advance predictive modeling and analysis.

MarS: a Financial Market Simulation Engine Powered by Generative Foundation Model

Generative models aim to simulate realistic effects of various actions across different contexts, from text generation to visual effects. Despite significant efforts to build real-world simulators, the application of generative models to virtual worlds, like financial markets, remains under-explored. In financial markets, generative models can simulate complex market effects of participants with various behaviors, enabling interaction under different market conditions, and training strategies without financial risk. This simulation relies on the finest structured data in financial market like orders thus building the finest realistic simulation. We propose Large Market Model (LMM), an order-level generative foundation model, for financial market simulation, akin to language modeling in the digital world. Our financial Market Simulation engine (MarS), powered by LMM, addresses the domain-specific need for realistic, interactive and controllable order generation. Key observations include LMM's strong scalability across data size and model complexity, and MarS's robust and practicable realism in controlled generation with market impact. We showcase MarS as a forecast tool, detection system, analysis platform, and agent training environment, thus demonstrating MarS's "paradigm shift" potential for a variety of financial applications. We release the code of MarS at https://github.com/microsoft/MarS/.

Harmful Terms and Where to Find Them: Measuring and Modeling Unfavorable Financial Terms and Conditions in Shopping Websites at Scale

Terms and conditions for online shopping websites often contain terms that can have significant financial consequences for customers. Despite their impact, there is currently no comprehensive understanding of the types and potential risks associated with unfavorable financial terms. Furthermore, there are no publicly available detection systems or datasets to systematically identify or mitigate these terms. In this paper, we take the first steps toward solving this problem with three key contributions. First, we introduce TermMiner, an automated data collection and topic modeling pipeline to understand the landscape of unfavorable financial terms. Second, we create ShopTC-100K, a dataset of terms and conditions from shopping websites in the Tranco top 100K list, comprising 1.8 million terms from 8,251 websites. Consequently, we develop a taxonomy of 22 types from 4 categories of unfavorable financial terms -- spanning purchase, post-purchase, account termination, and legal aspects. Third, we build TermLens, an automated detector that uses Large Language Models (LLMs) to identify unfavorable financial terms. Fine-tuned on an annotated dataset, TermLens achieves an F1 score of 94.6\% and a false positive rate of 2.3\% using GPT-4o. When applied to shopping websites from the Tranco top 100K, we find that 42.06\% of these sites contain at least one unfavorable financial term, with such terms being more prevalent on less popular websites. Case studies further highlight the financial risks and customer dissatisfaction associated with unfavorable financial terms, as well as the limitations of existing ecosystem defenses.

Measuring Large Language Models Capacity to Annotate Journalistic Sourcing

Since the launch of ChatGPT in late 2022, the capacities of Large Language Models and their evaluation have been in constant discussion and evaluation both in academic research and in the industry. Scenarios and benchmarks have been developed in several areas such as law, medicine and math (Bommasani et al., 2023) and there is continuous evaluation of model variants. One area that has not received sufficient scenario development attention is journalism, and in particular journalistic sourcing and ethics. Journalism is a crucial truth-determination function in democracy (Vincent, 2023), and sourcing is a crucial pillar to all original journalistic output. Evaluating the capacities of LLMs to annotate stories for the different signals of sourcing and how reporters justify them is a crucial scenario that warrants a benchmark approach. It offers potential to build automated systems to contrast more transparent and ethically rigorous forms of journalism with everyday fare. In this paper we lay out a scenario to evaluate LLM performance on identifying and annotating sourcing in news stories on a five-category schema inspired from journalism studies (Gans, 2004). We offer the use case, our dataset and metrics and as the first step towards systematic benchmarking. Our accuracy findings indicate LLM-based approaches have more catching to do in identifying all the sourced statements in a story, and equally, in matching the type of sources. An even harder task is spotting source justifications.

When AI Meets Finance (StockAgent): Large Language Model-based Stock Trading in Simulated Real-world Environments

Can AI Agents simulate real-world trading environments to investigate the impact of external factors on stock trading activities (e.g., macroeconomics, policy changes, company fundamentals, and global events)? These factors, which frequently influence trading behaviors, are critical elements in the quest for maximizing investors' profits. Our work attempts to solve this problem through large language model based agents. We have developed a multi-agent AI system called StockAgent, driven by LLMs, designed to simulate investors' trading behaviors in response to the real stock market. The StockAgent allows users to evaluate the impact of different external factors on investor trading and to analyze trading behavior and profitability effects. Additionally, StockAgent avoids the test set leakage issue present in existing trading simulation systems based on AI Agents. Specifically, it prevents the model from leveraging prior knowledge it may have acquired related to the test data. We evaluate different LLMs under the framework of StockAgent in a stock trading environment that closely resembles real-world conditions. The experimental results demonstrate the impact of key external factors on stock market trading, including trading behavior and stock price fluctuation rules. This research explores the study of agents' free trading gaps in the context of no prior knowledge related to market data. The patterns identified through StockAgent simulations provide valuable insights for LLM-based investment advice and stock recommendation. The code is available at https://github.com/MingyuJ666/Stockagent.

Show me your NFT and I tell you how it will perform: Multimodal representation learning for NFT selling price prediction

Non-Fungible Tokens (NFTs) represent deeds of ownership, based on blockchain technologies and smart contracts, of unique crypto assets on digital art forms (e.g., artworks or collectibles). In the spotlight after skyrocketing in 2021, NFTs have attracted the attention of crypto enthusiasts and investors intent on placing promising investments in this profitable market. However, the NFT financial performance prediction has not been widely explored to date. In this work, we address the above problem based on the hypothesis that NFT images and their textual descriptions are essential proxies to predict the NFT selling prices. To this purpose, we propose MERLIN, a novel multimodal deep learning framework designed to train Transformer-based language and visual models, along with graph neural network models, on collections of NFTs' images and texts. A key aspect in MERLIN is its independence on financial features, as it exploits only the primary data a user interested in NFT trading would like to deal with, i.e., NFT images and textual descriptions. By learning dense representations of such data, a price-category classification task is performed by MERLIN models, which can also be tuned according to user preferences in the inference phase to mimic different risk-return investment profiles. Experimental evaluation on a publicly available dataset has shown that MERLIN models achieve significant performances according to several financial assessment criteria, fostering profitable investments, and also beating baseline machine-learning classifiers based on financial features.

Rating Multi-Modal Time-Series Forecasting Models (MM-TSFM) for Robustness Through a Causal Lens

AI systems are notorious for their fragility; minor input changes can potentially cause major output swings. When such systems are deployed in critical areas like finance, the consequences of their uncertain behavior could be severe. In this paper, we focus on multi-modal time-series forecasting, where imprecision due to noisy or incorrect data can lead to erroneous predictions, impacting stakeholders such as analysts, investors, and traders. Recently, it has been shown that beyond numeric data, graphical transformations can be used with advanced visual models to achieve better performance. In this context, we introduce a rating methodology to assess the robustness of Multi-Modal Time-Series Forecasting Models (MM-TSFM) through causal analysis, which helps us understand and quantify the isolated impact of various attributes on the forecasting accuracy of MM-TSFM. We apply our novel rating method on a variety of numeric and multi-modal forecasting models in a large experimental setup (six input settings of control and perturbations, ten data distributions, time series from six leading stocks in three industries over a year of data, and five time-series forecasters) to draw insights on robust forecasting models and the context of their strengths. Within the scope of our study, our main result is that multi-modal (numeric + visual) forecasting, which was found to be more accurate than numeric forecasting in previous studies, can also be more robust in diverse settings. Our work will help different stakeholders of time-series forecasting understand the models` behaviors along trust (robustness) and accuracy dimensions to select an appropriate model for forecasting using our rating method, leading to improved decision-making.

A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends

Deep learning has solved a problem that as little as five years ago was thought by many to be intractable - the automatic recognition of patterns in data; and it can do so with accuracy that often surpasses human beings. It has solved problems beyond the realm of traditional, hand-crafted machine learning algorithms and captured the imagination of practitioners trying to make sense out of the flood of data that now inundates our society. As public awareness of the efficacy of DL increases so does the desire to make use of it. But even for highly trained professionals it can be daunting to approach the rapidly increasing body of knowledge produced by experts in the field. Where does one start? How does one determine if a particular model is applicable to their problem? How does one train and deploy such a network? A primer on the subject can be a good place to start. With that in mind, we present an overview of some of the key multilayer ANNs that comprise DL. We also discuss some new automatic architecture optimization protocols that use multi-agent approaches. Further, since guaranteeing system uptime is becoming critical to many computer applications, we include a section on using neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where DL has emerged as a game-changing technology: anomalous behavior detection in financial applications or in financial time-series forecasting, predictive and prescriptive analytics, medical image processing and analysis and power systems research. The thrust of this review is to outline emerging areas of application-oriented research within the DL community as well as to provide a reference to researchers seeking to use it in their work for what it does best: statistical pattern recognition with unparalleled learning capacity with the ability to scale with information.

Hedging Properties of Algorithmic Investment Strategies using Long Short-Term Memory and Time Series models for Equity Indices

This paper proposes a novel approach to hedging portfolios of risky assets when financial markets are affected by financial turmoils. We introduce a completely novel approach to diversification activity not on the level of single assets but on the level of ensemble algorithmic investment strategies (AIS) built based on the prices of these assets. We employ four types of diverse theoretical models (LSTM - Long Short-Term Memory, ARIMA-GARCH - Autoregressive Integrated Moving Average - Generalized Autoregressive Conditional Heteroskedasticity, momentum, and contrarian) to generate price forecasts, which are then used to produce investment signals in single and complex AIS. In such a way, we are able to verify the diversification potential of different types of investment strategies consisting of various assets (energy commodities, precious metals, cryptocurrencies, or soft commodities) in hedging ensemble AIS built for equity indices (S&P 500 index). Empirical data used in this study cover the period between 2004 and 2022. Our main conclusion is that LSTM-based strategies outperform the other models and that the best diversifier for the AIS built for the S&P 500 index is the AIS built for Bitcoin. Finally, we test the LSTM model for a higher frequency of data (1 hour). We conclude that it outperforms the results obtained using daily data.

Toward a traceable, explainable, and fairJD/Resume recommendation system

In the last few decades, companies are interested to adopt an online automated recruitment process in an international recruitment environment. The problem is that the recruitment of employees through the manual procedure is a time and money consuming process. As a result, processing a significant number of applications through conventional methods can lead to the recruitment of clumsy individuals. Different JD/Resume matching model architectures have been proposed and reveal a high accuracy level in selecting relevant candidatesfor the required job positions. However, the development of an automatic recruitment system is still one of the main challenges. The reason is that the development of a fully automated recruitment system is a difficult task and poses different challenges. For example, providing a detailed matching explanation for the targeted stakeholders is needed to ensure a transparent recommendation. There are several knowledge bases that represent skills and competencies (e.g, ESCO, O*NET) that are used to identify the candidate and the required job skills for a matching purpose. Besides, modernpre-trained language models are fine-tuned for this context such as identifying lines where a specific feature was introduced. Typically, pre-trained language models use transfer-based machine learning models to be fine-tuned for a specific field. In this proposal, our aim is to explore how modern language models (based on transformers) can be combined with knowledge bases and ontologies to enhance the JD/Resume matching process. Our system aims at using knowledge bases and features to support the explainability of the JD/Resume matching. Finally, given that multiple software components, datasets, ontology, andmachine learning models will be explored, we aim at proposing a fair, ex-plainable, and traceable architecture for a Resume/JD matching purpose.

A New Way: Kronecker-Factored Approximate Curvature Deep Hedging and its Benefits

This paper advances the computational efficiency of Deep Hedging frameworks through the novel integration of Kronecker-Factored Approximate Curvature (K-FAC) optimization. While recent literature has established Deep Hedging as a data-driven alternative to traditional risk management strategies, the computational burden of training neural networks with first-order methods remains a significant impediment to practical implementation. The proposed architecture couples Long Short-Term Memory (LSTM) networks with K-FAC second-order optimization, specifically addressing the challenges of sequential financial data and curvature estimation in recurrent networks. Empirical validation using simulated paths from a calibrated Heston stochastic volatility model demonstrates that the K-FAC implementation achieves marked improvements in convergence dynamics and hedging efficacy. The methodology yields a 78.3% reduction in transaction costs (t = 56.88, p < 0.001) and a 34.4% decrease in profit and loss (P&L) variance compared to Adam optimization. Moreover, the K-FAC-enhanced model exhibits superior risk-adjusted performance with a Sharpe ratio of 0.0401, contrasting with -0.0025 for the baseline model. These results provide compelling evidence that second-order optimization methods can materially enhance the tractability of Deep Hedging implementations. The findings contribute to the growing literature on computational methods in quantitative finance while highlighting the potential for advanced optimization techniques to bridge the gap between theoretical frameworks and practical applications in financial markets.

Empirical Study of Market Impact Conditional on Order-Flow Imbalance

In this research, we have empirically investigated the key drivers affecting liquidity in equity markets. We illustrated how theoretical models, such as Kyle's model, of agents' interplay in the financial markets, are aligned with the phenomena observed in publicly available trades and quotes data. Specifically, we confirmed that for small signed order-flows, the price impact grows linearly with increase in the order-flow imbalance. We have, further, implemented a machine learning algorithm to forecast market impact given a signed order-flow. Our findings suggest that machine learning models can be used in estimation of financial variables; and predictive accuracy of such learning algorithms can surpass the performance of traditional statistical approaches. Understanding the determinants of price impact is crucial for several reasons. From a theoretical stance, modelling the impact provides a statistical measure of liquidity. Practitioners adopt impact models as a pre-trade tool to estimate expected transaction costs and optimize the execution of their strategies. This further serves as a post-trade valuation benchmark as suboptimal execution can significantly deteriorate a portfolio performance. More broadly, the price impact reflects the balance of liquidity across markets. This is of central importance to regulators as it provides an all-encompassing explanation of the correlation between market design and systemic risk, enabling regulators to design more stable and efficient markets.

Evaluation of OpenAI o1: Opportunities and Challenges of AGI

This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.

aiSTROM -- A roadmap for developing a successful AI strategy

A total of 34% of AI research and development projects fails or are abandoned, according to a recent survey by Rackspace Technology of 1,870 companies. We propose a new strategic framework, aiSTROM, that empowers managers to create a successful AI strategy based on a thorough literature review. This provides a unique and integrated approach that guides managers and lead developers through the various challenges in the implementation process. In the aiSTROM framework, we start by identifying the top n potential projects (typically 3-5). For each of those, seven areas of focus are thoroughly analysed. These areas include creating a data strategy that takes into account unique cross-departmental machine learning data requirements, security, and legal requirements. aiSTROM then guides managers to think about how to put together an interdisciplinary artificial intelligence (AI) implementation team given the scarcity of AI talent. Once an AI team strategy has been established, it needs to be positioned within the organization, either cross-departmental or as a separate division. Other considerations include AI as a service (AIaas), or outsourcing development. Looking at new technologies, we have to consider challenges such as bias, legality of black-box-models, and keeping humans in the loop. Next, like any project, we need value-based key performance indicators (KPIs) to track and validate the progress. Depending on the company's risk-strategy, a SWOT analysis (strengths, weaknesses, opportunities, and threats) can help further classify the shortlisted projects. Finally, we should make sure that our strategy includes continuous education of employees to enable a culture of adoption. This unique and comprehensive framework offers a valuable, literature supported, tool for managers and lead developers.

Learning to Predict Short-Term Volatility with Order Flow Image Representation

Introduction: The paper addresses the challenging problem of predicting the short-term realized volatility of the Bitcoin price using order flow information. The inherent stochastic nature and anti-persistence of price pose difficulties in accurate prediction. Methods: To address this, we propose a method that transforms order flow data over a fixed time interval (snapshots) into images. The order flow includes trade sizes, trade directions, and limit order book, and is mapped into image colour channels. These images are then used to train both a simple 3-layer Convolutional Neural Network (CNN) and more advanced ResNet-18 and ConvMixer, with additionally supplementing them with hand-crafted features. The models are evaluated against classical GARCH, Multilayer Perceptron trained on raw data, and a naive guess method that considers current volatility as a prediction. Results: The experiments are conducted using price data from January 2021 and evaluate model performance in terms of root mean square error (RMSPE). The results show that our order flow representation with a CNN as a predictive model achieves the best performance, with an RMSPE of 0.85+/-1.1 for the model with aggregated features and 1.0+/-1.4 for the model without feature supplementation. ConvMixer with feature supplementation follows closely. In comparison, the RMSPE for the naive guess method was 1.4+/-3.0.

Challenges and Complexities in Machine Learning based Credit Card Fraud Detection

Credit cards play an exploding role in modern economies. Its popularity and ubiquity have created a fertile ground for fraud, assisted by the cross boarder reach and instantaneous confirmation. While transactions are growing, the fraud percentages are also on the rise as well as the true cost of a dollar fraud. Volume of transactions, uniqueness of frauds and ingenuity of the fraudster are main challenges in detecting frauds. The advent of machine learning, artificial intelligence and big data has opened up new tools in the fight against frauds. Given past transactions, a machine learning algorithm has the ability to 'learn' infinitely complex characteristics in order to identify frauds in real-time, surpassing the best human investigators. However, the developments in fraud detection algorithms has been challenging and slow due the massively unbalanced nature of fraud data, absence of benchmarks and standard evaluation metrics to identify better performing classifiers, lack of sharing and disclosure of research findings and the difficulties in getting access to confidential transaction data for research. This work investigates the properties of typical massively imbalanced fraud data sets, their availability, suitability for research use while exploring the widely varying nature of fraud distributions. Furthermore, we show how human annotation errors compound with machine classification errors. We also carry out experiments to determine the effect of PCA obfuscation (as a means of disseminating sensitive transaction data for research and machine learning) on algorithmic performance of classifiers and show that while PCA does not significantly degrade performance, care should be taken to use the appropriate principle component size (dimensions) to avoid overfitting.

TRADES: Generating Realistic Market Simulations with Diffusion Models

Financial markets are complex systems characterized by high statistical noise, nonlinearity, and constant evolution. Thus, modeling them is extremely hard. We address the task of generating realistic and responsive Limit Order Book (LOB) market simulations, which are fundamental for calibrating and testing trading strategies, performing market impact experiments, and generating synthetic market data. Previous works lack realism, usefulness, and responsiveness of the generated simulations. To bridge this gap, we propose a novel TRAnsformer-based Denoising Diffusion Probabilistic Engine for LOB Simulations (TRADES). TRADES generates realistic order flows conditioned on the state of the market, leveraging a transformer-based architecture that captures the temporal and spatial characteristics of high-frequency market data. There is a notable absence of quantitative metrics for evaluating generative market simulation models in the literature. To tackle this problem, we adapt the predictive score, a metric measured as an MAE, by training a stock price predictive model on synthetic data and testing it on real data. We compare TRADES with previous works on two stocks, reporting an x3.27 and x3.47 improvement over SoTA according to the predictive score, demonstrating that we generate useful synthetic market data for financial downstream tasks. We assess TRADES's market simulation realism and responsiveness, showing that it effectively learns the conditional data distribution and successfully reacts to an experimental agent, giving sprout to possible calibrations and evaluations of trading strategies and market impact experiments. We developed DeepMarket, the first open-source Python framework for market simulation with deep learning. Our repository includes a synthetic LOB dataset composed of TRADES's generates simulations. We release the code at github.com/LeonardoBerti00/DeepMarket.

Topological Components in a Community Currency Network

Transaction data from digital payment systems can be used to study economic processes at such a detail that was not possible previously. Here, we analyse the data from Sarafu token network, a community inclusion currency in Kenya. During the COVID-19 emergency, the Sarafu was disbursed as part of a humanitarian aid project. In this work, the transactions are analysed using network science. A topological categorisation is defined to identify cyclic and acyclic components. Furthermore, temporal aspects of circulation taking place within these components are considered. The significant presence of different types of strongly connected components as compared to randomized null models shows the importance of cycles in this economic network. Especially, indicating their key role in currency recirculation. In some acyclic components, the most significant triad suggests the presence of a group of users collecting currency from accounts active only once, hinting at a misuse of the system. In some other acyclic components, small isolated groups of users were active only once, suggesting the presence of users only interested in trying out the system. The methods used in this paper can answer specific questions related to user activities, currency design, and assessment of monetary interventions. Our methodology provides a general quantitative tool for analysing the behaviour of users in a currency network.

Feature Learning for Stock Price Prediction Shows a Significant Role of Analyst Rating

To reject the Efficient Market Hypothesis a set of 5 technical indicators and 23 fundamental indicators was identified to establish the possibility of generating excess returns on the stock market. Leveraging these data points and various classification machine learning models, trading data of the 505 equities on the US S&P500 over the past 20 years was analysed to develop a classifier effective for our cause. From any given day, we were able to predict the direction of change in price by 1% up to 10 days in the future. The predictions had an overall accuracy of 83.62% with a precision of 85% for buy signals and a recall of 100% for sell signals. Moreover, we grouped equities by their sector and repeated the experiment to see if grouping similar assets together positively effected the results but concluded that it showed no significant improvements in the performance rejecting the idea of sector-based analysis. Also, using feature ranking we could identify an even smaller set of 6 indicators while maintaining similar accuracies as that from the original 28 features and also uncovered the importance of buy, hold and sell analyst ratings as they came out to be the top contributors in the model. Finally, to evaluate the effectiveness of the classifier in real-life situations, it was backtested on FAANG equities using a modest trading strategy where it generated high returns of above 60% over the term of the testing dataset. In conclusion, our proposed methodology with the combination of purposefully picked features shows an improvement over the previous studies, and our model predicts the direction of 1% price changes on the 10th day with high confidence and with enough buffer to even build a robotic trading system.

On the Opportunities and Risks of Foundation Models

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

Two Case Studies of Experience Prototyping Machine Learning Systems in the Wild

Throughout the course of my Ph.D., I have been designing the user experience (UX) of various machine learning (ML) systems. In this workshop, I share two projects as case studies in which people engage with ML in much more complicated and nuanced ways than the technical HCML work might assume. The first case study describes how cardiology teams in three hospitals used a clinical decision-support system that helps them decide whether and when to implant an artificial heart to a heart failure patient. I demonstrate that physicians cannot draw on their decision-making experience by seeing only patient data on paper. They are also confused by some fundamental premises upon which ML operates. For example, physicians asked: Are ML predictions made based on clinicians' best efforts? Is it ethical to make decisions based on previous patients' collective outcomes? In the second case study, my collaborators and I designed an intelligent text editor, with the goal of improving authors' writing experience with NLP (Natural Language Processing) technologies. We prototyped a number of generative functionalities where the system provides phrase-or-sentence-level writing suggestions upon user request. When writing with the prototype, however, authors shared that they need to "see where the sentence is going two paragraphs later" in order to decide whether the suggestion aligns with their writing; Some even considered adopting machine suggestions as plagiarism, therefore "is simply wrong". By sharing these unexpected and intriguing responses from these real-world ML users, I hope to start a discussion about such previously-unknown complexities and nuances of -- as the workshop proposal states -- "putting ML at the service of people in a way that is accessible, useful, and trustworthy to all".

Confidence-Building Measures for Artificial Intelligence: Workshop Proceedings

Foundation models could eventually introduce several pathways for undermining state security: accidents, inadvertent escalation, unintentional conflict, the proliferation of weapons, and the interference with human diplomacy are just a few on a long list. The Confidence-Building Measures for Artificial Intelligence workshop hosted by the Geopolitics Team at OpenAI and the Berkeley Risk and Security Lab at the University of California brought together a multistakeholder group to think through the tools and strategies to mitigate the potential risks introduced by foundation models to international security. Originating in the Cold War, confidence-building measures (CBMs) are actions that reduce hostility, prevent conflict escalation, and improve trust between parties. The flexibility of CBMs make them a key instrument for navigating the rapid changes in the foundation model landscape. Participants identified the following CBMs that directly apply to foundation models and which are further explained in this conference proceedings: 1. crisis hotlines 2. incident sharing 3. model, transparency, and system cards 4. content provenance and watermarks 5. collaborative red teaming and table-top exercises and 6. dataset and evaluation sharing. Because most foundation model developers are non-government entities, many CBMs will need to involve a wider stakeholder community. These measures can be implemented either by AI labs or by relevant government actors.

OpenMEDLab: An Open-source Platform for Multi-modality Foundation Models in Medicine

The emerging trend of advancing generalist artificial intelligence, such as GPTv4 and Gemini, has reshaped the landscape of research (academia and industry) in machine learning and many other research areas. However, domain-specific applications of such foundation models (e.g., in medicine) remain untouched or often at their very early stages. It will require an individual set of transfer learning and model adaptation techniques by further expanding and injecting these models with domain knowledge and data. The development of such technologies could be largely accelerated if the bundle of data, algorithms, and pre-trained foundation models were gathered together and open-sourced in an organized manner. In this work, we present OpenMEDLab, an open-source platform for multi-modality foundation models. It encapsulates not only solutions of pioneering attempts in prompting and fine-tuning large language and vision models for frontline clinical and bioinformatic applications but also building domain-specific foundation models with large-scale multi-modal medical data. Importantly, it opens access to a group of pre-trained foundation models for various medical image modalities, clinical text, protein engineering, etc. Inspiring and competitive results are also demonstrated for each collected approach and model in a variety of benchmarks for downstream tasks. We welcome researchers in the field of medical artificial intelligence to continuously contribute cutting-edge methods and models to OpenMEDLab, which can be accessed via https://github.com/openmedlab.

Explainable Deep Behavioral Sequence Clustering for Transaction Fraud Detection

In e-commerce industry, user behavior sequence data has been widely used in many business units such as search and merchandising to improve their products. However, it is rarely used in financial services not only due to its 3V characteristics - i.e. Volume, Velocity and Variety - but also due to its unstructured nature. In this paper, we propose a Financial Service scenario Deep learning based Behavior data representation method for Clustering (FinDeepBehaviorCluster) to detect fraudulent transactions. To utilize the behavior sequence data, we treat click stream data as event sequence, use time attention based Bi-LSTM to learn the sequence embedding in an unsupervised fashion, and combine them with intuitive features generated by risk experts to form a hybrid feature representation. We also propose a GPU powered HDBSCAN (pHDBSCAN) algorithm, which is an engineering optimization for the original HDBSCAN algorithm based on FAISS project, so that clustering can be carried out on hundreds of millions of transactions within a few minutes. The computation efficiency of the algorithm has increased 500 times compared with the original implementation, which makes flash fraud pattern detection feasible. Our experimental results show that the proposed FinDeepBehaviorCluster framework is able to catch missed fraudulent transactions with considerable business values. In addition, rule extraction method is applied to extract patterns from risky clusters using intuitive features, so that narrative descriptions can be attached to the risky clusters for case investigation, and unknown risk patterns can be mined for real-time fraud detection. In summary, FinDeepBehaviorCluster as a complementary risk management strategy to the existing real-time fraud detection engine, can further increase our fraud detection and proactive risk defense capabilities.