pucpr-br
/

Clinical-BR-Mistral-7B-v0.2

@@ -13,5 +13,8 @@ Clinical-BR-Mistral-7B-v0.2 is a fine-tuned language model specifically designed
 ## Fine-Tuning Approach
 To enhance memory efficiency and reduce computational demands, we implemented LoRA with 16-bit precision on the q_proj and v_proj projections. We configured LoRA with R set to 8, Alpha to 16, and Dropout to 0.1, allowing the model to adapt effectively while preserving output quality. For optimization, the AdamW optimizer was used with parameters β1 = 0.9 and β2 = 0.999, achieving a balance between fast convergence and training stability. This careful tuning process ensures robust performance in generating accurate and contextually appropriate clinical text in Portuguese.
-## Data:
-The data used to fine-tune Clinical-BR-Mistral-7B-v0.2 was sourced from three distinct clinical datasets, totaling 2.4GB of text and 309,151,121 tokens. The first dataset comes from the SemClinBr project, which includes 2,100,546 clinical narrative entries from multiple Brazilian hospitals, featuring various document types such as discharge summaries, ambulatory notes, and nursing notes, as well as a wide range of medical specialties including cardiology, nephrology, and endocrinology. This electronic health record (EHR) data was de-identified and approved by the PUCPR Research Ethics Committee, with the certificate of presentation for ethical appreciation number 51376015.4.0000.0020. Additionally, the BRATECA dataset was incorporated, consisting of 73,040 admission notes from 10 Brazilian hospitals, associated with various medical departments such as obstetrics, surgery, emergency, COVID-19, intensive care, and ambulatory care. All data in this dataset was anonymized and ethically approved by the National Research Ethics Committee under the number 46652521.9.0000.5530, and it is available under PhysioNet Credentialed Health Data Use. Finally, the project also utilized data from the Lopes et al., 2019 study, which comprises 3,678 clinical texts mainly focused on neurology cases collected from medical journals written in European Portuguese. This dataset provides additional linguistic diversity and is publicly accessible for research purposes. Combining these datasets allowed Clinical-BR-Mistral-7B-v0.2 to be finely tuned for generating accurate and contextually appropriate clinical notes in Portuguese.

 ## Fine-Tuning Approach
 To enhance memory efficiency and reduce computational demands, we implemented LoRA with 16-bit precision on the q_proj and v_proj projections. We configured LoRA with R set to 8, Alpha to 16, and Dropout to 0.1, allowing the model to adapt effectively while preserving output quality. For optimization, the AdamW optimizer was used with parameters β1 = 0.9 and β2 = 0.999, achieving a balance between fast convergence and training stability. This careful tuning process ensures robust performance in generating accurate and contextually appropriate clinical text in Portuguese.
+## Data
+The fine-tuning of Clinical-BR-Mistral-7B-v0.2 utilized 2.4GB of text from three clinical datasets. The SemClinBr project provided diverse clinical narratives from Brazilian hospitals, while the BRATECA dataset contributed admission notes from various departments in 10 hospitals. Additionally, data from Lopes et al., 2019, added neurology-focused texts from European Portuguese medical journals. These datasets collectively improved the model’s ability to generate accurate clinical notes in Portuguese.
+## Citation