Studio
Update README.md
2514eec verified
---
license: cc-by-nc-4.0
language:
- ru
- en
pipeline_tag: document-question-answering
tags:
- DocumentQA
- QuestionAnswering
- NLP
- DeepLearning
- Transformers
- Multimodal
- HuggingFace
- ruBert
- MachineLearning
- DeepQA
- AIForDocs
- Docs
- NeuralNetworks
- torch
- pytorch
library_name: transformers
metrics:
- accuracy
- f1
- recall
- exact_match
- precision
base_model:
- ai-forever/ruBert-base
---
![Official Kaleidoscope Logo](https://huggingface.co/LaciaStudio/Kaleidoscope_small_v1/resolve/main/Kaleidoscope.png)
# Document Question Answering Model - Kaleidoscope_small_v1
This model is a fine-tuned version of sberbank-ai/ruBert-base designed for the task of document question answering. It has been adapted specifically for extracting answers from a provided document context and fine-tuned on a custom JSON dataset containing context, question, and answer triples.
# Key Features
* Objective: Extract answers from documents based on user questions.
* Base Model: sberbank-ai/ruBert-base.
* Dataset: A custom JSON file with fields: context, question, and answer.
* Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments.
# Training Settings:
* Number of epochs: 20.
* Batch size: 4 per device.
* Warmup steps: 0.1 of total steps.
* FP16 training enabled (if CUDA is available).
* Hardware: Training was performed on an 1xRTX 3070.
# Description
The model was fine-tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include:
Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.
* *Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.*
* *Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.*
* *Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.*
* *This model is ideal for interactive document question answering tasks, making it a powerful tool for applications such as customer support, document search, and automated Q&A systems.*
While primarily focused on Russian texts, the model also supports English language inputs.
**The model also supports English language, but its support was not tested**
# Example Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model.to(device)
file_path = input("Enter document path: ")
with open(file_path, "r", encoding="utf-8") as f:
context = f.read()
while True:
question = input("Enter question (or 'exit' to quit): ")
if question.lower() == "exit":
break
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)
answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print("Answer:", answer)
```
# Example of answering
**RU**
*Context:*
```
Альберт Эйнштейн разработал теорию относительности.
```
*Question:*
```
Кто разработал теорию относительности?
```
*Answer:*
```
альберт эинштеин
```
**EN**
*Context:*
```
I had a red car.
```
*Question:*
```
What kind of car did I have?
```
*Answer:*
```
a red car
```
**Finetuned by LaciaStudio | LaciaAI**