|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- ru |
|
- en |
|
pipeline_tag: document-question-answering |
|
tags: |
|
- DocumentQA |
|
- QuestionAnswering |
|
- NLP |
|
- DeepLearning |
|
- Transformers |
|
- Multimodal |
|
- HuggingFace |
|
- ruBert |
|
- MachineLearning |
|
- DeepQA |
|
- AIForDocs |
|
- Docs |
|
- NeuralNetworks |
|
- torch |
|
- pytorch |
|
library_name: transformers |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- recall |
|
- exact_match |
|
- precision |
|
base_model: |
|
- ai-forever/ruBert-base |
|
--- |
|
|
|
 |
|
|
|
# Document Question Answering Model - Kaleidoscope_small_v1 |
|
This model is a fine-tuned version of sberbank-ai/ruBert-base designed for the task of document question answering. It has been adapted specifically for extracting answers from a provided document context and fine-tuned on a custom JSON dataset containing context, question, and answer triples. |
|
|
|
# Key Features |
|
* Objective: Extract answers from documents based on user questions. |
|
* Base Model: sberbank-ai/ruBert-base. |
|
* Dataset: A custom JSON file with fields: context, question, and answer. |
|
* Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments. |
|
# Training Settings: |
|
* Number of epochs: 20. |
|
* Batch size: 4 per device. |
|
* Warmup steps: 0.1 of total steps. |
|
* FP16 training enabled (if CUDA is available). |
|
* Hardware: Training was performed on an 1xRTX 3070. |
|
|
|
# Description |
|
The model was fine-tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include: |
|
|
|
Custom Dataset: A loader reads a JSON file containing context, question, and answer triples. |
|
|
|
* *Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.* |
|
* *Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.* |
|
* *Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.* |
|
* *This model is ideal for interactive document question answering tasks, making it a powerful tool for applications such as customer support, document search, and automated Q&A systems.* |
|
|
|
While primarily focused on Russian texts, the model also supports English language inputs. |
|
**The model also supports English language, but its support was not tested** |
|
|
|
# Example Usage |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForQuestionAnswering |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_small_v1") |
|
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_small_v1") |
|
model.to(device) |
|
|
|
file_path = input("Enter document path: ") |
|
with open(file_path, "r", encoding="utf-8") as f: |
|
context = f.read() |
|
|
|
while True: |
|
question = input("Enter question (or 'exit' to quit): ") |
|
if question.lower() == "exit": |
|
break |
|
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384) |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
outputs = model(**inputs) |
|
start_logits = outputs.start_logits |
|
end_logits = outputs.end_logits |
|
start_index = torch.argmax(start_logits) |
|
end_index = torch.argmax(end_logits) |
|
answer_tokens = inputs["input_ids"][0][start_index:end_index + 1] |
|
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True) |
|
print("Answer:", answer) |
|
``` |
|
|
|
# Example of answering |
|
**RU** |
|
*Context:* |
|
|
|
``` |
|
Альберт Эйнштейн разработал теорию относительности. |
|
``` |
|
|
|
*Question:* |
|
|
|
``` |
|
Кто разработал теорию относительности? |
|
``` |
|
|
|
*Answer:* |
|
|
|
``` |
|
альберт эинштеин |
|
``` |
|
**EN** |
|
*Context:* |
|
|
|
``` |
|
I had a red car. |
|
``` |
|
|
|
*Question:* |
|
|
|
``` |
|
What kind of car did I have? |
|
``` |
|
|
|
*Answer:* |
|
|
|
``` |
|
a red car |
|
``` |
|
|
|
**Finetuned by LaciaStudio | LaciaAI** |