Studio

Update README.md

2514eec verified 7 months ago

4.03 kB

	---
	license: cc-by-nc-4.0
	language:
	- ru
	- en
	pipeline_tag: document-question-answering
	tags:
	- DocumentQA
	- QuestionAnswering
	- NLP
	- DeepLearning
	- Transformers
	- Multimodal
	- HuggingFace
	- ruBert
	- MachineLearning
	- DeepQA
	- AIForDocs
	- Docs
	- NeuralNetworks
	- torch
	- pytorch
	library_name: transformers
	metrics:
	- accuracy
	- f1
	- recall
	- exact_match
	- precision
	base_model:
	- ai-forever/ruBert-base
	---

	![Official Kaleidoscope Logo](https://huggingface.co/LaciaStudio/Kaleidoscope_small_v1/resolve/main/Kaleidoscope.png)

	# Document Question Answering Model - Kaleidoscope_small_v1
	This model is a fine-tuned version of sberbank-ai/ruBert-base designed for the task of document question answering. It has been adapted specifically for extracting answers from a provided document context and fine-tuned on a custom JSON dataset containing context, question, and answer triples.

	# Key Features
	* Objective: Extract answers from documents based on user questions.
	* Base Model: sberbank-ai/ruBert-base.
	* Dataset: A custom JSON file with fields: context, question, and answer.
	* Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments.
	# Training Settings:
	* Number of epochs: 20.
	* Batch size: 4 per device.
	* Warmup steps: 0.1 of total steps.
	* FP16 training enabled (if CUDA is available).
	* Hardware: Training was performed on an 1xRTX 3070.

	# Description
	The model was fine-tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include:

	Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.

	* Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.
	* Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.
	* Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.
	* This model is ideal for interactive document question answering tasks, making it a powerful tool for applications such as customer support, document search, and automated Q&A systems.

	While primarily focused on Russian texts, the model also supports English language inputs.
	The model also supports English language, but its support was not tested

	# Example Usage

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForQuestionAnswering

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
	model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
	model.to(device)

	file_path = input("Enter document path: ")
	with open(file_path, "r", encoding="utf-8") as f:
	context = f.read()

	while True:
	question = input("Enter question (or 'exit' to quit): ")
	if question.lower() == "exit":
	break
	inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
	inputs = {k: v.to(device) for k, v in inputs.items()}
	outputs = model(**inputs)
	start_logits = outputs.start_logits
	end_logits = outputs.end_logits
	start_index = torch.argmax(start_logits)
	end_index = torch.argmax(end_logits)
	answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
	answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
	print("Answer:", answer)
	```

	# Example of answering
	RU
	Context:

	```
	Альберт Эйнштейн разработал теорию относительности.
	```

	Question:

	```
	Кто разработал теорию относительности?
	```

	Answer:

	```
	альберт эинштеин
	```
	EN
	Context:

	```
	I had a red car.
	```

	Question:

	```
	What kind of car did I have?
	```

	Answer:

	```
	a red car
	```

	Finetuned by LaciaStudio \| LaciaAI