update model

Browse files

Files changed (3) hide show

README.md +65 -11
adapter_model.safetensors +1 -1
ggml-adapter-model.bin +1 -1

README.md CHANGED Viewed

@@ -71,10 +71,10 @@ wandb_log_model:
 gradient_accumulation_steps: 8
 micro_batch_size: 1
-num_epochs: 2
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
-learning_rate: 0.000005
 train_on_inputs: false
 group_by_length: false
@@ -118,25 +118,79 @@ This is a LoRA for the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.
 ## Model description
-Given text extracted from pages of a sustainability report, this model extracts the scope 1, 2 and 3 emissions in JSON format. The JSON object also contains the pages containing this information. For example, the [2022 sustainability report by the Bristol-Myers Squibb Company](https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf) leads to the following output: `{"scope_1":202290,"scope_2":161907,"scope_3":1696100,"sources":[88,89]}`. For more information, refer to the [GitHub repo](https://github.com/nopperl/corporate_emission_reports).
 ## Intended uses & limitations
-The model is intended to be used together with the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the `inference.py` script from the [GitHub repo](https://github.com/nopperl/corporate_emission_reports). The script ensures that the prompt string and token ids exactly match the ones used for training.
-Example usage:
-    python inference.py --model mistral --lora emissions-extraction-lora/ggml-adapter-model.bin https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
 Compare to base model without LoRA:
-    python inference.py --model mistral https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
-## Training and evaluation data
-Finetuned on the [sustainability-report-emissions-instruction-style](https://huggingface.co/datasets/nopperl/sustainability-report-emissions-instruction-style) dataset.
-Reaches an emission value extraction accuracy of 57\% (up from 46\% of the base model) and a source citation accuracy of 68\% (base model: 52\%) on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports) dataset.
 ## Training procedure
@@ -169,4 +223,4 @@ The following hyperparameters were used during training:
 - Transformers 4.37.1
 - Pytorch 2.0.1
 - Datasets 2.16.1
-- Tokenizers 0.15.0

 gradient_accumulation_steps: 8
 micro_batch_size: 1
+num_epochs: 4
 optimizer: adamw_bnb_8bit
 lr_scheduler: cosine
+learning_rate: 0.00002
 train_on_inputs: false
 group_by_length: false
 ## Model description
+Given text extracted from pages of a sustainability report, this model extracts the scope 1, 2 and 3 emissions in JSON format. The JSON object also contains the pages containing this information. For example, the [2022 sustainability report by the Bristol-Myers Squibb Company](https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf) leads to the following output: `{"scope_1":202290,"scope_2":161907,"scope_3":1696100,"sources":[88,89]}`.
+Reaches an emission value extraction accuracy of 65\% (up from 46\% of the base model) and a source citation accuracy of 69\% (base model: 52\%) on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports) dataset. For more information, refer to the [GitHub repo](https://github.com/nopperl/corporate_emission_reports).
 ## Intended uses & limitations
+The model is intended to be used together with the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the `inference.py` script from the [accompanying python package](https://github.com/nopperl/corporate_emission_reports). The script ensures that the prompt string and token ids exactly match the ones used for training.
+### Example usage
+#### CLI
+Using [transformers](https://github.com/huggingface/transformers) as inference engine:
+    python -m corporate_emissions_reports.inference.py --model_path mistralai/Mistral-7B-Instruct-v0.2 --lora nopperl/emissions-extraction-lora --model_context_size 32768 --engine hf https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
 Compare to base model without LoRA:
+    python -m corporate_emissions_reports.inference.py --model_path mistralai/Mistral-7B-Instruct-v0.2 --model_context_size 32768 --engine hf https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
+Alternatively, it is possible to use [llama.cpp](https://github.com/ggerganov/llama.cpp) as inference engine. In this case, follow the installation instructions of the [package readme](https://github.com/nopperl/corporate_emission_reports/blob/main/README.md). In particular, the model needs to be downloaded beforehand. Then:
+    python -m corporate_emissions_reports.inference.py --model mistral --lora ./emissions-extraction-lora/ggml-adapter-model.bin https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
+Compare to base model without LoRA:
+    python -m corporate_emissions_reports.inference.py --model mistral https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
+#### Programmatically
+The package also provides a function for inference from python code:
+    from corporate_emission_reports.inference import extract_emissions
+    document_path = "https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf"
+    model_kwargs = {}  # Optional arguments with are passed to the HF model
+    emissions = extract_emissions(document_path, "mistralai/Mistral-7B-Instruct-v0.2", lora="nopperl/emissions-extraction-lora", engine="hf", **model_kwargs)
+It's also possible to use it directly with [transformers](https://github.com/huggingface/transformers):
+```
+from corporate_emission_reports.inference import construct_prompt
+from peft import AutoPeftModelForCausalLM
+from transformers import AutoTokenizer
+document_path = "https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf"
+lora_path = "nopperl/emissions-extraction-lora"
+tokenizer = AutoTokenizer.from_pretrained(lora_path)
+prompt_text = construct_prompt(document_path, tokenizer)
+model = AutoPeftModelForCausalLM.from_pretrained(lora_path)
+prompt_tokenized = tokenizer.encode(prompt_text, return_tensors="pt").to(model.device)
+outputs = model.generate(prompt_tokenized, max_new_tokens=120)
+output = outputs[0][prompt_tokenized.shape[1]:]
+```
+Additionally, it is possible to enforce valid JSON output and convert it into a Pydantic object using [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer):
+```
+from corporate_emission_reports.pydantic_types import Emissions
+from lmformatenforcer import JsonSchemaParser
+from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
+...
+parser = JsonSchemaParser(Emissions.model_json_schema())
+prefix_function = build_transformers_prefix_allowed_tokens_fn(tokenizer, parser)
+outputs = model.generate(prompt_tokenized, max_new_tokens=120, prefix_allowed_tokens_fn=prefix_function)
+output = outputs[0][prompt_tokenized.shape[1]:]
+if tokenizer.eos_token:
+    output = output[:-1]
+output = tokenizer.decode(output)
+return Emissions.model_validate_json(output, strict=True)
+```
+## Training and evaluation data
+Finetuned on the [sustainability-report-emissions-instruction-style](https://huggingface.co/datasets/nopperl/sustainability-report-emissions-instruction-style) dataset and evaluated on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports).
 ## Training procedure
 - Transformers 4.37.1
 - Pytorch 2.0.1
 - Datasets 2.16.1
+- Tokenizers 0.15.0

adapter_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:001b46af325d9d3b08730c26b86389303e95e3a0690ffa851548afcf21d18cc4
 size 167832688

 version https://git-lfs.github.com/spec/v1
+oid sha256:c9cf7ae7c20b80e1a17041b5e0f8b12788db0bc46943fa01b4ebeb96f8059615
 size 167832688

ggml-adapter-model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9e159542c3de0db7c7b35e23cd948ee97f7225609af8def1ec224c746ae7f28f
 size 335572992

 version https://git-lfs.github.com/spec/v1
+oid sha256:d915ae9cd2bd2f1909eea73bdba5a7b14ac5b423e5f32d5ab45f82c4ffbfccf8
 size 335572992