update model
Browse files- README.md +65 -11
- adapter_model.safetensors +1 -1
- ggml-adapter-model.bin +1 -1
README.md
CHANGED
|
@@ -71,10 +71,10 @@ wandb_log_model:
|
|
| 71 |
|
| 72 |
gradient_accumulation_steps: 8
|
| 73 |
micro_batch_size: 1
|
| 74 |
-
num_epochs:
|
| 75 |
optimizer: adamw_bnb_8bit
|
| 76 |
lr_scheduler: cosine
|
| 77 |
-
learning_rate: 0.
|
| 78 |
|
| 79 |
train_on_inputs: false
|
| 80 |
group_by_length: false
|
|
@@ -118,25 +118,79 @@ This is a LoRA for the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.
|
|
| 118 |
|
| 119 |
## Model description
|
| 120 |
|
| 121 |
-
Given text extracted from pages of a sustainability report, this model extracts the scope 1, 2 and 3 emissions in JSON format. The JSON object also contains the pages containing this information. For example, the [2022 sustainability report by the Bristol-Myers Squibb Company](https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf) leads to the following output: `{"scope_1":202290,"scope_2":161907,"scope_3":1696100,"sources":[88,89]}`.
|
|
|
|
|
|
|
| 122 |
|
| 123 |
## Intended uses & limitations
|
| 124 |
|
| 125 |
-
The model is intended to be used together with the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the `inference.py` script from the [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
-
|
| 128 |
|
| 129 |
-
python inference.py --
|
| 130 |
|
| 131 |
Compare to base model without LoRA:
|
| 132 |
|
| 133 |
-
python inference.py --
|
| 134 |
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
| 140 |
|
| 141 |
## Training procedure
|
| 142 |
|
|
@@ -169,4 +223,4 @@ The following hyperparameters were used during training:
|
|
| 169 |
- Transformers 4.37.1
|
| 170 |
- Pytorch 2.0.1
|
| 171 |
- Datasets 2.16.1
|
| 172 |
-
- Tokenizers 0.15.0
|
|
|
|
| 71 |
|
| 72 |
gradient_accumulation_steps: 8
|
| 73 |
micro_batch_size: 1
|
| 74 |
+
num_epochs: 4
|
| 75 |
optimizer: adamw_bnb_8bit
|
| 76 |
lr_scheduler: cosine
|
| 77 |
+
learning_rate: 0.00002
|
| 78 |
|
| 79 |
train_on_inputs: false
|
| 80 |
group_by_length: false
|
|
|
|
| 118 |
|
| 119 |
## Model description
|
| 120 |
|
| 121 |
+
Given text extracted from pages of a sustainability report, this model extracts the scope 1, 2 and 3 emissions in JSON format. The JSON object also contains the pages containing this information. For example, the [2022 sustainability report by the Bristol-Myers Squibb Company](https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf) leads to the following output: `{"scope_1":202290,"scope_2":161907,"scope_3":1696100,"sources":[88,89]}`.
|
| 122 |
+
|
| 123 |
+
Reaches an emission value extraction accuracy of 65\% (up from 46\% of the base model) and a source citation accuracy of 69\% (base model: 52\%) on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports) dataset. For more information, refer to the [GitHub repo](https://github.com/nopperl/corporate_emission_reports).
|
| 124 |
|
| 125 |
## Intended uses & limitations
|
| 126 |
|
| 127 |
+
The model is intended to be used together with the [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model using the `inference.py` script from the [accompanying python package](https://github.com/nopperl/corporate_emission_reports). The script ensures that the prompt string and token ids exactly match the ones used for training.
|
| 128 |
+
|
| 129 |
+
### Example usage
|
| 130 |
+
|
| 131 |
+
#### CLI
|
| 132 |
|
| 133 |
+
Using [transformers](https://github.com/huggingface/transformers) as inference engine:
|
| 134 |
|
| 135 |
+
python -m corporate_emissions_reports.inference.py --model_path mistralai/Mistral-7B-Instruct-v0.2 --lora nopperl/emissions-extraction-lora --model_context_size 32768 --engine hf https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
|
| 136 |
|
| 137 |
Compare to base model without LoRA:
|
| 138 |
|
| 139 |
+
python -m corporate_emissions_reports.inference.py --model_path mistralai/Mistral-7B-Instruct-v0.2 --model_context_size 32768 --engine hf https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
|
| 140 |
|
| 141 |
+
Alternatively, it is possible to use [llama.cpp](https://github.com/ggerganov/llama.cpp) as inference engine. In this case, follow the installation instructions of the [package readme](https://github.com/nopperl/corporate_emission_reports/blob/main/README.md). In particular, the model needs to be downloaded beforehand. Then:
|
| 142 |
+
|
| 143 |
+
python -m corporate_emissions_reports.inference.py --model mistral --lora ./emissions-extraction-lora/ggml-adapter-model.bin https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
|
| 144 |
+
|
| 145 |
+
Compare to base model without LoRA:
|
| 146 |
+
|
| 147 |
+
python -m corporate_emissions_reports.inference.py --model mistral https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf
|
| 148 |
+
|
| 149 |
+
#### Programmatically
|
| 150 |
+
|
| 151 |
+
The package also provides a function for inference from python code:
|
| 152 |
|
| 153 |
+
from corporate_emission_reports.inference import extract_emissions
|
| 154 |
+
document_path = "https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf"
|
| 155 |
+
model_kwargs = {} # Optional arguments with are passed to the HF model
|
| 156 |
+
emissions = extract_emissions(document_path, "mistralai/Mistral-7B-Instruct-v0.2", lora="nopperl/emissions-extraction-lora", engine="hf", **model_kwargs)
|
| 157 |
+
|
| 158 |
+
It's also possible to use it directly with [transformers](https://github.com/huggingface/transformers):
|
| 159 |
+
|
| 160 |
+
```
|
| 161 |
+
from corporate_emission_reports.inference import construct_prompt
|
| 162 |
+
from peft import AutoPeftModelForCausalLM
|
| 163 |
+
from transformers import AutoTokenizer
|
| 164 |
+
document_path = "https://www.bms.com/assets/bms/us/en-us/pdf/bmy-2022-esg-report.pdf"
|
| 165 |
+
lora_path = "nopperl/emissions-extraction-lora"
|
| 166 |
+
tokenizer = AutoTokenizer.from_pretrained(lora_path)
|
| 167 |
+
prompt_text = construct_prompt(document_path, tokenizer)
|
| 168 |
+
model = AutoPeftModelForCausalLM.from_pretrained(lora_path)
|
| 169 |
+
prompt_tokenized = tokenizer.encode(prompt_text, return_tensors="pt").to(model.device)
|
| 170 |
+
outputs = model.generate(prompt_tokenized, max_new_tokens=120)
|
| 171 |
+
output = outputs[0][prompt_tokenized.shape[1]:]
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
Additionally, it is possible to enforce valid JSON output and convert it into a Pydantic object using [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer):
|
| 175 |
+
|
| 176 |
+
```
|
| 177 |
+
from corporate_emission_reports.pydantic_types import Emissions
|
| 178 |
+
from lmformatenforcer import JsonSchemaParser
|
| 179 |
+
from lmformatenforcer.integrations.transformers import build_transformers_prefix_allowed_tokens_fn
|
| 180 |
+
...
|
| 181 |
+
parser = JsonSchemaParser(Emissions.model_json_schema())
|
| 182 |
+
prefix_function = build_transformers_prefix_allowed_tokens_fn(tokenizer, parser)
|
| 183 |
+
outputs = model.generate(prompt_tokenized, max_new_tokens=120, prefix_allowed_tokens_fn=prefix_function)
|
| 184 |
+
output = outputs[0][prompt_tokenized.shape[1]:]
|
| 185 |
+
if tokenizer.eos_token:
|
| 186 |
+
output = output[:-1]
|
| 187 |
+
output = tokenizer.decode(output)
|
| 188 |
+
return Emissions.model_validate_json(output, strict=True)
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
## Training and evaluation data
|
| 192 |
|
| 193 |
+
Finetuned on the [sustainability-report-emissions-instruction-style](https://huggingface.co/datasets/nopperl/sustainability-report-emissions-instruction-style) dataset and evaluated on the [corporate-emission-reports](https://huggingface.co/datasets/nopperl/corporate-emission-reports).
|
| 194 |
|
| 195 |
## Training procedure
|
| 196 |
|
|
|
|
| 223 |
- Transformers 4.37.1
|
| 224 |
- Pytorch 2.0.1
|
| 225 |
- Datasets 2.16.1
|
| 226 |
+
- Tokenizers 0.15.0
|
adapter_model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 167832688
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c9cf7ae7c20b80e1a17041b5e0f8b12788db0bc46943fa01b4ebeb96f8059615
|
| 3 |
size 167832688
|
ggml-adapter-model.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 335572992
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d915ae9cd2bd2f1909eea73bdba5a7b14ac5b423e5f32d5ab45f82c4ffbfccf8
|
| 3 |
size 335572992
|