juliensimon
/

xlm-v-base-language-id

Text Classification

Generated from Trainer

language-identification

Model card Files Files and versions

Julien Simon commited on Feb 9, 2023

Commit

27c7dcd

·

1 Parent(s): 0bfaffa

- Add training script

- Add details to model card

Files changed (2) hide show

README.md +4 -6
train-xlm.py +1 -6

README.md CHANGED Viewed

@@ -34,20 +34,18 @@ It achieves the following results on the evaluation set:
 - Loss: 0.0241
 - Accuracy: 0.9930
-## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:

 - Loss: 0.0241
 - Accuracy: 0.9930
 ## Intended uses & limitations
+The model can accurately detect 102 languages.
 ## Training and evaluation data
+The model has been trained and evaluated on the complete google/fleurs training and validation sets.
 ## Training procedure
+The training script is included in the repository. The model has been trained on an p3dn.24xlarge instance on AWS (8 NVIDIA V100 GPUs).
 ### Training hyperparameters
 The following hyperparameters were used during training:

train-xlm.py CHANGED Viewed

@@ -24,9 +24,7 @@ columns_to_remove = [
     "lang_group_id",
 ]
-train, val = load_dataset(
-    dataset_id, "all", split=["train", "validation"], ignore_verifications=True
-)
 # Build the label2id and id2label dictionaries
@@ -54,11 +52,9 @@ val = val.shuffle(seed=42)
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 def preprocess(data):
     return tokenizer(data["text"], truncation=True)
 processed_train = train.map(preprocess, batched=True)
 processed_val = val.map(preprocess, batched=True)
@@ -111,4 +107,3 @@ trainer = Trainer(
 trainer.train()
-trainer.save_model("./my_model")

     "lang_group_id",
 ]
+train, val = load_dataset(dataset_id, "all", split=["train", "validation"], ignore_verifications=True)
 # Build the label2id and id2label dictionaries
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 def preprocess(data):
     return tokenizer(data["text"], truncation=True)
 processed_train = train.map(preprocess, batched=True)
 processed_val = val.map(preprocess, batched=True)
 trainer.train()