onecxi commited on
Commit
338b39f
·
verified ·
1 Parent(s): 46976e4

Initial commit

Browse files
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - hi
5
+ - or
6
+ - bn
7
+ - ta
8
+ - te
9
+ - kn
10
+ - ml
11
+ - mr
12
+ - gu
13
+ - pa
14
+ - as
15
+ license: apache-2.0
16
+ pipeline_tag: audio-classification
17
+ library_name: transformers
18
+ tags:
19
+ - language-identification
20
+ - indian-languages
21
+ - multilingual
22
+ - speech
23
+ - asr-preprocessing
24
+ - callcenter-ai
25
+ - speech-analytics
26
+ - audio-classification
27
+ - wav2vec2
28
+ - transformers
29
+ - pytorch
30
+ - huggingface
31
+ ---
32
+
33
+ # **Vakgyata**
34
+
35
+ **Language Identification for Indian Languages from Speech**
36
+
37
+ ---
38
+
39
+ ## **Model Overview**
40
+
41
+ `vakgyata` is an open-source language identification model specifically designed to classify Indian languages from raw speech audio. It is built upon the pretrained [`Harveenchadha/wav2vec2-pretrained-clsril-23-10k`](https://huggingface.co/Harveenchadha/wav2vec2-pretrained-clsril-23-10k) with additional **Layer Normalization** integrated to improve stability and performance for audio classification tasks.
42
+
43
+ ---
44
+
45
+ ## **Variants and Model Sizes**
46
+
47
+ | Variant | Parameters | Accuracy |
48
+ | ---------------- | ---------- | -------- |
49
+ | `vakgyata-base` | 95M | 95.88% |
50
+ | `vakgyata-small` | 52M | 95.06% |
51
+ | `vakgyata-mini` | 38M | 95.06% |
52
+ | `vakgyata-tiny` | 24M | 93.63% |
53
+
54
+ ---
55
+
56
+ ## **Supported Languages**
57
+
58
+ | Language | Code |
59
+ | --------------- | ----- |
60
+ | English (India) | en-IN |
61
+ | Hindi | hi-IN |
62
+ | Odia | or-IN |
63
+ | Bengali | bn-IN |
64
+ | Tamil | ta-IN |
65
+ | Telugu | te-IN |
66
+ | Kannada | kn-IN |
67
+ | Malayalam | ml-IN |
68
+ | Marathi | mr-IN |
69
+ | Gujarati | gu-IN |
70
+ | Punjabi | pa-IN |
71
+ | Assamese | as-IN |
72
+
73
+ ---
74
+
75
+ ## **Specifications**
76
+
77
+ * **Supported Sampling Rate:** 16000 Hz
78
+ * **Recommended Audio Format:** 16kHz, 16bit PCM (Mono)
79
+
80
+ ---
81
+
82
+ ## **Installation**
83
+
84
+ ```bash
85
+ pip install transformers torchaudio
86
+ ```
87
+
88
+ ---
89
+
90
+ ## **Usage**
91
+
92
+ ```python
93
+ from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
94
+ import torch
95
+
96
+ device = "cuda" if torch.cuda.is_available() else "cpu"
97
+
98
+ model_id = "onecxi/vakgyata-mini"
99
+
100
+ processor = AutoFeatureExtractor.from_pretrained(model_id)
101
+ model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id).to(device)
102
+ ```
103
+
104
+ ---
105
+
106
+ ## **Inference Example**
107
+
108
+ ```python
109
+ import torchaudio
110
+
111
+ # Load the audio (ensure it's 16kHz mono)
112
+ audio, sr = torchaudio.load("path/to/audio.wav")
113
+
114
+ # Preprocess
115
+ inputs = processor(audio.squeeze(), sampling_rate=sr, return_tensors="pt").to(device)
116
+
117
+ # Inference
118
+ with torch.no_grad():
119
+ logits = model(**inputs).logits
120
+
121
+ # Softmax to get probabilities
122
+ probs = logits.softmax(dim=-1).cpu().numpy()
123
+
124
+ # Predicted language
125
+ language = model.config.id2label.get(probs.argmax())
126
+ print("Predicted Language:", language)
127
+ ```
128
+
129
+ ---
130
+
131
+ ## **Citation**
132
+
133
+ If you use this model in your research or application, please consider citing the model and its base source:
134
+
135
+ ```
136
+ @misc{vakgyata2024,
137
+ title={vakgyata: Language Identification for Indian Speech},
138
+ author={OneCXI},
139
+ year={2024},
140
+ url={https://huggingface.co/onecxi/vakgyata-base}
141
+ }
142
+ ```
143
+
144
+ ---
config.json ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "onecxi/vakgyata-mini",
3
+ "activation_dropout": 0.1,
4
+ "adapter_attn_dim": null,
5
+ "adapter_kernel_size": 3,
6
+ "adapter_stride": 2,
7
+ "add_adapter": false,
8
+ "apply_spec_augment": true,
9
+ "architectures": [
10
+ "Wav2Vec2ForSequenceClassification"
11
+ ],
12
+ "attention_dropout": 0.1,
13
+ "bos_token": "<s>",
14
+ "bos_token_id": 1,
15
+ "classifier_proj_size": 1024,
16
+ "codevector_dim": 256,
17
+ "contrastive_logits_temperature": 0.1,
18
+ "conv_bias": false,
19
+ "conv_dim": [
20
+ 512,
21
+ 512,
22
+ 512,
23
+ 512,
24
+ 512,
25
+ 512,
26
+ 512
27
+ ],
28
+ "conv_kernel": [
29
+ 10,
30
+ 3,
31
+ 3,
32
+ 3,
33
+ 3,
34
+ 2,
35
+ 2
36
+ ],
37
+ "conv_stride": [
38
+ 5,
39
+ 2,
40
+ 2,
41
+ 2,
42
+ 2,
43
+ 2,
44
+ 2
45
+ ],
46
+ "ctc_loss_reduction": "sum",
47
+ "ctc_zero_infinity": false,
48
+ "diversity_loss_weight": 0.1,
49
+ "do_lower_case": false,
50
+ "do_stable_layer_norm": true,
51
+ "eos_token": "</s>",
52
+ "eos_token_id": 2,
53
+ "feat_extract_activation": "gelu",
54
+ "feat_extract_norm": "group",
55
+ "feat_proj_dropout": 0.1,
56
+ "feat_quantizer_dropout": 0.0,
57
+ "final_dropout": 0.1,
58
+ "gradient_checkpointing": false,
59
+ "hidden_act": "gelu",
60
+ "hidden_dropout": 0.1,
61
+ "hidden_size": 768,
62
+ "id2label": {
63
+ "0": "en-IN",
64
+ "1": "hi-IN",
65
+ "2": "or-IN",
66
+ "3": "bn-IN",
67
+ "4": "ta-IN",
68
+ "5": "te-IN",
69
+ "6": "kn-IN",
70
+ "7": "ml-IN",
71
+ "8": "mr-IN",
72
+ "9": "gu-IN",
73
+ "10": "pa-IN",
74
+ "11": "as-IN"
75
+ },
76
+ "initializer_range": 0.02,
77
+ "intermediate_size": 3072,
78
+ "label2id": {
79
+ "as-IN": 11,
80
+ "bn-IN": 3,
81
+ "en-IN": 0,
82
+ "gu-IN": 9,
83
+ "hi-IN": 1,
84
+ "kn-IN": 6,
85
+ "ml-IN": 7,
86
+ "mr-IN": 8,
87
+ "or-IN": 2,
88
+ "pa-IN": 10,
89
+ "ta-IN": 4,
90
+ "te-IN": 5
91
+ },
92
+ "layer_norm_eps": 1e-05,
93
+ "layerdrop": 0.1,
94
+ "mask_feature_length": 10,
95
+ "mask_feature_min_masks": 0,
96
+ "mask_feature_prob": 0.0,
97
+ "mask_time_length": 10,
98
+ "mask_time_min_masks": 2,
99
+ "mask_time_prob": 0.05,
100
+ "model_name": "vakgyata",
101
+ "model_type": "wav2vec2",
102
+ "num_adapter_layers": 3,
103
+ "num_attention_heads": 12,
104
+ "num_codevector_groups": 2,
105
+ "num_codevectors_per_group": 320,
106
+ "num_conv_pos_embedding_groups": 16,
107
+ "num_conv_pos_embeddings": 128,
108
+ "num_feat_extract_layers": 7,
109
+ "num_hidden_layers": 4,
110
+ "num_negatives": 100,
111
+ "output_hidden_size": 768,
112
+ "pad_token": "[PAD]",
113
+ "pad_token_id": 0,
114
+ "proj_codevector_dim": 256,
115
+ "tdnn_dilation": [
116
+ 1,
117
+ 2,
118
+ 3,
119
+ 1,
120
+ 1
121
+ ],
122
+ "tdnn_dim": [
123
+ 512,
124
+ 512,
125
+ 512,
126
+ 512,
127
+ 1500
128
+ ],
129
+ "tdnn_kernel": [
130
+ 5,
131
+ 3,
132
+ 3,
133
+ 1,
134
+ 1
135
+ ],
136
+ "torch_dtype": "float32",
137
+ "transformers_version": "4.48.3",
138
+ "unk_token": "[UNK]",
139
+ "use_weighted_layer_sum": false,
140
+ "vocab_size": 12,
141
+ "word_delimiter_token": "|",
142
+ "xvector_output_dim": 512
143
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64737b2f254b46e13b471c9090299e886a6b5ccb98b02a989e5ef840dc1924e9
3
+ size 153884320
onnx/model_quantized.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18d0db5d829d46e8e26bf028d101a33d388aac1c92dedb571a800b4b7ff35a48
3
+ size 154004514
preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0.0,
7
+ "return_attention_mask": true,
8
+ "sampling_rate": 16000
9
+ }