File size: 2,869 Bytes
9857c7b
 
 
 
 
 
 
 
 
 
 
 
75814d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d6e044
75814d9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d6e044
 
75814d9
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
title: SATE
emoji: 
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0
short_description: Speech Annotatin and Transcription Enhancer
---


# SATE: Speech Annotation and Transcription Enhancer (MVP)

This is the **Minimum Viable Product (MVP)** version of **SATE**, a unified pipeline framework that integrates audio segmentation, speaker diarization, transcription, and linguistic annotation into a single application.

---

## Overview

- **Main Entry**: `main_socket.py`
- **Input**: Entire audio file (`.mp3`, `.wav`, etc.)
- **Output**: Word-level timestamped transcription with annotations such as pauses, repetitions, filler words, mispronunciations and syllables.

- **Preprocessing**:
  - Audio segmentation
  - Speaker diarization
  - Transcription using Crisper Whisper

- **Annotation**:
  - Pause
  - Repetition
  - Filler Words
  - Syllable Structure
  - Mispronunciation Sequence (PLM container is needed)

- **Feature Extraction**

---


## Getting Started

#### Installation

##### 1. Clone the repo
```bash
git clone https://github.com/SwenHou/SATE.git
```
##### 2. Install packages
```bash
conda env create -f environment_sate_0.11.yml
```
##### 3. Start Inference API in your Local Computer
Setup your Huggingface Token:
```bash
export HF_TOKEN=<your_token_here>
```
Start API:
```bash
python main_socket.py
```
#### Usage
##### 1. Get Annotations

```bash
curl -X POST http://localhost:7860/process \
  -F "audio_file=@<your local path to audio file>" \
  -F "device=cuda" \
  -F "pause_threshold=0.25"
```
The annotation file is also available in `SATE/session_data/`

---


## 🐳 Use Docker

### 1. Build Docker Image
Tn `Dockerfile`:
Delete `ENV HF_HOME=/data/.huggingface` and add `ENV HF_TOKEN=<your_token_here>` 

Run the following command in the project root directory:

```bash
docker build -t sate_0.11 .
```

### 2. Run the Docker Container
```bash
docker run --gpus all -it --rm \
  -p 7860:7860 \
  sate_0.11
```

### 3. Usage
The usage is same as using local API, but the annotation file will be deleted after container exits.

```bash
curl -X POST http://localhost:7860/process \
  -F "audio_file=@<your local path to audio file>" \
  -F "device=cuda" \
  -F "pause_threshold=0.25"
```


---


## 🤗 Use API from Hugging Face Spaces

```bash
curl -X POST https://Sven33-SATE.hf.space/process \
  -F "audio_file=@<your local path to audio file>" \
  -F "device=cuda" \
  -F "pause_threshold=0.25"
```
##### Hugging Face Space URL: `https://huggingface.co/spaces/Sven33/SATE`

Due to Hugging Face's GPU scheduling latency, the initial startup time for the first request is around 5-8 minutes. If there is no visit within five minutes after startup, the service will go back into sleep mode. 

For a 10-minute audio sample, the inference time using a T4 small GPU is approximately under two minutes.