Spaces:

terapyon
/

podcast-search

Sleeping

File size: 3,970 Bytes

---
title: Podcast Search
emoji: 🚀
colorFrom: green
colorTo: gray
sdk: streamlit
sdk_version: 1.41.1
app_file: src/app.py
pinned: false
license: mit
short_description: terapyon channel の検索
---

# podcast-search

Podcast terapyon channelを検索する仕組み


## 使い方


### タイトルリスト

- 以下のファイルを`store` フォルダに置く
- `title-list-202301-202501.parquet`
- 以下のカラムを持つ
  - id: int
  - date: str (2023-01-09)
  - length: int
  - audio: str (オーディオファイルURL)
  - title: str

タイトルリストファイルの例

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id</th>
      <th>date</th>
      <th>length</th>
      <th>audio</th>
      <th>title</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>69</td>
      <td>2023-01-09</td>
      <td>20993616</td>
      <td>https://anchor.fm/s/14480e04/podcast/play/6323...</td>
      <td>#69 2023年新年挨拶から 2022年の振り返りと2023年の抱負</td>
    </tr>
    <tr>
      <th>1</th>
      <td>70</td>
      <td>2023-03-09</td>
      <td>103287296</td>
      <td>https://anchor.fm/s/14480e04/podcast/play/6621...</td>
      <td>#70 PyCon JP Association代表理事退任と今後の展望をIqbalさんと語る</td>
    </tr>
    <tr>
      <th>2</th>
      <td>71</td>
      <td>2023-03-22</td>
      <td>116393694</td>
      <td>https://anchor.fm/s/14480e04/podcast/play/6706...</td>
      <td>#71 hirokikyさんをゲストに 自然言語処理系AI Chat GPT / Whisp...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>72</td>
      <td>2023-05-04</td>
      <td>49642320</td>
      <td>https://anchor.fm/s/14480e04/podcast/play/6976...</td>
      <td>#72 PyCon US 2023 ひとり振り返り</td>
    </tr>
    <tr>
      <th>4</th>
      <td>73</td>
      <td>2023-05-24</td>
      <td>150643013</td>
      <td>https://anchor.fm/s/14480e04/podcast/play/7094...</td>
      <td>#73 Nyohoさんをゲストに Scratchからディープラーニングや数学の話</td>
    </tr>
  </tbody>
</table>
</div>

### 文字データ作成

- dataフォルダをを作る(srcと同じ階層)
- dataフォルダに、srtファイルを入れる
  - (以下に従うと、srtファイルからIDが取得できる)
  - 拡張子を `.srt` とする
  - ファイル名に、ID(整数)が1つだけ入ってること
  - IDの前後に、 `-` または `_` で区切られいること
- 以下のスクリプトを実行する。 `store` フォルダに `parquet` ファイルが srtファイル分できる

```
% python src/episode.py
```

### データベース作成

以下のコマンドで、テーブル作成から必要な3つのデータをDuckDB(永続化)を作る

```
% python src/store.py all
```

上記のコマンドの詳細

- テーブル作成 create table
  - `python src/store.py create`
- タイトルリスト insert
  - `python src/store.py podcastinsert`
- エピソードとテキスト insert
  - `python src/store.py episodeinsert`
- ベクトル化 embedding
  - `python src/store.py embed`
- ベクトルデータ index
  - `python src/store.py index`


### 検索UI

```
% streamlit run src/app.py
```

- Podcastタイトル(複数)を選ぶ。未選択の場合すべてとなる
- 検索したいワードをテキストボックスに入力
- 10個のセンテンス(文章)候補が出てくる
- 表の左をクリックすると、下部に文字列が表示される
- 音声のタイミング（分・秒）が表示される
- そのタイミングの音声がその場で聞ける


https://github.com/user-attachments/assets/98e85be4-a633-4bdc-900d-9a7c06818d9b