Upload folder using huggingface_hub
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +9 -34
- LICENSE.md +60 -0
- Notice-jp.md +36 -0
- Notice.md +36 -0
- README-jp.md +147 -0
- README.md +154 -0
- checkpoints/.gitkeep +1 -0
- checkpoints/japanese-dialog-transformer-1.6B-empdial50k.pt +3 -0
- checkpoints/japanese-dialog-transformer-1.6B-persona50k.pt +3 -0
- checkpoints/japanese-dialog-transformer-1.6B.pt +3 -0
- data/dicts/sp_oall_32k.model +3 -0
- data/dicts/sp_oall_32k.txt +0 -0
- data/dicts/sp_oall_32k.vocab +0 -0
- data/persona_chat/binary/dict.dst.txt +0 -0
- data/persona_chat/binary/dict.src.txt +0 -0
- data/persona_chat/binary/preprocess.log +14 -0
- data/persona_chat/binary/test.src-dst.dst.bin +3 -0
- data/persona_chat/binary/test.src-dst.dst.idx +0 -0
- data/persona_chat/binary/test.src-dst.src.bin +3 -0
- data/persona_chat/binary/test.src-dst.src.idx +0 -0
- data/persona_chat/binary/train.src-dst.dst.bin +3 -0
- data/persona_chat/binary/train.src-dst.dst.idx +3 -0
- data/persona_chat/binary/train.src-dst.src.bin +3 -0
- data/persona_chat/binary/train.src-dst.src.idx +3 -0
- data/persona_chat/binary/valid.src-dst.dst.bin +3 -0
- data/persona_chat/binary/valid.src-dst.dst.idx +0 -0
- data/persona_chat/binary/valid.src-dst.src.bin +3 -0
- data/persona_chat/binary/valid.src-dst.src.idx +0 -0
- data/persona_chat/raw/persona_data.pkl +3 -0
- data/persona_chat/raw/rest.dst +0 -0
- data/persona_chat/raw/rest.src +3 -0
- data/persona_chat/raw/test.dst +0 -0
- data/persona_chat/raw/test.src +0 -0
- data/persona_chat/raw/train.dst +0 -0
- data/persona_chat/raw/train.src +0 -0
- data/persona_chat/raw/valid.dst +0 -0
- data/persona_chat/raw/valid.src +0 -0
- data/persona_chat/tokenized/sp_oall_32k/sp_oall_32k.model +3 -0
- data/persona_chat/tokenized/sp_oall_32k/test.dst +0 -0
- data/persona_chat/tokenized/sp_oall_32k/test.src +0 -0
- data/persona_chat/tokenized/sp_oall_32k/train.dst +0 -0
- data/persona_chat/tokenized/sp_oall_32k/train.src +0 -0
- data/persona_chat/tokenized/sp_oall_32k/valid.dst +0 -0
- data/persona_chat/tokenized/sp_oall_32k/valid.src +0 -0
- data/sample/bin/dict.dst.txt +0 -0
- data/sample/bin/dict.src.txt +0 -0
- data/sample/bin/test.src-dst.dst.bin +3 -0
- data/sample/bin/test.src-dst.dst.idx +0 -0
- data/sample/bin/test.src-dst.src.bin +3 -0
- data/sample/bin/test.src-dst.src.idx +0 -0
.gitattributes
CHANGED
|
@@ -1,35 +1,10 @@
|
|
| 1 |
-
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
-
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
-
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
-
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
-
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
-
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
-
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
-
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
-
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
-
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
-
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
-
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
-
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
-
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
-
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
-
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
-
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
-
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
-
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
-
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
-
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
-
*.
|
| 24 |
-
*.
|
| 25 |
-
*.
|
| 26 |
-
|
| 27 |
-
*.
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
-
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
-
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
-
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.txt filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.idx filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.src filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.dst filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
data/dicts/sp_oall_32k.model filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
data/persona_chat/raw/persona_data.pkl filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
data/persona_chat/tokenized/sp_oall_32k/sp_oall_32k.model filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
japanese_persona_chat.xlsx filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
LICENSE.md
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## DIALOGUE MODEL AND DATA EVALUATION LICENSE AGREEMENT
|
| 2 |
+
|
| 3 |
+
This DIALOGUE MODEL AND DATA EVALUATION LICENSE AGREEMENT (this "Agreement") is a legal contract between a person who uses or otherwise accesses the Data (“User(s)”), and Nippon Telegraph and Telephone corporation ("NTT").
|
| 4 |
+
|
| 5 |
+
READ THE TERMS AND CONDITIONS OF THIS AGREEMENT CAREFULLY BEFORE USING OR OTHERWISE ACCESSING NTT'S PROPRIETARY DIALOGUE MODEL AND DATA (INCLUDING THE MODEL AND DATA MODIFIED PURSUANT AND SUBJECT TO SECTION 1.) ACCOMPANIED BY THIS AGREEMENT (the "DATA"). THE DATA IS COPYRIGHTED AND IT IS LICENSED TO USER UNDER THIS AGREEMENT, NOT SOLD TO USER. BY USING OR OTHERWISE ACCESSING THE DATA, USER ACKNOWLEDGES THAT USER HAS READ THIS AGREEMENT, THAT USER UNDERSTANDS IT, AND THAT USER ACCEPTS AND AGREES TO BE BOUND BY ITS TERMS. IF AT ANY TIME USER IS NOT WILLING TO BE BOUND BY THE TERMS OF THIS AGREEMENT, USER SHOULD IMMEDIATELY CEASE AND REFRAIN FROM ACCESSING OR USING THE DATA AND DELETE ANY COPIES USER MAY HAVE. THIS AGREEMENT REPRESENTS THE ENTIRE AGREEMENT BETWEEN USER AND NTT CONCERNING THE DATA.
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
### BACKGROUND
|
| 9 |
+
A. NTT is the owner of all rights, including all patent rights, copyrights and trade secret rights, in and to the Data and related documentation listed in Exhibit A to this Agreement.
|
| 10 |
+
|
| 11 |
+
B. User wishes to obtain a royalty free license to use the Data to enable User to evaluate, and NTT wishes to grant such a license to User, pursuant and subject to the terms and conditions of this Agreement.
|
| 12 |
+
|
| 13 |
+
C. As a condition to NTT's provision of the Data to User, NTT has required User to execute this Agreement.
|
| 14 |
+
|
| 15 |
+
In consideration of these premises, and the mutual promises and conditions in this Agreement, the parties hereby agree as follows:
|
| 16 |
+
|
| 17 |
+
1. <u>Grant of Evaluation License.</u> NTT hereby grants to User, and User hereby accepts, under the terms and conditions of this Agreement, a royalty free, nontransferable and nonexclusive license to use and modify the Data only for the purposes of testing, analyzing, and evaluating the methods or mechanisms as shown in the research paper submitted by NTT to a certain academy. User may use the Data for providing dialogue service to end-users only for the purpose of testing, analyzing, and evaluating of the Data. User may make a reasonable number of backup copies of the Data solely for User's internal use pursuant to the license granted in this Section 1.
|
| 18 |
+
|
| 19 |
+
2. <u>Shipment and Installation.</u> NTT will ship or deliver the Data by any method that NTT deems appropriate. User shall be solely responsible for proper use of the Data.
|
| 20 |
+
|
| 21 |
+
3. <u>Term.</u> This Agreement is effective whichever is earlier (i) upon User’s acceptance of the Agreement, or (ii) upon User’s using and accessing the Data, even if User has not expressly accepted this Agreement. Without prejudice to any other rights, NTT may terminate this Agreement without notice to User (i) if User breaches or fails to comply with any of the limitations or other requirements described herein, (ii) if User uses the Data in an ethically or socially unacceptable manner, and (iii) if NTT receives a notice from the academy stating that the research paper would not be published, and in any such case User agrees that NTT may, in addition to any other remedies it may have at law or in equity, remotely disable the use of the Data. User may terminate this Agreement at any time by User’s decision to terminate the Agreement to NTT and ceasing use of the Data. Upon any termination or expiration of this Agreement for any reason, User agrees to remove the Data and either return to NTT the Data and all copies thereof, or to destroy all such materials and provide written verification of such destruction to NTT upon request of NTT.
|
| 22 |
+
|
| 23 |
+
4.<u>Proprietary Rights.</u>
|
| 24 |
+
|
| 25 |
+
(a) The Data is the valuable, confidential, and proprietary property of NTT, and NTT shall retain exclusive title to this property both during the term and after the termination of this Agreement. Without limitation, User acknowledges that all patent rights, copyrights and trade secret rights in the Data shall remain the exclusive property of NTT at all times. User shall use not less than reasonable care in safeguarding the confidentiality of the Data.
|
| 26 |
+
|
| 27 |
+
(b) USER SHALL NOT, IN WHOLE OR IN PART, AT ANY TIME DURING THE TERM OF OR AFTER THE TERMINATION OF THIS AGREEMENT: (i) SELL, ASSIGN, LEASE, DISTRIBUTE, OR OTHERWISE TRANSFER THE DATA TO ANY THIRD PARTY; (ii) EXCEPT AS OTHERWISE PROVIDED HEREIN, COPY OR REPRODUCE THE DATA IN ANY MANNER; (iii) DISCLOSE THE DATA TO ANY THIRD PARTY, EXCEPT TO USER'S EMPLOYEES WHO REQUIRE ACCESS TO THE DATA FOR THE PURPOSES OF THIS AGREEMENT; (iv) USE THE DATA TO ANY THIRD PARTY EXCEPT FOR NON-COMMERCIAL EVALUATION PURPOSE OF THIS AGREEMENT( v) DISASSEMBLE, DECOMPILE, REVERSE ENGINEER OR TRANSLATE THE DATA; OR (ⅵ) ALLOW ANY PERSON OR ENTITY TO COMMIT ANY OF THE ACTIONS DESCRIBED IN (i) THROUGH (v) ABOVE.
|
| 28 |
+
|
| 29 |
+
(c) User shall take appropriate action, by instruction, agreement, or otherwise, with respect to its employees permitted under this Agreement to have access to the Data to ensure that all of User's obligations under this Section 4 shall be satisfied.
|
| 30 |
+
5. <u>Indemnity.</u>
|
| 31 |
+
|
| 32 |
+
User shall defend, indemnify and hold harmless NTT, its agents and employees, from any loss, damage, or liability arising in connection with User's improper or unauthorized use of the Data. NTT SHALL HAVE THE SOLE RIGHT TO CONDUCT DEFEND ANY ACTTION RELATING TO THE DATA.
|
| 33 |
+
|
| 34 |
+
6. <u>Disclaimer.</u> THE DATA IS LICENSED TO USER "AS IS," WITHOUT ANY TRAINING, MAINTENANCE, OR SERVICE OBLIGATIONS WHATSOEVER ON THE PART OF NTT. NTT MAKES NO EXPRESS OR IMPLIED WARRANTIES OF ANY TYPE WHATSOEVER, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF MERCHANTABILITY, OF FITNESS FOR A PARTICULAR PURPOSE AND OF NON-INFRINGEMENT ON COPYRIGHT OR ANY OTHER RIGHT OF THIRD PARTIES. USER ASSUMES ALL RISKS ASSOCIATED WITH ITS USE OF THE DATA, INCLUDING WITHOUT LIMITATION RISKS RELATING TO QUALITY, PERFORMANCE, DATA LOSS, AND UTILITY IN A PRODUCTION ENVIRONMENT.
|
| 35 |
+
|
| 36 |
+
7. <u>Limitation of Liability.</u> IN NO EVENT SHALL NTT BE LIABLE TO USER OR TO ANY THIRD PARTY FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO DAMAGES FOR PERSONAL INJURY, PROPERTY DAMAGE, LOST PROFITS, OR OTHER ECONOMIC LOSS, ARISING IN CONNECTION WITH USER'S USE OF OR INABILITY TO USE THE DATA, IN CONNECTION WITH NTT'S PROVISION OF OR FAILURE TO PROVIDE SERVICES PERTAINING TO THE DATA, OR AS A RESULT OF ANY DEFECT IN THE DATA. THIS DISCLAIMER OF LIABILITY SHALL APPLY REGARD¬LESS OF THE FORM OF ACTION THAT MAY BE BROUGHT AGAINST NTT, WHETHER IN CONTRACT OR TORT, INCLUDING WITHOUT LIMITATION ANY ACTION FOR NEGLIGENCE. USER'S SOLE REMEDY IN THE EVENT OF ANY BREACH OF THIS AGREEMENT BY NTT SHALL BE TERMINATION PURSUANT TO SECTION 3.
|
| 37 |
+
|
| 38 |
+
8.<u>No Assignment or Sublicense.</u> Neither this Agreement nor any right or license under this Agreement, nor the Data, may be sublicensed, assigned, or otherwise transferred by User without NTT's prior written consent.
|
| 39 |
+
|
| 40 |
+
9.<u>General</u>
|
| 41 |
+
|
| 42 |
+
(a) If any provision, or part of a provision, of this Agreement is or becomes illegal, unenforceable, or invalidated, by operation of law or otherwise, that provision or part shall to that extent be deemed omitted, and the remainder of this Agreement shall remain in full force and effect.
|
| 43 |
+
|
| 44 |
+
(b) This Agreement is the complete and exclusive statement of the agreement between the parties with respect to the subject matter hereof, and supersedes all written and oral contracts, proposals, and other communications between the parties relating to that subject matter.
|
| 45 |
+
|
| 46 |
+
(c) Subject to Section 8, this Agreement shall be binding on, and shall inure to the benefit of, the respective successors and assigns of NTT and User.
|
| 47 |
+
|
| 48 |
+
(d) If either party to this Agreement initiates a legal action or proceeding to enforce or interpret any part of this Agreement, the prevailing party in such action shall be entitled to recover, as an element of the costs of such action and not as damages, its attorneys' fees and other costs associated with such action or proceeding.
|
| 49 |
+
|
| 50 |
+
(e) This Agreement shall be governed by and interpreted under the laws of Japan, without reference to conflicts of law principles. All disputes arising out of or in connection with this Agreement shall be finally settled by arbitration in Tokyo in accordance with the Commercial Arbitration Rules of the Japan Commercial Arbitration Association. The arbitration shall be conducted by three (3) arbitrators and in Japanese. The award rendered by the arbitrators shall be final and binding upon the parties. Judgment upon the award may be entered in any court having jurisdiction thereof.
|
| 51 |
+
|
| 52 |
+
(f) NTT shall not be liable to the User or to any third party for any delay or failure to perform NTT’s obligation set forth under this Agreement due to any cause beyond NTT’s reasonable control.
|
| 53 |
+
|
| 54 |
+
(g) This Agreement shall be governed by the laws of Japan, without regard to the conflict of laws principles. NTT and User agree that any claim asserted in any legal proceeding by one party against the other will be commenced and maintained exclusively in Tokyo district court in Japan.
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
<u>EXHIBIT A</u>
|
| 58 |
+
* Pre-trained dialogue model
|
| 59 |
+
* Fine-tuned dialogue model
|
| 60 |
+
* Dialogue datasets (EmpatheticDialogues, PersonaChats)
|
Notice-jp.md
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NTTが提供する対話モデルの利用に関する注意文
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
## 対話モデルの提供について
|
| 5 |
+
NTT コミュニケーション科学基礎研究所(以下,NTT)が提供する本対話モデルは,対話的な応答文を生成するプログラム(対話プログラム)の動作に必要な,パラメータデータです.本対話モデルは,人同士の対話データを利用して学習(パラメータ調整)されています.本対話モデルを対話プログラムと合わせて利用することで,学習に用いられた対話データに含まれる応答関係の出現確率に基づいて,応答文を確率的に生成することができます.
|
| 6 |
+
|
| 7 |
+
対話プログラムの研究開発に携わる多くの方々に本モデルを適切にご利用いただくことで,様々な技術的課題を解決し,対話システム研究を大きく進展させることができると考え,このたび評価・検証目的のご利用に限り,本モデルを無償で公開することとなりました.
|
| 8 |
+
|
| 9 |
+
## 提供モデルの利用用途の制限
|
| 10 |
+
本対話モデルおよび本対話データの利用用途は,本対話モデルおよび本対話データで学習した対話モデルの性能評価・検証と,本対話データの分析目的に限られます.評価・検証を目的とする限りにおいて,営利団体・非営利団体・個人を問わず,本モデルを一般ユーザとの対話にもご利用いただけます.
|
| 11 |
+
一方で,営利団体・非営利団体・個人を問わず,また商用・非商用を問わず,対話サービスの提供自体を目的とする用途への利用はご遠慮いただいております.そうした目的へのご利用を検討される場合は,末尾の問い合わせ先へご連絡ください.
|
| 12 |
+
|
| 13 |
+
## 注意事項
|
| 14 |
+
本モデルを利用した対話プログラムを通じて生成される文には,原理的に学習データに含まれる内容の偏りが反映されます.本対話モデルの学習に利用した対話データにはWeb上の公開情報から収集した文が含まれるため,Web上の文の分布を反映して,非社会的な発話,実在人物・組織等に対する誹謗中傷,ヘイトスピーチ等の,不適切な内容を含む文が生成される可能性があります.
|
| 15 |
+
実際にどのような文が生成されるかは,対話プログラムのアルゴリズムや使い方に大きく依存します.そのため,本モデルの利用者(対話プログラムと本モデルを利用して,文を生成した者)は,生成される文が不適切な内容を含む可能性を考慮した上で本モデルを利用し,生成された文によって被害が生じないよう万全の配慮と対策をお願いいたします.利用者は,そのような配慮と対策をおこない,適切・不適切を問わず生成した文に対する責任を負うことに同意する場合に限り,本モデルをご利用いただけます。
|
| 16 |
+
|
| 17 |
+
## 不適切文生成への対策例
|
| 18 |
+
NTTが本モデルを実際の対話に適用する場合には,一般に以下の対策を取っております.ただし現状では不適切文生成への対応自体も解決すべき技術的課題のひとつであり,下記の例が十分な対策であることをNTTが保証するものではありません.あくまでも参考としてお考えください.
|
| 19 |
+
1. モデル利用時に想定される入力文や,不適切な文の生成を誘発させうる入力文を事前に与えて応答を十分にテストする
|
| 20 |
+
2. 実際に生成文を見る人に「不適切な文が出力されうる」ことを告知し了解を得る
|
| 21 |
+
3. 高い確率で特定の不適切表現が生成される場合には,キーワードフィルタで当該表現を含む文の出力を抑制する
|
| 22 |
+
4. 生成文を見た人が不快に感じた場合には即座に対話を中止する
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
## 提供対話モデルの詳細
|
| 26 |
+
|
| 27 |
+
### 事前学習モデル
|
| 28 |
+
Twitter内の応答ツイートペアのみを用いて,幅広い言語表現や応答関係を不適切な内容も含めて学習したモデルです.不適切な内容(例えば犯罪行為)を除去したデータでモデルを学習した場合,入力された情報の適切さを判断する基準そのものがモデルから失われてしまいます.そのため,あえて不適切な内容を除去せずに学習しています.
|
| 29 |
+
後述する「ファインチューン済みモデル」の素材として用いられるモデルであり,本モデルをそのまま実際の対話へ適用することは想定されていません.実際の対話へ適用する場合は,適切なデータでファインチューン(追加学習)を行ってからご利用ください.
|
| 30 |
+
|
| 31 |
+
### ファインチューン済みモデル
|
| 32 |
+
事前学習モデルを下地として,不適切な発話が含まれない対話データ(NTTが収��した日本語版「Persona-chat」「Empathetic dialog」)で追加学習(ファインチューン)したモデルです.十分な量の対話データを持たない利用者が,本対話モデルの性能を予備的に検証する目的で提供します.
|
| 33 |
+
ファインチューンに利用した対話データの性質が反映されるため,不適切な文の生成確率は事前学習モデルに比べて大幅に低下していますが,完全には抑制されていません.特に,不適切な内容がモデルに入力された場合には,その返答として不適切な内容が生成される確率が上昇します.注意事項や不適切文生成への対策例をご参考いただき,モデル利用者が必要と考える対策を施した上でご利用ください.
|
| 34 |
+
|
| 35 |
+
## お問合せ先
|
| 36 |
+
dialog-transformer-ml < at > ntt.com
|
Notice.md
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NTTが提供する対話モデルの利用に関する注意文
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
## 対話モデルの提供について
|
| 5 |
+
NTT コミュニケーション科学基礎研究所(以下,NTT)が提供する本対話モデルは,対話的な応答文を生成するプログラム(対話プログラム)の動作に必要な,パラメータデータです.本対話モデルは,人同士の対話データを利用して学習(パラメータ調整)されています.本対話モデルを対話プログラムと合わせて利用することで,学習に用いられた対話データに含まれる応答関係の出現確率に基づいて,応答文を確率的に生成することができます.
|
| 6 |
+
|
| 7 |
+
対話プログラムの研究開発に携わる多くの方々に本モデルを適切にご利用いただくことで,様々な技術的課題を解決し,対話システム研究を大きく進展させることができると考え,このたび評価・検証目的のご利用に限り,本モデルを無償で公開することとなりました.
|
| 8 |
+
|
| 9 |
+
## 提供モデルの利用用途の制限
|
| 10 |
+
本対話モデルおよび本対話データの利用用途は,本対話モデルおよび本対話データで学習した対話モデルの性能評価・検証と,本対話データの分析目的に限られます.評価・検証を目的とする限りにおいて,営利団体・非営利団体・個人を問わず,本モデルを一般ユーザとの対話にもご利用いただけます.
|
| 11 |
+
一方で,営利団体・非営利団体・個人を問わず,また商用・非商用を問わず,対話サービスの提供自体を目的とする用途への利用はご遠慮いただいております.そうした目的へのご利用を検討される場合は,末尾の問い合わせ先へご連絡ください.
|
| 12 |
+
|
| 13 |
+
## 注意事項
|
| 14 |
+
本モデルを利用した対話プログラムを通じて生成される文には,原理的に学習データに含まれる内容の偏りが反映されます.本対話モデルの学習に利用した対話データにはWeb上の公開情報から収集した文が含まれるため,Web上の文の分布を反映して,非社会的な発話,実在人物・組織等に対する誹謗中傷,ヘイトスピーチ等の,不適切な内容を含む文が生成される可能性があります.
|
| 15 |
+
実際にどのような文が生成されるかは,対話プログラムのアルゴリズムや使い方に大きく依存します.そのため,本モデルの利用者(対話プログラムと本モデルを利用して,文を生成した者)は,生成される文が不適切な内容を含む可能性を考慮した上で本モデルを利用し,生成された文によって被害が生じないよう万全の配慮と対策をお願いいたします.利用者は,そのような配慮と対策をおこない,適切・不適切を問わず生成した文に対する責任を負うことに同意する場合に限り,本モデルをご利用いただけます。
|
| 16 |
+
|
| 17 |
+
## 不適切文生成への対策例
|
| 18 |
+
NTTが本モデルを実際の対話に適用する場合には,一般に以下の対策を取っております.ただし現状では不適切文生成への対応自体も解決すべき技術的課題のひとつであり,下記の例が十分な対策であることをNTTが保証するものではありません.あくまでも参考としてお考えください.
|
| 19 |
+
1. モデル利用時に想定される入力文や,不適切な文の生成を誘発させうる入力文を事前に与えて応答を十分にテストする
|
| 20 |
+
2. 実際に生成文を見る人に「不適切な文が出力されうる」ことを告知し了解を得る
|
| 21 |
+
3. 高い確率で特定の不適切表現が生成される場合には,キーワードフィルタで当該表現を含む文の出力を抑制する
|
| 22 |
+
4. 生成文を見た人が不快に感じた場合には即座に対話を中止する
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
## 提供対話モデルの詳細
|
| 26 |
+
|
| 27 |
+
### 事前学習モデル
|
| 28 |
+
Twitter内の応答ツイートペアのみを用いて,幅広い言語表現や応答関係を不適切な内容も含めて学習したモデルです.不適切な内容(例えば犯罪行為)を除去したデータでモデルを学習した場合,入力された情報の適切さを判断する基準そのものがモデルから失われてしまいます.そのため,あえて不適切な内容を除去せずに学習しています.
|
| 29 |
+
後述する「ファインチューン済みモデル」の素材として用いられるモデルであり,本モデルをそのまま実際の対話へ適用することは想定されていません.実際の対話へ適用する場合は,適切なデータでファインチューン(追加学習)を行ってからご利用ください.
|
| 30 |
+
|
| 31 |
+
### ファインチューン済みモデル
|
| 32 |
+
事前学習モデルを下地として,不適切な発話が含まれない対話データ(NTTが収��した日本語版「Persona-chat」「Empathetic dialog」)で追加学習(ファインチューン)したモデルです.十分な量の対話データを持たない利用者が,本対話モデルの性能を予備的に検証する目的で提供します.
|
| 33 |
+
ファインチューンに利用した対話データの性質が反映されるため,不適切な文の生成確率は事前学習モデルに比べて大幅に低下していますが,完全には抑制されていません.特に,不適切な内容がモデルに入力された場合には,その返答として不適切な内容が生成される確率が上昇します.注意事項や不適切文生成への対策例をご参考いただき,モデル利用者が必要と考える対策を施した上でご利用ください.
|
| 34 |
+
|
| 35 |
+
## お問合せ先
|
| 36 |
+
dialog-transformer-ml < at > ntt.com
|
README-jp.md
ADDED
|
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# japanese-dialog-transformers
|
| 2 |
+
|
| 3 |
+
このリポジトリでは,NTTが提供する,日本語Transformer Encoder-decoder対話モデルを,[fairseq](https://github.com/pytorch/fairseq)上で評価するために必要な情報を公開しています.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
| 本ページの項目一覧 |
|
| 8 |
+
|-|
|
| 9 |
+
| [更新履歴](#更新履歴) |
|
| 10 |
+
| [ご利用の前に](#ご利用の前に) |
|
| 11 |
+
| [モデルのダウンロード](#モデルのダウンロード) |
|
| 12 |
+
| [提供モデルのご利用方法](#提供モデルのご利用方法) |
|
| 13 |
+
| [利用規約](LICENSE.md) |
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## 更新履歴
|
| 18 |
+
|
| 19 |
+
* 2021/09/17 対話モデル(fairseq版 `japanese-dialog-transformer-1.6B`),データセット(`JEmpatheticDialogues` and `JPersonaChat`),および評価用コードを公開しました.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## ご利用の前に
|
| 24 |
+
提供する対話モデルは,モデル性能の評価・検証用です。
|
| 25 |
+
これらのモデルをダウンロードする前に,[利用規約](LICENSE.md)と[注意文書](Notice-jp.md)をご確認ください。以下の3点にご同意いただける場合に限り,本モデルをダウンロード・ご利用いただけます。
|
| 26 |
+
1. [利用規約](LICENSE.md)
|
| 27 |
+
2. 本モデルの評価・検証目的にのみに利用し,対話サービスの提供自体を目的とする用途へ利用しないこと
|
| 28 |
+
3. 生成された文によって被害が生じないよう万全の配慮と対策をおこない,適切・不適切を問わず生成した文に対する責任を負うこと
|
| 29 |
+
|
| 30 |
+
### BibTeX
|
| 31 |
+
本モデルを利用した結果を公開する場合には、以下の論文を引用ください。
|
| 32 |
+
<!-- You can use the following BibTeX entry for citation if you find our method useful. -->
|
| 33 |
+
```BibTeX
|
| 34 |
+
@misc{sugiyama2021empirical,
|
| 35 |
+
title={Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems},
|
| 36 |
+
author={Hiroaki Sugiyama and Masahiro Mizukami and Tsunehiro Arimoto and Hiromi Narimatsu and Yuya Chiba and Hideharu Nakajima and Toyomi Meguro},
|
| 37 |
+
year={2021},
|
| 38 |
+
eprint={2109.05217},
|
| 39 |
+
archivePrefix={arXiv},
|
| 40 |
+
primaryClass={cs.CL}
|
| 41 |
+
}
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
## モデルのダウンロード
|
| 46 |
+
- 事前学習モデルのダウンロードは[こちら](https://www.dropbox.com/s/k3ugxmr7nw6t86l/japanese-dialog-transformer-1.6B.pt?dl=0)
|
| 47 |
+
- PersonaChatでファインチューンしたモデルのダウンロードは[こちら](https://www.dropbox.com/s/e5ib6rhsbldup3v/japanese-dialog-transformer-1.6B-persona50k.pt?dl=0)
|
| 48 |
+
- EmpatheticDialoguesでファインチューンしたモデルのダウンロードは[こちら](https://www.dropbox.com/s/laqz0jcgxvpxiy0/japanese-dialog-transformer-1.6B-empdial50k.pt?dl=0)
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## 提供モデルのご利用方法
|
| 53 |
+
|
| 54 |
+
本ページで公開するモデルは、fairseqに含まれるスクリプトを用いて、発話文生成や追加のfine-tuningを行うことができます。
|
| 55 |
+
|
| 56 |
+
### 依存ライブラリのインストール
|
| 57 |
+
検証環境は以下の通りです.[miniconda](https://repo.anaconda.com/miniconda/Miniconda3-py38_4.10.3-Linux-x86_64.sh)上で検証しています.
|
| 58 |
+
- Python 3.8.10
|
| 59 |
+
- CUDA 11.1/10.2
|
| 60 |
+
- Pytorch 1.8.2 (インストールコマンドは必ず[公式ページ](https://pytorch.org/get-started/locally/)をご確認ください.pipを推奨します.)
|
| 61 |
+
- fairseq 1.0.0a0(正式リリースではなく,`github clone`でのみダウンロード可能です.検証時のcommit IDは8adff65ab30dd5f3a3589315bbc1fafad52943e7です.)
|
| 62 |
+
- sentencepiece 0.1.96
|
| 63 |
+
|
| 64 |
+
fairseqのインストールに当たっては,公式ページの[Requirements and Installation](https://github.com/pytorch/fairseq#requirements-and-installation)をご確認いただき,`github clone`を利用して最新版もしくは検証済み版(8adff65)をインストールください.通常のpip installでは旧バージョンの0.10.2までしか入りません.
|
| 65 |
+
|
| 66 |
+
また,独自のデータでfinetuneを行う場合は,sentencepieceのスタンドアローン版のインストールが必要です.
|
| 67 |
+
|
| 68 |
+
### 一問一答の対話(fairseq-interactive)
|
| 69 |
+
一問一答形式の対話を行います.fairseq-interactiveは文脈を保持する手段がなく,入力文されたのみを見て応答を生成します.Finetune・論文実験時の文脈を利用する設定と異なっておりますので,やや不適切な発話が生成されやすくなっております.ご注意ください.
|
| 70 |
+
|
| 71 |
+
以下のコマンドでは,beam・nbest(出力候補数)について,結果を見やすくするため,小さめの値(10)を利用しております.実際に利用する場合は,20以上に設定するほうがよい結果を得られると思います.
|
| 72 |
+
|
| 73 |
+
~~~
|
| 74 |
+
fairseq-interactive data/sample/bin/ \
|
| 75 |
+
--path checkpoints/persona50k-flat_1.6B_33avog1i_4.16.pt\
|
| 76 |
+
--beam 10 \
|
| 77 |
+
--seed 0 \
|
| 78 |
+
--min-len 10 \
|
| 79 |
+
--source-lang src \
|
| 80 |
+
--target-lang dst \
|
| 81 |
+
--tokenizer space \
|
| 82 |
+
--bpe sentencepiece \
|
| 83 |
+
--sentencepiece-model data/dicts/sp_oall_32k.model \
|
| 84 |
+
--no-repeat-ngram-size 3 \
|
| 85 |
+
--nbest 10 \
|
| 86 |
+
--sampling \
|
| 87 |
+
--sampling-topp 0.9 \
|
| 88 |
+
--temperature 1.0
|
| 89 |
+
~~~
|
| 90 |
+
|
| 91 |
+
### 文脈を保持する対話(dialog.py)
|
| 92 |
+
4発話程度の文脈を保持して対話を行います.Finetune・論文実験時相当の設定です.
|
| 93 |
+
|
| 94 |
+
~~~
|
| 95 |
+
python scripts/dialog.py data/sample/bin/ \
|
| 96 |
+
--path checkpoints/dials5_1e-4_1li20zh5_tw5.143_step85.pt \
|
| 97 |
+
--beam 80 \
|
| 98 |
+
--min-len 10 \
|
| 99 |
+
--source-lang src \
|
| 100 |
+
--target-lang dst \
|
| 101 |
+
--tokenizer space \
|
| 102 |
+
--bpe sentencepiece \
|
| 103 |
+
--sentencepiece-model data/dicts/sp_oall_32k.model \
|
| 104 |
+
--no-repeat-ngram-size 3 \
|
| 105 |
+
--nbest 80 \
|
| 106 |
+
--sampling \
|
| 107 |
+
--sampling-topp 0.9 \
|
| 108 |
+
--temperature 1.0 \
|
| 109 |
+
--show-nbest 5
|
| 110 |
+
~~~
|
| 111 |
+
|
| 112 |
+
### 特定データセット上でのパープレキシティ計算
|
| 113 |
+
特定データセット上でのパープレキシティ(ppl)の計算を行います.
|
| 114 |
+
pplが低いほど,そのデータセットでの対話をモデルが表現できていると評価することができます.
|
| 115 |
+
|
| 116 |
+
~~~
|
| 117 |
+
fairseq-validate $DATA_PATH \
|
| 118 |
+
--path $MODEL_PATH \
|
| 119 |
+
--task translation \
|
| 120 |
+
--source-lang src \
|
| 121 |
+
--target-lang dst \
|
| 122 |
+
--batch-size 2 \
|
| 123 |
+
--ddp-backend no_c10d \
|
| 124 |
+
--valid-subset test \
|
| 125 |
+
--skip-invalid-size-inputs-valid-test
|
| 126 |
+
~~~
|
| 127 |
+
|
| 128 |
+
### Persona-chat, EmpatheticDialoguesを利用したFinetune
|
| 129 |
+
PretrainモデルをPersona-chatやEmpatheticDialoguesでFinetuneすることで,提供するFinetune済みモデルとほぼ同じモデルを作成することができます.
|
| 130 |
+
|
| 131 |
+
また,独自の対話データをお持ちの場合は,data/*/raw内に同じ形式でデータを配置することで,そのデータでFinetuneを行うことも可能です.ただし,Finetuneを施したモデルを公開・配布することは利用規約上許可しておりませんのでご注意ください(独自データを公開し,第三者に本モデルからFinetuneさせることは可能です).
|
| 132 |
+
#### データセットのダウンロードと変換
|
| 133 |
+
|
| 134 |
+
* [JEmpatheticDialogues](https://www.dropbox.com/s/rkzyeu58p48ndz3/japanese_empathetic_dialogues.xlsx?dl=0)
|
| 135 |
+
* [JPersonaChat](https://www.dropbox.com/s/sda9wzexh7ntlij/japanese_persona_chat.xlsx?dl=0
|
| 136 |
+
)
|
| 137 |
+
|
| 138 |
+
データをExcelからシンプルな入力文(src)・出力文(dst)の形式に変換します.srcとdstの同じ行が対応する入出力のペアとなります.50000行をtrainとして分割出力します.
|
| 139 |
+
~~~
|
| 140 |
+
python scripts/extract_ed.py japanese_empathetic_dialogues.xlsx data/empdial/raw/
|
| 141 |
+
~~~
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## License
|
| 146 |
+
|
| 147 |
+
[利用規約](LICENSE.md)
|
README.md
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# japanese-dialog-transformers
|
| 2 |
+
|
| 3 |
+
**[日本語の説明文はこちら](README-jp.md)**
|
| 4 |
+
|
| 5 |
+
This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialogue model provided by NTT on [fairseq](https://github.com/pytorch/fairseq).
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
| Table of contents. |
|
| 11 |
+
|-|
|
| 12 |
+
| [Update log](#update-log) |
|
| 13 |
+
| [Notice for using the codes](#before-using) |
|
| 14 |
+
| [Model download](#model-download) |
|
| 15 |
+
| [Quick start](#quick-start) |
|
| 16 |
+
| [LICENSE](LICENSE.md) |
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## Update log
|
| 21 |
+
|
| 22 |
+
* Sept. 17, 2021: Published dialogue models (fairseq version `japanese-dialog-transformer-1.6B`), datasets(`JEmpatheticDialogues` and `JPersonaChat`) and evaluation codes.
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## Notice for using the codes
|
| 27 |
+
The dialogue models provided are for evaluation and verification of model performance.
|
| 28 |
+
Before downloading these models, please read the [LICENSE](LICENSE.md) and [CAUTION](Notice-jp.md) documents. You can download and use these models only if you agree to the following three points.
|
| 29 |
+
|
| 30 |
+
1. [LICENSE](LICENSE.md)
|
| 31 |
+
2. To be used only for the purpose of evaluation and verification of this model, and not for the purpose of providing dialogue service itself.
|
| 32 |
+
3. Take all possible care and measures to prevent damage caused by the generated text, and take responsibility for the text you generate, whether appropriate or inappropriate.
|
| 33 |
+
|
| 34 |
+
### BibTeX
|
| 35 |
+
When publishing results using this model, please cite the following paper.
|
| 36 |
+
<!-- You can use the following BibTeX entry for citation if you find our method useful. -->
|
| 37 |
+
```BibTeX
|
| 38 |
+
@misc{sugiyama2021empirical,
|
| 39 |
+
title={Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems},
|
| 40 |
+
author={Hiroaki Sugiyama and Masahiro Mizukami and Tsunehiro Arimoto and Hiromi Narimatsu and Yuya Chiba and Hideharu Nakajima and Toyomi Meguro},
|
| 41 |
+
year={2021},
|
| 42 |
+
eprint={2109.05217},
|
| 43 |
+
archivePrefix={arXiv},
|
| 44 |
+
primaryClass={cs.CL}
|
| 45 |
+
}
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
## Model download
|
| 50 |
+
- [Pre-trained model](https://www.dropbox.com/s/k3ugxmr7nw6t86l/japanese-dialog-transformer-1.6B.pt?dl=0)
|
| 51 |
+
- [Finetuned model with JPersonaChat](https://www.dropbox.com/s/e5ib6rhsbldup3v/japanese-dialog-transformer-1.6B-persona50k.pt?dl=0)
|
| 52 |
+
- [Finetuned model with JEmpatheticDialogues](https://www.dropbox.com/s/laqz0jcgxvpxiy0/japanese-dialog-transformer-1.6B-empdial50k.pt?dl=0)
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## Quick start
|
| 57 |
+
|
| 58 |
+
The models published on this page can be used for utterance generation and additional fine-tuning using the scripts included in fairseq.
|
| 59 |
+
|
| 60 |
+
### Install dependent libraries
|
| 61 |
+
The verification environment is as follows.
|
| 62 |
+
|
| 63 |
+
- Python 3.8.10 on [miniconda](https://repo.anaconda.com/miniconda/Miniconda3-py38_4.10.3-Linux-x86_64.sh)
|
| 64 |
+
- CUDA 11.1/10.2
|
| 65 |
+
- Pytorch 1.8.2 (For the installation commands, be sure to check the [official page](https://pytorch.org/get-started/locally/). We recommend using pip.)
|
| 66 |
+
- fairseq 1.0.0a0(Available only from github: validated commit ID was 8adff65ab30dd5f3a3589315bbc1fafad52943e7)
|
| 67 |
+
- sentencepiece 0.1.96
|
| 68 |
+
|
| 69 |
+
When installing fairseq, please check [Requirements and Installation](https://github.com/pytorch/fairseq#requirements-and-installation) of the official page and install the latest or verified version (8adff65) using `github clone`. Normal pip install will only install the older version 0.10.2.
|
| 70 |
+
If you want to run finetune with your own data, you need to install the standalone version of sentencepiece.
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
### fairseq-interactive
|
| 75 |
+
Since fairseq-interactive does not have any way to keep the context, it generates responses based on the input sentences only, which is different from the setting that uses the context in Finetune and the paper experiment, so it is easy to generate inappropriate utterances.
|
| 76 |
+
|
| 77 |
+
In the following command, a small value (10) is used for beam and nbest (number of output candidates) to make the results easier to read. In actual use, it would be better to set the number to 20 or more for better results.
|
| 78 |
+
|
| 79 |
+
~~~
|
| 80 |
+
fairseq-interactive data/sample/bin/ \
|
| 81 |
+
--path checkpoints/persona50k-flat_1.6B_33avog1i_4.16.pt\
|
| 82 |
+
--beam 10 \
|
| 83 |
+
--seed 0 \
|
| 84 |
+
--min-len 10 \
|
| 85 |
+
--source-lang src \
|
| 86 |
+
--target-lang dst \
|
| 87 |
+
--tokenizer space \
|
| 88 |
+
--bpe sentencepiece \
|
| 89 |
+
--sentencepiece-model data/dicts/sp_oall_32k.model \
|
| 90 |
+
--no-repeat-ngram-size 3 \
|
| 91 |
+
--nbest 10 \
|
| 92 |
+
--sampling \
|
| 93 |
+
--sampling-topp 0.9 \
|
| 94 |
+
--temperature 1.0
|
| 95 |
+
~~~
|
| 96 |
+
|
| 97 |
+
### dialog.py
|
| 98 |
+
The system utilizes a context of about four utterances, which is equivalent to the settings used in the Finetune and paper experiments.
|
| 99 |
+
|
| 100 |
+
~~~
|
| 101 |
+
python scripts/dialog.py data/sample/bin/ \
|
| 102 |
+
--path checkpoints/dials5_1e-4_1li20zh5_tw5.143_step85.pt \
|
| 103 |
+
--beam 80 \
|
| 104 |
+
--min-len 10 \
|
| 105 |
+
--source-lang src \
|
| 106 |
+
--target-lang dst \
|
| 107 |
+
--tokenizer space \
|
| 108 |
+
--bpe sentencepiece \
|
| 109 |
+
--sentencepiece-model data/dicts/sp_oall_32k.model \
|
| 110 |
+
--no-repeat-ngram-size 3 \
|
| 111 |
+
--nbest 80 \
|
| 112 |
+
--sampling \
|
| 113 |
+
--sampling-topp 0.9 \
|
| 114 |
+
--temperature 1.0 \
|
| 115 |
+
--show-nbest 5
|
| 116 |
+
~~~
|
| 117 |
+
|
| 118 |
+
### Perplexity calculation on a specific data set
|
| 119 |
+
Computes the perplexity (ppl) on a particular dataset.
|
| 120 |
+
The lower the ppl, the better the model can represent the interaction on that dataset.
|
| 121 |
+
|
| 122 |
+
~~~
|
| 123 |
+
fairseq-validate $DATA_PATH \
|
| 124 |
+
--path $MODEL_PATH \
|
| 125 |
+
--task translation \
|
| 126 |
+
--source-lang src \
|
| 127 |
+
--target-lang dst \
|
| 128 |
+
--batch-size 2 \
|
| 129 |
+
--ddp-backend no_c10d \
|
| 130 |
+
--valid-subset test \
|
| 131 |
+
--skip-invalid-size-inputs-valid-test
|
| 132 |
+
~~~
|
| 133 |
+
|
| 134 |
+
### Finetuning with Persona-chat and EmpatheticDialogues
|
| 135 |
+
By finetuning the Pretrained model with PersonaChat or EmpatheticDialogues, you can create a model that is almost identical to the finetuned model provided.
|
| 136 |
+
|
| 137 |
+
If you have your own dialogue data, you can place the data in the same format in data/*/raw and perform Finetune on that data. Please note, however, that we do not allow the release or distribution of Finetune models under the LICENSE. You can release your own data and let a third party run Finetune from this model.
|
| 138 |
+
|
| 139 |
+
#### Downloading and converting datasets
|
| 140 |
+
|
| 141 |
+
* [JEmpatheticDialogues](https://www.dropbox.com/s/rkzyeu58p48ndz3/japanese_empathetic_dialogues.xlsx?dl=0)
|
| 142 |
+
* [JPersonaChat](https://www.dropbox.com/s/sda9wzexh7ntlij/japanese_persona_chat.xlsx?dl=0)
|
| 143 |
+
|
| 144 |
+
Convert data from Excel to a simple input statement (src) and output statement (dst) format, where the same row in src and dst is the corresponding input/output pair. 50000 rows are split and output as a train.
|
| 145 |
+
|
| 146 |
+
~~~
|
| 147 |
+
python scripts/extract_ed.py japanese_empathetic_dialogues.xlsx data/empdial/raw/
|
| 148 |
+
~~~
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
+
|
| 152 |
+
## License
|
| 153 |
+
|
| 154 |
+
[LICENSE](LICENSE.md)
|
checkpoints/.gitkeep
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
#gitkeep file
|
checkpoints/japanese-dialog-transformer-1.6B-empdial50k.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:345df64292a56eae2c26b8b7ba776d20b1e291dff106592018788a104b4082a1
|
| 3 |
+
size 6767110505
|
checkpoints/japanese-dialog-transformer-1.6B-persona50k.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7925190c61179b75559aa0dd7526e8a33417fad18dc9942380379bf77d3e2f29
|
| 3 |
+
size 6767107861
|
checkpoints/japanese-dialog-transformer-1.6B.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4552d565683fe1e3329e1a0881f1fc54449626c65d4413dc995be9fafae9553e
|
| 3 |
+
size 6767149869
|
data/dicts/sp_oall_32k.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:234c36dd85db6d160c859c2846938cf623c376974bcd403c6ee1ce7e8a6f5104
|
| 3 |
+
size 854270
|
data/dicts/sp_oall_32k.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/dicts/sp_oall_32k.vocab
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/binary/dict.dst.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/binary/dict.src.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/binary/preprocess.log
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Namespace(aim_repo=None, aim_run_hash=None, align_suffix=None, alignfile=None, all_gather_list_size=16384, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='/mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/binary', dict_only=False, empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, on_cpu_convert_precision=False, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, source_lang='src', srcdict='/mnt/disk6/Takisan/NLP/DeepConvResponse/data/dicts/sp_oall_32k.txt', suppress_crashes=False, target_lang='dst', task='translation', tensorboard_logdir=None, testpref='/mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/tokenized/sp_oall_32k/test', tgtdict='/mnt/disk6/Takisan/NLP/DeepConvResponse/data/dicts/sp_oall_32k.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer='space', tpu=False, trainpref='/mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/tokenized/sp_oall_32k/train', use_plasma_view=False, user_dir=None, validpref='/mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/tokenized/sp_oall_32k/valid', wandb_project=None, workers=1)
|
| 2 |
+
[src] Dictionary: 32002 types
|
| 3 |
+
[src] /mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/tokenized/sp_oall_32k/train.src: 15008 sents, 747131 tokens, 0.144% replaced (by <unk>)
|
| 4 |
+
[src] Dictionary: 32002 types
|
| 5 |
+
[src] /mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/tokenized/sp_oall_32k/valid.src: 3000 sents, 148090 tokens, 0.133% replaced (by <unk>)
|
| 6 |
+
[src] Dictionary: 32002 types
|
| 7 |
+
[src] /mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/tokenized/sp_oall_32k/test.src: 3576 sents, 185803 tokens, 0.2% replaced (by <unk>)
|
| 8 |
+
[dst] Dictionary: 32002 types
|
| 9 |
+
[dst] /mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/tokenized/sp_oall_32k/train.dst: 15008 sents, 278196 tokens, 0.135% replaced (by <unk>)
|
| 10 |
+
[dst] Dictionary: 32002 types
|
| 11 |
+
[dst] /mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/tokenized/sp_oall_32k/valid.dst: 3000 sents, 53550 tokens, 0.136% replaced (by <unk>)
|
| 12 |
+
[dst] Dictionary: 32002 types
|
| 13 |
+
[dst] /mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/tokenized/sp_oall_32k/test.dst: 3576 sents, 67946 tokens, 0.203% replaced (by <unk>)
|
| 14 |
+
Wrote preprocessed data to /mnt/disk6/Takisan/NLP/DeepConvResponse/data/persona_chat/binary
|
data/persona_chat/binary/test.src-dst.dst.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0f42d2641011ac37990ca727c3ae22512210c776598df346fb5aa2788ef25a6d
|
| 3 |
+
size 135892
|
data/persona_chat/binary/test.src-dst.dst.idx
ADDED
|
Binary file (42.9 kB). View file
|
|
|
data/persona_chat/binary/test.src-dst.src.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cd47d54d313810a1bfb64607d4148a3fa49de13605f0e6371e853a1cbf849040
|
| 3 |
+
size 371606
|
data/persona_chat/binary/test.src-dst.src.idx
ADDED
|
Binary file (42.9 kB). View file
|
|
|
data/persona_chat/binary/train.src-dst.dst.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d1cec47a26809b4a6bf88945e31e0d2e34dfdfd63a68c6666abefe5ddd7252d7
|
| 3 |
+
size 556392
|
data/persona_chat/binary/train.src-dst.dst.idx
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9aea0d0324e620c3f92c588ac495cdde737ed6ef90d5ed397dd2a394ff78bca9
|
| 3 |
+
size 180122
|
data/persona_chat/binary/train.src-dst.src.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:549a0068143a3075191f0b8288c9f6778b845fa0eb550a4d9d57f355097c3a41
|
| 3 |
+
size 1494262
|
data/persona_chat/binary/train.src-dst.src.idx
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9194ca82c85d3e548ca57fd94dfa78d5cddb85faaa8e9fc110847463fe7a7ae6
|
| 3 |
+
size 180122
|
data/persona_chat/binary/valid.src-dst.dst.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:70051b714bdd429990ce819ac099110d7bb8dbbffa11f430c862d17bd91aaefb
|
| 3 |
+
size 107100
|
data/persona_chat/binary/valid.src-dst.dst.idx
ADDED
|
Binary file (36 kB). View file
|
|
|
data/persona_chat/binary/valid.src-dst.src.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:89304e93b4d3e08604ed8c82613606d090a82b96e2dbd6ae3f2090ccdf1038be
|
| 3 |
+
size 296180
|
data/persona_chat/binary/valid.src-dst.src.idx
ADDED
|
Binary file (36 kB). View file
|
|
|
data/persona_chat/raw/persona_data.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3156b5ba3ed624672c59b14978484ec4d119cd4c36f709c56a16ff1249f7a42e
|
| 3 |
+
size 8814636
|
data/persona_chat/raw/rest.dst
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/raw/rest.src
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3779874a70d131062c9b7e2557e3be3b87f5b5e4f90d9e11f01def2832a30f71
|
| 3 |
+
size 11807928
|
data/persona_chat/raw/test.dst
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/raw/test.src
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/raw/train.dst
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/raw/train.src
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/raw/valid.dst
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/raw/valid.src
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/tokenized/sp_oall_32k/sp_oall_32k.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:234c36dd85db6d160c859c2846938cf623c376974bcd403c6ee1ce7e8a6f5104
|
| 3 |
+
size 854270
|
data/persona_chat/tokenized/sp_oall_32k/test.dst
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/tokenized/sp_oall_32k/test.src
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/tokenized/sp_oall_32k/train.dst
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/tokenized/sp_oall_32k/train.src
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/tokenized/sp_oall_32k/valid.dst
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/persona_chat/tokenized/sp_oall_32k/valid.src
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/sample/bin/dict.dst.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/sample/bin/dict.src.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/sample/bin/test.src-dst.dst.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0f42d2641011ac37990ca727c3ae22512210c776598df346fb5aa2788ef25a6d
|
| 3 |
+
size 135892
|
data/sample/bin/test.src-dst.dst.idx
ADDED
|
Binary file (42.9 kB). View file
|
|
|
data/sample/bin/test.src-dst.src.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cd47d54d313810a1bfb64607d4148a3fa49de13605f0e6371e853a1cbf849040
|
| 3 |
+
size 371606
|
data/sample/bin/test.src-dst.src.idx
ADDED
|
Binary file (42.9 kB). View file
|
|
|