mohammadmahdinouri commited on Feb 10

Commit

43acf85

1 Parent(s): 2955de7

added codes and tokenizers

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

CODE_OF_CONDUCT.md +80 -0
CONTRIBUTING.md +31 -0
LICENSE +116 -0
MODEL_CARD.md +81 -0
README.md +55 -0
assets/spiritlm_overview.png +0 -0
checkpoints/README.md +52 -0
checkpoints/speech_tokenizer/hifigan_spiritlm_base/config.json +55 -0
checkpoints/speech_tokenizer/hifigan_spiritlm_base/generator.pt +3 -0
checkpoints/speech_tokenizer/hifigan_spiritlm_base/speakers.txt +4 -0
checkpoints/speech_tokenizer/hifigan_spiritlm_base/styles.txt +34 -0
checkpoints/speech_tokenizer/hifigan_spiritlm_expressive_w2v2/config.json +60 -0
checkpoints/speech_tokenizer/hifigan_spiritlm_expressive_w2v2/generator.pt +3 -0
checkpoints/speech_tokenizer/hifigan_spiritlm_expressive_w2v2/speakers.txt +4 -0
checkpoints/speech_tokenizer/hubert_25hz/L11_quantizer_500.pt +3 -0
checkpoints/speech_tokenizer/hubert_25hz/mhubert_base_25hz.pt +3 -0
checkpoints/speech_tokenizer/style_encoder_w2v2/config.json +321 -0
checkpoints/speech_tokenizer/style_encoder_w2v2/pytorch_model.bin +3 -0
checkpoints/speech_tokenizer/vqvae_f0_quantizer/config.yaml +59 -0
checkpoints/speech_tokenizer/vqvae_f0_quantizer/model.pt +3 -0
data/examples/pred.jsonl +5 -0
data/examples/ref.jsonl +5 -0
env.yml +19 -0
examples/audio/7143-88743-0029.flac +0 -0
examples/distributed_inference_recipe/multi_nodes.slurm +24 -0
examples/distributed_inference_recipe/run_dist.py +89 -0
examples/speech_generation/spirit_model.ipynb +0 -0
examples/speech_tokenizer/spiritlm_speech_tokenizer.ipynb +0 -0
requirements.dev.txt +1 -0
requirements.txt +9 -0
setup.py +60 -0
spiritlm.egg-info/PKG-INFO +84 -0
spiritlm.egg-info/SOURCES.txt +32 -0
spiritlm.egg-info/dependency_links.txt +1 -0
spiritlm.egg-info/not-zip-safe +1 -0
spiritlm.egg-info/requires.txt +15 -0
spiritlm.egg-info/top_level.txt +2 -0
spiritlm/__init__.py +5 -0
spiritlm/__pycache__/__init__.cpython-310.pyc +0 -0
spiritlm/eval/README.md +92 -0
spiritlm/eval/eval_stsp.py +87 -0
spiritlm/eval/load_data.py +50 -0
spiritlm/eval/stsp/few_shot_prompt.py +101 -0
spiritlm/eval/stsp/predict_stsp.py +299 -0
spiritlm/eval/stsp/sanity_check_download.py +30 -0
spiritlm/eval/stsp/sentiment_classifiers.py +37 -0
spiritlm/eval/stsp/stsp_constants.py +12 -0
spiritlm/eval/stsp/utils.py +122 -0
spiritlm/eval/utils.py +17 -0
spiritlm/model/README.md +82 -0

CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# Code of Conduct
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+## Our Standards
+Examples of behavior that contributes to creating a positive environment
+include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+Examples of unacceptable behavior by participants include:
+* The use of sexualized language or imagery and unwelcome sexual attention or
+advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+professional setting
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+This Code of Conduct also applies outside the project spaces when there is a
+reasonable belief that an individual's behavior may have a negative impact on
+the project or its community.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <[email protected]>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+[homepage]: https://www.contributor-covenant.org
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# Contributing to spiritlm
+We want to make contributing to this project as easy and transparent as
+possible.
+## Pull Requests
+We actively welcome your pull requests.
+1. Fork the repo and create your branch from `main`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+Complete your CLA here: <https://code.facebook.com/cla>
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+## License
+By contributing to spiritlm, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.

LICENSE ADDED Viewed

	@@ -0,0 +1,116 @@

+FAIR Noncommercial Research License
+Last Updated: October 18, 2024
+“Acceptable Use Policy” means the FAIR Acceptable Use Policy, applicable to Research Materials, that is incorporated into this Agreement.
+“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the Research Materials set forth herein.
+“Documentation” means the specifications, manuals and documentation accompanying
+Research Materials distributed by Meta.
+“Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
+“Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland).
+“Noncommercial Research Uses” means noncommercial research use cases related to research, development, education, processing, or analysis and in each case, is not primarily intended for commercial advantage or monetary compensation to you or others.
+“Research Materials” means, collectively, Documentation and the models, software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code, demonstration materials and other elements of the foregoing distributed by Meta and made available under this Agreement.
+By clicking “I Accept” below or by using or distributing any portion or element of the Research Materials, you agree to be bound by this Agreement.
+1. License Rights and Redistribution.
+   a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Research Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Research Materials.
+   b. Redistribution and Use.
+      i. You will not use the Research Materials or any outputs or results of the Research Materials in connection with any commercial uses or for any uses other than Noncommercial Research Uses;
+      ii. Distribution of Research Materials, and any derivative works thereof, are subject to the terms of this Agreement. If you distribute or make the Research Materials, or any derivative works thereof, available to a third party, you may only do so under the terms of this Agreement. You shall also provide a copy of this Agreement to such third party.
+      iii.  If you submit for publication the results of research you perform on, using, or otherwise in connection with Research Materials, you must acknowledge the use of Research Materials in your publication.
+      iv. Your use of the Research Materials must comply with applicable laws and regulations (including Trade Control Laws) and adhere to the FAIR Acceptable Use Policy, which is hereby incorporated by reference into this Agreement.
+2. User Support. Your Noncommercial Research Use of the Research Materials is done at your own discretion; Meta does not process any information nor provide any service in relation to such use.  Meta is under no obligation to provide any support services for the Research Materials. Any support provided is “as is”, “with all faults”, and without warranty of any kind.
+3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE RESEARCH MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, AND META DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE RESEARCH MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE RESEARCH MATERIALS AND ANY OUTPUT AND RESULTS.
+4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY DIRECT OR INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+5. Intellectual Property.
+   a. Subject to Meta’s ownership of Research Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the Research Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications.
+   b. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Research Materials, outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Research Materials.
+6. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the Research Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the Research Materials. Sections 5, 6 and 9 shall survive the termination of this Agreement.
+7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
+8. Modifications and Amendments. Meta may modify this Agreement from time to time by posting a revised version at [https://github.com/facebookresearch/spiritlm/blob/main/LICENSE]; provided that they are similar in spirit to the current version of the Agreement, but may differ in detail to address new problems or concerns. All such changes will be effective immediately. Your continued use of the Research Materials after any modification to this Agreement constitutes your agreement to such modification. Except as provided in this Agreement, no modification or addition to any provision of this Agreement will be binding unless it is in writing and signed by an authorized representative of both you and Meta.
+FAIR Acceptable Use Policy
+The Fundamental AI Research (FAIR) team at Meta seeks to further understanding of new and existing research domains with the mission of advancing the state-of-the-art in artificial intelligence through open research for the benefit of all.
+As part of this mission, Meta makes certain research materials available for noncommercial research use. Meta is committed to promoting the safe and responsible use of such research materials.
+Prohibited Uses
+You agree you will not use, or allow others to use, Research Materials to:
+   1. Violate the law or others’ rights, including to:
+      a. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
+         i. Violence or terrorism
+         ii. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
+         iii. Human trafficking, exploitation, and sexual violence
+         iv. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
+         v. Sexual solicitation
+         iv. Any other criminal activity
+      b. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
+      c. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
+      d. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
+      e. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
+      f. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any technology using FAIR research materials
+      g. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system
+   2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of research artifacts related to the following:
+      a. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
+      b. Guns and illegal weapons (including weapon development)
+      c. Illegal drugs and regulated/controlled substances
+      d. Operation of critical infrastructure, transportation technologies, or heavy machinery
+      e. Self-harm or harm to others, including suicide, cutting, and eating disorders
+      f. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual
+   3. Intentionally deceive or mislead others, including use of FAIR Research Materials related to the following:
+      a. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
+      b. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
+      c. Generating, promoting, or further distributing spam
+      d. Impersonating another individual without consent, authorization, or legal right
+      e. Representing that outputs of FAIR research materials or outputs from technology using FAIR research materials o are human-generated
+      f. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
+   4. Fail to appropriately disclose to end users any known dangers of your Research Materials.
+Please report any violation of this Policy or other problems that could lead to a violation of this Policy by submitting a report here [https://docs.google.com/forms/d/e/1FAIpQLSeb11cryAopJ7LNrC4nxEUXrHY26hfkXQMf_uH-oFgA3WlYZQ/viewform].

MODEL_CARD.md ADDED Viewed

	@@ -0,0 +1,81 @@

+# Meta Spirit LM Model Card
+## Model Details
+*Note: Use of this model is governed by the FAIR Noncommercial Research License.*
+Spirit LM is a multimodal language model that freely mixes text and speech. The model can be prompted with either text or speech and is capable of generating outputs in either modality, while preserving the expressivity of the input prompt. The model is also able to learn new tasks across modalities such as automatic speech recognition, text-to-speech, and speech classification in a few-shot manner.
+## Model Developers
+Meta
+## Variations
+Spirit LM comes in two versions: Spirit LM Base that uses speech phonetic tokens and Spirit LM Expressive that models expressivity using pitch and style tokens in addition to the phonetic tokens.
+## Input
+Models input text or speech or a mixed sequence of the two.
+## Output
+Models generate text or speech or a mixed sequence of the two.
+## Model Architecture
+### Speech Tokenizer
+Spirit LM uses 3 types of speech tokenizers: Phonetic Tokenizer (HuBERT), Pitch Tokenizer (VQ-VAE) and Style Tokenizer (Speechprop or Wav2vec2). We use Hifi-GAN to convert the speech tokens back to audio.
+It is worth noting that in the associated paper, for Spirit LM Expressive, we used Speechprop to extract style tokens, while we use a Wav2vec2 model to extract style tokens in this release.
+|                        | Model                    | Parameters | Input               | Output             |
+|------------------------|--------------------------|------------|---------------------|--------------------|
+| Phonetic Tokenizer     | HuBERT+LinearQuantizer   | 96M        | Waveform            | Phonetic Tokens    |
+| Pitch Tokenizer        | VQ-VAE                   | 0.2M       | Extracted F0        | Pitch Tokens       |
+| Style Tokenizer        | Wav2vec2+LinearProjection| 95M        | Waveform            | Style Tokens       |
+| Base Speech Decoder    | Hifi-GAN                 | 14M        | Phonetic Tokens     | Waveform           |
+| Expressive Speech Decoder | Hifi-GAN              | 15M        | Phonetic, Pitch, Style Tokens | Waveform
+### Language Model
+Spirit LM is initialized from the Llama-2 7B model.
+|                      | Architecture   | Parameters | Input/Output Tokens                                      | Vocab Size |
+|----------------------|----------------|------------|----------------------------------------------------------|------------|
+| Spirit LM Base       | Llama-2 7B     | 7B         | Text Tokens, Phonetic Tokens                             | 32512      |
+| Spirit LM Expressive | Llama-2 7B     | 7B         | Text Tokens, Phonetic Tokens, Pitch Tokens, Style Tokens | 32768      |
+### Release Date
+The models were trained between October and December 2023. The research paper was released on February 8th 2024. We released the model on October 18th 2024.
+### Status
+This is a static model trained on an offline dataset.
+### License
+We release the model under the FAIR Noncommercial Research License found in the [LICENSE](LICENSE) file in the root directory of this repo.
+### Research Paper
+More information can be found in the paper ["SpiRit-LM: Interleaved Spoken and Written Language Model"](https://arxiv.org/pdf/2402.05755.pdf).
+## Hardware and Software
+### Training Factors
+We used custom training libraries. The training of the released models has been performed on Meta’s Research Clusters.
+The training of each model (Spirit LM Base and Spirit LM Expressive) takes 21K GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W), not including the training of Llama-2.
+## Training Data
+We trained the models on a combination of text-only datasets, speech-only datasets and aligned speech-text datasets. All the speech datasets are publicly available. Here are the statistics of the datasets we used:
+|              | Hours | Speech Tokens | Text Tokens |
+|--------------|-------|---------------|-------------|
+| Speech-only  | 458K  | 28.2B         | -           |
+| Speech+Text  | 111K  | 7.0B          | 1.4B        |
+| Text-only    | -     | -             | 307B        |
+## Evaluation Results
+See evaluations for our models and detailed ablations in Section 4 and 5, and safety evaluations in Section 6 of the [research paper](https://arxiv.org/pdf/2402.05755.pdf).
+## Intended Use
+### Intended Use Cases
+Spirit LM is intended for noncommercial research use in English.
+### Out-of-Scope Uses
+Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. Use in any other way that is prohibited by the FAIR Noncommercial Research License and Acceptable Use Policy.
+## Ethical Considerations and Limitations
+This model is built on Llama 2 which carries risks with use.  Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.  For these reasons, as with all LLMs, Llama 2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts.  The model’s speech capabilities are designed to analyze speaker agnostic qualities of any input speech and output speech in one of four pre-set voices. The model is meant for use for noncommercial research purposes only and should not be deployed in any consumer-facing applications.

README.md ADDED Viewed

	@@ -0,0 +1,55 @@

+# Meta Spirit LM: Interleaved Spoken and Written Language Model
+This repository contains the model weights, inference code and evaluation scripts for the Spirit LM [paper](https://arxiv.org/pdf/2402.05755.pdf). You can find more generation samples on our [demo page](https://speechbot.github.io/spiritlm/).
+## Spirit LM Model Overview
+<img src="assets/spiritlm_overview.png">
+## Installation Setup
+### Conda
+```
+conda env create -f env.yml
+pip install -e '.[eval]'
+```
+### Pip
+```
+pip install -e '.[eval]'
+```
+### Dev
+(Optionally, use only if you want to run the tests.)
+```
+pip install -e '.[dev]'
+```
+## Checkpoints Setup
+See [checkpoints/README.md](checkpoints/README.md)
+## Quick Start
+### Speech Tokenization
+See [spiritlm/speech_tokenizer/README.md](spiritlm/speech_tokenizer/README.md)
+### Spirit LM Generation
+See [spiritlm/model/README.md](spiritlm/model/README.md)
+### Speech-Text Sentiment Preservation benchmark (STSP)
+See [spiritlm/eval/README.md](spiritlm/eval/README.md)
+## Model Card
+More details of the model can be found in [MODEL_CARD.md](MODEL_CARD.md).
+## License
+The present code is provided under the **FAIR Noncommercial Research License** found in [LICENSE](LICENSE).
+## Citation
+```
+@misc{nguyen2024spiritlminterleavedspokenwritten,
+      title={SpiRit-LM: Interleaved Spoken and Written Language Model},
+      author={Tu Anh Nguyen and Benjamin Muller and Bokai Yu and Marta R. Costa-jussa and Maha Elbayad and Sravya Popuri and Paul-Ambroise Duquenne and Robin Algayres and Ruslan Mavlyutov and Itai Gat and Gabriel Synnaeve and Juan Pino and Benoit Sagot and Emmanuel Dupoux},
+      year={2024},
+      eprint={2402.05755},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2402.05755},
+}
+```

assets/spiritlm_overview.png ADDED Viewed

checkpoints/README.md ADDED Viewed

	@@ -0,0 +1,52 @@

+# Spirit LM Checkpoints
+## Download Checkpoints
+To access and download Spirit LM Checkpoints, please request the model artifacts in this link:
+[https://ai.meta.com/resources/models-and-libraries/spirit-lm-downloads/](https://ai.meta.com/resources/models-and-libraries/spirit-lm-downloads/)
+Upon approval, you will then receive an email with download links to each model artifact.
+Please note that Spirit LM is made available under the **FAIR Noncommercial Research License**
+found in the [LICENSE](../LICENSE) file in the root directory of this source tree and Acceptable Use Policy.
+## Structure
+The checkpoints directory should look like this:
+```
+checkpoints/
+├── README.md
+├── speech_tokenizer
+│   ├── hifigan_spiritlm_base
+│   │   ├── config.json
+│   │   ├── generator.pt
+│   │   ├── speakers.txt
+│   │   └── styles.txt
+│   ├── hifigan_spiritlm_expressive_w2v2
+│   │   ├── config.json
+│   │   ├── generator.pt
+│   │   └── speakers.txt
+│   ├── hubert_25hz
+│   │   ├── L11_quantizer_500.pt
+│   │   └── mhubert_base_25hz.pt
+│   ├── style_encoder_w2v2
+│   │   ├── config.json
+│   │   └── pytorch_model.bin
+│   └── vqvae_f0_quantizer
+│       ├── config.yaml
+│       └── model.pt
+└── spiritlm_model
+    ├── spirit-lm-base-7b
+    │   ├── config.json
+    │   ├── generation_config.json
+    │   ├── pytorch_model.bin
+    │   ├── special_tokens_map.json
+    │   ├── tokenizer_config.json
+    │   └── tokenizer.model
+    └── spirit-lm-expressive-7b
+        ├── config.json
+        ├── generation_config.json
+        ├── pytorch_model.bin
+        ├── special_tokens_map.json
+        ├── tokenizer_config.json
+        └── tokenizer.model
+```

checkpoints/speech_tokenizer/hifigan_spiritlm_base/config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+    "resblock": "1",
+    "num_gpus": 0,
+    "batch_size": 16,
+    "learning_rate": 0.0002,
+    "adam_b1": 0.8,
+    "adam_b2": 0.99,
+    "lr_decay": 0.999,
+    "seed": 1234,
+    "upsample_rates": [5,4,4,4,2],
+    "upsample_kernel_sizes": [11,8,8,8,4],
+    "upsample_initial_channel": 512,
+    "resblock_kernel_sizes": [3,7,11],
+    "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
+    "num_embeddings": 501,
+    "embedding_dim": 128,
+    "model_in_dim": 384,
+    "segment_size": 8960,
+    "code_hop_size": 640,
+    "f0": false,
+    "num_mels": 80,
+    "num_freq": 1025,
+    "n_fft": 1024,
+    "hop_size": 256,
+    "win_size": 1024,
+    "multispkr": "from_input_file",
+    "num_speakers": 4,
+    "multistyle": "from_input_file",
+    "num_styles": 34,
+    "dur_prediction_weight": 1.0,
+    "dur_predictor_params": {
+        "encoder_embed_dim": 128,
+        "var_pred_hidden_dim": 128,
+        "var_pred_kernel_size": 3,
+        "var_pred_dropout": 0.5
+    },
+    "sampling_rate": 16000,
+    "fmin": 0,
+    "fmax": 8000,
+    "fmax_for_loss": null,
+    "num_workers": 4,
+    "dist_config": {
+        "dist_backend": "nccl",
+        "dist_url": "env://"
+    }
+}

checkpoints/speech_tokenizer/hifigan_spiritlm_base/generator.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d66c49067aeff93b14b038f143c2dc9ed981671956512d8e702897416f13c459
+size 57512631

checkpoints/speech_tokenizer/hifigan_spiritlm_base/speakers.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+ex04
+ex02
+ex03
+ex01

checkpoints/speech_tokenizer/hifigan_spiritlm_base/styles.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+read-default
+read-enunciated
+read-confused
+read-laughing
+read-whisper
+read-sad
+read-happy
+conv-projected
+conv-default
+conv-sympathetic
+conv-fast
+conv-disgusted
+conv-laughing
+conv-calm
+conv-sarcastic
+conv-whisper
+conv-angry
+conv-sad
+conv-happy
+conv-enunciated
+conv-awe
+read-singing
+conv-confused
+conv-fearful
+conv-narration
+conv-sleepy
+conv-child
+conv-animal
+conv-childdir
+conv-animaldir
+conv-bored
+conv-desire
+conv-nonverbal
+read-narration

checkpoints/speech_tokenizer/hifigan_spiritlm_expressive_w2v2/config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+    "resblock": "1",
+    "num_gpus": 0,
+    "batch_size": 128,
+    "learning_rate": 0.0002,
+    "adam_b1": 0.8,
+    "adam_b2": 0.99,
+    "lr_decay": 0.999,
+    "seed": 1234,
+    "upsample_rates": [5,4,4,4,2],
+    "upsample_kernel_sizes": [11,8,8,8,4],
+    "upsample_initial_channel": 512,
+    "resblock_kernel_sizes": [3,7,11],
+    "resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
+    "multispkr": "from_input_file",
+    "multistyle": null,
+	"dur_prediction_weight": 1.0,
+    "dur_predictor_params": {
+        "encoder_embed_dim": 128,
+        "var_pred_hidden_dim": 128,
+        "var_pred_kernel_size": 3,
+        "var_pred_dropout": 0.5
+    },
+    "segment_size": 17920,
+    "code_hop_size": 640,
+    "f0_hop_size": 1280,
+    "style_hop_size": 16000,
+    "num_embeddings": 501,
+    "num_f0_tokens": 64,
+    "num_style_tokens": 100,
+    "num_speakers": 4,
+    "embedding_dim": 128,
+    "model_in_dim": 512,
+    "num_mels": 80,
+    "num_freq": 1025,
+    "n_fft": 1024,
+    "hop_size": 256,
+    "win_size": 1024,
+    "sampling_rate": 16000,
+    "fmin": 0,
+    "fmax": 8000,
+    "fmax_for_loss": null,
+    "num_workers": 4,
+    "dist_config": {
+        "dist_backend": "nccl",
+        "dist_url": "env://"
+    }
+}

checkpoints/speech_tokenizer/hifigan_spiritlm_expressive_w2v2/generator.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9de97cbc336e6113c27f17560988a22e76df25af6a8e944ad01676aabec31326
+size 59414584

checkpoints/speech_tokenizer/hifigan_spiritlm_expressive_w2v2/speakers.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+ex04
+ex02
+ex03
+ex01

checkpoints/speech_tokenizer/hubert_25hz/L11_quantizer_500.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:06b408c0a487f0218e8aba52ca30d9de54e0c36af8bebfd151265647d221080b
+size 5222060

checkpoints/speech_tokenizer/hubert_25hz/mhubert_base_25hz.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1421f060cf92b9d2ea72dcecd7f30ce549e1112e24b62b43f2ec3026301051bb
+size 383333938

checkpoints/speech_tokenizer/style_encoder_w2v2/config.json ADDED Viewed

	@@ -0,0 +1,321 @@

+{
+  "_name_or_path": "facebook/wav2vec2-base",
+  "activation_dropout": 0.0,
+  "adapter_kernel_size": 3,
+  "adapter_stride": 2,
+  "add_adapter": false,
+  "apply_spec_augment": true,
+  "architectures": [
+    "Wav2Vec2ForPooledSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 1,
+  "classifier_proj_size": 256,
+  "codevector_dim": 256,
+  "contrastive_logits_temperature": 0.1,
+  "conv_bias": false,
+  "conv_dim": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "conv_kernel": [
+    10,
+    3,
+    3,
+    3,
+    3,
+    2,
+    2
+  ],
+  "conv_stride": [
+    5,
+    2,
+    2,
+    2,
+    2,
+    2,
+    2
+  ],
+  "ctc_loss_reduction": "sum",
+  "ctc_zero_infinity": false,
+  "diversity_loss_weight": 0.1,
+  "do_stable_layer_norm": false,
+  "eos_token_id": 2,
+  "feat_extract_activation": "gelu",
+  "feat_extract_norm": "group",
+  "feat_proj_dropout": 0.1,
+  "feat_quantizer_dropout": 0.0,
+  "final_dropout": 0.0,
+  "freeze_feat_extract_train": true,
+  "hidden_act": "gelu",
+  "hidden_dropout": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "0",
+    "1": "1",
+    "10": "10",
+    "11": "11",
+    "12": "12",
+    "13": "13",
+    "14": "14",
+    "15": "15",
+    "16": "16",
+    "17": "17",
+    "18": "18",
+    "19": "19",
+    "2": "2",
+    "20": "20",
+    "21": "21",
+    "22": "22",
+    "23": "23",
+    "24": "24",
+    "25": "25",
+    "26": "26",
+    "27": "27",
+    "28": "28",
+    "29": "29",
+    "3": "3",
+    "30": "30",
+    "31": "31",
+    "32": "32",
+    "33": "33",
+    "34": "34",
+    "35": "35",
+    "36": "36",
+    "37": "37",
+    "38": "38",
+    "39": "39",
+    "4": "4",
+    "40": "40",
+    "41": "41",
+    "42": "42",
+    "43": "43",
+    "44": "44",
+    "45": "45",
+    "46": "46",
+    "47": "47",
+    "48": "48",
+    "49": "49",
+    "5": "5",
+    "50": "50",
+    "51": "51",
+    "52": "52",
+    "53": "53",
+    "54": "54",
+    "55": "55",
+    "56": "56",
+    "57": "57",
+    "58": "58",
+    "59": "59",
+    "6": "6",
+    "60": "60",
+    "61": "61",
+    "62": "62",
+    "63": "63",
+    "64": "64",
+    "65": "65",
+    "66": "66",
+    "67": "67",
+    "68": "68",
+    "69": "69",
+    "7": "7",
+    "70": "70",
+    "71": "71",
+    "72": "72",
+    "73": "73",
+    "74": "74",
+    "75": "75",
+    "76": "76",
+    "77": "77",
+    "78": "78",
+    "79": "79",
+    "8": "8",
+    "80": "80",
+    "81": "81",
+    "82": "82",
+    "83": "83",
+    "84": "84",
+    "85": "85",
+    "86": "86",
+    "87": "87",
+    "88": "88",
+    "89": "89",
+    "9": "9",
+    "90": "90",
+    "91": "91",
+    "92": "92",
+    "93": "93",
+    "94": "94",
+    "95": "95",
+    "96": "96",
+    "97": "97",
+    "98": "98",
+    "99": "99"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "0": "0",
+    "1": "1",
+    "10": "10",
+    "11": "11",
+    "12": "12",
+    "13": "13",
+    "14": "14",
+    "15": "15",
+    "16": "16",
+    "17": "17",
+    "18": "18",
+    "19": "19",
+    "2": "2",
+    "20": "20",
+    "21": "21",
+    "22": "22",
+    "23": "23",
+    "24": "24",
+    "25": "25",
+    "26": "26",
+    "27": "27",
+    "28": "28",
+    "29": "29",
+    "3": "3",
+    "30": "30",
+    "31": "31",
+    "32": "32",
+    "33": "33",
+    "34": "34",
+    "35": "35",
+    "36": "36",
+    "37": "37",
+    "38": "38",
+    "39": "39",
+    "4": "4",
+    "40": "40",
+    "41": "41",
+    "42": "42",
+    "43": "43",
+    "44": "44",
+    "45": "45",
+    "46": "46",
+    "47": "47",
+    "48": "48",
+    "49": "49",
+    "5": "5",
+    "50": "50",
+    "51": "51",
+    "52": "52",
+    "53": "53",
+    "54": "54",
+    "55": "55",
+    "56": "56",
+    "57": "57",
+    "58": "58",
+    "59": "59",
+    "6": "6",
+    "60": "60",
+    "61": "61",
+    "62": "62",
+    "63": "63",
+    "64": "64",
+    "65": "65",
+    "66": "66",
+    "67": "67",
+    "68": "68",
+    "69": "69",
+    "7": "7",
+    "70": "70",
+    "71": "71",
+    "72": "72",
+    "73": "73",
+    "74": "74",
+    "75": "75",
+    "76": "76",
+    "77": "77",
+    "78": "78",
+    "79": "79",
+    "8": "8",
+    "80": "80",
+    "81": "81",
+    "82": "82",
+    "83": "83",
+    "84": "84",
+    "85": "85",
+    "86": "86",
+    "87": "87",
+    "88": "88",
+    "89": "89",
+    "9": "9",
+    "90": "90",
+    "91": "91",
+    "92": "92",
+    "93": "93",
+    "94": "94",
+    "95": "95",
+    "96": "96",
+    "97": "97",
+    "98": "98",
+    "99": "99"
+  },
+  "layer_norm_eps": 1e-05,
+  "layerdrop": 0.0,
+  "mask_channel_length": 10,
+  "mask_channel_min_space": 1,
+  "mask_channel_other": 0.0,
+  "mask_channel_prob": 0.0,
+  "mask_channel_selection": "static",
+  "mask_feature_length": 10,
+  "mask_feature_min_masks": 0,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_masks": 2,
+  "mask_time_min_space": 1,
+  "mask_time_other": 0.0,
+  "mask_time_prob": 0.05,
+  "mask_time_selection": "static",
+  "model_type": "wav2vec2",
+  "no_mask_channel_overlap": false,
+  "no_mask_time_overlap": false,
+  "num_adapter_layers": 3,
+  "num_attention_heads": 12,
+  "num_codevector_groups": 2,
+  "num_codevectors_per_group": 320,
+  "num_conv_pos_embedding_groups": 16,
+  "num_conv_pos_embeddings": 128,
+  "num_feat_extract_layers": 7,
+  "num_hidden_layers": 12,
+  "num_negatives": 100,
+  "output_hidden_size": 768,
+  "pad_token_id": 0,
+  "proj_codevector_dim": 256,
+  "tdnn_dilation": [
+    1,
+    2,
+    3,
+    1,
+    1
+  ],
+  "tdnn_dim": [
+    512,
+    512,
+    512,
+    512,
+    1500
+  ],
+  "tdnn_kernel": [
+    5,
+    3,
+    3,
+    1,
+    1
+  ],
+  "torch_dtype": "float32",
+  "transformers_version": "4.25.1",
+  "use_weighted_layer_sum": false,
+  "vocab_size": 32,
+  "xvector_output_dim": 512
+}

checkpoints/speech_tokenizer/style_encoder_w2v2/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:819b7a87bfcf4041252c3f1825915924e10d5a06a2a66da546beaca98c7ab8bc
+size 378451177

checkpoints/speech_tokenizer/vqvae_f0_quantizer/config.yaml ADDED Viewed

	@@ -0,0 +1,59 @@

+seed: 1234
+# Data
+f0_path: ''
+p_train: 0.95
+min_frames: null
+batch_size: 128
+features: f0_interp,vuv
+out_features: norm_f0_interp,vuv
+segment_size: null
+segment_multi: 16
+num_workers: 4
+vuv_scale: 2
+speaker_stats: ''
+recon_loss_fn: l1_loss
+# Optimization
+learning_rate: 0.0002
+adam_b1: 0.8
+adam_b2: 0.99
+lr_decay: 0.999
+lambda_commit: 0.02
+# VQ params
+vq_params:
+  l_bins: 64
+  emb_width: 128
+  mu: 0.99
+  levels: 1
+# Encoder params
+encoder_params:
+  input_emb_width: 2
+  output_emb_width: 128
+  levels: 1
+  downs_t:
+  - 4
+  strides_t:
+  - 2
+  width: 32
+  depth: 4
+  m_conv: 1.0
+  dilation_growth_rate: 3
+# Decoder params
+decoder_params:
+  input_emb_width: 2
+  output_emb_width: 128
+  levels: 1
+  downs_t:
+  - 4
+  strides_t:
+  - 2
+  width: 32
+  depth: 4
+  m_conv: 1.0
+  dilation_growth_rate: 3

checkpoints/speech_tokenizer/vqvae_f0_quantizer/model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4321b0c19b47279ab3d76c4b6e85bbc439156a4dd6478919856accaf4180382
+size 2600601

data/examples/pred.jsonl ADDED Viewed

	@@ -0,0 +1,5 @@

+{"pred": "angry",  "id": 4792320029370491913}
+{"pred": "neutral", "id": -5682350483296949563}
+{"pred": "amused", "id": -8754508989367964614}
+{"pred": "angry",  "id": -9018665079841831624}
+{"pred": "neutral", "id": 1159246029716120600}

data/examples/ref.jsonl ADDED Viewed

	@@ -0,0 +1,5 @@

+{"emotion": "angry", "sentiment": "negative", "wav_path": "emov/sam/Angry/anger_281-308_0286.wav", "split": "test", "speaker": "sam", "id": 4792320029370491913}
+{"emotion": "neutral", "sentiment": "neutral", "wav_path": "emov/sam/Neutral/neutral_281-308_0286.wav", "split": "test", "speaker": "sam", "id": -5682350483296949563}
+{"emotion": "amused", "sentiment": "positive", "wav_path": "emov/sam/Amused/amused_281-308_0286.wav", "split": "test", "speaker": "sam", "id": -8754508989367964614}
+{"emotion": "angry", "sentiment": "negative", "wav_path": "emov/jenie/Angry/anger_57-84_0084.wav", "split": "test", "speaker": "jenie", "id": -9018665079841831624}
+{"emotion": "neutral", "sentiment": "neutral", "wav_path": "emov/jenie/Neutral/neutral_57-84_0084.wav", "split": "test", "speaker": "jenie", "id": 1159246029716120600}

env.yml ADDED Viewed

	@@ -0,0 +1,19 @@

+name: spiritlm_test
+channels:
+  - pytorch
+  - nvidia
+dependencies:
+  - python=3.9
+  - pip
+  - pytorch-cuda=11.8
+  - pytorch
+  - torchaudio
+  - pip:
+    - omegaconf==2.2.0
+    - librosa~=0.10
+    - local-attention~=1.9
+    - encodec~=0.1
+    - transformers
+    - fairscale~=0.4
+    - sentencepiece
+    - torchfcpe~=0.0.4

examples/audio/7143-88743-0029.flac ADDED Viewed

Binary file (147 kB). View file

examples/distributed_inference_recipe/multi_nodes.slurm ADDED Viewed

	@@ -0,0 +1,24 @@

+#!/bin/bash
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+#SBATCH --job-name=spiritlm
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=8
+#SBATCH --nodes=2
+#SBATCH --cpus-per-task=12
+#SBATCH --output=./logs/%j.stdout
+#SBATCH --error=./logs/%j.stderr
+#SBATCH --time=01:00:00
+set -e
+srun bash -c 'torchrun --nnodes $SLURM_JOB_NUM_NODES --nproc-per-node $SLURM_GPUS_ON_NODE \
+--node-rank $SLURM_PROCID \
+--master-addr $(scontrol show hostnames $SLURM_NODELIST | head -n1) \
+--master-port 12345 \
+examples/distributed_inference_recipe/run_dist.py'

examples/distributed_inference_recipe/run_dist.py ADDED Viewed

	@@ -0,0 +1,89 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+"""
+Usage example:
+cd {SPIRITLM ROOT FOLDER}
+export PYTHONPATH=.
+Single node, multi-gpus:
+(Assume that your machine has 8 GPUs)
+    torchrun --nnodes 1 --nproc-per-node 8 examples/distributed_inference_recipe/run_dist.py
+Multi-nodes, multi-gpus:
+(2 nodes, 8 GPUs for eahc node, via sbatch)
+    mkdir -p logs
+    sbatch examples/distributed_inference_recipe/multi_nodes.slurm
+"""
+import os
+import torch
+import torch.distributed as dist
+import torchaudio
+from spiritlm.model.spiritlm_model import (
+    ContentType,
+    GenerationInput,
+    OutputModality,
+    Spiritlm,
+)
+from torch.utils.data import TensorDataset
+from torch.utils.data.distributed import DistributedSampler
+from transformers import GenerationConfig, set_seed
+def run(seed: int = 0):
+    world_size = int(os.environ["WORLD_SIZE"])
+    world_rank = int(os.environ["RANK"])
+    print(
+        f"Running distributed inference with world_size: {world_size}, world_rank: {world_rank}"
+    )
+    dist.init_process_group("nccl", rank=world_rank, world_size=world_size)
+    set_seed(seed)
+    wav = torchaudio.load("examples/audio/7143-88743-0029.flac")[0].squeeze()
+    # fake repeated dataset
+    dataset = TensorDataset(wav.repeat(32, 1))
+    sampler = DistributedSampler(dataset=dataset)
+    loader = torch.utils.data.DataLoader(
+        dataset=dataset,
+        batch_size=1,  # don't change
+        sampler=sampler,
+        num_workers=4,
+    )
+    spirit_lm = Spiritlm("spirit-lm-expressive-7b")
+    for _, data in enumerate(loader):
+        outs = spirit_lm.generate(
+            output_modality=OutputModality.ARBITRARY,
+            interleaved_inputs=[
+                GenerationInput(
+                    content=data[0],  # 0 because of batch size 1
+                    content_type=ContentType.SPEECH,
+                )
+            ],
+            generation_config=GenerationConfig(
+                temperature=0.9,
+                top_p=0.95,
+                max_new_tokens=200,
+                do_sample=True,
+            ),
+        )
+        print(f"outs: {outs}")
+def setup_env():
+    os.environ["OMP_NUM_THREADS"] = "1"
+if __name__ == "__main__":
+    setup_env()
+    run()

examples/speech_generation/spirit_model.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

examples/speech_tokenizer/spiritlm_speech_tokenizer.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.dev.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ pytest

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+omegaconf>=2.2.0
+librosa>=0.10
+local-attention>=1.9
+encodec>=0.1
+transformers
+fairscale>=0.4
+sentencepiece
+pyarrow>=14.0
+torchfcpe>=0.0.4

setup.py ADDED Viewed

	@@ -0,0 +1,60 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+import os
+from pathlib import Path
+from setuptools import find_packages, setup
+NAME = "spiritlm"
+VERSION = "0.1.0"
+DESCRIPTION = "Interleaved Spoken and Written Language Model"
+URL = "https://github.com/facebookresearch/spiritlm"
+KEYWORDS = [
+    "Language Model, Speech Language Model, Multimodal, Crossmodal, Expressivity Modeling"
+]
+LICENSE = "FAIR Noncommercial Research License"
+def _get_long_description():
+    with (Path(__file__).parent / "README.md").open(encoding="utf-8") as file:
+        long_description = file.read()
+    return long_description
+def _read_reqs(relpath):
+    fullpath = os.path.join(os.path.dirname(__file__), relpath)
+    with open(fullpath) as f:
+        return [
+            s.strip() for s in f.readlines() if (s.strip() and not s.startswith("#"))
+        ]
+setup(
+    name=NAME,
+    version=VERSION,
+    description=DESCRIPTION,
+    long_description=_get_long_description(),
+    long_description_content_type="text/plain",
+    url=URL,
+    license=LICENSE,
+    author="Meta",
+    keywords=KEYWORDS,
+    classifiers=[
+        "Intended Audience :: Science/Research",
+        "License :: FAIR Noncommercial Research License",
+        "Topic :: Multimedia :: Sound/Audio",
+        "Topic :: Scientific/Engineering :: Artificial Intelligence",
+    ],
+    packages=find_packages(),
+    zip_safe=False,
+    python_requires=">=3.9",
+    install_requires=_read_reqs("requirements.txt"),
+    extras_require={
+        "dev": ["pytest"],
+        "eval": ["pandas"],
+    },
+)

spiritlm.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,84 @@

+Metadata-Version: 2.1
+Name: spiritlm
+Version: 0.1.0
+Summary: Interleaved Spoken and Written Language Model
+Home-page: https://github.com/facebookresearch/spiritlm
+Author: Meta
+License: FAIR Noncommercial Research License
+Keywords: Language Model, Speech Language Model, Multimodal, Crossmodal, Expressivity Modeling
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: FAIR Noncommercial Research License
+Classifier: Topic :: Multimedia :: Sound/Audio
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.9
+Description-Content-Type: text/plain
+License-File: LICENSE
+Requires-Dist: omegaconf>=2.2.0
+Requires-Dist: librosa>=0.10
+Requires-Dist: local-attention>=1.9
+Requires-Dist: encodec>=0.1
+Requires-Dist: transformers
+Requires-Dist: fairscale>=0.4
+Requires-Dist: sentencepiece
+Requires-Dist: pyarrow>=14.0
+Requires-Dist: torchfcpe>=0.0.4
+Provides-Extra: dev
+Requires-Dist: pytest; extra == "dev"
+Provides-Extra: eval
+Requires-Dist: pandas; extra == "eval"
+# Meta Spirit LM: Interleaved Spoken and Written Language Model
+This repository contains the model weights, inference code and evaluation scripts for the Spirit LM [paper](https://arxiv.org/pdf/2402.05755.pdf). You can find more generation samples on our [demo page](https://speechbot.github.io/spiritlm/).
+## Spirit LM Model Overview
+<img src="assets/spiritlm_overview.png">
+## Installation Setup
+### Conda
+```
+conda env create -f env.yml
+pip install -e '.[eval]'
+```
+### Pip
+```
+pip install -e '.[eval]'
+```
+### Dev
+(Optionally, use only if you want to run the tests.)
+```
+pip install -e '.[dev]'
+```
+## Checkpoints Setup
+See [checkpoints/README.md](checkpoints/README.md)
+## Quick Start
+### Speech Tokenization
+See [spiritlm/speech_tokenizer/README.md](spiritlm/speech_tokenizer/README.md)
+### Spirit LM Generation
+See [spiritlm/model/README.md](spiritlm/model/README.md)
+### Speech-Text Sentiment Preservation benchmark (STSP)
+See [spiritlm/eval/README.md](spiritlm/eval/README.md)
+## Model Card
+More details of the model can be found in [MODEL_CARD.md](MODEL_CARD.md).
+## License
+The present code is provided under the **FAIR Noncommercial Research License** found in [LICENSE](LICENSE).
+## Citation
+```
+@misc{nguyen2024spiritlminterleavedspokenwritten,
+      title={SpiRit-LM: Interleaved Spoken and Written Language Model},
+      author={Tu Anh Nguyen and Benjamin Muller and Bokai Yu and Marta R. Costa-jussa and Maha Elbayad and Sravya Popuri and Paul-Ambroise Duquenne and Robin Algayres and Ruslan Mavlyutov and Itai Gat and Gabriel Synnaeve and Juan Pino and Benoit Sagot and Emmanuel Dupoux},
+      year={2024},
+      eprint={2402.05755},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2402.05755},
+}
+```

spiritlm.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,32 @@

+LICENSE
+README.md
+setup.py
+spiritlm/__init__.py
+spiritlm.egg-info/PKG-INFO
+spiritlm.egg-info/SOURCES.txt
+spiritlm.egg-info/dependency_links.txt
+spiritlm.egg-info/not-zip-safe
+spiritlm.egg-info/requires.txt
+spiritlm.egg-info/top_level.txt
+spiritlm/model/__init__.py
+spiritlm/model/spiritlm_model.py
+spiritlm/model/utils.py
+spiritlm/speech_tokenizer/__init__.py
+spiritlm/speech_tokenizer/spiritlm_tokenizer.py
+spiritlm/speech_tokenizer/f0/__init__.py
+spiritlm/speech_tokenizer/f0/f0_extractor.py
+spiritlm/speech_tokenizer/f0/f0_tokenizer.py
+spiritlm/speech_tokenizer/f0/vqvae.py
+spiritlm/speech_tokenizer/hifigan/__init__.py
+spiritlm/speech_tokenizer/hifigan/hifigan_vocoder.py
+spiritlm/speech_tokenizer/hubert/__init__.py
+spiritlm/speech_tokenizer/hubert/hubert_tokenizer.py
+spiritlm/speech_tokenizer/hubert/quantizer_model.py
+spiritlm/speech_tokenizer/hubert/hubert_model/__init__.py
+spiritlm/speech_tokenizer/hubert/hubert_model/hubert_model.py
+spiritlm/speech_tokenizer/hubert/hubert_model/wav2vec2_model.py
+spiritlm/speech_tokenizer/style_encoder/__init__.py
+spiritlm/speech_tokenizer/style_encoder/w2v2_encoder.py
+tests/__init__.py
+tests/test_spirit_model.py
+tests/test_tokenizer.py

spiritlm.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

spiritlm.egg-info/not-zip-safe ADDED Viewed

	@@ -0,0 +1 @@


1	+

spiritlm.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+omegaconf>=2.2.0
+librosa>=0.10
+local-attention>=1.9
+encodec>=0.1
+transformers
+fairscale>=0.4
+sentencepiece
+pyarrow>=14.0
+torchfcpe>=0.0.4
+[dev]
+pytest
+[eval]
+pandas

spiritlm.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ spiritlm
2	+ tests

spiritlm/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.

spiritlm/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (162 Bytes). View file

spiritlm/eval/README.md ADDED Viewed

	@@ -0,0 +1,92 @@

+# STSP Evaluation
+The Speech-Text Sentiment Preservation (STSP) benchmark is made of a collection of speech and text prompts in the positive, negative or neutral sentiment.
+Given a spoken or written prompt , the task consists in generating a text or speech sequence of tokens that preserves the sentiment of the prompt.
+The sentiment of the prompt is evaluated automatically with a sentiment/emotion classifier in speech or text depending of the output modality.
+Based on these, we derive a STSP accuracy score.
+## Data Download
+Download the data as well as the speech/text classifier checkpoints via this [link](https://dl.fbaipublicfiles.com/textless_nlp/spiritlm/stsp.tar.gz)
+then extract the data into the folder `{spiritlm ROOT FOLDER}/data/stsp_data`
+```
+cd {spiritlm ROOT FOLDER}
+mkdir data/stsp_data
+tar -xvzf stsp.tar.gz -C data/stsp_data --strip-components=1
+```
+Run the following script to check the dataset is all correctly present:
+```
+python spiritlm/eval/stsp/sanity_check_download.py
+```
+## Data structure
+The dataset contains 3 folders:
+- `data`: raw audio files
+- `manifest`: data splits
+- `model`: speech/text classifier checkpoints
+### Data
+The raw audio files for
+- `emov`: EMOV
+- `expresso/conversational`: EXPRESSO-ASR
+- `expresso/read`: EXPRESSO-READ
+### Manifest
+The train/validation/test splits, concretely we have:
+#### EMOV
+- 1053 records for emov train split at `manifest/emov/emov.train.jsonl`
+- 351 records for emov dev split at `manifest/emov/emov.dev.jsonl`
+- 351 records for emov test split at `manifest/emov/emov.test.jsonl`
+#### EXPRESSO-ASR
+- 1373 records for EXPRESSO-ASR train split at `manifest/expresso/expresso_asr.train`
+- 479 records for EXPRESSO-ASR dev at `manifest/expresso/expresso_asr.dev.jsonl`
+- 462 records for EXPRESSO-ASR test split at `manifest/expresso/expresso_asr.test.jsonl`
+#### EXPRESSO-READ
+- 1024 records for EXPRESSO-READ train split at `manifest/expresso/expresso_read.train`
+- 60 records for EXPRESSO-READ dev at `manifest/expresso/expresso_read.dev.jsonl`
+- 54 records for EXPRESSO-READ test split at `manifest/expresso/expresso_read.test.jsonl`
+#### Few-shot Samples
+The subset from EXPRESSO-ASR training set, used for the few-shot experiments:
+- `s2s.jsonl`: S -> S direction
+- `s2t.jsonl`: S -> T direction
+- `t2t.jsonl`: T -> T direction
+- `t2s.jsonl`: T -> S direction
+### Auto-Eval Speech And Text Classifiers
+The sentiment of the generated sequence is estimated in an auto-eval fashion with Speech and Text classifiers. We point to the [paper](https://arxiv.org/abs/2402.05755) for details on these classifiers.
+## Prediction & Evaluation of Spirit LM on STSP (Speech/Text)
+```export PYTHONPATH=.```
+Set `spiritlm` to the model you want to evaluate: e.g. ```spiritlm=spirit-lm-base-7b``` or ```spiritlm=spirit-lm-expressive-7b```
+#### Speech to Text
+    torchrun --nnodes 1 --nproc-per-node 1 spiritlm/eval/stsp/predict_stsp.py --model $spiritlm --eval_manifest_path data/stsp_data/manifest/emov/emov.test.jsonl --eval --write_pred ./pred_s_t.jsonl --input_output speech_text
+#### Text to Text
+    torchrun --nnodes 1 --nproc-per-node 1 spiritlm/eval/stsp/predict_stsp.py --model $spiritlm --eval_manifest_path data/stsp_data/manifest/emov/emov.test.jsonl --eval --write_pred ./pred_t_t.jsonl --input_output text_text
+#### Text to Speech
+    torchrun --nnodes 1 --nproc-per-node 1 spiritlm/eval/stsp/predict_stsp.py --model $spiritlm --eval_manifest_path data/stsp_data/manifest/emov/emov.test.jsonl --eval --write_pred ./pred_t_s.jsonl --input_output text_speech
+#### Speech to Speech
+    torchrun --nnodes 1 --nproc-per-node 1 spiritlm/eval/stsp/predict_stsp.py --model $spiritlm --eval_manifest_path data/stsp_data/manifest/emov/emov.test.jsonl --eval --write_pred ./pred_s_s.jsonl --input_output speech_speech
+### Post-hoc Evaluation
+To evaluate the performance of a model different from SpiritLM, you can use the following evaluation script that takes as input a prediction.jsonl file.
+```
+python spiritlm/eval/eval_stsp.py --ref_file $REF_FILE --pred_file $pred_file
+```
+e.g.
+```
+python spiritlm/eval/eval_stsp.py \
+--ref_file ./data/examples/demo.jsonl  \
+--pred_file ./data/examples/pred.jsonl
+> Accuracy: 100.00% for predictions ./data/examples/pred.jsonl
+```

spiritlm/eval/eval_stsp.py ADDED Viewed

	@@ -0,0 +1,87 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+import argparse
+import json
+from typing import Dict, Union
+import pandas as pd
+from spiritlm.eval.stsp.utils import EMOTION_2_SENTIMENT
+def load_pred(predictions):
+    ret = {}
+    with open(predictions) as f:
+        for line in f:
+            pred = json.loads(line)
+            ret[str(pred["id"])] = pred["pred"]
+    assert sum(1 for _ in open(predictions)) == len(ret)
+    return ret
+def eval(
+    gold_records: str, predictions: Union[str, Dict], info_data="", label="sentiment"
+):
+    n_gold_records = sum(1 for _ in open(gold_records))
+    n_lines_pred = (
+        sum(1 for _ in open(predictions))
+        if isinstance(predictions, str)
+        else len(predictions)
+    )
+    assert (
+        n_gold_records == n_lines_pred
+    ), f"Mismatch between prediction ({n_lines_pred} samples in {predictions}) and reference ({n_gold_records} in {gold_records})"
+    pred_dic = load_pred(predictions) if isinstance(predictions, str) else predictions
+    scores = []
+    with open(gold_records) as gold:
+        for line in gold:
+            ref = json.loads(line)
+            try:
+                if label in ref:
+                    scores.append(pred_dic[str(ref["id"])] == ref[label])
+                else:
+                    assert label == "sentiment" and "emotion" in ref, ref
+                    sentiment = EMOTION_2_SENTIMENT[ref["emotion"]]
+                    scores.append(pred_dic[str(ref["id"])] == sentiment)
+            except Exception as e:
+                print(
+                    f"ERROR in matching the predicted labels with the gold ones: {e}: ref['id']  do not match any key in {pred_dic}', {ref['id']}: "
+                )
+    # TODO: add other metrics if needed : F1 per class, etc.
+    report = pd.DataFrame({"Correct": scores})
+    if isinstance(predictions, str):
+        info_data += f"from {predictions}"
+    print(
+        f"Accuracy: {(report['Correct']==1).sum()/len(report)*100:0.2f}% for predictions {info_data}"
+    )
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--ref_file",
+        type=str,
+        help="Path to reference record",
+    )
+    parser.add_argument(
+        "--pred_file",
+        type=str,
+        help="Path to prediction: should be jsonl with each entry {'pred': , 'id': }",
+    )
+    parser.add_argument(
+        "--label",
+        type=str,
+        default="sentiment",
+        help="sentiment or emotion",
+    )
+    args = parser.parse_args()
+    eval(args.ref_file, args.pred_file, label=args.label)

spiritlm/eval/load_data.py ADDED Viewed

	@@ -0,0 +1,50 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+import json
+from pathlib import Path
+import torch
+import torchaudio
+class SpeechData(torch.utils.data.Dataset):
+    def __init__(self, manifest_dir, root_dir=None):
+        if root_dir is None:
+            root_dir = "."
+        self.root_dir = Path(root_dir)
+        self.manifest_dir = self.root_dir / manifest_dir
+        self.wav_field = "wav_path"
+        self.manifest = [json.loads(line.strip()) for line in open(manifest_dir)]
+    def __getitem__(self, idx):
+        wav_path = self.root_dir / self.manifest[idx][self.wav_field]
+        return {
+            "wav": torchaudio.load(wav_path)[0].squeeze(0),
+            "id": str(self.manifest[idx]["id"]),
+        }
+    def __len__(self):
+        return len(self.manifest)
+class TextData(torch.utils.data.Dataset):
+    def __init__(self, manifest_dir, root_dir=None):
+        if root_dir is None:
+            root_dir = "."
+        self.root_dir = Path(root_dir)
+        self.manifest_dir = self.root_dir / manifest_dir
+        self.text_field = "asr"
+        self.manifest = [json.loads(line.strip()) for line in open(manifest_dir)]
+    def __getitem__(self, idx):
+        return {
+            "text": self.manifest[idx][self.text_field],
+            "id": str(self.manifest[idx]["id"]),
+        }
+    def __len__(self):
+        return len(self.manifest)

spiritlm/eval/stsp/few_shot_prompt.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+import math
+from typing import Union
+import pandas as pd
+import torch
+import torchaudio
+from spiritlm.eval.stsp.stsp_constants import STSP_DATA_ROOT, STSP_MANIFEST_ROOT
+from spiritlm.model.spiritlm_model import Spiritlm
+FEW_SHOT_MANIFEST_DIR = STSP_MANIFEST_ROOT / "few_shot"
+FEW_SHOT_TEMPLATE = "{prompt}{generation}"
+def wav_prompt(spiritlm_model: Spiritlm, wav: Union[str, torch.Tensor]) -> str:
+    return spiritlm_model.SPEECH_PROMPT_PREFIX + spiritlm_model.speech_tokenizer(wav)
+def text_prompt(spiritlm_model: Spiritlm, text: str) -> str:
+    return spiritlm_model.TEXT_PROMPT_PREFIX + text
+def _load_half_wav(wav_path: str, load_first_half: bool) -> torch.Tensor:
+    wav_path = STSP_DATA_ROOT / wav_path
+    wav = torchaudio.load(wav_path)[0].squeeze(0)
+    size = wav.size()[0]
+    half_size = size // 2
+    if load_first_half:
+        wav = wav[:half_size]
+    else:
+        wav = wav[half_size:]
+    return wav
+def build_few_shot_prompt(
+    spiritlm_model: Spiritlm,
+    input_output: str,
+    n_shots: int = 3,
+) -> str:
+    """
+    Build the few-shot prompt by simply concatenating a set of examples.
+    E.g., a 3-shots T->S prompt would like this:
+    "[Text]text1[Speech]speech_tokens1\n[Text]text2[Speech]speech_tokens2\n[Text]text3[Speech]speech_tokens3\n"
+    """
+    manifset_file_mapping = {
+        "text_text": "t2t",
+        "speech_text": "s2t",
+        "text_speech": "t2s",
+        "speech_speech": "s2s",
+    }
+    manifest_path = (
+        FEW_SHOT_MANIFEST_DIR / f"{manifset_file_mapping[input_output]}.jsonl"
+    )
+    df = pd.read_json(manifest_path, lines=True)
+    assert n_shots <= len(df)
+    # ensure a balanced sampels for each sentiment
+    nb_samples_per_sentiment = math.ceil(n_shots / 3)
+    df = df.groupby("sentiment").sample(n=nb_samples_per_sentiment)
+    prompts = []
+    for _, row in df.iterrows():
+        prompt = row["prompt"]
+        generation = row["generation"]
+        if input_output == "text_text":
+            prompt = FEW_SHOT_TEMPLATE.format(
+                prompt=text_prompt(spiritlm_model, prompt),
+                generation=text_prompt(spiritlm_model, generation),
+            )
+        elif input_output == "text_speech":
+            prompt = FEW_SHOT_TEMPLATE.format(
+                prompt=text_prompt(spiritlm_model, prompt),
+                generation=wav_prompt(
+                    spiritlm_model, _load_half_wav(generation, load_first_half=False)
+                ),
+            )
+        elif input_output == "speech_text":
+            prompt = FEW_SHOT_TEMPLATE.format(
+                prompt=wav_prompt(
+                    spiritlm_model, _load_half_wav(prompt, load_first_half=True)
+                ),
+                generation=text_prompt(spiritlm_model, generation),
+            )
+        elif input_output == "speech_speech":
+            prompt = FEW_SHOT_TEMPLATE.format(
+                prompt=wav_prompt(
+                    spiritlm_model, _load_half_wav(prompt, load_first_half=True)
+                ),
+                generation=wav_prompt(
+                    spiritlm_model, _load_half_wav(generation, load_first_half=False)
+                ),
+            )
+        prompts.append(prompt)
+    print(f"prompts: {prompts}")
+    return "\n".join(prompts) + "\n"

spiritlm/eval/stsp/predict_stsp.py ADDED Viewed

	@@ -0,0 +1,299 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+"""
+Usage example:
+cd {SPIRITLM ROOT FOLDER}
+export PYTHONPATH=.
+# Speech to Text
+torchrun --nnodes 1 --nproc-per-node 1 spiritlm/eval/stsp/predict_stsp.py --eval_manifest_path data/examples/ref.jsonl --eval --write_pred ./pred_s_t.jsonl --input_output speech_text
+# Text to Text
+torchrun --nnodes 1 --nproc-per-node 1 spiritlm/eval/stsp/predict_stsp.py --eval_manifest_path data/examples/ref.jsonl --eval --write_pred ./pred_t_t.jsonl --input_output text_text
+# Text to Speech#
+torchrun --nnodes 1 --nproc-per-node 1 spiritlm/eval/stsp/predict_stsp.py --eval_manifest_path data/examples/ref.jsonl --eval --write_pred ./pred._t_s.jsonl --input_output text_speech
+# Speech to Speech
+torchrun --nnodes 1 --nproc-per-node 1 spiritlm/eval/stsp/predict_stsp.py --eval_manifest_path data/examples/ref.jsonl --eval --write_pred ./pred_s_s.jsonl --input_output speech_speech
+"""
+import argparse
+import json
+import os
+import uuid
+from pathlib import Path
+from typing import Union
+import torch
+import torch.distributed as dist
+import torchaudio
+from spiritlm.eval.eval_stsp import eval
+from spiritlm.eval.load_data import SpeechData, TextData
+from spiritlm.eval.stsp.few_shot_prompt import build_few_shot_prompt
+from spiritlm.eval.stsp.sentiment_classifiers import (
+    get_text_sentiment_prediction,
+    load_sentiment_classifier,
+)
+from spiritlm.eval.stsp.stsp_constants import STSP_DATA_ROOT, STSP_MODEL_ROOT
+from spiritlm.eval.stsp.utils import (
+    ExpressoEmotionClassifier,
+    load_emotion_classifier,
+    wav2emotion_and_sentiment,
+)
+from spiritlm.model.spiritlm_model import (
+    ContentType,
+    GenerationInput,
+    InterleavedOutputs,
+    OutputModality,
+    Spiritlm,
+)
+from torch.utils.data.distributed import DistributedSampler
+from tqdm import tqdm
+from transformers import AutoModelForSequenceClassification, GenerationConfig, set_seed
+SPEECH_CLASSIFIER = STSP_MODEL_ROOT / "speech_classifier"
+TEXT_CLASSIFIER = STSP_MODEL_ROOT / "text_classifier"
+NB_RETRIES = 3
+def get_eval_classifier(args):
+    if args.input_output.endswith("speech"):
+        return load_emotion_classifier(str(SPEECH_CLASSIFIER))
+    elif args.input_output.endswith("text"):
+        return load_sentiment_classifier(str(TEXT_CLASSIFIER))
+    else:
+        raise (Exception(f"{args.input_output} not supported"))
+def get_sentiment(
+    input_output,
+    generation,
+    classifer: Union[AutoModelForSequenceClassification, ExpressoEmotionClassifier],
+):
+    if input_output.endswith("speech"):
+        _, pred_sentiment = wav2emotion_and_sentiment(generation, classifer)
+    elif input_output.endswith("text"):
+        _, pred_sentiment = get_text_sentiment_prediction(generation, classifer)
+    return pred_sentiment
+def write_jsonl(dir: str, predictions: dict):
+    Path(dir).parent.mkdir(exist_ok=True, parents=True)
+    with open(dir, "w") as f:
+        for id, result_dict in predictions.items():
+            record = {"id": id, **result_dict}
+            json_string = json.dumps(record)
+            f.write(json_string + "\n")  # Add a newline to separate JSON objects
+    print(f"{dir} written")
+def write_wav(
+    wav,
+    save_dir: Path,
+    sample_rate: int = 16_000,
+) -> str:
+    """Save wav under `save_dir` with a random name and return the full path."""
+    save_dir.mkdir(exist_ok=True, parents=True)
+    random_path = save_dir / (str(uuid.uuid4()) + ".wav")
+    torchaudio.save(
+        random_path, torch.from_numpy(wav).unsqueeze(0), sample_rate=sample_rate
+    )
+    return str(random_path)
+def run(args):
+    world_size = int(os.environ["WORLD_SIZE"])
+    world_rank = int(os.environ["RANK"])
+    print(
+        f"Running distributed inference with world_size: {world_size}, world_rank: {world_rank}"
+    )
+    dist.init_process_group("nccl", rank=world_rank, world_size=world_size)
+    set_seed(args.seed)
+    spiritlm_model = Spiritlm(args.model)
+    evaluation_classifier = get_eval_classifier(args)
+    input_output = args.input_output
+    eval_manifest_path = args.eval_manifest_path
+    write_wav_output = args.write_wav_output
+    if args.few_shot > 0:
+        prompt = build_few_shot_prompt(
+            spiritlm_model=spiritlm_model,
+            input_output=args.input_output,
+            n_shots=args.few_shot,
+        )
+    else:
+        prompt = None
+    # load
+    if input_output.startswith("speech"):
+        eval_dataset = SpeechData(eval_manifest_path, root_dir=STSP_DATA_ROOT)
+    elif input_output.startswith("text"):
+        eval_dataset = TextData(eval_manifest_path, root_dir=STSP_DATA_ROOT)
+    sampler = DistributedSampler(dataset=eval_dataset)
+    loader = torch.utils.data.DataLoader(
+        dataset=eval_dataset,
+        batch_size=1,  # large batch size is not supported yet
+        sampler=sampler,
+        num_workers=4,
+    )
+    predictions = {}
+    if input_output.endswith("speech"):
+        output_modality = OutputModality.SPEECH
+        max_new_tokens = 300
+    else:
+        output_modality = OutputModality.TEXT
+        max_new_tokens = 50
+    for _, data in tqdm(
+        enumerate(loader),
+        desc=f"Predict {eval_manifest_path}",
+        total=eval_dataset.__len__() // world_size,
+    ):
+        # retry the generation multiple times because sometime it does not generate hubert tokens
+        for i in range(NB_RETRIES):
+            try:
+                out: InterleavedOutputs = spiritlm_model.generate(
+                    output_modality=output_modality,
+                    interleaved_inputs=[
+                        GenerationInput(
+                            content=(
+                                data["wav"][0]
+                                if input_output.startswith("speech")
+                                else data["text"][0]
+                            ),  # 0 because of batch size 1
+                            content_type=(
+                                ContentType.SPEECH
+                                if input_output.startswith("speech")
+                                else ContentType.TEXT
+                            ),
+                        )
+                    ],
+                    generation_config=GenerationConfig(
+                        temperature=0.8,
+                        top_p=0.95,
+                        max_new_tokens=max_new_tokens,
+                        do_sample=True,
+                    ),
+                    prompt=prompt,
+                )
+            except Exception as e:
+                print(f"Got an exception when generating: {e}")
+                if i == NB_RETRIES - 1:
+                    raise Exception(f"Failed to generate after {NB_RETRIES}")
+            else:
+                break
+        assert len(out) == 1
+        generated_output = out[0].content
+        detected_sentiment = get_sentiment(
+            input_output, generated_output, evaluation_classifier
+        )
+        if output_modality == OutputModality.TEXT:
+            generation = generated_output
+        elif write_wav_output and output_modality == OutputModality.SPEECH:
+            generation = write_wav(generated_output, Path(write_wav_output))
+        else:
+            generation = None
+        result_dict = {"pred": detected_sentiment}
+        if generation is not None:
+            result_dict["generation"] = generation
+        predictions[str(data["id"][0])] = result_dict
+    if args.eval:
+        gathered_predictions = [None for _ in range(world_size)]
+        dist.gather_object(
+            predictions, gathered_predictions if world_rank == 0 else None, dst=0
+        )
+        if world_rank == 0:
+            all_predictions = {k: v for d in gathered_predictions for k, v in d.items()}
+            eval(
+                eval_manifest_path,
+                {k: v["pred"] for k, v in all_predictions.items()},
+                info_data=f"{eval_manifest_path}, input-output {input_output}",
+                label="sentiment",
+            )
+    if args.write_pred is not None and world_rank == 0:
+        write_jsonl(args.write_pred, all_predictions)
+def setup_env():
+    os.environ["OMP_NUM_THREADS"] = "1"
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--eval_manifest_path",  # data/examples/ref.jsonl
+        type=str,
+        help="Path to reference record",
+        required=True,
+    )
+    parser.add_argument(
+        "--data_root_dir",  # data/stsp_data
+        type=str,
+        help=f"Path to root data folder, default to {str(STSP_DATA_ROOT)}",
+        default=str(STSP_DATA_ROOT),
+        required=False,
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        default="spirit-lm-expressive-7b",
+        help="Model name (spirit-lm-base-7b or spirit-lm-expressive-7b) or path to model",
+        required=False,
+    )
+    parser.add_argument(
+        "--few_shot",
+        type=int,
+        default=0,
+        help="Number of few shot examples, 3/6/9",
+        required=False,
+    )
+    parser.add_argument(
+        "--input_output",
+        type=str,
+        default="speech_speech",
+        help="speech_speech speech_text text_speech text_text",
+        required=False,
+    )
+    parser.add_argument(
+        "--eval_type",
+        type=str,
+        default="emotion",
+        required=False,
+    )
+    parser.add_argument(
+        "--write_pred",
+        type=str,
+        default=None,
+        help="Path to save the predictions output",
+        required=False,
+    )
+    parser.add_argument(
+        "--write_wav_output",
+        type=str,
+        default=None,
+        help="Path to save the generated audio if the output is speech",
+        required=False,
+    )
+    parser.add_argument(
+        "--eval",
+        default=False,
+        action="store_true",
+    )
+    parser.add_argument(
+        "--seed",
+        default=0,
+        type=int,
+    )
+    args = parser.parse_args()
+    setup_env()
+    run(args)

spiritlm/eval/stsp/sanity_check_download.py ADDED Viewed

	@@ -0,0 +1,30 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+import json
+from spiritlm.eval.stsp.stsp_constants import STSP_DATA_ROOT, STSP_MANIFEST_ROOT
+def check_all_datasets():
+    for dataset_manifset in STSP_MANIFEST_ROOT.glob("**/*jsonl"):
+        records_checked = 0
+        print(f"dataset_manifset: {dataset_manifset}")
+        with dataset_manifset.open() as f:
+            for record in f:
+                record = json.loads(record)
+                for wav_key in ["wav_path", "prompt", "generation"]:
+                    if wav_key in record and record[wav_key].endswith(".wav"):
+                        wav_path = STSP_DATA_ROOT / record[wav_key]
+                    assert (
+                        wav_path.is_file()
+                    ), f"Record {record[wav_key]} not found in {str(wav_path)} and listed in {dataset_manifset}"
+                records_checked += 1
+        print(f"{records_checked} records checked for {dataset_manifset.stem} split")
+if __name__ == "__main__":
+    check_all_datasets()

spiritlm/eval/stsp/sentiment_classifiers.py ADDED Viewed

	@@ -0,0 +1,37 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+from typing import Any, Dict, List, Tuple
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
+def pred_to_label(
+    sentiment_prediction_scores: List[List[Dict[str, Any]]],
+) -> Tuple[str, float]:
+    if isinstance(sentiment_prediction_scores[0], list):
+        sentiment_prediction_scores = sentiment_prediction_scores[0]
+    item_with_max_score = max(
+        sentiment_prediction_scores, key=lambda _dict: _dict["score"]
+    )
+    score = item_with_max_score["score"]
+    return score, item_with_max_score["label"].lower()
+def get_text_sentiment_prediction(text: str, sentiment_classifier) -> Tuple[str, float]:
+    return pred_to_label(sentiment_classifier(text))
+def load_sentiment_classifier(model_dir: str):
+    classifier = pipeline(
+        task="text-classification",
+        model=AutoModelForSequenceClassification.from_pretrained(model_dir),
+        tokenizer=AutoTokenizer.from_pretrained(
+            "j-hartmann/sentiment-roberta-large-english-3-classes"
+        ),
+        top_k=None,
+    )
+    return classifier

spiritlm/eval/stsp/stsp_constants.py ADDED Viewed

	@@ -0,0 +1,12 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+from pathlib import Path
+STSP_ROOT = Path(__file__).parents[3] / "data" / "stsp_data"
+STSP_DATA_ROOT = STSP_ROOT / "data"
+STSP_MODEL_ROOT = STSP_ROOT / "model"
+STSP_MANIFEST_ROOT = STSP_ROOT / "manifest"

spiritlm/eval/stsp/utils.py ADDED Viewed

	@@ -0,0 +1,122 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+from dataclasses import dataclass
+from functools import cache
+from typing import List, Optional, Tuple
+import torch
+import torchaudio
+from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
+EXPRESSO_EMOTION_2_SENTIMENT = {
+    "happy": "positive",
+    "angry": "negative",
+    "sad": "negative",
+    "default": "neutral",
+}
+EMOTION_2_SENTIMENT = {
+    "happy": "positive",
+    "angry": "negative",
+    "sad": "negative",
+    "default": "neutral",
+    "neutral": "neutral",
+    "amused": "positive",
+}
+@cache
+def emotions2new_label_names_and_indices(
+    emotions_to_select: Tuple[str],
+    label_names: Tuple[str],
+) -> Tuple[List[str], List[int]]:
+    emotion2index = {e: i for i, e in enumerate(label_names)}
+    sorted_indices_emotions = sorted(
+        [(emotion2index[emotion], emotion) for emotion in emotions_to_select]
+    )
+    zipped = list(zip(*sorted_indices_emotions))
+    return zipped
+def expresso_emotion2_sentiment(emotion: str):
+    return EXPRESSO_EMOTION_2_SENTIMENT[emotion]
+@dataclass
+class ExpressoEmotionClassifier:
+    feature_extractor: AutoFeatureExtractor
+    model: AutoModelForAudioClassification
+    label_names: List[str]
+def load_emotion_classifier(checkpoint_path: str) -> ExpressoEmotionClassifier:
+    feature_extractor = AutoFeatureExtractor.from_pretrained(checkpoint_path)
+    model = (
+        AutoModelForAudioClassification.from_pretrained(checkpoint_path).cuda().eval()
+    )
+    label_names = [model.config.id2label[i] for i in range(model.config.num_labels)]
+    print(f"Classification model loaded from {checkpoint_path} !")
+    return ExpressoEmotionClassifier(feature_extractor, model, label_names)
+@torch.inference_mode()
+def predict_audio(
+    audio,
+    expresso_emotion_classifier: ExpressoEmotionClassifier,
+    emotions_to_predict: Optional[List[str]] = None,
+):
+    if isinstance(audio, str):
+        speech, _ = torchaudio.load(audio)
+        resampler = torchaudio.transforms.Resample(
+            expresso_emotion_classifier.feature_extractor.sampling_rate
+        )
+        speech = resampler(speech).squeeze().numpy()
+    else:
+        speech = audio
+    features = expresso_emotion_classifier.feature_extractor(
+        speech,
+        sampling_rate=expresso_emotion_classifier.feature_extractor.sampling_rate,
+        return_tensors="pt",
+    )
+    features["input_values"] = features["input_values"].cuda()
+    logits = expresso_emotion_classifier.model(**features).logits
+    if emotions_to_predict is not None:
+        (indices, label_names) = emotions2new_label_names_and_indices(
+            tuple(emotions_to_predict), tuple(expresso_emotion_classifier.label_names)
+        )
+        logits = logits[:, indices]
+    else:
+        label_names = expresso_emotion_classifier.label_names
+    pred_id = torch.argmax(logits, dim=-1)[0].item()
+    return label_names[pred_id], logits.detach().cpu().numpy()
+def wav2emotion(
+    wav,
+    expresso_emotion_classifier: ExpressoEmotionClassifier,
+    emotions_to_predict: Optional[List[str]] = None,
+) -> str:
+    label_logits = predict_audio(
+        audio=wav,
+        expresso_emotion_classifier=expresso_emotion_classifier,
+        emotions_to_predict=emotions_to_predict,
+    )
+    pred_emotion = label_logits[0]
+    return pred_emotion
+def wav2emotion_and_sentiment(
+    wav,
+    expresso_emotion_classifier: ExpressoEmotionClassifier,
+    emotions_to_predict: Optional[List[str]] = None,
+) -> Tuple[str, str]:
+    pred_emotion = wav2emotion(wav, expresso_emotion_classifier, emotions_to_predict)
+    mapped_sentiment = expresso_emotion2_sentiment(pred_emotion)
+    return pred_emotion, mapped_sentiment

spiritlm/eval/utils.py ADDED Viewed

	@@ -0,0 +1,17 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the FAIR Noncommercial Research License
+# found in the LICENSE file in the root directory of this source tree.
+import torchaudio
+from spiritlm.model.spiritlm_model import Spiritlm
+def wav_prompt(spiritlm_model: Spiritlm, wav_path: str) -> str:
+    wav = torchaudio.load(wav_path)[0].squeeze(0)
+    return spiritlm_model.SPEECH_PROMPT_PREFIX + spiritlm_model.speech_tokenizer(wav)
+def text_prompt(spiritlm_model: Spiritlm, text: str) -> str:
+    return spiritlm_model.TEXT_PROMPT_PREFIX + text

spiritlm/model/README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# Model for Spirit LM
+This repo includes the Spirit LM model wrapper.
+## Usage examples
+### Model Loading
+```python
+from spiritlm.model.spiritlm_model import Spiritlm
+# Spirit LM Base 7B
+spirit_lm = Spiritlm("spirit-lm-base-7b")
+# Spirit LM Expressive 7B
+spirit_lm = Spiritlm("spirit-lm-expressive-7b")
+```
+### Generation examples
+```python
+from spiritlm.model.spiritlm_model import OutputModality, GenerationInput, ContentType
+from transformers import GenerationConfig
+# Generate only text
+spirit_lm.generate(
+    output_modality=OutputModality.TEXT,
+    interleaved_inputs=[
+        GenerationInput(
+            content="The largest country in the world is",
+            content_type=ContentType.TEXT,
+        )
+        ],
+    generation_config=GenerationConfig(
+        temperature=0.9,
+        top_p=0.95,
+        max_new_tokens=50,
+        do_sample=True,
+    ),
+)
+# Expected output format:
+# [GenerationOuput(content='Russia, with an area of ...', content_type=<ContentType.TEXT: 'TEXT'>)]
+# Generate only speech
+spirit_lm.generate(
+    output_modality=OutputModality.SPEECH,
+    interleaved_inputs=[
+        GenerationInput(
+            content="examples/audio/7143-88743-0029.flac",
+            content_type=ContentType.SPEECH,
+        )
+    ],
+    generation_config=GenerationConfig(
+        temperature=0.9,
+        top_p=0.95,
+        max_new_tokens=200,
+        do_sample=True,
+    ),
+)
+# Expected output format:
+# [GenerationOuput(content=array([ 3.6673620e-05,  2.6468514e-04,  1.0735081e-03, ...,], dtype=float32), content_type=<ContentType.SPEECH: 'SPEECH'>)]
+# Arbitrary generation
+spirit_lm.generate(
+    output_modality=OutputModality.ARBITRARY,
+    interleaved_inputs=[
+        GenerationInput(
+            content="examples/audio/7143-88743-0029.flac",
+            content_type=ContentType.SPEECH,
+        )
+    ],
+    generation_config=GenerationConfig(
+        temperature=0.9,
+        top_p=0.95,
+        max_new_tokens=200,
+        do_sample=True,
+    ),
+)
+# Expected output format is a list of GenerationOuput where content type could be `ContentType.TEXT' or `ContentType.SPEECH`:
+# [GenerationOuput(content='xxx', content_type=<ContentType.TEXT: 'TEXT'>), GenerationOuput(content=array([ 0.00553902, -0.03210586, ... ], dtype=float32), content_type=<ContentType.SPEECH: 'SPEECH'>), GenerationOuput(content='yyy', content_type=<ContentType.TEXT: 'TEXT'>), GenerationOuput(content=array([0.04051103, 0.03596291, 0.03381396, ..., 0.05103811, 0.05429034, ..,,], dtype=float32), content_type=<ContentType.SPEECH: 'SPEECH'>)]
+```
+See more examples with other modalites in [examples/speech_generation/spirit_model.ipynb](../../examples/speech_generation/spirit_model.ipynb).