Alibaba-Research-Intelligence-Computing
/

Tora

@@ -91,32 +91,6 @@ All videos are available in this [Link](https://cloudbook-public-daily.oss-cn-ha
 - [x] Release diffusers version and optimize the GPU memory usage
 - [x] Release complete version of Tora
-## 🧨 Diffusers verision
-Please refer to [the diffusers version](diffusers-version/README.md) for details.
-## 🐍 Installation
-Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
-```bash
-# Clone this repository.
-git clone https://github.com/alibaba/Tora.git
-cd Tora
-# Install Pytorch (we use Pytorch 2.4.0) and torchvision following the official instructions: https://pytorch.org/get-started/previous-versions/. For example:
-conda create -n tora python==3.10
-conda activate tora
-conda install pytorch==2.4.0 torchvision==0.19.0 pytorch-cuda=12.1 -c pytorch -c nvidia
-# Install requirements
-cd modules/SwissArmyTransformer
-pip install -e .
-cd ../../sat
-pip install -r requirements.txt
-cd ..
-```
 ## 📦 Model Weights
 ### Folder Structure
@@ -182,91 +156,6 @@ git clone https://www.modelscope.cn/xiaoche/Tora.git
     - T5: [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder), [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
 - Tora t2v model weights: [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/mp_rank_00_model_states.pt). Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE).
-## 🔄 Inference
-### Text to Video
-It requires around 30 GiB GPU memory tested on NVIDIA A100.
-```bash
-cd sat
-PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/t2v --output-dir samples --point_path trajs/coaster.txt --input-file assets/text/t2v/examples.txt
-```
-You can change the `--input-file` and `--point_path` to your own prompts and trajectory points files. Please note that the trajectory is drawn on a 256x256 canvas.
-Replace `$N_GPU` with the number of GPUs you want to use.
-### Image to Video
-```bash
-cd sat
-PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora_i2v.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/i2v --output-dir samples --point_path trajs/sawtooth.txt --input-file assets/text/i2v/examples.txt --img_dir assets/images --image2video
-```
-The first frame images should be placed in the `--img_dir`. The names of these images should be specified in the corresponding text prompt in `--input-file`, seperated by `@@`.
-### Recommendations for Text Prompts
-For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.
-You can refer to the following resources for guidance:
-- [CogVideoX Documentation](https://github.com/THUDM/CogVideo/blob/main/inference/convert_demo.py)
-- [OpenSora Scripts](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/inference.py)
-## 🖥️ Gradio Demo
-Usage:
-```bash
-cd sat
-python app.py --load ckpts/tora/t2v
-```
-## 🧠 Training
-### Data Preparation
-Following this guide https://github.com/THUDM/CogVideo/blob/main/sat/README.md#preparing-the-dataset, structure the datasets as follows:
-```
-.
-├── labels
-│   ├── 1.txt
-│   ├── 2.txt
-│   ├── ...
-└── videos
-    ├── 1.mp4
-    ├── 2.mp4
-    ├── ...
-```
-Training data examples are in `sat/training_examples`
-### Text to Video
-It requires around 60 GiB GPU memory tested on NVIDIA A100.
-Replace `$N_GPU` with the number of GPUs you want to use.
-- Stage 1
-```bash
-PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_dense.yaml --experiment-name "t2v-stage1"
-```
-- Stage 2
-```bash
-PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU train_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/train_sparse.yaml --experiment-name "t2v-stage2"
-```
-## 🎯 Troubleshooting
-### 1. ValueError: Non-consecutive added token...
-Upgrade the transformers package to 4.44.2. See [this](https://github.com/THUDM/CogVideo/issues/213) issue.
 ## 🤝 Acknowledgements
 We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:

 - [x] Release diffusers version and optimize the GPU memory usage
 - [x] Release complete version of Tora
 ## 📦 Model Weights
 ### Folder Structure
     - T5: [text_encoder](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/text_encoder), [tokenizer](https://huggingface.co/THUDM/CogVideoX-2b/tree/main/tokenizer)
 - Tora t2v model weights: [Link](https://cloudbook-public-daily.oss-cn-hangzhou.aliyuncs.com/Tora_t2v/mp_rank_00_model_states.pt). Downloading this weight requires following the [CogVideoX License](CogVideoX_LICENSE).
 ## 🤝 Acknowledgements
 We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project: