Instructions to use internlm/internlm2_5-7b-chat-1m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use internlm/internlm2_5-7b-chat-1m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="internlm/internlm2_5-7b-chat-1m", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("internlm/internlm2_5-7b-chat-1m", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use internlm/internlm2_5-7b-chat-1m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "internlm/internlm2_5-7b-chat-1m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "internlm/internlm2_5-7b-chat-1m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/internlm/internlm2_5-7b-chat-1m
- SGLang
How to use internlm/internlm2_5-7b-chat-1m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "internlm/internlm2_5-7b-chat-1m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "internlm/internlm2_5-7b-chat-1m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "internlm/internlm2_5-7b-chat-1m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "internlm/internlm2_5-7b-chat-1m", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use internlm/internlm2_5-7b-chat-1m with Docker Model Runner:
docker model run hf.co/internlm/internlm2_5-7b-chat-1m
Update README.md
Browse files
README.md
CHANGED
|
@@ -48,6 +48,8 @@ InternLM2.5-7B-Chat-1M is the 1M-long-context version of InternLM2.5-7B-Chat. Si
|
|
| 48 |
|
| 49 |
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
|
| 50 |
|
|
|
|
|
|
|
| 51 |
```bash
|
| 52 |
pip install lmdeploy
|
| 53 |
```
|
|
@@ -57,7 +59,12 @@ You can run batch inference locally with the following python code:
|
|
| 57 |
```python
|
| 58 |
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
|
| 59 |
|
| 60 |
-
backend_config = TurbomindEngineConfig(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
|
| 62 |
prompt = 'Use a long prompt to replace this sentence'
|
| 63 |
response = pipe(prompt)
|
|
@@ -69,6 +76,7 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i
|
|
| 69 |
|
| 70 |
### Import from Transformers
|
| 71 |
|
|
|
|
| 72 |
To load the InternLM2 7B Chat model using Transformers, use the following code:
|
| 73 |
|
| 74 |
```python
|
|
@@ -114,6 +122,8 @@ pip install vllm
|
|
| 114 |
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
|
| 115 |
```
|
| 116 |
|
|
|
|
|
|
|
| 117 |
Then you can send a chat request to the server:
|
| 118 |
|
| 119 |
```bash
|
|
@@ -164,6 +174,8 @@ InternLM2.5-7B-Chat-1M 支持 1 百万字超长上下文推理,且性能和 In
|
|
| 164 |
|
| 165 |
LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
|
| 166 |
|
|
|
|
|
|
|
| 167 |
```bash
|
| 168 |
pip install lmdeploy
|
| 169 |
```
|
|
@@ -174,8 +186,13 @@ pip install lmdeploy
|
|
| 174 |
```python
|
| 175 |
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
|
| 176 |
|
| 177 |
-
backend_config = TurbomindEngineConfig(
|
| 178 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
prompt = 'Use a long prompt to replace this sentence'
|
| 180 |
response = pipe(prompt)
|
| 181 |
print(response)
|
|
@@ -183,6 +200,8 @@ print(response)
|
|
| 183 |
|
| 184 |
### 通过 Transformers 加载
|
| 185 |
|
|
|
|
|
|
|
| 186 |
通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
|
| 187 |
|
| 188 |
```python
|
|
@@ -228,6 +247,8 @@ pip install vllm
|
|
| 228 |
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
|
| 229 |
```
|
| 230 |
|
|
|
|
|
|
|
| 231 |
然后你可以向服务端发起一个聊天请求:
|
| 232 |
|
| 233 |
```bash
|
|
|
|
| 48 |
|
| 49 |
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
|
| 50 |
|
| 51 |
+
Here is an example of 1M-long context inference. **Note: 1M context length requires 4xA100-80G!**
|
| 52 |
+
|
| 53 |
```bash
|
| 54 |
pip install lmdeploy
|
| 55 |
```
|
|
|
|
| 59 |
```python
|
| 60 |
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
|
| 61 |
|
| 62 |
+
backend_config = TurbomindEngineConfig(
|
| 63 |
+
rope_scaling_factor=2.5,
|
| 64 |
+
session_len=1048576, # 1M context length
|
| 65 |
+
max_batch_size=1,
|
| 66 |
+
cache_max_entry_count=0.7,
|
| 67 |
+
tp=4) # 4xA100-80G.
|
| 68 |
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
|
| 69 |
prompt = 'Use a long prompt to replace this sentence'
|
| 70 |
response = pipe(prompt)
|
|
|
|
| 76 |
|
| 77 |
### Import from Transformers
|
| 78 |
|
| 79 |
+
Since Transformers does not support 1M long context, we only show the usage of non-long context.
|
| 80 |
To load the InternLM2 7B Chat model using Transformers, use the following code:
|
| 81 |
|
| 82 |
```python
|
|
|
|
| 122 |
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
|
| 123 |
```
|
| 124 |
|
| 125 |
+
If you encounter OOM, try to reduce `--max-model-len` or increase `--tensor-parallel-size`.
|
| 126 |
+
|
| 127 |
Then you can send a chat request to the server:
|
| 128 |
|
| 129 |
```bash
|
|
|
|
| 174 |
|
| 175 |
LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
|
| 176 |
|
| 177 |
+
以下是一个 1M 上下文推理的例子. **注意: 1M 上下文需要 4xA100-80G!**
|
| 178 |
+
|
| 179 |
```bash
|
| 180 |
pip install lmdeploy
|
| 181 |
```
|
|
|
|
| 186 |
```python
|
| 187 |
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
|
| 188 |
|
| 189 |
+
backend_config = TurbomindEngineConfig(
|
| 190 |
+
rope_scaling_factor=2.5,
|
| 191 |
+
session_len=1048576, # 1M context length
|
| 192 |
+
max_batch_size=1,
|
| 193 |
+
cache_max_entry_count=0.7,
|
| 194 |
+
tp=4) # 4xA100-80G.
|
| 195 |
+
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
|
| 196 |
prompt = 'Use a long prompt to replace this sentence'
|
| 197 |
response = pipe(prompt)
|
| 198 |
print(response)
|
|
|
|
| 200 |
|
| 201 |
### 通过 Transformers 加载
|
| 202 |
|
| 203 |
+
由于 Transformers 无法支持 1M 长上下文推理,这里仅演示非长文本的用法。
|
| 204 |
+
|
| 205 |
通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
|
| 206 |
|
| 207 |
```python
|
|
|
|
| 247 |
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
|
| 248 |
```
|
| 249 |
|
| 250 |
+
如果你遇到 OOM, 请减小 `--max-model-len` 或增加 `--tensor-parallel-size` 参数.
|
| 251 |
+
|
| 252 |
然后你可以向服务端发起一个聊天请求:
|
| 253 |
|
| 254 |
```bash
|