Instructions to use internlm/internlm2_5-7b-chat-1m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use internlm/internlm2_5-7b-chat-1m with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="internlm/internlm2_5-7b-chat-1m", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("internlm/internlm2_5-7b-chat-1m", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use internlm/internlm2_5-7b-chat-1m with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "internlm/internlm2_5-7b-chat-1m"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "internlm/internlm2_5-7b-chat-1m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/internlm/internlm2_5-7b-chat-1m

SGLang

How to use internlm/internlm2_5-7b-chat-1m with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "internlm/internlm2_5-7b-chat-1m" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "internlm/internlm2_5-7b-chat-1m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "internlm/internlm2_5-7b-chat-1m" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "internlm/internlm2_5-7b-chat-1m",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use internlm/internlm2_5-7b-chat-1m with Docker Model Runner:
```
docker model run hf.co/internlm/internlm2_5-7b-chat-1m
```

RangiLyu commited on Jul 1, 2024

Commit

7c42fba

verified ·

1 Parent(s): 19f48c2

Update README.md

Browse files

Files changed (1) hide show

README.md +24 -3

README.md CHANGED Viewed

@@ -48,6 +48,8 @@ InternLM2.5-7B-Chat-1M is the 1M-long-context version of InternLM2.5-7B-Chat. Si
 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
 ```bash
 pip install lmdeploy
 ```
@@ -57,7 +59,12 @@ You can run batch inference locally with the following python code:
 ```python
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
-backend_config = TurbomindEngineConfig(rope_scaling_factor=2.5, session_len=1048576)
 pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
 prompt = 'Use a long prompt to replace this sentence'
 response = pipe(prompt)
@@ -69,6 +76,7 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i
 ### Import from Transformers
 To load the InternLM2 7B Chat model using Transformers, use the following code:
 ```python
@@ -114,6 +122,8 @@ pip install vllm
 python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
 ```
 Then you can send a chat request to the server:
 ```bash
@@ -164,6 +174,8 @@ InternLM2.5-7B-Chat-1M 支持 1 百万字超长上下文推理，且性能和 In
 LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发，是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
 ```bash
 pip install lmdeploy
 ```
@@ -174,8 +186,13 @@ pip install lmdeploy
 ```python
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
-backend_config = TurbomindEngineConfig(rope_scaling_factor=2.5, session_len=1048576)
-pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=backend_config)
 prompt = 'Use a long prompt to replace this sentence'
 response = pipe(prompt)
 print(response)
@@ -183,6 +200,8 @@ print(response)
 ### 通过 Transformers 加载
 通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
 ```python
@@ -228,6 +247,8 @@ pip install vllm
 python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
 ```
 然后你可以向服务端发起一个聊天请求:
 ```bash

 LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
+Here is an example of 1M-long context inference. **Note: 1M context length requires 4xA100-80G!**
 ```bash
 pip install lmdeploy
 ```
 ```python
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(
+        rope_scaling_factor=2.5,
+        session_len=1048576,  # 1M context length
+        max_batch_size=1,
+        cache_max_entry_count=0.7,
+        tp=4)  # 4xA100-80G.
 pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
 prompt = 'Use a long prompt to replace this sentence'
 response = pipe(prompt)
 ### Import from Transformers
+Since Transformers does not support 1M long context, we only show the usage of non-long context.
 To load the InternLM2 7B Chat model using Transformers, use the following code:
 ```python
 python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --served-model-name internlm2_5-7b-chat-1m --trust-remote-code
 ```
+If you encounter OOM, try to reduce `--max-model-len` or increase `--tensor-parallel-size`.
 Then you can send a chat request to the server:
 ```bash
 LMDeploy 由 MMDeploy 和 MMRazor 团队联合开发，是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
+以下是一个 1M 上下文推理的例子. **注意: 1M 上下文需要 4xA100-80G!**
 ```bash
 pip install lmdeploy
 ```
 ```python
 from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
+backend_config = TurbomindEngineConfig(
+        rope_scaling_factor=2.5,
+        session_len=1048576,  # 1M context length
+        max_batch_size=1,
+        cache_max_entry_count=0.7,
+        tp=4)  # 4xA100-80G.
+pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
 prompt = 'Use a long prompt to replace this sentence'
 response = pipe(prompt)
 print(response)
 ### 通过 Transformers 加载
+由于 Transformers 无法支持 1M 长上下文推理，这里仅演示非长文本的用法。
 通过以下的代码加载 InternLM2.5 7B Chat 1M 模型
 ```python
 python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-7b-chat-1m --trust-remote-code
 ```
+如果你遇到 OOM, 请减小 `--max-model-len` 或增加 `--tensor-parallel-size` 参数.
 然后你可以向服务端发起一个聊天请求:
 ```bash