Degreaded point performance on some examples compared to the playground
When running this code with the below image in the playground you get very good results, with the detection of the majority of the spectators in the image.
In contrast, running this code on the same image finds only a single spectator. Does anyone know why? What is the difference between the playground and the model from huggingface?
import torch
import numpy as np
from PIL import Image
from transformers import AutoModelForCausalLM
moondream_model = AutoModelForCausalLM.from_pretrained(
"moondream/moondream3-preview", trust_remote_code=True, torch_dtype=torch.float32, device_map={"": "cuda"}
)
image = Image.open("baseball.png")
result = moondream_model.point(image, "spectator")
for i, point in enumerate(result["points"]):
print(f"Point {i+1}: x={point['x']:.3f}, y={point['y']:.3f}")
One thing that comes to mind is that the playground runs the model in bfloat16 -- since it was trained in that precision it's possible running in float32 causes issues?
Me too. The performance on their website is much better than when I download the model and test it locally.
One thing that comes to mind is that the playground runs the model in bfloat16 -- since it was trained in that precision it's possible running in float32 causes issues?
I run the model in torch.bfloat16 but still got different results compared to the results in the playground.
Me too. The performance on their website is much better than when I download the model and test it locally.
Hi, have you found a solution to solve this issue?
Can you share the image and your results from running it locally?
Can you share the image and your results from running it locally?
Thanks for your reply. The image is attached, and the text for point query is "all floor areas"
The code I use locally is:
from transformers import AutoModelForCausalLM
from PIL import Image
import torch
import time
import matplotlib.pyplot as plt
if __name__ == '__main__':
model = AutoModelForCausalLM.from_pretrained(
"moondream/moondream3-preview",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="cuda"
)
model.eval()
model.compile()
with torch.inference_mode():
image = Image.open("/media/jcx/SSD_2T/NeRF_Dataset/Replica/room0/rgb/frame000000.jpg")
encoded_image = model.encode_image(image)
result = model.point(encoded_image, "all floor areas")
points = result["points"]
print(f"Found {len(points)} all floor areas")
# Visualize the points
plt.figure(figsize=(10, 10))
plt.imshow(image)
for point in points:
# Convert normalized coordinates to pixel values
x = point["x"] * image.width
y = point["y"] * image.height
# Plot the point
plt.plot(x, y, 'ro', markersize=15, alpha=0.7)
plt.text(
x + 10, y, "Face",
color='white', fontsize=12,
bbox=dict(facecolor='red', alpha=0.5)
)
plt.axis('off')
plt.savefig("output_with_points.jpg")
plt.show()
And the results from local code is:
Would also be very interested to know why performance differs so much between playground and local deployment, because the playground version rund really well.
@JCX1999 Thanks for sharing your code. The Cloud API (and by extension the playground) uses our new inference engine (Kestrel), which could be contributing to a slight change in outputs. Has the playground been consistently better, or is it just this example?
Aside: can you give me some context as to your use case?
@JCX1999 Thanks for sharing your code. The Cloud API (and by extension the playground) uses our new inference engine (Kestrel), which could be contributing to a slight change in outputs. Has the playground been consistently better, or is it just this example?
Aside: can you give me some context as to your use case?
@err805 Thank you for your reply. I’ve found the Playground consistently better. I’m using Moondream for semantic labeling and was wondering if it’s possible to run the model locally with the new Kestrel inference engine.


