Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions vllm/entrypoints/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,11 +266,38 @@ def __init__(
self.chat_template = chat_template
self.chat_template_content_format = chat_template_content_format

self.input_messages: list[ResponseRawMessageAndToken] = []
self.output_messages: list[ResponseRawMessageAndToken] = []

def append_output(self, output: RequestOutput) -> None:
self.num_prompt_tokens = len(output.prompt_token_ids or [])
self.num_cached_tokens = output.num_cached_tokens or 0
self.num_output_tokens += len(output.outputs[0].token_ids or [])
self.parser.process(output.outputs[0])
output_prompt = output.prompt or ""
output_prompt_token_ids = output.prompt_token_ids or []
if len(self.input_messages) == 0:
self.input_messages.append(
ResponseRawMessageAndToken(
message=output_prompt,
tokens=output_prompt_token_ids,
)
)
Comment on lines +279 to +285
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The input_messages list is only populated during the first call to append_output due to the if len(self.input_messages) == 0: condition. In a multi-turn conversation, subsequent prompts (which are part of output.prompt) will not be added to input_messages. This will result in input_messages not accurately reflecting all prompts sent to the model across turns, which is likely unintended for a comprehensive input message log.

            self.input_messages.append(
                ResponseRawMessageAndToken(
                    message=output_prompt,
                    tokens=output_prompt_token_ids,
                )
            )

else:
# TODO: merge them in properly together
# TODO: responsesParser doesn't parse kimi k2 sentences correctly
Comment on lines +287 to +288
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These TODO comments indicate known issues with merging messages and parsing kimi k2 sentences. Since output_messages is directly used in serving_responses.py (line 663), these unresolved issues could lead to incorrect or incomplete data being returned in the API response, especially in multi-turn scenarios. This impacts the correctness of the API output.

self.output_messages.append(
ResponseRawMessageAndToken(
message=output_prompt,
tokens=output_prompt_token_ids,
)
)
self.output_messages.append(
ResponseRawMessageAndToken(
message=output.outputs[0].text,
tokens=output.outputs[0].token_ids,
)
)

def append_tool_output(self, output: list[ResponseInputOutputItem]) -> None:
self.parser.response_messages.extend(output)
Expand Down
1 change: 1 addition & 0 deletions vllm/entrypoints/openai/serving_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -1339,6 +1339,7 @@ async def _generate_with_builtin_tools(
)
engine_prompt = engine_prompts[0]
request_prompt = request_prompts[0]
prompt_text, _, _ = self._get_prompt_components(request_prompt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable prompt_text is extracted here but is not used anywhere in the subsequent code within this elif block. This is dead code and should be removed to maintain code cleanliness and prevent confusion.


# Update the sampling params.
sampling_params.max_tokens = self.max_model_len - len(
Expand Down
9 changes: 4 additions & 5 deletions vllm/entrypoints/openai/serving_responses.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,8 @@
if maybe_validation_error is not None:
return maybe_validation_error

fbvscode.set_trace()

Check failure on line 321 in vllm/entrypoints/openai/serving_responses.py

View workflow job for this annotation

GitHub Actions / pre-commit

Name "fbvscode" is not defined [name-defined]

Check failure on line 321 in vllm/entrypoints/openai/serving_responses.py

View workflow job for this annotation

GitHub Actions / pre-commit

Name "fbvscode" is not defined [name-defined]

Check failure on line 321 in vllm/entrypoints/openai/serving_responses.py

View workflow job for this annotation

GitHub Actions / pre-commit

Name "fbvscode" is not defined [name-defined]

Check failure on line 321 in vllm/entrypoints/openai/serving_responses.py

View workflow job for this annotation

GitHub Actions / pre-commit

Name "fbvscode" is not defined [name-defined]

Check failure on line 321 in vllm/entrypoints/openai/serving_responses.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (F821)

vllm/entrypoints/openai/serving_responses.py:321:9: F821 Undefined name `fbvscode`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

A debugger breakpoint (fbvscode.set_trace()) has been left in the code. This should be removed before merging to avoid unexpected behavior or blocking execution in a production environment.


# If the engine is dead, raise the engine's DEAD_ERROR.
# This is required for the streaming case, where we return a
# success status before we actually start generating text :).
Expand Down Expand Up @@ -656,12 +658,9 @@
]
output = make_response_output_items_from_parsable_context(response_messages)

# TODO: context for non-gptoss models doesn't use messages
# so we can't get them out yet
if request.enable_response_messages:
raise NotImplementedError(
"enable_response_messages is currently only supported for gpt-oss"
)
input_messages = context.input_messages
output_messages = context.output_messages
Comment on lines +662 to +663
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The input_messages and output_messages from the context are being directly assigned here. As noted in vllm/entrypoints/context.py, input_messages might not be fully populated for multi-turn conversations, and output_messages has unresolved TODOs regarding proper merging and parsing. This could lead to incomplete or incorrect data being returned when enable_response_messages is true.


# TODO: Calculate usage.
# assert final_res.prompt_token_ids is not None
Expand Down
Loading