vllm.model_executor.layers.quantization.utils.quant_utils ¶
This file is used for /tests and /benchmarks
kDynamic128Scale module-attribute ¶
kDynamic128Scale = ScaleDesc(
float32, False, GroupShape(1, 128)
)
kFp8Dynamic128Sym module-attribute ¶
kFp8Dynamic128Sym = QuantKey(
FP8_DTYPE, kDynamic128Scale, symmetric=True
)
kFp8Dynamic64Sym module-attribute ¶
kFp8Dynamic64Sym = QuantKey(
FP8_DTYPE, kDynamic64Scale, symmetric=True
)
kFp8DynamicTensorSym module-attribute ¶
kFp8DynamicTensorSym = QuantKey(
FP8_DTYPE, kDynamicTensorScale, symmetric=True
)
kFp8DynamicTokenSym module-attribute ¶
kFp8DynamicTokenSym = QuantKey(
FP8_DTYPE, kDynamicTokenScale, symmetric=True
)
kFp8StaticTensorSym module-attribute ¶
kFp8StaticTensorSym = QuantKey(
FP8_DTYPE, kStaticTensorScale, symmetric=True
)
kNvfp4GroupScale module-attribute ¶
kNvfp4GroupScale = ScaleDesc(
FP8_DTYPE, False, GroupShape(1, 16)
)
kNvfp4Quant module-attribute ¶
kNvfp4Quant = QuantKey(
FP4_DTYPE,
scale=kNvfp4GroupScale,
scale2=kStaticTensorScale,
)
GroupShape ¶
Bases: _GroupShape
This class describes the quantization group shape. It includes static members for common shapes (per-tensor, per-token).
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
QuantKey dataclass ¶
Class for identifying the type of quantization. dtype: quantized data type scale: scale descriptor scale2: second-level scale descriptor symmetric: symmetric if True, asymmetric if False
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
ScaleDesc dataclass ¶
Class for describing a single quantization scaling factor. dtype: data type of the scale static: static scale if True, dynamic if False group_shape: group shape of the scale
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
__str__ ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
_GroupShape ¶
_normalize_quant_group_shape ¶
_normalize_quant_group_shape(
x: Tensor, group_shape: GroupShape
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
awq_pack ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_bf16_scales_to_fp8 ¶
Convert a BF16 scale tensor into the pair of (fp8_scales, channel_scales) expected by W4A8 GEMM kernels.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_packed_uint4b8_to_signed_int4_inplace ¶
Convert int4b8 (packed to int32) to signed int4
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
cutlass_fp4_supported ¶
cutlass_fp4_supported() -> bool
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_fp8_min_max ¶
Get the min and max values for FP8 quantization.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_pack_factor ¶
gptq_pack ¶
gptq_quantize_weights ¶
gptq_quantize_weights(
w: Tensor,
quant_type: ScalarType,
group_size: int,
act_order: bool,
test_perm: Tensor | None = None,
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
group_broadcast ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
is_layer_skipped ¶
is_layer_skipped(
prefix: str,
ignored_layers: list[str],
fused_mapping: Mapping[
str, list[str]
] = MappingProxyType({}),
*,
skip_with_substr: bool = False,
) -> bool
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_cols ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_quantized_values_into_int32 ¶
pack_quantized_values_into_int32(
w_q: Tensor, wtype: ScalarType, packed_dim: int = 0
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_rows ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
permute_rows ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
quantize_weights ¶
quantize_weights(
w: Tensor,
quant_type: ScalarType,
group_size: int | None,
zero_points: bool = False,
ref_zero_points_after_scales: bool = False,
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 | |
scaled_dequantize ¶
scaled_dequantize(
x_q: Tensor,
x_s: Tensor,
group_shape: GroupShape | None = None,
out_dtype: dtype = float32,
) -> tuple[Tensor, Tensor]
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
scaled_quantize ¶
scaled_quantize(
x: Tensor,
group_shape: GroupShape,
quant_dtype: dtype,
compute_dtype: dtype | None = None,
) -> tuple[Tensor, Tensor]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x | Tensor | Input tensor to quantize | required |
group_shape | GroupShape | Shape of quantization groups | required |
quant_dtype | dtype | Target quantized dtype (e.g., torch.float8_e4m3fn) | required |
compute_dtype | dtype | None | Optional dtype for intermediate computations. If None, uses input dtype. Use torch.float32 for higher precision. | None |
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
sort_weights ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
swizzle_blockscale ¶
Pad and block-interleave the FP4 block-scales so that they match the data layout expected by the CUTLASS / FlashInfer kernels.
Parameters¶
scale: torch.Tensor
Returns¶
torch.Tensor The swizzled tensor with the same logical shape as scale.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
unpack_cols ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
unpack_quantized_values_into_int32 ¶
unpack_quantized_values_into_int32(
w_q: Tensor, wtype: ScalarType, packed_dim: int = 0
)