Spaces:
Sleeping
ggml-cpu: enable IBM NNPA Vector Intrinsics (llama/14317)
Browse files* ggml-cpu: add nnpa compile flag
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)
* ggml-cpu: add fp16->fp32 nnpa first
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)
* ggml-cpu: add fp32->fp16
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)
* ggml-cpu: better variable names
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)
* docs: update s390x docs
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)
* ggml-cpu: add debugging prints to see if dlf16 is correct
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix print vs printf
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix float placeholder
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: ensure fp16 and fp32 load and stores are called
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fp16 load ensured to hit
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: remove sigint from fp16 store
for some reason, the function is not getting a hit when debugged with
gdb. we will need to investigate further
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: nnpa switch to vec_xst test
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: switch to vec_xst for 4 element loops also
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: rework noop
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: remove noop, general code cleanup
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: clarify variable naming
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add breakpoint for debugging
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: test fix for conversion failure
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: disable fp32->fp16 nnpa conversions for now
there are some conversion failures in nnpa that requires the eyes of an
ibm stsm. will create a separate pr to introduce the fp32->fp16 change.
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: switch to elif macro
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix typo
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix compiler types
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: change to typedef vector types
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add 4 element loops for fp32->fp16
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: clarified vector naming
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: bring back fp32->fp16 store nnpa
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add nnpa macro check in ggml-impl
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add missing __func__
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: diagnose why __NNPA__ macro is not being defined
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: import vecintrin.h to fix compiler errors
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: update macro tests
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo <[email protected]>
* Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 157f856c34589566151630e294563a420702db39.
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: switch to importing ggml-cpu-impl instead
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix macro declaration
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: test more macros
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add debug prints
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: bruteforce macro definitions
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: move macro definitions
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add ggml-impl.h to cmakelists
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: switch to private macros
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 157f856c34589566151630e294563a420702db39)
* ggml-cpu: move things around
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: bring back compile macros
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: switch to quotes for import
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add compiler error macro
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add s390x detection in ggml-src
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: bring back compile definitions
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: undo cmakelists work
Signed-off-by: Aaron Teo <[email protected]>
* Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: remove typedefs.h
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: remove typedef from cmakelists
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add ggml-impl.h future notes
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: add todo comment for future reference
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: clarify naming of dlf16
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: remove unnecessary target compile definitions
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings
Signed-off-by: Aaron Teo <[email protected]>
* ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo <[email protected]>
* docs: update broken huggingface link for s390x
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix duplicate func names during compile
Signed-off-by: Aaron Teo <[email protected]>
* Revert "ggml-cpu: fix duplicate func names during compile"
This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.
Signed-off-by: Aaron Teo <[email protected]>
* Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"
This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.
Signed-off-by: Aaron Teo <[email protected]>
* ggml: refactor fp16<->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix missing simd-mappings.h import in quants.c
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix missing simd-mappings.h within repack
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix amx mmq missing simd-mappings.h
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: attempt at fixing loongarch failing build
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: move nnpa together with other fp16<->fp32 simd
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: fix wrong refactor of ggml-base
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555
Signed-off-by: Aaron Teo <[email protected]>
* ggml: remove dependency on ggml-cpu from ggml-base
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: remove mistaken fallback macro
fallback logic was already implemented but i was too sleepy to realise
Signed-off-by: Aaron Teo <[email protected]>
* ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures
Signed-off-by: Aaron Teo <[email protected]>
* Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"
This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.
Signed-off-by: Aaron Teo <[email protected]>
* Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"
This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.
Signed-off-by: Aaron Teo <[email protected]>
* ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)
* ggml: move ggml_table_f32_f16 to ggml-cpu.c
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: extern c ggml_table_f32_f16 + chore docs
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h
we rely on the variable declaration in ggml-cpu.c instead
Signed-off-by: Aaron Teo <[email protected]>
* Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"
This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.
Signed-off-by: Aaron Teo <[email protected]>
* ggml-cpu: bring back ggml_table_f32_f16
Signed-off-by: Aaron Teo <[email protected]>
* Revert "ggml-cpu: bring back ggml_table_f32_f16"
This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.
Signed-off-by: Aaron Teo <[email protected]>
* fix ggml time initialization
* fix f32_f16 table init
* remove extra line
---------
Signed-off-by: Aaron Teo <[email protected]>
Co-au
- ggml/CMakeLists.txt +1 -0
- ggml/include/ggml-cpu.h +1 -0
- ggml/src/ggml-cpu/CMakeLists.txt +8 -0
- ggml/src/ggml-cpu/amx/mmq.cpp +10 -9
- ggml/src/ggml-cpu/arch/arm/quants.c +109 -108
- ggml/src/ggml-cpu/arch/arm/repack.cpp +13 -12
- ggml/src/ggml-cpu/arch/loongarch/quants.c +53 -52
- ggml/src/ggml-cpu/arch/powerpc/quants.c +56 -55
- ggml/src/ggml-cpu/arch/riscv/quants.c +42 -41
- ggml/src/ggml-cpu/arch/riscv/repack.cpp +24 -23
- ggml/src/ggml-cpu/arch/s390/quants.c +29 -28
- ggml/src/ggml-cpu/arch/wasm/quants.c +30 -29
- ggml/src/ggml-cpu/arch/x86/quants.c +83 -82
- ggml/src/ggml-cpu/arch/x86/repack.cpp +20 -19
- ggml/src/ggml-cpu/common.h +3 -2
- ggml/src/ggml-cpu/ggml-cpu-impl.h +9 -3
- ggml/src/ggml-cpu/ggml-cpu.c +59 -16
- ggml/src/ggml-cpu/ggml-cpu.cpp +3 -0
- ggml/src/ggml-cpu/llamafile/sgemm.cpp +3 -2
- ggml/src/ggml-cpu/ops.cpp +48 -48
- ggml/src/ggml-cpu/quants.c +25 -24
- ggml/src/ggml-cpu/repack.cpp +15 -14
- ggml/src/ggml-cpu/simd-mappings.h +211 -33
- ggml/src/ggml-cpu/vec.cpp +2 -2
- ggml/src/ggml-cpu/vec.h +45 -45
- ggml/src/ggml-impl.h +61 -183
- ggml/src/ggml.c +0 -11
|
@@ -131,6 +131,7 @@ option(GGML_RVV "ggml: enable rvv" ON)
|
|
| 131 |
option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
|
| 132 |
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
|
| 133 |
option(GGML_VXE "ggml: enable vxe" ON)
|
|
|
|
| 134 |
|
| 135 |
option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
|
| 136 |
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
|
|
|
|
| 131 |
option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
|
| 132 |
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
|
| 133 |
option(GGML_VXE "ggml: enable vxe" ON)
|
| 134 |
+
option(GGML_NNPA "ggml: enable nnpa" ON)
|
| 135 |
|
| 136 |
option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
|
| 137 |
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
|
|
@@ -101,6 +101,7 @@ extern "C" {
|
|
| 101 |
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
|
| 102 |
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
|
| 103 |
GGML_BACKEND_API int ggml_cpu_has_vxe (void);
|
|
|
|
| 104 |
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
|
| 105 |
GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
|
| 106 |
|
|
|
|
| 101 |
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
|
| 102 |
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
|
| 103 |
GGML_BACKEND_API int ggml_cpu_has_vxe (void);
|
| 104 |
+
GGML_BACKEND_API int ggml_cpu_has_nnpa (void);
|
| 105 |
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
|
| 106 |
GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
|
| 107 |
|
|
@@ -448,6 +448,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
|
|
| 448 |
|
| 449 |
# TODO: Separation to determine activation of VX/VXE/VXE2
|
| 450 |
if (${S390X_M} MATCHES "8561|8562")
|
|
|
|
| 451 |
message(STATUS "z15 target")
|
| 452 |
list(APPEND ARCH_FLAGS -march=z15)
|
| 453 |
elseif (${S390X_M} MATCHES "3931")
|
|
@@ -464,7 +465,14 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
|
|
| 464 |
endif()
|
| 465 |
|
| 466 |
if (GGML_VXE)
|
|
|
|
| 467 |
list(APPEND ARCH_FLAGS -mvx -mzvector)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 468 |
endif()
|
| 469 |
elseif (CMAKE_SYSTEM_PROCESSOR MATCHES "wasm")
|
| 470 |
message(STATUS "Wasm detected")
|
|
|
|
| 448 |
|
| 449 |
# TODO: Separation to determine activation of VX/VXE/VXE2
|
| 450 |
if (${S390X_M} MATCHES "8561|8562")
|
| 451 |
+
set(GGML_NNPA OFF)
|
| 452 |
message(STATUS "z15 target")
|
| 453 |
list(APPEND ARCH_FLAGS -march=z15)
|
| 454 |
elseif (${S390X_M} MATCHES "3931")
|
|
|
|
| 465 |
endif()
|
| 466 |
|
| 467 |
if (GGML_VXE)
|
| 468 |
+
message(STATUS "VX/VXE/VXE2 enabled")
|
| 469 |
list(APPEND ARCH_FLAGS -mvx -mzvector)
|
| 470 |
+
list(APPEND ARCH_DEFINITIONS GGML_VXE)
|
| 471 |
+
endif()
|
| 472 |
+
|
| 473 |
+
if (GGML_NNPA)
|
| 474 |
+
message(STATUS "NNPA enabled")
|
| 475 |
+
list(APPEND ARCH_DEFINITIONS GGML_NNPA)
|
| 476 |
endif()
|
| 477 |
elseif (CMAKE_SYSTEM_PROCESSOR MATCHES "wasm")
|
| 478 |
message(STATUS "Wasm detected")
|
|
@@ -8,6 +8,7 @@
|
|
| 8 |
#include "mmq.h"
|
| 9 |
#include "ggml-impl.h"
|
| 10 |
#include "ggml-cpu-impl.h"
|
|
|
|
| 11 |
#include "quants.h"
|
| 12 |
#include "ggml-quants.h"
|
| 13 |
#include <algorithm>
|
|
@@ -453,7 +454,7 @@ void quantize_row_q8_K_vnni(const float * RESTRICT x, void * RESTRICT vy, int64_
|
|
| 453 |
|
| 454 |
// Quantize these floats
|
| 455 |
const float iscale = 127.f / amax;
|
| 456 |
-
y[i].d =
|
| 457 |
const float id = ( amax != 0.0f ) ? iscale : 0.f;
|
| 458 |
const __m512 vscale = _mm512_set1_ps(id);
|
| 459 |
|
|
@@ -1090,7 +1091,7 @@ struct acc_C<block_q8_0, block_q4_0, is_acc> {
|
|
| 1090 |
const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
|
| 1091 |
|
| 1092 |
for (int m = 0; m < nr; ++m) {
|
| 1093 |
-
const __m512 vd1 = _mm512_set1_ps(
|
| 1094 |
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
|
| 1095 |
|
| 1096 |
__m512 vsum;
|
|
@@ -1113,8 +1114,8 @@ struct acc_C<block_q8_1, block_q4_1, is_acc> {
|
|
| 1113 |
const __m512 vm0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset + TILE_N * sizeof(ggml_half))));
|
| 1114 |
|
| 1115 |
for (int m = 0; m < nr; ++m) {
|
| 1116 |
-
const __m512 vd1 = _mm512_set1_ps(
|
| 1117 |
-
const __m512 vs1 = _mm512_set1_ps(
|
| 1118 |
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
|
| 1119 |
|
| 1120 |
__m512 vsum;
|
|
@@ -1137,7 +1138,7 @@ struct acc_C<block_q8_0, block_q8_0, is_acc> {
|
|
| 1137 |
const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
|
| 1138 |
|
| 1139 |
for (int m = 0; m < nr; ++m) {
|
| 1140 |
-
const __m512 vd1 = _mm512_set1_ps(
|
| 1141 |
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
|
| 1142 |
|
| 1143 |
__m512 vsum;
|
|
@@ -1437,7 +1438,7 @@ struct tinygemm_kernel_vnni<block_q8_0, block_q4_0, float, BLOCK_M, BLOCK_N, BLO
|
|
| 1437 |
va[k] = _mm512_set1_epi32(a_ptr[k]);
|
| 1438 |
vcomp = _mm512_dpbusd_epi32(vcomp, off, va[k]);
|
| 1439 |
}
|
| 1440 |
-
vd1 = _mm512_set1_ps(
|
| 1441 |
}
|
| 1442 |
|
| 1443 |
// load b
|
|
@@ -1498,8 +1499,8 @@ struct tinygemm_kernel_vnni<block_q8_1, block_q4_1, float, 1, BLOCK_N, BLOCK_K>
|
|
| 1498 |
for (int k = 0; k < 8; ++k) {
|
| 1499 |
va[k] = _mm512_set1_epi32(a_ptr[k]);
|
| 1500 |
}
|
| 1501 |
-
vd1 = _mm512_set1_ps(
|
| 1502 |
-
vs1 = _mm512_set1_ps(
|
| 1503 |
}
|
| 1504 |
|
| 1505 |
// load b
|
|
@@ -1571,7 +1572,7 @@ struct tinygemm_kernel_vnni<block_q8_0, block_q8_0, float, BLOCK_M, BLOCK_N, BLO
|
|
| 1571 |
va[k] = _mm512_set1_epi32(a_ptr[k]);
|
| 1572 |
va[k] = _mm512_add_epi8(va[k], off);
|
| 1573 |
}
|
| 1574 |
-
vd1 = _mm512_set1_ps(
|
| 1575 |
}
|
| 1576 |
|
| 1577 |
// load b
|
|
|
|
| 8 |
#include "mmq.h"
|
| 9 |
#include "ggml-impl.h"
|
| 10 |
#include "ggml-cpu-impl.h"
|
| 11 |
+
#include "simd-mappings.h"
|
| 12 |
#include "quants.h"
|
| 13 |
#include "ggml-quants.h"
|
| 14 |
#include <algorithm>
|
|
|
|
| 454 |
|
| 455 |
// Quantize these floats
|
| 456 |
const float iscale = 127.f / amax;
|
| 457 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(1 / iscale);
|
| 458 |
const float id = ( amax != 0.0f ) ? iscale : 0.f;
|
| 459 |
const __m512 vscale = _mm512_set1_ps(id);
|
| 460 |
|
|
|
|
| 1091 |
const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
|
| 1092 |
|
| 1093 |
for (int m = 0; m < nr; ++m) {
|
| 1094 |
+
const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
|
| 1095 |
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
|
| 1096 |
|
| 1097 |
__m512 vsum;
|
|
|
|
| 1114 |
const __m512 vm0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset + TILE_N * sizeof(ggml_half))));
|
| 1115 |
|
| 1116 |
for (int m = 0; m < nr; ++m) {
|
| 1117 |
+
const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
|
| 1118 |
+
const __m512 vs1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].s));
|
| 1119 |
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
|
| 1120 |
|
| 1121 |
__m512 vsum;
|
|
|
|
| 1138 |
const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
|
| 1139 |
|
| 1140 |
for (int m = 0; m < nr; ++m) {
|
| 1141 |
+
const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
|
| 1142 |
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
|
| 1143 |
|
| 1144 |
__m512 vsum;
|
|
|
|
| 1438 |
va[k] = _mm512_set1_epi32(a_ptr[k]);
|
| 1439 |
vcomp = _mm512_dpbusd_epi32(vcomp, off, va[k]);
|
| 1440 |
}
|
| 1441 |
+
vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
|
| 1442 |
}
|
| 1443 |
|
| 1444 |
// load b
|
|
|
|
| 1499 |
for (int k = 0; k < 8; ++k) {
|
| 1500 |
va[k] = _mm512_set1_epi32(a_ptr[k]);
|
| 1501 |
}
|
| 1502 |
+
vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
|
| 1503 |
+
vs1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].s));
|
| 1504 |
}
|
| 1505 |
|
| 1506 |
// load b
|
|
|
|
| 1572 |
va[k] = _mm512_set1_epi32(a_ptr[k]);
|
| 1573 |
va[k] = _mm512_add_epi8(va[k], off);
|
| 1574 |
}
|
| 1575 |
+
vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
|
| 1576 |
}
|
| 1577 |
|
| 1578 |
// load b
|
|
@@ -3,6 +3,7 @@
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
|
|
|
| 6 |
|
| 7 |
#include "../../quants.h"
|
| 8 |
#include "../../ggml-cpu-impl.h"
|
|
@@ -62,7 +63,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 62 |
const float d = amax / ((1 << 7) - 1);
|
| 63 |
const float id = d ? 1.0f/d : 0.0f;
|
| 64 |
|
| 65 |
-
y[i].d =
|
| 66 |
|
| 67 |
for (int j = 0; j < 8; j++) {
|
| 68 |
const float32x4_t v = vmulq_n_f32(srcv[j], id);
|
|
@@ -104,7 +105,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 104 |
const float d = amax / ((1 << 7) - 1);
|
| 105 |
const float id = d ? 1.0f/d : 0.0f;
|
| 106 |
|
| 107 |
-
y[i].d =
|
| 108 |
|
| 109 |
int32x4_t accv = vdupq_n_s32(0);
|
| 110 |
|
|
@@ -120,7 +121,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 120 |
accv = vaddq_s32(accv, vi);
|
| 121 |
}
|
| 122 |
|
| 123 |
-
y[i].s =
|
| 124 |
}
|
| 125 |
#else
|
| 126 |
GGML_UNUSED(nb);
|
|
@@ -194,10 +195,10 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 194 |
const int8x16_t y1_h = vld1q_s8(b_y1->qs + 16);
|
| 195 |
|
| 196 |
float32_t _scale[4] = {
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
};
|
| 202 |
float32x4_t scale = vld1q_f32(_scale);
|
| 203 |
|
|
@@ -274,10 +275,10 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 274 |
// dot product
|
| 275 |
sumv0 = svmla_n_f32_x(ph4, sumv0, svcvt_f32_s32_x(ph4, svadd_x(ph4,
|
| 276 |
svdot_s32(svdup_n_s32(0), qx0ls, qy0l),
|
| 277 |
-
svdot_s32(svdup_n_s32(0), qx0hs, qy0h))),
|
| 278 |
sumv1 = svmla_n_f32_x(ph4, sumv1, svcvt_f32_s32_x(ph4, svadd_x(ph4,
|
| 279 |
svdot_s32(svdup_n_s32(0), qx1ls, qy1l),
|
| 280 |
-
svdot_s32(svdup_n_s32(0), qx1hs, qy1h))),
|
| 281 |
}
|
| 282 |
|
| 283 |
sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
|
|
@@ -313,9 +314,9 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 313 |
|
| 314 |
// dot product
|
| 315 |
sumv0 = svmla_n_f32_x(svptrue_b32(), sumv0, svcvt_f32_s32_x(svptrue_b32(),
|
| 316 |
-
svdot_s32(svdup_n_s32(0), qx0s, qy0)),
|
| 317 |
sumv1 = svmla_n_f32_x(svptrue_b32(), sumv1, svcvt_f32_s32_x(svptrue_b32(),
|
| 318 |
-
svdot_s32(svdup_n_s32(0), qx1s, qy1)),
|
| 319 |
}
|
| 320 |
|
| 321 |
sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
|
|
@@ -354,9 +355,9 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 354 |
|
| 355 |
// dot product
|
| 356 |
sumv0 = svmla_n_f32_x(ph32, sumv0, svcvt_f32_s32_x(ph32,
|
| 357 |
-
svdot_s32(svdup_n_s32(0), qx0s, qy0)),
|
| 358 |
sumv1 = svmla_n_f32_x(ph32, sumv1, svcvt_f32_s32_x(ph32,
|
| 359 |
-
svdot_s32(svdup_n_s32(0), qx1s, qy1)),
|
| 360 |
}
|
| 361 |
|
| 362 |
sumf = svaddv_f32(ph32, svadd_f32_x(ph32, sumv0, sumv1));
|
|
@@ -404,8 +405,8 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 404 |
const int32x4_t p_0 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_0ls, v1_0l), v0_0hs, v1_0h);
|
| 405 |
const int32x4_t p_1 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_1ls, v1_1l), v0_1hs, v1_1h);
|
| 406 |
|
| 407 |
-
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(p_0),
|
| 408 |
-
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(p_1),
|
| 409 |
}
|
| 410 |
|
| 411 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
|
|
@@ -423,7 +424,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 423 |
}
|
| 424 |
|
| 425 |
int sumi = sumi0 + sumi1;
|
| 426 |
-
sumf += sumi*
|
| 427 |
}
|
| 428 |
|
| 429 |
*s = sumf;
|
|
@@ -464,10 +465,10 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 464 |
const block_q8_1 * GGML_RESTRICT b_y1 = &vy1[i];
|
| 465 |
|
| 466 |
float32_t summs_t[4] = {
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
|
| 470 |
-
|
| 471 |
};
|
| 472 |
summs0 = vaddq_f32(summs0, vld1q_f32(summs_t));
|
| 473 |
|
|
@@ -490,10 +491,10 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 490 |
|
| 491 |
// mmla into int32x4_t
|
| 492 |
float32_t _scale[4] = {
|
| 493 |
-
|
| 494 |
-
|
| 495 |
-
|
| 496 |
-
|
| 497 |
};
|
| 498 |
float32x4_t scale = vld1q_f32(_scale);
|
| 499 |
|
|
@@ -539,7 +540,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 539 |
const block_q8_1 * GGML_RESTRICT y0 = &y[ib + 0];
|
| 540 |
const block_q8_1 * GGML_RESTRICT y1 = &y[ib + 1];
|
| 541 |
|
| 542 |
-
summs +=
|
| 543 |
|
| 544 |
const uint8x16_t m4b = vdupq_n_u8(0x0F);
|
| 545 |
|
|
@@ -562,8 +563,8 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 562 |
const int32x4_t p_0 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_0l, v1_0l), v0_0h, v1_0h);
|
| 563 |
const int32x4_t p_1 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_1l, v1_1l), v0_1h, v1_1h);
|
| 564 |
|
| 565 |
-
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(p_0),
|
| 566 |
-
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(p_1),
|
| 567 |
}
|
| 568 |
|
| 569 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1) + summs;
|
|
@@ -582,7 +583,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 582 |
}
|
| 583 |
|
| 584 |
int sumi = sumi0 + sumi1;
|
| 585 |
-
sumf += (
|
| 586 |
}
|
| 587 |
|
| 588 |
*s = sumf;
|
|
@@ -666,10 +667,10 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 666 |
|
| 667 |
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
|
| 668 |
ggml_vdotq_s32(vdupq_n_s32(0), v0_0lf, v1_0l),
|
| 669 |
-
ggml_vdotq_s32(vdupq_n_s32(0), v0_0hf, v1_0h))),
|
| 670 |
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
|
| 671 |
ggml_vdotq_s32(vdupq_n_s32(0), v0_1lf, v1_1l),
|
| 672 |
-
ggml_vdotq_s32(vdupq_n_s32(0), v0_1hf, v1_1h))),
|
| 673 |
}
|
| 674 |
|
| 675 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
|
|
@@ -694,7 +695,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 694 |
}
|
| 695 |
|
| 696 |
int sumi = sumi0 + sumi1;
|
| 697 |
-
sumf += (
|
| 698 |
}
|
| 699 |
|
| 700 |
*s = sumf;
|
|
@@ -739,8 +740,8 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 739 |
|
| 740 |
const uint8x16_t m4b = vdupq_n_u8(0x0F);
|
| 741 |
|
| 742 |
-
summs0 +=
|
| 743 |
-
summs1 +=
|
| 744 |
|
| 745 |
// extract the 5th bit via lookup table ((b) << 4)
|
| 746 |
memcpy(&qh0, x0->qh, sizeof(qh0));
|
|
@@ -784,10 +785,10 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 784 |
|
| 785 |
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
|
| 786 |
ggml_vdotq_s32(vdupq_n_s32(0), v0_0lf, v1_0l),
|
| 787 |
-
ggml_vdotq_s32(vdupq_n_s32(0), v0_0hf, v1_0h))),
|
| 788 |
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
|
| 789 |
ggml_vdotq_s32(vdupq_n_s32(0), v0_1lf, v1_1l),
|
| 790 |
-
ggml_vdotq_s32(vdupq_n_s32(0), v0_1hf, v1_1h))),
|
| 791 |
}
|
| 792 |
|
| 793 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1) + summs0 + summs1;
|
|
@@ -812,7 +813,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 812 |
}
|
| 813 |
|
| 814 |
int sumi = sumi0 + sumi1;
|
| 815 |
-
sumf += (
|
| 816 |
}
|
| 817 |
|
| 818 |
*s = sumf;
|
|
@@ -864,10 +865,10 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 864 |
const int8x16_t y1_h = vld1q_s8(b_y1->qs + 16);
|
| 865 |
|
| 866 |
float32_t _scale[4] = {
|
| 867 |
-
|
| 868 |
-
|
| 869 |
-
|
| 870 |
-
|
| 871 |
};
|
| 872 |
float32x4_t scale = vld1q_f32(_scale);
|
| 873 |
|
|
@@ -934,10 +935,10 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 934 |
|
| 935 |
sumv0 = svmla_n_f32_x(pl16, sumv0, svcvt_f32_s32_x(pl16, svadd_x(pl16,
|
| 936 |
svdot_s32(svdup_n_s32(0), qx0_0, qy0_0),
|
| 937 |
-
svdot_s32(svdup_n_s32(0), qx0_1, qy0_1))),
|
| 938 |
sumv1 = svmla_n_f32_x(pl16, sumv1, svcvt_f32_s32_x(pl16, svadd_x(pl16,
|
| 939 |
svdot_s32(svdup_n_s32(0), qx1_0, qy1_0),
|
| 940 |
-
svdot_s32(svdup_n_s32(0), qx1_1, qy1_1))),
|
| 941 |
}
|
| 942 |
|
| 943 |
sumf = svaddv_f32(pl16, svadd_f32_x(pl16, sumv0, sumv1));
|
|
@@ -960,9 +961,9 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 960 |
const svint8_t qy1 = svld1_s8(svptrue_b8(), y1->qs);
|
| 961 |
|
| 962 |
sumv0 = svmla_n_f32_x(svptrue_b32(), sumv0, svcvt_f32_s32_x(svptrue_b32(),
|
| 963 |
-
svdot_s32(svdup_n_s32(0), qx0, qy0)),
|
| 964 |
sumv1 = svmla_n_f32_x(svptrue_b32(), sumv1, svcvt_f32_s32_x(svptrue_b32(),
|
| 965 |
-
svdot_s32(svdup_n_s32(0), qx1, qy1)),
|
| 966 |
}
|
| 967 |
|
| 968 |
sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
|
|
@@ -1002,8 +1003,8 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1002 |
qy_64 = svadd_s8_x(svptrue_b8(), qy_32, qy_64);
|
| 1003 |
|
| 1004 |
// scale creation
|
| 1005 |
-
const float32_t deq1 =
|
| 1006 |
-
const float32_t deq2 =
|
| 1007 |
|
| 1008 |
// duplicate deq1 in first half of vector and deq2 in second half of vector
|
| 1009 |
const svfloat32_t temp = svdup_f32_m(svdup_f32_z(ph8, deq1), pl8, deq2);
|
|
@@ -1043,11 +1044,11 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1043 |
|
| 1044 |
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
|
| 1045 |
ggml_vdotq_s32(vdupq_n_s32(0), x0_0, y0_0),
|
| 1046 |
-
ggml_vdotq_s32(vdupq_n_s32(0), x0_1, y0_1))),
|
| 1047 |
|
| 1048 |
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
|
| 1049 |
ggml_vdotq_s32(vdupq_n_s32(0), x1_0, y1_0),
|
| 1050 |
-
ggml_vdotq_s32(vdupq_n_s32(0), x1_1, y1_1))),
|
| 1051 |
}
|
| 1052 |
|
| 1053 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
|
|
@@ -1059,7 +1060,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1059 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 1060 |
}
|
| 1061 |
|
| 1062 |
-
sumf += sumi*(
|
| 1063 |
}
|
| 1064 |
|
| 1065 |
*s = sumf;
|
|
@@ -1217,7 +1218,7 @@ void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1217 |
const int16x8_t ysum0 = vld1q_s16(y[i].bsums);
|
| 1218 |
const int16x8_t ysum1 = vld1q_s16(y[i].bsums + 8);
|
| 1219 |
|
| 1220 |
-
const float d =
|
| 1221 |
|
| 1222 |
#if defined(__ARM_FEATURE_DOTPROD)
|
| 1223 |
sumi0 = vaddq_s32(sumi0, sumi1);
|
|
@@ -1269,7 +1270,7 @@ void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1269 |
}
|
| 1270 |
}
|
| 1271 |
|
| 1272 |
-
sumf += (float) sum * (
|
| 1273 |
}
|
| 1274 |
|
| 1275 |
*s = sumf;
|
|
@@ -1362,7 +1363,7 @@ void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1362 |
const int16x8_t ysum0 = vld1q_s16(y[i].bsums);
|
| 1363 |
const int16x8_t ysum1 = vld1q_s16(y[i].bsums + 8);
|
| 1364 |
|
| 1365 |
-
const float d =
|
| 1366 |
|
| 1367 |
#if defined(__ARM_FEATURE_DOTPROD)
|
| 1368 |
sumi0 = vaddq_s32(sumi0, sumi1);
|
|
@@ -1393,7 +1394,7 @@ void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1393 |
}
|
| 1394 |
}
|
| 1395 |
|
| 1396 |
-
const float d = y[i].d *
|
| 1397 |
|
| 1398 |
sumf += (float) sumi * d;
|
| 1399 |
}
|
|
@@ -1425,9 +1426,9 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1425 |
switch (vector_length) {
|
| 1426 |
case 128:
|
| 1427 |
for (int i = 0; i < nb; ++i) {
|
| 1428 |
-
const float d = y[i].d *
|
| 1429 |
svfloat32_t d_broad = svdup_n_f32((float32_t)d);
|
| 1430 |
-
const float dmin = -y[i].d *
|
| 1431 |
svfloat32_t dmin_broad = svdup_n_f32((float32_t)dmin);
|
| 1432 |
|
| 1433 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
|
@@ -1570,9 +1571,9 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1570 |
case 256:
|
| 1571 |
case 512:
|
| 1572 |
for (int i = 0; i < nb; ++i) {
|
| 1573 |
-
const float d = y[i].d *
|
| 1574 |
svfloat32_t d_broad = svdup_n_f32((float32_t)d);
|
| 1575 |
-
const float dmin = -y[i].d *
|
| 1576 |
svfloat32_t dmin_broad = svdup_n_f32((float32_t)dmin);
|
| 1577 |
|
| 1578 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
|
@@ -1671,8 +1672,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1671 |
float sum = 0;
|
| 1672 |
|
| 1673 |
for (int i = 0; i < nb; ++i) {
|
| 1674 |
-
const float d = y[i].d *
|
| 1675 |
-
const float dmin = -y[i].d *
|
| 1676 |
|
| 1677 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1678 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -1742,8 +1743,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1742 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 1743 |
}
|
| 1744 |
|
| 1745 |
-
const float dall = y[i].d *
|
| 1746 |
-
const float dmin = y[i].d *
|
| 1747 |
|
| 1748 |
int isum = 0;
|
| 1749 |
int is = 0;
|
|
@@ -1805,7 +1806,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1805 |
|
| 1806 |
for (int i = 0; i < nb; ++i) {
|
| 1807 |
|
| 1808 |
-
const float d = y[i].d *
|
| 1809 |
|
| 1810 |
const uint8_t * GGML_RESTRICT q3_sv = x[i].qs;
|
| 1811 |
const uint8_t * GGML_RESTRICT qh_sv = x[i].hmask;
|
|
@@ -1981,7 +1982,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1981 |
|
| 1982 |
for (int i = 0; i < nb; ++i) {
|
| 1983 |
|
| 1984 |
-
const float d = y[i].d *
|
| 1985 |
|
| 1986 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 1987 |
const uint8_t * GGML_RESTRICT qh = x[i].hmask;
|
|
@@ -2112,7 +2113,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2112 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 2113 |
q8 += 8; a += 8;
|
| 2114 |
}
|
| 2115 |
-
const float d =
|
| 2116 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2117 |
}
|
| 2118 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -2258,18 +2259,18 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2258 |
bias[3] = vaddvq_s32(vaddq_s32(vmull_s16(vget_low_s16(y1_sums), vget_low_s16(x1_mins)),
|
| 2259 |
vmull_s16(vget_high_s16(y1_sums), vget_high_s16(x1_mins))));
|
| 2260 |
const float32x4_t dmins = {
|
| 2261 |
-
|
| 2262 |
-
|
| 2263 |
-
|
| 2264 |
-
|
| 2265 |
};
|
| 2266 |
vfsum = vmlsq_f32(vfsum, vcvtq_f32_s32(vld1q_s32(bias)), dmins);
|
| 2267 |
|
| 2268 |
const float32x4_t superblock_scale = {
|
| 2269 |
-
|
| 2270 |
-
|
| 2271 |
-
|
| 2272 |
-
|
| 2273 |
};
|
| 2274 |
vfsum = vmlaq_f32(vfsum, vcvtq_f32_s32(visum), superblock_scale);
|
| 2275 |
}
|
|
@@ -2289,8 +2290,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2289 |
float sumf = 0;
|
| 2290 |
for (int i = 0; i < nb; ++i) {
|
| 2291 |
|
| 2292 |
-
const float d = y[i].d *
|
| 2293 |
-
const float dmin = y[i].d *
|
| 2294 |
|
| 2295 |
const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
|
| 2296 |
|
|
@@ -2377,8 +2378,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2377 |
|
| 2378 |
for (int i = 0; i < nb; ++i) {
|
| 2379 |
|
| 2380 |
-
const float d = y[i].d *
|
| 2381 |
-
const float dmin = y[i].d *
|
| 2382 |
|
| 2383 |
const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
|
| 2384 |
|
|
@@ -2478,9 +2479,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2478 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2479 |
q8 += 8; a += 8;
|
| 2480 |
}
|
| 2481 |
-
const float d =
|
| 2482 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2483 |
-
const float dmin =
|
| 2484 |
sumf -= dmin * sumi;
|
| 2485 |
}
|
| 2486 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -2520,8 +2521,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2520 |
|
| 2521 |
for (int i = 0; i < nb; ++i) {
|
| 2522 |
|
| 2523 |
-
const float d = y[i].d *
|
| 2524 |
-
const float dmin = y[i].d *
|
| 2525 |
|
| 2526 |
const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
|
| 2527 |
|
|
@@ -2630,9 +2631,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2630 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2631 |
q8 += 8; a += 8;
|
| 2632 |
}
|
| 2633 |
-
const float d =
|
| 2634 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2635 |
-
const float dmin =
|
| 2636 |
sumf -= dmin * sumi;
|
| 2637 |
}
|
| 2638 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -2827,10 +2828,10 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2827 |
const int32x4_t vibias = vmulq_n_s32(vld1q_s32(bias), 32);
|
| 2828 |
|
| 2829 |
const float32x4_t superblock_scale = {
|
| 2830 |
-
|
| 2831 |
-
|
| 2832 |
-
|
| 2833 |
-
|
| 2834 |
};
|
| 2835 |
|
| 2836 |
visum = vsubq_s32(visum, vibias);
|
|
@@ -2858,7 +2859,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2858 |
svuint8_t q6h_1, q6h_2, q6h_3, q6h_4;
|
| 2859 |
|
| 2860 |
for (int i = 0; i < nb; ++i) {
|
| 2861 |
-
const float d_all =
|
| 2862 |
|
| 2863 |
const uint8_t * GGML_RESTRICT q6 = x[i].ql;
|
| 2864 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
@@ -3011,7 +3012,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 3011 |
|
| 3012 |
for (int i = 0; i < nb; ++i) {
|
| 3013 |
|
| 3014 |
-
const float d_all =
|
| 3015 |
|
| 3016 |
const uint8_t * GGML_RESTRICT q6 = x[i].ql;
|
| 3017 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
@@ -3128,7 +3129,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 3128 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 3129 |
q8 += 8; a += 8;
|
| 3130 |
}
|
| 3131 |
-
const float d =
|
| 3132 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 3133 |
}
|
| 3134 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -3199,7 +3200,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 3199 |
|
| 3200 |
float sumf = 0;
|
| 3201 |
for (int i = 0; i < nb; ++i) {
|
| 3202 |
-
const float d =
|
| 3203 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3204 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 3205 |
float sumf1 = 0, sumf2 = 0;
|
|
@@ -3234,7 +3235,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 3234 |
|
| 3235 |
float sumf = 0.f;
|
| 3236 |
for (int i = 0; i < nb; ++i) {
|
| 3237 |
-
const float d =
|
| 3238 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3239 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 3240 |
int32_t bsum = 0;
|
|
@@ -3284,7 +3285,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 3284 |
|
| 3285 |
float sumf = 0;
|
| 3286 |
for (int i = 0; i < nb; ++i) {
|
| 3287 |
-
const float d =
|
| 3288 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3289 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 3290 |
const uint8x8_t scales8 = vld1_u8(x[i].scales);
|
|
@@ -3329,7 +3330,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 3329 |
|
| 3330 |
float sumf = 0.f;
|
| 3331 |
for (int i = 0; i < nb; ++i) {
|
| 3332 |
-
const float d =
|
| 3333 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3334 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 3335 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -3398,7 +3399,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3398 |
float sumf = 0;
|
| 3399 |
for (int i = 0; i < nb; ++i) {
|
| 3400 |
|
| 3401 |
-
const float d =
|
| 3402 |
|
| 3403 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3404 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
@@ -3458,7 +3459,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3458 |
float sumf = 0;
|
| 3459 |
for (int i = 0; i < nb; i++) {
|
| 3460 |
|
| 3461 |
-
const float d =
|
| 3462 |
const int8_t * q8 = y[i].qs;
|
| 3463 |
const uint8_t * qs = x[i].qs;
|
| 3464 |
const uint8_t * qh = x[i].qh;
|
|
@@ -3521,7 +3522,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 3521 |
|
| 3522 |
float sumf = 0;
|
| 3523 |
for (int i = 0; i < nb; ++i) {
|
| 3524 |
-
const float d =
|
| 3525 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3526 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3527 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -3557,7 +3558,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 3557 |
|
| 3558 |
float sumf = 0.f;
|
| 3559 |
for (int i = 0; i < nb; ++i) {
|
| 3560 |
-
const float d =
|
| 3561 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3562 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3563 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -3630,7 +3631,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3630 |
|
| 3631 |
float sumf = 0;
|
| 3632 |
for (int i = 0; i < nb; ++i) {
|
| 3633 |
-
const float d =
|
| 3634 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3635 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3636 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
|
|
@@ -3691,7 +3692,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3691 |
|
| 3692 |
float sumf = 0.f;
|
| 3693 |
for (int i = 0; i < nb; ++i) {
|
| 3694 |
-
const float d =
|
| 3695 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3696 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3697 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
@@ -3786,7 +3787,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3786 |
|
| 3787 |
}
|
| 3788 |
|
| 3789 |
-
sumf += y[i].d *
|
| 3790 |
}
|
| 3791 |
|
| 3792 |
*s = sumf;
|
|
@@ -3817,7 +3818,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3817 |
qs += 4;
|
| 3818 |
}
|
| 3819 |
|
| 3820 |
-
sumf +=
|
| 3821 |
}
|
| 3822 |
|
| 3823 |
*s = sumf;
|
|
@@ -3905,7 +3906,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3905 |
|
| 3906 |
}
|
| 3907 |
|
| 3908 |
-
sumf += y[i].d *
|
| 3909 |
}
|
| 3910 |
|
| 3911 |
*s = sumf;
|
|
@@ -3952,7 +3953,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3952 |
qh += 2;
|
| 3953 |
}
|
| 3954 |
|
| 3955 |
-
sumf +=
|
| 3956 |
}
|
| 3957 |
|
| 3958 |
*s = sumf;
|
|
@@ -4003,13 +4004,13 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 4003 |
prod_2 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), q4b.val[2], q8b.val[2]), q4b.val[3], q8b.val[3]);
|
| 4004 |
|
| 4005 |
sumf +=
|
| 4006 |
-
|
| 4007 |
-
|
| 4008 |
}
|
| 4009 |
|
| 4010 |
#endif
|
| 4011 |
for (; ib < nb; ++ib) {
|
| 4012 |
-
const float d =
|
| 4013 |
int sumi1 = 0, sumi2 = 0;
|
| 4014 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 4015 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
@@ -4071,7 +4072,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 4071 |
|
| 4072 |
}
|
| 4073 |
|
| 4074 |
-
sumf +=
|
| 4075 |
}
|
| 4076 |
|
| 4077 |
*s = sumf;
|
|
@@ -4079,7 +4080,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 4079 |
#else
|
| 4080 |
float sumf = 0;
|
| 4081 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 4082 |
-
const float d4d8 =
|
| 4083 |
uint16_t h = x[ibl].scales_h;
|
| 4084 |
const uint8_t * qs = x[ibl].qs;
|
| 4085 |
const int8_t * q8 = y[ibl].qs;
|
|
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
| 6 |
+
#include "simd-mappings.h"
|
| 7 |
|
| 8 |
#include "../../quants.h"
|
| 9 |
#include "../../ggml-cpu-impl.h"
|
|
|
|
| 63 |
const float d = amax / ((1 << 7) - 1);
|
| 64 |
const float id = d ? 1.0f/d : 0.0f;
|
| 65 |
|
| 66 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 67 |
|
| 68 |
for (int j = 0; j < 8; j++) {
|
| 69 |
const float32x4_t v = vmulq_n_f32(srcv[j], id);
|
|
|
|
| 105 |
const float d = amax / ((1 << 7) - 1);
|
| 106 |
const float id = d ? 1.0f/d : 0.0f;
|
| 107 |
|
| 108 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 109 |
|
| 110 |
int32x4_t accv = vdupq_n_s32(0);
|
| 111 |
|
|
|
|
| 121 |
accv = vaddq_s32(accv, vi);
|
| 122 |
}
|
| 123 |
|
| 124 |
+
y[i].s = GGML_CPU_FP32_TO_FP16(d * vaddvq_s32(accv));
|
| 125 |
}
|
| 126 |
#else
|
| 127 |
GGML_UNUSED(nb);
|
|
|
|
| 195 |
const int8x16_t y1_h = vld1q_s8(b_y1->qs + 16);
|
| 196 |
|
| 197 |
float32_t _scale[4] = {
|
| 198 |
+
GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
|
| 199 |
+
GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y1->d),
|
| 200 |
+
GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
|
| 201 |
+
GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y1->d)
|
| 202 |
};
|
| 203 |
float32x4_t scale = vld1q_f32(_scale);
|
| 204 |
|
|
|
|
| 275 |
// dot product
|
| 276 |
sumv0 = svmla_n_f32_x(ph4, sumv0, svcvt_f32_s32_x(ph4, svadd_x(ph4,
|
| 277 |
svdot_s32(svdup_n_s32(0), qx0ls, qy0l),
|
| 278 |
+
svdot_s32(svdup_n_s32(0), qx0hs, qy0h))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 279 |
sumv1 = svmla_n_f32_x(ph4, sumv1, svcvt_f32_s32_x(ph4, svadd_x(ph4,
|
| 280 |
svdot_s32(svdup_n_s32(0), qx1ls, qy1l),
|
| 281 |
+
svdot_s32(svdup_n_s32(0), qx1hs, qy1h))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 282 |
}
|
| 283 |
|
| 284 |
sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
|
|
|
|
| 314 |
|
| 315 |
// dot product
|
| 316 |
sumv0 = svmla_n_f32_x(svptrue_b32(), sumv0, svcvt_f32_s32_x(svptrue_b32(),
|
| 317 |
+
svdot_s32(svdup_n_s32(0), qx0s, qy0)), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 318 |
sumv1 = svmla_n_f32_x(svptrue_b32(), sumv1, svcvt_f32_s32_x(svptrue_b32(),
|
| 319 |
+
svdot_s32(svdup_n_s32(0), qx1s, qy1)), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 320 |
}
|
| 321 |
|
| 322 |
sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
|
|
|
|
| 355 |
|
| 356 |
// dot product
|
| 357 |
sumv0 = svmla_n_f32_x(ph32, sumv0, svcvt_f32_s32_x(ph32,
|
| 358 |
+
svdot_s32(svdup_n_s32(0), qx0s, qy0)), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 359 |
sumv1 = svmla_n_f32_x(ph32, sumv1, svcvt_f32_s32_x(ph32,
|
| 360 |
+
svdot_s32(svdup_n_s32(0), qx1s, qy1)), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 361 |
}
|
| 362 |
|
| 363 |
sumf = svaddv_f32(ph32, svadd_f32_x(ph32, sumv0, sumv1));
|
|
|
|
| 405 |
const int32x4_t p_0 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_0ls, v1_0l), v0_0hs, v1_0h);
|
| 406 |
const int32x4_t p_1 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_1ls, v1_1l), v0_1hs, v1_1h);
|
| 407 |
|
| 408 |
+
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(p_0), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 409 |
+
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(p_1), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 410 |
}
|
| 411 |
|
| 412 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
|
|
|
|
| 424 |
}
|
| 425 |
|
| 426 |
int sumi = sumi0 + sumi1;
|
| 427 |
+
sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 428 |
}
|
| 429 |
|
| 430 |
*s = sumf;
|
|
|
|
| 465 |
const block_q8_1 * GGML_RESTRICT b_y1 = &vy1[i];
|
| 466 |
|
| 467 |
float32_t summs_t[4] = {
|
| 468 |
+
GGML_CPU_FP16_TO_FP32(b_x0->m) * GGML_CPU_FP16_TO_FP32(b_y0->s),
|
| 469 |
+
GGML_CPU_FP16_TO_FP32(b_x1->m) * GGML_CPU_FP16_TO_FP32(b_y0->s),
|
| 470 |
+
GGML_CPU_FP16_TO_FP32(b_x0->m) * GGML_CPU_FP16_TO_FP32(b_y1->s),
|
| 471 |
+
GGML_CPU_FP16_TO_FP32(b_x1->m) * GGML_CPU_FP16_TO_FP32(b_y1->s)
|
| 472 |
};
|
| 473 |
summs0 = vaddq_f32(summs0, vld1q_f32(summs_t));
|
| 474 |
|
|
|
|
| 491 |
|
| 492 |
// mmla into int32x4_t
|
| 493 |
float32_t _scale[4] = {
|
| 494 |
+
GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
|
| 495 |
+
GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y1->d),
|
| 496 |
+
GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
|
| 497 |
+
GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y1->d)
|
| 498 |
};
|
| 499 |
float32x4_t scale = vld1q_f32(_scale);
|
| 500 |
|
|
|
|
| 540 |
const block_q8_1 * GGML_RESTRICT y0 = &y[ib + 0];
|
| 541 |
const block_q8_1 * GGML_RESTRICT y1 = &y[ib + 1];
|
| 542 |
|
| 543 |
+
summs += GGML_CPU_FP16_TO_FP32(x0->m) * GGML_CPU_FP16_TO_FP32(y0->s) + GGML_CPU_FP16_TO_FP32(x1->m) * GGML_CPU_FP16_TO_FP32(y1->s);
|
| 544 |
|
| 545 |
const uint8x16_t m4b = vdupq_n_u8(0x0F);
|
| 546 |
|
|
|
|
| 563 |
const int32x4_t p_0 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_0l, v1_0l), v0_0h, v1_0h);
|
| 564 |
const int32x4_t p_1 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_1l, v1_1l), v0_1h, v1_1h);
|
| 565 |
|
| 566 |
+
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(p_0), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 567 |
+
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(p_1), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 568 |
}
|
| 569 |
|
| 570 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1) + summs;
|
|
|
|
| 583 |
}
|
| 584 |
|
| 585 |
int sumi = sumi0 + sumi1;
|
| 586 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 587 |
}
|
| 588 |
|
| 589 |
*s = sumf;
|
|
|
|
| 667 |
|
| 668 |
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
|
| 669 |
ggml_vdotq_s32(vdupq_n_s32(0), v0_0lf, v1_0l),
|
| 670 |
+
ggml_vdotq_s32(vdupq_n_s32(0), v0_0hf, v1_0h))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 671 |
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
|
| 672 |
ggml_vdotq_s32(vdupq_n_s32(0), v0_1lf, v1_1l),
|
| 673 |
+
ggml_vdotq_s32(vdupq_n_s32(0), v0_1hf, v1_1h))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 674 |
}
|
| 675 |
|
| 676 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
|
|
|
|
| 695 |
}
|
| 696 |
|
| 697 |
int sumi = sumi0 + sumi1;
|
| 698 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
|
| 699 |
}
|
| 700 |
|
| 701 |
*s = sumf;
|
|
|
|
| 740 |
|
| 741 |
const uint8x16_t m4b = vdupq_n_u8(0x0F);
|
| 742 |
|
| 743 |
+
summs0 += GGML_CPU_FP16_TO_FP32(x0->m) * GGML_CPU_FP16_TO_FP32(y0->s);
|
| 744 |
+
summs1 += GGML_CPU_FP16_TO_FP32(x1->m) * GGML_CPU_FP16_TO_FP32(y1->s);
|
| 745 |
|
| 746 |
// extract the 5th bit via lookup table ((b) << 4)
|
| 747 |
memcpy(&qh0, x0->qh, sizeof(qh0));
|
|
|
|
| 785 |
|
| 786 |
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
|
| 787 |
ggml_vdotq_s32(vdupq_n_s32(0), v0_0lf, v1_0l),
|
| 788 |
+
ggml_vdotq_s32(vdupq_n_s32(0), v0_0hf, v1_0h))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 789 |
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
|
| 790 |
ggml_vdotq_s32(vdupq_n_s32(0), v0_1lf, v1_1l),
|
| 791 |
+
ggml_vdotq_s32(vdupq_n_s32(0), v0_1hf, v1_1h))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 792 |
}
|
| 793 |
|
| 794 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1) + summs0 + summs1;
|
|
|
|
| 813 |
}
|
| 814 |
|
| 815 |
int sumi = sumi0 + sumi1;
|
| 816 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 817 |
}
|
| 818 |
|
| 819 |
*s = sumf;
|
|
|
|
| 865 |
const int8x16_t y1_h = vld1q_s8(b_y1->qs + 16);
|
| 866 |
|
| 867 |
float32_t _scale[4] = {
|
| 868 |
+
GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
|
| 869 |
+
GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y1->d),
|
| 870 |
+
GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
|
| 871 |
+
GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y1->d)
|
| 872 |
};
|
| 873 |
float32x4_t scale = vld1q_f32(_scale);
|
| 874 |
|
|
|
|
| 935 |
|
| 936 |
sumv0 = svmla_n_f32_x(pl16, sumv0, svcvt_f32_s32_x(pl16, svadd_x(pl16,
|
| 937 |
svdot_s32(svdup_n_s32(0), qx0_0, qy0_0),
|
| 938 |
+
svdot_s32(svdup_n_s32(0), qx0_1, qy0_1))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 939 |
sumv1 = svmla_n_f32_x(pl16, sumv1, svcvt_f32_s32_x(pl16, svadd_x(pl16,
|
| 940 |
svdot_s32(svdup_n_s32(0), qx1_0, qy1_0),
|
| 941 |
+
svdot_s32(svdup_n_s32(0), qx1_1, qy1_1))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 942 |
}
|
| 943 |
|
| 944 |
sumf = svaddv_f32(pl16, svadd_f32_x(pl16, sumv0, sumv1));
|
|
|
|
| 961 |
const svint8_t qy1 = svld1_s8(svptrue_b8(), y1->qs);
|
| 962 |
|
| 963 |
sumv0 = svmla_n_f32_x(svptrue_b32(), sumv0, svcvt_f32_s32_x(svptrue_b32(),
|
| 964 |
+
svdot_s32(svdup_n_s32(0), qx0, qy0)), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 965 |
sumv1 = svmla_n_f32_x(svptrue_b32(), sumv1, svcvt_f32_s32_x(svptrue_b32(),
|
| 966 |
+
svdot_s32(svdup_n_s32(0), qx1, qy1)), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 967 |
}
|
| 968 |
|
| 969 |
sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
|
|
|
|
| 1003 |
qy_64 = svadd_s8_x(svptrue_b8(), qy_32, qy_64);
|
| 1004 |
|
| 1005 |
// scale creation
|
| 1006 |
+
const float32_t deq1 = GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d);
|
| 1007 |
+
const float32_t deq2 = GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d);
|
| 1008 |
|
| 1009 |
// duplicate deq1 in first half of vector and deq2 in second half of vector
|
| 1010 |
const svfloat32_t temp = svdup_f32_m(svdup_f32_z(ph8, deq1), pl8, deq2);
|
|
|
|
| 1044 |
|
| 1045 |
sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
|
| 1046 |
ggml_vdotq_s32(vdupq_n_s32(0), x0_0, y0_0),
|
| 1047 |
+
ggml_vdotq_s32(vdupq_n_s32(0), x0_1, y0_1))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
|
| 1048 |
|
| 1049 |
sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
|
| 1050 |
ggml_vdotq_s32(vdupq_n_s32(0), x1_0, y1_0),
|
| 1051 |
+
ggml_vdotq_s32(vdupq_n_s32(0), x1_1, y1_1))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
|
| 1052 |
}
|
| 1053 |
|
| 1054 |
sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
|
|
|
|
| 1060 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 1061 |
}
|
| 1062 |
|
| 1063 |
+
sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 1064 |
}
|
| 1065 |
|
| 1066 |
*s = sumf;
|
|
|
|
| 1218 |
const int16x8_t ysum0 = vld1q_s16(y[i].bsums);
|
| 1219 |
const int16x8_t ysum1 = vld1q_s16(y[i].bsums + 8);
|
| 1220 |
|
| 1221 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1222 |
|
| 1223 |
#if defined(__ARM_FEATURE_DOTPROD)
|
| 1224 |
sumi0 = vaddq_s32(sumi0, sumi1);
|
|
|
|
| 1270 |
}
|
| 1271 |
}
|
| 1272 |
|
| 1273 |
+
sumf += (float) sum * (GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d);
|
| 1274 |
}
|
| 1275 |
|
| 1276 |
*s = sumf;
|
|
|
|
| 1363 |
const int16x8_t ysum0 = vld1q_s16(y[i].bsums);
|
| 1364 |
const int16x8_t ysum1 = vld1q_s16(y[i].bsums + 8);
|
| 1365 |
|
| 1366 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1367 |
|
| 1368 |
#if defined(__ARM_FEATURE_DOTPROD)
|
| 1369 |
sumi0 = vaddq_s32(sumi0, sumi1);
|
|
|
|
| 1394 |
}
|
| 1395 |
}
|
| 1396 |
|
| 1397 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1398 |
|
| 1399 |
sumf += (float) sumi * d;
|
| 1400 |
}
|
|
|
|
| 1426 |
switch (vector_length) {
|
| 1427 |
case 128:
|
| 1428 |
for (int i = 0; i < nb; ++i) {
|
| 1429 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1430 |
svfloat32_t d_broad = svdup_n_f32((float32_t)d);
|
| 1431 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1432 |
svfloat32_t dmin_broad = svdup_n_f32((float32_t)dmin);
|
| 1433 |
|
| 1434 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
|
|
|
| 1571 |
case 256:
|
| 1572 |
case 512:
|
| 1573 |
for (int i = 0; i < nb; ++i) {
|
| 1574 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1575 |
svfloat32_t d_broad = svdup_n_f32((float32_t)d);
|
| 1576 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1577 |
svfloat32_t dmin_broad = svdup_n_f32((float32_t)dmin);
|
| 1578 |
|
| 1579 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
|
|
|
| 1672 |
float sum = 0;
|
| 1673 |
|
| 1674 |
for (int i = 0; i < nb; ++i) {
|
| 1675 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1676 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1677 |
|
| 1678 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1679 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 1743 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 1744 |
}
|
| 1745 |
|
| 1746 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1747 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1748 |
|
| 1749 |
int isum = 0;
|
| 1750 |
int is = 0;
|
|
|
|
| 1806 |
|
| 1807 |
for (int i = 0; i < nb; ++i) {
|
| 1808 |
|
| 1809 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1810 |
|
| 1811 |
const uint8_t * GGML_RESTRICT q3_sv = x[i].qs;
|
| 1812 |
const uint8_t * GGML_RESTRICT qh_sv = x[i].hmask;
|
|
|
|
| 1982 |
|
| 1983 |
for (int i = 0; i < nb; ++i) {
|
| 1984 |
|
| 1985 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1986 |
|
| 1987 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 1988 |
const uint8_t * GGML_RESTRICT qh = x[i].hmask;
|
|
|
|
| 2113 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 2114 |
q8 += 8; a += 8;
|
| 2115 |
}
|
| 2116 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2117 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2118 |
}
|
| 2119 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 2259 |
bias[3] = vaddvq_s32(vaddq_s32(vmull_s16(vget_low_s16(y1_sums), vget_low_s16(x1_mins)),
|
| 2260 |
vmull_s16(vget_high_s16(y1_sums), vget_high_s16(x1_mins))));
|
| 2261 |
const float32x4_t dmins = {
|
| 2262 |
+
GGML_CPU_FP16_TO_FP32(x0->dmin) * y0->d,
|
| 2263 |
+
GGML_CPU_FP16_TO_FP32(x0->dmin) * y1->d,
|
| 2264 |
+
GGML_CPU_FP16_TO_FP32(x1->dmin) * y0->d,
|
| 2265 |
+
GGML_CPU_FP16_TO_FP32(x1->dmin) * y1->d,
|
| 2266 |
};
|
| 2267 |
vfsum = vmlsq_f32(vfsum, vcvtq_f32_s32(vld1q_s32(bias)), dmins);
|
| 2268 |
|
| 2269 |
const float32x4_t superblock_scale = {
|
| 2270 |
+
GGML_CPU_FP16_TO_FP32(x0->d) * y0->d,
|
| 2271 |
+
GGML_CPU_FP16_TO_FP32(x0->d) * y1->d,
|
| 2272 |
+
GGML_CPU_FP16_TO_FP32(x1->d) * y0->d,
|
| 2273 |
+
GGML_CPU_FP16_TO_FP32(x1->d) * y1->d,
|
| 2274 |
};
|
| 2275 |
vfsum = vmlaq_f32(vfsum, vcvtq_f32_s32(visum), superblock_scale);
|
| 2276 |
}
|
|
|
|
| 2290 |
float sumf = 0;
|
| 2291 |
for (int i = 0; i < nb; ++i) {
|
| 2292 |
|
| 2293 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 2294 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 2295 |
|
| 2296 |
const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
|
| 2297 |
|
|
|
|
| 2378 |
|
| 2379 |
for (int i = 0; i < nb; ++i) {
|
| 2380 |
|
| 2381 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 2382 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 2383 |
|
| 2384 |
const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
|
| 2385 |
|
|
|
|
| 2479 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2480 |
q8 += 8; a += 8;
|
| 2481 |
}
|
| 2482 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2483 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2484 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 2485 |
sumf -= dmin * sumi;
|
| 2486 |
}
|
| 2487 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 2521 |
|
| 2522 |
for (int i = 0; i < nb; ++i) {
|
| 2523 |
|
| 2524 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 2525 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 2526 |
|
| 2527 |
const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
|
| 2528 |
|
|
|
|
| 2631 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2632 |
q8 += 8; a += 8;
|
| 2633 |
}
|
| 2634 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2635 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2636 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 2637 |
sumf -= dmin * sumi;
|
| 2638 |
}
|
| 2639 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 2828 |
const int32x4_t vibias = vmulq_n_s32(vld1q_s32(bias), 32);
|
| 2829 |
|
| 2830 |
const float32x4_t superblock_scale = {
|
| 2831 |
+
GGML_CPU_FP16_TO_FP32(x0->d) * y0->d,
|
| 2832 |
+
GGML_CPU_FP16_TO_FP32(x0->d) * y1->d,
|
| 2833 |
+
GGML_CPU_FP16_TO_FP32(x1->d) * y0->d,
|
| 2834 |
+
GGML_CPU_FP16_TO_FP32(x1->d) * y1->d,
|
| 2835 |
};
|
| 2836 |
|
| 2837 |
visum = vsubq_s32(visum, vibias);
|
|
|
|
| 2859 |
svuint8_t q6h_1, q6h_2, q6h_3, q6h_4;
|
| 2860 |
|
| 2861 |
for (int i = 0; i < nb; ++i) {
|
| 2862 |
+
const float d_all = GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 2863 |
|
| 2864 |
const uint8_t * GGML_RESTRICT q6 = x[i].ql;
|
| 2865 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
|
|
| 3012 |
|
| 3013 |
for (int i = 0; i < nb; ++i) {
|
| 3014 |
|
| 3015 |
+
const float d_all = GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 3016 |
|
| 3017 |
const uint8_t * GGML_RESTRICT q6 = x[i].ql;
|
| 3018 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
|
|
| 3129 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 3130 |
q8 += 8; a += 8;
|
| 3131 |
}
|
| 3132 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3133 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 3134 |
}
|
| 3135 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 3200 |
|
| 3201 |
float sumf = 0;
|
| 3202 |
for (int i = 0; i < nb; ++i) {
|
| 3203 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3204 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3205 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 3206 |
float sumf1 = 0, sumf2 = 0;
|
|
|
|
| 3235 |
|
| 3236 |
float sumf = 0.f;
|
| 3237 |
for (int i = 0; i < nb; ++i) {
|
| 3238 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3239 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3240 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 3241 |
int32_t bsum = 0;
|
|
|
|
| 3285 |
|
| 3286 |
float sumf = 0;
|
| 3287 |
for (int i = 0; i < nb; ++i) {
|
| 3288 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3289 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3290 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 3291 |
const uint8x8_t scales8 = vld1_u8(x[i].scales);
|
|
|
|
| 3330 |
|
| 3331 |
float sumf = 0.f;
|
| 3332 |
for (int i = 0; i < nb; ++i) {
|
| 3333 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3334 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3335 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 3336 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 3399 |
float sumf = 0;
|
| 3400 |
for (int i = 0; i < nb; ++i) {
|
| 3401 |
|
| 3402 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3403 |
|
| 3404 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3405 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
|
|
| 3459 |
float sumf = 0;
|
| 3460 |
for (int i = 0; i < nb; i++) {
|
| 3461 |
|
| 3462 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3463 |
const int8_t * q8 = y[i].qs;
|
| 3464 |
const uint8_t * qs = x[i].qs;
|
| 3465 |
const uint8_t * qh = x[i].qh;
|
|
|
|
| 3522 |
|
| 3523 |
float sumf = 0;
|
| 3524 |
for (int i = 0; i < nb; ++i) {
|
| 3525 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3526 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3527 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3528 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 3558 |
|
| 3559 |
float sumf = 0.f;
|
| 3560 |
for (int i = 0; i < nb; ++i) {
|
| 3561 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3562 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3563 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3564 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 3631 |
|
| 3632 |
float sumf = 0;
|
| 3633 |
for (int i = 0; i < nb; ++i) {
|
| 3634 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3635 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3636 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3637 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
|
|
|
|
| 3692 |
|
| 3693 |
float sumf = 0.f;
|
| 3694 |
for (int i = 0; i < nb; ++i) {
|
| 3695 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3696 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3697 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3698 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
|
|
| 3787 |
|
| 3788 |
}
|
| 3789 |
|
| 3790 |
+
sumf += y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d) * (sumi1 + sumi2 + IQ1S_DELTA * sumi3);
|
| 3791 |
}
|
| 3792 |
|
| 3793 |
*s = sumf;
|
|
|
|
| 3818 |
qs += 4;
|
| 3819 |
}
|
| 3820 |
|
| 3821 |
+
sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
|
| 3822 |
}
|
| 3823 |
|
| 3824 |
*s = sumf;
|
|
|
|
| 3906 |
|
| 3907 |
}
|
| 3908 |
|
| 3909 |
+
sumf += y[i].d * GGML_CPU_FP16_TO_FP32(scale.f16) * (vaddvq_s32(sumi1) + IQ1M_DELTA * vaddvq_s32(sumi2));
|
| 3910 |
}
|
| 3911 |
|
| 3912 |
*s = sumf;
|
|
|
|
| 3953 |
qh += 2;
|
| 3954 |
}
|
| 3955 |
|
| 3956 |
+
sumf += GGML_CPU_FP16_TO_FP32(scale.f16) * y[i].d * (sumi1 + IQ1M_DELTA * sumi2);
|
| 3957 |
}
|
| 3958 |
|
| 3959 |
*s = sumf;
|
|
|
|
| 4004 |
prod_2 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), q4b.val[2], q8b.val[2]), q4b.val[3], q8b.val[3]);
|
| 4005 |
|
| 4006 |
sumf +=
|
| 4007 |
+
GGML_CPU_FP16_TO_FP32(x[ib+0].d) * GGML_CPU_FP16_TO_FP32(y[ib + 0].d) * vaddvq_s32(prod_1) +
|
| 4008 |
+
GGML_CPU_FP16_TO_FP32(x[ib+1].d) * GGML_CPU_FP16_TO_FP32(y[ib + 1].d) * vaddvq_s32(prod_2);
|
| 4009 |
}
|
| 4010 |
|
| 4011 |
#endif
|
| 4012 |
for (; ib < nb; ++ib) {
|
| 4013 |
+
const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
|
| 4014 |
int sumi1 = 0, sumi2 = 0;
|
| 4015 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 4016 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
|
|
| 4072 |
|
| 4073 |
}
|
| 4074 |
|
| 4075 |
+
sumf += GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d * (sumi1 + sumi2);
|
| 4076 |
}
|
| 4077 |
|
| 4078 |
*s = sumf;
|
|
|
|
| 4080 |
#else
|
| 4081 |
float sumf = 0;
|
| 4082 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 4083 |
+
const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
|
| 4084 |
uint16_t h = x[ibl].scales_h;
|
| 4085 |
const uint8_t * qs = x[ibl].qs;
|
| 4086 |
const int8_t * q8 = y[ibl].qs;
|
|
@@ -6,6 +6,7 @@
|
|
| 6 |
#include "ggml-impl.h"
|
| 7 |
#include "ggml-cpu.h"
|
| 8 |
#include "ggml-cpu-impl.h"
|
|
|
|
| 9 |
#include "traits.h"
|
| 10 |
|
| 11 |
#include <cmath>
|
|
@@ -51,7 +52,7 @@ void ggml_quantize_mat_q8_0_4x4(const float * GGML_RESTRICT x, void * GGML_RESTR
|
|
| 51 |
const float d = amax / ((1 << 7) - 1);
|
| 52 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 53 |
|
| 54 |
-
y[i].d[row_iter] =
|
| 55 |
}
|
| 56 |
|
| 57 |
for (int j = 0; j < 8; j++) {
|
|
@@ -102,7 +103,7 @@ void ggml_quantize_mat_q8_0_4x4(const float * GGML_RESTRICT x, void * GGML_RESTR
|
|
| 102 |
const float d = amax / ((1 << 7) - 1);
|
| 103 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 104 |
|
| 105 |
-
y[i].d[row_iter] =
|
| 106 |
}
|
| 107 |
|
| 108 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
@@ -145,7 +146,7 @@ void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTR
|
|
| 145 |
const float d = amax / ((1 << 7) - 1);
|
| 146 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 147 |
|
| 148 |
-
y[i].d[row_iter] =
|
| 149 |
}
|
| 150 |
|
| 151 |
for (int j = 0; j < 4; j++) {
|
|
@@ -221,7 +222,7 @@ void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTR
|
|
| 221 |
const float d = amax / ((1 << 7) - 1);
|
| 222 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 223 |
|
| 224 |
-
y[i].d[row_iter] =
|
| 225 |
}
|
| 226 |
|
| 227 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
@@ -311,7 +312,7 @@ void ggml_gemv_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 311 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 312 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 313 |
}
|
| 314 |
-
sumf[j] += sumi *
|
| 315 |
}
|
| 316 |
}
|
| 317 |
}
|
|
@@ -399,7 +400,7 @@ void ggml_gemv_q4_0_4x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 399 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 400 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 401 |
}
|
| 402 |
-
sumf[j] += sumi *
|
| 403 |
}
|
| 404 |
}
|
| 405 |
}
|
|
@@ -514,7 +515,7 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 514 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 515 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 516 |
}
|
| 517 |
-
sumf[j] += sumi *
|
| 518 |
}
|
| 519 |
}
|
| 520 |
}
|
|
@@ -608,7 +609,7 @@ void ggml_gemv_iq4_nl_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 608 |
const int v1 = kvalues_iq4nl[b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] >> 4];
|
| 609 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2]));
|
| 610 |
}
|
| 611 |
-
sumf[j] += sumi *
|
| 612 |
}
|
| 613 |
}
|
| 614 |
}
|
|
@@ -1117,7 +1118,7 @@ void ggml_gemm_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1117 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 1118 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 1119 |
}
|
| 1120 |
-
sumf[m][j] += sumi *
|
| 1121 |
}
|
| 1122 |
}
|
| 1123 |
}
|
|
@@ -1570,7 +1571,7 @@ void ggml_gemm_q4_0_4x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1570 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 1571 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 1572 |
}
|
| 1573 |
-
sumf[m][j] += sumi *
|
| 1574 |
}
|
| 1575 |
}
|
| 1576 |
}
|
|
@@ -2039,7 +2040,7 @@ void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2039 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 2040 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 2041 |
}
|
| 2042 |
-
sumf[m][j] += sumi *
|
| 2043 |
}
|
| 2044 |
}
|
| 2045 |
}
|
|
@@ -2147,7 +2148,7 @@ void ggml_gemm_iq4_nl_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 2147 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 2148 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4]));
|
| 2149 |
}
|
| 2150 |
-
sumf[m][j] += sumi *
|
| 2151 |
}
|
| 2152 |
}
|
| 2153 |
}
|
|
|
|
| 6 |
#include "ggml-impl.h"
|
| 7 |
#include "ggml-cpu.h"
|
| 8 |
#include "ggml-cpu-impl.h"
|
| 9 |
+
#include "simd-mappings.h"
|
| 10 |
#include "traits.h"
|
| 11 |
|
| 12 |
#include <cmath>
|
|
|
|
| 52 |
const float d = amax / ((1 << 7) - 1);
|
| 53 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 54 |
|
| 55 |
+
y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
|
| 56 |
}
|
| 57 |
|
| 58 |
for (int j = 0; j < 8; j++) {
|
|
|
|
| 103 |
const float d = amax / ((1 << 7) - 1);
|
| 104 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 105 |
|
| 106 |
+
y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
|
| 107 |
}
|
| 108 |
|
| 109 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
|
|
| 146 |
const float d = amax / ((1 << 7) - 1);
|
| 147 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 148 |
|
| 149 |
+
y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
|
| 150 |
}
|
| 151 |
|
| 152 |
for (int j = 0; j < 4; j++) {
|
|
|
|
| 222 |
const float d = amax / ((1 << 7) - 1);
|
| 223 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 224 |
|
| 225 |
+
y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
|
| 226 |
}
|
| 227 |
|
| 228 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
|
|
| 312 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 313 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 314 |
}
|
| 315 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 316 |
}
|
| 317 |
}
|
| 318 |
}
|
|
|
|
| 400 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 401 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 402 |
}
|
| 403 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 404 |
}
|
| 405 |
}
|
| 406 |
}
|
|
|
|
| 515 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 516 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 517 |
}
|
| 518 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 519 |
}
|
| 520 |
}
|
| 521 |
}
|
|
|
|
| 609 |
const int v1 = kvalues_iq4nl[b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] >> 4];
|
| 610 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2]));
|
| 611 |
}
|
| 612 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 613 |
}
|
| 614 |
}
|
| 615 |
}
|
|
|
|
| 1118 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 1119 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 1120 |
}
|
| 1121 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 1122 |
}
|
| 1123 |
}
|
| 1124 |
}
|
|
|
|
| 1571 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 1572 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 1573 |
}
|
| 1574 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 1575 |
}
|
| 1576 |
}
|
| 1577 |
}
|
|
|
|
| 2040 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 2041 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 2042 |
}
|
| 2043 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 2044 |
}
|
| 2045 |
}
|
| 2046 |
}
|
|
|
|
| 2148 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 2149 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4]));
|
| 2150 |
}
|
| 2151 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 2152 |
}
|
| 2153 |
}
|
| 2154 |
}
|
|
@@ -3,6 +3,7 @@
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
|
|
|
| 6 |
|
| 7 |
#include "../../quants.h"
|
| 8 |
#include "../../ggml-cpu-impl.h"
|
|
@@ -474,7 +475,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 474 |
|
| 475 |
// Quantize these floats
|
| 476 |
const float d = max_scalar / 127.f;
|
| 477 |
-
y[i].d =
|
| 478 |
const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
|
| 479 |
const __m256 mul = (__m256)__lasx_xvreplfr2vr_s( id );
|
| 480 |
|
|
@@ -548,7 +549,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 548 |
|
| 549 |
// Quantize these floats
|
| 550 |
const float d = max_scalar / 127.f;
|
| 551 |
-
y[i].d =
|
| 552 |
const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
|
| 553 |
const __m256 mul = __lasx_xvreplfr2vr_s( id );
|
| 554 |
|
|
@@ -576,7 +577,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 576 |
// Compute the sum of the quants and set y[i].s
|
| 577 |
const __m128i s0 = __lsx_vadd_w(__lsx_vadd_w(ni0, ni1), __lsx_vadd_w(ni2, ni3));
|
| 578 |
const __m128i s1 = __lsx_vadd_w(__lsx_vadd_w(ni4, ni5), __lsx_vadd_w(ni6, ni7));
|
| 579 |
-
y[i].s =
|
| 580 |
|
| 581 |
// Convert int32 to int16
|
| 582 |
ni0 = lsx_packs_w( ni0, ni1 );
|
|
@@ -667,7 +668,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 667 |
// Main loop
|
| 668 |
for (; ib < nb; ++ib) {
|
| 669 |
/* Compute combined scale for the block */
|
| 670 |
-
const __m256 d = __lasx_xvreplfr2vr_s(
|
| 671 |
|
| 672 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 673 |
|
|
@@ -699,7 +700,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 699 |
for (; ib + 1 < nb; ib += 2) {
|
| 700 |
|
| 701 |
// Compute combined scale for the block 0 and 1
|
| 702 |
-
const __m128 d_0_1 = (__m128)__lsx_vreplgr2vr_w(
|
| 703 |
|
| 704 |
const __m128i tmp_0_1 = __lsx_vld((const __m128i *)x[ib].qs, 0);
|
| 705 |
|
|
@@ -717,7 +718,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 717 |
//_mm_prefetch(&y[ib] + 2 * sizeof(block_q8_0), _MM_HINT_T0);
|
| 718 |
|
| 719 |
// Compute combined scale for the block 2 and 3
|
| 720 |
-
const __m128 d_2_3 = (__m128)__lsx_vreplgr2vr_w(
|
| 721 |
|
| 722 |
const __m128i tmp_2_3 = __lsx_vld((const __m128i *)x[ib + 1].qs, 0);
|
| 723 |
|
|
@@ -766,7 +767,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 766 |
}
|
| 767 |
|
| 768 |
int sumi = sumi0 + sumi1;
|
| 769 |
-
sumf += sumi*
|
| 770 |
}
|
| 771 |
|
| 772 |
*s = sumf;
|
|
@@ -797,10 +798,10 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 797 |
|
| 798 |
// Main loop
|
| 799 |
for (; ib < nb; ++ib) {
|
| 800 |
-
const float d0 =
|
| 801 |
-
const float d1 =
|
| 802 |
|
| 803 |
-
summs +=
|
| 804 |
|
| 805 |
const __m256 d0v = __lasx_xvreplfr2vr_s( d0 );
|
| 806 |
const __m256 d1v = __lasx_xvreplfr2vr_s( d1 );
|
|
@@ -834,7 +835,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 834 |
}
|
| 835 |
|
| 836 |
int sumi = sumi0 + sumi1;
|
| 837 |
-
sumf += (
|
| 838 |
}
|
| 839 |
|
| 840 |
*s = sumf;
|
|
@@ -865,7 +866,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 865 |
// Main loop
|
| 866 |
for (; ib < nb; ++ib) {
|
| 867 |
/* Compute combined scale for the block */
|
| 868 |
-
const __m256 d = __lasx_xvreplfr2vr_s(
|
| 869 |
|
| 870 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 871 |
__m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
|
@@ -902,7 +903,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 902 |
}
|
| 903 |
|
| 904 |
int sumi = sumi0 + sumi1;
|
| 905 |
-
sumf += (
|
| 906 |
}
|
| 907 |
|
| 908 |
*s = sumf;
|
|
@@ -934,16 +935,16 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 934 |
|
| 935 |
// Main loop
|
| 936 |
for (; ib < nb; ++ib) {
|
| 937 |
-
const __m256 dx = __lasx_xvreplfr2vr_s(
|
| 938 |
|
| 939 |
-
summs +=
|
| 940 |
|
| 941 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 942 |
__m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
| 943 |
bxhi = __lasx_xvand_v(bxhi, __lasx_xvreplgr2vr_b(0x10));
|
| 944 |
qx = __lasx_xvor_v(qx, bxhi);
|
| 945 |
|
| 946 |
-
const __m256 dy = __lasx_xvreplfr2vr_s(
|
| 947 |
const __m256i qy = __lasx_xvld((const __m256i *)y[ib].qs, 0);
|
| 948 |
|
| 949 |
const __m256 q = mul_sum_us8_pairs_float(qx, qy);
|
|
@@ -973,7 +974,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 973 |
}
|
| 974 |
|
| 975 |
int sumi = sumi0 + sumi1;
|
| 976 |
-
sumf += (
|
| 977 |
}
|
| 978 |
|
| 979 |
*s = sumf;
|
|
@@ -1003,7 +1004,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1003 |
// Main loop
|
| 1004 |
for (; ib < nb; ++ib) {
|
| 1005 |
// Compute combined scale for the block
|
| 1006 |
-
const __m256 d = __lasx_xvreplfr2vr_s(
|
| 1007 |
__m256i qx = __lasx_xvld((const __m256i *)x[ib].qs, 0);
|
| 1008 |
__m256i qy = __lasx_xvld((const __m256i *)y[ib].qs, 0);
|
| 1009 |
|
|
@@ -1023,7 +1024,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1023 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 1024 |
}
|
| 1025 |
|
| 1026 |
-
sumf += sumi*(
|
| 1027 |
}
|
| 1028 |
|
| 1029 |
*s = sumf;
|
|
@@ -1047,8 +1048,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1047 |
|
| 1048 |
for (int i = 0; i < nb; ++i) {
|
| 1049 |
|
| 1050 |
-
const float d = y[i].d *
|
| 1051 |
-
const float dmin = -y[i].d *
|
| 1052 |
|
| 1053 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1054 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -1116,8 +1117,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1116 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 1117 |
}
|
| 1118 |
|
| 1119 |
-
const float dall = y[i].d *
|
| 1120 |
-
const float dmin = y[i].d *
|
| 1121 |
|
| 1122 |
int isum = 0;
|
| 1123 |
int is = 0;
|
|
@@ -1170,7 +1171,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1170 |
|
| 1171 |
for (int i = 0; i < nb; ++i) {
|
| 1172 |
|
| 1173 |
-
const float d = y[i].d *
|
| 1174 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 1175 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1176 |
// Set up scales
|
|
@@ -1294,7 +1295,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1294 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 1295 |
q8 += 8; a += 8;
|
| 1296 |
}
|
| 1297 |
-
const float d =
|
| 1298 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1299 |
}
|
| 1300 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1330,8 +1331,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1330 |
|
| 1331 |
for (int i = 0; i < nb; ++i) {
|
| 1332 |
|
| 1333 |
-
const float d = y[i].d *
|
| 1334 |
-
const float dmin = -y[i].d *
|
| 1335 |
|
| 1336 |
memcpy(utmp, x[i].scales, 12);
|
| 1337 |
utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
|
|
@@ -1438,9 +1439,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1438 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1439 |
q8 += 8; a += 8;
|
| 1440 |
}
|
| 1441 |
-
const float d =
|
| 1442 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1443 |
-
const float dmin =
|
| 1444 |
sumf -= dmin * sumi;
|
| 1445 |
}
|
| 1446 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1477,8 +1478,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1477 |
const uint8_t * GGML_RESTRICT q5 = x[i].qs;
|
| 1478 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1479 |
|
| 1480 |
-
const float d = y[i].d *
|
| 1481 |
-
const float dmin = -y[i].d *
|
| 1482 |
|
| 1483 |
memcpy(utmp, x[i].scales, 12);
|
| 1484 |
utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
|
|
@@ -1593,9 +1594,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1593 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1594 |
q8 += 8; a += 8;
|
| 1595 |
}
|
| 1596 |
-
const float d =
|
| 1597 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1598 |
-
const float dmin =
|
| 1599 |
sumf -= dmin * sumi;
|
| 1600 |
}
|
| 1601 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1624,7 +1625,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1624 |
|
| 1625 |
for (int i = 0; i < nb; ++i) {
|
| 1626 |
|
| 1627 |
-
const float d = y[i].d *
|
| 1628 |
|
| 1629 |
const uint8_t * GGML_RESTRICT q4 = x[i].ql;
|
| 1630 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
@@ -1713,7 +1714,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1713 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1714 |
q8 += 8; a += 8;
|
| 1715 |
}
|
| 1716 |
-
const float d =
|
| 1717 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1718 |
}
|
| 1719 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1780,7 +1781,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 1780 |
|
| 1781 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 1782 |
for (int i = 0; i < nb; ++i) {
|
| 1783 |
-
const float d =
|
| 1784 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1785 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1786 |
__m256i sumi1 = __lasx_xvldi(0);
|
|
@@ -1820,7 +1821,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 1820 |
|
| 1821 |
float sumf = 0.f;
|
| 1822 |
for (int i = 0; i < nb; ++i) {
|
| 1823 |
-
const float d =
|
| 1824 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1825 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1826 |
int32_t bsum = 0;
|
|
@@ -1895,7 +1896,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 1895 |
|
| 1896 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 1897 |
for (int i = 0; i < nb; ++i) {
|
| 1898 |
-
const float d =
|
| 1899 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1900 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1901 |
|
|
@@ -1980,7 +1981,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 1980 |
|
| 1981 |
float sumf = 0.f;
|
| 1982 |
for (int i = 0; i < nb; ++i) {
|
| 1983 |
-
const float d =
|
| 1984 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1985 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 1986 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -2049,7 +2050,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2049 |
|
| 2050 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 2051 |
for (int i = 0; i < nb; ++i) {
|
| 2052 |
-
const float d =
|
| 2053 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 2054 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 2055 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
|
|
@@ -2108,7 +2109,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2108 |
float sumf = 0;
|
| 2109 |
for (int i = 0; i < nb; i++) {
|
| 2110 |
|
| 2111 |
-
const float d =
|
| 2112 |
const int8_t * q8 = y[i].qs;
|
| 2113 |
const uint8_t * qs = x[i].qs;
|
| 2114 |
const uint8_t * qh = x[i].qh;
|
|
@@ -2168,7 +2169,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 2168 |
|
| 2169 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 2170 |
for (int i = 0; i < nb; ++i) {
|
| 2171 |
-
const float d =
|
| 2172 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 2173 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 2174 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -2213,7 +2214,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 2213 |
|
| 2214 |
float sumf = 0.f;
|
| 2215 |
for (int i = 0; i < nb; ++i) {
|
| 2216 |
-
const float d =
|
| 2217 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 2218 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 2219 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -2279,7 +2280,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2279 |
|
| 2280 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 2281 |
for (int i = 0; i < nb; ++i) {
|
| 2282 |
-
const float d =
|
| 2283 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 2284 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 2285 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
|
|
@@ -2340,7 +2341,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2340 |
|
| 2341 |
float sumf = 0.f;
|
| 2342 |
for (int i = 0; i < nb; ++i) {
|
| 2343 |
-
const float d =
|
| 2344 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 2345 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 2346 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
@@ -2451,7 +2452,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2451 |
+ (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
|
| 2452 |
}
|
| 2453 |
|
| 2454 |
-
const float d = y[i].d *
|
| 2455 |
accum = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(d), __lasx_xvffint_s_w(sumi), accum);
|
| 2456 |
accum1 += d * sumi1;
|
| 2457 |
}
|
|
@@ -2484,7 +2485,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2484 |
qs += 4;
|
| 2485 |
}
|
| 2486 |
|
| 2487 |
-
sumf +=
|
| 2488 |
}
|
| 2489 |
|
| 2490 |
*s = sumf;
|
|
@@ -2530,9 +2531,9 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2530 |
const __m256i p16_2 = mul_add_epi8(q4b_2, q8b_2);
|
| 2531 |
const __m256i p_1 = lasx_madd_h(p16_1, mone);
|
| 2532 |
const __m256i p_2 = lasx_madd_h(p16_2, mone);
|
| 2533 |
-
accum1 = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(
|
| 2534 |
__lasx_xvffint_s_w(p_1), accum1);
|
| 2535 |
-
accum2 = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(
|
| 2536 |
__lasx_xvffint_s_w(p_2), accum2);
|
| 2537 |
}
|
| 2538 |
|
|
@@ -2540,7 +2541,7 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2540 |
|
| 2541 |
#endif
|
| 2542 |
for (; ib < nb; ++ib) {
|
| 2543 |
-
const float d =
|
| 2544 |
int sumi1 = 0, sumi2 = 0;
|
| 2545 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 2546 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
@@ -2595,7 +2596,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2595 |
sumi1 = __lasx_xvadd_w(p_1, sumi1);
|
| 2596 |
sumi2 = __lasx_xvadd_w(p_2, sumi2);
|
| 2597 |
}
|
| 2598 |
-
accum = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(
|
| 2599 |
__lasx_xvffint_s_w(__lasx_xvadd_w(sumi1, sumi2)), accum);
|
| 2600 |
}
|
| 2601 |
|
|
@@ -2604,7 +2605,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2604 |
#else
|
| 2605 |
float sumf = 0;
|
| 2606 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 2607 |
-
const float d4d8 =
|
| 2608 |
uint16_t h = x[ibl].scales_h;
|
| 2609 |
const uint8_t * qs = x[ibl].qs;
|
| 2610 |
const int8_t * q8 = y[ibl].qs;
|
|
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
| 6 |
+
#include "simd-mappings.h"
|
| 7 |
|
| 8 |
#include "../../quants.h"
|
| 9 |
#include "../../ggml-cpu-impl.h"
|
|
|
|
| 475 |
|
| 476 |
// Quantize these floats
|
| 477 |
const float d = max_scalar / 127.f;
|
| 478 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 479 |
const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
|
| 480 |
const __m256 mul = (__m256)__lasx_xvreplfr2vr_s( id );
|
| 481 |
|
|
|
|
| 549 |
|
| 550 |
// Quantize these floats
|
| 551 |
const float d = max_scalar / 127.f;
|
| 552 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 553 |
const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
|
| 554 |
const __m256 mul = __lasx_xvreplfr2vr_s( id );
|
| 555 |
|
|
|
|
| 577 |
// Compute the sum of the quants and set y[i].s
|
| 578 |
const __m128i s0 = __lsx_vadd_w(__lsx_vadd_w(ni0, ni1), __lsx_vadd_w(ni2, ni3));
|
| 579 |
const __m128i s1 = __lsx_vadd_w(__lsx_vadd_w(ni4, ni5), __lsx_vadd_w(ni6, ni7));
|
| 580 |
+
y[i].s = GGML_CPU_FP32_TO_FP16(d * hsum_i32_4(__lsx_vadd_w(s0, s1)));
|
| 581 |
|
| 582 |
// Convert int32 to int16
|
| 583 |
ni0 = lsx_packs_w( ni0, ni1 );
|
|
|
|
| 668 |
// Main loop
|
| 669 |
for (; ib < nb; ++ib) {
|
| 670 |
/* Compute combined scale for the block */
|
| 671 |
+
const __m256 d = __lasx_xvreplfr2vr_s( GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d) );
|
| 672 |
|
| 673 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 674 |
|
|
|
|
| 700 |
for (; ib + 1 < nb; ib += 2) {
|
| 701 |
|
| 702 |
// Compute combined scale for the block 0 and 1
|
| 703 |
+
const __m128 d_0_1 = (__m128)__lsx_vreplgr2vr_w( GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d) );
|
| 704 |
|
| 705 |
const __m128i tmp_0_1 = __lsx_vld((const __m128i *)x[ib].qs, 0);
|
| 706 |
|
|
|
|
| 718 |
//_mm_prefetch(&y[ib] + 2 * sizeof(block_q8_0), _MM_HINT_T0);
|
| 719 |
|
| 720 |
// Compute combined scale for the block 2 and 3
|
| 721 |
+
const __m128 d_2_3 = (__m128)__lsx_vreplgr2vr_w( GGML_CPU_FP16_TO_FP32(x[ib + 1].d) * GGML_CPU_FP16_TO_FP32(y[ib + 1].d) );
|
| 722 |
|
| 723 |
const __m128i tmp_2_3 = __lsx_vld((const __m128i *)x[ib + 1].qs, 0);
|
| 724 |
|
|
|
|
| 767 |
}
|
| 768 |
|
| 769 |
int sumi = sumi0 + sumi1;
|
| 770 |
+
sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 771 |
}
|
| 772 |
|
| 773 |
*s = sumf;
|
|
|
|
| 798 |
|
| 799 |
// Main loop
|
| 800 |
for (; ib < nb; ++ib) {
|
| 801 |
+
const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
|
| 802 |
+
const float d1 = GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 803 |
|
| 804 |
+
summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 805 |
|
| 806 |
const __m256 d0v = __lasx_xvreplfr2vr_s( d0 );
|
| 807 |
const __m256 d1v = __lasx_xvreplfr2vr_s( d1 );
|
|
|
|
| 835 |
}
|
| 836 |
|
| 837 |
int sumi = sumi0 + sumi1;
|
| 838 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 839 |
}
|
| 840 |
|
| 841 |
*s = sumf;
|
|
|
|
| 866 |
// Main loop
|
| 867 |
for (; ib < nb; ++ib) {
|
| 868 |
/* Compute combined scale for the block */
|
| 869 |
+
const __m256 d = __lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d)); //FIXME
|
| 870 |
|
| 871 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 872 |
__m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
|
|
|
| 903 |
}
|
| 904 |
|
| 905 |
int sumi = sumi0 + sumi1;
|
| 906 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
|
| 907 |
}
|
| 908 |
|
| 909 |
*s = sumf;
|
|
|
|
| 935 |
|
| 936 |
// Main loop
|
| 937 |
for (; ib < nb; ++ib) {
|
| 938 |
+
const __m256 dx = __lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(x[ib].d));
|
| 939 |
|
| 940 |
+
summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 941 |
|
| 942 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 943 |
__m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
| 944 |
bxhi = __lasx_xvand_v(bxhi, __lasx_xvreplgr2vr_b(0x10));
|
| 945 |
qx = __lasx_xvor_v(qx, bxhi);
|
| 946 |
|
| 947 |
+
const __m256 dy = __lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 948 |
const __m256i qy = __lasx_xvld((const __m256i *)y[ib].qs, 0);
|
| 949 |
|
| 950 |
const __m256 q = mul_sum_us8_pairs_float(qx, qy);
|
|
|
|
| 974 |
}
|
| 975 |
|
| 976 |
int sumi = sumi0 + sumi1;
|
| 977 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 978 |
}
|
| 979 |
|
| 980 |
*s = sumf;
|
|
|
|
| 1004 |
// Main loop
|
| 1005 |
for (; ib < nb; ++ib) {
|
| 1006 |
// Compute combined scale for the block
|
| 1007 |
+
const __m256 d = __lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 1008 |
__m256i qx = __lasx_xvld((const __m256i *)x[ib].qs, 0);
|
| 1009 |
__m256i qy = __lasx_xvld((const __m256i *)y[ib].qs, 0);
|
| 1010 |
|
|
|
|
| 1024 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 1025 |
}
|
| 1026 |
|
| 1027 |
+
sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 1028 |
}
|
| 1029 |
|
| 1030 |
*s = sumf;
|
|
|
|
| 1048 |
|
| 1049 |
for (int i = 0; i < nb; ++i) {
|
| 1050 |
|
| 1051 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1052 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1053 |
|
| 1054 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1055 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 1117 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 1118 |
}
|
| 1119 |
|
| 1120 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1121 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1122 |
|
| 1123 |
int isum = 0;
|
| 1124 |
int is = 0;
|
|
|
|
| 1171 |
|
| 1172 |
for (int i = 0; i < nb; ++i) {
|
| 1173 |
|
| 1174 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1175 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 1176 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1177 |
// Set up scales
|
|
|
|
| 1295 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 1296 |
q8 += 8; a += 8;
|
| 1297 |
}
|
| 1298 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1299 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1300 |
}
|
| 1301 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1331 |
|
| 1332 |
for (int i = 0; i < nb; ++i) {
|
| 1333 |
|
| 1334 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1335 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1336 |
|
| 1337 |
memcpy(utmp, x[i].scales, 12);
|
| 1338 |
utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
|
|
|
|
| 1439 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1440 |
q8 += 8; a += 8;
|
| 1441 |
}
|
| 1442 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1443 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1444 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 1445 |
sumf -= dmin * sumi;
|
| 1446 |
}
|
| 1447 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1478 |
const uint8_t * GGML_RESTRICT q5 = x[i].qs;
|
| 1479 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1480 |
|
| 1481 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1482 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1483 |
|
| 1484 |
memcpy(utmp, x[i].scales, 12);
|
| 1485 |
utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
|
|
|
|
| 1594 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1595 |
q8 += 8; a += 8;
|
| 1596 |
}
|
| 1597 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1598 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1599 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 1600 |
sumf -= dmin * sumi;
|
| 1601 |
}
|
| 1602 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1625 |
|
| 1626 |
for (int i = 0; i < nb; ++i) {
|
| 1627 |
|
| 1628 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1629 |
|
| 1630 |
const uint8_t * GGML_RESTRICT q4 = x[i].ql;
|
| 1631 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
|
|
| 1714 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1715 |
q8 += 8; a += 8;
|
| 1716 |
}
|
| 1717 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1718 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1719 |
}
|
| 1720 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1781 |
|
| 1782 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 1783 |
for (int i = 0; i < nb; ++i) {
|
| 1784 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1785 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1786 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1787 |
__m256i sumi1 = __lasx_xvldi(0);
|
|
|
|
| 1821 |
|
| 1822 |
float sumf = 0.f;
|
| 1823 |
for (int i = 0; i < nb; ++i) {
|
| 1824 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1825 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1826 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1827 |
int32_t bsum = 0;
|
|
|
|
| 1896 |
|
| 1897 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 1898 |
for (int i = 0; i < nb; ++i) {
|
| 1899 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1900 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1901 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1902 |
|
|
|
|
| 1981 |
|
| 1982 |
float sumf = 0.f;
|
| 1983 |
for (int i = 0; i < nb; ++i) {
|
| 1984 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1985 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1986 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 1987 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 2050 |
|
| 2051 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 2052 |
for (int i = 0; i < nb; ++i) {
|
| 2053 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2054 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 2055 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 2056 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
|
|
|
|
| 2109 |
float sumf = 0;
|
| 2110 |
for (int i = 0; i < nb; i++) {
|
| 2111 |
|
| 2112 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2113 |
const int8_t * q8 = y[i].qs;
|
| 2114 |
const uint8_t * qs = x[i].qs;
|
| 2115 |
const uint8_t * qh = x[i].qh;
|
|
|
|
| 2169 |
|
| 2170 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 2171 |
for (int i = 0; i < nb; ++i) {
|
| 2172 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2173 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 2174 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 2175 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 2214 |
|
| 2215 |
float sumf = 0.f;
|
| 2216 |
for (int i = 0; i < nb; ++i) {
|
| 2217 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2218 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 2219 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 2220 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 2280 |
|
| 2281 |
__m256 accumf = (__m256)__lasx_xvldi(0);
|
| 2282 |
for (int i = 0; i < nb; ++i) {
|
| 2283 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2284 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 2285 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 2286 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
|
|
|
|
| 2341 |
|
| 2342 |
float sumf = 0.f;
|
| 2343 |
for (int i = 0; i < nb; ++i) {
|
| 2344 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2345 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 2346 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 2347 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
|
|
| 2452 |
+ (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
|
| 2453 |
}
|
| 2454 |
|
| 2455 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 2456 |
accum = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(d), __lasx_xvffint_s_w(sumi), accum);
|
| 2457 |
accum1 += d * sumi1;
|
| 2458 |
}
|
|
|
|
| 2485 |
qs += 4;
|
| 2486 |
}
|
| 2487 |
|
| 2488 |
+
sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
|
| 2489 |
}
|
| 2490 |
|
| 2491 |
*s = sumf;
|
|
|
|
| 2531 |
const __m256i p16_2 = mul_add_epi8(q4b_2, q8b_2);
|
| 2532 |
const __m256i p_1 = lasx_madd_h(p16_1, mone);
|
| 2533 |
const __m256i p_2 = lasx_madd_h(p16_2, mone);
|
| 2534 |
+
accum1 = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(y[ib + 0].d)*GGML_CPU_FP16_TO_FP32(x[ib + 0].d)),
|
| 2535 |
__lasx_xvffint_s_w(p_1), accum1);
|
| 2536 |
+
accum2 = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(y[ib + 1].d)*GGML_CPU_FP16_TO_FP32(x[ib + 1].d)),
|
| 2537 |
__lasx_xvffint_s_w(p_2), accum2);
|
| 2538 |
}
|
| 2539 |
|
|
|
|
| 2541 |
|
| 2542 |
#endif
|
| 2543 |
for (; ib < nb; ++ib) {
|
| 2544 |
+
const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
|
| 2545 |
int sumi1 = 0, sumi2 = 0;
|
| 2546 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 2547 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
|
|
| 2596 |
sumi1 = __lasx_xvadd_w(p_1, sumi1);
|
| 2597 |
sumi2 = __lasx_xvadd_w(p_2, sumi2);
|
| 2598 |
}
|
| 2599 |
+
accum = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(x[ibl].d)*y[ibl].d),
|
| 2600 |
__lasx_xvffint_s_w(__lasx_xvadd_w(sumi1, sumi2)), accum);
|
| 2601 |
}
|
| 2602 |
|
|
|
|
| 2605 |
#else
|
| 2606 |
float sumf = 0;
|
| 2607 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 2608 |
+
const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
|
| 2609 |
uint16_t h = x[ibl].scales_h;
|
| 2610 |
const uint8_t * qs = x[ibl].qs;
|
| 2611 |
const int8_t * q8 = y[ibl].qs;
|
|
@@ -3,6 +3,7 @@
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
|
|
|
| 6 |
|
| 7 |
#include "../../quants.h"
|
| 8 |
#include "../../ggml-cpu-impl.h"
|
|
@@ -67,7 +68,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 67 |
const float id = d ? 1.0f/d : 0.0f;
|
| 68 |
const vector float vid = vec_splats(id);
|
| 69 |
|
| 70 |
-
y[i].d =
|
| 71 |
|
| 72 |
for (int j = 0; j < 8; j++) {
|
| 73 |
const vector float v = vec_round(vec_mul(srcv[j], vid));
|
|
@@ -112,7 +113,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 112 |
const float id = d ? 1.0f/d : 0.0f;
|
| 113 |
const vector float vid = vec_splats(id);
|
| 114 |
|
| 115 |
-
y[i].d =
|
| 116 |
|
| 117 |
vector int accv = vec_splats(0);
|
| 118 |
|
|
@@ -127,7 +128,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 127 |
|
| 128 |
accv = vec_add(accv, vec_sld(accv, accv, 4));
|
| 129 |
accv = vec_add(accv, vec_sld(accv, accv, 8));
|
| 130 |
-
y[i].s =
|
| 131 |
}
|
| 132 |
|
| 133 |
#else
|
|
@@ -170,8 +171,8 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 170 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 171 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 172 |
|
| 173 |
-
vector float vxd = vec_splats(
|
| 174 |
-
vector float vyd = vec_splats(
|
| 175 |
vector float vd = vec_mul(vxd, vyd);
|
| 176 |
|
| 177 |
vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
|
|
@@ -214,7 +215,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 214 |
}
|
| 215 |
|
| 216 |
int sumi = sumi0 + sumi1;
|
| 217 |
-
sumf += sumi*
|
| 218 |
}
|
| 219 |
|
| 220 |
*s = sumf;
|
|
@@ -249,12 +250,12 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 249 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 250 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 251 |
|
| 252 |
-
vector float vxd = vec_splats(
|
| 253 |
-
vector float vyd = vec_splats(
|
| 254 |
vector float vd = vec_mul(vxd, vyd);
|
| 255 |
|
| 256 |
-
vector float vxmin = vec_splats(
|
| 257 |
-
vector float vys = {
|
| 258 |
vsumf0 = vec_madd(vxmin, vys, vsumf0);
|
| 259 |
|
| 260 |
vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
|
|
@@ -291,7 +292,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 291 |
}
|
| 292 |
|
| 293 |
int sumi = sumi0 + sumi1;
|
| 294 |
-
sumf += (
|
| 295 |
}
|
| 296 |
|
| 297 |
*s = sumf;
|
|
@@ -326,8 +327,8 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 326 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 327 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 328 |
|
| 329 |
-
vector float vxd = vec_splats(
|
| 330 |
-
vector float vyd = vec_splats(
|
| 331 |
vector float vd = vec_mul(vxd, vyd);
|
| 332 |
|
| 333 |
vector signed long long aux64x2_0 = {(uint64_t)(table_b2b_1[x[ib].qh[0]]), (uint64_t)(table_b2b_1[x[ib].qh[1]])};
|
|
@@ -379,7 +380,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 379 |
}
|
| 380 |
|
| 381 |
int sumi = sumi0 + sumi1;
|
| 382 |
-
sumf += (
|
| 383 |
}
|
| 384 |
|
| 385 |
*s = sumf;
|
|
@@ -415,12 +416,12 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 415 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 416 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 417 |
|
| 418 |
-
vector float vxd = vec_splats(
|
| 419 |
-
vector float vyd = vec_splats(
|
| 420 |
vector float vd = vec_mul(vxd, vyd);
|
| 421 |
|
| 422 |
-
vector float vxmin = vec_splats(
|
| 423 |
-
vector float vys = {
|
| 424 |
vsumf0 = vec_madd(vxmin, vys, vsumf0);
|
| 425 |
|
| 426 |
vector unsigned long long aux64x2_0 = {(uint64_t)(table_b2b_0[x[ib].qh[0]]), (uint64_t)(table_b2b_0[x[ib].qh[1]])};
|
|
@@ -470,7 +471,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 470 |
}
|
| 471 |
|
| 472 |
int sumi = sumi0 + sumi1;
|
| 473 |
-
sumf += (
|
| 474 |
}
|
| 475 |
|
| 476 |
*s = sumf;
|
|
@@ -502,8 +503,8 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 502 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 503 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 504 |
|
| 505 |
-
vector float vxd = vec_splats(
|
| 506 |
-
vector float vyd = vec_splats(
|
| 507 |
vector float vd = vec_mul(vxd, vyd);
|
| 508 |
|
| 509 |
vector signed char q8x0 = vec_xl( 0, x[ib].qs);
|
|
@@ -542,7 +543,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 542 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 543 |
}
|
| 544 |
|
| 545 |
-
sumf += sumi*(
|
| 546 |
}
|
| 547 |
|
| 548 |
*s = sumf;
|
|
@@ -574,11 +575,11 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 574 |
vector float vsumf3 = vec_splats(0.0f);
|
| 575 |
|
| 576 |
for (int i = 0; i < nb; ++i) {
|
| 577 |
-
vector float vxd = vec_splats(
|
| 578 |
vector float vyd = vec_splats(y[i].d);
|
| 579 |
vector float vd = vec_mul(vxd, vyd);
|
| 580 |
|
| 581 |
-
vector float vxmin = vec_splats(
|
| 582 |
vector float vdmin = vec_mul(vxmin, vyd);
|
| 583 |
|
| 584 |
vector signed short q8ysums0 = vec_xl( 0, y[i].bsums);
|
|
@@ -708,8 +709,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 708 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 709 |
}
|
| 710 |
|
| 711 |
-
const float dall = y[i].d *
|
| 712 |
-
const float dmin = y[i].d *
|
| 713 |
|
| 714 |
int isum = 0;
|
| 715 |
int is = 0;
|
|
@@ -770,7 +771,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 770 |
vector float vsumf3 = vec_splats(0.0f);
|
| 771 |
|
| 772 |
for (int i = 0; i < nb; ++i) {
|
| 773 |
-
vector float vxd = vec_splats(
|
| 774 |
vector float vyd = vec_splats(y[i].d);
|
| 775 |
vector float vd = vec_mul(vxd, vyd);
|
| 776 |
|
|
@@ -962,7 +963,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 962 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 963 |
q8 += 8; a += 8;
|
| 964 |
}
|
| 965 |
-
const float d =
|
| 966 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 967 |
}
|
| 968 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1005,11 +1006,11 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1005 |
vector float vsumf3 = vec_splats(0.0f);
|
| 1006 |
|
| 1007 |
for (int i = 0; i < nb; ++i) {
|
| 1008 |
-
vector float vxd = vec_splats(
|
| 1009 |
vector float vyd = vec_splats(y[i].d);
|
| 1010 |
vector float vd = vec_mul(vxd, vyd);
|
| 1011 |
|
| 1012 |
-
vector float vxmin = vec_splats(
|
| 1013 |
vector float vdmin = vec_mul(vxmin, vyd);
|
| 1014 |
|
| 1015 |
vector signed short q8ysums0 = vec_xl( 0, y[i].bsums);
|
|
@@ -1177,9 +1178,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1177 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1178 |
q8 += 8; a += 8;
|
| 1179 |
}
|
| 1180 |
-
const float d =
|
| 1181 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1182 |
-
const float dmin =
|
| 1183 |
sumf -= dmin * sumi;
|
| 1184 |
}
|
| 1185 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1222,11 +1223,11 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1222 |
vector float vsumf3 = vec_splats(0.0f);
|
| 1223 |
|
| 1224 |
for (int i = 0; i < nb; ++i) {
|
| 1225 |
-
vector float vxd = vec_splats(
|
| 1226 |
vector float vyd = vec_splats(y[i].d);
|
| 1227 |
vector float vd = vec_mul(vxd, vyd);
|
| 1228 |
|
| 1229 |
-
vector float vxmin = vec_splats(
|
| 1230 |
vector float vdmin = vec_mul(vxmin, vyd);
|
| 1231 |
|
| 1232 |
UNUSED(kmask1);
|
|
@@ -1394,9 +1395,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1394 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1395 |
q8 += 8; a += 8;
|
| 1396 |
}
|
| 1397 |
-
const float d =
|
| 1398 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1399 |
-
const float dmin =
|
| 1400 |
sumf -= dmin * sumi;
|
| 1401 |
}
|
| 1402 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1432,7 +1433,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1432 |
vector float vsumf3 = vec_splats(0.0f);
|
| 1433 |
|
| 1434 |
for (int i = 0; i < nb; ++i) {
|
| 1435 |
-
vector float vxd = vec_splats(
|
| 1436 |
vector float vyd = vec_splats(y[i].d);
|
| 1437 |
vector float vd = vec_mul(vxd, vyd);
|
| 1438 |
|
|
@@ -1591,7 +1592,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1591 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1592 |
q8 += 8; a += 8;
|
| 1593 |
}
|
| 1594 |
-
const float d =
|
| 1595 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1596 |
}
|
| 1597 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1659,7 +1660,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 1659 |
const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
|
| 1660 |
|
| 1661 |
for (int i = 0; i < nb; ++i) {
|
| 1662 |
-
vector float vxd = vec_splats(
|
| 1663 |
vector float vyd = vec_splats(y[i].d);
|
| 1664 |
vector float vd = vec_mul(vxd, vyd);
|
| 1665 |
|
|
@@ -1742,7 +1743,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 1742 |
|
| 1743 |
float sumf = 0.f;
|
| 1744 |
for (int i = 0; i < nb; ++i) {
|
| 1745 |
-
const float d =
|
| 1746 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1747 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1748 |
int32_t bsum = 0;
|
|
@@ -1790,7 +1791,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 1790 |
const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
|
| 1791 |
|
| 1792 |
for (int i = 0; i < nb; ++i) {
|
| 1793 |
-
vector float vxd = vec_splats(
|
| 1794 |
vector float vyd = vec_splats(y[i].d);
|
| 1795 |
vector float vd = vec_mul(vxd, vyd);
|
| 1796 |
|
|
@@ -1871,7 +1872,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 1871 |
|
| 1872 |
float sumf = 0.f;
|
| 1873 |
for (int i = 0; i < nb; ++i) {
|
| 1874 |
-
const float d =
|
| 1875 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1876 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 1877 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -1939,7 +1940,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1939 |
const vector signed char mask2 = (vector signed char)vec_xl( 0, k_mask2);
|
| 1940 |
|
| 1941 |
for (int i = 0; i < nb; ++i) {
|
| 1942 |
-
vector float vxd = vec_splats(
|
| 1943 |
vector float vyd = vec_splats(y[i].d);
|
| 1944 |
vector float vd = vec_mul(vxd, vyd);
|
| 1945 |
|
|
@@ -2033,7 +2034,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2033 |
float sumf = 0;
|
| 2034 |
for (int i = 0; i < nb; i++) {
|
| 2035 |
|
| 2036 |
-
const float d =
|
| 2037 |
const int8_t * q8 = y[i].qs;
|
| 2038 |
const uint8_t * qs = x[i].qs;
|
| 2039 |
const uint8_t * qh = x[i].qh;
|
|
@@ -2096,7 +2097,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 2096 |
vector float vsumf3 = vec_splats(0.0f);
|
| 2097 |
|
| 2098 |
for (int i = 0; i < nb; ++i) {
|
| 2099 |
-
vector float vxd = vec_splats(
|
| 2100 |
vector float vyd = vec_splats(y[i].d);
|
| 2101 |
vector float vd = vec_mul(vxd, vyd);
|
| 2102 |
|
|
@@ -2176,7 +2177,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 2176 |
|
| 2177 |
float sumf = 0.f;
|
| 2178 |
for (int i = 0; i < nb; ++i) {
|
| 2179 |
-
const float d =
|
| 2180 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 2181 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 2182 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -2236,7 +2237,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2236 |
const vector signed char mask2 = (vector signed char)vec_xl( 0, k_mask2);
|
| 2237 |
|
| 2238 |
for (int i = 0; i < nb; ++i) {
|
| 2239 |
-
vector float vxd = vec_splats(
|
| 2240 |
vector float vyd = vec_splats(y[i].d);
|
| 2241 |
vector float vd = vec_mul(vxd, vyd);
|
| 2242 |
|
|
@@ -2329,7 +2330,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2329 |
|
| 2330 |
float sumf = 0.f;
|
| 2331 |
for (int i = 0; i < nb; ++i) {
|
| 2332 |
-
const float d =
|
| 2333 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 2334 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 2335 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
@@ -2394,7 +2395,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2394 |
vector float vsumf3 = vec_splats(0.0f);
|
| 2395 |
|
| 2396 |
for (int i = 0; i < nb; ++i) {
|
| 2397 |
-
vector float vxd = vec_splats(
|
| 2398 |
vector float vyd = vec_splats(y[i].d);
|
| 2399 |
vector float vd = vec_mul(vxd, vyd);
|
| 2400 |
|
|
@@ -2505,7 +2506,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 2505 |
qs += 4;
|
| 2506 |
}
|
| 2507 |
|
| 2508 |
-
sumf +=
|
| 2509 |
}
|
| 2510 |
|
| 2511 |
*s = sumf;
|
|
@@ -2546,8 +2547,8 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2546 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 2547 |
|
| 2548 |
|
| 2549 |
-
vector float vxd = vec_splats(
|
| 2550 |
-
vector float vyd = vec_splats(
|
| 2551 |
vector float vd = vec_mul(vxd, vyd);
|
| 2552 |
|
| 2553 |
vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
|
|
@@ -2582,7 +2583,7 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2582 |
|
| 2583 |
#endif
|
| 2584 |
for (; ib < nb; ++ib) {
|
| 2585 |
-
const float d =
|
| 2586 |
int sumi1 = 0, sumi2 = 0;
|
| 2587 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 2588 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
@@ -2620,7 +2621,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2620 |
|
| 2621 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 2622 |
|
| 2623 |
-
vector float vxd = vec_splats(
|
| 2624 |
vector float vyd = vec_splats(y[ibl].d);
|
| 2625 |
vector float vd = vec_mul(vxd, vyd);
|
| 2626 |
|
|
@@ -2697,7 +2698,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2697 |
#else
|
| 2698 |
float sumf = 0;
|
| 2699 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 2700 |
-
const float d4d8 =
|
| 2701 |
uint16_t h = x[ibl].scales_h;
|
| 2702 |
const uint8_t * qs = x[ibl].qs;
|
| 2703 |
const int8_t * q8 = y[ibl].qs;
|
|
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
| 6 |
+
#include "simd-mappings.h"
|
| 7 |
|
| 8 |
#include "../../quants.h"
|
| 9 |
#include "../../ggml-cpu-impl.h"
|
|
|
|
| 68 |
const float id = d ? 1.0f/d : 0.0f;
|
| 69 |
const vector float vid = vec_splats(id);
|
| 70 |
|
| 71 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 72 |
|
| 73 |
for (int j = 0; j < 8; j++) {
|
| 74 |
const vector float v = vec_round(vec_mul(srcv[j], vid));
|
|
|
|
| 113 |
const float id = d ? 1.0f/d : 0.0f;
|
| 114 |
const vector float vid = vec_splats(id);
|
| 115 |
|
| 116 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 117 |
|
| 118 |
vector int accv = vec_splats(0);
|
| 119 |
|
|
|
|
| 128 |
|
| 129 |
accv = vec_add(accv, vec_sld(accv, accv, 4));
|
| 130 |
accv = vec_add(accv, vec_sld(accv, accv, 8));
|
| 131 |
+
y[i].s = GGML_CPU_FP32_TO_FP16(d * vec_extract(accv, 0));
|
| 132 |
}
|
| 133 |
|
| 134 |
#else
|
|
|
|
| 171 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 172 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 173 |
|
| 174 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
|
| 175 |
+
vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 176 |
vector float vd = vec_mul(vxd, vyd);
|
| 177 |
|
| 178 |
vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
|
|
|
|
| 215 |
}
|
| 216 |
|
| 217 |
int sumi = sumi0 + sumi1;
|
| 218 |
+
sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 219 |
}
|
| 220 |
|
| 221 |
*s = sumf;
|
|
|
|
| 250 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 251 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 252 |
|
| 253 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
|
| 254 |
+
vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 255 |
vector float vd = vec_mul(vxd, vyd);
|
| 256 |
|
| 257 |
+
vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].m));
|
| 258 |
+
vector float vys = {GGML_CPU_FP16_TO_FP32(y[ib].s), 0.0f, 0.0f, 0.0f};
|
| 259 |
vsumf0 = vec_madd(vxmin, vys, vsumf0);
|
| 260 |
|
| 261 |
vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
|
|
|
|
| 292 |
}
|
| 293 |
|
| 294 |
int sumi = sumi0 + sumi1;
|
| 295 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 296 |
}
|
| 297 |
|
| 298 |
*s = sumf;
|
|
|
|
| 327 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 328 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 329 |
|
| 330 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
|
| 331 |
+
vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 332 |
vector float vd = vec_mul(vxd, vyd);
|
| 333 |
|
| 334 |
vector signed long long aux64x2_0 = {(uint64_t)(table_b2b_1[x[ib].qh[0]]), (uint64_t)(table_b2b_1[x[ib].qh[1]])};
|
|
|
|
| 380 |
}
|
| 381 |
|
| 382 |
int sumi = sumi0 + sumi1;
|
| 383 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
|
| 384 |
}
|
| 385 |
|
| 386 |
*s = sumf;
|
|
|
|
| 416 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 417 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 418 |
|
| 419 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
|
| 420 |
+
vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 421 |
vector float vd = vec_mul(vxd, vyd);
|
| 422 |
|
| 423 |
+
vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].m));
|
| 424 |
+
vector float vys = {GGML_CPU_FP16_TO_FP32(y[ib].s), 0.f, 0.f, 0.f};
|
| 425 |
vsumf0 = vec_madd(vxmin, vys, vsumf0);
|
| 426 |
|
| 427 |
vector unsigned long long aux64x2_0 = {(uint64_t)(table_b2b_0[x[ib].qh[0]]), (uint64_t)(table_b2b_0[x[ib].qh[1]])};
|
|
|
|
| 471 |
}
|
| 472 |
|
| 473 |
int sumi = sumi0 + sumi1;
|
| 474 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 475 |
}
|
| 476 |
|
| 477 |
*s = sumf;
|
|
|
|
| 503 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 504 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 505 |
|
| 506 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
|
| 507 |
+
vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 508 |
vector float vd = vec_mul(vxd, vyd);
|
| 509 |
|
| 510 |
vector signed char q8x0 = vec_xl( 0, x[ib].qs);
|
|
|
|
| 543 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 544 |
}
|
| 545 |
|
| 546 |
+
sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 547 |
}
|
| 548 |
|
| 549 |
*s = sumf;
|
|
|
|
| 575 |
vector float vsumf3 = vec_splats(0.0f);
|
| 576 |
|
| 577 |
for (int i = 0; i < nb; ++i) {
|
| 578 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 579 |
vector float vyd = vec_splats(y[i].d);
|
| 580 |
vector float vd = vec_mul(vxd, vyd);
|
| 581 |
|
| 582 |
+
vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].dmin));
|
| 583 |
vector float vdmin = vec_mul(vxmin, vyd);
|
| 584 |
|
| 585 |
vector signed short q8ysums0 = vec_xl( 0, y[i].bsums);
|
|
|
|
| 709 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 710 |
}
|
| 711 |
|
| 712 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 713 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 714 |
|
| 715 |
int isum = 0;
|
| 716 |
int is = 0;
|
|
|
|
| 771 |
vector float vsumf3 = vec_splats(0.0f);
|
| 772 |
|
| 773 |
for (int i = 0; i < nb; ++i) {
|
| 774 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 775 |
vector float vyd = vec_splats(y[i].d);
|
| 776 |
vector float vd = vec_mul(vxd, vyd);
|
| 777 |
|
|
|
|
| 963 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 964 |
q8 += 8; a += 8;
|
| 965 |
}
|
| 966 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 967 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 968 |
}
|
| 969 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1006 |
vector float vsumf3 = vec_splats(0.0f);
|
| 1007 |
|
| 1008 |
for (int i = 0; i < nb; ++i) {
|
| 1009 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 1010 |
vector float vyd = vec_splats(y[i].d);
|
| 1011 |
vector float vd = vec_mul(vxd, vyd);
|
| 1012 |
|
| 1013 |
+
vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].dmin));
|
| 1014 |
vector float vdmin = vec_mul(vxmin, vyd);
|
| 1015 |
|
| 1016 |
vector signed short q8ysums0 = vec_xl( 0, y[i].bsums);
|
|
|
|
| 1178 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1179 |
q8 += 8; a += 8;
|
| 1180 |
}
|
| 1181 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1182 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1183 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 1184 |
sumf -= dmin * sumi;
|
| 1185 |
}
|
| 1186 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1223 |
vector float vsumf3 = vec_splats(0.0f);
|
| 1224 |
|
| 1225 |
for (int i = 0; i < nb; ++i) {
|
| 1226 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 1227 |
vector float vyd = vec_splats(y[i].d);
|
| 1228 |
vector float vd = vec_mul(vxd, vyd);
|
| 1229 |
|
| 1230 |
+
vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].dmin));
|
| 1231 |
vector float vdmin = vec_mul(vxmin, vyd);
|
| 1232 |
|
| 1233 |
UNUSED(kmask1);
|
|
|
|
| 1395 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1396 |
q8 += 8; a += 8;
|
| 1397 |
}
|
| 1398 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1399 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1400 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 1401 |
sumf -= dmin * sumi;
|
| 1402 |
}
|
| 1403 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1433 |
vector float vsumf3 = vec_splats(0.0f);
|
| 1434 |
|
| 1435 |
for (int i = 0; i < nb; ++i) {
|
| 1436 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 1437 |
vector float vyd = vec_splats(y[i].d);
|
| 1438 |
vector float vd = vec_mul(vxd, vyd);
|
| 1439 |
|
|
|
|
| 1592 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1593 |
q8 += 8; a += 8;
|
| 1594 |
}
|
| 1595 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1596 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1597 |
}
|
| 1598 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1660 |
const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
|
| 1661 |
|
| 1662 |
for (int i = 0; i < nb; ++i) {
|
| 1663 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 1664 |
vector float vyd = vec_splats(y[i].d);
|
| 1665 |
vector float vd = vec_mul(vxd, vyd);
|
| 1666 |
|
|
|
|
| 1743 |
|
| 1744 |
float sumf = 0.f;
|
| 1745 |
for (int i = 0; i < nb; ++i) {
|
| 1746 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1747 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1748 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1749 |
int32_t bsum = 0;
|
|
|
|
| 1791 |
const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
|
| 1792 |
|
| 1793 |
for (int i = 0; i < nb; ++i) {
|
| 1794 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 1795 |
vector float vyd = vec_splats(y[i].d);
|
| 1796 |
vector float vd = vec_mul(vxd, vyd);
|
| 1797 |
|
|
|
|
| 1872 |
|
| 1873 |
float sumf = 0.f;
|
| 1874 |
for (int i = 0; i < nb; ++i) {
|
| 1875 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1876 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1877 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 1878 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 1940 |
const vector signed char mask2 = (vector signed char)vec_xl( 0, k_mask2);
|
| 1941 |
|
| 1942 |
for (int i = 0; i < nb; ++i) {
|
| 1943 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 1944 |
vector float vyd = vec_splats(y[i].d);
|
| 1945 |
vector float vd = vec_mul(vxd, vyd);
|
| 1946 |
|
|
|
|
| 2034 |
float sumf = 0;
|
| 2035 |
for (int i = 0; i < nb; i++) {
|
| 2036 |
|
| 2037 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2038 |
const int8_t * q8 = y[i].qs;
|
| 2039 |
const uint8_t * qs = x[i].qs;
|
| 2040 |
const uint8_t * qh = x[i].qh;
|
|
|
|
| 2097 |
vector float vsumf3 = vec_splats(0.0f);
|
| 2098 |
|
| 2099 |
for (int i = 0; i < nb; ++i) {
|
| 2100 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 2101 |
vector float vyd = vec_splats(y[i].d);
|
| 2102 |
vector float vd = vec_mul(vxd, vyd);
|
| 2103 |
|
|
|
|
| 2177 |
|
| 2178 |
float sumf = 0.f;
|
| 2179 |
for (int i = 0; i < nb; ++i) {
|
| 2180 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2181 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 2182 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 2183 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 2237 |
const vector signed char mask2 = (vector signed char)vec_xl( 0, k_mask2);
|
| 2238 |
|
| 2239 |
for (int i = 0; i < nb; ++i) {
|
| 2240 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 2241 |
vector float vyd = vec_splats(y[i].d);
|
| 2242 |
vector float vd = vec_mul(vxd, vyd);
|
| 2243 |
|
|
|
|
| 2330 |
|
| 2331 |
float sumf = 0.f;
|
| 2332 |
for (int i = 0; i < nb; ++i) {
|
| 2333 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2334 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 2335 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 2336 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
|
|
| 2395 |
vector float vsumf3 = vec_splats(0.0f);
|
| 2396 |
|
| 2397 |
for (int i = 0; i < nb; ++i) {
|
| 2398 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 2399 |
vector float vyd = vec_splats(y[i].d);
|
| 2400 |
vector float vd = vec_mul(vxd, vyd);
|
| 2401 |
|
|
|
|
| 2506 |
qs += 4;
|
| 2507 |
}
|
| 2508 |
|
| 2509 |
+
sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
|
| 2510 |
}
|
| 2511 |
|
| 2512 |
*s = sumf;
|
|
|
|
| 2547 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 2548 |
|
| 2549 |
|
| 2550 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
|
| 2551 |
+
vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 2552 |
vector float vd = vec_mul(vxd, vyd);
|
| 2553 |
|
| 2554 |
vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
|
|
|
|
| 2583 |
|
| 2584 |
#endif
|
| 2585 |
for (; ib < nb; ++ib) {
|
| 2586 |
+
const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
|
| 2587 |
int sumi1 = 0, sumi2 = 0;
|
| 2588 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 2589 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
|
|
| 2621 |
|
| 2622 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 2623 |
|
| 2624 |
+
vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ibl].d));
|
| 2625 |
vector float vyd = vec_splats(y[ibl].d);
|
| 2626 |
vector float vd = vec_mul(vxd, vyd);
|
| 2627 |
|
|
|
|
| 2698 |
#else
|
| 2699 |
float sumf = 0;
|
| 2700 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 2701 |
+
const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
|
| 2702 |
uint16_t h = x[ibl].scales_h;
|
| 2703 |
const uint8_t * qs = x[ibl].qs;
|
| 2704 |
const int8_t * q8 = y[ibl].qs;
|
|
@@ -3,6 +3,7 @@
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
|
|
|
| 6 |
|
| 7 |
#include "../../quants.h"
|
| 8 |
#include "../../ggml-cpu-impl.h"
|
|
@@ -45,7 +46,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 45 |
const float d = amax / ((1 << 7) - 1);
|
| 46 |
const float id = d ? 1.0f/d : 0.0f;
|
| 47 |
|
| 48 |
-
y[i].d =
|
| 49 |
|
| 50 |
vfloat32m8_t x0 = __riscv_vfmul_vf_f32m8(v_x, id, vl);
|
| 51 |
|
|
@@ -85,7 +86,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 85 |
const float d = amax / ((1 << 7) - 1);
|
| 86 |
const float id = d ? 1.0f/d : 0.0f;
|
| 87 |
|
| 88 |
-
y[i].d =
|
| 89 |
|
| 90 |
vfloat32m8_t x0 = __riscv_vfmul_vf_f32m8(v_x, id, vl);
|
| 91 |
|
|
@@ -102,7 +103,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 102 |
|
| 103 |
// set y[i].s
|
| 104 |
int sum = __riscv_vmv_x_s_i16m1_i16(vwrs);
|
| 105 |
-
y[i].s =
|
| 106 |
}
|
| 107 |
|
| 108 |
#else
|
|
@@ -160,7 +161,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 160 |
|
| 161 |
int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);
|
| 162 |
|
| 163 |
-
sumf += sumi*
|
| 164 |
}
|
| 165 |
|
| 166 |
#endif
|
|
@@ -177,7 +178,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 177 |
}
|
| 178 |
|
| 179 |
int sumi = sumi0 + sumi1;
|
| 180 |
-
sumf += sumi*
|
| 181 |
}
|
| 182 |
|
| 183 |
*s = sumf;
|
|
@@ -225,7 +226,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 225 |
|
| 226 |
int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);
|
| 227 |
|
| 228 |
-
sumf += (
|
| 229 |
}
|
| 230 |
|
| 231 |
#endif
|
|
@@ -242,7 +243,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 242 |
}
|
| 243 |
|
| 244 |
int sumi = sumi0 + sumi1;
|
| 245 |
-
sumf += (
|
| 246 |
}
|
| 247 |
|
| 248 |
*s = sumf;
|
|
@@ -293,7 +294,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 293 |
vint32m1_t sum = __riscv_vwredsum_vs_i16m4_i32m1(mul, zero, vl);
|
| 294 |
int32_t sumi = __riscv_vmv_x_s_i32m1_i32(sum);
|
| 295 |
|
| 296 |
-
sumf += (
|
| 297 |
}
|
| 298 |
|
| 299 |
#endif
|
|
@@ -316,7 +317,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 316 |
}
|
| 317 |
|
| 318 |
int sumi = sumi0 + sumi1;
|
| 319 |
-
sumf += (
|
| 320 |
}
|
| 321 |
|
| 322 |
*s = sumf;
|
|
@@ -366,7 +367,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 366 |
vint32m1_t sum = __riscv_vwredsum_vs_i16m4_i32m1(mul, zero, vl);
|
| 367 |
int32_t sumi = __riscv_vmv_x_s_i32m1_i32(sum);
|
| 368 |
|
| 369 |
-
sumf += (
|
| 370 |
}
|
| 371 |
|
| 372 |
#endif
|
|
@@ -389,7 +390,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 389 |
}
|
| 390 |
|
| 391 |
int sumi = sumi0 + sumi1;
|
| 392 |
-
sumf += (
|
| 393 |
}
|
| 394 |
|
| 395 |
*s = sumf;
|
|
@@ -427,7 +428,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 427 |
|
| 428 |
int sumi = __riscv_vmv_x_s_i32m1_i32(v_sum);
|
| 429 |
|
| 430 |
-
sumf += sumi*(
|
| 431 |
}
|
| 432 |
|
| 433 |
#endif
|
|
@@ -438,7 +439,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 438 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 439 |
}
|
| 440 |
|
| 441 |
-
sumf += sumi*(
|
| 442 |
}
|
| 443 |
|
| 444 |
*s = sumf;
|
|
@@ -465,8 +466,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 465 |
const uint8_t * q2 = x[i].qs;
|
| 466 |
const int8_t * q8 = y[i].qs;
|
| 467 |
const uint8_t * sc = x[i].scales;
|
| 468 |
-
const float dall = y[i].d *
|
| 469 |
-
const float dmin = -y[i].d *
|
| 470 |
uint8_t *patmp = atmp;
|
| 471 |
int vsums;
|
| 472 |
int tmp;
|
|
@@ -569,8 +570,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 569 |
const int8_t * q8 = y[i].qs;
|
| 570 |
const uint8_t * sc = x[i].scales;
|
| 571 |
|
| 572 |
-
const float dall = y[i].d *
|
| 573 |
-
const float dmin = -y[i].d *
|
| 574 |
|
| 575 |
size_t vl = 16;
|
| 576 |
|
|
@@ -644,8 +645,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 644 |
const uint8_t * q2 = x[i].qs;
|
| 645 |
const int8_t * q8 = y[i].qs;
|
| 646 |
const uint8_t * sc = x[i].scales;
|
| 647 |
-
const float dall = y[i].d *
|
| 648 |
-
const float dmin = -y[i].d *
|
| 649 |
uint8_t *patmp = atmp;
|
| 650 |
int vsums;
|
| 651 |
int tmp;
|
|
@@ -750,8 +751,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 750 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 751 |
}
|
| 752 |
|
| 753 |
-
const float dall = y[i].d *
|
| 754 |
-
const float dmin = y[i].d *
|
| 755 |
|
| 756 |
int isum = 0;
|
| 757 |
int is = 0;
|
|
@@ -916,7 +917,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 916 |
q3 += 32; q8 += 128; scale += 8;
|
| 917 |
}
|
| 918 |
|
| 919 |
-
const float d =
|
| 920 |
sumf += d * isum;
|
| 921 |
}
|
| 922 |
|
|
@@ -1017,7 +1018,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1017 |
|
| 1018 |
}
|
| 1019 |
|
| 1020 |
-
const float d =
|
| 1021 |
|
| 1022 |
sumf += d*sum_t;
|
| 1023 |
|
|
@@ -1134,7 +1135,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1134 |
q3 += 32; q8 += 128; scale += 8;
|
| 1135 |
}
|
| 1136 |
|
| 1137 |
-
const float d =
|
| 1138 |
sumf += d * isum;
|
| 1139 |
}
|
| 1140 |
break;
|
|
@@ -1202,7 +1203,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1202 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 1203 |
q8 += 8; a += 8;
|
| 1204 |
}
|
| 1205 |
-
const float d =
|
| 1206 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1207 |
}
|
| 1208 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1239,8 +1240,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1239 |
float sumf = 0;
|
| 1240 |
|
| 1241 |
for (int i = 0; i < nb; ++i) {
|
| 1242 |
-
const float d = y[i].d *
|
| 1243 |
-
const float dmin = y[i].d *
|
| 1244 |
|
| 1245 |
int tmp, tmp2, sumi;
|
| 1246 |
__asm__ __volatile__(
|
|
@@ -1361,8 +1362,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1361 |
|
| 1362 |
size_t vl = 8;
|
| 1363 |
|
| 1364 |
-
const float d = y[i].d *
|
| 1365 |
-
const float dmin = y[i].d *
|
| 1366 |
|
| 1367 |
vint16mf2_t q8sums_0 = __riscv_vlse16_v_i16mf2(y[i].bsums, 4, vl);
|
| 1368 |
vint16mf2_t q8sums_1 = __riscv_vlse16_v_i16mf2(y[i].bsums+1, 4, vl);
|
|
@@ -1422,8 +1423,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1422 |
break;
|
| 1423 |
case 128:
|
| 1424 |
for (int i = 0; i < nb; ++i) {
|
| 1425 |
-
const float d = y[i].d *
|
| 1426 |
-
const float dmin = y[i].d *
|
| 1427 |
|
| 1428 |
int tmp, tmp2, sumi;
|
| 1429 |
__asm__ __volatile__(
|
|
@@ -1580,9 +1581,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1580 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1581 |
q8 += 8; a += 8;
|
| 1582 |
}
|
| 1583 |
-
const float d =
|
| 1584 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1585 |
-
const float dmin =
|
| 1586 |
sumf -= dmin * sumi;
|
| 1587 |
}
|
| 1588 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1627,8 +1628,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1627 |
const uint8_t * GGML_RESTRICT hm = x[i].qh;
|
| 1628 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1629 |
|
| 1630 |
-
const float d =
|
| 1631 |
-
const float dmin =
|
| 1632 |
|
| 1633 |
vint16m1_t q8sums_0 = __riscv_vlse16_v_i16m1(y[i].bsums, 4, vl);
|
| 1634 |
vint16m1_t q8sums_1 = __riscv_vlse16_v_i16m1(y[i].bsums+1, 4, vl);
|
|
@@ -1749,9 +1750,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1749 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1750 |
q8 += 8; a += 8;
|
| 1751 |
}
|
| 1752 |
-
const float d =
|
| 1753 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1754 |
-
const float dmin =
|
| 1755 |
sumf -= dmin * sumi;
|
| 1756 |
}
|
| 1757 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1778,7 +1779,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1778 |
|
| 1779 |
for (int i = 0; i < nb; ++i) {
|
| 1780 |
|
| 1781 |
-
const float d =
|
| 1782 |
|
| 1783 |
const uint8_t * restrict q6 = x[i].ql;
|
| 1784 |
const uint8_t * restrict qh = x[i].qh;
|
|
@@ -1862,7 +1863,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1862 |
case 256:
|
| 1863 |
for (int i = 0; i < nb; ++i) {
|
| 1864 |
|
| 1865 |
-
const float d =
|
| 1866 |
|
| 1867 |
const uint8_t * GGML_RESTRICT q6 = x[i].ql;
|
| 1868 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
@@ -1943,7 +1944,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1943 |
case 128:
|
| 1944 |
for (int i = 0; i < nb; ++i) {
|
| 1945 |
|
| 1946 |
-
const float d =
|
| 1947 |
|
| 1948 |
const uint8_t * restrict q6 = x[i].ql;
|
| 1949 |
const uint8_t * restrict qh = x[i].qh;
|
|
@@ -2058,7 +2059,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2058 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2059 |
q8 += 8; a += 8;
|
| 2060 |
}
|
| 2061 |
-
const float d =
|
| 2062 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2063 |
}
|
| 2064 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
| 6 |
+
#include "simd-mappings.h"
|
| 7 |
|
| 8 |
#include "../../quants.h"
|
| 9 |
#include "../../ggml-cpu-impl.h"
|
|
|
|
| 46 |
const float d = amax / ((1 << 7) - 1);
|
| 47 |
const float id = d ? 1.0f/d : 0.0f;
|
| 48 |
|
| 49 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 50 |
|
| 51 |
vfloat32m8_t x0 = __riscv_vfmul_vf_f32m8(v_x, id, vl);
|
| 52 |
|
|
|
|
| 86 |
const float d = amax / ((1 << 7) - 1);
|
| 87 |
const float id = d ? 1.0f/d : 0.0f;
|
| 88 |
|
| 89 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 90 |
|
| 91 |
vfloat32m8_t x0 = __riscv_vfmul_vf_f32m8(v_x, id, vl);
|
| 92 |
|
|
|
|
| 103 |
|
| 104 |
// set y[i].s
|
| 105 |
int sum = __riscv_vmv_x_s_i16m1_i16(vwrs);
|
| 106 |
+
y[i].s = GGML_CPU_FP32_TO_FP16(sum*d);
|
| 107 |
}
|
| 108 |
|
| 109 |
#else
|
|
|
|
| 161 |
|
| 162 |
int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);
|
| 163 |
|
| 164 |
+
sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 165 |
}
|
| 166 |
|
| 167 |
#endif
|
|
|
|
| 178 |
}
|
| 179 |
|
| 180 |
int sumi = sumi0 + sumi1;
|
| 181 |
+
sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 182 |
}
|
| 183 |
|
| 184 |
*s = sumf;
|
|
|
|
| 226 |
|
| 227 |
int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);
|
| 228 |
|
| 229 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 230 |
}
|
| 231 |
|
| 232 |
#endif
|
|
|
|
| 243 |
}
|
| 244 |
|
| 245 |
int sumi = sumi0 + sumi1;
|
| 246 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 247 |
}
|
| 248 |
|
| 249 |
*s = sumf;
|
|
|
|
| 294 |
vint32m1_t sum = __riscv_vwredsum_vs_i16m4_i32m1(mul, zero, vl);
|
| 295 |
int32_t sumi = __riscv_vmv_x_s_i32m1_i32(sum);
|
| 296 |
|
| 297 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
|
| 298 |
}
|
| 299 |
|
| 300 |
#endif
|
|
|
|
| 317 |
}
|
| 318 |
|
| 319 |
int sumi = sumi0 + sumi1;
|
| 320 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
|
| 321 |
}
|
| 322 |
|
| 323 |
*s = sumf;
|
|
|
|
| 367 |
vint32m1_t sum = __riscv_vwredsum_vs_i16m4_i32m1(mul, zero, vl);
|
| 368 |
int32_t sumi = __riscv_vmv_x_s_i32m1_i32(sum);
|
| 369 |
|
| 370 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 371 |
}
|
| 372 |
|
| 373 |
#endif
|
|
|
|
| 390 |
}
|
| 391 |
|
| 392 |
int sumi = sumi0 + sumi1;
|
| 393 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 394 |
}
|
| 395 |
|
| 396 |
*s = sumf;
|
|
|
|
| 428 |
|
| 429 |
int sumi = __riscv_vmv_x_s_i32m1_i32(v_sum);
|
| 430 |
|
| 431 |
+
sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 432 |
}
|
| 433 |
|
| 434 |
#endif
|
|
|
|
| 439 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 440 |
}
|
| 441 |
|
| 442 |
+
sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 443 |
}
|
| 444 |
|
| 445 |
*s = sumf;
|
|
|
|
| 466 |
const uint8_t * q2 = x[i].qs;
|
| 467 |
const int8_t * q8 = y[i].qs;
|
| 468 |
const uint8_t * sc = x[i].scales;
|
| 469 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 470 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 471 |
uint8_t *patmp = atmp;
|
| 472 |
int vsums;
|
| 473 |
int tmp;
|
|
|
|
| 570 |
const int8_t * q8 = y[i].qs;
|
| 571 |
const uint8_t * sc = x[i].scales;
|
| 572 |
|
| 573 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 574 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 575 |
|
| 576 |
size_t vl = 16;
|
| 577 |
|
|
|
|
| 645 |
const uint8_t * q2 = x[i].qs;
|
| 646 |
const int8_t * q8 = y[i].qs;
|
| 647 |
const uint8_t * sc = x[i].scales;
|
| 648 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 649 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 650 |
uint8_t *patmp = atmp;
|
| 651 |
int vsums;
|
| 652 |
int tmp;
|
|
|
|
| 751 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 752 |
}
|
| 753 |
|
| 754 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 755 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 756 |
|
| 757 |
int isum = 0;
|
| 758 |
int is = 0;
|
|
|
|
| 917 |
q3 += 32; q8 += 128; scale += 8;
|
| 918 |
}
|
| 919 |
|
| 920 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 921 |
sumf += d * isum;
|
| 922 |
}
|
| 923 |
|
|
|
|
| 1018 |
|
| 1019 |
}
|
| 1020 |
|
| 1021 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1022 |
|
| 1023 |
sumf += d*sum_t;
|
| 1024 |
|
|
|
|
| 1135 |
q3 += 32; q8 += 128; scale += 8;
|
| 1136 |
}
|
| 1137 |
|
| 1138 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1139 |
sumf += d * isum;
|
| 1140 |
}
|
| 1141 |
break;
|
|
|
|
| 1203 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 1204 |
q8 += 8; a += 8;
|
| 1205 |
}
|
| 1206 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1207 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1208 |
}
|
| 1209 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1240 |
float sumf = 0;
|
| 1241 |
|
| 1242 |
for (int i = 0; i < nb; ++i) {
|
| 1243 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1244 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1245 |
|
| 1246 |
int tmp, tmp2, sumi;
|
| 1247 |
__asm__ __volatile__(
|
|
|
|
| 1362 |
|
| 1363 |
size_t vl = 8;
|
| 1364 |
|
| 1365 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1366 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1367 |
|
| 1368 |
vint16mf2_t q8sums_0 = __riscv_vlse16_v_i16mf2(y[i].bsums, 4, vl);
|
| 1369 |
vint16mf2_t q8sums_1 = __riscv_vlse16_v_i16mf2(y[i].bsums+1, 4, vl);
|
|
|
|
| 1423 |
break;
|
| 1424 |
case 128:
|
| 1425 |
for (int i = 0; i < nb; ++i) {
|
| 1426 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1427 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1428 |
|
| 1429 |
int tmp, tmp2, sumi;
|
| 1430 |
__asm__ __volatile__(
|
|
|
|
| 1581 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1582 |
q8 += 8; a += 8;
|
| 1583 |
}
|
| 1584 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1585 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1586 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 1587 |
sumf -= dmin * sumi;
|
| 1588 |
}
|
| 1589 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1628 |
const uint8_t * GGML_RESTRICT hm = x[i].qh;
|
| 1629 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1630 |
|
| 1631 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1632 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 1633 |
|
| 1634 |
vint16m1_t q8sums_0 = __riscv_vlse16_v_i16m1(y[i].bsums, 4, vl);
|
| 1635 |
vint16m1_t q8sums_1 = __riscv_vlse16_v_i16m1(y[i].bsums+1, 4, vl);
|
|
|
|
| 1750 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1751 |
q8 += 8; a += 8;
|
| 1752 |
}
|
| 1753 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1754 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1755 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 1756 |
sumf -= dmin * sumi;
|
| 1757 |
}
|
| 1758 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1779 |
|
| 1780 |
for (int i = 0; i < nb; ++i) {
|
| 1781 |
|
| 1782 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1783 |
|
| 1784 |
const uint8_t * restrict q6 = x[i].ql;
|
| 1785 |
const uint8_t * restrict qh = x[i].qh;
|
|
|
|
| 1863 |
case 256:
|
| 1864 |
for (int i = 0; i < nb; ++i) {
|
| 1865 |
|
| 1866 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1867 |
|
| 1868 |
const uint8_t * GGML_RESTRICT q6 = x[i].ql;
|
| 1869 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
|
|
| 1944 |
case 128:
|
| 1945 |
for (int i = 0; i < nb; ++i) {
|
| 1946 |
|
| 1947 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1948 |
|
| 1949 |
const uint8_t * restrict q6 = x[i].ql;
|
| 1950 |
const uint8_t * restrict qh = x[i].qh;
|
|
|
|
| 2059 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2060 |
q8 += 8; a += 8;
|
| 2061 |
}
|
| 2062 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2063 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2064 |
}
|
| 2065 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -6,6 +6,7 @@
|
|
| 6 |
#include "ggml-impl.h"
|
| 7 |
#include "ggml-cpu.h"
|
| 8 |
#include "ggml-cpu-impl.h"
|
|
|
|
| 9 |
#include "traits.h"
|
| 10 |
|
| 11 |
#include <cmath>
|
|
@@ -90,16 +91,16 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 90 |
const vfloat32m1_t facc = __riscv_vfcvt_f_x_v_f32m1(sumi_h8, vl / 4);
|
| 91 |
|
| 92 |
// vector version needs Zvfhmin extension
|
| 93 |
-
const float a_scale =
|
| 94 |
const float b_scales[8] = {
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
};
|
| 104 |
const vfloat32m1_t b_scales_vec = __riscv_vle32_v_f32m1(b_scales, vl / 4);
|
| 105 |
const vfloat32m1_t tmp1 = __riscv_vfmul_vf_f32m1(facc, a_scale, vl / 4);
|
|
@@ -129,7 +130,7 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 129 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 130 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 131 |
}
|
| 132 |
-
sumf[j] += sumi *
|
| 133 |
}
|
| 134 |
}
|
| 135 |
}
|
|
@@ -181,20 +182,20 @@ void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 181 |
|
| 182 |
// vector version needs Zvfhmin extension
|
| 183 |
const float a_scales[4] = {
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
};
|
| 189 |
const float b_scales[8] = {
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
};
|
| 199 |
const vfloat32m1_t b_scales_vec = __riscv_vle32_v_f32m1(b_scales, vl / 4);
|
| 200 |
|
|
@@ -382,7 +383,7 @@ void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 382 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 383 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 384 |
}
|
| 385 |
-
sumf[m][j] += sumi *
|
| 386 |
}
|
| 387 |
}
|
| 388 |
}
|
|
|
|
| 6 |
#include "ggml-impl.h"
|
| 7 |
#include "ggml-cpu.h"
|
| 8 |
#include "ggml-cpu-impl.h"
|
| 9 |
+
#include "simd-mappings.h"
|
| 10 |
#include "traits.h"
|
| 11 |
|
| 12 |
#include <cmath>
|
|
|
|
| 91 |
const vfloat32m1_t facc = __riscv_vfcvt_f_x_v_f32m1(sumi_h8, vl / 4);
|
| 92 |
|
| 93 |
// vector version needs Zvfhmin extension
|
| 94 |
+
const float a_scale = GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 95 |
const float b_scales[8] = {
|
| 96 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[0]),
|
| 97 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[1]),
|
| 98 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[2]),
|
| 99 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[3]),
|
| 100 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[4]),
|
| 101 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[5]),
|
| 102 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[6]),
|
| 103 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[7])
|
| 104 |
};
|
| 105 |
const vfloat32m1_t b_scales_vec = __riscv_vle32_v_f32m1(b_scales, vl / 4);
|
| 106 |
const vfloat32m1_t tmp1 = __riscv_vfmul_vf_f32m1(facc, a_scale, vl / 4);
|
|
|
|
| 130 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 131 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 132 |
}
|
| 133 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 134 |
}
|
| 135 |
}
|
| 136 |
}
|
|
|
|
| 182 |
|
| 183 |
// vector version needs Zvfhmin extension
|
| 184 |
const float a_scales[4] = {
|
| 185 |
+
GGML_CPU_FP16_TO_FP32(a_ptr[l].d[0]),
|
| 186 |
+
GGML_CPU_FP16_TO_FP32(a_ptr[l].d[1]),
|
| 187 |
+
GGML_CPU_FP16_TO_FP32(a_ptr[l].d[2]),
|
| 188 |
+
GGML_CPU_FP16_TO_FP32(a_ptr[l].d[3])
|
| 189 |
};
|
| 190 |
const float b_scales[8] = {
|
| 191 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[0]),
|
| 192 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[1]),
|
| 193 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[2]),
|
| 194 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[3]),
|
| 195 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[4]),
|
| 196 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[5]),
|
| 197 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[6]),
|
| 198 |
+
GGML_CPU_FP16_TO_FP32(b_ptr[l].d[7])
|
| 199 |
};
|
| 200 |
const vfloat32m1_t b_scales_vec = __riscv_vle32_v_f32m1(b_scales, vl / 4);
|
| 201 |
|
|
|
|
| 383 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 384 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 385 |
}
|
| 386 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 387 |
}
|
| 388 |
}
|
| 389 |
}
|
|
@@ -3,6 +3,7 @@
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
|
|
|
| 6 |
|
| 7 |
#include "../../quants.h"
|
| 8 |
#include "../../ggml-cpu-impl.h"
|
|
@@ -49,7 +50,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 49 |
const float d = amax / ((1 << 7) - 1);
|
| 50 |
const float id = d ? 1.0f / d : 0.0f;
|
| 51 |
|
| 52 |
-
y[i].d =
|
| 53 |
|
| 54 |
for (int j = 0; j < 8; j++) {
|
| 55 |
const __vector float v = vec_mul(srcv[j], vec_splats(id));
|
|
@@ -94,7 +95,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 94 |
const float d = amax / ((1 << 7) - 1);
|
| 95 |
const float id = d ? 1.0f / d : 0.0f;
|
| 96 |
|
| 97 |
-
y[i].d =
|
| 98 |
|
| 99 |
__vector int32_t acc = vec_splats(0);
|
| 100 |
|
|
@@ -110,7 +111,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 110 |
acc = vec_add(acc, vi);
|
| 111 |
}
|
| 112 |
|
| 113 |
-
y[i].s =
|
| 114 |
}
|
| 115 |
#else
|
| 116 |
GGML_UNUSED(nb);
|
|
@@ -164,7 +165,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 164 |
__vector int16_t v_xy_ = v_xylso + v_xylse + v_xyhso + v_xyhse; v_xy_ += vec_reve(v_xy_);
|
| 165 |
|
| 166 |
const __vector float v_xy = vec_float(vec_unpackh(v_xy_));
|
| 167 |
-
const __vector float v_d = vec_splats(
|
| 168 |
|
| 169 |
acc = vec_madd(v_xy, v_d, acc);
|
| 170 |
}
|
|
@@ -185,7 +186,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 185 |
}
|
| 186 |
|
| 187 |
int sumi = sumi0 + sumi1;
|
| 188 |
-
sumf += sumi*
|
| 189 |
}
|
| 190 |
|
| 191 |
*s = sumf;
|
|
@@ -219,7 +220,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 219 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 220 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 221 |
|
| 222 |
-
summs +=
|
| 223 |
|
| 224 |
const uint8x16_t v_x = vec_xl(0, x[ib].qs);
|
| 225 |
const int8x16_t v_xl = (const int8x16_t)(v_x & v_m);
|
|
@@ -231,7 +232,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 231 |
const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
|
| 232 |
const float32x4_t v_xy = vec_float(v_xy_);
|
| 233 |
|
| 234 |
-
const float32x4_t v_d = vec_splats(
|
| 235 |
|
| 236 |
acc = vec_madd(v_xy, v_d, acc);
|
| 237 |
}
|
|
@@ -252,7 +253,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 252 |
}
|
| 253 |
|
| 254 |
int sumi = sumi0 + sumi1;
|
| 255 |
-
sumf += (
|
| 256 |
}
|
| 257 |
|
| 258 |
*s = sumf;
|
|
@@ -290,7 +291,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 290 |
|
| 291 |
const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
|
| 292 |
const float32x4_t v_xy = vec_float(v_xy_);
|
| 293 |
-
const float32x4_t v_d = vec_splats(
|
| 294 |
|
| 295 |
acc = vec_madd(v_xy, v_d, acc);
|
| 296 |
}
|
|
@@ -305,7 +306,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 305 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 306 |
}
|
| 307 |
|
| 308 |
-
sumf += sumi*(
|
| 309 |
}
|
| 310 |
|
| 311 |
*s = sumf;
|
|
@@ -348,7 +349,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 348 |
float sum = 0;
|
| 349 |
|
| 350 |
for (int i = 0; i < nb; ++i) {
|
| 351 |
-
const float d = y[i].d *
|
| 352 |
|
| 353 |
const uint8_t * restrict x0l = x[i].qs;
|
| 354 |
const uint8_t * restrict x0h = x[i].hmask;
|
|
@@ -497,7 +498,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 497 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 498 |
q8 += 8; a += 8;
|
| 499 |
}
|
| 500 |
-
const float d =
|
| 501 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 502 |
}
|
| 503 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -537,8 +538,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 537 |
float sumf = 0;
|
| 538 |
|
| 539 |
for (int i = 0; i < nb; ++i) {
|
| 540 |
-
const float d = y[i].d *
|
| 541 |
-
const float dmin = y[i].d *
|
| 542 |
|
| 543 |
const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
|
| 544 |
const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
|
|
@@ -647,9 +648,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 647 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 648 |
q8 += 8; a += 8;
|
| 649 |
}
|
| 650 |
-
const float d =
|
| 651 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 652 |
-
const float dmin =
|
| 653 |
sumf -= dmin * sumi;
|
| 654 |
}
|
| 655 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -698,8 +699,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 698 |
float sumf = 0;
|
| 699 |
|
| 700 |
for (int i = 0; i < nb; ++i) {
|
| 701 |
-
const float d = y[i].d *
|
| 702 |
-
const float dmin = y[i].d *
|
| 703 |
|
| 704 |
const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
|
| 705 |
const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
|
|
@@ -819,9 +820,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 819 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 820 |
q8 += 8; a += 8;
|
| 821 |
}
|
| 822 |
-
const float d =
|
| 823 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 824 |
-
const float dmin =
|
| 825 |
sumf -= dmin * sumi;
|
| 826 |
}
|
| 827 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -859,7 +860,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 859 |
int8x16_t v_y[4];
|
| 860 |
|
| 861 |
for (int i = 0; i < nb; ++i) {
|
| 862 |
-
const float d_all =
|
| 863 |
|
| 864 |
const uint8_t * GGML_RESTRICT x0l = x[i].ql;
|
| 865 |
const uint8_t * GGML_RESTRICT x0h = x[i].qh;
|
|
@@ -1004,7 +1005,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1004 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1005 |
q8 += 8; a += 8;
|
| 1006 |
}
|
| 1007 |
-
const float d =
|
| 1008 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1009 |
}
|
| 1010 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1071,7 +1072,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1071 |
// float sumf = 0;
|
| 1072 |
|
| 1073 |
// for (int i = 0; i < nb; ++i) {
|
| 1074 |
-
// const float d =
|
| 1075 |
// const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1076 |
// const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1077 |
|
|
@@ -1121,7 +1122,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1121 |
|
| 1122 |
// float sumf = 0.f;
|
| 1123 |
// for (int i = 0; i < nb; ++i) {
|
| 1124 |
-
// const float d =
|
| 1125 |
// const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1126 |
// const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1127 |
// int32_t bsum = 0;
|
|
@@ -1182,12 +1183,12 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 1182 |
const int8x16_t v_yh = vec_xl(QK8_0/2, y0->qs);
|
| 1183 |
const int32x4_t v_xy = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
|
| 1184 |
|
| 1185 |
-
sumf +=
|
| 1186 |
}
|
| 1187 |
|
| 1188 |
#endif
|
| 1189 |
for (; ib < nb; ++ib) {
|
| 1190 |
-
const float d =
|
| 1191 |
int sumi1 = 0, sumi2 = 0;
|
| 1192 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 1193 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
@@ -1257,7 +1258,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 1257 |
sumi2 += (vsumi1[0] + vsumi1[1] + vsumi1[2] + vsumi1[3]) * ls2;
|
| 1258 |
}
|
| 1259 |
|
| 1260 |
-
sumf +=
|
| 1261 |
}
|
| 1262 |
|
| 1263 |
*s = sumf;
|
|
@@ -1265,7 +1266,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 1265 |
#else
|
| 1266 |
float sumf = 0;
|
| 1267 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 1268 |
-
const float d4d8 =
|
| 1269 |
uint16_t h = x[ibl].scales_h;
|
| 1270 |
const uint8_t * qs = x[ibl].qs;
|
| 1271 |
const int8_t * q8 = y[ibl].qs;
|
|
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
| 6 |
+
#include "simd-mappings.h"
|
| 7 |
|
| 8 |
#include "../../quants.h"
|
| 9 |
#include "../../ggml-cpu-impl.h"
|
|
|
|
| 50 |
const float d = amax / ((1 << 7) - 1);
|
| 51 |
const float id = d ? 1.0f / d : 0.0f;
|
| 52 |
|
| 53 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 54 |
|
| 55 |
for (int j = 0; j < 8; j++) {
|
| 56 |
const __vector float v = vec_mul(srcv[j], vec_splats(id));
|
|
|
|
| 95 |
const float d = amax / ((1 << 7) - 1);
|
| 96 |
const float id = d ? 1.0f / d : 0.0f;
|
| 97 |
|
| 98 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 99 |
|
| 100 |
__vector int32_t acc = vec_splats(0);
|
| 101 |
|
|
|
|
| 111 |
acc = vec_add(acc, vi);
|
| 112 |
}
|
| 113 |
|
| 114 |
+
y[i].s = GGML_CPU_FP32_TO_FP16(d * (acc[0] + acc[1] + acc[2] + acc[3]));
|
| 115 |
}
|
| 116 |
#else
|
| 117 |
GGML_UNUSED(nb);
|
|
|
|
| 165 |
__vector int16_t v_xy_ = v_xylso + v_xylse + v_xyhso + v_xyhse; v_xy_ += vec_reve(v_xy_);
|
| 166 |
|
| 167 |
const __vector float v_xy = vec_float(vec_unpackh(v_xy_));
|
| 168 |
+
const __vector float v_d = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 169 |
|
| 170 |
acc = vec_madd(v_xy, v_d, acc);
|
| 171 |
}
|
|
|
|
| 186 |
}
|
| 187 |
|
| 188 |
int sumi = sumi0 + sumi1;
|
| 189 |
+
sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 190 |
}
|
| 191 |
|
| 192 |
*s = sumf;
|
|
|
|
| 220 |
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 221 |
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 222 |
|
| 223 |
+
summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 224 |
|
| 225 |
const uint8x16_t v_x = vec_xl(0, x[ib].qs);
|
| 226 |
const int8x16_t v_xl = (const int8x16_t)(v_x & v_m);
|
|
|
|
| 232 |
const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
|
| 233 |
const float32x4_t v_xy = vec_float(v_xy_);
|
| 234 |
|
| 235 |
+
const float32x4_t v_d = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 236 |
|
| 237 |
acc = vec_madd(v_xy, v_d, acc);
|
| 238 |
}
|
|
|
|
| 253 |
}
|
| 254 |
|
| 255 |
int sumi = sumi0 + sumi1;
|
| 256 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 257 |
}
|
| 258 |
|
| 259 |
*s = sumf;
|
|
|
|
| 291 |
|
| 292 |
const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
|
| 293 |
const float32x4_t v_xy = vec_float(v_xy_);
|
| 294 |
+
const float32x4_t v_d = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 295 |
|
| 296 |
acc = vec_madd(v_xy, v_d, acc);
|
| 297 |
}
|
|
|
|
| 306 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 307 |
}
|
| 308 |
|
| 309 |
+
sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 310 |
}
|
| 311 |
|
| 312 |
*s = sumf;
|
|
|
|
| 349 |
float sum = 0;
|
| 350 |
|
| 351 |
for (int i = 0; i < nb; ++i) {
|
| 352 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 353 |
|
| 354 |
const uint8_t * restrict x0l = x[i].qs;
|
| 355 |
const uint8_t * restrict x0h = x[i].hmask;
|
|
|
|
| 498 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 499 |
q8 += 8; a += 8;
|
| 500 |
}
|
| 501 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 502 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 503 |
}
|
| 504 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 538 |
float sumf = 0;
|
| 539 |
|
| 540 |
for (int i = 0; i < nb; ++i) {
|
| 541 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 542 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 543 |
|
| 544 |
const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
|
| 545 |
const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
|
|
|
|
| 648 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 649 |
q8 += 8; a += 8;
|
| 650 |
}
|
| 651 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 652 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 653 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 654 |
sumf -= dmin * sumi;
|
| 655 |
}
|
| 656 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 699 |
float sumf = 0;
|
| 700 |
|
| 701 |
for (int i = 0; i < nb; ++i) {
|
| 702 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 703 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 704 |
|
| 705 |
const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
|
| 706 |
const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
|
|
|
|
| 820 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 821 |
q8 += 8; a += 8;
|
| 822 |
}
|
| 823 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 824 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 825 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 826 |
sumf -= dmin * sumi;
|
| 827 |
}
|
| 828 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 860 |
int8x16_t v_y[4];
|
| 861 |
|
| 862 |
for (int i = 0; i < nb; ++i) {
|
| 863 |
+
const float d_all = GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 864 |
|
| 865 |
const uint8_t * GGML_RESTRICT x0l = x[i].ql;
|
| 866 |
const uint8_t * GGML_RESTRICT x0h = x[i].qh;
|
|
|
|
| 1005 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1006 |
q8 += 8; a += 8;
|
| 1007 |
}
|
| 1008 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1009 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1010 |
}
|
| 1011 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1072 |
// float sumf = 0;
|
| 1073 |
|
| 1074 |
// for (int i = 0; i < nb; ++i) {
|
| 1075 |
+
// const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1076 |
// const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1077 |
// const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1078 |
|
|
|
|
| 1122 |
|
| 1123 |
// float sumf = 0.f;
|
| 1124 |
// for (int i = 0; i < nb; ++i) {
|
| 1125 |
+
// const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1126 |
// const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1127 |
// const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 1128 |
// int32_t bsum = 0;
|
|
|
|
| 1183 |
const int8x16_t v_yh = vec_xl(QK8_0/2, y0->qs);
|
| 1184 |
const int32x4_t v_xy = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
|
| 1185 |
|
| 1186 |
+
sumf += GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d) * (v_xy[0] + v_xy[1] + v_xy[2] + v_xy[3]);
|
| 1187 |
}
|
| 1188 |
|
| 1189 |
#endif
|
| 1190 |
for (; ib < nb; ++ib) {
|
| 1191 |
+
const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
|
| 1192 |
int sumi1 = 0, sumi2 = 0;
|
| 1193 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 1194 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
|
|
| 1258 |
sumi2 += (vsumi1[0] + vsumi1[1] + vsumi1[2] + vsumi1[3]) * ls2;
|
| 1259 |
}
|
| 1260 |
|
| 1261 |
+
sumf += GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d * (sumi1 + sumi2);
|
| 1262 |
}
|
| 1263 |
|
| 1264 |
*s = sumf;
|
|
|
|
| 1266 |
#else
|
| 1267 |
float sumf = 0;
|
| 1268 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 1269 |
+
const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
|
| 1270 |
uint16_t h = x[ibl].scales_h;
|
| 1271 |
const uint8_t * qs = x[ibl].qs;
|
| 1272 |
const int8_t * q8 = y[ibl].qs;
|
|
@@ -3,6 +3,7 @@
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
|
|
|
| 6 |
|
| 7 |
#include "../../quants.h"
|
| 8 |
#include "../../ggml-cpu-impl.h"
|
|
@@ -65,7 +66,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 65 |
const float d = amax / ((1 << 7) - 1);
|
| 66 |
const float id = d ? 1.0f/d : 0.0f;
|
| 67 |
|
| 68 |
-
y[i].d =
|
| 69 |
|
| 70 |
for (int j = 0; j < 8; j++) {
|
| 71 |
const v128_t v = wasm_f32x4_mul(srcv[j], wasm_f32x4_splat(id));
|
|
@@ -110,7 +111,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 110 |
const float d = amax / ((1 << 7) - 1);
|
| 111 |
const float id = d ? 1.0f/d : 0.0f;
|
| 112 |
|
| 113 |
-
y[i].d =
|
| 114 |
|
| 115 |
v128_t accv = wasm_i32x4_splat(0);
|
| 116 |
|
|
@@ -126,7 +127,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 126 |
accv = wasm_i32x4_add(accv, vi);
|
| 127 |
}
|
| 128 |
|
| 129 |
-
y[i].s =
|
| 130 |
d * (wasm_i32x4_extract_lane(accv, 0) +
|
| 131 |
wasm_i32x4_extract_lane(accv, 1) +
|
| 132 |
wasm_i32x4_extract_lane(accv, 2) +
|
|
@@ -324,8 +325,8 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 324 |
);
|
| 325 |
|
| 326 |
// Accumulate results with scaling
|
| 327 |
-
float scale0 =
|
| 328 |
-
float scale1 =
|
| 329 |
|
| 330 |
sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(dp0), wasm_f32x4_splat(scale0)));
|
| 331 |
sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(dp1), wasm_f32x4_splat(scale1)));
|
|
@@ -348,7 +349,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 348 |
}
|
| 349 |
|
| 350 |
int sumi = sumi0 + sumi1;
|
| 351 |
-
sumf += sumi*
|
| 352 |
}
|
| 353 |
|
| 354 |
*s = sumf;
|
|
@@ -428,7 +429,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 428 |
wasm_i32x4_dot_i16x8(v0lfh, v1lh)),
|
| 429 |
wasm_i32x4_add(wasm_i32x4_dot_i16x8(v0hfl, v1hl),
|
| 430 |
wasm_i32x4_dot_i16x8(v0hfh, v1hh)))),
|
| 431 |
-
wasm_f32x4_splat(
|
| 432 |
}
|
| 433 |
|
| 434 |
sumf = wasm_f32x4_extract_lane(sumv, 0) + wasm_f32x4_extract_lane(sumv, 1) +
|
|
@@ -454,7 +455,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 454 |
}
|
| 455 |
|
| 456 |
int sumi = sumi0 + sumi1;
|
| 457 |
-
sumf += (
|
| 458 |
}
|
| 459 |
|
| 460 |
*s = sumf;
|
|
@@ -491,7 +492,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 491 |
const block_q5_1 * GGML_RESTRICT x0 = &x[ib];
|
| 492 |
const block_q8_1 * GGML_RESTRICT y0 = &y[ib];
|
| 493 |
|
| 494 |
-
summs +=
|
| 495 |
|
| 496 |
const v128_t m4b = wasm_i8x16_splat(0x0F);
|
| 497 |
|
|
@@ -538,7 +539,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 538 |
wasm_i32x4_dot_i16x8(v0lfh, v1lh)),
|
| 539 |
wasm_i32x4_add(wasm_i32x4_dot_i16x8(v0hfl, v1hl),
|
| 540 |
wasm_i32x4_dot_i16x8(v0hfh, v1hh)))),
|
| 541 |
-
wasm_f32x4_splat(
|
| 542 |
}
|
| 543 |
|
| 544 |
sumf = wasm_f32x4_extract_lane(sumv, 0) + wasm_f32x4_extract_lane(sumv, 1) +
|
|
@@ -564,7 +565,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 564 |
}
|
| 565 |
|
| 566 |
int sumi = sumi0 + sumi1;
|
| 567 |
-
sumf += (
|
| 568 |
}
|
| 569 |
|
| 570 |
*s = sumf;
|
|
@@ -620,7 +621,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 620 |
const v128_t sum_dots = wasm_i32x4_add(wasm_i32x4_add(dx0_0, dx0_1), wasm_i32x4_add(dx1_0, dx1_1));
|
| 621 |
|
| 622 |
// Convert to float and accumulate
|
| 623 |
-
const float scale =
|
| 624 |
sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(sum_dots), wasm_f32x4_splat(scale)));
|
| 625 |
}
|
| 626 |
|
|
@@ -635,7 +636,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 635 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 636 |
}
|
| 637 |
|
| 638 |
-
sumf += sumi*(
|
| 639 |
}
|
| 640 |
|
| 641 |
*s = sumf;
|
|
@@ -746,8 +747,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 746 |
isum += wasm_i32x4_extract_lane(isum_vec, 0);
|
| 747 |
}
|
| 748 |
|
| 749 |
-
const float dall =
|
| 750 |
-
const float dmin =
|
| 751 |
sumf += dall * isum - dmin * summs;
|
| 752 |
}
|
| 753 |
|
|
@@ -768,8 +769,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 768 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 769 |
}
|
| 770 |
|
| 771 |
-
const float dall = y[i].d *
|
| 772 |
-
const float dmin = y[i].d *
|
| 773 |
|
| 774 |
int isum = 0;
|
| 775 |
int is = 0;
|
|
@@ -880,7 +881,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 880 |
}
|
| 881 |
|
| 882 |
// Accumulate results
|
| 883 |
-
const float d =
|
| 884 |
const v128_t v_d = wasm_f32x4_splat(d);
|
| 885 |
v128_t v_sum = wasm_f32x4_add(
|
| 886 |
wasm_f32x4_mul(wasm_f32x4_convert_i32x4(v_acc0), v_d),
|
|
@@ -957,7 +958,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 957 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 958 |
q8 += 8; a += 8;
|
| 959 |
}
|
| 960 |
-
const float d =
|
| 961 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 962 |
}
|
| 963 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -991,8 +992,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 991 |
float sumf = 0;
|
| 992 |
|
| 993 |
for (int i = 0; i < nb; ++i) {
|
| 994 |
-
const float d = y[i].d *
|
| 995 |
-
const float dmin = y[i].d *
|
| 996 |
|
| 997 |
const uint8_t * GGML_RESTRICT q4 = x[i].qs;
|
| 998 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -1136,9 +1137,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1136 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1137 |
q8 += 8; a += 8;
|
| 1138 |
}
|
| 1139 |
-
const float d =
|
| 1140 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1141 |
-
const float dmin =
|
| 1142 |
sumf -= dmin * sumi;
|
| 1143 |
}
|
| 1144 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1170,8 +1171,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1170 |
float sumf = 0;
|
| 1171 |
|
| 1172 |
for (int i = 0; i < nb; ++i) {
|
| 1173 |
-
const float d = y[i].d *
|
| 1174 |
-
const float dmin = y[i].d *
|
| 1175 |
|
| 1176 |
const uint8_t * GGML_RESTRICT q5 = x[i].qs;
|
| 1177 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
@@ -1331,9 +1332,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1331 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1332 |
q8 += 8; a += 8;
|
| 1333 |
}
|
| 1334 |
-
const float d =
|
| 1335 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1336 |
-
const float dmin =
|
| 1337 |
sumf -= dmin * sumi;
|
| 1338 |
}
|
| 1339 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1420,7 +1421,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1420 |
wasm_v128_store(&aux32[0], acc0);
|
| 1421 |
wasm_v128_store(&aux32[4], acc1);
|
| 1422 |
|
| 1423 |
-
const float d =
|
| 1424 |
for (int l = 0; l < 8; ++l) {
|
| 1425 |
sums[l] += d * aux32[l];
|
| 1426 |
}
|
|
@@ -1470,7 +1471,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1470 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1471 |
q8 += 8; a += 8;
|
| 1472 |
}
|
| 1473 |
-
const float d =
|
| 1474 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1475 |
}
|
| 1476 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
| 6 |
+
#include "simd-mappings.h"
|
| 7 |
|
| 8 |
#include "../../quants.h"
|
| 9 |
#include "../../ggml-cpu-impl.h"
|
|
|
|
| 66 |
const float d = amax / ((1 << 7) - 1);
|
| 67 |
const float id = d ? 1.0f/d : 0.0f;
|
| 68 |
|
| 69 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 70 |
|
| 71 |
for (int j = 0; j < 8; j++) {
|
| 72 |
const v128_t v = wasm_f32x4_mul(srcv[j], wasm_f32x4_splat(id));
|
|
|
|
| 111 |
const float d = amax / ((1 << 7) - 1);
|
| 112 |
const float id = d ? 1.0f/d : 0.0f;
|
| 113 |
|
| 114 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 115 |
|
| 116 |
v128_t accv = wasm_i32x4_splat(0);
|
| 117 |
|
|
|
|
| 127 |
accv = wasm_i32x4_add(accv, vi);
|
| 128 |
}
|
| 129 |
|
| 130 |
+
y[i].s = GGML_CPU_FP32_TO_FP16(
|
| 131 |
d * (wasm_i32x4_extract_lane(accv, 0) +
|
| 132 |
wasm_i32x4_extract_lane(accv, 1) +
|
| 133 |
wasm_i32x4_extract_lane(accv, 2) +
|
|
|
|
| 325 |
);
|
| 326 |
|
| 327 |
// Accumulate results with scaling
|
| 328 |
+
float scale0 = GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d);
|
| 329 |
+
float scale1 = GGML_CPU_FP16_TO_FP32(x1->d) * GGML_CPU_FP16_TO_FP32(y1->d);
|
| 330 |
|
| 331 |
sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(dp0), wasm_f32x4_splat(scale0)));
|
| 332 |
sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(dp1), wasm_f32x4_splat(scale1)));
|
|
|
|
| 349 |
}
|
| 350 |
|
| 351 |
int sumi = sumi0 + sumi1;
|
| 352 |
+
sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 353 |
}
|
| 354 |
|
| 355 |
*s = sumf;
|
|
|
|
| 429 |
wasm_i32x4_dot_i16x8(v0lfh, v1lh)),
|
| 430 |
wasm_i32x4_add(wasm_i32x4_dot_i16x8(v0hfl, v1hl),
|
| 431 |
wasm_i32x4_dot_i16x8(v0hfh, v1hh)))),
|
| 432 |
+
wasm_f32x4_splat(GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d))));
|
| 433 |
}
|
| 434 |
|
| 435 |
sumf = wasm_f32x4_extract_lane(sumv, 0) + wasm_f32x4_extract_lane(sumv, 1) +
|
|
|
|
| 455 |
}
|
| 456 |
|
| 457 |
int sumi = sumi0 + sumi1;
|
| 458 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
|
| 459 |
}
|
| 460 |
|
| 461 |
*s = sumf;
|
|
|
|
| 492 |
const block_q5_1 * GGML_RESTRICT x0 = &x[ib];
|
| 493 |
const block_q8_1 * GGML_RESTRICT y0 = &y[ib];
|
| 494 |
|
| 495 |
+
summs += GGML_CPU_FP16_TO_FP32(x0->m) * GGML_CPU_FP16_TO_FP32(y0->s);
|
| 496 |
|
| 497 |
const v128_t m4b = wasm_i8x16_splat(0x0F);
|
| 498 |
|
|
|
|
| 539 |
wasm_i32x4_dot_i16x8(v0lfh, v1lh)),
|
| 540 |
wasm_i32x4_add(wasm_i32x4_dot_i16x8(v0hfl, v1hl),
|
| 541 |
wasm_i32x4_dot_i16x8(v0hfh, v1hh)))),
|
| 542 |
+
wasm_f32x4_splat(GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d))));
|
| 543 |
}
|
| 544 |
|
| 545 |
sumf = wasm_f32x4_extract_lane(sumv, 0) + wasm_f32x4_extract_lane(sumv, 1) +
|
|
|
|
| 565 |
}
|
| 566 |
|
| 567 |
int sumi = sumi0 + sumi1;
|
| 568 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 569 |
}
|
| 570 |
|
| 571 |
*s = sumf;
|
|
|
|
| 621 |
const v128_t sum_dots = wasm_i32x4_add(wasm_i32x4_add(dx0_0, dx0_1), wasm_i32x4_add(dx1_0, dx1_1));
|
| 622 |
|
| 623 |
// Convert to float and accumulate
|
| 624 |
+
const float scale = GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d);
|
| 625 |
sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(sum_dots), wasm_f32x4_splat(scale)));
|
| 626 |
}
|
| 627 |
|
|
|
|
| 636 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 637 |
}
|
| 638 |
|
| 639 |
+
sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 640 |
}
|
| 641 |
|
| 642 |
*s = sumf;
|
|
|
|
| 747 |
isum += wasm_i32x4_extract_lane(isum_vec, 0);
|
| 748 |
}
|
| 749 |
|
| 750 |
+
const float dall = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 751 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 752 |
sumf += dall * isum - dmin * summs;
|
| 753 |
}
|
| 754 |
|
|
|
|
| 769 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 770 |
}
|
| 771 |
|
| 772 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 773 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 774 |
|
| 775 |
int isum = 0;
|
| 776 |
int is = 0;
|
|
|
|
| 881 |
}
|
| 882 |
|
| 883 |
// Accumulate results
|
| 884 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 885 |
const v128_t v_d = wasm_f32x4_splat(d);
|
| 886 |
v128_t v_sum = wasm_f32x4_add(
|
| 887 |
wasm_f32x4_mul(wasm_f32x4_convert_i32x4(v_acc0), v_d),
|
|
|
|
| 958 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 959 |
q8 += 8; a += 8;
|
| 960 |
}
|
| 961 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 962 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 963 |
}
|
| 964 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 992 |
float sumf = 0;
|
| 993 |
|
| 994 |
for (int i = 0; i < nb; ++i) {
|
| 995 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 996 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin); // Corrected sign
|
| 997 |
|
| 998 |
const uint8_t * GGML_RESTRICT q4 = x[i].qs;
|
| 999 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 1137 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1138 |
q8 += 8; a += 8;
|
| 1139 |
}
|
| 1140 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1141 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1142 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 1143 |
sumf -= dmin * sumi;
|
| 1144 |
}
|
| 1145 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1171 |
float sumf = 0;
|
| 1172 |
|
| 1173 |
for (int i = 0; i < nb; ++i) {
|
| 1174 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1175 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin); // Fixed sign
|
| 1176 |
|
| 1177 |
const uint8_t * GGML_RESTRICT q5 = x[i].qs;
|
| 1178 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
|
|
| 1332 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1333 |
q8 += 8; a += 8;
|
| 1334 |
}
|
| 1335 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1336 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1337 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 1338 |
sumf -= dmin * sumi;
|
| 1339 |
}
|
| 1340 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1421 |
wasm_v128_store(&aux32[0], acc0);
|
| 1422 |
wasm_v128_store(&aux32[4], acc1);
|
| 1423 |
|
| 1424 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1425 |
for (int l = 0; l < 8; ++l) {
|
| 1426 |
sums[l] += d * aux32[l];
|
| 1427 |
}
|
|
|
|
| 1471 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 1472 |
q8 += 8; a += 8;
|
| 1473 |
}
|
| 1474 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1475 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1476 |
}
|
| 1477 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -3,6 +3,7 @@
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
|
|
|
| 6 |
|
| 7 |
#include "../../quants.h"
|
| 8 |
#include "../../ggml-cpu-impl.h"
|
|
@@ -256,9 +257,9 @@ static inline __m256 mul_sum_i8_quad_float(const __m128i x_1_0, const __m128i x_
|
|
| 256 |
|
| 257 |
// quad fp16 delta calculation
|
| 258 |
static inline __m256 quad_fp16_delta_float(const float x0, const float y0, const float x1, const float y1) {
|
| 259 |
-
//
|
| 260 |
-
return _mm256_set_m128(_mm_set1_ps(
|
| 261 |
-
_mm_set1_ps(
|
| 262 |
}
|
| 263 |
#endif
|
| 264 |
#elif defined(__SSSE3__)
|
|
@@ -305,7 +306,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 305 |
|
| 306 |
// Quantize these floats
|
| 307 |
const float d = maxScalar / 127.f;
|
| 308 |
-
y[i].d =
|
| 309 |
const float id = ( maxScalar != 0.0f ) ? 127.f / maxScalar : 0.0f;
|
| 310 |
const __m256 mul = _mm256_set1_ps( id );
|
| 311 |
|
|
@@ -401,7 +402,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 401 |
|
| 402 |
// Quantize these floats
|
| 403 |
const float d = max_scalar / 127.f;
|
| 404 |
-
y[i].d =
|
| 405 |
const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
|
| 406 |
const __m256 mul = _mm256_set1_ps( id );
|
| 407 |
|
|
@@ -425,7 +426,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 425 |
|
| 426 |
#if defined(__AVX2__)
|
| 427 |
// Compute the sum of the quants and set y[i].s
|
| 428 |
-
y[i].s =
|
| 429 |
|
| 430 |
// Convert int32 to int16
|
| 431 |
i0 = _mm256_packs_epi32( i0, i1 ); // 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15
|
|
@@ -455,7 +456,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
|
|
| 455 |
// Compute the sum of the quants and set y[i].s
|
| 456 |
const __m128i s0 = _mm_add_epi32(_mm_add_epi32(ni0, ni1), _mm_add_epi32(ni2, ni3));
|
| 457 |
const __m128i s1 = _mm_add_epi32(_mm_add_epi32(ni4, ni5), _mm_add_epi32(ni6, ni7));
|
| 458 |
-
y[i].s =
|
| 459 |
|
| 460 |
// Convert int32 to int16
|
| 461 |
ni0 = _mm_packs_epi32( ni0, ni1 );
|
|
@@ -552,7 +553,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 552 |
// Main loop
|
| 553 |
for (; ib < nb; ++ib) {
|
| 554 |
/* Compute combined scale for the block */
|
| 555 |
-
const __m256 d = _mm256_set1_ps(
|
| 556 |
|
| 557 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 558 |
|
|
@@ -613,7 +614,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 613 |
_mm_prefetch(&y[ib] + sizeof(block_q8_0), _MM_HINT_T0);
|
| 614 |
|
| 615 |
// Compute combined scale for the block 0 and 1
|
| 616 |
-
const __m128 d_0_1 = _mm_set1_ps(
|
| 617 |
|
| 618 |
const __m128i tmp_0_1 = _mm_loadu_si128((const __m128i *)x[ib].qs);
|
| 619 |
|
|
@@ -631,7 +632,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 631 |
_mm_prefetch(&y[ib] + 2 * sizeof(block_q8_0), _MM_HINT_T0);
|
| 632 |
|
| 633 |
// Compute combined scale for the block 2 and 3
|
| 634 |
-
const __m128 d_2_3 = _mm_set1_ps(
|
| 635 |
|
| 636 |
const __m128i tmp_2_3 = _mm_loadu_si128((const __m128i *)x[ib + 1].qs);
|
| 637 |
|
|
@@ -680,7 +681,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 680 |
}
|
| 681 |
|
| 682 |
int sumi = sumi0 + sumi1;
|
| 683 |
-
sumf += sumi*
|
| 684 |
}
|
| 685 |
|
| 686 |
*s = sumf;
|
|
@@ -711,10 +712,10 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 711 |
|
| 712 |
// Main loop
|
| 713 |
for (; ib < nb; ++ib) {
|
| 714 |
-
const float d0 =
|
| 715 |
-
const float d1 =
|
| 716 |
|
| 717 |
-
summs +=
|
| 718 |
|
| 719 |
const __m256 d0v = _mm256_set1_ps( d0 );
|
| 720 |
const __m256 d1v = _mm256_set1_ps( d1 );
|
|
@@ -752,7 +753,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 752 |
}
|
| 753 |
|
| 754 |
int sumi = sumi0 + sumi1;
|
| 755 |
-
sumf += (
|
| 756 |
}
|
| 757 |
|
| 758 |
*s = sumf;
|
|
@@ -783,7 +784,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 783 |
// Main loop
|
| 784 |
for (; ib < nb; ++ib) {
|
| 785 |
/* Compute combined scale for the block */
|
| 786 |
-
const __m256 d = _mm256_set1_ps(
|
| 787 |
|
| 788 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 789 |
__m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
|
@@ -807,7 +808,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 807 |
// Main loop
|
| 808 |
for (; ib < nb; ++ib) {
|
| 809 |
/* Compute combined scale for the block */
|
| 810 |
-
const __m256 d = _mm256_set1_ps(
|
| 811 |
|
| 812 |
__m256i bx_0 = bytes_from_nibbles_32(x[ib].qs);
|
| 813 |
const __m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
|
@@ -851,7 +852,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 851 |
}
|
| 852 |
|
| 853 |
int sumi = sumi0 + sumi1;
|
| 854 |
-
sumf += (
|
| 855 |
}
|
| 856 |
|
| 857 |
*s = sumf;
|
|
@@ -883,16 +884,16 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 883 |
|
| 884 |
// Main loop
|
| 885 |
for (; ib < nb; ++ib) {
|
| 886 |
-
const __m256 dx = _mm256_set1_ps(
|
| 887 |
|
| 888 |
-
summs +=
|
| 889 |
|
| 890 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 891 |
__m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
| 892 |
bxhi = _mm256_and_si256(bxhi, _mm256_set1_epi8(0x10));
|
| 893 |
qx = _mm256_or_si256(qx, bxhi);
|
| 894 |
|
| 895 |
-
const __m256 dy = _mm256_set1_ps(
|
| 896 |
const __m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
|
| 897 |
|
| 898 |
const __m256 q = mul_sum_us8_pairs_float(qx, qy);
|
|
@@ -910,9 +911,9 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 910 |
|
| 911 |
// Main loop
|
| 912 |
for (; ib < nb; ++ib) {
|
| 913 |
-
const __m256 dx = _mm256_set1_ps(
|
| 914 |
|
| 915 |
-
summs +=
|
| 916 |
|
| 917 |
__m256i bx_0 = bytes_from_nibbles_32(x[ib].qs);
|
| 918 |
const __m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
|
@@ -926,7 +927,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 926 |
bxh = _mm_or_si128(bxh, bxhih);
|
| 927 |
bx_0 = MM256_SET_M128I(bxh, bxl);
|
| 928 |
|
| 929 |
-
const __m256 dy = _mm256_set1_ps(
|
| 930 |
const __m256i by_0 = _mm256_loadu_si256((const __m256i *)y[ib].qs);
|
| 931 |
|
| 932 |
const __m256 q = mul_sum_us8_pairs_float(bx_0, by_0);
|
|
@@ -956,7 +957,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 956 |
}
|
| 957 |
|
| 958 |
int sumi = sumi0 + sumi1;
|
| 959 |
-
sumf += (
|
| 960 |
}
|
| 961 |
|
| 962 |
*s = sumf;
|
|
@@ -986,7 +987,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 986 |
// Main loop
|
| 987 |
for (; ib < nb; ++ib) {
|
| 988 |
// Compute combined scale for the block
|
| 989 |
-
const __m256 d = _mm256_set1_ps(
|
| 990 |
__m256i qx = _mm256_loadu_si256((const __m256i *)x[ib].qs);
|
| 991 |
__m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
|
| 992 |
|
|
@@ -1025,7 +1026,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1025 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 1026 |
}
|
| 1027 |
|
| 1028 |
-
sumf += sumi*(
|
| 1029 |
}
|
| 1030 |
|
| 1031 |
*s = sumf;
|
|
@@ -1144,7 +1145,7 @@ void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1144 |
}
|
| 1145 |
|
| 1146 |
const __m256i ysum = _mm256_loadu_si256((const __m256i *) y[i].bsums);
|
| 1147 |
-
const __m256 d = _mm256_set1_ps(y[i].d *
|
| 1148 |
|
| 1149 |
sumi0 = _mm256_sub_epi16(sumi0, ysum);
|
| 1150 |
sumi0 = _mm256_add_epi16(sumi0, _mm256_add_epi16(sumi1, sumi2));
|
|
@@ -1190,7 +1191,7 @@ void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1190 |
}
|
| 1191 |
}
|
| 1192 |
|
| 1193 |
-
sumf += (float) sum * (
|
| 1194 |
}
|
| 1195 |
|
| 1196 |
*s = sumf;
|
|
@@ -1244,7 +1245,7 @@ void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1244 |
}
|
| 1245 |
|
| 1246 |
const __m256i ysum = _mm256_loadu_si256((const __m256i *) y[i].bsums);
|
| 1247 |
-
const __m256 d = _mm256_set1_ps(y[i].d *
|
| 1248 |
|
| 1249 |
sumi0 = _mm256_add_epi16(sumi0, sumi1);
|
| 1250 |
sumi0 = _mm256_sub_epi16(sumi0, ysum);
|
|
@@ -1269,7 +1270,7 @@ void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1269 |
}
|
| 1270 |
}
|
| 1271 |
|
| 1272 |
-
const float d = y[i].d *
|
| 1273 |
|
| 1274 |
sumf += (float) sumi * d;
|
| 1275 |
}
|
|
@@ -1299,8 +1300,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1299 |
|
| 1300 |
for (int i = 0; i < nb; ++i) {
|
| 1301 |
|
| 1302 |
-
const float d = y[i].d *
|
| 1303 |
-
const float dmin = -y[i].d *
|
| 1304 |
|
| 1305 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1306 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -1366,8 +1367,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1366 |
|
| 1367 |
for (int i = 0; i < nb; ++i) {
|
| 1368 |
|
| 1369 |
-
const float dall = y[i].d *
|
| 1370 |
-
const float dmin = -y[i].d *
|
| 1371 |
|
| 1372 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1373 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -1477,8 +1478,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1477 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 1478 |
}
|
| 1479 |
|
| 1480 |
-
const float dall = y[i].d *
|
| 1481 |
-
const float dmin = y[i].d *
|
| 1482 |
|
| 1483 |
int isum = 0;
|
| 1484 |
int is = 0;
|
|
@@ -1533,7 +1534,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1533 |
|
| 1534 |
for (int i = 0; i < nb; ++i) {
|
| 1535 |
|
| 1536 |
-
const float d = y[i].d *
|
| 1537 |
|
| 1538 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 1539 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -1638,7 +1639,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1638 |
|
| 1639 |
for (int i = 0; i < nb; ++i) {
|
| 1640 |
|
| 1641 |
-
const float d = y[i].d *
|
| 1642 |
|
| 1643 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 1644 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -1824,7 +1825,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1824 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 1825 |
q8 += 8; a += 8;
|
| 1826 |
}
|
| 1827 |
-
const float d =
|
| 1828 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1829 |
}
|
| 1830 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -1862,8 +1863,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1862 |
|
| 1863 |
for (int i = 0; i < nb; ++i) {
|
| 1864 |
|
| 1865 |
-
const float d = y[i].d *
|
| 1866 |
-
const float dmin = -y[i].d *
|
| 1867 |
|
| 1868 |
memcpy(utmp, x[i].scales, 12);
|
| 1869 |
utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
|
|
@@ -1928,8 +1929,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 1928 |
|
| 1929 |
for (int i = 0; i < nb; ++i) {
|
| 1930 |
|
| 1931 |
-
const float d = y[i].d *
|
| 1932 |
-
const float dmin = -y[i].d *
|
| 1933 |
|
| 1934 |
const uint8_t * GGML_RESTRICT q4 = x[i].qs;
|
| 1935 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -2049,9 +2050,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2049 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2050 |
q8 += 8; a += 8;
|
| 2051 |
}
|
| 2052 |
-
const float d =
|
| 2053 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2054 |
-
const float dmin =
|
| 2055 |
sumf -= dmin * sumi;
|
| 2056 |
}
|
| 2057 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -2092,8 +2093,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2092 |
const uint8_t * GGML_RESTRICT q5 = x[i].qs;
|
| 2093 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2094 |
|
| 2095 |
-
const float d = y[i].d *
|
| 2096 |
-
const float dmin = -y[i].d *
|
| 2097 |
|
| 2098 |
memcpy(utmp, x[i].scales, 12);
|
| 2099 |
utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
|
|
@@ -2170,8 +2171,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2170 |
|
| 2171 |
for (int i = 0; i < nb; ++i) {
|
| 2172 |
|
| 2173 |
-
const float d = y[i].d *
|
| 2174 |
-
const float dmin = -y[i].d *
|
| 2175 |
|
| 2176 |
const uint8_t * GGML_RESTRICT q5 = x[i].qs;
|
| 2177 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -2311,9 +2312,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2311 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2312 |
q8 += 8; a += 8;
|
| 2313 |
}
|
| 2314 |
-
const float d =
|
| 2315 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2316 |
-
const float dmin =
|
| 2317 |
sumf -= dmin * sumi;
|
| 2318 |
}
|
| 2319 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -2344,7 +2345,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2344 |
|
| 2345 |
for (int i = 0; i < nb; ++i) {
|
| 2346 |
|
| 2347 |
-
const float d = y[i].d *
|
| 2348 |
|
| 2349 |
const uint8_t * GGML_RESTRICT q4 = x[i].ql;
|
| 2350 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
@@ -2422,7 +2423,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2422 |
|
| 2423 |
for (int i = 0; i < nb; ++i) {
|
| 2424 |
|
| 2425 |
-
const float d = y[i].d *
|
| 2426 |
|
| 2427 |
const uint8_t * GGML_RESTRICT q4 = x[i].ql;
|
| 2428 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
@@ -2555,7 +2556,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
|
|
| 2555 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2556 |
q8 += 8; a += 8;
|
| 2557 |
}
|
| 2558 |
-
const float d =
|
| 2559 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2560 |
}
|
| 2561 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -2622,7 +2623,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 2622 |
|
| 2623 |
__m256 accumf = _mm256_setzero_ps();
|
| 2624 |
for (int i = 0; i < nb; ++i) {
|
| 2625 |
-
const float d =
|
| 2626 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2627 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2628 |
__m256i sumi1 = _mm256_setzero_si256();
|
|
@@ -2663,7 +2664,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 2663 |
|
| 2664 |
__m256 accumf = _mm256_setzero_ps();
|
| 2665 |
for (int i = 0; i < nb; ++i) {
|
| 2666 |
-
const float d =
|
| 2667 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2668 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2669 |
__m128i sumi1_0 = _mm_setzero_si128();
|
|
@@ -2717,7 +2718,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 2717 |
|
| 2718 |
float sumf = 0.f;
|
| 2719 |
for (int i = 0; i < nb; ++i) {
|
| 2720 |
-
const float d =
|
| 2721 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2722 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2723 |
int32_t bsum = 0;
|
|
@@ -2792,7 +2793,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2792 |
|
| 2793 |
__m256 accumf = _mm256_setzero_ps();
|
| 2794 |
for (int i = 0; i < nb; ++i) {
|
| 2795 |
-
const float d =
|
| 2796 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2797 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2798 |
|
|
@@ -2913,7 +2914,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 2913 |
|
| 2914 |
__m256 accumf = _mm256_setzero_ps();
|
| 2915 |
for (int i = 0; i < nb; ++i) {
|
| 2916 |
-
const float d =
|
| 2917 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2918 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2919 |
|
|
@@ -3035,7 +3036,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 3035 |
|
| 3036 |
float sumf = 0.f;
|
| 3037 |
for (int i = 0; i < nb; ++i) {
|
| 3038 |
-
const float d =
|
| 3039 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3040 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 3041 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -3104,7 +3105,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3104 |
|
| 3105 |
__m256 accumf = _mm256_setzero_ps();
|
| 3106 |
for (int i = 0; i < nb; ++i) {
|
| 3107 |
-
const float d =
|
| 3108 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3109 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3110 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
|
|
@@ -3177,7 +3178,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3177 |
|
| 3178 |
__m256 accumf = _mm256_setzero_ps();
|
| 3179 |
for (int i = 0; i < nb; ++i) {
|
| 3180 |
-
const float d =
|
| 3181 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3182 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3183 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
|
|
@@ -3253,7 +3254,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3253 |
float sumf = 0;
|
| 3254 |
for (int i = 0; i < nb; i++) {
|
| 3255 |
|
| 3256 |
-
const float d =
|
| 3257 |
const int8_t * q8 = y[i].qs;
|
| 3258 |
const uint8_t * qs = x[i].qs;
|
| 3259 |
const uint8_t * qh = x[i].qh;
|
|
@@ -3313,7 +3314,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 3313 |
|
| 3314 |
__m256 accumf = _mm256_setzero_ps();
|
| 3315 |
for (int i = 0; i < nb; ++i) {
|
| 3316 |
-
const float d =
|
| 3317 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3318 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3319 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -3358,7 +3359,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 3358 |
|
| 3359 |
__m256 accumf = _mm256_setzero_ps();
|
| 3360 |
for (int i = 0; i < nb; ++i) {
|
| 3361 |
-
const float d =
|
| 3362 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3363 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3364 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -3414,7 +3415,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
|
|
| 3414 |
|
| 3415 |
float sumf = 0.f;
|
| 3416 |
for (int i = 0; i < nb; ++i) {
|
| 3417 |
-
const float d =
|
| 3418 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3419 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3420 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -3480,7 +3481,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3480 |
|
| 3481 |
__m256 accumf = _mm256_setzero_ps();
|
| 3482 |
for (int i = 0; i < nb; ++i) {
|
| 3483 |
-
const float d =
|
| 3484 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3485 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3486 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
|
|
@@ -3565,7 +3566,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3565 |
|
| 3566 |
__m256 accumf = _mm256_setzero_ps();
|
| 3567 |
for (int i = 0; i < nb; ++i) {
|
| 3568 |
-
const float d =
|
| 3569 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3570 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3571 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
|
|
@@ -3648,7 +3649,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3648 |
|
| 3649 |
float sumf = 0.f;
|
| 3650 |
for (int i = 0; i < nb; ++i) {
|
| 3651 |
-
const float d =
|
| 3652 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3653 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3654 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
@@ -3753,7 +3754,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3753 |
+ (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
|
| 3754 |
}
|
| 3755 |
|
| 3756 |
-
const float d = y[i].d *
|
| 3757 |
accum = _mm256_fmadd_ps(_mm256_set1_ps(d), _mm256_cvtepi32_ps(sumi), accum);
|
| 3758 |
accum1 += d * sumi1;
|
| 3759 |
|
|
@@ -3801,7 +3802,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3801 |
+ (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
|
| 3802 |
}
|
| 3803 |
|
| 3804 |
-
const float d = y[i].d *
|
| 3805 |
accum = _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(d), _mm256_cvtepi32_ps(MM256_SET_M128I(sumi1_1, sumi1_0))), accum);
|
| 3806 |
accum1 += d * sumi1;
|
| 3807 |
|
|
@@ -3835,7 +3836,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3835 |
qs += 4;
|
| 3836 |
}
|
| 3837 |
|
| 3838 |
-
sumf +=
|
| 3839 |
}
|
| 3840 |
|
| 3841 |
*s = sumf;
|
|
@@ -3947,7 +3948,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3947 |
qs += 8; qh += 4;
|
| 3948 |
}
|
| 3949 |
|
| 3950 |
-
const __m256 d = _mm256_set1_ps(y[i].d *
|
| 3951 |
|
| 3952 |
accum1 = _mm256_fmadd_ps(d, _mm256_cvtepi32_ps(sumi1), accum1);
|
| 3953 |
accum2 = _mm256_fmadd_ps(d, _mm256_cvtepi32_ps(sumi2), accum2);
|
|
@@ -4033,7 +4034,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 4033 |
qs += 8; qh += 4;
|
| 4034 |
}
|
| 4035 |
|
| 4036 |
-
const __m256 d = _mm256_set1_ps(y[i].d *
|
| 4037 |
|
| 4038 |
accum1 = _mm256_add_ps(_mm256_mul_ps(d, _mm256_cvtepi32_ps(MM256_SET_M128I(sumi1_1, sumi1_0))), accum1);
|
| 4039 |
accum2 = _mm256_add_ps(_mm256_mul_ps(d, _mm256_cvtepi32_ps(MM256_SET_M128I(sumi2_1, sumi2_0))), accum2);
|
|
@@ -4083,7 +4084,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 4083 |
qh += 2;
|
| 4084 |
}
|
| 4085 |
|
| 4086 |
-
sumf +=
|
| 4087 |
}
|
| 4088 |
|
| 4089 |
*s = sumf;
|
|
@@ -4129,9 +4130,9 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 4129 |
const __m256i p16_2 = mul_add_epi8(q4b_2, q8b_2);
|
| 4130 |
const __m256i p_1 = _mm256_madd_epi16(p16_1, mone);
|
| 4131 |
const __m256i p_2 = _mm256_madd_epi16(p16_2, mone);
|
| 4132 |
-
accum1 = _mm256_fmadd_ps(_mm256_set1_ps(
|
| 4133 |
_mm256_cvtepi32_ps(p_1), accum1);
|
| 4134 |
-
accum2 = _mm256_fmadd_ps(_mm256_set1_ps(
|
| 4135 |
_mm256_cvtepi32_ps(p_2), accum2);
|
| 4136 |
}
|
| 4137 |
|
|
@@ -4164,7 +4165,7 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 4164 |
|
| 4165 |
#endif
|
| 4166 |
for (; ib < nb; ++ib) {
|
| 4167 |
-
const float d =
|
| 4168 |
int sumi1 = 0, sumi2 = 0;
|
| 4169 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 4170 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
@@ -4219,7 +4220,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 4219 |
sumi1 = _mm256_add_epi32(p_1, sumi1);
|
| 4220 |
sumi2 = _mm256_add_epi32(p_2, sumi2);
|
| 4221 |
}
|
| 4222 |
-
accum = _mm256_fmadd_ps(_mm256_set1_ps(
|
| 4223 |
_mm256_cvtepi32_ps(_mm256_add_epi32(sumi1, sumi2)), accum);
|
| 4224 |
}
|
| 4225 |
|
|
@@ -4267,7 +4268,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 4267 |
}
|
| 4268 |
__m128i sumi12_0 = _mm_add_epi32(sumi1_0, sumi2_0);
|
| 4269 |
__m128i sumi12_1 = _mm_add_epi32(sumi1_1, sumi2_1);
|
| 4270 |
-
accum = _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(
|
| 4271 |
_mm256_cvtepi32_ps(MM256_SET_M128I(sumi12_1, sumi12_0))), accum);
|
| 4272 |
}
|
| 4273 |
|
|
@@ -4276,7 +4277,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
|
|
| 4276 |
#else
|
| 4277 |
float sumf = 0;
|
| 4278 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 4279 |
-
const float d4d8 =
|
| 4280 |
uint16_t h = x[ibl].scales_h;
|
| 4281 |
const uint8_t * qs = x[ibl].qs;
|
| 4282 |
const int8_t * q8 = y[ibl].qs;
|
|
|
|
| 3 |
#include "ggml-quants.h"
|
| 4 |
#include "ggml-impl.h"
|
| 5 |
#include "ggml-cpu.h"
|
| 6 |
+
#include "simd-mappings.h"
|
| 7 |
|
| 8 |
#include "../../quants.h"
|
| 9 |
#include "../../ggml-cpu-impl.h"
|
|
|
|
| 257 |
|
| 258 |
// quad fp16 delta calculation
|
| 259 |
static inline __m256 quad_fp16_delta_float(const float x0, const float y0, const float x1, const float y1) {
|
| 260 |
+
// GGML_CPU_FP16_TO_FP32 is faster than Intel F16C
|
| 261 |
+
return _mm256_set_m128(_mm_set1_ps(GGML_CPU_FP16_TO_FP32(x1) * GGML_CPU_FP16_TO_FP32(y1)),
|
| 262 |
+
_mm_set1_ps(GGML_CPU_FP16_TO_FP32(x0) * GGML_CPU_FP16_TO_FP32(y0)));
|
| 263 |
}
|
| 264 |
#endif
|
| 265 |
#elif defined(__SSSE3__)
|
|
|
|
| 306 |
|
| 307 |
// Quantize these floats
|
| 308 |
const float d = maxScalar / 127.f;
|
| 309 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 310 |
const float id = ( maxScalar != 0.0f ) ? 127.f / maxScalar : 0.0f;
|
| 311 |
const __m256 mul = _mm256_set1_ps( id );
|
| 312 |
|
|
|
|
| 402 |
|
| 403 |
// Quantize these floats
|
| 404 |
const float d = max_scalar / 127.f;
|
| 405 |
+
y[i].d = GGML_CPU_FP32_TO_FP16(d);
|
| 406 |
const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
|
| 407 |
const __m256 mul = _mm256_set1_ps( id );
|
| 408 |
|
|
|
|
| 426 |
|
| 427 |
#if defined(__AVX2__)
|
| 428 |
// Compute the sum of the quants and set y[i].s
|
| 429 |
+
y[i].s = GGML_CPU_FP32_TO_FP16(d * hsum_i32_8(_mm256_add_epi32(_mm256_add_epi32(i0, i1), _mm256_add_epi32(i2, i3))));
|
| 430 |
|
| 431 |
// Convert int32 to int16
|
| 432 |
i0 = _mm256_packs_epi32( i0, i1 ); // 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15
|
|
|
|
| 456 |
// Compute the sum of the quants and set y[i].s
|
| 457 |
const __m128i s0 = _mm_add_epi32(_mm_add_epi32(ni0, ni1), _mm_add_epi32(ni2, ni3));
|
| 458 |
const __m128i s1 = _mm_add_epi32(_mm_add_epi32(ni4, ni5), _mm_add_epi32(ni6, ni7));
|
| 459 |
+
y[i].s = GGML_CPU_FP32_TO_FP16(d * hsum_i32_4(_mm_add_epi32(s0, s1)));
|
| 460 |
|
| 461 |
// Convert int32 to int16
|
| 462 |
ni0 = _mm_packs_epi32( ni0, ni1 );
|
|
|
|
| 553 |
// Main loop
|
| 554 |
for (; ib < nb; ++ib) {
|
| 555 |
/* Compute combined scale for the block */
|
| 556 |
+
const __m256 d = _mm256_set1_ps( GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d) );
|
| 557 |
|
| 558 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 559 |
|
|
|
|
| 614 |
_mm_prefetch(&y[ib] + sizeof(block_q8_0), _MM_HINT_T0);
|
| 615 |
|
| 616 |
// Compute combined scale for the block 0 and 1
|
| 617 |
+
const __m128 d_0_1 = _mm_set1_ps( GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d) );
|
| 618 |
|
| 619 |
const __m128i tmp_0_1 = _mm_loadu_si128((const __m128i *)x[ib].qs);
|
| 620 |
|
|
|
|
| 632 |
_mm_prefetch(&y[ib] + 2 * sizeof(block_q8_0), _MM_HINT_T0);
|
| 633 |
|
| 634 |
// Compute combined scale for the block 2 and 3
|
| 635 |
+
const __m128 d_2_3 = _mm_set1_ps( GGML_CPU_FP16_TO_FP32(x[ib + 1].d) * GGML_CPU_FP16_TO_FP32(y[ib + 1].d) );
|
| 636 |
|
| 637 |
const __m128i tmp_2_3 = _mm_loadu_si128((const __m128i *)x[ib + 1].qs);
|
| 638 |
|
|
|
|
| 681 |
}
|
| 682 |
|
| 683 |
int sumi = sumi0 + sumi1;
|
| 684 |
+
sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 685 |
}
|
| 686 |
|
| 687 |
*s = sumf;
|
|
|
|
| 712 |
|
| 713 |
// Main loop
|
| 714 |
for (; ib < nb; ++ib) {
|
| 715 |
+
const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
|
| 716 |
+
const float d1 = GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 717 |
|
| 718 |
+
summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 719 |
|
| 720 |
const __m256 d0v = _mm256_set1_ps( d0 );
|
| 721 |
const __m256 d1v = _mm256_set1_ps( d1 );
|
|
|
|
| 753 |
}
|
| 754 |
|
| 755 |
int sumi = sumi0 + sumi1;
|
| 756 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 757 |
}
|
| 758 |
|
| 759 |
*s = sumf;
|
|
|
|
| 784 |
// Main loop
|
| 785 |
for (; ib < nb; ++ib) {
|
| 786 |
/* Compute combined scale for the block */
|
| 787 |
+
const __m256 d = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 788 |
|
| 789 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 790 |
__m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
|
|
|
| 808 |
// Main loop
|
| 809 |
for (; ib < nb; ++ib) {
|
| 810 |
/* Compute combined scale for the block */
|
| 811 |
+
const __m256 d = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 812 |
|
| 813 |
__m256i bx_0 = bytes_from_nibbles_32(x[ib].qs);
|
| 814 |
const __m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
|
|
|
| 852 |
}
|
| 853 |
|
| 854 |
int sumi = sumi0 + sumi1;
|
| 855 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
|
| 856 |
}
|
| 857 |
|
| 858 |
*s = sumf;
|
|
|
|
| 884 |
|
| 885 |
// Main loop
|
| 886 |
for (; ib < nb; ++ib) {
|
| 887 |
+
const __m256 dx = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d));
|
| 888 |
|
| 889 |
+
summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 890 |
|
| 891 |
__m256i qx = bytes_from_nibbles_32(x[ib].qs);
|
| 892 |
__m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
| 893 |
bxhi = _mm256_and_si256(bxhi, _mm256_set1_epi8(0x10));
|
| 894 |
qx = _mm256_or_si256(qx, bxhi);
|
| 895 |
|
| 896 |
+
const __m256 dy = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 897 |
const __m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
|
| 898 |
|
| 899 |
const __m256 q = mul_sum_us8_pairs_float(qx, qy);
|
|
|
|
| 911 |
|
| 912 |
// Main loop
|
| 913 |
for (; ib < nb; ++ib) {
|
| 914 |
+
const __m256 dx = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d));
|
| 915 |
|
| 916 |
+
summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 917 |
|
| 918 |
__m256i bx_0 = bytes_from_nibbles_32(x[ib].qs);
|
| 919 |
const __m256i bxhi = bytes_from_bits_32(x[ib].qh);
|
|
|
|
| 927 |
bxh = _mm_or_si128(bxh, bxhih);
|
| 928 |
bx_0 = MM256_SET_M128I(bxh, bxl);
|
| 929 |
|
| 930 |
+
const __m256 dy = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 931 |
const __m256i by_0 = _mm256_loadu_si256((const __m256i *)y[ib].qs);
|
| 932 |
|
| 933 |
const __m256 q = mul_sum_us8_pairs_float(bx_0, by_0);
|
|
|
|
| 957 |
}
|
| 958 |
|
| 959 |
int sumi = sumi0 + sumi1;
|
| 960 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 961 |
}
|
| 962 |
|
| 963 |
*s = sumf;
|
|
|
|
| 987 |
// Main loop
|
| 988 |
for (; ib < nb; ++ib) {
|
| 989 |
// Compute combined scale for the block
|
| 990 |
+
const __m256 d = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 991 |
__m256i qx = _mm256_loadu_si256((const __m256i *)x[ib].qs);
|
| 992 |
__m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
|
| 993 |
|
|
|
|
| 1026 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 1027 |
}
|
| 1028 |
|
| 1029 |
+
sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 1030 |
}
|
| 1031 |
|
| 1032 |
*s = sumf;
|
|
|
|
| 1145 |
}
|
| 1146 |
|
| 1147 |
const __m256i ysum = _mm256_loadu_si256((const __m256i *) y[i].bsums);
|
| 1148 |
+
const __m256 d = _mm256_set1_ps(y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 1149 |
|
| 1150 |
sumi0 = _mm256_sub_epi16(sumi0, ysum);
|
| 1151 |
sumi0 = _mm256_add_epi16(sumi0, _mm256_add_epi16(sumi1, sumi2));
|
|
|
|
| 1191 |
}
|
| 1192 |
}
|
| 1193 |
|
| 1194 |
+
sumf += (float) sum * (GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d);
|
| 1195 |
}
|
| 1196 |
|
| 1197 |
*s = sumf;
|
|
|
|
| 1245 |
}
|
| 1246 |
|
| 1247 |
const __m256i ysum = _mm256_loadu_si256((const __m256i *) y[i].bsums);
|
| 1248 |
+
const __m256 d = _mm256_set1_ps(y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d));
|
| 1249 |
|
| 1250 |
sumi0 = _mm256_add_epi16(sumi0, sumi1);
|
| 1251 |
sumi0 = _mm256_sub_epi16(sumi0, ysum);
|
|
|
|
| 1270 |
}
|
| 1271 |
}
|
| 1272 |
|
| 1273 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1274 |
|
| 1275 |
sumf += (float) sumi * d;
|
| 1276 |
}
|
|
|
|
| 1300 |
|
| 1301 |
for (int i = 0; i < nb; ++i) {
|
| 1302 |
|
| 1303 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1304 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1305 |
|
| 1306 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1307 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 1367 |
|
| 1368 |
for (int i = 0; i < nb; ++i) {
|
| 1369 |
|
| 1370 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1371 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1372 |
|
| 1373 |
const uint8_t * GGML_RESTRICT q2 = x[i].qs;
|
| 1374 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 1478 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 1479 |
}
|
| 1480 |
|
| 1481 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1482 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1483 |
|
| 1484 |
int isum = 0;
|
| 1485 |
int is = 0;
|
|
|
|
| 1534 |
|
| 1535 |
for (int i = 0; i < nb; ++i) {
|
| 1536 |
|
| 1537 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1538 |
|
| 1539 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 1540 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 1639 |
|
| 1640 |
for (int i = 0; i < nb; ++i) {
|
| 1641 |
|
| 1642 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1643 |
|
| 1644 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 1645 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 1825 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 1826 |
q8 += 8; a += 8;
|
| 1827 |
}
|
| 1828 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 1829 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 1830 |
}
|
| 1831 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 1863 |
|
| 1864 |
for (int i = 0; i < nb; ++i) {
|
| 1865 |
|
| 1866 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1867 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1868 |
|
| 1869 |
memcpy(utmp, x[i].scales, 12);
|
| 1870 |
utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
|
|
|
|
| 1929 |
|
| 1930 |
for (int i = 0; i < nb; ++i) {
|
| 1931 |
|
| 1932 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 1933 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 1934 |
|
| 1935 |
const uint8_t * GGML_RESTRICT q4 = x[i].qs;
|
| 1936 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 2050 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2051 |
q8 += 8; a += 8;
|
| 2052 |
}
|
| 2053 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2054 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2055 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 2056 |
sumf -= dmin * sumi;
|
| 2057 |
}
|
| 2058 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 2093 |
const uint8_t * GGML_RESTRICT q5 = x[i].qs;
|
| 2094 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2095 |
|
| 2096 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 2097 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 2098 |
|
| 2099 |
memcpy(utmp, x[i].scales, 12);
|
| 2100 |
utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
|
|
|
|
| 2171 |
|
| 2172 |
for (int i = 0; i < nb; ++i) {
|
| 2173 |
|
| 2174 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 2175 |
+
const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 2176 |
|
| 2177 |
const uint8_t * GGML_RESTRICT q5 = x[i].qs;
|
| 2178 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 2312 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2313 |
q8 += 8; a += 8;
|
| 2314 |
}
|
| 2315 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2316 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2317 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 2318 |
sumf -= dmin * sumi;
|
| 2319 |
}
|
| 2320 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 2345 |
|
| 2346 |
for (int i = 0; i < nb; ++i) {
|
| 2347 |
|
| 2348 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 2349 |
|
| 2350 |
const uint8_t * GGML_RESTRICT q4 = x[i].ql;
|
| 2351 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
|
|
| 2423 |
|
| 2424 |
for (int i = 0; i < nb; ++i) {
|
| 2425 |
|
| 2426 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 2427 |
|
| 2428 |
const uint8_t * GGML_RESTRICT q4 = x[i].ql;
|
| 2429 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
|
|
|
| 2556 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 2557 |
q8 += 8; a += 8;
|
| 2558 |
}
|
| 2559 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2560 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 2561 |
}
|
| 2562 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 2623 |
|
| 2624 |
__m256 accumf = _mm256_setzero_ps();
|
| 2625 |
for (int i = 0; i < nb; ++i) {
|
| 2626 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2627 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2628 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2629 |
__m256i sumi1 = _mm256_setzero_si256();
|
|
|
|
| 2664 |
|
| 2665 |
__m256 accumf = _mm256_setzero_ps();
|
| 2666 |
for (int i = 0; i < nb; ++i) {
|
| 2667 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2668 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2669 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2670 |
__m128i sumi1_0 = _mm_setzero_si128();
|
|
|
|
| 2718 |
|
| 2719 |
float sumf = 0.f;
|
| 2720 |
for (int i = 0; i < nb; ++i) {
|
| 2721 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2722 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2723 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2724 |
int32_t bsum = 0;
|
|
|
|
| 2793 |
|
| 2794 |
__m256 accumf = _mm256_setzero_ps();
|
| 2795 |
for (int i = 0; i < nb; ++i) {
|
| 2796 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2797 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2798 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2799 |
|
|
|
|
| 2914 |
|
| 2915 |
__m256 accumf = _mm256_setzero_ps();
|
| 2916 |
for (int i = 0; i < nb; ++i) {
|
| 2917 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 2918 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 2919 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 2920 |
|
|
|
|
| 3036 |
|
| 3037 |
float sumf = 0.f;
|
| 3038 |
for (int i = 0; i < nb; ++i) {
|
| 3039 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3040 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 3041 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 3042 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 3105 |
|
| 3106 |
__m256 accumf = _mm256_setzero_ps();
|
| 3107 |
for (int i = 0; i < nb; ++i) {
|
| 3108 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3109 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3110 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3111 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
|
|
|
|
| 3178 |
|
| 3179 |
__m256 accumf = _mm256_setzero_ps();
|
| 3180 |
for (int i = 0; i < nb; ++i) {
|
| 3181 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3182 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3183 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3184 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
|
|
|
|
| 3254 |
float sumf = 0;
|
| 3255 |
for (int i = 0; i < nb; i++) {
|
| 3256 |
|
| 3257 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3258 |
const int8_t * q8 = y[i].qs;
|
| 3259 |
const uint8_t * qs = x[i].qs;
|
| 3260 |
const uint8_t * qh = x[i].qh;
|
|
|
|
| 3314 |
|
| 3315 |
__m256 accumf = _mm256_setzero_ps();
|
| 3316 |
for (int i = 0; i < nb; ++i) {
|
| 3317 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3318 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3319 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3320 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 3359 |
|
| 3360 |
__m256 accumf = _mm256_setzero_ps();
|
| 3361 |
for (int i = 0; i < nb; ++i) {
|
| 3362 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3363 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3364 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3365 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 3415 |
|
| 3416 |
float sumf = 0.f;
|
| 3417 |
for (int i = 0; i < nb; ++i) {
|
| 3418 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3419 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 3420 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 3421 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 3481 |
|
| 3482 |
__m256 accumf = _mm256_setzero_ps();
|
| 3483 |
for (int i = 0; i < nb; ++i) {
|
| 3484 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3485 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3486 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3487 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
|
|
|
|
| 3566 |
|
| 3567 |
__m256 accumf = _mm256_setzero_ps();
|
| 3568 |
for (int i = 0; i < nb; ++i) {
|
| 3569 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3570 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3571 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3572 |
const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
|
|
|
|
| 3649 |
|
| 3650 |
float sumf = 0.f;
|
| 3651 |
for (int i = 0; i < nb; ++i) {
|
| 3652 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 3653 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 3654 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 3655 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
|
|
| 3754 |
+ (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
|
| 3755 |
}
|
| 3756 |
|
| 3757 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 3758 |
accum = _mm256_fmadd_ps(_mm256_set1_ps(d), _mm256_cvtepi32_ps(sumi), accum);
|
| 3759 |
accum1 += d * sumi1;
|
| 3760 |
|
|
|
|
| 3802 |
+ (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
|
| 3803 |
}
|
| 3804 |
|
| 3805 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 3806 |
accum = _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(d), _mm256_cvtepi32_ps(MM256_SET_M128I(sumi1_1, sumi1_0))), accum);
|
| 3807 |
accum1 += d * sumi1;
|
| 3808 |
|
|
|
|
| 3836 |
qs += 4;
|
| 3837 |
}
|
| 3838 |
|
| 3839 |
+
sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
|
| 3840 |
}
|
| 3841 |
|
| 3842 |
*s = sumf;
|
|
|
|
| 3948 |
qs += 8; qh += 4;
|
| 3949 |
}
|
| 3950 |
|
| 3951 |
+
const __m256 d = _mm256_set1_ps(y[i].d * GGML_CPU_FP16_TO_FP32(scale.f16));
|
| 3952 |
|
| 3953 |
accum1 = _mm256_fmadd_ps(d, _mm256_cvtepi32_ps(sumi1), accum1);
|
| 3954 |
accum2 = _mm256_fmadd_ps(d, _mm256_cvtepi32_ps(sumi2), accum2);
|
|
|
|
| 4034 |
qs += 8; qh += 4;
|
| 4035 |
}
|
| 4036 |
|
| 4037 |
+
const __m256 d = _mm256_set1_ps(y[i].d * GGML_CPU_FP16_TO_FP32(scale.f16));
|
| 4038 |
|
| 4039 |
accum1 = _mm256_add_ps(_mm256_mul_ps(d, _mm256_cvtepi32_ps(MM256_SET_M128I(sumi1_1, sumi1_0))), accum1);
|
| 4040 |
accum2 = _mm256_add_ps(_mm256_mul_ps(d, _mm256_cvtepi32_ps(MM256_SET_M128I(sumi2_1, sumi2_0))), accum2);
|
|
|
|
| 4084 |
qh += 2;
|
| 4085 |
}
|
| 4086 |
|
| 4087 |
+
sumf += GGML_CPU_FP16_TO_FP32(scale.f16) * y[i].d * (sumi1 + IQ1M_DELTA * sumi2);
|
| 4088 |
}
|
| 4089 |
|
| 4090 |
*s = sumf;
|
|
|
|
| 4130 |
const __m256i p16_2 = mul_add_epi8(q4b_2, q8b_2);
|
| 4131 |
const __m256i p_1 = _mm256_madd_epi16(p16_1, mone);
|
| 4132 |
const __m256i p_2 = _mm256_madd_epi16(p16_2, mone);
|
| 4133 |
+
accum1 = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y[ib + 0].d)*GGML_CPU_FP16_TO_FP32(x[ib + 0].d)),
|
| 4134 |
_mm256_cvtepi32_ps(p_1), accum1);
|
| 4135 |
+
accum2 = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y[ib + 1].d)*GGML_CPU_FP16_TO_FP32(x[ib + 1].d)),
|
| 4136 |
_mm256_cvtepi32_ps(p_2), accum2);
|
| 4137 |
}
|
| 4138 |
|
|
|
|
| 4165 |
|
| 4166 |
#endif
|
| 4167 |
for (; ib < nb; ++ib) {
|
| 4168 |
+
const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
|
| 4169 |
int sumi1 = 0, sumi2 = 0;
|
| 4170 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 4171 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
|
|
| 4220 |
sumi1 = _mm256_add_epi32(p_1, sumi1);
|
| 4221 |
sumi2 = _mm256_add_epi32(p_2, sumi2);
|
| 4222 |
}
|
| 4223 |
+
accum = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ibl].d)*y[ibl].d),
|
| 4224 |
_mm256_cvtepi32_ps(_mm256_add_epi32(sumi1, sumi2)), accum);
|
| 4225 |
}
|
| 4226 |
|
|
|
|
| 4268 |
}
|
| 4269 |
__m128i sumi12_0 = _mm_add_epi32(sumi1_0, sumi2_0);
|
| 4270 |
__m128i sumi12_1 = _mm_add_epi32(sumi1_1, sumi2_1);
|
| 4271 |
+
accum = _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ibl].d)*y[ibl].d),
|
| 4272 |
_mm256_cvtepi32_ps(MM256_SET_M128I(sumi12_1, sumi12_0))), accum);
|
| 4273 |
}
|
| 4274 |
|
|
|
|
| 4277 |
#else
|
| 4278 |
float sumf = 0;
|
| 4279 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 4280 |
+
const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
|
| 4281 |
uint16_t h = x[ibl].scales_h;
|
| 4282 |
const uint8_t * qs = x[ibl].qs;
|
| 4283 |
const int8_t * q8 = y[ibl].qs;
|
|
@@ -6,6 +6,7 @@
|
|
| 6 |
#include "ggml-impl.h"
|
| 7 |
#include "ggml-cpu.h"
|
| 8 |
#include "ggml-cpu-impl.h"
|
|
|
|
| 9 |
#include "traits.h"
|
| 10 |
|
| 11 |
#include <cmath>
|
|
@@ -39,11 +40,11 @@ static inline __m512 __avx512_f32cx8x2_load(ggml_fp16_t *x, ggml_fp16_t *y) {
|
|
| 39 |
float tmp[16];
|
| 40 |
|
| 41 |
for (int i = 0; i < 8; i++) {
|
| 42 |
-
tmp[i] =
|
| 43 |
}
|
| 44 |
|
| 45 |
for (int i = 0; i < 8; i++) {
|
| 46 |
-
tmp[i + 8] =
|
| 47 |
}
|
| 48 |
|
| 49 |
return _mm512_loadu_ps(tmp);
|
|
@@ -54,10 +55,10 @@ static inline __m512 __avx512_repeat_f32cx16_load(__m128i x) {
|
|
| 54 |
_mm_storeu_si128((__m128i*)tmphalf, x);
|
| 55 |
|
| 56 |
for (int i = 0; i < 4; i++) {
|
| 57 |
-
tmp[i] =
|
| 58 |
-
tmp[i + 4] =
|
| 59 |
-
tmp[i + 8] =
|
| 60 |
-
tmp[i + 12] =
|
| 61 |
}
|
| 62 |
|
| 63 |
return _mm512_loadu_ps(tmp);
|
|
@@ -67,7 +68,7 @@ static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) {
|
|
| 67 |
float tmp[8];
|
| 68 |
|
| 69 |
for (int i = 0; i < 8; i++) {
|
| 70 |
-
tmp[i] =
|
| 71 |
}
|
| 72 |
|
| 73 |
return _mm256_loadu_ps(tmp);
|
|
@@ -76,8 +77,8 @@ static inline __m256 __avx_repeat_f32cx8_load(ggml_fp16_t *x) {
|
|
| 76 |
float tmp[8];
|
| 77 |
|
| 78 |
for (int i = 0; i < 4; i++) {
|
| 79 |
-
tmp[i] =
|
| 80 |
-
tmp[i + 4] =
|
| 81 |
}
|
| 82 |
|
| 83 |
return _mm256_loadu_ps(tmp);
|
|
@@ -88,7 +89,7 @@ static inline __m256 __avx_rearranged_f32cx8_load(ggml_fp16_t *x, __m128i arrang
|
|
| 88 |
|
| 89 |
_mm_storeu_si128((__m128i*)tmphalf, _mm_shuffle_epi8(_mm_loadu_si128((const __m128i *) x), arrangeMask));
|
| 90 |
for (int i = 0; i < 8; i++) {
|
| 91 |
-
tmp[i] =
|
| 92 |
}
|
| 93 |
|
| 94 |
return _mm256_loadu_ps(tmp);
|
|
@@ -211,7 +212,7 @@ void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTR
|
|
| 211 |
id[row_iter] = ( maxScalar != 0.0f ) ? 127.f / maxScalar : 0.0f; //d ? 1.0f / d : 0.0f;
|
| 212 |
|
| 213 |
// Store the scale for the individual block
|
| 214 |
-
y[i].d[row_iter] =
|
| 215 |
|
| 216 |
// Store the values in blocks of eight values - Aim is to use these later for block interleaving
|
| 217 |
srcv[row_iter][0] = v0;
|
|
@@ -297,7 +298,7 @@ void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTR
|
|
| 297 |
const float d = amax / ((1 << 7) - 1);
|
| 298 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 299 |
|
| 300 |
-
y[i].d[row_iter] =
|
| 301 |
}
|
| 302 |
|
| 303 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
@@ -647,7 +648,7 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 647 |
const __m256 col_scale_f32 = GGML_F32Cx8_REARRANGE_LOAD(b_ptr[b].d, changemask);
|
| 648 |
|
| 649 |
// Load and convert to FP32 scale from block_q8_0
|
| 650 |
-
const __m256 row_scale_f32 = _mm256_set1_ps(
|
| 651 |
|
| 652 |
// Load the block values in block_q8_0 in batches of 16 bytes and replicate the same across 256 bit vector
|
| 653 |
__m256i lhs_vec_0 = _mm256_castsi128_si256(_mm_loadu_si128((const __m128i *)a_ptr[b].qs));
|
|
@@ -706,7 +707,7 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 706 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 707 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 708 |
}
|
| 709 |
-
sumf[j] += sumi *
|
| 710 |
}
|
| 711 |
}
|
| 712 |
}
|
|
@@ -972,13 +973,13 @@ void ggml_gemv_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 972 |
sumi2 = sumi2 * scales_1[j];
|
| 973 |
sumi += sumi1 + sumi2;
|
| 974 |
}
|
| 975 |
-
sumf[j] += sumi *
|
| 976 |
}
|
| 977 |
}
|
| 978 |
for (int sb = 0; sb < 8; sb++) {
|
| 979 |
uint8_t *mins = (uint8_t*) utmp + 8 + sb * 16;
|
| 980 |
for (int j = 0; j < ncols_interleaved; j++) {
|
| 981 |
-
sum_minf[j] += mins[j] * (a_ptr[l].bsums[sb * 2] + a_ptr[l].bsums[sb * 2 + 1]) *
|
| 982 |
}
|
| 983 |
}
|
| 984 |
}
|
|
@@ -1755,7 +1756,7 @@ void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 1755 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 1756 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 1757 |
}
|
| 1758 |
-
sumf[m][j] += sumi *
|
| 1759 |
}
|
| 1760 |
}
|
| 1761 |
}
|
|
@@ -3259,7 +3260,7 @@ void ggml_gemm_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3259 |
sumi2 = sumi2 * scales_1[j];
|
| 3260 |
sumi += sumi1 + sumi2;
|
| 3261 |
}
|
| 3262 |
-
sumf[m][j] += sumi *
|
| 3263 |
}
|
| 3264 |
}
|
| 3265 |
}
|
|
@@ -3268,7 +3269,7 @@ void ggml_gemm_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
|
|
| 3268 |
for(int m = 0; m < 4; m++) {
|
| 3269 |
const int16_t *bsums = a_ptr[l].bsums + (sb * 8) + (m * 4) - ((sb % 2) * 6);
|
| 3270 |
for(int j = 0; j < ncols_interleaved; j++) {
|
| 3271 |
-
sum_minf[m][j] += mins[j] * (bsums[0] + bsums[1]) *
|
| 3272 |
}
|
| 3273 |
}
|
| 3274 |
}
|
|
|
|
| 6 |
#include "ggml-impl.h"
|
| 7 |
#include "ggml-cpu.h"
|
| 8 |
#include "ggml-cpu-impl.h"
|
| 9 |
+
#include "simd-mappings.h"
|
| 10 |
#include "traits.h"
|
| 11 |
|
| 12 |
#include <cmath>
|
|
|
|
| 40 |
float tmp[16];
|
| 41 |
|
| 42 |
for (int i = 0; i < 8; i++) {
|
| 43 |
+
tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 44 |
}
|
| 45 |
|
| 46 |
for (int i = 0; i < 8; i++) {
|
| 47 |
+
tmp[i + 8] = GGML_CPU_FP16_TO_FP32(y[i]);
|
| 48 |
}
|
| 49 |
|
| 50 |
return _mm512_loadu_ps(tmp);
|
|
|
|
| 55 |
_mm_storeu_si128((__m128i*)tmphalf, x);
|
| 56 |
|
| 57 |
for (int i = 0; i < 4; i++) {
|
| 58 |
+
tmp[i] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
|
| 59 |
+
tmp[i + 4] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
|
| 60 |
+
tmp[i + 8] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
|
| 61 |
+
tmp[i + 12] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
|
| 62 |
}
|
| 63 |
|
| 64 |
return _mm512_loadu_ps(tmp);
|
|
|
|
| 68 |
float tmp[8];
|
| 69 |
|
| 70 |
for (int i = 0; i < 8; i++) {
|
| 71 |
+
tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 72 |
}
|
| 73 |
|
| 74 |
return _mm256_loadu_ps(tmp);
|
|
|
|
| 77 |
float tmp[8];
|
| 78 |
|
| 79 |
for (int i = 0; i < 4; i++) {
|
| 80 |
+
tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 81 |
+
tmp[i + 4] = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 82 |
}
|
| 83 |
|
| 84 |
return _mm256_loadu_ps(tmp);
|
|
|
|
| 89 |
|
| 90 |
_mm_storeu_si128((__m128i*)tmphalf, _mm_shuffle_epi8(_mm_loadu_si128((const __m128i *) x), arrangeMask));
|
| 91 |
for (int i = 0; i < 8; i++) {
|
| 92 |
+
tmp[i] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
|
| 93 |
}
|
| 94 |
|
| 95 |
return _mm256_loadu_ps(tmp);
|
|
|
|
| 212 |
id[row_iter] = ( maxScalar != 0.0f ) ? 127.f / maxScalar : 0.0f; //d ? 1.0f / d : 0.0f;
|
| 213 |
|
| 214 |
// Store the scale for the individual block
|
| 215 |
+
y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
|
| 216 |
|
| 217 |
// Store the values in blocks of eight values - Aim is to use these later for block interleaving
|
| 218 |
srcv[row_iter][0] = v0;
|
|
|
|
| 298 |
const float d = amax / ((1 << 7) - 1);
|
| 299 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 300 |
|
| 301 |
+
y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
|
| 302 |
}
|
| 303 |
|
| 304 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
|
|
| 648 |
const __m256 col_scale_f32 = GGML_F32Cx8_REARRANGE_LOAD(b_ptr[b].d, changemask);
|
| 649 |
|
| 650 |
// Load and convert to FP32 scale from block_q8_0
|
| 651 |
+
const __m256 row_scale_f32 = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(a_ptr[b].d));
|
| 652 |
|
| 653 |
// Load the block values in block_q8_0 in batches of 16 bytes and replicate the same across 256 bit vector
|
| 654 |
__m256i lhs_vec_0 = _mm256_castsi128_si256(_mm_loadu_si128((const __m128i *)a_ptr[b].qs));
|
|
|
|
| 707 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 708 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 709 |
}
|
| 710 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 711 |
}
|
| 712 |
}
|
| 713 |
}
|
|
|
|
| 973 |
sumi2 = sumi2 * scales_1[j];
|
| 974 |
sumi += sumi1 + sumi2;
|
| 975 |
}
|
| 976 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d;
|
| 977 |
}
|
| 978 |
}
|
| 979 |
for (int sb = 0; sb < 8; sb++) {
|
| 980 |
uint8_t *mins = (uint8_t*) utmp + 8 + sb * 16;
|
| 981 |
for (int j = 0; j < ncols_interleaved; j++) {
|
| 982 |
+
sum_minf[j] += mins[j] * (a_ptr[l].bsums[sb * 2] + a_ptr[l].bsums[sb * 2 + 1]) * GGML_CPU_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d;
|
| 983 |
}
|
| 984 |
}
|
| 985 |
}
|
|
|
|
| 1756 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 1757 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 1758 |
}
|
| 1759 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 1760 |
}
|
| 1761 |
}
|
| 1762 |
}
|
|
|
|
| 3260 |
sumi2 = sumi2 * scales_1[j];
|
| 3261 |
sumi += sumi1 + sumi2;
|
| 3262 |
}
|
| 3263 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d[m];
|
| 3264 |
}
|
| 3265 |
}
|
| 3266 |
}
|
|
|
|
| 3269 |
for(int m = 0; m < 4; m++) {
|
| 3270 |
const int16_t *bsums = a_ptr[l].bsums + (sb * 8) + (m * 4) - ((sb % 2) * 6);
|
| 3271 |
for(int j = 0; j < ncols_interleaved; j++) {
|
| 3272 |
+
sum_minf[m][j] += mins[j] * (bsums[0] + bsums[1]) * GGML_CPU_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d[m];
|
| 3273 |
}
|
| 3274 |
}
|
| 3275 |
}
|
|
@@ -4,6 +4,7 @@
|
|
| 4 |
#include "traits.h"
|
| 5 |
#include "ggml-cpu-impl.h"
|
| 6 |
#include "ggml-impl.h"
|
|
|
|
| 7 |
|
| 8 |
#ifdef __cplusplus
|
| 9 |
|
|
@@ -12,11 +13,11 @@
|
|
| 12 |
// convenience functions/macros for use in template calls
|
| 13 |
// note: these won't be required after the 'traits' lookup table is used.
|
| 14 |
static inline ggml_fp16_t f32_to_f16(float x) {
|
| 15 |
-
return
|
| 16 |
}
|
| 17 |
|
| 18 |
static inline float f16_to_f32(ggml_fp16_t x) {
|
| 19 |
-
return
|
| 20 |
}
|
| 21 |
|
| 22 |
static inline ggml_bf16_t f32_to_bf16(float x) {
|
|
|
|
| 4 |
#include "traits.h"
|
| 5 |
#include "ggml-cpu-impl.h"
|
| 6 |
#include "ggml-impl.h"
|
| 7 |
+
#include "simd-mappings.h"
|
| 8 |
|
| 9 |
#ifdef __cplusplus
|
| 10 |
|
|
|
|
| 13 |
// convenience functions/macros for use in template calls
|
| 14 |
// note: these won't be required after the 'traits' lookup table is used.
|
| 15 |
static inline ggml_fp16_t f32_to_f16(float x) {
|
| 16 |
+
return GGML_CPU_FP32_TO_FP16(x);
|
| 17 |
}
|
| 18 |
|
| 19 |
static inline float f16_to_f32(ggml_fp16_t x) {
|
| 20 |
+
return GGML_CPU_FP16_TO_FP32(x);
|
| 21 |
}
|
| 22 |
|
| 23 |
static inline ggml_bf16_t f32_to_bf16(float x) {
|
|
@@ -62,11 +62,17 @@ struct ggml_compute_params {
|
|
| 62 |
#if defined(__s390x__) && defined(__VEC__)
|
| 63 |
#ifndef __VXE__
|
| 64 |
#define __VXE__
|
| 65 |
-
#endif
|
| 66 |
#ifndef __VXE2__
|
| 67 |
#define __VXE2__
|
| 68 |
-
#endif
|
| 69 |
-
#endif
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
#if defined(__ARM_FEATURE_SVE)
|
| 72 |
#include <sys/prctl.h>
|
|
|
|
| 62 |
#if defined(__s390x__) && defined(__VEC__)
|
| 63 |
#ifndef __VXE__
|
| 64 |
#define __VXE__
|
| 65 |
+
#endif // __VXE__
|
| 66 |
#ifndef __VXE2__
|
| 67 |
#define __VXE2__
|
| 68 |
+
#endif // __VXE2__
|
| 69 |
+
#endif // __s390x__ && __VEC__
|
| 70 |
+
|
| 71 |
+
#if defined(__s390x__) && defined(GGML_NNPA)
|
| 72 |
+
#ifndef __NNPA__
|
| 73 |
+
#define __NNPA__
|
| 74 |
+
#endif // __NNPA__
|
| 75 |
+
#endif // __s390x__ && GGML_NNPA
|
| 76 |
|
| 77 |
#if defined(__ARM_FEATURE_SVE)
|
| 78 |
#include <sys/prctl.h>
|
|
@@ -72,6 +72,9 @@
|
|
| 72 |
#define UNUSED GGML_UNUSED
|
| 73 |
#define SWAP(x, y, T) do { T SWAP = x; (x) = y; (y) = SWAP; } while (0)
|
| 74 |
|
|
|
|
|
|
|
|
|
|
| 75 |
#if defined(__ARM_ARCH)
|
| 76 |
struct ggml_arm_arch_features_type {
|
| 77 |
int sve_cnt;
|
|
@@ -736,7 +739,7 @@ struct ggml_tensor * ggml_set_i32 (struct ggml_tensor * tensor, int32_t value) {
|
|
| 736 |
{
|
| 737 |
assert(tensor->nb[0] == sizeof(ggml_fp16_t));
|
| 738 |
for (int i = 0; i < n; i++) {
|
| 739 |
-
ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1),
|
| 740 |
}
|
| 741 |
} break;
|
| 742 |
case GGML_TYPE_BF16:
|
|
@@ -795,7 +798,7 @@ struct ggml_tensor * ggml_set_f32(struct ggml_tensor * tensor, float value) {
|
|
| 795 |
{
|
| 796 |
assert(tensor->nb[0] == sizeof(ggml_fp16_t));
|
| 797 |
for (int i = 0; i < n; i++) {
|
| 798 |
-
ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1),
|
| 799 |
}
|
| 800 |
} break;
|
| 801 |
case GGML_TYPE_BF16:
|
|
@@ -846,7 +849,7 @@ int32_t ggml_get_i32_1d(const struct ggml_tensor * tensor, int i) {
|
|
| 846 |
case GGML_TYPE_F16:
|
| 847 |
{
|
| 848 |
GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
|
| 849 |
-
return
|
| 850 |
}
|
| 851 |
case GGML_TYPE_BF16:
|
| 852 |
{
|
|
@@ -891,7 +894,7 @@ void ggml_set_i32_1d(const struct ggml_tensor * tensor, int i, int32_t value) {
|
|
| 891 |
case GGML_TYPE_F16:
|
| 892 |
{
|
| 893 |
GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
|
| 894 |
-
((ggml_fp16_t *)(tensor->data))[i] =
|
| 895 |
} break;
|
| 896 |
case GGML_TYPE_BF16:
|
| 897 |
{
|
|
@@ -920,7 +923,7 @@ int32_t ggml_get_i32_nd(const struct ggml_tensor * tensor, int i0, int i1, int i
|
|
| 920 |
case GGML_TYPE_I32:
|
| 921 |
return ((int32_t *) data)[0];
|
| 922 |
case GGML_TYPE_F16:
|
| 923 |
-
return
|
| 924 |
case GGML_TYPE_BF16:
|
| 925 |
return GGML_BF16_TO_FP32(((ggml_bf16_t *) data)[0]);
|
| 926 |
case GGML_TYPE_F32:
|
|
@@ -947,7 +950,7 @@ void ggml_set_i32_nd(const struct ggml_tensor * tensor, int i0, int i1, int i2,
|
|
| 947 |
} break;
|
| 948 |
case GGML_TYPE_F16:
|
| 949 |
{
|
| 950 |
-
((ggml_fp16_t *)(data))[0] =
|
| 951 |
} break;
|
| 952 |
case GGML_TYPE_BF16:
|
| 953 |
{
|
|
@@ -985,7 +988,7 @@ float ggml_get_f32_1d(const struct ggml_tensor * tensor, int i) {
|
|
| 985 |
}
|
| 986 |
case GGML_TYPE_F16:
|
| 987 |
{
|
| 988 |
-
return
|
| 989 |
}
|
| 990 |
case GGML_TYPE_BF16:
|
| 991 |
{
|
|
@@ -1024,7 +1027,7 @@ void ggml_set_f32_1d(const struct ggml_tensor * tensor, int i, float value) {
|
|
| 1024 |
} break;
|
| 1025 |
case GGML_TYPE_F16:
|
| 1026 |
{
|
| 1027 |
-
((ggml_fp16_t *)(tensor->data))[i] =
|
| 1028 |
} break;
|
| 1029 |
case GGML_TYPE_BF16:
|
| 1030 |
{
|
|
@@ -1051,7 +1054,7 @@ float ggml_get_f32_nd(const struct ggml_tensor * tensor, int i0, int i1, int i2,
|
|
| 1051 |
case GGML_TYPE_I32:
|
| 1052 |
return ((int32_t *) data)[0];
|
| 1053 |
case GGML_TYPE_F16:
|
| 1054 |
-
return
|
| 1055 |
case GGML_TYPE_BF16:
|
| 1056 |
return GGML_BF16_TO_FP32(((ggml_bf16_t *) data)[0]);
|
| 1057 |
case GGML_TYPE_F32:
|
|
@@ -1078,7 +1081,7 @@ void ggml_set_f32_nd(const struct ggml_tensor * tensor, int i0, int i1, int i2,
|
|
| 1078 |
} break;
|
| 1079 |
case GGML_TYPE_F16:
|
| 1080 |
{
|
| 1081 |
-
((ggml_fp16_t *)(data))[0] =
|
| 1082 |
} break;
|
| 1083 |
case GGML_TYPE_BF16:
|
| 1084 |
{
|
|
@@ -3141,9 +3144,24 @@ void ggml_cpu_fp32_to_fp16(const float * x, ggml_fp16_t * y, int64_t n) {
|
|
| 3141 |
__m128i y_vec = _mm_cvtps_ph(x_vec, _MM_FROUND_TO_NEAREST_INT);
|
| 3142 |
_mm_storel_epi64((__m128i *)(y + i), y_vec);
|
| 3143 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3144 |
#endif
|
| 3145 |
for (; i < n; ++i) {
|
| 3146 |
-
y[i] =
|
| 3147 |
}
|
| 3148 |
}
|
| 3149 |
|
|
@@ -3167,9 +3185,25 @@ void ggml_cpu_fp16_to_fp32(const ggml_fp16_t * x, float * y, int64_t n) {
|
|
| 3167 |
__m128 y_vec = _mm_cvtph_ps(x_vec);
|
| 3168 |
_mm_storeu_ps(y + i, y_vec);
|
| 3169 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3170 |
#endif
|
|
|
|
| 3171 |
for (; i < n; ++i) {
|
| 3172 |
-
y[i] =
|
| 3173 |
}
|
| 3174 |
}
|
| 3175 |
|
|
@@ -3369,6 +3403,14 @@ int ggml_cpu_has_vxe(void) {
|
|
| 3369 |
#endif
|
| 3370 |
}
|
| 3371 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3372 |
int ggml_cpu_has_neon(void) {
|
| 3373 |
#if defined(__ARM_ARCH) && defined(__ARM_NEON)
|
| 3374 |
return 1;
|
|
@@ -3418,7 +3460,7 @@ int ggml_cpu_has_sme(void) {
|
|
| 3418 |
}
|
| 3419 |
|
| 3420 |
void ggml_cpu_init(void) {
|
| 3421 |
-
// needed to initialize
|
| 3422 |
{
|
| 3423 |
struct ggml_init_params params = { 0, NULL, false };
|
| 3424 |
struct ggml_context * ctx = ggml_init(params);
|
|
@@ -3439,9 +3481,10 @@ void ggml_cpu_init(void) {
|
|
| 3439 |
uint16_t u16;
|
| 3440 |
ggml_fp16_t fp16;
|
| 3441 |
} u = {i};
|
| 3442 |
-
float f =
|
| 3443 |
-
|
| 3444 |
-
|
|
|
|
| 3445 |
}
|
| 3446 |
|
| 3447 |
const uint64_t t_end = ggml_time_us(); UNUSED(t_end);
|
|
|
|
| 72 |
#define UNUSED GGML_UNUSED
|
| 73 |
#define SWAP(x, y, T) do { T SWAP = x; (x) = y; (y) = SWAP; } while (0)
|
| 74 |
|
| 75 |
+
// precomputed f32 table for f16 (256 KB) (simd-mappings.h)
|
| 76 |
+
float ggml_table_f32_f16[1 << 16];
|
| 77 |
+
|
| 78 |
#if defined(__ARM_ARCH)
|
| 79 |
struct ggml_arm_arch_features_type {
|
| 80 |
int sve_cnt;
|
|
|
|
| 739 |
{
|
| 740 |
assert(tensor->nb[0] == sizeof(ggml_fp16_t));
|
| 741 |
for (int i = 0; i < n; i++) {
|
| 742 |
+
ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1), GGML_CPU_FP32_TO_FP16(value));
|
| 743 |
}
|
| 744 |
} break;
|
| 745 |
case GGML_TYPE_BF16:
|
|
|
|
| 798 |
{
|
| 799 |
assert(tensor->nb[0] == sizeof(ggml_fp16_t));
|
| 800 |
for (int i = 0; i < n; i++) {
|
| 801 |
+
ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1), GGML_CPU_FP32_TO_FP16(value));
|
| 802 |
}
|
| 803 |
} break;
|
| 804 |
case GGML_TYPE_BF16:
|
|
|
|
| 849 |
case GGML_TYPE_F16:
|
| 850 |
{
|
| 851 |
GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
|
| 852 |
+
return GGML_CPU_FP16_TO_FP32(((ggml_fp16_t *)(tensor->data))[i]);
|
| 853 |
}
|
| 854 |
case GGML_TYPE_BF16:
|
| 855 |
{
|
|
|
|
| 894 |
case GGML_TYPE_F16:
|
| 895 |
{
|
| 896 |
GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
|
| 897 |
+
((ggml_fp16_t *)(tensor->data))[i] = GGML_CPU_FP32_TO_FP16(value);
|
| 898 |
} break;
|
| 899 |
case GGML_TYPE_BF16:
|
| 900 |
{
|
|
|
|
| 923 |
case GGML_TYPE_I32:
|
| 924 |
return ((int32_t *) data)[0];
|
| 925 |
case GGML_TYPE_F16:
|
| 926 |
+
return GGML_CPU_FP16_TO_FP32(((ggml_fp16_t *) data)[0]);
|
| 927 |
case GGML_TYPE_BF16:
|
| 928 |
return GGML_BF16_TO_FP32(((ggml_bf16_t *) data)[0]);
|
| 929 |
case GGML_TYPE_F32:
|
|
|
|
| 950 |
} break;
|
| 951 |
case GGML_TYPE_F16:
|
| 952 |
{
|
| 953 |
+
((ggml_fp16_t *)(data))[0] = GGML_CPU_FP32_TO_FP16(value);
|
| 954 |
} break;
|
| 955 |
case GGML_TYPE_BF16:
|
| 956 |
{
|
|
|
|
| 988 |
}
|
| 989 |
case GGML_TYPE_F16:
|
| 990 |
{
|
| 991 |
+
return GGML_CPU_FP16_TO_FP32(((ggml_fp16_t *)(tensor->data))[i]);
|
| 992 |
}
|
| 993 |
case GGML_TYPE_BF16:
|
| 994 |
{
|
|
|
|
| 1027 |
} break;
|
| 1028 |
case GGML_TYPE_F16:
|
| 1029 |
{
|
| 1030 |
+
((ggml_fp16_t *)(tensor->data))[i] = GGML_CPU_FP32_TO_FP16(value);
|
| 1031 |
} break;
|
| 1032 |
case GGML_TYPE_BF16:
|
| 1033 |
{
|
|
|
|
| 1054 |
case GGML_TYPE_I32:
|
| 1055 |
return ((int32_t *) data)[0];
|
| 1056 |
case GGML_TYPE_F16:
|
| 1057 |
+
return GGML_CPU_FP16_TO_FP32(((ggml_fp16_t *) data)[0]);
|
| 1058 |
case GGML_TYPE_BF16:
|
| 1059 |
return GGML_BF16_TO_FP32(((ggml_bf16_t *) data)[0]);
|
| 1060 |
case GGML_TYPE_F32:
|
|
|
|
| 1081 |
} break;
|
| 1082 |
case GGML_TYPE_F16:
|
| 1083 |
{
|
| 1084 |
+
((ggml_fp16_t *)(data))[0] = GGML_CPU_FP32_TO_FP16(value);
|
| 1085 |
} break;
|
| 1086 |
case GGML_TYPE_BF16:
|
| 1087 |
{
|
|
|
|
| 3144 |
__m128i y_vec = _mm_cvtps_ph(x_vec, _MM_FROUND_TO_NEAREST_INT);
|
| 3145 |
_mm_storel_epi64((__m128i *)(y + i), y_vec);
|
| 3146 |
}
|
| 3147 |
+
#elif defined(__NNPA__)
|
| 3148 |
+
for (; i + 7 < n; i += 8) {
|
| 3149 |
+
float32x4_t v_xh = vec_xl(0, (const float *)(x + i + 0));
|
| 3150 |
+
float32x4_t v_xl = vec_xl(0, (const float *)(x + i + 4));
|
| 3151 |
+
uint16x8_t v_yd = vec_round_from_fp32(v_xh, v_xl, 0);
|
| 3152 |
+
uint16x8_t v_y = vec_convert_to_fp16(v_yd, 0);
|
| 3153 |
+
vec_xst(v_y, 0, (ggml_fp16_t *)(y + i));
|
| 3154 |
+
}
|
| 3155 |
+
for (; i + 3 < n; i += 4) {
|
| 3156 |
+
float32x4_t v_x = vec_xl(0, (const float *)(x + i));
|
| 3157 |
+
float32x4_t v_zero = vec_splats(0.0f);
|
| 3158 |
+
uint16x8_t v_yd = vec_round_from_fp32(v_x, v_zero, 0);
|
| 3159 |
+
uint16x8_t v_y = vec_convert_to_fp16(v_yd, 0);
|
| 3160 |
+
vec_xst(v_y, 0, (ggml_fp16_t *)(y + i));
|
| 3161 |
+
}
|
| 3162 |
#endif
|
| 3163 |
for (; i < n; ++i) {
|
| 3164 |
+
y[i] = GGML_CPU_FP32_TO_FP16(x[i]);
|
| 3165 |
}
|
| 3166 |
}
|
| 3167 |
|
|
|
|
| 3185 |
__m128 y_vec = _mm_cvtph_ps(x_vec);
|
| 3186 |
_mm_storeu_ps(y + i, y_vec);
|
| 3187 |
}
|
| 3188 |
+
#elif defined(__NNPA__)
|
| 3189 |
+
for (; i + 7 < n; i += 8) {
|
| 3190 |
+
uint16x8_t v_x = vec_xl(0, (const ggml_fp16_t *)(x + i));
|
| 3191 |
+
uint16x8_t v_yd = vec_convert_from_fp16(v_x, 0);
|
| 3192 |
+
float32x4_t v_yh = vec_extend_to_fp32_hi(v_yd, 0);
|
| 3193 |
+
float32x4_t v_yl = vec_extend_to_fp32_lo(v_yd, 0);
|
| 3194 |
+
vec_xst(v_yh, 0, (float *)(y + i + 0));
|
| 3195 |
+
vec_xst(v_yl, 0, (float *)(y + i + 4));
|
| 3196 |
+
}
|
| 3197 |
+
for (; i + 3 < n; i += 4) {
|
| 3198 |
+
uint16x8_t v_x = vec_xl(0, (const ggml_fp16_t *)(x + i));
|
| 3199 |
+
uint16x8_t v_yd = vec_convert_from_fp16(v_x, 0);
|
| 3200 |
+
float32x4_t v_yh = vec_extend_to_fp32_hi(v_yd, 0);
|
| 3201 |
+
vec_xst(v_yh, 0, (float *)(y + i));
|
| 3202 |
+
}
|
| 3203 |
#endif
|
| 3204 |
+
|
| 3205 |
for (; i < n; ++i) {
|
| 3206 |
+
y[i] = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 3207 |
}
|
| 3208 |
}
|
| 3209 |
|
|
|
|
| 3403 |
#endif
|
| 3404 |
}
|
| 3405 |
|
| 3406 |
+
int ggml_cpu_has_nnpa(void) {
|
| 3407 |
+
#if defined(GGML_NNPA)
|
| 3408 |
+
return 1;
|
| 3409 |
+
#else
|
| 3410 |
+
return 0;
|
| 3411 |
+
#endif
|
| 3412 |
+
}
|
| 3413 |
+
|
| 3414 |
int ggml_cpu_has_neon(void) {
|
| 3415 |
#if defined(__ARM_ARCH) && defined(__ARM_NEON)
|
| 3416 |
return 1;
|
|
|
|
| 3460 |
}
|
| 3461 |
|
| 3462 |
void ggml_cpu_init(void) {
|
| 3463 |
+
// needed to initialize ggml_time
|
| 3464 |
{
|
| 3465 |
struct ggml_init_params params = { 0, NULL, false };
|
| 3466 |
struct ggml_context * ctx = ggml_init(params);
|
|
|
|
| 3481 |
uint16_t u16;
|
| 3482 |
ggml_fp16_t fp16;
|
| 3483 |
} u = {i};
|
| 3484 |
+
float f = GGML_COMPUTE_FP16_TO_FP32(u.fp16);
|
| 3485 |
+
ggml_table_f32_f16[i] = f;
|
| 3486 |
+
ggml_table_gelu_f16[i] = GGML_CPU_FP32_TO_FP16(ggml_gelu_f32(f));
|
| 3487 |
+
ggml_table_gelu_quick_f16[i] = GGML_CPU_FP32_TO_FP16(ggml_gelu_quick_f32(f));
|
| 3488 |
}
|
| 3489 |
|
| 3490 |
const uint64_t t_end = ggml_time_us(); UNUSED(t_end);
|
|
@@ -578,6 +578,9 @@ static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r
|
|
| 578 |
if (ggml_cpu_has_vxe()) {
|
| 579 |
features.push_back({ "VXE", "1" });
|
| 580 |
}
|
|
|
|
|
|
|
|
|
|
| 581 |
if (ggml_cpu_has_wasm_simd()) {
|
| 582 |
features.push_back({ "WASM_SIMD", "1" });
|
| 583 |
}
|
|
|
|
| 578 |
if (ggml_cpu_has_vxe()) {
|
| 579 |
features.push_back({ "VXE", "1" });
|
| 580 |
}
|
| 581 |
+
if (ggml_cpu_has_nnpa()) {
|
| 582 |
+
features.push_back({ "NNPA", "1" });
|
| 583 |
+
}
|
| 584 |
if (ggml_cpu_has_wasm_simd()) {
|
| 585 |
features.push_back({ "WASM_SIMD", "1" });
|
| 586 |
}
|
|
@@ -52,6 +52,7 @@
|
|
| 52 |
#include "ggml-impl.h"
|
| 53 |
#include "ggml-cpu-impl.h"
|
| 54 |
#include "ggml-quants.h"
|
|
|
|
| 55 |
|
| 56 |
#include <array>
|
| 57 |
#include <type_traits>
|
|
@@ -73,7 +74,7 @@
|
|
| 73 |
namespace {
|
| 74 |
|
| 75 |
inline float unhalf(ggml_fp16_t d) {
|
| 76 |
-
return
|
| 77 |
}
|
| 78 |
|
| 79 |
////////////////////////////////////////////////////////////////////////////////////////////////////
|
|
@@ -252,7 +253,7 @@ template <> inline float32x4_t load(const ggml_fp16_t * p) {
|
|
| 252 |
float tmp[4];
|
| 253 |
|
| 254 |
for (int i = 0; i < 4; i++) {
|
| 255 |
-
tmp[i] =
|
| 256 |
}
|
| 257 |
|
| 258 |
return vec_xl(0, (const float *)(tmp));
|
|
|
|
| 52 |
#include "ggml-impl.h"
|
| 53 |
#include "ggml-cpu-impl.h"
|
| 54 |
#include "ggml-quants.h"
|
| 55 |
+
#include "simd-mappings.h"
|
| 56 |
|
| 57 |
#include <array>
|
| 58 |
#include <type_traits>
|
|
|
|
| 74 |
namespace {
|
| 75 |
|
| 76 |
inline float unhalf(ggml_fp16_t d) {
|
| 77 |
+
return GGML_CPU_FP16_TO_FP32(d);
|
| 78 |
}
|
| 79 |
|
| 80 |
////////////////////////////////////////////////////////////////////////////////////////////////////
|
|
|
|
| 253 |
float tmp[4];
|
| 254 |
|
| 255 |
for (int i = 0; i < 4; i++) {
|
| 256 |
+
tmp[i] = GGML_CPU_FP16_TO_FP32(p[i]);
|
| 257 |
}
|
| 258 |
|
| 259 |
return vec_xl(0, (const float *)(tmp));
|
|
@@ -108,7 +108,7 @@ static void ggml_compute_forward_dup_f16(
|
|
| 108 |
for (int i01 = ir0; i01 < ir1; i01++) {
|
| 109 |
const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
|
| 110 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 111 |
-
dst_ptr[id] =
|
| 112 |
id++;
|
| 113 |
}
|
| 114 |
}
|
|
@@ -130,7 +130,7 @@ static void ggml_compute_forward_dup_f16(
|
|
| 130 |
const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
|
| 131 |
|
| 132 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 133 |
-
src0_f32[i00] =
|
| 134 |
}
|
| 135 |
|
| 136 |
quantize_row_q(src0_f32, dst_ptr + id, ne00);
|
|
@@ -156,7 +156,7 @@ static void ggml_compute_forward_dup_f16(
|
|
| 156 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 157 |
const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 158 |
|
| 159 |
-
dst_ptr[id] =
|
| 160 |
id++;
|
| 161 |
}
|
| 162 |
}
|
|
@@ -267,7 +267,7 @@ static void ggml_compute_forward_dup_f16(
|
|
| 267 |
const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 268 |
char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
|
| 269 |
|
| 270 |
-
*(float *) dst_ptr =
|
| 271 |
|
| 272 |
if (++i10 == ne0) {
|
| 273 |
i10 = 0;
|
|
@@ -372,7 +372,7 @@ static void ggml_compute_forward_dup_bf16(
|
|
| 372 |
for (int i01 = ir0; i01 < ir1; i01++) {
|
| 373 |
const ggml_bf16_t * src0_ptr = (ggml_bf16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
|
| 374 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 375 |
-
dst_ptr[id] =
|
| 376 |
id++;
|
| 377 |
}
|
| 378 |
}
|
|
@@ -473,7 +473,7 @@ static void ggml_compute_forward_dup_bf16(
|
|
| 473 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 474 |
const ggml_bf16_t * src0_ptr = (ggml_bf16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 475 |
|
| 476 |
-
dst_ptr[id] =
|
| 477 |
id++;
|
| 478 |
}
|
| 479 |
}
|
|
@@ -566,7 +566,7 @@ static void ggml_compute_forward_dup_bf16(
|
|
| 566 |
const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 567 |
char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
|
| 568 |
|
| 569 |
-
*(ggml_fp16_t *) dst_ptr =
|
| 570 |
|
| 571 |
if (++i10 == ne0) {
|
| 572 |
i10 = 0;
|
|
@@ -765,7 +765,7 @@ static void ggml_compute_forward_dup_f32(
|
|
| 765 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 766 |
const float * src0_ptr = (float *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 767 |
|
| 768 |
-
dst_ptr[id] =
|
| 769 |
id++;
|
| 770 |
}
|
| 771 |
}
|
|
@@ -878,7 +878,7 @@ static void ggml_compute_forward_dup_f32(
|
|
| 878 |
const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 879 |
char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
|
| 880 |
|
| 881 |
-
*(ggml_fp16_t *) dst_ptr =
|
| 882 |
|
| 883 |
if (++i10 == ne0) {
|
| 884 |
i10 = 0;
|
|
@@ -1419,7 +1419,7 @@ static void ggml_compute_forward_add1_f16_f32(
|
|
| 1419 |
ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 );
|
| 1420 |
ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01);
|
| 1421 |
for (int i = 0; i < ne0; i++) {
|
| 1422 |
-
dst_ptr[i] =
|
| 1423 |
}
|
| 1424 |
}
|
| 1425 |
}
|
|
@@ -1435,7 +1435,7 @@ static void ggml_compute_forward_add1_f16_f16(
|
|
| 1435 |
GGML_ASSERT(ggml_is_scalar(src1));
|
| 1436 |
|
| 1437 |
// scalar to add
|
| 1438 |
-
const float v =
|
| 1439 |
|
| 1440 |
const int ith = params->ith;
|
| 1441 |
const int nth = params->nth;
|
|
@@ -1467,7 +1467,7 @@ static void ggml_compute_forward_add1_f16_f16(
|
|
| 1467 |
ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 );
|
| 1468 |
ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01);
|
| 1469 |
for (int i = 0; i < ne0; i++) {
|
| 1470 |
-
dst_ptr[i] =
|
| 1471 |
}
|
| 1472 |
}
|
| 1473 |
}
|
|
@@ -1889,7 +1889,7 @@ static void ggml_compute_forward_sum_f16(
|
|
| 1889 |
}
|
| 1890 |
}
|
| 1891 |
}
|
| 1892 |
-
((ggml_fp16_t *) dst->data)[0] =
|
| 1893 |
}
|
| 1894 |
|
| 1895 |
static void ggml_compute_forward_sum_bf16(
|
|
@@ -2660,7 +2660,7 @@ static void ggml_compute_forward_gelu_f16(
|
|
| 2660 |
#ifndef NDEBUG
|
| 2661 |
for (int k = 0; k < nc; k++) {
|
| 2662 |
const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
|
| 2663 |
-
const float v =
|
| 2664 |
GGML_UNUSED(v);
|
| 2665 |
assert(!isnan(v));
|
| 2666 |
assert(!isinf(v));
|
|
@@ -2763,7 +2763,7 @@ static void ggml_compute_forward_gelu_erf_f16(
|
|
| 2763 |
#ifndef NDEBUG
|
| 2764 |
for (int k = 0; k < nc; k++) {
|
| 2765 |
const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
|
| 2766 |
-
const float v =
|
| 2767 |
GGML_UNUSED(v);
|
| 2768 |
assert(!isnan(v));
|
| 2769 |
assert(!isinf(v));
|
|
@@ -2866,7 +2866,7 @@ static void ggml_compute_forward_gelu_quick_f16(
|
|
| 2866 |
#ifndef NDEBUG
|
| 2867 |
for (int k = 0; k < nc; k++) {
|
| 2868 |
const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
|
| 2869 |
-
const float v =
|
| 2870 |
GGML_UNUSED(v);
|
| 2871 |
assert(!isnan(v));
|
| 2872 |
assert(!isinf(v));
|
|
@@ -2969,7 +2969,7 @@ static void ggml_compute_forward_silu_f16(
|
|
| 2969 |
#ifndef NDEBUG
|
| 2970 |
for (int k = 0; k < nc; k++) {
|
| 2971 |
const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*(dst->nb[1])))[k];
|
| 2972 |
-
const float v =
|
| 2973 |
GGML_UNUSED(v);
|
| 2974 |
assert(!isnan(v));
|
| 2975 |
assert(!isinf(v));
|
|
@@ -3163,7 +3163,7 @@ static void ggml_compute_forward_silu_back_f16(
|
|
| 3163 |
#ifndef NDEBUG
|
| 3164 |
for (int k = 0; k < nc; k++) {
|
| 3165 |
const float x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
|
| 3166 |
-
const float v =
|
| 3167 |
GGML_UNUSED(v);
|
| 3168 |
assert(!isnan(v));
|
| 3169 |
assert(!isinf(v));
|
|
@@ -4500,7 +4500,7 @@ static void ggml_compute_forward_get_rows_back_f32_f16(
|
|
| 4500 |
|
| 4501 |
for (int j = 0; j < nc; ++j) {
|
| 4502 |
ggml_fp16_t v = ((ggml_fp16_t *) ((char *) src0->data + i*src0->nb[1]))[j];
|
| 4503 |
-
((float *) ((char *) dst->data + r*dst->nb[1]))[j] +=
|
| 4504 |
}
|
| 4505 |
}
|
| 4506 |
}
|
|
@@ -4792,7 +4792,7 @@ static void ggml_compute_forward_soft_max_f32(
|
|
| 4792 |
if (mp_f32) {
|
| 4793 |
if (use_f16) {
|
| 4794 |
for (int i = 0; i < nc; ++i) {
|
| 4795 |
-
wp[i] += slope*
|
| 4796 |
}
|
| 4797 |
} else {
|
| 4798 |
for (int i = 0; i < nc; ++i) {
|
|
@@ -5018,8 +5018,8 @@ static void ggml_compute_forward_clamp_f16(
|
|
| 5018 |
ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + j*nb01);
|
| 5019 |
|
| 5020 |
for (int i = 0; i < nc; i++) {
|
| 5021 |
-
float v =
|
| 5022 |
-
dst_ptr[i] =
|
| 5023 |
}
|
| 5024 |
}
|
| 5025 |
}
|
|
@@ -5476,11 +5476,11 @@ static void ggml_compute_forward_rope_f16(
|
|
| 5476 |
const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
|
| 5477 |
ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
|
| 5478 |
|
| 5479 |
-
const float x0 =
|
| 5480 |
-
const float x1 =
|
| 5481 |
|
| 5482 |
-
dst_data[0] =
|
| 5483 |
-
dst_data[n_dims] =
|
| 5484 |
}
|
| 5485 |
} else {
|
| 5486 |
for (int64_t i0 = 0; i0 < n_dims; i0 += 2) {
|
|
@@ -5492,11 +5492,11 @@ static void ggml_compute_forward_rope_f16(
|
|
| 5492 |
const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
|
| 5493 |
ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
|
| 5494 |
|
| 5495 |
-
const float x0 =
|
| 5496 |
-
const float x1 =
|
| 5497 |
|
| 5498 |
-
dst_data[0] =
|
| 5499 |
-
dst_data[n_dims/2] =
|
| 5500 |
}
|
| 5501 |
}
|
| 5502 |
} else {
|
|
@@ -5507,11 +5507,11 @@ static void ggml_compute_forward_rope_f16(
|
|
| 5507 |
const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + i0*nb00);
|
| 5508 |
ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + i0*nb0);
|
| 5509 |
|
| 5510 |
-
const float x0 =
|
| 5511 |
-
const float x1 =
|
| 5512 |
|
| 5513 |
-
dst_data[0] =
|
| 5514 |
-
dst_data[1] =
|
| 5515 |
}
|
| 5516 |
}
|
| 5517 |
|
|
@@ -5525,11 +5525,11 @@ static void ggml_compute_forward_rope_f16(
|
|
| 5525 |
const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
|
| 5526 |
ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
|
| 5527 |
|
| 5528 |
-
const float x0 =
|
| 5529 |
-
const float x1 =
|
| 5530 |
|
| 5531 |
-
dst_data[0] =
|
| 5532 |
-
dst_data[n_dims] =
|
| 5533 |
}
|
| 5534 |
} else {
|
| 5535 |
for (int64_t i0 = n_dims; i0 < ne0; i0 += 2) {
|
|
@@ -5640,7 +5640,7 @@ static void ggml_compute_forward_conv_transpose_1d_f16_f32(
|
|
| 5640 |
for (int64_t i11 = 0; i11 < ne11; i11++) {
|
| 5641 |
const float * const src = (float *)((char *) src1->data + i11*nb11);
|
| 5642 |
for (int64_t i10 = 0; i10 < ne10; i10++) {
|
| 5643 |
-
dst_data[i10*ne11 + i11] =
|
| 5644 |
}
|
| 5645 |
}
|
| 5646 |
}
|
|
@@ -5933,7 +5933,7 @@ static void ggml_compute_forward_im2col_f16(
|
|
| 5933 |
if (iih < 0 || iih >= IH || iiw < 0 || iiw >= IW) {
|
| 5934 |
dst_data[iic*(KH*KW) + ikh*KW + ikw] = 0;
|
| 5935 |
} else {
|
| 5936 |
-
dst_data[iic*(KH*KW) + ikh*KW + ikw] =
|
| 5937 |
}
|
| 5938 |
}
|
| 5939 |
}
|
|
@@ -6109,7 +6109,7 @@ void ggml_compute_forward_conv_transpose_2d(
|
|
| 6109 |
const float * const src = (float *)((char *) src1->data + i12*nb12 + i11*nb11);
|
| 6110 |
ggml_fp16_t * dst_data = wdata + i11*ne10*ne12;
|
| 6111 |
for (int i10 = 0; i10 < ne10; i10++) {
|
| 6112 |
-
dst_data[i10*ne12 + i12] =
|
| 6113 |
}
|
| 6114 |
}
|
| 6115 |
}
|
|
@@ -6358,7 +6358,7 @@ static void ggml_compute_forward_pool_1d_sk_p0(
|
|
| 6358 |
case GGML_OP_POOL_COUNT: GGML_ABORT("fatal error");
|
| 6359 |
}
|
| 6360 |
for (int ki = 0; ki < k; ++ki) {
|
| 6361 |
-
const float srow_j = (src->type == GGML_TYPE_F32) ? ((const float*)srow)[j] :
|
| 6362 |
switch (op) {
|
| 6363 |
case GGML_OP_POOL_AVG: drow[i] += srow_j; break;
|
| 6364 |
case GGML_OP_POOL_MAX: if (srow_j > drow[i]) drow[i] = srow_j; break;
|
|
@@ -6450,7 +6450,7 @@ void ggml_compute_forward_pool_2d(
|
|
| 6450 |
for (int kx = 0; kx < k0; ++kx) {
|
| 6451 |
int j = ix + kx;
|
| 6452 |
if (j < 0 || j >= src->ne[0]) continue;
|
| 6453 |
-
const float srow_j = (src->type == GGML_TYPE_F32) ? ((const float*)srow)[j] :
|
| 6454 |
switch (op) {
|
| 6455 |
case GGML_OP_POOL_AVG: *out += srow_j; break;
|
| 6456 |
case GGML_OP_POOL_MAX: if (srow_j > *out) *out = srow_j; break;
|
|
@@ -6538,7 +6538,7 @@ void ggml_compute_forward_pool_2d_back(
|
|
| 6538 |
}
|
| 6539 |
|
| 6540 |
const float val = dst->type == GGML_TYPE_F32 ?
|
| 6541 |
-
((const float *) drowf)[j] :
|
| 6542 |
if (val <= maxval) {
|
| 6543 |
continue;
|
| 6544 |
}
|
|
@@ -6558,7 +6558,7 @@ void ggml_compute_forward_pool_2d_back(
|
|
| 6558 |
if (dst->type == GGML_TYPE_F32) {
|
| 6559 |
((float *) drow)[j] += grad0;
|
| 6560 |
} else {
|
| 6561 |
-
((ggml_fp16_t *) drow)[j] =
|
| 6562 |
}
|
| 6563 |
} else if (op == GGML_OP_POOL_AVG) {
|
| 6564 |
const float grad = grad0 / ka;
|
|
@@ -6577,7 +6577,7 @@ void ggml_compute_forward_pool_2d_back(
|
|
| 6577 |
if (dst->type == GGML_TYPE_F32) {
|
| 6578 |
((float *) drow)[j] += grad;
|
| 6579 |
} else {
|
| 6580 |
-
((ggml_fp16_t *) drow)[j] +=
|
| 6581 |
}
|
| 6582 |
}
|
| 6583 |
}
|
|
@@ -7147,7 +7147,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
|
|
| 7147 |
// loop over n_kv and n_head_kv
|
| 7148 |
// ref: https://arxiv.org/pdf/2112.05682.pdf
|
| 7149 |
for (int64_t ic = 0; ic < nek1; ++ic) {
|
| 7150 |
-
const float mv = mp ? slope*
|
| 7151 |
if (mv == -INFINITY) {
|
| 7152 |
continue;
|
| 7153 |
}
|
|
@@ -7215,7 +7215,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
|
|
| 7215 |
|
| 7216 |
if (v->type == GGML_TYPE_F16) {
|
| 7217 |
for (int64_t d = 0; d < DV; ++d) {
|
| 7218 |
-
VKQ32[d] =
|
| 7219 |
}
|
| 7220 |
}
|
| 7221 |
|
|
|
|
| 108 |
for (int i01 = ir0; i01 < ir1; i01++) {
|
| 109 |
const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
|
| 110 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 111 |
+
dst_ptr[id] = GGML_CPU_FP16_TO_FP32(src0_ptr[i00]);
|
| 112 |
id++;
|
| 113 |
}
|
| 114 |
}
|
|
|
|
| 130 |
const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
|
| 131 |
|
| 132 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 133 |
+
src0_f32[i00] = GGML_CPU_FP16_TO_FP32(src0_ptr[i00]);
|
| 134 |
}
|
| 135 |
|
| 136 |
quantize_row_q(src0_f32, dst_ptr + id, ne00);
|
|
|
|
| 156 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 157 |
const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 158 |
|
| 159 |
+
dst_ptr[id] = GGML_CPU_FP16_TO_FP32(*src0_ptr);
|
| 160 |
id++;
|
| 161 |
}
|
| 162 |
}
|
|
|
|
| 267 |
const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 268 |
char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
|
| 269 |
|
| 270 |
+
*(float *) dst_ptr = GGML_CPU_FP16_TO_FP32(*(const ggml_fp16_t *) src0_ptr);
|
| 271 |
|
| 272 |
if (++i10 == ne0) {
|
| 273 |
i10 = 0;
|
|
|
|
| 372 |
for (int i01 = ir0; i01 < ir1; i01++) {
|
| 373 |
const ggml_bf16_t * src0_ptr = (ggml_bf16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
|
| 374 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 375 |
+
dst_ptr[id] = GGML_CPU_FP32_TO_FP16(GGML_BF16_TO_FP32(src0_ptr[i00]));
|
| 376 |
id++;
|
| 377 |
}
|
| 378 |
}
|
|
|
|
| 473 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 474 |
const ggml_bf16_t * src0_ptr = (ggml_bf16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 475 |
|
| 476 |
+
dst_ptr[id] = GGML_CPU_FP32_TO_FP16(GGML_BF16_TO_FP32(*src0_ptr));
|
| 477 |
id++;
|
| 478 |
}
|
| 479 |
}
|
|
|
|
| 566 |
const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 567 |
char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
|
| 568 |
|
| 569 |
+
*(ggml_fp16_t *) dst_ptr = GGML_CPU_FP32_TO_FP16(GGML_BF16_TO_FP32(*(const ggml_bf16_t *) src0_ptr));
|
| 570 |
|
| 571 |
if (++i10 == ne0) {
|
| 572 |
i10 = 0;
|
|
|
|
| 765 |
for (int i00 = 0; i00 < ne00; i00++) {
|
| 766 |
const float * src0_ptr = (float *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 767 |
|
| 768 |
+
dst_ptr[id] = GGML_CPU_FP32_TO_FP16(*src0_ptr);
|
| 769 |
id++;
|
| 770 |
}
|
| 771 |
}
|
|
|
|
| 878 |
const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
|
| 879 |
char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
|
| 880 |
|
| 881 |
+
*(ggml_fp16_t *) dst_ptr = GGML_CPU_FP32_TO_FP16(*(const float *) src0_ptr);
|
| 882 |
|
| 883 |
if (++i10 == ne0) {
|
| 884 |
i10 = 0;
|
|
|
|
| 1419 |
ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 );
|
| 1420 |
ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01);
|
| 1421 |
for (int i = 0; i < ne0; i++) {
|
| 1422 |
+
dst_ptr[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(src0_ptr[i]) + v);
|
| 1423 |
}
|
| 1424 |
}
|
| 1425 |
}
|
|
|
|
| 1435 |
GGML_ASSERT(ggml_is_scalar(src1));
|
| 1436 |
|
| 1437 |
// scalar to add
|
| 1438 |
+
const float v = GGML_CPU_FP16_TO_FP32(*(ggml_fp16_t *) src1->data);
|
| 1439 |
|
| 1440 |
const int ith = params->ith;
|
| 1441 |
const int nth = params->nth;
|
|
|
|
| 1467 |
ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 );
|
| 1468 |
ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01);
|
| 1469 |
for (int i = 0; i < ne0; i++) {
|
| 1470 |
+
dst_ptr[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(src0_ptr[i]) + v);
|
| 1471 |
}
|
| 1472 |
}
|
| 1473 |
}
|
|
|
|
| 1889 |
}
|
| 1890 |
}
|
| 1891 |
}
|
| 1892 |
+
((ggml_fp16_t *) dst->data)[0] = GGML_CPU_FP32_TO_FP16(sum);
|
| 1893 |
}
|
| 1894 |
|
| 1895 |
static void ggml_compute_forward_sum_bf16(
|
|
|
|
| 2660 |
#ifndef NDEBUG
|
| 2661 |
for (int k = 0; k < nc; k++) {
|
| 2662 |
const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
|
| 2663 |
+
const float v = GGML_CPU_FP16_TO_FP32(x);
|
| 2664 |
GGML_UNUSED(v);
|
| 2665 |
assert(!isnan(v));
|
| 2666 |
assert(!isinf(v));
|
|
|
|
| 2763 |
#ifndef NDEBUG
|
| 2764 |
for (int k = 0; k < nc; k++) {
|
| 2765 |
const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
|
| 2766 |
+
const float v = GGML_CPU_FP16_TO_FP32(x);
|
| 2767 |
GGML_UNUSED(v);
|
| 2768 |
assert(!isnan(v));
|
| 2769 |
assert(!isinf(v));
|
|
|
|
| 2866 |
#ifndef NDEBUG
|
| 2867 |
for (int k = 0; k < nc; k++) {
|
| 2868 |
const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
|
| 2869 |
+
const float v = GGML_CPU_FP16_TO_FP32(x);
|
| 2870 |
GGML_UNUSED(v);
|
| 2871 |
assert(!isnan(v));
|
| 2872 |
assert(!isinf(v));
|
|
|
|
| 2969 |
#ifndef NDEBUG
|
| 2970 |
for (int k = 0; k < nc; k++) {
|
| 2971 |
const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*(dst->nb[1])))[k];
|
| 2972 |
+
const float v = GGML_CPU_FP16_TO_FP32(x);
|
| 2973 |
GGML_UNUSED(v);
|
| 2974 |
assert(!isnan(v));
|
| 2975 |
assert(!isinf(v));
|
|
|
|
| 3163 |
#ifndef NDEBUG
|
| 3164 |
for (int k = 0; k < nc; k++) {
|
| 3165 |
const float x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
|
| 3166 |
+
const float v = GGML_CPU_FP16_TO_FP32(x);
|
| 3167 |
GGML_UNUSED(v);
|
| 3168 |
assert(!isnan(v));
|
| 3169 |
assert(!isinf(v));
|
|
|
|
| 4500 |
|
| 4501 |
for (int j = 0; j < nc; ++j) {
|
| 4502 |
ggml_fp16_t v = ((ggml_fp16_t *) ((char *) src0->data + i*src0->nb[1]))[j];
|
| 4503 |
+
((float *) ((char *) dst->data + r*dst->nb[1]))[j] += GGML_CPU_FP16_TO_FP32(v);
|
| 4504 |
}
|
| 4505 |
}
|
| 4506 |
}
|
|
|
|
| 4792 |
if (mp_f32) {
|
| 4793 |
if (use_f16) {
|
| 4794 |
for (int i = 0; i < nc; ++i) {
|
| 4795 |
+
wp[i] += slope*GGML_CPU_FP16_TO_FP32(mp_f16[i]);
|
| 4796 |
}
|
| 4797 |
} else {
|
| 4798 |
for (int i = 0; i < nc; ++i) {
|
|
|
|
| 5018 |
ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + j*nb01);
|
| 5019 |
|
| 5020 |
for (int i = 0; i < nc; i++) {
|
| 5021 |
+
float v = GGML_CPU_FP16_TO_FP32(src0_ptr[i]);
|
| 5022 |
+
dst_ptr[i] = GGML_CPU_FP32_TO_FP16(MAX(MIN(v, max), min));
|
| 5023 |
}
|
| 5024 |
}
|
| 5025 |
}
|
|
|
|
| 5476 |
const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
|
| 5477 |
ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
|
| 5478 |
|
| 5479 |
+
const float x0 = GGML_CPU_FP16_TO_FP32(src[0]);
|
| 5480 |
+
const float x1 = GGML_CPU_FP16_TO_FP32(src[n_dims]);
|
| 5481 |
|
| 5482 |
+
dst_data[0] = GGML_CPU_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
|
| 5483 |
+
dst_data[n_dims] = GGML_CPU_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
|
| 5484 |
}
|
| 5485 |
} else {
|
| 5486 |
for (int64_t i0 = 0; i0 < n_dims; i0 += 2) {
|
|
|
|
| 5492 |
const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
|
| 5493 |
ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
|
| 5494 |
|
| 5495 |
+
const float x0 = GGML_CPU_FP16_TO_FP32(src[0]);
|
| 5496 |
+
const float x1 = GGML_CPU_FP16_TO_FP32(src[n_dims/2]);
|
| 5497 |
|
| 5498 |
+
dst_data[0] = GGML_CPU_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
|
| 5499 |
+
dst_data[n_dims/2] = GGML_CPU_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
|
| 5500 |
}
|
| 5501 |
}
|
| 5502 |
} else {
|
|
|
|
| 5507 |
const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + i0*nb00);
|
| 5508 |
ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + i0*nb0);
|
| 5509 |
|
| 5510 |
+
const float x0 = GGML_CPU_FP16_TO_FP32(src[0]);
|
| 5511 |
+
const float x1 = GGML_CPU_FP16_TO_FP32(src[1]);
|
| 5512 |
|
| 5513 |
+
dst_data[0] = GGML_CPU_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
|
| 5514 |
+
dst_data[1] = GGML_CPU_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
|
| 5515 |
}
|
| 5516 |
}
|
| 5517 |
|
|
|
|
| 5525 |
const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
|
| 5526 |
ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
|
| 5527 |
|
| 5528 |
+
const float x0 = GGML_CPU_FP16_TO_FP32(src[0]);
|
| 5529 |
+
const float x1 = GGML_CPU_FP16_TO_FP32(src[n_dims]);
|
| 5530 |
|
| 5531 |
+
dst_data[0] = GGML_CPU_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
|
| 5532 |
+
dst_data[n_dims] = GGML_CPU_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
|
| 5533 |
}
|
| 5534 |
} else {
|
| 5535 |
for (int64_t i0 = n_dims; i0 < ne0; i0 += 2) {
|
|
|
|
| 5640 |
for (int64_t i11 = 0; i11 < ne11; i11++) {
|
| 5641 |
const float * const src = (float *)((char *) src1->data + i11*nb11);
|
| 5642 |
for (int64_t i10 = 0; i10 < ne10; i10++) {
|
| 5643 |
+
dst_data[i10*ne11 + i11] = GGML_CPU_FP32_TO_FP16(src[i10]);
|
| 5644 |
}
|
| 5645 |
}
|
| 5646 |
}
|
|
|
|
| 5933 |
if (iih < 0 || iih >= IH || iiw < 0 || iiw >= IW) {
|
| 5934 |
dst_data[iic*(KH*KW) + ikh*KW + ikw] = 0;
|
| 5935 |
} else {
|
| 5936 |
+
dst_data[iic*(KH*KW) + ikh*KW + ikw] = GGML_CPU_FP32_TO_FP16(src_data[iih*IW + iiw]);
|
| 5937 |
}
|
| 5938 |
}
|
| 5939 |
}
|
|
|
|
| 6109 |
const float * const src = (float *)((char *) src1->data + i12*nb12 + i11*nb11);
|
| 6110 |
ggml_fp16_t * dst_data = wdata + i11*ne10*ne12;
|
| 6111 |
for (int i10 = 0; i10 < ne10; i10++) {
|
| 6112 |
+
dst_data[i10*ne12 + i12] = GGML_CPU_FP32_TO_FP16(src[i10]);
|
| 6113 |
}
|
| 6114 |
}
|
| 6115 |
}
|
|
|
|
| 6358 |
case GGML_OP_POOL_COUNT: GGML_ABORT("fatal error");
|
| 6359 |
}
|
| 6360 |
for (int ki = 0; ki < k; ++ki) {
|
| 6361 |
+
const float srow_j = (src->type == GGML_TYPE_F32) ? ((const float*)srow)[j] : GGML_CPU_FP16_TO_FP32(((const ggml_fp16_t*)srow)[j]);
|
| 6362 |
switch (op) {
|
| 6363 |
case GGML_OP_POOL_AVG: drow[i] += srow_j; break;
|
| 6364 |
case GGML_OP_POOL_MAX: if (srow_j > drow[i]) drow[i] = srow_j; break;
|
|
|
|
| 6450 |
for (int kx = 0; kx < k0; ++kx) {
|
| 6451 |
int j = ix + kx;
|
| 6452 |
if (j < 0 || j >= src->ne[0]) continue;
|
| 6453 |
+
const float srow_j = (src->type == GGML_TYPE_F32) ? ((const float*)srow)[j] : GGML_CPU_FP16_TO_FP32(((const ggml_fp16_t*)srow)[j]);
|
| 6454 |
switch (op) {
|
| 6455 |
case GGML_OP_POOL_AVG: *out += srow_j; break;
|
| 6456 |
case GGML_OP_POOL_MAX: if (srow_j > *out) *out = srow_j; break;
|
|
|
|
| 6538 |
}
|
| 6539 |
|
| 6540 |
const float val = dst->type == GGML_TYPE_F32 ?
|
| 6541 |
+
((const float *) drowf)[j] : GGML_CPU_FP16_TO_FP32(((const ggml_fp16_t *) drowf)[j]);
|
| 6542 |
if (val <= maxval) {
|
| 6543 |
continue;
|
| 6544 |
}
|
|
|
|
| 6558 |
if (dst->type == GGML_TYPE_F32) {
|
| 6559 |
((float *) drow)[j] += grad0;
|
| 6560 |
} else {
|
| 6561 |
+
((ggml_fp16_t *) drow)[j] = GGML_CPU_FP32_TO_FP16(grad0 + GGML_CPU_FP16_TO_FP32(((const ggml_fp16_t *) drow)[j]));
|
| 6562 |
}
|
| 6563 |
} else if (op == GGML_OP_POOL_AVG) {
|
| 6564 |
const float grad = grad0 / ka;
|
|
|
|
| 6577 |
if (dst->type == GGML_TYPE_F32) {
|
| 6578 |
((float *) drow)[j] += grad;
|
| 6579 |
} else {
|
| 6580 |
+
((ggml_fp16_t *) drow)[j] += GGML_CPU_FP32_TO_FP16(grad);
|
| 6581 |
}
|
| 6582 |
}
|
| 6583 |
}
|
|
|
|
| 7147 |
// loop over n_kv and n_head_kv
|
| 7148 |
// ref: https://arxiv.org/pdf/2112.05682.pdf
|
| 7149 |
for (int64_t ic = 0; ic < nek1; ++ic) {
|
| 7150 |
+
const float mv = mp ? slope*GGML_CPU_FP16_TO_FP32(mp[ic]) : 0.0f;
|
| 7151 |
if (mv == -INFINITY) {
|
| 7152 |
continue;
|
| 7153 |
}
|
|
|
|
| 7215 |
|
| 7216 |
if (v->type == GGML_TYPE_F16) {
|
| 7217 |
for (int64_t d = 0; d < DV; ++d) {
|
| 7218 |
+
VKQ32[d] = GGML_CPU_FP16_TO_FP32(VKQ16[d]);
|
| 7219 |
}
|
| 7220 |
}
|
| 7221 |
|
|
@@ -2,6 +2,7 @@
|
|
| 2 |
#include "ggml-common.h"
|
| 3 |
|
| 4 |
#include "ggml-cpu-impl.h"
|
|
|
|
| 5 |
#include "ggml-quants.h"
|
| 6 |
#include "quants.h"
|
| 7 |
|
|
@@ -137,7 +138,7 @@ void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 137 |
}
|
| 138 |
|
| 139 |
int sumi = sumi0 + sumi1;
|
| 140 |
-
sumf += sumi*
|
| 141 |
}
|
| 142 |
|
| 143 |
*s = sumf;
|
|
@@ -174,7 +175,7 @@ void ggml_vec_dot_q4_1_q8_1_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 174 |
}
|
| 175 |
|
| 176 |
int sumi = sumi0 + sumi1;
|
| 177 |
-
sumf += (
|
| 178 |
}
|
| 179 |
|
| 180 |
*s = sumf;
|
|
@@ -217,7 +218,7 @@ void ggml_vec_dot_q5_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 217 |
}
|
| 218 |
|
| 219 |
int sumi = sumi0 + sumi1;
|
| 220 |
-
sumf += (
|
| 221 |
}
|
| 222 |
|
| 223 |
*s = sumf;
|
|
@@ -260,7 +261,7 @@ void ggml_vec_dot_q5_1_q8_1_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 260 |
}
|
| 261 |
|
| 262 |
int sumi = sumi0 + sumi1;
|
| 263 |
-
sumf += (
|
| 264 |
}
|
| 265 |
|
| 266 |
*s = sumf;
|
|
@@ -290,7 +291,7 @@ void ggml_vec_dot_q8_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 290 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 291 |
}
|
| 292 |
|
| 293 |
-
sumf += sumi*(
|
| 294 |
}
|
| 295 |
|
| 296 |
*s = sumf;
|
|
@@ -342,7 +343,7 @@ void ggml_vec_dot_tq1_0_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 342 |
}
|
| 343 |
}
|
| 344 |
|
| 345 |
-
sumf += (float) sum * (
|
| 346 |
}
|
| 347 |
|
| 348 |
*s = sumf;
|
|
@@ -372,7 +373,7 @@ void ggml_vec_dot_tq2_0_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 372 |
}
|
| 373 |
}
|
| 374 |
|
| 375 |
-
const float d = y[i].d *
|
| 376 |
|
| 377 |
sumf += (float) sumi * d;
|
| 378 |
}
|
|
@@ -405,8 +406,8 @@ void ggml_vec_dot_q2_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 405 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 406 |
}
|
| 407 |
|
| 408 |
-
const float dall = y[i].d *
|
| 409 |
-
const float dmin = y[i].d *
|
| 410 |
|
| 411 |
int isum = 0;
|
| 412 |
int is = 0;
|
|
@@ -504,7 +505,7 @@ void ggml_vec_dot_q3_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 504 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 505 |
q8 += 8; a += 8;
|
| 506 |
}
|
| 507 |
-
const float d =
|
| 508 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 509 |
}
|
| 510 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -577,9 +578,9 @@ void ggml_vec_dot_q4_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 577 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 578 |
q8 += 8; a += 8;
|
| 579 |
}
|
| 580 |
-
const float d =
|
| 581 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 582 |
-
const float dmin =
|
| 583 |
sumf -= dmin * sumi;
|
| 584 |
}
|
| 585 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -657,9 +658,9 @@ void ggml_vec_dot_q5_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 657 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 658 |
q8 += 8; a += 8;
|
| 659 |
}
|
| 660 |
-
const float d =
|
| 661 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 662 |
-
const float dmin =
|
| 663 |
sumf -= dmin * sumi;
|
| 664 |
}
|
| 665 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -714,7 +715,7 @@ void ggml_vec_dot_q6_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
|
|
| 714 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 715 |
q8 += 8; a += 8;
|
| 716 |
}
|
| 717 |
-
const float d =
|
| 718 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 719 |
}
|
| 720 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
@@ -739,7 +740,7 @@ void ggml_vec_dot_iq2_xxs_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs
|
|
| 739 |
|
| 740 |
float sumf = 0.f;
|
| 741 |
for (int i = 0; i < nb; ++i) {
|
| 742 |
-
const float d =
|
| 743 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 744 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 745 |
int32_t bsum = 0;
|
|
@@ -778,7 +779,7 @@ void ggml_vec_dot_iq2_xs_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 778 |
|
| 779 |
float sumf = 0.f;
|
| 780 |
for (int i = 0; i < nb; ++i) {
|
| 781 |
-
const float d =
|
| 782 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 783 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 784 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -829,7 +830,7 @@ void ggml_vec_dot_iq2_s_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 829 |
float sumf = 0;
|
| 830 |
for (int i = 0; i < nb; i++) {
|
| 831 |
|
| 832 |
-
const float d =
|
| 833 |
const int8_t * q8 = y[i].qs;
|
| 834 |
const uint8_t * qs = x[i].qs;
|
| 835 |
const uint8_t * qh = x[i].qh;
|
|
@@ -882,7 +883,7 @@ void ggml_vec_dot_iq3_xxs_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs
|
|
| 882 |
|
| 883 |
float sumf = 0.f;
|
| 884 |
for (int i = 0; i < nb; ++i) {
|
| 885 |
-
const float d =
|
| 886 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 887 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 888 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
@@ -924,7 +925,7 @@ void ggml_vec_dot_iq3_s_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 924 |
|
| 925 |
float sumf = 0.f;
|
| 926 |
for (int i = 0; i < nb; ++i) {
|
| 927 |
-
const float d =
|
| 928 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 929 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 930 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
@@ -1002,7 +1003,7 @@ void ggml_vec_dot_iq1_s_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 1002 |
qs += 4;
|
| 1003 |
}
|
| 1004 |
|
| 1005 |
-
sumf +=
|
| 1006 |
}
|
| 1007 |
|
| 1008 |
*s = sumf;
|
|
@@ -1063,7 +1064,7 @@ void ggml_vec_dot_iq1_m_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 1063 |
qh += 2;
|
| 1064 |
}
|
| 1065 |
|
| 1066 |
-
sumf +=
|
| 1067 |
}
|
| 1068 |
|
| 1069 |
*s = sumf;
|
|
@@ -1087,7 +1088,7 @@ void ggml_vec_dot_iq4_nl_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 1087 |
float sumf = 0;
|
| 1088 |
|
| 1089 |
for (; ib < nb; ++ib) {
|
| 1090 |
-
const float d =
|
| 1091 |
int sumi1 = 0, sumi2 = 0;
|
| 1092 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 1093 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
@@ -1113,7 +1114,7 @@ void ggml_vec_dot_iq4_xs_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 1113 |
|
| 1114 |
float sumf = 0;
|
| 1115 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 1116 |
-
const float d4d8 =
|
| 1117 |
uint16_t h = x[ibl].scales_h;
|
| 1118 |
const uint8_t * qs = x[ibl].qs;
|
| 1119 |
const int8_t * q8 = y[ibl].qs;
|
|
|
|
| 2 |
#include "ggml-common.h"
|
| 3 |
|
| 4 |
#include "ggml-cpu-impl.h"
|
| 5 |
+
#include "simd-mappings.h"
|
| 6 |
#include "ggml-quants.h"
|
| 7 |
#include "quants.h"
|
| 8 |
|
|
|
|
| 138 |
}
|
| 139 |
|
| 140 |
int sumi = sumi0 + sumi1;
|
| 141 |
+
sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
|
| 142 |
}
|
| 143 |
|
| 144 |
*s = sumf;
|
|
|
|
| 175 |
}
|
| 176 |
|
| 177 |
int sumi = sumi0 + sumi1;
|
| 178 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 179 |
}
|
| 180 |
|
| 181 |
*s = sumf;
|
|
|
|
| 218 |
}
|
| 219 |
|
| 220 |
int sumi = sumi0 + sumi1;
|
| 221 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
|
| 222 |
}
|
| 223 |
|
| 224 |
*s = sumf;
|
|
|
|
| 261 |
}
|
| 262 |
|
| 263 |
int sumi = sumi0 + sumi1;
|
| 264 |
+
sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
|
| 265 |
}
|
| 266 |
|
| 267 |
*s = sumf;
|
|
|
|
| 291 |
sumi += x[ib].qs[j]*y[ib].qs[j];
|
| 292 |
}
|
| 293 |
|
| 294 |
+
sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
|
| 295 |
}
|
| 296 |
|
| 297 |
*s = sumf;
|
|
|
|
| 343 |
}
|
| 344 |
}
|
| 345 |
|
| 346 |
+
sumf += (float) sum * (GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d);
|
| 347 |
}
|
| 348 |
|
| 349 |
*s = sumf;
|
|
|
|
| 373 |
}
|
| 374 |
}
|
| 375 |
|
| 376 |
+
const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 377 |
|
| 378 |
sumf += (float) sumi * d;
|
| 379 |
}
|
|
|
|
| 406 |
summs += y[i].bsums[j] * (sc[j] >> 4);
|
| 407 |
}
|
| 408 |
|
| 409 |
+
const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
|
| 410 |
+
const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
|
| 411 |
|
| 412 |
int isum = 0;
|
| 413 |
int is = 0;
|
|
|
|
| 505 |
for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
|
| 506 |
q8 += 8; a += 8;
|
| 507 |
}
|
| 508 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 509 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 510 |
}
|
| 511 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 578 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 579 |
q8 += 8; a += 8;
|
| 580 |
}
|
| 581 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 582 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 583 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 584 |
sumf -= dmin * sumi;
|
| 585 |
}
|
| 586 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 658 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 659 |
q8 += 8; a += 8;
|
| 660 |
}
|
| 661 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 662 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 663 |
+
const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
|
| 664 |
sumf -= dmin * sumi;
|
| 665 |
}
|
| 666 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 715 |
for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
|
| 716 |
q8 += 8; a += 8;
|
| 717 |
}
|
| 718 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 719 |
for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
|
| 720 |
}
|
| 721 |
for (int l = 0; l < 8; ++l) sumf += sums[l];
|
|
|
|
| 740 |
|
| 741 |
float sumf = 0.f;
|
| 742 |
for (int i = 0; i < nb; ++i) {
|
| 743 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 744 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 745 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
| 746 |
int32_t bsum = 0;
|
|
|
|
| 779 |
|
| 780 |
float sumf = 0.f;
|
| 781 |
for (int i = 0; i < nb; ++i) {
|
| 782 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 783 |
const uint16_t * GGML_RESTRICT q2 = x[i].qs;
|
| 784 |
const uint8_t * GGML_RESTRICT sc = x[i].scales;
|
| 785 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 830 |
float sumf = 0;
|
| 831 |
for (int i = 0; i < nb; i++) {
|
| 832 |
|
| 833 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 834 |
const int8_t * q8 = y[i].qs;
|
| 835 |
const uint8_t * qs = x[i].qs;
|
| 836 |
const uint8_t * qh = x[i].qh;
|
|
|
|
| 883 |
|
| 884 |
float sumf = 0.f;
|
| 885 |
for (int i = 0; i < nb; ++i) {
|
| 886 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 887 |
const uint8_t * GGML_RESTRICT q3 = x[i].qs;
|
| 888 |
const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
|
| 889 |
const int8_t * GGML_RESTRICT q8 = y[i].qs;
|
|
|
|
| 925 |
|
| 926 |
float sumf = 0.f;
|
| 927 |
for (int i = 0; i < nb; ++i) {
|
| 928 |
+
const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 929 |
const uint8_t * GGML_RESTRICT qs = x[i].qs;
|
| 930 |
const uint8_t * GGML_RESTRICT qh = x[i].qh;
|
| 931 |
const uint8_t * GGML_RESTRICT signs = x[i].signs;
|
|
|
|
| 1003 |
qs += 4;
|
| 1004 |
}
|
| 1005 |
|
| 1006 |
+
sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
|
| 1007 |
}
|
| 1008 |
|
| 1009 |
*s = sumf;
|
|
|
|
| 1064 |
qh += 2;
|
| 1065 |
}
|
| 1066 |
|
| 1067 |
+
sumf += GGML_CPU_FP16_TO_FP32(scale.f16) * y[i].d * (sumi1 + IQ1M_DELTA * sumi2);
|
| 1068 |
}
|
| 1069 |
|
| 1070 |
*s = sumf;
|
|
|
|
| 1088 |
float sumf = 0;
|
| 1089 |
|
| 1090 |
for (; ib < nb; ++ib) {
|
| 1091 |
+
const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
|
| 1092 |
int sumi1 = 0, sumi2 = 0;
|
| 1093 |
for (int j = 0; j < QK4_NL/2; ++j) {
|
| 1094 |
sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
|
|
|
|
| 1114 |
|
| 1115 |
float sumf = 0;
|
| 1116 |
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 1117 |
+
const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
|
| 1118 |
uint16_t h = x[ibl].scales_h;
|
| 1119 |
const uint8_t * qs = x[ibl].qs;
|
| 1120 |
const int8_t * q8 = y[ibl].qs;
|
|
@@ -6,6 +6,7 @@
|
|
| 6 |
#include "ggml-impl.h"
|
| 7 |
#include "ggml-cpu.h"
|
| 8 |
#include "ggml-cpu-impl.h"
|
|
|
|
| 9 |
#include "traits.h"
|
| 10 |
|
| 11 |
#include "arch-fallback.h"
|
|
@@ -72,7 +73,7 @@ void ggml_quantize_mat_q8_0_4x4_generic(const float * GGML_RESTRICT x, void * GG
|
|
| 72 |
const float d = amax / ((1 << 7) - 1);
|
| 73 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 74 |
|
| 75 |
-
y[i].d[row_iter] =
|
| 76 |
}
|
| 77 |
|
| 78 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
@@ -110,7 +111,7 @@ void ggml_quantize_mat_q8_0_4x8_generic(const float * GGML_RESTRICT x, void * GG
|
|
| 110 |
const float d = amax / ((1 << 7) - 1);
|
| 111 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 112 |
|
| 113 |
-
y[i].d[row_iter] =
|
| 114 |
}
|
| 115 |
|
| 116 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
@@ -236,7 +237,7 @@ void ggml_gemv_q4_0_4x4_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 236 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 237 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 238 |
}
|
| 239 |
-
sumf[j] += sumi *
|
| 240 |
}
|
| 241 |
}
|
| 242 |
}
|
|
@@ -280,7 +281,7 @@ void ggml_gemv_q4_0_4x8_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 280 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 281 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 282 |
}
|
| 283 |
-
sumf[j] += sumi *
|
| 284 |
}
|
| 285 |
}
|
| 286 |
}
|
|
@@ -325,7 +326,7 @@ void ggml_gemv_q4_0_8x8_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 325 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 326 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 327 |
}
|
| 328 |
-
sumf[j] += sumi *
|
| 329 |
}
|
| 330 |
}
|
| 331 |
}
|
|
@@ -396,13 +397,13 @@ void ggml_gemv_q4_K_8x8_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 396 |
sumi2 = sumi2 * scales_1[j];
|
| 397 |
sumi += sumi1 + sumi2;
|
| 398 |
}
|
| 399 |
-
sumf[j] += sumi *
|
| 400 |
}
|
| 401 |
}
|
| 402 |
for (int sb = 0; sb < 8; sb++) {
|
| 403 |
uint8_t *mins = (uint8_t*) utmp + 8 + sb * 16;
|
| 404 |
for (int j = 0; j < ncols_interleaved; j++) {
|
| 405 |
-
sum_minf[j] += mins[j] * (a_ptr[l].bsums[sb * 2] + a_ptr[l].bsums[sb * 2 + 1]) *
|
| 406 |
}
|
| 407 |
}
|
| 408 |
}
|
|
@@ -449,7 +450,7 @@ void ggml_gemv_iq4_nl_4x4_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs
|
|
| 449 |
const int v1 = kvalues_iq4nl[b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] >> 4];
|
| 450 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2]));
|
| 451 |
}
|
| 452 |
-
sumf[j] += sumi *
|
| 453 |
}
|
| 454 |
}
|
| 455 |
}
|
|
@@ -500,7 +501,7 @@ void ggml_gemm_q4_0_4x4_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 500 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 501 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 502 |
}
|
| 503 |
-
sumf[m][j] += sumi *
|
| 504 |
}
|
| 505 |
}
|
| 506 |
}
|
|
@@ -555,7 +556,7 @@ void ggml_gemm_q4_0_4x8_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 555 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 556 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 557 |
}
|
| 558 |
-
sumf[m][j] += sumi *
|
| 559 |
}
|
| 560 |
}
|
| 561 |
}
|
|
@@ -609,7 +610,7 @@ void ggml_gemm_q4_0_8x8_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 609 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 610 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 611 |
}
|
| 612 |
-
sumf[m][j] += sumi *
|
| 613 |
}
|
| 614 |
}
|
| 615 |
}
|
|
@@ -688,7 +689,7 @@ void ggml_gemm_q4_K_8x8_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 688 |
sumi2 = sumi2 * scales_1[j];
|
| 689 |
sumi += sumi1 + sumi2;
|
| 690 |
}
|
| 691 |
-
sumf[m][j] += sumi *
|
| 692 |
}
|
| 693 |
}
|
| 694 |
}
|
|
@@ -697,7 +698,7 @@ void ggml_gemm_q4_K_8x8_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
|
|
| 697 |
for(int m = 0; m < 4; m++) {
|
| 698 |
const int16_t *bsums = a_ptr[l].bsums + (sb * 8) + (m * 4) - ((sb % 2) * 6);
|
| 699 |
for(int j = 0; j < ncols_interleaved; j++) {
|
| 700 |
-
sum_minf[m][j] += mins[j] * (bsums[0] + bsums[1]) *
|
| 701 |
}
|
| 702 |
}
|
| 703 |
}
|
|
@@ -753,7 +754,7 @@ void ggml_gemm_iq4_nl_4x4_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs
|
|
| 753 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 754 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4]));
|
| 755 |
}
|
| 756 |
-
sumf[m][j] += sumi *
|
| 757 |
}
|
| 758 |
}
|
| 759 |
}
|
|
|
|
| 6 |
#include "ggml-impl.h"
|
| 7 |
#include "ggml-cpu.h"
|
| 8 |
#include "ggml-cpu-impl.h"
|
| 9 |
+
#include "simd-mappings.h"
|
| 10 |
#include "traits.h"
|
| 11 |
|
| 12 |
#include "arch-fallback.h"
|
|
|
|
| 73 |
const float d = amax / ((1 << 7) - 1);
|
| 74 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 75 |
|
| 76 |
+
y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
|
| 77 |
}
|
| 78 |
|
| 79 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
|
|
| 111 |
const float d = amax / ((1 << 7) - 1);
|
| 112 |
id[row_iter] = d ? 1.0f / d : 0.0f;
|
| 113 |
|
| 114 |
+
y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
|
| 115 |
}
|
| 116 |
|
| 117 |
for (int j = 0; j < QK8_0 * 4; j++) {
|
|
|
|
| 237 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 238 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 239 |
}
|
| 240 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 241 |
}
|
| 242 |
}
|
| 243 |
}
|
|
|
|
| 281 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 282 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 283 |
}
|
| 284 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 285 |
}
|
| 286 |
}
|
| 287 |
}
|
|
|
|
| 326 |
const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
|
| 327 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
|
| 328 |
}
|
| 329 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 330 |
}
|
| 331 |
}
|
| 332 |
}
|
|
|
|
| 397 |
sumi2 = sumi2 * scales_1[j];
|
| 398 |
sumi += sumi1 + sumi2;
|
| 399 |
}
|
| 400 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d;
|
| 401 |
}
|
| 402 |
}
|
| 403 |
for (int sb = 0; sb < 8; sb++) {
|
| 404 |
uint8_t *mins = (uint8_t*) utmp + 8 + sb * 16;
|
| 405 |
for (int j = 0; j < ncols_interleaved; j++) {
|
| 406 |
+
sum_minf[j] += mins[j] * (a_ptr[l].bsums[sb * 2] + a_ptr[l].bsums[sb * 2 + 1]) * GGML_CPU_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d;
|
| 407 |
}
|
| 408 |
}
|
| 409 |
}
|
|
|
|
| 450 |
const int v1 = kvalues_iq4nl[b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] >> 4];
|
| 451 |
sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2]));
|
| 452 |
}
|
| 453 |
+
sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
|
| 454 |
}
|
| 455 |
}
|
| 456 |
}
|
|
|
|
| 501 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 502 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 503 |
}
|
| 504 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 505 |
}
|
| 506 |
}
|
| 507 |
}
|
|
|
|
| 556 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 557 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 558 |
}
|
| 559 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 560 |
}
|
| 561 |
}
|
| 562 |
}
|
|
|
|
| 610 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 611 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
|
| 612 |
}
|
| 613 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 614 |
}
|
| 615 |
}
|
| 616 |
}
|
|
|
|
| 689 |
sumi2 = sumi2 * scales_1[j];
|
| 690 |
sumi += sumi1 + sumi2;
|
| 691 |
}
|
| 692 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d[m];
|
| 693 |
}
|
| 694 |
}
|
| 695 |
}
|
|
|
|
| 698 |
for(int m = 0; m < 4; m++) {
|
| 699 |
const int16_t *bsums = a_ptr[l].bsums + (sb * 8) + (m * 4) - ((sb % 2) * 6);
|
| 700 |
for(int j = 0; j < ncols_interleaved; j++) {
|
| 701 |
+
sum_minf[m][j] += mins[j] * (bsums[0] + bsums[1]) * GGML_CPU_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d[m];
|
| 702 |
}
|
| 703 |
}
|
| 704 |
}
|
|
|
|
| 754 |
sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
|
| 755 |
(v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4]));
|
| 756 |
}
|
| 757 |
+
sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
|
| 758 |
}
|
| 759 |
}
|
| 760 |
}
|
|
@@ -2,10 +2,167 @@
|
|
| 2 |
|
| 3 |
#include "ggml-cpu-impl.h"
|
| 4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
//
|
| 6 |
// simd mappings
|
| 7 |
//
|
| 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
// we define a common set of C macros which map to specific intrinsics based on the current architecture
|
| 10 |
// we then implement the fundamental computation operations below using only these macros
|
| 11 |
// adding support for new architectures requires to define the corresponding SIMD macros
|
|
@@ -415,7 +572,7 @@ static inline __m256 __avx_f32cx8_load(const ggml_fp16_t * x) {
|
|
| 415 |
float tmp[8];
|
| 416 |
|
| 417 |
for (int i = 0; i < 8; i++) {
|
| 418 |
-
tmp[i] =
|
| 419 |
}
|
| 420 |
|
| 421 |
return _mm256_loadu_ps(tmp);
|
|
@@ -426,7 +583,7 @@ static inline void __avx_f32cx8_store(ggml_fp16_t *x, __m256 y) {
|
|
| 426 |
_mm256_storeu_ps(arr, y);
|
| 427 |
|
| 428 |
for (int i = 0; i < 8; i++)
|
| 429 |
-
x[i] =
|
| 430 |
}
|
| 431 |
#define GGML_F32Cx8_LOAD(x) __avx_f32cx8_load(x)
|
| 432 |
#define GGML_F32Cx8_STORE(x, y) __avx_f32cx8_store(x, y)
|
|
@@ -574,10 +731,10 @@ static inline unsigned char ggml_endian_byte(int i) {
|
|
| 574 |
inline static v128_t __wasm_f16x4_load(const ggml_fp16_t * p) {
|
| 575 |
float tmp[4];
|
| 576 |
|
| 577 |
-
tmp[0] =
|
| 578 |
-
tmp[1] =
|
| 579 |
-
tmp[2] =
|
| 580 |
-
tmp[3] =
|
| 581 |
|
| 582 |
return wasm_v128_load(tmp);
|
| 583 |
}
|
|
@@ -587,10 +744,10 @@ inline static void __wasm_f16x4_store(ggml_fp16_t * p, v128_t x) {
|
|
| 587 |
|
| 588 |
wasm_v128_store(tmp, x);
|
| 589 |
|
| 590 |
-
p[0] =
|
| 591 |
-
p[1] =
|
| 592 |
-
p[2] =
|
| 593 |
-
p[3] =
|
| 594 |
}
|
| 595 |
|
| 596 |
#define GGML_F16x4 v128_t
|
|
@@ -690,10 +847,10 @@ inline static void __wasm_f16x4_store(ggml_fp16_t * p, v128_t x) {
|
|
| 690 |
static inline __m128 __sse_f16x4_load(const ggml_fp16_t * x) {
|
| 691 |
float tmp[4];
|
| 692 |
|
| 693 |
-
tmp[0] =
|
| 694 |
-
tmp[1] =
|
| 695 |
-
tmp[2] =
|
| 696 |
-
tmp[3] =
|
| 697 |
|
| 698 |
return _mm_loadu_ps(tmp);
|
| 699 |
}
|
|
@@ -703,10 +860,10 @@ static inline void __sse_f16x4_store(ggml_fp16_t * x, __m128 y) {
|
|
| 703 |
|
| 704 |
_mm_storeu_ps(arr, y);
|
| 705 |
|
| 706 |
-
x[0] =
|
| 707 |
-
x[1] =
|
| 708 |
-
x[2] =
|
| 709 |
-
x[3] =
|
| 710 |
}
|
| 711 |
|
| 712 |
#define GGML_F32Cx4 __m128
|
|
@@ -828,7 +985,7 @@ static inline void __lasx_f32cx8_store(ggml_fp16_t * x, __m256 y) {
|
|
| 828 |
#define GGML_F32x4_ZERO __lsx_vldi(0)
|
| 829 |
#define GGML_F32x4_SET1(x) __lsx_vinsgr2vr_w(__lsx_vldi(0),(x), 0)
|
| 830 |
#define GGML_F32x4_LOAD(x) __lsx_vld((x), 0)
|
| 831 |
-
#define GGML_F32x4_STORE(
|
| 832 |
#define GGML_F32x4_FMA(a, b, c) __lsx_vfmadd_s(b, c, a)
|
| 833 |
#define GGML_F32x4_ADD __lsx_vfadd_s
|
| 834 |
#define GGML_F32x4_MUL __lsx_vfmul_s
|
|
@@ -874,10 +1031,10 @@ static inline void __lasx_f32cx8_store(ggml_fp16_t * x, __m256 y) {
|
|
| 874 |
static inline __m128 __lsx_f16x4_load(const ggml_fp16_t * x) {
|
| 875 |
float tmp[4];
|
| 876 |
|
| 877 |
-
tmp[0] =
|
| 878 |
-
tmp[1] =
|
| 879 |
-
tmp[2] =
|
| 880 |
-
tmp[3] =
|
| 881 |
|
| 882 |
return __lsx_vld(tmp, 0);
|
| 883 |
}
|
|
@@ -887,10 +1044,10 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
|
|
| 887 |
|
| 888 |
__lsx_vst(y, arr, 0);
|
| 889 |
|
| 890 |
-
x[0] =
|
| 891 |
-
x[1] =
|
| 892 |
-
x[2] =
|
| 893 |
-
x[3] =
|
| 894 |
}
|
| 895 |
|
| 896 |
#define GGML_F32Cx4 __m128
|
|
@@ -922,7 +1079,7 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
|
|
| 922 |
#define GGML_F32_STEP 32
|
| 923 |
#define GGML_F32_EPR 4
|
| 924 |
|
| 925 |
-
#define GGML_F32x4
|
| 926 |
#define GGML_F32x4_ZERO vec_splats(0.0f)
|
| 927 |
#define GGML_F32x4_SET1 vec_splats
|
| 928 |
#define GGML_F32x4_LOAD(p) vec_xl(0, p)
|
|
@@ -962,28 +1119,45 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
|
|
| 962 |
#define GGML_F16_STEP GGML_F32_STEP
|
| 963 |
#define GGML_F16_EPR GGML_F32_EPR
|
| 964 |
|
| 965 |
-
static inline
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 966 |
float tmp[4];
|
| 967 |
|
| 968 |
for (int i = 0; i < 4; i++) {
|
| 969 |
-
tmp[i] =
|
| 970 |
}
|
| 971 |
|
| 972 |
// note: keep type-cast here to prevent compiler bugs
|
| 973 |
// see: https://github.com/ggml-org/llama.cpp/issues/12846
|
| 974 |
return vec_xl(0, (const float *)(tmp));
|
|
|
|
| 975 |
}
|
| 976 |
|
| 977 |
-
static inline void __lzs_f16cx4_store(ggml_fp16_t * x,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 978 |
float arr[4];
|
| 979 |
|
| 980 |
// note: keep type-cast here to prevent compiler bugs
|
| 981 |
// see: https://github.com/ggml-org/llama.cpp/issues/12846
|
| 982 |
-
vec_xst(
|
| 983 |
|
| 984 |
for (int i = 0; i < 4; i++) {
|
| 985 |
-
x[i] =
|
| 986 |
}
|
|
|
|
| 987 |
}
|
| 988 |
|
| 989 |
#define GGML_F16_VEC GGML_F32x4
|
|
@@ -1004,3 +1178,7 @@ static inline void __lzs_f16cx4_store(ggml_fp16_t * x, __vector float y) {
|
|
| 1004 |
#define GGML_F32_ARR (GGML_F32_STEP/GGML_F32_EPR)
|
| 1005 |
#define GGML_F16_ARR (GGML_F16_STEP/GGML_F16_EPR)
|
| 1006 |
#endif
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
#include "ggml-cpu-impl.h"
|
| 4 |
|
| 5 |
+
#ifdef __ARM_FEATURE_SVE
|
| 6 |
+
#include <arm_sve.h>
|
| 7 |
+
#endif // __ARM_FEATURE_SVE
|
| 8 |
+
|
| 9 |
+
#if defined(__ARM_NEON) && !defined(__CUDACC__) && !defined(__MUSACC__)
|
| 10 |
+
// if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:
|
| 11 |
+
//
|
| 12 |
+
// $ ln -sfn /Library/Developer/CommandLineTools/usr/lib/clang/13.1.6/include/arm_neon.h ./src/
|
| 13 |
+
//
|
| 14 |
+
#include <arm_neon.h>
|
| 15 |
+
#endif
|
| 16 |
+
|
| 17 |
+
#if defined(__F16C__)
|
| 18 |
+
#include <immintrin.h>
|
| 19 |
+
#endif
|
| 20 |
+
|
| 21 |
+
#ifdef __cplusplus
|
| 22 |
+
extern "C" {
|
| 23 |
+
#endif
|
| 24 |
+
|
| 25 |
//
|
| 26 |
// simd mappings
|
| 27 |
//
|
| 28 |
|
| 29 |
+
// FP16 to FP32 conversion
|
| 30 |
+
|
| 31 |
+
// 16-bit float
|
| 32 |
+
// on Arm, we use __fp16
|
| 33 |
+
// on x86, we use uint16_t
|
| 34 |
+
//
|
| 35 |
+
// for old CUDA compilers (<= 11), we use uint16_t: ref https://github.com/ggml-org/llama.cpp/pull/10616
|
| 36 |
+
// for MUSA compilers , we use uint16_t: ref https://github.com/ggml-org/llama.cpp/pull/11843
|
| 37 |
+
//
|
| 38 |
+
#if defined(__ARM_NEON) && !(defined(__CUDACC__) && __CUDACC_VER_MAJOR__ <= 11) && !defined(__MUSACC__)
|
| 39 |
+
#define GGML_CPU_COMPUTE_FP16_TO_FP32(x) neon_compute_fp16_to_fp32(x)
|
| 40 |
+
#define GGML_CPU_COMPUTE_FP32_TO_FP16(x) neon_compute_fp32_to_fp16(x)
|
| 41 |
+
|
| 42 |
+
#define GGML_CPU_FP16_TO_FP32(x) GGML_CPU_COMPUTE_FP16_TO_FP32(x)
|
| 43 |
+
|
| 44 |
+
static inline float neon_compute_fp16_to_fp32(ggml_fp16_t h) {
|
| 45 |
+
__fp16 tmp;
|
| 46 |
+
memcpy(&tmp, &h, sizeof(ggml_fp16_t));
|
| 47 |
+
return (float)tmp;
|
| 48 |
+
}
|
| 49 |
+
|
| 50 |
+
static inline ggml_fp16_t neon_compute_fp32_to_fp16(float f) {
|
| 51 |
+
ggml_fp16_t res;
|
| 52 |
+
__fp16 tmp = f;
|
| 53 |
+
memcpy(&res, &tmp, sizeof(ggml_fp16_t));
|
| 54 |
+
return res;
|
| 55 |
+
}
|
| 56 |
+
#elif defined(__F16C__)
|
| 57 |
+
#ifdef _MSC_VER
|
| 58 |
+
#define GGML_CPU_COMPUTE_FP16_TO_FP32(x) _mm_cvtss_f32(_mm_cvtph_ps(_mm_cvtsi32_si128(x)))
|
| 59 |
+
#define GGML_CPU_COMPUTE_FP32_TO_FP16(x) _mm_extract_epi16(_mm_cvtps_ph(_mm_set_ss(x), 0), 0)
|
| 60 |
+
#else
|
| 61 |
+
#define GGML_CPU_COMPUTE_FP16_TO_FP32(x) _cvtsh_ss(x)
|
| 62 |
+
#define GGML_CPU_COMPUTE_FP32_TO_FP16(x) _cvtss_sh(x, 0)
|
| 63 |
+
#endif
|
| 64 |
+
#elif defined(__POWER9_VECTOR__)
|
| 65 |
+
#define GGML_CPU_COMPUTE_FP16_TO_FP32(x) power_compute_fp16_to_fp32(x)
|
| 66 |
+
#define GGML_CPU_COMPUTE_FP32_TO_FP16(x) power_compute_fp32_to_fp16(x)
|
| 67 |
+
/* the inline asm below is about 12% faster than the lookup method */
|
| 68 |
+
#define GGML_CPU_FP16_TO_FP32(x) GGML_CPU_COMPUTE_FP16_TO_FP32(x)
|
| 69 |
+
#define GGML_CPU_FP32_TO_FP16(x) GGML_CPU_COMPUTE_FP32_TO_FP16(x)
|
| 70 |
+
|
| 71 |
+
static inline float power_compute_fp16_to_fp32(ggml_fp16_t h) {
|
| 72 |
+
float f;
|
| 73 |
+
double d;
|
| 74 |
+
__asm__(
|
| 75 |
+
"mtfprd %0,%2\n"
|
| 76 |
+
"xscvhpdp %0,%0\n"
|
| 77 |
+
"frsp %1,%0\n" :
|
| 78 |
+
/* temp */ "=d"(d),
|
| 79 |
+
/* out */ "=f"(f):
|
| 80 |
+
/* in */ "r"(h));
|
| 81 |
+
return f;
|
| 82 |
+
}
|
| 83 |
+
|
| 84 |
+
static inline ggml_fp16_t power_compute_fp32_to_fp16(float f) {
|
| 85 |
+
double d;
|
| 86 |
+
ggml_fp16_t r;
|
| 87 |
+
__asm__( /* xscvdphp can work on double or single precision */
|
| 88 |
+
"xscvdphp %0,%2\n"
|
| 89 |
+
"mffprd %1,%0\n" :
|
| 90 |
+
/* temp */ "=d"(d),
|
| 91 |
+
/* out */ "=r"(r):
|
| 92 |
+
/* in */ "f"(f));
|
| 93 |
+
return r;
|
| 94 |
+
}
|
| 95 |
+
#elif defined(__riscv) && defined(__riscv_zfhmin)
|
| 96 |
+
static inline float riscv_compute_fp16_to_fp32(ggml_fp16_t h) {
|
| 97 |
+
float f;
|
| 98 |
+
__asm__(
|
| 99 |
+
"fmv.h.x %[f], %[h]\n\t"
|
| 100 |
+
"fcvt.s.h %[f], %[f]"
|
| 101 |
+
: [f] "=&f" (f)
|
| 102 |
+
: [h] "r" (h)
|
| 103 |
+
);
|
| 104 |
+
return f;
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
static inline ggml_fp16_t riscv_compute_fp32_to_fp16(float f) {
|
| 108 |
+
ggml_fp16_t res;
|
| 109 |
+
__asm__(
|
| 110 |
+
"fcvt.h.s %[f], %[f]\n\t"
|
| 111 |
+
"fmv.x.h %[h], %[f]"
|
| 112 |
+
: [h] "=&r" (res)
|
| 113 |
+
: [f] "f" (f)
|
| 114 |
+
);
|
| 115 |
+
return res;
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
#define GGML_CPU_COMPUTE_FP16_TO_FP32(x) riscv_compute_fp16_to_fp32(x)
|
| 119 |
+
#define GGML_CPU_COMPUTE_FP32_TO_FP16(x) riscv_compute_fp32_to_fp16(x)
|
| 120 |
+
#define GGML_CPU_FP16_TO_FP32(x) GGML_CPU_COMPUTE_FP16_TO_FP32(x)
|
| 121 |
+
#define GGML_CPU_FP32_TO_FP16(x) GGML_CPU_COMPUTE_FP32_TO_FP16(x)
|
| 122 |
+
#elif defined(__NNPA__)
|
| 123 |
+
#define GGML_CPU_COMPUTE_FP16_TO_FP32(x) nnpa_compute_fp16_to_fp32(x)
|
| 124 |
+
#define GGML_CPU_COMPUTE_FP32_TO_FP16(x) nnpa_compute_fp32_to_fp16(x)
|
| 125 |
+
|
| 126 |
+
#define GGML_CPU_FP16_TO_FP32(x) GGML_CPU_COMPUTE_FP16_TO_FP32(x)
|
| 127 |
+
#define GGML_CPU_FP32_TO_FP16(x) GGML_CPU_COMPUTE_FP32_TO_FP16(x)
|
| 128 |
+
|
| 129 |
+
static inline float nnpa_compute_fp16_to_fp32(ggml_fp16_t h) {
|
| 130 |
+
uint16x8_t v_h = vec_splats(h);
|
| 131 |
+
uint16x8_t v_hd = vec_convert_from_fp16(v_h, 0);
|
| 132 |
+
return vec_extend_to_fp32_hi(v_hd, 0)[0];
|
| 133 |
+
}
|
| 134 |
+
|
| 135 |
+
static inline ggml_fp16_t nnpa_compute_fp32_to_fp16(float f) {
|
| 136 |
+
float32x4_t v_f = vec_splats(f);
|
| 137 |
+
float32x4_t v_zero = vec_splats(0.0f);
|
| 138 |
+
uint16x8_t v_hd = vec_round_from_fp32(v_f, v_zero, 0);
|
| 139 |
+
uint16x8_t v_h = vec_convert_to_fp16(v_hd, 0);
|
| 140 |
+
return vec_extract(v_h, 0);
|
| 141 |
+
}
|
| 142 |
+
#endif
|
| 143 |
+
|
| 144 |
+
// precomputed f32 table for f16 (256 KB)
|
| 145 |
+
// defined in ggml-cpu.c, initialized in ggml_cpu_init()
|
| 146 |
+
extern float ggml_table_f32_f16[1 << 16];
|
| 147 |
+
|
| 148 |
+
// On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
|
| 149 |
+
// so we define GGML_CPU_FP16_TO_FP32 and GGML_CPU_FP32_TO_FP16 elsewhere for NEON.
|
| 150 |
+
// This is also true for POWER9.
|
| 151 |
+
#if !defined(GGML_CPU_FP16_TO_FP32)
|
| 152 |
+
inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
|
| 153 |
+
uint16_t s;
|
| 154 |
+
memcpy(&s, &f, sizeof(uint16_t));
|
| 155 |
+
return ggml_table_f32_f16[s];
|
| 156 |
+
}
|
| 157 |
+
|
| 158 |
+
#define GGML_CPU_FP16_TO_FP32(x) ggml_lookup_fp16_to_fp32(x)
|
| 159 |
+
#endif
|
| 160 |
+
|
| 161 |
+
#if !defined(GGML_CPU_FP32_TO_FP16)
|
| 162 |
+
#define GGML_CPU_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
|
| 163 |
+
#endif
|
| 164 |
+
|
| 165 |
+
|
| 166 |
// we define a common set of C macros which map to specific intrinsics based on the current architecture
|
| 167 |
// we then implement the fundamental computation operations below using only these macros
|
| 168 |
// adding support for new architectures requires to define the corresponding SIMD macros
|
|
|
|
| 572 |
float tmp[8];
|
| 573 |
|
| 574 |
for (int i = 0; i < 8; i++) {
|
| 575 |
+
tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 576 |
}
|
| 577 |
|
| 578 |
return _mm256_loadu_ps(tmp);
|
|
|
|
| 583 |
_mm256_storeu_ps(arr, y);
|
| 584 |
|
| 585 |
for (int i = 0; i < 8; i++)
|
| 586 |
+
x[i] = GGML_CPU_FP32_TO_FP16(arr[i]);
|
| 587 |
}
|
| 588 |
#define GGML_F32Cx8_LOAD(x) __avx_f32cx8_load(x)
|
| 589 |
#define GGML_F32Cx8_STORE(x, y) __avx_f32cx8_store(x, y)
|
|
|
|
| 731 |
inline static v128_t __wasm_f16x4_load(const ggml_fp16_t * p) {
|
| 732 |
float tmp[4];
|
| 733 |
|
| 734 |
+
tmp[0] = GGML_CPU_FP16_TO_FP32(p[0]);
|
| 735 |
+
tmp[1] = GGML_CPU_FP16_TO_FP32(p[1]);
|
| 736 |
+
tmp[2] = GGML_CPU_FP16_TO_FP32(p[2]);
|
| 737 |
+
tmp[3] = GGML_CPU_FP16_TO_FP32(p[3]);
|
| 738 |
|
| 739 |
return wasm_v128_load(tmp);
|
| 740 |
}
|
|
|
|
| 744 |
|
| 745 |
wasm_v128_store(tmp, x);
|
| 746 |
|
| 747 |
+
p[0] = GGML_CPU_FP32_TO_FP16(tmp[0]);
|
| 748 |
+
p[1] = GGML_CPU_FP32_TO_FP16(tmp[1]);
|
| 749 |
+
p[2] = GGML_CPU_FP32_TO_FP16(tmp[2]);
|
| 750 |
+
p[3] = GGML_CPU_FP32_TO_FP16(tmp[3]);
|
| 751 |
}
|
| 752 |
|
| 753 |
#define GGML_F16x4 v128_t
|
|
|
|
| 847 |
static inline __m128 __sse_f16x4_load(const ggml_fp16_t * x) {
|
| 848 |
float tmp[4];
|
| 849 |
|
| 850 |
+
tmp[0] = GGML_CPU_FP16_TO_FP32(x[0]);
|
| 851 |
+
tmp[1] = GGML_CPU_FP16_TO_FP32(x[1]);
|
| 852 |
+
tmp[2] = GGML_CPU_FP16_TO_FP32(x[2]);
|
| 853 |
+
tmp[3] = GGML_CPU_FP16_TO_FP32(x[3]);
|
| 854 |
|
| 855 |
return _mm_loadu_ps(tmp);
|
| 856 |
}
|
|
|
|
| 860 |
|
| 861 |
_mm_storeu_ps(arr, y);
|
| 862 |
|
| 863 |
+
x[0] = GGML_CPU_FP32_TO_FP16(arr[0]);
|
| 864 |
+
x[1] = GGML_CPU_FP32_TO_FP16(arr[1]);
|
| 865 |
+
x[2] = GGML_CPU_FP32_TO_FP16(arr[2]);
|
| 866 |
+
x[3] = GGML_CPU_FP32_TO_FP16(arr[3]);
|
| 867 |
}
|
| 868 |
|
| 869 |
#define GGML_F32Cx4 __m128
|
|
|
|
| 985 |
#define GGML_F32x4_ZERO __lsx_vldi(0)
|
| 986 |
#define GGML_F32x4_SET1(x) __lsx_vinsgr2vr_w(__lsx_vldi(0),(x), 0)
|
| 987 |
#define GGML_F32x4_LOAD(x) __lsx_vld((x), 0)
|
| 988 |
+
#define GGML_F32x4_STORE(x, y) __lsx_vst(y, x, 0)
|
| 989 |
#define GGML_F32x4_FMA(a, b, c) __lsx_vfmadd_s(b, c, a)
|
| 990 |
#define GGML_F32x4_ADD __lsx_vfadd_s
|
| 991 |
#define GGML_F32x4_MUL __lsx_vfmul_s
|
|
|
|
| 1031 |
static inline __m128 __lsx_f16x4_load(const ggml_fp16_t * x) {
|
| 1032 |
float tmp[4];
|
| 1033 |
|
| 1034 |
+
tmp[0] = GGML_CPU_FP16_TO_FP32(x[0]);
|
| 1035 |
+
tmp[1] = GGML_CPU_FP16_TO_FP32(x[1]);
|
| 1036 |
+
tmp[2] = GGML_CPU_FP16_TO_FP32(x[2]);
|
| 1037 |
+
tmp[3] = GGML_CPU_FP16_TO_FP32(x[3]);
|
| 1038 |
|
| 1039 |
return __lsx_vld(tmp, 0);
|
| 1040 |
}
|
|
|
|
| 1044 |
|
| 1045 |
__lsx_vst(y, arr, 0);
|
| 1046 |
|
| 1047 |
+
x[0] = GGML_CPU_FP32_TO_FP16(arr[0]);
|
| 1048 |
+
x[1] = GGML_CPU_FP32_TO_FP16(arr[1]);
|
| 1049 |
+
x[2] = GGML_CPU_FP32_TO_FP16(arr[2]);
|
| 1050 |
+
x[3] = GGML_CPU_FP32_TO_FP16(arr[3]);
|
| 1051 |
}
|
| 1052 |
|
| 1053 |
#define GGML_F32Cx4 __m128
|
|
|
|
| 1079 |
#define GGML_F32_STEP 32
|
| 1080 |
#define GGML_F32_EPR 4
|
| 1081 |
|
| 1082 |
+
#define GGML_F32x4 float32x4_t
|
| 1083 |
#define GGML_F32x4_ZERO vec_splats(0.0f)
|
| 1084 |
#define GGML_F32x4_SET1 vec_splats
|
| 1085 |
#define GGML_F32x4_LOAD(p) vec_xl(0, p)
|
|
|
|
| 1119 |
#define GGML_F16_STEP GGML_F32_STEP
|
| 1120 |
#define GGML_F16_EPR GGML_F32_EPR
|
| 1121 |
|
| 1122 |
+
static inline float32x4_t __lzs_f16cx4_load(const ggml_fp16_t * x) {
|
| 1123 |
+
#if defined(__NNPA__)
|
| 1124 |
+
uint16x8_t v_x = vec_xl(0, (const ggml_fp16_t *)x);
|
| 1125 |
+
uint16x8_t v_xd = vec_convert_from_fp16(v_x, 0);
|
| 1126 |
+
return vec_extend_to_fp32_hi(v_xd, 0);
|
| 1127 |
+
#else
|
| 1128 |
float tmp[4];
|
| 1129 |
|
| 1130 |
for (int i = 0; i < 4; i++) {
|
| 1131 |
+
tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 1132 |
}
|
| 1133 |
|
| 1134 |
// note: keep type-cast here to prevent compiler bugs
|
| 1135 |
// see: https://github.com/ggml-org/llama.cpp/issues/12846
|
| 1136 |
return vec_xl(0, (const float *)(tmp));
|
| 1137 |
+
#endif
|
| 1138 |
}
|
| 1139 |
|
| 1140 |
+
static inline void __lzs_f16cx4_store(ggml_fp16_t * x, float32x4_t v_y) {
|
| 1141 |
+
#if defined(__NNPA__)
|
| 1142 |
+
float32x4_t v_zero = vec_splats(0.0f);
|
| 1143 |
+
uint16x8_t v_xd = vec_round_from_fp32(v_y, v_zero, 0);
|
| 1144 |
+
uint16x8_t v_x = vec_convert_to_fp16(v_xd, 0);
|
| 1145 |
+
|
| 1146 |
+
x[0] = vec_extract(v_x, 0);
|
| 1147 |
+
x[1] = vec_extract(v_x, 1);
|
| 1148 |
+
x[2] = vec_extract(v_x, 2);
|
| 1149 |
+
x[3] = vec_extract(v_x, 3);
|
| 1150 |
+
#else
|
| 1151 |
float arr[4];
|
| 1152 |
|
| 1153 |
// note: keep type-cast here to prevent compiler bugs
|
| 1154 |
// see: https://github.com/ggml-org/llama.cpp/issues/12846
|
| 1155 |
+
vec_xst(v_y, 0, (float *)(arr));
|
| 1156 |
|
| 1157 |
for (int i = 0; i < 4; i++) {
|
| 1158 |
+
x[i] = GGML_CPU_FP32_TO_FP16(arr[i]);
|
| 1159 |
}
|
| 1160 |
+
#endif
|
| 1161 |
}
|
| 1162 |
|
| 1163 |
#define GGML_F16_VEC GGML_F32x4
|
|
|
|
| 1178 |
#define GGML_F32_ARR (GGML_F32_STEP/GGML_F32_EPR)
|
| 1179 |
#define GGML_F16_ARR (GGML_F16_STEP/GGML_F16_EPR)
|
| 1180 |
#endif
|
| 1181 |
+
|
| 1182 |
+
#ifdef __cplusplus
|
| 1183 |
+
}
|
| 1184 |
+
#endif
|
|
@@ -219,11 +219,11 @@ void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * G
|
|
| 219 |
|
| 220 |
// leftovers
|
| 221 |
for (int i = np; i < n; ++i) {
|
| 222 |
-
sumf += (ggml_float)(
|
| 223 |
}
|
| 224 |
#else
|
| 225 |
for (int i = 0; i < n; ++i) {
|
| 226 |
-
sumf += (ggml_float)(
|
| 227 |
}
|
| 228 |
#endif
|
| 229 |
|
|
|
|
| 219 |
|
| 220 |
// leftovers
|
| 221 |
for (int i = np; i < n; ++i) {
|
| 222 |
+
sumf += (ggml_float)(GGML_CPU_FP16_TO_FP32(x[i])*GGML_CPU_FP16_TO_FP32(y[i]));
|
| 223 |
}
|
| 224 |
#else
|
| 225 |
for (int i = 0; i < n; ++i) {
|
| 226 |
+
sumf += (ggml_float)(GGML_CPU_FP16_TO_FP32(x[i])*GGML_CPU_FP16_TO_FP32(y[i]));
|
| 227 |
}
|
| 228 |
#endif
|
| 229 |
|
|
@@ -58,7 +58,7 @@ inline static void ggml_vec_set_bf16(const int n, ggml_bf16_t * x, const ggml_bf
|
|
| 58 |
inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] + y[i]; }
|
| 59 |
inline static void ggml_vec_add_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
|
| 60 |
for (int i = 0; i < n; ++i) {
|
| 61 |
-
z[i] =
|
| 62 |
}
|
| 63 |
}
|
| 64 |
inline static void ggml_vec_add1_f32(const int n, float * z, const float * x, const float v) { for (int i = 0; i < n; ++i) z[i] = x[i] + v; }
|
|
@@ -67,7 +67,7 @@ inline static void ggml_vec_acc1_f32(const int n, float * y, const float v)
|
|
| 67 |
inline static void ggml_vec_sub_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] - y[i]; }
|
| 68 |
inline static void ggml_vec_sub_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
|
| 69 |
for (int i = 0; i < n; ++i) {
|
| 70 |
-
z[i] =
|
| 71 |
}
|
| 72 |
}
|
| 73 |
inline static void ggml_vec_set_f32 (const int n, float * x, const float v) { for (int i = 0; i < n; ++i) x[i] = v; }
|
|
@@ -75,20 +75,20 @@ inline static void ggml_vec_cpy_f32 (const int n, float * y, const float * x)
|
|
| 75 |
inline static void ggml_vec_neg_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = -x[i]; }
|
| 76 |
inline static void ggml_vec_neg_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 77 |
for (int i = 0; i < n; ++i) {
|
| 78 |
-
y[i] =
|
| 79 |
}
|
| 80 |
}
|
| 81 |
|
| 82 |
inline static void ggml_vec_mul_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i]*y[i]; }
|
| 83 |
inline static void ggml_vec_mul_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
|
| 84 |
for (int i = 0; i < n; ++i) {
|
| 85 |
-
z[i] =
|
| 86 |
}
|
| 87 |
}
|
| 88 |
inline static void ggml_vec_div_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i]/y[i]; }
|
| 89 |
inline static void ggml_vec_div_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
|
| 90 |
for (int i = 0; i < n; ++i) {
|
| 91 |
-
z[i] =
|
| 92 |
}
|
| 93 |
}
|
| 94 |
|
|
@@ -131,13 +131,13 @@ inline static void ggml_vec_dot_f16_unroll(const int n, const int xs, float * GG
|
|
| 131 |
// leftovers
|
| 132 |
for (int i = np; i < n; ++i) {
|
| 133 |
for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
|
| 134 |
-
sumf[j] += (ggml_float)(
|
| 135 |
}
|
| 136 |
}
|
| 137 |
#else
|
| 138 |
for (int i = 0; i < n; ++i) {
|
| 139 |
for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
|
| 140 |
-
sumf[j] += (ggml_float)(
|
| 141 |
}
|
| 142 |
}
|
| 143 |
#endif
|
|
@@ -280,12 +280,12 @@ inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * GGML_RESTRICT y,
|
|
| 280 |
|
| 281 |
// leftovers
|
| 282 |
for (int i = np; i < n; ++i) {
|
| 283 |
-
y[i] =
|
| 284 |
}
|
| 285 |
#else
|
| 286 |
// scalar
|
| 287 |
for (int i = 0; i < n; ++i) {
|
| 288 |
-
y[i] =
|
| 289 |
}
|
| 290 |
#endif
|
| 291 |
}
|
|
@@ -430,12 +430,12 @@ inline static void ggml_vec_scale_f16(const int n, ggml_fp16_t * y, const float
|
|
| 430 |
|
| 431 |
// leftovers
|
| 432 |
for (int i = np; i < n; ++i) {
|
| 433 |
-
y[i] =
|
| 434 |
}
|
| 435 |
#else
|
| 436 |
// scalar
|
| 437 |
for (int i = 0; i < n; ++i) {
|
| 438 |
-
y[i] =
|
| 439 |
}
|
| 440 |
#endif
|
| 441 |
}
|
|
@@ -444,103 +444,103 @@ inline static void ggml_vec_norm_f32 (const int n, float * s, const float * x) {
|
|
| 444 |
inline static void ggml_vec_sqr_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i]; }
|
| 445 |
inline static void ggml_vec_sqr_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 446 |
for (int i = 0; i < n; ++i) {
|
| 447 |
-
float v =
|
| 448 |
-
y[i] =
|
| 449 |
}
|
| 450 |
}
|
| 451 |
inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrtf(x[i]); }
|
| 452 |
inline static void ggml_vec_sqrt_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 453 |
for (int i = 0; i < n; ++i) {
|
| 454 |
-
y[i] =
|
| 455 |
}
|
| 456 |
}
|
| 457 |
inline static void ggml_vec_log_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = logf(x[i]); }
|
| 458 |
inline static void ggml_vec_log_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 459 |
for (int i = 0; i < n; ++i) {
|
| 460 |
-
y[i] =
|
| 461 |
}
|
| 462 |
}
|
| 463 |
inline static void ggml_vec_sin_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sinf(x[i]); }
|
| 464 |
inline static void ggml_vec_sin_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 465 |
for (int i = 0; i < n; ++i) {
|
| 466 |
-
y[i] =
|
| 467 |
}
|
| 468 |
}
|
| 469 |
inline static void ggml_vec_cos_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = cosf(x[i]); }
|
| 470 |
inline static void ggml_vec_cos_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 471 |
for (int i = 0; i < n; ++i) {
|
| 472 |
-
y[i] =
|
| 473 |
}
|
| 474 |
}
|
| 475 |
inline static void ggml_vec_abs_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fabsf(x[i]); }
|
| 476 |
inline static void ggml_vec_abs_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 477 |
for (int i = 0; i < n; ++i) {
|
| 478 |
-
y[i] =
|
| 479 |
}
|
| 480 |
}
|
| 481 |
inline static void ggml_vec_sgn_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : ((x[i] < 0.f) ? -1.f : 0.f); }
|
| 482 |
inline static void ggml_vec_sgn_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 483 |
for (int i = 0; i < n; ++i) {
|
| 484 |
-
float v =
|
| 485 |
-
y[i] =
|
| 486 |
}
|
| 487 |
}
|
| 488 |
inline static void ggml_vec_step_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : 0.f; }
|
| 489 |
inline static void ggml_vec_step_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 490 |
for (int i = 0; i < n; ++i) {
|
| 491 |
-
y[i] =
|
| 492 |
}
|
| 493 |
}
|
| 494 |
inline static void ggml_vec_tanh_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = tanhf(x[i]); }
|
| 495 |
inline static void ggml_vec_tanh_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 496 |
for (int i = 0; i < n; ++i) {
|
| 497 |
-
y[i] =
|
| 498 |
}
|
| 499 |
}
|
| 500 |
inline static void ggml_vec_elu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : expm1f(x[i]); }
|
| 501 |
inline static void ggml_vec_elu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 502 |
for (int i = 0; i < n; ++i) {
|
| 503 |
-
y[i] =
|
| 504 |
}
|
| 505 |
}
|
| 506 |
inline static void ggml_vec_relu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : 0.f; }
|
| 507 |
inline static void ggml_vec_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 508 |
for (int i = 0; i < n; ++i) {
|
| 509 |
-
float v =
|
| 510 |
-
y[i] =
|
| 511 |
}
|
| 512 |
}
|
| 513 |
inline static void ggml_vec_leaky_relu_f32 (const int n, float * y, const float * x, const float ns) { for (int i = 0; i < n; ++i) y[i] = ((x[i] > 0.f) ? x[i] : 0.f) + ns * ((x[i] < 0.0f) ? x[i] : 0.f); }
|
| 514 |
inline static void ggml_vec_leaky_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x, const float ns) {
|
| 515 |
for (int i = 0; i < n; ++i) {
|
| 516 |
-
float v =
|
| 517 |
-
y[i] =
|
| 518 |
}
|
| 519 |
}
|
| 520 |
inline static void ggml_vec_sigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = 1.f / (1.f + expf(-x[i])); }
|
| 521 |
inline static void ggml_vec_sigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 522 |
for (int i = 0; i < n; ++i) {
|
| 523 |
-
y[i] =
|
| 524 |
}
|
| 525 |
}
|
| 526 |
// TODO: optimize performance
|
| 527 |
inline static void ggml_vec_hardswish_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i] * fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
|
| 528 |
inline static void ggml_vec_hardswish_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 529 |
for (int i = 0; i < n; ++i) {
|
| 530 |
-
float v =
|
| 531 |
-
y[i] =
|
| 532 |
}
|
| 533 |
}
|
| 534 |
inline static void ggml_vec_hardsigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
|
| 535 |
inline static void ggml_vec_hardsigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 536 |
for (int i = 0; i < n; ++i) {
|
| 537 |
-
y[i] =
|
| 538 |
}
|
| 539 |
}
|
| 540 |
inline static void ggml_vec_exp_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = expf(x[i]); }
|
| 541 |
inline static void ggml_vec_exp_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 542 |
for (int i = 0; i < n; ++i) {
|
| 543 |
-
y[i] =
|
| 544 |
}
|
| 545 |
}
|
| 546 |
|
|
@@ -562,9 +562,9 @@ inline static void ggml_vec_gelu_f16(const int n, ggml_fp16_t * y, const ggml_fp
|
|
| 562 |
|
| 563 |
inline static void ggml_vec_gelu_erf_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 564 |
for (int i = 0; i < n; ++i) {
|
| 565 |
-
float xi =
|
| 566 |
float res = 0.5f*xi*(1.0f + erff(xi*SQRT_2_INV));
|
| 567 |
-
y[i] =
|
| 568 |
}
|
| 569 |
}
|
| 570 |
|
|
@@ -577,9 +577,9 @@ inline static void ggml_vec_gelu_f32(const int n, float * y, const float * x) {
|
|
| 577 |
} else if (x[i] >= 10.0f) {
|
| 578 |
y[i] = x[i];
|
| 579 |
} else {
|
| 580 |
-
ggml_fp16_t fp16 =
|
| 581 |
memcpy(&t, &fp16, sizeof(uint16_t));
|
| 582 |
-
y[i] =
|
| 583 |
}
|
| 584 |
}
|
| 585 |
}
|
|
@@ -613,9 +613,9 @@ inline static float ggml_gelu_quick_f32(float x) {
|
|
| 613 |
inline static void ggml_vec_gelu_quick_f32(const int n, float * y, const float * x) {
|
| 614 |
uint16_t t;
|
| 615 |
for (int i = 0; i < n; ++i) {
|
| 616 |
-
ggml_fp16_t fp16 =
|
| 617 |
memcpy(&t, &fp16, sizeof(uint16_t));
|
| 618 |
-
y[i] =
|
| 619 |
}
|
| 620 |
}
|
| 621 |
#else
|
|
@@ -628,8 +628,8 @@ inline static void ggml_vec_gelu_quick_f32(const int n, float * y, const float *
|
|
| 628 |
|
| 629 |
inline static void ggml_vec_gelu_quick_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 630 |
for (int i = 0; i < n; ++i) {
|
| 631 |
-
float v =
|
| 632 |
-
y[i] =
|
| 633 |
}
|
| 634 |
}
|
| 635 |
|
|
@@ -638,8 +638,8 @@ inline static float ggml_silu_f32(float x) {
|
|
| 638 |
return x/(1.0f + expf(-x));
|
| 639 |
}
|
| 640 |
inline static ggml_fp16_t ggml_silu_f16(ggml_fp16_t x) {
|
| 641 |
-
float v =
|
| 642 |
-
return
|
| 643 |
}
|
| 644 |
|
| 645 |
#if __FINITE_MATH_ONLY__
|
|
@@ -888,9 +888,9 @@ inline static float ggml_silu_backward_f32(float x, float dy) {
|
|
| 888 |
}
|
| 889 |
|
| 890 |
inline static ggml_fp16_t ggml_silu_backward_f16(ggml_fp16_t x, ggml_fp16_t dy) {
|
| 891 |
-
const float v =
|
| 892 |
const float s = 1.0f/(1.0f + expf(-v));
|
| 893 |
-
return
|
| 894 |
}
|
| 895 |
|
| 896 |
inline static void ggml_vec_silu_backward_f32(const int n, float * dx, const float * x, const float * dy) {
|
|
@@ -928,7 +928,7 @@ inline static void ggml_vec_sum_f32_ggf(const int n, ggml_float * s, const float
|
|
| 928 |
inline static void ggml_vec_sum_f16_ggf(const int n, float * s, const ggml_fp16_t * x) {
|
| 929 |
float sum = 0.0f;
|
| 930 |
for (int i = 0; i < n; ++i) {
|
| 931 |
-
sum +=
|
| 932 |
}
|
| 933 |
*s = sum;
|
| 934 |
}
|
|
|
|
| 58 |
inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] + y[i]; }
|
| 59 |
inline static void ggml_vec_add_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
|
| 60 |
for (int i = 0; i < n; ++i) {
|
| 61 |
+
z[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(x[i]) + GGML_CPU_FP16_TO_FP32(y[i]));
|
| 62 |
}
|
| 63 |
}
|
| 64 |
inline static void ggml_vec_add1_f32(const int n, float * z, const float * x, const float v) { for (int i = 0; i < n; ++i) z[i] = x[i] + v; }
|
|
|
|
| 67 |
inline static void ggml_vec_sub_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] - y[i]; }
|
| 68 |
inline static void ggml_vec_sub_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
|
| 69 |
for (int i = 0; i < n; ++i) {
|
| 70 |
+
z[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(x[i]) - GGML_CPU_FP16_TO_FP32(y[i]));
|
| 71 |
}
|
| 72 |
}
|
| 73 |
inline static void ggml_vec_set_f32 (const int n, float * x, const float v) { for (int i = 0; i < n; ++i) x[i] = v; }
|
|
|
|
| 75 |
inline static void ggml_vec_neg_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = -x[i]; }
|
| 76 |
inline static void ggml_vec_neg_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 77 |
for (int i = 0; i < n; ++i) {
|
| 78 |
+
y[i] = GGML_CPU_FP32_TO_FP16(-GGML_CPU_FP16_TO_FP32(x[i]));
|
| 79 |
}
|
| 80 |
}
|
| 81 |
|
| 82 |
inline static void ggml_vec_mul_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i]*y[i]; }
|
| 83 |
inline static void ggml_vec_mul_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
|
| 84 |
for (int i = 0; i < n; ++i) {
|
| 85 |
+
z[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(x[i]) * GGML_CPU_FP16_TO_FP32(y[i]));
|
| 86 |
}
|
| 87 |
}
|
| 88 |
inline static void ggml_vec_div_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i]/y[i]; }
|
| 89 |
inline static void ggml_vec_div_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
|
| 90 |
for (int i = 0; i < n; ++i) {
|
| 91 |
+
z[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(x[i]) / GGML_CPU_FP16_TO_FP32(y[i]));
|
| 92 |
}
|
| 93 |
}
|
| 94 |
|
|
|
|
| 131 |
// leftovers
|
| 132 |
for (int i = np; i < n; ++i) {
|
| 133 |
for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
|
| 134 |
+
sumf[j] += (ggml_float)(GGML_CPU_FP16_TO_FP32(x[j][i])*GGML_CPU_FP16_TO_FP32(y[i]));
|
| 135 |
}
|
| 136 |
}
|
| 137 |
#else
|
| 138 |
for (int i = 0; i < n; ++i) {
|
| 139 |
for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
|
| 140 |
+
sumf[j] += (ggml_float)(GGML_CPU_FP16_TO_FP32(x[j][i])*GGML_CPU_FP16_TO_FP32(y[i]));
|
| 141 |
}
|
| 142 |
}
|
| 143 |
#endif
|
|
|
|
| 280 |
|
| 281 |
// leftovers
|
| 282 |
for (int i = np; i < n; ++i) {
|
| 283 |
+
y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i]) + GGML_CPU_FP16_TO_FP32(x[i])*v);
|
| 284 |
}
|
| 285 |
#else
|
| 286 |
// scalar
|
| 287 |
for (int i = 0; i < n; ++i) {
|
| 288 |
+
y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i]) + GGML_CPU_FP16_TO_FP32(x[i])*v);
|
| 289 |
}
|
| 290 |
#endif
|
| 291 |
}
|
|
|
|
| 430 |
|
| 431 |
// leftovers
|
| 432 |
for (int i = np; i < n; ++i) {
|
| 433 |
+
y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i])*v);
|
| 434 |
}
|
| 435 |
#else
|
| 436 |
// scalar
|
| 437 |
for (int i = 0; i < n; ++i) {
|
| 438 |
+
y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i])*v);
|
| 439 |
}
|
| 440 |
#endif
|
| 441 |
}
|
|
|
|
| 444 |
inline static void ggml_vec_sqr_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i]; }
|
| 445 |
inline static void ggml_vec_sqr_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 446 |
for (int i = 0; i < n; ++i) {
|
| 447 |
+
float v = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 448 |
+
y[i] = GGML_CPU_FP32_TO_FP16(v*v);
|
| 449 |
}
|
| 450 |
}
|
| 451 |
inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrtf(x[i]); }
|
| 452 |
inline static void ggml_vec_sqrt_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 453 |
for (int i = 0; i < n; ++i) {
|
| 454 |
+
y[i] = GGML_CPU_FP32_TO_FP16(sqrtf(GGML_CPU_FP16_TO_FP32(x[i])));
|
| 455 |
}
|
| 456 |
}
|
| 457 |
inline static void ggml_vec_log_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = logf(x[i]); }
|
| 458 |
inline static void ggml_vec_log_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 459 |
for (int i = 0; i < n; ++i) {
|
| 460 |
+
y[i] = GGML_CPU_FP32_TO_FP16(logf(GGML_CPU_FP16_TO_FP32(x[i])));
|
| 461 |
}
|
| 462 |
}
|
| 463 |
inline static void ggml_vec_sin_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sinf(x[i]); }
|
| 464 |
inline static void ggml_vec_sin_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 465 |
for (int i = 0; i < n; ++i) {
|
| 466 |
+
y[i] = GGML_CPU_FP32_TO_FP16(sinf(GGML_CPU_FP16_TO_FP32(x[i])));
|
| 467 |
}
|
| 468 |
}
|
| 469 |
inline static void ggml_vec_cos_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = cosf(x[i]); }
|
| 470 |
inline static void ggml_vec_cos_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 471 |
for (int i = 0; i < n; ++i) {
|
| 472 |
+
y[i] = GGML_CPU_FP32_TO_FP16(cosf(GGML_CPU_FP16_TO_FP32(x[i])));
|
| 473 |
}
|
| 474 |
}
|
| 475 |
inline static void ggml_vec_abs_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fabsf(x[i]); }
|
| 476 |
inline static void ggml_vec_abs_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 477 |
for (int i = 0; i < n; ++i) {
|
| 478 |
+
y[i] = GGML_CPU_FP32_TO_FP16(fabsf(GGML_CPU_FP16_TO_FP32(x[i])));
|
| 479 |
}
|
| 480 |
}
|
| 481 |
inline static void ggml_vec_sgn_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : ((x[i] < 0.f) ? -1.f : 0.f); }
|
| 482 |
inline static void ggml_vec_sgn_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 483 |
for (int i = 0; i < n; ++i) {
|
| 484 |
+
float v = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 485 |
+
y[i] = GGML_CPU_FP32_TO_FP16((v > 0.f) ? 1.f : ((v < 0.f) ? -1.f : 0.f));
|
| 486 |
}
|
| 487 |
}
|
| 488 |
inline static void ggml_vec_step_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : 0.f; }
|
| 489 |
inline static void ggml_vec_step_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 490 |
for (int i = 0; i < n; ++i) {
|
| 491 |
+
y[i] = GGML_CPU_FP32_TO_FP16((GGML_CPU_FP16_TO_FP32(x[i]) > 0.f) ? 1.f : 0.f);
|
| 492 |
}
|
| 493 |
}
|
| 494 |
inline static void ggml_vec_tanh_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = tanhf(x[i]); }
|
| 495 |
inline static void ggml_vec_tanh_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 496 |
for (int i = 0; i < n; ++i) {
|
| 497 |
+
y[i] = GGML_CPU_FP32_TO_FP16(tanhf(GGML_CPU_FP16_TO_FP32(x[i])));
|
| 498 |
}
|
| 499 |
}
|
| 500 |
inline static void ggml_vec_elu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : expm1f(x[i]); }
|
| 501 |
inline static void ggml_vec_elu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 502 |
for (int i = 0; i < n; ++i) {
|
| 503 |
+
y[i] = GGML_CPU_FP32_TO_FP16(expm1f(GGML_CPU_FP16_TO_FP32(x[i])));
|
| 504 |
}
|
| 505 |
}
|
| 506 |
inline static void ggml_vec_relu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : 0.f; }
|
| 507 |
inline static void ggml_vec_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 508 |
for (int i = 0; i < n; ++i) {
|
| 509 |
+
float v = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 510 |
+
y[i] = GGML_CPU_FP32_TO_FP16((v > 0.f) ? v : 0.f);
|
| 511 |
}
|
| 512 |
}
|
| 513 |
inline static void ggml_vec_leaky_relu_f32 (const int n, float * y, const float * x, const float ns) { for (int i = 0; i < n; ++i) y[i] = ((x[i] > 0.f) ? x[i] : 0.f) + ns * ((x[i] < 0.0f) ? x[i] : 0.f); }
|
| 514 |
inline static void ggml_vec_leaky_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x, const float ns) {
|
| 515 |
for (int i = 0; i < n; ++i) {
|
| 516 |
+
float v = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 517 |
+
y[i] = GGML_CPU_FP32_TO_FP16(((v > 0.f) ? v : 0.f) + ns * ((v < 0.0f) ? v : 0.f));
|
| 518 |
}
|
| 519 |
}
|
| 520 |
inline static void ggml_vec_sigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = 1.f / (1.f + expf(-x[i])); }
|
| 521 |
inline static void ggml_vec_sigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 522 |
for (int i = 0; i < n; ++i) {
|
| 523 |
+
y[i] = GGML_CPU_FP32_TO_FP16(1.f / (1.f + expf(-GGML_CPU_FP16_TO_FP32(x[i]))));
|
| 524 |
}
|
| 525 |
}
|
| 526 |
// TODO: optimize performance
|
| 527 |
inline static void ggml_vec_hardswish_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i] * fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
|
| 528 |
inline static void ggml_vec_hardswish_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 529 |
for (int i = 0; i < n; ++i) {
|
| 530 |
+
float v = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 531 |
+
y[i] = GGML_CPU_FP32_TO_FP16(v * fminf(1.0f, fmaxf(0.0f, (v + 3.0f) / 6.0f)));
|
| 532 |
}
|
| 533 |
}
|
| 534 |
inline static void ggml_vec_hardsigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
|
| 535 |
inline static void ggml_vec_hardsigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 536 |
for (int i = 0; i < n; ++i) {
|
| 537 |
+
y[i] = GGML_CPU_FP32_TO_FP16(fminf(1.0f, fmaxf(0.0f, (GGML_CPU_FP16_TO_FP32(x[i]) + 3.0f) / 6.0f)));
|
| 538 |
}
|
| 539 |
}
|
| 540 |
inline static void ggml_vec_exp_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = expf(x[i]); }
|
| 541 |
inline static void ggml_vec_exp_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 542 |
for (int i = 0; i < n; ++i) {
|
| 543 |
+
y[i] = GGML_CPU_FP32_TO_FP16(expf(GGML_CPU_FP16_TO_FP32(x[i])));
|
| 544 |
}
|
| 545 |
}
|
| 546 |
|
|
|
|
| 562 |
|
| 563 |
inline static void ggml_vec_gelu_erf_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 564 |
for (int i = 0; i < n; ++i) {
|
| 565 |
+
float xi = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 566 |
float res = 0.5f*xi*(1.0f + erff(xi*SQRT_2_INV));
|
| 567 |
+
y[i] = GGML_CPU_FP32_TO_FP16(res);
|
| 568 |
}
|
| 569 |
}
|
| 570 |
|
|
|
|
| 577 |
} else if (x[i] >= 10.0f) {
|
| 578 |
y[i] = x[i];
|
| 579 |
} else {
|
| 580 |
+
ggml_fp16_t fp16 = GGML_CPU_FP32_TO_FP16(x[i]);
|
| 581 |
memcpy(&t, &fp16, sizeof(uint16_t));
|
| 582 |
+
y[i] = GGML_CPU_FP16_TO_FP32(ggml_table_gelu_f16[t]);
|
| 583 |
}
|
| 584 |
}
|
| 585 |
}
|
|
|
|
| 613 |
inline static void ggml_vec_gelu_quick_f32(const int n, float * y, const float * x) {
|
| 614 |
uint16_t t;
|
| 615 |
for (int i = 0; i < n; ++i) {
|
| 616 |
+
ggml_fp16_t fp16 = GGML_CPU_FP32_TO_FP16(x[i]);
|
| 617 |
memcpy(&t, &fp16, sizeof(uint16_t));
|
| 618 |
+
y[i] = GGML_CPU_FP16_TO_FP32(ggml_table_gelu_quick_f16[t]);
|
| 619 |
}
|
| 620 |
}
|
| 621 |
#else
|
|
|
|
| 628 |
|
| 629 |
inline static void ggml_vec_gelu_quick_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
|
| 630 |
for (int i = 0; i < n; ++i) {
|
| 631 |
+
float v = GGML_CPU_FP16_TO_FP32(x[i]);
|
| 632 |
+
y[i] = GGML_CPU_FP32_TO_FP16(v*(1.0f/(1.0f+expf(GELU_QUICK_COEF*v))));
|
| 633 |
}
|
| 634 |
}
|
| 635 |
|
|
|
|
| 638 |
return x/(1.0f + expf(-x));
|
| 639 |
}
|
| 640 |
inline static ggml_fp16_t ggml_silu_f16(ggml_fp16_t x) {
|
| 641 |
+
float v = GGML_CPU_FP16_TO_FP32(x);
|
| 642 |
+
return GGML_CPU_FP32_TO_FP16(v/(1.0f + expf(-v)));
|
| 643 |
}
|
| 644 |
|
| 645 |
#if __FINITE_MATH_ONLY__
|
|
|
|
| 888 |
}
|
| 889 |
|
| 890 |
inline static ggml_fp16_t ggml_silu_backward_f16(ggml_fp16_t x, ggml_fp16_t dy) {
|
| 891 |
+
const float v = GGML_CPU_FP16_TO_FP32(x);
|
| 892 |
const float s = 1.0f/(1.0f + expf(-v));
|
| 893 |
+
return GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(dy)*s*(1.0f + v*(1.0f - s)));
|
| 894 |
}
|
| 895 |
|
| 896 |
inline static void ggml_vec_silu_backward_f32(const int n, float * dx, const float * x, const float * dy) {
|
|
|
|
| 928 |
inline static void ggml_vec_sum_f16_ggf(const int n, float * s, const ggml_fp16_t * x) {
|
| 929 |
float sum = 0.0f;
|
| 930 |
for (int i = 0; i < n; ++i) {
|
| 931 |
+
sum += GGML_CPU_FP16_TO_FP32(x[i]);
|
| 932 |
}
|
| 933 |
*s = sum;
|
| 934 |
}
|
|
@@ -317,203 +317,81 @@ struct ggml_cgraph ggml_graph_view(struct ggml_cgraph * cgraph, int i0, int i1);
|
|
| 317 |
GGML_API void * ggml_aligned_malloc(size_t size);
|
| 318 |
GGML_API void ggml_aligned_free(void * ptr, size_t size);
|
| 319 |
|
| 320 |
-
// FP16
|
|
|
|
| 321 |
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
#define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
|
| 331 |
-
#define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
|
| 332 |
-
|
| 333 |
-
#define GGML_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
|
| 334 |
-
|
| 335 |
-
static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
|
| 336 |
-
__fp16 tmp;
|
| 337 |
-
memcpy(&tmp, &h, sizeof(ggml_fp16_t));
|
| 338 |
-
return (float)tmp;
|
| 339 |
-
}
|
| 340 |
-
|
| 341 |
-
static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
|
| 342 |
-
ggml_fp16_t res;
|
| 343 |
-
__fp16 tmp = f;
|
| 344 |
-
memcpy(&res, &tmp, sizeof(ggml_fp16_t));
|
| 345 |
-
return res;
|
| 346 |
-
}
|
| 347 |
-
|
| 348 |
-
#elif defined(__F16C__)
|
| 349 |
-
|
| 350 |
-
#ifdef _MSC_VER
|
| 351 |
-
#define GGML_COMPUTE_FP16_TO_FP32(x) _mm_cvtss_f32(_mm_cvtph_ps(_mm_cvtsi32_si128(x)))
|
| 352 |
-
#define GGML_COMPUTE_FP32_TO_FP16(x) _mm_extract_epi16(_mm_cvtps_ph(_mm_set_ss(x), 0), 0)
|
| 353 |
-
#else
|
| 354 |
-
#define GGML_COMPUTE_FP16_TO_FP32(x) _cvtsh_ss(x)
|
| 355 |
-
#define GGML_COMPUTE_FP32_TO_FP16(x) _cvtss_sh(x, 0)
|
| 356 |
-
#endif
|
| 357 |
-
|
| 358 |
-
#elif defined(__POWER9_VECTOR__)
|
| 359 |
-
|
| 360 |
-
#define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
|
| 361 |
-
#define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
|
| 362 |
-
/* the inline asm below is about 12% faster than the lookup method */
|
| 363 |
-
#define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
|
| 364 |
-
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
|
| 365 |
-
|
| 366 |
-
static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
|
| 367 |
-
float f;
|
| 368 |
-
double d;
|
| 369 |
-
__asm__(
|
| 370 |
-
"mtfprd %0,%2\n"
|
| 371 |
-
"xscvhpdp %0,%0\n"
|
| 372 |
-
"frsp %1,%0\n" :
|
| 373 |
-
/* temp */ "=d"(d),
|
| 374 |
-
/* out */ "=f"(f):
|
| 375 |
-
/* in */ "r"(h));
|
| 376 |
-
return f;
|
| 377 |
-
}
|
| 378 |
-
|
| 379 |
-
static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
|
| 380 |
-
double d;
|
| 381 |
-
ggml_fp16_t r;
|
| 382 |
-
__asm__( /* xscvdphp can work on double or single precision */
|
| 383 |
-
"xscvdphp %0,%2\n"
|
| 384 |
-
"mffprd %1,%0\n" :
|
| 385 |
-
/* temp */ "=d"(d),
|
| 386 |
-
/* out */ "=r"(r):
|
| 387 |
-
/* in */ "f"(f));
|
| 388 |
-
return r;
|
| 389 |
-
}
|
| 390 |
-
|
| 391 |
-
#elif defined(__riscv) && defined(__riscv_zfhmin)
|
| 392 |
-
|
| 393 |
-
static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
|
| 394 |
-
float f;
|
| 395 |
-
__asm__(
|
| 396 |
-
"fmv.h.x %[f], %[h]\n\t"
|
| 397 |
-
"fcvt.s.h %[f], %[f]"
|
| 398 |
-
: [f] "=&f" (f)
|
| 399 |
-
: [h] "r" (h)
|
| 400 |
-
);
|
| 401 |
-
return f;
|
| 402 |
-
}
|
| 403 |
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
return res;
|
| 413 |
-
}
|
| 414 |
|
| 415 |
-
|
| 416 |
-
|
| 417 |
-
|
| 418 |
-
|
| 419 |
|
|
|
|
|
|
|
|
|
|
| 420 |
#else
|
|
|
|
|
|
|
|
|
|
| 421 |
|
| 422 |
-
|
| 423 |
-
|
| 424 |
-
|
| 425 |
-
static inline float fp32_from_bits(uint32_t w) {
|
| 426 |
-
union {
|
| 427 |
-
uint32_t as_bits;
|
| 428 |
-
float as_value;
|
| 429 |
-
} fp32;
|
| 430 |
-
fp32.as_bits = w;
|
| 431 |
-
return fp32.as_value;
|
| 432 |
-
}
|
| 433 |
-
|
| 434 |
-
static inline uint32_t fp32_to_bits(float f) {
|
| 435 |
-
union {
|
| 436 |
-
float as_value;
|
| 437 |
-
uint32_t as_bits;
|
| 438 |
-
} fp32;
|
| 439 |
-
fp32.as_value = f;
|
| 440 |
-
return fp32.as_bits;
|
| 441 |
-
}
|
| 442 |
-
|
| 443 |
-
static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
|
| 444 |
-
const uint32_t w = (uint32_t) h << 16;
|
| 445 |
-
const uint32_t sign = w & UINT32_C(0x80000000);
|
| 446 |
-
const uint32_t two_w = w + w;
|
| 447 |
-
|
| 448 |
-
const uint32_t exp_offset = UINT32_C(0xE0) << 23;
|
| 449 |
-
#if (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)) && (!defined(__cplusplus) || __cplusplus >= 201703L)
|
| 450 |
-
const float exp_scale = 0x1.0p-112f;
|
| 451 |
-
#else
|
| 452 |
-
const float exp_scale = fp32_from_bits(UINT32_C(0x7800000));
|
| 453 |
-
#endif
|
| 454 |
-
const float normalized_value = fp32_from_bits((two_w >> 4) + exp_offset) * exp_scale;
|
| 455 |
-
|
| 456 |
-
const uint32_t magic_mask = UINT32_C(126) << 23;
|
| 457 |
-
const float magic_bias = 0.5f;
|
| 458 |
-
const float denormalized_value = fp32_from_bits((two_w >> 17) | magic_mask) - magic_bias;
|
| 459 |
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
|
| 467 |
-
#if (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)) && (!defined(__cplusplus) || __cplusplus >= 201703L)
|
| 468 |
-
const float scale_to_inf = 0x1.0p+112f;
|
| 469 |
-
const float scale_to_zero = 0x1.0p-110f;
|
| 470 |
-
#else
|
| 471 |
-
const float scale_to_inf = fp32_from_bits(UINT32_C(0x77800000));
|
| 472 |
-
const float scale_to_zero = fp32_from_bits(UINT32_C(0x08800000));
|
| 473 |
-
#endif
|
| 474 |
-
float base = (fabsf(f) * scale_to_inf) * scale_to_zero;
|
| 475 |
-
|
| 476 |
-
const uint32_t w = fp32_to_bits(f);
|
| 477 |
-
const uint32_t shl1_w = w + w;
|
| 478 |
-
const uint32_t sign = w & UINT32_C(0x80000000);
|
| 479 |
-
uint32_t bias = shl1_w & UINT32_C(0xFF000000);
|
| 480 |
-
if (bias < UINT32_C(0x71000000)) {
|
| 481 |
-
bias = UINT32_C(0x71000000);
|
| 482 |
-
}
|
| 483 |
|
| 484 |
-
|
| 485 |
-
|
| 486 |
-
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 490 |
}
|
| 491 |
|
| 492 |
-
|
| 493 |
-
|
| 494 |
-
|
| 495 |
-
|
| 496 |
-
|
| 497 |
-
|
| 498 |
-
// defined in ggml.c, initialized in ggml_init()
|
| 499 |
-
GGML_API float ggml_table_f32_f16[1 << 16];
|
| 500 |
-
|
| 501 |
-
// On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
|
| 502 |
-
// so we define GGML_FP16_TO_FP32 and GGML_FP32_TO_FP16 elsewhere for NEON.
|
| 503 |
-
// This is also true for POWER9.
|
| 504 |
-
#if !defined(GGML_FP16_TO_FP32)
|
| 505 |
-
inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
|
| 506 |
-
uint16_t s;
|
| 507 |
-
memcpy(&s, &f, sizeof(uint16_t));
|
| 508 |
-
return ggml_table_f32_f16[s];
|
| 509 |
}
|
| 510 |
|
| 511 |
-
#define
|
| 512 |
-
#
|
| 513 |
|
| 514 |
-
#
|
| 515 |
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
|
| 516 |
-
#endif
|
| 517 |
|
| 518 |
/**
|
| 519 |
* Converts brain16 to float32.
|
|
|
|
| 317 |
GGML_API void * ggml_aligned_malloc(size_t size);
|
| 318 |
GGML_API void ggml_aligned_free(void * ptr, size_t size);
|
| 319 |
|
| 320 |
+
// FP16 <-> FP32
|
| 321 |
+
// ref: https://github.com/Maratyszcza/FP16
|
| 322 |
|
| 323 |
+
static inline float fp32_from_bits(uint32_t w) {
|
| 324 |
+
union {
|
| 325 |
+
uint32_t as_bits;
|
| 326 |
+
float as_value;
|
| 327 |
+
} fp32;
|
| 328 |
+
fp32.as_bits = w;
|
| 329 |
+
return fp32.as_value;
|
| 330 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 331 |
|
| 332 |
+
static inline uint32_t fp32_to_bits(float f) {
|
| 333 |
+
union {
|
| 334 |
+
float as_value;
|
| 335 |
+
uint32_t as_bits;
|
| 336 |
+
} fp32;
|
| 337 |
+
fp32.as_value = f;
|
| 338 |
+
return fp32.as_bits;
|
| 339 |
+
}
|
|
|
|
|
|
|
| 340 |
|
| 341 |
+
static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
|
| 342 |
+
const uint32_t w = (uint32_t) h << 16;
|
| 343 |
+
const uint32_t sign = w & UINT32_C(0x80000000);
|
| 344 |
+
const uint32_t two_w = w + w;
|
| 345 |
|
| 346 |
+
const uint32_t exp_offset = UINT32_C(0xE0) << 23;
|
| 347 |
+
#if (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)) && (!defined(__cplusplus) || __cplusplus >= 201703L)
|
| 348 |
+
const float exp_scale = 0x1.0p-112f;
|
| 349 |
#else
|
| 350 |
+
const float exp_scale = fp32_from_bits(UINT32_C(0x7800000));
|
| 351 |
+
#endif
|
| 352 |
+
const float normalized_value = fp32_from_bits((two_w >> 4) + exp_offset) * exp_scale;
|
| 353 |
|
| 354 |
+
const uint32_t magic_mask = UINT32_C(126) << 23;
|
| 355 |
+
const float magic_bias = 0.5f;
|
| 356 |
+
const float denormalized_value = fp32_from_bits((two_w >> 17) | magic_mask) - magic_bias;
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 357 |
|
| 358 |
+
const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
|
| 359 |
+
const uint32_t result = sign |
|
| 360 |
+
(two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value));
|
| 361 |
+
return fp32_from_bits(result);
|
| 362 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 363 |
|
| 364 |
+
static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
|
| 365 |
+
#if (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)) && (!defined(__cplusplus) || __cplusplus >= 201703L)
|
| 366 |
+
const float scale_to_inf = 0x1.0p+112f;
|
| 367 |
+
const float scale_to_zero = 0x1.0p-110f;
|
| 368 |
+
#else
|
| 369 |
+
const float scale_to_inf = fp32_from_bits(UINT32_C(0x77800000));
|
| 370 |
+
const float scale_to_zero = fp32_from_bits(UINT32_C(0x08800000));
|
| 371 |
+
#endif
|
| 372 |
+
float base = (fabsf(f) * scale_to_inf) * scale_to_zero;
|
| 373 |
+
|
| 374 |
+
const uint32_t w = fp32_to_bits(f);
|
| 375 |
+
const uint32_t shl1_w = w + w;
|
| 376 |
+
const uint32_t sign = w & UINT32_C(0x80000000);
|
| 377 |
+
uint32_t bias = shl1_w & UINT32_C(0xFF000000);
|
| 378 |
+
if (bias < UINT32_C(0x71000000)) {
|
| 379 |
+
bias = UINT32_C(0x71000000);
|
| 380 |
}
|
| 381 |
|
| 382 |
+
base = fp32_from_bits((bias >> 1) + UINT32_C(0x07800000)) + base;
|
| 383 |
+
const uint32_t bits = fp32_to_bits(base);
|
| 384 |
+
const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00);
|
| 385 |
+
const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF);
|
| 386 |
+
const uint32_t nonsign = exp_bits + mantissa_bits;
|
| 387 |
+
return (sign >> 16) | (shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00) : nonsign);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 388 |
}
|
| 389 |
|
| 390 |
+
#define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
|
| 391 |
+
#define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
|
| 392 |
|
| 393 |
+
#define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
|
| 394 |
#define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
|
|
|
|
| 395 |
|
| 396 |
/**
|
| 397 |
* Converts brain16 to float32.
|
|
@@ -61,9 +61,6 @@
|
|
| 61 |
#define m512i(p) (__m512i)(p)
|
| 62 |
#endif
|
| 63 |
|
| 64 |
-
// precomputed f32 table for f16 (256 KB) (ggml-impl.h)
|
| 65 |
-
float ggml_table_f32_f16[1 << 16];
|
| 66 |
-
|
| 67 |
#if defined(__linux__) || \
|
| 68 |
defined(__FreeBSD__) || defined(__NetBSD__) || defined(__OpenBSD__) || \
|
| 69 |
(defined(__APPLE__) && !TARGET_OS_TV && !TARGET_OS_WATCH)
|
|
@@ -1422,14 +1419,6 @@ struct ggml_context * ggml_init(struct ggml_init_params params) {
|
|
| 1422 |
// initialize time system (required on Windows)
|
| 1423 |
ggml_time_init();
|
| 1424 |
|
| 1425 |
-
for (int i = 0; i < (1 << 16); ++i) {
|
| 1426 |
-
union {
|
| 1427 |
-
uint16_t u16;
|
| 1428 |
-
ggml_fp16_t fp16;
|
| 1429 |
-
} u = {i};
|
| 1430 |
-
ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16);
|
| 1431 |
-
}
|
| 1432 |
-
|
| 1433 |
is_first_call = false;
|
| 1434 |
}
|
| 1435 |
|
|
|
|
| 61 |
#define m512i(p) (__m512i)(p)
|
| 62 |
#endif
|
| 63 |
|
|
|
|
|
|
|
|
|
|
| 64 |
#if defined(__linux__) || \
|
| 65 |
defined(__FreeBSD__) || defined(__NetBSD__) || defined(__OpenBSD__) || \
|
| 66 |
(defined(__APPLE__) && !TARGET_OS_TV && !TARGET_OS_WATCH)
|
|
|
|
| 1419 |
// initialize time system (required on Windows)
|
| 1420 |
ggml_time_init();
|
| 1421 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1422 |
is_first_call = false;
|
| 1423 |
}
|
| 1424 |
|