taronaeo slaren commited on
Commit
fea8f94
·
1 Parent(s): e7b2e19

ggml-cpu: enable IBM NNPA Vector Intrinsics (llama/14317)

Browse files

* ggml-cpu: add nnpa compile flag

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

* ggml-cpu: add fp16->fp32 nnpa first

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

* ggml-cpu: add fp32->fp16

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

* ggml-cpu: better variable names

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

* docs: update s390x docs

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

* ggml-cpu: add debugging prints to see if dlf16 is correct

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix print vs printf

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix float placeholder

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: ensure fp16 and fp32 load and stores are called

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fp16 load ensured to hit

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: remove sigint from fp16 store

for some reason, the function is not getting a hit when debugged with
gdb. we will need to investigate further

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: nnpa switch to vec_xst test

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: switch to vec_xst for 4 element loops also

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rework noop

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: remove noop, general code cleanup

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: clarify variable naming

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add breakpoint for debugging

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: test fix for conversion failure

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: disable fp32->fp16 nnpa conversions for now

there are some conversion failures in nnpa that requires the eyes of an
ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: switch to elif macro

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix compiler types

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: change to typedef vector types

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add 4 element loops for fp32->fp16

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: clarified vector naming

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: bring back fp32->fp16 store nnpa

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add nnpa macro check in ggml-impl

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add missing __func__

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: diagnose why __NNPA__ macro is not being defined

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: import vecintrin.h to fix compiler errors

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: update macro tests

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: switch to importing ggml-cpu-impl instead

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix macro declaration

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: test more macros

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add debug prints

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: bruteforce macro definitions

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: move macro definitions

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add ggml-impl.h to cmakelists

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: switch to private macros

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 157f856c34589566151630e294563a420702db39)

* ggml-cpu: move things around

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: bring back compile macros

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: switch to quotes for import

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add compiler error macro

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add s390x detection in ggml-src

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: bring back compile definitions

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: undo cmakelists work

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: remove typedefs.h

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: remove typedef from cmakelists

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add ggml-impl.h future notes

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: add todo comment for future reference

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: clarify naming of dlf16

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: remove unnecessary target compile definitions

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings

Signed-off-by: Aaron Teo <[email protected]>

* ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo <[email protected]>

* docs: update broken huggingface link for s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix duplicate func names during compile

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: fix duplicate func names during compile"

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo <[email protected]>

* ggml: refactor fp16<->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix missing simd-mappings.h import in quants.c

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix missing simd-mappings.h within repack

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix amx mmq missing simd-mappings.h

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: attempt at fixing loongarch failing build

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: move nnpa together with other fp16<->fp32 simd

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: fix wrong refactor of ggml-base

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo <[email protected]>

* ggml: remove dependency on ggml-cpu from ggml-base

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: remove mistaken fallback macro

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo <[email protected]>

* ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo <[email protected]>

* ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo <[email protected]>
(cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

* ggml: move ggml_table_f32_f16 to ggml-cpu.c

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: extern c ggml_table_f32_f16 + chore docs

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo <[email protected]>

* ggml-cpu: bring back ggml_table_f32_f16

Signed-off-by: Aaron Teo <[email protected]>

* Revert "ggml-cpu: bring back ggml_table_f32_f16"

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo <[email protected]>

* fix ggml time initialization

* fix f32_f16 table init

* remove extra line

---------

Signed-off-by: Aaron Teo <[email protected]>
Co-au

ggml/CMakeLists.txt CHANGED
@@ -131,6 +131,7 @@ option(GGML_RVV "ggml: enable rvv" ON)
131
  option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
132
  option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
133
  option(GGML_VXE "ggml: enable vxe" ON)
 
134
 
135
  option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
136
  set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
 
131
  option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
132
  option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
133
  option(GGML_VXE "ggml: enable vxe" ON)
134
+ option(GGML_NNPA "ggml: enable nnpa" ON)
135
 
136
  option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
137
  set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
ggml/include/ggml-cpu.h CHANGED
@@ -101,6 +101,7 @@ extern "C" {
101
  GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
102
  GGML_BACKEND_API int ggml_cpu_has_vsx (void);
103
  GGML_BACKEND_API int ggml_cpu_has_vxe (void);
 
104
  GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
105
  GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
106
 
 
101
  GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
102
  GGML_BACKEND_API int ggml_cpu_has_vsx (void);
103
  GGML_BACKEND_API int ggml_cpu_has_vxe (void);
104
+ GGML_BACKEND_API int ggml_cpu_has_nnpa (void);
105
  GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
106
  GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
107
 
ggml/src/ggml-cpu/CMakeLists.txt CHANGED
@@ -448,6 +448,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
448
 
449
  # TODO: Separation to determine activation of VX/VXE/VXE2
450
  if (${S390X_M} MATCHES "8561|8562")
 
451
  message(STATUS "z15 target")
452
  list(APPEND ARCH_FLAGS -march=z15)
453
  elseif (${S390X_M} MATCHES "3931")
@@ -464,7 +465,14 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
464
  endif()
465
 
466
  if (GGML_VXE)
 
467
  list(APPEND ARCH_FLAGS -mvx -mzvector)
 
 
 
 
 
 
468
  endif()
469
  elseif (CMAKE_SYSTEM_PROCESSOR MATCHES "wasm")
470
  message(STATUS "Wasm detected")
 
448
 
449
  # TODO: Separation to determine activation of VX/VXE/VXE2
450
  if (${S390X_M} MATCHES "8561|8562")
451
+ set(GGML_NNPA OFF)
452
  message(STATUS "z15 target")
453
  list(APPEND ARCH_FLAGS -march=z15)
454
  elseif (${S390X_M} MATCHES "3931")
 
465
  endif()
466
 
467
  if (GGML_VXE)
468
+ message(STATUS "VX/VXE/VXE2 enabled")
469
  list(APPEND ARCH_FLAGS -mvx -mzvector)
470
+ list(APPEND ARCH_DEFINITIONS GGML_VXE)
471
+ endif()
472
+
473
+ if (GGML_NNPA)
474
+ message(STATUS "NNPA enabled")
475
+ list(APPEND ARCH_DEFINITIONS GGML_NNPA)
476
  endif()
477
  elseif (CMAKE_SYSTEM_PROCESSOR MATCHES "wasm")
478
  message(STATUS "Wasm detected")
ggml/src/ggml-cpu/amx/mmq.cpp CHANGED
@@ -8,6 +8,7 @@
8
  #include "mmq.h"
9
  #include "ggml-impl.h"
10
  #include "ggml-cpu-impl.h"
 
11
  #include "quants.h"
12
  #include "ggml-quants.h"
13
  #include <algorithm>
@@ -453,7 +454,7 @@ void quantize_row_q8_K_vnni(const float * RESTRICT x, void * RESTRICT vy, int64_
453
 
454
  // Quantize these floats
455
  const float iscale = 127.f / amax;
456
- y[i].d = GGML_FP32_TO_FP16(1 / iscale);
457
  const float id = ( amax != 0.0f ) ? iscale : 0.f;
458
  const __m512 vscale = _mm512_set1_ps(id);
459
 
@@ -1090,7 +1091,7 @@ struct acc_C<block_q8_0, block_q4_0, is_acc> {
1090
  const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
1091
 
1092
  for (int m = 0; m < nr; ++m) {
1093
- const __m512 vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].d));
1094
  const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
1095
 
1096
  __m512 vsum;
@@ -1113,8 +1114,8 @@ struct acc_C<block_q8_1, block_q4_1, is_acc> {
1113
  const __m512 vm0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset + TILE_N * sizeof(ggml_half))));
1114
 
1115
  for (int m = 0; m < nr; ++m) {
1116
- const __m512 vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].d));
1117
- const __m512 vs1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].s));
1118
  const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
1119
 
1120
  __m512 vsum;
@@ -1137,7 +1138,7 @@ struct acc_C<block_q8_0, block_q8_0, is_acc> {
1137
  const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
1138
 
1139
  for (int m = 0; m < nr; ++m) {
1140
- const __m512 vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].d));
1141
  const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
1142
 
1143
  __m512 vsum;
@@ -1437,7 +1438,7 @@ struct tinygemm_kernel_vnni<block_q8_0, block_q4_0, float, BLOCK_M, BLOCK_N, BLO
1437
  va[k] = _mm512_set1_epi32(a_ptr[k]);
1438
  vcomp = _mm512_dpbusd_epi32(vcomp, off, va[k]);
1439
  }
1440
- vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].d));
1441
  }
1442
 
1443
  // load b
@@ -1498,8 +1499,8 @@ struct tinygemm_kernel_vnni<block_q8_1, block_q4_1, float, 1, BLOCK_N, BLOCK_K>
1498
  for (int k = 0; k < 8; ++k) {
1499
  va[k] = _mm512_set1_epi32(a_ptr[k]);
1500
  }
1501
- vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].d));
1502
- vs1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].s));
1503
  }
1504
 
1505
  // load b
@@ -1571,7 +1572,7 @@ struct tinygemm_kernel_vnni<block_q8_0, block_q8_0, float, BLOCK_M, BLOCK_N, BLO
1571
  va[k] = _mm512_set1_epi32(a_ptr[k]);
1572
  va[k] = _mm512_add_epi8(va[k], off);
1573
  }
1574
- vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].d));
1575
  }
1576
 
1577
  // load b
 
8
  #include "mmq.h"
9
  #include "ggml-impl.h"
10
  #include "ggml-cpu-impl.h"
11
+ #include "simd-mappings.h"
12
  #include "quants.h"
13
  #include "ggml-quants.h"
14
  #include <algorithm>
 
454
 
455
  // Quantize these floats
456
  const float iscale = 127.f / amax;
457
+ y[i].d = GGML_CPU_FP32_TO_FP16(1 / iscale);
458
  const float id = ( amax != 0.0f ) ? iscale : 0.f;
459
  const __m512 vscale = _mm512_set1_ps(id);
460
 
 
1091
  const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
1092
 
1093
  for (int m = 0; m < nr; ++m) {
1094
+ const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
1095
  const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
1096
 
1097
  __m512 vsum;
 
1114
  const __m512 vm0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset + TILE_N * sizeof(ggml_half))));
1115
 
1116
  for (int m = 0; m < nr; ++m) {
1117
+ const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
1118
+ const __m512 vs1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].s));
1119
  const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
1120
 
1121
  __m512 vsum;
 
1138
  const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
1139
 
1140
  for (int m = 0; m < nr; ++m) {
1141
+ const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
1142
  const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
1143
 
1144
  __m512 vsum;
 
1438
  va[k] = _mm512_set1_epi32(a_ptr[k]);
1439
  vcomp = _mm512_dpbusd_epi32(vcomp, off, va[k]);
1440
  }
1441
+ vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
1442
  }
1443
 
1444
  // load b
 
1499
  for (int k = 0; k < 8; ++k) {
1500
  va[k] = _mm512_set1_epi32(a_ptr[k]);
1501
  }
1502
+ vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
1503
+ vs1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].s));
1504
  }
1505
 
1506
  // load b
 
1572
  va[k] = _mm512_set1_epi32(a_ptr[k]);
1573
  va[k] = _mm512_add_epi8(va[k], off);
1574
  }
1575
+ vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
1576
  }
1577
 
1578
  // load b
ggml/src/ggml-cpu/arch/arm/quants.c CHANGED
@@ -3,6 +3,7 @@
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
 
6
 
7
  #include "../../quants.h"
8
  #include "../../ggml-cpu-impl.h"
@@ -62,7 +63,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
62
  const float d = amax / ((1 << 7) - 1);
63
  const float id = d ? 1.0f/d : 0.0f;
64
 
65
- y[i].d = GGML_FP32_TO_FP16(d);
66
 
67
  for (int j = 0; j < 8; j++) {
68
  const float32x4_t v = vmulq_n_f32(srcv[j], id);
@@ -104,7 +105,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
104
  const float d = amax / ((1 << 7) - 1);
105
  const float id = d ? 1.0f/d : 0.0f;
106
 
107
- y[i].d = GGML_FP32_TO_FP16(d);
108
 
109
  int32x4_t accv = vdupq_n_s32(0);
110
 
@@ -120,7 +121,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
120
  accv = vaddq_s32(accv, vi);
121
  }
122
 
123
- y[i].s = GGML_FP32_TO_FP16(d * vaddvq_s32(accv));
124
  }
125
  #else
126
  GGML_UNUSED(nb);
@@ -194,10 +195,10 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
194
  const int8x16_t y1_h = vld1q_s8(b_y1->qs + 16);
195
 
196
  float32_t _scale[4] = {
197
- GGML_FP16_TO_FP32(b_x0->d)*GGML_FP16_TO_FP32(b_y0->d),
198
- GGML_FP16_TO_FP32(b_x0->d)*GGML_FP16_TO_FP32(b_y1->d),
199
- GGML_FP16_TO_FP32(b_x1->d)*GGML_FP16_TO_FP32(b_y0->d),
200
- GGML_FP16_TO_FP32(b_x1->d)*GGML_FP16_TO_FP32(b_y1->d)
201
  };
202
  float32x4_t scale = vld1q_f32(_scale);
203
 
@@ -274,10 +275,10 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
274
  // dot product
275
  sumv0 = svmla_n_f32_x(ph4, sumv0, svcvt_f32_s32_x(ph4, svadd_x(ph4,
276
  svdot_s32(svdup_n_s32(0), qx0ls, qy0l),
277
- svdot_s32(svdup_n_s32(0), qx0hs, qy0h))), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
278
  sumv1 = svmla_n_f32_x(ph4, sumv1, svcvt_f32_s32_x(ph4, svadd_x(ph4,
279
  svdot_s32(svdup_n_s32(0), qx1ls, qy1l),
280
- svdot_s32(svdup_n_s32(0), qx1hs, qy1h))), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
281
  }
282
 
283
  sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
@@ -313,9 +314,9 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
313
 
314
  // dot product
315
  sumv0 = svmla_n_f32_x(svptrue_b32(), sumv0, svcvt_f32_s32_x(svptrue_b32(),
316
- svdot_s32(svdup_n_s32(0), qx0s, qy0)), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
317
  sumv1 = svmla_n_f32_x(svptrue_b32(), sumv1, svcvt_f32_s32_x(svptrue_b32(),
318
- svdot_s32(svdup_n_s32(0), qx1s, qy1)), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
319
  }
320
 
321
  sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
@@ -354,9 +355,9 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
354
 
355
  // dot product
356
  sumv0 = svmla_n_f32_x(ph32, sumv0, svcvt_f32_s32_x(ph32,
357
- svdot_s32(svdup_n_s32(0), qx0s, qy0)), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
358
  sumv1 = svmla_n_f32_x(ph32, sumv1, svcvt_f32_s32_x(ph32,
359
- svdot_s32(svdup_n_s32(0), qx1s, qy1)), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
360
  }
361
 
362
  sumf = svaddv_f32(ph32, svadd_f32_x(ph32, sumv0, sumv1));
@@ -404,8 +405,8 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
404
  const int32x4_t p_0 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_0ls, v1_0l), v0_0hs, v1_0h);
405
  const int32x4_t p_1 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_1ls, v1_1l), v0_1hs, v1_1h);
406
 
407
- sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(p_0), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
408
- sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(p_1), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
409
  }
410
 
411
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
@@ -423,7 +424,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
423
  }
424
 
425
  int sumi = sumi0 + sumi1;
426
- sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
427
  }
428
 
429
  *s = sumf;
@@ -464,10 +465,10 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
464
  const block_q8_1 * GGML_RESTRICT b_y1 = &vy1[i];
465
 
466
  float32_t summs_t[4] = {
467
- GGML_FP16_TO_FP32(b_x0->m) * GGML_FP16_TO_FP32(b_y0->s),
468
- GGML_FP16_TO_FP32(b_x1->m) * GGML_FP16_TO_FP32(b_y0->s),
469
- GGML_FP16_TO_FP32(b_x0->m) * GGML_FP16_TO_FP32(b_y1->s),
470
- GGML_FP16_TO_FP32(b_x1->m) * GGML_FP16_TO_FP32(b_y1->s)
471
  };
472
  summs0 = vaddq_f32(summs0, vld1q_f32(summs_t));
473
 
@@ -490,10 +491,10 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
490
 
491
  // mmla into int32x4_t
492
  float32_t _scale[4] = {
493
- GGML_FP16_TO_FP32(b_x0->d)*GGML_FP16_TO_FP32(b_y0->d),
494
- GGML_FP16_TO_FP32(b_x0->d)*GGML_FP16_TO_FP32(b_y1->d),
495
- GGML_FP16_TO_FP32(b_x1->d)*GGML_FP16_TO_FP32(b_y0->d),
496
- GGML_FP16_TO_FP32(b_x1->d)*GGML_FP16_TO_FP32(b_y1->d)
497
  };
498
  float32x4_t scale = vld1q_f32(_scale);
499
 
@@ -539,7 +540,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
539
  const block_q8_1 * GGML_RESTRICT y0 = &y[ib + 0];
540
  const block_q8_1 * GGML_RESTRICT y1 = &y[ib + 1];
541
 
542
- summs += GGML_FP16_TO_FP32(x0->m) * GGML_FP16_TO_FP32(y0->s) + GGML_FP16_TO_FP32(x1->m) * GGML_FP16_TO_FP32(y1->s);
543
 
544
  const uint8x16_t m4b = vdupq_n_u8(0x0F);
545
 
@@ -562,8 +563,8 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
562
  const int32x4_t p_0 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_0l, v1_0l), v0_0h, v1_0h);
563
  const int32x4_t p_1 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_1l, v1_1l), v0_1h, v1_1h);
564
 
565
- sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(p_0), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
566
- sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(p_1), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
567
  }
568
 
569
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1) + summs;
@@ -582,7 +583,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
582
  }
583
 
584
  int sumi = sumi0 + sumi1;
585
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
586
  }
587
 
588
  *s = sumf;
@@ -666,10 +667,10 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
666
 
667
  sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
668
  ggml_vdotq_s32(vdupq_n_s32(0), v0_0lf, v1_0l),
669
- ggml_vdotq_s32(vdupq_n_s32(0), v0_0hf, v1_0h))), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
670
  sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
671
  ggml_vdotq_s32(vdupq_n_s32(0), v0_1lf, v1_1l),
672
- ggml_vdotq_s32(vdupq_n_s32(0), v0_1hf, v1_1h))), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
673
  }
674
 
675
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
@@ -694,7 +695,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
694
  }
695
 
696
  int sumi = sumi0 + sumi1;
697
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d)) * sumi;
698
  }
699
 
700
  *s = sumf;
@@ -739,8 +740,8 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
739
 
740
  const uint8x16_t m4b = vdupq_n_u8(0x0F);
741
 
742
- summs0 += GGML_FP16_TO_FP32(x0->m) * GGML_FP16_TO_FP32(y0->s);
743
- summs1 += GGML_FP16_TO_FP32(x1->m) * GGML_FP16_TO_FP32(y1->s);
744
 
745
  // extract the 5th bit via lookup table ((b) << 4)
746
  memcpy(&qh0, x0->qh, sizeof(qh0));
@@ -784,10 +785,10 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
784
 
785
  sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
786
  ggml_vdotq_s32(vdupq_n_s32(0), v0_0lf, v1_0l),
787
- ggml_vdotq_s32(vdupq_n_s32(0), v0_0hf, v1_0h))), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
788
  sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
789
  ggml_vdotq_s32(vdupq_n_s32(0), v0_1lf, v1_1l),
790
- ggml_vdotq_s32(vdupq_n_s32(0), v0_1hf, v1_1h))), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
791
  }
792
 
793
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1) + summs0 + summs1;
@@ -812,7 +813,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
812
  }
813
 
814
  int sumi = sumi0 + sumi1;
815
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
816
  }
817
 
818
  *s = sumf;
@@ -864,10 +865,10 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
864
  const int8x16_t y1_h = vld1q_s8(b_y1->qs + 16);
865
 
866
  float32_t _scale[4] = {
867
- GGML_FP16_TO_FP32(b_x0->d)*GGML_FP16_TO_FP32(b_y0->d),
868
- GGML_FP16_TO_FP32(b_x0->d)*GGML_FP16_TO_FP32(b_y1->d),
869
- GGML_FP16_TO_FP32(b_x1->d)*GGML_FP16_TO_FP32(b_y0->d),
870
- GGML_FP16_TO_FP32(b_x1->d)*GGML_FP16_TO_FP32(b_y1->d)
871
  };
872
  float32x4_t scale = vld1q_f32(_scale);
873
 
@@ -934,10 +935,10 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
934
 
935
  sumv0 = svmla_n_f32_x(pl16, sumv0, svcvt_f32_s32_x(pl16, svadd_x(pl16,
936
  svdot_s32(svdup_n_s32(0), qx0_0, qy0_0),
937
- svdot_s32(svdup_n_s32(0), qx0_1, qy0_1))), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
938
  sumv1 = svmla_n_f32_x(pl16, sumv1, svcvt_f32_s32_x(pl16, svadd_x(pl16,
939
  svdot_s32(svdup_n_s32(0), qx1_0, qy1_0),
940
- svdot_s32(svdup_n_s32(0), qx1_1, qy1_1))), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
941
  }
942
 
943
  sumf = svaddv_f32(pl16, svadd_f32_x(pl16, sumv0, sumv1));
@@ -960,9 +961,9 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
960
  const svint8_t qy1 = svld1_s8(svptrue_b8(), y1->qs);
961
 
962
  sumv0 = svmla_n_f32_x(svptrue_b32(), sumv0, svcvt_f32_s32_x(svptrue_b32(),
963
- svdot_s32(svdup_n_s32(0), qx0, qy0)), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
964
  sumv1 = svmla_n_f32_x(svptrue_b32(), sumv1, svcvt_f32_s32_x(svptrue_b32(),
965
- svdot_s32(svdup_n_s32(0), qx1, qy1)), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
966
  }
967
 
968
  sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
@@ -1002,8 +1003,8 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
1002
  qy_64 = svadd_s8_x(svptrue_b8(), qy_32, qy_64);
1003
 
1004
  // scale creation
1005
- const float32_t deq1 = GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d);
1006
- const float32_t deq2 = GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d);
1007
 
1008
  // duplicate deq1 in first half of vector and deq2 in second half of vector
1009
  const svfloat32_t temp = svdup_f32_m(svdup_f32_z(ph8, deq1), pl8, deq2);
@@ -1043,11 +1044,11 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
1043
 
1044
  sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
1045
  ggml_vdotq_s32(vdupq_n_s32(0), x0_0, y0_0),
1046
- ggml_vdotq_s32(vdupq_n_s32(0), x0_1, y0_1))), GGML_FP16_TO_FP32(x0->d)*GGML_FP16_TO_FP32(y0->d));
1047
 
1048
  sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
1049
  ggml_vdotq_s32(vdupq_n_s32(0), x1_0, y1_0),
1050
- ggml_vdotq_s32(vdupq_n_s32(0), x1_1, y1_1))), GGML_FP16_TO_FP32(x1->d)*GGML_FP16_TO_FP32(y1->d));
1051
  }
1052
 
1053
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
@@ -1059,7 +1060,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
1059
  sumi += x[ib].qs[j]*y[ib].qs[j];
1060
  }
1061
 
1062
- sumf += sumi*(GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d));
1063
  }
1064
 
1065
  *s = sumf;
@@ -1217,7 +1218,7 @@ void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
1217
  const int16x8_t ysum0 = vld1q_s16(y[i].bsums);
1218
  const int16x8_t ysum1 = vld1q_s16(y[i].bsums + 8);
1219
 
1220
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1221
 
1222
  #if defined(__ARM_FEATURE_DOTPROD)
1223
  sumi0 = vaddq_s32(sumi0, sumi1);
@@ -1269,7 +1270,7 @@ void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
1269
  }
1270
  }
1271
 
1272
- sumf += (float) sum * (GGML_FP16_TO_FP32(x[i].d) * y[i].d);
1273
  }
1274
 
1275
  *s = sumf;
@@ -1362,7 +1363,7 @@ void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
1362
  const int16x8_t ysum0 = vld1q_s16(y[i].bsums);
1363
  const int16x8_t ysum1 = vld1q_s16(y[i].bsums + 8);
1364
 
1365
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1366
 
1367
  #if defined(__ARM_FEATURE_DOTPROD)
1368
  sumi0 = vaddq_s32(sumi0, sumi1);
@@ -1393,7 +1394,7 @@ void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
1393
  }
1394
  }
1395
 
1396
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1397
 
1398
  sumf += (float) sumi * d;
1399
  }
@@ -1425,9 +1426,9 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1425
  switch (vector_length) {
1426
  case 128:
1427
  for (int i = 0; i < nb; ++i) {
1428
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1429
  svfloat32_t d_broad = svdup_n_f32((float32_t)d);
1430
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1431
  svfloat32_t dmin_broad = svdup_n_f32((float32_t)dmin);
1432
 
1433
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
@@ -1570,9 +1571,9 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1570
  case 256:
1571
  case 512:
1572
  for (int i = 0; i < nb; ++i) {
1573
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1574
  svfloat32_t d_broad = svdup_n_f32((float32_t)d);
1575
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1576
  svfloat32_t dmin_broad = svdup_n_f32((float32_t)dmin);
1577
 
1578
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
@@ -1671,8 +1672,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1671
  float sum = 0;
1672
 
1673
  for (int i = 0; i < nb; ++i) {
1674
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1675
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1676
 
1677
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
1678
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -1742,8 +1743,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1742
  summs += y[i].bsums[j] * (sc[j] >> 4);
1743
  }
1744
 
1745
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1746
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1747
 
1748
  int isum = 0;
1749
  int is = 0;
@@ -1805,7 +1806,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1805
 
1806
  for (int i = 0; i < nb; ++i) {
1807
 
1808
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1809
 
1810
  const uint8_t * GGML_RESTRICT q3_sv = x[i].qs;
1811
  const uint8_t * GGML_RESTRICT qh_sv = x[i].hmask;
@@ -1981,7 +1982,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1981
 
1982
  for (int i = 0; i < nb; ++i) {
1983
 
1984
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1985
 
1986
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
1987
  const uint8_t * GGML_RESTRICT qh = x[i].hmask;
@@ -2112,7 +2113,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2112
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
2113
  q8 += 8; a += 8;
2114
  }
2115
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2116
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2117
  }
2118
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -2258,18 +2259,18 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2258
  bias[3] = vaddvq_s32(vaddq_s32(vmull_s16(vget_low_s16(y1_sums), vget_low_s16(x1_mins)),
2259
  vmull_s16(vget_high_s16(y1_sums), vget_high_s16(x1_mins))));
2260
  const float32x4_t dmins = {
2261
- GGML_FP16_TO_FP32(x0->dmin) * y0->d,
2262
- GGML_FP16_TO_FP32(x0->dmin) * y1->d,
2263
- GGML_FP16_TO_FP32(x1->dmin) * y0->d,
2264
- GGML_FP16_TO_FP32(x1->dmin) * y1->d,
2265
  };
2266
  vfsum = vmlsq_f32(vfsum, vcvtq_f32_s32(vld1q_s32(bias)), dmins);
2267
 
2268
  const float32x4_t superblock_scale = {
2269
- GGML_FP16_TO_FP32(x0->d) * y0->d,
2270
- GGML_FP16_TO_FP32(x0->d) * y1->d,
2271
- GGML_FP16_TO_FP32(x1->d) * y0->d,
2272
- GGML_FP16_TO_FP32(x1->d) * y1->d,
2273
  };
2274
  vfsum = vmlaq_f32(vfsum, vcvtq_f32_s32(visum), superblock_scale);
2275
  }
@@ -2289,8 +2290,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2289
  float sumf = 0;
2290
  for (int i = 0; i < nb; ++i) {
2291
 
2292
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
2293
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
2294
 
2295
  const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
2296
 
@@ -2377,8 +2378,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2377
 
2378
  for (int i = 0; i < nb; ++i) {
2379
 
2380
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
2381
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
2382
 
2383
  const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
2384
 
@@ -2478,9 +2479,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2478
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2479
  q8 += 8; a += 8;
2480
  }
2481
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2482
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2483
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
2484
  sumf -= dmin * sumi;
2485
  }
2486
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -2520,8 +2521,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2520
 
2521
  for (int i = 0; i < nb; ++i) {
2522
 
2523
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
2524
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
2525
 
2526
  const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
2527
 
@@ -2630,9 +2631,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2630
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2631
  q8 += 8; a += 8;
2632
  }
2633
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2634
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2635
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
2636
  sumf -= dmin * sumi;
2637
  }
2638
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -2827,10 +2828,10 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2827
  const int32x4_t vibias = vmulq_n_s32(vld1q_s32(bias), 32);
2828
 
2829
  const float32x4_t superblock_scale = {
2830
- GGML_FP16_TO_FP32(x0->d) * y0->d,
2831
- GGML_FP16_TO_FP32(x0->d) * y1->d,
2832
- GGML_FP16_TO_FP32(x1->d) * y0->d,
2833
- GGML_FP16_TO_FP32(x1->d) * y1->d,
2834
  };
2835
 
2836
  visum = vsubq_s32(visum, vibias);
@@ -2858,7 +2859,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2858
  svuint8_t q6h_1, q6h_2, q6h_3, q6h_4;
2859
 
2860
  for (int i = 0; i < nb; ++i) {
2861
- const float d_all = GGML_FP16_TO_FP32(x[i].d);
2862
 
2863
  const uint8_t * GGML_RESTRICT q6 = x[i].ql;
2864
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
@@ -3011,7 +3012,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
3011
 
3012
  for (int i = 0; i < nb; ++i) {
3013
 
3014
- const float d_all = GGML_FP16_TO_FP32(x[i].d);
3015
 
3016
  const uint8_t * GGML_RESTRICT q6 = x[i].ql;
3017
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
@@ -3128,7 +3129,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
3128
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
3129
  q8 += 8; a += 8;
3130
  }
3131
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3132
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
3133
  }
3134
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -3199,7 +3200,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
3199
 
3200
  float sumf = 0;
3201
  for (int i = 0; i < nb; ++i) {
3202
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3203
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3204
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
3205
  float sumf1 = 0, sumf2 = 0;
@@ -3234,7 +3235,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
3234
 
3235
  float sumf = 0.f;
3236
  for (int i = 0; i < nb; ++i) {
3237
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3238
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3239
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
3240
  int32_t bsum = 0;
@@ -3284,7 +3285,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
3284
 
3285
  float sumf = 0;
3286
  for (int i = 0; i < nb; ++i) {
3287
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3288
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3289
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
3290
  const uint8x8_t scales8 = vld1_u8(x[i].scales);
@@ -3329,7 +3330,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
3329
 
3330
  float sumf = 0.f;
3331
  for (int i = 0; i < nb; ++i) {
3332
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3333
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3334
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
3335
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -3398,7 +3399,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3398
  float sumf = 0;
3399
  for (int i = 0; i < nb; ++i) {
3400
 
3401
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3402
 
3403
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3404
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
@@ -3458,7 +3459,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3458
  float sumf = 0;
3459
  for (int i = 0; i < nb; i++) {
3460
 
3461
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3462
  const int8_t * q8 = y[i].qs;
3463
  const uint8_t * qs = x[i].qs;
3464
  const uint8_t * qh = x[i].qh;
@@ -3521,7 +3522,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
3521
 
3522
  float sumf = 0;
3523
  for (int i = 0; i < nb; ++i) {
3524
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3525
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3526
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3527
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -3557,7 +3558,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
3557
 
3558
  float sumf = 0.f;
3559
  for (int i = 0; i < nb; ++i) {
3560
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3561
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3562
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3563
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -3630,7 +3631,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3630
 
3631
  float sumf = 0;
3632
  for (int i = 0; i < nb; ++i) {
3633
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3634
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3635
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3636
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
@@ -3691,7 +3692,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3691
 
3692
  float sumf = 0.f;
3693
  for (int i = 0; i < nb; ++i) {
3694
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3695
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3696
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3697
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
@@ -3786,7 +3787,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3786
 
3787
  }
3788
 
3789
- sumf += y[i].d * GGML_FP16_TO_FP32(x[i].d) * (sumi1 + sumi2 + IQ1S_DELTA * sumi3);
3790
  }
3791
 
3792
  *s = sumf;
@@ -3817,7 +3818,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3817
  qs += 4;
3818
  }
3819
 
3820
- sumf += GGML_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
3821
  }
3822
 
3823
  *s = sumf;
@@ -3905,7 +3906,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3905
 
3906
  }
3907
 
3908
- sumf += y[i].d * GGML_FP16_TO_FP32(scale.f16) * (vaddvq_s32(sumi1) + IQ1M_DELTA * vaddvq_s32(sumi2));
3909
  }
3910
 
3911
  *s = sumf;
@@ -3952,7 +3953,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3952
  qh += 2;
3953
  }
3954
 
3955
- sumf += GGML_FP16_TO_FP32(scale.f16) * y[i].d * (sumi1 + IQ1M_DELTA * sumi2);
3956
  }
3957
 
3958
  *s = sumf;
@@ -4003,13 +4004,13 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
4003
  prod_2 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), q4b.val[2], q8b.val[2]), q4b.val[3], q8b.val[3]);
4004
 
4005
  sumf +=
4006
- GGML_FP16_TO_FP32(x[ib+0].d) * GGML_FP16_TO_FP32(y[ib + 0].d) * vaddvq_s32(prod_1) +
4007
- GGML_FP16_TO_FP32(x[ib+1].d) * GGML_FP16_TO_FP32(y[ib + 1].d) * vaddvq_s32(prod_2);
4008
  }
4009
 
4010
  #endif
4011
  for (; ib < nb; ++ib) {
4012
- const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
4013
  int sumi1 = 0, sumi2 = 0;
4014
  for (int j = 0; j < QK4_NL/2; ++j) {
4015
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
@@ -4071,7 +4072,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
4071
 
4072
  }
4073
 
4074
- sumf += GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d * (sumi1 + sumi2);
4075
  }
4076
 
4077
  *s = sumf;
@@ -4079,7 +4080,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
4079
  #else
4080
  float sumf = 0;
4081
  for (int ibl = 0; ibl < nb; ++ibl) {
4082
- const float d4d8 = GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
4083
  uint16_t h = x[ibl].scales_h;
4084
  const uint8_t * qs = x[ibl].qs;
4085
  const int8_t * q8 = y[ibl].qs;
 
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
6
+ #include "simd-mappings.h"
7
 
8
  #include "../../quants.h"
9
  #include "../../ggml-cpu-impl.h"
 
63
  const float d = amax / ((1 << 7) - 1);
64
  const float id = d ? 1.0f/d : 0.0f;
65
 
66
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
67
 
68
  for (int j = 0; j < 8; j++) {
69
  const float32x4_t v = vmulq_n_f32(srcv[j], id);
 
105
  const float d = amax / ((1 << 7) - 1);
106
  const float id = d ? 1.0f/d : 0.0f;
107
 
108
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
109
 
110
  int32x4_t accv = vdupq_n_s32(0);
111
 
 
121
  accv = vaddq_s32(accv, vi);
122
  }
123
 
124
+ y[i].s = GGML_CPU_FP32_TO_FP16(d * vaddvq_s32(accv));
125
  }
126
  #else
127
  GGML_UNUSED(nb);
 
195
  const int8x16_t y1_h = vld1q_s8(b_y1->qs + 16);
196
 
197
  float32_t _scale[4] = {
198
+ GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
199
+ GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y1->d),
200
+ GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
201
+ GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y1->d)
202
  };
203
  float32x4_t scale = vld1q_f32(_scale);
204
 
 
275
  // dot product
276
  sumv0 = svmla_n_f32_x(ph4, sumv0, svcvt_f32_s32_x(ph4, svadd_x(ph4,
277
  svdot_s32(svdup_n_s32(0), qx0ls, qy0l),
278
+ svdot_s32(svdup_n_s32(0), qx0hs, qy0h))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
279
  sumv1 = svmla_n_f32_x(ph4, sumv1, svcvt_f32_s32_x(ph4, svadd_x(ph4,
280
  svdot_s32(svdup_n_s32(0), qx1ls, qy1l),
281
+ svdot_s32(svdup_n_s32(0), qx1hs, qy1h))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
282
  }
283
 
284
  sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
 
314
 
315
  // dot product
316
  sumv0 = svmla_n_f32_x(svptrue_b32(), sumv0, svcvt_f32_s32_x(svptrue_b32(),
317
+ svdot_s32(svdup_n_s32(0), qx0s, qy0)), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
318
  sumv1 = svmla_n_f32_x(svptrue_b32(), sumv1, svcvt_f32_s32_x(svptrue_b32(),
319
+ svdot_s32(svdup_n_s32(0), qx1s, qy1)), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
320
  }
321
 
322
  sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
 
355
 
356
  // dot product
357
  sumv0 = svmla_n_f32_x(ph32, sumv0, svcvt_f32_s32_x(ph32,
358
+ svdot_s32(svdup_n_s32(0), qx0s, qy0)), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
359
  sumv1 = svmla_n_f32_x(ph32, sumv1, svcvt_f32_s32_x(ph32,
360
+ svdot_s32(svdup_n_s32(0), qx1s, qy1)), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
361
  }
362
 
363
  sumf = svaddv_f32(ph32, svadd_f32_x(ph32, sumv0, sumv1));
 
405
  const int32x4_t p_0 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_0ls, v1_0l), v0_0hs, v1_0h);
406
  const int32x4_t p_1 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_1ls, v1_1l), v0_1hs, v1_1h);
407
 
408
+ sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(p_0), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
409
+ sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(p_1), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
410
  }
411
 
412
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
 
424
  }
425
 
426
  int sumi = sumi0 + sumi1;
427
+ sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
428
  }
429
 
430
  *s = sumf;
 
465
  const block_q8_1 * GGML_RESTRICT b_y1 = &vy1[i];
466
 
467
  float32_t summs_t[4] = {
468
+ GGML_CPU_FP16_TO_FP32(b_x0->m) * GGML_CPU_FP16_TO_FP32(b_y0->s),
469
+ GGML_CPU_FP16_TO_FP32(b_x1->m) * GGML_CPU_FP16_TO_FP32(b_y0->s),
470
+ GGML_CPU_FP16_TO_FP32(b_x0->m) * GGML_CPU_FP16_TO_FP32(b_y1->s),
471
+ GGML_CPU_FP16_TO_FP32(b_x1->m) * GGML_CPU_FP16_TO_FP32(b_y1->s)
472
  };
473
  summs0 = vaddq_f32(summs0, vld1q_f32(summs_t));
474
 
 
491
 
492
  // mmla into int32x4_t
493
  float32_t _scale[4] = {
494
+ GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
495
+ GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y1->d),
496
+ GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
497
+ GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y1->d)
498
  };
499
  float32x4_t scale = vld1q_f32(_scale);
500
 
 
540
  const block_q8_1 * GGML_RESTRICT y0 = &y[ib + 0];
541
  const block_q8_1 * GGML_RESTRICT y1 = &y[ib + 1];
542
 
543
+ summs += GGML_CPU_FP16_TO_FP32(x0->m) * GGML_CPU_FP16_TO_FP32(y0->s) + GGML_CPU_FP16_TO_FP32(x1->m) * GGML_CPU_FP16_TO_FP32(y1->s);
544
 
545
  const uint8x16_t m4b = vdupq_n_u8(0x0F);
546
 
 
563
  const int32x4_t p_0 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_0l, v1_0l), v0_0h, v1_0h);
564
  const int32x4_t p_1 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), v0_1l, v1_1l), v0_1h, v1_1h);
565
 
566
+ sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(p_0), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
567
+ sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(p_1), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
568
  }
569
 
570
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1) + summs;
 
583
  }
584
 
585
  int sumi = sumi0 + sumi1;
586
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
587
  }
588
 
589
  *s = sumf;
 
667
 
668
  sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
669
  ggml_vdotq_s32(vdupq_n_s32(0), v0_0lf, v1_0l),
670
+ ggml_vdotq_s32(vdupq_n_s32(0), v0_0hf, v1_0h))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
671
  sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
672
  ggml_vdotq_s32(vdupq_n_s32(0), v0_1lf, v1_1l),
673
+ ggml_vdotq_s32(vdupq_n_s32(0), v0_1hf, v1_1h))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
674
  }
675
 
676
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
 
695
  }
696
 
697
  int sumi = sumi0 + sumi1;
698
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
699
  }
700
 
701
  *s = sumf;
 
740
 
741
  const uint8x16_t m4b = vdupq_n_u8(0x0F);
742
 
743
+ summs0 += GGML_CPU_FP16_TO_FP32(x0->m) * GGML_CPU_FP16_TO_FP32(y0->s);
744
+ summs1 += GGML_CPU_FP16_TO_FP32(x1->m) * GGML_CPU_FP16_TO_FP32(y1->s);
745
 
746
  // extract the 5th bit via lookup table ((b) << 4)
747
  memcpy(&qh0, x0->qh, sizeof(qh0));
 
785
 
786
  sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
787
  ggml_vdotq_s32(vdupq_n_s32(0), v0_0lf, v1_0l),
788
+ ggml_vdotq_s32(vdupq_n_s32(0), v0_0hf, v1_0h))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
789
  sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
790
  ggml_vdotq_s32(vdupq_n_s32(0), v0_1lf, v1_1l),
791
+ ggml_vdotq_s32(vdupq_n_s32(0), v0_1hf, v1_1h))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
792
  }
793
 
794
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1) + summs0 + summs1;
 
813
  }
814
 
815
  int sumi = sumi0 + sumi1;
816
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
817
  }
818
 
819
  *s = sumf;
 
865
  const int8x16_t y1_h = vld1q_s8(b_y1->qs + 16);
866
 
867
  float32_t _scale[4] = {
868
+ GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
869
+ GGML_CPU_FP16_TO_FP32(b_x0->d)*GGML_CPU_FP16_TO_FP32(b_y1->d),
870
+ GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y0->d),
871
+ GGML_CPU_FP16_TO_FP32(b_x1->d)*GGML_CPU_FP16_TO_FP32(b_y1->d)
872
  };
873
  float32x4_t scale = vld1q_f32(_scale);
874
 
 
935
 
936
  sumv0 = svmla_n_f32_x(pl16, sumv0, svcvt_f32_s32_x(pl16, svadd_x(pl16,
937
  svdot_s32(svdup_n_s32(0), qx0_0, qy0_0),
938
+ svdot_s32(svdup_n_s32(0), qx0_1, qy0_1))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
939
  sumv1 = svmla_n_f32_x(pl16, sumv1, svcvt_f32_s32_x(pl16, svadd_x(pl16,
940
  svdot_s32(svdup_n_s32(0), qx1_0, qy1_0),
941
+ svdot_s32(svdup_n_s32(0), qx1_1, qy1_1))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
942
  }
943
 
944
  sumf = svaddv_f32(pl16, svadd_f32_x(pl16, sumv0, sumv1));
 
961
  const svint8_t qy1 = svld1_s8(svptrue_b8(), y1->qs);
962
 
963
  sumv0 = svmla_n_f32_x(svptrue_b32(), sumv0, svcvt_f32_s32_x(svptrue_b32(),
964
+ svdot_s32(svdup_n_s32(0), qx0, qy0)), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
965
  sumv1 = svmla_n_f32_x(svptrue_b32(), sumv1, svcvt_f32_s32_x(svptrue_b32(),
966
+ svdot_s32(svdup_n_s32(0), qx1, qy1)), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
967
  }
968
 
969
  sumf = svaddv_f32(svptrue_b32(), svadd_f32_x(svptrue_b32(), sumv0, sumv1));
 
1003
  qy_64 = svadd_s8_x(svptrue_b8(), qy_32, qy_64);
1004
 
1005
  // scale creation
1006
+ const float32_t deq1 = GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d);
1007
+ const float32_t deq2 = GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d);
1008
 
1009
  // duplicate deq1 in first half of vector and deq2 in second half of vector
1010
  const svfloat32_t temp = svdup_f32_m(svdup_f32_z(ph8, deq1), pl8, deq2);
 
1044
 
1045
  sumv0 = vmlaq_n_f32(sumv0, vcvtq_f32_s32(vaddq_s32(
1046
  ggml_vdotq_s32(vdupq_n_s32(0), x0_0, y0_0),
1047
+ ggml_vdotq_s32(vdupq_n_s32(0), x0_1, y0_1))), GGML_CPU_FP16_TO_FP32(x0->d)*GGML_CPU_FP16_TO_FP32(y0->d));
1048
 
1049
  sumv1 = vmlaq_n_f32(sumv1, vcvtq_f32_s32(vaddq_s32(
1050
  ggml_vdotq_s32(vdupq_n_s32(0), x1_0, y1_0),
1051
+ ggml_vdotq_s32(vdupq_n_s32(0), x1_1, y1_1))), GGML_CPU_FP16_TO_FP32(x1->d)*GGML_CPU_FP16_TO_FP32(y1->d));
1052
  }
1053
 
1054
  sumf = vaddvq_f32(sumv0) + vaddvq_f32(sumv1);
 
1060
  sumi += x[ib].qs[j]*y[ib].qs[j];
1061
  }
1062
 
1063
+ sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
1064
  }
1065
 
1066
  *s = sumf;
 
1218
  const int16x8_t ysum0 = vld1q_s16(y[i].bsums);
1219
  const int16x8_t ysum1 = vld1q_s16(y[i].bsums + 8);
1220
 
1221
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1222
 
1223
  #if defined(__ARM_FEATURE_DOTPROD)
1224
  sumi0 = vaddq_s32(sumi0, sumi1);
 
1270
  }
1271
  }
1272
 
1273
+ sumf += (float) sum * (GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d);
1274
  }
1275
 
1276
  *s = sumf;
 
1363
  const int16x8_t ysum0 = vld1q_s16(y[i].bsums);
1364
  const int16x8_t ysum1 = vld1q_s16(y[i].bsums + 8);
1365
 
1366
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1367
 
1368
  #if defined(__ARM_FEATURE_DOTPROD)
1369
  sumi0 = vaddq_s32(sumi0, sumi1);
 
1394
  }
1395
  }
1396
 
1397
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1398
 
1399
  sumf += (float) sumi * d;
1400
  }
 
1426
  switch (vector_length) {
1427
  case 128:
1428
  for (int i = 0; i < nb; ++i) {
1429
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1430
  svfloat32_t d_broad = svdup_n_f32((float32_t)d);
1431
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1432
  svfloat32_t dmin_broad = svdup_n_f32((float32_t)dmin);
1433
 
1434
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
 
1571
  case 256:
1572
  case 512:
1573
  for (int i = 0; i < nb; ++i) {
1574
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1575
  svfloat32_t d_broad = svdup_n_f32((float32_t)d);
1576
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1577
  svfloat32_t dmin_broad = svdup_n_f32((float32_t)dmin);
1578
 
1579
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
 
1672
  float sum = 0;
1673
 
1674
  for (int i = 0; i < nb; ++i) {
1675
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1676
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1677
 
1678
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
1679
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
1743
  summs += y[i].bsums[j] * (sc[j] >> 4);
1744
  }
1745
 
1746
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1747
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1748
 
1749
  int isum = 0;
1750
  int is = 0;
 
1806
 
1807
  for (int i = 0; i < nb; ++i) {
1808
 
1809
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1810
 
1811
  const uint8_t * GGML_RESTRICT q3_sv = x[i].qs;
1812
  const uint8_t * GGML_RESTRICT qh_sv = x[i].hmask;
 
1982
 
1983
  for (int i = 0; i < nb; ++i) {
1984
 
1985
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1986
 
1987
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
1988
  const uint8_t * GGML_RESTRICT qh = x[i].hmask;
 
2113
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
2114
  q8 += 8; a += 8;
2115
  }
2116
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2117
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2118
  }
2119
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
2259
  bias[3] = vaddvq_s32(vaddq_s32(vmull_s16(vget_low_s16(y1_sums), vget_low_s16(x1_mins)),
2260
  vmull_s16(vget_high_s16(y1_sums), vget_high_s16(x1_mins))));
2261
  const float32x4_t dmins = {
2262
+ GGML_CPU_FP16_TO_FP32(x0->dmin) * y0->d,
2263
+ GGML_CPU_FP16_TO_FP32(x0->dmin) * y1->d,
2264
+ GGML_CPU_FP16_TO_FP32(x1->dmin) * y0->d,
2265
+ GGML_CPU_FP16_TO_FP32(x1->dmin) * y1->d,
2266
  };
2267
  vfsum = vmlsq_f32(vfsum, vcvtq_f32_s32(vld1q_s32(bias)), dmins);
2268
 
2269
  const float32x4_t superblock_scale = {
2270
+ GGML_CPU_FP16_TO_FP32(x0->d) * y0->d,
2271
+ GGML_CPU_FP16_TO_FP32(x0->d) * y1->d,
2272
+ GGML_CPU_FP16_TO_FP32(x1->d) * y0->d,
2273
+ GGML_CPU_FP16_TO_FP32(x1->d) * y1->d,
2274
  };
2275
  vfsum = vmlaq_f32(vfsum, vcvtq_f32_s32(visum), superblock_scale);
2276
  }
 
2290
  float sumf = 0;
2291
  for (int i = 0; i < nb; ++i) {
2292
 
2293
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
2294
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
2295
 
2296
  const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
2297
 
 
2378
 
2379
  for (int i = 0; i < nb; ++i) {
2380
 
2381
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
2382
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
2383
 
2384
  const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
2385
 
 
2479
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2480
  q8 += 8; a += 8;
2481
  }
2482
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2483
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2484
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
2485
  sumf -= dmin * sumi;
2486
  }
2487
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
2521
 
2522
  for (int i = 0; i < nb; ++i) {
2523
 
2524
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
2525
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
2526
 
2527
  const int16x8_t q8sums = vpaddq_s16(vld1q_s16(y[i].bsums), vld1q_s16(y[i].bsums + 8));
2528
 
 
2631
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2632
  q8 += 8; a += 8;
2633
  }
2634
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2635
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2636
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
2637
  sumf -= dmin * sumi;
2638
  }
2639
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
2828
  const int32x4_t vibias = vmulq_n_s32(vld1q_s32(bias), 32);
2829
 
2830
  const float32x4_t superblock_scale = {
2831
+ GGML_CPU_FP16_TO_FP32(x0->d) * y0->d,
2832
+ GGML_CPU_FP16_TO_FP32(x0->d) * y1->d,
2833
+ GGML_CPU_FP16_TO_FP32(x1->d) * y0->d,
2834
+ GGML_CPU_FP16_TO_FP32(x1->d) * y1->d,
2835
  };
2836
 
2837
  visum = vsubq_s32(visum, vibias);
 
2859
  svuint8_t q6h_1, q6h_2, q6h_3, q6h_4;
2860
 
2861
  for (int i = 0; i < nb; ++i) {
2862
+ const float d_all = GGML_CPU_FP16_TO_FP32(x[i].d);
2863
 
2864
  const uint8_t * GGML_RESTRICT q6 = x[i].ql;
2865
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
 
3012
 
3013
  for (int i = 0; i < nb; ++i) {
3014
 
3015
+ const float d_all = GGML_CPU_FP16_TO_FP32(x[i].d);
3016
 
3017
  const uint8_t * GGML_RESTRICT q6 = x[i].ql;
3018
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
 
3129
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
3130
  q8 += 8; a += 8;
3131
  }
3132
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3133
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
3134
  }
3135
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
3200
 
3201
  float sumf = 0;
3202
  for (int i = 0; i < nb; ++i) {
3203
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3204
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3205
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
3206
  float sumf1 = 0, sumf2 = 0;
 
3235
 
3236
  float sumf = 0.f;
3237
  for (int i = 0; i < nb; ++i) {
3238
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3239
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3240
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
3241
  int32_t bsum = 0;
 
3285
 
3286
  float sumf = 0;
3287
  for (int i = 0; i < nb; ++i) {
3288
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3289
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3290
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
3291
  const uint8x8_t scales8 = vld1_u8(x[i].scales);
 
3330
 
3331
  float sumf = 0.f;
3332
  for (int i = 0; i < nb; ++i) {
3333
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3334
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3335
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
3336
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
3399
  float sumf = 0;
3400
  for (int i = 0; i < nb; ++i) {
3401
 
3402
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3403
 
3404
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3405
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
 
3459
  float sumf = 0;
3460
  for (int i = 0; i < nb; i++) {
3461
 
3462
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3463
  const int8_t * q8 = y[i].qs;
3464
  const uint8_t * qs = x[i].qs;
3465
  const uint8_t * qh = x[i].qh;
 
3522
 
3523
  float sumf = 0;
3524
  for (int i = 0; i < nb; ++i) {
3525
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3526
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3527
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3528
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
3558
 
3559
  float sumf = 0.f;
3560
  for (int i = 0; i < nb; ++i) {
3561
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3562
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3563
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3564
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
3631
 
3632
  float sumf = 0;
3633
  for (int i = 0; i < nb; ++i) {
3634
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3635
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3636
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3637
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
 
3692
 
3693
  float sumf = 0.f;
3694
  for (int i = 0; i < nb; ++i) {
3695
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3696
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3697
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3698
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
 
3787
 
3788
  }
3789
 
3790
+ sumf += y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d) * (sumi1 + sumi2 + IQ1S_DELTA * sumi3);
3791
  }
3792
 
3793
  *s = sumf;
 
3818
  qs += 4;
3819
  }
3820
 
3821
+ sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
3822
  }
3823
 
3824
  *s = sumf;
 
3906
 
3907
  }
3908
 
3909
+ sumf += y[i].d * GGML_CPU_FP16_TO_FP32(scale.f16) * (vaddvq_s32(sumi1) + IQ1M_DELTA * vaddvq_s32(sumi2));
3910
  }
3911
 
3912
  *s = sumf;
 
3953
  qh += 2;
3954
  }
3955
 
3956
+ sumf += GGML_CPU_FP16_TO_FP32(scale.f16) * y[i].d * (sumi1 + IQ1M_DELTA * sumi2);
3957
  }
3958
 
3959
  *s = sumf;
 
4004
  prod_2 = ggml_vdotq_s32(ggml_vdotq_s32(vdupq_n_s32(0), q4b.val[2], q8b.val[2]), q4b.val[3], q8b.val[3]);
4005
 
4006
  sumf +=
4007
+ GGML_CPU_FP16_TO_FP32(x[ib+0].d) * GGML_CPU_FP16_TO_FP32(y[ib + 0].d) * vaddvq_s32(prod_1) +
4008
+ GGML_CPU_FP16_TO_FP32(x[ib+1].d) * GGML_CPU_FP16_TO_FP32(y[ib + 1].d) * vaddvq_s32(prod_2);
4009
  }
4010
 
4011
  #endif
4012
  for (; ib < nb; ++ib) {
4013
+ const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
4014
  int sumi1 = 0, sumi2 = 0;
4015
  for (int j = 0; j < QK4_NL/2; ++j) {
4016
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
 
4072
 
4073
  }
4074
 
4075
+ sumf += GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d * (sumi1 + sumi2);
4076
  }
4077
 
4078
  *s = sumf;
 
4080
  #else
4081
  float sumf = 0;
4082
  for (int ibl = 0; ibl < nb; ++ibl) {
4083
+ const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
4084
  uint16_t h = x[ibl].scales_h;
4085
  const uint8_t * qs = x[ibl].qs;
4086
  const int8_t * q8 = y[ibl].qs;
ggml/src/ggml-cpu/arch/arm/repack.cpp CHANGED
@@ -6,6 +6,7 @@
6
  #include "ggml-impl.h"
7
  #include "ggml-cpu.h"
8
  #include "ggml-cpu-impl.h"
 
9
  #include "traits.h"
10
 
11
  #include <cmath>
@@ -51,7 +52,7 @@ void ggml_quantize_mat_q8_0_4x4(const float * GGML_RESTRICT x, void * GGML_RESTR
51
  const float d = amax / ((1 << 7) - 1);
52
  id[row_iter] = d ? 1.0f / d : 0.0f;
53
 
54
- y[i].d[row_iter] = GGML_FP32_TO_FP16(d);
55
  }
56
 
57
  for (int j = 0; j < 8; j++) {
@@ -102,7 +103,7 @@ void ggml_quantize_mat_q8_0_4x4(const float * GGML_RESTRICT x, void * GGML_RESTR
102
  const float d = amax / ((1 << 7) - 1);
103
  id[row_iter] = d ? 1.0f / d : 0.0f;
104
 
105
- y[i].d[row_iter] = GGML_FP32_TO_FP16(d);
106
  }
107
 
108
  for (int j = 0; j < QK8_0 * 4; j++) {
@@ -145,7 +146,7 @@ void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTR
145
  const float d = amax / ((1 << 7) - 1);
146
  id[row_iter] = d ? 1.0f / d : 0.0f;
147
 
148
- y[i].d[row_iter] = GGML_FP32_TO_FP16(d);
149
  }
150
 
151
  for (int j = 0; j < 4; j++) {
@@ -221,7 +222,7 @@ void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTR
221
  const float d = amax / ((1 << 7) - 1);
222
  id[row_iter] = d ? 1.0f / d : 0.0f;
223
 
224
- y[i].d[row_iter] = GGML_FP32_TO_FP16(d);
225
  }
226
 
227
  for (int j = 0; j < QK8_0 * 4; j++) {
@@ -311,7 +312,7 @@ void ggml_gemv_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
311
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
312
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
313
  }
314
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
315
  }
316
  }
317
  }
@@ -399,7 +400,7 @@ void ggml_gemv_q4_0_4x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
399
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
400
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
401
  }
402
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
403
  }
404
  }
405
  }
@@ -514,7 +515,7 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
514
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
515
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
516
  }
517
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
518
  }
519
  }
520
  }
@@ -608,7 +609,7 @@ void ggml_gemv_iq4_nl_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const
608
  const int v1 = kvalues_iq4nl[b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] >> 4];
609
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2]));
610
  }
611
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
612
  }
613
  }
614
  }
@@ -1117,7 +1118,7 @@ void ggml_gemm_q4_0_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
1117
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
1118
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
1119
  }
1120
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
1121
  }
1122
  }
1123
  }
@@ -1570,7 +1571,7 @@ void ggml_gemm_q4_0_4x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
1570
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
1571
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
1572
  }
1573
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
1574
  }
1575
  }
1576
  }
@@ -2039,7 +2040,7 @@ void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
2039
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
2040
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
2041
  }
2042
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
2043
  }
2044
  }
2045
  }
@@ -2147,7 +2148,7 @@ void ggml_gemm_iq4_nl_4x4_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const
2147
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
2148
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4]));
2149
  }
2150
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
2151
  }
2152
  }
2153
  }
 
6
  #include "ggml-impl.h"
7
  #include "ggml-cpu.h"
8
  #include "ggml-cpu-impl.h"
9
+ #include "simd-mappings.h"
10
  #include "traits.h"
11
 
12
  #include <cmath>
 
52
  const float d = amax / ((1 << 7) - 1);
53
  id[row_iter] = d ? 1.0f / d : 0.0f;
54
 
55
+ y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
56
  }
57
 
58
  for (int j = 0; j < 8; j++) {
 
103
  const float d = amax / ((1 << 7) - 1);
104
  id[row_iter] = d ? 1.0f / d : 0.0f;
105
 
106
+ y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
107
  }
108
 
109
  for (int j = 0; j < QK8_0 * 4; j++) {
 
146
  const float d = amax / ((1 << 7) - 1);
147
  id[row_iter] = d ? 1.0f / d : 0.0f;
148
 
149
+ y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
150
  }
151
 
152
  for (int j = 0; j < 4; j++) {
 
222
  const float d = amax / ((1 << 7) - 1);
223
  id[row_iter] = d ? 1.0f / d : 0.0f;
224
 
225
+ y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
226
  }
227
 
228
  for (int j = 0; j < QK8_0 * 4; j++) {
 
312
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
313
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
314
  }
315
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
316
  }
317
  }
318
  }
 
400
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
401
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
402
  }
403
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
404
  }
405
  }
406
  }
 
515
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
516
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
517
  }
518
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
519
  }
520
  }
521
  }
 
609
  const int v1 = kvalues_iq4nl[b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] >> 4];
610
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2]));
611
  }
612
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
613
  }
614
  }
615
  }
 
1118
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
1119
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
1120
  }
1121
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
1122
  }
1123
  }
1124
  }
 
1571
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
1572
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
1573
  }
1574
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
1575
  }
1576
  }
1577
  }
 
2040
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
2041
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
2042
  }
2043
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
2044
  }
2045
  }
2046
  }
 
2148
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
2149
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4]));
2150
  }
2151
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
2152
  }
2153
  }
2154
  }
ggml/src/ggml-cpu/arch/loongarch/quants.c CHANGED
@@ -3,6 +3,7 @@
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
 
6
 
7
  #include "../../quants.h"
8
  #include "../../ggml-cpu-impl.h"
@@ -474,7 +475,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
474
 
475
  // Quantize these floats
476
  const float d = max_scalar / 127.f;
477
- y[i].d = GGML_FP32_TO_FP16(d);
478
  const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
479
  const __m256 mul = (__m256)__lasx_xvreplfr2vr_s( id );
480
 
@@ -548,7 +549,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
548
 
549
  // Quantize these floats
550
  const float d = max_scalar / 127.f;
551
- y[i].d = GGML_FP32_TO_FP16(d);
552
  const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
553
  const __m256 mul = __lasx_xvreplfr2vr_s( id );
554
 
@@ -576,7 +577,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
576
  // Compute the sum of the quants and set y[i].s
577
  const __m128i s0 = __lsx_vadd_w(__lsx_vadd_w(ni0, ni1), __lsx_vadd_w(ni2, ni3));
578
  const __m128i s1 = __lsx_vadd_w(__lsx_vadd_w(ni4, ni5), __lsx_vadd_w(ni6, ni7));
579
- y[i].s = GGML_FP32_TO_FP16(d * hsum_i32_4(__lsx_vadd_w(s0, s1)));
580
 
581
  // Convert int32 to int16
582
  ni0 = lsx_packs_w( ni0, ni1 );
@@ -667,7 +668,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
667
  // Main loop
668
  for (; ib < nb; ++ib) {
669
  /* Compute combined scale for the block */
670
- const __m256 d = __lasx_xvreplfr2vr_s( GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d) );
671
 
672
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
673
 
@@ -699,7 +700,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
699
  for (; ib + 1 < nb; ib += 2) {
700
 
701
  // Compute combined scale for the block 0 and 1
702
- const __m128 d_0_1 = (__m128)__lsx_vreplgr2vr_w( GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d) );
703
 
704
  const __m128i tmp_0_1 = __lsx_vld((const __m128i *)x[ib].qs, 0);
705
 
@@ -717,7 +718,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
717
  //_mm_prefetch(&y[ib] + 2 * sizeof(block_q8_0), _MM_HINT_T0);
718
 
719
  // Compute combined scale for the block 2 and 3
720
- const __m128 d_2_3 = (__m128)__lsx_vreplgr2vr_w( GGML_FP16_TO_FP32(x[ib + 1].d) * GGML_FP16_TO_FP32(y[ib + 1].d) );
721
 
722
  const __m128i tmp_2_3 = __lsx_vld((const __m128i *)x[ib + 1].qs, 0);
723
 
@@ -766,7 +767,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
766
  }
767
 
768
  int sumi = sumi0 + sumi1;
769
- sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
770
  }
771
 
772
  *s = sumf;
@@ -797,10 +798,10 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
797
 
798
  // Main loop
799
  for (; ib < nb; ++ib) {
800
- const float d0 = GGML_FP16_TO_FP32(x[ib].d);
801
- const float d1 = GGML_FP16_TO_FP32(y[ib].d);
802
 
803
- summs += GGML_FP16_TO_FP32(x[ib].m) * GGML_FP16_TO_FP32(y[ib].s);
804
 
805
  const __m256 d0v = __lasx_xvreplfr2vr_s( d0 );
806
  const __m256 d1v = __lasx_xvreplfr2vr_s( d1 );
@@ -834,7 +835,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
834
  }
835
 
836
  int sumi = sumi0 + sumi1;
837
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
838
  }
839
 
840
  *s = sumf;
@@ -865,7 +866,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
865
  // Main loop
866
  for (; ib < nb; ++ib) {
867
  /* Compute combined scale for the block */
868
- const __m256 d = __lasx_xvreplfr2vr_s(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d)); //FIXME
869
 
870
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
871
  __m256i bxhi = bytes_from_bits_32(x[ib].qh);
@@ -902,7 +903,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
902
  }
903
 
904
  int sumi = sumi0 + sumi1;
905
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d)) * sumi;
906
  }
907
 
908
  *s = sumf;
@@ -934,16 +935,16 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
934
 
935
  // Main loop
936
  for (; ib < nb; ++ib) {
937
- const __m256 dx = __lasx_xvreplfr2vr_s(GGML_FP16_TO_FP32(x[ib].d));
938
 
939
- summs += GGML_FP16_TO_FP32(x[ib].m) * GGML_FP16_TO_FP32(y[ib].s);
940
 
941
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
942
  __m256i bxhi = bytes_from_bits_32(x[ib].qh);
943
  bxhi = __lasx_xvand_v(bxhi, __lasx_xvreplgr2vr_b(0x10));
944
  qx = __lasx_xvor_v(qx, bxhi);
945
 
946
- const __m256 dy = __lasx_xvreplfr2vr_s(GGML_FP16_TO_FP32(y[ib].d));
947
  const __m256i qy = __lasx_xvld((const __m256i *)y[ib].qs, 0);
948
 
949
  const __m256 q = mul_sum_us8_pairs_float(qx, qy);
@@ -973,7 +974,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
973
  }
974
 
975
  int sumi = sumi0 + sumi1;
976
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
977
  }
978
 
979
  *s = sumf;
@@ -1003,7 +1004,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
1003
  // Main loop
1004
  for (; ib < nb; ++ib) {
1005
  // Compute combined scale for the block
1006
- const __m256 d = __lasx_xvreplfr2vr_s(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
1007
  __m256i qx = __lasx_xvld((const __m256i *)x[ib].qs, 0);
1008
  __m256i qy = __lasx_xvld((const __m256i *)y[ib].qs, 0);
1009
 
@@ -1023,7 +1024,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
1023
  sumi += x[ib].qs[j]*y[ib].qs[j];
1024
  }
1025
 
1026
- sumf += sumi*(GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d));
1027
  }
1028
 
1029
  *s = sumf;
@@ -1047,8 +1048,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1047
 
1048
  for (int i = 0; i < nb; ++i) {
1049
 
1050
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1051
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1052
 
1053
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
1054
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -1116,8 +1117,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1116
  summs += y[i].bsums[j] * (sc[j] >> 4);
1117
  }
1118
 
1119
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1120
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1121
 
1122
  int isum = 0;
1123
  int is = 0;
@@ -1170,7 +1171,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1170
 
1171
  for (int i = 0; i < nb; ++i) {
1172
 
1173
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1174
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
1175
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1176
  // Set up scales
@@ -1294,7 +1295,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1294
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
1295
  q8 += 8; a += 8;
1296
  }
1297
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1298
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1299
  }
1300
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1330,8 +1331,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1330
 
1331
  for (int i = 0; i < nb; ++i) {
1332
 
1333
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1334
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1335
 
1336
  memcpy(utmp, x[i].scales, 12);
1337
  utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
@@ -1438,9 +1439,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1438
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1439
  q8 += 8; a += 8;
1440
  }
1441
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1442
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1443
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
1444
  sumf -= dmin * sumi;
1445
  }
1446
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1477,8 +1478,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1477
  const uint8_t * GGML_RESTRICT q5 = x[i].qs;
1478
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1479
 
1480
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1481
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1482
 
1483
  memcpy(utmp, x[i].scales, 12);
1484
  utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
@@ -1593,9 +1594,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1593
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1594
  q8 += 8; a += 8;
1595
  }
1596
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1597
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1598
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
1599
  sumf -= dmin * sumi;
1600
  }
1601
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1624,7 +1625,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1624
 
1625
  for (int i = 0; i < nb; ++i) {
1626
 
1627
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1628
 
1629
  const uint8_t * GGML_RESTRICT q4 = x[i].ql;
1630
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
@@ -1713,7 +1714,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1713
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1714
  q8 += 8; a += 8;
1715
  }
1716
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1717
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1718
  }
1719
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1780,7 +1781,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
1780
 
1781
  __m256 accumf = (__m256)__lasx_xvldi(0);
1782
  for (int i = 0; i < nb; ++i) {
1783
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1784
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1785
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1786
  __m256i sumi1 = __lasx_xvldi(0);
@@ -1820,7 +1821,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
1820
 
1821
  float sumf = 0.f;
1822
  for (int i = 0; i < nb; ++i) {
1823
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1824
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1825
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1826
  int32_t bsum = 0;
@@ -1895,7 +1896,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
1895
 
1896
  __m256 accumf = (__m256)__lasx_xvldi(0);
1897
  for (int i = 0; i < nb; ++i) {
1898
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1899
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1900
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1901
 
@@ -1980,7 +1981,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
1980
 
1981
  float sumf = 0.f;
1982
  for (int i = 0; i < nb; ++i) {
1983
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1984
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1985
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
1986
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -2049,7 +2050,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2049
 
2050
  __m256 accumf = (__m256)__lasx_xvldi(0);
2051
  for (int i = 0; i < nb; ++i) {
2052
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2053
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
2054
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
2055
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
@@ -2108,7 +2109,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2108
  float sumf = 0;
2109
  for (int i = 0; i < nb; i++) {
2110
 
2111
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2112
  const int8_t * q8 = y[i].qs;
2113
  const uint8_t * qs = x[i].qs;
2114
  const uint8_t * qh = x[i].qh;
@@ -2168,7 +2169,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
2168
 
2169
  __m256 accumf = (__m256)__lasx_xvldi(0);
2170
  for (int i = 0; i < nb; ++i) {
2171
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2172
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
2173
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
2174
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -2213,7 +2214,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
2213
 
2214
  float sumf = 0.f;
2215
  for (int i = 0; i < nb; ++i) {
2216
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2217
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
2218
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
2219
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -2279,7 +2280,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2279
 
2280
  __m256 accumf = (__m256)__lasx_xvldi(0);
2281
  for (int i = 0; i < nb; ++i) {
2282
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2283
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
2284
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
2285
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
@@ -2340,7 +2341,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2340
 
2341
  float sumf = 0.f;
2342
  for (int i = 0; i < nb; ++i) {
2343
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2344
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
2345
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
2346
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
@@ -2451,7 +2452,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2451
  + (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
2452
  }
2453
 
2454
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
2455
  accum = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(d), __lasx_xvffint_s_w(sumi), accum);
2456
  accum1 += d * sumi1;
2457
  }
@@ -2484,7 +2485,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2484
  qs += 4;
2485
  }
2486
 
2487
- sumf += GGML_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
2488
  }
2489
 
2490
  *s = sumf;
@@ -2530,9 +2531,9 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
2530
  const __m256i p16_2 = mul_add_epi8(q4b_2, q8b_2);
2531
  const __m256i p_1 = lasx_madd_h(p16_1, mone);
2532
  const __m256i p_2 = lasx_madd_h(p16_2, mone);
2533
- accum1 = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(GGML_FP16_TO_FP32(y[ib + 0].d)*GGML_FP16_TO_FP32(x[ib + 0].d)),
2534
  __lasx_xvffint_s_w(p_1), accum1);
2535
- accum2 = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(GGML_FP16_TO_FP32(y[ib + 1].d)*GGML_FP16_TO_FP32(x[ib + 1].d)),
2536
  __lasx_xvffint_s_w(p_2), accum2);
2537
  }
2538
 
@@ -2540,7 +2541,7 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
2540
 
2541
  #endif
2542
  for (; ib < nb; ++ib) {
2543
- const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
2544
  int sumi1 = 0, sumi2 = 0;
2545
  for (int j = 0; j < QK4_NL/2; ++j) {
2546
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
@@ -2595,7 +2596,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
2595
  sumi1 = __lasx_xvadd_w(p_1, sumi1);
2596
  sumi2 = __lasx_xvadd_w(p_2, sumi2);
2597
  }
2598
- accum = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(GGML_FP16_TO_FP32(x[ibl].d)*y[ibl].d),
2599
  __lasx_xvffint_s_w(__lasx_xvadd_w(sumi1, sumi2)), accum);
2600
  }
2601
 
@@ -2604,7 +2605,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
2604
  #else
2605
  float sumf = 0;
2606
  for (int ibl = 0; ibl < nb; ++ibl) {
2607
- const float d4d8 = GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
2608
  uint16_t h = x[ibl].scales_h;
2609
  const uint8_t * qs = x[ibl].qs;
2610
  const int8_t * q8 = y[ibl].qs;
 
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
6
+ #include "simd-mappings.h"
7
 
8
  #include "../../quants.h"
9
  #include "../../ggml-cpu-impl.h"
 
475
 
476
  // Quantize these floats
477
  const float d = max_scalar / 127.f;
478
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
479
  const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
480
  const __m256 mul = (__m256)__lasx_xvreplfr2vr_s( id );
481
 
 
549
 
550
  // Quantize these floats
551
  const float d = max_scalar / 127.f;
552
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
553
  const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
554
  const __m256 mul = __lasx_xvreplfr2vr_s( id );
555
 
 
577
  // Compute the sum of the quants and set y[i].s
578
  const __m128i s0 = __lsx_vadd_w(__lsx_vadd_w(ni0, ni1), __lsx_vadd_w(ni2, ni3));
579
  const __m128i s1 = __lsx_vadd_w(__lsx_vadd_w(ni4, ni5), __lsx_vadd_w(ni6, ni7));
580
+ y[i].s = GGML_CPU_FP32_TO_FP16(d * hsum_i32_4(__lsx_vadd_w(s0, s1)));
581
 
582
  // Convert int32 to int16
583
  ni0 = lsx_packs_w( ni0, ni1 );
 
668
  // Main loop
669
  for (; ib < nb; ++ib) {
670
  /* Compute combined scale for the block */
671
+ const __m256 d = __lasx_xvreplfr2vr_s( GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d) );
672
 
673
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
674
 
 
700
  for (; ib + 1 < nb; ib += 2) {
701
 
702
  // Compute combined scale for the block 0 and 1
703
+ const __m128 d_0_1 = (__m128)__lsx_vreplgr2vr_w( GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d) );
704
 
705
  const __m128i tmp_0_1 = __lsx_vld((const __m128i *)x[ib].qs, 0);
706
 
 
718
  //_mm_prefetch(&y[ib] + 2 * sizeof(block_q8_0), _MM_HINT_T0);
719
 
720
  // Compute combined scale for the block 2 and 3
721
+ const __m128 d_2_3 = (__m128)__lsx_vreplgr2vr_w( GGML_CPU_FP16_TO_FP32(x[ib + 1].d) * GGML_CPU_FP16_TO_FP32(y[ib + 1].d) );
722
 
723
  const __m128i tmp_2_3 = __lsx_vld((const __m128i *)x[ib + 1].qs, 0);
724
 
 
767
  }
768
 
769
  int sumi = sumi0 + sumi1;
770
+ sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
771
  }
772
 
773
  *s = sumf;
 
798
 
799
  // Main loop
800
  for (; ib < nb; ++ib) {
801
+ const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
802
+ const float d1 = GGML_CPU_FP16_TO_FP32(y[ib].d);
803
 
804
+ summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
805
 
806
  const __m256 d0v = __lasx_xvreplfr2vr_s( d0 );
807
  const __m256 d1v = __lasx_xvreplfr2vr_s( d1 );
 
835
  }
836
 
837
  int sumi = sumi0 + sumi1;
838
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
839
  }
840
 
841
  *s = sumf;
 
866
  // Main loop
867
  for (; ib < nb; ++ib) {
868
  /* Compute combined scale for the block */
869
+ const __m256 d = __lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d)); //FIXME
870
 
871
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
872
  __m256i bxhi = bytes_from_bits_32(x[ib].qh);
 
903
  }
904
 
905
  int sumi = sumi0 + sumi1;
906
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
907
  }
908
 
909
  *s = sumf;
 
935
 
936
  // Main loop
937
  for (; ib < nb; ++ib) {
938
+ const __m256 dx = __lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(x[ib].d));
939
 
940
+ summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
941
 
942
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
943
  __m256i bxhi = bytes_from_bits_32(x[ib].qh);
944
  bxhi = __lasx_xvand_v(bxhi, __lasx_xvreplgr2vr_b(0x10));
945
  qx = __lasx_xvor_v(qx, bxhi);
946
 
947
+ const __m256 dy = __lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(y[ib].d));
948
  const __m256i qy = __lasx_xvld((const __m256i *)y[ib].qs, 0);
949
 
950
  const __m256 q = mul_sum_us8_pairs_float(qx, qy);
 
974
  }
975
 
976
  int sumi = sumi0 + sumi1;
977
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
978
  }
979
 
980
  *s = sumf;
 
1004
  // Main loop
1005
  for (; ib < nb; ++ib) {
1006
  // Compute combined scale for the block
1007
+ const __m256 d = __lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
1008
  __m256i qx = __lasx_xvld((const __m256i *)x[ib].qs, 0);
1009
  __m256i qy = __lasx_xvld((const __m256i *)y[ib].qs, 0);
1010
 
 
1024
  sumi += x[ib].qs[j]*y[ib].qs[j];
1025
  }
1026
 
1027
+ sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
1028
  }
1029
 
1030
  *s = sumf;
 
1048
 
1049
  for (int i = 0; i < nb; ++i) {
1050
 
1051
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1052
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1053
 
1054
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
1055
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
1117
  summs += y[i].bsums[j] * (sc[j] >> 4);
1118
  }
1119
 
1120
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1121
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1122
 
1123
  int isum = 0;
1124
  int is = 0;
 
1171
 
1172
  for (int i = 0; i < nb; ++i) {
1173
 
1174
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1175
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
1176
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1177
  // Set up scales
 
1295
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
1296
  q8 += 8; a += 8;
1297
  }
1298
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1299
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1300
  }
1301
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1331
 
1332
  for (int i = 0; i < nb; ++i) {
1333
 
1334
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1335
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1336
 
1337
  memcpy(utmp, x[i].scales, 12);
1338
  utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
 
1439
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1440
  q8 += 8; a += 8;
1441
  }
1442
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1443
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1444
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
1445
  sumf -= dmin * sumi;
1446
  }
1447
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1478
  const uint8_t * GGML_RESTRICT q5 = x[i].qs;
1479
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1480
 
1481
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1482
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1483
 
1484
  memcpy(utmp, x[i].scales, 12);
1485
  utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
 
1594
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1595
  q8 += 8; a += 8;
1596
  }
1597
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1598
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1599
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
1600
  sumf -= dmin * sumi;
1601
  }
1602
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1625
 
1626
  for (int i = 0; i < nb; ++i) {
1627
 
1628
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1629
 
1630
  const uint8_t * GGML_RESTRICT q4 = x[i].ql;
1631
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
 
1714
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1715
  q8 += 8; a += 8;
1716
  }
1717
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1718
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1719
  }
1720
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1781
 
1782
  __m256 accumf = (__m256)__lasx_xvldi(0);
1783
  for (int i = 0; i < nb; ++i) {
1784
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1785
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1786
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1787
  __m256i sumi1 = __lasx_xvldi(0);
 
1821
 
1822
  float sumf = 0.f;
1823
  for (int i = 0; i < nb; ++i) {
1824
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1825
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1826
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1827
  int32_t bsum = 0;
 
1896
 
1897
  __m256 accumf = (__m256)__lasx_xvldi(0);
1898
  for (int i = 0; i < nb; ++i) {
1899
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1900
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1901
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1902
 
 
1981
 
1982
  float sumf = 0.f;
1983
  for (int i = 0; i < nb; ++i) {
1984
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1985
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1986
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
1987
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
2050
 
2051
  __m256 accumf = (__m256)__lasx_xvldi(0);
2052
  for (int i = 0; i < nb; ++i) {
2053
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2054
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
2055
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
2056
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
 
2109
  float sumf = 0;
2110
  for (int i = 0; i < nb; i++) {
2111
 
2112
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2113
  const int8_t * q8 = y[i].qs;
2114
  const uint8_t * qs = x[i].qs;
2115
  const uint8_t * qh = x[i].qh;
 
2169
 
2170
  __m256 accumf = (__m256)__lasx_xvldi(0);
2171
  for (int i = 0; i < nb; ++i) {
2172
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2173
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
2174
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
2175
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
2214
 
2215
  float sumf = 0.f;
2216
  for (int i = 0; i < nb; ++i) {
2217
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2218
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
2219
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
2220
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
2280
 
2281
  __m256 accumf = (__m256)__lasx_xvldi(0);
2282
  for (int i = 0; i < nb; ++i) {
2283
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2284
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
2285
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
2286
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
 
2341
 
2342
  float sumf = 0.f;
2343
  for (int i = 0; i < nb; ++i) {
2344
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2345
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
2346
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
2347
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
 
2452
  + (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
2453
  }
2454
 
2455
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
2456
  accum = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(d), __lasx_xvffint_s_w(sumi), accum);
2457
  accum1 += d * sumi1;
2458
  }
 
2485
  qs += 4;
2486
  }
2487
 
2488
+ sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
2489
  }
2490
 
2491
  *s = sumf;
 
2531
  const __m256i p16_2 = mul_add_epi8(q4b_2, q8b_2);
2532
  const __m256i p_1 = lasx_madd_h(p16_1, mone);
2533
  const __m256i p_2 = lasx_madd_h(p16_2, mone);
2534
+ accum1 = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(y[ib + 0].d)*GGML_CPU_FP16_TO_FP32(x[ib + 0].d)),
2535
  __lasx_xvffint_s_w(p_1), accum1);
2536
+ accum2 = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(y[ib + 1].d)*GGML_CPU_FP16_TO_FP32(x[ib + 1].d)),
2537
  __lasx_xvffint_s_w(p_2), accum2);
2538
  }
2539
 
 
2541
 
2542
  #endif
2543
  for (; ib < nb; ++ib) {
2544
+ const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
2545
  int sumi1 = 0, sumi2 = 0;
2546
  for (int j = 0; j < QK4_NL/2; ++j) {
2547
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
 
2596
  sumi1 = __lasx_xvadd_w(p_1, sumi1);
2597
  sumi2 = __lasx_xvadd_w(p_2, sumi2);
2598
  }
2599
+ accum = __lasx_xvfmadd_s(__lasx_xvreplfr2vr_s(GGML_CPU_FP16_TO_FP32(x[ibl].d)*y[ibl].d),
2600
  __lasx_xvffint_s_w(__lasx_xvadd_w(sumi1, sumi2)), accum);
2601
  }
2602
 
 
2605
  #else
2606
  float sumf = 0;
2607
  for (int ibl = 0; ibl < nb; ++ibl) {
2608
+ const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
2609
  uint16_t h = x[ibl].scales_h;
2610
  const uint8_t * qs = x[ibl].qs;
2611
  const int8_t * q8 = y[ibl].qs;
ggml/src/ggml-cpu/arch/powerpc/quants.c CHANGED
@@ -3,6 +3,7 @@
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
 
6
 
7
  #include "../../quants.h"
8
  #include "../../ggml-cpu-impl.h"
@@ -67,7 +68,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
67
  const float id = d ? 1.0f/d : 0.0f;
68
  const vector float vid = vec_splats(id);
69
 
70
- y[i].d = GGML_FP32_TO_FP16(d);
71
 
72
  for (int j = 0; j < 8; j++) {
73
  const vector float v = vec_round(vec_mul(srcv[j], vid));
@@ -112,7 +113,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
112
  const float id = d ? 1.0f/d : 0.0f;
113
  const vector float vid = vec_splats(id);
114
 
115
- y[i].d = GGML_FP32_TO_FP16(d);
116
 
117
  vector int accv = vec_splats(0);
118
 
@@ -127,7 +128,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
127
 
128
  accv = vec_add(accv, vec_sld(accv, accv, 4));
129
  accv = vec_add(accv, vec_sld(accv, accv, 8));
130
- y[i].s = GGML_FP32_TO_FP16(d * vec_extract(accv, 0));
131
  }
132
 
133
  #else
@@ -170,8 +171,8 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
170
  __builtin_prefetch(x[ib].qs, 0, 1);
171
  __builtin_prefetch(y[ib].qs, 0, 1);
172
 
173
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[ib].d));
174
- vector float vyd = vec_splats(GGML_FP16_TO_FP32(y[ib].d));
175
  vector float vd = vec_mul(vxd, vyd);
176
 
177
  vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
@@ -214,7 +215,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
214
  }
215
 
216
  int sumi = sumi0 + sumi1;
217
- sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
218
  }
219
 
220
  *s = sumf;
@@ -249,12 +250,12 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
249
  __builtin_prefetch(x[ib].qs, 0, 1);
250
  __builtin_prefetch(y[ib].qs, 0, 1);
251
 
252
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[ib].d));
253
- vector float vyd = vec_splats(GGML_FP16_TO_FP32(y[ib].d));
254
  vector float vd = vec_mul(vxd, vyd);
255
 
256
- vector float vxmin = vec_splats(GGML_FP16_TO_FP32(x[ib].m));
257
- vector float vys = {GGML_FP16_TO_FP32(y[ib].s), 0.0f, 0.0f, 0.0f};
258
  vsumf0 = vec_madd(vxmin, vys, vsumf0);
259
 
260
  vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
@@ -291,7 +292,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
291
  }
292
 
293
  int sumi = sumi0 + sumi1;
294
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
295
  }
296
 
297
  *s = sumf;
@@ -326,8 +327,8 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
326
  __builtin_prefetch(x[ib].qs, 0, 1);
327
  __builtin_prefetch(y[ib].qs, 0, 1);
328
 
329
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[ib].d));
330
- vector float vyd = vec_splats(GGML_FP16_TO_FP32(y[ib].d));
331
  vector float vd = vec_mul(vxd, vyd);
332
 
333
  vector signed long long aux64x2_0 = {(uint64_t)(table_b2b_1[x[ib].qh[0]]), (uint64_t)(table_b2b_1[x[ib].qh[1]])};
@@ -379,7 +380,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
379
  }
380
 
381
  int sumi = sumi0 + sumi1;
382
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d)) * sumi;
383
  }
384
 
385
  *s = sumf;
@@ -415,12 +416,12 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
415
  __builtin_prefetch(x[ib].qs, 0, 1);
416
  __builtin_prefetch(y[ib].qs, 0, 1);
417
 
418
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[ib].d));
419
- vector float vyd = vec_splats(GGML_FP16_TO_FP32(y[ib].d));
420
  vector float vd = vec_mul(vxd, vyd);
421
 
422
- vector float vxmin = vec_splats(GGML_FP16_TO_FP32(x[ib].m));
423
- vector float vys = {GGML_FP16_TO_FP32(y[ib].s), 0.f, 0.f, 0.f};
424
  vsumf0 = vec_madd(vxmin, vys, vsumf0);
425
 
426
  vector unsigned long long aux64x2_0 = {(uint64_t)(table_b2b_0[x[ib].qh[0]]), (uint64_t)(table_b2b_0[x[ib].qh[1]])};
@@ -470,7 +471,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
470
  }
471
 
472
  int sumi = sumi0 + sumi1;
473
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
474
  }
475
 
476
  *s = sumf;
@@ -502,8 +503,8 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
502
  __builtin_prefetch(x[ib].qs, 0, 1);
503
  __builtin_prefetch(y[ib].qs, 0, 1);
504
 
505
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[ib].d));
506
- vector float vyd = vec_splats(GGML_FP16_TO_FP32(y[ib].d));
507
  vector float vd = vec_mul(vxd, vyd);
508
 
509
  vector signed char q8x0 = vec_xl( 0, x[ib].qs);
@@ -542,7 +543,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
542
  sumi += x[ib].qs[j]*y[ib].qs[j];
543
  }
544
 
545
- sumf += sumi*(GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d));
546
  }
547
 
548
  *s = sumf;
@@ -574,11 +575,11 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
574
  vector float vsumf3 = vec_splats(0.0f);
575
 
576
  for (int i = 0; i < nb; ++i) {
577
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
578
  vector float vyd = vec_splats(y[i].d);
579
  vector float vd = vec_mul(vxd, vyd);
580
 
581
- vector float vxmin = vec_splats(GGML_FP16_TO_FP32(x[i].dmin));
582
  vector float vdmin = vec_mul(vxmin, vyd);
583
 
584
  vector signed short q8ysums0 = vec_xl( 0, y[i].bsums);
@@ -708,8 +709,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
708
  summs += y[i].bsums[j] * (sc[j] >> 4);
709
  }
710
 
711
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
712
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
713
 
714
  int isum = 0;
715
  int is = 0;
@@ -770,7 +771,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
770
  vector float vsumf3 = vec_splats(0.0f);
771
 
772
  for (int i = 0; i < nb; ++i) {
773
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
774
  vector float vyd = vec_splats(y[i].d);
775
  vector float vd = vec_mul(vxd, vyd);
776
 
@@ -962,7 +963,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
962
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
963
  q8 += 8; a += 8;
964
  }
965
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
966
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
967
  }
968
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1005,11 +1006,11 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1005
  vector float vsumf3 = vec_splats(0.0f);
1006
 
1007
  for (int i = 0; i < nb; ++i) {
1008
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
1009
  vector float vyd = vec_splats(y[i].d);
1010
  vector float vd = vec_mul(vxd, vyd);
1011
 
1012
- vector float vxmin = vec_splats(GGML_FP16_TO_FP32(x[i].dmin));
1013
  vector float vdmin = vec_mul(vxmin, vyd);
1014
 
1015
  vector signed short q8ysums0 = vec_xl( 0, y[i].bsums);
@@ -1177,9 +1178,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1177
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1178
  q8 += 8; a += 8;
1179
  }
1180
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1181
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1182
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
1183
  sumf -= dmin * sumi;
1184
  }
1185
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1222,11 +1223,11 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1222
  vector float vsumf3 = vec_splats(0.0f);
1223
 
1224
  for (int i = 0; i < nb; ++i) {
1225
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
1226
  vector float vyd = vec_splats(y[i].d);
1227
  vector float vd = vec_mul(vxd, vyd);
1228
 
1229
- vector float vxmin = vec_splats(GGML_FP16_TO_FP32(x[i].dmin));
1230
  vector float vdmin = vec_mul(vxmin, vyd);
1231
 
1232
  UNUSED(kmask1);
@@ -1394,9 +1395,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1394
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1395
  q8 += 8; a += 8;
1396
  }
1397
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1398
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1399
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
1400
  sumf -= dmin * sumi;
1401
  }
1402
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1432,7 +1433,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1432
  vector float vsumf3 = vec_splats(0.0f);
1433
 
1434
  for (int i = 0; i < nb; ++i) {
1435
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
1436
  vector float vyd = vec_splats(y[i].d);
1437
  vector float vd = vec_mul(vxd, vyd);
1438
 
@@ -1591,7 +1592,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1591
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1592
  q8 += 8; a += 8;
1593
  }
1594
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1595
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1596
  }
1597
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1659,7 +1660,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
1659
  const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
1660
 
1661
  for (int i = 0; i < nb; ++i) {
1662
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
1663
  vector float vyd = vec_splats(y[i].d);
1664
  vector float vd = vec_mul(vxd, vyd);
1665
 
@@ -1742,7 +1743,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
1742
 
1743
  float sumf = 0.f;
1744
  for (int i = 0; i < nb; ++i) {
1745
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1746
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1747
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1748
  int32_t bsum = 0;
@@ -1790,7 +1791,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
1790
  const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
1791
 
1792
  for (int i = 0; i < nb; ++i) {
1793
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
1794
  vector float vyd = vec_splats(y[i].d);
1795
  vector float vd = vec_mul(vxd, vyd);
1796
 
@@ -1871,7 +1872,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
1871
 
1872
  float sumf = 0.f;
1873
  for (int i = 0; i < nb; ++i) {
1874
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1875
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1876
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
1877
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -1939,7 +1940,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
1939
  const vector signed char mask2 = (vector signed char)vec_xl( 0, k_mask2);
1940
 
1941
  for (int i = 0; i < nb; ++i) {
1942
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
1943
  vector float vyd = vec_splats(y[i].d);
1944
  vector float vd = vec_mul(vxd, vyd);
1945
 
@@ -2033,7 +2034,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2033
  float sumf = 0;
2034
  for (int i = 0; i < nb; i++) {
2035
 
2036
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2037
  const int8_t * q8 = y[i].qs;
2038
  const uint8_t * qs = x[i].qs;
2039
  const uint8_t * qh = x[i].qh;
@@ -2096,7 +2097,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
2096
  vector float vsumf3 = vec_splats(0.0f);
2097
 
2098
  for (int i = 0; i < nb; ++i) {
2099
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
2100
  vector float vyd = vec_splats(y[i].d);
2101
  vector float vd = vec_mul(vxd, vyd);
2102
 
@@ -2176,7 +2177,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
2176
 
2177
  float sumf = 0.f;
2178
  for (int i = 0; i < nb; ++i) {
2179
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2180
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
2181
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
2182
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -2236,7 +2237,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2236
  const vector signed char mask2 = (vector signed char)vec_xl( 0, k_mask2);
2237
 
2238
  for (int i = 0; i < nb; ++i) {
2239
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
2240
  vector float vyd = vec_splats(y[i].d);
2241
  vector float vd = vec_mul(vxd, vyd);
2242
 
@@ -2329,7 +2330,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2329
 
2330
  float sumf = 0.f;
2331
  for (int i = 0; i < nb; ++i) {
2332
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2333
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
2334
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
2335
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
@@ -2394,7 +2395,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2394
  vector float vsumf3 = vec_splats(0.0f);
2395
 
2396
  for (int i = 0; i < nb; ++i) {
2397
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[i].d));
2398
  vector float vyd = vec_splats(y[i].d);
2399
  vector float vd = vec_mul(vxd, vyd);
2400
 
@@ -2505,7 +2506,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
2505
  qs += 4;
2506
  }
2507
 
2508
- sumf += GGML_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
2509
  }
2510
 
2511
  *s = sumf;
@@ -2546,8 +2547,8 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
2546
  __builtin_prefetch(y[ib].qs, 0, 1);
2547
 
2548
 
2549
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[ib].d));
2550
- vector float vyd = vec_splats(GGML_FP16_TO_FP32(y[ib].d));
2551
  vector float vd = vec_mul(vxd, vyd);
2552
 
2553
  vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
@@ -2582,7 +2583,7 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
2582
 
2583
  #endif
2584
  for (; ib < nb; ++ib) {
2585
- const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
2586
  int sumi1 = 0, sumi2 = 0;
2587
  for (int j = 0; j < QK4_NL/2; ++j) {
2588
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
@@ -2620,7 +2621,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
2620
 
2621
  for (int ibl = 0; ibl < nb; ++ibl) {
2622
 
2623
- vector float vxd = vec_splats(GGML_FP16_TO_FP32(x[ibl].d));
2624
  vector float vyd = vec_splats(y[ibl].d);
2625
  vector float vd = vec_mul(vxd, vyd);
2626
 
@@ -2697,7 +2698,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
2697
  #else
2698
  float sumf = 0;
2699
  for (int ibl = 0; ibl < nb; ++ibl) {
2700
- const float d4d8 = GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
2701
  uint16_t h = x[ibl].scales_h;
2702
  const uint8_t * qs = x[ibl].qs;
2703
  const int8_t * q8 = y[ibl].qs;
 
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
6
+ #include "simd-mappings.h"
7
 
8
  #include "../../quants.h"
9
  #include "../../ggml-cpu-impl.h"
 
68
  const float id = d ? 1.0f/d : 0.0f;
69
  const vector float vid = vec_splats(id);
70
 
71
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
72
 
73
  for (int j = 0; j < 8; j++) {
74
  const vector float v = vec_round(vec_mul(srcv[j], vid));
 
113
  const float id = d ? 1.0f/d : 0.0f;
114
  const vector float vid = vec_splats(id);
115
 
116
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
117
 
118
  vector int accv = vec_splats(0);
119
 
 
128
 
129
  accv = vec_add(accv, vec_sld(accv, accv, 4));
130
  accv = vec_add(accv, vec_sld(accv, accv, 8));
131
+ y[i].s = GGML_CPU_FP32_TO_FP16(d * vec_extract(accv, 0));
132
  }
133
 
134
  #else
 
171
  __builtin_prefetch(x[ib].qs, 0, 1);
172
  __builtin_prefetch(y[ib].qs, 0, 1);
173
 
174
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
175
+ vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
176
  vector float vd = vec_mul(vxd, vyd);
177
 
178
  vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
 
215
  }
216
 
217
  int sumi = sumi0 + sumi1;
218
+ sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
219
  }
220
 
221
  *s = sumf;
 
250
  __builtin_prefetch(x[ib].qs, 0, 1);
251
  __builtin_prefetch(y[ib].qs, 0, 1);
252
 
253
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
254
+ vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
255
  vector float vd = vec_mul(vxd, vyd);
256
 
257
+ vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].m));
258
+ vector float vys = {GGML_CPU_FP16_TO_FP32(y[ib].s), 0.0f, 0.0f, 0.0f};
259
  vsumf0 = vec_madd(vxmin, vys, vsumf0);
260
 
261
  vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
 
292
  }
293
 
294
  int sumi = sumi0 + sumi1;
295
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
296
  }
297
 
298
  *s = sumf;
 
327
  __builtin_prefetch(x[ib].qs, 0, 1);
328
  __builtin_prefetch(y[ib].qs, 0, 1);
329
 
330
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
331
+ vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
332
  vector float vd = vec_mul(vxd, vyd);
333
 
334
  vector signed long long aux64x2_0 = {(uint64_t)(table_b2b_1[x[ib].qh[0]]), (uint64_t)(table_b2b_1[x[ib].qh[1]])};
 
380
  }
381
 
382
  int sumi = sumi0 + sumi1;
383
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
384
  }
385
 
386
  *s = sumf;
 
416
  __builtin_prefetch(x[ib].qs, 0, 1);
417
  __builtin_prefetch(y[ib].qs, 0, 1);
418
 
419
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
420
+ vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
421
  vector float vd = vec_mul(vxd, vyd);
422
 
423
+ vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].m));
424
+ vector float vys = {GGML_CPU_FP16_TO_FP32(y[ib].s), 0.f, 0.f, 0.f};
425
  vsumf0 = vec_madd(vxmin, vys, vsumf0);
426
 
427
  vector unsigned long long aux64x2_0 = {(uint64_t)(table_b2b_0[x[ib].qh[0]]), (uint64_t)(table_b2b_0[x[ib].qh[1]])};
 
471
  }
472
 
473
  int sumi = sumi0 + sumi1;
474
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
475
  }
476
 
477
  *s = sumf;
 
503
  __builtin_prefetch(x[ib].qs, 0, 1);
504
  __builtin_prefetch(y[ib].qs, 0, 1);
505
 
506
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
507
+ vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
508
  vector float vd = vec_mul(vxd, vyd);
509
 
510
  vector signed char q8x0 = vec_xl( 0, x[ib].qs);
 
543
  sumi += x[ib].qs[j]*y[ib].qs[j];
544
  }
545
 
546
+ sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
547
  }
548
 
549
  *s = sumf;
 
575
  vector float vsumf3 = vec_splats(0.0f);
576
 
577
  for (int i = 0; i < nb; ++i) {
578
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
579
  vector float vyd = vec_splats(y[i].d);
580
  vector float vd = vec_mul(vxd, vyd);
581
 
582
+ vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].dmin));
583
  vector float vdmin = vec_mul(vxmin, vyd);
584
 
585
  vector signed short q8ysums0 = vec_xl( 0, y[i].bsums);
 
709
  summs += y[i].bsums[j] * (sc[j] >> 4);
710
  }
711
 
712
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
713
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
714
 
715
  int isum = 0;
716
  int is = 0;
 
771
  vector float vsumf3 = vec_splats(0.0f);
772
 
773
  for (int i = 0; i < nb; ++i) {
774
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
775
  vector float vyd = vec_splats(y[i].d);
776
  vector float vd = vec_mul(vxd, vyd);
777
 
 
963
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
964
  q8 += 8; a += 8;
965
  }
966
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
967
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
968
  }
969
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1006
  vector float vsumf3 = vec_splats(0.0f);
1007
 
1008
  for (int i = 0; i < nb; ++i) {
1009
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
1010
  vector float vyd = vec_splats(y[i].d);
1011
  vector float vd = vec_mul(vxd, vyd);
1012
 
1013
+ vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].dmin));
1014
  vector float vdmin = vec_mul(vxmin, vyd);
1015
 
1016
  vector signed short q8ysums0 = vec_xl( 0, y[i].bsums);
 
1178
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1179
  q8 += 8; a += 8;
1180
  }
1181
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1182
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1183
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
1184
  sumf -= dmin * sumi;
1185
  }
1186
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1223
  vector float vsumf3 = vec_splats(0.0f);
1224
 
1225
  for (int i = 0; i < nb; ++i) {
1226
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
1227
  vector float vyd = vec_splats(y[i].d);
1228
  vector float vd = vec_mul(vxd, vyd);
1229
 
1230
+ vector float vxmin = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].dmin));
1231
  vector float vdmin = vec_mul(vxmin, vyd);
1232
 
1233
  UNUSED(kmask1);
 
1395
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1396
  q8 += 8; a += 8;
1397
  }
1398
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1399
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1400
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
1401
  sumf -= dmin * sumi;
1402
  }
1403
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1433
  vector float vsumf3 = vec_splats(0.0f);
1434
 
1435
  for (int i = 0; i < nb; ++i) {
1436
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
1437
  vector float vyd = vec_splats(y[i].d);
1438
  vector float vd = vec_mul(vxd, vyd);
1439
 
 
1592
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1593
  q8 += 8; a += 8;
1594
  }
1595
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1596
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1597
  }
1598
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1660
  const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
1661
 
1662
  for (int i = 0; i < nb; ++i) {
1663
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
1664
  vector float vyd = vec_splats(y[i].d);
1665
  vector float vd = vec_mul(vxd, vyd);
1666
 
 
1743
 
1744
  float sumf = 0.f;
1745
  for (int i = 0; i < nb; ++i) {
1746
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1747
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1748
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1749
  int32_t bsum = 0;
 
1791
  const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
1792
 
1793
  for (int i = 0; i < nb; ++i) {
1794
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
1795
  vector float vyd = vec_splats(y[i].d);
1796
  vector float vd = vec_mul(vxd, vyd);
1797
 
 
1872
 
1873
  float sumf = 0.f;
1874
  for (int i = 0; i < nb; ++i) {
1875
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1876
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1877
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
1878
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
1940
  const vector signed char mask2 = (vector signed char)vec_xl( 0, k_mask2);
1941
 
1942
  for (int i = 0; i < nb; ++i) {
1943
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
1944
  vector float vyd = vec_splats(y[i].d);
1945
  vector float vd = vec_mul(vxd, vyd);
1946
 
 
2034
  float sumf = 0;
2035
  for (int i = 0; i < nb; i++) {
2036
 
2037
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2038
  const int8_t * q8 = y[i].qs;
2039
  const uint8_t * qs = x[i].qs;
2040
  const uint8_t * qh = x[i].qh;
 
2097
  vector float vsumf3 = vec_splats(0.0f);
2098
 
2099
  for (int i = 0; i < nb; ++i) {
2100
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
2101
  vector float vyd = vec_splats(y[i].d);
2102
  vector float vd = vec_mul(vxd, vyd);
2103
 
 
2177
 
2178
  float sumf = 0.f;
2179
  for (int i = 0; i < nb; ++i) {
2180
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2181
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
2182
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
2183
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
2237
  const vector signed char mask2 = (vector signed char)vec_xl( 0, k_mask2);
2238
 
2239
  for (int i = 0; i < nb; ++i) {
2240
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
2241
  vector float vyd = vec_splats(y[i].d);
2242
  vector float vd = vec_mul(vxd, vyd);
2243
 
 
2330
 
2331
  float sumf = 0.f;
2332
  for (int i = 0; i < nb; ++i) {
2333
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2334
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
2335
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
2336
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
 
2395
  vector float vsumf3 = vec_splats(0.0f);
2396
 
2397
  for (int i = 0; i < nb; ++i) {
2398
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[i].d));
2399
  vector float vyd = vec_splats(y[i].d);
2400
  vector float vd = vec_mul(vxd, vyd);
2401
 
 
2506
  qs += 4;
2507
  }
2508
 
2509
+ sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
2510
  }
2511
 
2512
  *s = sumf;
 
2547
  __builtin_prefetch(y[ib].qs, 0, 1);
2548
 
2549
 
2550
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d));
2551
+ vector float vyd = vec_splats(GGML_CPU_FP16_TO_FP32(y[ib].d));
2552
  vector float vd = vec_mul(vxd, vyd);
2553
 
2554
  vector signed char qxs = (vector signed char)vec_xl( 0, x[ib].qs);
 
2583
 
2584
  #endif
2585
  for (; ib < nb; ++ib) {
2586
+ const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
2587
  int sumi1 = 0, sumi2 = 0;
2588
  for (int j = 0; j < QK4_NL/2; ++j) {
2589
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
 
2621
 
2622
  for (int ibl = 0; ibl < nb; ++ibl) {
2623
 
2624
+ vector float vxd = vec_splats(GGML_CPU_FP16_TO_FP32(x[ibl].d));
2625
  vector float vyd = vec_splats(y[ibl].d);
2626
  vector float vd = vec_mul(vxd, vyd);
2627
 
 
2698
  #else
2699
  float sumf = 0;
2700
  for (int ibl = 0; ibl < nb; ++ibl) {
2701
+ const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
2702
  uint16_t h = x[ibl].scales_h;
2703
  const uint8_t * qs = x[ibl].qs;
2704
  const int8_t * q8 = y[ibl].qs;
ggml/src/ggml-cpu/arch/riscv/quants.c CHANGED
@@ -3,6 +3,7 @@
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
 
6
 
7
  #include "../../quants.h"
8
  #include "../../ggml-cpu-impl.h"
@@ -45,7 +46,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
45
  const float d = amax / ((1 << 7) - 1);
46
  const float id = d ? 1.0f/d : 0.0f;
47
 
48
- y[i].d = GGML_FP32_TO_FP16(d);
49
 
50
  vfloat32m8_t x0 = __riscv_vfmul_vf_f32m8(v_x, id, vl);
51
 
@@ -85,7 +86,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
85
  const float d = amax / ((1 << 7) - 1);
86
  const float id = d ? 1.0f/d : 0.0f;
87
 
88
- y[i].d = GGML_FP32_TO_FP16(d);
89
 
90
  vfloat32m8_t x0 = __riscv_vfmul_vf_f32m8(v_x, id, vl);
91
 
@@ -102,7 +103,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
102
 
103
  // set y[i].s
104
  int sum = __riscv_vmv_x_s_i16m1_i16(vwrs);
105
- y[i].s = GGML_FP32_TO_FP16(sum*d);
106
  }
107
 
108
  #else
@@ -160,7 +161,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
160
 
161
  int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);
162
 
163
- sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
164
  }
165
 
166
  #endif
@@ -177,7 +178,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
177
  }
178
 
179
  int sumi = sumi0 + sumi1;
180
- sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
181
  }
182
 
183
  *s = sumf;
@@ -225,7 +226,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
225
 
226
  int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);
227
 
228
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
229
  }
230
 
231
  #endif
@@ -242,7 +243,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
242
  }
243
 
244
  int sumi = sumi0 + sumi1;
245
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
246
  }
247
 
248
  *s = sumf;
@@ -293,7 +294,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
293
  vint32m1_t sum = __riscv_vwredsum_vs_i16m4_i32m1(mul, zero, vl);
294
  int32_t sumi = __riscv_vmv_x_s_i32m1_i32(sum);
295
 
296
- sumf += (GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d)) * sumi;
297
  }
298
 
299
  #endif
@@ -316,7 +317,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
316
  }
317
 
318
  int sumi = sumi0 + sumi1;
319
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d)) * sumi;
320
  }
321
 
322
  *s = sumf;
@@ -366,7 +367,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
366
  vint32m1_t sum = __riscv_vwredsum_vs_i16m4_i32m1(mul, zero, vl);
367
  int32_t sumi = __riscv_vmv_x_s_i32m1_i32(sum);
368
 
369
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
370
  }
371
 
372
  #endif
@@ -389,7 +390,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
389
  }
390
 
391
  int sumi = sumi0 + sumi1;
392
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
393
  }
394
 
395
  *s = sumf;
@@ -427,7 +428,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
427
 
428
  int sumi = __riscv_vmv_x_s_i32m1_i32(v_sum);
429
 
430
- sumf += sumi*(GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d));
431
  }
432
 
433
  #endif
@@ -438,7 +439,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
438
  sumi += x[ib].qs[j]*y[ib].qs[j];
439
  }
440
 
441
- sumf += sumi*(GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d));
442
  }
443
 
444
  *s = sumf;
@@ -465,8 +466,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
465
  const uint8_t * q2 = x[i].qs;
466
  const int8_t * q8 = y[i].qs;
467
  const uint8_t * sc = x[i].scales;
468
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
469
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
470
  uint8_t *patmp = atmp;
471
  int vsums;
472
  int tmp;
@@ -569,8 +570,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
569
  const int8_t * q8 = y[i].qs;
570
  const uint8_t * sc = x[i].scales;
571
 
572
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
573
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
574
 
575
  size_t vl = 16;
576
 
@@ -644,8 +645,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
644
  const uint8_t * q2 = x[i].qs;
645
  const int8_t * q8 = y[i].qs;
646
  const uint8_t * sc = x[i].scales;
647
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
648
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
649
  uint8_t *patmp = atmp;
650
  int vsums;
651
  int tmp;
@@ -750,8 +751,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
750
  summs += y[i].bsums[j] * (sc[j] >> 4);
751
  }
752
 
753
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
754
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
755
 
756
  int isum = 0;
757
  int is = 0;
@@ -916,7 +917,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
916
  q3 += 32; q8 += 128; scale += 8;
917
  }
918
 
919
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
920
  sumf += d * isum;
921
  }
922
 
@@ -1017,7 +1018,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1017
 
1018
  }
1019
 
1020
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1021
 
1022
  sumf += d*sum_t;
1023
 
@@ -1134,7 +1135,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1134
  q3 += 32; q8 += 128; scale += 8;
1135
  }
1136
 
1137
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1138
  sumf += d * isum;
1139
  }
1140
  break;
@@ -1202,7 +1203,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1202
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
1203
  q8 += 8; a += 8;
1204
  }
1205
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1206
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1207
  }
1208
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1239,8 +1240,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1239
  float sumf = 0;
1240
 
1241
  for (int i = 0; i < nb; ++i) {
1242
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1243
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1244
 
1245
  int tmp, tmp2, sumi;
1246
  __asm__ __volatile__(
@@ -1361,8 +1362,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1361
 
1362
  size_t vl = 8;
1363
 
1364
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1365
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1366
 
1367
  vint16mf2_t q8sums_0 = __riscv_vlse16_v_i16mf2(y[i].bsums, 4, vl);
1368
  vint16mf2_t q8sums_1 = __riscv_vlse16_v_i16mf2(y[i].bsums+1, 4, vl);
@@ -1422,8 +1423,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1422
  break;
1423
  case 128:
1424
  for (int i = 0; i < nb; ++i) {
1425
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1426
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1427
 
1428
  int tmp, tmp2, sumi;
1429
  __asm__ __volatile__(
@@ -1580,9 +1581,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1580
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1581
  q8 += 8; a += 8;
1582
  }
1583
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1584
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1585
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
1586
  sumf -= dmin * sumi;
1587
  }
1588
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1627,8 +1628,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1627
  const uint8_t * GGML_RESTRICT hm = x[i].qh;
1628
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1629
 
1630
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1631
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
1632
 
1633
  vint16m1_t q8sums_0 = __riscv_vlse16_v_i16m1(y[i].bsums, 4, vl);
1634
  vint16m1_t q8sums_1 = __riscv_vlse16_v_i16m1(y[i].bsums+1, 4, vl);
@@ -1749,9 +1750,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1749
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1750
  q8 += 8; a += 8;
1751
  }
1752
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1753
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1754
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
1755
  sumf -= dmin * sumi;
1756
  }
1757
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1778,7 +1779,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1778
 
1779
  for (int i = 0; i < nb; ++i) {
1780
 
1781
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1782
 
1783
  const uint8_t * restrict q6 = x[i].ql;
1784
  const uint8_t * restrict qh = x[i].qh;
@@ -1862,7 +1863,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1862
  case 256:
1863
  for (int i = 0; i < nb; ++i) {
1864
 
1865
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1866
 
1867
  const uint8_t * GGML_RESTRICT q6 = x[i].ql;
1868
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
@@ -1943,7 +1944,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1943
  case 128:
1944
  for (int i = 0; i < nb; ++i) {
1945
 
1946
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1947
 
1948
  const uint8_t * restrict q6 = x[i].ql;
1949
  const uint8_t * restrict qh = x[i].qh;
@@ -2058,7 +2059,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2058
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2059
  q8 += 8; a += 8;
2060
  }
2061
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2062
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2063
  }
2064
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
6
+ #include "simd-mappings.h"
7
 
8
  #include "../../quants.h"
9
  #include "../../ggml-cpu-impl.h"
 
46
  const float d = amax / ((1 << 7) - 1);
47
  const float id = d ? 1.0f/d : 0.0f;
48
 
49
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
50
 
51
  vfloat32m8_t x0 = __riscv_vfmul_vf_f32m8(v_x, id, vl);
52
 
 
86
  const float d = amax / ((1 << 7) - 1);
87
  const float id = d ? 1.0f/d : 0.0f;
88
 
89
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
90
 
91
  vfloat32m8_t x0 = __riscv_vfmul_vf_f32m8(v_x, id, vl);
92
 
 
103
 
104
  // set y[i].s
105
  int sum = __riscv_vmv_x_s_i16m1_i16(vwrs);
106
+ y[i].s = GGML_CPU_FP32_TO_FP16(sum*d);
107
  }
108
 
109
  #else
 
161
 
162
  int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);
163
 
164
+ sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
165
  }
166
 
167
  #endif
 
178
  }
179
 
180
  int sumi = sumi0 + sumi1;
181
+ sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
182
  }
183
 
184
  *s = sumf;
 
226
 
227
  int sumi = __riscv_vmv_x_s_i32m1_i32(vs2);
228
 
229
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
230
  }
231
 
232
  #endif
 
243
  }
244
 
245
  int sumi = sumi0 + sumi1;
246
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
247
  }
248
 
249
  *s = sumf;
 
294
  vint32m1_t sum = __riscv_vwredsum_vs_i16m4_i32m1(mul, zero, vl);
295
  int32_t sumi = __riscv_vmv_x_s_i32m1_i32(sum);
296
 
297
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
298
  }
299
 
300
  #endif
 
317
  }
318
 
319
  int sumi = sumi0 + sumi1;
320
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
321
  }
322
 
323
  *s = sumf;
 
367
  vint32m1_t sum = __riscv_vwredsum_vs_i16m4_i32m1(mul, zero, vl);
368
  int32_t sumi = __riscv_vmv_x_s_i32m1_i32(sum);
369
 
370
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
371
  }
372
 
373
  #endif
 
390
  }
391
 
392
  int sumi = sumi0 + sumi1;
393
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
394
  }
395
 
396
  *s = sumf;
 
428
 
429
  int sumi = __riscv_vmv_x_s_i32m1_i32(v_sum);
430
 
431
+ sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
432
  }
433
 
434
  #endif
 
439
  sumi += x[ib].qs[j]*y[ib].qs[j];
440
  }
441
 
442
+ sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
443
  }
444
 
445
  *s = sumf;
 
466
  const uint8_t * q2 = x[i].qs;
467
  const int8_t * q8 = y[i].qs;
468
  const uint8_t * sc = x[i].scales;
469
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
470
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
471
  uint8_t *patmp = atmp;
472
  int vsums;
473
  int tmp;
 
570
  const int8_t * q8 = y[i].qs;
571
  const uint8_t * sc = x[i].scales;
572
 
573
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
574
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
575
 
576
  size_t vl = 16;
577
 
 
645
  const uint8_t * q2 = x[i].qs;
646
  const int8_t * q8 = y[i].qs;
647
  const uint8_t * sc = x[i].scales;
648
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
649
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
650
  uint8_t *patmp = atmp;
651
  int vsums;
652
  int tmp;
 
751
  summs += y[i].bsums[j] * (sc[j] >> 4);
752
  }
753
 
754
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
755
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
756
 
757
  int isum = 0;
758
  int is = 0;
 
917
  q3 += 32; q8 += 128; scale += 8;
918
  }
919
 
920
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
921
  sumf += d * isum;
922
  }
923
 
 
1018
 
1019
  }
1020
 
1021
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1022
 
1023
  sumf += d*sum_t;
1024
 
 
1135
  q3 += 32; q8 += 128; scale += 8;
1136
  }
1137
 
1138
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1139
  sumf += d * isum;
1140
  }
1141
  break;
 
1203
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
1204
  q8 += 8; a += 8;
1205
  }
1206
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1207
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1208
  }
1209
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1240
  float sumf = 0;
1241
 
1242
  for (int i = 0; i < nb; ++i) {
1243
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1244
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1245
 
1246
  int tmp, tmp2, sumi;
1247
  __asm__ __volatile__(
 
1362
 
1363
  size_t vl = 8;
1364
 
1365
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1366
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1367
 
1368
  vint16mf2_t q8sums_0 = __riscv_vlse16_v_i16mf2(y[i].bsums, 4, vl);
1369
  vint16mf2_t q8sums_1 = __riscv_vlse16_v_i16mf2(y[i].bsums+1, 4, vl);
 
1423
  break;
1424
  case 128:
1425
  for (int i = 0; i < nb; ++i) {
1426
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1427
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1428
 
1429
  int tmp, tmp2, sumi;
1430
  __asm__ __volatile__(
 
1581
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1582
  q8 += 8; a += 8;
1583
  }
1584
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1585
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1586
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
1587
  sumf -= dmin * sumi;
1588
  }
1589
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1628
  const uint8_t * GGML_RESTRICT hm = x[i].qh;
1629
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
1630
 
1631
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1632
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
1633
 
1634
  vint16m1_t q8sums_0 = __riscv_vlse16_v_i16m1(y[i].bsums, 4, vl);
1635
  vint16m1_t q8sums_1 = __riscv_vlse16_v_i16m1(y[i].bsums+1, 4, vl);
 
1750
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1751
  q8 += 8; a += 8;
1752
  }
1753
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1754
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1755
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
1756
  sumf -= dmin * sumi;
1757
  }
1758
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1779
 
1780
  for (int i = 0; i < nb; ++i) {
1781
 
1782
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1783
 
1784
  const uint8_t * restrict q6 = x[i].ql;
1785
  const uint8_t * restrict qh = x[i].qh;
 
1863
  case 256:
1864
  for (int i = 0; i < nb; ++i) {
1865
 
1866
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1867
 
1868
  const uint8_t * GGML_RESTRICT q6 = x[i].ql;
1869
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
 
1944
  case 128:
1945
  for (int i = 0; i < nb; ++i) {
1946
 
1947
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1948
 
1949
  const uint8_t * restrict q6 = x[i].ql;
1950
  const uint8_t * restrict qh = x[i].qh;
 
2059
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2060
  q8 += 8; a += 8;
2061
  }
2062
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2063
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2064
  }
2065
  for (int l = 0; l < 8; ++l) sumf += sums[l];
ggml/src/ggml-cpu/arch/riscv/repack.cpp CHANGED
@@ -6,6 +6,7 @@
6
  #include "ggml-impl.h"
7
  #include "ggml-cpu.h"
8
  #include "ggml-cpu-impl.h"
 
9
  #include "traits.h"
10
 
11
  #include <cmath>
@@ -90,16 +91,16 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
90
  const vfloat32m1_t facc = __riscv_vfcvt_f_x_v_f32m1(sumi_h8, vl / 4);
91
 
92
  // vector version needs Zvfhmin extension
93
- const float a_scale = GGML_FP16_TO_FP32(a_ptr[l].d);
94
  const float b_scales[8] = {
95
- GGML_FP16_TO_FP32(b_ptr[l].d[0]),
96
- GGML_FP16_TO_FP32(b_ptr[l].d[1]),
97
- GGML_FP16_TO_FP32(b_ptr[l].d[2]),
98
- GGML_FP16_TO_FP32(b_ptr[l].d[3]),
99
- GGML_FP16_TO_FP32(b_ptr[l].d[4]),
100
- GGML_FP16_TO_FP32(b_ptr[l].d[5]),
101
- GGML_FP16_TO_FP32(b_ptr[l].d[6]),
102
- GGML_FP16_TO_FP32(b_ptr[l].d[7])
103
  };
104
  const vfloat32m1_t b_scales_vec = __riscv_vle32_v_f32m1(b_scales, vl / 4);
105
  const vfloat32m1_t tmp1 = __riscv_vfmul_vf_f32m1(facc, a_scale, vl / 4);
@@ -129,7 +130,7 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
129
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
130
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
131
  }
132
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
133
  }
134
  }
135
  }
@@ -181,20 +182,20 @@ void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
181
 
182
  // vector version needs Zvfhmin extension
183
  const float a_scales[4] = {
184
- GGML_FP16_TO_FP32(a_ptr[l].d[0]),
185
- GGML_FP16_TO_FP32(a_ptr[l].d[1]),
186
- GGML_FP16_TO_FP32(a_ptr[l].d[2]),
187
- GGML_FP16_TO_FP32(a_ptr[l].d[3])
188
  };
189
  const float b_scales[8] = {
190
- GGML_FP16_TO_FP32(b_ptr[l].d[0]),
191
- GGML_FP16_TO_FP32(b_ptr[l].d[1]),
192
- GGML_FP16_TO_FP32(b_ptr[l].d[2]),
193
- GGML_FP16_TO_FP32(b_ptr[l].d[3]),
194
- GGML_FP16_TO_FP32(b_ptr[l].d[4]),
195
- GGML_FP16_TO_FP32(b_ptr[l].d[5]),
196
- GGML_FP16_TO_FP32(b_ptr[l].d[6]),
197
- GGML_FP16_TO_FP32(b_ptr[l].d[7])
198
  };
199
  const vfloat32m1_t b_scales_vec = __riscv_vle32_v_f32m1(b_scales, vl / 4);
200
 
@@ -382,7 +383,7 @@ void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
382
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
383
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
384
  }
385
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
386
  }
387
  }
388
  }
 
6
  #include "ggml-impl.h"
7
  #include "ggml-cpu.h"
8
  #include "ggml-cpu-impl.h"
9
+ #include "simd-mappings.h"
10
  #include "traits.h"
11
 
12
  #include <cmath>
 
91
  const vfloat32m1_t facc = __riscv_vfcvt_f_x_v_f32m1(sumi_h8, vl / 4);
92
 
93
  // vector version needs Zvfhmin extension
94
+ const float a_scale = GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
95
  const float b_scales[8] = {
96
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[0]),
97
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[1]),
98
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[2]),
99
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[3]),
100
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[4]),
101
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[5]),
102
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[6]),
103
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[7])
104
  };
105
  const vfloat32m1_t b_scales_vec = __riscv_vle32_v_f32m1(b_scales, vl / 4);
106
  const vfloat32m1_t tmp1 = __riscv_vfmul_vf_f32m1(facc, a_scale, vl / 4);
 
130
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
131
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
132
  }
133
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
134
  }
135
  }
136
  }
 
182
 
183
  // vector version needs Zvfhmin extension
184
  const float a_scales[4] = {
185
+ GGML_CPU_FP16_TO_FP32(a_ptr[l].d[0]),
186
+ GGML_CPU_FP16_TO_FP32(a_ptr[l].d[1]),
187
+ GGML_CPU_FP16_TO_FP32(a_ptr[l].d[2]),
188
+ GGML_CPU_FP16_TO_FP32(a_ptr[l].d[3])
189
  };
190
  const float b_scales[8] = {
191
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[0]),
192
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[1]),
193
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[2]),
194
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[3]),
195
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[4]),
196
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[5]),
197
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[6]),
198
+ GGML_CPU_FP16_TO_FP32(b_ptr[l].d[7])
199
  };
200
  const vfloat32m1_t b_scales_vec = __riscv_vle32_v_f32m1(b_scales, vl / 4);
201
 
 
383
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
384
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
385
  }
386
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
387
  }
388
  }
389
  }
ggml/src/ggml-cpu/arch/s390/quants.c CHANGED
@@ -3,6 +3,7 @@
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
 
6
 
7
  #include "../../quants.h"
8
  #include "../../ggml-cpu-impl.h"
@@ -49,7 +50,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
49
  const float d = amax / ((1 << 7) - 1);
50
  const float id = d ? 1.0f / d : 0.0f;
51
 
52
- y[i].d = GGML_FP32_TO_FP16(d);
53
 
54
  for (int j = 0; j < 8; j++) {
55
  const __vector float v = vec_mul(srcv[j], vec_splats(id));
@@ -94,7 +95,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
94
  const float d = amax / ((1 << 7) - 1);
95
  const float id = d ? 1.0f / d : 0.0f;
96
 
97
- y[i].d = GGML_FP32_TO_FP16(d);
98
 
99
  __vector int32_t acc = vec_splats(0);
100
 
@@ -110,7 +111,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
110
  acc = vec_add(acc, vi);
111
  }
112
 
113
- y[i].s = GGML_FP32_TO_FP16(d * (acc[0] + acc[1] + acc[2] + acc[3]));
114
  }
115
  #else
116
  GGML_UNUSED(nb);
@@ -164,7 +165,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
164
  __vector int16_t v_xy_ = v_xylso + v_xylse + v_xyhso + v_xyhse; v_xy_ += vec_reve(v_xy_);
165
 
166
  const __vector float v_xy = vec_float(vec_unpackh(v_xy_));
167
- const __vector float v_d = vec_splats(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
168
 
169
  acc = vec_madd(v_xy, v_d, acc);
170
  }
@@ -185,7 +186,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
185
  }
186
 
187
  int sumi = sumi0 + sumi1;
188
- sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
189
  }
190
 
191
  *s = sumf;
@@ -219,7 +220,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
219
  __builtin_prefetch(x[ib].qs, 0, 1);
220
  __builtin_prefetch(y[ib].qs, 0, 1);
221
 
222
- summs += GGML_FP16_TO_FP32(x[ib].m) * GGML_FP16_TO_FP32(y[ib].s);
223
 
224
  const uint8x16_t v_x = vec_xl(0, x[ib].qs);
225
  const int8x16_t v_xl = (const int8x16_t)(v_x & v_m);
@@ -231,7 +232,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
231
  const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
232
  const float32x4_t v_xy = vec_float(v_xy_);
233
 
234
- const float32x4_t v_d = vec_splats(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
235
 
236
  acc = vec_madd(v_xy, v_d, acc);
237
  }
@@ -252,7 +253,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
252
  }
253
 
254
  int sumi = sumi0 + sumi1;
255
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
256
  }
257
 
258
  *s = sumf;
@@ -290,7 +291,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
290
 
291
  const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
292
  const float32x4_t v_xy = vec_float(v_xy_);
293
- const float32x4_t v_d = vec_splats(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
294
 
295
  acc = vec_madd(v_xy, v_d, acc);
296
  }
@@ -305,7 +306,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
305
  sumi += x[ib].qs[j]*y[ib].qs[j];
306
  }
307
 
308
- sumf += sumi*(GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d));
309
  }
310
 
311
  *s = sumf;
@@ -348,7 +349,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
348
  float sum = 0;
349
 
350
  for (int i = 0; i < nb; ++i) {
351
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
352
 
353
  const uint8_t * restrict x0l = x[i].qs;
354
  const uint8_t * restrict x0h = x[i].hmask;
@@ -497,7 +498,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
497
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
498
  q8 += 8; a += 8;
499
  }
500
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
501
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
502
  }
503
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -537,8 +538,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
537
  float sumf = 0;
538
 
539
  for (int i = 0; i < nb; ++i) {
540
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
541
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
542
 
543
  const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
544
  const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
@@ -647,9 +648,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
647
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
648
  q8 += 8; a += 8;
649
  }
650
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
651
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
652
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
653
  sumf -= dmin * sumi;
654
  }
655
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -698,8 +699,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
698
  float sumf = 0;
699
 
700
  for (int i = 0; i < nb; ++i) {
701
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
702
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
703
 
704
  const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
705
  const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
@@ -819,9 +820,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
819
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
820
  q8 += 8; a += 8;
821
  }
822
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
823
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
824
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
825
  sumf -= dmin * sumi;
826
  }
827
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -859,7 +860,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
859
  int8x16_t v_y[4];
860
 
861
  for (int i = 0; i < nb; ++i) {
862
- const float d_all = GGML_FP16_TO_FP32(x[i].d);
863
 
864
  const uint8_t * GGML_RESTRICT x0l = x[i].ql;
865
  const uint8_t * GGML_RESTRICT x0h = x[i].qh;
@@ -1004,7 +1005,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1004
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1005
  q8 += 8; a += 8;
1006
  }
1007
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1008
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1009
  }
1010
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1071,7 +1072,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1071
  // float sumf = 0;
1072
 
1073
  // for (int i = 0; i < nb; ++i) {
1074
- // const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1075
  // const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1076
  // const int8_t * GGML_RESTRICT q8 = y[i].qs;
1077
 
@@ -1121,7 +1122,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1121
 
1122
  // float sumf = 0.f;
1123
  // for (int i = 0; i < nb; ++i) {
1124
- // const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1125
  // const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1126
  // const int8_t * GGML_RESTRICT q8 = y[i].qs;
1127
  // int32_t bsum = 0;
@@ -1182,12 +1183,12 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
1182
  const int8x16_t v_yh = vec_xl(QK8_0/2, y0->qs);
1183
  const int32x4_t v_xy = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
1184
 
1185
- sumf += GGML_FP16_TO_FP32(x0->d) * GGML_FP16_TO_FP32(y0->d) * (v_xy[0] + v_xy[1] + v_xy[2] + v_xy[3]);
1186
  }
1187
 
1188
  #endif
1189
  for (; ib < nb; ++ib) {
1190
- const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
1191
  int sumi1 = 0, sumi2 = 0;
1192
  for (int j = 0; j < QK4_NL/2; ++j) {
1193
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
@@ -1257,7 +1258,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
1257
  sumi2 += (vsumi1[0] + vsumi1[1] + vsumi1[2] + vsumi1[3]) * ls2;
1258
  }
1259
 
1260
- sumf += GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d * (sumi1 + sumi2);
1261
  }
1262
 
1263
  *s = sumf;
@@ -1265,7 +1266,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
1265
  #else
1266
  float sumf = 0;
1267
  for (int ibl = 0; ibl < nb; ++ibl) {
1268
- const float d4d8 = GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
1269
  uint16_t h = x[ibl].scales_h;
1270
  const uint8_t * qs = x[ibl].qs;
1271
  const int8_t * q8 = y[ibl].qs;
 
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
6
+ #include "simd-mappings.h"
7
 
8
  #include "../../quants.h"
9
  #include "../../ggml-cpu-impl.h"
 
50
  const float d = amax / ((1 << 7) - 1);
51
  const float id = d ? 1.0f / d : 0.0f;
52
 
53
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
54
 
55
  for (int j = 0; j < 8; j++) {
56
  const __vector float v = vec_mul(srcv[j], vec_splats(id));
 
95
  const float d = amax / ((1 << 7) - 1);
96
  const float id = d ? 1.0f / d : 0.0f;
97
 
98
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
99
 
100
  __vector int32_t acc = vec_splats(0);
101
 
 
111
  acc = vec_add(acc, vi);
112
  }
113
 
114
+ y[i].s = GGML_CPU_FP32_TO_FP16(d * (acc[0] + acc[1] + acc[2] + acc[3]));
115
  }
116
  #else
117
  GGML_UNUSED(nb);
 
165
  __vector int16_t v_xy_ = v_xylso + v_xylse + v_xyhso + v_xyhse; v_xy_ += vec_reve(v_xy_);
166
 
167
  const __vector float v_xy = vec_float(vec_unpackh(v_xy_));
168
+ const __vector float v_d = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
169
 
170
  acc = vec_madd(v_xy, v_d, acc);
171
  }
 
186
  }
187
 
188
  int sumi = sumi0 + sumi1;
189
+ sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
190
  }
191
 
192
  *s = sumf;
 
220
  __builtin_prefetch(x[ib].qs, 0, 1);
221
  __builtin_prefetch(y[ib].qs, 0, 1);
222
 
223
+ summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
224
 
225
  const uint8x16_t v_x = vec_xl(0, x[ib].qs);
226
  const int8x16_t v_xl = (const int8x16_t)(v_x & v_m);
 
232
  const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
233
  const float32x4_t v_xy = vec_float(v_xy_);
234
 
235
+ const float32x4_t v_d = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
236
 
237
  acc = vec_madd(v_xy, v_d, acc);
238
  }
 
253
  }
254
 
255
  int sumi = sumi0 + sumi1;
256
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
257
  }
258
 
259
  *s = sumf;
 
291
 
292
  const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
293
  const float32x4_t v_xy = vec_float(v_xy_);
294
+ const float32x4_t v_d = vec_splats(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
295
 
296
  acc = vec_madd(v_xy, v_d, acc);
297
  }
 
306
  sumi += x[ib].qs[j]*y[ib].qs[j];
307
  }
308
 
309
+ sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
310
  }
311
 
312
  *s = sumf;
 
349
  float sum = 0;
350
 
351
  for (int i = 0; i < nb; ++i) {
352
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
353
 
354
  const uint8_t * restrict x0l = x[i].qs;
355
  const uint8_t * restrict x0h = x[i].hmask;
 
498
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
499
  q8 += 8; a += 8;
500
  }
501
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
502
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
503
  }
504
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
538
  float sumf = 0;
539
 
540
  for (int i = 0; i < nb; ++i) {
541
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
542
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
543
 
544
  const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
545
  const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
 
648
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
649
  q8 += 8; a += 8;
650
  }
651
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
652
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
653
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
654
  sumf -= dmin * sumi;
655
  }
656
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
699
  float sumf = 0;
700
 
701
  for (int i = 0; i < nb; ++i) {
702
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
703
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
704
 
705
  const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
706
  const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
 
820
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
821
  q8 += 8; a += 8;
822
  }
823
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
824
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
825
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
826
  sumf -= dmin * sumi;
827
  }
828
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
860
  int8x16_t v_y[4];
861
 
862
  for (int i = 0; i < nb; ++i) {
863
+ const float d_all = GGML_CPU_FP16_TO_FP32(x[i].d);
864
 
865
  const uint8_t * GGML_RESTRICT x0l = x[i].ql;
866
  const uint8_t * GGML_RESTRICT x0h = x[i].qh;
 
1005
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1006
  q8 += 8; a += 8;
1007
  }
1008
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1009
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1010
  }
1011
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1072
  // float sumf = 0;
1073
 
1074
  // for (int i = 0; i < nb; ++i) {
1075
+ // const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1076
  // const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1077
  // const int8_t * GGML_RESTRICT q8 = y[i].qs;
1078
 
 
1122
 
1123
  // float sumf = 0.f;
1124
  // for (int i = 0; i < nb; ++i) {
1125
+ // const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1126
  // const uint16_t * GGML_RESTRICT q2 = x[i].qs;
1127
  // const int8_t * GGML_RESTRICT q8 = y[i].qs;
1128
  // int32_t bsum = 0;
 
1183
  const int8x16_t v_yh = vec_xl(QK8_0/2, y0->qs);
1184
  const int32x4_t v_xy = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
1185
 
1186
+ sumf += GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d) * (v_xy[0] + v_xy[1] + v_xy[2] + v_xy[3]);
1187
  }
1188
 
1189
  #endif
1190
  for (; ib < nb; ++ib) {
1191
+ const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
1192
  int sumi1 = 0, sumi2 = 0;
1193
  for (int j = 0; j < QK4_NL/2; ++j) {
1194
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
 
1258
  sumi2 += (vsumi1[0] + vsumi1[1] + vsumi1[2] + vsumi1[3]) * ls2;
1259
  }
1260
 
1261
+ sumf += GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d * (sumi1 + sumi2);
1262
  }
1263
 
1264
  *s = sumf;
 
1266
  #else
1267
  float sumf = 0;
1268
  for (int ibl = 0; ibl < nb; ++ibl) {
1269
+ const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
1270
  uint16_t h = x[ibl].scales_h;
1271
  const uint8_t * qs = x[ibl].qs;
1272
  const int8_t * q8 = y[ibl].qs;
ggml/src/ggml-cpu/arch/wasm/quants.c CHANGED
@@ -3,6 +3,7 @@
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
 
6
 
7
  #include "../../quants.h"
8
  #include "../../ggml-cpu-impl.h"
@@ -65,7 +66,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
65
  const float d = amax / ((1 << 7) - 1);
66
  const float id = d ? 1.0f/d : 0.0f;
67
 
68
- y[i].d = GGML_FP32_TO_FP16(d);
69
 
70
  for (int j = 0; j < 8; j++) {
71
  const v128_t v = wasm_f32x4_mul(srcv[j], wasm_f32x4_splat(id));
@@ -110,7 +111,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
110
  const float d = amax / ((1 << 7) - 1);
111
  const float id = d ? 1.0f/d : 0.0f;
112
 
113
- y[i].d = GGML_FP32_TO_FP16(d);
114
 
115
  v128_t accv = wasm_i32x4_splat(0);
116
 
@@ -126,7 +127,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
126
  accv = wasm_i32x4_add(accv, vi);
127
  }
128
 
129
- y[i].s = GGML_FP32_TO_FP16(
130
  d * (wasm_i32x4_extract_lane(accv, 0) +
131
  wasm_i32x4_extract_lane(accv, 1) +
132
  wasm_i32x4_extract_lane(accv, 2) +
@@ -324,8 +325,8 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
324
  );
325
 
326
  // Accumulate results with scaling
327
- float scale0 = GGML_FP16_TO_FP32(x0->d) * GGML_FP16_TO_FP32(y0->d);
328
- float scale1 = GGML_FP16_TO_FP32(x1->d) * GGML_FP16_TO_FP32(y1->d);
329
 
330
  sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(dp0), wasm_f32x4_splat(scale0)));
331
  sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(dp1), wasm_f32x4_splat(scale1)));
@@ -348,7 +349,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
348
  }
349
 
350
  int sumi = sumi0 + sumi1;
351
- sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
352
  }
353
 
354
  *s = sumf;
@@ -428,7 +429,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
428
  wasm_i32x4_dot_i16x8(v0lfh, v1lh)),
429
  wasm_i32x4_add(wasm_i32x4_dot_i16x8(v0hfl, v1hl),
430
  wasm_i32x4_dot_i16x8(v0hfh, v1hh)))),
431
- wasm_f32x4_splat(GGML_FP16_TO_FP32(x0->d) * GGML_FP16_TO_FP32(y0->d))));
432
  }
433
 
434
  sumf = wasm_f32x4_extract_lane(sumv, 0) + wasm_f32x4_extract_lane(sumv, 1) +
@@ -454,7 +455,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
454
  }
455
 
456
  int sumi = sumi0 + sumi1;
457
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d)) * sumi;
458
  }
459
 
460
  *s = sumf;
@@ -491,7 +492,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
491
  const block_q5_1 * GGML_RESTRICT x0 = &x[ib];
492
  const block_q8_1 * GGML_RESTRICT y0 = &y[ib];
493
 
494
- summs += GGML_FP16_TO_FP32(x0->m) * GGML_FP16_TO_FP32(y0->s);
495
 
496
  const v128_t m4b = wasm_i8x16_splat(0x0F);
497
 
@@ -538,7 +539,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
538
  wasm_i32x4_dot_i16x8(v0lfh, v1lh)),
539
  wasm_i32x4_add(wasm_i32x4_dot_i16x8(v0hfl, v1hl),
540
  wasm_i32x4_dot_i16x8(v0hfh, v1hh)))),
541
- wasm_f32x4_splat(GGML_FP16_TO_FP32(x0->d) * GGML_FP16_TO_FP32(y0->d))));
542
  }
543
 
544
  sumf = wasm_f32x4_extract_lane(sumv, 0) + wasm_f32x4_extract_lane(sumv, 1) +
@@ -564,7 +565,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
564
  }
565
 
566
  int sumi = sumi0 + sumi1;
567
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
568
  }
569
 
570
  *s = sumf;
@@ -620,7 +621,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
620
  const v128_t sum_dots = wasm_i32x4_add(wasm_i32x4_add(dx0_0, dx0_1), wasm_i32x4_add(dx1_0, dx1_1));
621
 
622
  // Convert to float and accumulate
623
- const float scale = GGML_FP16_TO_FP32(x0->d) * GGML_FP16_TO_FP32(y0->d);
624
  sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(sum_dots), wasm_f32x4_splat(scale)));
625
  }
626
 
@@ -635,7 +636,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
635
  sumi += x[ib].qs[j]*y[ib].qs[j];
636
  }
637
 
638
- sumf += sumi*(GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d));
639
  }
640
 
641
  *s = sumf;
@@ -746,8 +747,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
746
  isum += wasm_i32x4_extract_lane(isum_vec, 0);
747
  }
748
 
749
- const float dall = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
750
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
751
  sumf += dall * isum - dmin * summs;
752
  }
753
 
@@ -768,8 +769,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
768
  summs += y[i].bsums[j] * (sc[j] >> 4);
769
  }
770
 
771
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
772
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
773
 
774
  int isum = 0;
775
  int is = 0;
@@ -880,7 +881,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
880
  }
881
 
882
  // Accumulate results
883
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
884
  const v128_t v_d = wasm_f32x4_splat(d);
885
  v128_t v_sum = wasm_f32x4_add(
886
  wasm_f32x4_mul(wasm_f32x4_convert_i32x4(v_acc0), v_d),
@@ -957,7 +958,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
957
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
958
  q8 += 8; a += 8;
959
  }
960
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
961
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
962
  }
963
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -991,8 +992,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
991
  float sumf = 0;
992
 
993
  for (int i = 0; i < nb; ++i) {
994
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
995
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin); // Corrected sign
996
 
997
  const uint8_t * GGML_RESTRICT q4 = x[i].qs;
998
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -1136,9 +1137,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1136
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1137
  q8 += 8; a += 8;
1138
  }
1139
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1140
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1141
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
1142
  sumf -= dmin * sumi;
1143
  }
1144
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1170,8 +1171,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1170
  float sumf = 0;
1171
 
1172
  for (int i = 0; i < nb; ++i) {
1173
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1174
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin); // Fixed sign
1175
 
1176
  const uint8_t * GGML_RESTRICT q5 = x[i].qs;
1177
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
@@ -1331,9 +1332,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1331
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1332
  q8 += 8; a += 8;
1333
  }
1334
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1335
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1336
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
1337
  sumf -= dmin * sumi;
1338
  }
1339
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1420,7 +1421,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1420
  wasm_v128_store(&aux32[0], acc0);
1421
  wasm_v128_store(&aux32[4], acc1);
1422
 
1423
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1424
  for (int l = 0; l < 8; ++l) {
1425
  sums[l] += d * aux32[l];
1426
  }
@@ -1470,7 +1471,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1470
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1471
  q8 += 8; a += 8;
1472
  }
1473
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1474
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1475
  }
1476
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
6
+ #include "simd-mappings.h"
7
 
8
  #include "../../quants.h"
9
  #include "../../ggml-cpu-impl.h"
 
66
  const float d = amax / ((1 << 7) - 1);
67
  const float id = d ? 1.0f/d : 0.0f;
68
 
69
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
70
 
71
  for (int j = 0; j < 8; j++) {
72
  const v128_t v = wasm_f32x4_mul(srcv[j], wasm_f32x4_splat(id));
 
111
  const float d = amax / ((1 << 7) - 1);
112
  const float id = d ? 1.0f/d : 0.0f;
113
 
114
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
115
 
116
  v128_t accv = wasm_i32x4_splat(0);
117
 
 
127
  accv = wasm_i32x4_add(accv, vi);
128
  }
129
 
130
+ y[i].s = GGML_CPU_FP32_TO_FP16(
131
  d * (wasm_i32x4_extract_lane(accv, 0) +
132
  wasm_i32x4_extract_lane(accv, 1) +
133
  wasm_i32x4_extract_lane(accv, 2) +
 
325
  );
326
 
327
  // Accumulate results with scaling
328
+ float scale0 = GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d);
329
+ float scale1 = GGML_CPU_FP16_TO_FP32(x1->d) * GGML_CPU_FP16_TO_FP32(y1->d);
330
 
331
  sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(dp0), wasm_f32x4_splat(scale0)));
332
  sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(dp1), wasm_f32x4_splat(scale1)));
 
349
  }
350
 
351
  int sumi = sumi0 + sumi1;
352
+ sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
353
  }
354
 
355
  *s = sumf;
 
429
  wasm_i32x4_dot_i16x8(v0lfh, v1lh)),
430
  wasm_i32x4_add(wasm_i32x4_dot_i16x8(v0hfl, v1hl),
431
  wasm_i32x4_dot_i16x8(v0hfh, v1hh)))),
432
+ wasm_f32x4_splat(GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d))));
433
  }
434
 
435
  sumf = wasm_f32x4_extract_lane(sumv, 0) + wasm_f32x4_extract_lane(sumv, 1) +
 
455
  }
456
 
457
  int sumi = sumi0 + sumi1;
458
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
459
  }
460
 
461
  *s = sumf;
 
492
  const block_q5_1 * GGML_RESTRICT x0 = &x[ib];
493
  const block_q8_1 * GGML_RESTRICT y0 = &y[ib];
494
 
495
+ summs += GGML_CPU_FP16_TO_FP32(x0->m) * GGML_CPU_FP16_TO_FP32(y0->s);
496
 
497
  const v128_t m4b = wasm_i8x16_splat(0x0F);
498
 
 
539
  wasm_i32x4_dot_i16x8(v0lfh, v1lh)),
540
  wasm_i32x4_add(wasm_i32x4_dot_i16x8(v0hfl, v1hl),
541
  wasm_i32x4_dot_i16x8(v0hfh, v1hh)))),
542
+ wasm_f32x4_splat(GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d))));
543
  }
544
 
545
  sumf = wasm_f32x4_extract_lane(sumv, 0) + wasm_f32x4_extract_lane(sumv, 1) +
 
565
  }
566
 
567
  int sumi = sumi0 + sumi1;
568
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
569
  }
570
 
571
  *s = sumf;
 
621
  const v128_t sum_dots = wasm_i32x4_add(wasm_i32x4_add(dx0_0, dx0_1), wasm_i32x4_add(dx1_0, dx1_1));
622
 
623
  // Convert to float and accumulate
624
+ const float scale = GGML_CPU_FP16_TO_FP32(x0->d) * GGML_CPU_FP16_TO_FP32(y0->d);
625
  sumv = wasm_f32x4_add(sumv, wasm_f32x4_mul(wasm_f32x4_convert_i32x4(sum_dots), wasm_f32x4_splat(scale)));
626
  }
627
 
 
636
  sumi += x[ib].qs[j]*y[ib].qs[j];
637
  }
638
 
639
+ sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
640
  }
641
 
642
  *s = sumf;
 
747
  isum += wasm_i32x4_extract_lane(isum_vec, 0);
748
  }
749
 
750
+ const float dall = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
751
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
752
  sumf += dall * isum - dmin * summs;
753
  }
754
 
 
769
  summs += y[i].bsums[j] * (sc[j] >> 4);
770
  }
771
 
772
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
773
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
774
 
775
  int isum = 0;
776
  int is = 0;
 
881
  }
882
 
883
  // Accumulate results
884
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
885
  const v128_t v_d = wasm_f32x4_splat(d);
886
  v128_t v_sum = wasm_f32x4_add(
887
  wasm_f32x4_mul(wasm_f32x4_convert_i32x4(v_acc0), v_d),
 
958
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
959
  q8 += 8; a += 8;
960
  }
961
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
962
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
963
  }
964
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
992
  float sumf = 0;
993
 
994
  for (int i = 0; i < nb; ++i) {
995
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
996
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin); // Corrected sign
997
 
998
  const uint8_t * GGML_RESTRICT q4 = x[i].qs;
999
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
1137
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1138
  q8 += 8; a += 8;
1139
  }
1140
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1141
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1142
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
1143
  sumf -= dmin * sumi;
1144
  }
1145
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1171
  float sumf = 0;
1172
 
1173
  for (int i = 0; i < nb; ++i) {
1174
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1175
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin); // Fixed sign
1176
 
1177
  const uint8_t * GGML_RESTRICT q5 = x[i].qs;
1178
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
 
1332
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1333
  q8 += 8; a += 8;
1334
  }
1335
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1336
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1337
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
1338
  sumf -= dmin * sumi;
1339
  }
1340
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1421
  wasm_v128_store(&aux32[0], acc0);
1422
  wasm_v128_store(&aux32[4], acc1);
1423
 
1424
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1425
  for (int l = 0; l < 8; ++l) {
1426
  sums[l] += d * aux32[l];
1427
  }
 
1471
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
1472
  q8 += 8; a += 8;
1473
  }
1474
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1475
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1476
  }
1477
  for (int l = 0; l < 8; ++l) sumf += sums[l];
ggml/src/ggml-cpu/arch/x86/quants.c CHANGED
@@ -3,6 +3,7 @@
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
 
6
 
7
  #include "../../quants.h"
8
  #include "../../ggml-cpu-impl.h"
@@ -256,9 +257,9 @@ static inline __m256 mul_sum_i8_quad_float(const __m128i x_1_0, const __m128i x_
256
 
257
  // quad fp16 delta calculation
258
  static inline __m256 quad_fp16_delta_float(const float x0, const float y0, const float x1, const float y1) {
259
- // GGML_FP16_TO_FP32 is faster than Intel F16C
260
- return _mm256_set_m128(_mm_set1_ps(GGML_FP16_TO_FP32(x1) * GGML_FP16_TO_FP32(y1)),
261
- _mm_set1_ps(GGML_FP16_TO_FP32(x0) * GGML_FP16_TO_FP32(y0)));
262
  }
263
  #endif
264
  #elif defined(__SSSE3__)
@@ -305,7 +306,7 @@ void quantize_row_q8_0(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
305
 
306
  // Quantize these floats
307
  const float d = maxScalar / 127.f;
308
- y[i].d = GGML_FP32_TO_FP16(d);
309
  const float id = ( maxScalar != 0.0f ) ? 127.f / maxScalar : 0.0f;
310
  const __m256 mul = _mm256_set1_ps( id );
311
 
@@ -401,7 +402,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
401
 
402
  // Quantize these floats
403
  const float d = max_scalar / 127.f;
404
- y[i].d = GGML_FP32_TO_FP16(d);
405
  const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
406
  const __m256 mul = _mm256_set1_ps( id );
407
 
@@ -425,7 +426,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
425
 
426
  #if defined(__AVX2__)
427
  // Compute the sum of the quants and set y[i].s
428
- y[i].s = GGML_FP32_TO_FP16(d * hsum_i32_8(_mm256_add_epi32(_mm256_add_epi32(i0, i1), _mm256_add_epi32(i2, i3))));
429
 
430
  // Convert int32 to int16
431
  i0 = _mm256_packs_epi32( i0, i1 ); // 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15
@@ -455,7 +456,7 @@ void quantize_row_q8_1(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, i
455
  // Compute the sum of the quants and set y[i].s
456
  const __m128i s0 = _mm_add_epi32(_mm_add_epi32(ni0, ni1), _mm_add_epi32(ni2, ni3));
457
  const __m128i s1 = _mm_add_epi32(_mm_add_epi32(ni4, ni5), _mm_add_epi32(ni6, ni7));
458
- y[i].s = GGML_FP32_TO_FP16(d * hsum_i32_4(_mm_add_epi32(s0, s1)));
459
 
460
  // Convert int32 to int16
461
  ni0 = _mm_packs_epi32( ni0, ni1 );
@@ -552,7 +553,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
552
  // Main loop
553
  for (; ib < nb; ++ib) {
554
  /* Compute combined scale for the block */
555
- const __m256 d = _mm256_set1_ps( GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d) );
556
 
557
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
558
 
@@ -613,7 +614,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
613
  _mm_prefetch(&y[ib] + sizeof(block_q8_0), _MM_HINT_T0);
614
 
615
  // Compute combined scale for the block 0 and 1
616
- const __m128 d_0_1 = _mm_set1_ps( GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d) );
617
 
618
  const __m128i tmp_0_1 = _mm_loadu_si128((const __m128i *)x[ib].qs);
619
 
@@ -631,7 +632,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
631
  _mm_prefetch(&y[ib] + 2 * sizeof(block_q8_0), _MM_HINT_T0);
632
 
633
  // Compute combined scale for the block 2 and 3
634
- const __m128 d_2_3 = _mm_set1_ps( GGML_FP16_TO_FP32(x[ib + 1].d) * GGML_FP16_TO_FP32(y[ib + 1].d) );
635
 
636
  const __m128i tmp_2_3 = _mm_loadu_si128((const __m128i *)x[ib + 1].qs);
637
 
@@ -680,7 +681,7 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
680
  }
681
 
682
  int sumi = sumi0 + sumi1;
683
- sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
684
  }
685
 
686
  *s = sumf;
@@ -711,10 +712,10 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
711
 
712
  // Main loop
713
  for (; ib < nb; ++ib) {
714
- const float d0 = GGML_FP16_TO_FP32(x[ib].d);
715
- const float d1 = GGML_FP16_TO_FP32(y[ib].d);
716
 
717
- summs += GGML_FP16_TO_FP32(x[ib].m) * GGML_FP16_TO_FP32(y[ib].s);
718
 
719
  const __m256 d0v = _mm256_set1_ps( d0 );
720
  const __m256 d1v = _mm256_set1_ps( d1 );
@@ -752,7 +753,7 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
752
  }
753
 
754
  int sumi = sumi0 + sumi1;
755
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
756
  }
757
 
758
  *s = sumf;
@@ -783,7 +784,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
783
  // Main loop
784
  for (; ib < nb; ++ib) {
785
  /* Compute combined scale for the block */
786
- const __m256 d = _mm256_set1_ps(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
787
 
788
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
789
  __m256i bxhi = bytes_from_bits_32(x[ib].qh);
@@ -807,7 +808,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
807
  // Main loop
808
  for (; ib < nb; ++ib) {
809
  /* Compute combined scale for the block */
810
- const __m256 d = _mm256_set1_ps(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
811
 
812
  __m256i bx_0 = bytes_from_nibbles_32(x[ib].qs);
813
  const __m256i bxhi = bytes_from_bits_32(x[ib].qh);
@@ -851,7 +852,7 @@ void ggml_vec_dot_q5_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
851
  }
852
 
853
  int sumi = sumi0 + sumi1;
854
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d)) * sumi;
855
  }
856
 
857
  *s = sumf;
@@ -883,16 +884,16 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
883
 
884
  // Main loop
885
  for (; ib < nb; ++ib) {
886
- const __m256 dx = _mm256_set1_ps(GGML_FP16_TO_FP32(x[ib].d));
887
 
888
- summs += GGML_FP16_TO_FP32(x[ib].m) * GGML_FP16_TO_FP32(y[ib].s);
889
 
890
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
891
  __m256i bxhi = bytes_from_bits_32(x[ib].qh);
892
  bxhi = _mm256_and_si256(bxhi, _mm256_set1_epi8(0x10));
893
  qx = _mm256_or_si256(qx, bxhi);
894
 
895
- const __m256 dy = _mm256_set1_ps(GGML_FP16_TO_FP32(y[ib].d));
896
  const __m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
897
 
898
  const __m256 q = mul_sum_us8_pairs_float(qx, qy);
@@ -910,9 +911,9 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
910
 
911
  // Main loop
912
  for (; ib < nb; ++ib) {
913
- const __m256 dx = _mm256_set1_ps(GGML_FP16_TO_FP32(x[ib].d));
914
 
915
- summs += GGML_FP16_TO_FP32(x[ib].m) * GGML_FP16_TO_FP32(y[ib].s);
916
 
917
  __m256i bx_0 = bytes_from_nibbles_32(x[ib].qs);
918
  const __m256i bxhi = bytes_from_bits_32(x[ib].qh);
@@ -926,7 +927,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
926
  bxh = _mm_or_si128(bxh, bxhih);
927
  bx_0 = MM256_SET_M128I(bxh, bxl);
928
 
929
- const __m256 dy = _mm256_set1_ps(GGML_FP16_TO_FP32(y[ib].d));
930
  const __m256i by_0 = _mm256_loadu_si256((const __m256i *)y[ib].qs);
931
 
932
  const __m256 q = mul_sum_us8_pairs_float(bx_0, by_0);
@@ -956,7 +957,7 @@ void ggml_vec_dot_q5_1_q8_1(int n, float * GGML_RESTRICT s, size_t bs, const voi
956
  }
957
 
958
  int sumi = sumi0 + sumi1;
959
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
960
  }
961
 
962
  *s = sumf;
@@ -986,7 +987,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
986
  // Main loop
987
  for (; ib < nb; ++ib) {
988
  // Compute combined scale for the block
989
- const __m256 d = _mm256_set1_ps(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
990
  __m256i qx = _mm256_loadu_si256((const __m256i *)x[ib].qs);
991
  __m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
992
 
@@ -1025,7 +1026,7 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const voi
1025
  sumi += x[ib].qs[j]*y[ib].qs[j];
1026
  }
1027
 
1028
- sumf += sumi*(GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d));
1029
  }
1030
 
1031
  *s = sumf;
@@ -1144,7 +1145,7 @@ void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
1144
  }
1145
 
1146
  const __m256i ysum = _mm256_loadu_si256((const __m256i *) y[i].bsums);
1147
- const __m256 d = _mm256_set1_ps(y[i].d * GGML_FP16_TO_FP32(x[i].d));
1148
 
1149
  sumi0 = _mm256_sub_epi16(sumi0, ysum);
1150
  sumi0 = _mm256_add_epi16(sumi0, _mm256_add_epi16(sumi1, sumi2));
@@ -1190,7 +1191,7 @@ void ggml_vec_dot_tq1_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
1190
  }
1191
  }
1192
 
1193
- sumf += (float) sum * (GGML_FP16_TO_FP32(x[i].d) * y[i].d);
1194
  }
1195
 
1196
  *s = sumf;
@@ -1244,7 +1245,7 @@ void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
1244
  }
1245
 
1246
  const __m256i ysum = _mm256_loadu_si256((const __m256i *) y[i].bsums);
1247
- const __m256 d = _mm256_set1_ps(y[i].d * GGML_FP16_TO_FP32(x[i].d));
1248
 
1249
  sumi0 = _mm256_add_epi16(sumi0, sumi1);
1250
  sumi0 = _mm256_sub_epi16(sumi0, ysum);
@@ -1269,7 +1270,7 @@ void ggml_vec_dot_tq2_0_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
1269
  }
1270
  }
1271
 
1272
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1273
 
1274
  sumf += (float) sumi * d;
1275
  }
@@ -1299,8 +1300,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1299
 
1300
  for (int i = 0; i < nb; ++i) {
1301
 
1302
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1303
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1304
 
1305
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
1306
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -1366,8 +1367,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1366
 
1367
  for (int i = 0; i < nb; ++i) {
1368
 
1369
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1370
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1371
 
1372
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
1373
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -1477,8 +1478,8 @@ void ggml_vec_dot_q2_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1477
  summs += y[i].bsums[j] * (sc[j] >> 4);
1478
  }
1479
 
1480
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1481
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1482
 
1483
  int isum = 0;
1484
  int is = 0;
@@ -1533,7 +1534,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1533
 
1534
  for (int i = 0; i < nb; ++i) {
1535
 
1536
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1537
 
1538
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
1539
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -1638,7 +1639,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1638
 
1639
  for (int i = 0; i < nb; ++i) {
1640
 
1641
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1642
 
1643
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
1644
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -1824,7 +1825,7 @@ void ggml_vec_dot_q3_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1824
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
1825
  q8 += 8; a += 8;
1826
  }
1827
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
1828
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1829
  }
1830
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -1862,8 +1863,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1862
 
1863
  for (int i = 0; i < nb; ++i) {
1864
 
1865
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1866
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1867
 
1868
  memcpy(utmp, x[i].scales, 12);
1869
  utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
@@ -1928,8 +1929,8 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
1928
 
1929
  for (int i = 0; i < nb; ++i) {
1930
 
1931
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
1932
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
1933
 
1934
  const uint8_t * GGML_RESTRICT q4 = x[i].qs;
1935
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -2049,9 +2050,9 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2049
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2050
  q8 += 8; a += 8;
2051
  }
2052
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2053
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2054
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
2055
  sumf -= dmin * sumi;
2056
  }
2057
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -2092,8 +2093,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2092
  const uint8_t * GGML_RESTRICT q5 = x[i].qs;
2093
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2094
 
2095
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
2096
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
2097
 
2098
  memcpy(utmp, x[i].scales, 12);
2099
  utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
@@ -2170,8 +2171,8 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2170
 
2171
  for (int i = 0; i < nb; ++i) {
2172
 
2173
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
2174
- const float dmin = -y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
2175
 
2176
  const uint8_t * GGML_RESTRICT q5 = x[i].qs;
2177
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -2311,9 +2312,9 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2311
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2312
  q8 += 8; a += 8;
2313
  }
2314
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2315
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2316
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
2317
  sumf -= dmin * sumi;
2318
  }
2319
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -2344,7 +2345,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2344
 
2345
  for (int i = 0; i < nb; ++i) {
2346
 
2347
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
2348
 
2349
  const uint8_t * GGML_RESTRICT q4 = x[i].ql;
2350
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
@@ -2422,7 +2423,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2422
 
2423
  for (int i = 0; i < nb; ++i) {
2424
 
2425
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
2426
 
2427
  const uint8_t * GGML_RESTRICT q4 = x[i].ql;
2428
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
@@ -2555,7 +2556,7 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const voi
2555
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2556
  q8 += 8; a += 8;
2557
  }
2558
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2559
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2560
  }
2561
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -2622,7 +2623,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
2622
 
2623
  __m256 accumf = _mm256_setzero_ps();
2624
  for (int i = 0; i < nb; ++i) {
2625
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2626
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2627
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2628
  __m256i sumi1 = _mm256_setzero_si256();
@@ -2663,7 +2664,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
2663
 
2664
  __m256 accumf = _mm256_setzero_ps();
2665
  for (int i = 0; i < nb; ++i) {
2666
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2667
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2668
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2669
  __m128i sumi1_0 = _mm_setzero_si128();
@@ -2717,7 +2718,7 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
2717
 
2718
  float sumf = 0.f;
2719
  for (int i = 0; i < nb; ++i) {
2720
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2721
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2722
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2723
  int32_t bsum = 0;
@@ -2792,7 +2793,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
2792
 
2793
  __m256 accumf = _mm256_setzero_ps();
2794
  for (int i = 0; i < nb; ++i) {
2795
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2796
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2797
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2798
 
@@ -2913,7 +2914,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
2913
 
2914
  __m256 accumf = _mm256_setzero_ps();
2915
  for (int i = 0; i < nb; ++i) {
2916
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
2917
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2918
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2919
 
@@ -3035,7 +3036,7 @@ void ggml_vec_dot_iq2_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
3035
 
3036
  float sumf = 0.f;
3037
  for (int i = 0; i < nb; ++i) {
3038
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3039
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3040
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
3041
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -3104,7 +3105,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3104
 
3105
  __m256 accumf = _mm256_setzero_ps();
3106
  for (int i = 0; i < nb; ++i) {
3107
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3108
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3109
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3110
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
@@ -3177,7 +3178,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3177
 
3178
  __m256 accumf = _mm256_setzero_ps();
3179
  for (int i = 0; i < nb; ++i) {
3180
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3181
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3182
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3183
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
@@ -3253,7 +3254,7 @@ void ggml_vec_dot_iq2_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3253
  float sumf = 0;
3254
  for (int i = 0; i < nb; i++) {
3255
 
3256
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3257
  const int8_t * q8 = y[i].qs;
3258
  const uint8_t * qs = x[i].qs;
3259
  const uint8_t * qh = x[i].qh;
@@ -3313,7 +3314,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
3313
 
3314
  __m256 accumf = _mm256_setzero_ps();
3315
  for (int i = 0; i < nb; ++i) {
3316
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3317
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3318
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3319
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -3358,7 +3359,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
3358
 
3359
  __m256 accumf = _mm256_setzero_ps();
3360
  for (int i = 0; i < nb; ++i) {
3361
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3362
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3363
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3364
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -3414,7 +3415,7 @@ void ggml_vec_dot_iq3_xxs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const
3414
 
3415
  float sumf = 0.f;
3416
  for (int i = 0; i < nb; ++i) {
3417
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3418
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3419
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3420
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -3480,7 +3481,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3480
 
3481
  __m256 accumf = _mm256_setzero_ps();
3482
  for (int i = 0; i < nb; ++i) {
3483
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3484
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3485
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3486
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
@@ -3565,7 +3566,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3565
 
3566
  __m256 accumf = _mm256_setzero_ps();
3567
  for (int i = 0; i < nb; ++i) {
3568
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3569
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3570
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3571
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
@@ -3648,7 +3649,7 @@ void ggml_vec_dot_iq3_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3648
 
3649
  float sumf = 0.f;
3650
  for (int i = 0; i < nb; ++i) {
3651
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
3652
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3653
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3654
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
@@ -3753,7 +3754,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3753
  + (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
3754
  }
3755
 
3756
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
3757
  accum = _mm256_fmadd_ps(_mm256_set1_ps(d), _mm256_cvtepi32_ps(sumi), accum);
3758
  accum1 += d * sumi1;
3759
 
@@ -3801,7 +3802,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3801
  + (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
3802
  }
3803
 
3804
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
3805
  accum = _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(d), _mm256_cvtepi32_ps(MM256_SET_M128I(sumi1_1, sumi1_0))), accum);
3806
  accum1 += d * sumi1;
3807
 
@@ -3835,7 +3836,7 @@ void ggml_vec_dot_iq1_s_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3835
  qs += 4;
3836
  }
3837
 
3838
- sumf += GGML_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
3839
  }
3840
 
3841
  *s = sumf;
@@ -3947,7 +3948,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3947
  qs += 8; qh += 4;
3948
  }
3949
 
3950
- const __m256 d = _mm256_set1_ps(y[i].d * GGML_FP16_TO_FP32(scale.f16));
3951
 
3952
  accum1 = _mm256_fmadd_ps(d, _mm256_cvtepi32_ps(sumi1), accum1);
3953
  accum2 = _mm256_fmadd_ps(d, _mm256_cvtepi32_ps(sumi2), accum2);
@@ -4033,7 +4034,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
4033
  qs += 8; qh += 4;
4034
  }
4035
 
4036
- const __m256 d = _mm256_set1_ps(y[i].d * GGML_FP16_TO_FP32(scale.f16));
4037
 
4038
  accum1 = _mm256_add_ps(_mm256_mul_ps(d, _mm256_cvtepi32_ps(MM256_SET_M128I(sumi1_1, sumi1_0))), accum1);
4039
  accum2 = _mm256_add_ps(_mm256_mul_ps(d, _mm256_cvtepi32_ps(MM256_SET_M128I(sumi2_1, sumi2_0))), accum2);
@@ -4083,7 +4084,7 @@ void ggml_vec_dot_iq1_m_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
4083
  qh += 2;
4084
  }
4085
 
4086
- sumf += GGML_FP16_TO_FP32(scale.f16) * y[i].d * (sumi1 + IQ1M_DELTA * sumi2);
4087
  }
4088
 
4089
  *s = sumf;
@@ -4129,9 +4130,9 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
4129
  const __m256i p16_2 = mul_add_epi8(q4b_2, q8b_2);
4130
  const __m256i p_1 = _mm256_madd_epi16(p16_1, mone);
4131
  const __m256i p_2 = _mm256_madd_epi16(p16_2, mone);
4132
- accum1 = _mm256_fmadd_ps(_mm256_set1_ps(GGML_FP16_TO_FP32(y[ib + 0].d)*GGML_FP16_TO_FP32(x[ib + 0].d)),
4133
  _mm256_cvtepi32_ps(p_1), accum1);
4134
- accum2 = _mm256_fmadd_ps(_mm256_set1_ps(GGML_FP16_TO_FP32(y[ib + 1].d)*GGML_FP16_TO_FP32(x[ib + 1].d)),
4135
  _mm256_cvtepi32_ps(p_2), accum2);
4136
  }
4137
 
@@ -4164,7 +4165,7 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const v
4164
 
4165
  #endif
4166
  for (; ib < nb; ++ib) {
4167
- const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
4168
  int sumi1 = 0, sumi2 = 0;
4169
  for (int j = 0; j < QK4_NL/2; ++j) {
4170
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
@@ -4219,7 +4220,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
4219
  sumi1 = _mm256_add_epi32(p_1, sumi1);
4220
  sumi2 = _mm256_add_epi32(p_2, sumi2);
4221
  }
4222
- accum = _mm256_fmadd_ps(_mm256_set1_ps(GGML_FP16_TO_FP32(x[ibl].d)*y[ibl].d),
4223
  _mm256_cvtepi32_ps(_mm256_add_epi32(sumi1, sumi2)), accum);
4224
  }
4225
 
@@ -4267,7 +4268,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
4267
  }
4268
  __m128i sumi12_0 = _mm_add_epi32(sumi1_0, sumi2_0);
4269
  __m128i sumi12_1 = _mm_add_epi32(sumi1_1, sumi2_1);
4270
- accum = _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(GGML_FP16_TO_FP32(x[ibl].d)*y[ibl].d),
4271
  _mm256_cvtepi32_ps(MM256_SET_M128I(sumi12_1, sumi12_0))), accum);
4272
  }
4273
 
@@ -4276,7 +4277,7 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const v
4276
  #else
4277
  float sumf = 0;
4278
  for (int ibl = 0; ibl < nb; ++ibl) {
4279
- const float d4d8 = GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
4280
  uint16_t h = x[ibl].scales_h;
4281
  const uint8_t * qs = x[ibl].qs;
4282
  const int8_t * q8 = y[ibl].qs;
 
3
  #include "ggml-quants.h"
4
  #include "ggml-impl.h"
5
  #include "ggml-cpu.h"
6
+ #include "simd-mappings.h"
7
 
8
  #include "../../quants.h"
9
  #include "../../ggml-cpu-impl.h"
 
257
 
258
  // quad fp16 delta calculation
259
  static inline __m256 quad_fp16_delta_float(const float x0, const float y0, const float x1, const float y1) {
260
+ // GGML_CPU_FP16_TO_FP32 is faster than Intel F16C
261
+ return _mm256_set_m128(_mm_set1_ps(GGML_CPU_FP16_TO_FP32(x1) * GGML_CPU_FP16_TO_FP32(y1)),
262
+ _mm_set1_ps(GGML_CPU_FP16_TO_FP32(x0) * GGML_CPU_FP16_TO_FP32(y0)));
263
  }
264
  #endif
265
  #elif defined(__SSSE3__)
 
306
 
307
  // Quantize these floats
308
  const float d = maxScalar / 127.f;
309
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
310
  const float id = ( maxScalar != 0.0f ) ? 127.f / maxScalar : 0.0f;
311
  const __m256 mul = _mm256_set1_ps( id );
312
 
 
402
 
403
  // Quantize these floats
404
  const float d = max_scalar / 127.f;
405
+ y[i].d = GGML_CPU_FP32_TO_FP16(d);
406
  const float id = ( max_scalar != 0.0f ) ? 127.f / max_scalar : 0.0f;
407
  const __m256 mul = _mm256_set1_ps( id );
408
 
 
426
 
427
  #if defined(__AVX2__)
428
  // Compute the sum of the quants and set y[i].s
429
+ y[i].s = GGML_CPU_FP32_TO_FP16(d * hsum_i32_8(_mm256_add_epi32(_mm256_add_epi32(i0, i1), _mm256_add_epi32(i2, i3))));
430
 
431
  // Convert int32 to int16
432
  i0 = _mm256_packs_epi32( i0, i1 ); // 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15
 
456
  // Compute the sum of the quants and set y[i].s
457
  const __m128i s0 = _mm_add_epi32(_mm_add_epi32(ni0, ni1), _mm_add_epi32(ni2, ni3));
458
  const __m128i s1 = _mm_add_epi32(_mm_add_epi32(ni4, ni5), _mm_add_epi32(ni6, ni7));
459
+ y[i].s = GGML_CPU_FP32_TO_FP16(d * hsum_i32_4(_mm_add_epi32(s0, s1)));
460
 
461
  // Convert int32 to int16
462
  ni0 = _mm_packs_epi32( ni0, ni1 );
 
553
  // Main loop
554
  for (; ib < nb; ++ib) {
555
  /* Compute combined scale for the block */
556
+ const __m256 d = _mm256_set1_ps( GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d) );
557
 
558
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
559
 
 
614
  _mm_prefetch(&y[ib] + sizeof(block_q8_0), _MM_HINT_T0);
615
 
616
  // Compute combined scale for the block 0 and 1
617
+ const __m128 d_0_1 = _mm_set1_ps( GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d) );
618
 
619
  const __m128i tmp_0_1 = _mm_loadu_si128((const __m128i *)x[ib].qs);
620
 
 
632
  _mm_prefetch(&y[ib] + 2 * sizeof(block_q8_0), _MM_HINT_T0);
633
 
634
  // Compute combined scale for the block 2 and 3
635
+ const __m128 d_2_3 = _mm_set1_ps( GGML_CPU_FP16_TO_FP32(x[ib + 1].d) * GGML_CPU_FP16_TO_FP32(y[ib + 1].d) );
636
 
637
  const __m128i tmp_2_3 = _mm_loadu_si128((const __m128i *)x[ib + 1].qs);
638
 
 
681
  }
682
 
683
  int sumi = sumi0 + sumi1;
684
+ sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
685
  }
686
 
687
  *s = sumf;
 
712
 
713
  // Main loop
714
  for (; ib < nb; ++ib) {
715
+ const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
716
+ const float d1 = GGML_CPU_FP16_TO_FP32(y[ib].d);
717
 
718
+ summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
719
 
720
  const __m256 d0v = _mm256_set1_ps( d0 );
721
  const __m256 d1v = _mm256_set1_ps( d1 );
 
753
  }
754
 
755
  int sumi = sumi0 + sumi1;
756
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
757
  }
758
 
759
  *s = sumf;
 
784
  // Main loop
785
  for (; ib < nb; ++ib) {
786
  /* Compute combined scale for the block */
787
+ const __m256 d = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
788
 
789
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
790
  __m256i bxhi = bytes_from_bits_32(x[ib].qh);
 
808
  // Main loop
809
  for (; ib < nb; ++ib) {
810
  /* Compute combined scale for the block */
811
+ const __m256 d = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
812
 
813
  __m256i bx_0 = bytes_from_nibbles_32(x[ib].qs);
814
  const __m256i bxhi = bytes_from_bits_32(x[ib].qh);
 
852
  }
853
 
854
  int sumi = sumi0 + sumi1;
855
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
856
  }
857
 
858
  *s = sumf;
 
884
 
885
  // Main loop
886
  for (; ib < nb; ++ib) {
887
+ const __m256 dx = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d));
888
 
889
+ summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
890
 
891
  __m256i qx = bytes_from_nibbles_32(x[ib].qs);
892
  __m256i bxhi = bytes_from_bits_32(x[ib].qh);
893
  bxhi = _mm256_and_si256(bxhi, _mm256_set1_epi8(0x10));
894
  qx = _mm256_or_si256(qx, bxhi);
895
 
896
+ const __m256 dy = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y[ib].d));
897
  const __m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
898
 
899
  const __m256 q = mul_sum_us8_pairs_float(qx, qy);
 
911
 
912
  // Main loop
913
  for (; ib < nb; ++ib) {
914
+ const __m256 dx = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d));
915
 
916
+ summs += GGML_CPU_FP16_TO_FP32(x[ib].m) * GGML_CPU_FP16_TO_FP32(y[ib].s);
917
 
918
  __m256i bx_0 = bytes_from_nibbles_32(x[ib].qs);
919
  const __m256i bxhi = bytes_from_bits_32(x[ib].qh);
 
927
  bxh = _mm_or_si128(bxh, bxhih);
928
  bx_0 = MM256_SET_M128I(bxh, bxl);
929
 
930
+ const __m256 dy = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y[ib].d));
931
  const __m256i by_0 = _mm256_loadu_si256((const __m256i *)y[ib].qs);
932
 
933
  const __m256 q = mul_sum_us8_pairs_float(bx_0, by_0);
 
957
  }
958
 
959
  int sumi = sumi0 + sumi1;
960
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
961
  }
962
 
963
  *s = sumf;
 
987
  // Main loop
988
  for (; ib < nb; ++ib) {
989
  // Compute combined scale for the block
990
+ const __m256 d = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ib].d) * GGML_CPU_FP16_TO_FP32(y[ib].d));
991
  __m256i qx = _mm256_loadu_si256((const __m256i *)x[ib].qs);
992
  __m256i qy = _mm256_loadu_si256((const __m256i *)y[ib].qs);
993
 
 
1026
  sumi += x[ib].qs[j]*y[ib].qs[j];
1027
  }
1028
 
1029
+ sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
1030
  }
1031
 
1032
  *s = sumf;
 
1145
  }
1146
 
1147
  const __m256i ysum = _mm256_loadu_si256((const __m256i *) y[i].bsums);
1148
+ const __m256 d = _mm256_set1_ps(y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d));
1149
 
1150
  sumi0 = _mm256_sub_epi16(sumi0, ysum);
1151
  sumi0 = _mm256_add_epi16(sumi0, _mm256_add_epi16(sumi1, sumi2));
 
1191
  }
1192
  }
1193
 
1194
+ sumf += (float) sum * (GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d);
1195
  }
1196
 
1197
  *s = sumf;
 
1245
  }
1246
 
1247
  const __m256i ysum = _mm256_loadu_si256((const __m256i *) y[i].bsums);
1248
+ const __m256 d = _mm256_set1_ps(y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d));
1249
 
1250
  sumi0 = _mm256_add_epi16(sumi0, sumi1);
1251
  sumi0 = _mm256_sub_epi16(sumi0, ysum);
 
1270
  }
1271
  }
1272
 
1273
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1274
 
1275
  sumf += (float) sumi * d;
1276
  }
 
1300
 
1301
  for (int i = 0; i < nb; ++i) {
1302
 
1303
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1304
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1305
 
1306
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
1307
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
1367
 
1368
  for (int i = 0; i < nb; ++i) {
1369
 
1370
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1371
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1372
 
1373
  const uint8_t * GGML_RESTRICT q2 = x[i].qs;
1374
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
1478
  summs += y[i].bsums[j] * (sc[j] >> 4);
1479
  }
1480
 
1481
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1482
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1483
 
1484
  int isum = 0;
1485
  int is = 0;
 
1534
 
1535
  for (int i = 0; i < nb; ++i) {
1536
 
1537
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1538
 
1539
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
1540
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
1639
 
1640
  for (int i = 0; i < nb; ++i) {
1641
 
1642
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1643
 
1644
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
1645
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
1825
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
1826
  q8 += 8; a += 8;
1827
  }
1828
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
1829
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
1830
  }
1831
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
1863
 
1864
  for (int i = 0; i < nb; ++i) {
1865
 
1866
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1867
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1868
 
1869
  memcpy(utmp, x[i].scales, 12);
1870
  utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
 
1929
 
1930
  for (int i = 0; i < nb; ++i) {
1931
 
1932
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
1933
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
1934
 
1935
  const uint8_t * GGML_RESTRICT q4 = x[i].qs;
1936
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
2050
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2051
  q8 += 8; a += 8;
2052
  }
2053
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2054
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2055
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
2056
  sumf -= dmin * sumi;
2057
  }
2058
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
2093
  const uint8_t * GGML_RESTRICT q5 = x[i].qs;
2094
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2095
 
2096
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
2097
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
2098
 
2099
  memcpy(utmp, x[i].scales, 12);
2100
  utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
 
2171
 
2172
  for (int i = 0; i < nb; ++i) {
2173
 
2174
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
2175
+ const float dmin = -y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
2176
 
2177
  const uint8_t * GGML_RESTRICT q5 = x[i].qs;
2178
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
2312
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2313
  q8 += 8; a += 8;
2314
  }
2315
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2316
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2317
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
2318
  sumf -= dmin * sumi;
2319
  }
2320
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
2345
 
2346
  for (int i = 0; i < nb; ++i) {
2347
 
2348
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
2349
 
2350
  const uint8_t * GGML_RESTRICT q4 = x[i].ql;
2351
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
 
2423
 
2424
  for (int i = 0; i < nb; ++i) {
2425
 
2426
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
2427
 
2428
  const uint8_t * GGML_RESTRICT q4 = x[i].ql;
2429
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
 
2556
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
2557
  q8 += 8; a += 8;
2558
  }
2559
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2560
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
2561
  }
2562
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
2623
 
2624
  __m256 accumf = _mm256_setzero_ps();
2625
  for (int i = 0; i < nb; ++i) {
2626
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2627
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2628
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2629
  __m256i sumi1 = _mm256_setzero_si256();
 
2664
 
2665
  __m256 accumf = _mm256_setzero_ps();
2666
  for (int i = 0; i < nb; ++i) {
2667
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2668
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2669
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2670
  __m128i sumi1_0 = _mm_setzero_si128();
 
2718
 
2719
  float sumf = 0.f;
2720
  for (int i = 0; i < nb; ++i) {
2721
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2722
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2723
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2724
  int32_t bsum = 0;
 
2793
 
2794
  __m256 accumf = _mm256_setzero_ps();
2795
  for (int i = 0; i < nb; ++i) {
2796
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2797
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2798
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2799
 
 
2914
 
2915
  __m256 accumf = _mm256_setzero_ps();
2916
  for (int i = 0; i < nb; ++i) {
2917
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
2918
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
2919
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
2920
 
 
3036
 
3037
  float sumf = 0.f;
3038
  for (int i = 0; i < nb; ++i) {
3039
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3040
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
3041
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
3042
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
3105
 
3106
  __m256 accumf = _mm256_setzero_ps();
3107
  for (int i = 0; i < nb; ++i) {
3108
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3109
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3110
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3111
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
 
3178
 
3179
  __m256 accumf = _mm256_setzero_ps();
3180
  for (int i = 0; i < nb; ++i) {
3181
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3182
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3183
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3184
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)(x[i].qs + QK_K/8);
 
3254
  float sumf = 0;
3255
  for (int i = 0; i < nb; i++) {
3256
 
3257
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3258
  const int8_t * q8 = y[i].qs;
3259
  const uint8_t * qs = x[i].qs;
3260
  const uint8_t * qh = x[i].qh;
 
3314
 
3315
  __m256 accumf = _mm256_setzero_ps();
3316
  for (int i = 0; i < nb; ++i) {
3317
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3318
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3319
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3320
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
3359
 
3360
  __m256 accumf = _mm256_setzero_ps();
3361
  for (int i = 0; i < nb; ++i) {
3362
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3363
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3364
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3365
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
3415
 
3416
  float sumf = 0.f;
3417
  for (int i = 0; i < nb; ++i) {
3418
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3419
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
3420
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
3421
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
3481
 
3482
  __m256 accumf = _mm256_setzero_ps();
3483
  for (int i = 0; i < nb; ++i) {
3484
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3485
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3486
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3487
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
 
3566
 
3567
  __m256 accumf = _mm256_setzero_ps();
3568
  for (int i = 0; i < nb; ++i) {
3569
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3570
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3571
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3572
  const uint16_t * GGML_RESTRICT signs = (const uint16_t *)x[i].signs;
 
3649
 
3650
  float sumf = 0.f;
3651
  for (int i = 0; i < nb; ++i) {
3652
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
3653
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
3654
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
3655
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
 
3754
  + (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
3755
  }
3756
 
3757
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
3758
  accum = _mm256_fmadd_ps(_mm256_set1_ps(d), _mm256_cvtepi32_ps(sumi), accum);
3759
  accum1 += d * sumi1;
3760
 
 
3802
  + (y[i].bsums[2*ib+2] + y[i].bsums[2*ib+3]) * (qh[ib+1] & 0x8000 ? -1 : 1) * ls2;
3803
  }
3804
 
3805
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
3806
  accum = _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(d), _mm256_cvtepi32_ps(MM256_SET_M128I(sumi1_1, sumi1_0))), accum);
3807
  accum1 += d * sumi1;
3808
 
 
3836
  qs += 4;
3837
  }
3838
 
3839
+ sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
3840
  }
3841
 
3842
  *s = sumf;
 
3948
  qs += 8; qh += 4;
3949
  }
3950
 
3951
+ const __m256 d = _mm256_set1_ps(y[i].d * GGML_CPU_FP16_TO_FP32(scale.f16));
3952
 
3953
  accum1 = _mm256_fmadd_ps(d, _mm256_cvtepi32_ps(sumi1), accum1);
3954
  accum2 = _mm256_fmadd_ps(d, _mm256_cvtepi32_ps(sumi2), accum2);
 
4034
  qs += 8; qh += 4;
4035
  }
4036
 
4037
+ const __m256 d = _mm256_set1_ps(y[i].d * GGML_CPU_FP16_TO_FP32(scale.f16));
4038
 
4039
  accum1 = _mm256_add_ps(_mm256_mul_ps(d, _mm256_cvtepi32_ps(MM256_SET_M128I(sumi1_1, sumi1_0))), accum1);
4040
  accum2 = _mm256_add_ps(_mm256_mul_ps(d, _mm256_cvtepi32_ps(MM256_SET_M128I(sumi2_1, sumi2_0))), accum2);
 
4084
  qh += 2;
4085
  }
4086
 
4087
+ sumf += GGML_CPU_FP16_TO_FP32(scale.f16) * y[i].d * (sumi1 + IQ1M_DELTA * sumi2);
4088
  }
4089
 
4090
  *s = sumf;
 
4130
  const __m256i p16_2 = mul_add_epi8(q4b_2, q8b_2);
4131
  const __m256i p_1 = _mm256_madd_epi16(p16_1, mone);
4132
  const __m256i p_2 = _mm256_madd_epi16(p16_2, mone);
4133
+ accum1 = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y[ib + 0].d)*GGML_CPU_FP16_TO_FP32(x[ib + 0].d)),
4134
  _mm256_cvtepi32_ps(p_1), accum1);
4135
+ accum2 = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y[ib + 1].d)*GGML_CPU_FP16_TO_FP32(x[ib + 1].d)),
4136
  _mm256_cvtepi32_ps(p_2), accum2);
4137
  }
4138
 
 
4165
 
4166
  #endif
4167
  for (; ib < nb; ++ib) {
4168
+ const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
4169
  int sumi1 = 0, sumi2 = 0;
4170
  for (int j = 0; j < QK4_NL/2; ++j) {
4171
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
 
4220
  sumi1 = _mm256_add_epi32(p_1, sumi1);
4221
  sumi2 = _mm256_add_epi32(p_2, sumi2);
4222
  }
4223
+ accum = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ibl].d)*y[ibl].d),
4224
  _mm256_cvtepi32_ps(_mm256_add_epi32(sumi1, sumi2)), accum);
4225
  }
4226
 
 
4268
  }
4269
  __m128i sumi12_0 = _mm_add_epi32(sumi1_0, sumi2_0);
4270
  __m128i sumi12_1 = _mm_add_epi32(sumi1_1, sumi2_1);
4271
+ accum = _mm256_add_ps(_mm256_mul_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(x[ibl].d)*y[ibl].d),
4272
  _mm256_cvtepi32_ps(MM256_SET_M128I(sumi12_1, sumi12_0))), accum);
4273
  }
4274
 
 
4277
  #else
4278
  float sumf = 0;
4279
  for (int ibl = 0; ibl < nb; ++ibl) {
4280
+ const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
4281
  uint16_t h = x[ibl].scales_h;
4282
  const uint8_t * qs = x[ibl].qs;
4283
  const int8_t * q8 = y[ibl].qs;
ggml/src/ggml-cpu/arch/x86/repack.cpp CHANGED
@@ -6,6 +6,7 @@
6
  #include "ggml-impl.h"
7
  #include "ggml-cpu.h"
8
  #include "ggml-cpu-impl.h"
 
9
  #include "traits.h"
10
 
11
  #include <cmath>
@@ -39,11 +40,11 @@ static inline __m512 __avx512_f32cx8x2_load(ggml_fp16_t *x, ggml_fp16_t *y) {
39
  float tmp[16];
40
 
41
  for (int i = 0; i < 8; i++) {
42
- tmp[i] = GGML_FP16_TO_FP32(x[i]);
43
  }
44
 
45
  for (int i = 0; i < 8; i++) {
46
- tmp[i + 8] = GGML_FP16_TO_FP32(y[i]);
47
  }
48
 
49
  return _mm512_loadu_ps(tmp);
@@ -54,10 +55,10 @@ static inline __m512 __avx512_repeat_f32cx16_load(__m128i x) {
54
  _mm_storeu_si128((__m128i*)tmphalf, x);
55
 
56
  for (int i = 0; i < 4; i++) {
57
- tmp[i] = GGML_FP16_TO_FP32(tmphalf[i]);
58
- tmp[i + 4] = GGML_FP16_TO_FP32(tmphalf[i]);
59
- tmp[i + 8] = GGML_FP16_TO_FP32(tmphalf[i]);
60
- tmp[i + 12] = GGML_FP16_TO_FP32(tmphalf[i]);
61
  }
62
 
63
  return _mm512_loadu_ps(tmp);
@@ -67,7 +68,7 @@ static inline __m256 __avx_f32cx8_load(ggml_fp16_t *x) {
67
  float tmp[8];
68
 
69
  for (int i = 0; i < 8; i++) {
70
- tmp[i] = GGML_FP16_TO_FP32(x[i]);
71
  }
72
 
73
  return _mm256_loadu_ps(tmp);
@@ -76,8 +77,8 @@ static inline __m256 __avx_repeat_f32cx8_load(ggml_fp16_t *x) {
76
  float tmp[8];
77
 
78
  for (int i = 0; i < 4; i++) {
79
- tmp[i] = GGML_FP16_TO_FP32(x[i]);
80
- tmp[i + 4] = GGML_FP16_TO_FP32(x[i]);
81
  }
82
 
83
  return _mm256_loadu_ps(tmp);
@@ -88,7 +89,7 @@ static inline __m256 __avx_rearranged_f32cx8_load(ggml_fp16_t *x, __m128i arrang
88
 
89
  _mm_storeu_si128((__m128i*)tmphalf, _mm_shuffle_epi8(_mm_loadu_si128((const __m128i *) x), arrangeMask));
90
  for (int i = 0; i < 8; i++) {
91
- tmp[i] = GGML_FP16_TO_FP32(tmphalf[i]);
92
  }
93
 
94
  return _mm256_loadu_ps(tmp);
@@ -211,7 +212,7 @@ void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTR
211
  id[row_iter] = ( maxScalar != 0.0f ) ? 127.f / maxScalar : 0.0f; //d ? 1.0f / d : 0.0f;
212
 
213
  // Store the scale for the individual block
214
- y[i].d[row_iter] = GGML_FP32_TO_FP16(d);
215
 
216
  // Store the values in blocks of eight values - Aim is to use these later for block interleaving
217
  srcv[row_iter][0] = v0;
@@ -297,7 +298,7 @@ void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTR
297
  const float d = amax / ((1 << 7) - 1);
298
  id[row_iter] = d ? 1.0f / d : 0.0f;
299
 
300
- y[i].d[row_iter] = GGML_FP32_TO_FP16(d);
301
  }
302
 
303
  for (int j = 0; j < QK8_0 * 4; j++) {
@@ -647,7 +648,7 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
647
  const __m256 col_scale_f32 = GGML_F32Cx8_REARRANGE_LOAD(b_ptr[b].d, changemask);
648
 
649
  // Load and convert to FP32 scale from block_q8_0
650
- const __m256 row_scale_f32 = _mm256_set1_ps(GGML_FP16_TO_FP32(a_ptr[b].d));
651
 
652
  // Load the block values in block_q8_0 in batches of 16 bytes and replicate the same across 256 bit vector
653
  __m256i lhs_vec_0 = _mm256_castsi128_si256(_mm_loadu_si128((const __m128i *)a_ptr[b].qs));
@@ -706,7 +707,7 @@ void ggml_gemv_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
706
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
707
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
708
  }
709
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
710
  }
711
  }
712
  }
@@ -972,13 +973,13 @@ void ggml_gemv_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
972
  sumi2 = sumi2 * scales_1[j];
973
  sumi += sumi1 + sumi2;
974
  }
975
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d;
976
  }
977
  }
978
  for (int sb = 0; sb < 8; sb++) {
979
  uint8_t *mins = (uint8_t*) utmp + 8 + sb * 16;
980
  for (int j = 0; j < ncols_interleaved; j++) {
981
- sum_minf[j] += mins[j] * (a_ptr[l].bsums[sb * 2] + a_ptr[l].bsums[sb * 2 + 1]) * GGML_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d;
982
  }
983
  }
984
  }
@@ -1755,7 +1756,7 @@ void ggml_gemm_q4_0_8x8_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const vo
1755
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
1756
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
1757
  }
1758
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
1759
  }
1760
  }
1761
  }
@@ -3259,7 +3260,7 @@ void ggml_gemm_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3259
  sumi2 = sumi2 * scales_1[j];
3260
  sumi += sumi1 + sumi2;
3261
  }
3262
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d[m];
3263
  }
3264
  }
3265
  }
@@ -3268,7 +3269,7 @@ void ggml_gemm_q4_K_8x8_q8_K(int n, float * GGML_RESTRICT s, size_t bs, const vo
3268
  for(int m = 0; m < 4; m++) {
3269
  const int16_t *bsums = a_ptr[l].bsums + (sb * 8) + (m * 4) - ((sb % 2) * 6);
3270
  for(int j = 0; j < ncols_interleaved; j++) {
3271
- sum_minf[m][j] += mins[j] * (bsums[0] + bsums[1]) * GGML_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d[m];
3272
  }
3273
  }
3274
  }
 
6
  #include "ggml-impl.h"
7
  #include "ggml-cpu.h"
8
  #include "ggml-cpu-impl.h"
9
+ #include "simd-mappings.h"
10
  #include "traits.h"
11
 
12
  #include <cmath>
 
40
  float tmp[16];
41
 
42
  for (int i = 0; i < 8; i++) {
43
+ tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
44
  }
45
 
46
  for (int i = 0; i < 8; i++) {
47
+ tmp[i + 8] = GGML_CPU_FP16_TO_FP32(y[i]);
48
  }
49
 
50
  return _mm512_loadu_ps(tmp);
 
55
  _mm_storeu_si128((__m128i*)tmphalf, x);
56
 
57
  for (int i = 0; i < 4; i++) {
58
+ tmp[i] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
59
+ tmp[i + 4] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
60
+ tmp[i + 8] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
61
+ tmp[i + 12] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
62
  }
63
 
64
  return _mm512_loadu_ps(tmp);
 
68
  float tmp[8];
69
 
70
  for (int i = 0; i < 8; i++) {
71
+ tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
72
  }
73
 
74
  return _mm256_loadu_ps(tmp);
 
77
  float tmp[8];
78
 
79
  for (int i = 0; i < 4; i++) {
80
+ tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
81
+ tmp[i + 4] = GGML_CPU_FP16_TO_FP32(x[i]);
82
  }
83
 
84
  return _mm256_loadu_ps(tmp);
 
89
 
90
  _mm_storeu_si128((__m128i*)tmphalf, _mm_shuffle_epi8(_mm_loadu_si128((const __m128i *) x), arrangeMask));
91
  for (int i = 0; i < 8; i++) {
92
+ tmp[i] = GGML_CPU_FP16_TO_FP32(tmphalf[i]);
93
  }
94
 
95
  return _mm256_loadu_ps(tmp);
 
212
  id[row_iter] = ( maxScalar != 0.0f ) ? 127.f / maxScalar : 0.0f; //d ? 1.0f / d : 0.0f;
213
 
214
  // Store the scale for the individual block
215
+ y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
216
 
217
  // Store the values in blocks of eight values - Aim is to use these later for block interleaving
218
  srcv[row_iter][0] = v0;
 
298
  const float d = amax / ((1 << 7) - 1);
299
  id[row_iter] = d ? 1.0f / d : 0.0f;
300
 
301
+ y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
302
  }
303
 
304
  for (int j = 0; j < QK8_0 * 4; j++) {
 
648
  const __m256 col_scale_f32 = GGML_F32Cx8_REARRANGE_LOAD(b_ptr[b].d, changemask);
649
 
650
  // Load and convert to FP32 scale from block_q8_0
651
+ const __m256 row_scale_f32 = _mm256_set1_ps(GGML_CPU_FP16_TO_FP32(a_ptr[b].d));
652
 
653
  // Load the block values in block_q8_0 in batches of 16 bytes and replicate the same across 256 bit vector
654
  __m256i lhs_vec_0 = _mm256_castsi128_si256(_mm_loadu_si128((const __m128i *)a_ptr[b].qs));
 
707
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
708
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
709
  }
710
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
711
  }
712
  }
713
  }
 
973
  sumi2 = sumi2 * scales_1[j];
974
  sumi += sumi1 + sumi2;
975
  }
976
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d;
977
  }
978
  }
979
  for (int sb = 0; sb < 8; sb++) {
980
  uint8_t *mins = (uint8_t*) utmp + 8 + sb * 16;
981
  for (int j = 0; j < ncols_interleaved; j++) {
982
+ sum_minf[j] += mins[j] * (a_ptr[l].bsums[sb * 2] + a_ptr[l].bsums[sb * 2 + 1]) * GGML_CPU_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d;
983
  }
984
  }
985
  }
 
1756
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
1757
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
1758
  }
1759
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
1760
  }
1761
  }
1762
  }
 
3260
  sumi2 = sumi2 * scales_1[j];
3261
  sumi += sumi1 + sumi2;
3262
  }
3263
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d[m];
3264
  }
3265
  }
3266
  }
 
3269
  for(int m = 0; m < 4; m++) {
3270
  const int16_t *bsums = a_ptr[l].bsums + (sb * 8) + (m * 4) - ((sb % 2) * 6);
3271
  for(int j = 0; j < ncols_interleaved; j++) {
3272
+ sum_minf[m][j] += mins[j] * (bsums[0] + bsums[1]) * GGML_CPU_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d[m];
3273
  }
3274
  }
3275
  }
ggml/src/ggml-cpu/common.h CHANGED
@@ -4,6 +4,7 @@
4
  #include "traits.h"
5
  #include "ggml-cpu-impl.h"
6
  #include "ggml-impl.h"
 
7
 
8
  #ifdef __cplusplus
9
 
@@ -12,11 +13,11 @@
12
  // convenience functions/macros for use in template calls
13
  // note: these won't be required after the 'traits' lookup table is used.
14
  static inline ggml_fp16_t f32_to_f16(float x) {
15
- return GGML_FP32_TO_FP16(x);
16
  }
17
 
18
  static inline float f16_to_f32(ggml_fp16_t x) {
19
- return GGML_FP16_TO_FP32(x);
20
  }
21
 
22
  static inline ggml_bf16_t f32_to_bf16(float x) {
 
4
  #include "traits.h"
5
  #include "ggml-cpu-impl.h"
6
  #include "ggml-impl.h"
7
+ #include "simd-mappings.h"
8
 
9
  #ifdef __cplusplus
10
 
 
13
  // convenience functions/macros for use in template calls
14
  // note: these won't be required after the 'traits' lookup table is used.
15
  static inline ggml_fp16_t f32_to_f16(float x) {
16
+ return GGML_CPU_FP32_TO_FP16(x);
17
  }
18
 
19
  static inline float f16_to_f32(ggml_fp16_t x) {
20
+ return GGML_CPU_FP16_TO_FP32(x);
21
  }
22
 
23
  static inline ggml_bf16_t f32_to_bf16(float x) {
ggml/src/ggml-cpu/ggml-cpu-impl.h CHANGED
@@ -62,11 +62,17 @@ struct ggml_compute_params {
62
  #if defined(__s390x__) && defined(__VEC__)
63
  #ifndef __VXE__
64
  #define __VXE__
65
- #endif
66
  #ifndef __VXE2__
67
  #define __VXE2__
68
- #endif
69
- #endif
 
 
 
 
 
 
70
 
71
  #if defined(__ARM_FEATURE_SVE)
72
  #include <sys/prctl.h>
 
62
  #if defined(__s390x__) && defined(__VEC__)
63
  #ifndef __VXE__
64
  #define __VXE__
65
+ #endif // __VXE__
66
  #ifndef __VXE2__
67
  #define __VXE2__
68
+ #endif // __VXE2__
69
+ #endif // __s390x__ && __VEC__
70
+
71
+ #if defined(__s390x__) && defined(GGML_NNPA)
72
+ #ifndef __NNPA__
73
+ #define __NNPA__
74
+ #endif // __NNPA__
75
+ #endif // __s390x__ && GGML_NNPA
76
 
77
  #if defined(__ARM_FEATURE_SVE)
78
  #include <sys/prctl.h>
ggml/src/ggml-cpu/ggml-cpu.c CHANGED
@@ -72,6 +72,9 @@
72
  #define UNUSED GGML_UNUSED
73
  #define SWAP(x, y, T) do { T SWAP = x; (x) = y; (y) = SWAP; } while (0)
74
 
 
 
 
75
  #if defined(__ARM_ARCH)
76
  struct ggml_arm_arch_features_type {
77
  int sve_cnt;
@@ -736,7 +739,7 @@ struct ggml_tensor * ggml_set_i32 (struct ggml_tensor * tensor, int32_t value) {
736
  {
737
  assert(tensor->nb[0] == sizeof(ggml_fp16_t));
738
  for (int i = 0; i < n; i++) {
739
- ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1), GGML_FP32_TO_FP16(value));
740
  }
741
  } break;
742
  case GGML_TYPE_BF16:
@@ -795,7 +798,7 @@ struct ggml_tensor * ggml_set_f32(struct ggml_tensor * tensor, float value) {
795
  {
796
  assert(tensor->nb[0] == sizeof(ggml_fp16_t));
797
  for (int i = 0; i < n; i++) {
798
- ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1), GGML_FP32_TO_FP16(value));
799
  }
800
  } break;
801
  case GGML_TYPE_BF16:
@@ -846,7 +849,7 @@ int32_t ggml_get_i32_1d(const struct ggml_tensor * tensor, int i) {
846
  case GGML_TYPE_F16:
847
  {
848
  GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
849
- return GGML_FP16_TO_FP32(((ggml_fp16_t *)(tensor->data))[i]);
850
  }
851
  case GGML_TYPE_BF16:
852
  {
@@ -891,7 +894,7 @@ void ggml_set_i32_1d(const struct ggml_tensor * tensor, int i, int32_t value) {
891
  case GGML_TYPE_F16:
892
  {
893
  GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
894
- ((ggml_fp16_t *)(tensor->data))[i] = GGML_FP32_TO_FP16(value);
895
  } break;
896
  case GGML_TYPE_BF16:
897
  {
@@ -920,7 +923,7 @@ int32_t ggml_get_i32_nd(const struct ggml_tensor * tensor, int i0, int i1, int i
920
  case GGML_TYPE_I32:
921
  return ((int32_t *) data)[0];
922
  case GGML_TYPE_F16:
923
- return GGML_FP16_TO_FP32(((ggml_fp16_t *) data)[0]);
924
  case GGML_TYPE_BF16:
925
  return GGML_BF16_TO_FP32(((ggml_bf16_t *) data)[0]);
926
  case GGML_TYPE_F32:
@@ -947,7 +950,7 @@ void ggml_set_i32_nd(const struct ggml_tensor * tensor, int i0, int i1, int i2,
947
  } break;
948
  case GGML_TYPE_F16:
949
  {
950
- ((ggml_fp16_t *)(data))[0] = GGML_FP32_TO_FP16(value);
951
  } break;
952
  case GGML_TYPE_BF16:
953
  {
@@ -985,7 +988,7 @@ float ggml_get_f32_1d(const struct ggml_tensor * tensor, int i) {
985
  }
986
  case GGML_TYPE_F16:
987
  {
988
- return GGML_FP16_TO_FP32(((ggml_fp16_t *)(tensor->data))[i]);
989
  }
990
  case GGML_TYPE_BF16:
991
  {
@@ -1024,7 +1027,7 @@ void ggml_set_f32_1d(const struct ggml_tensor * tensor, int i, float value) {
1024
  } break;
1025
  case GGML_TYPE_F16:
1026
  {
1027
- ((ggml_fp16_t *)(tensor->data))[i] = GGML_FP32_TO_FP16(value);
1028
  } break;
1029
  case GGML_TYPE_BF16:
1030
  {
@@ -1051,7 +1054,7 @@ float ggml_get_f32_nd(const struct ggml_tensor * tensor, int i0, int i1, int i2,
1051
  case GGML_TYPE_I32:
1052
  return ((int32_t *) data)[0];
1053
  case GGML_TYPE_F16:
1054
- return GGML_FP16_TO_FP32(((ggml_fp16_t *) data)[0]);
1055
  case GGML_TYPE_BF16:
1056
  return GGML_BF16_TO_FP32(((ggml_bf16_t *) data)[0]);
1057
  case GGML_TYPE_F32:
@@ -1078,7 +1081,7 @@ void ggml_set_f32_nd(const struct ggml_tensor * tensor, int i0, int i1, int i2,
1078
  } break;
1079
  case GGML_TYPE_F16:
1080
  {
1081
- ((ggml_fp16_t *)(data))[0] = GGML_FP32_TO_FP16(value);
1082
  } break;
1083
  case GGML_TYPE_BF16:
1084
  {
@@ -3141,9 +3144,24 @@ void ggml_cpu_fp32_to_fp16(const float * x, ggml_fp16_t * y, int64_t n) {
3141
  __m128i y_vec = _mm_cvtps_ph(x_vec, _MM_FROUND_TO_NEAREST_INT);
3142
  _mm_storel_epi64((__m128i *)(y + i), y_vec);
3143
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3144
  #endif
3145
  for (; i < n; ++i) {
3146
- y[i] = GGML_FP32_TO_FP16(x[i]);
3147
  }
3148
  }
3149
 
@@ -3167,9 +3185,25 @@ void ggml_cpu_fp16_to_fp32(const ggml_fp16_t * x, float * y, int64_t n) {
3167
  __m128 y_vec = _mm_cvtph_ps(x_vec);
3168
  _mm_storeu_ps(y + i, y_vec);
3169
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3170
  #endif
 
3171
  for (; i < n; ++i) {
3172
- y[i] = GGML_FP16_TO_FP32(x[i]);
3173
  }
3174
  }
3175
 
@@ -3369,6 +3403,14 @@ int ggml_cpu_has_vxe(void) {
3369
  #endif
3370
  }
3371
 
 
 
 
 
 
 
 
 
3372
  int ggml_cpu_has_neon(void) {
3373
  #if defined(__ARM_ARCH) && defined(__ARM_NEON)
3374
  return 1;
@@ -3418,7 +3460,7 @@ int ggml_cpu_has_sme(void) {
3418
  }
3419
 
3420
  void ggml_cpu_init(void) {
3421
- // needed to initialize f16 tables
3422
  {
3423
  struct ggml_init_params params = { 0, NULL, false };
3424
  struct ggml_context * ctx = ggml_init(params);
@@ -3439,9 +3481,10 @@ void ggml_cpu_init(void) {
3439
  uint16_t u16;
3440
  ggml_fp16_t fp16;
3441
  } u = {i};
3442
- float f = GGML_FP16_TO_FP32(u.fp16);
3443
- ggml_table_gelu_f16[i] = GGML_FP32_TO_FP16(ggml_gelu_f32(f));
3444
- ggml_table_gelu_quick_f16[i] = GGML_FP32_TO_FP16(ggml_gelu_quick_f32(f));
 
3445
  }
3446
 
3447
  const uint64_t t_end = ggml_time_us(); UNUSED(t_end);
 
72
  #define UNUSED GGML_UNUSED
73
  #define SWAP(x, y, T) do { T SWAP = x; (x) = y; (y) = SWAP; } while (0)
74
 
75
+ // precomputed f32 table for f16 (256 KB) (simd-mappings.h)
76
+ float ggml_table_f32_f16[1 << 16];
77
+
78
  #if defined(__ARM_ARCH)
79
  struct ggml_arm_arch_features_type {
80
  int sve_cnt;
 
739
  {
740
  assert(tensor->nb[0] == sizeof(ggml_fp16_t));
741
  for (int i = 0; i < n; i++) {
742
+ ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1), GGML_CPU_FP32_TO_FP16(value));
743
  }
744
  } break;
745
  case GGML_TYPE_BF16:
 
798
  {
799
  assert(tensor->nb[0] == sizeof(ggml_fp16_t));
800
  for (int i = 0; i < n; i++) {
801
+ ggml_vec_set_f16(nc, (ggml_fp16_t *)(data + i*n1), GGML_CPU_FP32_TO_FP16(value));
802
  }
803
  } break;
804
  case GGML_TYPE_BF16:
 
849
  case GGML_TYPE_F16:
850
  {
851
  GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
852
+ return GGML_CPU_FP16_TO_FP32(((ggml_fp16_t *)(tensor->data))[i]);
853
  }
854
  case GGML_TYPE_BF16:
855
  {
 
894
  case GGML_TYPE_F16:
895
  {
896
  GGML_ASSERT(tensor->nb[0] == sizeof(ggml_fp16_t));
897
+ ((ggml_fp16_t *)(tensor->data))[i] = GGML_CPU_FP32_TO_FP16(value);
898
  } break;
899
  case GGML_TYPE_BF16:
900
  {
 
923
  case GGML_TYPE_I32:
924
  return ((int32_t *) data)[0];
925
  case GGML_TYPE_F16:
926
+ return GGML_CPU_FP16_TO_FP32(((ggml_fp16_t *) data)[0]);
927
  case GGML_TYPE_BF16:
928
  return GGML_BF16_TO_FP32(((ggml_bf16_t *) data)[0]);
929
  case GGML_TYPE_F32:
 
950
  } break;
951
  case GGML_TYPE_F16:
952
  {
953
+ ((ggml_fp16_t *)(data))[0] = GGML_CPU_FP32_TO_FP16(value);
954
  } break;
955
  case GGML_TYPE_BF16:
956
  {
 
988
  }
989
  case GGML_TYPE_F16:
990
  {
991
+ return GGML_CPU_FP16_TO_FP32(((ggml_fp16_t *)(tensor->data))[i]);
992
  }
993
  case GGML_TYPE_BF16:
994
  {
 
1027
  } break;
1028
  case GGML_TYPE_F16:
1029
  {
1030
+ ((ggml_fp16_t *)(tensor->data))[i] = GGML_CPU_FP32_TO_FP16(value);
1031
  } break;
1032
  case GGML_TYPE_BF16:
1033
  {
 
1054
  case GGML_TYPE_I32:
1055
  return ((int32_t *) data)[0];
1056
  case GGML_TYPE_F16:
1057
+ return GGML_CPU_FP16_TO_FP32(((ggml_fp16_t *) data)[0]);
1058
  case GGML_TYPE_BF16:
1059
  return GGML_BF16_TO_FP32(((ggml_bf16_t *) data)[0]);
1060
  case GGML_TYPE_F32:
 
1081
  } break;
1082
  case GGML_TYPE_F16:
1083
  {
1084
+ ((ggml_fp16_t *)(data))[0] = GGML_CPU_FP32_TO_FP16(value);
1085
  } break;
1086
  case GGML_TYPE_BF16:
1087
  {
 
3144
  __m128i y_vec = _mm_cvtps_ph(x_vec, _MM_FROUND_TO_NEAREST_INT);
3145
  _mm_storel_epi64((__m128i *)(y + i), y_vec);
3146
  }
3147
+ #elif defined(__NNPA__)
3148
+ for (; i + 7 < n; i += 8) {
3149
+ float32x4_t v_xh = vec_xl(0, (const float *)(x + i + 0));
3150
+ float32x4_t v_xl = vec_xl(0, (const float *)(x + i + 4));
3151
+ uint16x8_t v_yd = vec_round_from_fp32(v_xh, v_xl, 0);
3152
+ uint16x8_t v_y = vec_convert_to_fp16(v_yd, 0);
3153
+ vec_xst(v_y, 0, (ggml_fp16_t *)(y + i));
3154
+ }
3155
+ for (; i + 3 < n; i += 4) {
3156
+ float32x4_t v_x = vec_xl(0, (const float *)(x + i));
3157
+ float32x4_t v_zero = vec_splats(0.0f);
3158
+ uint16x8_t v_yd = vec_round_from_fp32(v_x, v_zero, 0);
3159
+ uint16x8_t v_y = vec_convert_to_fp16(v_yd, 0);
3160
+ vec_xst(v_y, 0, (ggml_fp16_t *)(y + i));
3161
+ }
3162
  #endif
3163
  for (; i < n; ++i) {
3164
+ y[i] = GGML_CPU_FP32_TO_FP16(x[i]);
3165
  }
3166
  }
3167
 
 
3185
  __m128 y_vec = _mm_cvtph_ps(x_vec);
3186
  _mm_storeu_ps(y + i, y_vec);
3187
  }
3188
+ #elif defined(__NNPA__)
3189
+ for (; i + 7 < n; i += 8) {
3190
+ uint16x8_t v_x = vec_xl(0, (const ggml_fp16_t *)(x + i));
3191
+ uint16x8_t v_yd = vec_convert_from_fp16(v_x, 0);
3192
+ float32x4_t v_yh = vec_extend_to_fp32_hi(v_yd, 0);
3193
+ float32x4_t v_yl = vec_extend_to_fp32_lo(v_yd, 0);
3194
+ vec_xst(v_yh, 0, (float *)(y + i + 0));
3195
+ vec_xst(v_yl, 0, (float *)(y + i + 4));
3196
+ }
3197
+ for (; i + 3 < n; i += 4) {
3198
+ uint16x8_t v_x = vec_xl(0, (const ggml_fp16_t *)(x + i));
3199
+ uint16x8_t v_yd = vec_convert_from_fp16(v_x, 0);
3200
+ float32x4_t v_yh = vec_extend_to_fp32_hi(v_yd, 0);
3201
+ vec_xst(v_yh, 0, (float *)(y + i));
3202
+ }
3203
  #endif
3204
+
3205
  for (; i < n; ++i) {
3206
+ y[i] = GGML_CPU_FP16_TO_FP32(x[i]);
3207
  }
3208
  }
3209
 
 
3403
  #endif
3404
  }
3405
 
3406
+ int ggml_cpu_has_nnpa(void) {
3407
+ #if defined(GGML_NNPA)
3408
+ return 1;
3409
+ #else
3410
+ return 0;
3411
+ #endif
3412
+ }
3413
+
3414
  int ggml_cpu_has_neon(void) {
3415
  #if defined(__ARM_ARCH) && defined(__ARM_NEON)
3416
  return 1;
 
3460
  }
3461
 
3462
  void ggml_cpu_init(void) {
3463
+ // needed to initialize ggml_time
3464
  {
3465
  struct ggml_init_params params = { 0, NULL, false };
3466
  struct ggml_context * ctx = ggml_init(params);
 
3481
  uint16_t u16;
3482
  ggml_fp16_t fp16;
3483
  } u = {i};
3484
+ float f = GGML_COMPUTE_FP16_TO_FP32(u.fp16);
3485
+ ggml_table_f32_f16[i] = f;
3486
+ ggml_table_gelu_f16[i] = GGML_CPU_FP32_TO_FP16(ggml_gelu_f32(f));
3487
+ ggml_table_gelu_quick_f16[i] = GGML_CPU_FP32_TO_FP16(ggml_gelu_quick_f32(f));
3488
  }
3489
 
3490
  const uint64_t t_end = ggml_time_us(); UNUSED(t_end);
ggml/src/ggml-cpu/ggml-cpu.cpp CHANGED
@@ -578,6 +578,9 @@ static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r
578
  if (ggml_cpu_has_vxe()) {
579
  features.push_back({ "VXE", "1" });
580
  }
 
 
 
581
  if (ggml_cpu_has_wasm_simd()) {
582
  features.push_back({ "WASM_SIMD", "1" });
583
  }
 
578
  if (ggml_cpu_has_vxe()) {
579
  features.push_back({ "VXE", "1" });
580
  }
581
+ if (ggml_cpu_has_nnpa()) {
582
+ features.push_back({ "NNPA", "1" });
583
+ }
584
  if (ggml_cpu_has_wasm_simd()) {
585
  features.push_back({ "WASM_SIMD", "1" });
586
  }
ggml/src/ggml-cpu/llamafile/sgemm.cpp CHANGED
@@ -52,6 +52,7 @@
52
  #include "ggml-impl.h"
53
  #include "ggml-cpu-impl.h"
54
  #include "ggml-quants.h"
 
55
 
56
  #include <array>
57
  #include <type_traits>
@@ -73,7 +74,7 @@
73
  namespace {
74
 
75
  inline float unhalf(ggml_fp16_t d) {
76
- return GGML_FP16_TO_FP32(d);
77
  }
78
 
79
  ////////////////////////////////////////////////////////////////////////////////////////////////////
@@ -252,7 +253,7 @@ template <> inline float32x4_t load(const ggml_fp16_t * p) {
252
  float tmp[4];
253
 
254
  for (int i = 0; i < 4; i++) {
255
- tmp[i] = GGML_FP16_TO_FP32(p[i]);
256
  }
257
 
258
  return vec_xl(0, (const float *)(tmp));
 
52
  #include "ggml-impl.h"
53
  #include "ggml-cpu-impl.h"
54
  #include "ggml-quants.h"
55
+ #include "simd-mappings.h"
56
 
57
  #include <array>
58
  #include <type_traits>
 
74
  namespace {
75
 
76
  inline float unhalf(ggml_fp16_t d) {
77
+ return GGML_CPU_FP16_TO_FP32(d);
78
  }
79
 
80
  ////////////////////////////////////////////////////////////////////////////////////////////////////
 
253
  float tmp[4];
254
 
255
  for (int i = 0; i < 4; i++) {
256
+ tmp[i] = GGML_CPU_FP16_TO_FP32(p[i]);
257
  }
258
 
259
  return vec_xl(0, (const float *)(tmp));
ggml/src/ggml-cpu/ops.cpp CHANGED
@@ -108,7 +108,7 @@ static void ggml_compute_forward_dup_f16(
108
  for (int i01 = ir0; i01 < ir1; i01++) {
109
  const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
110
  for (int i00 = 0; i00 < ne00; i00++) {
111
- dst_ptr[id] = GGML_FP16_TO_FP32(src0_ptr[i00]);
112
  id++;
113
  }
114
  }
@@ -130,7 +130,7 @@ static void ggml_compute_forward_dup_f16(
130
  const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
131
 
132
  for (int i00 = 0; i00 < ne00; i00++) {
133
- src0_f32[i00] = GGML_FP16_TO_FP32(src0_ptr[i00]);
134
  }
135
 
136
  quantize_row_q(src0_f32, dst_ptr + id, ne00);
@@ -156,7 +156,7 @@ static void ggml_compute_forward_dup_f16(
156
  for (int i00 = 0; i00 < ne00; i00++) {
157
  const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
158
 
159
- dst_ptr[id] = GGML_FP16_TO_FP32(*src0_ptr);
160
  id++;
161
  }
162
  }
@@ -267,7 +267,7 @@ static void ggml_compute_forward_dup_f16(
267
  const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
268
  char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
269
 
270
- *(float *) dst_ptr = GGML_FP16_TO_FP32(*(const ggml_fp16_t *) src0_ptr);
271
 
272
  if (++i10 == ne0) {
273
  i10 = 0;
@@ -372,7 +372,7 @@ static void ggml_compute_forward_dup_bf16(
372
  for (int i01 = ir0; i01 < ir1; i01++) {
373
  const ggml_bf16_t * src0_ptr = (ggml_bf16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
374
  for (int i00 = 0; i00 < ne00; i00++) {
375
- dst_ptr[id] = GGML_FP32_TO_FP16(GGML_BF16_TO_FP32(src0_ptr[i00]));
376
  id++;
377
  }
378
  }
@@ -473,7 +473,7 @@ static void ggml_compute_forward_dup_bf16(
473
  for (int i00 = 0; i00 < ne00; i00++) {
474
  const ggml_bf16_t * src0_ptr = (ggml_bf16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
475
 
476
- dst_ptr[id] = GGML_FP32_TO_FP16(GGML_BF16_TO_FP32(*src0_ptr));
477
  id++;
478
  }
479
  }
@@ -566,7 +566,7 @@ static void ggml_compute_forward_dup_bf16(
566
  const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
567
  char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
568
 
569
- *(ggml_fp16_t *) dst_ptr = GGML_FP32_TO_FP16(GGML_BF16_TO_FP32(*(const ggml_bf16_t *) src0_ptr));
570
 
571
  if (++i10 == ne0) {
572
  i10 = 0;
@@ -765,7 +765,7 @@ static void ggml_compute_forward_dup_f32(
765
  for (int i00 = 0; i00 < ne00; i00++) {
766
  const float * src0_ptr = (float *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
767
 
768
- dst_ptr[id] = GGML_FP32_TO_FP16(*src0_ptr);
769
  id++;
770
  }
771
  }
@@ -878,7 +878,7 @@ static void ggml_compute_forward_dup_f32(
878
  const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
879
  char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
880
 
881
- *(ggml_fp16_t *) dst_ptr = GGML_FP32_TO_FP16(*(const float *) src0_ptr);
882
 
883
  if (++i10 == ne0) {
884
  i10 = 0;
@@ -1419,7 +1419,7 @@ static void ggml_compute_forward_add1_f16_f32(
1419
  ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 );
1420
  ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01);
1421
  for (int i = 0; i < ne0; i++) {
1422
- dst_ptr[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(src0_ptr[i]) + v);
1423
  }
1424
  }
1425
  }
@@ -1435,7 +1435,7 @@ static void ggml_compute_forward_add1_f16_f16(
1435
  GGML_ASSERT(ggml_is_scalar(src1));
1436
 
1437
  // scalar to add
1438
- const float v = GGML_FP16_TO_FP32(*(ggml_fp16_t *) src1->data);
1439
 
1440
  const int ith = params->ith;
1441
  const int nth = params->nth;
@@ -1467,7 +1467,7 @@ static void ggml_compute_forward_add1_f16_f16(
1467
  ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 );
1468
  ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01);
1469
  for (int i = 0; i < ne0; i++) {
1470
- dst_ptr[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(src0_ptr[i]) + v);
1471
  }
1472
  }
1473
  }
@@ -1889,7 +1889,7 @@ static void ggml_compute_forward_sum_f16(
1889
  }
1890
  }
1891
  }
1892
- ((ggml_fp16_t *) dst->data)[0] = GGML_FP32_TO_FP16(sum);
1893
  }
1894
 
1895
  static void ggml_compute_forward_sum_bf16(
@@ -2660,7 +2660,7 @@ static void ggml_compute_forward_gelu_f16(
2660
  #ifndef NDEBUG
2661
  for (int k = 0; k < nc; k++) {
2662
  const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
2663
- const float v = GGML_FP16_TO_FP32(x);
2664
  GGML_UNUSED(v);
2665
  assert(!isnan(v));
2666
  assert(!isinf(v));
@@ -2763,7 +2763,7 @@ static void ggml_compute_forward_gelu_erf_f16(
2763
  #ifndef NDEBUG
2764
  for (int k = 0; k < nc; k++) {
2765
  const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
2766
- const float v = GGML_FP16_TO_FP32(x);
2767
  GGML_UNUSED(v);
2768
  assert(!isnan(v));
2769
  assert(!isinf(v));
@@ -2866,7 +2866,7 @@ static void ggml_compute_forward_gelu_quick_f16(
2866
  #ifndef NDEBUG
2867
  for (int k = 0; k < nc; k++) {
2868
  const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
2869
- const float v = GGML_FP16_TO_FP32(x);
2870
  GGML_UNUSED(v);
2871
  assert(!isnan(v));
2872
  assert(!isinf(v));
@@ -2969,7 +2969,7 @@ static void ggml_compute_forward_silu_f16(
2969
  #ifndef NDEBUG
2970
  for (int k = 0; k < nc; k++) {
2971
  const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*(dst->nb[1])))[k];
2972
- const float v = GGML_FP16_TO_FP32(x);
2973
  GGML_UNUSED(v);
2974
  assert(!isnan(v));
2975
  assert(!isinf(v));
@@ -3163,7 +3163,7 @@ static void ggml_compute_forward_silu_back_f16(
3163
  #ifndef NDEBUG
3164
  for (int k = 0; k < nc; k++) {
3165
  const float x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
3166
- const float v = GGML_FP16_TO_FP32(x);
3167
  GGML_UNUSED(v);
3168
  assert(!isnan(v));
3169
  assert(!isinf(v));
@@ -4500,7 +4500,7 @@ static void ggml_compute_forward_get_rows_back_f32_f16(
4500
 
4501
  for (int j = 0; j < nc; ++j) {
4502
  ggml_fp16_t v = ((ggml_fp16_t *) ((char *) src0->data + i*src0->nb[1]))[j];
4503
- ((float *) ((char *) dst->data + r*dst->nb[1]))[j] += GGML_FP16_TO_FP32(v);
4504
  }
4505
  }
4506
  }
@@ -4792,7 +4792,7 @@ static void ggml_compute_forward_soft_max_f32(
4792
  if (mp_f32) {
4793
  if (use_f16) {
4794
  for (int i = 0; i < nc; ++i) {
4795
- wp[i] += slope*GGML_FP16_TO_FP32(mp_f16[i]);
4796
  }
4797
  } else {
4798
  for (int i = 0; i < nc; ++i) {
@@ -5018,8 +5018,8 @@ static void ggml_compute_forward_clamp_f16(
5018
  ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + j*nb01);
5019
 
5020
  for (int i = 0; i < nc; i++) {
5021
- float v = GGML_FP16_TO_FP32(src0_ptr[i]);
5022
- dst_ptr[i] = GGML_FP32_TO_FP16(MAX(MIN(v, max), min));
5023
  }
5024
  }
5025
  }
@@ -5476,11 +5476,11 @@ static void ggml_compute_forward_rope_f16(
5476
  const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
5477
  ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
5478
 
5479
- const float x0 = GGML_FP16_TO_FP32(src[0]);
5480
- const float x1 = GGML_FP16_TO_FP32(src[n_dims]);
5481
 
5482
- dst_data[0] = GGML_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
5483
- dst_data[n_dims] = GGML_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
5484
  }
5485
  } else {
5486
  for (int64_t i0 = 0; i0 < n_dims; i0 += 2) {
@@ -5492,11 +5492,11 @@ static void ggml_compute_forward_rope_f16(
5492
  const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
5493
  ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
5494
 
5495
- const float x0 = GGML_FP16_TO_FP32(src[0]);
5496
- const float x1 = GGML_FP16_TO_FP32(src[n_dims/2]);
5497
 
5498
- dst_data[0] = GGML_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
5499
- dst_data[n_dims/2] = GGML_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
5500
  }
5501
  }
5502
  } else {
@@ -5507,11 +5507,11 @@ static void ggml_compute_forward_rope_f16(
5507
  const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + i0*nb00);
5508
  ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + i0*nb0);
5509
 
5510
- const float x0 = GGML_FP16_TO_FP32(src[0]);
5511
- const float x1 = GGML_FP16_TO_FP32(src[1]);
5512
 
5513
- dst_data[0] = GGML_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
5514
- dst_data[1] = GGML_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
5515
  }
5516
  }
5517
 
@@ -5525,11 +5525,11 @@ static void ggml_compute_forward_rope_f16(
5525
  const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
5526
  ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
5527
 
5528
- const float x0 = GGML_FP16_TO_FP32(src[0]);
5529
- const float x1 = GGML_FP16_TO_FP32(src[n_dims]);
5530
 
5531
- dst_data[0] = GGML_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
5532
- dst_data[n_dims] = GGML_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
5533
  }
5534
  } else {
5535
  for (int64_t i0 = n_dims; i0 < ne0; i0 += 2) {
@@ -5640,7 +5640,7 @@ static void ggml_compute_forward_conv_transpose_1d_f16_f32(
5640
  for (int64_t i11 = 0; i11 < ne11; i11++) {
5641
  const float * const src = (float *)((char *) src1->data + i11*nb11);
5642
  for (int64_t i10 = 0; i10 < ne10; i10++) {
5643
- dst_data[i10*ne11 + i11] = GGML_FP32_TO_FP16(src[i10]);
5644
  }
5645
  }
5646
  }
@@ -5933,7 +5933,7 @@ static void ggml_compute_forward_im2col_f16(
5933
  if (iih < 0 || iih >= IH || iiw < 0 || iiw >= IW) {
5934
  dst_data[iic*(KH*KW) + ikh*KW + ikw] = 0;
5935
  } else {
5936
- dst_data[iic*(KH*KW) + ikh*KW + ikw] = GGML_FP32_TO_FP16(src_data[iih*IW + iiw]);
5937
  }
5938
  }
5939
  }
@@ -6109,7 +6109,7 @@ void ggml_compute_forward_conv_transpose_2d(
6109
  const float * const src = (float *)((char *) src1->data + i12*nb12 + i11*nb11);
6110
  ggml_fp16_t * dst_data = wdata + i11*ne10*ne12;
6111
  for (int i10 = 0; i10 < ne10; i10++) {
6112
- dst_data[i10*ne12 + i12] = GGML_FP32_TO_FP16(src[i10]);
6113
  }
6114
  }
6115
  }
@@ -6358,7 +6358,7 @@ static void ggml_compute_forward_pool_1d_sk_p0(
6358
  case GGML_OP_POOL_COUNT: GGML_ABORT("fatal error");
6359
  }
6360
  for (int ki = 0; ki < k; ++ki) {
6361
- const float srow_j = (src->type == GGML_TYPE_F32) ? ((const float*)srow)[j] : GGML_FP16_TO_FP32(((const ggml_fp16_t*)srow)[j]);
6362
  switch (op) {
6363
  case GGML_OP_POOL_AVG: drow[i] += srow_j; break;
6364
  case GGML_OP_POOL_MAX: if (srow_j > drow[i]) drow[i] = srow_j; break;
@@ -6450,7 +6450,7 @@ void ggml_compute_forward_pool_2d(
6450
  for (int kx = 0; kx < k0; ++kx) {
6451
  int j = ix + kx;
6452
  if (j < 0 || j >= src->ne[0]) continue;
6453
- const float srow_j = (src->type == GGML_TYPE_F32) ? ((const float*)srow)[j] : GGML_FP16_TO_FP32(((const ggml_fp16_t*)srow)[j]);
6454
  switch (op) {
6455
  case GGML_OP_POOL_AVG: *out += srow_j; break;
6456
  case GGML_OP_POOL_MAX: if (srow_j > *out) *out = srow_j; break;
@@ -6538,7 +6538,7 @@ void ggml_compute_forward_pool_2d_back(
6538
  }
6539
 
6540
  const float val = dst->type == GGML_TYPE_F32 ?
6541
- ((const float *) drowf)[j] : GGML_FP16_TO_FP32(((const ggml_fp16_t *) drowf)[j]);
6542
  if (val <= maxval) {
6543
  continue;
6544
  }
@@ -6558,7 +6558,7 @@ void ggml_compute_forward_pool_2d_back(
6558
  if (dst->type == GGML_TYPE_F32) {
6559
  ((float *) drow)[j] += grad0;
6560
  } else {
6561
- ((ggml_fp16_t *) drow)[j] = GGML_FP32_TO_FP16(grad0 + GGML_FP16_TO_FP32(((const ggml_fp16_t *) drow)[j]));
6562
  }
6563
  } else if (op == GGML_OP_POOL_AVG) {
6564
  const float grad = grad0 / ka;
@@ -6577,7 +6577,7 @@ void ggml_compute_forward_pool_2d_back(
6577
  if (dst->type == GGML_TYPE_F32) {
6578
  ((float *) drow)[j] += grad;
6579
  } else {
6580
- ((ggml_fp16_t *) drow)[j] += GGML_FP32_TO_FP16(grad);
6581
  }
6582
  }
6583
  }
@@ -7147,7 +7147,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
7147
  // loop over n_kv and n_head_kv
7148
  // ref: https://arxiv.org/pdf/2112.05682.pdf
7149
  for (int64_t ic = 0; ic < nek1; ++ic) {
7150
- const float mv = mp ? slope*GGML_FP16_TO_FP32(mp[ic]) : 0.0f;
7151
  if (mv == -INFINITY) {
7152
  continue;
7153
  }
@@ -7215,7 +7215,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
7215
 
7216
  if (v->type == GGML_TYPE_F16) {
7217
  for (int64_t d = 0; d < DV; ++d) {
7218
- VKQ32[d] = GGML_FP16_TO_FP32(VKQ16[d]);
7219
  }
7220
  }
7221
 
 
108
  for (int i01 = ir0; i01 < ir1; i01++) {
109
  const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
110
  for (int i00 = 0; i00 < ne00; i00++) {
111
+ dst_ptr[id] = GGML_CPU_FP16_TO_FP32(src0_ptr[i00]);
112
  id++;
113
  }
114
  }
 
130
  const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
131
 
132
  for (int i00 = 0; i00 < ne00; i00++) {
133
+ src0_f32[i00] = GGML_CPU_FP16_TO_FP32(src0_ptr[i00]);
134
  }
135
 
136
  quantize_row_q(src0_f32, dst_ptr + id, ne00);
 
156
  for (int i00 = 0; i00 < ne00; i00++) {
157
  const ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
158
 
159
+ dst_ptr[id] = GGML_CPU_FP16_TO_FP32(*src0_ptr);
160
  id++;
161
  }
162
  }
 
267
  const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
268
  char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
269
 
270
+ *(float *) dst_ptr = GGML_CPU_FP16_TO_FP32(*(const ggml_fp16_t *) src0_ptr);
271
 
272
  if (++i10 == ne0) {
273
  i10 = 0;
 
372
  for (int i01 = ir0; i01 < ir1; i01++) {
373
  const ggml_bf16_t * src0_ptr = (ggml_bf16_t *) ((char *) src0->data + i01*nb01 + i02*nb02 + i03*nb03);
374
  for (int i00 = 0; i00 < ne00; i00++) {
375
+ dst_ptr[id] = GGML_CPU_FP32_TO_FP16(GGML_BF16_TO_FP32(src0_ptr[i00]));
376
  id++;
377
  }
378
  }
 
473
  for (int i00 = 0; i00 < ne00; i00++) {
474
  const ggml_bf16_t * src0_ptr = (ggml_bf16_t *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
475
 
476
+ dst_ptr[id] = GGML_CPU_FP32_TO_FP16(GGML_BF16_TO_FP32(*src0_ptr));
477
  id++;
478
  }
479
  }
 
566
  const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
567
  char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
568
 
569
+ *(ggml_fp16_t *) dst_ptr = GGML_CPU_FP32_TO_FP16(GGML_BF16_TO_FP32(*(const ggml_bf16_t *) src0_ptr));
570
 
571
  if (++i10 == ne0) {
572
  i10 = 0;
 
765
  for (int i00 = 0; i00 < ne00; i00++) {
766
  const float * src0_ptr = (float *) ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
767
 
768
+ dst_ptr[id] = GGML_CPU_FP32_TO_FP16(*src0_ptr);
769
  id++;
770
  }
771
  }
 
878
  const char * src0_ptr = ((char *) src0->data + i00*nb00 + i01*nb01 + i02*nb02 + i03*nb03);
879
  char * dst_ptr = ((char *) dst->data + i10*nb0 + i11*nb1 + i12*nb2 + i13*nb3);
880
 
881
+ *(ggml_fp16_t *) dst_ptr = GGML_CPU_FP32_TO_FP16(*(const float *) src0_ptr);
882
 
883
  if (++i10 == ne0) {
884
  i10 = 0;
 
1419
  ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 );
1420
  ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01);
1421
  for (int i = 0; i < ne0; i++) {
1422
+ dst_ptr[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(src0_ptr[i]) + v);
1423
  }
1424
  }
1425
  }
 
1435
  GGML_ASSERT(ggml_is_scalar(src1));
1436
 
1437
  // scalar to add
1438
+ const float v = GGML_CPU_FP16_TO_FP32(*(ggml_fp16_t *) src1->data);
1439
 
1440
  const int ith = params->ith;
1441
  const int nth = params->nth;
 
1467
  ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 );
1468
  ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01);
1469
  for (int i = 0; i < ne0; i++) {
1470
+ dst_ptr[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(src0_ptr[i]) + v);
1471
  }
1472
  }
1473
  }
 
1889
  }
1890
  }
1891
  }
1892
+ ((ggml_fp16_t *) dst->data)[0] = GGML_CPU_FP32_TO_FP16(sum);
1893
  }
1894
 
1895
  static void ggml_compute_forward_sum_bf16(
 
2660
  #ifndef NDEBUG
2661
  for (int k = 0; k < nc; k++) {
2662
  const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
2663
+ const float v = GGML_CPU_FP16_TO_FP32(x);
2664
  GGML_UNUSED(v);
2665
  assert(!isnan(v));
2666
  assert(!isinf(v));
 
2763
  #ifndef NDEBUG
2764
  for (int k = 0; k < nc; k++) {
2765
  const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
2766
+ const float v = GGML_CPU_FP16_TO_FP32(x);
2767
  GGML_UNUSED(v);
2768
  assert(!isnan(v));
2769
  assert(!isinf(v));
 
2866
  #ifndef NDEBUG
2867
  for (int k = 0; k < nc; k++) {
2868
  const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
2869
+ const float v = GGML_CPU_FP16_TO_FP32(x);
2870
  GGML_UNUSED(v);
2871
  assert(!isnan(v));
2872
  assert(!isinf(v));
 
2969
  #ifndef NDEBUG
2970
  for (int k = 0; k < nc; k++) {
2971
  const ggml_fp16_t x = ((ggml_fp16_t *) ((char *) dst->data + i1*(dst->nb[1])))[k];
2972
+ const float v = GGML_CPU_FP16_TO_FP32(x);
2973
  GGML_UNUSED(v);
2974
  assert(!isnan(v));
2975
  assert(!isinf(v));
 
3163
  #ifndef NDEBUG
3164
  for (int k = 0; k < nc; k++) {
3165
  const float x = ((ggml_fp16_t *) ((char *) dst->data + i1*( dst->nb[1])))[k];
3166
+ const float v = GGML_CPU_FP16_TO_FP32(x);
3167
  GGML_UNUSED(v);
3168
  assert(!isnan(v));
3169
  assert(!isinf(v));
 
4500
 
4501
  for (int j = 0; j < nc; ++j) {
4502
  ggml_fp16_t v = ((ggml_fp16_t *) ((char *) src0->data + i*src0->nb[1]))[j];
4503
+ ((float *) ((char *) dst->data + r*dst->nb[1]))[j] += GGML_CPU_FP16_TO_FP32(v);
4504
  }
4505
  }
4506
  }
 
4792
  if (mp_f32) {
4793
  if (use_f16) {
4794
  for (int i = 0; i < nc; ++i) {
4795
+ wp[i] += slope*GGML_CPU_FP16_TO_FP32(mp_f16[i]);
4796
  }
4797
  } else {
4798
  for (int i = 0; i < nc; ++i) {
 
5018
  ggml_fp16_t * src0_ptr = (ggml_fp16_t *) ((char *) src0->data + j*nb01);
5019
 
5020
  for (int i = 0; i < nc; i++) {
5021
+ float v = GGML_CPU_FP16_TO_FP32(src0_ptr[i]);
5022
+ dst_ptr[i] = GGML_CPU_FP32_TO_FP16(MAX(MIN(v, max), min));
5023
  }
5024
  }
5025
  }
 
5476
  const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
5477
  ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
5478
 
5479
+ const float x0 = GGML_CPU_FP16_TO_FP32(src[0]);
5480
+ const float x1 = GGML_CPU_FP16_TO_FP32(src[n_dims]);
5481
 
5482
+ dst_data[0] = GGML_CPU_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
5483
+ dst_data[n_dims] = GGML_CPU_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
5484
  }
5485
  } else {
5486
  for (int64_t i0 = 0; i0 < n_dims; i0 += 2) {
 
5492
  const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
5493
  ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
5494
 
5495
+ const float x0 = GGML_CPU_FP16_TO_FP32(src[0]);
5496
+ const float x1 = GGML_CPU_FP16_TO_FP32(src[n_dims/2]);
5497
 
5498
+ dst_data[0] = GGML_CPU_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
5499
+ dst_data[n_dims/2] = GGML_CPU_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
5500
  }
5501
  }
5502
  } else {
 
5507
  const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + i0*nb00);
5508
  ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + i0*nb0);
5509
 
5510
+ const float x0 = GGML_CPU_FP16_TO_FP32(src[0]);
5511
+ const float x1 = GGML_CPU_FP16_TO_FP32(src[1]);
5512
 
5513
+ dst_data[0] = GGML_CPU_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
5514
+ dst_data[1] = GGML_CPU_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
5515
  }
5516
  }
5517
 
 
5525
  const ggml_fp16_t * const src = (ggml_fp16_t *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + ic*nb00);
5526
  ggml_fp16_t * dst_data = (ggml_fp16_t *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + ic*nb0);
5527
 
5528
+ const float x0 = GGML_CPU_FP16_TO_FP32(src[0]);
5529
+ const float x1 = GGML_CPU_FP16_TO_FP32(src[n_dims]);
5530
 
5531
+ dst_data[0] = GGML_CPU_FP32_TO_FP16(x0*cos_theta - x1*sin_theta);
5532
+ dst_data[n_dims] = GGML_CPU_FP32_TO_FP16(x0*sin_theta + x1*cos_theta);
5533
  }
5534
  } else {
5535
  for (int64_t i0 = n_dims; i0 < ne0; i0 += 2) {
 
5640
  for (int64_t i11 = 0; i11 < ne11; i11++) {
5641
  const float * const src = (float *)((char *) src1->data + i11*nb11);
5642
  for (int64_t i10 = 0; i10 < ne10; i10++) {
5643
+ dst_data[i10*ne11 + i11] = GGML_CPU_FP32_TO_FP16(src[i10]);
5644
  }
5645
  }
5646
  }
 
5933
  if (iih < 0 || iih >= IH || iiw < 0 || iiw >= IW) {
5934
  dst_data[iic*(KH*KW) + ikh*KW + ikw] = 0;
5935
  } else {
5936
+ dst_data[iic*(KH*KW) + ikh*KW + ikw] = GGML_CPU_FP32_TO_FP16(src_data[iih*IW + iiw]);
5937
  }
5938
  }
5939
  }
 
6109
  const float * const src = (float *)((char *) src1->data + i12*nb12 + i11*nb11);
6110
  ggml_fp16_t * dst_data = wdata + i11*ne10*ne12;
6111
  for (int i10 = 0; i10 < ne10; i10++) {
6112
+ dst_data[i10*ne12 + i12] = GGML_CPU_FP32_TO_FP16(src[i10]);
6113
  }
6114
  }
6115
  }
 
6358
  case GGML_OP_POOL_COUNT: GGML_ABORT("fatal error");
6359
  }
6360
  for (int ki = 0; ki < k; ++ki) {
6361
+ const float srow_j = (src->type == GGML_TYPE_F32) ? ((const float*)srow)[j] : GGML_CPU_FP16_TO_FP32(((const ggml_fp16_t*)srow)[j]);
6362
  switch (op) {
6363
  case GGML_OP_POOL_AVG: drow[i] += srow_j; break;
6364
  case GGML_OP_POOL_MAX: if (srow_j > drow[i]) drow[i] = srow_j; break;
 
6450
  for (int kx = 0; kx < k0; ++kx) {
6451
  int j = ix + kx;
6452
  if (j < 0 || j >= src->ne[0]) continue;
6453
+ const float srow_j = (src->type == GGML_TYPE_F32) ? ((const float*)srow)[j] : GGML_CPU_FP16_TO_FP32(((const ggml_fp16_t*)srow)[j]);
6454
  switch (op) {
6455
  case GGML_OP_POOL_AVG: *out += srow_j; break;
6456
  case GGML_OP_POOL_MAX: if (srow_j > *out) *out = srow_j; break;
 
6538
  }
6539
 
6540
  const float val = dst->type == GGML_TYPE_F32 ?
6541
+ ((const float *) drowf)[j] : GGML_CPU_FP16_TO_FP32(((const ggml_fp16_t *) drowf)[j]);
6542
  if (val <= maxval) {
6543
  continue;
6544
  }
 
6558
  if (dst->type == GGML_TYPE_F32) {
6559
  ((float *) drow)[j] += grad0;
6560
  } else {
6561
+ ((ggml_fp16_t *) drow)[j] = GGML_CPU_FP32_TO_FP16(grad0 + GGML_CPU_FP16_TO_FP32(((const ggml_fp16_t *) drow)[j]));
6562
  }
6563
  } else if (op == GGML_OP_POOL_AVG) {
6564
  const float grad = grad0 / ka;
 
6577
  if (dst->type == GGML_TYPE_F32) {
6578
  ((float *) drow)[j] += grad;
6579
  } else {
6580
+ ((ggml_fp16_t *) drow)[j] += GGML_CPU_FP32_TO_FP16(grad);
6581
  }
6582
  }
6583
  }
 
7147
  // loop over n_kv and n_head_kv
7148
  // ref: https://arxiv.org/pdf/2112.05682.pdf
7149
  for (int64_t ic = 0; ic < nek1; ++ic) {
7150
+ const float mv = mp ? slope*GGML_CPU_FP16_TO_FP32(mp[ic]) : 0.0f;
7151
  if (mv == -INFINITY) {
7152
  continue;
7153
  }
 
7215
 
7216
  if (v->type == GGML_TYPE_F16) {
7217
  for (int64_t d = 0; d < DV; ++d) {
7218
+ VKQ32[d] = GGML_CPU_FP16_TO_FP32(VKQ16[d]);
7219
  }
7220
  }
7221
 
ggml/src/ggml-cpu/quants.c CHANGED
@@ -2,6 +2,7 @@
2
  #include "ggml-common.h"
3
 
4
  #include "ggml-cpu-impl.h"
 
5
  #include "ggml-quants.h"
6
  #include "quants.h"
7
 
@@ -137,7 +138,7 @@ void ggml_vec_dot_q4_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, c
137
  }
138
 
139
  int sumi = sumi0 + sumi1;
140
- sumf += sumi*GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d);
141
  }
142
 
143
  *s = sumf;
@@ -174,7 +175,7 @@ void ggml_vec_dot_q4_1_q8_1_generic(int n, float * GGML_RESTRICT s, size_t bs, c
174
  }
175
 
176
  int sumi = sumi0 + sumi1;
177
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
178
  }
179
 
180
  *s = sumf;
@@ -217,7 +218,7 @@ void ggml_vec_dot_q5_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, c
217
  }
218
 
219
  int sumi = sumi0 + sumi1;
220
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d)) * sumi;
221
  }
222
 
223
  *s = sumf;
@@ -260,7 +261,7 @@ void ggml_vec_dot_q5_1_q8_1_generic(int n, float * GGML_RESTRICT s, size_t bs, c
260
  }
261
 
262
  int sumi = sumi0 + sumi1;
263
- sumf += (GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d))*sumi + GGML_FP16_TO_FP32(x[ib].m)*GGML_FP16_TO_FP32(y[ib].s);
264
  }
265
 
266
  *s = sumf;
@@ -290,7 +291,7 @@ void ggml_vec_dot_q8_0_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs, c
290
  sumi += x[ib].qs[j]*y[ib].qs[j];
291
  }
292
 
293
- sumf += sumi*(GGML_FP16_TO_FP32(x[ib].d)*GGML_FP16_TO_FP32(y[ib].d));
294
  }
295
 
296
  *s = sumf;
@@ -342,7 +343,7 @@ void ggml_vec_dot_tq1_0_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
342
  }
343
  }
344
 
345
- sumf += (float) sum * (GGML_FP16_TO_FP32(x[i].d) * y[i].d);
346
  }
347
 
348
  *s = sumf;
@@ -372,7 +373,7 @@ void ggml_vec_dot_tq2_0_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
372
  }
373
  }
374
 
375
- const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
376
 
377
  sumf += (float) sumi * d;
378
  }
@@ -405,8 +406,8 @@ void ggml_vec_dot_q2_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
405
  summs += y[i].bsums[j] * (sc[j] >> 4);
406
  }
407
 
408
- const float dall = y[i].d * GGML_FP16_TO_FP32(x[i].d);
409
- const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
410
 
411
  int isum = 0;
412
  int is = 0;
@@ -504,7 +505,7 @@ void ggml_vec_dot_q3_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
504
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
505
  q8 += 8; a += 8;
506
  }
507
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
508
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
509
  }
510
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -577,9 +578,9 @@ void ggml_vec_dot_q4_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
577
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
578
  q8 += 8; a += 8;
579
  }
580
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
581
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
582
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
583
  sumf -= dmin * sumi;
584
  }
585
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -657,9 +658,9 @@ void ggml_vec_dot_q5_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
657
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
658
  q8 += 8; a += 8;
659
  }
660
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
661
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
662
- const float dmin = GGML_FP16_TO_FP32(x[i].dmin) * y[i].d;
663
  sumf -= dmin * sumi;
664
  }
665
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -714,7 +715,7 @@ void ggml_vec_dot_q6_K_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs, c
714
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
715
  q8 += 8; a += 8;
716
  }
717
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
718
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
719
  }
720
  for (int l = 0; l < 8; ++l) sumf += sums[l];
@@ -739,7 +740,7 @@ void ggml_vec_dot_iq2_xxs_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs
739
 
740
  float sumf = 0.f;
741
  for (int i = 0; i < nb; ++i) {
742
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
743
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
744
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
745
  int32_t bsum = 0;
@@ -778,7 +779,7 @@ void ggml_vec_dot_iq2_xs_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
778
 
779
  float sumf = 0.f;
780
  for (int i = 0; i < nb; ++i) {
781
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
782
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
783
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
784
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -829,7 +830,7 @@ void ggml_vec_dot_iq2_s_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
829
  float sumf = 0;
830
  for (int i = 0; i < nb; i++) {
831
 
832
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
833
  const int8_t * q8 = y[i].qs;
834
  const uint8_t * qs = x[i].qs;
835
  const uint8_t * qh = x[i].qh;
@@ -882,7 +883,7 @@ void ggml_vec_dot_iq3_xxs_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs
882
 
883
  float sumf = 0.f;
884
  for (int i = 0; i < nb; ++i) {
885
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
886
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
887
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
888
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
@@ -924,7 +925,7 @@ void ggml_vec_dot_iq3_s_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
924
 
925
  float sumf = 0.f;
926
  for (int i = 0; i < nb; ++i) {
927
- const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
928
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
929
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
930
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
@@ -1002,7 +1003,7 @@ void ggml_vec_dot_iq1_s_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
1002
  qs += 4;
1003
  }
1004
 
1005
- sumf += GGML_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
1006
  }
1007
 
1008
  *s = sumf;
@@ -1063,7 +1064,7 @@ void ggml_vec_dot_iq1_m_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
1063
  qh += 2;
1064
  }
1065
 
1066
- sumf += GGML_FP16_TO_FP32(scale.f16) * y[i].d * (sumi1 + IQ1M_DELTA * sumi2);
1067
  }
1068
 
1069
  *s = sumf;
@@ -1087,7 +1088,7 @@ void ggml_vec_dot_iq4_nl_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
1087
  float sumf = 0;
1088
 
1089
  for (; ib < nb; ++ib) {
1090
- const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
1091
  int sumi1 = 0, sumi2 = 0;
1092
  for (int j = 0; j < QK4_NL/2; ++j) {
1093
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
@@ -1113,7 +1114,7 @@ void ggml_vec_dot_iq4_xs_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
1113
 
1114
  float sumf = 0;
1115
  for (int ibl = 0; ibl < nb; ++ibl) {
1116
- const float d4d8 = GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
1117
  uint16_t h = x[ibl].scales_h;
1118
  const uint8_t * qs = x[ibl].qs;
1119
  const int8_t * q8 = y[ibl].qs;
 
2
  #include "ggml-common.h"
3
 
4
  #include "ggml-cpu-impl.h"
5
+ #include "simd-mappings.h"
6
  #include "ggml-quants.h"
7
  #include "quants.h"
8
 
 
138
  }
139
 
140
  int sumi = sumi0 + sumi1;
141
+ sumf += sumi*GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d);
142
  }
143
 
144
  *s = sumf;
 
175
  }
176
 
177
  int sumi = sumi0 + sumi1;
178
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
179
  }
180
 
181
  *s = sumf;
 
218
  }
219
 
220
  int sumi = sumi0 + sumi1;
221
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d)) * sumi;
222
  }
223
 
224
  *s = sumf;
 
261
  }
262
 
263
  int sumi = sumi0 + sumi1;
264
+ sumf += (GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d))*sumi + GGML_CPU_FP16_TO_FP32(x[ib].m)*GGML_CPU_FP16_TO_FP32(y[ib].s);
265
  }
266
 
267
  *s = sumf;
 
291
  sumi += x[ib].qs[j]*y[ib].qs[j];
292
  }
293
 
294
+ sumf += sumi*(GGML_CPU_FP16_TO_FP32(x[ib].d)*GGML_CPU_FP16_TO_FP32(y[ib].d));
295
  }
296
 
297
  *s = sumf;
 
343
  }
344
  }
345
 
346
+ sumf += (float) sum * (GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d);
347
  }
348
 
349
  *s = sumf;
 
373
  }
374
  }
375
 
376
+ const float d = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
377
 
378
  sumf += (float) sumi * d;
379
  }
 
406
  summs += y[i].bsums[j] * (sc[j] >> 4);
407
  }
408
 
409
+ const float dall = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].d);
410
+ const float dmin = y[i].d * GGML_CPU_FP16_TO_FP32(x[i].dmin);
411
 
412
  int isum = 0;
413
  int is = 0;
 
505
  for (int l = 0; l < 8; ++l) aux32[l] += (scales[j] - 32) * aux16[l];
506
  q8 += 8; a += 8;
507
  }
508
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
509
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
510
  }
511
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
578
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
579
  q8 += 8; a += 8;
580
  }
581
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
582
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
583
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
584
  sumf -= dmin * sumi;
585
  }
586
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
658
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
659
  q8 += 8; a += 8;
660
  }
661
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
662
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
663
+ const float dmin = GGML_CPU_FP16_TO_FP32(x[i].dmin) * y[i].d;
664
  sumf -= dmin * sumi;
665
  }
666
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
715
  for (int l = 0; l < 8; ++l) aux32[l] += scale * aux16[l];
716
  q8 += 8; a += 8;
717
  }
718
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
719
  for (int l = 0; l < 8; ++l) sums[l] += d * aux32[l];
720
  }
721
  for (int l = 0; l < 8; ++l) sumf += sums[l];
 
740
 
741
  float sumf = 0.f;
742
  for (int i = 0; i < nb; ++i) {
743
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
744
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
745
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
746
  int32_t bsum = 0;
 
779
 
780
  float sumf = 0.f;
781
  for (int i = 0; i < nb; ++i) {
782
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
783
  const uint16_t * GGML_RESTRICT q2 = x[i].qs;
784
  const uint8_t * GGML_RESTRICT sc = x[i].scales;
785
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
830
  float sumf = 0;
831
  for (int i = 0; i < nb; i++) {
832
 
833
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
834
  const int8_t * q8 = y[i].qs;
835
  const uint8_t * qs = x[i].qs;
836
  const uint8_t * qh = x[i].qh;
 
883
 
884
  float sumf = 0.f;
885
  for (int i = 0; i < nb; ++i) {
886
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
887
  const uint8_t * GGML_RESTRICT q3 = x[i].qs;
888
  const uint8_t * GGML_RESTRICT gas = x[i].qs + QK_K/4;
889
  const int8_t * GGML_RESTRICT q8 = y[i].qs;
 
925
 
926
  float sumf = 0.f;
927
  for (int i = 0; i < nb; ++i) {
928
+ const float d = GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d;
929
  const uint8_t * GGML_RESTRICT qs = x[i].qs;
930
  const uint8_t * GGML_RESTRICT qh = x[i].qh;
931
  const uint8_t * GGML_RESTRICT signs = x[i].signs;
 
1003
  qs += 4;
1004
  }
1005
 
1006
+ sumf += GGML_CPU_FP16_TO_FP32(x[i].d) * y[i].d * (sumi + IQ1S_DELTA * sumi1);
1007
  }
1008
 
1009
  *s = sumf;
 
1064
  qh += 2;
1065
  }
1066
 
1067
+ sumf += GGML_CPU_FP16_TO_FP32(scale.f16) * y[i].d * (sumi1 + IQ1M_DELTA * sumi2);
1068
  }
1069
 
1070
  *s = sumf;
 
1088
  float sumf = 0;
1089
 
1090
  for (; ib < nb; ++ib) {
1091
+ const float d = GGML_CPU_FP16_TO_FP32(y[ib].d)*GGML_CPU_FP16_TO_FP32(x[ib].d);
1092
  int sumi1 = 0, sumi2 = 0;
1093
  for (int j = 0; j < QK4_NL/2; ++j) {
1094
  sumi1 += y[ib].qs[j+ 0] * kvalues_iq4nl[x[ib].qs[j] & 0xf];
 
1114
 
1115
  float sumf = 0;
1116
  for (int ibl = 0; ibl < nb; ++ibl) {
1117
+ const float d4d8 = GGML_CPU_FP16_TO_FP32(x[ibl].d) * y[ibl].d;
1118
  uint16_t h = x[ibl].scales_h;
1119
  const uint8_t * qs = x[ibl].qs;
1120
  const int8_t * q8 = y[ibl].qs;
ggml/src/ggml-cpu/repack.cpp CHANGED
@@ -6,6 +6,7 @@
6
  #include "ggml-impl.h"
7
  #include "ggml-cpu.h"
8
  #include "ggml-cpu-impl.h"
 
9
  #include "traits.h"
10
 
11
  #include "arch-fallback.h"
@@ -72,7 +73,7 @@ void ggml_quantize_mat_q8_0_4x4_generic(const float * GGML_RESTRICT x, void * GG
72
  const float d = amax / ((1 << 7) - 1);
73
  id[row_iter] = d ? 1.0f / d : 0.0f;
74
 
75
- y[i].d[row_iter] = GGML_FP32_TO_FP16(d);
76
  }
77
 
78
  for (int j = 0; j < QK8_0 * 4; j++) {
@@ -110,7 +111,7 @@ void ggml_quantize_mat_q8_0_4x8_generic(const float * GGML_RESTRICT x, void * GG
110
  const float d = amax / ((1 << 7) - 1);
111
  id[row_iter] = d ? 1.0f / d : 0.0f;
112
 
113
- y[i].d[row_iter] = GGML_FP32_TO_FP16(d);
114
  }
115
 
116
  for (int j = 0; j < QK8_0 * 4; j++) {
@@ -236,7 +237,7 @@ void ggml_gemv_q4_0_4x4_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
236
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
237
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
238
  }
239
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
240
  }
241
  }
242
  }
@@ -280,7 +281,7 @@ void ggml_gemv_q4_0_4x8_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
280
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
281
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
282
  }
283
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
284
  }
285
  }
286
  }
@@ -325,7 +326,7 @@ void ggml_gemv_q4_0_8x8_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
325
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
326
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
327
  }
328
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
329
  }
330
  }
331
  }
@@ -396,13 +397,13 @@ void ggml_gemv_q4_K_8x8_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
396
  sumi2 = sumi2 * scales_1[j];
397
  sumi += sumi1 + sumi2;
398
  }
399
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d;
400
  }
401
  }
402
  for (int sb = 0; sb < 8; sb++) {
403
  uint8_t *mins = (uint8_t*) utmp + 8 + sb * 16;
404
  for (int j = 0; j < ncols_interleaved; j++) {
405
- sum_minf[j] += mins[j] * (a_ptr[l].bsums[sb * 2] + a_ptr[l].bsums[sb * 2 + 1]) * GGML_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d;
406
  }
407
  }
408
  }
@@ -449,7 +450,7 @@ void ggml_gemv_iq4_nl_4x4_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs
449
  const int v1 = kvalues_iq4nl[b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] >> 4];
450
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2]));
451
  }
452
- sumf[j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d);
453
  }
454
  }
455
  }
@@ -500,7 +501,7 @@ void ggml_gemm_q4_0_4x4_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
500
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
501
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
502
  }
503
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
504
  }
505
  }
506
  }
@@ -555,7 +556,7 @@ void ggml_gemm_q4_0_4x8_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
555
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
556
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
557
  }
558
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
559
  }
560
  }
561
  }
@@ -609,7 +610,7 @@ void ggml_gemm_q4_0_8x8_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs,
609
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
610
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
611
  }
612
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
613
  }
614
  }
615
  }
@@ -688,7 +689,7 @@ void ggml_gemm_q4_K_8x8_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
688
  sumi2 = sumi2 * scales_1[j];
689
  sumi += sumi1 + sumi2;
690
  }
691
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d[m];
692
  }
693
  }
694
  }
@@ -697,7 +698,7 @@ void ggml_gemm_q4_K_8x8_q8_K_generic(int n, float * GGML_RESTRICT s, size_t bs,
697
  for(int m = 0; m < 4; m++) {
698
  const int16_t *bsums = a_ptr[l].bsums + (sb * 8) + (m * 4) - ((sb % 2) * 6);
699
  for(int j = 0; j < ncols_interleaved; j++) {
700
- sum_minf[m][j] += mins[j] * (bsums[0] + bsums[1]) * GGML_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d[m];
701
  }
702
  }
703
  }
@@ -753,7 +754,7 @@ void ggml_gemm_iq4_nl_4x4_q8_0_generic(int n, float * GGML_RESTRICT s, size_t bs
753
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
754
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4]));
755
  }
756
- sumf[m][j] += sumi * GGML_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_FP16_TO_FP32(a_ptr[l].d[m]);
757
  }
758
  }
759
  }
 
6
  #include "ggml-impl.h"
7
  #include "ggml-cpu.h"
8
  #include "ggml-cpu-impl.h"
9
+ #include "simd-mappings.h"
10
  #include "traits.h"
11
 
12
  #include "arch-fallback.h"
 
73
  const float d = amax / ((1 << 7) - 1);
74
  id[row_iter] = d ? 1.0f / d : 0.0f;
75
 
76
+ y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
77
  }
78
 
79
  for (int j = 0; j < QK8_0 * 4; j++) {
 
111
  const float d = amax / ((1 << 7) - 1);
112
  id[row_iter] = d ? 1.0f / d : 0.0f;
113
 
114
+ y[i].d[row_iter] = GGML_CPU_FP32_TO_FP16(d);
115
  }
116
 
117
  for (int j = 0; j < QK8_0 * 4; j++) {
 
237
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
238
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
239
  }
240
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
241
  }
242
  }
243
  }
 
281
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
282
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
283
  }
284
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
285
  }
286
  }
287
  }
 
326
  const int v1 = (int8_t) (b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] & 0xF0);
327
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2])) >> 4;
328
  }
329
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
330
  }
331
  }
332
  }
 
397
  sumi2 = sumi2 * scales_1[j];
398
  sumi += sumi1 + sumi2;
399
  }
400
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d;
401
  }
402
  }
403
  for (int sb = 0; sb < 8; sb++) {
404
  uint8_t *mins = (uint8_t*) utmp + 8 + sb * 16;
405
  for (int j = 0; j < ncols_interleaved; j++) {
406
+ sum_minf[j] += mins[j] * (a_ptr[l].bsums[sb * 2] + a_ptr[l].bsums[sb * 2 + 1]) * GGML_CPU_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d;
407
  }
408
  }
409
  }
 
450
  const int v1 = kvalues_iq4nl[b_ptr[l].qs[k * ncols_interleaved * blocklen + j * blocklen + i] >> 4];
451
  sumi += ((v0 * a_ptr[l].qs[k * blocklen + i]) + (v1 * a_ptr[l].qs[k * blocklen + i + qk / 2]));
452
  }
453
+ sumf[j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d);
454
  }
455
  }
456
  }
 
501
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
502
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
503
  }
504
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
505
  }
506
  }
507
  }
 
556
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
557
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
558
  }
559
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
560
  }
561
  }
562
  }
 
610
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
611
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4])) >> 4;
612
  }
613
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
614
  }
615
  }
616
  }
 
689
  sumi2 = sumi2 * scales_1[j];
690
  sumi += sumi1 + sumi2;
691
  }
692
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * a_ptr[l].d[m];
693
  }
694
  }
695
  }
 
698
  for(int m = 0; m < 4; m++) {
699
  const int16_t *bsums = a_ptr[l].bsums + (sb * 8) + (m * 4) - ((sb % 2) * 6);
700
  for(int j = 0; j < ncols_interleaved; j++) {
701
+ sum_minf[m][j] += mins[j] * (bsums[0] + bsums[1]) * GGML_CPU_FP16_TO_FP32(b_ptr[l].dmin[j]) * a_ptr[l].d[m];
702
  }
703
  }
704
  }
 
754
  sumi += ((v0 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i]) +
755
  (v1 * a_ptr[l].qs[k * 4 * blocklen + m * blocklen + i + qk / 2 * 4]));
756
  }
757
+ sumf[m][j] += sumi * GGML_CPU_FP16_TO_FP32(b_ptr[l].d[j]) * GGML_CPU_FP16_TO_FP32(a_ptr[l].d[m]);
758
  }
759
  }
760
  }
ggml/src/ggml-cpu/simd-mappings.h CHANGED
@@ -2,10 +2,167 @@
2
 
3
  #include "ggml-cpu-impl.h"
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  //
6
  // simd mappings
7
  //
8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  // we define a common set of C macros which map to specific intrinsics based on the current architecture
10
  // we then implement the fundamental computation operations below using only these macros
11
  // adding support for new architectures requires to define the corresponding SIMD macros
@@ -415,7 +572,7 @@ static inline __m256 __avx_f32cx8_load(const ggml_fp16_t * x) {
415
  float tmp[8];
416
 
417
  for (int i = 0; i < 8; i++) {
418
- tmp[i] = GGML_FP16_TO_FP32(x[i]);
419
  }
420
 
421
  return _mm256_loadu_ps(tmp);
@@ -426,7 +583,7 @@ static inline void __avx_f32cx8_store(ggml_fp16_t *x, __m256 y) {
426
  _mm256_storeu_ps(arr, y);
427
 
428
  for (int i = 0; i < 8; i++)
429
- x[i] = GGML_FP32_TO_FP16(arr[i]);
430
  }
431
  #define GGML_F32Cx8_LOAD(x) __avx_f32cx8_load(x)
432
  #define GGML_F32Cx8_STORE(x, y) __avx_f32cx8_store(x, y)
@@ -574,10 +731,10 @@ static inline unsigned char ggml_endian_byte(int i) {
574
  inline static v128_t __wasm_f16x4_load(const ggml_fp16_t * p) {
575
  float tmp[4];
576
 
577
- tmp[0] = GGML_FP16_TO_FP32(p[0]);
578
- tmp[1] = GGML_FP16_TO_FP32(p[1]);
579
- tmp[2] = GGML_FP16_TO_FP32(p[2]);
580
- tmp[3] = GGML_FP16_TO_FP32(p[3]);
581
 
582
  return wasm_v128_load(tmp);
583
  }
@@ -587,10 +744,10 @@ inline static void __wasm_f16x4_store(ggml_fp16_t * p, v128_t x) {
587
 
588
  wasm_v128_store(tmp, x);
589
 
590
- p[0] = GGML_FP32_TO_FP16(tmp[0]);
591
- p[1] = GGML_FP32_TO_FP16(tmp[1]);
592
- p[2] = GGML_FP32_TO_FP16(tmp[2]);
593
- p[3] = GGML_FP32_TO_FP16(tmp[3]);
594
  }
595
 
596
  #define GGML_F16x4 v128_t
@@ -690,10 +847,10 @@ inline static void __wasm_f16x4_store(ggml_fp16_t * p, v128_t x) {
690
  static inline __m128 __sse_f16x4_load(const ggml_fp16_t * x) {
691
  float tmp[4];
692
 
693
- tmp[0] = GGML_FP16_TO_FP32(x[0]);
694
- tmp[1] = GGML_FP16_TO_FP32(x[1]);
695
- tmp[2] = GGML_FP16_TO_FP32(x[2]);
696
- tmp[3] = GGML_FP16_TO_FP32(x[3]);
697
 
698
  return _mm_loadu_ps(tmp);
699
  }
@@ -703,10 +860,10 @@ static inline void __sse_f16x4_store(ggml_fp16_t * x, __m128 y) {
703
 
704
  _mm_storeu_ps(arr, y);
705
 
706
- x[0] = GGML_FP32_TO_FP16(arr[0]);
707
- x[1] = GGML_FP32_TO_FP16(arr[1]);
708
- x[2] = GGML_FP32_TO_FP16(arr[2]);
709
- x[3] = GGML_FP32_TO_FP16(arr[3]);
710
  }
711
 
712
  #define GGML_F32Cx4 __m128
@@ -828,7 +985,7 @@ static inline void __lasx_f32cx8_store(ggml_fp16_t * x, __m256 y) {
828
  #define GGML_F32x4_ZERO __lsx_vldi(0)
829
  #define GGML_F32x4_SET1(x) __lsx_vinsgr2vr_w(__lsx_vldi(0),(x), 0)
830
  #define GGML_F32x4_LOAD(x) __lsx_vld((x), 0)
831
- #define GGML_F32x4_STORE((x),(y)) __lsx_vst((y), (x), 0)
832
  #define GGML_F32x4_FMA(a, b, c) __lsx_vfmadd_s(b, c, a)
833
  #define GGML_F32x4_ADD __lsx_vfadd_s
834
  #define GGML_F32x4_MUL __lsx_vfmul_s
@@ -874,10 +1031,10 @@ static inline void __lasx_f32cx8_store(ggml_fp16_t * x, __m256 y) {
874
  static inline __m128 __lsx_f16x4_load(const ggml_fp16_t * x) {
875
  float tmp[4];
876
 
877
- tmp[0] = GGML_FP16_TO_FP32(x[0]);
878
- tmp[1] = GGML_FP16_TO_FP32(x[1]);
879
- tmp[2] = GGML_FP16_TO_FP32(x[2]);
880
- tmp[3] = GGML_FP16_TO_FP32(x[3]);
881
 
882
  return __lsx_vld(tmp, 0);
883
  }
@@ -887,10 +1044,10 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
887
 
888
  __lsx_vst(y, arr, 0);
889
 
890
- x[0] = GGML_FP32_TO_FP16(arr[0]);
891
- x[1] = GGML_FP32_TO_FP16(arr[1]);
892
- x[2] = GGML_FP32_TO_FP16(arr[2]);
893
- x[3] = GGML_FP32_TO_FP16(arr[3]);
894
  }
895
 
896
  #define GGML_F32Cx4 __m128
@@ -922,7 +1079,7 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
922
  #define GGML_F32_STEP 32
923
  #define GGML_F32_EPR 4
924
 
925
- #define GGML_F32x4 __vector float
926
  #define GGML_F32x4_ZERO vec_splats(0.0f)
927
  #define GGML_F32x4_SET1 vec_splats
928
  #define GGML_F32x4_LOAD(p) vec_xl(0, p)
@@ -962,28 +1119,45 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
962
  #define GGML_F16_STEP GGML_F32_STEP
963
  #define GGML_F16_EPR GGML_F32_EPR
964
 
965
- static inline __vector float __lzs_f16cx4_load(const ggml_fp16_t * x) {
 
 
 
 
 
966
  float tmp[4];
967
 
968
  for (int i = 0; i < 4; i++) {
969
- tmp[i] = GGML_FP16_TO_FP32(x[i]);
970
  }
971
 
972
  // note: keep type-cast here to prevent compiler bugs
973
  // see: https://github.com/ggml-org/llama.cpp/issues/12846
974
  return vec_xl(0, (const float *)(tmp));
 
975
  }
976
 
977
- static inline void __lzs_f16cx4_store(ggml_fp16_t * x, __vector float y) {
 
 
 
 
 
 
 
 
 
 
978
  float arr[4];
979
 
980
  // note: keep type-cast here to prevent compiler bugs
981
  // see: https://github.com/ggml-org/llama.cpp/issues/12846
982
- vec_xst(y, 0, (float *)(arr));
983
 
984
  for (int i = 0; i < 4; i++) {
985
- x[i] = GGML_FP32_TO_FP16(arr[i]);
986
  }
 
987
  }
988
 
989
  #define GGML_F16_VEC GGML_F32x4
@@ -1004,3 +1178,7 @@ static inline void __lzs_f16cx4_store(ggml_fp16_t * x, __vector float y) {
1004
  #define GGML_F32_ARR (GGML_F32_STEP/GGML_F32_EPR)
1005
  #define GGML_F16_ARR (GGML_F16_STEP/GGML_F16_EPR)
1006
  #endif
 
 
 
 
 
2
 
3
  #include "ggml-cpu-impl.h"
4
 
5
+ #ifdef __ARM_FEATURE_SVE
6
+ #include <arm_sve.h>
7
+ #endif // __ARM_FEATURE_SVE
8
+
9
+ #if defined(__ARM_NEON) && !defined(__CUDACC__) && !defined(__MUSACC__)
10
+ // if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:
11
+ //
12
+ // $ ln -sfn /Library/Developer/CommandLineTools/usr/lib/clang/13.1.6/include/arm_neon.h ./src/
13
+ //
14
+ #include <arm_neon.h>
15
+ #endif
16
+
17
+ #if defined(__F16C__)
18
+ #include <immintrin.h>
19
+ #endif
20
+
21
+ #ifdef __cplusplus
22
+ extern "C" {
23
+ #endif
24
+
25
  //
26
  // simd mappings
27
  //
28
 
29
+ // FP16 to FP32 conversion
30
+
31
+ // 16-bit float
32
+ // on Arm, we use __fp16
33
+ // on x86, we use uint16_t
34
+ //
35
+ // for old CUDA compilers (<= 11), we use uint16_t: ref https://github.com/ggml-org/llama.cpp/pull/10616
36
+ // for MUSA compilers , we use uint16_t: ref https://github.com/ggml-org/llama.cpp/pull/11843
37
+ //
38
+ #if defined(__ARM_NEON) && !(defined(__CUDACC__) && __CUDACC_VER_MAJOR__ <= 11) && !defined(__MUSACC__)
39
+ #define GGML_CPU_COMPUTE_FP16_TO_FP32(x) neon_compute_fp16_to_fp32(x)
40
+ #define GGML_CPU_COMPUTE_FP32_TO_FP16(x) neon_compute_fp32_to_fp16(x)
41
+
42
+ #define GGML_CPU_FP16_TO_FP32(x) GGML_CPU_COMPUTE_FP16_TO_FP32(x)
43
+
44
+ static inline float neon_compute_fp16_to_fp32(ggml_fp16_t h) {
45
+ __fp16 tmp;
46
+ memcpy(&tmp, &h, sizeof(ggml_fp16_t));
47
+ return (float)tmp;
48
+ }
49
+
50
+ static inline ggml_fp16_t neon_compute_fp32_to_fp16(float f) {
51
+ ggml_fp16_t res;
52
+ __fp16 tmp = f;
53
+ memcpy(&res, &tmp, sizeof(ggml_fp16_t));
54
+ return res;
55
+ }
56
+ #elif defined(__F16C__)
57
+ #ifdef _MSC_VER
58
+ #define GGML_CPU_COMPUTE_FP16_TO_FP32(x) _mm_cvtss_f32(_mm_cvtph_ps(_mm_cvtsi32_si128(x)))
59
+ #define GGML_CPU_COMPUTE_FP32_TO_FP16(x) _mm_extract_epi16(_mm_cvtps_ph(_mm_set_ss(x), 0), 0)
60
+ #else
61
+ #define GGML_CPU_COMPUTE_FP16_TO_FP32(x) _cvtsh_ss(x)
62
+ #define GGML_CPU_COMPUTE_FP32_TO_FP16(x) _cvtss_sh(x, 0)
63
+ #endif
64
+ #elif defined(__POWER9_VECTOR__)
65
+ #define GGML_CPU_COMPUTE_FP16_TO_FP32(x) power_compute_fp16_to_fp32(x)
66
+ #define GGML_CPU_COMPUTE_FP32_TO_FP16(x) power_compute_fp32_to_fp16(x)
67
+ /* the inline asm below is about 12% faster than the lookup method */
68
+ #define GGML_CPU_FP16_TO_FP32(x) GGML_CPU_COMPUTE_FP16_TO_FP32(x)
69
+ #define GGML_CPU_FP32_TO_FP16(x) GGML_CPU_COMPUTE_FP32_TO_FP16(x)
70
+
71
+ static inline float power_compute_fp16_to_fp32(ggml_fp16_t h) {
72
+ float f;
73
+ double d;
74
+ __asm__(
75
+ "mtfprd %0,%2\n"
76
+ "xscvhpdp %0,%0\n"
77
+ "frsp %1,%0\n" :
78
+ /* temp */ "=d"(d),
79
+ /* out */ "=f"(f):
80
+ /* in */ "r"(h));
81
+ return f;
82
+ }
83
+
84
+ static inline ggml_fp16_t power_compute_fp32_to_fp16(float f) {
85
+ double d;
86
+ ggml_fp16_t r;
87
+ __asm__( /* xscvdphp can work on double or single precision */
88
+ "xscvdphp %0,%2\n"
89
+ "mffprd %1,%0\n" :
90
+ /* temp */ "=d"(d),
91
+ /* out */ "=r"(r):
92
+ /* in */ "f"(f));
93
+ return r;
94
+ }
95
+ #elif defined(__riscv) && defined(__riscv_zfhmin)
96
+ static inline float riscv_compute_fp16_to_fp32(ggml_fp16_t h) {
97
+ float f;
98
+ __asm__(
99
+ "fmv.h.x %[f], %[h]\n\t"
100
+ "fcvt.s.h %[f], %[f]"
101
+ : [f] "=&f" (f)
102
+ : [h] "r" (h)
103
+ );
104
+ return f;
105
+ }
106
+
107
+ static inline ggml_fp16_t riscv_compute_fp32_to_fp16(float f) {
108
+ ggml_fp16_t res;
109
+ __asm__(
110
+ "fcvt.h.s %[f], %[f]\n\t"
111
+ "fmv.x.h %[h], %[f]"
112
+ : [h] "=&r" (res)
113
+ : [f] "f" (f)
114
+ );
115
+ return res;
116
+ }
117
+
118
+ #define GGML_CPU_COMPUTE_FP16_TO_FP32(x) riscv_compute_fp16_to_fp32(x)
119
+ #define GGML_CPU_COMPUTE_FP32_TO_FP16(x) riscv_compute_fp32_to_fp16(x)
120
+ #define GGML_CPU_FP16_TO_FP32(x) GGML_CPU_COMPUTE_FP16_TO_FP32(x)
121
+ #define GGML_CPU_FP32_TO_FP16(x) GGML_CPU_COMPUTE_FP32_TO_FP16(x)
122
+ #elif defined(__NNPA__)
123
+ #define GGML_CPU_COMPUTE_FP16_TO_FP32(x) nnpa_compute_fp16_to_fp32(x)
124
+ #define GGML_CPU_COMPUTE_FP32_TO_FP16(x) nnpa_compute_fp32_to_fp16(x)
125
+
126
+ #define GGML_CPU_FP16_TO_FP32(x) GGML_CPU_COMPUTE_FP16_TO_FP32(x)
127
+ #define GGML_CPU_FP32_TO_FP16(x) GGML_CPU_COMPUTE_FP32_TO_FP16(x)
128
+
129
+ static inline float nnpa_compute_fp16_to_fp32(ggml_fp16_t h) {
130
+ uint16x8_t v_h = vec_splats(h);
131
+ uint16x8_t v_hd = vec_convert_from_fp16(v_h, 0);
132
+ return vec_extend_to_fp32_hi(v_hd, 0)[0];
133
+ }
134
+
135
+ static inline ggml_fp16_t nnpa_compute_fp32_to_fp16(float f) {
136
+ float32x4_t v_f = vec_splats(f);
137
+ float32x4_t v_zero = vec_splats(0.0f);
138
+ uint16x8_t v_hd = vec_round_from_fp32(v_f, v_zero, 0);
139
+ uint16x8_t v_h = vec_convert_to_fp16(v_hd, 0);
140
+ return vec_extract(v_h, 0);
141
+ }
142
+ #endif
143
+
144
+ // precomputed f32 table for f16 (256 KB)
145
+ // defined in ggml-cpu.c, initialized in ggml_cpu_init()
146
+ extern float ggml_table_f32_f16[1 << 16];
147
+
148
+ // On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
149
+ // so we define GGML_CPU_FP16_TO_FP32 and GGML_CPU_FP32_TO_FP16 elsewhere for NEON.
150
+ // This is also true for POWER9.
151
+ #if !defined(GGML_CPU_FP16_TO_FP32)
152
+ inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
153
+ uint16_t s;
154
+ memcpy(&s, &f, sizeof(uint16_t));
155
+ return ggml_table_f32_f16[s];
156
+ }
157
+
158
+ #define GGML_CPU_FP16_TO_FP32(x) ggml_lookup_fp16_to_fp32(x)
159
+ #endif
160
+
161
+ #if !defined(GGML_CPU_FP32_TO_FP16)
162
+ #define GGML_CPU_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
163
+ #endif
164
+
165
+
166
  // we define a common set of C macros which map to specific intrinsics based on the current architecture
167
  // we then implement the fundamental computation operations below using only these macros
168
  // adding support for new architectures requires to define the corresponding SIMD macros
 
572
  float tmp[8];
573
 
574
  for (int i = 0; i < 8; i++) {
575
+ tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
576
  }
577
 
578
  return _mm256_loadu_ps(tmp);
 
583
  _mm256_storeu_ps(arr, y);
584
 
585
  for (int i = 0; i < 8; i++)
586
+ x[i] = GGML_CPU_FP32_TO_FP16(arr[i]);
587
  }
588
  #define GGML_F32Cx8_LOAD(x) __avx_f32cx8_load(x)
589
  #define GGML_F32Cx8_STORE(x, y) __avx_f32cx8_store(x, y)
 
731
  inline static v128_t __wasm_f16x4_load(const ggml_fp16_t * p) {
732
  float tmp[4];
733
 
734
+ tmp[0] = GGML_CPU_FP16_TO_FP32(p[0]);
735
+ tmp[1] = GGML_CPU_FP16_TO_FP32(p[1]);
736
+ tmp[2] = GGML_CPU_FP16_TO_FP32(p[2]);
737
+ tmp[3] = GGML_CPU_FP16_TO_FP32(p[3]);
738
 
739
  return wasm_v128_load(tmp);
740
  }
 
744
 
745
  wasm_v128_store(tmp, x);
746
 
747
+ p[0] = GGML_CPU_FP32_TO_FP16(tmp[0]);
748
+ p[1] = GGML_CPU_FP32_TO_FP16(tmp[1]);
749
+ p[2] = GGML_CPU_FP32_TO_FP16(tmp[2]);
750
+ p[3] = GGML_CPU_FP32_TO_FP16(tmp[3]);
751
  }
752
 
753
  #define GGML_F16x4 v128_t
 
847
  static inline __m128 __sse_f16x4_load(const ggml_fp16_t * x) {
848
  float tmp[4];
849
 
850
+ tmp[0] = GGML_CPU_FP16_TO_FP32(x[0]);
851
+ tmp[1] = GGML_CPU_FP16_TO_FP32(x[1]);
852
+ tmp[2] = GGML_CPU_FP16_TO_FP32(x[2]);
853
+ tmp[3] = GGML_CPU_FP16_TO_FP32(x[3]);
854
 
855
  return _mm_loadu_ps(tmp);
856
  }
 
860
 
861
  _mm_storeu_ps(arr, y);
862
 
863
+ x[0] = GGML_CPU_FP32_TO_FP16(arr[0]);
864
+ x[1] = GGML_CPU_FP32_TO_FP16(arr[1]);
865
+ x[2] = GGML_CPU_FP32_TO_FP16(arr[2]);
866
+ x[3] = GGML_CPU_FP32_TO_FP16(arr[3]);
867
  }
868
 
869
  #define GGML_F32Cx4 __m128
 
985
  #define GGML_F32x4_ZERO __lsx_vldi(0)
986
  #define GGML_F32x4_SET1(x) __lsx_vinsgr2vr_w(__lsx_vldi(0),(x), 0)
987
  #define GGML_F32x4_LOAD(x) __lsx_vld((x), 0)
988
+ #define GGML_F32x4_STORE(x, y) __lsx_vst(y, x, 0)
989
  #define GGML_F32x4_FMA(a, b, c) __lsx_vfmadd_s(b, c, a)
990
  #define GGML_F32x4_ADD __lsx_vfadd_s
991
  #define GGML_F32x4_MUL __lsx_vfmul_s
 
1031
  static inline __m128 __lsx_f16x4_load(const ggml_fp16_t * x) {
1032
  float tmp[4];
1033
 
1034
+ tmp[0] = GGML_CPU_FP16_TO_FP32(x[0]);
1035
+ tmp[1] = GGML_CPU_FP16_TO_FP32(x[1]);
1036
+ tmp[2] = GGML_CPU_FP16_TO_FP32(x[2]);
1037
+ tmp[3] = GGML_CPU_FP16_TO_FP32(x[3]);
1038
 
1039
  return __lsx_vld(tmp, 0);
1040
  }
 
1044
 
1045
  __lsx_vst(y, arr, 0);
1046
 
1047
+ x[0] = GGML_CPU_FP32_TO_FP16(arr[0]);
1048
+ x[1] = GGML_CPU_FP32_TO_FP16(arr[1]);
1049
+ x[2] = GGML_CPU_FP32_TO_FP16(arr[2]);
1050
+ x[3] = GGML_CPU_FP32_TO_FP16(arr[3]);
1051
  }
1052
 
1053
  #define GGML_F32Cx4 __m128
 
1079
  #define GGML_F32_STEP 32
1080
  #define GGML_F32_EPR 4
1081
 
1082
+ #define GGML_F32x4 float32x4_t
1083
  #define GGML_F32x4_ZERO vec_splats(0.0f)
1084
  #define GGML_F32x4_SET1 vec_splats
1085
  #define GGML_F32x4_LOAD(p) vec_xl(0, p)
 
1119
  #define GGML_F16_STEP GGML_F32_STEP
1120
  #define GGML_F16_EPR GGML_F32_EPR
1121
 
1122
+ static inline float32x4_t __lzs_f16cx4_load(const ggml_fp16_t * x) {
1123
+ #if defined(__NNPA__)
1124
+ uint16x8_t v_x = vec_xl(0, (const ggml_fp16_t *)x);
1125
+ uint16x8_t v_xd = vec_convert_from_fp16(v_x, 0);
1126
+ return vec_extend_to_fp32_hi(v_xd, 0);
1127
+ #else
1128
  float tmp[4];
1129
 
1130
  for (int i = 0; i < 4; i++) {
1131
+ tmp[i] = GGML_CPU_FP16_TO_FP32(x[i]);
1132
  }
1133
 
1134
  // note: keep type-cast here to prevent compiler bugs
1135
  // see: https://github.com/ggml-org/llama.cpp/issues/12846
1136
  return vec_xl(0, (const float *)(tmp));
1137
+ #endif
1138
  }
1139
 
1140
+ static inline void __lzs_f16cx4_store(ggml_fp16_t * x, float32x4_t v_y) {
1141
+ #if defined(__NNPA__)
1142
+ float32x4_t v_zero = vec_splats(0.0f);
1143
+ uint16x8_t v_xd = vec_round_from_fp32(v_y, v_zero, 0);
1144
+ uint16x8_t v_x = vec_convert_to_fp16(v_xd, 0);
1145
+
1146
+ x[0] = vec_extract(v_x, 0);
1147
+ x[1] = vec_extract(v_x, 1);
1148
+ x[2] = vec_extract(v_x, 2);
1149
+ x[3] = vec_extract(v_x, 3);
1150
+ #else
1151
  float arr[4];
1152
 
1153
  // note: keep type-cast here to prevent compiler bugs
1154
  // see: https://github.com/ggml-org/llama.cpp/issues/12846
1155
+ vec_xst(v_y, 0, (float *)(arr));
1156
 
1157
  for (int i = 0; i < 4; i++) {
1158
+ x[i] = GGML_CPU_FP32_TO_FP16(arr[i]);
1159
  }
1160
+ #endif
1161
  }
1162
 
1163
  #define GGML_F16_VEC GGML_F32x4
 
1178
  #define GGML_F32_ARR (GGML_F32_STEP/GGML_F32_EPR)
1179
  #define GGML_F16_ARR (GGML_F16_STEP/GGML_F16_EPR)
1180
  #endif
1181
+
1182
+ #ifdef __cplusplus
1183
+ }
1184
+ #endif
ggml/src/ggml-cpu/vec.cpp CHANGED
@@ -219,11 +219,11 @@ void ggml_vec_dot_f16(int n, float * GGML_RESTRICT s, size_t bs, ggml_fp16_t * G
219
 
220
  // leftovers
221
  for (int i = np; i < n; ++i) {
222
- sumf += (ggml_float)(GGML_FP16_TO_FP32(x[i])*GGML_FP16_TO_FP32(y[i]));
223
  }
224
  #else
225
  for (int i = 0; i < n; ++i) {
226
- sumf += (ggml_float)(GGML_FP16_TO_FP32(x[i])*GGML_FP16_TO_FP32(y[i]));
227
  }
228
  #endif
229
 
 
219
 
220
  // leftovers
221
  for (int i = np; i < n; ++i) {
222
+ sumf += (ggml_float)(GGML_CPU_FP16_TO_FP32(x[i])*GGML_CPU_FP16_TO_FP32(y[i]));
223
  }
224
  #else
225
  for (int i = 0; i < n; ++i) {
226
+ sumf += (ggml_float)(GGML_CPU_FP16_TO_FP32(x[i])*GGML_CPU_FP16_TO_FP32(y[i]));
227
  }
228
  #endif
229
 
ggml/src/ggml-cpu/vec.h CHANGED
@@ -58,7 +58,7 @@ inline static void ggml_vec_set_bf16(const int n, ggml_bf16_t * x, const ggml_bf
58
  inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] + y[i]; }
59
  inline static void ggml_vec_add_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
60
  for (int i = 0; i < n; ++i) {
61
- z[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(x[i]) + GGML_FP16_TO_FP32(y[i]));
62
  }
63
  }
64
  inline static void ggml_vec_add1_f32(const int n, float * z, const float * x, const float v) { for (int i = 0; i < n; ++i) z[i] = x[i] + v; }
@@ -67,7 +67,7 @@ inline static void ggml_vec_acc1_f32(const int n, float * y, const float v)
67
  inline static void ggml_vec_sub_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] - y[i]; }
68
  inline static void ggml_vec_sub_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
69
  for (int i = 0; i < n; ++i) {
70
- z[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(x[i]) - GGML_FP16_TO_FP32(y[i]));
71
  }
72
  }
73
  inline static void ggml_vec_set_f32 (const int n, float * x, const float v) { for (int i = 0; i < n; ++i) x[i] = v; }
@@ -75,20 +75,20 @@ inline static void ggml_vec_cpy_f32 (const int n, float * y, const float * x)
75
  inline static void ggml_vec_neg_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = -x[i]; }
76
  inline static void ggml_vec_neg_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
77
  for (int i = 0; i < n; ++i) {
78
- y[i] = GGML_FP32_TO_FP16(-GGML_FP16_TO_FP32(x[i]));
79
  }
80
  }
81
 
82
  inline static void ggml_vec_mul_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i]*y[i]; }
83
  inline static void ggml_vec_mul_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
84
  for (int i = 0; i < n; ++i) {
85
- z[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(x[i]) * GGML_FP16_TO_FP32(y[i]));
86
  }
87
  }
88
  inline static void ggml_vec_div_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i]/y[i]; }
89
  inline static void ggml_vec_div_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
90
  for (int i = 0; i < n; ++i) {
91
- z[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(x[i]) / GGML_FP16_TO_FP32(y[i]));
92
  }
93
  }
94
 
@@ -131,13 +131,13 @@ inline static void ggml_vec_dot_f16_unroll(const int n, const int xs, float * GG
131
  // leftovers
132
  for (int i = np; i < n; ++i) {
133
  for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
134
- sumf[j] += (ggml_float)(GGML_FP16_TO_FP32(x[j][i])*GGML_FP16_TO_FP32(y[i]));
135
  }
136
  }
137
  #else
138
  for (int i = 0; i < n; ++i) {
139
  for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
140
- sumf[j] += (ggml_float)(GGML_FP16_TO_FP32(x[j][i])*GGML_FP16_TO_FP32(y[i]));
141
  }
142
  }
143
  #endif
@@ -280,12 +280,12 @@ inline static void ggml_vec_mad_f16(const int n, ggml_fp16_t * GGML_RESTRICT y,
280
 
281
  // leftovers
282
  for (int i = np; i < n; ++i) {
283
- y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i]) + GGML_FP16_TO_FP32(x[i])*v);
284
  }
285
  #else
286
  // scalar
287
  for (int i = 0; i < n; ++i) {
288
- y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i]) + GGML_FP16_TO_FP32(x[i])*v);
289
  }
290
  #endif
291
  }
@@ -430,12 +430,12 @@ inline static void ggml_vec_scale_f16(const int n, ggml_fp16_t * y, const float
430
 
431
  // leftovers
432
  for (int i = np; i < n; ++i) {
433
- y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i])*v);
434
  }
435
  #else
436
  // scalar
437
  for (int i = 0; i < n; ++i) {
438
- y[i] = GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(y[i])*v);
439
  }
440
  #endif
441
  }
@@ -444,103 +444,103 @@ inline static void ggml_vec_norm_f32 (const int n, float * s, const float * x) {
444
  inline static void ggml_vec_sqr_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i]; }
445
  inline static void ggml_vec_sqr_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
446
  for (int i = 0; i < n; ++i) {
447
- float v = GGML_FP16_TO_FP32(x[i]);
448
- y[i] = GGML_FP32_TO_FP16(v*v);
449
  }
450
  }
451
  inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrtf(x[i]); }
452
  inline static void ggml_vec_sqrt_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
453
  for (int i = 0; i < n; ++i) {
454
- y[i] = GGML_FP32_TO_FP16(sqrtf(GGML_FP16_TO_FP32(x[i])));
455
  }
456
  }
457
  inline static void ggml_vec_log_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = logf(x[i]); }
458
  inline static void ggml_vec_log_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
459
  for (int i = 0; i < n; ++i) {
460
- y[i] = GGML_FP32_TO_FP16(logf(GGML_FP16_TO_FP32(x[i])));
461
  }
462
  }
463
  inline static void ggml_vec_sin_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sinf(x[i]); }
464
  inline static void ggml_vec_sin_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
465
  for (int i = 0; i < n; ++i) {
466
- y[i] = GGML_FP32_TO_FP16(sinf(GGML_FP16_TO_FP32(x[i])));
467
  }
468
  }
469
  inline static void ggml_vec_cos_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = cosf(x[i]); }
470
  inline static void ggml_vec_cos_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
471
  for (int i = 0; i < n; ++i) {
472
- y[i] = GGML_FP32_TO_FP16(cosf(GGML_FP16_TO_FP32(x[i])));
473
  }
474
  }
475
  inline static void ggml_vec_abs_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fabsf(x[i]); }
476
  inline static void ggml_vec_abs_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
477
  for (int i = 0; i < n; ++i) {
478
- y[i] = GGML_FP32_TO_FP16(fabsf(GGML_FP16_TO_FP32(x[i])));
479
  }
480
  }
481
  inline static void ggml_vec_sgn_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : ((x[i] < 0.f) ? -1.f : 0.f); }
482
  inline static void ggml_vec_sgn_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
483
  for (int i = 0; i < n; ++i) {
484
- float v = GGML_FP16_TO_FP32(x[i]);
485
- y[i] = GGML_FP32_TO_FP16((v > 0.f) ? 1.f : ((v < 0.f) ? -1.f : 0.f));
486
  }
487
  }
488
  inline static void ggml_vec_step_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : 0.f; }
489
  inline static void ggml_vec_step_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
490
  for (int i = 0; i < n; ++i) {
491
- y[i] = GGML_FP32_TO_FP16((GGML_FP16_TO_FP32(x[i]) > 0.f) ? 1.f : 0.f);
492
  }
493
  }
494
  inline static void ggml_vec_tanh_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = tanhf(x[i]); }
495
  inline static void ggml_vec_tanh_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
496
  for (int i = 0; i < n; ++i) {
497
- y[i] = GGML_FP32_TO_FP16(tanhf(GGML_FP16_TO_FP32(x[i])));
498
  }
499
  }
500
  inline static void ggml_vec_elu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : expm1f(x[i]); }
501
  inline static void ggml_vec_elu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
502
  for (int i = 0; i < n; ++i) {
503
- y[i] = GGML_FP32_TO_FP16(expm1f(GGML_FP16_TO_FP32(x[i])));
504
  }
505
  }
506
  inline static void ggml_vec_relu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : 0.f; }
507
  inline static void ggml_vec_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
508
  for (int i = 0; i < n; ++i) {
509
- float v = GGML_FP16_TO_FP32(x[i]);
510
- y[i] = GGML_FP32_TO_FP16((v > 0.f) ? v : 0.f);
511
  }
512
  }
513
  inline static void ggml_vec_leaky_relu_f32 (const int n, float * y, const float * x, const float ns) { for (int i = 0; i < n; ++i) y[i] = ((x[i] > 0.f) ? x[i] : 0.f) + ns * ((x[i] < 0.0f) ? x[i] : 0.f); }
514
  inline static void ggml_vec_leaky_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x, const float ns) {
515
  for (int i = 0; i < n; ++i) {
516
- float v = GGML_FP16_TO_FP32(x[i]);
517
- y[i] = GGML_FP32_TO_FP16(((v > 0.f) ? v : 0.f) + ns * ((v < 0.0f) ? v : 0.f));
518
  }
519
  }
520
  inline static void ggml_vec_sigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = 1.f / (1.f + expf(-x[i])); }
521
  inline static void ggml_vec_sigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
522
  for (int i = 0; i < n; ++i) {
523
- y[i] = GGML_FP32_TO_FP16(1.f / (1.f + expf(-GGML_FP16_TO_FP32(x[i]))));
524
  }
525
  }
526
  // TODO: optimize performance
527
  inline static void ggml_vec_hardswish_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i] * fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
528
  inline static void ggml_vec_hardswish_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
529
  for (int i = 0; i < n; ++i) {
530
- float v = GGML_FP16_TO_FP32(x[i]);
531
- y[i] = GGML_FP32_TO_FP16(v * fminf(1.0f, fmaxf(0.0f, (v + 3.0f) / 6.0f)));
532
  }
533
  }
534
  inline static void ggml_vec_hardsigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
535
  inline static void ggml_vec_hardsigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
536
  for (int i = 0; i < n; ++i) {
537
- y[i] = GGML_FP32_TO_FP16(fminf(1.0f, fmaxf(0.0f, (GGML_FP16_TO_FP32(x[i]) + 3.0f) / 6.0f)));
538
  }
539
  }
540
  inline static void ggml_vec_exp_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = expf(x[i]); }
541
  inline static void ggml_vec_exp_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
542
  for (int i = 0; i < n; ++i) {
543
- y[i] = GGML_FP32_TO_FP16(expf(GGML_FP16_TO_FP32(x[i])));
544
  }
545
  }
546
 
@@ -562,9 +562,9 @@ inline static void ggml_vec_gelu_f16(const int n, ggml_fp16_t * y, const ggml_fp
562
 
563
  inline static void ggml_vec_gelu_erf_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
564
  for (int i = 0; i < n; ++i) {
565
- float xi = GGML_FP16_TO_FP32(x[i]);
566
  float res = 0.5f*xi*(1.0f + erff(xi*SQRT_2_INV));
567
- y[i] = GGML_FP32_TO_FP16(res);
568
  }
569
  }
570
 
@@ -577,9 +577,9 @@ inline static void ggml_vec_gelu_f32(const int n, float * y, const float * x) {
577
  } else if (x[i] >= 10.0f) {
578
  y[i] = x[i];
579
  } else {
580
- ggml_fp16_t fp16 = GGML_FP32_TO_FP16(x[i]);
581
  memcpy(&t, &fp16, sizeof(uint16_t));
582
- y[i] = GGML_FP16_TO_FP32(ggml_table_gelu_f16[t]);
583
  }
584
  }
585
  }
@@ -613,9 +613,9 @@ inline static float ggml_gelu_quick_f32(float x) {
613
  inline static void ggml_vec_gelu_quick_f32(const int n, float * y, const float * x) {
614
  uint16_t t;
615
  for (int i = 0; i < n; ++i) {
616
- ggml_fp16_t fp16 = GGML_FP32_TO_FP16(x[i]);
617
  memcpy(&t, &fp16, sizeof(uint16_t));
618
- y[i] = GGML_FP16_TO_FP32(ggml_table_gelu_quick_f16[t]);
619
  }
620
  }
621
  #else
@@ -628,8 +628,8 @@ inline static void ggml_vec_gelu_quick_f32(const int n, float * y, const float *
628
 
629
  inline static void ggml_vec_gelu_quick_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
630
  for (int i = 0; i < n; ++i) {
631
- float v = GGML_FP16_TO_FP32(x[i]);
632
- y[i] = GGML_FP32_TO_FP16(v*(1.0f/(1.0f+expf(GELU_QUICK_COEF*v))));
633
  }
634
  }
635
 
@@ -638,8 +638,8 @@ inline static float ggml_silu_f32(float x) {
638
  return x/(1.0f + expf(-x));
639
  }
640
  inline static ggml_fp16_t ggml_silu_f16(ggml_fp16_t x) {
641
- float v = GGML_FP16_TO_FP32(x);
642
- return GGML_FP32_TO_FP16(v/(1.0f + expf(-v)));
643
  }
644
 
645
  #if __FINITE_MATH_ONLY__
@@ -888,9 +888,9 @@ inline static float ggml_silu_backward_f32(float x, float dy) {
888
  }
889
 
890
  inline static ggml_fp16_t ggml_silu_backward_f16(ggml_fp16_t x, ggml_fp16_t dy) {
891
- const float v = GGML_FP16_TO_FP32(x);
892
  const float s = 1.0f/(1.0f + expf(-v));
893
- return GGML_FP32_TO_FP16(GGML_FP16_TO_FP32(dy)*s*(1.0f + v*(1.0f - s)));
894
  }
895
 
896
  inline static void ggml_vec_silu_backward_f32(const int n, float * dx, const float * x, const float * dy) {
@@ -928,7 +928,7 @@ inline static void ggml_vec_sum_f32_ggf(const int n, ggml_float * s, const float
928
  inline static void ggml_vec_sum_f16_ggf(const int n, float * s, const ggml_fp16_t * x) {
929
  float sum = 0.0f;
930
  for (int i = 0; i < n; ++i) {
931
- sum += GGML_FP16_TO_FP32(x[i]);
932
  }
933
  *s = sum;
934
  }
 
58
  inline static void ggml_vec_add_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] + y[i]; }
59
  inline static void ggml_vec_add_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
60
  for (int i = 0; i < n; ++i) {
61
+ z[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(x[i]) + GGML_CPU_FP16_TO_FP32(y[i]));
62
  }
63
  }
64
  inline static void ggml_vec_add1_f32(const int n, float * z, const float * x, const float v) { for (int i = 0; i < n; ++i) z[i] = x[i] + v; }
 
67
  inline static void ggml_vec_sub_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i] - y[i]; }
68
  inline static void ggml_vec_sub_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
69
  for (int i = 0; i < n; ++i) {
70
+ z[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(x[i]) - GGML_CPU_FP16_TO_FP32(y[i]));
71
  }
72
  }
73
  inline static void ggml_vec_set_f32 (const int n, float * x, const float v) { for (int i = 0; i < n; ++i) x[i] = v; }
 
75
  inline static void ggml_vec_neg_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = -x[i]; }
76
  inline static void ggml_vec_neg_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
77
  for (int i = 0; i < n; ++i) {
78
+ y[i] = GGML_CPU_FP32_TO_FP16(-GGML_CPU_FP16_TO_FP32(x[i]));
79
  }
80
  }
81
 
82
  inline static void ggml_vec_mul_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i]*y[i]; }
83
  inline static void ggml_vec_mul_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
84
  for (int i = 0; i < n; ++i) {
85
+ z[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(x[i]) * GGML_CPU_FP16_TO_FP32(y[i]));
86
  }
87
  }
88
  inline static void ggml_vec_div_f32 (const int n, float * z, const float * x, const float * y) { for (int i = 0; i < n; ++i) z[i] = x[i]/y[i]; }
89
  inline static void ggml_vec_div_f16 (const int n, ggml_fp16_t * z, const ggml_fp16_t * x, const ggml_fp16_t * y) {
90
  for (int i = 0; i < n; ++i) {
91
+ z[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(x[i]) / GGML_CPU_FP16_TO_FP32(y[i]));
92
  }
93
  }
94
 
 
131
  // leftovers
132
  for (int i = np; i < n; ++i) {
133
  for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
134
+ sumf[j] += (ggml_float)(GGML_CPU_FP16_TO_FP32(x[j][i])*GGML_CPU_FP16_TO_FP32(y[i]));
135
  }
136
  }
137
  #else
138
  for (int i = 0; i < n; ++i) {
139
  for (int j = 0; j < GGML_VEC_DOT_UNROLL; ++j) {
140
+ sumf[j] += (ggml_float)(GGML_CPU_FP16_TO_FP32(x[j][i])*GGML_CPU_FP16_TO_FP32(y[i]));
141
  }
142
  }
143
  #endif
 
280
 
281
  // leftovers
282
  for (int i = np; i < n; ++i) {
283
+ y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i]) + GGML_CPU_FP16_TO_FP32(x[i])*v);
284
  }
285
  #else
286
  // scalar
287
  for (int i = 0; i < n; ++i) {
288
+ y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i]) + GGML_CPU_FP16_TO_FP32(x[i])*v);
289
  }
290
  #endif
291
  }
 
430
 
431
  // leftovers
432
  for (int i = np; i < n; ++i) {
433
+ y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i])*v);
434
  }
435
  #else
436
  // scalar
437
  for (int i = 0; i < n; ++i) {
438
+ y[i] = GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(y[i])*v);
439
  }
440
  #endif
441
  }
 
444
  inline static void ggml_vec_sqr_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i]*x[i]; }
445
  inline static void ggml_vec_sqr_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
446
  for (int i = 0; i < n; ++i) {
447
+ float v = GGML_CPU_FP16_TO_FP32(x[i]);
448
+ y[i] = GGML_CPU_FP32_TO_FP16(v*v);
449
  }
450
  }
451
  inline static void ggml_vec_sqrt_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sqrtf(x[i]); }
452
  inline static void ggml_vec_sqrt_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
453
  for (int i = 0; i < n; ++i) {
454
+ y[i] = GGML_CPU_FP32_TO_FP16(sqrtf(GGML_CPU_FP16_TO_FP32(x[i])));
455
  }
456
  }
457
  inline static void ggml_vec_log_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = logf(x[i]); }
458
  inline static void ggml_vec_log_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
459
  for (int i = 0; i < n; ++i) {
460
+ y[i] = GGML_CPU_FP32_TO_FP16(logf(GGML_CPU_FP16_TO_FP32(x[i])));
461
  }
462
  }
463
  inline static void ggml_vec_sin_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = sinf(x[i]); }
464
  inline static void ggml_vec_sin_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
465
  for (int i = 0; i < n; ++i) {
466
+ y[i] = GGML_CPU_FP32_TO_FP16(sinf(GGML_CPU_FP16_TO_FP32(x[i])));
467
  }
468
  }
469
  inline static void ggml_vec_cos_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = cosf(x[i]); }
470
  inline static void ggml_vec_cos_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
471
  for (int i = 0; i < n; ++i) {
472
+ y[i] = GGML_CPU_FP32_TO_FP16(cosf(GGML_CPU_FP16_TO_FP32(x[i])));
473
  }
474
  }
475
  inline static void ggml_vec_abs_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fabsf(x[i]); }
476
  inline static void ggml_vec_abs_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
477
  for (int i = 0; i < n; ++i) {
478
+ y[i] = GGML_CPU_FP32_TO_FP16(fabsf(GGML_CPU_FP16_TO_FP32(x[i])));
479
  }
480
  }
481
  inline static void ggml_vec_sgn_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : ((x[i] < 0.f) ? -1.f : 0.f); }
482
  inline static void ggml_vec_sgn_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
483
  for (int i = 0; i < n; ++i) {
484
+ float v = GGML_CPU_FP16_TO_FP32(x[i]);
485
+ y[i] = GGML_CPU_FP32_TO_FP16((v > 0.f) ? 1.f : ((v < 0.f) ? -1.f : 0.f));
486
  }
487
  }
488
  inline static void ggml_vec_step_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? 1.f : 0.f; }
489
  inline static void ggml_vec_step_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
490
  for (int i = 0; i < n; ++i) {
491
+ y[i] = GGML_CPU_FP32_TO_FP16((GGML_CPU_FP16_TO_FP32(x[i]) > 0.f) ? 1.f : 0.f);
492
  }
493
  }
494
  inline static void ggml_vec_tanh_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = tanhf(x[i]); }
495
  inline static void ggml_vec_tanh_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
496
  for (int i = 0; i < n; ++i) {
497
+ y[i] = GGML_CPU_FP32_TO_FP16(tanhf(GGML_CPU_FP16_TO_FP32(x[i])));
498
  }
499
  }
500
  inline static void ggml_vec_elu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : expm1f(x[i]); }
501
  inline static void ggml_vec_elu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
502
  for (int i = 0; i < n; ++i) {
503
+ y[i] = GGML_CPU_FP32_TO_FP16(expm1f(GGML_CPU_FP16_TO_FP32(x[i])));
504
  }
505
  }
506
  inline static void ggml_vec_relu_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = (x[i] > 0.f) ? x[i] : 0.f; }
507
  inline static void ggml_vec_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
508
  for (int i = 0; i < n; ++i) {
509
+ float v = GGML_CPU_FP16_TO_FP32(x[i]);
510
+ y[i] = GGML_CPU_FP32_TO_FP16((v > 0.f) ? v : 0.f);
511
  }
512
  }
513
  inline static void ggml_vec_leaky_relu_f32 (const int n, float * y, const float * x, const float ns) { for (int i = 0; i < n; ++i) y[i] = ((x[i] > 0.f) ? x[i] : 0.f) + ns * ((x[i] < 0.0f) ? x[i] : 0.f); }
514
  inline static void ggml_vec_leaky_relu_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x, const float ns) {
515
  for (int i = 0; i < n; ++i) {
516
+ float v = GGML_CPU_FP16_TO_FP32(x[i]);
517
+ y[i] = GGML_CPU_FP32_TO_FP16(((v > 0.f) ? v : 0.f) + ns * ((v < 0.0f) ? v : 0.f));
518
  }
519
  }
520
  inline static void ggml_vec_sigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = 1.f / (1.f + expf(-x[i])); }
521
  inline static void ggml_vec_sigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
522
  for (int i = 0; i < n; ++i) {
523
+ y[i] = GGML_CPU_FP32_TO_FP16(1.f / (1.f + expf(-GGML_CPU_FP16_TO_FP32(x[i]))));
524
  }
525
  }
526
  // TODO: optimize performance
527
  inline static void ggml_vec_hardswish_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = x[i] * fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
528
  inline static void ggml_vec_hardswish_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
529
  for (int i = 0; i < n; ++i) {
530
+ float v = GGML_CPU_FP16_TO_FP32(x[i]);
531
+ y[i] = GGML_CPU_FP32_TO_FP16(v * fminf(1.0f, fmaxf(0.0f, (v + 3.0f) / 6.0f)));
532
  }
533
  }
534
  inline static void ggml_vec_hardsigmoid_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = fminf(1.0f, fmaxf(0.0f, (x[i] + 3.0f) / 6.0f)); }
535
  inline static void ggml_vec_hardsigmoid_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
536
  for (int i = 0; i < n; ++i) {
537
+ y[i] = GGML_CPU_FP32_TO_FP16(fminf(1.0f, fmaxf(0.0f, (GGML_CPU_FP16_TO_FP32(x[i]) + 3.0f) / 6.0f)));
538
  }
539
  }
540
  inline static void ggml_vec_exp_f32 (const int n, float * y, const float * x) { for (int i = 0; i < n; ++i) y[i] = expf(x[i]); }
541
  inline static void ggml_vec_exp_f16 (const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
542
  for (int i = 0; i < n; ++i) {
543
+ y[i] = GGML_CPU_FP32_TO_FP16(expf(GGML_CPU_FP16_TO_FP32(x[i])));
544
  }
545
  }
546
 
 
562
 
563
  inline static void ggml_vec_gelu_erf_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
564
  for (int i = 0; i < n; ++i) {
565
+ float xi = GGML_CPU_FP16_TO_FP32(x[i]);
566
  float res = 0.5f*xi*(1.0f + erff(xi*SQRT_2_INV));
567
+ y[i] = GGML_CPU_FP32_TO_FP16(res);
568
  }
569
  }
570
 
 
577
  } else if (x[i] >= 10.0f) {
578
  y[i] = x[i];
579
  } else {
580
+ ggml_fp16_t fp16 = GGML_CPU_FP32_TO_FP16(x[i]);
581
  memcpy(&t, &fp16, sizeof(uint16_t));
582
+ y[i] = GGML_CPU_FP16_TO_FP32(ggml_table_gelu_f16[t]);
583
  }
584
  }
585
  }
 
613
  inline static void ggml_vec_gelu_quick_f32(const int n, float * y, const float * x) {
614
  uint16_t t;
615
  for (int i = 0; i < n; ++i) {
616
+ ggml_fp16_t fp16 = GGML_CPU_FP32_TO_FP16(x[i]);
617
  memcpy(&t, &fp16, sizeof(uint16_t));
618
+ y[i] = GGML_CPU_FP16_TO_FP32(ggml_table_gelu_quick_f16[t]);
619
  }
620
  }
621
  #else
 
628
 
629
  inline static void ggml_vec_gelu_quick_f16(const int n, ggml_fp16_t * y, const ggml_fp16_t * x) {
630
  for (int i = 0; i < n; ++i) {
631
+ float v = GGML_CPU_FP16_TO_FP32(x[i]);
632
+ y[i] = GGML_CPU_FP32_TO_FP16(v*(1.0f/(1.0f+expf(GELU_QUICK_COEF*v))));
633
  }
634
  }
635
 
 
638
  return x/(1.0f + expf(-x));
639
  }
640
  inline static ggml_fp16_t ggml_silu_f16(ggml_fp16_t x) {
641
+ float v = GGML_CPU_FP16_TO_FP32(x);
642
+ return GGML_CPU_FP32_TO_FP16(v/(1.0f + expf(-v)));
643
  }
644
 
645
  #if __FINITE_MATH_ONLY__
 
888
  }
889
 
890
  inline static ggml_fp16_t ggml_silu_backward_f16(ggml_fp16_t x, ggml_fp16_t dy) {
891
+ const float v = GGML_CPU_FP16_TO_FP32(x);
892
  const float s = 1.0f/(1.0f + expf(-v));
893
+ return GGML_CPU_FP32_TO_FP16(GGML_CPU_FP16_TO_FP32(dy)*s*(1.0f + v*(1.0f - s)));
894
  }
895
 
896
  inline static void ggml_vec_silu_backward_f32(const int n, float * dx, const float * x, const float * dy) {
 
928
  inline static void ggml_vec_sum_f16_ggf(const int n, float * s, const ggml_fp16_t * x) {
929
  float sum = 0.0f;
930
  for (int i = 0; i < n; ++i) {
931
+ sum += GGML_CPU_FP16_TO_FP32(x[i]);
932
  }
933
  *s = sum;
934
  }
ggml/src/ggml-impl.h CHANGED
@@ -317,203 +317,81 @@ struct ggml_cgraph ggml_graph_view(struct ggml_cgraph * cgraph, int i0, int i1);
317
  GGML_API void * ggml_aligned_malloc(size_t size);
318
  GGML_API void ggml_aligned_free(void * ptr, size_t size);
319
 
320
- // FP16 to FP32 conversion
 
321
 
322
- // 16-bit float
323
- // on Arm, we use __fp16
324
- // on x86, we use uint16_t
325
- //
326
- // for old CUDA compilers (<= 11), we use uint16_t: ref https://github.com/ggml-org/llama.cpp/pull/10616
327
- // for MUSA compilers , we use uint16_t: ref https://github.com/ggml-org/llama.cpp/pull/11843
328
- //
329
- #if defined(__ARM_NEON) && !(defined(__CUDACC__) && __CUDACC_VER_MAJOR__ <= 11) && !defined(__MUSACC__)
330
- #define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
331
- #define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
332
-
333
- #define GGML_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
334
-
335
- static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
336
- __fp16 tmp;
337
- memcpy(&tmp, &h, sizeof(ggml_fp16_t));
338
- return (float)tmp;
339
- }
340
-
341
- static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
342
- ggml_fp16_t res;
343
- __fp16 tmp = f;
344
- memcpy(&res, &tmp, sizeof(ggml_fp16_t));
345
- return res;
346
- }
347
-
348
- #elif defined(__F16C__)
349
-
350
- #ifdef _MSC_VER
351
- #define GGML_COMPUTE_FP16_TO_FP32(x) _mm_cvtss_f32(_mm_cvtph_ps(_mm_cvtsi32_si128(x)))
352
- #define GGML_COMPUTE_FP32_TO_FP16(x) _mm_extract_epi16(_mm_cvtps_ph(_mm_set_ss(x), 0), 0)
353
- #else
354
- #define GGML_COMPUTE_FP16_TO_FP32(x) _cvtsh_ss(x)
355
- #define GGML_COMPUTE_FP32_TO_FP16(x) _cvtss_sh(x, 0)
356
- #endif
357
-
358
- #elif defined(__POWER9_VECTOR__)
359
-
360
- #define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
361
- #define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
362
- /* the inline asm below is about 12% faster than the lookup method */
363
- #define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
364
- #define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
365
-
366
- static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
367
- float f;
368
- double d;
369
- __asm__(
370
- "mtfprd %0,%2\n"
371
- "xscvhpdp %0,%0\n"
372
- "frsp %1,%0\n" :
373
- /* temp */ "=d"(d),
374
- /* out */ "=f"(f):
375
- /* in */ "r"(h));
376
- return f;
377
- }
378
-
379
- static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
380
- double d;
381
- ggml_fp16_t r;
382
- __asm__( /* xscvdphp can work on double or single precision */
383
- "xscvdphp %0,%2\n"
384
- "mffprd %1,%0\n" :
385
- /* temp */ "=d"(d),
386
- /* out */ "=r"(r):
387
- /* in */ "f"(f));
388
- return r;
389
- }
390
-
391
- #elif defined(__riscv) && defined(__riscv_zfhmin)
392
-
393
- static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
394
- float f;
395
- __asm__(
396
- "fmv.h.x %[f], %[h]\n\t"
397
- "fcvt.s.h %[f], %[f]"
398
- : [f] "=&f" (f)
399
- : [h] "r" (h)
400
- );
401
- return f;
402
- }
403
 
404
- static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
405
- ggml_fp16_t res;
406
- __asm__(
407
- "fcvt.h.s %[f], %[f]\n\t"
408
- "fmv.x.h %[h], %[f]"
409
- : [h] "=&r" (res)
410
- : [f] "f" (f)
411
- );
412
- return res;
413
- }
414
 
415
- #define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
416
- #define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
417
- #define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
418
- #define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
419
 
 
 
 
420
  #else
 
 
 
421
 
422
- // FP16 <-> FP32
423
- // ref: https://github.com/Maratyszcza/FP16
424
-
425
- static inline float fp32_from_bits(uint32_t w) {
426
- union {
427
- uint32_t as_bits;
428
- float as_value;
429
- } fp32;
430
- fp32.as_bits = w;
431
- return fp32.as_value;
432
- }
433
-
434
- static inline uint32_t fp32_to_bits(float f) {
435
- union {
436
- float as_value;
437
- uint32_t as_bits;
438
- } fp32;
439
- fp32.as_value = f;
440
- return fp32.as_bits;
441
- }
442
-
443
- static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
444
- const uint32_t w = (uint32_t) h << 16;
445
- const uint32_t sign = w & UINT32_C(0x80000000);
446
- const uint32_t two_w = w + w;
447
-
448
- const uint32_t exp_offset = UINT32_C(0xE0) << 23;
449
- #if (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)) && (!defined(__cplusplus) || __cplusplus >= 201703L)
450
- const float exp_scale = 0x1.0p-112f;
451
- #else
452
- const float exp_scale = fp32_from_bits(UINT32_C(0x7800000));
453
- #endif
454
- const float normalized_value = fp32_from_bits((two_w >> 4) + exp_offset) * exp_scale;
455
-
456
- const uint32_t magic_mask = UINT32_C(126) << 23;
457
- const float magic_bias = 0.5f;
458
- const float denormalized_value = fp32_from_bits((two_w >> 17) | magic_mask) - magic_bias;
459
 
460
- const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
461
- const uint32_t result = sign |
462
- (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value));
463
- return fp32_from_bits(result);
464
- }
465
-
466
- static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
467
- #if (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)) && (!defined(__cplusplus) || __cplusplus >= 201703L)
468
- const float scale_to_inf = 0x1.0p+112f;
469
- const float scale_to_zero = 0x1.0p-110f;
470
- #else
471
- const float scale_to_inf = fp32_from_bits(UINT32_C(0x77800000));
472
- const float scale_to_zero = fp32_from_bits(UINT32_C(0x08800000));
473
- #endif
474
- float base = (fabsf(f) * scale_to_inf) * scale_to_zero;
475
-
476
- const uint32_t w = fp32_to_bits(f);
477
- const uint32_t shl1_w = w + w;
478
- const uint32_t sign = w & UINT32_C(0x80000000);
479
- uint32_t bias = shl1_w & UINT32_C(0xFF000000);
480
- if (bias < UINT32_C(0x71000000)) {
481
- bias = UINT32_C(0x71000000);
482
- }
483
 
484
- base = fp32_from_bits((bias >> 1) + UINT32_C(0x07800000)) + base;
485
- const uint32_t bits = fp32_to_bits(base);
486
- const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00);
487
- const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF);
488
- const uint32_t nonsign = exp_bits + mantissa_bits;
489
- return (sign >> 16) | (shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00) : nonsign);
 
 
 
 
 
 
 
 
 
 
490
  }
491
 
492
- #define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
493
- #define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
494
-
495
- #endif // defined(__ARM_NEON) && !(defined(__CUDACC__) && __CUDACC_VER_MAJOR__ <= 11) && !defined(__MUSACC__)
496
-
497
- // precomputed f32 table for f16 (256 KB)
498
- // defined in ggml.c, initialized in ggml_init()
499
- GGML_API float ggml_table_f32_f16[1 << 16];
500
-
501
- // On ARM NEON, it's quicker to directly convert x -> x instead of calling into ggml_lookup_fp16_to_fp32,
502
- // so we define GGML_FP16_TO_FP32 and GGML_FP32_TO_FP16 elsewhere for NEON.
503
- // This is also true for POWER9.
504
- #if !defined(GGML_FP16_TO_FP32)
505
- inline static float ggml_lookup_fp16_to_fp32(ggml_fp16_t f) {
506
- uint16_t s;
507
- memcpy(&s, &f, sizeof(uint16_t));
508
- return ggml_table_f32_f16[s];
509
  }
510
 
511
- #define GGML_FP16_TO_FP32(x) ggml_lookup_fp16_to_fp32(x)
512
- #endif
513
 
514
- #if !defined(GGML_FP32_TO_FP16)
515
  #define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
516
- #endif
517
 
518
  /**
519
  * Converts brain16 to float32.
 
317
  GGML_API void * ggml_aligned_malloc(size_t size);
318
  GGML_API void ggml_aligned_free(void * ptr, size_t size);
319
 
320
+ // FP16 <-> FP32
321
+ // ref: https://github.com/Maratyszcza/FP16
322
 
323
+ static inline float fp32_from_bits(uint32_t w) {
324
+ union {
325
+ uint32_t as_bits;
326
+ float as_value;
327
+ } fp32;
328
+ fp32.as_bits = w;
329
+ return fp32.as_value;
330
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
331
 
332
+ static inline uint32_t fp32_to_bits(float f) {
333
+ union {
334
+ float as_value;
335
+ uint32_t as_bits;
336
+ } fp32;
337
+ fp32.as_value = f;
338
+ return fp32.as_bits;
339
+ }
 
 
340
 
341
+ static inline float ggml_compute_fp16_to_fp32(ggml_fp16_t h) {
342
+ const uint32_t w = (uint32_t) h << 16;
343
+ const uint32_t sign = w & UINT32_C(0x80000000);
344
+ const uint32_t two_w = w + w;
345
 
346
+ const uint32_t exp_offset = UINT32_C(0xE0) << 23;
347
+ #if (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)) && (!defined(__cplusplus) || __cplusplus >= 201703L)
348
+ const float exp_scale = 0x1.0p-112f;
349
  #else
350
+ const float exp_scale = fp32_from_bits(UINT32_C(0x7800000));
351
+ #endif
352
+ const float normalized_value = fp32_from_bits((two_w >> 4) + exp_offset) * exp_scale;
353
 
354
+ const uint32_t magic_mask = UINT32_C(126) << 23;
355
+ const float magic_bias = 0.5f;
356
+ const float denormalized_value = fp32_from_bits((two_w >> 17) | magic_mask) - magic_bias;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
357
 
358
+ const uint32_t denormalized_cutoff = UINT32_C(1) << 27;
359
+ const uint32_t result = sign |
360
+ (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value));
361
+ return fp32_from_bits(result);
362
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
363
 
364
+ static inline ggml_fp16_t ggml_compute_fp32_to_fp16(float f) {
365
+ #if (defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 199901L) || defined(__GNUC__) && !defined(__STRICT_ANSI__)) && (!defined(__cplusplus) || __cplusplus >= 201703L)
366
+ const float scale_to_inf = 0x1.0p+112f;
367
+ const float scale_to_zero = 0x1.0p-110f;
368
+ #else
369
+ const float scale_to_inf = fp32_from_bits(UINT32_C(0x77800000));
370
+ const float scale_to_zero = fp32_from_bits(UINT32_C(0x08800000));
371
+ #endif
372
+ float base = (fabsf(f) * scale_to_inf) * scale_to_zero;
373
+
374
+ const uint32_t w = fp32_to_bits(f);
375
+ const uint32_t shl1_w = w + w;
376
+ const uint32_t sign = w & UINT32_C(0x80000000);
377
+ uint32_t bias = shl1_w & UINT32_C(0xFF000000);
378
+ if (bias < UINT32_C(0x71000000)) {
379
+ bias = UINT32_C(0x71000000);
380
  }
381
 
382
+ base = fp32_from_bits((bias >> 1) + UINT32_C(0x07800000)) + base;
383
+ const uint32_t bits = fp32_to_bits(base);
384
+ const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00);
385
+ const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF);
386
+ const uint32_t nonsign = exp_bits + mantissa_bits;
387
+ return (sign >> 16) | (shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00) : nonsign);
 
 
 
 
 
 
 
 
 
 
 
388
  }
389
 
390
+ #define GGML_COMPUTE_FP16_TO_FP32(x) ggml_compute_fp16_to_fp32(x)
391
+ #define GGML_COMPUTE_FP32_TO_FP16(x) ggml_compute_fp32_to_fp16(x)
392
 
393
+ #define GGML_FP16_TO_FP32(x) GGML_COMPUTE_FP16_TO_FP32(x)
394
  #define GGML_FP32_TO_FP16(x) GGML_COMPUTE_FP32_TO_FP16(x)
 
395
 
396
  /**
397
  * Converts brain16 to float32.
ggml/src/ggml.c CHANGED
@@ -61,9 +61,6 @@
61
  #define m512i(p) (__m512i)(p)
62
  #endif
63
 
64
- // precomputed f32 table for f16 (256 KB) (ggml-impl.h)
65
- float ggml_table_f32_f16[1 << 16];
66
-
67
  #if defined(__linux__) || \
68
  defined(__FreeBSD__) || defined(__NetBSD__) || defined(__OpenBSD__) || \
69
  (defined(__APPLE__) && !TARGET_OS_TV && !TARGET_OS_WATCH)
@@ -1422,14 +1419,6 @@ struct ggml_context * ggml_init(struct ggml_init_params params) {
1422
  // initialize time system (required on Windows)
1423
  ggml_time_init();
1424
 
1425
- for (int i = 0; i < (1 << 16); ++i) {
1426
- union {
1427
- uint16_t u16;
1428
- ggml_fp16_t fp16;
1429
- } u = {i};
1430
- ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(u.fp16);
1431
- }
1432
-
1433
  is_first_call = false;
1434
  }
1435
 
 
61
  #define m512i(p) (__m512i)(p)
62
  #endif
63
 
 
 
 
64
  #if defined(__linux__) || \
65
  defined(__FreeBSD__) || defined(__NetBSD__) || defined(__OpenBSD__) || \
66
  (defined(__APPLE__) && !TARGET_OS_TV && !TARGET_OS_WATCH)
 
1419
  // initialize time system (required on Windows)
1420
  ggml_time_init();
1421
 
 
 
 
 
 
 
 
 
1422
  is_first_call = false;
1423
  }
1424