Spaces:
Sleeping
ggml-cpu: Support s390x SIMD Instruction Set (llama/12019)
Browse files* ggml: add s390x ARCH_FLAGS for compilation
Signed-off-by: Aaron Teo <[email protected]>
* ggml: add SIMD for s390x using vector intrinsics
SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16
SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix missing escape character in GGML_F32x4_REDUCE
Signed-off-by: Aaron Teo <[email protected]>
* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix s390x GGML_F32x4_REDUCE
Signed-off-by: Aaron Teo <[email protected]>
* ggml: full SIMD activation for F32,F16 s390x
Signed-off-by: Aaron Teo <[email protected]>
* ggml: add option to disable s390x VXE/VXE2
Signed-off-by: Aaron Teo <[email protected]>
* ggml: change vecintrin.h include to ggml-cpu-impl
* add __VXE__ and __VXE2__ macros
Signed-off-by: Aaron Teo <[email protected]>
* cmake: add s390x target detection for VX/VXE/VXE2
Signed-off-by: Aaron Teo <[email protected]>
* ggml: move s390x vector intrinsics to ggml-cpu-impl.h
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x Q8_0 SIMD
Signed-off-by: Aaron Teo <[email protected]>
* ggml: correct documentation for Q8_0
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x reduce code complexity Q8_0
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x bugfix typo Q8_0
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x SIMD activated for Q4_1
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x inline vec_reve
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x SIMD activation for Q4_0
Signed-off-by: Aaron Teo <[email protected]>
* ggml: add VXE backend feature
Signed-off-by: Aaron Teo <[email protected]>
* ggml: remove test.py
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x SIMD activation for quantize_row_q8_0
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x SIMD activation for quantize_row_q8_1
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x SIMD activation for iq4_xs
Signed-off-by: Aaron Teo <[email protected]>
* ggml: bugfix iq4_xs
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x SIMD activation for iq4_nl
Signed-off-by: Aaron Teo <[email protected]>
* ggml: add float, double, and long vector data type
Signed-off-by: Aaron Teo <[email protected]>
* ggml: clean up iq4_xs SIMD
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix improper use of restrict keyword
Signed-off-by: Aaron Teo <[email protected]>
* ggml: update warning message for ggml_vec_tbl
Signed-off-by: Aaron Teo <[email protected]>
* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K
Signed-off-by: Aaron Teo <[email protected]>
* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs
Signed-off-by: Aaron Teo <[email protected]>
* ggml: switch to restrict for iq4_nl
Signed-off-by: Aaron Teo <[email protected]>
* ggml: slight dot product speed improvement for q4_1_q8_1
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x SIMD activation for q6_K
Signed-off-by: Aaron Teo <[email protected]>
* ggml: add missing `_t` to ggml_int8x16x4_t
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix missing `_t` for ggml_vec_xl_s8x4
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix more missing `_t`
Signed-off-by: Aaron Teo <[email protected]>
* ggml: add unroll and prefetch to Q8_0
increase of 3.86% for prompt processing and 32.22% for token generation
Signed-off-by: Aaron Teo <[email protected]>
* ggml: patch Q8_0 to use proper vector sizes
Signed-off-by: Aaron Teo <[email protected]>
* ggml: optimise Q8_0 dot prod compute kernel further
Signed-off-by: Aaron Teo <[email protected]>
* ggml: add unroll and prefetch to Q4_1
Signed-off-by: Aaron Teo <[email protected]>
* ggml: refactor Q6_K variable naming for readability
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix Q6_K typos
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x SIMD activation for Q5_K
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix wrong char*x16_t naming
Signed-off-by: Aaron Teo <[email protected]>
* ggml: Q5_K y0 wrong signness
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix Q5_K invalid uchar type
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix Q5_K invalid uchar type
Signed-off-by: Aaron Teo <[email protected]>
* ggml: s390x SIMD activation for Q4_K
Signed-off-by: Aaron Teo <[email protected]>
* ggml: fix Q4_K invalid vector intrinsics
Signed-off-by: Aaron Teo <[email protected]>
* ggml: simplify ggml_padd_s16 compute kernel
Signed-off-by: Aaron Teo <[email protected]>
* ggml: correct ggml-cpu vxe wording
Signed-off-by: Aaron Teo <[email protected]>
* ggml: change ggml_aligned_malloc alignment to 256
256 is the cache line size for s390x platforms
Signed-off-by: Aaron Teo <[email protected]>
* ggml: resolve pr merge via cherry-pick 225bbbf
Signed-off-by: Aaron Teo <[email protected]>
* ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)
* ggml: resolve pr merge via cherry-pick 4571953
Signed-off-by: Aaron Teo <[email protected]>
* ggml: cmake remove fork when determining s390x machine type
thank you
@ericcurtin
Signed-off-by: Aaron Teo <[email protected]>
---------
Signed-off-by: Aaron Teo <[email protected]>
Co-authored-by: Jinyang He <[email protected]>
Co-authored-by: junchao-zhao <[email protected]>
- ggml/CMakeLists.txt +1 -0
- ggml/include/ggml-cpu.h +1 -0
- ggml/src/ggml-cpu/CMakeLists.txt +21 -0
- ggml/src/ggml-cpu/ggml-cpu-impl.h +151 -0
- ggml/src/ggml-cpu/ggml-cpu-quants.c +554 -1
- ggml/src/ggml-cpu/ggml-cpu.c +91 -0
- ggml/src/ggml-cpu/ggml-cpu.cpp +3 -0
- ggml/src/ggml.c +4 -0
|
@@ -122,6 +122,7 @@ endif()
|
|
| 122 |
option(GGML_LASX "ggml: enable lasx" ON)
|
| 123 |
option(GGML_LSX "ggml: enable lsx" ON)
|
| 124 |
option(GGML_RVV "ggml: enable rvv" ON)
|
|
|
|
| 125 |
|
| 126 |
option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
|
| 127 |
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
|
|
|
|
| 122 |
option(GGML_LASX "ggml: enable lasx" ON)
|
| 123 |
option(GGML_LSX "ggml: enable lsx" ON)
|
| 124 |
option(GGML_RVV "ggml: enable rvv" ON)
|
| 125 |
+
option(GGML_VXE "ggml: enable vxe" ON)
|
| 126 |
|
| 127 |
option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
|
| 128 |
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
|
|
@@ -99,6 +99,7 @@ extern "C" {
|
|
| 99 |
// other
|
| 100 |
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
|
| 101 |
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
|
|
|
|
| 102 |
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
|
| 103 |
GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
|
| 104 |
|
|
|
|
| 99 |
// other
|
| 100 |
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
|
| 101 |
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
|
| 102 |
+
GGML_BACKEND_API int ggml_cpu_has_vxe (void);
|
| 103 |
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
|
| 104 |
GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
|
| 105 |
|
|
@@ -306,6 +306,27 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
|
|
| 306 |
if (GGML_RVV)
|
| 307 |
list(APPEND ARCH_FLAGS -march=rv64gcv -mabi=lp64d)
|
| 308 |
endif()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 309 |
else()
|
| 310 |
message(STATUS "Unknown architecture")
|
| 311 |
endif()
|
|
|
|
| 306 |
if (GGML_RVV)
|
| 307 |
list(APPEND ARCH_FLAGS -march=rv64gcv -mabi=lp64d)
|
| 308 |
endif()
|
| 309 |
+
elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "s390x")
|
| 310 |
+
message(STATUS "s390x detected")
|
| 311 |
+
file(READ "/proc/cpuinfo" CPUINFO_CONTENTS)
|
| 312 |
+
string(REGEX REPLACE "machine[ \t\r\n]*=[ \t\r\n]*([0-9]+)" "\\1" S390X_M ${CPUINFO_CONTENTS})
|
| 313 |
+
|
| 314 |
+
# TODO: Separation to determine activation of VX/VXE/VXE2
|
| 315 |
+
if (${S390X_M} MATCHES "8561|8562")
|
| 316 |
+
message(STATUS "z15 target")
|
| 317 |
+
list(APPEND ARCH_FLAGS -march=z15 -mtune=z15)
|
| 318 |
+
elseif (${S390X_M} MATCHES "3931")
|
| 319 |
+
message(STATUS "z16 target")
|
| 320 |
+
list(APPEND ARCH_FLAGS -march=z16 -mtune=z16)
|
| 321 |
+
else()
|
| 322 |
+
message(STATUS "Unknown target")
|
| 323 |
+
message(WARNING "Unknown target. If you are compiling for z14 and earlier, you might have to add -DGGML_VXE=OFF.")
|
| 324 |
+
list(APPEND ARCH_FLAGS -march=native -mtune=native)
|
| 325 |
+
endif()
|
| 326 |
+
|
| 327 |
+
if (GGML_VXE)
|
| 328 |
+
list(APPEND ARCH_FLAGS -mvx -mzvector)
|
| 329 |
+
endif()
|
| 330 |
else()
|
| 331 |
message(STATUS "Unknown architecture")
|
| 332 |
endif()
|
|
@@ -59,6 +59,15 @@ struct ggml_compute_params {
|
|
| 59 |
#endif
|
| 60 |
#endif
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
#if defined(__ARM_FEATURE_SVE)
|
| 63 |
#include <arm_sve.h>
|
| 64 |
#include <sys/prctl.h>
|
|
@@ -359,6 +368,148 @@ inline static int32x4_t ggml_vdotq_s32(int32x4_t acc, int8x16_t a, int8x16_t b)
|
|
| 359 |
#endif
|
| 360 |
#endif
|
| 361 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 362 |
#if defined(__loongarch_asx)
|
| 363 |
/* float type data load instructions */
|
| 364 |
static __m128 __lsx_vreplfr2vr_s(const float val) {
|
|
|
|
| 59 |
#endif
|
| 60 |
#endif
|
| 61 |
|
| 62 |
+
#if defined(__s390x__) && defined(__VEC__)
|
| 63 |
+
#ifndef __VXE__
|
| 64 |
+
#define __VXE__
|
| 65 |
+
#endif
|
| 66 |
+
#ifndef __VXE2__
|
| 67 |
+
#define __VXE2__
|
| 68 |
+
#endif
|
| 69 |
+
#endif
|
| 70 |
+
|
| 71 |
#if defined(__ARM_FEATURE_SVE)
|
| 72 |
#include <arm_sve.h>
|
| 73 |
#include <sys/prctl.h>
|
|
|
|
| 368 |
#endif
|
| 369 |
#endif
|
| 370 |
|
| 371 |
+
#if defined(__VXE__) || defined(__VXE2__)
|
| 372 |
+
#include <vecintrin.h>
|
| 373 |
+
|
| 374 |
+
#define vec_neg(a) (-(a)) // Vector Negate
|
| 375 |
+
#define vec_add(a, b) ((a) + (b)) // Vector Add
|
| 376 |
+
#define vec_sub(a, b) ((a) - (b)) // Vector Subtract
|
| 377 |
+
#define vec_mul(a, b) ((a) * (b)) // Vector Multiply
|
| 378 |
+
#define vec_div(a, b) ((a) / (b)) // Vector Divide
|
| 379 |
+
#define vec_sl(a, b) ((a) << (b)) // Vector Shift Left
|
| 380 |
+
#define vec_sra(a, b) ((a) >> (b)) // Vector Shift Right
|
| 381 |
+
#define vec_sr(a, b) ((a) >> (b)) // Vector Shift Right Algebraic
|
| 382 |
+
#define vec_slo(a, b) vec_slb(a, (b) << 64) // Vector Shift Left by Octet
|
| 383 |
+
#define vec_sro(a, b) vec_srb(a, (b) << 64) // Vector Shift Right by Octet
|
| 384 |
+
|
| 385 |
+
#ifndef vec_and
|
| 386 |
+
#define vec_and(a, b) ((a) & (b)) // Vector AND
|
| 387 |
+
#endif
|
| 388 |
+
|
| 389 |
+
#ifndef vec_or
|
| 390 |
+
#define vec_or(a, b) ((a) | (b)) // Vector OR
|
| 391 |
+
#endif
|
| 392 |
+
|
| 393 |
+
#ifndef vec_xor
|
| 394 |
+
#define vec_xor(a, b) ((a) ^ (b)) // Vector XOR
|
| 395 |
+
#endif
|
| 396 |
+
|
| 397 |
+
typedef signed char char8x16_t __attribute__((vector_size(16)));
|
| 398 |
+
typedef unsigned char uchar8x16_t __attribute__((vector_size(16)));
|
| 399 |
+
|
| 400 |
+
typedef int8_t int8x16_t __attribute__((vector_size(16)));
|
| 401 |
+
typedef int16_t int16x8_t __attribute__((vector_size(16)));
|
| 402 |
+
typedef int32_t int32x4_t __attribute__((vector_size(16)));
|
| 403 |
+
|
| 404 |
+
typedef uint8_t uint8x16_t __attribute__((vector_size(16)));
|
| 405 |
+
typedef uint16_t uint16x8_t __attribute__((vector_size(16)));
|
| 406 |
+
typedef uint32_t uint32x4_t __attribute__((vector_size(16)));
|
| 407 |
+
|
| 408 |
+
typedef float float32x4_t __attribute__((vector_size(16)));
|
| 409 |
+
typedef double double64x2_t __attribute((vector_size(16)));
|
| 410 |
+
|
| 411 |
+
typedef signed long long long64x2_t __attribute((vector_size(16)));
|
| 412 |
+
typedef unsigned long long ulong64x2_t __attribute__((vector_size(16)));
|
| 413 |
+
|
| 414 |
+
typedef struct ggml_uint8x16x2_t {
|
| 415 |
+
uint8x16_t val[2];
|
| 416 |
+
} ggml_uint8x16x2_t;
|
| 417 |
+
|
| 418 |
+
inline static ggml_uint8x16x2_t ggml_vec_xl_u8x2(const uint8_t * ptr) {
|
| 419 |
+
ggml_uint8x16x2_t res;
|
| 420 |
+
|
| 421 |
+
res.val[0] = vec_xl( 0, ptr);
|
| 422 |
+
res.val[1] = vec_xl(16, ptr);
|
| 423 |
+
|
| 424 |
+
return res;
|
| 425 |
+
}
|
| 426 |
+
|
| 427 |
+
typedef struct ggml_uint8x16x4_t {
|
| 428 |
+
uint8x16_t val[4];
|
| 429 |
+
} ggml_uint8x16x4_t;
|
| 430 |
+
|
| 431 |
+
inline static ggml_uint8x16x4_t ggml_vec_xl_u8x4(const uint8_t * ptr) {
|
| 432 |
+
ggml_uint8x16x4_t res;
|
| 433 |
+
|
| 434 |
+
res.val[0] = vec_xl( 0, ptr);
|
| 435 |
+
res.val[1] = vec_xl(16, ptr);
|
| 436 |
+
res.val[2] = vec_xl(32, ptr);
|
| 437 |
+
res.val[3] = vec_xl(48, ptr);
|
| 438 |
+
|
| 439 |
+
return res;
|
| 440 |
+
}
|
| 441 |
+
|
| 442 |
+
typedef struct ggml_int8x16x4_t {
|
| 443 |
+
int8x16_t val[4];
|
| 444 |
+
} ggml_int8x16x4_t;
|
| 445 |
+
|
| 446 |
+
inline static ggml_int8x16x4_t ggml_vec_xl_s8x4(const int8_t * ptr) {
|
| 447 |
+
ggml_int8x16x4_t res;
|
| 448 |
+
|
| 449 |
+
res.val[0] = vec_xl( 0, ptr);
|
| 450 |
+
res.val[1] = vec_xl(16, ptr);
|
| 451 |
+
res.val[2] = vec_xl(32, ptr);
|
| 452 |
+
res.val[3] = vec_xl(48, ptr);
|
| 453 |
+
|
| 454 |
+
return res;
|
| 455 |
+
}
|
| 456 |
+
|
| 457 |
+
typedef struct ggml_int16x8x2_t {
|
| 458 |
+
int16x8_t val[2];
|
| 459 |
+
} ggml_int16x8x2_t;
|
| 460 |
+
|
| 461 |
+
inline static ggml_int16x8x2_t ggml_vec_xl_s16x2(const int16_t * ptr) {
|
| 462 |
+
ggml_int16x8x2_t res;
|
| 463 |
+
|
| 464 |
+
res.val[0] = vec_xl( 0, ptr);
|
| 465 |
+
res.val[1] = vec_xl(16, ptr);
|
| 466 |
+
|
| 467 |
+
return res;
|
| 468 |
+
}
|
| 469 |
+
|
| 470 |
+
/*
|
| 471 |
+
! WARNING: Very slow. Use vec_perm if possible. Refer to iq4_xs
|
| 472 |
+
! or iq4_nl for example implementation.
|
| 473 |
+
*/
|
| 474 |
+
inline static int8x16_t ggml_vec_tbl(int8x16_t a, uint8x16_t b) {
|
| 475 |
+
int8x16_t res;
|
| 476 |
+
|
| 477 |
+
res[ 0] = a[b[ 0]];
|
| 478 |
+
res[ 1] = a[b[ 1]];
|
| 479 |
+
res[ 2] = a[b[ 2]];
|
| 480 |
+
res[ 3] = a[b[ 3]];
|
| 481 |
+
res[ 4] = a[b[ 4]];
|
| 482 |
+
res[ 5] = a[b[ 5]];
|
| 483 |
+
res[ 6] = a[b[ 6]];
|
| 484 |
+
res[ 7] = a[b[ 7]];
|
| 485 |
+
res[ 8] = a[b[ 8]];
|
| 486 |
+
res[ 9] = a[b[ 9]];
|
| 487 |
+
res[10] = a[b[10]];
|
| 488 |
+
res[11] = a[b[11]];
|
| 489 |
+
res[12] = a[b[12]];
|
| 490 |
+
res[13] = a[b[13]];
|
| 491 |
+
res[14] = a[b[14]];
|
| 492 |
+
res[15] = a[b[15]];
|
| 493 |
+
|
| 494 |
+
return res;
|
| 495 |
+
}
|
| 496 |
+
|
| 497 |
+
inline static int16x8_t vec_padd_s16(int16x8_t a, int16x8_t b) {
|
| 498 |
+
const uchar8x16_t v_maske = { 0, 1, 4, 5, 8, 9, 12, 13,
|
| 499 |
+
16, 17, 20, 21, 24, 25, 28, 29 };
|
| 500 |
+
|
| 501 |
+
const int16x8_t v_abo = vec_pack((int32x4_t)a, (int32x4_t)b);
|
| 502 |
+
const int16x8_t v_abe = vec_perm(a, b, v_maske);
|
| 503 |
+
return v_abo + v_abe;
|
| 504 |
+
}
|
| 505 |
+
|
| 506 |
+
inline static int32x4_t ggml_vec_dot(int32x4_t acc, int8x16_t a, int8x16_t b) {
|
| 507 |
+
const int16x8_t p = vec_mule(a, b) + vec_mulo(a, b);
|
| 508 |
+
return acc + (vec_unpackh(p) + vec_unpackl(p));
|
| 509 |
+
}
|
| 510 |
+
|
| 511 |
+
#endif
|
| 512 |
+
|
| 513 |
#if defined(__loongarch_asx)
|
| 514 |
/* float type data load instructions */
|
| 515 |
static __m128 __lsx_vreplfr2vr_s(const float val) {
|
|
@@ -1011,6 +1011,38 @@ void quantize_row_q8_0(const float * restrict x, void * restrict vy, int64_t k)
|
|
| 1011 |
__lsx_vst(ni4, (__m128i *)(y[i].qs + 16), 0);
|
| 1012 |
|
| 1013 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1014 |
#else
|
| 1015 |
GGML_UNUSED(nb);
|
| 1016 |
// scalar
|
|
@@ -1337,6 +1369,44 @@ void quantize_row_q8_1(const float * restrict x, void * restrict vy, int64_t k)
|
|
| 1337 |
__lsx_vst(ni0, (__m128i *)(y[i].qs + 0), 0);
|
| 1338 |
__lsx_vst(ni4, (__m128i *)(y[i].qs + 16), 0);
|
| 1339 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1340 |
#else
|
| 1341 |
GGML_UNUSED(nb);
|
| 1342 |
// scalar
|
|
@@ -2488,6 +2558,37 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * restrict s, size_t bs, const void * r
|
|
| 2488 |
}
|
| 2489 |
|
| 2490 |
sumf = hsum_float_4x4(acc_0, acc_1, acc_2, acc_3);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2491 |
#endif
|
| 2492 |
for (; ib < nb; ++ib) {
|
| 2493 |
int sumi0 = 0;
|
|
@@ -2781,6 +2882,35 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * restrict s, size_t bs, const void * r
|
|
| 2781 |
}
|
| 2782 |
|
| 2783 |
sumf = hsum_float_8(acc) + summs;
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2784 |
#endif
|
| 2785 |
for (; ib < nb; ++ib) {
|
| 2786 |
int sumi0 = 0;
|
|
@@ -3915,6 +4045,27 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * restrict s, size_t bs, const void * r
|
|
| 3915 |
}
|
| 3916 |
|
| 3917 |
sumf = hsum_float_8(acc);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3918 |
#endif
|
| 3919 |
for (; ib < nb; ++ib) {
|
| 3920 |
int sumi = 0;
|
|
@@ -6797,6 +6948,77 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * restrict s, size_t bs, const void * r
|
|
| 6797 |
|
| 6798 |
|
| 6799 |
*s = hsum_float_8(acc) + ((v4f32)acc_m)[0];
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6800 |
#else
|
| 6801 |
|
| 6802 |
const uint8_t * scales = (const uint8_t*)&utmp[0];
|
|
@@ -7526,7 +7748,94 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * restrict s, size_t bs, const void * r
|
|
| 7526 |
acc_m = __lsx_vfadd_s(acc_m, (__m128)__lsx_vbsrl_v(acc_m, 4));
|
| 7527 |
|
| 7528 |
*s = hsum_float_8(acc) + ((v4f32)acc_m)[0];
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7529 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7530 |
#else
|
| 7531 |
|
| 7532 |
const uint8_t * scales = (const uint8_t*)&utmp[0];
|
|
@@ -8243,7 +8552,130 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * restrict s, size_t bs, const void * r
|
|
| 8243 |
}
|
| 8244 |
|
| 8245 |
*s = hsum_float_8(acc);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8246 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8247 |
#else
|
| 8248 |
|
| 8249 |
int8_t aux8[QK_K];
|
|
@@ -8604,7 +9036,57 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * restrict s, size_t bs, const void
|
|
| 8604 |
}
|
| 8605 |
|
| 8606 |
*s = 0.125f * hsum_float_8(accumf);
|
| 8607 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8608 |
#else
|
| 8609 |
|
| 8610 |
uint32_t aux32[2];
|
|
@@ -11365,6 +11847,27 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * restrict s, size_t bs, const void *
|
|
| 11365 |
|
| 11366 |
sumf = hsum_float_8(__lasx_xvfadd_s(accum1, accum2));
|
| 11367 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11368 |
#endif
|
| 11369 |
for (; ib < nb; ++ib) {
|
| 11370 |
const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
|
|
@@ -11643,6 +12146,56 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * restrict s, size_t bs, const void *
|
|
| 11643 |
}
|
| 11644 |
|
| 11645 |
*s = hsum_float_8(accum);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11646 |
|
| 11647 |
#else
|
| 11648 |
float sumf = 0;
|
|
|
|
| 1011 |
__lsx_vst(ni4, (__m128i *)(y[i].qs + 16), 0);
|
| 1012 |
|
| 1013 |
}
|
| 1014 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 1015 |
+
for (int i = 0; i < nb; i++) {
|
| 1016 |
+
__vector float srcv [8];
|
| 1017 |
+
__vector float asrcv[8];
|
| 1018 |
+
__vector float amaxv[8];
|
| 1019 |
+
|
| 1020 |
+
for (int j = 0; j < 8; j++) srcv[j] = vec_xl(0, x + i*32 + 4*j);
|
| 1021 |
+
for (int j = 0; j < 8; j++) asrcv[j] = vec_abs(srcv[j]);
|
| 1022 |
+
for (int j = 0; j < 4; j++) amaxv[2*j] = vec_max(asrcv[2*j], asrcv[2*j+1]);
|
| 1023 |
+
for (int j = 0; j < 2; j++) amaxv[4*j] = vec_max(amaxv[4*j], amaxv[4*j+2]);
|
| 1024 |
+
for (int j = 0; j < 1; j++) amaxv[8*j] = vec_max(amaxv[8*j], amaxv[8*j+4]);
|
| 1025 |
+
|
| 1026 |
+
const float amax = MAX(MAX(vec_extract(amaxv[0], 0),
|
| 1027 |
+
vec_extract(amaxv[0], 1)),
|
| 1028 |
+
MAX(vec_extract(amaxv[0], 2),
|
| 1029 |
+
vec_extract(amaxv[0], 3)));
|
| 1030 |
+
|
| 1031 |
+
const float d = amax / ((1 << 7) - 1);
|
| 1032 |
+
const float id = d ? 1.0f / d : 0.0f;
|
| 1033 |
+
|
| 1034 |
+
y[i].d = GGML_FP32_TO_FP16(d);
|
| 1035 |
+
|
| 1036 |
+
for (int j = 0; j < 8; j++) {
|
| 1037 |
+
const __vector float v = vec_mul(srcv[j], vec_splats(id));
|
| 1038 |
+
const __vector int32_t vi = vec_signed(v);
|
| 1039 |
+
|
| 1040 |
+
y[i].qs[4*j + 0] = vec_extract(vi, 0);
|
| 1041 |
+
y[i].qs[4*j + 1] = vec_extract(vi, 1);
|
| 1042 |
+
y[i].qs[4*j + 2] = vec_extract(vi, 2);
|
| 1043 |
+
y[i].qs[4*j + 3] = vec_extract(vi, 3);
|
| 1044 |
+
}
|
| 1045 |
+
}
|
| 1046 |
#else
|
| 1047 |
GGML_UNUSED(nb);
|
| 1048 |
// scalar
|
|
|
|
| 1369 |
__lsx_vst(ni0, (__m128i *)(y[i].qs + 0), 0);
|
| 1370 |
__lsx_vst(ni4, (__m128i *)(y[i].qs + 16), 0);
|
| 1371 |
}
|
| 1372 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 1373 |
+
for (int i = 0; i < nb; i++) {
|
| 1374 |
+
__vector float srcv [8];
|
| 1375 |
+
__vector float asrcv[8];
|
| 1376 |
+
__vector float amaxv[8];
|
| 1377 |
+
|
| 1378 |
+
for (int j = 0; j < 8; j++) srcv[j] = vec_xl(0, x + i*32 + 4*j);
|
| 1379 |
+
for (int j = 0; j < 8; j++) asrcv[j] = vec_abs(srcv[j]);
|
| 1380 |
+
for (int j = 0; j < 4; j++) amaxv[2*j] = vec_max(asrcv[2*j], asrcv[2*j+1]);
|
| 1381 |
+
for (int j = 0; j < 2; j++) amaxv[4*j] = vec_max(amaxv[4*j], amaxv[4*j+2]);
|
| 1382 |
+
for (int j = 0; j < 1; j++) amaxv[8*j] = vec_max(amaxv[8*j], amaxv[8*j+4]);
|
| 1383 |
+
|
| 1384 |
+
const float amax = MAX(MAX(vec_extract(amaxv[0], 0),
|
| 1385 |
+
vec_extract(amaxv[0], 1)),
|
| 1386 |
+
MAX(vec_extract(amaxv[0], 2),
|
| 1387 |
+
vec_extract(amaxv[0], 3)));
|
| 1388 |
+
|
| 1389 |
+
const float d = amax / ((1 << 7) - 1);
|
| 1390 |
+
const float id = d ? 1.0f / d : 0.0f;
|
| 1391 |
+
|
| 1392 |
+
y[i].d = GGML_FP32_TO_FP16(d);
|
| 1393 |
+
|
| 1394 |
+
__vector int32_t acc = vec_splats(0);
|
| 1395 |
+
|
| 1396 |
+
for (int j = 0; j < 8; j++) {
|
| 1397 |
+
const __vector float v = vec_mul(srcv[j], vec_splats(id));
|
| 1398 |
+
const __vector int32_t vi = vec_signed(v);
|
| 1399 |
+
|
| 1400 |
+
y[i].qs[4*j + 0] = vec_extract(vi, 0);
|
| 1401 |
+
y[i].qs[4*j + 1] = vec_extract(vi, 1);
|
| 1402 |
+
y[i].qs[4*j + 2] = vec_extract(vi, 2);
|
| 1403 |
+
y[i].qs[4*j + 3] = vec_extract(vi, 3);
|
| 1404 |
+
|
| 1405 |
+
acc = vec_add(acc, vi);
|
| 1406 |
+
}
|
| 1407 |
+
|
| 1408 |
+
y[i].s = GGML_FP32_TO_FP16(d * (acc[0] + acc[1] + acc[2] + acc[3]));
|
| 1409 |
+
}
|
| 1410 |
#else
|
| 1411 |
GGML_UNUSED(nb);
|
| 1412 |
// scalar
|
|
|
|
| 2558 |
}
|
| 2559 |
|
| 2560 |
sumf = hsum_float_4x4(acc_0, acc_1, acc_2, acc_3);
|
| 2561 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 2562 |
+
__vector float acc = vec_splats(0.0f);
|
| 2563 |
+
|
| 2564 |
+
const __vector uint8_t v_m = vec_splats((const uint8_t)0x0F);
|
| 2565 |
+
const __vector int8_t v_s = vec_splats( (const int8_t)0x08);
|
| 2566 |
+
|
| 2567 |
+
for (; ib < nb; ++ib) {
|
| 2568 |
+
const __vector uint8_t v_x = vec_xl(0, x[ib].qs);
|
| 2569 |
+
const __vector int8_t v_xl = (const __vector int8_t)(v_x & v_m);
|
| 2570 |
+
const __vector int8_t v_xh = (const __vector int8_t)(v_x >> 4);
|
| 2571 |
+
|
| 2572 |
+
const __vector int8_t v_xls = vec_sub(v_xl, v_s);
|
| 2573 |
+
const __vector int8_t v_xhs = vec_sub(v_xh, v_s);
|
| 2574 |
+
|
| 2575 |
+
const __vector int8_t v_yl = vec_xl(0 , y[ib].qs);
|
| 2576 |
+
const __vector int8_t v_yh = vec_xl(QK8_0/2, y[ib].qs);
|
| 2577 |
+
|
| 2578 |
+
const __vector int16_t v_xylso = vec_mulo(v_xls, v_yl);
|
| 2579 |
+
const __vector int16_t v_xylse = vec_mule(v_xls, v_yl);
|
| 2580 |
+
const __vector int16_t v_xyhso = vec_mulo(v_xhs, v_yh);
|
| 2581 |
+
const __vector int16_t v_xyhse = vec_mule(v_xhs, v_yh);
|
| 2582 |
+
|
| 2583 |
+
__vector int16_t v_xy_ = v_xylso + v_xylse + v_xyhso + v_xyhse; v_xy_ += vec_reve(v_xy_);
|
| 2584 |
+
|
| 2585 |
+
const __vector float v_xy = vec_float(vec_unpackh(v_xy_));
|
| 2586 |
+
const __vector float v_d = vec_splats(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
|
| 2587 |
+
|
| 2588 |
+
acc = vec_madd(v_xy, v_d, acc);
|
| 2589 |
+
}
|
| 2590 |
+
|
| 2591 |
+
sumf = acc[0] + acc[1] + acc[2] + acc[3];
|
| 2592 |
#endif
|
| 2593 |
for (; ib < nb; ++ib) {
|
| 2594 |
int sumi0 = 0;
|
|
|
|
| 2882 |
}
|
| 2883 |
|
| 2884 |
sumf = hsum_float_8(acc) + summs;
|
| 2885 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 2886 |
+
float summs = 0;
|
| 2887 |
+
float32x4_t acc = vec_splats(0.0f);
|
| 2888 |
+
|
| 2889 |
+
const uint8x16_t v_m = vec_splat_u8(0x0F);
|
| 2890 |
+
|
| 2891 |
+
#pragma GCC unroll 4
|
| 2892 |
+
for (; ib < nb; ++ib) {
|
| 2893 |
+
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 2894 |
+
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 2895 |
+
|
| 2896 |
+
summs += GGML_FP16_TO_FP32(x[ib].m) * GGML_FP16_TO_FP32(y[ib].s);
|
| 2897 |
+
|
| 2898 |
+
const uint8x16_t v_x = vec_xl(0, x[ib].qs);
|
| 2899 |
+
const int8x16_t v_xl = (const int8x16_t)(v_x & v_m);
|
| 2900 |
+
const int8x16_t v_xh = (const int8x16_t)(v_x >> 4);
|
| 2901 |
+
|
| 2902 |
+
const int8x16_t v_yl = vec_xl(0 , y[ib].qs);
|
| 2903 |
+
const int8x16_t v_yh = vec_xl(QK8_1/2, y[ib].qs);
|
| 2904 |
+
|
| 2905 |
+
const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
|
| 2906 |
+
const float32x4_t v_xy = vec_float(v_xy_);
|
| 2907 |
+
|
| 2908 |
+
const float32x4_t v_d = vec_splats(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
|
| 2909 |
+
|
| 2910 |
+
acc = vec_madd(v_xy, v_d, acc);
|
| 2911 |
+
}
|
| 2912 |
+
|
| 2913 |
+
sumf = acc[0] + acc[1] + acc[2] + acc[3] + summs;
|
| 2914 |
#endif
|
| 2915 |
for (; ib < nb; ++ib) {
|
| 2916 |
int sumi0 = 0;
|
|
|
|
| 4045 |
}
|
| 4046 |
|
| 4047 |
sumf = hsum_float_8(acc);
|
| 4048 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 4049 |
+
__vector float acc = vec_splats(0.0f);
|
| 4050 |
+
|
| 4051 |
+
#pragma GCC unroll 8
|
| 4052 |
+
for (; ib < nb; ++ib) {
|
| 4053 |
+
__builtin_prefetch(x[ib].qs, 0, 1);
|
| 4054 |
+
__builtin_prefetch(y[ib].qs, 0, 1);
|
| 4055 |
+
|
| 4056 |
+
const int8x16_t v_xl = vec_xl(0 , x[ib].qs);
|
| 4057 |
+
const int8x16_t v_xh = vec_xl(QK8_0/2, x[ib].qs);
|
| 4058 |
+
const int8x16_t v_yl = vec_xl(0 , y[ib].qs);
|
| 4059 |
+
const int8x16_t v_yh = vec_xl(QK8_0/2, y[ib].qs);
|
| 4060 |
+
|
| 4061 |
+
const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
|
| 4062 |
+
const float32x4_t v_xy = vec_float(v_xy_);
|
| 4063 |
+
const float32x4_t v_d = vec_splats(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
|
| 4064 |
+
|
| 4065 |
+
acc = vec_madd(v_xy, v_d, acc);
|
| 4066 |
+
}
|
| 4067 |
+
|
| 4068 |
+
sumf = acc[0] + acc[1] + acc[2] + acc[3];
|
| 4069 |
#endif
|
| 4070 |
for (; ib < nb; ++ib) {
|
| 4071 |
int sumi = 0;
|
|
|
|
| 6948 |
|
| 6949 |
|
| 6950 |
*s = hsum_float_8(acc) + ((v4f32)acc_m)[0];
|
| 6951 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 6952 |
+
const uint8x16_t v_lm = vec_splat_u8(0x0F);
|
| 6953 |
+
const int32x4_t v_z = vec_splat_s32(0);
|
| 6954 |
+
|
| 6955 |
+
uint8x16_t v_x[2];
|
| 6956 |
+
int8x16_t v_xl[2];
|
| 6957 |
+
int8x16_t v_y[2];
|
| 6958 |
+
|
| 6959 |
+
float sumf = 0;
|
| 6960 |
+
|
| 6961 |
+
for (int i = 0; i < nb; ++i) {
|
| 6962 |
+
const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
|
| 6963 |
+
const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
|
| 6964 |
+
|
| 6965 |
+
const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
|
| 6966 |
+
const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
|
| 6967 |
+
const int16x8_t v_ysums = vec_padd_s16(v_ysumsl, v_ysumsh);
|
| 6968 |
+
|
| 6969 |
+
memcpy(utmp, x[i].scales, 12);
|
| 6970 |
+
|
| 6971 |
+
uint32x4_t v_mins8 = { 0 };
|
| 6972 |
+
v_mins8 = vec_insert(utmp[1] & kmask1, v_mins8, 0);
|
| 6973 |
+
v_mins8 = vec_insert(((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4), v_mins8, 1);
|
| 6974 |
+
|
| 6975 |
+
utmp[1] = (utmp[2] & kmask2) | (((utmp[0] >> 6) & kmask3) << 4);
|
| 6976 |
+
utmp[0] &= kmask1;
|
| 6977 |
+
|
| 6978 |
+
const int16x8_t v_minsh = (int16x8_t)vec_unpackh((uint8x16_t)v_mins8);
|
| 6979 |
+
|
| 6980 |
+
const int32x4_t v_minso = vec_mulo(v_ysums, v_minsh);
|
| 6981 |
+
const int32x4_t v_minse = vec_mule(v_ysums, v_minsh);
|
| 6982 |
+
const int32x4_t v_mins = v_minso + v_minse;
|
| 6983 |
+
sumf -= dmin * (v_mins[0] + v_mins[1] + v_mins[2] + v_mins[3]);
|
| 6984 |
+
|
| 6985 |
+
const uint8_t * scales = (const uint8_t *)utmp;
|
| 6986 |
+
const uint8_t * restrict x0 = x[i].qs;
|
| 6987 |
+
const int8_t * restrict y0 = y[i].qs;
|
| 6988 |
+
|
| 6989 |
+
int32_t sumi1 = 0;
|
| 6990 |
+
int32_t sumi2 = 0;
|
| 6991 |
+
|
| 6992 |
+
for (int j = 0; j < QK_K/64; ++j) {
|
| 6993 |
+
v_x[0] = vec_xl(0 , x0);
|
| 6994 |
+
v_x[1] = vec_xl(16, x0);
|
| 6995 |
+
x0 += 32;
|
| 6996 |
+
|
| 6997 |
+
v_y[0] = vec_xl(0 , y0);
|
| 6998 |
+
v_y[1] = vec_xl(16, y0);
|
| 6999 |
+
y0 += 32;
|
| 7000 |
+
|
| 7001 |
+
v_xl[0] = (int8x16_t)vec_and(v_x[0], v_lm);
|
| 7002 |
+
v_xl[1] = (int8x16_t)vec_and(v_x[1], v_lm);
|
| 7003 |
+
|
| 7004 |
+
const int32x4_t p1 = ggml_vec_dot(ggml_vec_dot(v_z, v_xl[0], v_y[0]), v_xl[1], v_y[1]);
|
| 7005 |
+
sumi1 += (p1[0] + p1[1] + p1[2] + p1[3]) * scales[2*j+0];
|
| 7006 |
+
|
| 7007 |
+
v_y[0] = vec_xl(0 , y0);
|
| 7008 |
+
v_y[1] = vec_xl(16, y0);
|
| 7009 |
+
y0 += 32;
|
| 7010 |
+
|
| 7011 |
+
v_xl[0] = (int8x16_t)vec_sr(v_x[0], 4);
|
| 7012 |
+
v_xl[1] = (int8x16_t)vec_sr(v_x[1], 4);
|
| 7013 |
+
|
| 7014 |
+
const int32x4_t p2 = ggml_vec_dot(ggml_vec_dot(v_z, v_xl[0], v_y[0]), v_xl[1], v_y[1]);
|
| 7015 |
+
sumi2 += (p2[0] + p2[1] + p2[2] + p2[3]) * scales[2*j+1];
|
| 7016 |
+
}
|
| 7017 |
+
|
| 7018 |
+
sumf += d * (sumi1 + sumi2);
|
| 7019 |
+
}
|
| 7020 |
+
|
| 7021 |
+
*s = sumf;
|
| 7022 |
#else
|
| 7023 |
|
| 7024 |
const uint8_t * scales = (const uint8_t*)&utmp[0];
|
|
|
|
| 7748 |
acc_m = __lsx_vfadd_s(acc_m, (__m128)__lsx_vbsrl_v(acc_m, 4));
|
| 7749 |
|
| 7750 |
*s = hsum_float_8(acc) + ((v4f32)acc_m)[0];
|
| 7751 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 7752 |
+
const uint8x16_t v_lm = vec_splat_u8(0x0F);
|
| 7753 |
+
const uint8x16_t v_1m = vec_splat_u8(0x01);
|
| 7754 |
+
const uint8x16_t v_2m = vec_splat_u8(0x02);
|
| 7755 |
+
|
| 7756 |
+
const int32x4_t v_z = vec_splat_s32(0);
|
| 7757 |
+
|
| 7758 |
+
const uchar8x16_t v_minsm = {
|
| 7759 |
+
0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
|
| 7760 |
+
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF
|
| 7761 |
+
};
|
| 7762 |
+
|
| 7763 |
+
int8x16_t q5b[4];
|
| 7764 |
+
uint8x16_t q5h[4];
|
| 7765 |
+
|
| 7766 |
+
uint8x16_t v_xl[2];
|
| 7767 |
+
uint8x16_t v_xh[2];
|
| 7768 |
+
int8x16_t v_y[4];
|
| 7769 |
+
|
| 7770 |
+
float sumf = 0;
|
| 7771 |
+
|
| 7772 |
+
for (int i = 0; i < nb; ++i) {
|
| 7773 |
+
const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
|
| 7774 |
+
const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
|
| 7775 |
+
|
| 7776 |
+
const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
|
| 7777 |
+
const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
|
| 7778 |
+
const int16x8_t v_ysums = vec_padd_s16(v_ysumsl, v_ysumsh);
|
| 7779 |
+
|
| 7780 |
+
memcpy(utmp, x[i].scales, 12);
|
| 7781 |
+
utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
|
| 7782 |
+
const uint32_t uaux = utmp[1] & kmask1;
|
| 7783 |
+
utmp[1] = (utmp[2] & kmask2) | (((utmp[0] >> 6) & kmask3) << 4);
|
| 7784 |
+
utmp[2] = uaux;
|
| 7785 |
+
utmp[0] &= kmask1;
|
| 7786 |
+
|
| 7787 |
+
const uint8x16_t v_mins16 = vec_xl(0, (const uint8_t *)utmp);
|
| 7788 |
+
const uint8x16_t v_mins8 = vec_perm(v_mins16, v_mins16, v_minsm);
|
| 7789 |
+
const int16x8_t v_minsh = (int16x8_t)vec_unpackh(v_mins8);
|
| 7790 |
+
|
| 7791 |
+
const int32x4_t v_minsho = vec_mulo(v_ysums, v_minsh);
|
| 7792 |
+
const int32x4_t v_minshe = vec_mule(v_ysums, v_minsh);
|
| 7793 |
+
const int32x4_t v_mins = vec_add(v_minsho, v_minshe);
|
| 7794 |
+
const int32_t mins = v_mins[0] + v_mins[1] + v_mins[2] + v_mins[3];
|
| 7795 |
+
|
| 7796 |
+
const uint8_t * scales = (const uint8_t *)utmp;
|
| 7797 |
+
const uint8_t * restrict x0l = x[i].qs;
|
| 7798 |
+
const uint8_t * restrict x0h = x[i].qh;
|
| 7799 |
+
const int8_t * restrict y0 = y[i].qs;
|
| 7800 |
+
|
| 7801 |
+
v_xh[0] = vec_xl(0 , x0h);
|
| 7802 |
+
v_xh[1] = vec_xl(16, x0h);
|
| 7803 |
+
|
| 7804 |
+
int32_t sumi = 0;
|
| 7805 |
+
for (int j = 0; j < QK_K/64; ++j) {
|
| 7806 |
+
v_xl[0] = vec_xl(0 , x0l);
|
| 7807 |
+
v_xl[1] = vec_xl(16, x0l);
|
| 7808 |
+
x0l += 32;
|
| 7809 |
+
|
| 7810 |
+
v_y[0] = vec_xl(0 , y0);
|
| 7811 |
+
v_y[1] = vec_xl(16, y0);
|
| 7812 |
+
v_y[2] = vec_xl(32, y0);
|
| 7813 |
+
v_y[3] = vec_xl(48, y0);
|
| 7814 |
+
y0 += 64;
|
| 7815 |
|
| 7816 |
+
q5h[0] = vec_sl(vec_and(v_1m, v_xh[0]), 4);
|
| 7817 |
+
q5h[1] = vec_sl(vec_and(v_1m, v_xh[1]), 4);
|
| 7818 |
+
q5h[2] = vec_sl(vec_and(v_2m, v_xh[0]), 3);
|
| 7819 |
+
q5h[3] = vec_sl(vec_and(v_2m, v_xh[1]), 3);
|
| 7820 |
+
v_xh[0] = vec_sr(v_xh[0], 2);
|
| 7821 |
+
v_xh[1] = vec_sr(v_xh[1], 2);
|
| 7822 |
+
|
| 7823 |
+
q5b[0] = (int8x16_t)vec_or(vec_and(v_xl[0], v_lm), q5h[0]);
|
| 7824 |
+
q5b[1] = (int8x16_t)vec_or(vec_and(v_xl[1], v_lm), q5h[1]);
|
| 7825 |
+
q5b[2] = (int8x16_t)vec_or(vec_sr(v_xl[0], 4), q5h[2]);
|
| 7826 |
+
q5b[3] = (int8x16_t)vec_or(vec_sr(v_xl[1], 4), q5h[3]);
|
| 7827 |
+
|
| 7828 |
+
int32x4_t sumi0 = ggml_vec_dot(ggml_vec_dot(v_z, q5b[0], v_y[0]), q5b[1], v_y[1]);
|
| 7829 |
+
int32x4_t sumi1 = ggml_vec_dot(ggml_vec_dot(v_z, q5b[2], v_y[2]), q5b[3], v_y[3]);
|
| 7830 |
+
|
| 7831 |
+
sumi += (sumi0[0] + sumi0[1] + sumi0[2] + sumi0[3]) * *scales++;
|
| 7832 |
+
sumi += (sumi1[0] + sumi1[1] + sumi1[2] + sumi1[3]) * *scales++;
|
| 7833 |
+
}
|
| 7834 |
+
|
| 7835 |
+
sumf += d * sumi - dmin * mins;
|
| 7836 |
+
}
|
| 7837 |
+
|
| 7838 |
+
*s = sumf;
|
| 7839 |
#else
|
| 7840 |
|
| 7841 |
const uint8_t * scales = (const uint8_t*)&utmp[0];
|
|
|
|
| 8552 |
}
|
| 8553 |
|
| 8554 |
*s = hsum_float_8(acc);
|
| 8555 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 8556 |
+
float sum = 0;
|
| 8557 |
+
|
| 8558 |
+
// Lower 4-bit and upper 2-bit masks
|
| 8559 |
+
const uint8x16_t v_lm = vec_splat_u8(0x0F);
|
| 8560 |
+
const uint8x16_t v_um = vec_splat_u8(0x03);
|
| 8561 |
+
|
| 8562 |
+
const int32x4_t v_z = vec_splat_s32(0);
|
| 8563 |
+
|
| 8564 |
+
int8x16_t q6b[4];
|
| 8565 |
+
uint8x16_t q6h[4];
|
| 8566 |
+
|
| 8567 |
+
uint8x16_t v_xl[4];
|
| 8568 |
+
uint8x16_t v_xh[2];
|
| 8569 |
+
int8x16_t v_y[4];
|
| 8570 |
+
|
| 8571 |
+
for (int i = 0; i < nb; ++i) {
|
| 8572 |
+
const float d_all = GGML_FP16_TO_FP32(x[i].d);
|
| 8573 |
+
|
| 8574 |
+
const uint8_t * restrict x0l = x[i].ql;
|
| 8575 |
+
const uint8_t * restrict x0h = x[i].qh;
|
| 8576 |
+
const int8_t * restrict y0 = y[i].qs;
|
| 8577 |
+
|
| 8578 |
+
const int8_t * restrict scale = x[i].scales;
|
| 8579 |
+
|
| 8580 |
+
const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
|
| 8581 |
+
const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
|
| 8582 |
+
|
| 8583 |
+
const int8x16_t v_scale = vec_xl(0, scale);
|
| 8584 |
+
const int16x8_t v_scalel = vec_unpackh(v_scale);
|
| 8585 |
+
const int16x8_t v_scaleh = vec_unpackl(v_scale);
|
| 8586 |
+
|
| 8587 |
+
const int32x4_t v_minslo = vec_mulo(v_ysumsl, v_scalel);
|
| 8588 |
+
const int32x4_t v_minsle = vec_mule(v_ysumsl, v_scalel);
|
| 8589 |
+
const int32x4_t v_minsho = vec_mulo(v_ysumsh, v_scaleh);
|
| 8590 |
+
const int32x4_t v_minshe = vec_mule(v_ysumsh, v_scaleh);
|
| 8591 |
+
const int32x4_t v_mins = v_minslo + v_minsle + v_minsho + v_minshe;
|
| 8592 |
+
|
| 8593 |
+
const int32_t mins = v_mins[0] + v_mins[1] + v_mins[2] + v_mins[3];
|
| 8594 |
+
|
| 8595 |
+
int32_t isum = 0;
|
| 8596 |
+
for (int j = 0; j < QK_K/128; ++j) {
|
| 8597 |
+
// Load model upper 2 bits
|
| 8598 |
+
v_xh[0] = vec_xl(0 , x0h);
|
| 8599 |
+
v_xh[1] = vec_xl(16, x0h);
|
| 8600 |
+
x0h += 32;
|
| 8601 |
+
|
| 8602 |
+
// Load model lower 4 bits
|
| 8603 |
+
v_xl[0] = vec_xl(0 , x0l);
|
| 8604 |
+
v_xl[1] = vec_xl(16, x0l);
|
| 8605 |
+
v_xl[2] = vec_xl(32, x0l);
|
| 8606 |
+
v_xl[3] = vec_xl(48, x0l);
|
| 8607 |
+
x0l += 64;
|
| 8608 |
+
|
| 8609 |
+
// Load activation quants
|
| 8610 |
+
v_y[0] = vec_xl(0 , y0);
|
| 8611 |
+
v_y[1] = vec_xl(16, y0);
|
| 8612 |
+
v_y[2] = vec_xl(32, y0);
|
| 8613 |
+
v_y[3] = vec_xl(48, y0);
|
| 8614 |
+
y0 += 64;
|
| 8615 |
+
|
| 8616 |
+
q6h[0] = vec_sl(vec_and(v_um, v_xh[0]), 4);
|
| 8617 |
+
q6h[1] = vec_sl(vec_and(v_um, v_xh[1]), 4);
|
| 8618 |
+
uint8x16_t shifted = vec_sr(v_xh[0], 2);
|
| 8619 |
+
q6h[2] = vec_sl(vec_and(v_um, shifted), 4);
|
| 8620 |
+
shifted = vec_sr(v_xh[1], 2);
|
| 8621 |
+
q6h[3] = vec_sl(vec_and(v_um, shifted), 4);
|
| 8622 |
+
|
| 8623 |
+
q6b[0] = (int8x16_t)(vec_or(vec_and(v_xl[0], v_lm), q6h[0]));
|
| 8624 |
+
q6b[1] = (int8x16_t)(vec_or(vec_and(v_xl[1], v_lm), q6h[1]));
|
| 8625 |
+
q6b[2] = (int8x16_t)(vec_or(vec_and(v_xl[2], v_lm), q6h[2]));
|
| 8626 |
+
q6b[3] = (int8x16_t)(vec_or(vec_and(v_xl[3], v_lm), q6h[3]));
|
| 8627 |
+
|
| 8628 |
+
int32x4_t summs0 = ggml_vec_dot(v_z, q6b[0], v_y[0]);
|
| 8629 |
+
int32x4_t summs1 = ggml_vec_dot(v_z, q6b[1], v_y[1]);
|
| 8630 |
+
int32x4_t summs2 = ggml_vec_dot(v_z, q6b[2], v_y[2]);
|
| 8631 |
+
int32x4_t summs3 = ggml_vec_dot(v_z, q6b[3], v_y[3]);
|
| 8632 |
+
|
| 8633 |
+
isum += (summs0[0] + summs0[1] + summs0[2] + summs0[3]) * scale[0] +
|
| 8634 |
+
(summs1[0] + summs1[1] + summs1[2] + summs1[3]) * scale[1] +
|
| 8635 |
+
(summs2[0] + summs2[1] + summs2[2] + summs2[3]) * scale[2] +
|
| 8636 |
+
(summs3[0] + summs3[1] + summs3[2] + summs3[3]) * scale[3];
|
| 8637 |
+
|
| 8638 |
+
scale += 4;
|
| 8639 |
+
|
| 8640 |
|
| 8641 |
+
// Load activation quants
|
| 8642 |
+
v_y[0] = vec_xl(0 , y0);
|
| 8643 |
+
v_y[1] = vec_xl(16, y0);
|
| 8644 |
+
v_y[2] = vec_xl(32, y0);
|
| 8645 |
+
v_y[3] = vec_xl(48, y0);
|
| 8646 |
+
y0 += 64;
|
| 8647 |
+
|
| 8648 |
+
shifted = vec_sr(v_xh[0], 4);
|
| 8649 |
+
q6h[0] = vec_sl(vec_and(v_um, shifted), 4);
|
| 8650 |
+
shifted = vec_sr(v_xh[1], 4);
|
| 8651 |
+
q6h[1] = vec_sl(vec_and(v_um, shifted), 4);
|
| 8652 |
+
shifted = vec_sr(v_xh[0], 6);
|
| 8653 |
+
q6h[2] = vec_sl(vec_and(v_um, shifted), 4);
|
| 8654 |
+
shifted = vec_sr(v_xh[1], 6);
|
| 8655 |
+
q6h[3] = vec_sl(vec_and(v_um, shifted), 4);
|
| 8656 |
+
|
| 8657 |
+
q6b[0] = (int8x16_t)(vec_or(vec_sr(v_xl[0], 4), q6h[0]));
|
| 8658 |
+
q6b[1] = (int8x16_t)(vec_or(vec_sr(v_xl[1], 4), q6h[1]));
|
| 8659 |
+
q6b[2] = (int8x16_t)(vec_or(vec_sr(v_xl[2], 4), q6h[2]));
|
| 8660 |
+
q6b[3] = (int8x16_t)(vec_or(vec_sr(v_xl[3], 4), q6h[3]));
|
| 8661 |
+
|
| 8662 |
+
summs0 = ggml_vec_dot(v_z, q6b[0], v_y[0]);
|
| 8663 |
+
summs1 = ggml_vec_dot(v_z, q6b[1], v_y[1]);
|
| 8664 |
+
summs2 = ggml_vec_dot(v_z, q6b[2], v_y[2]);
|
| 8665 |
+
summs3 = ggml_vec_dot(v_z, q6b[3], v_y[3]);
|
| 8666 |
+
|
| 8667 |
+
isum += (summs0[0] + summs0[1] + summs0[2] + summs0[3]) * scale[0] +
|
| 8668 |
+
(summs1[0] + summs1[1] + summs1[2] + summs1[3]) * scale[1] +
|
| 8669 |
+
(summs2[0] + summs2[1] + summs2[2] + summs2[3]) * scale[2] +
|
| 8670 |
+
(summs3[0] + summs3[1] + summs3[2] + summs3[3]) * scale[3];
|
| 8671 |
+
|
| 8672 |
+
scale += 4;
|
| 8673 |
+
}
|
| 8674 |
+
|
| 8675 |
+
sum += d_all * y[i].d * (isum - 32 * mins);
|
| 8676 |
+
}
|
| 8677 |
+
|
| 8678 |
+
*s = sum;
|
| 8679 |
#else
|
| 8680 |
|
| 8681 |
int8_t aux8[QK_K];
|
|
|
|
| 9036 |
}
|
| 9037 |
|
| 9038 |
*s = 0.125f * hsum_float_8(accumf);
|
| 9039 |
+
//#elif defined(__VXE__) || defined(__VXE2__)
|
| 9040 |
+
// const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
|
| 9041 |
+
//
|
| 9042 |
+
// uint32_t aux32[4];
|
| 9043 |
+
// const uint8_t * aux8 = (const uint8_t *)aux32;
|
| 9044 |
+
//
|
| 9045 |
+
// float sumf = 0;
|
| 9046 |
+
//
|
| 9047 |
+
// for (int i = 0; i < nb; ++i) {
|
| 9048 |
+
// const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
|
| 9049 |
+
// const uint16_t * restrict q2 = x[i].qs;
|
| 9050 |
+
// const int8_t * restrict q8 = y[i].qs;
|
| 9051 |
+
//
|
| 9052 |
+
// float sumf1 = 0, sumf2 = 0;
|
| 9053 |
+
//
|
| 9054 |
+
// for (int ib32 = 0; ib32 < QK_K/32; ib += 2) {
|
| 9055 |
+
// int8x16_t q8b0 = vec_xl( 0, q8);
|
| 9056 |
+
// int8x16_t qb81 = vec_xl(16, q8);
|
| 9057 |
+
// int8x16_t q8b2 = vec_xl(32, q8);
|
| 9058 |
+
// int8x16_t q8b3 = vec_xl(48, q8);
|
| 9059 |
+
// q8 += 64;
|
| 9060 |
+
//
|
| 9061 |
+
// memcpy(aux32, q2, 4 * sizeof(uint32_t));
|
| 9062 |
+
// q2 += 8;
|
| 9063 |
+
//
|
| 9064 |
+
// int8x16_t q2u0 = { *(const int64_t *)(iq2xxs_grid + aux8[ 0]), *(const int64_t *)(iq2xxs_grid + aux8[ 1]) };
|
| 9065 |
+
// int8x16_t q2u1 = { *(const int64_t *)(iq2xxs_grid + aux8[ 2]), *(const int64_t *)(iq2xxs_grid + aux8[ 3]) };
|
| 9066 |
+
// int8x16_t q2u2 = { *(const int64_t *)(iq2xxs_grid + aux8[ 8]), *(const int64_t *)(iq2xxs_grid + aux8[ 9]) };
|
| 9067 |
+
// int8x16_t q2u3 = { *(const int64_t *)(iq2xxs_grid + aux8[10]), *(const int64_t *)(iq2xxs_grid + aux8[11]) };
|
| 9068 |
+
//
|
| 9069 |
+
// int8x16_t q2s0 = { *(const int64_t *)(signs64 + ((aux32[1] >> 0) & 127)), *(const int64_t *)(signs64 + ((aux32[1] >> 7) & 127)) };
|
| 9070 |
+
// int8x16_t q2s1 = { *(const int64_t *)(signs64 + ((aux32[1] >> 14) & 127)), *(const int64_t *)(signs64 + ((aux32[1] >> 21) & 127)) };
|
| 9071 |
+
// int8x16_t q2s2 = { *(const int64_t *)(signs64 + ((aux32[3] >> 0) & 127)), *(const int64_t *)(signs64 + ((aux32[3] >> 7) & 127)) };
|
| 9072 |
+
// int8x16_t q2s3 = { *(const int64_t *)(signs64 + ((aux32[3] >> 14) & 127)), *(const int64_t *)(signs64 + ((aux32[3] >> 21) & 127)) };
|
| 9073 |
+
//
|
| 9074 |
+
// q2u0 = vec_mul(q2u0, q2s0);
|
| 9075 |
+
// q2u1 = vec_mul(q2u1, q2s1);
|
| 9076 |
+
// q2u2 = vec_mul(q2u2, q2s2);
|
| 9077 |
+
// q2u3 = vec_mul(q2u3, q2s3);
|
| 9078 |
+
//
|
| 9079 |
+
// const int32x4_t p1 = ggml_vec_dot(ggml_vec_dot(vec_splat_s32(0), q2u0, q8b0), q2u1, q8b1);
|
| 9080 |
+
// const int32x4_t p2 = ggml_vec_dot(ggml_vec_dot(vec_splat_s32(0), q2u2, q8b2), q2u3, q8b3);
|
| 9081 |
+
//
|
| 9082 |
+
// sumf1 += (p1[0] + p1[1] + p1[2] + p1[3]) * (0.5f + (aux32[1] >> 28));
|
| 9083 |
+
// sumf2 += (p2[0] + p2[1] + p2[2] + p2[3]) * (0.5f + (aux32[3] >> 28));
|
| 9084 |
+
// }
|
| 9085 |
+
//
|
| 9086 |
+
// sumf += d * (sumf1 + sumf2);
|
| 9087 |
+
// }
|
| 9088 |
+
//
|
| 9089 |
+
// *s = 0.25f * sumf;
|
| 9090 |
#else
|
| 9091 |
|
| 9092 |
uint32_t aux32[2];
|
|
|
|
| 11847 |
|
| 11848 |
sumf = hsum_float_8(__lasx_xvfadd_s(accum1, accum2));
|
| 11849 |
|
| 11850 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 11851 |
+
const int8x16_t v_k = vec_xl(0, kvalues_iq4nl);
|
| 11852 |
+
const uint8x16_t v_m = vec_splat_u8(0x0F);
|
| 11853 |
+
|
| 11854 |
+
for (; ib < nb; ++ib) {
|
| 11855 |
+
const block_iq4_nl * restrict x0 = &x[ib];
|
| 11856 |
+
const block_q8_0 * restrict y0 = &y[ib];
|
| 11857 |
+
|
| 11858 |
+
const uint8x16_t v_x = vec_xl(0, x0->qs);
|
| 11859 |
+
int8x16_t v_xl = (int8x16_t)vec_and(v_x, v_m);
|
| 11860 |
+
int8x16_t v_xh = (int8x16_t)vec_sr(v_x, 4);
|
| 11861 |
+
|
| 11862 |
+
v_xl = vec_perm(v_k, v_k, (uchar8x16_t)v_xl);
|
| 11863 |
+
v_xh = vec_perm(v_k, v_k, (uchar8x16_t)v_xh);
|
| 11864 |
+
|
| 11865 |
+
const int8x16_t v_yl = vec_xl(0 , y0->qs);
|
| 11866 |
+
const int8x16_t v_yh = vec_xl(QK8_0/2, y0->qs);
|
| 11867 |
+
const int32x4_t v_xy = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
|
| 11868 |
+
|
| 11869 |
+
sumf += GGML_FP16_TO_FP32(x0->d) * GGML_FP16_TO_FP32(y0->d) * (v_xy[0] + v_xy[1] + v_xy[2] + v_xy[3]);
|
| 11870 |
+
}
|
| 11871 |
#endif
|
| 11872 |
for (; ib < nb; ++ib) {
|
| 11873 |
const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
|
|
|
|
| 12146 |
}
|
| 12147 |
|
| 12148 |
*s = hsum_float_8(accum);
|
| 12149 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 12150 |
+
const int8x16_t v_k = vec_xl(0, kvalues_iq4nl);
|
| 12151 |
+
const uint8x16_t v_m = vec_splat_u8(0x0F);
|
| 12152 |
+
|
| 12153 |
+
float sumf = 0;
|
| 12154 |
+
|
| 12155 |
+
for (int ibl = 0; ibl < nb; ++ibl) {
|
| 12156 |
+
const uint8_t * restrict q4 = x[ibl].qs;
|
| 12157 |
+
const int8_t * restrict q8 = y[ibl].qs;
|
| 12158 |
+
|
| 12159 |
+
uint16_t h = x[ibl].scales_h;
|
| 12160 |
+
|
| 12161 |
+
int sumi1 = 0, sumi2 = 0;
|
| 12162 |
+
for (int ib = 0; ib < QK_K/64; ++ib) {
|
| 12163 |
+
const uint8x16_t v_x0 = vec_xl(0 , q4);
|
| 12164 |
+
const uint8x16_t v_x1 = vec_xl(QK4_NL/2, q4);
|
| 12165 |
+
q4 += 32;
|
| 12166 |
+
|
| 12167 |
+
int8x16_t v_x0l = (int8x16_t)vec_and(v_x0, v_m);
|
| 12168 |
+
int8x16_t v_x0h = (int8x16_t)vec_sr(v_x0, 4);
|
| 12169 |
+
int8x16_t v_x1l = (int8x16_t)vec_and(v_x1, v_m);
|
| 12170 |
+
int8x16_t v_x1h = (int8x16_t)vec_sr(v_x1, 4);
|
| 12171 |
+
|
| 12172 |
+
v_x0l = vec_perm(v_k, v_k, (uchar8x16_t)v_x0l);
|
| 12173 |
+
v_x0h = vec_perm(v_k, v_k, (uchar8x16_t)v_x0h);
|
| 12174 |
+
v_x1l = vec_perm(v_k, v_k, (uchar8x16_t)v_x1l);
|
| 12175 |
+
v_x1h = vec_perm(v_k, v_k, (uchar8x16_t)v_x1h);
|
| 12176 |
+
|
| 12177 |
+
const int8x16_t v_y0 = vec_xl( 0, q8);
|
| 12178 |
+
const int8x16_t v_y1 = vec_xl(16, q8);
|
| 12179 |
+
const int8x16_t v_y2 = vec_xl(32, q8);
|
| 12180 |
+
const int8x16_t v_y3 = vec_xl(48, q8);
|
| 12181 |
+
q8 += 64;
|
| 12182 |
+
|
| 12183 |
+
int32x4_t vsumi0 = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_x0l, v_y0), v_x0h, v_y1);
|
| 12184 |
+
int32x4_t vsumi1 = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_x1l, v_y2), v_x1h, v_y3);
|
| 12185 |
+
|
| 12186 |
+
int ls1 = ((x[ibl].scales_l[ib] & 0xF) | ((h << 4) & 0x30)) - 32;
|
| 12187 |
+
int ls2 = ((x[ibl].scales_l[ib] >> 4) | ((h << 2) & 0x30)) - 32;
|
| 12188 |
+
|
| 12189 |
+
h >>= 4;
|
| 12190 |
+
|
| 12191 |
+
sumi1 += (vsumi0[0] + vsumi0[1] + vsumi0[2] + vsumi0[3]) * ls1;
|
| 12192 |
+
sumi2 += (vsumi1[0] + vsumi1[1] + vsumi1[2] + vsumi1[3]) * ls2;
|
| 12193 |
+
}
|
| 12194 |
+
|
| 12195 |
+
sumf += GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d * (sumi1 + sumi2);
|
| 12196 |
+
}
|
| 12197 |
+
|
| 12198 |
+
*s = sumf;
|
| 12199 |
|
| 12200 |
#else
|
| 12201 |
float sumf = 0;
|
|
@@ -237,6 +237,8 @@ typedef pthread_t ggml_thread_t;
|
|
| 237 |
#else
|
| 238 |
#if defined(__POWER9_VECTOR__)
|
| 239 |
#define CACHE_LINE_SIZE 128
|
|
|
|
|
|
|
| 240 |
#else
|
| 241 |
#define CACHE_LINE_SIZE 64
|
| 242 |
#endif
|
|
@@ -1211,6 +1213,87 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
|
|
| 1211 |
#define GGML_F16_VEC_MUL GGML_F32Cx4_MUL
|
| 1212 |
#define GGML_F16_VEC_REDUCE GGML_F32Cx4_REDUCE
|
| 1213 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1214 |
#endif
|
| 1215 |
|
| 1216 |
// GGML_F32_ARR / GGML_F16_ARR
|
|
@@ -14419,6 +14502,14 @@ int ggml_cpu_has_vsx(void) {
|
|
| 14419 |
#endif
|
| 14420 |
}
|
| 14421 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14422 |
int ggml_cpu_has_neon(void) {
|
| 14423 |
#if defined(__ARM_ARCH) && defined(__ARM_NEON)
|
| 14424 |
return ggml_arm_arch_features.has_neon;
|
|
|
|
| 237 |
#else
|
| 238 |
#if defined(__POWER9_VECTOR__)
|
| 239 |
#define CACHE_LINE_SIZE 128
|
| 240 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 241 |
+
#define CACHE_LINE_SIZE 256
|
| 242 |
#else
|
| 243 |
#define CACHE_LINE_SIZE 64
|
| 244 |
#endif
|
|
|
|
| 1213 |
#define GGML_F16_VEC_MUL GGML_F32Cx4_MUL
|
| 1214 |
#define GGML_F16_VEC_REDUCE GGML_F32Cx4_REDUCE
|
| 1215 |
|
| 1216 |
+
#elif defined(__VXE__) || defined(__VXE2__)
|
| 1217 |
+
|
| 1218 |
+
#define GGML_SIMD
|
| 1219 |
+
|
| 1220 |
+
// F32 s390x
|
| 1221 |
+
|
| 1222 |
+
#define GGML_F32_STEP 32
|
| 1223 |
+
#define GGML_F32_EPR 4
|
| 1224 |
+
|
| 1225 |
+
#define GGML_F32x4 __vector float
|
| 1226 |
+
#define GGML_F32x4_ZERO vec_splats(0.0f)
|
| 1227 |
+
#define GGML_F32x4_SET1 vec_splats
|
| 1228 |
+
#define GGML_F32x4_LOAD(p) vec_xl(0, p)
|
| 1229 |
+
#define GGML_F32x4_STORE(p, r) vec_xst(r, 0, p)
|
| 1230 |
+
#define GGML_F32x4_FMA(a, b, c) vec_madd(b, c, a)
|
| 1231 |
+
#define GGML_F32x4_ADD vec_add
|
| 1232 |
+
#define GGML_F32x4_MUL vec_mul
|
| 1233 |
+
#define GGML_F32x4_REDUCE(res, x) \
|
| 1234 |
+
{ \
|
| 1235 |
+
int offset = GGML_F32_ARR >> 1; \
|
| 1236 |
+
for (int i = 0; i < offset; ++i) { \
|
| 1237 |
+
x[i] = vec_add(x[i], x[offset + i]); \
|
| 1238 |
+
} \
|
| 1239 |
+
offset >>= 1; \
|
| 1240 |
+
for (int i = 0; i < offset; ++i) { \
|
| 1241 |
+
x[i] = vec_add(x[i], x[offset + i]); \
|
| 1242 |
+
} \
|
| 1243 |
+
offset >>= 1; \
|
| 1244 |
+
for (int i = 0; i < offset; ++i) { \
|
| 1245 |
+
x[i] = vec_add(x[i], x[offset + i]); \
|
| 1246 |
+
} \
|
| 1247 |
+
res = vec_extract(x[0], 0) + \
|
| 1248 |
+
vec_extract(x[0], 1) + \
|
| 1249 |
+
vec_extract(x[0], 2) + \
|
| 1250 |
+
vec_extract(x[0], 3); \
|
| 1251 |
+
}
|
| 1252 |
+
|
| 1253 |
+
#define GGML_F32_VEC GGML_F32x4
|
| 1254 |
+
#define GGML_F32_VEC_ZERO GGML_F32x4_ZERO
|
| 1255 |
+
#define GGML_F32_VEC_SET1 GGML_F32x4_SET1
|
| 1256 |
+
#define GGML_F32_VEC_LOAD GGML_F32x4_LOAD
|
| 1257 |
+
#define GGML_F32_VEC_STORE GGML_F32x4_STORE
|
| 1258 |
+
#define GGML_F32_VEC_FMA GGML_F32x4_FMA
|
| 1259 |
+
#define GGML_F32_VEC_ADD GGML_F32x4_ADD
|
| 1260 |
+
#define GGML_F32_VEC_MUL GGML_F32x4_MUL
|
| 1261 |
+
#define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE
|
| 1262 |
+
|
| 1263 |
+
// F16 s390x
|
| 1264 |
+
#define GGML_F16_STEP GGML_F32_STEP
|
| 1265 |
+
#define GGML_F16_EPR GGML_F32_EPR
|
| 1266 |
+
|
| 1267 |
+
static inline __vector float __lzs_f16cx4_load(const ggml_fp16_t * x) {
|
| 1268 |
+
float tmp[4];
|
| 1269 |
+
|
| 1270 |
+
for (int i = 0; i < 4; i++) {
|
| 1271 |
+
tmp[i] = GGML_FP16_TO_FP32(x[i]);
|
| 1272 |
+
}
|
| 1273 |
+
|
| 1274 |
+
return vec_xl(0, tmp);
|
| 1275 |
+
}
|
| 1276 |
+
|
| 1277 |
+
static inline void __lzs_f16cx4_store(ggml_fp16_t * x, __vector float y) {
|
| 1278 |
+
float arr[4];
|
| 1279 |
+
|
| 1280 |
+
vec_xst(y, 0, arr);
|
| 1281 |
+
|
| 1282 |
+
for (int i = 0; i < 4; i++) {
|
| 1283 |
+
x[i] = GGML_FP32_TO_FP16(arr[i]);
|
| 1284 |
+
}
|
| 1285 |
+
}
|
| 1286 |
+
|
| 1287 |
+
#define GGML_F16_VEC GGML_F32x4
|
| 1288 |
+
#define GGML_F16_VEC_ZERO GGML_F32x4_ZERO
|
| 1289 |
+
#define GGML_F16_VEC_SET1 GGML_F32x4_SET1
|
| 1290 |
+
#define GGML_F16_VEC_LOAD(p, i) __lzs_f16cx4_load(p)
|
| 1291 |
+
#define GGML_F16_VEC_STORE(p, r, i) __lzs_f16cx4_store(p, r[i])
|
| 1292 |
+
#define GGML_F16_VEC_FMA GGML_F32x4_FMA
|
| 1293 |
+
#define GGML_F16_VEC_ADD GGML_F32x4_ADD
|
| 1294 |
+
#define GGML_F16_VEC_MUL GGML_F32x4_MUL
|
| 1295 |
+
#define GGML_F16_VEC_REDUCE GGML_F32x4_REDUCE
|
| 1296 |
+
|
| 1297 |
#endif
|
| 1298 |
|
| 1299 |
// GGML_F32_ARR / GGML_F16_ARR
|
|
|
|
| 14502 |
#endif
|
| 14503 |
}
|
| 14504 |
|
| 14505 |
+
int ggml_cpu_has_vxe(void) {
|
| 14506 |
+
#if defined(__VXE__) || defined(__VXE2__)
|
| 14507 |
+
return 1;
|
| 14508 |
+
#else
|
| 14509 |
+
return 0;
|
| 14510 |
+
#endif
|
| 14511 |
+
}
|
| 14512 |
+
|
| 14513 |
int ggml_cpu_has_neon(void) {
|
| 14514 |
#if defined(__ARM_ARCH) && defined(__ARM_NEON)
|
| 14515 |
return ggml_arm_arch_features.has_neon;
|
|
@@ -557,6 +557,9 @@ static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r
|
|
| 557 |
if (ggml_cpu_has_vsx()) {
|
| 558 |
features.push_back({ "VSX", "1" });
|
| 559 |
}
|
|
|
|
|
|
|
|
|
|
| 560 |
if (ggml_cpu_has_wasm_simd()) {
|
| 561 |
features.push_back({ "WASM_SIMD", "1" });
|
| 562 |
}
|
|
|
|
| 557 |
if (ggml_cpu_has_vsx()) {
|
| 558 |
features.push_back({ "VSX", "1" });
|
| 559 |
}
|
| 560 |
+
if (ggml_cpu_has_vxe()) {
|
| 561 |
+
features.push_back({ "VXE", "1" });
|
| 562 |
+
}
|
| 563 |
if (ggml_cpu_has_wasm_simd()) {
|
| 564 |
features.push_back({ "WASM_SIMD", "1" });
|
| 565 |
}
|
|
@@ -240,7 +240,11 @@ void ggml_log_callback_default(enum ggml_log_level level, const char * text, voi
|
|
| 240 |
|
| 241 |
|
| 242 |
void * ggml_aligned_malloc(size_t size) {
|
|
|
|
|
|
|
|
|
|
| 243 |
const int alignment = 64;
|
|
|
|
| 244 |
|
| 245 |
#if defined(_MSC_VER) || defined(__MINGW32__)
|
| 246 |
return _aligned_malloc(size, alignment);
|
|
|
|
| 240 |
|
| 241 |
|
| 242 |
void * ggml_aligned_malloc(size_t size) {
|
| 243 |
+
#if defined(__s390x__)
|
| 244 |
+
const int alignment = 256;
|
| 245 |
+
#else
|
| 246 |
const int alignment = 64;
|
| 247 |
+
#endif
|
| 248 |
|
| 249 |
#if defined(_MSC_VER) || defined(__MINGW32__)
|
| 250 |
return _aligned_malloc(size, alignment);
|