Aaron Teo Jinyang He junchao-zhao commited on
Commit
4aa54ec
·
1 Parent(s): fbc5f16

ggml-cpu: Support s390x SIMD Instruction Set (llama/12019)

Browse files

* ggml: add s390x ARCH_FLAGS for compilation

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add SIMD for s390x using vector intrinsics

SIMD is activated for:
* ggml_vec_dot_f32
* ggml_vec_dot_f16
* ggml_vec_mad_f32
* ggml_vec_mad_f16
* ggml_vec_mad_f32_unroll
* ggml_vec_scale_f32
* ggml_vec_scale_f16

SIMD is NOT activated for:
* ggml_vec_dot_f16_unroll (pending bugfix)

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix missing escape character in GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix s390x GGML_F32x4_REDUCE

Signed-off-by: Aaron Teo <[email protected]>

* ggml: full SIMD activation for F32,F16 s390x

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add option to disable s390x VXE/VXE2

Signed-off-by: Aaron Teo <[email protected]>

* ggml: change vecintrin.h include to ggml-cpu-impl

* add __VXE__ and __VXE2__ macros

Signed-off-by: Aaron Teo <[email protected]>

* cmake: add s390x target detection for VX/VXE/VXE2

Signed-off-by: Aaron Teo <[email protected]>

* ggml: move s390x vector intrinsics to ggml-cpu-impl.h

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x Q8_0 SIMD

Signed-off-by: Aaron Teo <[email protected]>

* ggml: correct documentation for Q8_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x reduce code complexity Q8_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x bugfix typo Q8_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activated for Q4_1

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x inline vec_reve

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for Q4_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add VXE backend feature

Signed-off-by: Aaron Teo <[email protected]>

* ggml: remove test.py

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for quantize_row_q8_0

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for quantize_row_q8_1

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for iq4_xs

Signed-off-by: Aaron Teo <[email protected]>

* ggml: bugfix iq4_xs

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for iq4_nl

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add float, double, and long vector data type

Signed-off-by: Aaron Teo <[email protected]>

* ggml: clean up iq4_xs SIMD

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix improper use of restrict keyword

Signed-off-by: Aaron Teo <[email protected]>

* ggml: update warning message for ggml_vec_tbl

Signed-off-by: Aaron Teo <[email protected]>

* ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K

Signed-off-by: Aaron Teo <[email protected]>

* ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs

Signed-off-by: Aaron Teo <[email protected]>

* ggml: switch to restrict for iq4_nl

Signed-off-by: Aaron Teo <[email protected]>

* ggml: slight dot product speed improvement for q4_1_q8_1

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for q6_K

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add missing `_t` to ggml_int8x16x4_t

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix missing `_t` for ggml_vec_xl_s8x4

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix more missing `_t`

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add unroll and prefetch to Q8_0

increase of 3.86% for prompt processing and 32.22% for token generation

Signed-off-by: Aaron Teo <[email protected]>

* ggml: patch Q8_0 to use proper vector sizes

Signed-off-by: Aaron Teo <[email protected]>

* ggml: optimise Q8_0 dot prod compute kernel further

Signed-off-by: Aaron Teo <[email protected]>

* ggml: add unroll and prefetch to Q4_1

Signed-off-by: Aaron Teo <[email protected]>

* ggml: refactor Q6_K variable naming for readability

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix Q6_K typos

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for Q5_K

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix wrong char*x16_t naming

Signed-off-by: Aaron Teo <[email protected]>

* ggml: Q5_K y0 wrong signness

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix Q5_K invalid uchar type

Signed-off-by: Aaron Teo <[email protected]>

* ggml: s390x SIMD activation for Q4_K

Signed-off-by: Aaron Teo <[email protected]>

* ggml: fix Q4_K invalid vector intrinsics

Signed-off-by: Aaron Teo <[email protected]>

* ggml: simplify ggml_padd_s16 compute kernel

Signed-off-by: Aaron Teo <[email protected]>

* ggml: correct ggml-cpu vxe wording

Signed-off-by: Aaron Teo <[email protected]>

* ggml: change ggml_aligned_malloc alignment to 256

256 is the cache line size for s390x platforms

Signed-off-by: Aaron Teo <[email protected]>

* ggml: resolve pr merge via cherry-pick 225bbbf

Signed-off-by: Aaron Teo <[email protected]>

* ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)

* ggml: resolve pr merge via cherry-pick 4571953

Signed-off-by: Aaron Teo <[email protected]>

* ggml: cmake remove fork when determining s390x machine type

thank you

@ericcurtin


Signed-off-by: Aaron Teo <[email protected]>

---------

Signed-off-by: Aaron Teo <[email protected]>
Co-authored-by: Jinyang He <[email protected]>
Co-authored-by: junchao-zhao <[email protected]>

ggml/CMakeLists.txt CHANGED
@@ -122,6 +122,7 @@ endif()
122
  option(GGML_LASX "ggml: enable lasx" ON)
123
  option(GGML_LSX "ggml: enable lsx" ON)
124
  option(GGML_RVV "ggml: enable rvv" ON)
 
125
 
126
  option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
127
  set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
 
122
  option(GGML_LASX "ggml: enable lasx" ON)
123
  option(GGML_LSX "ggml: enable lsx" ON)
124
  option(GGML_RVV "ggml: enable rvv" ON)
125
+ option(GGML_VXE "ggml: enable vxe" ON)
126
 
127
  option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
128
  set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")
ggml/include/ggml-cpu.h CHANGED
@@ -99,6 +99,7 @@ extern "C" {
99
  // other
100
  GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
101
  GGML_BACKEND_API int ggml_cpu_has_vsx (void);
 
102
  GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
103
  GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
104
 
 
99
  // other
100
  GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
101
  GGML_BACKEND_API int ggml_cpu_has_vsx (void);
102
+ GGML_BACKEND_API int ggml_cpu_has_vxe (void);
103
  GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
104
  GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
105
 
ggml/src/ggml-cpu/CMakeLists.txt CHANGED
@@ -306,6 +306,27 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
306
  if (GGML_RVV)
307
  list(APPEND ARCH_FLAGS -march=rv64gcv -mabi=lp64d)
308
  endif()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
309
  else()
310
  message(STATUS "Unknown architecture")
311
  endif()
 
306
  if (GGML_RVV)
307
  list(APPEND ARCH_FLAGS -march=rv64gcv -mabi=lp64d)
308
  endif()
309
+ elseif (${CMAKE_SYSTEM_PROCESSOR} MATCHES "s390x")
310
+ message(STATUS "s390x detected")
311
+ file(READ "/proc/cpuinfo" CPUINFO_CONTENTS)
312
+ string(REGEX REPLACE "machine[ \t\r\n]*=[ \t\r\n]*([0-9]+)" "\\1" S390X_M ${CPUINFO_CONTENTS})
313
+
314
+ # TODO: Separation to determine activation of VX/VXE/VXE2
315
+ if (${S390X_M} MATCHES "8561|8562")
316
+ message(STATUS "z15 target")
317
+ list(APPEND ARCH_FLAGS -march=z15 -mtune=z15)
318
+ elseif (${S390X_M} MATCHES "3931")
319
+ message(STATUS "z16 target")
320
+ list(APPEND ARCH_FLAGS -march=z16 -mtune=z16)
321
+ else()
322
+ message(STATUS "Unknown target")
323
+ message(WARNING "Unknown target. If you are compiling for z14 and earlier, you might have to add -DGGML_VXE=OFF.")
324
+ list(APPEND ARCH_FLAGS -march=native -mtune=native)
325
+ endif()
326
+
327
+ if (GGML_VXE)
328
+ list(APPEND ARCH_FLAGS -mvx -mzvector)
329
+ endif()
330
  else()
331
  message(STATUS "Unknown architecture")
332
  endif()
ggml/src/ggml-cpu/ggml-cpu-impl.h CHANGED
@@ -59,6 +59,15 @@ struct ggml_compute_params {
59
  #endif
60
  #endif
61
 
 
 
 
 
 
 
 
 
 
62
  #if defined(__ARM_FEATURE_SVE)
63
  #include <arm_sve.h>
64
  #include <sys/prctl.h>
@@ -359,6 +368,148 @@ inline static int32x4_t ggml_vdotq_s32(int32x4_t acc, int8x16_t a, int8x16_t b)
359
  #endif
360
  #endif
361
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
362
  #if defined(__loongarch_asx)
363
  /* float type data load instructions */
364
  static __m128 __lsx_vreplfr2vr_s(const float val) {
 
59
  #endif
60
  #endif
61
 
62
+ #if defined(__s390x__) && defined(__VEC__)
63
+ #ifndef __VXE__
64
+ #define __VXE__
65
+ #endif
66
+ #ifndef __VXE2__
67
+ #define __VXE2__
68
+ #endif
69
+ #endif
70
+
71
  #if defined(__ARM_FEATURE_SVE)
72
  #include <arm_sve.h>
73
  #include <sys/prctl.h>
 
368
  #endif
369
  #endif
370
 
371
+ #if defined(__VXE__) || defined(__VXE2__)
372
+ #include <vecintrin.h>
373
+
374
+ #define vec_neg(a) (-(a)) // Vector Negate
375
+ #define vec_add(a, b) ((a) + (b)) // Vector Add
376
+ #define vec_sub(a, b) ((a) - (b)) // Vector Subtract
377
+ #define vec_mul(a, b) ((a) * (b)) // Vector Multiply
378
+ #define vec_div(a, b) ((a) / (b)) // Vector Divide
379
+ #define vec_sl(a, b) ((a) << (b)) // Vector Shift Left
380
+ #define vec_sra(a, b) ((a) >> (b)) // Vector Shift Right
381
+ #define vec_sr(a, b) ((a) >> (b)) // Vector Shift Right Algebraic
382
+ #define vec_slo(a, b) vec_slb(a, (b) << 64) // Vector Shift Left by Octet
383
+ #define vec_sro(a, b) vec_srb(a, (b) << 64) // Vector Shift Right by Octet
384
+
385
+ #ifndef vec_and
386
+ #define vec_and(a, b) ((a) & (b)) // Vector AND
387
+ #endif
388
+
389
+ #ifndef vec_or
390
+ #define vec_or(a, b) ((a) | (b)) // Vector OR
391
+ #endif
392
+
393
+ #ifndef vec_xor
394
+ #define vec_xor(a, b) ((a) ^ (b)) // Vector XOR
395
+ #endif
396
+
397
+ typedef signed char char8x16_t __attribute__((vector_size(16)));
398
+ typedef unsigned char uchar8x16_t __attribute__((vector_size(16)));
399
+
400
+ typedef int8_t int8x16_t __attribute__((vector_size(16)));
401
+ typedef int16_t int16x8_t __attribute__((vector_size(16)));
402
+ typedef int32_t int32x4_t __attribute__((vector_size(16)));
403
+
404
+ typedef uint8_t uint8x16_t __attribute__((vector_size(16)));
405
+ typedef uint16_t uint16x8_t __attribute__((vector_size(16)));
406
+ typedef uint32_t uint32x4_t __attribute__((vector_size(16)));
407
+
408
+ typedef float float32x4_t __attribute__((vector_size(16)));
409
+ typedef double double64x2_t __attribute((vector_size(16)));
410
+
411
+ typedef signed long long long64x2_t __attribute((vector_size(16)));
412
+ typedef unsigned long long ulong64x2_t __attribute__((vector_size(16)));
413
+
414
+ typedef struct ggml_uint8x16x2_t {
415
+ uint8x16_t val[2];
416
+ } ggml_uint8x16x2_t;
417
+
418
+ inline static ggml_uint8x16x2_t ggml_vec_xl_u8x2(const uint8_t * ptr) {
419
+ ggml_uint8x16x2_t res;
420
+
421
+ res.val[0] = vec_xl( 0, ptr);
422
+ res.val[1] = vec_xl(16, ptr);
423
+
424
+ return res;
425
+ }
426
+
427
+ typedef struct ggml_uint8x16x4_t {
428
+ uint8x16_t val[4];
429
+ } ggml_uint8x16x4_t;
430
+
431
+ inline static ggml_uint8x16x4_t ggml_vec_xl_u8x4(const uint8_t * ptr) {
432
+ ggml_uint8x16x4_t res;
433
+
434
+ res.val[0] = vec_xl( 0, ptr);
435
+ res.val[1] = vec_xl(16, ptr);
436
+ res.val[2] = vec_xl(32, ptr);
437
+ res.val[3] = vec_xl(48, ptr);
438
+
439
+ return res;
440
+ }
441
+
442
+ typedef struct ggml_int8x16x4_t {
443
+ int8x16_t val[4];
444
+ } ggml_int8x16x4_t;
445
+
446
+ inline static ggml_int8x16x4_t ggml_vec_xl_s8x4(const int8_t * ptr) {
447
+ ggml_int8x16x4_t res;
448
+
449
+ res.val[0] = vec_xl( 0, ptr);
450
+ res.val[1] = vec_xl(16, ptr);
451
+ res.val[2] = vec_xl(32, ptr);
452
+ res.val[3] = vec_xl(48, ptr);
453
+
454
+ return res;
455
+ }
456
+
457
+ typedef struct ggml_int16x8x2_t {
458
+ int16x8_t val[2];
459
+ } ggml_int16x8x2_t;
460
+
461
+ inline static ggml_int16x8x2_t ggml_vec_xl_s16x2(const int16_t * ptr) {
462
+ ggml_int16x8x2_t res;
463
+
464
+ res.val[0] = vec_xl( 0, ptr);
465
+ res.val[1] = vec_xl(16, ptr);
466
+
467
+ return res;
468
+ }
469
+
470
+ /*
471
+ ! WARNING: Very slow. Use vec_perm if possible. Refer to iq4_xs
472
+ ! or iq4_nl for example implementation.
473
+ */
474
+ inline static int8x16_t ggml_vec_tbl(int8x16_t a, uint8x16_t b) {
475
+ int8x16_t res;
476
+
477
+ res[ 0] = a[b[ 0]];
478
+ res[ 1] = a[b[ 1]];
479
+ res[ 2] = a[b[ 2]];
480
+ res[ 3] = a[b[ 3]];
481
+ res[ 4] = a[b[ 4]];
482
+ res[ 5] = a[b[ 5]];
483
+ res[ 6] = a[b[ 6]];
484
+ res[ 7] = a[b[ 7]];
485
+ res[ 8] = a[b[ 8]];
486
+ res[ 9] = a[b[ 9]];
487
+ res[10] = a[b[10]];
488
+ res[11] = a[b[11]];
489
+ res[12] = a[b[12]];
490
+ res[13] = a[b[13]];
491
+ res[14] = a[b[14]];
492
+ res[15] = a[b[15]];
493
+
494
+ return res;
495
+ }
496
+
497
+ inline static int16x8_t vec_padd_s16(int16x8_t a, int16x8_t b) {
498
+ const uchar8x16_t v_maske = { 0, 1, 4, 5, 8, 9, 12, 13,
499
+ 16, 17, 20, 21, 24, 25, 28, 29 };
500
+
501
+ const int16x8_t v_abo = vec_pack((int32x4_t)a, (int32x4_t)b);
502
+ const int16x8_t v_abe = vec_perm(a, b, v_maske);
503
+ return v_abo + v_abe;
504
+ }
505
+
506
+ inline static int32x4_t ggml_vec_dot(int32x4_t acc, int8x16_t a, int8x16_t b) {
507
+ const int16x8_t p = vec_mule(a, b) + vec_mulo(a, b);
508
+ return acc + (vec_unpackh(p) + vec_unpackl(p));
509
+ }
510
+
511
+ #endif
512
+
513
  #if defined(__loongarch_asx)
514
  /* float type data load instructions */
515
  static __m128 __lsx_vreplfr2vr_s(const float val) {
ggml/src/ggml-cpu/ggml-cpu-quants.c CHANGED
@@ -1011,6 +1011,38 @@ void quantize_row_q8_0(const float * restrict x, void * restrict vy, int64_t k)
1011
  __lsx_vst(ni4, (__m128i *)(y[i].qs + 16), 0);
1012
 
1013
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1014
  #else
1015
  GGML_UNUSED(nb);
1016
  // scalar
@@ -1337,6 +1369,44 @@ void quantize_row_q8_1(const float * restrict x, void * restrict vy, int64_t k)
1337
  __lsx_vst(ni0, (__m128i *)(y[i].qs + 0), 0);
1338
  __lsx_vst(ni4, (__m128i *)(y[i].qs + 16), 0);
1339
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1340
  #else
1341
  GGML_UNUSED(nb);
1342
  // scalar
@@ -2488,6 +2558,37 @@ void ggml_vec_dot_q4_0_q8_0(int n, float * restrict s, size_t bs, const void * r
2488
  }
2489
 
2490
  sumf = hsum_float_4x4(acc_0, acc_1, acc_2, acc_3);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2491
  #endif
2492
  for (; ib < nb; ++ib) {
2493
  int sumi0 = 0;
@@ -2781,6 +2882,35 @@ void ggml_vec_dot_q4_1_q8_1(int n, float * restrict s, size_t bs, const void * r
2781
  }
2782
 
2783
  sumf = hsum_float_8(acc) + summs;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2784
  #endif
2785
  for (; ib < nb; ++ib) {
2786
  int sumi0 = 0;
@@ -3915,6 +4045,27 @@ void ggml_vec_dot_q8_0_q8_0(int n, float * restrict s, size_t bs, const void * r
3915
  }
3916
 
3917
  sumf = hsum_float_8(acc);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3918
  #endif
3919
  for (; ib < nb; ++ib) {
3920
  int sumi = 0;
@@ -6797,6 +6948,77 @@ void ggml_vec_dot_q4_K_q8_K(int n, float * restrict s, size_t bs, const void * r
6797
 
6798
 
6799
  *s = hsum_float_8(acc) + ((v4f32)acc_m)[0];
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6800
  #else
6801
 
6802
  const uint8_t * scales = (const uint8_t*)&utmp[0];
@@ -7526,7 +7748,94 @@ void ggml_vec_dot_q5_K_q8_K(int n, float * restrict s, size_t bs, const void * r
7526
  acc_m = __lsx_vfadd_s(acc_m, (__m128)__lsx_vbsrl_v(acc_m, 4));
7527
 
7528
  *s = hsum_float_8(acc) + ((v4f32)acc_m)[0];
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7529
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7530
  #else
7531
 
7532
  const uint8_t * scales = (const uint8_t*)&utmp[0];
@@ -8243,7 +8552,130 @@ void ggml_vec_dot_q6_K_q8_K(int n, float * restrict s, size_t bs, const void * r
8243
  }
8244
 
8245
  *s = hsum_float_8(acc);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8247
  #else
8248
 
8249
  int8_t aux8[QK_K];
@@ -8604,7 +9036,57 @@ void ggml_vec_dot_iq2_xxs_q8_K(int n, float * restrict s, size_t bs, const void
8604
  }
8605
 
8606
  *s = 0.125f * hsum_float_8(accumf);
8607
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8608
  #else
8609
 
8610
  uint32_t aux32[2];
@@ -11365,6 +11847,27 @@ void ggml_vec_dot_iq4_nl_q8_0(int n, float * restrict s, size_t bs, const void *
11365
 
11366
  sumf = hsum_float_8(__lasx_xvfadd_s(accum1, accum2));
11367
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11368
  #endif
11369
  for (; ib < nb; ++ib) {
11370
  const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
@@ -11643,6 +12146,56 @@ void ggml_vec_dot_iq4_xs_q8_K(int n, float * restrict s, size_t bs, const void *
11643
  }
11644
 
11645
  *s = hsum_float_8(accum);
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11646
 
11647
  #else
11648
  float sumf = 0;
 
1011
  __lsx_vst(ni4, (__m128i *)(y[i].qs + 16), 0);
1012
 
1013
  }
1014
+ #elif defined(__VXE__) || defined(__VXE2__)
1015
+ for (int i = 0; i < nb; i++) {
1016
+ __vector float srcv [8];
1017
+ __vector float asrcv[8];
1018
+ __vector float amaxv[8];
1019
+
1020
+ for (int j = 0; j < 8; j++) srcv[j] = vec_xl(0, x + i*32 + 4*j);
1021
+ for (int j = 0; j < 8; j++) asrcv[j] = vec_abs(srcv[j]);
1022
+ for (int j = 0; j < 4; j++) amaxv[2*j] = vec_max(asrcv[2*j], asrcv[2*j+1]);
1023
+ for (int j = 0; j < 2; j++) amaxv[4*j] = vec_max(amaxv[4*j], amaxv[4*j+2]);
1024
+ for (int j = 0; j < 1; j++) amaxv[8*j] = vec_max(amaxv[8*j], amaxv[8*j+4]);
1025
+
1026
+ const float amax = MAX(MAX(vec_extract(amaxv[0], 0),
1027
+ vec_extract(amaxv[0], 1)),
1028
+ MAX(vec_extract(amaxv[0], 2),
1029
+ vec_extract(amaxv[0], 3)));
1030
+
1031
+ const float d = amax / ((1 << 7) - 1);
1032
+ const float id = d ? 1.0f / d : 0.0f;
1033
+
1034
+ y[i].d = GGML_FP32_TO_FP16(d);
1035
+
1036
+ for (int j = 0; j < 8; j++) {
1037
+ const __vector float v = vec_mul(srcv[j], vec_splats(id));
1038
+ const __vector int32_t vi = vec_signed(v);
1039
+
1040
+ y[i].qs[4*j + 0] = vec_extract(vi, 0);
1041
+ y[i].qs[4*j + 1] = vec_extract(vi, 1);
1042
+ y[i].qs[4*j + 2] = vec_extract(vi, 2);
1043
+ y[i].qs[4*j + 3] = vec_extract(vi, 3);
1044
+ }
1045
+ }
1046
  #else
1047
  GGML_UNUSED(nb);
1048
  // scalar
 
1369
  __lsx_vst(ni0, (__m128i *)(y[i].qs + 0), 0);
1370
  __lsx_vst(ni4, (__m128i *)(y[i].qs + 16), 0);
1371
  }
1372
+ #elif defined(__VXE__) || defined(__VXE2__)
1373
+ for (int i = 0; i < nb; i++) {
1374
+ __vector float srcv [8];
1375
+ __vector float asrcv[8];
1376
+ __vector float amaxv[8];
1377
+
1378
+ for (int j = 0; j < 8; j++) srcv[j] = vec_xl(0, x + i*32 + 4*j);
1379
+ for (int j = 0; j < 8; j++) asrcv[j] = vec_abs(srcv[j]);
1380
+ for (int j = 0; j < 4; j++) amaxv[2*j] = vec_max(asrcv[2*j], asrcv[2*j+1]);
1381
+ for (int j = 0; j < 2; j++) amaxv[4*j] = vec_max(amaxv[4*j], amaxv[4*j+2]);
1382
+ for (int j = 0; j < 1; j++) amaxv[8*j] = vec_max(amaxv[8*j], amaxv[8*j+4]);
1383
+
1384
+ const float amax = MAX(MAX(vec_extract(amaxv[0], 0),
1385
+ vec_extract(amaxv[0], 1)),
1386
+ MAX(vec_extract(amaxv[0], 2),
1387
+ vec_extract(amaxv[0], 3)));
1388
+
1389
+ const float d = amax / ((1 << 7) - 1);
1390
+ const float id = d ? 1.0f / d : 0.0f;
1391
+
1392
+ y[i].d = GGML_FP32_TO_FP16(d);
1393
+
1394
+ __vector int32_t acc = vec_splats(0);
1395
+
1396
+ for (int j = 0; j < 8; j++) {
1397
+ const __vector float v = vec_mul(srcv[j], vec_splats(id));
1398
+ const __vector int32_t vi = vec_signed(v);
1399
+
1400
+ y[i].qs[4*j + 0] = vec_extract(vi, 0);
1401
+ y[i].qs[4*j + 1] = vec_extract(vi, 1);
1402
+ y[i].qs[4*j + 2] = vec_extract(vi, 2);
1403
+ y[i].qs[4*j + 3] = vec_extract(vi, 3);
1404
+
1405
+ acc = vec_add(acc, vi);
1406
+ }
1407
+
1408
+ y[i].s = GGML_FP32_TO_FP16(d * (acc[0] + acc[1] + acc[2] + acc[3]));
1409
+ }
1410
  #else
1411
  GGML_UNUSED(nb);
1412
  // scalar
 
2558
  }
2559
 
2560
  sumf = hsum_float_4x4(acc_0, acc_1, acc_2, acc_3);
2561
+ #elif defined(__VXE__) || defined(__VXE2__)
2562
+ __vector float acc = vec_splats(0.0f);
2563
+
2564
+ const __vector uint8_t v_m = vec_splats((const uint8_t)0x0F);
2565
+ const __vector int8_t v_s = vec_splats( (const int8_t)0x08);
2566
+
2567
+ for (; ib < nb; ++ib) {
2568
+ const __vector uint8_t v_x = vec_xl(0, x[ib].qs);
2569
+ const __vector int8_t v_xl = (const __vector int8_t)(v_x & v_m);
2570
+ const __vector int8_t v_xh = (const __vector int8_t)(v_x >> 4);
2571
+
2572
+ const __vector int8_t v_xls = vec_sub(v_xl, v_s);
2573
+ const __vector int8_t v_xhs = vec_sub(v_xh, v_s);
2574
+
2575
+ const __vector int8_t v_yl = vec_xl(0 , y[ib].qs);
2576
+ const __vector int8_t v_yh = vec_xl(QK8_0/2, y[ib].qs);
2577
+
2578
+ const __vector int16_t v_xylso = vec_mulo(v_xls, v_yl);
2579
+ const __vector int16_t v_xylse = vec_mule(v_xls, v_yl);
2580
+ const __vector int16_t v_xyhso = vec_mulo(v_xhs, v_yh);
2581
+ const __vector int16_t v_xyhse = vec_mule(v_xhs, v_yh);
2582
+
2583
+ __vector int16_t v_xy_ = v_xylso + v_xylse + v_xyhso + v_xyhse; v_xy_ += vec_reve(v_xy_);
2584
+
2585
+ const __vector float v_xy = vec_float(vec_unpackh(v_xy_));
2586
+ const __vector float v_d = vec_splats(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
2587
+
2588
+ acc = vec_madd(v_xy, v_d, acc);
2589
+ }
2590
+
2591
+ sumf = acc[0] + acc[1] + acc[2] + acc[3];
2592
  #endif
2593
  for (; ib < nb; ++ib) {
2594
  int sumi0 = 0;
 
2882
  }
2883
 
2884
  sumf = hsum_float_8(acc) + summs;
2885
+ #elif defined(__VXE__) || defined(__VXE2__)
2886
+ float summs = 0;
2887
+ float32x4_t acc = vec_splats(0.0f);
2888
+
2889
+ const uint8x16_t v_m = vec_splat_u8(0x0F);
2890
+
2891
+ #pragma GCC unroll 4
2892
+ for (; ib < nb; ++ib) {
2893
+ __builtin_prefetch(x[ib].qs, 0, 1);
2894
+ __builtin_prefetch(y[ib].qs, 0, 1);
2895
+
2896
+ summs += GGML_FP16_TO_FP32(x[ib].m) * GGML_FP16_TO_FP32(y[ib].s);
2897
+
2898
+ const uint8x16_t v_x = vec_xl(0, x[ib].qs);
2899
+ const int8x16_t v_xl = (const int8x16_t)(v_x & v_m);
2900
+ const int8x16_t v_xh = (const int8x16_t)(v_x >> 4);
2901
+
2902
+ const int8x16_t v_yl = vec_xl(0 , y[ib].qs);
2903
+ const int8x16_t v_yh = vec_xl(QK8_1/2, y[ib].qs);
2904
+
2905
+ const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
2906
+ const float32x4_t v_xy = vec_float(v_xy_);
2907
+
2908
+ const float32x4_t v_d = vec_splats(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
2909
+
2910
+ acc = vec_madd(v_xy, v_d, acc);
2911
+ }
2912
+
2913
+ sumf = acc[0] + acc[1] + acc[2] + acc[3] + summs;
2914
  #endif
2915
  for (; ib < nb; ++ib) {
2916
  int sumi0 = 0;
 
4045
  }
4046
 
4047
  sumf = hsum_float_8(acc);
4048
+ #elif defined(__VXE__) || defined(__VXE2__)
4049
+ __vector float acc = vec_splats(0.0f);
4050
+
4051
+ #pragma GCC unroll 8
4052
+ for (; ib < nb; ++ib) {
4053
+ __builtin_prefetch(x[ib].qs, 0, 1);
4054
+ __builtin_prefetch(y[ib].qs, 0, 1);
4055
+
4056
+ const int8x16_t v_xl = vec_xl(0 , x[ib].qs);
4057
+ const int8x16_t v_xh = vec_xl(QK8_0/2, x[ib].qs);
4058
+ const int8x16_t v_yl = vec_xl(0 , y[ib].qs);
4059
+ const int8x16_t v_yh = vec_xl(QK8_0/2, y[ib].qs);
4060
+
4061
+ const int32x4_t v_xy_ = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
4062
+ const float32x4_t v_xy = vec_float(v_xy_);
4063
+ const float32x4_t v_d = vec_splats(GGML_FP16_TO_FP32(x[ib].d) * GGML_FP16_TO_FP32(y[ib].d));
4064
+
4065
+ acc = vec_madd(v_xy, v_d, acc);
4066
+ }
4067
+
4068
+ sumf = acc[0] + acc[1] + acc[2] + acc[3];
4069
  #endif
4070
  for (; ib < nb; ++ib) {
4071
  int sumi = 0;
 
6948
 
6949
 
6950
  *s = hsum_float_8(acc) + ((v4f32)acc_m)[0];
6951
+ #elif defined(__VXE__) || defined(__VXE2__)
6952
+ const uint8x16_t v_lm = vec_splat_u8(0x0F);
6953
+ const int32x4_t v_z = vec_splat_s32(0);
6954
+
6955
+ uint8x16_t v_x[2];
6956
+ int8x16_t v_xl[2];
6957
+ int8x16_t v_y[2];
6958
+
6959
+ float sumf = 0;
6960
+
6961
+ for (int i = 0; i < nb; ++i) {
6962
+ const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
6963
+ const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
6964
+
6965
+ const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
6966
+ const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
6967
+ const int16x8_t v_ysums = vec_padd_s16(v_ysumsl, v_ysumsh);
6968
+
6969
+ memcpy(utmp, x[i].scales, 12);
6970
+
6971
+ uint32x4_t v_mins8 = { 0 };
6972
+ v_mins8 = vec_insert(utmp[1] & kmask1, v_mins8, 0);
6973
+ v_mins8 = vec_insert(((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4), v_mins8, 1);
6974
+
6975
+ utmp[1] = (utmp[2] & kmask2) | (((utmp[0] >> 6) & kmask3) << 4);
6976
+ utmp[0] &= kmask1;
6977
+
6978
+ const int16x8_t v_minsh = (int16x8_t)vec_unpackh((uint8x16_t)v_mins8);
6979
+
6980
+ const int32x4_t v_minso = vec_mulo(v_ysums, v_minsh);
6981
+ const int32x4_t v_minse = vec_mule(v_ysums, v_minsh);
6982
+ const int32x4_t v_mins = v_minso + v_minse;
6983
+ sumf -= dmin * (v_mins[0] + v_mins[1] + v_mins[2] + v_mins[3]);
6984
+
6985
+ const uint8_t * scales = (const uint8_t *)utmp;
6986
+ const uint8_t * restrict x0 = x[i].qs;
6987
+ const int8_t * restrict y0 = y[i].qs;
6988
+
6989
+ int32_t sumi1 = 0;
6990
+ int32_t sumi2 = 0;
6991
+
6992
+ for (int j = 0; j < QK_K/64; ++j) {
6993
+ v_x[0] = vec_xl(0 , x0);
6994
+ v_x[1] = vec_xl(16, x0);
6995
+ x0 += 32;
6996
+
6997
+ v_y[0] = vec_xl(0 , y0);
6998
+ v_y[1] = vec_xl(16, y0);
6999
+ y0 += 32;
7000
+
7001
+ v_xl[0] = (int8x16_t)vec_and(v_x[0], v_lm);
7002
+ v_xl[1] = (int8x16_t)vec_and(v_x[1], v_lm);
7003
+
7004
+ const int32x4_t p1 = ggml_vec_dot(ggml_vec_dot(v_z, v_xl[0], v_y[0]), v_xl[1], v_y[1]);
7005
+ sumi1 += (p1[0] + p1[1] + p1[2] + p1[3]) * scales[2*j+0];
7006
+
7007
+ v_y[0] = vec_xl(0 , y0);
7008
+ v_y[1] = vec_xl(16, y0);
7009
+ y0 += 32;
7010
+
7011
+ v_xl[0] = (int8x16_t)vec_sr(v_x[0], 4);
7012
+ v_xl[1] = (int8x16_t)vec_sr(v_x[1], 4);
7013
+
7014
+ const int32x4_t p2 = ggml_vec_dot(ggml_vec_dot(v_z, v_xl[0], v_y[0]), v_xl[1], v_y[1]);
7015
+ sumi2 += (p2[0] + p2[1] + p2[2] + p2[3]) * scales[2*j+1];
7016
+ }
7017
+
7018
+ sumf += d * (sumi1 + sumi2);
7019
+ }
7020
+
7021
+ *s = sumf;
7022
  #else
7023
 
7024
  const uint8_t * scales = (const uint8_t*)&utmp[0];
 
7748
  acc_m = __lsx_vfadd_s(acc_m, (__m128)__lsx_vbsrl_v(acc_m, 4));
7749
 
7750
  *s = hsum_float_8(acc) + ((v4f32)acc_m)[0];
7751
+ #elif defined(__VXE__) || defined(__VXE2__)
7752
+ const uint8x16_t v_lm = vec_splat_u8(0x0F);
7753
+ const uint8x16_t v_1m = vec_splat_u8(0x01);
7754
+ const uint8x16_t v_2m = vec_splat_u8(0x02);
7755
+
7756
+ const int32x4_t v_z = vec_splat_s32(0);
7757
+
7758
+ const uchar8x16_t v_minsm = {
7759
+ 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
7760
+ 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF
7761
+ };
7762
+
7763
+ int8x16_t q5b[4];
7764
+ uint8x16_t q5h[4];
7765
+
7766
+ uint8x16_t v_xl[2];
7767
+ uint8x16_t v_xh[2];
7768
+ int8x16_t v_y[4];
7769
+
7770
+ float sumf = 0;
7771
+
7772
+ for (int i = 0; i < nb; ++i) {
7773
+ const float d = y[i].d * GGML_FP16_TO_FP32(x[i].d);
7774
+ const float dmin = y[i].d * GGML_FP16_TO_FP32(x[i].dmin);
7775
+
7776
+ const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
7777
+ const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
7778
+ const int16x8_t v_ysums = vec_padd_s16(v_ysumsl, v_ysumsh);
7779
+
7780
+ memcpy(utmp, x[i].scales, 12);
7781
+ utmp[3] = ((utmp[2] >> 4) & kmask2) | (((utmp[1] >> 6) & kmask3) << 4);
7782
+ const uint32_t uaux = utmp[1] & kmask1;
7783
+ utmp[1] = (utmp[2] & kmask2) | (((utmp[0] >> 6) & kmask3) << 4);
7784
+ utmp[2] = uaux;
7785
+ utmp[0] &= kmask1;
7786
+
7787
+ const uint8x16_t v_mins16 = vec_xl(0, (const uint8_t *)utmp);
7788
+ const uint8x16_t v_mins8 = vec_perm(v_mins16, v_mins16, v_minsm);
7789
+ const int16x8_t v_minsh = (int16x8_t)vec_unpackh(v_mins8);
7790
+
7791
+ const int32x4_t v_minsho = vec_mulo(v_ysums, v_minsh);
7792
+ const int32x4_t v_minshe = vec_mule(v_ysums, v_minsh);
7793
+ const int32x4_t v_mins = vec_add(v_minsho, v_minshe);
7794
+ const int32_t mins = v_mins[0] + v_mins[1] + v_mins[2] + v_mins[3];
7795
+
7796
+ const uint8_t * scales = (const uint8_t *)utmp;
7797
+ const uint8_t * restrict x0l = x[i].qs;
7798
+ const uint8_t * restrict x0h = x[i].qh;
7799
+ const int8_t * restrict y0 = y[i].qs;
7800
+
7801
+ v_xh[0] = vec_xl(0 , x0h);
7802
+ v_xh[1] = vec_xl(16, x0h);
7803
+
7804
+ int32_t sumi = 0;
7805
+ for (int j = 0; j < QK_K/64; ++j) {
7806
+ v_xl[0] = vec_xl(0 , x0l);
7807
+ v_xl[1] = vec_xl(16, x0l);
7808
+ x0l += 32;
7809
+
7810
+ v_y[0] = vec_xl(0 , y0);
7811
+ v_y[1] = vec_xl(16, y0);
7812
+ v_y[2] = vec_xl(32, y0);
7813
+ v_y[3] = vec_xl(48, y0);
7814
+ y0 += 64;
7815
 
7816
+ q5h[0] = vec_sl(vec_and(v_1m, v_xh[0]), 4);
7817
+ q5h[1] = vec_sl(vec_and(v_1m, v_xh[1]), 4);
7818
+ q5h[2] = vec_sl(vec_and(v_2m, v_xh[0]), 3);
7819
+ q5h[3] = vec_sl(vec_and(v_2m, v_xh[1]), 3);
7820
+ v_xh[0] = vec_sr(v_xh[0], 2);
7821
+ v_xh[1] = vec_sr(v_xh[1], 2);
7822
+
7823
+ q5b[0] = (int8x16_t)vec_or(vec_and(v_xl[0], v_lm), q5h[0]);
7824
+ q5b[1] = (int8x16_t)vec_or(vec_and(v_xl[1], v_lm), q5h[1]);
7825
+ q5b[2] = (int8x16_t)vec_or(vec_sr(v_xl[0], 4), q5h[2]);
7826
+ q5b[3] = (int8x16_t)vec_or(vec_sr(v_xl[1], 4), q5h[3]);
7827
+
7828
+ int32x4_t sumi0 = ggml_vec_dot(ggml_vec_dot(v_z, q5b[0], v_y[0]), q5b[1], v_y[1]);
7829
+ int32x4_t sumi1 = ggml_vec_dot(ggml_vec_dot(v_z, q5b[2], v_y[2]), q5b[3], v_y[3]);
7830
+
7831
+ sumi += (sumi0[0] + sumi0[1] + sumi0[2] + sumi0[3]) * *scales++;
7832
+ sumi += (sumi1[0] + sumi1[1] + sumi1[2] + sumi1[3]) * *scales++;
7833
+ }
7834
+
7835
+ sumf += d * sumi - dmin * mins;
7836
+ }
7837
+
7838
+ *s = sumf;
7839
  #else
7840
 
7841
  const uint8_t * scales = (const uint8_t*)&utmp[0];
 
8552
  }
8553
 
8554
  *s = hsum_float_8(acc);
8555
+ #elif defined(__VXE__) || defined(__VXE2__)
8556
+ float sum = 0;
8557
+
8558
+ // Lower 4-bit and upper 2-bit masks
8559
+ const uint8x16_t v_lm = vec_splat_u8(0x0F);
8560
+ const uint8x16_t v_um = vec_splat_u8(0x03);
8561
+
8562
+ const int32x4_t v_z = vec_splat_s32(0);
8563
+
8564
+ int8x16_t q6b[4];
8565
+ uint8x16_t q6h[4];
8566
+
8567
+ uint8x16_t v_xl[4];
8568
+ uint8x16_t v_xh[2];
8569
+ int8x16_t v_y[4];
8570
+
8571
+ for (int i = 0; i < nb; ++i) {
8572
+ const float d_all = GGML_FP16_TO_FP32(x[i].d);
8573
+
8574
+ const uint8_t * restrict x0l = x[i].ql;
8575
+ const uint8_t * restrict x0h = x[i].qh;
8576
+ const int8_t * restrict y0 = y[i].qs;
8577
+
8578
+ const int8_t * restrict scale = x[i].scales;
8579
+
8580
+ const int16x8_t v_ysumsl = vec_xl(0 , y[i].bsums);
8581
+ const int16x8_t v_ysumsh = vec_xl(16, y[i].bsums);
8582
+
8583
+ const int8x16_t v_scale = vec_xl(0, scale);
8584
+ const int16x8_t v_scalel = vec_unpackh(v_scale);
8585
+ const int16x8_t v_scaleh = vec_unpackl(v_scale);
8586
+
8587
+ const int32x4_t v_minslo = vec_mulo(v_ysumsl, v_scalel);
8588
+ const int32x4_t v_minsle = vec_mule(v_ysumsl, v_scalel);
8589
+ const int32x4_t v_minsho = vec_mulo(v_ysumsh, v_scaleh);
8590
+ const int32x4_t v_minshe = vec_mule(v_ysumsh, v_scaleh);
8591
+ const int32x4_t v_mins = v_minslo + v_minsle + v_minsho + v_minshe;
8592
+
8593
+ const int32_t mins = v_mins[0] + v_mins[1] + v_mins[2] + v_mins[3];
8594
+
8595
+ int32_t isum = 0;
8596
+ for (int j = 0; j < QK_K/128; ++j) {
8597
+ // Load model upper 2 bits
8598
+ v_xh[0] = vec_xl(0 , x0h);
8599
+ v_xh[1] = vec_xl(16, x0h);
8600
+ x0h += 32;
8601
+
8602
+ // Load model lower 4 bits
8603
+ v_xl[0] = vec_xl(0 , x0l);
8604
+ v_xl[1] = vec_xl(16, x0l);
8605
+ v_xl[2] = vec_xl(32, x0l);
8606
+ v_xl[3] = vec_xl(48, x0l);
8607
+ x0l += 64;
8608
+
8609
+ // Load activation quants
8610
+ v_y[0] = vec_xl(0 , y0);
8611
+ v_y[1] = vec_xl(16, y0);
8612
+ v_y[2] = vec_xl(32, y0);
8613
+ v_y[3] = vec_xl(48, y0);
8614
+ y0 += 64;
8615
+
8616
+ q6h[0] = vec_sl(vec_and(v_um, v_xh[0]), 4);
8617
+ q6h[1] = vec_sl(vec_and(v_um, v_xh[1]), 4);
8618
+ uint8x16_t shifted = vec_sr(v_xh[0], 2);
8619
+ q6h[2] = vec_sl(vec_and(v_um, shifted), 4);
8620
+ shifted = vec_sr(v_xh[1], 2);
8621
+ q6h[3] = vec_sl(vec_and(v_um, shifted), 4);
8622
+
8623
+ q6b[0] = (int8x16_t)(vec_or(vec_and(v_xl[0], v_lm), q6h[0]));
8624
+ q6b[1] = (int8x16_t)(vec_or(vec_and(v_xl[1], v_lm), q6h[1]));
8625
+ q6b[2] = (int8x16_t)(vec_or(vec_and(v_xl[2], v_lm), q6h[2]));
8626
+ q6b[3] = (int8x16_t)(vec_or(vec_and(v_xl[3], v_lm), q6h[3]));
8627
+
8628
+ int32x4_t summs0 = ggml_vec_dot(v_z, q6b[0], v_y[0]);
8629
+ int32x4_t summs1 = ggml_vec_dot(v_z, q6b[1], v_y[1]);
8630
+ int32x4_t summs2 = ggml_vec_dot(v_z, q6b[2], v_y[2]);
8631
+ int32x4_t summs3 = ggml_vec_dot(v_z, q6b[3], v_y[3]);
8632
+
8633
+ isum += (summs0[0] + summs0[1] + summs0[2] + summs0[3]) * scale[0] +
8634
+ (summs1[0] + summs1[1] + summs1[2] + summs1[3]) * scale[1] +
8635
+ (summs2[0] + summs2[1] + summs2[2] + summs2[3]) * scale[2] +
8636
+ (summs3[0] + summs3[1] + summs3[2] + summs3[3]) * scale[3];
8637
+
8638
+ scale += 4;
8639
+
8640
 
8641
+ // Load activation quants
8642
+ v_y[0] = vec_xl(0 , y0);
8643
+ v_y[1] = vec_xl(16, y0);
8644
+ v_y[2] = vec_xl(32, y0);
8645
+ v_y[3] = vec_xl(48, y0);
8646
+ y0 += 64;
8647
+
8648
+ shifted = vec_sr(v_xh[0], 4);
8649
+ q6h[0] = vec_sl(vec_and(v_um, shifted), 4);
8650
+ shifted = vec_sr(v_xh[1], 4);
8651
+ q6h[1] = vec_sl(vec_and(v_um, shifted), 4);
8652
+ shifted = vec_sr(v_xh[0], 6);
8653
+ q6h[2] = vec_sl(vec_and(v_um, shifted), 4);
8654
+ shifted = vec_sr(v_xh[1], 6);
8655
+ q6h[3] = vec_sl(vec_and(v_um, shifted), 4);
8656
+
8657
+ q6b[0] = (int8x16_t)(vec_or(vec_sr(v_xl[0], 4), q6h[0]));
8658
+ q6b[1] = (int8x16_t)(vec_or(vec_sr(v_xl[1], 4), q6h[1]));
8659
+ q6b[2] = (int8x16_t)(vec_or(vec_sr(v_xl[2], 4), q6h[2]));
8660
+ q6b[3] = (int8x16_t)(vec_or(vec_sr(v_xl[3], 4), q6h[3]));
8661
+
8662
+ summs0 = ggml_vec_dot(v_z, q6b[0], v_y[0]);
8663
+ summs1 = ggml_vec_dot(v_z, q6b[1], v_y[1]);
8664
+ summs2 = ggml_vec_dot(v_z, q6b[2], v_y[2]);
8665
+ summs3 = ggml_vec_dot(v_z, q6b[3], v_y[3]);
8666
+
8667
+ isum += (summs0[0] + summs0[1] + summs0[2] + summs0[3]) * scale[0] +
8668
+ (summs1[0] + summs1[1] + summs1[2] + summs1[3]) * scale[1] +
8669
+ (summs2[0] + summs2[1] + summs2[2] + summs2[3]) * scale[2] +
8670
+ (summs3[0] + summs3[1] + summs3[2] + summs3[3]) * scale[3];
8671
+
8672
+ scale += 4;
8673
+ }
8674
+
8675
+ sum += d_all * y[i].d * (isum - 32 * mins);
8676
+ }
8677
+
8678
+ *s = sum;
8679
  #else
8680
 
8681
  int8_t aux8[QK_K];
 
9036
  }
9037
 
9038
  *s = 0.125f * hsum_float_8(accumf);
9039
+ //#elif defined(__VXE__) || defined(__VXE2__)
9040
+ // const uint64_t * signs64 = (const uint64_t *)keven_signs_q2xs;
9041
+ //
9042
+ // uint32_t aux32[4];
9043
+ // const uint8_t * aux8 = (const uint8_t *)aux32;
9044
+ //
9045
+ // float sumf = 0;
9046
+ //
9047
+ // for (int i = 0; i < nb; ++i) {
9048
+ // const float d = GGML_FP16_TO_FP32(x[i].d) * y[i].d;
9049
+ // const uint16_t * restrict q2 = x[i].qs;
9050
+ // const int8_t * restrict q8 = y[i].qs;
9051
+ //
9052
+ // float sumf1 = 0, sumf2 = 0;
9053
+ //
9054
+ // for (int ib32 = 0; ib32 < QK_K/32; ib += 2) {
9055
+ // int8x16_t q8b0 = vec_xl( 0, q8);
9056
+ // int8x16_t qb81 = vec_xl(16, q8);
9057
+ // int8x16_t q8b2 = vec_xl(32, q8);
9058
+ // int8x16_t q8b3 = vec_xl(48, q8);
9059
+ // q8 += 64;
9060
+ //
9061
+ // memcpy(aux32, q2, 4 * sizeof(uint32_t));
9062
+ // q2 += 8;
9063
+ //
9064
+ // int8x16_t q2u0 = { *(const int64_t *)(iq2xxs_grid + aux8[ 0]), *(const int64_t *)(iq2xxs_grid + aux8[ 1]) };
9065
+ // int8x16_t q2u1 = { *(const int64_t *)(iq2xxs_grid + aux8[ 2]), *(const int64_t *)(iq2xxs_grid + aux8[ 3]) };
9066
+ // int8x16_t q2u2 = { *(const int64_t *)(iq2xxs_grid + aux8[ 8]), *(const int64_t *)(iq2xxs_grid + aux8[ 9]) };
9067
+ // int8x16_t q2u3 = { *(const int64_t *)(iq2xxs_grid + aux8[10]), *(const int64_t *)(iq2xxs_grid + aux8[11]) };
9068
+ //
9069
+ // int8x16_t q2s0 = { *(const int64_t *)(signs64 + ((aux32[1] >> 0) & 127)), *(const int64_t *)(signs64 + ((aux32[1] >> 7) & 127)) };
9070
+ // int8x16_t q2s1 = { *(const int64_t *)(signs64 + ((aux32[1] >> 14) & 127)), *(const int64_t *)(signs64 + ((aux32[1] >> 21) & 127)) };
9071
+ // int8x16_t q2s2 = { *(const int64_t *)(signs64 + ((aux32[3] >> 0) & 127)), *(const int64_t *)(signs64 + ((aux32[3] >> 7) & 127)) };
9072
+ // int8x16_t q2s3 = { *(const int64_t *)(signs64 + ((aux32[3] >> 14) & 127)), *(const int64_t *)(signs64 + ((aux32[3] >> 21) & 127)) };
9073
+ //
9074
+ // q2u0 = vec_mul(q2u0, q2s0);
9075
+ // q2u1 = vec_mul(q2u1, q2s1);
9076
+ // q2u2 = vec_mul(q2u2, q2s2);
9077
+ // q2u3 = vec_mul(q2u3, q2s3);
9078
+ //
9079
+ // const int32x4_t p1 = ggml_vec_dot(ggml_vec_dot(vec_splat_s32(0), q2u0, q8b0), q2u1, q8b1);
9080
+ // const int32x4_t p2 = ggml_vec_dot(ggml_vec_dot(vec_splat_s32(0), q2u2, q8b2), q2u3, q8b3);
9081
+ //
9082
+ // sumf1 += (p1[0] + p1[1] + p1[2] + p1[3]) * (0.5f + (aux32[1] >> 28));
9083
+ // sumf2 += (p2[0] + p2[1] + p2[2] + p2[3]) * (0.5f + (aux32[3] >> 28));
9084
+ // }
9085
+ //
9086
+ // sumf += d * (sumf1 + sumf2);
9087
+ // }
9088
+ //
9089
+ // *s = 0.25f * sumf;
9090
  #else
9091
 
9092
  uint32_t aux32[2];
 
11847
 
11848
  sumf = hsum_float_8(__lasx_xvfadd_s(accum1, accum2));
11849
 
11850
+ #elif defined(__VXE__) || defined(__VXE2__)
11851
+ const int8x16_t v_k = vec_xl(0, kvalues_iq4nl);
11852
+ const uint8x16_t v_m = vec_splat_u8(0x0F);
11853
+
11854
+ for (; ib < nb; ++ib) {
11855
+ const block_iq4_nl * restrict x0 = &x[ib];
11856
+ const block_q8_0 * restrict y0 = &y[ib];
11857
+
11858
+ const uint8x16_t v_x = vec_xl(0, x0->qs);
11859
+ int8x16_t v_xl = (int8x16_t)vec_and(v_x, v_m);
11860
+ int8x16_t v_xh = (int8x16_t)vec_sr(v_x, 4);
11861
+
11862
+ v_xl = vec_perm(v_k, v_k, (uchar8x16_t)v_xl);
11863
+ v_xh = vec_perm(v_k, v_k, (uchar8x16_t)v_xh);
11864
+
11865
+ const int8x16_t v_yl = vec_xl(0 , y0->qs);
11866
+ const int8x16_t v_yh = vec_xl(QK8_0/2, y0->qs);
11867
+ const int32x4_t v_xy = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_xl, v_yl), v_xh, v_yh);
11868
+
11869
+ sumf += GGML_FP16_TO_FP32(x0->d) * GGML_FP16_TO_FP32(y0->d) * (v_xy[0] + v_xy[1] + v_xy[2] + v_xy[3]);
11870
+ }
11871
  #endif
11872
  for (; ib < nb; ++ib) {
11873
  const float d = GGML_FP16_TO_FP32(y[ib].d)*GGML_FP16_TO_FP32(x[ib].d);
 
12146
  }
12147
 
12148
  *s = hsum_float_8(accum);
12149
+ #elif defined(__VXE__) || defined(__VXE2__)
12150
+ const int8x16_t v_k = vec_xl(0, kvalues_iq4nl);
12151
+ const uint8x16_t v_m = vec_splat_u8(0x0F);
12152
+
12153
+ float sumf = 0;
12154
+
12155
+ for (int ibl = 0; ibl < nb; ++ibl) {
12156
+ const uint8_t * restrict q4 = x[ibl].qs;
12157
+ const int8_t * restrict q8 = y[ibl].qs;
12158
+
12159
+ uint16_t h = x[ibl].scales_h;
12160
+
12161
+ int sumi1 = 0, sumi2 = 0;
12162
+ for (int ib = 0; ib < QK_K/64; ++ib) {
12163
+ const uint8x16_t v_x0 = vec_xl(0 , q4);
12164
+ const uint8x16_t v_x1 = vec_xl(QK4_NL/2, q4);
12165
+ q4 += 32;
12166
+
12167
+ int8x16_t v_x0l = (int8x16_t)vec_and(v_x0, v_m);
12168
+ int8x16_t v_x0h = (int8x16_t)vec_sr(v_x0, 4);
12169
+ int8x16_t v_x1l = (int8x16_t)vec_and(v_x1, v_m);
12170
+ int8x16_t v_x1h = (int8x16_t)vec_sr(v_x1, 4);
12171
+
12172
+ v_x0l = vec_perm(v_k, v_k, (uchar8x16_t)v_x0l);
12173
+ v_x0h = vec_perm(v_k, v_k, (uchar8x16_t)v_x0h);
12174
+ v_x1l = vec_perm(v_k, v_k, (uchar8x16_t)v_x1l);
12175
+ v_x1h = vec_perm(v_k, v_k, (uchar8x16_t)v_x1h);
12176
+
12177
+ const int8x16_t v_y0 = vec_xl( 0, q8);
12178
+ const int8x16_t v_y1 = vec_xl(16, q8);
12179
+ const int8x16_t v_y2 = vec_xl(32, q8);
12180
+ const int8x16_t v_y3 = vec_xl(48, q8);
12181
+ q8 += 64;
12182
+
12183
+ int32x4_t vsumi0 = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_x0l, v_y0), v_x0h, v_y1);
12184
+ int32x4_t vsumi1 = ggml_vec_dot(ggml_vec_dot(vec_splats(0), v_x1l, v_y2), v_x1h, v_y3);
12185
+
12186
+ int ls1 = ((x[ibl].scales_l[ib] & 0xF) | ((h << 4) & 0x30)) - 32;
12187
+ int ls2 = ((x[ibl].scales_l[ib] >> 4) | ((h << 2) & 0x30)) - 32;
12188
+
12189
+ h >>= 4;
12190
+
12191
+ sumi1 += (vsumi0[0] + vsumi0[1] + vsumi0[2] + vsumi0[3]) * ls1;
12192
+ sumi2 += (vsumi1[0] + vsumi1[1] + vsumi1[2] + vsumi1[3]) * ls2;
12193
+ }
12194
+
12195
+ sumf += GGML_FP16_TO_FP32(x[ibl].d) * y[ibl].d * (sumi1 + sumi2);
12196
+ }
12197
+
12198
+ *s = sumf;
12199
 
12200
  #else
12201
  float sumf = 0;
ggml/src/ggml-cpu/ggml-cpu.c CHANGED
@@ -237,6 +237,8 @@ typedef pthread_t ggml_thread_t;
237
  #else
238
  #if defined(__POWER9_VECTOR__)
239
  #define CACHE_LINE_SIZE 128
 
 
240
  #else
241
  #define CACHE_LINE_SIZE 64
242
  #endif
@@ -1211,6 +1213,87 @@ static inline void __lsx_f16x4_store(ggml_fp16_t * x, __m128 y) {
1211
  #define GGML_F16_VEC_MUL GGML_F32Cx4_MUL
1212
  #define GGML_F16_VEC_REDUCE GGML_F32Cx4_REDUCE
1213
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1214
  #endif
1215
 
1216
  // GGML_F32_ARR / GGML_F16_ARR
@@ -14419,6 +14502,14 @@ int ggml_cpu_has_vsx(void) {
14419
  #endif
14420
  }
14421
 
 
 
 
 
 
 
 
 
14422
  int ggml_cpu_has_neon(void) {
14423
  #if defined(__ARM_ARCH) && defined(__ARM_NEON)
14424
  return ggml_arm_arch_features.has_neon;
 
237
  #else
238
  #if defined(__POWER9_VECTOR__)
239
  #define CACHE_LINE_SIZE 128
240
+ #elif defined(__VXE__) || defined(__VXE2__)
241
+ #define CACHE_LINE_SIZE 256
242
  #else
243
  #define CACHE_LINE_SIZE 64
244
  #endif
 
1213
  #define GGML_F16_VEC_MUL GGML_F32Cx4_MUL
1214
  #define GGML_F16_VEC_REDUCE GGML_F32Cx4_REDUCE
1215
 
1216
+ #elif defined(__VXE__) || defined(__VXE2__)
1217
+
1218
+ #define GGML_SIMD
1219
+
1220
+ // F32 s390x
1221
+
1222
+ #define GGML_F32_STEP 32
1223
+ #define GGML_F32_EPR 4
1224
+
1225
+ #define GGML_F32x4 __vector float
1226
+ #define GGML_F32x4_ZERO vec_splats(0.0f)
1227
+ #define GGML_F32x4_SET1 vec_splats
1228
+ #define GGML_F32x4_LOAD(p) vec_xl(0, p)
1229
+ #define GGML_F32x4_STORE(p, r) vec_xst(r, 0, p)
1230
+ #define GGML_F32x4_FMA(a, b, c) vec_madd(b, c, a)
1231
+ #define GGML_F32x4_ADD vec_add
1232
+ #define GGML_F32x4_MUL vec_mul
1233
+ #define GGML_F32x4_REDUCE(res, x) \
1234
+ { \
1235
+ int offset = GGML_F32_ARR >> 1; \
1236
+ for (int i = 0; i < offset; ++i) { \
1237
+ x[i] = vec_add(x[i], x[offset + i]); \
1238
+ } \
1239
+ offset >>= 1; \
1240
+ for (int i = 0; i < offset; ++i) { \
1241
+ x[i] = vec_add(x[i], x[offset + i]); \
1242
+ } \
1243
+ offset >>= 1; \
1244
+ for (int i = 0; i < offset; ++i) { \
1245
+ x[i] = vec_add(x[i], x[offset + i]); \
1246
+ } \
1247
+ res = vec_extract(x[0], 0) + \
1248
+ vec_extract(x[0], 1) + \
1249
+ vec_extract(x[0], 2) + \
1250
+ vec_extract(x[0], 3); \
1251
+ }
1252
+
1253
+ #define GGML_F32_VEC GGML_F32x4
1254
+ #define GGML_F32_VEC_ZERO GGML_F32x4_ZERO
1255
+ #define GGML_F32_VEC_SET1 GGML_F32x4_SET1
1256
+ #define GGML_F32_VEC_LOAD GGML_F32x4_LOAD
1257
+ #define GGML_F32_VEC_STORE GGML_F32x4_STORE
1258
+ #define GGML_F32_VEC_FMA GGML_F32x4_FMA
1259
+ #define GGML_F32_VEC_ADD GGML_F32x4_ADD
1260
+ #define GGML_F32_VEC_MUL GGML_F32x4_MUL
1261
+ #define GGML_F32_VEC_REDUCE GGML_F32x4_REDUCE
1262
+
1263
+ // F16 s390x
1264
+ #define GGML_F16_STEP GGML_F32_STEP
1265
+ #define GGML_F16_EPR GGML_F32_EPR
1266
+
1267
+ static inline __vector float __lzs_f16cx4_load(const ggml_fp16_t * x) {
1268
+ float tmp[4];
1269
+
1270
+ for (int i = 0; i < 4; i++) {
1271
+ tmp[i] = GGML_FP16_TO_FP32(x[i]);
1272
+ }
1273
+
1274
+ return vec_xl(0, tmp);
1275
+ }
1276
+
1277
+ static inline void __lzs_f16cx4_store(ggml_fp16_t * x, __vector float y) {
1278
+ float arr[4];
1279
+
1280
+ vec_xst(y, 0, arr);
1281
+
1282
+ for (int i = 0; i < 4; i++) {
1283
+ x[i] = GGML_FP32_TO_FP16(arr[i]);
1284
+ }
1285
+ }
1286
+
1287
+ #define GGML_F16_VEC GGML_F32x4
1288
+ #define GGML_F16_VEC_ZERO GGML_F32x4_ZERO
1289
+ #define GGML_F16_VEC_SET1 GGML_F32x4_SET1
1290
+ #define GGML_F16_VEC_LOAD(p, i) __lzs_f16cx4_load(p)
1291
+ #define GGML_F16_VEC_STORE(p, r, i) __lzs_f16cx4_store(p, r[i])
1292
+ #define GGML_F16_VEC_FMA GGML_F32x4_FMA
1293
+ #define GGML_F16_VEC_ADD GGML_F32x4_ADD
1294
+ #define GGML_F16_VEC_MUL GGML_F32x4_MUL
1295
+ #define GGML_F16_VEC_REDUCE GGML_F32x4_REDUCE
1296
+
1297
  #endif
1298
 
1299
  // GGML_F32_ARR / GGML_F16_ARR
 
14502
  #endif
14503
  }
14504
 
14505
+ int ggml_cpu_has_vxe(void) {
14506
+ #if defined(__VXE__) || defined(__VXE2__)
14507
+ return 1;
14508
+ #else
14509
+ return 0;
14510
+ #endif
14511
+ }
14512
+
14513
  int ggml_cpu_has_neon(void) {
14514
  #if defined(__ARM_ARCH) && defined(__ARM_NEON)
14515
  return ggml_arm_arch_features.has_neon;
ggml/src/ggml-cpu/ggml-cpu.cpp CHANGED
@@ -557,6 +557,9 @@ static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t r
557
  if (ggml_cpu_has_vsx()) {
558
  features.push_back({ "VSX", "1" });
559
  }
 
 
 
560
  if (ggml_cpu_has_wasm_simd()) {
561
  features.push_back({ "WASM_SIMD", "1" });
562
  }
 
557
  if (ggml_cpu_has_vsx()) {
558
  features.push_back({ "VSX", "1" });
559
  }
560
+ if (ggml_cpu_has_vxe()) {
561
+ features.push_back({ "VXE", "1" });
562
+ }
563
  if (ggml_cpu_has_wasm_simd()) {
564
  features.push_back({ "WASM_SIMD", "1" });
565
  }
ggml/src/ggml.c CHANGED
@@ -240,7 +240,11 @@ void ggml_log_callback_default(enum ggml_log_level level, const char * text, voi
240
 
241
 
242
  void * ggml_aligned_malloc(size_t size) {
 
 
 
243
  const int alignment = 64;
 
244
 
245
  #if defined(_MSC_VER) || defined(__MINGW32__)
246
  return _aligned_malloc(size, alignment);
 
240
 
241
 
242
  void * ggml_aligned_malloc(size_t size) {
243
+ #if defined(__s390x__)
244
+ const int alignment = 256;
245
+ #else
246
  const int alignment = 64;
247
+ #endif
248
 
249
  #if defined(_MSC_VER) || defined(__MINGW32__)
250
  return _aligned_malloc(size, alignment);