Skip to content

Commit 5c4d767

Browse files
chore: Fix markdown warnings (ggml-org#6625)
1 parent ef21ce4 commit 5c4d767

File tree

8 files changed

+98
-99
lines changed

8 files changed

+98
-99
lines changed

README-sycl.md

Lines changed: 47 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@
88
- [Linux](#linux)
99
- [Windows](#windows)
1010
- [Environment Variable](#environment-variable)
11-
- [Known Issue](#known-issue)
12-
- [Q&A](#q&a)
13-
- [Todo](#todo)
11+
- [Known Issue](#known-issues)
12+
- [Q&A](#qa)
13+
- [TODO](#todo)
1414

1515
## Background
1616

@@ -54,10 +54,10 @@ It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS,
5454

5555
## OS
5656

57-
|OS|Status|Verified|
58-
|-|-|-|
59-
|Linux|Support|Ubuntu 22.04, Fedora Silverblue 39|
60-
|Windows|Support|Windows 11|
57+
| OS | Status | Verified |
58+
|---------|---------|------------------------------------|
59+
| Linux | Support | Ubuntu 22.04, Fedora Silverblue 39 |
60+
| Windows | Support | Windows 11 |
6161

6262

6363
## Hardware
@@ -66,13 +66,13 @@ It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS,
6666

6767
**Verified devices**
6868

69-
|Intel GPU| Status | Verified Model|
70-
|-|-|-|
71-
|Intel Data Center Max Series| Support| Max 1550|
72-
|Intel Data Center Flex Series| Support| Flex 170|
73-
|Intel Arc Series| Support| Arc 770, 730M|
74-
|Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake|
75-
|Intel iGPU| Support| iGPU in i5-1250P, i7-1260P, i7-1165G7|
69+
| Intel GPU | Status | Verified Model |
70+
|-------------------------------|---------|---------------------------------------|
71+
| Intel Data Center Max Series | Support | Max 1550 |
72+
| Intel Data Center Flex Series | Support | Flex 170 |
73+
| Intel Arc Series | Support | Arc 770, 730M |
74+
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
75+
| Intel iGPU | Support | iGPU in i5-1250P, i7-1260P, i7-1165G7 |
7676

7777
*Notes:*
7878

@@ -89,10 +89,10 @@ The BLAS acceleration on Nvidia GPU through oneAPI can be obtained using the Nvi
8989

9090
**Verified devices**
9191

92-
|Nvidia GPU| Status | Verified Model|
93-
|-|-|-|
94-
|Ampere Series| Support| A100, A4000|
95-
|Ampere Series *(Mobile)*| Support| RTX 40 Series|
92+
| Nvidia GPU | Status | Verified Model |
93+
|--------------------------|---------|----------------|
94+
| Ampere Series | Support | A100, A4000 |
95+
| Ampere Series *(Mobile)* | Support | RTX 40 Series |
9696

9797
*Notes:*
9898
- Support for Nvidia targets through oneAPI is currently limited to Linux platforms.
@@ -167,7 +167,7 @@ Platform #0: Intel(R) OpenCL HD Graphics
167167

168168
- **Nvidia GPU**
169169

170-
In order to target Nvidia GPUs through SYCL, please make sure the CUDA/CUBLAS native requirements *-found [here](README.md#cublas)-* are installed.
170+
In order to target Nvidia GPUs through SYCL, please make sure the CUDA/CUBLAS native requirements *-found [here](README.md#cuda)-* are installed.
171171
Installation can be verified by running the following:
172172
```sh
173173
nvidia-smi
@@ -313,10 +313,10 @@ found 6 SYCL devices:
313313
| 5| [opencl:acc:0]| Intel(R) FPGA Emulation Device| 1.2| 24|67108864| 64| 67064815616|
314314
```
315315

316-
|Attribute|Note|
317-
|-|-|
318-
|compute capability 1.3|Level-zero driver/runtime, recommended |
319-
|compute capability 3.0|OpenCL driver/runtime, slower than level-zero in most cases|
316+
| Attribute | Note |
317+
|------------------------|-------------------------------------------------------------|
318+
| compute capability 1.3 | Level-zero driver/runtime, recommended |
319+
| compute capability 3.0 | OpenCL driver/runtime, slower than level-zero in most cases |
320320

321321
4. Launch inference
322322

@@ -325,10 +325,10 @@ There are two device selection modes:
325325
- Single device: Use one device target specified by the user.
326326
- Multiple devices: Automatically select the devices with the same largest Max compute-units.
327327

328-
|Device selection|Parameter|
329-
|-|-|
330-
|Single device|--split-mode none --main-gpu DEVICE_ID |
331-
|Multiple devices|--split-mode layer (default)|
328+
| Device selection | Parameter |
329+
|------------------|----------------------------------------|
330+
| Single device | --split-mode none --main-gpu DEVICE_ID |
331+
| Multiple devices | --split-mode layer (default) |
332332

333333
Examples:
334334

@@ -486,10 +486,10 @@ found 6 SYCL devices:
486486
487487
```
488488

489-
|Attribute|Note|
490-
|-|-|
491-
|compute capability 1.3|Level-zero running time, recommended |
492-
|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|
489+
| Attribute | Note |
490+
|------------------------|-----------------------------------------------------------|
491+
| compute capability 1.3 | Level-zero running time, recommended |
492+
| compute capability 3.0 | OpenCL running time, slower than level-zero in most cases |
493493

494494

495495
4. Launch inference
@@ -499,10 +499,10 @@ There are two device selection modes:
499499
- Single device: Use one device assigned by user.
500500
- Multiple devices: Automatically choose the devices with the same biggest Max compute units.
501501

502-
|Device selection|Parameter|
503-
|-|-|
504-
|Single device|--split-mode none --main-gpu DEVICE_ID |
505-
|Multiple devices|--split-mode layer (default)|
502+
| Device selection | Parameter |
503+
|------------------|----------------------------------------|
504+
| Single device | --split-mode none --main-gpu DEVICE_ID |
505+
| Multiple devices | --split-mode layer (default) |
506506

507507
Examples:
508508

@@ -540,20 +540,20 @@ use 1 SYCL GPUs: [0] with Max compute units:512
540540

541541
#### Build
542542

543-
|Name|Value|Function|
544-
|-|-|-|
545-
|LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path.|
546-
|LLAMA_SYCL_TARGET | INTEL *(default)* \| NVIDIA|Set the SYCL target device type.|
547-
|LLAMA_SYCL_F16|OFF *(default)* \|ON *(optional)*|Enable FP16 build with SYCL code path.|
548-
|CMAKE_C_COMPILER|icx|Set *icx* compiler for SYCL code path.|
549-
|CMAKE_CXX_COMPILER|icpx *(Linux)*, icx *(Windows)*|Set `icpx/icx` compiler for SYCL code path.|
543+
| Name | Value | Function |
544+
|--------------------|-----------------------------------|---------------------------------------------|
545+
| LLAMA_SYCL | ON (mandatory) | Enable build with SYCL code path. |
546+
| LLAMA_SYCL_TARGET | INTEL *(default)* \| NVIDIA | Set the SYCL target device type. |
547+
| LLAMA_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. |
548+
| CMAKE_C_COMPILER | icx | Set *icx* compiler for SYCL code path. |
549+
| CMAKE_CXX_COMPILER | icpx *(Linux)*, icx *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
550550

551551
#### Runtime
552552

553-
|Name|Value|Function|
554-
|-|-|-|
555-
|GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|
556-
|ZES_ENABLE_SYSMAN| 0 (default) or 1|Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer|
553+
| Name | Value | Function |
554+
|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
555+
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
556+
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
557557

558558
## Known Issues
559559

@@ -591,6 +591,6 @@ use 1 SYCL GPUs: [0] with Max compute units:512
591591
### **GitHub contribution**:
592592
Please add the **[SYCL]** prefix/tag in issues/PRs titles to help the SYCL-team check/address them without delay.
593593

594-
## Todo
594+
## TODO
595595

596596
- Support row layer split for multiple card runs.

README.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -485,14 +485,14 @@ Building the program with BLAS support may lead to some performance improvements
485485
486486
The environment variable [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) can be used to specify which GPU(s) will be used. The following compilation options are also available to tweak performance:
487487
488-
| Option | Legal values | Default | Description |
489-
|--------------------------------|------------------------|---------|-------------|
490-
| LLAMA_CUDA_FORCE_DMMV | Boolean | false | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants. |
491-
| LLAMA_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
492-
| LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. |
493-
| LLAMA_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
494-
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
495-
| LLAMA_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
488+
| Option | Legal values | Default | Description |
489+
|--------------------------------|------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
490+
| LLAMA_CUDA_FORCE_DMMV | Boolean | false | Force the use of dequantization + matrix vector multiplication kernels instead of using kernels that do matrix vector multiplication on quantized data. By default the decision is made based on compute capability (MMVQ for 6.1/Pascal/GTX 1000 or higher). Does not affect k-quants. |
491+
| LLAMA_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the CUDA dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
492+
| LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the CUDA mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. |
493+
| LLAMA_CUDA_F16 | Boolean | false | If enabled, use half-precision floating point arithmetic for the CUDA dequantization + mul mat vec kernels and for the q4_1 and q5_1 matrix matrix multiplication kernels. Can improve performance on relatively recent GPUs. |
494+
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per CUDA thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
495+
| LLAMA_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
496496
497497
- #### hipBLAS
498498
@@ -534,11 +534,11 @@ Building the program with BLAS support may lead to some performance improvements
534534
If your GPU is not officially supported you can use the environment variable [`HSA_OVERRIDE_GFX_VERSION`] set to a similar GPU, for example 10.3.0 on RDNA2 (e.g. gfx1030, gfx1031, or gfx1035) or 11.0.0 on RDNA3.
535535
The following compilation options are also available to tweak performance (yes, they refer to CUDA, not HIP, because it uses the same code as the cuBLAS version above):
536536
537-
| Option | Legal values | Default | Description |
538-
|-------------------------|------------------------|---------|-------------|
539-
| LLAMA_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the HIP dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
540-
| LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the HIP mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
541-
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per HIP thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
537+
| Option | Legal values | Default | Description |
538+
|-------------------------|------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
539+
| LLAMA_CUDA_DMMV_X | Positive integer >= 32 | 32 | Number of values in x direction processed by the HIP dequantization + matrix vector multiplication kernel per iteration. Increasing this value can improve performance on fast GPUs. Power of 2 heavily recommended. Does not affect k-quants. |
540+
| LLAMA_CUDA_MMV_Y | Positive integer | 1 | Block size in y direction for the HIP mul mat vec kernels. Increasing this value can improve performance on fast GPUs. Power of 2 recommended. Does not affect k-quants. |
541+
| LLAMA_CUDA_KQUANTS_ITER | 1 or 2 | 2 | Number of values processed per iteration and per HIP thread for Q2_K and Q6_K quantization formats. Setting this value to 1 can improve performance for slow GPUs. |
542542
543543
- #### CLBlast
544544
@@ -746,19 +746,19 @@ From the unzipped folder, open a terminal/cmd window here and place a pre-conver
746746
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
747747
748748
| Model | Original size | Quantized size (Q4_0) |
749-
|------:|--------------:|-----------------------:|
750-
| 7B | 13 GB | 3.9 GB |
751-
| 13B | 24 GB | 7.8 GB |
752-
| 30B | 60 GB | 19.5 GB |
753-
| 65B | 120 GB | 38.5 GB |
749+
|------:|--------------:|----------------------:|
750+
| 7B | 13 GB | 3.9 GB |
751+
| 13B | 24 GB | 7.8 GB |
752+
| 30B | 60 GB | 19.5 GB |
753+
| 65B | 120 GB | 38.5 GB |
754754
755755
### Quantization
756756
757757
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
758758
759759
*(outdated)*
760760
761-
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
761+
| Model | Measure | F16 | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 |
762762
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|
763763
| 7B | perplexity | 5.9066 | 6.1565 | 6.0912 | 5.9862 | 5.9481 | 5.9070 |
764764
| 7B | file size | 13.0G | 3.5G | 3.9G | 4.3G | 4.7G | 6.7G |

SECURITY.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,11 @@ If you intend to run multiple models in parallel with shared memory, it is your
4949

5050
1. Tenant Isolation: Models should run separately with strong isolation methods to prevent unwanted data access. Separating networks is crucial for isolation, as it prevents unauthorized access to data or models and malicious users from sending graphs to execute under another tenant's identity.
5151

52-
1. Resource Allocation: A denial of service caused by one model can impact the overall system health. Implement safeguards like rate limits, access controls, and health monitoring.
52+
2. Resource Allocation: A denial of service caused by one model can impact the overall system health. Implement safeguards like rate limits, access controls, and health monitoring.
5353

54-
1. Model Sharing: In a multitenant model sharing design, tenants and users must understand the security risks of running code provided by others. Since there are no reliable methods to detect malicious models, sandboxing the model execution is the recommended approach to mitigate the risk.
54+
3. Model Sharing: In a multitenant model sharing design, tenants and users must understand the security risks of running code provided by others. Since there are no reliable methods to detect malicious models, sandboxing the model execution is the recommended approach to mitigate the risk.
5555

56-
1. Hardware Attacks: GPUs or TPUs can also be attacked. [Researches](https://scholar.google.com/scholar?q=gpu+side+channel) has shown that side channel attacks on GPUs are possible, which can make data leak from other models or processes running on the same system at the same time.
56+
4. Hardware Attacks: GPUs or TPUs can also be attacked. [Researches](https://scholar.google.com/scholar?q=gpu+side+channel) has shown that side channel attacks on GPUs are possible, which can make data leak from other models or processes running on the same system at the same time.
5757

5858
## Reporting a vulnerability
5959

examples/llava/MobileVLM-README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ After building, run: `./llava-cli` to see the usage. For example:
2222

2323
## Model conversion
2424

25-
- Clone `mobileVLM-1.7B` and `clip-vit-large-patch14-336` locally:
25+
1. Clone `mobileVLM-1.7B` and `clip-vit-large-patch14-336` locally:
2626

2727
```sh
2828
git clone https://huggingface.co/mtgv/MobileVLM-1.7B

examples/llava/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ After building, run: `./llava-cli` to see the usage. For example:
2424

2525
## LLaVA 1.5
2626

27-
- Clone a LLaVA and a CLIP model ([available options](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). For example:
27+
1. Clone a LLaVA and a CLIP model ([available options](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)). For example:
2828

2929
```sh
3030
git clone https://huggingface.co/liuhaotian/llava-v1.5-7b

0 commit comments

Comments
 (0)