Verifique NVLink® en Linux
Por favor, instale los controladores NVIDIA® siguiendo nuestra guía Instalar controlador NVIDIA® en Linux, antes de verificar el soporte NVLink® en el sistema operativo. Adicionalmente, necesita instalar el kit de herramientas CUDA® para compilar los ejemplos de aplicaciones. En esta pequeña guía, hemos recopilado algunos comandos útiles que puede usar.
Comandos básicos
Verifique la topología física de su sistema. Este comando muestra todas las GPUs y sus interconexiones:
nvidia-smi topo -mSi desea mostrar el estado de los enlaces, ejecute el siguiente comando:
nvidia-smi nvlink -sEl comando muestra la velocidad de cada enlace o
nvidia-smi nvlink -i 0 -cSin esta opción, se mostrará información sobre todas las conexiones de GPUs:
nvidia-smi nvlink -cInstalar muestras de CUDA®
Una buena manera de probar el rendimiento es usar los ejemplos de aplicaciones de NVIDIA®. El código fuente de estos ejemplos se publica en GitHub y está disponible para todos. Proceda a clonar el repositorio en el servidor:
git clone https://github.com/NVIDIA/cuda-samples.gitCambie el directorio al repositorio descargado:
cd cuda-samplesSeleccione la rama apropiada por etiqueta de acuerdo a la versión del CUDA® instalada. Por ejemplo, si tiene CUDA® 12.2:
git checkout tags/v12.2Instale algunos prerequisitos que se utilizarán en el proceso de compilación:
sudo apt -y install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev libglfw3-dev libgles2-mesa-devAhora, puede compilar cualquier muestra. Vaya al directorio Samples:
cd SamplesMire rápidamente el contenido:
ls -la
total 40
drwxrwxr-x 10 usergpu usergpu 4096 Sep 13 14:54 .
drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 ..
drwxrwxr-x 55 usergpu usergpu 4096 Sep 13 14:54 0_Introduction
drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 1_Utilities
drwxrwxr-x 36 usergpu usergpu 4096 Sep 13 14:54 2_Concepts_and_Techniques
drwxrwxr-x 25 usergpu usergpu 4096 Sep 13 14:54 3_CUDA_Features
drwxrwxr-x 41 usergpu usergpu 4096 Sep 13 14:54 4_CUDA_Libraries
drwxrwxr-x 52 usergpu usergpu 4096 Sep 13 14:54 5_Domain_Specific
drwxrwxr-x 6 usergpu usergpu 4096 Sep 13 14:54 6_Performance
drwxrwxr-x 11 usergpu usergpu 4096 Sep 13 14:54 7_libNVVM
Vamos a probar el ancho de banda de la GPU. Cambie el directorio:
cd 1_Utilities/bandwidthTestCompile la aplicación:
makeEjecutar pruebas
Comience las pruebas ejecutando la aplicación usando su nombre:
./bandwidthTestLa salida puede parecerse a esto:
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA RTX A6000
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 6.0
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 6.6
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 569.2
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Alternativamente, puede compilar e iniciar p2pBandwidthLatencyTest:
cd 5_Domain_Specific/p2pBandwidthLatencyTestmake./p2pBandwidthLatencyTestEsta aplicación le mostrará información detallada sobre el ancho de banda de su GPU en modo P2P. Ejemplo de salida:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A6000, pciBusID: 4, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 590.51 6.04
1 6.02 590.51
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 589.40 52.75
1 52.88 592.53
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 593.88 8.55
1 8.55 595.32
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 595.69 101.68
1 101.97 595.69
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.61 28.66
1 18.49 1.53
CPU 0 1
0 2.27 6.06
1 6.12 2.23
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.62 1.27
1 1.17 1.55
CPU 0 1
0 2.27 1.91
1 1.90 2.34
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
En caso de una configuración con múltiples GPUs, puede parecerse a esto:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA H100 PCIe, pciBusID: 30, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA H100 PCIe, pciBusID: 3f, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA H100 PCIe, pciBusID: 40, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA H100 PCIe, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA H100 PCIe, pciBusID: b0, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA H100 PCIe, pciBusID: b1, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA H100 PCIe, pciBusID: c2, pciDeviceID: 0, pciDomainID:0
Device: 7, NVIDIA H100 PCIe, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1629.83 38.43 38.39 37.66 38.51 38.19 38.09 37.92
1 38.22 1637.04 35.52 35.59 38.15 38.38 38.08 37.55
2 37.76 35.62 1635.32 35.45 38.59 38.21 38.77 37.94
3 37.88 35.50 35.60 1639.45 38.49 37.43 38.72 38.49
4 36.87 37.03 37.00 36.90 1635.86 34.48 38.06 37.22
5 37.27 37.06 36.92 37.06 34.51 1636.18 37.80 37.50
6 37.05 36.95 37.45 37.15 37.51 37.96 1630.79 34.94
7 36.98 36.91 36.95 36.87 37.83 38.02 34.73 1633.35
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1635.22 34.42 33.84 256.54 27.74 28.68 28.00 28.41
1 34.66 1636.93 256.16 17.97 71.58 71.64 71.65 71.61
2 34.78 256.81 1655.79 30.29 70.34 70.42 70.37 70.33
3 256.65 30.65 70.67 1654.53 70.66 70.69 70.70 70.73
4 28.26 30.80 69.99 70.04 1630.36 256.45 69.97 70.02
5 28.10 31.08 71.60 71.59 256.47 1654.31 71.62 71.54
6 28.37 30.96 70.99 70.93 70.91 70.96 1632.12 257.11
7 27.66 30.87 70.30 70.40 70.30 70.39 256.72 1649.57
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1673.16 51.88 51.95 51.76 51.61 51.44 52.07 51.30
1 52.04 1676.28 39.06 39.21 51.62 51.62 51.98 51.36
2 52.11 39.27 1674.62 39.16 51.42 51.21 51.72 51.71
3 51.74 39.70 39.22 1672.77 51.50 51.27 51.70 51.24
4 52.14 52.41 51.38 52.14 1671.54 38.81 46.76 45.72
5 51.82 52.65 52.30 51.67 38.57 1676.33 46.90 45.96
6 52.92 52.66 53.02 52.68 46.23 46.31 1672.74 38.91
7 52.61 52.74 52.79 52.64 45.90 46.35 39.07 1673.16
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1670.31 52.41 140.69 508.68 139.85 141.88 141.71 140.55
1 141.69 1673.30 509.23 141.22 139.91 143.28 141.71 140.61
2 140.64 508.90 1669.67 140.68 139.93 140.61 140.67 140.50
3 509.14 141.36 140.61 1682.65 139.93 141.45 141.45 140.67
4 140.01 140.03 140.07 139.94 1670.68 508.37 140.01 139.90
5 141.92 143.17 140.50 141.19 508.92 1670.73 141.72 140.52
6 141.72 141.72 140.60 141.31 139.66 141.85 1671.51 510.03
7 140.62 140.71 140.66 140.63 140.02 140.72 509.77 1668.28
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.35 17.23 17.13 13.38 12.86 21.15 21.39 21.12
1 17.54 2.32 12.95 13.78 21.05 21.23 21.31 21.37
2 16.85 14.83 2.35 16.07 12.71 12.80 21.23 12.79
3 14.98 16.06 14.64 2.41 13.35 12.81 13.60 21.36
4 21.31 21.31 20.49 21.32 2.62 12.33 12.66 12.98
5 20.36 21.22 20.17 12.79 16.74 2.58 12.41 12.93
6 17.51 12.84 12.79 12.70 17.63 18.78 2.36 13.69
7 21.23 12.71 19.41 21.09 14.69 13.79 15.52 2.59
CPU 0 1 2 3 4 5 6 7
0 1.73 4.99 4.88 4.85 5.17 5.18 5.18 5.33
1 5.04 1.71 4.74 4.82 5.04 5.14 5.10 5.19
2 4.86 4.75 1.66 4.78 5.08 5.09 5.11 5.17
3 4.80 4.72 4.73 1.63 5.09 5.11 5.06 5.10
4 5.07 5.00 5.03 4.96 1.77 5.33 5.34 5.38
5 5.12 4.94 5.00 4.96 5.31 1.77 5.38 5.41
6 5.09 4.97 5.09 5.01 5.35 5.39 1.80 5.42
7 5.18 5.09 5.02 5.00 5.39 5.40 5.40 1.76
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.33 2.15 2.11 2.76 2.07 2.11 2.07 2.12
1 2.07 2.30 2.77 2.07 2.12 2.06 2.06 2.10
2 2.09 2.75 2.34 2.12 2.09 2.08 2.08 2.12
3 2.78 2.10 2.13 2.40 2.13 2.14 2.14 2.13
4 2.18 2.23 2.23 2.17 2.59 2.82 2.15 2.16
5 2.15 2.17 2.15 2.20 2.82 2.56 2.17 2.16
6 2.13 2.18 2.21 2.17 2.15 2.17 2.36 2.85
7 2.19 2.21 2.19 2.22 2.19 2.19 2.86 2.61
CPU 0 1 2 3 4 5 6 7
0 1.78 1.32 1.29 1.40 1.33 1.34 1.34 1.33
1 1.32 1.69 1.34 1.35 1.35 1.34 1.40 1.33
2 1.38 1.37 1.73 1.36 1.36 1.35 1.35 1.34
3 1.34 1.42 1.35 1.66 1.34 1.34 1.35 1.33
4 1.53 1.41 1.40 1.40 1.77 1.43 1.48 1.47
5 1.46 1.43 1.43 1.42 1.47 1.84 1.51 1.56
6 1.53 1.45 1.45 1.45 1.45 1.44 1.85 1.47
7 1.54 1.47 1.47 1.47 1.45 1.44 1.50 1.84
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Ver también:
Actualizado: 28.03.2025
Publicado: 06.05.2024