You are monitoring the resource utilization of a DGX SuperPOD cluster using NVIDIA Base Command Manager (BCM). The system is experiencing slow performance, and you need to identify the cause.
What is the most effective way to monitor GPU usage across nodes?
You are deploying an AI workload on a Kubernetes cluster that requires access to GPUs for training deep learning models. However, the pods are not able to detect the GPUs on the nodes.
What would be the first step to troubleshoot this issue?
You are configuring cloudbursting for your on-premises cluster using BCM, and you plan to extend the cluster into both AWS and Azure.
What is a key requirement for enabling cloudbursting across multiple cloud providers?
A system administrator needs to configure and manage multiple installations of NVIDIA hardware ranging from single DGX BasePOD to SuperPOD.
Which software stack should be used?
You are tasked with deploying a deep learning framework container from NVIDIA NGC on a stand-alone GPU-enabled server.
What must you complete before pulling the container? (Choose two.)
An organization only needs basic network monitoring and validation tools.
Which UFM platform should they use?
Which of the following correctly identifies the key components of a Kubernetes cluster and their roles?
A GPU administrator needs to virtualize AI/ML training in an HGX environment.
How can the NVIDIA Fabric Manager be used to meet this demand?
A system administrator needs to optimize the delivery of their AI applications to the edge.
What NVIDIA platform should be used?
Which two (2) ways does the pre-configured GPU Operator in NVIDIA Enterprise Catalog differ from the GPU Operator in the public NGC catalog? (Choose two.)
An administrator requires full access to the NGC Base Command Platform CLI.
Which command should be used to accomplish this action?
A DGX H100 system in a cluster is showing performance issues when running jobs.
Which command should be run to generate system logs related to the health report?
A system administrator notices that jobs are failing intermittently on Base Command Manager due to incorrect GPU configurations in Slurm. The administrator needs to ensure that jobs utilize GPUs correctly.
How should they troubleshoot this issue?
A cloud engineer is looking to provision a virtual machine for machine learning using the NVIDIA Virtual Machine Image (VMI) and Rapids.
What technology stack will be set up for the development team automatically when the VMI is deployed?
An administrator is troubleshooting issues with an NVIDIA Unified Fabric Manager Enterprise (UFM) installation and notices that the UFM server is unable to communicate with InfiniBand switches.
What step should be taken to address the issue?
A system administrator is looking to set up virtual machines in an HGX environment with NVIDIA Fabric Manager.
What three (3) tasks will Fabric Manager accomplish? (Choose three.)