triggerGHR/config/nhc_fault_dictionary.cfg (21 lines of code) (raw):
NHC2001 = Resource.Hpc.Unhealthy.HpcGenericFailure #Use this to report issues that do not fall under any other HPC categories.
NHC2002 = Resource.Hpc.Unhealthy.MissingIB #Use this to report an HPC node with missing InfiniBand.
NHC2003 = Resource.Hpc.Unhealthy.IBPerformance #Use this to report an HPC node with degraded IB performance
NHC2004 = Resource.Hpc.Unhealthy.IBPortDown #Use this to report when IB port is persistently down
NHC2005 = Resource.Hpc.Unhealthy.IBPortFlapping #Use this to report when IB port is flapping
NHC2007 = Resource.Hpc.Unhealthy.HpcRowRemapFailure #Use this to report an HPC node with GPU row remap failure.
NHC2008 = Resource.Hpc.Unhealthy.HpcInforomCorruption #se this to report an HPC node with GPU infoROM corruption.
NHC2009 = Resource.Hpc.Unhealthy.HpcMissingGpu #Use this to report an HPC node with missing GPUs
NHC2010 = Resource.Hpc.Unhealthy.ManualInvestigation #Use this to report issues that require further manual investigation by the HPC team.
NHC2011 = Resource.Hpc.Unhealthy.XID95UncontainedECCError #Use this to report an HPC node with NVRM Xid 95 error.
NHC2012 = Resource.Hpc.Unhealthy.XID94ContainedECCError #Use this to report an HPC node with NVRM Xid 94 error
NHC2013 = Resource.Hpc.Unhealthy.XID79FallenOffBus #Use this to report an HPC node that may have NVRM Xid 79 error.
NHC2014 = Resource.Hpc.Unhealthy.XID48DoubleBitECC #Use this to report an HPC node that may have NVRM Xid 48 error.
NHC2015 = Resource.Hpc.Unhealthy.UnhealthyGPUNvidiasmi #Use this to report an HPC node that may experience Nvidia-smi hang and may not recover.
NHC2016 = Resource.Hpc.Unhealthy.NvLink #Use this to report an HPC node where NvLink may be down.
NHC2017 = Resource.Hpc.Unhealthy.HpcDcgmiThermalReport #Use this to report an HPC node that may have thermal violations reports from a DCGMI run.
NHC2018 = Resource.Hpc.Unhealthy.ECCPageRetirementTableFull #Use this to report an HPC node where double-bit ECC error page retirements may be reaching threshold.
NHC2019 = Resource.Hpc.Unhealthy.DBEOverLimit #Use this to report an HPC node that may have more than 10 DBE retired pages in a week.
NHC2020 = Resource.Hpc.Unhealthy.HpcGpuDcgmDiagFailure #Use this to report an HPC node with GPU DCGMI diagnostic failure.
NHC2021 = Resource.Hpc.Unhealthy.GPUMemoryBWFailure #Use this to report an HPC node with more than GPU memory bandwidth issue.
NHC2022 = Resource.Hpc.Unhealthy.CPUPerformance #Use this to report an HPC node with more than CPU performance issue.
#NHCNA = NHC Set Up issue error, This is not a impact category, it is used to indicate that the NHC was not set up properly