triggerGHR/config/nhc_fault_dictionary.cfg (21 lines of code) (raw):

NHC2001 = Resource.Hpc.Unhealthy.HpcGenericFailure #Use this to report issues that do not fall under any other HPC categories. NHC2002 = Resource.Hpc.Unhealthy.MissingIB #Use this to report an HPC node with missing InfiniBand. NHC2003 = Resource.Hpc.Unhealthy.IBPerformance #Use this to report an HPC node with degraded IB performance NHC2004 = Resource.Hpc.Unhealthy.IBPortDown #Use this to report when IB port is persistently down NHC2005 = Resource.Hpc.Unhealthy.IBPortFlapping #Use this to report when IB port is flapping NHC2007 = Resource.Hpc.Unhealthy.HpcRowRemapFailure #Use this to report an HPC node with GPU row remap failure. NHC2008 = Resource.Hpc.Unhealthy.HpcInforomCorruption #se this to report an HPC node with GPU infoROM corruption. NHC2009 = Resource.Hpc.Unhealthy.HpcMissingGpu #Use this to report an HPC node with missing GPUs NHC2010 = Resource.Hpc.Unhealthy.ManualInvestigation #Use this to report issues that require further manual investigation by the HPC team. NHC2011 = Resource.Hpc.Unhealthy.XID95UncontainedECCError #Use this to report an HPC node with NVRM Xid 95 error. NHC2012 = Resource.Hpc.Unhealthy.XID94ContainedECCError #Use this to report an HPC node with NVRM Xid 94 error NHC2013 = Resource.Hpc.Unhealthy.XID79FallenOffBus #Use this to report an HPC node that may have NVRM Xid 79 error. NHC2014 = Resource.Hpc.Unhealthy.XID48DoubleBitECC #Use this to report an HPC node that may have NVRM Xid 48 error. NHC2015 = Resource.Hpc.Unhealthy.UnhealthyGPUNvidiasmi #Use this to report an HPC node that may experience Nvidia-smi hang and may not recover. NHC2016 = Resource.Hpc.Unhealthy.NvLink #Use this to report an HPC node where NvLink may be down. NHC2017 = Resource.Hpc.Unhealthy.HpcDcgmiThermalReport #Use this to report an HPC node that may have thermal violations reports from a DCGMI run. NHC2018 = Resource.Hpc.Unhealthy.ECCPageRetirementTableFull #Use this to report an HPC node where double-bit ECC error page retirements may be reaching threshold. NHC2019 = Resource.Hpc.Unhealthy.DBEOverLimit #Use this to report an HPC node that may have more than 10 DBE retired pages in a week. NHC2020 = Resource.Hpc.Unhealthy.HpcGpuDcgmDiagFailure #Use this to report an HPC node with GPU DCGMI diagnostic failure. NHC2021 = Resource.Hpc.Unhealthy.GPUMemoryBWFailure #Use this to report an HPC node with more than GPU memory bandwidth issue. NHC2022 = Resource.Hpc.Unhealthy.CPUPerformance #Use this to report an HPC node with more than CPU performance issue. #NHCNA = NHC Set Up issue error, This is not a impact category, it is used to indicate that the NHC was not set up properly