There are many good answers to this question. I like Amazon's description of DevOps: "DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability ...
TORQUE, Slurm, and other schedulers/resource managers provide for a periodic "node health check" to be performed on each compute node to verify that the node is working properly. Nodes which are ...