- It’s rare, but in the context of Gemini means that we can expect SDC vents to impact training every week or two.
- Rapidly detecting and removing faulty hardware required several new techniques that exploit deterministic replay t isolate incorrect computations
- proactive SDC scanner on idle machines and hot standbys
This could be incorrect computation issue that the software doesn’t check for