Talk: Faster, Better Training for Recommendation Systems
3 - 3:30pm
DLRM (Deep Learning Recommendation Model) is a deep learning-based model for recommendations introduced and open sourced by Facebook. It’s one of the State-Of-The-Art models and part of the MLPerf training benchmark. DLRM workload poses unique challenges for single-socket and multi-socket distributed training due to the need to balance a mixture of compute-bound, memory-bound and I/O-bound operations. To tackle this, we implemented an efficient scale-out solution for DLRM training on Intel Xeon clusters that includes innovative data and model parallelization, new hybrid splitSGD + LAMB optimizers, efficient hyperparameter tuning for model convergence with much larger global batch size, and novel data loader techniques to support scale-up and scale-out. According to the MLPerf v1.0 training result, we can train DLRM with 64 Xeon Cooper-Lake 8376H processors in 15 minutes, a 3X improvement compared with our MLPerf v0.7 submission with 16 Xeon Cooper-Lake 8380 processors. In this talk, Ke will discuss DLRM, the unique challenges associated with it and these optimizations that drive training performance acceleration.