Pre-trained models with Whole Word Masking are linked below. The data and training were otherwise identical, and the models have identical structure and vocab to the original models. We only include ...
Note that the English result is worse than the 84.2 MultiNLI baseline because this training used Multilingual BERT rather than English-only BERT. This implies that for high-resource languages, the ...