GPT models retain "meaning knowledge" as long-term memory in FFN (especially NL layers) and as short-term memory in SA (especially memory {Vi} layers). In the transformer layer's [ SA-RC → (LN-FFN)-RC ...