Which queries should we include so offline results match online behavior? How do we prompt an LLM to label query–document pairs reliably and with low bias? Which metrics (e.g., NDCG, MAP, Recall) best ...