The answer is that raw latency and perceived latency are different engineering problems.
Raw latency is backend execution time. It includes retrieval, reranking, orchestration, model inference, queueing, serialization, and network overhead.
A silent 1.5-second wait can feel worse than a streamed 2.5-second answer because the first interface creates uncertainty and the second creates progress.
This distinction matters because often we optimize the wrong number, focusing only on total execution time, when the user is often reacting to something more specific:
À propos de l'auteur

Cyril Noirot
Lead Data Scientist
Data scientist freelance. Je conçois et déploie des systèmes de décision — prévision, pricing, marketing measurement, optimisation.