**Some notes on perceived DNS resolver performance** Last updated: 13th of March 2017 / [bert hubert](https://twitter.com/powerdns_bert), with thanks to [HÃ¥kan Lindqvist](https://twitter.com/HakanLindqvist). Perceived DNS performance ========================= The perceived performance of a DNS resolver, defined as the average time taken to deliver an answer, can be modeled as follows: We define $t_{auth}$ as the average response time of a remote authoritative server, and \( t_{cached} \) as how fast our resolver implementation responds in the cached case. If we set define \( P_{cached} \) as the chance of a cache hit, the average response time \( t_{resolve} \) becomes: $$ t_{resolve} = P_{cached} \cdot t_{cached} + (1-P_{cached}) \cdot t_{auth} $$ This in itself is not yet overly insightful, but this becomes more so if we model \( P_{cached} \). If we assume that an entry stays in the cache until the TTL has expired, and we take into account the average query rate, we can guess the amount of cache hits we will get after the initial miss. This assumes that the query rate is high enough that we will get a significant number of queries before the TTL expires. A special case are domains for which the reciprocal query rate is longer than the TTL, but such cases are either rare (because of a low query rate), or pathological (because of an unwisely low choice of TTL). In this case, \(P_{cached} = 0\), leading to \(t_{resolve} = t_{auth}\). For the normal case, if the Query rate is \( R \) and the average TTL is \( T \), it follows that per time period \( T \): $$ \begin {aligned} N_{hits} & = R \cdot T - 1 \\ N_{misses} & = 1 \end {aligned} $$ This leads to: $$ \begin{aligned} P_{cached} & = \frac{N_{hits}}{N_{hits} + N_{misses}} = \frac{R \cdot T - 1}{R \cdot T} = 1 - \frac{1}{R \cdot T} \end{aligned} $$ From this it follows that the average resolving response time is: $$ \begin {aligned} t_{resolve} & = P_{cached} \cdot t_{cached} + (1-P_{cached}) \cdot t_{auth} \\ & = (1 - \frac{1}{R \cdot T}) \cdot t_{cached} + ( \frac{1}{R \cdot T} ) \cdot t_{auth} \end {aligned} $$ $t_{cached}$ can be assumed to be sub-millisecond, whereas authoritative servers respond within 10s or 100s of milliseconds, so usually \( t_{cached} \ll t_{auth} \), allowing for: $$ t_{resolve} \approx \frac{t_{auth}}{R \cdot T} $$ From this, it can be seen that to improve response times without artificially refetching, we can work on lowering \(t_{auth}\), by picking remote servers that respond quickly, or by raising the effective query rate \(R\). One avenue for doing that is 'query concentration', making sure that similar queries end up on the same node within a load balanced setup (like [dnsdist](http://dnsdist.org) offers). Another way of achieving higher effective \( R \) is by adding 'cache peeking' to other nodes within a resolver constellation. Another cheap way is to artificially increase minimum \( T \) above a certain, non-pathological, value. Generalizing this from individual domains to real traffic ========================================================= The equations above all work for single domains, but it is the sum of these domains, each with their own \(R_{domain}\), \(t_{{auth}_{domain}}\) and \(T_{domain}\) that make up the total experience. It would be tempting to simply average these three, and to the math, but that would be highly misleading - domains with low \( R \) form a negligable impact on the total experience. Meanwhile, popular domains are more likely than unpopular domains to be hosted on nameservers nearby. To get real numbers therefore, we need to average the actual numbers over all domains individually: $$ t_{resolve} = \frac{\displaystyle \sum_{domains} R_{domain} \cdot \frac{t_{{auth}_{domain}}}{R_{domain} \cdot T_{domain}}}{R} $$ This weighs each domain based on its individual query rate. And this is where something interesting happens. The above can be simplified to: $$ t_{resolve} = \frac{\displaystyle \sum_{domains} \frac{t_{{auth}_{domain}}}{ T_{domain}}}{R} $$ The per-domain query rate has fallen out of the equation! If this formula is true, only the bulk query rate matters, and concentration would have no effect. This interpretation is incorrect however. For each resolver, $$ \displaystyle\sum_{domains} \frac{t_{{auth}_{domainD}}}{T_{domain}} $$ goes down as it receives less unique domains, meaning each target of concentration achieves better perceived performance. And as the combined performance is the average and not sum of contencentrated performances, this is a net win. Actual latency, travel time to resolver ======================================= For actual resolver latency, a fixed new term enters the equation: $$ t_{resolve} \approx t_{travel} + \frac{t_{auth}}{R \cdot T} $$ Where \(t_{travel}\) stands for the amount of time a query takes to travel to the actual resolver, and back. The minimal $t_{travel}$ depends critically on chosen access technology latency: <20ms for DSL, somewhat less for cable and milliseconds for fiber. To this must be added the physical light speed (in fiber) travel return time to the resolver, which is 1 millisecond per 100km (for 200km of travel). For an optimized DNS resolver, actual resolution time will be dominated by such network latencies. In other words, it is rarely possible to get effective home user DNS latencies below several milliseconds.