**Some notes on perceived DNS resolver performance**
Last updated: 13th of March 2017 / [bert hubert](https://twitter.com/powerdns_bert), with thanks to [HÃ¥kan Lindqvist](https://twitter.com/HakanLindqvist).
Perceived DNS performance
=========================
The perceived performance of a DNS resolver, defined as the average time taken
to deliver an answer, can be modeled as follows:
We define $t_{auth}$ as the average response time of a remote
authoritative server, and \( t_{cached} \) as how fast our resolver
implementation responds in the cached case.
If we set define \( P_{cached} \) as the chance of a cache hit, the average
response time \( t_{resolve} \) becomes:
$$
t_{resolve} = P_{cached} \cdot t_{cached} + (1-P_{cached}) \cdot t_{auth}
$$
This in itself is not yet overly insightful, but this becomes more so if we
model \( P_{cached} \). If we assume that an entry stays in the cache until the
TTL has expired, and we take into account the average query rate, we can
guess the amount of cache hits we will get after the initial miss. This
assumes that the query rate is high enough that we will get a significant
number of queries before the TTL expires.
A special case are domains for which the reciprocal query rate is longer
than the TTL, but such cases are either rare (because of a low query rate),
or pathological (because of an unwisely low choice of TTL). In this case,
\(P_{cached} = 0\), leading to \(t_{resolve} = t_{auth}\).
For the normal case, if the Query rate is \( R \) and the average TTL is \(
T \), it follows that per time period \( T \):
$$
\begin {aligned}
N_{hits} & = R \cdot T - 1 \\
N_{misses} & = 1
\end {aligned}
$$
This leads to:
$$ \begin{aligned}
P_{cached} & = \frac{N_{hits}}{N_{hits} + N_{misses}} = \frac{R \cdot T - 1}{R \cdot T} = 1 - \frac{1}{R \cdot T}
\end{aligned}
$$
From this it follows that the average resolving response time is:
$$
\begin {aligned}
t_{resolve} & = P_{cached} \cdot t_{cached} + (1-P_{cached}) \cdot t_{auth} \\
& = (1 - \frac{1}{R \cdot T}) \cdot t_{cached} + ( \frac{1}{R \cdot T} )
\cdot t_{auth}
\end {aligned}
$$
$t_{cached}$ can be assumed to be sub-millisecond, whereas authoritative
servers respond within 10s or 100s of milliseconds, so usually \( t_{cached}
\ll t_{auth} \), allowing for:
$$ t_{resolve} \approx \frac{t_{auth}}{R \cdot T} $$
From this, it can be seen that to improve response times without
artificially refetching, we can work on lowering \(t_{auth}\), by picking
remote servers that respond quickly, or by raising the effective query rate
\(R\).
One avenue for doing that is 'query concentration', making sure that similar
queries end up on the same node within a load balanced setup (like
[dnsdist](http://dnsdist.org) offers).
Another way of achieving higher effective \( R \) is by adding 'cache
peeking' to other nodes within a resolver constellation.
Another cheap way is to artificially increase minimum \( T \) above a
certain, non-pathological, value.
Generalizing this from individual domains to real traffic
=========================================================
The equations above all work for single domains, but it is the sum of these
domains, each with their own \(R_{domain}\), \(t_{{auth}_{domain}}\) and
\(T_{domain}\) that make up the total experience.
It would be tempting to simply average these three, and to the math, but
that would be highly misleading - domains with low \( R \) form a negligable
impact on the total experience. Meanwhile, popular domains are more likely
than unpopular domains to be hosted on nameservers nearby.
To get real numbers therefore, we need to average the actual numbers over
all domains individually:
$$
t_{resolve} = \frac{\displaystyle \sum_{domains} R_{domain} \cdot \frac{t_{{auth}_{domain}}}{R_{domain} \cdot
T_{domain}}}{R}
$$
This weighs each domain based on its individual query rate. And this is
where something interesting happens. The above can be simplified to:
$$
t_{resolve} = \frac{\displaystyle \sum_{domains} \frac{t_{{auth}_{domain}}}{
T_{domain}}}{R}
$$
The per-domain query rate has fallen out of the equation! If this formula is
true, only the bulk query rate matters, and concentration would have no
effect.
This interpretation is incorrect however. For each resolver,
$$
\displaystyle\sum_{domains} \frac{t_{{auth}_{domainD}}}{T_{domain}}
$$ goes down as it receives less unique domains,
meaning each target of concentration achieves better perceived performance.
And as the combined performance is the average and not sum of
contencentrated performances, this is a net win.
Actual latency, travel time to resolver
=======================================
For actual resolver latency, a fixed new term enters the equation:
$$ t_{resolve} \approx t_{travel} + \frac{t_{auth}}{R \cdot T} $$
Where \(t_{travel}\) stands for the amount of time a query takes to travel to
the actual resolver, and back.
The minimal $t_{travel}$ depends critically on chosen access technology
latency: <20ms for DSL, somewhat less for cable and milliseconds for
fiber. To this must be added the physical light speed (in fiber) travel
return time to the resolver, which is 1 millisecond per 100km (for 200km of
travel).
For an optimized DNS resolver, actual resolution time will be dominated by
such network latencies. In other words, it is rarely possible to get
effective home user DNS latencies below several milliseconds.