tlakh/VerifyHostKeyDNS.org

#+Title: VerifyHostKeyDNS
#+SUBTITLE: ... or how I enroll new hosts into my infrastructure.
#+DATE: 2023-01-15
* Prologue
I run my own infrastructure. I self-host my email, DNS, this website,
a [[https://git.tlakh.xyz/explore/repos][git server]], [[https://restic.net/][backups]], and probably a bunch of other stuff that I
forgot about. Ah yes, [[https://icinga.com/][monitoring]], Ubiquiti uniFi for my Wi-Fi access
points at home and probably even more stuff.

All of it running [[https://openbsd.org/][OpenBSD]], except for one machine running [[https://debian.org/][debian]]. It's
all tied together with [[https://www.ansible.com/][ansible]][fn:: I started out with ansible,
switched to salt stack and moved back to ansible. Because reasons.].

So far it's eight machines. I was reinstalling and consolidating some
VMs and physical machines the other day and hooking up new machines
became annoying because of ssh host-keys.
* StrictHostKeyChecking
My ansible orchestration host needs to be able to talk to new machines
over ssh. New machines need to talk to the backup server over ssh and
submit passive check results over ssh to the monitoring server. The
monitoring server needs to talk to new hosts over ssh[fn:: I don't
trust nrpe. I have seen the code. Instead I use ~by_ssh~ to monitor
hosts. Ansible adds an ssh public-key to a monitoring user with a
force-command. The force-command is a shell-script switching over
~${SSH_ORIGINAL_COMMAND}~ to run specific check_commands. It does not
trust the remote ssh at all.].

So we have the issue of existing infrastructure needing to verify
host-keys of new hosts and new hosts needing to verify host-keys of
existing infrastructure. One way to deal with this is to run a [[https://www.lorier.net/docs/ssh-ca.html][CA,
sign host-keys with it and roll certificates out]].

I on the other hand, prefer to use DNS[fn:: I have a laptop sticker and
travel mug with "We reject kings, presidents and voting. We believe in
rough consensus and running code." crossed out with "Fuck that! Just
put it in DNS." I also have a RUN DNS sticker. I am biased]. [[https://www.rfc-editor.org/rfc/rfc4255][RFC4255]] provides
facilities to store host-keys in SSHFP resource records in DNS and we
can secure those with DNSSEC.

* VerifyHostKeyDNS
[[https://man.openbsd.org/ssh_config.5#VerifyHostKeyDNS][ssh_config(5)]] explains how [[https://man.openbsd.org/ssh.1][ssh(1)]] can use SSHFP records to verify
host-keys:

+ *VerifyHostKeyDNS* :: Specifies whether to verify the remote key using
  DNS and SSHFP resource records.  If this option is set to yes, the
  client will implicitly trust keys that match a secure fingerprint
  from DNS.  Insecure fingerprints will be handled as if this option
  was set to ask.  If this option is set to ask, information on
  fingerprint match will be displayed, but the user will still need to
  confirm new host keys according to the StrictHostKeyChecking option.
  The default is no.

One problem with this is, if you put
#+begin_example
Host *
    VerifyHostKeyDNS yes
#+end_example
into your =.ssh/config= it will not work. The magic is /secure
fingerprint/. What the documentation means is that a DNS answer for
SSHFP needs to have the /Authentic Data (AD)/ flag set. The flag gets
set by a validating name-server if it can DNSSEC validate the SSHFP.

But when the libc stub resolver[fn:: The thingy[fn:: Thingy is a
technical term, don't worry about it.] that ssh uses to talk
to the validating name-server. On OpenBSD that is [[https://man.openbsd.org/man3/asr_run.3][asr]].] gets that
answer it will strip the AD flag for security reasons. You see, it
does not know that it can trust the validating name-server. One way to
have a trustworthy validating name-server is to run one on localhost.

[[http://man.openbsd.org/resolv.conf#trust-ad][resolv.conf(5)]] explains the *trust-ad* option:

+ *trust-ad* :: A name server indicating that it performed DNSSEC
  validation by setting the Authentic Data (AD) flag in the answer can
  only be trusted if the name server itself is trusted and the network
  path is trusted.  Generally this is not the case and the AD flag is
  cleared in the answer.  The trust-ad option lets the system
  administrator indicate that the name server and the network path are
  trusted.  This option is automatically enabled if resolv.conf only
  lists name servers on localhost.

The easiest way is to run [[https://man.openbsd.org/unwind.8][unwind(8)]]:
#+begin_src shell
  doas rcctl enable unwind
  doas rcctl start unwind
#+end_src

[[https://man.openbsd.org/resolvd.8][resolvd(8)]] will then add =nameserver 127.0.0.1= to
=/etc/resolv.conf= and comment out all other dynamically learned name
servers. Just make sure that you are not using any static configured
name servers[fn:: I use ~! route nameserver $if 149.112.112.9
2620:fe::9 9.9.9.9 2620:fe::fe:9~ in my main [[http://man.openbsd.org/hostname.if.5][hostname.if(5)]] to add
some static name servers in case unwind(8) crashes[fn:: Not sure why
it would do that though. Sounds unpleasant.].] because you really want
to have only =nameserver 127.0.0.1= in there.
* Putting it all together
When I install a new host I have out of band access in one way or
another. It might be a serial console, a fake html5 console or some
KVM contraption. Heck, I even used [[https://blog.eulinux.org/2018/07/hetzner-install.html][qemu]] to get OpenBSD running on some
Hetzner physical machine.

On the installed machine I use said out of band access to run
#+begin_src shell
  ssh-keygen -l -f /etc/ssh/ssh_host_ed25519_key.pub
#+end_src
This gives me one ssh host-key fingerprint and I can login over ssh.

I have to add IPv6 and legacy-IP addresses to DNS for the machine so I
also grab the SSHFP to add them at the same time:

#+begin_src shell
  ls /etc/ssh/*.pub | xargs -n1 ssh-keygen -r $(hostname) -f
#+end_src

While still logged in, I install python3 and add an ssh-key for
ansible. I then add the host to the ansible inventory. The ansible
orchestrator can now finish the installation of the host over ssh
while trusting the SSHFP it finds in DNS.

Ansible also hooks up the host to my monitoring system and the
monitoring system can connect to the new host over ssh, again trusting
that it talks to the correct host because of SSHFP in DNS.

The newly installed host knows that it's talking to my backup and
monitoring server using their published SSHFP records.
* Epilogue
I have some ideas how to streamline this even more, but I do not
install new machines that often. This strikes a reasonable balance
between manual work and working on automation. It's probably best to
leave it like this.