Today deployed at work a new service that based on zeromq. I never liked zeromq because it does not provide any feedback about what is going on, and this is exactly what caused problems. After deployment service worked as expected on all servers except one. Daemon was starting, creating zeromq socket and waiting for messages, but zeromq did not establish connection and of course did not report any errors – it is supposed simply work, and if it does not… well, it will pretend that everything is fine, zeromq developers assume that it is better to do nothing than return an error or at least print some warning.
Strace revealed that zeromq tries to create socket with
EINVAL, because kernel does not support
SOCK_CLOEXEC, waits a bit, and
tries again. Now, I would understand that
connect may fail now, and succeed
two minutes later, but if
socket fails with
EINVAL, then it does not make
any sense to call it again and again with the same arguments, what do you
expect, hot kernel upgrade? This idea of not reporting any errors really does
suck. If the only way to detect the error is to use tcpdump and strace, then
something is wrong with the interface.
So they kinda fixed the problem with compile time check. Great. Sometimes server may run in virtual container with older kernel, and then their compile time check doesn’t help. Thanks to Ubuntu guys for the package with runtime checks, it compiled nice on squeeze and fixed our problem. Till the next time, I guess…