Today deployed at work a new service that based on zeromq. I never liked zeromq because it does not provide any feedback about what is going on, and this is exactly what caused problems. After deployment service worked as expected on all servers except one. Daemon was starting, creating zeromq socket and waiting for messages, but zeromq did not establish connection and of course did not report any errors – it is supposed simply work, and if it does not… well, it will pretend that everything is fine, zeromq developers assume that it is better to do nothing than return an error or at least print some warning.

Strace revealed that zeromq tries to create socket with SOCK_CLOEXEC flag, gets EINVAL, because kernel does not support SOCK_CLOEXEC, waits a bit, and tries again. Now, I would understand that connect may fail now, and succeed two minutes later, but if socket fails with EINVAL, then it does not make any sense to call it again and again with the same arguments, what do you expect, hot kernel upgrade? This idea of not reporting any errors really does suck. If the only way to detect the error is to use tcpdump and strace, then something is wrong with the interface.

So they kinda fixed the problem with compile time check. Great. Sometimes server may run in virtual container with older kernel, and then their compile time check doesn’t help. Thanks to Ubuntu guys for the package with runtime checks, it compiled nice on squeeze and fixed our problem. Till the next time, I guess…