Nexus Group

Certificate Manager 7.6, Performance and Availability

Blogginlägg   •   Feb 03, 2014 12:29 CET

As some of you know I have spent the fall in Colombia, I had a really good time, but now I´m back and it is a very exciting time now when we are releasing the new version of neXus Certificate Manager (CM) which we have been working hard on during 2013.

As you can imagine there are many improvements made in CM 7.6 to make it the best CM released this far. I will not go into all the details on the improvements in this blog post, instead I will focus on one big improvement. As the title implies I will focus on performance and availability and I will do it in the light of the new feature active-active high availability (HA) configuration. For a complete list of improvements have a look in the release documentation.

Certificate Manager has for a long time supported an active-passive cluster configuration. With CM 7.6 we take one step further by adding active-active support. In doing this we are opening up for a new set of possibilities. I will try to guide you through what implications this improvement can have, from installation and configuration to setup examples and monitoring.

Lets start by having a look at how this type of installation differs from an active-passive clustered installation. Previously, HA with CM required third party cluster software that would trigger on failure, of the active node and startup the passive node. With the new CM and active-active configuration the installation is as simple as setting up two separate CM installations just pointing them to the same database. After the setup you can connect with the clients to either installation and all changes will be reflected in the other one since all operations are stored in the database.

With the active-passive configuration it was the cluster software that detected service failure and initiated failover. In the active-active setup you will in most cases put a load balancer in front of CM for distributing the traffic over the nodes and in case of failure direct the traffic to the functioning node, until the faulting one is maintained and restored. How to do the failure detection depends on the kind of integration possibilities the load balancer supports.

Performance has historically not been a problem with CM but as security becomes increasingly important more user certificates are issued, machines (routers, antennas etc.) requires certificates to connect to the Internet of Things (IoT) and signature service are issuing certificates for one time use, all this increases the number of issuances drastically. With the active-active setup it is possible to double and triple the performance of the system by just adding a new set of/pair of. Performance will increase in this way since all components except the database can be doubled and that is no problem in normal cases since modern database clusters perform very well.

Since previously it has been possible to run multiple Certificate Issuing Systems (CIS) to provide better availability of the HSM connection, that setup could look as follows:

With this setup Certificate Factory (CF) could failover to the secondary CIS, this makes connection to HSMs more reliable. To support this we have the option for CIS to do a graceful shutdown in case of problems with the connection to the HSM. This makes it easy for CF to notice problems and continue operation with the other CIS until the primary one is back in operation again.

If we now have a setup without failover between CISs, the reason might be that the setups are placed on separate physical locations and security zones that are not allowed to be connected to each other, or some other reason, it can look like this:

Here the CIS can shutdown if connection to HSM is lost, but this made it hard for the load balancer to notice problems if it does not monitor the CIS service too. I.e. if the CIS was down CF could still answer as expected from the load balancers point of view. To solve this we added the option in CF to also do a graceful shutdown in case of connection problems with the CIS. When this is done it will be easy for the load balancer to shift the traffic to the running node.

Until now I have only described the most obvious setup possibilities. Lets continue with some more imaginative setups e.g. separate roles and test system.

One of the most critical parts of a PKI is CRL production since without a valid CRL it is not possible to know if a certificate is to be trusted, i.e. operations will be down. With the new active-active possibility we could add one or more CFs that will not have any clients connecting to them, they would only be responsible for producing the CRLs. Further operations that might be very sensitive is the issuing of machine certificates since if it is not done routers might fail to operate correctly, which would be incredibly bad for operations. Therefore it might be desirable to have one or several CFs running for the issuing of machine certificates only.

When doing critical changes to the CM configuration it is desirable to test the configuration before applying it to production and at the same time it is important that the environment is as similar as possible to production so that we minimize the number of surprises when moving change into production. With a separate CF for testing configuration changes it can be done close to production but without interrupting the production.

Even if we don’t want to admit it, all software can fall into strange states where they seem to work from the outside but does not operate like they should. With infrastructure software such as CM it is paramount to detect this ASAP and act. We work hard to provide these possibilities in CM. Last year we did a major improvement in the SNMP support and earlier we added the ping functionality that runs through the whole certificate issuing procedure, but without issuing a certificate, except for just reporting the result. Other things that we have done to make CM more informative about issues in the operation is that the Distribution Manager now tries to publish several times before giving up and if it comes to that, it can send an email to notify the administrator of what is happening.

So where do we go from here? Some configuration is still file based and has to be synchronized between the nodes. Even though this separation can have benefits, it is desirable to move most configurations to a shared space, e.g. the database, to lesser the risk of having unintentional differences in configuration of the nodes. Further it is also desirable to add additional monitoring to make the decision to failover more reliable and give the load balancer more information to act on.