In this guide
- Disaster Recovery
- Growth Models for Reliability
This guide is an overview of models and frameworks to help you achieve more reliable deployments with Nexus Repository. We’ll focus on two concepts – (1) disaster recovery (DR) and (2) availability – and how they differ but also, how they both serve the purpose of fault tolerance in your production deployments.
You can use Nexus Repository to achieve outcomes with both concepts, starting with a model where you stand up one node. Then as your usage grows, you can move to more advanced models. Employing the models in this guide can help ward off both predictable and unforeseen events related to system disruption and maintaining data integrity.
Ultimately, there’s no one-size-fits-all deployment. This guide will help you become more familiar with a variety of models to help you meet the best scenario for your organization.
At the end of this guide, we expect to you to achieve with the following outcomes:
- Define availability and disaster recovery.
- Distinguish the outcomes between availability and disaster recovery.
- Identify the tools and services needed to build deployments resistant to data loss.
- Choose a deployment framework that best suits your organization.
The concept of availability describes the amount of time over a period that resources are operational in a deployment. In order to achieve availability, ancillary configurations – e.g. replication or backup – need to be well-designed and fully tested in your chosen environment. This way, you could mature to a highly available system more resistant to loss and flexible during recovery periods.
As you start with an environment containing a single node, you won’t meet availability goals. David Clinton, a technology writer and Linux administrator, says in this Hacker Noon article, “[One node is equivalent to a single server,] running operations independently on a [storage area network].” In other words,a single node canfail; so the requirements to meet availability goals involve additional, replicated Nexus servers to operate when the lone node dies.
Disaster recovery (DR) is a set of tools, procedures, and policies that let you continue operating in the case of a major outage – widespread, long-lasting, destructive, or all three.
Availability and disaster recovery are often thought of as the same. Disaster recovery is one strategy to achieve better availability metrics. For certain, both concepts overlap to achieve failover. Sometimes, organizations will implement availability procedures to shift closer to a higher uptime percentage. However, DR is intended to handle issues when a system is down. DR will help protect your deployment against worst-case scenarios, ranging from data corruption and all-out node failure.
Also, when you introduce disaster recovery in your deployment, you’re adding redundancy to all operations for improved reliability. The concept of reliability underscores disaster recovery design because it suggests that your system – in theory – is free of error. At a glance, your goal to reach better design for DR relies on these factors:
- data backups and restoration
- traffic management
Replication is the repeated creation of critical data stores. This is important for data synchronicity when accessing multiple server nodes. It also ensures critical data will survive disasters. Replication adds another layer of reliability to your disaster recovery architecture. It can be performed for single and multi-node deployments alike.
Let’s say you have a single node deployment. You’ll need to create a replica of the databases, component metadata, and custom configurations in the node, then store them inside your network-attached storage. As you make good for your DR plan, make sure to plan scheduled exports for databases and synchronized copies to these storage volumes inside the network-attached media. See an example in Model 1, below.
Replication is also practiced in multi-node deployments. For example, you’ll need to stand up one instance of Nexus Repository for production, then create a duplicate instance in a disaster recovery site. The latter site is needed for continuity amid disasters that put your main site’s operation or physical existence at stake. To be successful with replication, your DR strategy requires your teams to constantly test. This ensures failover to your alternate mission-critical data, as described in Model 1, Model 2, and Model 3, below.
In the chapter The Production Environment at Google, from the Viewpoint of an SRE of Site Reliability Engineering: How Google Runs Production Systems, the writer suggests that storage “is responsible for offering users easy and reliable access to [external disks available for single- or multi-node servers].” If you’re working in a simpler environment – one node, for example – you could configure a local sandbox for testing. There you can experiment with scheduled tasks and build separate storage volumes to synchronize database exports and other mission-critical data.
To ensure survival of Nexus local file system data, copy it to a storage mechanism, such as network-attached storage (NAS). You’ll benefit from improved uptime if you choose this type of device. That’s because NAS boxes use a redundancy feature called Redundant Array of Independent Disks (RAID). Unlike a standard backup, RAID prevents data loss by creating multiple copies of your files across several drives. This raises the bar on better availability and reliability in your deployment.
If you plan to install a shared drive for multi-node redundancy, provision the first node to connect to a local storage device. There, you’ll host all application data, configurations, security data, and database exports. On the other hand, keep your blob stores in an external drive to ensure failover, a method to prevent loss of mission-critical binaries. In the event a node fails, the data in the external drive will remain intact when synchronized regularly.
Backups and Restoration
To help maintain the integrity of your systems, you can configure your architecture to schedule exports or use third-party tooling to transfer and back up files from one location to another.
Performance in Nexus Repository relies on the speed of input/output (I/O) requests. So to ensure quick response time, we recommend installing your local Nexus installation on either a storage-area network (SAN) or solid-state drive (SSD). Either tend to be high-speed, if you decide to connect another node in the future. Though you should take into consideration storage limited to space or cost. If that’s the case, you can migrate (or replicate) your environment to a NAS. If you choose this option, use NFSv4 or later.
Adding storage media to backup mission-critical content enhances the design of Nexus Repository deployments. Nexus Repository provides a scheduled task to create database snapshots and relocate them to a target disk. All remaining directories in your local instance (or instances) should also be copied and rebuilt on a backup disk. In essence, to ensure all repository contents aren’t subject to loss, adding network-attached media to your deployments is an integral part of your disaster recovery plan.
Backing up your blob store happens outside of Nexus Repository. If you’re using a file-based blob store, you would use backup media to host mission-critical data. Otherwise S3 is a viable option if your organization’s infrastructure is based in the AWS cloud. Either way, reliable backup media is required. The process itself can be performed with either of these two ways:
- You can keep your instance or instances running while configuring this scheduled task to automate backups
- Or, you can shut down Nexus Repository and copy single or multiple instances to a new location, then restart the application
If you decide to use the task, it exports component metadata, general administrative configurations, and security access content for users. It places the DB backup files inside the sonatype-work directory. However, if you’re required to shut down your repository instances, choose a replication tool to transfer files elsewhere. The disadvantage with this method is you’ll encounter longer downtime periods during backups.
NOTE: Bear in mind when you export files, it simply writes the backup files to the path configured in the task. So, make sure the file system and blob store are copied to a more resilient disaster recovery site.
Obviously, you want to maintain a deployment well-equipped with data protection. Therefore you’ll need to find the right cadence for loss prevention. For example, if you’re working with a two-node, active-passive deployment, the standby node replicated from the node in production acts as a backup. You’ll need to run syncs regularly, especially when data at the production site is modified. Optionally, for additional resilience, you could install a backup disk at your local production site to protect the site from failure.
If you prefer your deployments to remain operational, perform the first option above as the system is running. Be advised that the task in this method blocks all write functions, preventing your team and/or CI servers to upload or modify contents to your repositories. As a general note, we recommend you test your backups often – especially on running instances – to keep an eye out for synchronization issues.
If you’re skilled with command line operations, you can use the read-only REST APIs from the Administration menu. Use the freeze endpoint so you can perform file transfers, and backup your data. Then use the release endpoint to resume operations.
As mentioned in Replication, we encourage you to test failover processes often. This ensures the restoration steps run effectively and that you can validate the integrity of your data. Additionally, frequent tests ensure your team can perform failover-and-restore activities in a controlled environment, which will increase the likelihood that the restorations will be successful in true failure scenarios.
To restore your data follow the step-by-step restore procedures documented in our help docs, then recover any discrepancies between the database and blob store. If you restore content from a disaster recovery site, additional steps are needed. Consider the procedures in the chapter Restore Exported Databases.
To finalize the restoration, you’ll need to re-configure traffic routing in your load balancer so end users can direct requests to the alternate site. Also, you’ll need to reverse the direction of any transfers, copies, and database exports. This ensures that all critical data at your alternate site gets properly migrated to your network-attached backup media.
A larger deployment means more throughput – or requests passing through your systems. So, in a more mature, multi-node scenario you’ll need to configure a load balancer to keep traffic at bay, so to speak. This helps your deployment manage bulk requests and potential latency issues while ensuring fault-tolerance.
More to the point, the load balancer distributes requests among healthy nodes in your local and/or external data centers. In our help docs, you can test and configure tools such as Apache HTTP server, Nginx, or AWS ELB to handle requests among multiple nodes.
Monitoring Node Health
Part of your disaster recovery plan should include performance monitoring. Whether you’re tracking metrics for worst-case failure scenarios or general node activity, Nexus Repository’s Java Virtual Machine offers support for visualizing node health. In the UI, you’ll get a cumulative look at memory use, thread lifecycle, requests, and responses. Fortunately, you can parse and itemize data from each individual node using REST APIs.
Nexus Repository’s REST APIs can be used to track the status of nodes such as:
- blob stores, to oversee disk management quotas and capacity – e.g. space used or remaining
- maintenance, to inspect and manipulate the state of a node. This is often used for troubleshooting and recovery operations
- ping, to test the reachability of a host on an Internet Protocol (IP) network
- status, to determine if your instances can validate read-only and read-write states
- transactions, to inspect database transaction timeouts if such an event occurs
- threads, to troubleshoot conflicts involving access to shared resources such as random access memory, disk storage, cache memory, internal communication within your network (buses), or external network devices
To access remote Nexus instances you can configure them to allow Java Management Extension (JMX) connections. This may be useful for deeper research on thread activity, memory consumption, CPU cycles, I/O use and other performance-related data needed to maintain your fault-tolerant environment. You can refer to this support article for enabling JMX, if desired. You can also review some examples of how to retrieve server health metrics from these endpoints.
Growth Model for Reliability
We encourage you to roll out your site architecture step-by-step – or more explicitly – node-by-node. You can configure deployments starting with one node. Then you can add additional instances as you increase consumption, performance, and resiliency, shown in this diagram:
Figure 1: A Sample Network of Multiple Nodes
As mentioned beforehand, there’s no “one-size-fits-all” solution. The diagram offers a sample of where to get started, and how your deployment can mature over time, based on your organization’s business outcomes. As you review each model breakdown, below, think about what equipment, services, routines, and solutions are needed to maintain business operations. All examples are environment-agnostic. So, you can set up your deployments either on-prem or in the cloud.
Model 1: Single Node with Backup
Figure 2: Single Instance with Backup
When you provision your first installation of Nexus Repository (single node), consider these recommendations for your deployment:
- Deploy a single Nexus instance to a storage-area network (SAN) or network SSD.
- Add storage media for backup, such as a fast-performing NAS, in the same region as your Nexus server.
- Create a “shared” blob store on your NAS, if planning to grow your team beyond single-node capacity.
- Depending on your operating system, use the appropriate tools to synchronize the remainder of the data directory and custom configurations in your installation directory to the NAS.
- Regularly use the in-app export task – mentioned in Backups and Restoration – to back up your database to the network-attached media.
If your teams grow to a size greater than this model, consider installing a second server as a passive, synchronized node in your deployment.
Model 2: Dual Node Active-Passive
NOTE: This scenario shows two nodes in separate data centers.
When your single-node deployment requires more users, it means you’ll need to shift its design closer to better uptime. If this is the case, consider adding a second, standby node in a separate data center (or region). As in any disaster-proof scenario, you’ll need to make sure all mission-critical data in the standby node is kept up-to-date with the active node.
Figure 3: One Instance in Active Mode, One Offsite in Standby Mode
For this type of deployment, consider these recommendations for deployment:
- Configure the server ports of each node on a load balancer, to handle requests and port forwarding.
- Use a sync tool to transfer modified repository contents from the local, active instance to the replicated, passive instance.
- If you have a NAS in place, configure the passive instance to share the same blob store path, as in the first model above.
- Follow steps 4 and 5 from Model 1 on a regular cadence.
The first instance (Node 1) is the primary production center where the business normally operates. If some sort of calamity occurs, the duplicate instance (Node 2) becomes your disaster recovery site. In addition, a load balancer server is installed. It’s connected to handle requests between Node 1 and Node 2. When Node 1 becomes a point of failure, the load balancer re-routes traffic to Node 2.
Model 3: The Star Pattern
If you find that your teams are getting even larger, or perhaps geographically distributed, consider the star-pattern design to suit the growing deployment architecture. The star pattern connects teams to a central “hub” where they can read contents that pass through a master proxy server. Along with the master backup, each server connected to it is a carbon copy; the copies represent a “point of a star.”
Figure 4: The Star Pattern
The scenario above depicts a higher-performance network with these recommendations:
- a backup instance, copied from the master
- two dual proxies in two separate regions in read-only mode, and on standby
- load balancers installed in each region, as well as next to the disbursed, read-only proxies
- a global load balancer that sits in front of all servers, routing client requests to all servers capable of fulfilling requests from users
Without additional replicas of the master, it alone would be the single point of failure. Fortunately, with all synchronized instances on standby across regions, this type of deployment will ensure a greater degree of availability.
If you have further questions, we have you covered. Check out: