Dokuwiki Loadbalaning

Load balancing was primarily understood by the services of a certain size to serve more requests. This wiki I have been using for a long time has become much larger than initially expected, and as I have become increasingly dependent on it, I have started adding some requirements. For example, it was necessary to reduce the chance of losing data, and service should always be maintained. The chance of data loss is reduced by using a disk that can be detached from the instance and taking daily snapshots of the entire instance, including this disk. However, the requirement that the service should always be maintained was not easy to achieve. My previous experience with failure has led me to realize that Lightsail instances can fail more often than I thought, and I used to experiment with this on wikis and web servers on my own to break services. I used to do something on my own, and after breaking the wiki, I was frustrated that I could not find a way to use my wiki.

Then I came to the conclusion that to meet the second requirement that this service should always be maintained, I had to add servers and configure them to load balance. I have written about adding servers and setting up load balancing1) after the last failure. However, I have never talked about the Dokuwiki load balancing itself. So I decided to write an article that covers this topic precisely because it was difficult to find an article that tried the same with me no matter how much Google ringing this setting. This article is about 'Dokuwiki through load balancing'.

Background

Dokuwiki works on a file system basis. This is both an advantage and a disadvantage of Dokuwiki. Although both simples to install and use, it is platform-dependent and makes it difficult to choose a reliable scaling method like a database when trying to scale through load balancing. Although there is a document on the official Dokuwiki website that addresses the scalability caused by Dokuwiki's operation on a file system basis2), this document itself does not actually explain the use of Dokuwiki by cultivating it for a larger service. This document just talks about how big Dokuwiki exists in the world, and that this wiki is capable of operating at such a large number of documents, and does not explain the situation in which users must continue to operate in the event of an increase or failure.

Scale

This wiki is divided into public and private parts. Dokuwiki provides a feature called a farm that runs multiple wikis by installing one set of scripts and setting up different data directories based on it, but I am not using it. Based on the namespace, some parts are public and private, and they are separated by ACLs supported by Dokuwiki itself. To increase the level of security, Cloudflare Access is used to access additional private parts, requiring additional authentication and enabling a specific VPN to log in3). None of these are required in the public part. Wiki is about 20 gigabytes of text and images, except for cache files, and It using 32-gigabyte disks. The number of registered users is 10 or less, and the average number of edits per day is about 50.

Subjects

Load balancing

If you are just getting bigger wikis or more users, you can simply fix the problem using the larger Lightsail bundle. I am using the 2GB memory bundle, Until now, I don't feel particularly lacking in digesting the features I need for Dokuwiki. In particular, Cloudflare handles requests that the web server does not need to receive directly in front of the webserver4), so I do not feel the need for a bigger web server. As I said above, I wanted to reduce service outages and continue to use the wiki in the event of an outage by me, and I came to the conclusion that to meet this requirement, I had to increase the number of instances instead of the larger instances.

Among the load balancing services that can be easily found right now where services provided by Lightsail and Cloudflare. With Lightsail, you could build load balancing regardless of the number of servers, a number of requests, and traffic at a fixed cost of $18 per month5). Cloudflare can start load balancing with two servers for $5 per month but increases the cost by increasing the number of servers or increasing the number of requests6). In my case, I choose this because Cloudflare's pricing was cheaper because My scale is smaller. If it grows in size someday, there is room for migration to the flat rate load balancing offered by Lightsail.

Cloudflare's load balancing does not use the API it requires IP addresses as many as the number of servers in order to use it without setting DNS separately on the Lightsail side. Currently, I building load balancing with two Lightsail instances, and the two instances have separate IP addresses and they are accessed through Cloudflare's DNS settings. Each server has to be set as A record and must be accessed separately through sub-domain, and load balancing is set based on this. The load balancing setup first creates a server pool and adds each of the servers to be in that pool. What you can do with $5 per month is one pool and up to two servers in it. Exactly what I am doing right now can be handled with this amount. Here, to increase the number of pools or increase the number of servers, the cost increases by about $5.

Once set up, it will be deployed within seconds, and once you enter the domain name, one of the two servers will start responding. With the session affinity7) setting, even if there are multiple servers, the server that initially responds for a certain period of time continues to respond, making it easy to maintain login and synchronize.

Synchronization

Because Dokuwiki works on a file system basis, it creates a number of concerns when expanding in this way. When I Googled, I want to expand Dokuwiki, but since it is based on a file system, there are articles8) saying that I am not sure what to do. Perhaps experts who know how to fit the situation have already left Dokuwiki or have already solved the problem via a network filesystem, but this article has not appeared on Google. I wasn't expert enough to set this up, and will not go to leave Dokuwiki for a while, so I had to find a way to fix it. To be precise, I had to decide how to synchronize the file system between the two servers and run it.

As I said in the previous article, I put unison9) into crontab and run it in a short cycle. Basically, I think synchronization should be the most secure and the most efficient way to reduce waste when the file system is changed. But for some reason, people with the requirement to sync the filesystems shown on Google were using more methods to perform synchronization in a shorter period by using tricks on the crontab10), which can only be done for up to 1 minute. Once I did it right now, it seemed a lot easier, so I followed the same thing and there haven't been any incidents so far.

However, it can potentially cause problems if one sync operation is longer than the minimum time unit in which synchronization occurs. unison runs on only one of the two servers. The other one just syncs but doesn't run it on its own. If we need to increase the number of instances in the future, I plan to respond by adding synchronization settings to instances that are expanded based on the 'Synchronizing Side' instance and running synchronization.

Certification

To maintain the security of the entire service, certificates are required in three sections. One is between client and Cloudflare. I use a free certificate provided by Cloudflare. In desktop browsers, you can click on the padlock icon next to the address to view the certificate information, but here I see the Cloudflare domain name instead of my domain name. I thought it does not matter. And it fully achieves the original purpose of encrypting traffic in this section.

The other two are between two servers and a Cloudflare server. Initially, Let's Encrypt certificates were issued and used, it was possible to automate the issuance and maintenance of certificates for each sub-domain. But I switched to using the origin certificate provided by Cloudflare. This origin certificate is signed by Cloudflare and the certificate is valid itself, but not valid for all authentication chain from the root certificate. So, if you use this certificate and without going through the Cloudflare, the authentication is marked as broken. However, since all requests are always going through the Cloudflare, the request was never marked as broken to the client, and the validity of the certificate was long, so it was less administrative and could achieve the purpose of encrypting between the Cloudflare and two origin servers.

Code distribution

Data and settings are synchronized in short cycles. But the Dokuwiki script didn't seem to do that. I frequently experimented with web servers and wikis, and if the experiment failed, the service would be temporarily stopped. At this time, if the Dokuwiki script was also synchronized, problems from one server could quickly spread to another server. However, I had to modify the code and I needed a way to deploy it comfortably. Perhaps the experts have another way, but I'm distributing the code by creating a repository on Github11). First, create an instance with a snapshot to modify and test the code there. If it does not seem to have any problems here push it to Github, deploy it to only one of the two servers, and let it go for a while. If it still doesn't seem like much, deploy it to the other one.

This method was used a while ago when applying12) the new RC13) of Dokuwiki. First I created a branch, applied the new code, then reviewed it, and tested it for some time, reflecting only on the newly created instance. I didn't have a big problem, and the third RC decided that all the changes were for code cleanup and deployed them on a daily basis on each server. Of course, this is because the major version update does not migrate the filesystem. If a new version of the future migrates a filesystem, it will not be possible to deploy one by one in this way.

Problems

Synchronization interval

File system synchronization occurs over time. Now it happens once every 10 seconds. In the first, 1-minute interval setting, there was a case where a page was written on one side but was not reflected on the other side, so there was a page that was not available when sharing an address. So for some time, when I shared the address, I opened the subdomains of all the servers to see if the posts appeared on both sides, and then shared the address. I have reduced the synchronization interval no. and It is working reliably to some extent, no longer checking every server before sharing the address.

If the number of users increases and the same page is modified on different servers at intervals shorter than the synchronization interval, the data of the side that was modified earlier will be lost. It was very difficult to merge when corrections occurred on both sides at the same time automatically. I'm simply overwriting the revised side later, but on my scale, there haven't been any problems yet, but it's not without problems. Potentially modified pages may lose their modifications.

External edit

Dokuwiki has a function that detects edits that Dokuwiki does not know through the time stamp of the file system. If someone or some other app opens and edits the Dokuwiki data file directly, there will be a difference between the time stamp recorded inside the file and the timestamp of the file itself. The next time you try to edit this page while in this state, It will first make a new revision of the changes that Dokuwiki doesn't know about, then open this revision to begin editing. However, in the environment of synchronizing files, the side that synchronizes the modifications will always have a wrong timestamp. On one server, the timestamp inside the file will match on the file itself, but on the other server, the timestamp of the file itself will point to a more future perspective, leaving an 'External edit' record at all times. The two servers that are synchronized always show the same revision history, but I didn't want to call the detectExternalEdit function14) because I didn't want to register the revisions that occurred. However, this setting will prevent you from recording this revision when external editing actually occurs.

Another external edit issue is that if the latest revision of a page has been synced from another server, the user who modified this page appears as 'External edit'. This is not recorded in the revision, but when the next modification occurs, it is normally recorded in the revision, but the newest page displays the edited user information normally on one server, while on the other server it is displayed as 'external edit' instead of the modified user information. This seems to be modifiable in the template output code, but I didn't fix it because I didn't understand why there was no user information because it was supposed to be external edit when no user information edited the page.

Session affinity

As I solved this external edit issue, I noticed a situation where the session affinity function was working. Session affinity is a function that continuously routes users accessing either server to the server through load balancing. Without this feature, a user who just logged in to one server may be routed to another server immediately after logging in and may be asked to log in again. When the page is edited and saved in connection with the external editing problem above, the user information is displayed in the page history, but when you access the page again after a while, it is sometimes changed to 'external edit'. This means that I was routed from one server to another while I was refreshing and using the wiki.

Conclusion

If Dokuwiki had used a database, I would have been talking about database replication for most of the load balancing. On the other hand, instead of using another tool like unison, you could just use the functionality of the database itself to build more stable load balancing without the potential problems discussed above. However, on my scale, I built Dokuwiki load balancing that works anyway, including potential issues, and it's been running without problems for a while. I can increase the number of servers to be synchronized and to reduce the possibility of data loss, I can back up the servers to synchronization. If one server is lost in the future, it can be recovered from data from the other server or from data in another Availability Zone. I believe that I have achieved some of the two goals mentioned above.