Parallels produce a very nice clustered storage solution for their container based web hosting environment. Having used it for a year now it has turned out to be a robust reliable cloud hosting platform for Linux Containers and Windows VM’s (VPS’s) and I’ve worked out some ideal solutions to setting it up to purr nicely.
PCS uses a very well thought out clustered storage solution that takes your directly attached storage and presents them as a unified clustered storage that appears as a mounted file system on every hardware node (server). PCS Storage can be thought of a bit like a single LARGE datastore in VMWare ESX that grows as you license it and add more to it.
PCS storage is presented over the network use cifs and represents a fuse mount point just like Gluster FS. At the lower level, the file system is made up of blocks from physical disks but delivered as self replicating chunks. Each chunk has 3 replicas (you can tune this as needed but its set at 3 and that’s now the industry standard). The “Chunks” are managed by daemons called chunk servers, a single chunk server manages a single presented disk, the disk looks just like any other disk at the OS level so nothing scary there. In the cluster I am using in this example there are 34 chunk servers.
The other interesting point is that at the Container or VM level everything is thin provisioned so storage is provided as needed. Unused storage is scavenged back and returned to the cluster.
The other really neat thing about the chunk server apart from automatic replication is auto balancing within the storage tier. So after a new chunk server is provisioned, it rapidly reduces down to the size of the other servers (using a percentage scaling algorithm and using a “cost” metric. With all disks the same, the metric over time levels out and the amount of free storage is about the same for every chunk server (percentage wise).
You can have multiple storage tiers but its easier to just present a single tier, as the chunk servers work out the “cost” of the storage, so slow disks like 4TB SATA disks will be higher cost than a 300GB 15K SAS and an SSD will have an even lower cost again. If your anal and what to make life hard, go ahead and put all SSD’s into tier 0 storage and SAS into tier 1 etc. But over the last year I have worked out its not worth the effort and 3 rebuilds later all disks are in 1 tier.
Setting Up Chunks
In legacy pre-cloud storage models the use of RAID was “Industry Standard”, I use to set up Dell, IBM and EMC SAN’s with RAID-1 and RAID-5 LUNS daily, now most clustered cloud storage solutions automatically implement an N=3 replication scheme to ensure that there are 3 copies of all the data in the cluster so RAID actually adds an unwanted latency to the storage equation. Originally I use to setup disks for storage in RAID-5 configurations (failure just meant swap a disk and it rebuilds autmatically), but over the course of the last 12 months I have come to the conclusion that this is not the way to do it for cloud storage (its still perfect for legacy SAN technology so don’t get confused here!).
In Parallels PCS, disks are best provisioned un-RAIDed and presented to the OS as raw initialized disks, which you, the “system engineer”, partition and format as a standard ext4. To partition the disks I use “parted”, then I mount them under /pstorage/cs0…cs1 etc
To provision the disk to the cluster you use the pstorage tool and give it the” make-cs” command option, this creates a chunk server on it, after you run the command line below you end up with usable space for the cluster:
pstorage -c my-cloud make-cs -r /pstorage/my-cloud-cs4 -j /cluster/csjournal/cs4-my-cloud-journal -s 5124
The Chunk Servers can use journalling if its enabled in the creation command, that’s the “-j” option above, so on the Operating System SSD disks, I create a 5G journal file for the chunk server and the journal file is in /cluster/csjournals which is a logical volume created from LVM disk that lives on the SSD disks. If you don’t have SSD then journalling might not give you any benefits. I suggest rebuilding a node with SSD’s might be ideal.
PCS storage is presented via the network to most nodes. I assume if the node has local access to the required storage it will fetch it via the OS directly but I have not verified this. With a 10 node cluster and 2x 1G ethernet setup as a trunk (using an LACP trunk) back to a common Gigabit switch, network usage is low on 1000 containers. if the load goes up and the network begins to hit 50% utilization then I will look at 10G Ethernet to each host but it looks like that’s a long way off yet.
“pstorage” is a CLI tool supplied by Parallels that presents volumes of info on what’s going on in the cluster. The example below shows tiers which I discarded a while ago.
The image below shows a single tier PCS cluster with two chunk servers and 1 Meta data server. ideally a minimal cluster is 3 physical servers, 3 chunk servers and 3 metadata servers. You can narrow down on different sub systems using key toggles like “c” for just Chunk Servers, “v” for verbose etc.
What to do when a Chunk Server Goes Bad
So if the disks arn’t RAID’ed what happens when a disk fails?
Well that’s easy, the Chunk Server dies and the data is lost! Yep – gone, but dont worry, you have 2 other copies, and the cluster will start to replicate the third copy immediately so there is NO DATA LOSS if more disks fail. Also a server failure means that only a single copy of a chunk is lost, no two identical chunks are stored on the same server (so a minimal config is ALWAYS 3 servers).
You will have to do a few things to get back a healthy cluster:
- Remove the Chunk server using the –force option.
- Identify the physical disk at the OS level (like /dev/sdf1)
- Replace the physical disk (always keep spares!).
- Initialize the disk via the BIOS web interface (Dell has OMSA web Interface and it works GREAT).
- Partition the newly presented disk (it should present back as the same disk (like /dev/sdf).
- Format the disk (I use mkfs.ext4 /dev/sdf1)
- Build a new Chunk server (see command earlier)
Once the new chunk server comes online the cluster will automatically rebalance and everything just keeps working.
Summary of Key Points
- Use SSD’s at least for your OS and put a /cluster/csjournals and /cluster/readcache on it.
- Do not RAID your disks, this is cloud storage now not SAN LUNs presented over iSCSI (PCS does support this also 😉
- Buy servers with lots of local directly attached SAS disks.
- Trunk your storage network, LACP works well in PCS Cloud Linux (even though its CloudLinux, it uses Centos commands to setup).
- Server manufacturers are releasing more server offerings with LOTS of DAS disk storage for this very purpose. Look seriously at upgrading them over time.
- SAN’s are now obsolete in a web hosting environment.