Mini_HOWTO: Multi Disk System Tuning Version 0.6 Date 960806 By Stein Gjoen This document was written for two reasons, mainly because I got hold of 3 old SCSI disks to set up my Linux system on and I was pondering how best to utilise the inherent possibilities of parallelizing in a SCSI system. Secondly I hear there is a prize for people who write docs... This is intended to be read in conjunction with the Linux File System Standard (FSSTD). It does not in any way replace it but tries to suggest where physically to place directories detailed in the FSSTD, in terms of drives, partitions, types, RAID, file system (fs), physical sizes and other parameters that should be considered and tuned in a Linux system, ranging from single home systems to large servers on the Internet. This is also a learning experience for myself and I hope I can start the ball rolling with this Mini-HOWTO and that it perhaps can evolve into a larger more detailed and hopefully even more correct HOWTO. Notes in square brackets indicate where I need more information. Note that this is a guide on how to design and map logical partitions onto multiple disks and tune for performance and reliability, NOT how to actually partition the disks or format them - yet. This is the third update, still without much in the way of inputs... Nevertheless this mini-HOWTO seems to be growing regardless and I expect I will have to turn this into a fully fledged HOWTO one of these days, I just need to learn the format. Hot news: I have been asked to add information on physical storage media as well as partitioning and make it all into a full sized HOWTO. this will take a bit of time to complete which means that other than bug fixes to this mini-HOWTO it will not be updated much until the new HOWTO is ready. In the meantime I will of course still be interested in feedback. More news: there has been a fair bit of interest in new kinds of file systems in the comp.os.linux newsgroups, in particular logging, journaling and inherited file systems. Watch out for updates. The latest version number of this document can be gleaned from my plan entry if you do "finger sgjoen@nox.nyx.net" In this version I have the pleasure of acknowledging even more people who have contributed in one way or another: ronnej@ucs.orst.edu cm@kukuruz.ping.at armbru@pond.sub.org nakano@apm.seikei.ac.jp (who is also doing the Japanese translation) R.P.Blake@open.ac.uk neuffer@goofy.zdv.Uni-Mainz.de sjmudd@phoenix.ea4els.ampr.org Not many still, so please read through this document, make a contribution and join the elite. If I have forgotten anyone, please let me know. So let's cut to the chase where swap and /tmp are racing along hard drive... --------------------------------------------------------------- 1. Considerations The starting point in this will be to consider where you are and what you want to do. The typical home system starts out with existing hardware and the newly converted Linux user will want to get the most out of existing hardware. Someone setting up a new system for a specific purpose (such as an Internet provider) will instead have to consider what the goal is and buy accordingly. Being ambitious I will try to cover the entire range. Various purposes will also have different requirements regarding file system placement on the drives, a large multiuser machine would probably be best off with the /home directory on a separate disk, just to give an example. In general, for performance it is advantageous to split most things over as many disks as possible but there is a limited number of devices that can live on a SCSI bus and cost is naturally also a factor. Equally important, file system maintenance becomes more complicated as the number of partitions and physical drives increases. 1.1 File system features The various parts of FSSTD have different requirements regarding speed, reliability and size, for instance losing root is a pain but can easily be recovered. Losing /var/spool/mail is a rather different issue. Here is a quick summary of some essential parts and their properties and requirements. [This REALLY need some beefing up]: 1.1.1 Swap Speed: Maximum! Though if you rely too much on swap you should consider buying some more RAM. Size: Similar as for RAM. Quick and dirty algorithm: just as for tea: 16M for the machine and 2M for each user. Smallest kernel run in 1M but is tight, use 4M for general work and light applications, 8M for X11 or GCC or 16M to be comfortable. [The author is known to brew a rather powerful cuppa tea...] Some suggest that swapspace should be 1-2 times the size of the RAM, pointing out that the locality of the programs determines how effective your added swapspace is. Note that using the same algorithm as for 4BSD is slightly incorrect as Linux does not allocate space for pages in core [More on this is coming soon]. Reliability: Medium. When it fails you know it pretty quickly and failure will cost you some lost work. You save often, don't you? Note 1: Linux offers the possibility of interleaved swapping across multiple devices, a feature that can gain you much. Check out the "man 8 swapon" for more details. However, software raiding across multiple devices adds more overheads than you gain. Note 2: Some people use a RAM disk for swapping or some other filesystems. However, unless you have some very unusual requirements or setups you are unlikely to gain much from this as this cuts into the memory available for caching and buffering. 1.1.2 /tmp and /var/tmp Speed: Very high. On a separate disk/partition this will reduce fragmentation generally, though ext2fs handles fragmentation rather well. Size: Hard to tell, small systems are easy to run with just a few MB but these are notorious hiding places for stashing files away from prying eyes and quota enforcements and can grow without control on larger machines. Suggested: small machine: 8M, large machines up to 500M (The machine used by the author at work has 1100 users and a 300M /tmp directory). Reliability: Low. Often programs will warn or fail gracefully when these areas fail or are filled up. Random file errors will of course be more serious, no matter what file area this is. (* That was 50 lines, I am home and dry! *) 1.1.3 Spool areas (/var/spool/news, /var/spool/mail) Speed: High, especially on large news servers. News transfer and expiring are disk intensive and will benefit from fast drives. Print spools: low. Consider RAID0 for news. Size: For news/mail servers: whatever you can afford. For single user systems a few MB will be sufficient if you read continuously. Joining a list server and taking a holiday is on the other hand it is not a good idea. (Again the machine I use at work has 100M reserved for the entire /var/spool) Reliability: Mail: very high, news: medium, print spool: low. If your mail is very important (isn't it always?) consider RAID for reliability. [Is mail spool failure frequent? I have never experienced it but there are people catering to this market of reliability...] Note: Some of the news documentation suggests putting all the .overview files on a drive separate from the news files, check out all news FAQs for more information. 1.1.4 Home directories (/home) Speed: Medium. Although many programs use /tmp for temporary storage, other such as some newsreaders frequently update files in the home directory which can be noticeable on large multiuser systems. For small systems this is not a critical issue. Size: Tricky! On some systems people pay for storage so this is usually then a question of finance. Large systems such as nyx.net (which is a free Internet service with mail, news and WWW services) run successfully with a suggested limit of 100K per user and 300K as max. If however you are writing books or are doing design work the requirements balloon quickly. Reliability: Variable. Losing /home on a single user machine is annoying but when 2000 users call you to tell you their home directories are gone it is more than just annoying. For some their livelihood relies on what is here. You do regular backups of course? Note: You might consider RAID for either speed or reliability. If you want extremely high speed and reliability you might be looking at other OSes and platforms anyway. (Fault tolerance etc.) 1.1.5 Main binaries ( /usr/bin and /usr/local/bin) Speed: Low. Often data is bigger than the programs which are demand loaded anyway so this is not speed critical. Witness the successes of live file systems on CD ROM. Size: The sky is the limit but 200M should give you most of what you want for a comprehensive system. (The machine I use, including the libraries, uses about 800M) Reliability: Low. This is usually mounted under root where all the essentials are collected. Nevertheless losing all the binaries is a pain... 1.1.6 Libraries ( /usr/lib and /usr/local/lib) Speed: Medium. These are large chunks of data loaded often, ranging from object files to fonts, all susceptible to bloating. Often these are also loaded in their entirety and speed is of some use here. Size: Variable. This is for instance where word processors store their immense font files. The few that have given me feedback on this report about 70M in their various lib directories. The following ones are some of the largest diskhogs: GCC, Emacs, TeX/LaTeX, X11 and perl. Reliability: Low. See point 1.1.5 1.1.7 Root Speed: Quite low: only the bare minimum is here, much of which is only run at startup time. Size: Relatively small. However it is a good idea to keep some essential rescue files and utilities on the root partition and some keep several kernel versions. Feedback suggests about 20M would be sufficient. Reliability: High. A failure here will possible cause a bit of grief and can take a little time rescuing your boot partition. Naturally you do have a rescue disk? 1.2 Explanation of terms Naturally the faster the better but often the happy installer of Linux has several disks of varying speed and reliability so even though this document describes performance as 'fast' and 'slow' it is just a rough guide since no finer granularity is feasible. Even so there are a few details that should be kept in mind: 1.2.1 Speed This is really a rather woolly mix of several terms: CPU load, transfer setup overhead, disk seek time and transfer rate. It is in the very nature of tuning that there is no fixed optimum, and in most cases price is the dictating factor. CPU load is only significant for IDE systems where the CPU does the transfer itself [more details needed here !!] but is generally low for SCSI, see SCSI documentation for actual numbers. Disk seek time is also small, usually in the millisecond range. This however is not a problem if you use command queueing on SCSI where you then overlap commands keeping the bus busy all the time. News spools are a special case consisting of a huge number of normally small files so in this case seek time can become more significant. 1.2.2 Reliability Naturally none wants low reliability disks but one might be better off regarding old disks as unreliable. Also for RAID purposes (See the relevant docs) it is suggested to use a mixed set of disks so that simultaneous disk crashes becomes less likely. 1.3 Technologies In order to decide how to get the most of your devices you need to know what technologies are available and their implications. As always there can be some tradeoffs with respect to speed, reliability, power, flexibility, ease of use and complexity. 1.3.1 RAID This is a method of increasing reliability, speed or both by using multiple disks in parallel thereby decreasing access time and increasing transfer speed. A checksum or mirroring system can be used to increase reliability. Large servers can take advantage of such a setup but it might be overkill for a single user system unless you already have a large number of disks available. See other docs and FAQs for more information. For Linux one can set up a RAID system using either software (the md module in the kernel) or hardware, using a Linux compatible controller. Check the documentation for what controllers can be used. A hardware solution is usually faster, and perhaps also safer, but comes at a significant cost. 1.3.2 AFS, Veritas and Other Volume Management Systems Although multiple partitions and disks have the advantage of making for more space and higher speed and reliability there is a significant snag: if for instance the /tmp partition is full you are in trouble even if the news spool is empty, it is not easy to retransfer quotas across disks. Volume management is a system that does just this and AFS and Veritas are two of the best known examples. Some also offer other file systems like log file systems and others optimised for reliability or speed. Note that Veritas is not available (yet) for Linux and it is not certain they can sell kernel modules without providing source for their proprietary code, this is just mentioned for information on what is out there. Still, you can check their web page http://www.veritas.com to see how such systems function. Derek Atkins, of MIT, ported AFS to Linux and has also set up a mailing list for this: linux-afs@mit.edu which is open to the public, requests to join the list goes to linux-afs-request@mit.edu and finally bug reports should go to linux-afs-bugs@mit.edu. Important: as AFS uses encryption it is restricted software and cannot easily be exported from the US. AFS is now sold by Transarc and they have set up a www site. The directory structure there has been reorganized recently so I cannot give a more accurate URL than just http://www.transarc.com which lands you in the root of the web site. There you can also find much general information as well as a FAQ. 1.3.3 Linux md Kernel Patch There is however one kernel project that attempts to do some of this, md, which has been part of the kernel distributions since 1.3.69. Currently providing spanning and RAID it is still in early development and people are reporting varying degrees of success as well as total wipe out. Use with caution. 1.3.4 General File System Consideration In the Linux world ext2fs is well established as a general purpose system. Still for some purposes others can be a better choice. News spools lend themselves to a log file based system whereas high reliability data might need other formats. This is a hotly debated topic and there are currently few choices available but work is underway. Log file systems also have the advantage of very fast file checking. Mailservers in the 100G class can suffer file checks taking several days before becoming operational after rebooting. [I believe someone from Yggdrasil mentioned a log file based system once, details? And AFS is available to Linux I think, sources anyone?] There is room for access control lists (ACL) and other unimplemented features in the existing ext2fs, stay tuned for future updates. There has been some talk about adding on the fly compression too. 1.3.5 Compression Disk versus file compression is a hotly debated topic especially regarding the added danger of file corruption. Nevertheless there are several options available for the adventurous administrators. These take on many forms, from kernel modules and patches to extra libraries but note that most suffer various forms of limitations such as being read-only. As development takes place at neckbreaking speed the specs have undoubtedly changed by the time you read this. As always: check the latest updates yourself. Here only a few references are given. - DouBle features file compression with some limitations. - Zlibc adds transparent on-the-fly decompression of files as they load. - there are many modules available for reading compressed files or partitions that are native to various other operating systems though currently most of these are read-only. Also there is the user file system that allows ftp based file system and some compression (arcfs) plus fast prototyping and many other features. Recent kernels feature the loop or loopback device which can be used to put a complete file system within a file. There are some possibilities for using this for making new filesystems with compression, tarring etc. Note that this device is unrelated to the network loopback device. 1.3.5 Physical Sector Positioning Some seek time reduction can be achieved by positioning frequently accessed sectors in the middle so that the average seek distance and therefore the seek time is short. This can be done either by using fdisk or cfdisk to make a partition on the middle sectors or by first making a file (using dd) equal to half the size of the entire disk before creating the files that are frequently accessed, after which the dummy file can be deleted. Both cases assume starting from an empty disk. This little trick can be used both on ordinary drives as well as RAID systems. In the latter case the calculation for centering the sectors will be different, if possible. Consult the latest RAID manual. 2 Disk Layout With all this in mind we are now ready to embark on the layout [and no doubt controversy]. I have based this on my own method developed when I got hold of 3 old SCSI disks and boggled over the possibilities. 2.1 Selection Determine your needs and set up a list of all the parts of the file system you want to be on separate partitions and sort them in descending order of speed requirement and how much space you want to give each partition. If you plan to RAID make a note of the disks you want to use and what partitions you want to RAID. Remember various RAID solutions offers different speeds and degrees of reliability. (Just to make it simple I'll assume we have a set of identical SCSI disks and no RAID) 2.2 Mapping Then we want to place the partitions onto physical disks. The point of the following algorithm is to maximise parallelizing and bus capacity. In this example the drives are A, B and C and the partitions are 987654321 where 9 is the partition with the highest speed requirement. Starting at one drive we 'meander' the partition line over and over the drives in this way: A : 9 4 3 B : 8 5 2 C : 7 6 1 This makes the 'sum of speed requirements' the most equal across each drive. 2.3 Optimizing After this there are usually a few partitions that have to be 'shuffled' over the drives either to make them fit or if there are special considerations regarding speed, reliability, special file systems etc. Nevertheless this gives [what this author believes is] a good starting point for the complete setup of the drives and the partitions. In the end it is actual use that will determine the real needs after we have made so many assumptions. After commencing operations one should assume a time comes when a repartitioning will be beneficial. 2.4 Pitfalls The dangers of splitting up everything into separate partitions are briefly mentioned in the section about volume management. Still, several people have asked me to emphasize this point more strongly: when one partition fills up it cannot grow any further, no matter if there is plenty of space in other partitions. In particular look out for explosive growth in the news spool (/var/spool/news). For multi user machines with quotas keep an eye on /tmp and /var/tmp as some people try to hide their files there, just look out for filenames ending in gif or jpeg... In fact, for single physical drives this scheme offers very little gains at all, other than making file growth monitoring easier (using 'df') and there is no scope for parallel disk access. A freely available volume management system would solve this but this is still some time in the future. Partitions and disks are easily monitored using 'df' and should be done frequently, perhaps using a cron job or some other general system management tool. [Is any such tool currently available?] 3 Further Information There is wealth of information one should go through when setting up a major system, for instance for a news or general Internet service provider. The FAQs in the following groups are useful: News groups: comp.arch.storage, comp.sys.ibm.pc.hardware.storage, alt.filesystems.afs, comp.periphs.scsi ... Mailing lists: raid, scsi ... Many mailing lists are at vger.rutgers.edu but this is notoriously overloaded, try to find a mirror. There are some lists mirrored at http://www.redhat.com [more references please!]. Remember you can also use the web search engines and that some, like altavista, also can search usenet news. [much more info needed here] 4 Concluding Remarks Disk tuning and partition decisions are difficult to make, and there are no hard rules here. Nevertheless it is a good idea to work more on this as the payoffs can be considerable. Maximizing usage on one drive only while the others are idle is unlikely to be optimal, watch the drive light, they are not there just for decoration. For a properly set up system the lights should look like Christmas in a disco. Linux offers software RAID but also support for some hardware base SCSI RAID controllers. Check what is available. As your system and experiences evolve you are likely to repartition and you might look on this document again. Additions are always welcome. Currently the only supported hardware SCSI RAID controllers are the SmartCache [I/III/IV] and SmartRAID [I/III/IV] controller families from DTP. These controllers are supported by the EATA/DMA driver in the standard kernel. This company also has an informative web page at http://www.dpt.com which also describes various general aspects of RAID and SCSI in addition to the product related information. [Please let me know if there are other hardware RAID controllers available for linux.]