Steven (unzeugmatic) wrote,

Thinking Like an Arrogant Sysadmin

This is work stuff. I'm not bothering too much with the "you might not be a sysadmin" filter. Just like at a LISA conference.

Years ago I had the rare opportunity of being present the first time somebody attempted to use the XVM volume manager (at a "partner site", so this wasn't even somebody internal) -- and thus this was the first time somebody was going to use the XVM documentation I was working on, although this was by no means a usability test for the manual (it was more a chance for me to see how somebody might use the product -- and actually the product in question was not XVM, but it required XVM to run). This was my first meeting with somebody I later worked with (at both SGI and Red Hat, interestingly enough) and while in most key ways he was an excellent co-worker, he definitely had a plow-ahead at all costs approach to things. Anyway, I watched as he started to configure a system without any regard for the documentation and without asking any questions -- and I couldn't figure out what he was doing. Thinking he might know more about what he was doing than I did, I -- curious -- asked him why he had done what he'd done. He looked up and said, "Well, it's just like XLV, isn't it?"

No, it was nothing at all like XLV - in pretty much every administrative way, ranging from where the storage volume information was kept to the nature of how commands were executed to the basic notion that XVM owns and manages its own disks. The only thing similar was that it was a tool for configuring logical volumes on IRIX systems, although even the names of the volume components were different. So I told him what he needed to do and he said "oh" and picked things up immediately and went on from there.

I learned something key that day: That I had to somehow, as backtext, understand that I was not just documenting a product, but that I was documenting a product for people who came to the product with an advance misunderstanding of how the product worked along with a certainty that they knew everything about the product. I can't tell you, specifically, what this means (well, in this case it meant that I went out and learned a lot more about XLV, which I had not previously paid attention to), but it definitely was something to keep in mind and something I try to remember to this day.

The awful thing is that at this point I sometimes fall victim to the same thing -- knowing a little bit about something causes me to make mistakes I would never make if I thought I knew nothing about something.

What I'm doing this week is what I call a "sanity test" of a manual -- in this case the GFS manual, which has been in the field for years in one form or another. This means that I'm simply entering every example command as documented in the book, to be sure that they do what we say they do (and which, for nearly every command, involves a fair bit of setup). I'm not testing the product, I'm testing the documentation. For example, I might enter a command to set a quota for a user. If entering the command exactly as documented doesn't yield an error, I might enter a command to list the current quotas that have been set, to see if they reflect the quota I've just set. What I don't do, however, is then log in as that user and try to exceed that quota -- to see if the feature works. That's product testing (and presumably that was done years ago) -- although when I find things where the problem is with the feature and not the documentation I file bugs and pass things along to the testers, and I do sometimes find these things even with mature products (where new things broke old things).

One thing I find, through sanity testing a document, are areas where I think we could clarify or provide more information -- things that do not jump out at you when you are just reading a document but which come up when you actually run the commands. For example, there are many commands to set filesystem parameters which we document, and I think it would be nice if after each example we provide an example of a command to display the results of the previous command. If I document a command that sets a file attribute, it's nice to provide a command right in the same part of document that will show what attributes have been set for a file. This is not as obvious as you might think. I tried lsattr (which other Linux filesystem use, including GFS2), which yielded an ioctl error -- something I don't think an admin. should ever see as a result of a completely non-destructive command that's just a request for information. No, the command is not lsattr -- it's "gfs_tool stat". Intuitive, huh? (I tried to find out if we could fix lsattr, but apparently that's not maintained by my company -- ah, open source...)

The problem I had today was that for GFS2 coming down the pike I learned a little bit about how GFS and GFS2 differ in how they set up their journals. In GFS2 you can add journals on the fly, just as if they were files, so if you don't specify enough journals when you create the filesystem it's fairly easy to add more. In GFS(1), you can't do this -- you have to expand the underlying logical volume first, and then the journals are added to the end of the filesystem. And that's my problem -- that I knew that journals get added to the end of the filesystem in GFS1 (that is, journals you add -- journals you define from the beginning are in the middle of the filesystem; don't ask).

So I tested what we document for the gfs_jadd command. And yeah, there's some introductory stuff about extending your underlying logical volume first, but I know all about that so I didn't read it too carefully. I extended the underlying volume, then I grew the filesystem (thinking that this would give me the space at the end of the filesystem for the new journals I was going to add). I'm so clever.

When I tried to add a journal, however, I got an error message that there was not enough space. But I just added 12G of memory to the filesystem! That's one honking big journal. So I checked the available space and saw there was tons and tons of space available for data -- more than a reasonable man would ever want or use. But there was no space available for journals.

Because you don't grow the filesystem to add journals, you just extend the underlying volume. The new journals come after the filesystem, but they are not part of the filesystem, or at least the usable section of the filesystem. (The developer who told me that the journals come "at the end of the filesystem" was being a little bit loose with his explanation). So you add the journals. You can then grow the filesystem to fill the remaining space in the logical volume, if you feel like it.

This took a bit of time for me to work out (because I was so certain I knew what I was doing that I was looking in the wrong places for the cause of the error). And you know, nowhere in the manual does it say to grow the filesystem before adding journals -- the documentation is not wrong. But, like the good sysadmins I emulate, I don't bother to read the manual too carefully because, after all, I know all about the filesystem.

This was all quite useful to me, because now I have to figure out whether to bother noting that you don't need to do something that it wouldn't occur to you to do in the first place unless I mentioned it as something you should not do.

I get paid for this, by the way.
  • Post a new comment


    default userpic

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.