Steven (unzeugmatic) wrote,

The USENIX-LISA Conference Report

Since apparently it is not possible to read something on Facebook without a Facebook account, no matter how I tag it, I am putting the conference report I wrote for my department here. It's long in part because my company paid my whole way this year (which is unusual) and I want them to be sure they got their money's worth.

On November 9-12 I attended the USENIX-LISA (Large Installation System Administration) Conference in San Jose, California. This is a conference that is run by and for system administrators. It offers training programs and technical presentations and informal gatherings. It is a total immersion conference; it would be possible never to leave the conference hotel (connected to the Convention Center) for a week, and never concern yourself with anything but system administration and the people who live it. What this means, contradictorily, is that the attendees get to leave their workplaces behind for a week so that they can spend all their time on their work. Surprisingly enough, people find this inspiring, including me and (as I often note) I'm not even a system administrator.

My intention in writing this report is to try to explain why a conference for system administrators is a good conference for a technical writer to attend, and to encourage the other writers to become involved in their own local system administrator communities and organizations. I think that we are often isolated in our jobs, and don't have an easy way of keeping up with what is changing out in the data centers that use our documentation. Just hanging around with people who live in those data centers (if only virtually) pulls you into that world for a week. These are the people who make decisions about purchasing hardware and who develop passionate attachments to particular tools. Learning about technical issues in the context of the LISA conference is like hanging around in a sports bar to learn what the current sports rivalries and controversies are -- even down to the arguments over beer.

I did not register for any of the training classes, which do not specifically relate to my work. My goal for the conference was to get a general sense of the current state of system administration, so I focused on attending the "Technical Sessions": Invited talks, "Guru" sessions on a few topics, and the opening and closing sessions. As always, I attended informal evening sessions, and had many conversations with the other attendees about their work.

Some General Themes

There were a few general things I took from the conference, themes that recurred.

* Huge Amounts of Data

This is no surprise (particularly for a conference geared towards administrators of large installations), but over and over during the conference I learned about huge massive vast incomprehensible quantities of data that must be administered. It seems that in nearly every talk I attended one of the slides was a data over time chart, with a line gradually rising until the most recent data points which involved a sudden huge tenfold increase. This chart appeared in a talk about the Large Hadron Collider, in a talk about storage management at the Weta Digital studios, and (by implication) in a talk about Reliability at Facebook. I wouldn't be surprised if the same chart appeared in the talk about operations at Twitter and a talk about NFS at Dreamworks.

As I say, this was no surprise, but as I look at the issues I need to address in my own storage documentation (and as I learn about our file system plans at Red Hat) it is useful for me to have a sense of just how important it is to be able to manage petabytes of data, and how quickly and how universally this need has grown.

* This is a Serious Conference

I admit up front that I have a good deal of plain old fun at this conference, and over the years I have made many friends (let's call them "professional contacts I network with") who attend as well. But the people who attend the conference are very serious about the conference. They want to get the most out of their sessions, and to bring what they learn back to their jobs. They show up on time in the morning, they attend even the arcane presentations, they pay close attention and ask serious questions. They also do not have much patience for presentations or guru sessions they feel are not technically solid. This is not a conference for armchair pontification; this is a conference that requires statistical proof for you musings.

How does this relate to my work? It means my audience doesn't want or need a lot of verbal padding around what I document. They want to know what features we provide and how to implement them. Of course I knew this, but I think about what a tough audience the LISA attendees can be when you have a technical case to make, and I realize that this tough audience is my audience.

* Red Hat at LISA

Each time I go to LISA it seems that more and more of the attendees now have to support Red Hat systems as part of their (heterogeneous) work environment. Which means that people tell me of issues they have with GFS, or (in one case), with their desire for Red Hat cluster software to continuously monitor individual services rather than entire nodes. I'm not sure what to do with those requests ("File an RFE!" I say), but at least I'm in the conversation. And once again people are specifically happy that I documented the device mappings in the LVM manual. That gives me a warm feeling and makes me want to be sure that appendix is up to date (people rely on it, after all, and trust it).

I attended an evening BOF (Birds of a Feather) session to talk about Fedora (they gave away balloons and t-shirts). I met Red Hat's own Karsten Wade as well as three system administrators who are based in Raleigh and we chatted for a good long while. We are so spread out and disconnected at Red Hat that it is great when there's an opportunity like this to learn a little bit more about other areas of the company, and to make the personal connections that I miss from an isolated satellite office.

A friend of mine I have met at previous conferences is currently involved with the Fedora documentation project, working on SELinux documentation. We made a point of going to lunch one day and the entire lunch conversation was about her work with this project. It could be a good thing for us to have this connection, as we work on our own SELinux documentation.

* Vendor Swag

Because I am not in a professional position that requires me to provide input in the purchase of hardware and software tools (and because I am not currently seeking employment as a system administrator), I do not spend a lot of time in the Vendor Exhibit. But I want to point out that the days when you could outfit yourself for a year with free t-shirts are long gone. Even Google was giving away only chapstick -- chapstick! -- emblazoned with a bit of the Google logo. Facebook, on the other hand, was giving away refrigerator magnets in the shape of a "like" thumbs-up hand on Facebook.

* LOPSA, Sysadmins, and Tech. Writers

LOPSA, the League of Professional System Administrators ( formed a few years ago, as a separate association from USENIX (which, as an organization, has a different set of goals). I always attend the LOPSA annual meeting at LISA, and this year I finally just paid up my dues and joined.

Why join this professional organization if I am not a system administrator? Because system administration is now a vast area. There are storage administrators, and network administrators, and system architects, and database administrators. Technical writers are just another piece of the puzzle, although I think it's a paradigm shift for us to think of our roles in that way.

LOPSA, as an organization, has had some success in organizing regional days and weekends for system administrators. Since through the LISA conference I know many of the people involved in the organization, on a regular basis I'm invited to participate in these events. I have been asked to revive a talk I have given a few times called "Why the Documentation Sucks" and, more importantly, I have been asked my advice on what I might be able to provide in my area of expertise. With this in mind, I have vague plans to see about putting together small workshops in documentation. My first idea is to teach a small class in "writing clear procedures": Exercises in what it means to have one task per item and parallel structure, and in breaking down long procedures into separate ones. But I think that all of us could probably put together small sessions that could help the system administrators. In Australia I'm sure the local SAGE-AU groups would appreciate this. This all falls under the category of long-term goals.

The Invited Talks

What I want to provide here is an informal summary of the talks along with some musings about what I can take from this talk in my work. In some cases this is very long, but most of the talks were just packed with information. Believe me, this is edited down from the notes I took.

* Keynote Address: The LHC Computing Challenge: Preparation, Reality, and Future Outlook
Tony Cass, CERN

This talk was about the work of the IT department at CERN that supports the Large Hadron Collider (LHC) in Switzerland. It was very much a "gee whiz isn't that amazing" sort of talk for geeks; Mr. Cass opened the talk by saying that the scientists were correct and "We did not destroy the universe". The LHC is the "fastest racetrack on the planet", accelerating protons to almost the speed of light. It requires an empty vacuum -- emptier than the moon. It is one of the coldest places in the universe (-271 C), just 1.9 degrees above absolute zero, colder than outer space. At one point Mr. Cass noted that the experiments that the LHC runs have recreated all of the physics that has been discovered, and they are on the verge of creating new physics. Gee whiz, isn't that amazing?

But what does all that high-level frontier-breaking work look like from the only point of view that matters: That of a system administrator?

The computing challenge they face is that the experiments they perform reconstruct tracks from digital images, which takes a lot of computing power. Then they analyze millions of these pictures. This generates lots of data, which is reduced by online computers to 100-1000 megabytes per second. The current forecast is that this will require 23-25 PetaBytes per year. The archive will need to store 0.1 EB in 2014, and approximately 1 billion files in 2015. The data is distributed worldwide. I could go on.

There are cross-site issues. There are short-term logs and long-term archives. The long-term archive contains 2 trillion records and is growing by 4 billion a day. They need to be able to support twice the apparent data needs for recovery purposes. That's a lot of stuff to manage.

They do all of this with an Oracle RAC system configured with application scalability. They use CVMFS (CernVM-FS), a caching file system. They have had various operational issues, and frequent hardware failures -- there are so many systems in the worldwide grid that disks are failing every hour. Infrastructure failures are a fact of life. But they have 98% reliability, and even if a job fails it gets retried.

Mr. Cass discussed some of the problems they have had, and some features of the system they developed that haven't worked out well. He also discussed his hopes for how things might change so that file sizes, file placement policies, and access patterns will interact better. He has hopes for sharing images between sites, using more of a cloud computing model.

The project has been long and technically and sociologically challenging, but on the whole successful. Now that they have real data they feel they can improve the system.

What do I take from this? Nothing specific for my work, except the general idea that as data needs grow larger and larger the models of sharing data (and providing backup) across networks could gain traction. Interoperability becomes more key than ever.

* Storage Performance Management at Weta Digital
Matt Provost

Weta Digital is the movie special effects production company in New Zealand that did the production for the Lord of the Rings Trilogy and Avatar. So, given the glamor and appeal of movie-making, and all of the amazing things and insider dope you might learn from people who work for this company, what would you say is the most interesting topic a representative from this company could address? Why, storage management of course.

Interesting facts for sysadmins that will probably never make it as line-items on Entertainment Tonight:

* Weta Digital is one of the biggest visual effect facilities in the world, and they use mostly commercial off-the-shelf software.

* They must run the same operating system everywhere, since processing is done on a small scale (by individual artists) and rendering is done on a huge scale. You cannot have kernel and software incompatibility.

* There is an iterative workline: People work all day, and their work needs to be rendered overnight (and needs to be done by morning). So the computing needs require a mixture of high-performanceand high-availability.

* Disk usage over time: Fellowship of the Ring - 2 TB. Lord of the Rings -- ten times as much. Avatar -- 900 TB. 3D of course adds tremendously to the amount of storage you need.

* For Lord of the Rings, they used a giant SGI system for the "rendering wall". Now they run in parallel on commodity hardware. There was a tremendous spike in purchase to get through Avatar.

* The systems at Weta Digital have free cooling because of Wellington's climate. New Zealand is a small country, this is a big computer, so New Zealand has the highest peak gigaflops per capita in the world.

* The media compression between a film in a theater and Blu-Ray is huge -- and DVDs are even smaller. The block diagram showed a skyscraper for amount of data on a film compared to a small building for Blu-Ray compared to a line for DVD. The best way to see this is still in the cinema. "Don't make me waste all this disk."

* Avatar took 56 million render hours to produce. That's 6400 years, done in parallel.

How do they do it? With a disk space management system, a file system called linkfarm, which is just directories and symlinks. You can't really use a true parallel file system, since that would require striping files across multiple file system which they cannot do.

* Parallel file systems are metadata heavy, and require space to be reserved for the metadata; this would get huge for a project like Avatar.

* Their requirements involve lots of small files, with not a lot of big files. For a parallel file system striping across multiple files, you have to pick a stripe size. When your files are tiny, this doesn't buy you much.

* Their files can be snapshotted and backed up onto tape. How fast is a restore from tape vs. fsck? In some cases it is better to restore from tape.

* They can move directories around by moving the symlink, and applications don't need to know about it. They watch the atimes on files to decide when to move to tape, at which point they copy thedirectory to a new file system and do rsyncs on a five second delay to see if anything changes. If it changes that means it is in use and they bail out. If nothing changes they update the symlink and delete the original. But it is important to keep inactive files off the disk because it can take months to empty the disks at the end of a project and you can't get it all off in time for the next project.

* The source has to be good for interactive use, since the employees are good expensive artists whose time is valuable. It is hard for an artist to commit to something being final, so they have a "do you think you're done with it" system that migrates files to slow disk before moving them to tape. If somebody starts rendering, you know this immediately and you move the file back to the fast disk.

* A frame can involve hitting several file systems, but all legacy tools expect everything to be in one directory, with one directory for each frame. So they make a master directory with symlinks to individual frames so that legacy tools can go to a single directory and find everything they want.

What they learned with the massive data storage needs for Avatar: There are problems when you start to fill a file system, with performance this a tipping point. Above 90% it grinds to a halt as it tries to find free blocks. This affects all file systems. Runaway jobs can fill up the file system and destroy performance. So they allocate a percentage of the file system that will remain unused, which is hidden. The file system reads as full when there are still free blocks.
The future:

* The Hobbit
* Maybe Avatar 2 and 3
* 3D is the future, which makes storage difficult (as it requires twice the amount of storage)
* Better cameras with higher resolution are coming out all the time with higher resolution. For Avatar, there were 12 MBs per frame. For the new digital movie cameras,it is significantly larger.
* There is an increased frame rate for stereo 3D (which makes the movie look better): 48 frames second for each eye.

So you see, the future of special effects is not about artistry, but about storage management needs.

How does this relate to my job? Oh my gosh, I've been writing about file systems for years, so to see some actual real world issues with large file systems and how people resolve those issues is eye-opening for me. This takes my work from the level of the abstract to the level of the practical to the level of the movies.

"But what is this for?", I'm always asking myself as a subtext to my documentation. I rarely get an answer as specific as this talk.

* Storage over Ethernet: What's in it for Me?
Stephen Foskett

This talk provided an overview of what is happening in the world of Ethernet storage. Hardware remains a gap in my knowledge, although the more I learn about this area the more I realize that the area of most concern to me should be storage protocols rather than specific models of specific pieces of hardware. For that matter, networking and network protocols is another gap in my knowledge. So for now I attend talks like this so that I can stay abreast of the latest buzzwords. The buzzword these days is virtualization.

Some snippets:

* Data centers now rely on standard ingredients, sharing wires with multiple protocols. What will connect these systems together? IP and Ethernet are logical choices.

* Intel's CPU is driving the commodification of the storage industry, as is open source. Open systems are on the rise; rather than using specialized embedded OSs, companies are using Linux or BSD.

* Virtualization is driving greater network and storage I/O. So are data-driven applications, and applications that require massive throughput. The move away from serial I/O is a nightmare for storage admins, and this is driving convergence of

* Storage protocols are hitching to the Ethernet bandwagon rather than the fibre channel bandwagon. iSCSI works great on the Ethernet bandwagon, while fibre channel is left behind. Eventually we're going to be using storage over Ethernet whether it is good or not. Fibre channel is fast and secure; iSCSI is cheap and insecure (but it turns out iSCSI is fast now). What this means is that Ethernetactually works as a storage mechanism.

* In addition to speed: Ethernet offers simplified connectivity, new network architecture, and virtual machine mobility. A couple of 10gig ports replace smaller networks and cluster and storage. This changes the data center, as the placement and cabling of SAN switches and adapters dictates where to install servers.

* They are reinventing everything, transforming Ethernet for better flow control and bandwidth management and congestion notification. Much of this is not standard or production ready yet.

There was some brief summary talk about other topics:

* SCSI, and SCSI over fibre channel over Ethernet.
* Fibre channel over Ethernet -- being pushed by various companies, although it is unproven and expensive and end-to-end is nonexistent.
* NFS: pNFS ("What if we added everything to NFS?"). Server-to-server control protocol isn't agreed on. Linux client supports files, and work is being done on blocks.

Server managers, network managers, and storage managers will each get different benefits from storage over converged networking.

Mr. Foskett wants to encourage FCoTR: Fibre Channel on Token Ring. He even gave away badges with a logo for this.


* Ethernet will come to dominate.
* iSCSI will continue to grow, but will probably not take over from fibre channel for a long time.
* FCoE is likely but not guaranteed and not ready for prime time.
* Future of NFS is unclear: NFS v4 should be adopted, pNFS is strongly supported by storage vendors.

What did I get from this talk? I need a general awareness of what the issues are in storage, and what developers are talking about when they bring these issue up in the context of cluster documentation. Just listening to people who live in this world talk about these concerns is a priceless (if informal) overview for me. I can't really pick all this up myself, since I'm not working on any actual systems or deciding on network architecture, so listening to a talk like this is like looking over the shoulder of a network architect, or sitting in silently in a planning meeting.

* Rethinking Passwords
William Cheswick

This talk, on current absurdities in the password protocols that reduce the quality of our lives, was a little bit less sysadmin-oriented than the other invited talks, but that was because it was a replacement talk for a speaker who was unable to make it. William Cheswick, however, is a major figure in the world of system administration, in the area of security in particular, and he always puts on a good show.

Bill's premise was that our passwords are in an awful mess, and he wants to add his voice to this major problem that needs fixing. He began by showing a series of slides showing various password rules instituted by many organizations -- each seemingly more absurd and complex than the next. He referred to this throughout his talk as the "eye of newt" rules.

Bill blames himself for some of this mess, as people are using (and extending) guidelines he noted in his firewalls book: Use a different password on each target system, change them frequently, don't write them down. But this leads to an impossible situation. These rules come from the past when the stakes were lower, before there were keystroke loggers and phishing attacks and password database compromise. The rules do not make things more secure in the face of most current threats.

Bill made various points:

* You need to count and limit password choices, to prevent dictionary attacks. If you don't allow the most common words, you can still allow just plain words. Research is needed on account locking: What does a lost password cost, how long will a user wait for an unlock?

* Better solution: Get out of the game, using SecureID or RSA Softkey. Use a challenge/response password.

* How many bits do you need to type? Facebook and Twitter want 20 bits. Banks want 30. Government in the mid 40s and up. If you want 60 bits of entropy? "value part peter sense some computer". "blissrubbery uncial Iris". 2e3059156c9e378. But these don't pass "eye of newt" entropy rules" And who is going to remember these things for a year?

* Perhaps you could use passpoints rather than passwords: Pick something in a picture. Passfaces. Computer-generated art. Gestures. Pass graphs: generated on the computer.

* Bill's idea: passmaps where you zoom in on something, so where you end up is not shown on the original screen. The hope is that you can remember this. It's hard to shoulder surf this, but is it memorable for a year?

* Challenge response with obfuscations. There is related literature going back to 1967.

* Procedural passwords, using math -- so far none seem usable.

What you want is strong authentication, not strong passwords.

What I got out of this? Nothing for my job in particular, but certainly, like everybody else, this concerned a major aspect of my daily life and it was good to hear what experts in the field are now saying.

* System Administrators in the Wild: An Outsider's View of Your World and Work
Eben Haber

This talk provided an overview summary of a book in progress that came from an IBM research project to determine just what it is that system administrators do and how they do it, as part of their research in developing autonomic computing (systems that fix themselves). They used tools of ethnography to study system administrators: Living amongst the natives, participating in their daily activities, understanding what their life is like.

Ethnography is about stories, stories of real people. So this talk focused on some films and recordings of system administrators at crisis times.

One film showed two administrators working together on a dry run of a system update that nearly accidentally brought the whole system down. What came clear in the film were the vast and complex techniques and procedures they had developed over time to do their work, and how esoteric (but necessary) those procedures were. This example showed collaboration, and coping methods.

The second (lengthy) recording showed an apprentice system administrator having a really bad day, and the difficulties he had setting up a new server. So many things went wrong in so many ways that the example provided an enormous amount of fodder for the talk attendees to discuss what had gone wrong and how and how to prevent that. The big issue at hand was how people can get caught in a particular thought pattern and can't get out of it to see something – that sometimes the system administrator needs to be "rebooted". Communication and collaboration are critical, and people can't solve things alone.

So what is system administration? An environment of large scale very high complexity and risk. With coping mechanisms of specialization, collaboration, improvisation, automation, and tool building.

Is it possible to tame the complex technology of system administration, and to automate it? The complexity curve has not yet flattened out. IT exists to serve human ends, and human ends are not specified in machine-readable format. Until computer systems that can understand the natural language in which human ends and business requirements are specified, we can't have automatic systems and it will be the job of the sysadmins to translate.

What did I get from this talk, as it relates to my job? I often say that I go to LISA to myself observe sysadmins in their native environment, to get a sense of who my audience is. The videos of administrators at times of stress gave me the clearest picture in my head of the world in which my audience lives. It is my job -- my mission -- to try to address or prevent the sorts of problems that come cascading down around the administrators. The speaker emphasized the collaborative and community nature of system administration. My goal is to be part of that collaboration.

* Er What? Requirements, Specifications, and Reality: Distilling Truth from Friction
Cat Okita

My friend Cat Okita gives humorous talks about serious issues of system and network administration. She fills her talks with illustrations, wit, and bit of snark. She encourages audience participation, and the audience also shows a bit of wit and snark. And then comes the q&a session when people get up and make serious points about the topic.

This year's talk was on specifications: What it means to provide clear requirements from which you can tell that you've met a project goal.

Advice: Participate if you don't want requirements inflicted from above. Cooperate. Call meetings. Document. Use brief and appropriate language. Ask, don't tell.

Cat talked a bit about specifying the why, what, and how of project specifications, with some negative examples of buzzword and vagueness and some advice on being realistic and specific and limiting scope.

As an example to work through, Cat proposed the project of building a death star. Out of pumpkins. Using a spoon. ("Are these requirements or specifications?") Somebody proposed a specification of "must destroy 300 planets an hour".

More examples: "We're moving everything into the cloud." The cloud? Everything? "We're movinginto the 21st century." But who's involved? What does this mean? Who might this affect? What are the assumptions?

During q&a I noted that goals in my area don't shift so much as get compromised and I asked how one deals with this. Cat's answer: Redefine, make goals iterative. And ask myself: Would the project have been better if I hadn't done anything? If somebody else had done it? (Actually it was the friend of mine sitting in front of me who suggested the latter.) Good answers.

I know this summary makes it seem as if this was a light talk, and it was intended to be an end-of-conference talk when your brain is fried. But just spending time to stop and think about these common issues and problems with people who have the same general concerns you do about doing your job well when it often seems impossible is, to me, inspiring. And fun.

* Reliability at Massive Scale: Lessons Learned at Facebook
Robert Johnson

The Director of Engineering at FaceBook, along with three of his colleagues, gave a talk about a recent outage at Facebook. They described the basic setup of their system, summarized what went wrong, and talked about what they learned from this.

Reliability is important on Facebook, even though they are dealing with huge numbers: 2 trillion objects cached. The longest they've ever been down is 2.5 hours, even though "Every day we have our biggest day ever".

On Facebook, every page is dynamic: Who is the viewer, what they are allowed to see, is assembled out of hundreds or thousands of pieces of information gotten from a MYSQL database where the persistent copy is stored and then stored on a memcache server. That is where scaling problems show up.

The cache is client-controlled. The memcache and MYSQL have no knowledge of each other. The client puts values from the database into the cache. The system is designed to fix issues when there is inconsistent data by removing the bad data from the memcache server and getting the value from the database. But in the case of the failure under discussion, the problem was in the database, the result of a bug they had not encountered before. So each client would delete the value from the memcache server and the next client would put the bad value back. Soon you had thousands of clients pounding a handful of databases. The problem cascades. So they had to shut the site down, stopping all traffic, which turned out to be harder than anticipated. Now they sanity-check values before putting a new value in the cache.

As a generalization: If a large number of machines call on a small number of machines, there will be problems. You always want the small number of machines to initiate processes.

The rest of the talk summarized some general things they try to implement at Facebook, which apply to systems in general: Avoid single points of failure, provide redundancy, make sure there is spare capacity at every tier, design clusters that can fail independently, account for the fact that softwareitself can be a single point of failure, automate steps involved in failover but leave the decision on whether to pull the trigger to a human. Measure everything. Always do a post mortem.

What do I take from this? A lot of what I try to follow on the internal mailing lists are the descriptions of how our customers set up their systems (or, more usually, want to set up their systems). This talk was a pretty good example of how very large systems can be configured, and what sorts of things the system architects need to be concerned with. In some key ways this adds to my system vocabulary, and my understanding of the world for which I write my documents.

"GURU IS IN" Sessions

For each time block of the conference, there is a "guru" session: An informal roundtable session with a particular subject matter expert. Some of these sessions involve more in the way of formal presentation than others, but they all focus on participation by the attendees. I attended two of these sessions.

* Project Management
Strata Rose Chalup

I attended this session in part to see if the issues I face in my position (where I am not technically a project manager) are similar to the issues faced by system administrators, and the answer is a resounding yes. Strata began the session by noting that "everything people do that takes more than an afternoon can take advantage of project management techniques."

Much of what Strata discussed are things that to a large extent I have learned to implement over the years, both on the level of document planning and even on the level of maintaining documents for point releases. Some snippets:

* Have a risk management plan. Each risk should have an owner, with a specific condition that will pull the trigger.
* The project manager should have a subject matter expert team.
* Look where you want to go, not where you don't want to go.
* Keep after people to quantify what they are going to produce as deliverables.
* Sit down with people who you need to produce things, talk with them, write up what they say, get past the blank page syndrome -- one person at a time.

A participant asked the question: So much of our work is figuring out what to do and how to do it, and so little is actually doing it. How do you quantify that in project management? Ans: Build in a design, architecture and discovery phase into the plan, possibly including "create draft for plan" as a phase.

There was some general discussion about specific bad projects people had worked on, with some analysis about how to prevent these things (meetings should have specific purpose, bring everybody in from the beginning, need written support material, templating things that are set in tickets, parallelize tasks).

The discussion ended with talk about the importance of getting things documented -- not just what, but why. Words of wisdom for us all.

* Production Documentation
Matthew Sachs

Production documentation is not the same as the technical documentation I do, but since this was a session with the word "documentation" in the title I thought it would do me well to attend. Production documentation means documenting the setup of your production system.

But in general production documentation is very difficult to maintain. It is difficult to keep track of server farm buildouts, and in general there is not enough time in a sysadmin's day to maintain documentation. Still, if only one person knows something, it starts to kill your efficiency.

Matthew Sachs, who ran this session, is working on ways of automating production documentation.

Mr. Sachs noted that the most powerful thing that system administrators have is the ability of human expression. Through human expression you gain understanding and can better build systems. Code and configuration files require a technical understanding, but when you transform that into something morelike a conversation between two people you better understand the system. That's something I should be thinking about in documenting configuration files.

Without automation and standardized documentation, a human has to read a wiki page to generate a configuration file. Release management, development, and operations teams all have to agree on the components of the structure (such as a server farm), but each group has its own structure for storing information. The system administrator has to interpret the design document.

Mr. Sachs is working on a "handling engine" -- a lexical parser -- that can read an infrastructure design document and automatically generate xml and then run validation. The manual process of a person reading a wiki page and then running validation manually is going to be packaged into this process, which should apply to any configuration management system. A translation engine will turn human expression into code.

With standardized infrastructure design documents, documentation can be functional, and can be used to power your process. Documentation holds a great wealth of power by fusing the expressiveness of documentation with automated systems. Documentation can improve the process, as it is no longer just notetaking put is part of the process. Everything necessary to build something can be in the document, and once it is in xml you can build the application

This all sounds intriguing. Good luck, I say.

The rest of session was a general discussion on what problems the people in the session have with producing documentation

* No time to update.
* Need for better communication among departments.
* Documentation that was written by somebody no longer at the site.
* Unclear ownership of document.

But here's the cool thing: Somebody mentioned that his department has a fulltime technical writer to manage the documentation. The technical writer hounds people to make sure they follow standards and format.

There was a general discussion on ways to make documentation easier to maintain:

* Set aside a small period of time, maybe once a week at the end of the week, to document.
* Document in small chunks.
* Define your audience: is the intended reader an expert or a novice.
* Maintain a personal blog or a company blog.
* Measure the number of errors made in systems that were not documented as compared to systems that were and show that to management to get the resources to document. One group instituted a policy whereby if you didn't write up something before the weekly meeting you had to bring candy bars for everybody.

The concluding comments could specifically have been directed at me:

* The better the technical writers do, the lazier system administrators can be.
* It is not good just to repeat vendor documentation; document differences and one-offs from standard vendor information.
* Searching hundreds of pdf pages is not realistic; anything that is not easily findable has to be redocumented.
* Things need to be concise, easy to find, and directly applicable.

Closing Session

* Look! Up in the Sky! It's a Bird! It's a Plane! It's a Sysadmin!
David Blank-Edelman

The final session of the LISA conference is usually something on the lighter side, to combat the weariness we are all experiencing by that point. David Blank-Edelman has been giving a series of talks over the years in which he examines a particular profession which faces similar challenges to system administration and researches how that profession addresses the challenges. How are veterinarians trained to diagnose animals (systems) that don't speak? How do cookbook writers put procedures together?

This year David's talk was on superheroes, and how system administrators have special powers and talents that can be likened to those of comic book characters. Utility belts and system toolkits, Specialized knowledge that helps the world. Plus: superheroes and system administrators have long-suffering spouses.

What did I take from this? What David wanted me to: A sense that technical knowledge is a superpower.


This year's conference included many more talks and sessions that provided a large gee-whiz factor for me, particularly as I learned the details of how various large and super-large and super-duper-large configurations are implemented. System administration is splitting up into various specialties, and technical writing for system administrators is one of those specialties. I am not a guest at this conference; this is my conference as well. You should consider whether it is yours.

  • Post a new comment


    default userpic

    Your IP address will be recorded