The Book of Xen - Part 18
Library

Part 18

To actually use the framebuffer within a domain, you'll need to specify it in the config file. Recent versions of Xen have improved the syntax somewhat. The vfb= vfb= option controls all aspects of the virtual framebuffer, just as the option controls all aspects of the virtual framebuffer, just as the vif= vif= and and disk= disk= lines control virtual interfaces and virtual block devices. For example: lines control virtual interfaces and virtual block devices. For example: vfb=['type=vnc,vncunused=1']

Here we specify a VNC VFB and tell the VNC server to listen on the first unused port that's over the given number. (We go into more detail on the options available in AppendixB AppendixB.) Or, if you're feeling adventurous, there's the SDL version: vfb=['type=sdl']

Simple.

Use of the XenStore for Fun and Profit The XenStore is the configuration database in which Xen stores information on the running domUs. Although Xen uses the XenStore internally for vital matters like setting up virtual devices, you can also write arbitrary data to it from domUs as well as from dom0. Think of it as some sort of interdomain socket.

This opens up all sorts of possibilities. For example, domains could, in theory, negotiate among themselves for access to shared resources. Or you could have something like the talk talk system on the shared UNIX machines of yore-multiuser chat between people running on the same host. You could use it to propagate host-specific messages, for example, warning people of impending backups or migration. For the most part, though, such applications remain to be written. system on the shared UNIX machines of yore-multiuser chat between people running on the same host. You could use it to propagate host-specific messages, for example, warning people of impending backups or migration. For the most part, though, such applications remain to be written.

It's a little inconvenient to interact with the XenStore manually because no one's gotten around to providing a handy sh.e.l.l-style interface. In the meantime, we have to make do with tools that interrogate single keys.

To look at the XenStore, you can use the xenstore-list xenstore-list command. Here's a sh.e.l.l script from the Xen wiki that dumps keys from the xenstore recursively with command. Here's a sh.e.l.l script from the Xen wiki that dumps keys from the xenstore recursively with xenstore-list xenstore-list: #!/bin/sh

functiondumpkey(){ localparam=${1} localkey localresult result=$(xenstore-list${param}) if["${result}"!=""];then forkeyin${result};dodumpkey${param}/${key};done else echo-n${param}'='

xenstore-read${param} fi }

forkeyin/vm/local/domain/tool;dodumpkey${key};done You'll see that we have three hard-coded top-level keys: vm vm, local/domain local/domain, and tool tool. These each have a well-defined purpose to the hypervisor: vm vm stores domain information by UUID; stores domain information by UUID; local/domain local/domain stores domain information by ID (one might say that stores domain information by ID (one might say that vm vm exports domain data in a form suitable for migration, and exports domain data in a form suitable for migration, and local/domain local/domain stores it for local use); and stores it for local use); and tool tool stores tool-specific information. stores tool-specific information.

Poke around, look at the keys and how they map the information that you already know about the domain from other sources, like xm list --long xm list --long. For example, to get the memory usage target for the domain, run: #xenstore-read/local/domain/15/memory/target 1048576 Many of the keys in the XenStore are also writable. Although we don't recommend adjusting memory usage by writing to the XenStore, see the next section for an example of interdomain communication via writable XenStore keys.

Automatically Connecting to the VNC Console on Domain Boot One neat feature of the Xen LiveCD is that Xen domains, when started, will automatically pop up a VNC window when they've finished booting. The infrastructure that makes this possible is a script in the domU, a listener in the dom0, and the XenBus between them.

The script in the domU, vnc-advertiser vnc-advertiser, fires off from the domU startup scripts and waits for an Xvnc session to start. When it finds one, it writes to the XenStore: xenstore-write/tool/vncwatch/${domid}${local_addr}${screen} In the dom0, a corresponding script watches for writes to the XenStore. On the LiveCD, it's named vnc-watcher.py vnc-watcher.py. This script is a good example of general-purpose uses for the XenStore, so we've copied it wholesale here, with verbose annotations: #!/usr/bin/envpython ### #VNCwatchutility #Copyright(C)2005XenSourceLtd # #ThisfileissubjecttothetermsandconditionsoftheGNUGeneral #PublicLicense.Seethefile"COPYING"inthemaindirectoryof #thisarchiveformoredetails.

### #WatchesforVNCappearinginguestsandfiresupalocalVNC #viewertothatguest.

#Importlibrariesnecessarytointeractwiththexenstore.Xswatch #watchesaxenstorenodeandactivatesascript-definedfunction #whenthenodechanges,whilexstransactsupportsstandardreadand #writeoperations.

fromxen.xend.xenstoreimportxswatch fromxen.xend.xenstore.xstransactimportxstransact fromosimportsystem

defmain(): #firstmakethenode: xstransact.Mkdir("/tool/vncwatch") xstransact.SetPermissions("/tool/vncwatch", {"dom":0, "read":True, "write":True}) active_connections={}

#ThewatchFiredmethoddoestheactualworkofthescript.Whenthe #watchernoteschangestothepath"/tool/vncwatch/",itcalls #watchFiredwiththepath(andarguments,whichareunusedinthis #script).

defwatchFired(path,*args,**nargs): ifpath=="/tool/vncwatch": #notinterested: return1

#Ifwereachthispoint,something'schangedunderourpathof #interest.Let'sreadthevalueatthepath.

vncaddr=xstransact.Read(path) printvncaddr

#Whenthevnc-advertisernoticesthatXvnc'sshutdowninthedomU, #itremovesthevaluefromthexenstore.Ifthathappens,the #watcherthanremovestheconnectionfromitsinternallist(because #presumablytheVNCsessionnolongerexists).

ifvncaddr==None: #serverterminated,removefromconnectionlist: ifpathinactive_connections: active_connections.remove(path) else: #serverstartedorchanged,findoutwhathappened: if(notactive_connections.has_key(path))or active_connections[path]!=vncaddr:

#Recallthatthevnc-advertiserscriptwrites${domid} #${local_addr}${screen}tothepatch/tool/vncwatch/.Thewatcher #takesthatinformationandusesittoexecutethevncviewercommand #withappropriatearguments.

active_connections[path]=vncaddrsystem("vncviewer -truecolour"+vncaddr+"&")return1

#a.s.sociatethewatchFiredeventwithawatcheronthepath #"tool/vncwatch"

mywatch=xswatch.xswatch("/tool/vncwatch",watchFired) xswatch.watchThread.join()

if__name__=="__main__": main()

=== There are a couple of other sections that we would have loved to include here, but that aren't ready as of this writing, for example, the ongoing open source efforts to build an Amazon EC2 clone or the high-availability work being done by Project Kemari.

Anyway, please visit our website (http://prgmr.com/xen/) for more on the cool yet frightfully everyday things that we do with Xen.

Also, if you've broken your system trying to upgrade Xen from source, there's no better time than the present to take a look at the next chapter.

Chapter15.TROUBLESHOOTING

With any luck, you're just reading this chapter for fun, not because your server has just erupted in a tower of flame. Of course, sysadmins being almost comically lazy, it's most likely the latter, but the former is at least vaguely possible, right?

If the machine is in fact already broken, don't panic. Xen is complex, but the issues discussed here are fixable problems with known solutions. There's a vast a.r.s.enal of tools, a great deal of information to work with, and a lot of expertise available.

In this section, we'll outline a number of troubleshooting steps and techniques, with particular reference to Xen's peculiarities. We'll include explanations for some of the vague error messages that you might come across, and we'll make some suggestions about where to get help if all else fails.

Let's start with a general overview of our approach to troubleshooting, which will help to put the specific discussion of Xen-related problems in context.

The most important thing when troubleshooting is to get a clear idea of the machine's state: what it's doing, what problems it's having, what telegraphic errors it's spitting out, and where the errors are coming from. This is doubly important in Xen because its modular, standards-based design brings together diverse and unrelated tools, each with its own methods of logging and error handling.

Our usual troubleshooting technique is to: Reproduce the problem.

If the problem generates an error message, use that as a starting point.

If the error message doesn't provide enough information to solve the problem, consult the logs.

If the logs don't help, use set -x set -x to make sure the scripts are firing correctly, and closely examine the control flow of the non-Xen-specific parts of the system. to make sure the scripts are firing correctly, and closely examine the control flow of the non-Xen-specific parts of the system.

Use strace strace or or pdb pdb to track the flow of execution in the more Xen-specific bits and see what's failing. to track the flow of execution in the more Xen-specific bits and see what's failing.

If you get truly stuck, you might want to think about asking for help. Xen has a couple of excellent mailing lists (xen-devel and and xen-users xen-users) and a useful IRC channel, #xen #xen on on irc.oftc.net irc.oftc.net. For more information about how and where to get help, see the end of the chapter.

Troubleshooting Phase 1: Error Messages The first sign that something's amiss is likely to be an error message and an abrupt exit. These usually occur in response to some action-booting the machine, perhaps, or creating a domU.

Xen's error messages can be, frankly, infuriating. They're somewhat vague and developer oriented, and they usually come from somewhere deep in the bowels of the code where it's difficult to determine what particular cla.s.s of user error is responsible, or even if it's user error at all.

Better admins than us have been driven mad, have thrown their machines out the window and vowed to spend the rest of their lives wearing animal skins, killing dinner with fire-hardened spears. And who can say they are wrong?

Regardless, the error messages are a useful diagnostic and often provide enough information to solve the problem.

Errors at Dom0 Boot The first place to look for information about system-wide problems (if only because there's nothing else to do while the machine boots) is the boot output, both from the hypervisor and the dom0 kernel.

READING BOOT ERROR MESSAGESWhen a machine's broken badly enough that it can't boot, it often reboots itself immediately. This can lead to difficulty when trying to diagnose the problem. We suggest using a serial console with some sort of scrollback buffer to preserve the messages on another computer. This also makes it easy to log output, for example by using GNU screen.If you refuse to use serial consoles, or if you wish to otherwise do something before the box reboots, you can append noreboot to both the Xen and Linux kernel lines in GRUB. (If you miss either, it'll reboot. It's finicky that way.) Many of the Xen-specific problems we've encountered at boot have to do with kernel/hypervisor mismatches. The Xen kernel must match the dom0 kernel in terms of PAE support, and if the hypervisor is 64 bit, the dom0 must be 64 bit or i386-PAE. Of course, if the hypervisor is 32 bit, so must be the dom0.

You can run an i386-PAE dom0 with an x86_64 hypervisor and x86_64 domUs, but only on recent Xen kernels (in fact, this is what some versions of the Citrix Xen product do). In no case can you mismatch the PAE-ness. Modern versions of Xen don't even include the compile-time option to run in i386 non-PAE mode, causing all sorts of problems if you want to run older operating systems, such as NetBSD 4.

Of course, many of the problems that we've had at boot aren't especially Xen-specific; for example, the machine may not boot properly if the initrd isn't correctly matched to the kernel. This often causes people trouble when moving to the Xen.org kernel because it puts the drivers for the root device into an initrd, rather than into the kernel.

If your distro expects an initrd, you probably want to use your distro's initrd creation script after installing the Xen.org kernel. With CentOS, after installing the Xen.org kernel, make sure that /etc/modprobe.conf /etc/modprobe.conf correctly describes your root device (with an entry like correctly describes your root device (with an entry like alias scsi_hostadapter sata_nv alias scsi_hostadapter sata_nv), then run something like: #mkinitrd/boot/initrd-2.6.18.8-xen.img2.6.18.8-xen Replace /boot/initrd-2.6.18.8-xen.img /boot/initrd-2.6.18.8-xen.img with the desired filename of your new initrd, and replace with the desired filename of your new initrd, and replace 2.6.18.8-xen 2.6.18.8-xen with the output of with the output of uname -r uname -r for the kernel that you're building the initrd for. (Other options, such as for the kernel that you're building the initrd for. (Other options, such as --preload --preload, may also come in handy. Refer to the distro manual for more information.) a.s.suming you've booted successfully, there are a variety of informative error messages that Xen can give you. Usually these are in response to an attempt to do something, like starting xend xend or creating a domain. or creating a domain.

DomU Preboot Errors If you're using PyGRUB (or another bootloader, such as pypxeboot), you may see the message VmError: Boot loader didn't return any data! VmError: Boot loader didn't return any data! This means that PyGRUB, for some reason, wasn't able to find a kernel. Usually this is either because the disks aren't specified properly or because there isn't a valid GRUB configuration in the domU. Check the disk configuration and make sure that This means that PyGRUB, for some reason, wasn't able to find a kernel. Usually this is either because the disks aren't specified properly or because there isn't a valid GRUB configuration in the domU. Check the disk configuration and make sure that /boot/grub/menu.lst /boot/grub/menu.lst exists in the filesystem on the first domU VBD. exists in the filesystem on the first domU VBD.

NoteThere's some leeway; PyGRUB will check a bunch of filenames, including but not limited to /boot/grub/menu.lst, /boot/grub/grub.conf, /grub/menu.lst, /boot/grub/menu.lst, /boot/grub/grub.conf, /grub/menu.lst, and and /grub/grub.conf. /grub/grub.conf. Remember that PyGRUB is a good emulation of GRUB, but it's not exact Remember that PyGRUB is a good emulation of GRUB, but it's not exact.

You can troubleshoot PyGRUB problems by running PyGRUB manually: #/usr/bin/pygrubtype:/path/to/disk/image This should give you a PyGRUB boot menu. When you choose a kernel from the menu, PyGRUB exits with a message like: Linux(kernel/var/lib/xen/boot_kerne.hH9kEk)(args"bootdev=xbd1") This means that PyGRUB successfully loaded a kernel and placed it in the dom0 filesystem. Check the listed location to make sure it's actually there.

PyGRUB is quite picky about the terminal it's connected to. If PyGRUB exits, complaining about libncurses, or if PyGRUB on the same domain works for some people and not for others, you might have a problem with the terminal.

For example, with the version of PyGRUB that comes with CentOS 5.1, you can repeatedly get a failure by executing xm create -c xm create -c from a terminal window less than 19 lines long. If you suspect this may be the problem, resize your console to 80 x 24 and try again. from a terminal window less than 19 lines long. If you suspect this may be the problem, resize your console to 80 x 24 and try again.

PyGRUB will also expect to find your terminal type (the value of the TERM TERM variable) in the terminfo database. Manually setting variable) in the terminfo database. Manually setting TERM=vt100 TERM=vt100 before creating the domain is usually sufficient. before creating the domain is usually sufficient.

Creating Domains in Low-Memory Conditions This is one of the most informative error messages in Xen's a.r.s.enal: XendError:Errorcreatingdomain:Ineed131072KiB,butdom0_min_mem is262144andshrinkingto262144KiBwouldleaveonly-16932KiB free.

The error means that the system doesn't have enough memory to create the domU as requested. (The system in this case had only 384MiB, so the error really isn't surprising.) The solution is to adjust dom0_min_mem dom0_min_mem to compensate or adjust the domU to require less memory. Or, as in this case, do both (and possibly add more memory). to compensate or adjust the domU to require less memory. Or, as in this case, do both (and possibly add more memory).

Configuring Devices in the DomU Most likely, if the domU fails to start because of missing devices, the problem is tied to storage. (Broken network setups don't usually cause the boot to fail outright, although they can render your VM less than useful after booting.) Sometimes the domU will load its kernel and get through the first part of its boot sequence but then complain about not being able to access its root device, despite a correctly specified root kernel parameter. Most likely, the problem is that the domU doesn't have the root device node in the /dev /dev directory in the initrd. directory in the initrd.

This can lead to trouble when attempting to use the semantically more correct xvd* xvd* devices. Because many distros don't include the appropriate device nodes, they'll fail to boot. The solution, then, is to use the devices. Because many distros don't include the appropriate device nodes, they'll fail to boot. The solution, then, is to use the hd* hd* or or sd* sd* devices in the devices in the disk= disk= line, thus: line, thus: disk=['phy:/dev/tempest/sebastian,sda1,r']

root="/dev/sda1"

After starting the domain successfully, you can create the xvd xvd devices properly or edit your udev configuration. devices properly or edit your udev configuration.

The Xen block driver may also have trouble attaching to virtual drives that use the sdX sdX naming convention if the domU kernel includes a SCSI driver. In that case, use the naming convention if the domU kernel includes a SCSI driver. In that case, use the xvdX xvdX convention, like this: convention, like this: disk=['phy:/dev/tempest/sebastian,xvda1,r']

Troubleshooting Disks Most disk-related errors will cause the domU creation to fail immediately. This makes them fairly easy to troubleshoot. Here are some examples: Error:DestroyDevice()takes.e.xactly3arguments(2given) These pop up frequently and usually mean that something's wrong in the device specification. Check the config file for typos in the vif= vif= and and disk= disk= lines. If the message refers to a block device, the problem is often that you're referring to a nonexistent device or file. lines. If the message refers to a block device, the problem is often that you're referring to a nonexistent device or file.

There are a few other errors that have similar causes. For example: Error:Unabletofindnumberfordevice(cdrom) This, too, is usually caused by a phy: phy: device with an incorrectly specified backing device. device with an incorrectly specified backing device.

However, this isn't the only possible cause. If you're using file-backed block devices, rather than LVM volumes, the kernel may have run out of block loops on which to mount these devices. (In this case, the message is particularly frustrating because it seems entirely independent of the domain's config.) You can confirm this by looking for an error in the logs like: Error:Device769(vbd)couldnotbeconnected.Backenddevicenotfound.

Although this message usually means that you've mistyped the name of the domain's backing storage device, it may instead mean that you've run out of block loops. The default loop driver only creates seven of the things-barely enough for three domains with root and swap devices.

We might suggest that you move to LVM, but that's probably overkill. The more direct answer is to make more loops. If your loop driver is a module, edit /etc/modules.conf /etc/modules.conf and add: and add: optionsloopmax_loop=64 or another number of your choice; each domU file-backed VBD will require one loop device in dom0. (Do this in whatever domain is used as the backend, usually dom0, although Xen's new stub domains promise to make non-dom0 driver domains much more prevalent.) Then reload the module. Shut down all domains that use loop devices (and detach loops from the dom0) and then run: #rmmodloop #insmodloop If the loop driver is built into the kernel, you can add the max_loop max_loop option to the dom0 kernel command line. For example, in option to the dom0 kernel command line. For example, in /boot/grub/menu.lst /boot/grub/menu.lst: modulelinux-2.6-xen0max_loop=64 Reboot and the problem should go away.

VM Restarting Too Fast Disk problems, if they don't announce themselves through a specific error message, often manifest in log entries like the following: [2007-08-2316:06:51xend.XendDomainInfo2889]ERROR (XendDomainInfo:1675)VMsebastianrestartingtoofast(4.260192 secondssincethelastrestart).Refusingtorestarttoavoidloops.

This one is really just Xen's way of asking for help; the domain is stuck in a reboot cycle. Start the domain with the -c -c option (for console autoconnect) and look at what's causing it to die on startup. In this case, the domain booted and immediately panicked for lack of a root device. option (for console autoconnect) and look at what's causing it to die on startup. In this case, the domain booted and immediately panicked for lack of a root device.

NoteIn this case, the VM is restarting every 4.2 seconds, long enough to get console output. If the restarting too fast number is less than 1 or 2 seconds, often xm create -c xm create -c shows no output. If this happens, check the logs for informative messages. See later sections of this chapter for more details on Xen's logging shows no output. If this happens, check the logs for informative messages. See later sections of this chapter for more details on Xen's logging.

Troubleshooting Xen's Networking In our experience, troubleshooting Xen's networking is a straightforward process, given some general networking knowledge. Unless you've modified the networking scripts, Xen will fairly reliably create the vif vif devices. However, if you have problems, here are some general guidelines.(We'll focus on devices. However, if you have problems, here are some general guidelines.(We'll focus on network-bridge network-bridge here, although similar steps apply to here, although similar steps apply to network-route network-route and and network-nat network-nat.) To troubleshoot networking, you really need to understand how Xen does networking. There are a number of scripts and systems working together, and it's important to decompose each problem and isolate it to the appropriate components. Check Chapter5 Chapter5 for a general overview of Xen's network components. for a general overview of Xen's network components.

The first thing to do is run the network script with the status status argument. For example, if you're using argument. For example, if you're using network-bridge, /etc/xen/scripts/network-bridge status network-bridge, /etc/xen/scripts/network-bridge status will provide a helpful dump of the state of your network as seen in dom0. At this point you can use will provide a helpful dump of the state of your network as seen in dom0. At this point you can use brctl show brctl show to examine the network in more detail, and use the to examine the network in more detail, and use the xm vnet-create xm vnet-create and and vnet-delete vnet-delete commands in conjunction with the rest of the users.p.a.ce tools to get a properly set up bridge and Xen virtual network devices. commands in conjunction with the rest of the users.p.a.ce tools to get a properly set up bridge and Xen virtual network devices.

When you've got the backend sorted, you can address the frontend. Check the logs and check dmesg dmesg from within the domU to make sure that the domU is initializing its network devices. from within the domU to make sure that the domU is initializing its network devices.

If these look normal, we usually attack the problem more systematically, from bottom to top. First, make sure that the relevant devices show up in the domU. Xen creates these pretty reliably. If they aren't there, check the domU config and the logs for relevant-looking error messages.

At the next level (because we know that the dom0's networking works, right?) we want to check that the link is functioning. Our basic tool for that is arping arping from within the domU, combined with from within the domU, combined with tcpdump -i [interface] tcpdump -i [interface] on the domU's interface in the dom0. on the domU's interface in the dom0.