Home

Problems booting OpenSUSE on IBM system p machines

Lucas Meneghel Rodrigues: Linux on Power - Tue, 07/22/2008 - 15:47
One of these days I was trying to boot the latest OpenSUSE image on one of my p series machines, to see how the new release was doing on power. However, I hit a problem: I was not able to boot the installation program on my openfirmware based machine. My friend Carlos figured out what [...]

RHEL 5.2 and HPC performance hints

Bill Buros: Improving Performance on Linux - Thu, 07/17/2008 - 07:10
Building on the SLES 10 sp2 kernel build post from a couple of weeks ago, we got the equivalent RHEL 5.2 page posted under the developerWorks umbrella. Mostly the same conceptual steps, but a little different in the specifics. And of course, in the RHEL 5.2 example, we reverse the example from SLES 10 by building a 4KB kernel where "normally" the RHEL 5.2 kernel is based on the 64KB pages. It's a good experiment to play with when you want to see the performance gains that emerge from leveraging larger page sizes.
We linked this in under the HPC Central wiki page where several of us are playing around with adding descriptive how-to's for HPC workloads based on practical experience.

See HPC Central, follow the link to the Red Hat Enterprise Linux page, which is where the kernel page is linked in. We plan to replicate these pieces for SUSE Linux Enterprise Server next month.

asynchronous spu contexts, initial designs

Jeremy Kerr - Wed, 07/16/2008 - 00:21

I've recently been working on some changes to the spufs code, and thought I'd write-up some of the details.

At present, the spu_run syscall (used to run a SPU context) blocks until the SPU program has exited (or some other event has happened, such as a non-serviceable fault). This means that to take advantage of the SPUs, you really need to start a new thread for each SPU context that you create, otherwise your application will be sitting around waiting for each SPU context to complete.

In fact, we have an invariant in the spufs code at the moment that only contexts that are currently being spu_run will ever be runnable (and, at the moment, schedulable).

Ben H and I have been chatting about some ideas about asynchronous spu contexts. This means that the userspace app can start the context, then later retrieve the status of the SPU context (to see if it has stopped, faulted, or whatever). We can then use standard POSIX semantics like poll() to see if a context is still running or has generated any "events", then handle these events when they become available.

In effect, this is similar to spu_run: currently, the spu_run syscall runs the SPU, then blocks until an event happens, which is then returned to userpsace as the return value of spu_run. The main difference is that we don't block in the kernel while the SPU is running.

So, I've been coding up an experimental change to spufs. Firstly, we have to explicitly tell the kernel that we want a context to operate in asynchronous mode, so I've added a new flag to the spu_create syscall: SPU_CREATE_ASYNC.

I've opted for a file-based interface to these asynchronous contexts - SPU events are retrieved by reading from a file. Contexts that are created with the SPU_CREATE_ASYNC flag have an extra file present (called something like "event") in their context directory in the spufs mount. Reading from this file allows applications to retreive events that the SPU program has raised.

We need to define a format for the data read from this events file, so here's something to get started with:

struct spu_event {
	uint32_t event;
	uint32_t status;
	uint32_t npc;
};

- where the event member specifies which event happened - a stop-and-signal for example.

The status and npc members return the status of the SPU and the next program counter register, respectively. While not strictly necessary (this information is available from other files in spufs), it's very likely that the application will need these values in order to handle the event.

So, users of this interface may look something like this:

uint32_t npc = 0;
struct context {
        int fd;
        int event_fd;
} context;

/* create the context */
context.fd = spu_create("/spu/ctx", NULL, SPU_CREATE_ASYNC);

/* open the events file */
context.event_fd = openat(context.fd, "event", O_RDWR);

/* start the context running. unlike the spu_run syscall,
 * this function does not block for the duration of the
 * spu program */
run_context(&context, npc);

for (;;) {
        struct spu_event event;

        /* get the next event caused by the SPU */
        read(context.event_fd, &event, sizeof(event));

        if (event.event == SPU_EVENT_STOP)
                break;

        /* handle other event ... */
}

Note that the userspace examples here are not what we'd present to Cell application developers. They're more low-level examples of how the new asynchronous kernel interface works. In fact, the changes could be completely transparent to applications which use the libSPE interface.

This isn't far from the API provided by the current spu_run syscall, except that we're not waiting in the kernel while the SPU is running.

Also, we're going to need to control the SPU somehow - for example, we need to implement the run_context function in the pseudocode above. Rather than overloading the spu_run syscall, I've opted to use the same event file - writes to this file will allow userspace to control the SPU. I'm still working out the exact format of these writes, but the way I've implemented it at the moment is that the application can write structures of this layout to the file:

struct spu_control {
	uint32_t op;
	char data[];
};

The contents of the data member depends on the operation requested (specified by the op member). For example, a 'start spu' operation would have four extra bytes - a uint32_t containing the NPC to start the SPU execution from. A 'stop spu' operation doesn't require any extra parameters, so the data member would be 0 bytes long.

This would allow us to implement the run_context function as follows:

void run_context(struct context *context, uint32_t npc)
{
        uint32_t buf[2];

        buf[0] = SPU_CONTROL_START_SPU;
        buf[1] = npc;

        write(context.event_fd, buf, sizeof(buf));
}

There are plenty of other issues to deal with (like signals, and debugging), but I have a basic prototype working at the moment. More to come!

The building blocks of HPC

Bill Buros: Improving Performance on Linux - Tue, 07/15/2008 - 07:04
Top 500 again. Linpack HPL. Hitting half a teraflop on a single system.

Using RHEL 5.2 on a single IBM Power 575 system, Linux was able to hit half a teraflop with Linpack. These water-cooled systems are pretty nice. Thirty-two POWER6 cores packed into a fairly dense 2U rack form factor. These systems are designed for clusters, so 14 nodes (14 single systems) can be loaded into a single rack. Water piping winds its way into each system and over the cores (we of course had to pop one open to see how things looked and worked). The systems can be loaded with 128GB or 256GB of memory. A colleague provided a nice summary of the Linpack result over on IBM's developerWorks.

For Linux, there are several interesting pieces, especially as we look at Linpack as one of the key workloads that takes advantage of easy HPC building blocks. RHEL 5.2 comes with 64KB pages. The 64KB pages provides easy performance gains out of the box. The commercial compilers and math libraries provide the tailored and easy exploitation of the POWER6 systems. Running Linpack on clusters is the whole basis for the Top 500 workloads.

It's easy to take advantage of the building blocks in RHEL 5.2. OpenMPI in particular, the Infiniband stack, libraries tuned for the POWER hardware are all included. When we fire up a cluster node, we automatically install these components.
  • openmpi including -devel packages
  • openib
  • libehca
  • libibverbs-utils
  • openssl
These building blocks allow us to take the half-a-teraflop single system Linpack result and begin running it "out-of-the-box" on multiple nodes. There are cluster experts around that I'm learning from. Lots of interesting new challenges in the interconnect technologies and configurations. In this realm, I'm learning that one of the technology shifts emerging is the 10GBe (10GB Ethernet) interconnect vs Infiniband. Infiniband has all sorts of learning curves associated with it. Everytime I try to do something with Infiniband, I'm finding another thing to learn. It'll be interesting to see whether the 10GBe technology will be more like simply plugging in an Ethernet cable and off we go. A good summer project...

GCC Summit 2008 - been there, got the t-shirt

Thiago Bauermann: Linux on Power - Mon, 07/14/2008 - 01:05

I have finally found some time to write a bit about the GCC Developer’s Summit 2008, which happened one month ago in Ottawa, Canada (well, I didn’t really find time, since it’s past 1:30 AM now but still…).

In summary, I had a blast there! I was in last year’s summit and enjoyed it and learned a lot from it. But this time I already knew GDB people and they knew me, and I am involved in a couple of current developments and have more experience with the project, all of which made some difference. And everybody there is very friendly, of course, even if they never heard of you before. :-) In fact, Ian Taylor in his welcome presentation urged people to be friendly to newcomers since GCC and the GNU toolchain need new blood.

There was a good number of GDB-related events: a GDB talk and two debugging information talks, a debug information BoF, an informal GDB get-together and a GDB BoF. Unfortunately I know squat about GCC internals (I intend to learn more about it, but didn’t have a chance yet) so the two debugging information were above my head, and I absorbed little. The GDB talk was interesting but since I follow the GDB mailing lists I already knew most of what was presented.

The debug information BoF was interesting, especially since the discussion didn’t focus so much on the two competing approaches to improve debug information (which was the original point of the BoF), but mostly on what should be expected from debug information generated by GCC (i.e., what a debugger should be able to do with it, especially at higher optimization levels), and how its quality can be tested in the GCC testsuite.

The most interesting events for me were of course the GDB get-together and the GDB BoF. The former was a table reserved for us at lunch one day (thanks for organizing this, Joel Brobecker!) where folks interested in GDB would get to see each other faces and talk about random stuff (GDB-related or not). It was fun, and we were able to throw some ideas around about things such as conversion of the GDB repository from CVS to Subversion, the patch review process, and even about rewriting GDB in C++ (which is a hot thread in the GDB mailing list today!). I have a picture of the event:

If you follow the link you can see the notes with the name of each person in the photo above.

The GDB BoF was very interesting, and it felt weird to be at the front (thanks for inviting me Daniel!) discussing current GDB issues with Daniel Jacobowitz, Tom Tromey, Pedro Alves (the other people at the front) and the other GDB maintainers and developers in the room.

We nailed down some pending issues that were being discussed in the mailing list at the time regarding Python scripting support (man, it’s so much easier to decide things face to face rather than by e-mail!), and also discussed a bit of reversible debugging, multithreading GDB itself, GDB scalability, what to do regarding the next release (in a nutshell: wait about a year from the last release so that all the cool stuff which is being worked on right now gets in and settle down), moving the bugs database from GNATS to bugzilla (thanks for doing this Tromey!) etc.

Also after the BoF Pedro Alves gave a very good improvised tutorial on the GDB event loop which he has been studying for the past few months. It felt like cheating, to get all that knowledge in what, half an hour? :-) Thanks so much Pedro, it was awesome.

And of course all the interaction with the people who were there, like Joel Brobecker (playing tennis is more serious than I thought!), Gaius Mulley (Pink Floyd!), Anmol Paralkar, Ramana Radhakrishnan and many others (I don’t even try to enumerate, just a random sample).

I shared a suite in Ottawa with David Edelsohn and Kenneth Zadeck, which was an interesting thing in itself. Heading back to the hotel felt like going to an extended GCC summit. :-) I almost learned something about GCC internals (SSA, LTO, register allocation) and also had very interesting conversations in general.

And of course my one week of backpacking in Canada after the summit, which was another blast. :-)

Browsing your Creative Zen Vision M mp3 player with MTPfs

Written by Daniel Felix Ferber (techtavern.wordpress.com) This article explains how to browse your Creative Zen mp3 player from Linux command line, using MTPfs. MTPfs is a fuse file system implementation that wraps libmtp. libmtp is an open source implementation of the MTP (Media Transfer Protocol), intended to transfer for media files from/to players. With MTPfs, you can [...]

My Book World: Linux NAS done well (almost...)

Darren Hart: Real Time - Sun, 07/06/2008 - 00:18

wdfMyBook_World_1N.jpg I picked up a 1 TB Western Digital My Book World from Fry's today - that's a 1 Terabyte Gigabit Network Attached Storage box - for $220. I've seen a few consumer technical appliances that run Linux under the covers, and haven't been terribly impressed with many of them. So far however, this box is slick. While WD doesn't support Linux officially (bad WD!) it is pretty trivial to get the box into a techie-friendly state.

With only an hour or so of tinkering, I was able to enable ssh, disable the java server, add my own users, and mount my partitions via sshfs. The info on doing this is already covered by "Paul" at his How to setup My Book World Edition II page so I won't duplicate that here. Should also probably link to Martin Hinner's My Book Howto directly as he appears to have done the hard work. I'll spend the next few days figuring out how to best make use of this device; most likely looking into something like rdiff-backup.

Something they don't mention are the specs of this little beauty, for those of you who might care, read on...

read more

SystemTap Without Debug Info

Michael Strosaker: Power RAS - Wed, 07/02/2008 - 16:31
Just a short post today to mention a new feature in SystemTap that I should have mentioned a while ago.  A primary barrier to the adoption of SystemTap has been the requirement that SystemTap have access to the DWARF debug information for the kernel and modules.  This is no longer the case; as of [...]

Ext4 is now the primary filesystem on my laptop

Ted Tso: Kernel Hacker - Mon, 06/30/2008 - 14:06

Over the weekend, I converted my laptop to use the ext4 filesystem.  So far so good!  So far I’ve found one bug as a result of my using ext4 in production (if delayed allocation is enabled, i_blocks doesn’t get updated until the block allocation takes place, so files can appear to have 0k blocksize right after they are created, which is confusing/unfortunate), but nothing super serious yet.  I will be doing backups a bit more frequently until I’m absolutely sure things are rock solid, though!

I am using the latest ext4 patches and the tip of the e2fsprogs git repository.  Hopefully when we get the bulk of the patches merged into the mainline kernel after the 2.6.26 ships and the 2.6.27 merge window opens, and after I ship out e2fsprogs 1.41 (I have one work-in-progress pre-release, with another coming soon), it’ll be ready for much more wide-spread testing.

In addition to the excellent crew of ext4 developers, I’d like to call out for special thanks Gary Howco and Holger Kiehl, two early users/benchmarkers of ext4 who tried our latest code, and reported bugs that had previously escaped attention by developers (who had been mostly testing the code via the same old test suites); their additional workloads and benchmarks flushed out a few additional bugs.   Thanks, guys!!

Hopefully after a few weeks of my using ext4 for real-live work, I’ll find a few last few bugs to be fixed, and/or feel much more confident it’s ready for me to recommend to others for their production data.

debian on a qs22 cell blade

Jeremy Kerr - Sun, 06/29/2008 - 23:25

Seeing as the are out, here's a short guide to getting debian installed.

[jk@qs22 ~]$ grep -m1 ^cpu /proc/cpuinfo
cpu             : Cell Broadband Engine, altivec supported
[jk@qs22 ~]$ lsb_release -d
Description:    Debian GNU/Linux unstable (sid)

kernel

You'll need a kernel that has support for the IBM Cell blades. If you configure your kernel with the 'cell_defconfig' target, you should have all the necessary options:

[jk@pingu linux-2.6.25]$ make cell_defconfig

Specifically, you need:

  • CONFIG_PPC_IBM_CELL_BLADE;
  • CONFIG_SERIAL_OF_PLATFORM;
  • CONFIG_FUSION_SAS;
  • CONFIG_ROOT_NFS;
  • CONFIG_IP_PNP_DHCP and
  • CONFIG_SPU_FS.

root filesystem

The QS22s have no internal disk (they're compute nodes, right?), so you'll have to either:

  • use a remote root filesystem, like NFS; or
  • add a LSI SAS adaptor to the blade, and use an external SAS disk for the root filesystem.

The installation process will be different depending on which you choose, so just skip to the appropriate section here.

NFS root

For the first option, there's a number of NFS-root howtos around. First up, we need to build the actual debian filesystem, using debootstrap. For example:

[jk@pingu ~]$ sudo debootstrap --arch=powerpc --foreign sid /srv/nfs/qs22/

This will create an entire debian filesystem in /srv/nfs/qs22. We need to make a few modifications though:

  1. add the following line to /etc/inittab:
    T0:23:respawn:/sbin/getty -L ttyS0 19200 vt100
  2. and couple of extra device nodes:
    [jk@pingu ~]$ cd /srv/nfs/qs22/dev
    [jk@pingu dev]$ sudo mknod console c 5 1
    [jk@pingu dev]$ sudo mknod ttyS0 c 4 64

Once this is done, we need to complete the bootstrap on the QS22. Set up your NFS server, and export the appropriate directory. Boot the QS22 with the nfs root kernel options, plus "rw init=/bin/sh" (eg root=/dev/nfs nfsroot=server_ip:/srv/nfs/qs22 ip=dhcp rw init=/bin/sh). Then, once the machine has booted:

sh-3.2# PATH=/:/bin:/usr/bin:/sbin:/usr/sbin /debootstrap/debootstrap --second-stage

This should finish the bootstrap. After it has completed (it should finish with "I: Base system installed successfully"), reboot the machine with the same kernel command line, minus the rw init=/bin/sh arguments. Once it boots, you should have the debian login prompt. Login as root (there will be no password, but don't forget to set one) and away you go.

SAS disk

If you're using SAS, the install is much more straightforward, as you can just use the standard debian installer. However, you may need to use a custom kernel which supports the QS22s. This is a matter of building your own kernel, using the powerpc64 debian installer image as the initramfs:

[jk@pingu linux-2.6.25]$ wget http://ftp.us.debian.org/debian/dists/testing/main/installer-powerpc/current/images/powerpc64/netboot/initrd.gz
[jk@pingu linux-2.6.25]$ gunzip -c < initrd.gz > initrd
[jk@pingu linux-2.6.25]$ make cell_defconfig
[jk@pingu linux-2.6.25]$ sed -ie 's,^CONFIG_INITRAMFS_SOURCE=".*",CONFIG_INITRAMFS_SOURCE="'$PWD'/initrd",' .config
[jk@pingu linux-2.6.25]$ make

Then, just boot the kernel in arch/powerpc/boot/zImage.pseries. The debian installer should start, and guide you through the rest of the installation. Since you're netbooting, you can ignore any messages about not having a bootstrap partition, or not being able to install a kernel or yaboot

software

Entirely optional, but you'll probably get the most out of your QS22 with a few extra packages:

[jk@qs22 ~]$ sudo apt-get install openssh-server libspe2-dev spu-gcc build-essential

Building a distro kernel on Power - not so bad

Bill Buros: Improving Performance on Linux - Sun, 06/29/2008 - 13:58
This should be simple. And when you know all the steps, it is. But I was surprised how challenging it's been to find easy examples of the steps to re-build a commercially shipping "distro" kernel, in this case the SLES 10 sp1 kernel.

It's probably documented cleverly in the end user documentation - but I'm far too addicted to the ease of googling compared to the inevitable drudgery of digging through user documentation. I always wonder when the "documentation community" will simply shift to wiki pages to document, but more importantly, maintain the correctness and accessibility, of end user documentation.

For this exercise, turns out we wanted to do something simple to a SLES 10 kernel shipping on Power. In our case, we wanted to see if we could re-build the distro kernel to support the 64KB pages available in the Power6 hardware systems. For the performance angle, 64KB pages can often significantly improve the performance of applications. Normally, when working with the Linux community, we simply snag the latest mainline kernel and work with that, but in this case, we were really interested in the specific performance differences between 4KB today and the expected performance of 64KB pages on the same base.

Out of that exercise, we created a new wiki page which documented the steps to re-build the SLES 10 kernel. A peer, Peter Wong, has already documented the RHEL 5.2 steps, we're just waiting for some web site maintenance to complete on the IBM developerWorks infrastructure to get that page posted as well.

For the SLES 10 sp1 (and sp2) kernel re-build instructions, see
Recently Jon Tollefson was playing around on the SLES 10 sp2 kernel and found that there's a missing file in the SLES 10 sp2 kernel package, so we had to comment out a line in the kernel-ppc64.spec file (modprobe.d-qeth).

One interesting aspect is I had thought the kernel re-build process would be precise and seamless. But there were a few tricks that had to be done to make it work.

One of them was adding control to be able to run the make on all of the CPUs seen by Linux.
%define jobs %(cat /proc/cpuinfo | grep processor | wc -l)
We've been playing recently on some of the sweet top-of-the-line POWER6 systems, in one case the Power 575 system with 32 cores. When running with SMT enabled, that's 64 CPUs that Linux controls. The kernel build goes very fast on that system.

Second, and there's probably a more clever way to do this, but we ended up having to unpack, modify, and re-pack the config.tar.bz2 file for the platform.

The last interesting aspect was the built-in "kabi" protections. When we first re-built the kernel, the build failed because this failed the kabi tolerance level. Very clever. I assume various kernel interfaces are flagged with KABI values, which when changed, cause the build to fail. In our case, we knew it would change things in the kernel, so we modified the tolerance value to allow for the kernel re-build.

So. Easy to do, easy to make changes, and for a performance team, easy to minimize how much is changing from one step to the next. By starting with a known entity in the distro kernel, we make one change, verify the performance differences, and then proceed to the next change. Simple. Methodical. Straight-forward.

Episode 10 - Embedded PowerPC with Josh Boyer

Original Planet LTC Linux Podcasts - Fri, 06/27/2008 - 08:18

Embedded 4xx PowerPC Linux Kernel maintainer Josh Boyer discusses activity in the embedded linux space.

Additional Information

  • Embedded Linux kernel mailing list at http://vger.kernel.org/vger-lists.html#linux-embedded
  • Embedded and powerpc developers lists at http://ozlabs.org/mailman/listinfo
  • Embedded PowerPC info at http://penguinppc.org/embedded/

Shameless Analyst Report Plug: “IBM & Linux – 9 Years Later”

A colleague sent me a link to this analyst paper today that takes a look at whether IBM has made good on the Linux promises it made back in 1999. I’m obviously biased, but I’m interested in hearing if anyone has thoughts on this topic.

Here’s the report: ftp://ftp.software.ibm.com/linux/pdfs/GCG_IBM_and_Linux-9_years_later.pdf

The opening teaser:

In 1999, IBM issued a series of announcements fully committing the company to supporting Linux. IBM vowed to Linux-enable all of their hardware platforms, including their non-x86 based mainframe, mini, and RISC-based systems. They also promised to release Linux versions of their software products and develop
Linux-centric service practices. Moreover, they pledged significant resources to the Linux community with the goal of advancing Linux and open source technology.

So, nine years later, did IBM deliver on these promises? Was their commitment to Linux genuine or just lip service? This report examines IBM’s current Linux products, services, and community support in light of the promises they made in 1999…

While I think it’s obvious IBM has been a huge investor in the Linux community, one thing that I noticed reading the report is just how much IBM is actually different from other community members. There are some noticeable differences in the investments and approach to supporting the Linux platform and community. I often forget to just take in all the Linux technologies IBM has been heavily involved in from Xen, KVM and libvirt to filesystems, to systemtap, kprobes and then there’s RAS, scalability and performance enhancements.

Another interesting thought to reflect on is just how important it has been that there are multiple investors in this field. If this report captures just what IBM did, think of the industry combined. IBM couldn’t have done anything this big with Linux if it weren’t for co-creating with a community of enthusiasts, researchers, governments, Intel, AMD, Google, Nokia, Motorola, Oracle and thousands more. What would the report look like if you compiled all the investments and work the entire community leveraged across the industry. Linux is “bigger than huge” when you stop to think about it. This is also why I’ve said for a couple years now when you extend the investment model 3 to 5 years into the future, Sun and its anti-Linux,  Solaris push against the tide of the industry loses in the end. I think we’re starting to witness that now. Sure, OpenSolaris is a great idea… it’s just 9 years late and it’s too late to matter now.

I’m interested in outside perspectives too - where do you think IBM stands? Has the community development and investment model worked? Where will this lead in the future and what will be the next evolution of the model? Red Hat seems to think the model will evolve to include increased customer co-creation - I tend to agree. Why? Because the incentive model to invest aligns very well - and when you have alignment, it almost naturally will happen.

WAN-accelerator for NFS

Ronnie Sahlberg: CTDB - Tue, 06/24/2008 - 11:45
I've been working a bit with WAN-acceleration for CTDB. Actually with two different approaches
for two different purposes.


WAN-accelerator #1 (general purpose)

The first approach was to add new "capabilities" to the CTDB daemon so that you could have a cluster of CTDB nodes where some nodes were located at a very remote site, across a high-latency WAN-link. This was tricky to solve since eventhough you have nodes that participate synchronously in the cluster you do not want the high WAN-link latency to affect performance on the nodes in the main datacentre.

Initial tests seems to indicate that it works quite well. Surprisingly well.

But this is not really a WAN-accelerator. A classic WAN-accelerator is more a device that performs a man-in-the-middle attack on the CIFS/NFS protocols and performs some (sometimes unsafe) caching.

In the CTDB approach above there is no man-in-the-middle attack, nor is it really a WAN-accelerator.
It is conceptually more like one single multihomed CIFS server where one on the NICs (the remote site) happens to be a few hundred ms away. Thus we dont have to play any tricks, nor do any questionable caching, we are still a single cifs server, with fully and 100% correct cifs semantics, its just that this cifs server is spread out across multiple sites.

I.e. clients on the remote site talk to the genuine real cifs server. Not an man-in-the-middle imposter that may or may not provide correct semantics.

WAN accelerator #2 (nfs)

A different solution was based on FUSE and providing very aggressive caching of data and metadata for NFS. This one also seems to perform really well but is obviously less cool than "a single multihomed cifs server spanning multiple sites".

Parked up in Portland for a bit

Mel Gorman: Kernel Spanner - Tue, 06/24/2008 - 01:22
I am just back online after being absent for two weeks. If you were trying to get in touch with me in that time and I am apparently ignoring mails, nudge me again because it is lost in the mess. The start of the absense was due to presenting a paper at ISMM 2008 in Tuscon, Arizona. The conference is fairly in-depth and was of significant interest despite many of the discussions are around managed languages and garbage collection which I do not focus on ordinarily. Luckily for me, attending next year will be relatively handy as it is due to be held in Dublin, Ireland.

Post conference, I drove to Portland, Oregan over the course of two weeks stopping off at various places along the way such as Vegas (a unique place), Bagdad (in Arizona, not the other one), Los Angeles and Yosemite park. I'm now parked up in Portland for the next 4 weeks so if anyone in the area wants to meet up that has met me in the past, drop a line and we'll make something happen.

linux.conf.au hackfest: the solution, part three

Jeremy Kerr - Sun, 06/22/2008 - 18:20

In part two of this series, we had just ported a fractal renderer to the SPEs on a Cell Broadband Engine machine. Now we're going to start the optimisation...

Our baseline performance is 40.7 seconds to generate the sample fractal (using the sample fractal parameters).

The initial SPE-based fractal renderer used only one SPE. However, we have more available:

Number of SPEs in currently-available in CBEA machines
Machine SPEs available
Sony Playstation 3 6
IBM QS21 / QS22 blades. 16 (8 per CPU)

So, we should be able to get better performance by distributing the render work between the SPEs. We can do this by dividing the fractal into a set of n strips, where n is the number of SPEs available. Then, each SPE renders its own strip of the fractal.

The following image shows the standard fractal, as would be rendered by 8 SPEs, with shading to show the division of the work amongst the SPEs.

SPE fractal divisions

In order to split up the work, we first need to tell each SPE which part of the fractal it is to render. We'll add two variables to the spe_args structure (which is passed to the SPE during program setup), to provide the total number of threads (n_threads) and the index of this SPE thread (thread_idx).

struct spe_args {
        struct fractal_params fractal;
        int n_threads;
        int thread_idx;
};

spe changes

On the SPE side, we use these parameters to alter the invocations of render_fractal. We set up another couple of convenience variables:

        rows_per_spe = args.fractal.rows / args.n_threads;
	start_row = rows_per_spe * args.thread_idx;

And just alter our for-loop to only render rows_per_spe rows, rather than the entire fractal:

        for (row = 0; row < rows_per_spe; row += rows_per_dma) {

                render_fractal(&args.fractal, start_row + row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);
                mfc_read_tag_status_all();
        }

ppe changes

The changes to the PPE code are fairly simple - instead of just creating one thread, create n threads.

First though, let's add a '-n' argument to the program to specify the number of threads to start:

        while ((opt = getopt(argc, argv, "p:o:n:")) != -1) {
                switch (opt) {

                /* other options omitted */

                case 'n':
                        n_threads = atoi(optarg);
                        break;

Rather than just creating one SPE thread, we create n_threads. Also, we have to set the thread-specific arguments for each thread:

        for (i = 0; i < n_threads; i++) {
                /* copy the fractal data into this thread's args */
                memcpy(&threads[i].args.fractal, fractal, sizeof(*fractal));

                /* set thread-specific arguments */
                threads[i].args.n_threads = n_threads;
                threads[i].args.thread_idx = i;

                threads[i].ctx = spe_context_create(0, NULL);

                spe_program_load(threads[i].ctx, &spe_fractal);

                pthread_create(&threads[i].pthread, NULL,
                                spethread_fn, &threads[i]);
        }

Now, the SPEs should be running, and generating their own slice of the fractal. We just have to wait for them all to finish:

        /* wait for the threads to finish */
        for (i = 0; i < n_threads; i++)
                pthread_join(threads[i].pthread, NULL);

If you're after the source code for the multi-threaded SPE fractal renderer, it's available in fractal.3.tar.gz.

That's it! Now we have a multi-threaded SPE application to do the fractal rendering. In theory, an n threaded program will take 1/n of the time of a single-threaded implementation, let's see how that fares with reality...

performance

Let's compare invocations of our multi-threaded fractal renderer, with different values for the n_threads parameter.

Performance of multi-threaded SPE fractal renderer
SPE threads Running time (sec)
140.72
230.14
418.84
612.72
810.89

Not too bad, but we're definitely not seeing linear scalability here; we could expect the 8 SPE case to take around 5 seconds, rather than 11.

what's slowing us down?

A little investigation into the fractal generator will show that some SPE threads are finishing long before others. This is due to the variability in calculation time between pixels. In order to see if a point (ie, pixel) in the fractal does not converge towards infinity (and gets coloured blue), we need to do the full i_max tests in render_fractal (i_max is 10,000 in our sample fractal). Other pixels may converge much more quickly (possibly in under 10 iterations), and so are orders of mangitude faster to calculate.

We end up with slices that are very quick to calculate, and others that take longer. Of course, we have to wait for the longest-running SPE thread to complete, so our overall runtime will be that of the slowest thread.

So, let's take another aproach to distributing the workload. Rather than dividing the fractal into contiguous slices, we can interleave the SPE work. For example, if we were to use 2 SPE threads, then thread 0 would render all the even chunks, and thread 1 would render all the odd chunks (where a "chunk" is a set of rows that fit into a single DMA). This should even-out the work between SPE threads.

Interleaved SPE fractal divisions

This is just a matter of changing the SPE for-loop a little. Rather than the current code, which divides the work into n_threads contiguous chunks:

        for (row = 0; row < rows_per_spe; row += rows_per_dma) {

                render_fractal(&args.fractal, start_row + row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);
                mfc_read_tag_status_all();
        }

We change this to render every n_threadth chunk, starting from thread_idx, which gives us the the interleaved pattern:

        for (row = rows_per_dma * args.thread_idx;
                        row < args.fractal.rows;
                        row += rows_per_dma * args.n_threads) {

                render_fractal(&args.fractal, row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + row * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);
                mfc_read_tag_status_all();
        }

An updated renderer is available in fractal.4.tar.gz.

Making this small change gives some better performance figures:

Performance of multi-threaded, interleaved SPE fractal renderer
SPE threads Running time (sec)
140.72
220.75
410.78
67.44
85.81

We're doing much better now, but we're still nowhere near the theoretical maximum performance of the SPEs. More optimisations in the next article...

Episode 9 - WBEM Intro with Chris Buccella

Original Planet LTC Linux Podcasts - Fri, 06/20/2008 - 14:46

Chris gives an intro to WBEM, sblim and other system management tools for Linux and other OSes.

Additional Information

  • SBLIM home page at http://sblim.wiki.sourceforge.net
  • Open Pegasus at http://openpegasus.org
  • WBEM from Python at http:://pywbem.sourceforge.net
  • WS-MAN information at http://openwsman.org

Real-Time for the Masses

Darren Hart: Real Time - Fri, 06/20/2008 - 00:14

Today at the Red Hat Summit in Boston, Red Hat announced the official release of Red Hat Enterprise MRG V1 (Messaging Realtime Grid) [1]. A couple snippets of note:

"The Realtime component of Red Hat Enterprise MRG comprises numerous kernel enhancements that provide deterministic performance for time-critical and latency-sensitive applications."

"IBM has worked together closely with Red Hat on the development of the real time Linux kernel and has optimized both WebSphere Real Time and BladeCenter servers on Red Hat Enterprise MRG. We are delighted that IBM and Raytheon have been recognized by Red Hat for this innovation which has led to the largest deployment of real time technology in the next generation of US Navy Destroyers." --Jeff Smith, vice president, Open Source and Linux Middleware at IBM

read more

Canberra and the world&#8217;s fastest computer

Michael Ellerman - Wed, 06/18/2008 - 20:30
As Jeremy mentioned, the IBM QS22 was released a few weeks ago. The QS22 is the newest Cell processor based blade server, sporting the new PowerXCell 8i chip, and up to 32 GB of memory. Because the QS22 can support larger amounts of memory, Linux needs to enable the IOMMU, whereas on previous blades that was [...]

Red Hat announces "next-generation" virtualization based on KVM

Anthony Liguori: Tales of a Code Monkey - Wed, 06/18/2008 - 16:21

Today, at the Red Hat Summit, Red Hat announced three virtualization initiatives including oVirt. The press release is here.

Some choice quotage:

KVM technology has rapidly emerged as the next-generation virtualization technology, following on from the highly successful Xen implementation.

Another good one:

We continue to see huge improvements in functionality, performance and time to market because of our close relationship with our open source partners. For example, Intel and IBM have worked with us for many years covering virtualization technologies that span from Red Hat Enterprise Linux 5 to today's KVM-based announcements.

And of course:

"IBM works closely with Red Hat and the open source community to drive innovation within the Linux kernel," said Daniel Frye, vice president, open systems development at IBM. "IBM has a heterogenous approach toward virtualization, with KVM one of several options. KVM leverages the core features of the Linux kernel, including paravirtualization interfaces contributed by IBM engineers. By combining Linux virtualization infrastructure with open management interfaces such as CIM and libvirt, we gain a solution that eliminates lock-in and open source community innovations, we are able to offer our customers a solution with outstanding performance, scalability and agility."

If you want to see what all the fuss is about, check out KVM.

Syndicate content