See HPC Central, follow the link to the Red Hat Enterprise Linux page, which is where the kernel page is linked in. We plan to replicate these pieces for SUSE Linux Enterprise Server next month.
I've recently been working on some changes to the spufs code, and thought I'd write-up some of the details.
At present, the spu_run syscall (used to run a SPU context) blocks until the SPU program has exited (or some other event has happened, such as a non-serviceable fault). This means that to take advantage of the SPUs, you really need to start a new thread for each SPU context that you create, otherwise your application will be sitting around waiting for each SPU context to complete.
In fact, we have an invariant in the spufs code at the moment that only contexts that are currently being spu_run will ever be runnable (and, at the moment, schedulable).
Ben H and I have been chatting about some ideas about asynchronous spu contexts. This means that the userspace app can start the context, then later retrieve the status of the SPU context (to see if it has stopped, faulted, or whatever). We can then use standard POSIX semantics like poll() to see if a context is still running or has generated any "events", then handle these events when they become available.
In effect, this is similar to spu_run: currently, the spu_run syscall runs the SPU, then blocks until an event happens, which is then returned to userpsace as the return value of spu_run. The main difference is that we don't block in the kernel while the SPU is running.
So, I've been coding up an experimental change to spufs. Firstly, we have to explicitly tell the kernel that we want a context to operate in asynchronous mode, so I've added a new flag to the spu_create syscall: SPU_CREATE_ASYNC.
I've opted for a file-based interface to these asynchronous contexts - SPU events are retrieved by reading from a file. Contexts that are created with the SPU_CREATE_ASYNC flag have an extra file present (called something like "event") in their context directory in the spufs mount. Reading from this file allows applications to retreive events that the SPU program has raised.
We need to define a format for the data read from this events file, so here's something to get started with:
struct spu_event { uint32_t event; uint32_t status; uint32_t npc; };
- where the event member specifies which event happened - a stop-and-signal for example.
The status and npc members return the status of the SPU and the next program counter register, respectively. While not strictly necessary (this information is available from other files in spufs), it's very likely that the application will need these values in order to handle the event.
So, users of this interface may look something like this:
uint32_t npc = 0; struct context { int fd; int event_fd; } context; /* create the context */ context.fd = spu_create("/spu/ctx", NULL, SPU_CREATE_ASYNC); /* open the events file */ context.event_fd = openat(context.fd, "event", O_RDWR); /* start the context running. unlike the spu_run syscall, * this function does not block for the duration of the * spu program */ run_context(&context, npc); for (;;) { struct spu_event event; /* get the next event caused by the SPU */ read(context.event_fd, &event, sizeof(event)); if (event.event == SPU_EVENT_STOP) break; /* handle other event ... */ }
Note that the userspace examples here are not what we'd present to Cell application developers. They're more low-level examples of how the new asynchronous kernel interface works. In fact, the changes could be completely transparent to applications which use the libSPE interface.
This isn't far from the API provided by the current spu_run syscall, except that we're not waiting in the kernel while the SPU is running.
Also, we're going to need to control the SPU somehow - for example, we need to implement the run_context function in the pseudocode above. Rather than overloading the spu_run syscall, I've opted to use the same event file - writes to this file will allow userspace to control the SPU. I'm still working out the exact format of these writes, but the way I've implemented it at the moment is that the application can write structures of this layout to the file:
struct spu_control { uint32_t op; char data[]; };
The contents of the data member depends on the operation requested (specified by the op member). For example, a 'start spu' operation would have four extra bytes - a uint32_t containing the NPC to start the SPU execution from. A 'stop spu' operation doesn't require any extra parameters, so the data member would be 0 bytes long.
This would allow us to implement the run_context function as follows:
void run_context(struct context *context, uint32_t npc) { uint32_t buf[2]; buf[0] = SPU_CONTROL_START_SPU; buf[1] = npc; write(context.event_fd, buf, sizeof(buf)); }
There are plenty of other issues to deal with (like signals, and debugging), but I have a basic prototype working at the moment. More to come!
I have finally found some time to write a bit about the GCC Developer’s Summit 2008, which happened one month ago in Ottawa, Canada (well, I didn’t really find time, since it’s past 1:30 AM now but still…).
In summary, I had a blast there! I was in last year’s summit and enjoyed it and learned a lot from it. But this time I already knew GDB people and they knew me, and I am involved in a couple of current developments and have more experience with the project, all of which made some difference. And everybody there is very friendly, of course, even if they never heard of you before.
In fact, Ian Taylor in his welcome presentation urged people to be friendly to newcomers since GCC and the GNU toolchain need new blood.
There was a good number of GDB-related events: a GDB talk and two debugging information talks, a debug information BoF, an informal GDB get-together and a GDB BoF. Unfortunately I know squat about GCC internals (I intend to learn more about it, but didn’t have a chance yet) so the two debugging information were above my head, and I absorbed little. The GDB talk was interesting but since I follow the GDB mailing lists I already knew most of what was presented.
The debug information BoF was interesting, especially since the discussion didn’t focus so much on the two competing approaches to improve debug information (which was the original point of the BoF), but mostly on what should be expected from debug information generated by GCC (i.e., what a debugger should be able to do with it, especially at higher optimization levels), and how its quality can be tested in the GCC testsuite.
The most interesting events for me were of course the GDB get-together and the GDB BoF. The former was a table reserved for us at lunch one day (thanks for organizing this, Joel Brobecker!) where folks interested in GDB would get to see each other faces and talk about random stuff (GDB-related or not). It was fun, and we were able to throw some ideas around about things such as conversion of the GDB repository from CVS to Subversion, the patch review process, and even about rewriting GDB in C++ (which is a hot thread in the GDB mailing list today!). I have a picture of the event:
If you follow the link you can see the notes with the name of each person in the photo above.
The GDB BoF was very interesting, and it felt weird to be at the front (thanks for inviting me Daniel!) discussing current GDB issues with Daniel Jacobowitz, Tom Tromey, Pedro Alves (the other people at the front) and the other GDB maintainers and developers in the room.
We nailed down some pending issues that were being discussed in the mailing list at the time regarding Python scripting support (man, it’s so much easier to decide things face to face rather than by e-mail!), and also discussed a bit of reversible debugging, multithreading GDB itself, GDB scalability, what to do regarding the next release (in a nutshell: wait about a year from the last release so that all the cool stuff which is being worked on right now gets in and settle down), moving the bugs database from GNATS to bugzilla (thanks for doing this Tromey!) etc.
Also after the BoF Pedro Alves gave a very good improvised tutorial on the GDB event loop which he has been studying for the past few months. It felt like cheating, to get all that knowledge in what, half an hour?
Thanks so much Pedro, it was awesome.
And of course all the interaction with the people who were there, like Joel Brobecker (playing tennis is more serious than I thought!), Gaius Mulley (Pink Floyd!), Anmol Paralkar, Ramana Radhakrishnan and many others (I don’t even try to enumerate, just a random sample).
I shared a suite in Ottawa with David Edelsohn and Kenneth Zadeck, which was an interesting thing in itself. Heading back to the hotel felt like going to an extended GCC summit.
I almost learned something about GCC internals (SSA, LTO, register allocation) and also had very interesting conversations in general.
And of course my one week of backpacking in Canada after the summit, which was another blast. ![]()

I picked up a 1 TB Western Digital My Book World from Fry's today - that's a 1 Terabyte Gigabit Network Attached Storage box - for $220. I've seen a few consumer technical appliances that run Linux under the covers, and haven't been terribly impressed with many of them. So far however, this box is slick. While WD doesn't support Linux officially (bad WD!) it is pretty trivial to get the box into a techie-friendly state.
With only an hour or so of tinkering, I was able to enable ssh, disable the java server, add my own users, and mount my partitions via sshfs. The info on doing this is already covered by "Paul" at his How to setup My Book World Edition II page so I won't duplicate that here. Should also probably link to Martin Hinner's My Book Howto directly as he appears to have done the hard work. I'll spend the next few days figuring out how to best make use of this device; most likely looking into something like rdiff-backup.
Something they don't mention are the specs of this little beauty, for those of you who might care, read on...
Over the weekend, I converted my laptop to use the ext4 filesystem. So far so good! So far I’ve found one bug as a result of my using ext4 in production (if delayed allocation is enabled, i_blocks doesn’t get updated until the block allocation takes place, so files can appear to have 0k blocksize right after they are created, which is confusing/unfortunate), but nothing super serious yet. I will be doing backups a bit more frequently until I’m absolutely sure things are rock solid, though!
I am using the latest ext4 patches and the tip of the e2fsprogs git repository. Hopefully when we get the bulk of the patches merged into the mainline kernel after the 2.6.26 ships and the 2.6.27 merge window opens, and after I ship out e2fsprogs 1.41 (I have one work-in-progress pre-release, with another coming soon), it’ll be ready for much more wide-spread testing.
In addition to the excellent crew of ext4 developers, I’d like to call out for special thanks Gary Howco and Holger Kiehl, two early users/benchmarkers of ext4 who tried our latest code, and reported bugs that had previously escaped attention by developers (who had been mostly testing the code via the same old test suites); their additional workloads and benchmarks flushed out a few additional bugs. Thanks, guys!!
Hopefully after a few weeks of my using ext4 for real-live work, I’ll find a few last few bugs to be fixed, and/or feel much more confident it’s ready for me to recommend to others for their production data.
Seeing as the are out, here's a short guide to getting debian installed.
[jk@qs22 ~]$ grep -m1 ^cpu /proc/cpuinfo cpu : Cell Broadband Engine, altivec supported [jk@qs22 ~]$ lsb_release -d Description: Debian GNU/Linux unstable (sid)
You'll need a kernel that has support for the IBM Cell blades. If you configure your kernel with the 'cell_defconfig' target, you should have all the necessary options:
[jk@pingu linux-2.6.25]$ make cell_defconfig
Specifically, you need:
The QS22s have no internal disk (they're compute nodes, right?), so you'll have to either:
The installation process will be different depending on which you choose, so just skip to the appropriate section here.
For the first option, there's a number of NFS-root howtos around. First up, we need to build the actual debian filesystem, using debootstrap. For example:
[jk@pingu ~]$ sudo debootstrap --arch=powerpc --foreign sid /srv/nfs/qs22/
This will create an entire debian filesystem in /srv/nfs/qs22. We need to make a few modifications though:
T0:23:respawn:/sbin/getty -L ttyS0 19200 vt100
[jk@pingu ~]$ cd /srv/nfs/qs22/dev [jk@pingu dev]$ sudo mknod console c 5 1 [jk@pingu dev]$ sudo mknod ttyS0 c 4 64
Once this is done, we need to complete the bootstrap on the QS22. Set up your NFS server, and export the appropriate directory. Boot the QS22 with the nfs root kernel options, plus "rw init=/bin/sh" (eg root=/dev/nfs nfsroot=server_ip:/srv/nfs/qs22 ip=dhcp rw init=/bin/sh). Then, once the machine has booted:
sh-3.2# PATH=/:/bin:/usr/bin:/sbin:/usr/sbin /debootstrap/debootstrap --second-stage
This should finish the bootstrap. After it has completed (it should finish with "I: Base system installed successfully"), reboot the machine with the same kernel command line, minus the rw init=/bin/sh arguments. Once it boots, you should have the debian login prompt. Login as root (there will be no password, but don't forget to set one) and away you go.
If you're using SAS, the install is much more straightforward, as you can just use the standard debian installer. However, you may need to use a custom kernel which supports the QS22s. This is a matter of building your own kernel, using the powerpc64 debian installer image as the initramfs:
[jk@pingu linux-2.6.25]$ wget http://ftp.us.debian.org/debian/dists/testing/main/installer-powerpc/current/images/powerpc64/netboot/initrd.gz [jk@pingu linux-2.6.25]$ gunzip -c < initrd.gz > initrd [jk@pingu linux-2.6.25]$ make cell_defconfig [jk@pingu linux-2.6.25]$ sed -ie 's,^CONFIG_INITRAMFS_SOURCE=".*",CONFIG_INITRAMFS_SOURCE="'$PWD'/initrd",' .config [jk@pingu linux-2.6.25]$ make
Then, just boot the kernel in arch/powerpc/boot/zImage.pseries. The debian installer should start, and guide you through the rest of the installation. Since you're netbooting, you can ignore any messages about not having a bootstrap partition, or not being able to install a kernel or yaboot
Entirely optional, but you'll probably get the most out of your QS22 with a few extra packages:
[jk@qs22 ~]$ sudo apt-get install openssh-server libspe2-dev spu-gcc build-essential
%define jobs %(cat /proc/cpuinfo | grep processor | wc -l)We've been playing recently on some of the sweet top-of-the-line POWER6 systems, in one case the Power 575 system with 32 cores. When running with SMT enabled, that's 64 CPUs that Linux controls. The kernel build goes very fast on that system.
Embedded 4xx PowerPC Linux Kernel maintainer Josh Boyer discusses activity in the embedded linux space.
A colleague sent me a link to this analyst paper today that takes a look at whether IBM has made good on the Linux promises it made back in 1999. I’m obviously biased, but I’m interested in hearing if anyone has thoughts on this topic.
Here’s the report: ftp://ftp.software.ibm.com/linux/pdfs/GCG_IBM_and_Linux-9_years_later.pdf
The opening teaser:
In 1999, IBM issued a series of announcements fully committing the company to supporting Linux. IBM vowed to Linux-enable all of their hardware platforms, including their non-x86 based mainframe, mini, and RISC-based systems. They also promised to release Linux versions of their software products and develop
Linux-centric service practices. Moreover, they pledged significant resources to the Linux community with the goal of advancing Linux and open source technology.
So, nine years later, did IBM deliver on these promises? Was their commitment to Linux genuine or just lip service? This report examines IBM’s current Linux products, services, and community support in light of the promises they made in 1999…
While I think it’s obvious IBM has been a huge investor in the Linux community, one thing that I noticed reading the report is just how much IBM is actually different from other community members. There are some noticeable differences in the investments and approach to supporting the Linux platform and community. I often forget to just take in all the Linux technologies IBM has been heavily involved in from Xen, KVM and libvirt to filesystems, to systemtap, kprobes and then there’s RAS, scalability and performance enhancements.
Another interesting thought to reflect on is just how important it has been that there are multiple investors in this field. If this report captures just what IBM did, think of the industry combined. IBM couldn’t have done anything this big with Linux if it weren’t for co-creating with a community of enthusiasts, researchers, governments, Intel, AMD, Google, Nokia, Motorola, Oracle and thousands more. What would the report look like if you compiled all the investments and work the entire community leveraged across the industry. Linux is “bigger than huge” when you stop to think about it. This is also why I’ve said for a couple years now when you extend the investment model 3 to 5 years into the future, Sun and its anti-Linux, Solaris push against the tide of the industry loses in the end. I think we’re starting to witness that now. Sure, OpenSolaris is a great idea… it’s just 9 years late and it’s too late to matter now.
I’m interested in outside perspectives too - where do you think IBM stands? Has the community development and investment model worked? Where will this lead in the future and what will be the next evolution of the model? Red Hat seems to think the model will evolve to include increased customer co-creation - I tend to agree. Why? Because the incentive model to invest aligns very well - and when you have alignment, it almost naturally will happen.
In part two of this series, we had just ported a fractal renderer to the SPEs on a Cell Broadband Engine machine. Now we're going to start the optimisation...
Our baseline performance is 40.7 seconds to generate the sample fractal (using the sample fractal parameters).
The initial SPE-based fractal renderer used only one SPE. However, we have more available:
| Machine | SPEs available |
|---|---|
| Sony Playstation 3 | 6 |
| IBM QS21 / QS22 blades. | 16 (8 per CPU) |
So, we should be able to get better performance by distributing the render work between the SPEs. We can do this by dividing the fractal into a set of n strips, where n is the number of SPEs available. Then, each SPE renders its own strip of the fractal.
The following image shows the standard fractal, as would be rendered by 8 SPEs, with shading to show the division of the work amongst the SPEs.
In order to split up the work, we first need to tell each SPE which part of the fractal it is to render. We'll add two variables to the spe_args structure (which is passed to the SPE during program setup), to provide the total number of threads (n_threads) and the index of this SPE thread (thread_idx).
struct spe_args { struct fractal_params fractal; int n_threads; int thread_idx; };
On the SPE side, we use these parameters to alter the invocations of render_fractal. We set up another couple of convenience variables:
rows_per_spe = args.fractal.rows / args.n_threads;
start_row = rows_per_spe * args.thread_idx;
And just alter our for-loop to only render rows_per_spe rows, rather than the entire fractal:
for (row = 0; row < rows_per_spe; row += rows_per_dma) {
render_fractal(&args.fractal, start_row + row,
rows_per_dma);
mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
bytes_per_row * rows_per_dma,
0, 0, 0);
/* Wait for the DMA to complete */
mfc_write_tag_mask(1 << 0);
mfc_read_tag_status_all();
}
The changes to the PPE code are fairly simple - instead of just creating one thread, create n threads.
First though, let's add a '-n' argument to the program to specify the number of threads to start:
while ((opt = getopt(argc, argv, "p:o:n:")) != -1) {
switch (opt) {
/* other options omitted */
case 'n':
n_threads = atoi(optarg);
break;
Rather than just creating one SPE thread, we create n_threads. Also, we have to set the thread-specific arguments for each thread:
for (i = 0; i < n_threads; i++) {
/* copy the fractal data into this thread's args */
memcpy(&threads[i].args.fractal, fractal, sizeof(*fractal));
/* set thread-specific arguments */
threads[i].args.n_threads = n_threads;
threads[i].args.thread_idx = i;
threads[i].ctx = spe_context_create(0, NULL);
spe_program_load(threads[i].ctx, &spe_fractal);
pthread_create(&threads[i].pthread, NULL,
spethread_fn, &threads[i]);
}
Now, the SPEs should be running, and generating their own slice of the fractal. We just have to wait for them all to finish:
/* wait for the threads to finish */
for (i = 0; i < n_threads; i++)
pthread_join(threads[i].pthread, NULL);
If you're after the source code for the multi-threaded SPE fractal renderer, it's available in fractal.3.tar.gz.
That's it! Now we have a multi-threaded SPE application to do the fractal rendering. In theory, an n threaded program will take 1/n of the time of a single-threaded implementation, let's see how that fares with reality...
Let's compare invocations of our multi-threaded fractal renderer, with different values for the n_threads parameter.
| SPE threads | Running time (sec) |
|---|---|
| 1 | 40.72 |
| 2 | 30.14 |
| 4 | 18.84 |
| 6 | 12.72 |
| 8 | 10.89 |
Not too bad, but we're definitely not seeing linear scalability here; we could expect the 8 SPE case to take around 5 seconds, rather than 11.
A little investigation into the fractal generator will show that some SPE threads are finishing long before others. This is due to the variability in calculation time between pixels. In order to see if a point (ie, pixel) in the fractal does not converge towards infinity (and gets coloured blue), we need to do the full i_max tests in render_fractal (i_max is 10,000 in our sample fractal). Other pixels may converge much more quickly (possibly in under 10 iterations), and so are orders of mangitude faster to calculate.
We end up with slices that are very quick to calculate, and others that take longer. Of course, we have to wait for the longest-running SPE thread to complete, so our overall runtime will be that of the slowest thread.
So, let's take another aproach to distributing the workload. Rather than dividing the fractal into contiguous slices, we can interleave the SPE work. For example, if we were to use 2 SPE threads, then thread 0 would render all the even chunks, and thread 1 would render all the odd chunks (where a "chunk" is a set of rows that fit into a single DMA). This should even-out the work between SPE threads.
This is just a matter of changing the SPE for-loop a little. Rather than the current code, which divides the work into n_threads contiguous chunks:
for (row = 0; row < rows_per_spe; row += rows_per_dma) {
render_fractal(&args.fractal, start_row + row,
rows_per_dma);
mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
bytes_per_row * rows_per_dma,
0, 0, 0);
/* Wait for the DMA to complete */
mfc_write_tag_mask(1 << 0);
mfc_read_tag_status_all();
}
We change this to render every n_threadth chunk, starting from thread_idx, which gives us the the interleaved pattern:
for (row = rows_per_dma * args.thread_idx;
row < args.fractal.rows;
row += rows_per_dma * args.n_threads) {
render_fractal(&args.fractal, row,
rows_per_dma);
mfc_put(buf, ppe_buf + row * bytes_per_row,
bytes_per_row * rows_per_dma,
0, 0, 0);
/* Wait for the DMA to complete */
mfc_write_tag_mask(1 << 0);
mfc_read_tag_status_all();
}
An updated renderer is available in fractal.4.tar.gz.
Making this small change gives some better performance figures:
| SPE threads | Running time (sec) |
|---|---|
| 1 | 40.72 |
| 2 | 20.75 |
| 4 | 10.78 |
| 6 | 7.44 |
| 8 | 5.81 |
We're doing much better now, but we're still nowhere near the theoretical maximum performance of the SPEs. More optimisations in the next article...
Chris gives an intro to WBEM, sblim and other system management tools for Linux and other OSes.
Today at the Red Hat Summit in Boston, Red Hat announced the official release of Red Hat Enterprise MRG V1 (Messaging Realtime Grid) [1]. A couple snippets of note:
"The Realtime component of Red Hat Enterprise MRG comprises numerous kernel enhancements that provide deterministic performance for time-critical and latency-sensitive applications."
"IBM has worked together closely with Red Hat on the development of the real time Linux kernel and has optimized both WebSphere Real Time and BladeCenter servers on Red Hat Enterprise MRG. We are delighted that IBM and Raytheon have been recognized by Red Hat for this innovation which has led to the largest deployment of real time technology in the next generation of US Navy Destroyers." --Jeff Smith, vice president, Open Source and Linux Middleware at IBM
Today, at the Red Hat Summit, Red Hat announced three virtualization initiatives including oVirt. The press release is here.
Some choice quotage:
KVM technology has rapidly emerged as the next-generation virtualization technology, following on from the highly successful Xen implementation.Another good one:
We continue to see huge improvements in functionality, performance and time to market because of our close relationship with our open source partners. For example, Intel and IBM have worked with us for many years covering virtualization technologies that span from Red Hat Enterprise Linux 5 to today's KVM-based announcements.And of course:
"IBM works closely with Red Hat and the open source community to drive innovation within the Linux kernel," said Daniel Frye, vice president, open systems development at IBM. "IBM has a heterogenous approach toward virtualization, with KVM one of several options. KVM leverages the core features of the Linux kernel, including paravirtualization interfaces contributed by IBM engineers. By combining Linux virtualization infrastructure with open management interfaces such as CIM and libvirt, we gain a solution that eliminates lock-in and open source community innovations, we are able to offer our customers a solution with outstanding performance, scalability and agility."If you want to see what all the fuss is about, check out KVM.