RSS Add a new post titled:

Introducing TeleMetrum v0.2

Bdale and I (mostly Bdale, of course) finished the TeleMetrum v0.2 design work in December, and this weekend we got boards made and parts ordered and Bdale sat down with his trusty electric skillet and built 3 new boards. The new design has an integrated GPS receiver and patch antenna, and is otherwise fairly similar in design to v0.1.

TeleMetrum v0.2 Hardware

Here’s the front side of the board:

From the left, you’ll see a connector for an external power switch and the two ejection charge circuits, a battery connector for a single 3.7V lipo cell, the GPS patch antenna, a 4-pin debug connector, the piezo buzzer and the new 8-pin companion board connector. We weren’t happy with the connectors used on the v0.1 board and finally found these Tyco Micro-MaTch parts which take up a modest amount of board space (more than pico-blade connectors, less than regular pin blocks), have a locking option and crimp on to standard ribbon cable. They’re also bright red and surprisingly low in profile.

And here’s the back side:

Elements on this side include the new 100μF cap in the upper left corner which sits on the 3.3V supply to try and keep the CPU alive through minor power glitches. Below that is a new package containing a pair of FETs for the ejection circuits. We used discrete FETs in v0.1, but this device has better specs for our needs (lower on resistance, etc). The USB connector was pulled in-board far enough to keep it from hanging over the edge. Right of that is the new data logging chip, and right of that is a U.FL connector in case you want to use an external GPS antenna. We supply power to that connector as most external GPS antennas include their own LNA. And, of course, to the right of that is the Skytraq Venus 634 GPS receiver.

Below and to the right of the GPS receiver is the cc1111, to the left lies the accelerometer and then the barometric pressure sensor above the 5V boost regulator which powers the accelerometer. We haven’t found any high-G accelerometers that run on 3.3V yet. Finally the two tiny 5-pin chips are the USB LiPo charger and the 3.3V regulator. What you can’t see easily are a pile of 0402 passive components scattered across the board. Even close up, they’re hard to pick out by eye.

The only hardware ‘bug’ was in the reset logic — the new board was designed with a much larger capacitor on the reset line than the old board. The debug code would only hold the reset line low for a brief instant, sufficient for the old capacitor value but not the new one. Instead of fixing the code, Bdale decided to try a smaller capacitor value and found that it worked just fine. After that, the board came up just fine and the updated firmware was flashed into the CPU.

TeleMetrum v0.2 Software

The only significant software change was that the data logging part changed from a 25LC1024 1Mbit eeprom to an AT45DB161D 16Mbit DataFlash. This required writing a new driver, but fortunately much of the code could be copied from the 25LC1024 driver. Because the AT45DB161D comes from a family of similar-but-different parts ranging from 1Mbit to 64Mbits, I decided to make the code automatically adapt to the installed part, detecting which one was attached and adjusting the driver.

The story here is that the configuration data didn’t appear to be getting preserved across reboots — we use the last block of the data logging part to hold configuration data, including call sign, sensor calibration values and flight parameters. A bit of testing and we found that the code to read/write the device worked perfectly. It turns out that a premature optimization in detecting which kind of flash part was installed had a race condition when multiple threads were trying to access storage at the same time, resulting in the configuration data being left uninitialized. Oops!

The TeleMetrum firmware has a clever hack for selecting between ground mode (for fetching data from the device or altering the configuration) and flight mode (prepared to fly the rocket). It switches between these by detecting whether the board is upright (flight mode) or not (idle mode). However, the accelerometer must be calibrated to tell the difference. What never occurred to us was that if the calibration data was broken enough, the device might always come up in flight mode. In that mode, it isn’t listing to either USB or the radio link, so it’s impossible to fix the accelerometer calibration data.

A bit of brainstorming led to a fairly simple hack — check to see if one of the pins on the companion connector was shorted to ground at power on time, if so, force the computer to enter idle mode. Pin 1 of the companion connector is ground, and fortunately, pin 2 was the SPI clock pin, normally output-only, so we could safely use that in this mode as any companion device shouldn’t ever pull that low.

Future Events

As of this evening, three boards are built and mostly tested; the radios appear to work, GPS tracks satellites and the beeper makes plenty of noise. Still to check is whether the deployment circuits will fire an ematch (we’ve tested the design before, just not this specific implementation).

Next weekend, we’re off to linux.conf.au in Wellington, New Zealand where we’re scheduled to give a presentation on the hardware and software in TeleMetrum. We’ll have v0.2 boards to show off, so come and see them in person.

With v0.1, we used the same board design for both flight computer and ground station, TeleDongle. For TeleDongle, we just left most of the components off of the board and loaded alternate firmware. For v0.2, we’re planning on building a separate TeleDongle board; that design is finished but no boards are made yet.

Once we’re happy with the design, we’ve got big plans to get more boards made so we can let a few friends buy them for use them in their own rocket projects. That should happen in the next month or so. Once we’ve gotten enough testing done, and made sure that other people can actually operate them without hand-holding from us, we’ll make them available for sale to the general rocket-flying public.

Beyond that, we’ve got plans to build more stuff:

  1. A stand-alone ground station, called TeleTerra, that would include an LCD readout and flight data recording so you wouldn’t need a laptop during the flight.

  2. A companion board, called TelePyro, to control 8 additional pyro channels. These could be used for almost anything from air starts to staging or any other whacky plans.

Posted Sun Jan 10 19:33:09 2010 Tags:

This is an article introducing a new email reading system called notmuch, written by Carl Worth with comments from me (and a few minor patches).

Abandon Fail Boat

Almost two months ago, when I updated my debian system to the latest and greatest bits, I happened to get a new version of evolution, 2.28. As has become the tradition with new versions of evolution, a few more things broke.

I’ve suffered through evolution ‘upgrades’ several times and had slowly reduced my usage of evolution features to try and keep it working. This time, I got stuck. The accumulated bugs in this mailer made it impossible for me to get my work done any more.

And, yes, it’s a sad commentary on the Linux desktop that the most important feature for many people using Linux has no credible GUI application (yes, I’ve tried a lot of email applications; I have too much mail for them to cope).

Exploring Sup

Carl had given up on Evolution a few weeks before and was using sup. From his description, and from a brief bit of experimentation, I decided to give it a try. Sup has four main features:

  1. It is entirely search based. All messages are indexed by a ‘real’ indexing system, xapian which provides reasonable full text search for email.

  2. You can mark (automatically, or manually) messages with labels; the ‘inbox’ view just shows the results of a search for messages with the ‘inbox’ label.

  3. It never modifies the actual mail store. All state is stored inside the database in the form of labels.

  4. Most operations act on threads, not messages. Viewing a thread shows you the unread messages in the whole thread in a single page, making following the conversation easy.

This feature set is exactly what I’ve been trying to get Evolution to use for several years; I used the virtual folders to automatically sort mail into several ‘catagories’. Unfortunately, the evolution vfolder support was terrible to start with (way too slow to be actually useful) and has gotten far worse over time (no more nested vfolders?).

Sup works quite well for a small amount of email. With my message store (dating back to 1984), it took “a while” to do the initial scan of to construct the database. After that, searches are zippy fast.

Sup has a couple of fairly serious mis-features though:

  1. It’s written in ruby. Yet another language disaster in my book; syntax horror-show similar to perl, and a lack of static typechecking means that obvious bugs in the program wouldn’t be caught until you happened to execute that particular line of code. Ruby is also no speed demon—I spend a lot of my day reading email, waiting for ruby is not on my list of desired activities.

  2. It has a magic curses UI. This is actually pretty good for reading email, but it’s not scriptable at all, which is useful for mass patch-application, and it completely fails when composing new mail as it forks off emacs and waits for it to complete, meaning that you cannot see any mail while composing a message.

  3. It saves a bunch of label changes inside the application, and Xapian saves most of the database changes too. Having sup crash often means re-viewing a lot of mail.

Carl and I started fixing sup in various ways; making the mime-viewer run asynchronously (so you could see attachments while viewing the rest of the message), sorting the inbox oldest first and various other changes. Nothing serious, but it did show us how sup was built and just how simple it was inside.

It turns out that sup is just a bit of UI goo over a powerful full-text database; the complicated code is not the UI but the database. Of course, the sup UI is great for viewing mail, but that’s fortunately easy to clone.

A Minimal Mail Reader

Having seen just how easy it was to build a really nice mail reading system, Carl and I sat down and sketched out what the foundations of our ‘ideal’ system would look like:

  1. Xapian based. I haven’t seen anything close to Xapian in terms of features or performance. It has only one serious bug—it’s written in C++. Fortunately, we can wrap the C++ mess with a simple C wrapper and ignore that aspect.

  2. Command line driven. Any UI would be constructed on top of the command line interface. And, by UI, we mean emacs major mode. If someone wants to write a GUI, we won’t stop them though.

  3. Otherwise, work a lot like Sup (thread-based, immutable mail store, user-defined tags).

Carl started by playing with Xapian, using the existing sup database; one possibility would have been to retain compatibility with the sup format and just provide a new interface. Unfortunately, there were a lot of ‘ruby-isms’ in the sup database, and reconstructing that would have been pretty difficult from a non-ruby application.

Introducing Notmuch: Not much of an email program

Notmuch really isn’t much of an email program; it doesn’t talk to mail servers to receive or send mail, it doesn’t even really know what Maildir should look like. All it does is construct a database for all of your mail messages and allow you to search and show email messages.

Notmuch has two pieces—a C program that uses Xapian to search and tag mail messages, and an emacs major mode which provides a fairly simple user interface. Like git, the notmuch C program places a bunch of commands within a single executable:

  1. setup
    Interactively setup notmuch for first use.

  2. new
    Find and import any new messages.

  3. search search-term […]
    Search for threads matching the given search terms.

  4. reply search-terms […]
    Formats a reply from a set of existing messages.

  5. show search-terms […]
    Shows all messages matching the search terms.

  6. tag +tag|-tag […] [—] search-term […]
    Add/remove tags for all messages matching the search terms.

  7. dump [filename]
    Create a plain-text dump of the tags for each message.

  8. restore filename
    Restore the tags from the given dump file (see ‘dump’).

  9. help [command]
    This message, or more detailed help for the named command.

(The above text was taken directly from notmuch itself and was written by Carl).

As you can see, all of the commands which talk about messages take an arbitrary search pattern. The search command outputs thread identifiers in search-term form, so you can easily script things by pulling that out of the search output and passing it to additional notmuch commands. Learning how to do searching in notmuch is the key to using it successfully.

Xapian Search Terms

Matching words anyplace in the message is fairly simple; just list the set of words you want to match. Notmuch also adds some special syntax to direct the match at specific header fields:

  • tag:tag
    match messages with the specified tag

  • thread:thread-id
    match messages associated with the specified thread

  • id:id
    match the message with the given id. Message ids are those set by the message sender in the Message-Id: header field.

  • from:word
    match messages with word in the from address field.

  • to:word
    match messages with word in either the To: or Cc: headers.

  • attachment:word
    match messages with word in an attachment filename.

  • subject:word
    match messages with word in the subject field.

Aside from these additions, notmuch uses standard Xapian search syntax, including support for AND, OR etc. Xapian’s query parser is not the most robust piece of code though, so sometimes you need to mess with the query to get it to do what you want.

Notmuch emacs mode

There are a lot of email clients available for emacs; notmuch adds only the email reading part and uses the existing ‘message’ module for composing and sending mail. Even still, notmuch.el is almost 1000 lines long. It offers two different modes — the search display, where a list of email threads are presented, and the thread display, where a single thread is displayed.

The search display presents the output of ‘notmuch search’ in a window, eliding the thread id. When a thread is selected, a thread display buffer is constructed with the thread contents as formatted by ‘notmuch show’.

‘notmuch show’ structures the thread to make the display more useful in emacs; it splits messages into headers and bodies and marks the thread depth of each message. The header of each message will be shrunk to a single line (in reverse video). Previously read portions of the thread will be hidden by default, along with signature lines, quotations and attachments. Each of these can be viewed by use of a suitable command. Carl stole much of this from Sup and adopted it for use inside emacs, along with some of the key bindings.

How well does it work right now?

Frankly, notmuch is pretty rough today; I’m using it to read email, but I’m finding lots of stuff to fix. Fortunately, most of the fixes are pretty simple at this point. The good news is that it’s plenty fast, fast enough that I can count how many threads I’ve exchanged with my good friend Bart in the past 25 years (2686) in only a few seconds.

The biggest performance issue is some lazy code within Xapian. When you want to change the set of tags related to a document in the database (a single mail message), Xapian replaces the entire document. Try removing the ‘inbox’ tag from half a million messages and Xapian will carefully rewrite 5GB of data. That takes a while. The Xapian developers have suggested that this shouldn’t be hard to fix though, at which point re-tagging messages should get a lot faster.

For those interested in playing along, the notmuch sources are available from the notmuch web site along with a pointer to the mailing list.

Posted Tue Nov 17 02:14:21 2009 Tags:

In case you’ve been hiding under a rock for the last several months, I’d like to remind you that the Linux Plumbers Conference is currently soliciting submissions for the following tracks:

  1. Audio: Lennart Poettering
  2. Boot and Init: Dave Jones
  3. Embedded Systems: Greg Kroah-Hartman and David Woodhouse
  4. Energy Efficiency, Performance, and Power Management
  5. Inter-Distributor Cooperation: James Bottomley
  6. Kernel/Userspace/User Interfaces: Jim Gettys
  7. Networking: Steve Hemminger
  8. Security: James Morris and Paul Moore
  9. Storage: Matthew Wilcox
  10. Video Input Infrastructure
  11. X Window System: Keith Packard

I’m particularily interested in submissions around the changes in the Linux desktop, past present and future. We’re seeing all kinds of new Linux-based user interfaces around these days, and I’d like to hear about where things are going, from both the hardware and software perspective. It’s Plumbers, so sessions which will generate active discussion among the participants are the best kinds.

As appears the tradition with Linux conferences, we’ve received numerous requests for “just a bit more time” in the submission process, and so the deadline has been extended from today until next Monday, June 22nd. Please head on over to the submission page and make sure we know you’re interested in contributing.

Posted Mon Jun 15 19:06:27 2009

To build code for TeleMetrum, we’re using SDCC, the Small Device C Compiler as the CPU inside the cc1111 is an 8051 clone, an 8-bit microprocessor for which SDCC has excellent support (more about the flight software later).

SDCC version 2.9.0 was recently uploaded to Debian unstable, and when I built our flight software with the new version, I discovered a bug in the display of strings formatted by printf. First assuming that the bug was in my source code, I tried to figure out what I’d done wrong, but then I eventually looked that the 8051 assembly output (ick) and discovered that the compiler was generating the wrong code for pointers when passed to a varargs function. A bit of hacking and I soon had a short test case that demonstrated the bug:

extern void f(char *x, ...);

void
func(__xdata char *s)
{
    f("hi", s);
}

I filed a brief bug report and attached the test case, then went to download the current source code to see if I couldn’t uncover the source of the bug. I have to say that reading through the SDCC source code was reasonably pleasant; a competent compiler in very little code that was easy to grasp. I eventually located the bug, and discovered that it was from a change made last December as part of a pointer-related optimization, and I posted a patch that I found would fix the specific problem I had found.

The nicest part came next — once I’d posted the patch, a reasonably lively discussion between Maarten Brock, Borut Ražem and Raphael Neider came to a quick concensus about what the desired behavior in this case would be.

Then, Maarten Brock applied my patch to the project and, much to my amazement, he included a regression test that verified the desired behaviour in both the case that I had uncovered and several other cases as well.

I just want to applaud these developers for building a great compiler and running a great project.

Posted Fri May 1 13:40:50 2009 Tags:

This week, we finished up our 2009 Q1 release of the Intel driver. Most of the effort for this quarter has been to stabilize the recent work, focusing on serious bugs and testing as many combinations as we could manage.

For the last year or so, we’ve been busy rebuilding the driver, adding new ways of managing memory, setting modes and communicating between user space and the kernel. Because all of these changes cross multiple projects (X/Mesa/Linux), we’ve tried to make sure we supported all of the possible combinations. Let’s see what options we’ve got:

Mode Setting

  1. User mode. The entire output side of the driver stack is in user mode; all of the output detection, monitor detection, EDID parsing etc. This has some significant limitations, the worst of which is that the kernel has no idea what’s going on, so you cannot show any kernel messages unless the X server relinquishes control of the display. In particular, panic messages are lost to the user. If the X server crashes, the user gets to reboot the machine. A more subtle limitation is that the driver couldn’t handle interrupts, so there wasn’t any hot-plug monitor support. That’s becoming increasingly important as people want hot-plug projector support, and as systems start including DisplayPort, which requires driver intervention when the video cable gets kicked out of the machine.

  2. Kernel mode. All of that code moves into the kernel, where it is exposed both as a part of the DRI interface and also through the frame buffer APIs for use by the fb console or any other frame buffer applications. Lots of benefits here, but the development environment is entirely different from user mode, and so porting the code is a fair bit of work. Dave Airlie has some pie-in-the-sky ideas about making the kernel mode setting code run in user mode by recompiling it with suitable user-mode emulation of the necessary kernel APIs.

Direct Rendering

  1. None. In this mode, the system doesn’t support direct rendering at all, so all rendering must go through the X server. GL calls are generally implemented as a software rasterizer inside the X server.

  2. DRI1. In this mode, applications share access to a single set of front/back/depth/stencil buffers. They must carefully ensure that all of their drawing operations are clipped to the subset of each buffer that their window occupies. Each application performs their own buffer swapping by copying contents from suitable regions of each buffer. Synchronization with the X server is done through signals and a shared memory buffer. While any application is executing, all other applications are locked out of the hardware, even if they wouldn’t be conflicting.

  3. DRI2. This gives each application private back/depth/stencil buffers; they draw without taking any locks as the kernel mediates access to each object. The real front buffer for each window is owned by the window system, and so requests to draw into it are directed through the window system API, using the DRI2 extension in the case of the X window system. When applications ask to draw to the ‘front’ buffer, they get a fake buffer allocated, which operates almost exactly like the back buffer, except that copies to the ‘real’ front buffer are automatically performed at suitable synchronization points.

Memory Management

  1. X server + Old-style DRI. In this mode, the X server asks for a fixed amount of memory which is then permanently bound to the graphics aperture and treated like the memory on a discrete graphics card. The X server allocates pixmaps from this fixed pool, when that runs out, it uses regular virtual memory. It may move objects back and forth by copying them between the aperture and virtual memory.

    This mode also supports direct rendering. While the direct rendering application holds the DRI1 lock (remember that from above?), it has exclusive access to a area within the aperture which is granted to it by the X server. Pages for this area are statically allocated (by the X server). Whenever the application loses the DRI1 lock, any or all of the data stored in those pages may be kicked out, so the application must be prepared to lose the data without notice. For data like textures and vertex lists, which are generated by the application and not (generally) written by the GPU, this works fairly well; the application has a copy of the data already and can re-upload it should it disappear.

  2. GEM. Here, no pages are statically allocated for exclusive use by the graphics system. Instead. individual objects (“buffer objects”, or “bo’s”) are allocated as chunks of pages from regular virtual memory. When not in-use, these objects can be paged out. Furthermore, applications aren’t limited to the graphics aperture space, they allocate from the system pool of virtual memory instead. As objects are used by applications, the kernel dynamically maps them into the graphics aperture.

2D acceleration

  1. None. The X server has a complete software 2D rendering system (fb), and if the driver doesn’t provide any accelerated drawing mechanism, the X server can use that software stack to provide all of the necessary drawing operations. While it seems like this would be terribly slow, in reality, it’s not that bad as long as the target rendering surface is present in cached memory, and not accessed through a device in write-combining or uncached modes.

  2. XAA. This is the old XFree86 rendering architecture, and heavily focuses on ‘classic’ X drawing operations, including zero-width lines, core text and even core wide lines and arcs. It does not support accelerated drawing to anything other than the screen or pixmaps which precisely match the screen pixel format. Pixmaps are allocated from a subset of the frame buffer, as if they were actually on the screen. This causes huge problems for chips which have limited 2D addressing abilities (like Intel 8xx-945 and older ATI chips) as they cannot use any memory beyond a single 2D allocation.

  3. EXA. This code was lifted from kdrive, where it was designed as a minimal graphics acceleration architecture for embedded X servers. In that original design, all pixmaps were allocated in system memory as the target systems had essentially no off-screen memory available to the graphics accelerator. In addition, with the goal of bringing up an X server quickly on simple hardware, the only accelerated operations were 2D solid fills and 2D blits. However, one key feature of that code was that it provided a uniform API for drawing with arbitrary pixel formats. As that code was moved into the core X server, it was changed so that pixmaps could be allocated from graphics memory. In addition, acceleration for the Render extension was added so that modern applications could get reasonable performance for anti-aliased text and composited images. However, it can only accelerate rendering to objects stored in graphics memory, and that memory must be pre-allocated by the X server (see the Memory Management section above). Once you run out of that memory, the X server is stuck trying to figure out what to do. This single issue has been the focus of EXA development for the last couple of years — when to move data between virtual memory and graphics memory. Objects in graphics memory are drawn fastest with the GPU and objects in virtual memory can only be drawn with the CPU. The key problem here is that reading data from graphics memory is horribly expensive, so the cost of moving an object from graphics memory to virtual memory is high. When everything is in the right memory space, EXA runs fast. When you start thrashing things around, EXA runs slow.

  4. UXA. Assume your GPU can draw to arbitrary memory. Now assume that EXA’s basic drawing operations are sound, and do a reasonable job of supporting 2D applications (as long as they fit within graphics memory). UXA comes from the combination of these two assumptions — GEM provides the first and the EXA drawing code provides the second. UXA doesn’t need any of the (ugly) pixmap ‘migration’ code because pixmaps never move — they stay in their own little set of pages and the GEM code maps them in and out of the aperture as needed. So, UXA and EXA are not far apart in style or substance, UXA simply skips the parts of EXA which are not necessary in a GEM world.

Pick One From Each Column

Now, many of the above choices can be made independently — you can use User mode setting with DRI1, classic memory management and XAA. Or you can select Kernel mode setting with DRI1, GEM and EXA. With 2 × 3 × 2 × 4 = 48 combinations, you can imagine that:

  • Some of them can’t work together
  • Some of them haven’t been tested
  • Some of them haven’t been tuned for performance
  • Some work well on i915, and poorly on 965GM
  • Others work well on 965GM and poorly on 855
  • None of them (yet) work perfectly well everywhere

Two years ago, you had a lot fewer choices, only user mode setting, none or DRI1 direct rendering, only X server memory management and only none, XAA or EXA acceleration = 12 choices). Even then, choosing between XAA and EXA was quite contentious — EXA would thrash memory badly, while XAA would effectively disable acceleration for pixmaps as soon as it ran out of its (tiny) off-screen space.

In moving towards our eventual goal of a KMS/GEM/DRI2 world, we’ve felt obligated to avoid removing options until that goal worked best for as many people as possible. So, instead of forcing people to switch to brand new code that hasn’t been entirely stable or fast, we’ve tried to make sure that each release of the driver has at least continued to work with the older options.

However, some of the changes we’ve made have caused performance regressions in these older options, which doesn’t exactly make people happy — the old code runs slow, and the new code isn’t quite ready for prime time in all situations. One option here would be to stop shipping code and sit around working on the ‘perfect’ driver, to be released soon after the heat-death of the universe.

Instead, we decided (without much discussion, I’ll have to admit) to keep shipping stuff, make it work as well as we knew how, and engage the community in helping us make this fairly significant transition to our new world order. We did, however, make a very conscious choice to push out new code quickly — getting exposure to real users is often the best way to make sure you’re not making terrible mistakes in the design. The thinking was that users could always switch back to the ‘old’ code if the new code caused problems. Of course, sometimes that ‘old’ code saw fairly significant changes while the new code was integrated…

You can imagine that our internal testing people haven’t been entirely happy with this plan either — our count of bugs has been far too high for far too long, and while we spent the last three months doing nothing but fixing things, it’s still a lot higher than I’d like to see.

Performance Differences

Only a few things in the above lists have obvious performance implications — choose XAA and your performance for modern applications will suffer as it offers no acceleration for the Render extension. So, why does switching from EXA to UXA change the performance characteristics of the X server so much? The simple answer is that UXA, GEM and KMS haven’t been tweaked on every platform yet.

For example, hardware rendering performance is affected by how memory is accessed by the drawing engine. There are two ways of mapping pixels, “linear” and “tiled”. In linear mode, pixels are stored in sequential addresses all the way across each scanline, subsequent scanlines are at ever higher addresses. A simple plan, and all of the software rendering code in the X server assumes this model. In tiled mode, rectangular chunks of the screen are stored in adjacent areas in memory, a block of 128x8 pixels forms an ‘X tile’ in the Intel hardware. Drawing to vertically adjacent pixels in this mode means touching the same page, reducing PTE thrashing compared with linear mode. For systems with a limited number of PTEs and limited caches inside the graphics hardware, tiled mode offers tremendous performance improvements. However, getting everything lined up to hit tiled mode is a pain, and on some hardware, in some configurations it doesn’t happen, so you see a huge drop in performance.

Similarly, mapping pages in and out of the GTT sometimes requires that the contents be flushed from CPU or GPU caches. Now, GPU cache flushing isn’t cheap, but we end up doing it all the time as that’s how rendering contents are guaranteed to become visible on the screen. CPU cache flushing, on the other hand, is something you’re never “supposed” to do, as all I/O operations over PCI and communication between CPU cores is cache-coherent. Except for the GPU. So, we end up using some fairly dire slow-paths in the CPU whenever we end up doing this. UXA isn’t supposed to hit cache flushing paths while drawing, but sometimes it still happens. So, you get UXA performance loss sometimes. On the other hand, failing to dynamically map objects into the GTT means that some objects don’t fit, and so EXA spends a huge amount of time copying data around, in which case EXA suffers.

The difference between DRI1 and DRI2 is due in part to the context switch necessary to get buffer swap commands from the DRI2 application to the X server which owns the ‘real’ front buffer. For an application like glxgears which draws almost nothing, and spends most of its time clearing and swapping, the impact can be significant (note, glxgears is not a benchmark, this is just one of many reasons). On the other hand, having private back buffers means that partially obscured applications will draw faster, not having to loop over clip rectangles in the main rendering loop.

The obvious result here is that we’re at a point where application performance goes all over the map, depending on the hardware platform and particular set of configuration options selected.

Light at Tunnel’s End

The good news is that our redesign is now complete, and we have the architecture we want in place throughout the system — global graphics memory management, kernel mode setting and per-window 3D buffers. This means that the rate of new code additions to the driver has dropped dramatically; almost to zero. Going forward, users should expect this ‘perfect’ combination to work more reliably, faster and better as time goes by.

Right now, we continue to spend all of our time stabilizing the code and fixing bugs. A minor but important piece of this work is to get UXA running without GEM so that we have EXA-like performance on older kernels. That should be fairly straightforward as UXA shares all of the same basic EXA acceleration code, and the EXA pixmap migration stuff works best when it works in the most simplistic fashion possible (move to GPU when drawing, move out only under memory pressure), something which we can provide in the GEM emulation layer already present under UXA.

Our overall plan is to focus our efforts on the ‘one true configuration’. The best way to do that is to work on reducing the number of supported configurations until we get to just that one. First on the block are XAA and EXA. XAA because no-one should have to use that anymore, and EXA because it’s just UXA with some pixmap management stuff we don’t need. There’s no reason UXA should be slower than EXA, once the various hidden performance bugs are fixed.

At the same time, DRI1 support will be removed. We cannot support compositing managers under DRI1, nor can we support frame buffer resize and a host of other new features. You’ll still get a desktop without DRI1, you just won’t get accelerated OpenGL. With the necessary infrastructure in the kernel and X server already released, this seems like the right time to switch off a huge pile of code.

Initial measurements from this work show that we’ll be shrinking our codebase by about 10%.

Moving beyond this next quarterly release, the remaining ‘legacy’ piece is the user mode setting code. Something like 50% of the code in the 2D driver relates this this, so removing it will rather significantly reduce our code base. You can only imagine how excited we are about this prospect.

The goal is to take the driver we’ve got and produce a leaner, faster more stable driver in the next few releases to come.

Posted Fri Apr 24 17:12:45 2009

Now that I’ve got USB working on the cc1111, I’m feeling like it’s time to start thinking about actually building the flight software. I’ve written up some very rough ideas as a starting point based on what I know about the hardware we’ve got.

Posted Sat Feb 21 22:52:22 2009 Tags:

My good friend Bdale Garbee and I are working on a new rocket flight computer called TeleMetrum. It’s using a TI CC1111 microcontroller, which contains a digital RF transceiver along with a tiny microprocessor based on the universally loved Intel 8051.

The CC1111 has the usual array of microcontroller I/O ports: A to D converters, GPIO pins, SPI, I2C, I2S and regular serial ports. It also has a USB device controller, which is why we selected it over the otherwise identical CC1110.

I hadn’t ever written a USB device controller before, and my USB experience had been limited to writing a debug interface driver using libusb for this project, which should be the subject of another blog posting someday. USB appears to have been designed by mean people; just getting a simple two-way bytestream involves a huge pile of code.

Starting with FreeRTOS, I found a compatible USB stack written for the LPC2148 processor, lpcusb. Fortunately, that stack is fairly cleanly written, with a narrow interface between the stack and the device. I figured I could replace the LPC2148 bits with CC1111 bits and have it running in short order. Of course, nothing is as simple as it should be.

After about three weeks, I managed to get packets flowing from host to device and started to debug the USB setup stuff. All of my difficulties here relate to the slightly brain damaged way USB signals the end of Setup data flowing from device to host (IN data). This is done by sending a packet which is strictly less than the maximum size advertised by the device, in my case 32 bytes (that’s all the CC1111 can handle). If the data to be sent is a multiple of this max size, you send a zero-length packet afterwards.

The first bug was that my code simply delivered a zero length packet every time it was done sending data, in response to the ‘a packet has been delivered’ interrupt. It should have been obvious, but this ended up flooding the USB link with zero length packets. Once that was fixed, I had the first few setup packets working correctly. Next, I had to fix the code that would chunk up larger setup replies into multiple packets. With that done, I had the initial setup working correctly and the device appeared in the lsusb output.

While debugging this, I had noticed that my ISR was getting called ‘a lot’, and I found out that none of the USB interrupt status bits were on. I guessed that this meant the master USB interrupt bit was stuck on. Which confused me, as the other interrupt bits I’d played with on the CC1111 all automatically cleared themselves when the ISR was invoked. Hurray for inconsistent hardware, but it turns out that this is not true for all of the interrupts, only some of them. With the interrupts turned back off, I’ve now got a device which correctly responds to the USB setup and then sits idle:

idVendor           0xfffe 
idProduct          0x000a 
bcdDevice            1.00
iManufacturer           1 altusmetrum.org
iProduct                2 TeleMetrum
iSerial                 3 tele-0

The next task is to figure out how to send NAK packets back when the host asks for data and I have none to send. That may make it possible to send data back and forth, at which point I can write a simple command interpreter for the CC1111 so we can poke at it via USB.

All of this code is in my freertos git repository, I’ll see if the freertos or lpcusb people are interested in the code once it’s working.

Posted Wed Jan 28 23:49:00 2009 Tags:

My daughter bought me a BeeLine TX radio direction finding beacon for Christmas. The plan is to mount it inside the payload bay of various rockets so that I can find them after launch. This uses a PIC 16F688 processor and a CC1050 transmitter and sends FM-encoded beeps and Morse code ident strings. It’s tiny and runs for a long time off a Li-Po battery.

The BeeLine TX came pre-programmed to transmit at 433.920MHz and ident as ‘KD7SQG’. I wanted to move it to 440.700MHz and ident as ‘KD7SQG ROCKET’, but the configuration utility provided was Windows-only.

Fortunately, Greg Clark, the person behind Big Red Bee, released the source code for the firmware under the GPLv2 and provided full schematics as well.

I was able to read through the code and construct a simplistic programming utility, also released under the GPLv2 and available via git as beelinetx. It doesn’t do much yet, just allows the configuration of the frequency and transmitted message string.

I’d like to thank Greg for building the BeeLine TX, making the sources and schematics available and also for answering questions over email about some subtle aspects of the frequency calibration.

Now to wait for the weather to clear and go take it flying.

Posted Fri Dec 26 16:43:40 2008 Tags:

What’s up this week

The last week has certainly been entertaining. We’re quickly merging a pile of new code into the driver and trying to get everything building in one place so that people can play with stuff before we release.

Getting 2D on top of GEM

One of the big missing pieces last week was getting the 2D driver working with Pixmaps as GEM objects. This is critical as we move towards unified kernel memory management for rendering resources to allow us to use objects across multiple APIs. The most pressing need here is to enable the GLX_EXT_texture_from_pixmap extension in an efficient fashion.

So, what’s the plan then? Fairly simple; allocate GEM objects for every pixmap and then use GEM relocations to manage access to them. No need for the 2D driver to even know what’s bound to the GTT; it can treat every Pixmap exactly alike and let the kernel manage the low-level hardware details. Our experience with the 3D driver has been quite good; GEM is easy to use and reasonably efficient.

The initial thought was that we’d use EXA’s ability to forward pixmap creation back to the driver and have our driver call-back create the GEM object. However, in looking at that, it turns out to have a terrible (and incomplete) API. The driver has no say in the pixmap layout, it must use the EXA-enforced pixel organization. In a land of tiled pixmaps, that’s not OK. Further enquiry showed a wealth of other code which is useless in our uniform Pixmap environment. Damage tracking, and enforce hardware synchronization are wasteful performance robbing activities.

Ok, so if EXA isn’t what we want, then what is? Well, I like the basic EXA acceleration plan — accelerate solid fills, copy area and the composite operation and leave everything else to software. In fact, the whole EXA drawing API is just fine, it’s just the wasteful EXA code that isn’t necessary.

UXA — the UMA Acceleration Architecture

Ok, so instead of hacking up EXA and trying to make it work for the GEM driver and existing drivers, I decided to just make it work for GEM on UMA hardware and see what it looked like. The hope is that we’ll find some way to either patch EXA or at least find a way to share the low-level rendering code between UXA and EXA. For now, UXA lives in the intel driver itself; once we figure out how we want the X server rendering infrastructure to work, we’ll merge whatever results back into the core server.

I started UXA by just copying the existing EXA code and running an edit script to change all of the names. Then, I went through the code and removed everything dealing with pixmap migration, damage computation or explicit global hardware synchronization. The only synchronization primitive left is the prepare_access/finish_access pair which signals the start and end of software drawing. The hardware driver is expected to deal with all other synchronization issues itself.

Oddly, GEM does rendering synchronization automatically when rendering with the hardware, and provides simple primitives to provide for software fallbacks. The key here is that we never need to idle the whole chip, we only need to wait for it to finish working on whatever objects are currently being drawn with. The goal is to avoid artificial serialization.

The result is less than 5000 lines of code, as compared to EXA which has about 7500 lines.

Yeah, but does it work?

The short answer is “Yes, it works”. The longer answer is “Yes, with limitations”. The biggest limitation right now is that GEM objects can only be mapped directly by the CPU. For lots of operations, this is exactly what you want; a fully cached view into the objects as it offers full performance for CPU-bound rendering operations.

However, it has one performance problem and one functional limitation.

The performance problem is that using the CPU cache with these objects means flushing the CPU cache whenever switching between CPU and GPU rendering. CPU cache flushing is horribly expensive, enough so that it’s often far better to take the huge performance penalty of using un-cached reads if the number of reads is small.

Yes, we could create write-combining PTEs for this direct mapping, but constructing write-combining PTEs is also really expensive as that involves flushing those PTEs from every CPU TLB, which requires an inter-processor interrupt. Of course, you can’t just create a write-combining PTE, you have to make sure that the page it maps is not in any CPU cache, so you have to perform a CPU cache flush as well.

Someday maybe this won’t be true; there are plans afoot within the Linux kernel to make this reasonably efficient. Perhaps this will happen before we get our flying cars.

So, it’s a performance problem; we can deal with that.

Tiled Surfaces

What we can’t deal with is how tiled surfaces work under a CPU map. A normal surface has an entire scanline mapped to a linear section of memory. This places vertically adjacent pixels a fair distance apart in memory. Drawing a vertical line means touching two different cache lines and two different pages. Even a large cache and TLB will not help much if you draw tall objects. Tiled surfaces arrange for nearby screen pixels to be nearby in memory, usually by constructing the surface from a set of rectangular page-sized tiles. Vertically adjacent pixels will then be in the same page, and can even be in the same cache line in some cases.

The performance benefits for tiled surfaces are obvious; fewer cache and TLB misses. The cost to the hardware is fairly small; just some gates to stir addresses around when fetching and storing pixels. However, the cost to software is fairly large; computing the address of a pixel now involves some fairly ugly computation.

We already managed to make Mesa deal with tiled surfaces. That was fairly easy as Mesa has a single span-based pixel fetch and store architecture. Write new span accessing functions and the rest of the sw rendering code just works.

Fixing the X server 2D software rendering code is another matter entirely — there’s a lot of it, and it all wants to touch memory in a linear fashion. Aaron Plattner from nVidia actually did go and whack fb to make it work; every pixel fetch or store goes through a function call which is passed the nominal linear address of the pixel. These accessor/setter functions then munge that address into the actual tiled address. However, that’s yet another huge performance impact for software rendering.

Hardware De-Tiling

A better solution is to just use the hardware. When a tiled surface is bound to the GTT, it is visible to everyone using linear addresses; those addresses are swizzled in the hardware and head out to memory in tiled form. There’s no performance benefit from the CPU as its TLBs and caches all see the linear address, but it doesn’t have to deal in a non-linear space.

The second benefit of the GTT map is that it lives under a write-combining MTRR, so all accesses to memory are write-combining and not write-back. This eliminates all of the CPU cache coherence issues and leaves us back with the old performance that we know and love — fast writes and really slow reads, but no penalty for switching rapidly between GPU and CPU.

What’s Next?

So, the basic Pixmaps-in-GEM code is up and running in the gem-pixmap branch of my driver repository, git://people.freedesktop.org/~keithp/xf86-video-intel. The next step will be to integrate Carl Worth’s 965 render changes which place all of the temporary data that it uses into GEM objects as well. That will finish the DRI2 enabling work and allow us to provide zero-copy texture-from-pixmap support.

However, before that can really go main-stream, we need to get the GTT object mapping to fix tiled surface support and get back some performance lost to the CPU cache flushing. We’ll see if Kristian is ready with DRI2 tomorrow, if not, I’ll probably spend the day figuring out enough additional parts of the Linux MM code to get my GTT maps working.

Posted Tue Aug 5 23:28:32 2008

Ok, so I didn’t get a lot of time for coding last week. And, this week there’s OSCON, so coding time will be short again. I figured I should spend some time writing up a brief report about where X output is at today.

Output Hotplug

I think this stuff is fairly solid these days, although we don’t have much in the way of auto-detection of monitor connect/disconnect. There are two reasons here:

  1. The hardware notifies the operating system via an interrupt. Given mode setting code in user space, dealing with interrupts is a huge pain and hence hasn’t been hooked up yet (see below).

  2. Analog outputs (VGA, TV) do detection using impedance changes in the output signal path. This means we have to keep them active if we want to detect a connection. That takes a lot of power (about 1W to light up the VGA output without a monitor connected). What we could do is detect when a monitor was unplugged; that’s free.

There are a few other random improvements that are coming soon, like CEA additions to the EDID parsing code. These are additional data blocks that follow the standard EDID data and are used for ‘consumer electronics’ devices. Supporting these should make more HDMI monitors ‘just work’.

Initial Mode Selection

Detecting connected monitors is fine, but one thing we haven’t really solved is what to do when you have more than one connected when the server starts. My initial code would pick one ‘primary’ monitor, light that up at its preferred size and then pick modes for the other monitors which were as close as possible to the primary monitor size without being larger. Obviously, I liked that as it meant my laptop always came up looking correct on the LVDS and my external VGA would show most of the screen.

However, this was reported to confuse a lot of users. I can imagine that starting the X server with one of the outputs connected but not turned on would make for some ‘interesting’ support calls. So, now the X server picks a mode which all outputs can support and uses that everywhere. Sadly, this means that my laptop panel gets some random scaled mode (usually 1024x768) which looks quite awful.

I think we need something better than either of these choices, but I’m not quite sure how it should work.

Kernel Mode Setting

A bunch of people, including Jesse Barnes and Dave Airlie, have been hacking to move the output configuration code into the kernel. This will solve lots of little problems, like how to display kernel panic messages, and how to deal with interrupts for output hotplug.

This code is up and running fairly well these days, but depends on a kernel memory manager to deal with frame buffers. The integration of GEM into the kernel is blocking this work, but I’m hopeful that this will be sorted out in the next couple of weeks.

GEM — the Graphics Execution Manager

Work here was stalled for a few weeks while we sorted out memory channel interleaving issues. Now things are moving again, and we’re working on getting it stable enough to merge into master. That means fixing a few more critical bugs that the Intel QA team has identified.

One of these bugs is that our GL conformance tests weren’t working right; that turned out to be caused by tests reading back data from the frame buffer one pixel at a time. Our read-back path passed through the GEM memory domain code to pull objects back from GTT space to CPU space. That meant flushing the front, back and depth buffers from the CPU cache. With each of those at 16MB, reading a single pixel took long enough that the tests would time-out. Increasing the timeouts to ‘way too long’ is making them run, but tests which would complete in a few hours are now taking days.

We’ve got two different plans for fixing the read-back path:

  1. Use pread to access precisely the data we need. This would involve flushing a single cache line for the tests above.

  2. Mapping the back buffer through the GTT. This would eliminate the need to clflush anything as the GTT mappings are write combining and so reads bypass the cache.

Eric is working on the former, and I’m working on the latter. More news later (this week?) when we see which one wins.

Composite Acceleration

With Owen Taylor’s change to the glyph management code in the server, Eric and Carl were able to change the driver to batch multiple glyph drawing operations into a command buffer. Once Carl had this working, we went from 13000 glyphs/sec to 103000 glyphs/sec. Obviously we’re hoping for even larger improvements as a pure software solution is well over 1 million glyphs/sec. Even still, 103000 glyphs/sec is enough to make my desktop vastly more usable, and using the software path means losing a lot of other useful acceleration.

DRI2 — Redirected Direct Rendering

Right now, direct rendered GL applications (which is the fastest way we can do GL at present) get drawn to a giant screen-sized back buffer and then copied from there to the screen at swap buffers time. Because everyone shares the same back buffer, you get to clip your drawing as if you were drawing directly to the screen. While this normally doesn’t matter much (aside from some performance costs associated with lots of clip rectangles), when you’re running a compositing manager (like compiz), the 3D applications end up ignoring the per-window offscreen pixmap and spam their output directly to the real frame buffer.

DRI2, written by Kristen Høgsberg, solves this by changing how direct rendering works and giving everyone a private back-buffer to draw to. Now, at buffer swap time, that private back-buffer can be copied to the window’s pixmap and compiz is happy.

This work has been around for a few months, but depends on a TTM-based memory manager. That dependency isn’t very strong, and krh has promised to fix it shortly. Once that’s done, getting the GEM driver to support DRI2 won’t take long, and we’ll have our fully composited desktop running. With luck, that’ll happen before September.

Final Words

As you can see, we’re nearing the end of our long X output rework saga, with most of the pieces falling into place in the next month or two.

Posted Mon Jul 21 18:25:43 2008

All Entries