A tempest in a tty pot

[LWN subscriber-only content]

Welcome to LWN.net

The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!

By Jonathan Corbet
July 29, 2009

There are dark areas of the kernel where only the bravest hackers dare to tread. Places where the code is twisted, the requirements are complex, and everything depends on ancient code which has seen little change over the years because even the most qualified developers fear the consequences. Arguably, no part of the kernel is darker and scarier than the serial terminal (TTY) code. Recently, this code was getting a much-needed update, but it now appears that a disconnect within the community has brought that work to a halt and thrown TTY back into the “unmaintained” column – at a time when that code has known regressions in the 2.6.31-rc kernel.At a first glance, the TTY layer wouldn’t seem like it should be all that challenging. It is, after all, just a simple char device which is charged with transferring byte-oriented data streams between two well-defined points. But the problem is harder than it looks. Much of the TTY code has roots in ancient hardware implementing the RS-232 standard – one of the loosest, most variable standards out there. TTY drivers also have to monitor the data stream and extract information from it; this duty can include ^S/^Q flow control, parity checking, and detection of control characters. Control characters may turn into out-of-band information which must be communicated to user space; ^D may become an end-of-file when the application reads to the appropriate point in the data stream, while other characters map onto signals. So the TTY code has to deal with complex signal delivery as well – never a path to a simple code base. Echoing of data – possibly transforming it in the process – must be handled. With the addition of pseudo terminals (PTYs), the TTY code has also become a sort of interprocess communication mechanism, with all of the weird TTY semantics preserved. The TTY code also needs to support networking protocols like PPP without creating performance bottlenecks.

All told, it’s a complicated problem. It is also a problem which seems to interest relatively few developers. The top of drivers/char/tty_io.c still reads “Copyright (C) 1991, 1992, Linus Torvalds.” Much of the code is still dependent on the big kernel lock. There are deadlocks and race conditions to be found. Almost nobody wants to touch it, but it still mostly works.

Alan, you are a true wizard :-) The tty layer is one of the very few pieces of kernel code that scares the hell out of me :-)
Ingo Molnar, July, 2007

In recent times, though, an energetic TTY maintainer has stepped forward: Alan Cox. One could almost hear the sighs of relief across the net when this happened; if anybody could clean out that particular set of Augean Stables, it would certainly be Alan, who has the combination of technical skill and attention to detail needed to avoid breaking things. Over the last year, it has been clear that fixing the TTY code has stressed even Alan’s skills; the work has been slow and apparently laborious. But it has also been successful at getting the TTY code into better shape while preserving it as a functioning subsystem.At least, that was the case until 2.6.31, where the combination of significant changes and some last-minute tweaks led to regressions. Users started to report that the kdesu application stopped working. The emacs compile mode started losing output. And so on. It turns out that there were a few separate bugs, not all of which were in the tty layer:

  • The problem with kdesu appears to be a KDE bug; the application would read too much data, then wonder why the next read didn’t have what it wanted. This code worked with the older TTY code, but broke with 2.6.31. There is probably no way to fix it which doesn’t saddle the kernel with maintaining weird legacy bug-compatibility code – something the TTY layer does not need more of.
  • The emacs problem is different. In this case, the compile process would finish its work (writing its final output to the PTY) and exit. Emacs would try to read that final output, but would get a failed read resulting from the SIGCHLD signal sent by the exiting compile process. That failure was unexpected and caused emacs to drop the data. In essence, emacs expected that, by the time the compile process had completed its close() of the PTY file descriptor, the data written to that descriptor had been pushed through to the other end and would be available for reading. The 2.6.31 changes broke that assumption.

The second problem results from the complex nature of TTY data processing. It’s not just a serial stream of data; instead, there is the line discipline processing in the middle. In 2.6.31, data written to a PTY will have been queued up for line discipline attention by the time a close() is allowed to complete, but there’s no assurance that the line discipline code will have actually run and passed the data through to the other end. So the SIGCHLD signal can pass the data and arrive first.

Alan thinks this behavior is reasonable; it complies with the applicable standards and can be implemented in a relatively straightforward way. Making a close() on a PTY block until the other end has received the data might make emacs work better, but it also risks deadlock if both sides write data and close their file descriptors at the same time. Even so, Alan posted a “elegant in all the wrong ways” patch which fixed the problem, but also made it clear that he thought emacs was buggy and that the real fix belonged there.

Linus merged a version of this patch, but he was not happy about it. He believes that emacs is correct in its assumptions, and would like to see a better fix which makes the ordering of events clear and deterministic. He made his frustration clear:

Why? Why blame emacs? Why call user land buggy, when the bug was introduced by you, and was in the kernel? Why are you fighting it? Why did it take so long to admit that all the regressions were kernel problems? Why were you trying to dismiss the regression report as a user-land bug from the very beginning?

At that point, it was Alan’s turn to express frustration; he did not hold back:

I’ve been working on fixing it. I have spent a huge amount of time working on the tty stuff trying to gradually get it sane without breaking anything and fixing security holes along the way as they came up. I spent the past two evenings working on the tty regressions.However I’ve had enough. If you think that problem is easy to fix you fix it. Have fun.

The message included a patch removing Alan as the maintainer of the TTY layer.

And that is where things stand, as of this writing. The TTY code is unmaintained again, a promising rework has halted partway through, and the person most qualified to fix the problems has thrown up his hands and left the building (though it should be noted that he is participating in the conversation on how the next maintainer, whoever that might be, can fix things). Kernel development will go on, but development in this area will go rather more slowly; the TTY layer has claimed another victim.


(Log in to post comments)

A tempest in a tty pot

Posted Jul 29, 2009 16:36 UTC (Wed) by Baylink (subscriber, #755) [Link]

With all due respect to Linus, I don’t think it’s the tty layer that claimed this victim… I think it’s Linus.
Sure, this is the NFL, wear your pads, and all that crap.

As several people just pointed out to me, rather strenuously, in the sexism thread, that’s really not a good enough answer. My experience of Alan is not that he’s a shrinking violet; perhaps Linus could have approached that in a… slightly more mellow fashion?

I happen to be on the “if there’s really a ‘right’ way for the kernel to do it, and that breaks user apps that made questionable assumptions, then break them” side, myself, but I don’t think it’s black and white, either.

Particularly in this case, where the odds of that kernel getting into distributions are effectively non-existent, I don’t think this was resolved in the best way for all concerned. Though perhaps that observation is obvious enough to be puerile…

A tempest in a tty pot

Posted Jul 29, 2009 20:33 UTC (Wed) by AlexHudson (subscriber, #41828) [Link]

I think you’re basically right.
In other areas of the kernel you wouldn’t notice – people have thrown up their hands in places like IDE even after substantial work, it gets chucked out, and other people willingly step up.

It means that there is really very little compelling reason to ‘play nice’. Until it comes to keeping the completer-finisher types for places like tty…

A tempest in a tty pot

Posted Jul 29, 2009 17:07 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

Would it not make some sense to move the tty code out into a CUSE driver which would itself access the console driver, possibly modified to suit the user-space driver’s needs? I am sure that all this stuff is not so performance-critical that it needs to be in the kernel, or that it would not cope with a couple of extra context switches.

A tempest in a tty pot

Posted Jul 29, 2009 17:36 UTC (Wed) by BrucePerens (subscriber, #2510) [Link]

The console must work during boot, when the kernel panics, etc.

A tempest in a tty pot

Posted Jul 29, 2009 18:23 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

I didn’t talk about moving the console driver into user space, just the tty driver – I actually suggested that the tty driver could use the console driver.

A tempest in a tty pot

Posted Jul 29, 2009 18:31 UTC (Wed) by BrucePerens (subscriber, #2510) [Link]

Yeah, you don’t really need the tty discipline for the console. But you do need the ANSI terminal to framebuffer code, and the SysReq code for the keyboard, and both of those have to work when the kernel isn’t running processes. So, you end up not having too much that can be moved out of the kernel.

A tempest in a tty pot

Posted Jul 29, 2009 19:35 UTC (Wed) by michaeljt (subscriber, #39183) [Link]

Why the ANSI terminal to framebuffer code? I would have thought that simple line printing, along with a couple of ioctls for moving the cursor and selecting colours would be enough. And of course all the virtual terminal stuff is not needed either for printing kernel messages.

A tempest in a tty pot

Posted Jul 29, 2009 19:41 UTC (Wed) by BrucePerens (subscriber, #2510) [Link]

Well, the color is set with ANSI codes. And there’s a lot of code that backspaces over itself and reprints while indicating something in process.So. you end up with a reasonably large piece of function-duplicating code if you split the kernel console and a more elaborate user console.

A tempest in a tty pot

Posted Jul 29, 2009 19:57 UTC (Wed) by nix (subscriber, #2304) [Link]

And you *do* need the line discipline code for booting on serial consoles,
serial logging of panics, and so on.

A tempest in a tty pot

Posted Jul 30, 2009 8:18 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

Does that require the line discipline code to be in the kernel, or would running it from the boot loader to set up the line suffice?

A tempest in a tty pot

Posted Jul 30, 2009 8:21 UTC (Thu) by michaeljt (subscriber, #39183) [Link]

Not that I can immediately think of a single argument as to why having the code in the boot loader would be better than having it in the kernel…

A tempest in a tty pot

Posted Jul 29, 2009 17:07 UTC (Wed) by iabervon (subscriber, #722) [Link]

There’s the additional oddity that TTYs are, abstractly, a non-local communications channel with limited bandwidth and non-zero latency. The other end of a serial connection could send data and signal it was done, and that signal could actually arrive while the data was propagating down the copper (or the RS232 interrupt could have lower priority). The two ends of a real TTY are different observers and can disagree on the ordering of events. PTYs pretend that there’s magical ideal hardware connecting the two ends. Emacs makes the assumption that this ideal hardware is perfectly fast, such that no other signals can arrive while the data is on the wire. The new TTY code allows other stuff to be faster than the receiving side virtual serial port decoding. In particular, SIGCHLD can arrive while there’s data that has been buffered for decoding but not yet decoded, because the sending side doesn’t wait and the receiving side has a step before the data registers as available. If this were a pipe or a socket, it would be there already, but a real TTY wouldn’t be so certain, and a PTY is underspecified between these models.

A tempest in a tty pot

Posted Jul 29, 2009 17:19 UTC (Wed) by allesfresser (subscriber, #216) [Link]

By the way, happy birthday, Alan. :)

A tempest in a tty pot

Posted Jul 29, 2009 17:56 UTC (Wed) by nowster (subscriber, #67) [Link]

The opening of this article reminded me of the opening of an episode of HHGTTG:

… when men were real men, women were real women, and large beardy creatures from Alffa Sentŵri (near Swansea) were real large beardy …

GNU + Linux = broken development model

Posted Jul 29, 2009 17:29 UTC (Wed) by mheily (subscriber, #27123) [Link]

This illustrates why dividing the kernel and the userland into two totally separate projects is a broken development model that doesn’t scale. I’m not trying to start a flamewar by bringing up BSD, but I’ve always thought that their approach of releasing a complete operating system allows you to make major infrastructure improvements that cross the boundary between kernel and userland.
If one of the BSDs wanted to fix their TTY code, they would coordinate the changes in the kernel, core userland, and ports. This allows you to change existing bad behavior with minimal risk of breakage, since users are encouraged to upgrade their entire operating system (kernel, userland, and ports) at the same time.

Perhaps there should be a “reference implementation” of a complete GNU/Linux operating system that that is blessed by both Richard Stallman and Linus Torvalds, and where extra effort is made to ensure that all the pieces work together. Existing distributions would be free to fork this implementation and add their own differentiating features, like Ubuntu forks Debian every six months and polishes it up for desktop users.

GNU + Linux = broken development model

Posted Jul 29, 2009 17:39 UTC (Wed) by adicarlo (subscriber, #8463) [Link]

Tighter coupling between kernel and userland is better? You gotta be kidding me.
That means you gotta run the kernel your distributor provides you. Want that new driver for that wifi card or USB dongle? Too bad.

The GNU + Linux development model is proven and its winning.

GNU + Linux = broken development model

Posted Jul 29, 2009 18:27 UTC (Wed) by mheily (subscriber, #27123) [Link]

> Tighter coupling between kernel and userland is better? You gotta be kidding me. That means you gotta run the kernel your distributor provides you. Want that new driver for that wifi card or USB dongle? Too bad.
Ever heard of loadable kernel modules? They let you run experimental code that isn’t shipped with the distributors kernel. Of course, it helps to have a stable kernel API, but the Linux devs have made their distaste for interface stability perfectly clear.

GNU + Linux = broken development model

Posted Jul 29, 2009 21:34 UTC (Wed) by adicarlo (subscriber, #8463) [Link]

>> Tighter coupling between kernel and userland is better? You gotta be kidding me. That means you gotta run the kernel your distributor provides you. Want that new driver for that wifi card or USB dongle? Too bad.
> Ever heard of loadable kernel modules? They let you run experimental code that isn’t shipped with the distributors kernel. Of course, it helps to have a stable kernel API, but the Linux devs have made their distaste for interface stability perfectly clear.

Sure I’ve heard of them. Good luck loading ath5k.ko from 2.6.30 in your vendor-supplied 2.6.26 kernel tho.

GNU + Linux = broken development model

Posted Jul 29, 2009 22:43 UTC (Wed) by jond (subscriber, #37669) [Link]

To be fair to the parent (who I also disagree with) that particular problem only exists due to the lack of a stable kernel API.

GNU + Linux = broken development model

Posted Jul 29, 2009 23:52 UTC (Wed) by JoeBuck (subscriber, #2330) [Link]

And the main reason Microsoft sucks rocks, despite all of the brilliant people who work there, is stable ABIs. For everything that they’ve ever done. No matter how stupid it was. In many cases, even things that they never documented can never be broken, because critical apps (including Microsoft apps) depend on this undocumented behavior. They are crushed under the weight of all that legacy.

GNU + Linux = broken development model

Posted Jul 30, 2009 1:28 UTC (Thu) by mikov (subscriber, #33179) [Link]

I am sorry but that is simply wrong.
I have written and maintained several large and very non-standard Windows kernel drivers on my own through the years. They worked with minimal effort (if any) across NT4, W2K and WXP (I switched to Linux development before Vista was out). This stability was an immense help to me.

I would never have had enough time to do it if I had to track incompatible kernel changes every few months. Plus, of course the changes are never clearly documented and the only way to find them is by observing broken compilation or worse broken behavior. Or read LWN, of course.

This is madness! It takes a full time job just to track the kernel, let alone have time to do other constructive stuff. This is not a sustainable development model and I think that reality has proven me right. Nobody can afford to waste these kind of resources on Linux and … nobody is.

No software vendor (Redhad, Novell, etc) tests _all_ or even _most_ of the drivers which are included in the kernel for every new release. Most smaller hardware vendors in turn can’t afford to employ kernel developers just for this so they don’t test their own products either.

Compare this to Windows where a company can hire a one time consultant to write a driver and can be reasonably sure that the driver will continue to work for the foreseeable future without _any_ maintenance.

The argument for including stuff in the mainline kernel is of course completely ridiculous:

- One, it would be extremely hard, probably impossible, to include a driver for custom hardware which is not widely available.

- Two, including it in the kernel doesn’t guarantee anything more than successful compilation in future versions. As I said, nobody tests the drivers before a kernel release.

In the end it is much easier to write a new Linux kernel driver, but I suspect that in the long term it is more expensive than Windows to maintain.

GNU + Linux = broken development model

Posted Jul 30, 2009 2:10 UTC (Thu) by foom (subscriber, #14868) [Link]

- One, it would be extremely hard, probably impossible, to include a driver for custom hardware which is not widely available.

Apparently not actually true, there are such drivers in the kernel, and the kernel devs seem to be happy to include them. (I too find this surprising and a bit strange…)

- Two, including it in the kernel doesn’t guarantee anything more than successful compilation in future versions. As I said, nobody tests the drivers before a kernel release.

Same thing is true of windows drivers — they promise not to break the ABI, but microsoft isn’t going to test your little driver for custom hardware with their new OS…sure it might still load, but then maybe it’ll immediately crash. There’s no guarantee against that!

GNU + Linux = broken development model

Posted Jul 30, 2009 2:53 UTC (Thu) by mikov (subscriber, #33179) [Link]

Same thing is true of windows drivers — they promise not to break the ABI, but microsoft isn’t going to test your little driver for custom hardware with their new OS…sure it might still load, but then maybe it’ll immediately crash. There’s no guarantee against that!

No, it is not the same thing by far. While Microsoft will not test my driver (unless I have submitted it for WHQL, which BTW is much easier for a business than submitting to the Linux kernel), the amount of time I would have to spend making sure it works is minuscule by comparison. Plus:

  • Microsoft at least tests all the drivers that are shipped with the OS. This is a huge difference with any Linux distribution.
  • The above alone guarantees a very high level of API compatibility.
  • They don’t ship new kernels with frequency anywhere near Linux.

Even if a 100% reliable stable API is not realistic, aiming for one is a really immense help for developers and businesses alike. Especially for smaller businesses where the need for an extra high salaried employee would make a huge difference.

In my opinion it would be sufficient if Linux maintained a source-level API with explicit versioning, and with explicit documentation for each change. Change the API at every release if you like, but increment the version and maintain a formal documentation of the changes (and I don’t mean a GIT log).

Sigh… Alas, that will never happen. I am convinced that it is one of the big reasons why we will never see mass adoption of Linux on the desktop. Specialized huge volume devices with fixed hardware like phones, netbooks, game consoles – perhaps, but a PC – not until this changes.

GNU + Linux = broken development model

Posted Jul 30, 2009 4:24 UTC (Thu) by rahvin (subscriber, #16953) [Link]

They don’t ship new kernels with frequency anywhere near Linux.

Why not admit that this is what you are complaining about, not the lack of an API? The fast moving kernel is the issue that makes it difficult to track, not the lack of an API as you state. The simple solution is to do what many businesses do and develop for kernel releases shipped with distributions. RHEL, Debian, Unbuntu, etc… all ship with specific kernels, there is often little need to track the newest kernel for out of kernel drivers.

GNU + Linux = broken development model

Posted Jul 30, 2009 5:04 UTC (Thu) by mikov (subscriber, #33179) [Link]

Sigh. You are missing the point.
The same Windows kernel driver can work from NT 3.51, released in 1995, to at least Windows XP, and for many classes of drivers Vista and Windows 7. During that time the Windows kernel has undergone numerous fundamental improvements (plug and play, power management, etc), and yet the core kernel _concepts_ have been kept stable. Can you point me to a Linux distribution that maintains a kernel for 15 years?

The frequency of kernel releases would be immaterial if the API was stable or the changes formally tracked and documented. The latter is actually preferable because it wouldn’t stop the innovation.

Oh, yes, the commercial distributions are trying to maintain a stable kernel for some time (although far from 15 years) – of course they would; this is what the vast majority of businesses and users really want and need regardless of what the stupid document “stable_api_nonsense.txt” says. Alas, it is actually causing further incompatibilities and problems between distributions – everybody has a different version, with different patches, etc. It is a nightmare.

Sometimes I cynically suspect that this is on purpose, or otherwise expensive support contracts wouldn’t be that necessary.

GNU + Linux = broken development model

Posted Jul 30, 2009 8:12 UTC (Thu) by farnz (subscriber, #17727) [Link]

The same Windows kernel driver can work from NT 3.51, released in 1995, to at least Windows XP, and for many classes of drivers Vista and Windows 7. During that time the Windows kernel has undergone numerous fundamental improvements (plug and play, power management, etc), and yet the core kernel _concepts_ have been kept stable. Can you point me to a Linux distribution that maintains a kernel for 15 years?

I’m sorry, but practical experience tells me you’re wrong. I had a driver for a SCSI card that was fine in NT 4, fine in 2000, did not even load on XP; when I followed instructions to force load the driver, it immediately bluescreened. I’ve since thrown the card out, because the chip wasn’t supported in Linux, either.

Oh, and the card had different drivers for NT 3.51; those didn’t work with 2000 or XP (I had no access to NT 4 to check if the NT 3.51 driver worked there).

So, throwing it back at you, can you show me a driver written for NT 3.51 that does something involving DMA (e.g. a SCSI driver, an IDE driver, a TV card driver) that still works fine in XP? If not, what about one written for NT 4?

GNU + Linux = broken development model

Posted Jul 30, 2009 14:46 UTC (Thu) by Richard_DCS (subscriber, #56565) [Link]

This sounds like a nice theory though I’ve seen lots of counter examples, so I don’t think it works.
Also you miss the point that most users are making the transition from 9x to XP 32bit and then onto Windows 7 64bit in the future. That path stops drivers working on every transition, Linux never had that problem.

Drivers work best when they are open source and in the mainline kernel where they can be fixed when needed.

GNU + Linux = broken development model

Posted Jul 30, 2009 13:13 UTC (Thu) by cortana (subscriber, #24596) [Link]

To be fair, the WHQL testing is worthless. The WHQL-passed driver for my RT61-based wlan interface causes 64-bit Vista and Windows 7 systems to crash whenever it deals with a non-trivial about of traffic. This is extremely easy to reproduce: just running the Steam server browser, Left 4 Dead or Bittorrent is enough to trigger the crash. But the drivers were rubber stamped, no problem…

GNU + Linux = broken development model

Posted Jul 30, 2009 4:06 UTC (Thu) by elanthis (subscriber, #6227) [Link]

Microsoft breaks ABIs and userland apps all the fucking time. Why do you think so many people complain about Vista? It doesn’t run a lot of apps that worked perfectly in XP (even if you install/run them in administrator mode or in XP compatibility mode). Hell, do any of you actually remember when XP came people? People bitched about it just as much because it broke a ton of apps that worked just fine in 95/98. Windows 7 can’t run some apps that Vista handled just fine.
Windows maintains stable ABIs *where it can* because Real Users kind of prefer not having half their desktop stack break every 6 months while proclaiming the endless breakage as “true innovation!”

GNU + Linux = broken development model

Posted Jul 30, 2009 11:43 UTC (Thu) by mb (subscriber, #50428) [Link]

> To be fair to the parent (who I also disagree with) that particular problem only exists due to the lack of a stable kernel API.
No that is completely wrong. The required API was not available (because it was not yet developed) in 2.6.26.
Having a stable API/ABI does not help this at all.

GNU + Linux = broken development model

Posted Jul 29, 2009 17:41 UTC (Wed) by fuhchee (subscriber, #40059) [Link]

This illustrates why dividing the kernel and the userland into two totally separate projects is a broken development model that doesn’t scale.Tell that to Microsoft, or almost any other system software vendor, which values backward binary compatibility.

GNU + Linux = broken development model

Posted Jul 29, 2009 17:41 UTC (Wed) by me@jasonclinton.com (subscriber, #52701) [Link]

I’ll try to ignore the political land-mines all over your text and focus specifically on your claim that $OS does this better. The simple fact is that there are *thousands* of user space consumers of the TTY kernel interfaces out there in the wild, even on *BSD.
For a counter example of how this “un-scalable” development model can actually work just fine–regardless of OS–look no further than the DRI2 and KMS changes coordinated across X.org, other user space and the Linux kernel, all at the same time with many hands involved in the work and with very little user-visible disruption.

Your (primary) claim just doesn’t square with the reality, anywhere.

GNU + Linux = broken development model

Posted Jul 29, 2009 18:18 UTC (Wed) by mheily (subscriber, #27123) [Link]

> For a counter example of how this “un-scalable” development model can actually work just fine–regardless of OS–look no further than the DRI2 and KMS changes coordinated across X.org, other user space and the Linux kernel, all at the same time with many hands involved in the work and with very little user-visible disruption.
Your counter-example actually illustrates my point. It is relatively easy to coordinate changes between the kernel and a single userspace project such as X.org. Once you try to make kernel changes that impact many userspace programs, it becomes very difficult to coordinate the necessary changes. The story of what happened with the Linux TTY changes is concrete evidence of the drawbacks of the split kernel v.s. userspace development model.

GNU + Linux = broken development model

Posted Jul 29, 2009 18:23 UTC (Wed) by me@jasonclinton.com (subscriber, #52701) [Link]

You ignored the part of my response where I pointed out that there are thousands of TTY consumer in *BSD. At this point, I’m just going to assume that you’re trolling and not respond to this thread any further.

GNU + Linux = broken development model

Posted Jul 29, 2009 18:55 UTC (Wed) by mheily (subscriber, #27123) [Link]

Fine. Please ignore the rest of this comment because anything you don’t agree with must be a troll. Continuing with the hypothetical example of TTY changes to a BSD-based operating system…
In order to coordinate changes to the kernel TTY code that potentially impact thousands of userland programs, you would need a team of developers and testers to go through the code looking for problems. First, you fix problems in the the core userland programs (aka. the “base system”). The source code for everything is under /usr/src, so you can start with:

$ find /usr/src -type f -exec grep ‘#include <pty.h>’ {} \;

After you have a patch for the base system, you install the entire ports tree and repeat the search. Ports are installed under /usr/ports, so the command is:

$ find /usr/ports -type f -exec grep ‘#include <pty.h>’ {} \;

Once you have a list of potential problematic ports, you send out a notice to the ports maintainers and users of the -CURRENT branch asking them to test the affected ports against the experimental TTY patch for the base system. Some ports, such as Emacs, may rely on the old behavior, so they will need to be patched. These ports will require patches in order to make them work with the new kernel. Other ports may not need any changes at all.

Once all the changes are tested and reviewed (kernel, base system, and ports), the combined patchset is applied to the -STABLE branch for inclusion in the next stable release. All of this development and testing is costly, so hopefully the kernel changes were worth it :)

GNU + Linux = broken development model

Posted Jul 29, 2009 19:12 UTC (Wed) by bronson (subscriber, #4806) [Link]

KDE and Emacs, the two things that broke in the article, aren’t a part of BSD. It’s true that fixing code in ports is easy, just like patching a .deb or .rpm is easy.
That’s not where the problem lies. Here’s are the hard parts:

- testing the app, figuring out if there are bugs
- finding the bugs, fixing the bugs
- getting code review, regressing your fixes
- upstreaming your patches. All they do is work around a new and obscure kernel bug? Good luck with that!

And THAT is why the kernel <-> user space API should be stable. I like ports as much as anybody else but they just don’t help much here.

GNU + Linux = broken development model

Posted Jul 29, 2009 20:02 UTC (Wed) by nix (subscriber, #2304) [Link]

And if TTY code was that easy to grep for, maybe it would be simple: we
have distributors who can point to such code.
But it is not. TTY users can be people who accept file descriptors via
pipes and have no idea there are TTYs at the other end: they can be people
who use the Unix98 or the old BSD pty interface (which still has users!);
every use of ioctl() has to be audited: the signal handling in TTY users
has to be checked; it ties in with process groups…

The TTY stuff introduced in the early BSD Unix is, let’s be blunt, a
bloody design mess, and a pervasive one. It’s not a nice simple <pty.h>
interface, by any means, although it should have been.

Passing fds via pipes?!

Posted Jul 30, 2009 0:01 UTC (Thu) by i3839 (subscriber, #31386) [Link]

> TTY users can be people who accept file descriptors via
> pipes and have no idea there are TTYs at the other end
How is this possible? Did you mean unix domain sockets instead of pipes?

Passing fds via pipes?!

Posted Jul 30, 2009 0:37 UTC (Thu) by nix (subscriber, #2304) [Link]

Wrong way round. An fd to a TTY can be passed over a unix-domain socket
and then used (which will trigger line discipline magic even though the
app has no idea it’s using it), so it’s using a TTY even though it never
opened it or looked at /dev/ptmx. (This is probably not common, but it
makes a comprehensive audit of TTY users ridiculously hard, because use of
AF_UNIX sockets *is* common and fd passing is not particularly rare. One
variation of it, in which the fd is passed into the application as one of
fds 0 to 2, is of course exceedingly common. You don’t even need AF_UNIX
sockets for that.)

GNU + Linux = broken development model

Posted Jul 29, 2009 22:46 UTC (Wed) by jond (subscriber, #37669) [Link]

…and then your user’s custom code, not part of the OS, breaks when you release.

GNU + Linux = broken development model

Posted Jul 30, 2009 4:47 UTC (Thu) by daniels (subscriber, #16193) [Link]

On GNU userspace, you could use grep -r. ;)
(This is a blatant, content-free, troll.)

GNU + Linux = broken development model.

Posted Jul 29, 2009 19:02 UTC (Wed) by berndp (subscriber, #52035) [Link]

Your perception of the kernel development model is broken: It’s not that 2.6.30-rc5 (or 2.6.31) hits the desktop of Joe Plumber the day
after publication on http://www.kernel.org/ of distributions. That’s the job of distributors (and it’s the same in the
BSD world – they are distributing the complete OS so they must be compared the Linux distributors).
Assuming that “kdesu” is broken, there is plenty of time for “kdesu” to get fixed (and pushed expedited downstream because it’s a bug fix!).
Maintaining bug compatibility is plain simply not worth the effort (and it is IMHO violating any sane open-source development model
where it’s at least possible to *fix* bugs and not maintain an enormous amount of bug compatibility code – and maintenance effort – for ages.
Why else should Windows needs exponentially more resources with each release?).

If someone wants eternal backward bug compatibility, please *do* it – but do not complain to others or even ask them.
Especially not for some random buggy app which just happen to not trigger a race condition.

GNU + Linux = broken development model.

Posted Jul 29, 2009 19:32 UTC (Wed) by mheily (subscriber, #27123) [Link]

Sometimes it’s not clear which codebase is “broken” and needs to be fixed. Alan Cox’s changes may have been totally legal according to the RS-232 standards, but if Emacs and many other programs depend on the old behavior, it’s difficult to say that all userspace programs must be changed. One person’s “bug” is another person’s “feature” :)
I do agree with Linus that increased kernel parallelism should avoid impacting userland code that depends (rightly or wrongly) on serialization.

GNU + Linux = broken development model

Posted Jul 29, 2009 20:30 UTC (Wed) by nevyn (subscriber, #33129) [Link]

This illustrates why dividing the kernel and the userland into two totally separate projects is a broken development model that doesn’t scale. I’m not trying to start a flamewar by bringing up BSD, but I’ve always thought that their approach of releasing a complete operating system allows you to make major infrastructure improvements that cross the boundary between kernel and userland.

Ahh, how right you are … which is why Linux development hasn’t scaled as well as *BSD development. Also I know a bunch of Linux customers who would be much happier to be told to just “get their custom enterprise apps. into the main Linux git repo.” so that the APIs in the kernel can be broken more often.

*shakes head and wakes up* … ooohh, bad nightmare there for a minute.

If one of the BSDs wanted to fix their TTY code, they would coordinate the changes in the kernel, core userland, and ports. This allows you to change existing bad behavior with minimal risk of breakage

Last time I checked emacs and kde weren’t in BSD.

Existing distributions would be free to fork this implementation and add their own differentiating features, like Ubuntu forks Debian every six months and polishes it up for desktop users.

Except this “description” of Ubuntu vs. Debian is not based on reality either. Zero for three, doing well today little troll.

Please

Posted Jul 29, 2009 20:42 UTC (Wed) by corbet (editor, #1) [Link]

One might well disagree with the opinions expressed in the parent post, but it was not a troll. Please, let’s try to avoid that kind of language and focus on the real issues, OK?

Please

Posted Jul 30, 2009 5:17 UTC (Thu) by chad.netzer (subscriber, #4257) [Link]

“GNU + Linux = broken development model”. It should be no surprise that LWN readers (including many Linux developers) won’t take kindly to their style of development being called “broken”. That’s a statement that requires some compelling evidence to back up; one is essentially saying Linux has progressed so far *in spite of* such a broken model.

A tempest in a tty pot

Posted Jul 29, 2009 17:36 UTC (Wed) by adicarlo (subscriber, #8463) [Link]

This is a case where Linus came down too hard on Alan. He swung the axe, not on the basis of lack of work, lack of progress, or any technical issue whatsoever. AFAICT, Linus stomped based only on his perception of Alan’s attitude.
Sure, I understand why Linus would do that. He has to cope with bug denial from hackers and maintainers all the time. Yet it seems to me he’s started to develop a hair-trigger towards developers complaining about the abuses of userspace.

Sure, Alan grumbled about the behavior of TTY consumers in the wild. Even so, he never showed any resistance to fixing these issues. My perception is that things had been moving along well, with regressions being closed left and right.

IMHO, I think Linus should “man up” and offer an apology.

A tempest in a tty pot

Posted Jul 30, 2009 0:22 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

Agreed. After all, Alan has been working on the Linux kernel for what, fifteen years? (Maybe longer.)

While I often find value in Linus’ pearls of wisdom (often told with an acerbic but magisterial tone), I am troubled by how he snubbed Alan this way. I pray and hope that the two of them can work out their differences of opinion as (1) our editor has already pointed out that the TTY code is a harsh beast, and (2) Alan is (was) at least willing to try to tame it.

A tempest in a tty pot

Posted Jul 29, 2009 19:16 UTC (Wed) by Tobu (subscriber, #24111) [Link]

Ignoring the “I don’t like your attitude” mismatch, Linus uses sloppy terminology with his “bug”.

> Why? Why blame emacs? Why call user land buggy, when the bug was
> introduced by you, and was in the kernel? Why are you fighting it? Why did
> it take so long to admit that all the regressions were kernel problems?
> Why were you trying to dismiss the regression report as a user-land bug
> from the very beginning?

While the emacs+tty code together show unexpected and unwanted behaviour (one meaning of bug), it isn’t up to him to decide that the tty code has inconsistent semantics (another meaning of bug). He could have been more precise and less insulting.

iabervon’s comment makes a good point on where the semantics come from.

(Where to fix things is an entirely different matter of short vs long term maintability, from the point of view of people who freeze things vs people who update things.)

A tempest in a tty pot

Posted Jul 29, 2009 21:38 UTC (Wed) by dougg (subscriber, #1894) [Link]

My vote would be that the original copyright holder of the tty subsystem, if still active if kernel development, should be actioned to do the re-rewrite. Then the best kernel code reviewer (IMO Alan Cox) should be actioned to decide how subtle differences in semantics should be handled. This will only work if the reviewer is given a veto over re-rewriter and the latter has some semblance of humility. … just day dreaming.

A tempest in a tty pot

Posted Jul 29, 2009 22:10 UTC (Wed) by tzafrir (subscriber, #11501) [Link]

To quote the article:
‘The top of drivers/char/tty_io.c still reads “Copyright (C) 1991, 1992, Linus Torvalds.”‘

Looks like the original author to me. He also appears to still be active. I also bet he wouldn’t be flamed that badly by Linus.

Sounds like a plan.

Linus and folks need to take a different approach here.

Posted Jul 29, 2009 22:22 UTC (Wed) by drag (subscriber, #31333) [Link]

Not a huge difference. A small difference.
The idea that a kernel can maintain perfect userland ABI/API accross all versions for ever and ever is just _wrong_ thinking. They can’t do it. It’s impossible and is something that they have consistantly failed at over and over again.

Ideally it should be like that, but it isn’t and won’t ever be. Sorry. Can’t do it. It’s impossible. Nobody is that perfect. Soooo…. What should they do differently?

Version Numbering the ABI

Every 4 years or so introduce another new userland-ABI compatibility version. This gives people a chance to clean up /proc, /sys, bad sys call designs, odd tty insanity, break compatibility with old X drivers, clean up file system behavior, etc etc.

Then try to do your best to support the previous ABI. So that the kernel supports 2 versions at most. So that gives you about 8 years of ABI/API compatibility.

Because right now what is happenning is that the Linux kernel isn’t supporting 1 ABI version.. it’s supporting _hundreds_. Each time a kernel developer introduces a bug or a odd behavior and applications start taking advantage of that then that is a new feature that Linux will have to support for the rest of the time. (which it will fail at eventually)

Linus and folks need to take a different approach here.

Posted Jul 29, 2009 23:55 UTC (Wed) by nix (subscriber, #2304) [Link]

Consistent ABIs are impossible.
Also, the X Window System doesn’t exist.

Linus and folks need to take a different approach here.

Posted Jul 30, 2009 6:44 UTC (Thu) by zlynx (subscriber, #2285) [Link]

The X ABI has been broken hundreds (probably thousands) of times.
I unfondly remember the days of the CU computer lab where some applications would not display properly from the HP to the X terminals but worked from the Sun. I recall it was xv the image viewer that was usually good for breaking this stuff.

I remember some odd problems with that second, higher performing proprietary(I think it was?) X server Red Hat included with the boxed version 5.2.

the unknown factor

Posted Jul 29, 2009 22:47 UTC (Wed) by sebas (subscriber, #51660) [Link]

If KDE or Emacs depend on a bug in the kernel, they need to be fixed. In
those cases it’s actually relatively easy, since the bug has already been
triaged and the components that need fixing have been identified.
Whether or not the kernel should keep compatibility (i.e. the fix for the
regressions) is a different question. If two widely used applications show
problems with a new kernel version, then there’s a non-zero chance other
applications are “broken” as well. This $unknown needs to be offset with
the cost of maintaining the incorrect behaviour.

This again has little to do with personal issues, which play an important
role in this story.

A tempest in a tty pot

Posted Jul 29, 2009 23:40 UTC (Wed) by nowster (subscriber, #67) [Link]

Just in:

commit e043e42bdb66885b3ac10d27a01ccb9972e2b0a3
Author: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Date:   Wed Jul 29 12:15:56 2009 -0700

    pty: avoid forcing 'low_latency' tty flag

    We really don't want to mark the pty as a low-latency device, because as
    Alan points out, the ->write method can be called from an IRQ (ppp?),
    and that means we can't use ->low_latency=1 as we take mutexes in the
    low_latency case.

    So rather than using low_latency to force the written data to be pushed
    to the ldisc handling at 'write()' time, just make the reader side (or
    the poll function) do the flush when it checks whether there is data to
    be had.

    This also fixes the problem with lost data in an emacs compile buffer
    (bugzilla 13815), and we can thus revert the low_latency pty hack
    (commit 3a54297478e6578f96fd54bf4daa1751130aca86: "pty: quickfix for the
    pty ENXIO timing problems").

    Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
    Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
    [ Modified to do the tty_flush_to_ldisc() inside input_available_p() so
      that it triggers for both read and poll()  - Linus]
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Alan, please reconsider

Posted Jul 30, 2009 0:04 UTC (Thu) by Felix_the_Mac (subscriber, #32242) [Link]

Alan, you don’t know me, I am just a Linux fanboy, but out here in Ubuntuland, I have been rooting for you.

You’re the one they sent in where lesser mortals fear to tread.
Nobody else could possibly grok the tangle of code that is the TTY layer.
To me you are the ‘asset’ dispatched to eliminate the BKL.

Back in the real world :-) … you are one of the elders of GNU/Linux and you have devoted (almost) half of your life to it because you believe in it.

Today, at this moment of stress, when you have been publicly slighted by the boss, please just remember that we, the people, really are counting on you to do a difficult job, that there really is nobody else eager to step in to your shoes and take the huge workload and _apparently_ small reward.

Most of all please know that your efforts are more widely appreciated than you realise.

Alan, please reconsider

Posted Jul 30, 2009 0:44 UTC (Thu) by johnflux (subscriber, #58833) [Link]

I was 15 when I first met Alan Cox. I asked him to sign my Tux teddy bear :-D He did try, but using a permanent marker on a fluffy toy just didn’t work as well as I’d thought it would.
He seemed very amused by the whole thing.

Alan, please reconsider

Posted Jul 30, 2009 2:25 UTC (Thu) by pflugstad (subscriber, #224) [Link]

+5, ditto, “what he said” and all that.
When Linus flies off the handle (and IMO, he did so here), I always look to Alan’s words for a glimpse of sanity.

Getting a new tty pot

Posted Jul 30, 2009 1:10 UTC (Thu) by PaulWay (subscriber, #45600) [Link]

It occurs to me that what the TTY interface needs is a replanning. Something like:
1: Work up a new interface (’nu-serial’) that communicates serial data and related out-of-band information in a sensible, linear way. Don’t design this to comply with RS-232, RS-422 or anything – design it to work with data and userspace requirements.
2: Convert userspace programs across to this new interface.
3: Write the TTY driver to connect to the nu-serial interface.
4: Optimise the nu-serial interface to work well with threads and across cores.
5: Phase out the old TTY interface.
6: Profit.

The problem seems to me to be that everyone’s got a slightly different interpretation of what a TTY should do, and it’s being bolted into a whole bunch of things like ppp, KDE and emacs that are needing quite different things from the serial interface. If there’s already something like nu-serial, then so much the better.

I’m not a kernel developer, though, so I have no ability to either implement it or guide the project. Some might argue that I shouldn’t be saying anything. It does seem to me, however, that this is the approach that’s been taken in the past with other things – libata, memory allocators, or schedulers being the first (bad) examples that pop into my head.

Have fun,

Paul

Getting a new tty pot

Posted Jul 30, 2009 1:20 UTC (Thu) by corbet (editor, #1) [Link]

The real problem here, of course, is step 2. The current TTY interface is a little brutal at times, but it’s standard and a lot of programs have been written to it. A new user-space API would be painful to bring in, and we’d still have to support the old one forever.Past history in this area is not greatly encouraging either. Solaris tried more-or-less what you suggest; the result was called “streams.” If you’ve never had to deal with those, count your blessings.

Getting a new tty pot

Posted Jul 30, 2009 8:08 UTC (Thu) by nix (subscriber, #2304) [Link]

The *problem* is also step 2, of course. The TTY layer isn’t complicated
by its need to conform to RS232 or whatever. It’s complicated by its
horrendously arcane userspace interface (what is it, *two* chapters in
APUE?), which is the very thing whose behaviour must be preserved.

Getting a new tty pot

Posted Jul 30, 2009 7:39 UTC (Thu) by iq-0 (subscriber, #36655) [Link]

The problem is in number 2, as many applications try to be os independent (or at least unix-flavor
independent).
This is the point that most people don’t seem to grasp and perhaps why Linus was so harsh: Emacs
works on a large number of unixes, why is a breakage on Linux the fault of Emacs? Sure it may be
bug, but that would make it a bug that exists outside of Linux and the main purpose of Linux is and
was to be a independent unix-like posix compatible operation system.
This also means that probably Emacs (by virtue of it being a widely used application) was just first
to be impacted by this behavior. And since other applications are probably also directly impacted by
this change of semantics (read: effective ABI change) is not desirable.
The KDE bug is different in that there is a clear indication that it would also not work on other
unixes and probably went unnoticed due to different defaults in distributing KDE which caused that
part to be hardly used (or perhaps some ports has a local patch to fix it which never got upstream).

Getting a new tty pot

Posted Jul 30, 2009 8:07 UTC (Thu) by nix (subscriber, #2304) [Link]

Just because something works doesn’t mean it’s not buggy. I don’t know
about Emacs, but XEmacs’s TTY code has had data-losing bugs before, and
worked on heaps of platforms until Linux 2.6 came along, when it started
doing exactly what we see Emacs doing now, losing the end of compilations
(but for a different underlying reason). It can happen.

Getting a new tty pot

Posted Jul 30, 2009 12:50 UTC (Thu) by foom (subscriber, #14868) [Link]

I’ve seen that same buggy behavior of emacs losing the tail ends of compilations on various versions of OSX. I think it’s fixed now, I have no idea if it was fixed in the kernel or in emacs, and I don’t know if it’s the same underlying cause or not…but it seems suspicious.So the claim that this new version of Linux is fundamentally breaking the Unix ABI and that emacs worked fine with all other OSes is not all at obvious!

When it is a kernel or a userspace bug?

Posted Jul 30, 2009 7:52 UTC (Thu) by simlo (subscriber, #10866) [Link]

As far as I see it it all boils down to that question:
Linus wants to make sure all applications works even though you can say they are buggy. Alan says that applications can’t rely on on certain old behaviour and must be fixed.

I disagree with Linus: You must be allowed to change things once in a while. If the applications assumes some non-standeard behaviour it is broken. To determine what non-standeard behaviour is, look at how it behaves on OTHER kernels. What can Emacs assume when it runs on BSD, Solaris, etc.? If they also comply with the old behavior of TTY then the new behavior is wrong. But if Emacs already has code to handle the new behaviour because the other kernels already do that, then the new behaviour is ok (this is unlikely as Emacs would probably have no problem with the new tty code.)

In short: If there is no written standeard look at the other OS’es. Applications should not expect special behaviour from the Linux kernel over other Unix kernels unless there is a really good reason to.

The TTY demystified

Posted Jul 30, 2009 9:16 UTC (Thu) by bakterie (subscriber, #37541) [Link]

I found the following link very informative. Maybe other readers will as well.
http://www.linusakesson.net/programming/tty/index.php

A tempest in a tty pot

Posted Jul 30, 2009 10:44 UTC (Thu) by djm (subscriber, #11651) [Link]

FWIW Cox is completely correct: a user program must be prepared to cope with short reads and EAGAIN at all times, especially around process exit where delivery of SIGCHLD is highly likely to interrupt pending reads.

Linux TTY维护者Alan Cox退出,称“我已经受够了”

英国程序员Alan Cox是一位资深的kernel黑客,曾被视为是Linus Torvalds之后的二号人物,2009年1月他从工作了10年的Red Hat公司离开,加入了Intel公司。 除了维护Linux kernel子系统TTY之外,Alan Cox还参与了GNOME和X.Org项目。2009年7月28日,在受到了Linus Torvalds的严厉批评之后,Alan Cox宣布退出,称“我已经受够了。如果你认为问题很容易修正,你去修正好了。Have fun。我已经压缩了tty合并队列,任何为tty层提供补丁的人可以将它们发送给新的维护者”。

Leave a comment

Your comment

*
To prove you're a person (not a spam script), type the security word shown in the picture.
Anti-Spam Image