[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [interesting-people Home]
Subject: from TidBITS#186/26-Jul-93
Software Acceleration
---------------------
by Roy K. McDonald, Connectix -- connectix@applelink.apple.com
Presented at the Sumeria Technologies & Issues Conference
Hardware gets faster every year. We've all come to expect it. And,
a huge amount of work is going on right now to ensure that next
year the same thing will happen.
Software gets more features. And unfortunately, all too often, the
presumption that fast hardware will take up the slack has meant
that inelegant software design needlessly eats up performance
advances. The irony is that software improvements are often far
more dramatic in their impact than hardware improvements. Hardware
is the tortoise, advancing relentlessly in tens of percents per
year; software is the hare - on occasion it leaps orders of
magnitude.
This article reviews what has been done in software acceleration
on the Mac, highlighting how much more could be done right now. I
aim to persuade you to think about Mac performance as a hybrid of
hardware and software acceleration and perhaps shift your
priorities a little in favor of pushing the envelope on code
rather than silicon.
Decade of Macintosh Hardware Advances
Let's start by seeing what can be done with hardware. How has
Macintosh hardware improved in performance over the past 10 years?
The original 128K Mac had an effective speed of roughly 1/2 MIP.
Today's Quadra 950 provides about 8 MIPs. Of course, the Quadra
950 is relatively expensive, so on a real $/MIP basis, the growth
is only eight-fold, equivalent to a yearly average improvement of
26 percent.
SCSI, NuBus, and AppleTalk speeds have changed less. SCSI may be
about twice as fast as it originally was. The new Cyclone NuBus
standard will give a four times performance boost. AppleTalk is
basically unchanged. And, although EtherTalk has led to a high-
speed network standard bandwidth that is roughly twenty times
better than what we had in 1984, actual throughput is roughly only
a factor of five better.
Typical RAM installation has grown from 128K to the current
average of 6 MB, a 50 times growth, or about 50 percent per year.
Access speeds of main storage have only improved about a factor of
two (although caching has mitigated this otherwise fatal
limitation).
Common hard drives seek an average of about five times faster and
have ten times the capacity than they did when drives first
shipped for the Mac Plus. The average transfer rate hasn't
improved by much more than a factor of two.
Overall, we might imagine a "Speedometer" increase of as much as a
factor of 20 over the past decade (with perhaps much more than
that for floating-point operations).
That's not to say that hardware can't make occasional big leaps,
too. RISC processors will provide a roughly three times
performance jump on one-third the die size, for an overall price-
performance step of ten times in what will probably be a two to
three year transition period. DSP can also accelerate certain
processes by an order of magnitude.
But, taken all together, typical jobs on a constant-priced Mac
have been able to be performed roughly 25 percent faster every
year, solely because of technical advances in hardware and
increased performance for the price. This means hardware
performance doubles roughly every three years, a rate likely to
continue for the foreseeable future.
Software Advances
While hardware advances are relentless and pervasive, software
improvements are often more specific in their impact. The
performance results, however, can be dramatic.
For a familiar example, consider the case of 'Find File' running
under System 6 versus System 7. For fun, we recently took a Mac
Plus running System 7 and raced it against a Mac IIci using System
6. The System 7 software was running on hardware five years older
than the System 6 version. Still, Find File went slightly faster
on the Plus, because Find File is roughly ten times faster in its
current form.
Unfortunately, it often takes a long time for well-known software
techniques to enter the commercial sector. For instance, it was
many years after the introduction of the first spreadsheet
(VisiCalc) before sparse and virtual array techniques were used.
If you wanted a 50 by 1,000 cell spreadsheet, you had to have
50,000 cells worth of RAM (say, 800K), even if most cells were
empty.
Sparse techniques would have allowed you to use only the amount of
memory taken by full cells, and virtual techniques to use disk
space as well, at the cost of slower calculation. But the
marketing war focussed on porting to new platforms and adding new
features, not on saving RAM. A few engineer-years could have saved
users tens of millions of dollars worth of RAM.
Many new technologies which seem to arrive because of hardware
advances are in fact largely enabled by software breakthroughs. We
did a rough analysis of the increased performance in a variety of
frontier technologies over the past five years and tried to assess
what fraction of speed improvements came from software as opposed
to hardware. We concluded that the software components for the
various technologies were:
* Voice recognition 80%
* Handwriting recognition 80%
* Dynamic 3D graphics 60%
* Compression 50%
In all cases, some hardware improvement was necessary in order to
make the technologies practical, (e.g. DSP) but better software,
particularly better software algorithms were the most important
enabling technology.
Components of Speed
Where does the speed come from? You can break the software design
process into three components: algorithms, implementation, and
compilation.
The largest range of performance difference comes from algorithm
selection. This may also be the area of poorest performance in the
industry today. Factors of 10 and 100 losses in performance are
common. Why is this?
Consider the basic Order theory of algorithms. Every computer
algorithm can be classed by Order. For example, an Order N
algorithm takes twice as long when you run it on twice as much
data. An Order N-squared algorithm takes four times as long. Lots
of computational problems are easy to code as N-squared
algorithms, but can be rewritten with difficulty to scale as
NlogN.
A famous example was the introduction of the Fast Fourier
Transform in the mid-60's, an NlogN algorithm that replaced the
previous N-squared algorithm.
A 1,024 point transform could thus be performed 100 times faster
by this new software method. So this advance was comparable in
speed to over 20 years of general-purpose hardware speed
improvement. And, it was accomplished through a software change
which, once developed, had no marginal cost over the prior
solution.
Unfortunately, plenty of commercial software ships every day
containing inefficient algorithms. Sorting records in a database
is a familiar example where NlogN algorithms can be used but
aren't always. When you scale your data from 10 to 100 records,
pixels, or whatever, it means the algorithm may take 100 times
longer to run, when it only needs to take twenty times longer.
It's easy to see why it happens. From the technical perspective,
debugging and benchmarking is often done on limited data sets that
don't reveal how badly the code will bog down in real world
applications. And the real world constantly increases data set
size, often at an exponential rate. Screen diagonal and pixel
resolution are two common parameters which quadruple data set size
when the parameters double.
Over in marketing, they know that software is not as rigorously
benchmarked for speed as hardware, because comparisons are often
more difficult to apply. So feature lists and time-to-market
become disproportionately important factors.
Good algorithms are not enough. Implementation counts as well. For
example, suppose you need code for looking up records in a
database. An efficient algorithm for this is Order N - twice as
many records means twice as long a search.
The usual way to accomplish this is to index the records in a
binary tree. Then you need to do log(2) N index lookups to get the
location. To find a single record in a 1,000 record data base
requires 10 lookups.
But, if each of these lookups involves a separate hard drive
access, the implementation is poor, even though the algorithm is
optimal. A better (and more typical) implementation would bring
some or all of the directory information into RAM at the time of
the first disk hit and cache it there for the next nine lookups.
Whether or not you use an optimized algorithm, if the
implementation is three times slower than necessary, the overall
performance suffers by the same ratio.
Good implementation is often a matter of deep familiarity with the
target hardware platform, a familiarity which is increasingly
difficult to achieve as technology life cycles shrink ever
shorter.
Also, the code we write is not the code the system runs. Between
the two stands a compiler.
Within the Mac world one can find a range of commercial C
compilers that vary by as much as 30 percent or more in ultimate
compiled code performance. To do better than that, one must write
in assembler, and here the variations are even greater. To put it
bluntly, it's not hard to do a lot better than MPW.
Looking beyond the Mac, we must face the fact that much more
effort has gone into optimizing 80x86 compilers than 680x0
products. As Windows has gained market share, more and more cross-
platform benchmarks are being published of essentially identical
object code compiled for Windows versus Mac and run on similarly
powered CPUs. The Windows products tend to run faster because the
compilers are, by and large, a little bit better. The most
striking example I've seen was a recent PC Magazine benchmark of
WordPerfect where the Windows advantage was substantial. This is
not because of a superior operating system, but because of the
availability of a better optimized compiler.
With the move from CISC to RISC architecture, and especially with
the move to superscalar pipelines, ever more burden is placed upon
the compiler. If sloppy compilers can be written for CISC
machines, time-to-market pressures could produce RISC compilers
which have even more of an effect.
The trend in the software industry today is in the opposite
direction of this theme. We are all sacrificing performance in
favor of time-to-market. Object Oriented Programming is the
epitome of this trade-off. Now, there's nothing wrong with OOP,
and it's great that we'll all soon be writing Newton applications
by dragging and dropping resources from the object pool.
But OOP is an obvious formula for inefficient code. Witness the
feel of the Finder in System 6 vs. System 7. In many applications
I'll guess that early products will be sketched in OOP and later,
more mature products or versions will be coded at lower levels.
Lately we've been thinking about starting a development house that
specializes in knocking off popular OOP-based products with C or
assembler-based me-too versions. We'd be second to market but we'd
win the benchmark wars every time.
System Software
System software is particularly important because of its pervasive
impact on performance. Well-written, native-mode system calls are
critical to good performance for a wide range of software
products, and can to some extent overcome limitations imposed by
inefficient compilers. If most of the computer's time is spent in
highly-optimized system calls, the inefficiencies of the calling
program can easily be overlooked.
On the downside, many advances in system software have undermined
performance. Windowing systems and multitasking both advance
overall productivity, but add overhead which slows routine
operation. The user gets new functionality, but it doesn't come
for free, and it affects all applications.
Moreover, advances often improve performance in ways that are
difficult to define quantitatively. Both virtual memory and RAM
disk technology can significantly enhance Mac productivity, but
it's hard to benchmark their contributions. For example, Connectix
end-user studies of Virtual and MAXIMA customers indicate that
either product can increase total work output per session by 5-20
percent, but results vary widely according to the type of work
performed and the system configuration.
An area of particular interest to Connectix is the use of
advanced, dynamic disk caching techniques, utilizing all of the
often "wasted" RAM on computers to avoid unnecessary disk access.
The benefits of this are two-fold:
First, disk accesses are usually a hundred to a thousand times
slower than RAM accesses, so tremendous speed improvements can be
achieved. Preliminary benchmarks on our Velocity caching product
show an overall work throughput increase of about 25 percent.
That's not bad for a low-cost software extension considering what
it costs to accomplish the same boost in hardware.
Second, caching has become increasingly important because of
portable computing. PowerBook users will enjoy considerable
battery life extension through the elimination of unneeded disk
spin-ups, which typically account for 10 percent of power use in a
battery-powered PowerBook session. Many PowerBook users also
complain that their PowerBooks seem sluggish compared to
comparable desktop systems - mainly, it appears, because of the
random annoying delays of drive spin up.
The key to a successful caching strategy involves maximizing the
available cache size and filling it with the data most likely to
be called for next by the CPU. Velocity incorporates unique
advances in both of these areas, which I look forward to
discussing in the future.
Input/Output
One of the most productive areas for software acceleration is in
the I/O domain, both internal to the system, and over a network.
After all, processing has three major steps - you get the
information, then you process it, then you spit out the results.
Two thirds I/O, one third processing.
Consider the following thought experiment: Watch a typical user
for an hour. She opens files, launches applications, enters
alphanumeric data, spell checks, calculates, sends email, closes
windows. Now, double the processor speed. Maybe she'll save 5
minutes out of the hour. Instead, suppose you double the I/O
speeds - SCSI, ADB, AppleTalk, and NuBus. How much does she save
then? Our testing indicates it's also about five minutes, and it's
certainly within a factor of two of that either way for most
sessions.
Moreover, a lot of the time saved will occur during periods when
the user would be especially annoyed at delays. Most people are
prepared to watch their clock spin a few seconds when calculating,
but have less patience when saving or opening a document. The
system just doesn't seem to be working as hard then.
Hardware I/O speeds are generally not improving quite as fast as
raw computation speeds. But a lot can be done in software here.
Many I/O bottlenecks give 10 to 1 or even 100 to 1 speed delays.
Even though they are only relevant to system operation a small
fraction, say 10 percent of the time, addressing these bottlenecks
can have a big impact. If you want a graphic example of this,
compare benchmark data of third-party 25 versus 33 MHz accelerator
boards. With a 33 percent higher clock speed, you often see
benchmarks only 10 or 20 percent better, because I/O is setting
the pace.
Networks
Enormous increases in network bandwidth are becoming available
because of the introduction of new technologies, particularly
optical transmission. The underlying structure of network data
transmission on the Mac is starting to be strained by these
capabilities.
I recently spoke with a vendor who successfully developed an
attractive low-cost, high-performance FDDI card with about ten
times the effective speed of today's Ethernet systems. It failed
as a product, however, because the throughput of the network
bottlenecked at both ends of the link by packet creation and
decoding time. This seems like an area ripe for new software
paradigms.
Video
There has been little improvement in the software that drives Mac
video over the years. This reflects the fact that the Mac started
with an excellent foundation, the original version of QuickDraw.
Subsequent versions have improved screen draw times by about a
factor of two, and big improvements in the future seem unlikely.
User/System
Finally, there is one bandwidth limitation which dominates all
others in importance, one link in the I/O chain responsible for 99
percent of the wasted clock cycles in every Macintosh. This, of
course, is the interface between the user and the system. Far
outweighing compiler, implementation, and even swamping the effect
of new algorithms is how efficiently a user can communicate her
wishes to the machine, and how in turn the machine can let the
user understand or appreciate the results and implications of
those actions. The ultimate bandwidth limitation, and the single
most important way to improve the total performance of the user-
system combination is the user interface metaphor.
The Mac established its special position in the industry by virtue
of its unique ability to address this one issue. Essentially, the
key technology that enabled it to do so was software. But more
remains to be done, and the pace of improvement in the last five
years has not been particularly impressive. For all the two
thousand engineer years that went into its development, is the Mac
a lot easier to use under System 7 than it was before? I don't
believe so, and I hope we're in for some paradigm shifting
breakthroughs here. Personal computing could use such a shot in
the arm today.
Conclusion
Time-to-market and feature list forces are driving software
developers to work in ever higher-level programming languages and
to pay less and less attention to the efficiency of the underlying
code. Because hardware speed has increased over the years, they
have been able to get away with this for some time.
But considering how much effort goes into pushing the speed
envelope of the hardware, it seems like users would be well served
if more emphasis were placed on software acceleration. In
everything from mainstream applications to system software, users
do care about speed and software will often be the best price-
performance technology to provide it.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [interesting-people Home]
Powered by eList eXpress LLC