In-Circuit Emulation: How the Microprocessor Evolved Over Time

Thursday, June 11, 2009

Microprocessors weren't always designed with in-circuit emulation in mind. But as the microprocessors evolved, the need to support in-circuit emulation within the microprocessors became obvious. Without microprocessor support, it would be very difficult, if not impossible, to halt the microprocessor anywhere on a specified breakpoint event, let alone reconstruct an instruction disassembly trace. As time went on, many more emulation features were built into the microprocessor. On the 80186 a few pins were implemented. On the 80286, a few pins and a few instructions were implemented. The 80386 expanded these support pins, added a few more instructions, some debug registers, and a few special bus cycles. The 80486 refined these same features, while the Pentium scrapped them all, and completely redesigned all of the ICE features.

In the July "Undocumented Corner," I outlined the differences and advantages that in-circuit emulators held over their software-based debugging counterparts. I delved into the history of Intel's ICE offerings, and promised to continue my ICE discussion by describing how the microprocessor design has evolved with in-circuit emulation in mind. This month, I will continue my discussion of ICEs by describing these changes that occurred within the microprocessor.

ICE Evolution: Internal Hardware Support

Intel's x86 microprocessors seem to have made three evolutionary changes for ICE support. In the beginning, the ICE used a standard microprocessor. This type of ICE achieved its capabilities strictly by monitoring the microprocessor bus. I'm not sure exactly how breakpoints were signaled to the CPU. Most likely they were signaled through some type of non-maskable interrupt, or by asserting a pin that puts the CPU into a hold state. This type of ICE would include the 8088/8086.

The next type of ICE used a modified CPU, commonly called a "bond-out CPU." The bond-out CPU was an ordinary microprocessor with some extraordinary capabilities. This type of CPU was given its name because certain pins were bonded out from the silicon to the external microprocessor pins. These pins were marked as no-connects by Intel, and were usually not connected to anything on the ordinary production silicon. The bond-out version of the microprocessor connected some of these pins to pads of the silicon, giving the microprocessor its special ICE features. This would include all CPUs were produced for all Intel x86 processors from the 80186/88 to the 80486.

The third evolutionary change occurred with the advent of the Pentium processor, and remains in the Pentium Pro processor as well. These microprocessors have "Probe Mode," a special debug mode which is implemented by connecting a handful of signals to the ICE host. These signals are collectively called the probe port, or debug port. (Intel manuals most commonly use the term debug port.) Using the debug port, the ICE may communicate with the CPU at any time, but not necessarily perform all ICE functions until the CPU is in "Probe Mode."

Clearly, Intel was searching for the perfect balance of cost and features needed to support in-circuit emulation. A drastic evolutionary change occurred between the 8086 and 80186. Another big evolutionary change occurred between the 80486 and the Pentium. I believe Intel hit gold with their current ICE implementation. It's a very simple, elegant, and powerful solution. Intel reduced complexity without sacrificing a single ICE-mode feature - the best kind of trade-off.

ICE Evolution: Triggering and Resuming

I know very little about the earliest days of in-circuit emulation support for Intel x86 microprocessors. I believe the 8086 didn't have any direct support for in-circuit emulation. Instead, I believe that breakpoints were triggered by monitoring the bus, and signaling the CPU to hold, or generate a certain type of non-maskable interrupt. Minimal ICE support didn't begin until the 80186 implementation. Full-blown ICE support didn't begin until the 80286 implementation.

The 80286 didn't have any means to trigger a breakpoint on an exact trigger event. Instead, a breakpoint event was specified by the user to the ICE software. The ICE hardware would monitor the bus for this event and assert a special microprocessor pin to indicate when a breakpoint match had occurred. In return, the 80286 would halt emulation and enter ICE mode. Because of the lack of microprocessor breakpoint support, it was impossible to halt emulation at the exact point of the breakpoint event. Instead, the halt occurred a few instructions later. This was the best the 80286 could do. Once in ICE mode, the 80286 stored its internal microprocessor state to a hard-coded memory location within the ICE host hardware. For the purposes of this discussion, I'll call this special memory location the "state save map." The ICE host software could query and modify the internal processor state by modifying values in the state save map.

Resuming from ICE mode was accomplished by executing the undocumented LOADALL instruction. LOADALL restores the microprocessor state from the state save map that is saved during the transition from user mode to ICE mode. LOADALL loads enough of the microprocessor state to ensure return to any processor operating mode. LOADALL is a very powerful instruction in its own right, which explains why it has been the topic of many magazine articles (mine included), and chapters in booksAfter LOADALL is executed, the CPU exits ICE mode, and returns to user mode.

The 80386 greatly expanded the ICE support features. Like the 80286, the 80386 ICEs supported bus event breakpoints (those events that required monitoring the microprocessor bus signals to detect breakpoint events). However, the 80386 expanded its breakpoint capabilities by introducing a series of breakpoint event registers (collectively known as debug registers) and other debugging support features. The debug registers could store up to four different breakpoint events. When a code execution breakpoint event occurred, the microprocessor immediately stopped execution without overshooting the breakpoint event. This was a dramatic improvement over the 80286 bus-snooping method. The debug registers also had the ability to trigger a breakpoint on a limited set of memory-access events. In addition to the debug register breakpoints, the 80386 also had other hard-coded breakpoint events. There are various breakpoint events that can occur on the 80386 (and later) microprocessors, including:

  • An event match on any of the four debug registers.
  • An attempt to write to any of the debug registers when the DR7.GD bit equals one (bit 13 of debug register 7). This breakpoint type was primarily designed to prevent user code from modifying the breakpoint events (in the debug registers) while the ICE was in use.
  • Switching to a 386-style task whose T-Bit is set in the task state segment. (The T-bit is unique to 32-bit task state segments.)
  • Executing any instruction while the EFLAGS Trap Flag bit is set (EFLAGS. TF=1),
  • Executing the ICE BreakPoint instruction (ICEBP). This undocumented instruction was implemented specifically to allow for a convenient way to halt the ICE. It's strange that it has never been officially documented
  • Executing an INT-01 instruction (opcode CD 01).
Under normal operating conditions (when an ICE isn't connected), any of these events will cause the microprocessor to invoke the debug exception handler (exception type 01). However, when the ICE is connected, any of these events causes the ICE to halt emulation - that is, when the ICE is instructed to halt emulation. When the ICE is connected, these exceptions are optionally diverted to the ICE, instead of being vectored to the INT1 handler. This behavior is governed by an undocumented bit in the DR7 register - bit-12for a description of secret bits in the DR7 register.) This bit is automatically set by the ICE host software whenever the user issues a command that instructs the ICE to expect a breakpoint event to occur. Therefore, the user needn't know about undocumented debug register bits, or write any software to set them.

External breakpoints are triggered when the ICE hardware asserts an undocumented microprocessor pin. When asserted, the CPU finishes executing its current instruction, then immediately enters ICE mode. Like the 80286, the ICE hardware connects to the entire microprocessor bus, and has the ability to recognize bus events (such as special CPU cycles) and trigger the CPU as a means to enter ICE mode.

Resuming from ICE mode is accomplished in the same way as the 80286. The ICE host executes the 80386 version of the LOADALL instruction to exit from ICE mode and return to user mode. The 80386 version of the LOADALL instruction has a different data format and opcode than its 80286 counterpart.

The 80486 has virtually the same ICE support as the 80386. Entering and exiting ICE mode is virtually identical between these processors. A similar set of undocumented pins supports ICE breakpoint events, and the same LOADALL instruction is used to resume from ICE mode to user mode. Even though the ICE support is virtually the same, the 80486 had one major change - the LOADALL instruction could no longer be executed outside of ICE mode. The A-step of the 80486 allowed LOADALL execution in user mode, but Intel decided that was too dangerous. Instead, attempting to execute LOADALL in user mode caused an invalid opcode exception. When executing in ICE mode, LOADALL worked like it always did. The Pentium processor slightly expanded the debug architecture, and completely redesigned the ICE support features.

The Pentium processor sported a new breakpoint event - I/O breakpoints. An I/O breakpoint could be triggered when an I/O instruction read, wrote, or accessed a specific I/O port register. This was a welcome addition, though it was very limited in its abilities. If you wanted to specify a breakpoint on a specific port value, or port size (byte, word, or double-word), then a full-blown ICE with bus event recognition circuitry was needed.

The ICE support features of the Pentium processor were completely redesigned. The Pentium stiI1 has an internal and external means to enter ICE mode, but it is completely different from its older x86 brethren. From the outside world, special probe-mode instructions trigger the entrance and exit of ICE mode. Internally, the Pentium can be triggered into ICE mode in much the same fashion as the 80386 and 80486 (though the undocumented bits in DR7 don't have a role in this function). I'll cover the Pentium's new ICE features in my next column.

ICE Evolution: ICE Mode

Once a breakpoint is triggered on processors ranging from the 80186 to 80486, the CPU enters an alternate operating state called "ICE mode." During ICE mode, all memory cycles appear to stop, and the CPU appears to be dormant. This behavior is only a facade. Actually, the CPU is executing special ICE software. The CPU appears to be dormant because the normal pins that indicate bus activity - ADS and RDY - aren't used. In ICE mode, the microprocessor uses alternate ADS and RDY (bus cycle) signals. When viewing bus activity on a logic analyzer, the lack of the normal ADS/RDY handshake gives the appearance of the lack of bus activity.

Once in ICE mode, the microprocessor asserts an ICE-MODE output pin. This output pin is decoded by the ICE host hardware. Its purpose is to indicate to the ICE hardware to use an alternate address space. The state save map is saved to this alternate address space, and the ICE kernel code resides there too. Some people have described the ICE-MODE pin as a 33rd address line (which would be A32, in CPU nomenclature). When A32 is asserted, all memory accesses occur to this alternate memory space.

Even though ICE mode executes out of a protected memory space, the ICE kernel has the ability to read and write memory to the user's memory space. This is done through the UMOV undocumented instruction. UMOV was introduced on the 80386, and remained on the 80486UMOV transferred the data contents between a register and user memory. When executed, UMOV always asserted the standard ADS/RDY signals for the memory transfer. Using the standard ADS/RDY handshake guaranteed that user memory was always accessed. As a side-effect of using ADS/RDY, UMOV would operate like any other MOV instruction when executed outside of the ICE mode environment.

ICE Mode on the Pentium is completely different than any of its predecessors, and I will discuss it in my next column.

ICE Evolution: Code Tracing

Most ICEs have the ability to reconstruct

an instruction trace, which is done with the aid of the microprocessor. There are two components that aid in trace reconstruction:

  • A special (undocumented) bus cycle is generated - called a Branch Trace Message (BTM). The address and data bus of this special cycle contain the source and/or destination address of any instruction discontinuities. This data aids in trace reconstruction by allowing the trace reconstruction software to ignore spurious instruction pre-fetches that contain code that was never executed.
  • The length of the instruction is also helpful, though I don't think it is absolutely necessary for trace reconstruction.

The destination branch is emitted by the microprocessor as a BTM. BTMs are officially documented in the Pentium and Pentium Pro manuals, but they remain undocumented for the 80386 and 80486 microprocessors. The 80386 and 80486 both implement branch trace messages as special microprocessor bus cycles using the ICE-ADS signal as a qualifier. Qualifying BTMs with ICE-ADS enables the trace reconstruction software to easily recognize these special cycles, which are stored in the ICE trace memory. Generating these cycles is optional to the microprocessor, and may be enabled and disabled. For normal operation (without an ICE connected), the default operation is BTMs disabled. They are enabled and disabled on the 80386 and 80486 by writing to an undocumented bit in DR7 (bit 14 of debug register 7), or by writing to a bit in the TR12 register in the Pentium.

Knowing the instruction length is also very helpful in reconstructing the instruction execution trace. For this purpose, the 80286 through 80486 have a set of four pins that are sampled at every clock boundary. When an instruction completes execution, these four pins are encoded with the instruction opcode lengths. Whereas x86 instructions cannot exceed 15 bytes, four pins are precisely enough for a binary encoding of 0 to 15. These pins are always inactive on non-execution boundaries (when an instruction is currently executing). As far as I know, this feature was removed on the Pentium processor because the internal execution of the Pentium runs at a faster speed than its external input clock. The asynchronous relationship of the input clock to the core clock makes it is impossible to sample a similar set of pins on an instruction boundary.

ICE Evolution: Instruction Set Support

The 80286 was the first Intel x86 microprocessor to have special bond-out instructions. The LOADALL instruction was used for in-circuit emulation support, and possibly chip testing purposes. LOADALL continued its evolution on the 80386 and 80486 microprocessors. For these processors, LOADALL took on a new opcode (it changed from OF05 to OF07), had expanded functionality (saved and restored 32-bit registers, and so on), and had slightly different semantics (ES:EDI pointed to the LOADALL table instead of the table being hard-coded at a fixed address). The 80386 also added two more instructions that aided ICEs - ICEBP provided a software mechanism to trigger an ICE breakpoint event (though it remains undocumented today), and UMOV provided a means to exchange memory between ICE mode and USER mode. Most of this was removed in the Pentium and replaced with a completely different ICE architecture (though the ICEBP instruction still remains).

Conclusion

As you can see, the microprocessor has undergone many changes in support of in-circuit emulation. Special versions of the microprocessor, called bond-out versions, provided these specialized functions that appear to be non-existent in the production versions of the chip. In truth, the production versions are the same as the bond-out versions, except these ICE features aren't bonded out to the external pins of the microprocessor package. Just try setting DR7.bit12=1 and executing an INT-01 and see how fast you hang your 80386 or 80486. It hangs because the CPU is attempting to invoke the ICE hardware.

As the CPU evolved, extra pins were added, and extra opcodes were implemented - all in support of in-circuit emulation. Even though these features required quite a lot of work on Intel's part, they threw all of it away when they implemented the Pentium processor. The Pentium abandoned all of these prior ICE functions, and completely redesigned the ICE architecture. Next time, I will continue my ICE discussion by examining the Pentium, and discussing its version of ICE support.

8051 Microprocessor

The second component of the computer programming curriculum is a simulation of an Intel 8051 microprocessor. This particular microprocessor outsells all others, including the microprocessors used in desktop computers. There are likely to be a few of these in your house and a few more in your car. Just as with the RPN calculator, you get a complete integrated development environment where you construct your program and then watch it run. In this case you will be programming in assembly language, which is still widely used in industry when performance counts.

Microprocessors are used to control intelligent machines such as CD audio players, sewing machines, camcorders, etc. To give you real-world experience with these types of applications, this 8051 simulator allows you to write 8051 programs that interface with a scrolling electronic signboard, a motorized mechanical mouse that searches through a maze looking for cheese, and an LED bar graph used to display the instantaneous volume peaks in arbitrary .WAV files (i.e., songs) being played through the speakers. Thanks to 54,000 words of on-line Help documentation and 24 fully-explained example programs you will learn to program the 8051 microprocessor and use it to control each of these three unique and fun environments.

A screen capture of the 8051 Simulator when configured with the scrolling electronic signboard is shown below:

Another screen capture of the 8051 Simulator when configured with the maze is shown below:

Another screen capture of the 8051 Simulator when configured with the audio peak detector is shown below. Note that you can employ any .WAV file, including ones you rip from your own CD collection.

Microprocessor

Microprocessor

Microprocessor

The microprocessor, (or CPU), is the brain of the computer. The picture above shows a slot 1 processor with heatsinks and a fan, which prevent it from overheating. Below is the processor without the heatsinks and fan, being inserted into a slot 1 motherboard connection. Slot 1 processors have the microprocessor and level 2 cache memory mounted on a circuit board, (or card), which is enclosed inside of a protective shell.

Microprocessor

The enclosed slot 1 processor card contains the central processing unit, (or CPU), with its level 1 cache memory. The central processing unit also contains the control unit and the arithmetic/logic unit, both working together as a team to process the computer's commands. The control unit controls the flow of events inside the processor. It fetches instructions from memory and decodes them into commands that the computer can understand. The arithmetic/logic unit handles all of the math calculations and logical comparisons. It takes the commands from the control unit and executes them, storing the results back into memory. These 4 steps, (fetch, decode, execute, and store), are what's called the "machine cycle" of a computer. These 4 basic steps are how the computer runs each and every program. The microprocessor's level 1 cache memory, is memory that is contained within the CPU itself. It stores the most frequently used instructions and data. The CPU can access the cache memory much faster than having to access the RAM, (or Random Access Memory). Below is a picture of what's inside of a Pentium 3 processor. The control unit, arithmetic/logic unit, and level 1 cache are contained within the center CPU chip. Level 2 cache memory is visible on the right-hand side of the processor card.

Microprocessor

Level 1 cache memory is memory that is included inside of the CPU itself. It is usually smaller and faster than level 2 cache memory. Level 2 cache memory is memory between the RAM and CPU. It is used when the level 1 cache memory is full or is too small to hold the intended data. Originally it was not directly on the CPU chip itself. *Read the update at the bottom of this page.* The photo above shows level 2 cache memory on the processor card, beside the CPU. Below are two photos of a CPU. The photo on the bottom is a view of the CPU chip from the outside. The photo on the top is a large map of the inside of the CPU, showing the different areas and what their function is. See if you can find the areas that fetch, decode, and execute the instructions. Can you also find the level 1 cache areas that store information? The pipelined floating point area, logic areas, and superscalar integer execution units area are part of what? Did you guess the arithmetic/logic unit? If so, you're right!

Microprocessor

At the top you can also see the clock driver. The clock driver is what times, or sets the pace, for the computer. The clock's speed, is how CPUs are rated. Each machine cycle consists of two beats. Each beat the control unit fetches and decodes data, which is called the "instruction cycle." At the same time the arithmetic/logic unit executes and stores data, which is called the "execution cycle." The speed of a clock is rated by how many beats per second it can accomplish. 1 billion beats per second is referred to as 1Ghz. For every beat, (except the very first), a machine cycle is completed. Common CPUs available today perform at 3Ghz and faster. This means that a 3Ghz CPU can execute 3,000,000,000 instructions in a single second!

*Update*

The slot 1 processor is no longer being produced. Below are two photos of an AMD Athlon 64 FX socket 939 processor and one photo of a Pentium 4 Extreme Edition socket 775 processor. These are later model processors than the slot 1. Currently AMD is using the socket 939, socket 940, and socket 754 processors. Pentium is using the socket 775 and socket 478 processors. All of these processors look similar, but they do have some differences, including the number of contact points, (or pins), that they have. Another difference in some of the newer processors is that the level 2 cache memory is located directly on the CPU chip itself. Any cache memory located outside of a CPU like this is called level 3 cache memory. The usage is still the same though. Level 1 cache memory is still located closest to the core of the CPU and is still usually smaller and faster than the level 2 cache memory. Some of the newer processors even have level 3 cache memory located directly on the CPU itself. Any cache memory located outside of a CPU like this is called level 4 cache memory. As with the other levels of cache memory, the higher the level, the further away from the core of the CPU it is located. The higher levels of cache memory also are usually larger and slower than the smaller levels. The first photo below shows the front and back of a Pentium 4 Extreme Edition socket 775 processor. It has level 3 cache memory located directly on the CPU itself. The second photo below shows the front and back of an AMD Athlon 64 FX socket 939 processor. It has level 2 cache memory located directly on the CPU itself. The third photo below shows the AMD processor installed on a motherboard with a heatsink and fan.

Microprocessor

Microprocessor

Microprocessor

CGaAs FXU (PowerPC Integer Processor)

Tuesday, March 24, 2009

CGaAs FXU (PowerPC Integer Processor)

Abstract:

The primary focus of this project is the development of a radiation-hard complementary GaAs (CGaAs), PowerPC microprocessor with flip-chip, area I/O packaging. This processor, called PUMA, is ideally suited to space applications because of its low power-delay product and excellent radiation hardness. We have partially tested a CMOS prototype of the PUMA architecture, and are ready to begin testing the CGaAs processor. We have analyzed the CGaAs technology to determine the most cost-effective scaling factor for each design rule; in this effort, we have developed a methodology and tools to help engineers also scale CMOS processes non-linearly.

We have developed and tested low-jitter PLL clock generators, current-mode I/O, and CAD tools for better leaf-cell design, logic synthesis, and minimization of cross-talk. We have developed new packaging capabilities, including a gold bumping process which produces bumps with pitches as small as 50 mm. Assembly of MCMs has begun at 3-M. The remainder of this project and an accompanying AASERT will complete the system design and demonstrate the prototype in a desktop computer.

This project is supported by the Advanced Research Projects Agency under ARPA/ARO Contract Number DAAH04-94-G-0327.

Microprocessor

The microprocessor is the center of your computer. It processes instructions and communicates with outside devices, controlling most of the operation of the computer. The microprocessor usually has a large heat sink attached to it. Some microprocessors come in a package with a heat sink and a fan included as a part of the package. Other microprocessors require you to install the heat sink and fan separately. This is not a difficult problem, but can be a bit daunting when the buyer wants to make sure they get the correct parts to fit their microprocessor. Also the buyer needs to make sure they will get the motherboard that their microprocessor will work with. This section will explain some of the differences in microprocessors and ways to be sure your parts match.

Microprocessors and Mounting

The mounting method refers to the type of connection the microprocessor makes with the motherboard. The following table lists the various mounting packages and some of the well known microprocessors that are mounted for that package.

  • Socket 7 - AMD K5, K6, Intel Pentium 75-200Mhz, IBM
  • Socket 370 - Some Intel Celerons
  • Slot 1 - Intel Pentium II, Pentium III, Some Celeron 266-533
  • Slot II - Intel Xeon
  • Slot A - AMD Athlon

The Socket 7 processors are becoming less popular. We recommend socket 370, through slot A microprocessors at the current time. The prices on Socket 370 microprocessors are currently very low considering the performance of the systems. I recently bought a Celeron 500Mhz microprocessor with 66Mhz sidebus for under $120 with a motherboard for $84. When buying a microprocessor, make sure you get the type of socket you think since some processors are made for different sockets such as the Celeron. Be sure of one of the following.

  1. The socket type is stated at the vendors website.
  2. There is a microprocessor part number stated at the vendors website that can be traced to the manufacturers website which specifies the mounting package you want.

It would be no fun to get a Slot 1 motherboard and a socket 370 Microprocessor.

Microprocessor heat sinks and fans.

Being sure you get the correct heat sink and fan for your microprocessor can be a bit daunting. Who wants to get a $300 microprocessor, and risk it with an incorrect mounting of a heatsink or fan? Who wants to find out that they have purchased the wrong heatsink for their processor and spend days or weeks trying to sort it out? My solution is to purchase the microprocessor with the heatsink in the same package. Usually you get a better warranty and return policy this way and you don't need to worry about whether the two are compatible. I do not believe you can save enough money buying the heatsink and fan from anyone other then the vendor selling the microprocessor because of the time it takes for the additional research required and the potential trouble. The best solution to this problem is simply to buy a slot1, slot II or slot A microprocessor with the package that includes the fan and heatsink. These would be one of the Pentium II, Pentium III, Athlon, or Xeon packages. All that is required in this case is to slide the microprocessor carefully into its slot. With the exception of processors such as the Athlon which have a larger heat sink, requiring an extra plastic clip mechanism to help stabilize the heatsink, it is easier to install one of these processors than it is to install the computer's RAM memory or a hard drive