Read about the available trace concepts:
Bus Trace chapter covers trace use on iSYSTEM debug and test tools featuring a so called bus trace. Bus trace means that the trace hardware can sample and capture microcontroller address and data bus value at every CPU cycle. Bus trace provides a very detailed insight into the microcontroller behavior during the program execution.
Bus trace exists on nearly all iSYSTEM in-circuit emulation tools. Additionally, it’s also available on iSYSTEM on-chip debug & test tools supporting Nexus trace or ARM ETM trace, on top of which iSYSTEM proprietary RTR (Real-Time Trace Reconstruction) technology is integrated.
List of debug & test tool configurations featuring bus trace:
•iC1000 PowerEmulator (all supported architectures)
•iC2000 PowerEmulator including PowerAnalyzer module (all supported architectures)
•EM Microelectronic CoolRISC Active POD
•Freescale 68HC08 Active PODs II
•Freescale 68HCS12 Active PODs
•Freescale MC9S12X ActivePRO/GT PODs
•Freescale 68K ActiveGT PODs
•Freescale 68332 ActiveGT POD
•Renesas V850ES/Fx3 ActivePRO and ActiveGT POD
•Renesas 78K0R and RL78 ActiveGT POD
•Renesas 78K0 ActivePRO POD
•Renesas R8C/3x ActivePRO POD
•iTRACE PRO/GT & ARM ARM7 ETM (RTR)
•iTRACE PRO/GT & Freescale MPC5553/5554, MPC5561/5565/5566/5567 (RTR)
•iTRACE GT & Freescale MPC56x (RTR)
OCT ARM ETM chapter covers ETM trace on ARM microcontrollers.
OCT Cortex chapter covers On-Chip trace on Cortex based microcontrollers.
OCT MPC5xxx SPC5x Nexus L2+ chapter covers trace usage on Freescale MPC5xxx and ST SPC5x microcontrollers featuring Nexus Class 2+ interface.
OCT MPC5xxx SPC5x Nexus L3+ chapter covers trace usage on Freescale MPC5xxx and ST SPC5x microcontrollers featuring Nexus Class 3+ interface.
User Trace port chapter covers trace possibilities on CPUs, where trace port is not available (e.g. V850E2/Fx4L)
Slow Run mode offers program execution and data trace capabilities on MCUs without trace port. It is available on all architectures.
This document describes how real-time CPU activity is acquired and analyzed. It also shows limitations of different tracing techniques.
•Emulation Technical Notes for the respective CPU provide more specific information. Not all CPUs and emulation tools provide all the tracing capabilities which are discussed in this document.
Program flow is used most often. It is instrumental in finding bugs in real-time execution, measuring code performance and in code coverage.
Accesses to memory (mainly write accesses) are typically used for data acquisition and profiling OS events.
Found on some On-Chip Trace architectures, the application can use special op-codes or writes to dedicated register(s), to generate these trace messages. This can reduce the amount of trace data considerably, and is usually the only trace available on low-end OCT implementations.
Apart from the above activities, the CPU might implement some advanced trigger functionality, which will generate a watchpoint message when a certain complex condition is met.
Some CPUs can signal entry and wake up from low power mode state.
Next to the trace stream originating from the CPU, an emulator can provide auxiliary inputs which can be traced alongside CPU. The advantage here is that events inside the CPU can be well correlated to events outside the CPU.
The auxiliary inputs are typically digital, but analog signals or network protocol messages can be traced too.
One of the most important parts of the trace is a real-time timestamp assigned to every event as it is recorded. Without a timestamp mechanism, real-time analysis and performance measurement is not possible.
The amount of information generates by a fast CPU is enormous. Despite trace stream compression on modern OCT architectures, well over 100MB/s of data can be generated for a single-core running at 100MHz. A faster SoC with multiple cores will generate yet more data.
Without any filtering mechanism, the trace session would inevitably be limited in time, and the high-level analysis would take a lot of time.
To reduce the trace data bandwidth, both ICE and OCT architectures implement a Qualifier condition, which defines what trace data is recorded and what gets discarded. One of the simplest qualifier examples is to ‘Record Program Only’ – here only program activity is recorded, but the much more bandwidth consuming data trace is discarded.
The trace can always be activated independent of the CPU application state. This is not desired when the activity of interest occurs infrequently. An immediate recording start could cause the trace buffer to fill before the activity occurs, plus looking for the activity in the vast recording is too time consuming.
Similar to the qualifier, the Trigger condition, which defines when to start recording, can be defined.
One of the typical triggers is a write access to some program variable, upon which all the program trace is recorded – thus providing insight into what parts of the program access that variable.
Simple CPUs (mainly older ones), used external memory. The address, data and control buses were visible on the device pins.
An in-circuit emulator replaces the CPU on the target with an emulation POD, where the same (or an emulation) version of the CPU is located, but all external busses are visible to the emulator. Most CPU activities (apart from internal memory and register transfers) can thus be recorded.
The ICE typically implements a trigger and qualifier logic in an FPGA circuit. The timer is generated in the emulator internally and assigned to every recorded trace sample. The timing accuracy is best possible.
The drawbacks of this approach become apparent when CPUs employ internal instruction pipeline, where the externally fetched instructions are not always executed. An instruction or data cache could completely obscure the CPU activities.
Once CPUs started integrating memory, instruction pipelines and caches, an ICE trace became impossible. The solution to this problem is to integrate a trace module on the chip itself. This module can observe all CPU activities. It implements its own trigger and qualifier logic. In addition to that, the trace data is compressed to reduce the bandwidth and thus the number of pins dedicated to trace.
Once the compressed data package is ready, it can be either
•streamed through a trace port to an OCT tool, or
•stored internally in an On-Chip Trace Buffer (OCTB)
A trace port must be designed to sustain the most typical bandwidth. Initially standard digital I/O’s were used, but since such ports do not exceed 200MHz in practice, 16 or more such pins were required (on CPUs in the 200MHz range), making the CPU package larger and more expensive. High speed pin toggling also increases power consumption considerably – putting design requirements on the entire ECU.
As trace bandwidth increased, it became obvious that digital I/O ports cannot keep up and a LVDS based PCI-Express like protocol was employed - Aurora.
A trace port has several significant advantages:
•Duration of the recording is not limited by the CPU. If the OCT tool can keep up with the incoming trace stream, the session can be unlimited in length – an important factor in system tests.
•Timestamp is available – as it is generated by the OCT tool.
There are some inevitable drawbacks compared to an ICE or an OCTB, but these are either minor and can usually be avoided or compensated:
Since the CPU activity is compressed and qualifier is employed, trace messages are not generated at a constant rate. While the trace port is sized to sustain an average bandwidth, at times the rate of generated messages will exceed this.
To compensate, the OCT module uses a message FIFO (typically 16-64 entries deep). Still, if the qualifier is set to widely, FIFO can get filled and subsequent OCT messages cannot be inserted.
Such situation can be handled in these manners:
•Trace is stopped, indicating FIFO overflow
•Trace messages (usually the data trace) are suppressed until FIFO is freed to some level, then they resume.
•CPU is stalled until one FIFO entry is free.
To avoid this condition these measures can be taken:
•Define a finer qualifier
•Increase the trace port bandwidth – either by using a wider port or a higher trace clock.
Timestamp in an OCT configuration is generated by the OCT tool, when a trace message is received and stored in the buffer. The problem here is that message propagation delay from the event occurrence to the port output is not constant – because one message might stream immediately, but another would have to wait multiple cycles in the FIFO.
While this reduces the time accuracy considerably, it is in practice less noticeable because:
•FIFO load is relatively constant - and usually low, if sustained traces are required.
•If program trace is recorded, the tool will (in the analysis stage) interpolate the time based on executed op-codes and thus compensate for deviations.
Since any kind of trace port is expensive in terms of pins and power consumption, some CPUs implement a dedicated trace buffer, where the compressed stream is stored instead of being pushed out of the trace port.
This has several advantages:
•FIFO overflows don’t occur
•A low-end tool can read-out the buffer via the standard debug port (e.g. JTAG)
But several important disadvantages:
•The trace buffer RAM is relatively small compared to an external tool. Only short trace sessions are possible, which makes it unsuitable for system tests.
•Most OCTB systems don’t support streaming mode, so even when configuring a strong qualifier (yielding low bandwidth), activity can’t be displayed as it happens, but only after the trace is stopped.
•Most OCTB systems don’t provide an internal time stamp. On those that do, it typically records CPU cycles instead of a real time base; it is narrow and can thus overflow on strong qualifiers.
•Parallel trace of auxiliary events is not possible
•ETM = always program trace, sometimes also data trace
•DWT = data trace
•ITM = instrumentation trace
•HTM = AHB bus trace
The term ETM refers to an OCT ‘Macrocell’, but is for historical reasons often used to describe the entire OCT system as well as the OCT external port.
•level 1 = no trace, just debugging, usually IEEE1149 JTAG
•level 2 = program trace
•level 2+ = program trace and instrumentation trace (OTM)
•level 3 = program, instrumentation and data trace
•level 4 = level 3, plus capability to ‘emulate’ memory locations where the external tool supplies the data to the requested memory location.
If no trace is provided by a CPU, the last resort to getting at any real-time CPU activity is to use a free digital output port. The application must be instrumented to manipulate the port on the places where events of interest are known to occur, for example:
•Function entries and exits – this allows code profiling
•Task hook routines – this allows task profiling
•IRQ service routines – this allows IRQ profiling
User Trace Port is a good option if a larger package version of the CPU is available. In this case an emulation adapter is build, hosting the larger CPU, additional ports are routed to the trace tools and smaller package pins are routed to the CPU socket on the target board.
The trace data acquired in an ICE or an OCT system is stored in the tool’s own trace buffer storage, which can be several GB big. The buffer stores raw data acquired from the CPU, along with AUX data and timestamps.
Buffer consumption depends strongly on the type of trace. On an ICE every recorded cycle will consume 8-16 bytes, whereas a compressed message on an OCT could consume 8 bytes for several hundred op-codes executed.
The hardware tool implements no further analysis of the data. When the host PC requests the data, it will upload it.
If an optimal hardware configuration (including the PC) is used, over 40 MB/s can be streamed to the PC via USB2 interface. If the average trace data bandwidth is lower than this, the trace buffer acts as a very large FIFO and the trace session can last indefinitely.
The trace data on an ICE system consists of a series of CPU cycles, where state of address, data and control bus is known.
Trace data on an OCT system consists of a series of trace messages, containing compressed information about CPU activities.
Either of these must be reconstructed to the level required by the user. The analyzer window will display these levels of CPU activity:
•Raw data (CPU cycles or OCT messages)
•Disassembly – program flow on object code level, displayed in CPU mnemonics
•Source – program flow on source code level
•Symbols – program flow and data accesses to locations where known symbols are located
Note: Advanced analyzer functionalities, like profiler or coverage, implement additional processing on this reconstructed data stream.
Analysis on an ICE is relatively simple. Using basic object-code analysis, the tool determines which op-codes were executed. The op-code data is available on the data bus and the disassembly can be derived from that alone.
Data accesses are available directly without further analysis.
CPU employs a compression technology to reduce the traffic over the OCT port. A typical program trace message is a branch message, which can say as much as “a branch was executed after 4 sequential instructions”. The tool must then reconstruct the program flow from its knowledge of the downloaded code and the CPU state from the previous message.
Modern OCT protocols can compress as much as 500 op-code executions in a single message. Once these are reconstructed, each op-code is assigned an interpolated time-stamp.
Data accesses require minor reconstruction as the address is reported in differential form to the previous data message. Data trace is much less compressible than program trace, and an unfiltered trace is likely to overflow.
Instrumentation trace is similar to data trace, but the trace message is explicitly generated by execution of a special CPU op-code or by writing to a dedicated register. To generate instrumentation messages, the application must be modified (instrumented) to generate them at appropriate locations.
•On ARM CPUs this is called ITM.
•On CPUs with Nexus OCT port, this is called OTM. On most PowerPC processors this message is emitted when MMU configuration is modified and its usage is thus restricted to applications which do not make use of MMU.
Some newer Nexus CPUs also implement the DQM protocol which, much like the ARM’s ITM, is dedicated to instrumentation trace.
The size of the data transmitted in an instrumentation message depends on the CPU. Sizes range from 8 to 32 bits.
Instrumentation trace requires less data to be transmitted over the OCT port (address is not given, just data value) and is in its nature less frequently used as a memory write. Most CPUs with program trace also implement instrumentation trace, without having to increase the number of OCT pins.
Note: Program execution is queued until a full length message is streamed out. Data and instrumentation trace messages are streamed immediately on occurrence. Observing the trace stream will thus display data accesses to occur before the accessing program code actually executed.
Disclaimer: iSYSTEM assumes no responsibility for any errors which may appear in this document, reserves the right to change devices or specifications detailed herein at any time without notice, and does not make any commitment to update the information herein.
© iSYSTEM . All rights reserved.