DWARF Debugging Format DWARF Debugging Standard Wiki

Using COMDAT Sections to Reduce the Size of DWARF Debug Information

Cary Coutant Modified: October 30, 2008

Objective

DWARF debugging information for a typical C++ application can consume a large amount of disk space in both the relocatable object files and the final executable or shared library. Depending on the application and compilation options, the debug information can consume as much as 75% of the object file.

The bulk of the debug information is in the .debug_info section, the bulk of that section contains type information, and the bulk of the type information is made up of duplicate copies of types that are emitted by the compiler in each compilation unit.

There are several approaches to reducing the overhead of the debug information.

The Gnu compiler supports several options, such as -femit-struct-debug-baseonly, to reduce the total amount of debug information generated at compile time, but these are heuristics that can often produce insufficient debug information.

One common approach, used by Sun, Apple, and HP, has been to leave most of the debug information in the relocatable objects, copying only a summary of the types defined in each object into the linker’s output file. This approach works well for some, but obviously requires that the original relocatable objects remain accessible to the debugger. It also imposes additional complexity on the debugger itself, which must be able to identify which object file contains the desired information, then must apply the relocations in order to use the data.

Another possible approach is to post-process the linker’s output, eliminating duplicate information in the debug information, and rewriting it. While this approach can achieve the desired reduction in size, the time required to link the application in the first place is not reduced at all, and additional time after the link is required to compress the debug information.

Ideally, the linker would be able to eliminate the duplicate debug information during the link, at the least avoiding the extra time it would take to write out the duplicate data. The structure of DWARF, however, makes this a difficult and expensive task, and results in much longer link times.

Attempts have been made to use ELF comdat section groups to help the linker identify and discard duplicate information, and the DWARF specification actually contains a discussion of how this may be implemented (see Appendix E of the draft DWARF-3 specification). The suggested mechanism, however, relies on partitioning the debug information by header file, so that the debug information produced by the compiler for a particular header file is equivalent in each separate compilation. This allows the linker to discard all but one copy of the debug information for each header file using its existing (and efficient) comdat mechanism. This scheme, however, requires the compiler to keep track of the debug information by header file, which is not always practical, and it requires that it produce substantially the same debug information each time it encounters that header file. This latter requirement cannot always be met, as conditional compilation can introduce changes in what source the compiler sees from one compilation to the next. It also requires the compiler to output debug information for the entire contents of the header file, rather than just those declarations that are actually referenced by the rest of the source file; this can actually cause the size of the relocatable objects to grow significantly, even if the final linked output might be smaller.

The design presented here takes advantage of the linker’s handling of comdat sections, but without the disadvantages of the approach described above. Rather than use a comdat group for each header file, it uses a comdat group for each type definition (except for base types and other trivial definitions). This allows the compiler to trim unused debug information from its output, while allowing the linker to remove duplicate type definitions without processing the contents of the DWARF sections.

This design also makes it convenient to implement a hybrid scheme where subprogram and variable definitions are copied to the output file at link time, along with line number tables, but the type definitions, living in separate sections, are not copied. This scheme would allow for full stack traces with line numbers with substantially-reduced space overhead, while access to the original relocatable object files would be required only for more detailed debugging. Furthermore, type definitions in DWARF generally do not require link-time relocation, so they can be left in the relocatable objects without requiring the debugger to process relocations at debug time.

Design Highlights

The .debug_types Section

A new debug section, .debug_types, is used to hold DWARF type definitions. This section is structurally similar to the .debug_info section, consisting of a header followed by a tree of debug information entries (DIEs) describing a single type. No type definition is required to be placed in this section, nor is it advisable to place all type definitions in this section. Instead, only type definitions whose DWARF description is large enough to merit this treatment should be placed in the .debug_types section. The likelihood of duplication across multiple compilation units and the additional overhead of separating the type definition should be taken into account when determining what type definitions to move out of the .debug_info section. In practice, structure and union types declared in header files are good candidates. Incomplete types and declarations are not suitable.

When the compiler determines that a type definition should be moved out of the .debug_info section, it places the type definition in a new .debug_types section and makes that section a member of a COMDAT group whose key is a signature of the type definition itself. The linker’s existing ability to discard duplicate COMDAT groups based on the key will be used to eliminate duplicate definitions of that type. In the linked output file, a single .debug_types section will contain the concatenated contents of the input sections that were not discarded.

In DWARF, a DIE that describes any typed object contains a DW_AT_type attribute that refers to a target DIE describing the type itself. For types defined in the .debug_info section, this reference is made using a reference-class form, which provides a direct offset of the target DIE within the .debug_info section. For types defined in the .debug_types section, the reference is made with the form DW_FORM_sig8, which provides the type signature. This forms is a member of the reference class, as it is still used to reference debugging information entries. The DWARF consumer must scan the .debug_types section and construct a table that maps from a signature to the location of the DIE that provides the type definition.

When a class type is moved to a separate section, the compiler may find it necessary to leave a declaration of that class in the main compile unit DIE in the .debug_info section. For example, if the compile unit contains a definition of a member function of the class, the definition of that member function must be represented as a member of the class, so a brief declaration of the containing class will be kept in that compile unit. To help the DWARF consumer, the compiler may add a new attribute to that declaration: DW_AT_signature. This attribute is used to provide the signature of the class so that the consumer can easily match the declaration to the definition in the .debug_types section.

Each type definition is preceded by a type header. Similar to the compilation unit header (as described in Section 7.5.1 of the DWARF spec), it consists of the following fields:

  1. unit_length (initial length)
  2. version (uhalf)
  3. debug_abbrev_offset (section offset)
  4. address_size (ubyte)
  5. type_signature (8-byte unsigned integer)
  6. type_offset (section offset)

The first four fields are the same as a normal compilation unit header, as described in Section 7.5.1 of the DWARF spec.

Like a compilation unit, the DIEs following the header are associated with a particular abbreviations table. While the .debug_types section may use its own abbreviation table, it may also use the same abbreviation table as the corresponding compilation unit.

The header allows a DWARF consumer to scan the .debug_types section for the signatures quickly, without having to process the DIEs themselves.

The type_signature field contains the 8-byte signature of the type described immediately following the header.

The type_offset field contains the section offset of the DIE for this type definition. Because the type may be nested inside a namespace or other structures, it may not be the first or only DIE in the unit.

The first DIE following the type header is a DW_TAG_type_unit DIE to serve as the root of the tree of DIEs in the unit. The type unit DIE typically will have one attribute: DW_AT_language. It will have at least one child: the DIE that describes the type, and to which the type_offset field refers. It may have additional children as well. In the case of a type that is nested within a namespace or another type, there may be a declaration tree establishing the context, and the actual type DIE will be a specification referring to the declaration within that tree. If the type’s definition contains references to other types that have not been given type units of their own (e.g., base types or pointer types), definitions or declarations for those types may also be present as additional children of the compile unit DIE.

Computing a Type Signature

The method for computing a type signature does not need to be formally specified, because the DWARF producer needs only a unique identifier that it can use to label the type in the .debug_types section and to reference the type from elsewhere. If two or more separate compilers are used in the same application, however, the use of differing methods will lessen the effectiveness of duplicate identification, so a suggested method is presented here.

To compute the signature for a type definition, start with the top-level DIE T0 and compute a byte-stream signature for the type, using the function S(T0, {T0}), defined below. If the type is nested within another type or namespace, also compute a byte-stream signature for that context, as described below, and insert that signature at the beginning of the type signature. Finally, generate an MD5 hash of that byte stream and use the low-order 64 bits of the result.

The byte stream for the context of a nested type is formed by starting at the outermost enclosing namespace or type DIE, initializing the stream s with the one-byte tag of the enclosing DIE followed by the name (from the DW_AT_name attribute), not including the trailing null byte. If a DIE has no DW_AT_name attribute, use the empty string as the name. Repeat this step for each enclosing layer, from outermost to innermost.

The byte stream for a type DIE is generated by the function S(T, V), where T is a DWARF DIE and V is a list of DIEs representing visited type definitions, as follows:

  1. Initialize the stream s with the single byte representing the DIE’s tag (e.g., DW_TAG_structure_type).
  2. If the DIE has a DW_AT_name attribute, append the attribute code DW_AT_name, followed by the name itself (not including the trailing null byte), to s.
  3. If the DIE represents a pointer or reference type, or a DW_TAG_friend entry, find the name of the target type and append the DW_AT_name attribute code, followed by the name itself (not including the trailing null byte), to s. Skip Steps 4-6.
  4. For each of the following attributes that are present in the DIE, in the order listed, append the one-byte attribute code (e.g., DW_AT_encoding) and the value of the attribute to s:
  5. If the DIE is a pointer or reference type, and a DW_AT_type attribute is present in the DIE, and the referenced type has a DW_AT_name attribute, append the attribute code DW_AT_name, followed by the name itself (not including the trailing NULL byte).
  6. Otherwise, if a DW_AT_type attribute is present in the DIE, and the referenced type is in the list V, append the attribute code DW_AT_type, followed by the index of the type in V, to s.
  7. Otherwise, if a DW_AT_type attribute is present in the DIE, append the attribute code DW_AT_type to s, then generate a byte stream S(Tn, V + {Tn}) for the referenced type and append it to s.
  8. For each child DIE Cn, generate a byte stream s(Cn, V) and append it to s.
  9. Return the value of s as the result of the function.

Scope

These changes are being proposed as an extension of DWARF-3, to appear in the DWARF-4 specification.

In gcc, the ability to separate type information into .debug_types sections will be conditioned on a compile-time option -gdwarf-4. Most of the source changes to provide this functionality will be in the file dwarf2out.c, and will be of a similar nature to the existing functionality for separating debug info into COMDAT groups based on include files.

In gdb, most of the source changes to support the new section will be in the file dwarf2read.c. It will need to read the .debug_types section and make a quick scan of the section to record the signatures contained therein. When processing an attribute of form DW_FORM_sig4 or DW_FORM_sig8, it should lookup the signature and convert the attribute into an equivalent reference form with a pointer to the DIE as read from the .debug_types section. The referenced DIE will need to be treated as a separate compilation unit that will need to be loaded if it has not already been loaded.

The readelf and objdump utilities, and the DWARF support in bfd, will also require changes to support this extension.

dwarfstd.org is supported by Sourceware. Contributions are welcome.

All logos and trademarks in this site are property of their respective owner.
The comments are property of their posters, all the rest © 2007-2022 by DWARF Standards Committee.