This is a proposal to improve some minor glitches in the way FORMs are currently used, and try to prevent more such glitches in the future. The basic problem being addressed is the inability to correctly interpret the data associated with an attribute without using a lookup table that tells how to interpret the data with respect to each specific attribute. I believe that FORM's were originally invented to prevent the need for this kind of lookup, otherwise dwarf would only have encoded the length of the data, and wouldn't have even bothered with defining different forms. I have also included one new form that seems like a clever idea to reduce space consumption, but it's not related to solving these other problems (that is, DW_FORM_implicit_flag)
need to know attribute to understand data
Using a lookup table is difficult for several reasons:
1) In any consumer there will be lower level routines and higher level routines. The low level routines used to scan the binary tag structure would like to be able to associate some semantic data with the data in a generic way without needing to use a lookup table. This is basic efficiency and cleanliness of code, and it minimizes the dependencies between different parts of the dwarf reader.
2) In a dwarf dumper, there will sometimes be vendor extensions that are not known to the dumper. In this case, it's helpful for a human being to be able to see and understand the attribute even if the dumper program itself doesn't know what it means. Being able to differentiate between an external string and a location list makes a big difference in readability.
3) Even when the producer and consumer are tightly linked, and if we temporarily ignore issues of interoperability with other vendors, there is still version skew problems within a single set of products. For example, between version 12.3 of the compiler and version 12.2 of the dumper. It's very convenient for me not to have to rebuild my dumper every time I add a new attribute to the compiler.
Besides the need for a lookup table, having overly-broad FORMs causes a problem with the overloading that is common in dwarf 3 attributes. The meaning of the data attached to an attribute is understood by looking at the FORM of the data. If we use the same FORM for two kinds of data, then attributes cannot be multi-tasked to contain both kinds of data. (Because the two kinds could not be distinguished based on their FORM) We've run into this recently where data4/data8 can be used in class 'constant' and class 'loclistptr' (or other section pointers). Some attributes currently take both constant and loclistptr (DW_AT_data_member_location). Those cannot use data4/data8 values for the constant. Some attributes currently take only 'constant' (DW_AT_scope_start). Those attributes cannot safely be extended to also include 'loclistptr' because some implementations might already use data4/data8 forms for the constant values.
Because dwarf 3 uses this overloading so heavily, it's very important that forms be able to accurately and precisely define the meaning of the data attached to an attribute.
For the reasons above, I think it's worth our time to clean up the kinds of different FORMs we have in dwarf 3 so that they can be used more precisely by producers and understood better by consumers (dumpers, debuggers, and other consumers)
There are several kinds of ambiguity that I can see in dwarf3 tools:
- not knowing if a data value is signed or unsigned
- not knowing if a data4/8 is a section offset or a constant
- not knowing what section a section pointer points to
all blocks must be expressions
If someone were to try and use a block form to represent something besides an expression there would be no way to tell it from an expression. We should fix this.
explicit section forms
- we should introduce explicit forms for each elf section in dwarf4
- we should allocate a range of form values for section-offset vendor extensions
- we should combine these so that future standards can add new section pointers in a compatible way
- (see ranges below) all values between 0x40-0x7f can be parsed as 4/8 bytes
- forms that are known to the consumer can be either printed or processed internally without knowing the attribute.
new implicit flag form
Flag forms can be encoded without any bytes of data at all. This essentially uses the abbreviation table to store the data. Producers have a choice to a) use the 1-byte flag form and create a single abbrev entry for tags with 1-values and tags with 0-values. They can also b) create one abbrev entry with the implicit form, and another abbrev entry without the implicit form. This uses more abbrev entry space, but uses less space for each tag. The implementation decision is probably determined by the design the producer. The libdwarf library automatically creates abbrev entries based on the attributes. Unless the caller explicitly creates tags with flag==0, two abbreviation entries will be created.
Proposed entries in FORM table
- ... (existing forms)
- 0x16 : FORM_indirect (last in dwarf3)
- 0x17 : FORM_implicit_flag (zero length, implies a "true" value)
- 0x18 : FORM_exp_block (same as FORM_block, but explicitly an expression)
- 0x40 : FORM_sec_debug_info -- (replaces FORM_ref_addr)
- 0x41 : FORM_sec_debug_strp -- (replaces FORM_strp)
- 0x42 : FORM_sec_debug_line -- (replaces class lineptr)
- 0x43 : FORM_sec_debug_loc -- (replaces class loclistptr)
- 0x44 : FORM_sec_debug_macifo -- (replaces class lineptr)
- 0x45 : FORM_sec_debug_ranges -- (replaces class rangelistptr)
- 0x60 : begin vendor section pointer range
- 0x7f : end vendor section pointer range
new expression block form
We have to decide if adding an explicit expression block form means we need to add FORM_exp_block, FORM_exp_block1, FORM_exp_block2, FORM_exp_block4 and FORM_exp_block8. It's hard for me to justify keeping the explicit block forms in dwarf 4, but not adding explicitly sized FORM_exp_block variants. I want to require the new expression block forms to be used with all expressions (which is all uses of blocks in dwarf 3, I think).
So this is an open issue.
Because of the subtle distinction between "expressions" and "locations", a name should be chosen that reflects all the proper usages of this form. The standard itself should establish a common term for these "operator strings" and the name of the FORM should use that common term.
Deprecate block1, block2, block4 in favor of block
deprecate use of data1/data2/data4/data8 for signed/unsigned integer types
Note: data1/2/3/4 will need to kept for opaque, small, fixed size, data like floats, doubles, etc.
We should do one of:
- add udata1/udata2/udata4/udata8, sdata1/sdata2/sdata4/sdata8
- or just require the use of sdata and udata.
The reason for this is that consumers have to use an attribute specific lookup table to guess the signed/unsigned nature of the data. This violates the reason for having FORMs in the first place.
I have a weak preference for adding the (s,u)data(N) forms, but on alternate tuesdays I think we should just require sdata/udata.
If a compiler wants to record a 16-bit constant value as the initial value for a C short type, it's a little suboptimial to ask the compiler to encode this in LEN instead of just allowing the put the exact constant inline. So that's an argument in favor of adding 8 new FORMs. Of course, if the actual number stored in the 4 bytes is '3' for example, then you get to save 3 bytes by using an LEB. So I'm not completely convinced either way.
We can't reuse the existing data1/2/4/8 forms for unsigned because they currently don't stand for unsigned numbers. They currently stand for numbers which could be signed or unsigned.
Deprecate ref1, ref2 in favor of ref_udata, ref4, ref8
These forms point from one die to another die. An producer (assembler or otherwise) cannot generate a .debug_info section block without support for ULEB and SLEB numbers. (Unless it is generated as raw hex, in which case it's not an issue here) There is no reason an implementation must use the fixed size variations. Based on gcc 3.4.5, the assembly output for dwarf uses precooked hex numbers for all the LEB numbers, and for internal references. The precooked hex number is currently using ref4, but I don't see why it couldn't use ref_udata instead. Some of the other precooked numbers are LEB's already.
At first I wanted to get rid of ref1/ref2/ref4/ref8, but Matt pointed out something I missed. When generating forward references, you have to be able to allocated bytes for a reference without knowing how far forward it points. There are techniques for coping with that, but not something we'd want to force on users. I still think that means we should get rid of ref1 and ref2, since I don't see how to use them for forward references. It might also be nice to give guidance to implementors that we expect ref4/8 to be used for forward references, and ref_udata to be used for backward references.
new list of classes
The purpose of having classes is to streamline the table that describes which attributes can use which forms. Every attribute that can have a DW_FORM_string might also want to use the form DW_FORM_strp. In some recent email from Gary it seemed that goal might not have been clear. It results in forms with very different implementations being lumped together into the same class because of how they are used.
- address - DW_FORM_addr
- block - DW_FORM_block
- constant - DW_FORM_sdata, DW_FORM_udata
- possibly also: sdata1/sdata2/sdata4/sdata8/udata1/udata2/udata4/udata8
- flag - DW_FORM_flag, DW_FORM_implicit_flag
- string - DW_FORM_string,
- lineptr - DW_FORM_sec_debug_line
- loclistptr - DW_FORM_sec_debug_loc
- macptr - DW_FORM_sec_debug_macinfo
- rangelistptr - DW_FORM_sec_debug_ranges
- reference (either internal or external)
- internal: DW_FORM_ref_udata (possibly also ref1/ref2/ref4/ref8)
- external: DW_FORM_sec_debug_info
- string (either internal or external)
- internal: DW_FORM_string
- external: DW_FORM_sec_debug_strp