Skip to content

Term naming is non-injective #808

@ivg

Description

@ivg

That means, that we can have two functions (sub terms) having the same name. In practice, this hits with ELF shared libraries, that have both a PLT entry stub and an implementation for a particular stub. Our existing symbolizers, have a tradition to treat the PLT entry stub as a real function, and it works fine with regular binaries, that do no expose and use functions at the same time. But with the shared libraries our symbolizers (objdump and IDA) tells us the same name for two different addresses. Careful thinking tells us, that we can't actually fix this on the symbolizer level. Neither we actually do want to fix symbolizers. We should enforce, however, an invariant that for each name we have at most one address/term. Or speaking mathematically, that our naming function is injective.

The desired properties

We should enforce an invariant, that for each name we have at most one address, while several names can share the same address. That means that the address : N -> A function is surjective, where N is the set of names and A is the set of addresses. And name : A -> N is injective. This pair of functions also induces a partitioning of names, principal : N -> N = name . address, i.e., it establishes a principal name for each name, that is the representative for the class of names of the given name. We use the principal name as the name of the symbol and treat the rest of the names in the class as aliases.

Our current implementation

  1. We use symbolizers, that provide the name function. We do not check that this function is injective, i.e., that for each name we have at most one address.

  2. On the IR level, we allow arbitrary renaming. We do not control how terms are associated with addresses and vice verse. We apply renaming during different phases of reconstruction and disassembling, and in general every pass can rename a term.

Implementation Limitations

  1. We cannot enforce injectivity on the name function, so we must assume that this function is neither injective nor surjective.

  2. We use ugly Term.set_name function that uses internal hashtable to establish a mapping between terms and names.

Proposed Solution

  1. We will enforce the naming invariant only on the IR level. Lower levels, such as symtab, disassembly, etc represent what we had as input, so it is better to leave them as they are. Thus we will enforce the invariant on the domain of terms, not the set of addresses. But given that term : A -> T is injective, our invariant is even more strict, than the desired one.

  2. We will not try very hard to enforce the invariant on the Term.set_name function. However, it will work correctly, unless it was abused. We shall consider removing this function in BAP 2.0, and we should mark it as obsolete asap, to refrain users from touching it. We shall also dedocument it.

  3. To choose the principal name, we will treat the last proposed name as the more important one, assuming that every consequent analysis pass has more information than previous. In other words, the last set name, is the principal name.

Suggested Implementation

We need to keep track for the following three sources of naming in the IR:

  1. Sub.name
  2. Sub.aliases
  3. Term.name

We will not enforce the invariant on the Term.name assuming that an incorrect usage of the Term.set_name can break it. Given that the set_name function is technically internal (and we will warn about this) that's OK. Basically, we put the responsibility of preserving the invariant on the set_name caller.

Next, we need a function fix_names: ?old_names:string Tid.Map.t -> program -> program that will enforce the desired predicates. We shall look carefully in the aliases, and check the invariant on each of them. When a conflict is detected, i.e., we have a name, that is already assigned an address, then we need to rename one of the two names. We need to give a preference to one or another, trying to preserve the new name and rename the old one. The name is old if it is in the old_names and has the same name. Otherwise it is new one. If both names are new, then pick the rename a name that corresponds to a tid that is smaller.

The renaming procedure, should mangle the name to make it unique. I think the easiest approach would be just to mangle it with the address, e.g., malloc@4000efc, though nicer approaches, such as mangling it with the name of the section to which it belongs are appreciated, e.g., malloc@plt, though they are harder to implement, since it is not guaranteed that there are no duplicate symbol names in one section, and the section name is not easy to obtain.

The fix_names function should also ensure that the Term.name function points to the correct name, by calling the Term.set_name function for all terms. Finally, the fix_names function is called every time the new sub term is inserted or updated. Note: the name and the type of the function is just a suggestion. This is an internal function, so it doesn't really matter.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions