Term naming is non-injective

That means, that we can have two functions (sub terms) having the same name. In practice, this hits with ELF shared libraries, that have both a PLT entry stub and an implementation for a particular stub. Our existing symbolizers, have a tradition to treat the PLT entry stub as a real function, and it works fine with regular binaries, that do no expose and use functions at the same time. But with the shared libraries our symbolizers (objdump and IDA) tells us the same name for two different addresses. Careful thinking tells us, that we can't actually fix this on the symbolizer level. Neither we actually do want to fix symbolizers. We should enforce, however, an invariant that for each name we have at most one address/term. Or speaking mathematically, that our naming function is injective. 

## The desired properties

We should enforce an invariant, that for each name we have at most one address, while several names can share the same address. That means that the `address : N -> A` function is surjective, where `N` is the set of names and `A` is the set of addresses. And `name : A -> N` is injective. This pair of functions also induces a partitioning of names, `principal : N -> N = name . address`, i.e., it establishes a principal name for each name, that is the representative for the class of names of the given name. We use the principal name as the name of the symbol and treat the rest of the names in the class as aliases. 

## Our current implementation

1. We use symbolizers, that provide the `name` function. We do not check that this function is injective, i.e., that for each name we have at most one address.

2. On the IR level, we allow arbitrary renaming. We do not control how terms are associated with addresses and vice verse. We apply renaming during different phases of reconstruction and disassembling, and in general every pass can rename a term. 

## Implementation Limitations

1. We cannot enforce injectivity on the `name` function, so we must assume that this function is neither injective nor surjective. 

2. We use ugly `Term.set_name` function that uses internal hashtable to establish a mapping between terms and names. 


## Proposed Solution

1. We will enforce the naming invariant only on the IR level. Lower levels, such as symtab, disassembly, etc represent what we had as input, so it is better to leave them as they are. Thus we will enforce the invariant on the domain of terms, not the set of addresses. But given that `term : A -> T` is injective, our invariant is even more strict, than the desired one. 

2. We will not try very hard to enforce the invariant on the `Term.set_name` function. However, it will work correctly, unless it was abused. We shall consider removing this function in BAP 2.0, and we should mark it as obsolete asap, to refrain users from touching it. We shall also dedocument it. 

3. To choose the principal name, we will treat the last proposed name as the more important one, assuming that every consequent analysis pass has more information than previous. In other words, the last set name, is the principal name. 



## Suggested Implementation 

We need to keep track for the following three sources of naming in the IR:
1. Sub.name
2. Sub.aliases
3. Term.name

We will not enforce the invariant on the `Term.name` assuming that an incorrect usage of the `Term.set_name` can break it. Given that the `set_name` function is technically internal (and we will warn about this) that's OK. Basically, we put the responsibility of preserving the invariant on the `set_name` caller. 

Next, we need a function `fix_names: ?old_names:string Tid.Map.t ->  program -> program` that will enforce the desired predicates. We shall look carefully in the aliases, and check the invariant on each of them. When a conflict is detected, i.e., we have a name, that is already assigned an address, then we need to rename one of the two names. We need to give a preference to one or another, trying to preserve the new name and rename the old one. The name is old if it is in the `old_names` and has the same name. Otherwise it is new one. If both names are new, then pick the rename a name that corresponds to a tid that is smaller. 

The renaming procedure, should mangle the name to make it unique. I think the easiest approach would be just to mangle it with the address, e.g., `malloc@4000efc`, though nicer approaches, such as mangling it with the name of the section to which it belongs are appreciated, e.g., `malloc@plt`, though they are harder to implement, since it is not guaranteed that there are no duplicate symbol names in one section, and the section name is not easy to obtain. 

The `fix_names` function should also ensure that the `Term.name` function points to the correct name, by calling the `Term.set_name` function for all terms. Finally, the `fix_names` function is called every time the new `sub` term is inserted or updated. Note: the name and the type of the function is just a suggestion. This is an internal function, so it doesn't really matter. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Term naming is non-injective #808

The desired properties

Our current implementation

Implementation Limitations

Proposed Solution

Suggested Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Term naming is non-injective #808

Description

The desired properties

Our current implementation

Implementation Limitations

Proposed Solution

Suggested Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions