Skip to content

Conversation

gitoleg
Copy link
Contributor

@gitoleg gitoleg commented Feb 21, 2019

This PR enhances the CFG reconstruction algorithm and weakens the notion of call, by allowing calls to be fall-though edges, not only of real branches. This enables support for functions with multiple entries and lessens the impact of Byteweight (or any other rooter) false positives. In particular, the new algorithm will correctly reconstruct the following three subroutines:

entry_debug:
   enable_debugging
entry_checked:
   enable_checking
entry:
    do_work

fixes #631

@ivg ivg self-requested a review February 21, 2019 14:52
@@ -0,0 +1,87 @@
open Core_kernel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate a little bit on the purposes and tasks of this module, and the overall algorithm description? It would be even nice to have this as a comment in the module.

Copy link
Contributor Author

@gitoleg gitoleg Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's describe, what we did here. First of all, we gently touched the disassembler to allow fall-through edges leading to roots. And this affects the cfg only and doesn't affect subroutines on the IR level since our sema lifter doen't take into account whole program cfg at all, it deals only with subroutines cfgs, that's not sufficient for this task.

So we need to finish lifting procedure by taking into account every fall-through branch coming into subroutine entry. And by finish, I mean to edit a call site.

So algorithm, in general:

  1. find the fall-through branch to a subroutine, i.e. to it's entry block
  2. find a source IR block where it came from
  3. create and insert a call to the target subroutine after such block
  4. if the source block contains no-returns then redirect them to
    a brand new call, since the only reason it was created as noreturn
    is a lack of information about fall-through branches in sema lifter.

Will provide an example, like what is was before and how it looks like now.
E.g. main subroutine in /bin/ls. As far as I can see the only purpose is to push R15, that makes the name of this function a little bit too serious:

000003c6: sub main(main_argc, main_argv, main_result)
00015327: main_argc :: in u32 = RDI
00015328: main_argv :: in out u64 = RSI
00015329: main_result :: out u32 = RAX
000003c2: 
000003c3: v17136 := R15
000003c4: RSP := RSP - 8
000003c5: mem := mem with [RSP, el]:u64 <- v17136

after applying the algorithm in question we will see a continuation:

00015255: 
00015256: call @sub_402a02 with noreturn

Considering the place where to apply this transformation, we can still
to find a better place:

  • a separate pass
  • module in bap_sema library

Does it make any sence for you?

@ivg
Copy link
Member

ivg commented Feb 21, 2019

So as far as I understand, you totally forgot about the reconstructor, and the Symtab data structure will end up in an incorrect state, am I right?

@gitoleg
Copy link
Contributor Author

gitoleg commented Feb 21, 2019

well, not really. But now it works with the help of Symtab data

@gitoleg
Copy link
Contributor Author

gitoleg commented Feb 26, 2019

There are lot's of changes and better to describe them explicitly.

Reconstructor

It turned out that current reconstructor violates an invariant that one block belongs only to one subroutine.
On IR level it means that mapping from address to tid is ambiguous. The reason is in our approach: we traverse a whole program cfg starting from every known root, and if some node is reachable from two roots then it appears in both subroutines in a symbol table. The better solution is to update a set of roots simultaneously with traversing of cfg. So, some node considered as root if it is in the set of known
roots or it has an input edge, the source node of which doesn't belong to the subroutine's cfg.

Also, we update symtab with information about fall edges that leads to roots.

Symtab

We need information about fall-through edges that end up with roots. So far, we were storing only callees names in symtab, that is useful for resolving calls when we're dealing with relocations and external calls.
But I think we need something new for the problem in question. That's why there is a couple of straightforward functions in the symtab for storing/loading address of fall destination.

Sema lifter

For every block, sema lifter tries to find a fall destination in a subroutine's cfg. We slightly extend this and also will try to find a fall destination in symtab and create a synthetic jump if such fall is found.

Copy link
Member

@ivg ivg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. you're definitely abusing symtab here, we need to discuss, what you really want to use it for
  2. the question from the previous review is still there, I don't see any code that will fixes the callgraph

val add_call_name : t -> block -> string -> t

(* remembers a fall from [addr] to the given block *)
val add_fall_addr : t -> block -> addr -> t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is totally non-obvious why a symtab shall remember fall addresses. The name of this function implies that we store all falls in the callgraph, which doesn't make any sense. So this should be refined.

This interface should be definitely refined.

val find_call_name : t -> addr -> string option

(* finds if there are any falls to the given [addr] *)
val find_fall_addr : t -> addr -> addr option
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any falls or falls that calls?

@gitoleg gitoleg force-pushed the cfg-reconstructor branch from 1f783c8 to cf75d9f Compare March 18, 2019 20:52
@gitoleg gitoleg merged commit e598acc into BinaryAnalysisPlatform:master Apr 2, 2019
@gitoleg gitoleg deleted the cfg-reconstructor branch May 13, 2020 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deal with byteweight false-positives
2 participants