From c29e7c39b3470326a0295d2480d77f76ed72d294 Mon Sep 17 00:00:00 2001 From: Leah Hanson Date: Mon, 30 Jun 2014 21:52:04 -0500 Subject: [PATCH 1/6] Added draft of intro sections of chapter --- static-analysis/StaticAnalysisChapter.md | 156 +++++++++++++++++++++++ 1 file changed, 156 insertions(+) create mode 100644 static-analysis/StaticAnalysisChapter.md diff --git a/static-analysis/StaticAnalysisChapter.md b/static-analysis/StaticAnalysisChapter.md new file mode 100644 index 000000000..25bdecdda --- /dev/null +++ b/static-analysis/StaticAnalysisChapter.md @@ -0,0 +1,156 @@ +# Static Analysis +by Leah Hanson for *500 Lines or Less* + +Static Analysis is a way to check for problems in your code without running it. "Static" means at compile-time, rather than at run-time, and "analysis" because we're analyzing the code. + +There are three phases to implementing static analysis: + +1. Deciding what you want to check for + + This refers to the general problem you'd like to solve, in terms that a user of the programming language would recognize. Examples include: + + * Finding misspelled variable names + * Finding race conditions in parallel code + * Finding calls to unimplemented functions +2. Deciding how exactly to check for it + + While we could ask a friend to do one of the tasks listed above, they aren't specific enough to explain to a computer. To tackle "misspelled variable names", for example, we'd need to decide what misspelled means here. One option would be to claim variable names should be composed of English words from the dictionary; another, more useful, option is to look for variables that are only used once (the one time you mis-typed it). + + Now that we know we're looking for variables only used once, we can talk about kinds of variable usages (having their value assigned vs. read) and what code would or would not trigger a warning. + +3. Implementation details + + This covers the actual act of writing the code, the time spent reading the documentation for libraries you use, and figuring out how to get at the information you need to write the analysis. + +We're going to work through these steps for each of the individual checks implemented in this chapter. Step 1 requires enough understanding of the language we're analyzing to empathize with the kinds of problems it's users face. All the code in this chapter is Julia code, written to analyze Julia code. + +## A Brief Introduction to Julia + +Julia is a young language aimed at technical computing. It was released, at version 0.1 in the Spring of 2012; as of the summer of 2014, it is reaching version 0.3. Julia is a procedural language; it is not object-oriented and while it has functional features (anonymous functions, higher order functions, immutable data), it does encourage a functional style of coding. The feature that most programmers will find novel in Julia is multiple dispatch, which is also central to the design of most of it's APIs. + +Here is a snippet of Julia code: + +~~~jl +# A comment about increment +function increment(x::Int) + return x + 1 +end + +increment(5) +~~~ + +This code defines a method of the function `increment` that takes one argument, named `x`, of type `Int`. The method returns the value of `x + 1`. Then, this freshly defined method is called with the value `5`; the function call, as you may have guessed, will evaluate to `6`. + +The name `increment` refers to a generic function, which may have many methods. We have just defined one method of it. Let's define another: + +~~~jl +# Increment x by y +function increment(x::Int, y::Number) + return x + y +end + +increment(5) # => 6 +increment(5,4) # => 9 +~~~ + +Now increment has two methods. Julia decides which method to run for a given call based on the number and types of the arguments; this is called dynamic multiple dispatch. + +* *dynamic* because it's based on the types of the values used at run-time +* *multiple* because it looks at the types and order of all the arguments. Object-oriented languages use single dispatch because they only consider the first argument (In `x.foo(y)`, the first argument is `x`.) +* *dispatch* because this a way of matching function calls to method definitions. + +We haven't really seen the "multiple" part yet, but if you're curious about Julia you'll have to look that up on your own. We need to move on to a few implementation details. + +## Introspection in Julia. + +When you or I introspect, we're thinking about how and why we think and feel. When code introspects, it examines the representation or execution properties of code in the same language (possibly it's own code). When code's introspection extends to modifying the examined code, it's called metaprogramming (programs that write or modify programs). + +Julia makes it easy to introspect. There are four functions built-in to let us see what that compiler is thinking: `code_lowered`, `code_typed`, `code_llvm`, and `code_native`. Those are listed in order of what step in the compilation process their output is from; the left-most one is closest to the code we'd type and the right-most one is the closest to what the CPU runs. For this chapter, we'll focus on `code_typed`, which gives us the optimized, type-inferred abstract syntax tree (AST). + +Let's look at the output of `code_typed` for that first method of `increment` that we defined. +~~~jl +# code_typed takes the function and argument types of the method you want +code_typed(increment,(Int,)) +~~~ + +The output looks like this: +~~~ +1-element Array{Any,1}: + :($(Expr(:lambda, {:x}, {{},{{:x,Int64,0}},{}}, :(begin # REPL, line 12: + return top(box)(Int64,top(add_int)(x::Int64,1))::Int64 + end::Int64)))) +~~~ + +The output is an `Array` because you can call `code_typed` with an ambiguous tuple of types -- which would result in multiple methods matching and being returned as results. The `Array` contains `Expr`s, which represent expressions. + +An `Expr` has three fields: + +* `head`, which is a `Symbol`. In our example, it is `:lambda`. +* `typ`, which is a `Type`. The outer `Expr`s from `code_typed` always set this to `Any`. +* `args`, which is a structure of nested `Array{Any,1}`. It is made of untyped, nested lists. + +For `Expr`s from `code_typed`, there are always three elements in `args`. + +1. A list of argument names +2. A list of metadata about variables used in the method +3. The body of the function, as another `Expr` + +The `Expr` of the function stores the list of expressions in the body of the method in its `args` field. The inferred return type of the method is stored in the `typ` field. We can define some helper functions to make these simple operations more readable; both of these functions will expect an `Expr` of the form returned by `code_typed`. + +~~~jl +# given an Expr representing a method, return its inferred return type +returntype(e::Expr) = e.args[3].typ # arrays index from 1 + +# given an Expr representing a method, return an Array of Exprs representing its body +body(e::Expr) = e.args[3].args +~~~ + +We can run these on our first `increment` method: +~~~jl +returntype(code_typed(increment,(Int,))[1]) # => Int64 +body(code_typed(increment,(Int,))[1]) # => 2-element Array{Any,1}: + # :( # REPL, line 12:) + # :(return top(box)(Int64,top(add_int)(x::Int64,1))::Int64) +~~~ + +The `head` of an `Expr` indicates what type of `Expr` it is. For example, `:=` indicates an assignment, like `x = 5`. If we wanted to find all the places a method might return, we'd look for head values of `:return`. We can use the `body` helper function that we just wrote to write a function that takes an `Expr` from `code_typed` and returns all the return statements in its body. + +~~~jl +# given an Expr representing a method, return all of the return statement in its body +returns(e::Expr) = filter(x-> typeof(x) == Expr && x.head==:return,body(e)) + +returns(code_typed(increment,(Int,))[1]) # => 1-element Array{Any,1}: + # :(return top(box)(Int64,top(add_int)(x::Int64,1)) +~~~ + +This `code_typed(increment,(Int,))[1]` stuff is getting rather tedious. Let's write a couple of helper methods so that we can run `code_typed` on a whole function at once. + +~~~jl +# return the type-inferred AST for one method of a generic function +function Base.code_typed(m::Method) + linfo = m.func.code + (tree,ty) = Base.typeinf(linfo,m.sig,()) + if !isa(tree,Expr) + ccall(:jl_uncompress_ast, Any, (Any,Any), linfo, tree) + else + tree + end +end + +# return the type-inferred AST for each method of a generic function +function Base.code_typed(f::Function) + Expr[code_typed(m) for m in f.env] +end +~~~ + +Once we have a `code_typed` that handles `Method`s, handling whole `Function`s is just requires an array-comprehension over the methods of the given function. For a given `Function` `f`, we can get the methods using `f.env`. Handling a `Method` has more details to handle; the implementation is modeled closely on the existing built-in implementation. + +`m.func.code` gives us the implementation of the method; `m.sig` gives us the types of it's arguments. Given these, `Base.typeinf` should return the type-inferred AST. However, if it was saved in a compressed state, we'll need to call one of the C functions used to implement parts of Julia, specifically `jl_uncompress_ast`, to get the `Expr` value we want to return. + +~~~jl +[returntype(e) for e in code_typed(increment)] # => 2-element Array{Any,1}: + # Int64 + # Any +~~~ + + From b68be679471111d89a1f68f85544de2b27b8f9d4 Mon Sep 17 00:00:00 2001 From: Leah Hanson Date: Fri, 4 Jul 2014 14:53:20 -0500 Subject: [PATCH 2/6] Added first part of Check Loop Variable Types section. Moved Introspection in Julia section down to serve as reference when I write the first implementation section. --- static-analysis/StaticAnalysisChapter.md | 121 ++++++++++++++++++++++- 1 file changed, 117 insertions(+), 4 deletions(-) diff --git a/static-analysis/StaticAnalysisChapter.md b/static-analysis/StaticAnalysisChapter.md index 25bdecdda..7036b742e 100644 --- a/static-analysis/StaticAnalysisChapter.md +++ b/static-analysis/StaticAnalysisChapter.md @@ -20,11 +20,12 @@ There are three phases to implementing static analysis: 3. Implementation details - This covers the actual act of writing the code, the time spent reading the documentation for libraries you use, and figuring out how to get at the information you need to write the analysis. + This covers the actual act of writing the code, the time spent reading the documentation for libraries you use, and figuring out how to get at the information you need to write the analysis. This could involve reading in a file of code, parsing it to understand the structure, and then making your specific check on the structure. Parsing code into a representative structure is a complicated business, and gets more complicated as the language grows. + In this chapter, we'll be depending on internal datastructures used by the compiler. This means that we don't have to worry about reading files or parsing them, but it does mean we have to work with data structures that are not in our control and that sometimes feel clumsy or ugly. Besdies all the work we'll save by not having to understand the code by ourselves, working with the same datastructures the compiler uses means that our checks will be based on an accurate assesment of the compilers understanding -- which means they'll be accurate to how the code will actually run. We're going to work through these steps for each of the individual checks implemented in this chapter. Step 1 requires enough understanding of the language we're analyzing to empathize with the kinds of problems it's users face. All the code in this chapter is Julia code, written to analyze Julia code. -## A Brief Introduction to Julia +# A Very Brief Introduction to Julia Julia is a young language aimed at technical computing. It was released, at version 0.1 in the Spring of 2012; as of the summer of 2014, it is reaching version 0.3. Julia is a procedural language; it is not object-oriented and while it has functional features (anonymous functions, higher order functions, immutable data), it does encourage a functional style of coding. The feature that most programmers will find novel in Julia is multiple dispatch, which is also central to the design of most of it's APIs. @@ -59,8 +60,115 @@ Now increment has two methods. Julia decides which method to run for a given cal * *multiple* because it looks at the types and order of all the arguments. Object-oriented languages use single dispatch because they only consider the first argument (In `x.foo(y)`, the first argument is `x`.) * *dispatch* because this a way of matching function calls to method definitions. -We haven't really seen the "multiple" part yet, but if you're curious about Julia you'll have to look that up on your own. We need to move on to a few implementation details. +We haven't really seen the "multiple" part yet, but if you're curious about Julia you'll have to look that up on your own. We need to move on to our first check. +# Checking the Types of Variables in Loops + +A feature of Julia that sets it apart from other high-level languages is its speed. As in most programming languages, writing very fast code in Julia involves an understanding of how the computer works and how Julia works. In Julia, an important part of helping the compiler create fast code for you is writing type-stable code. When the compiler can see that a variable in a section of code will always be the same specific type, it can do more optimizations than if it believes (correctly or not) that there are many possible types for that variable. + +For example, let's write a function that takes an `Int` and then increases it by some amount. If the number is small (less than 10), let's increase it by a big number (50), but if it's big, let's only increase it by a little (0.5). + +~~~jl +function increment(x::Int) + if x < 10 + x = x + 50 + else + x = x + 0.5 + end + return x +end +~~~ + +This function looks pretty straight-forward, but the type of `x` is unstable. At the end of this function, `return x` might return an `Int` or it might return a `Float64`. This is because of the `else` clause; if you add an Int, like `22`, to `0.5`, which is a `Float64`, then you'll get a `Float64`, like `22.5`. This means that `5` will become `55` (an `Int`), but `22` will become `22.5` (a `Float64`). If there were more involved code after this function, then it would have to handle both types for `x`, since the compiler expects to need to handle both. + +As with most efficiency problems, this issue is more pronounced when it happens during loops. Code inside for-loops and while-loops is run many, many times, so making it fast is more important than speeding up code that is only run a couple of times. Our first check is going to look for variables inside loops that have unstable types. First, let's look at an example of what we want to catch. + +We'll be looking at two functions. Each of them sums the numbers 1 to 100, but instead of summing the whole numbers, it divides each one by 2 before summing it. Both functions will get the same answer (`2525.0`); both will return the same type (`Float64`). However, the first function, `unstable`, suffers from type-instability, while the second one, `stable`, does not. + +~~~jl +function unstable() + sum = 0 + for i=1:100 + sum += i/2 + end + return sum +end +~~~ + +~~~.jl +function stable() + sum = 0.0 + for i=1:100 + sum += i/2 + end + return sum +end +~~~ + +The only textual difference between the two functions is in the initialization of `sum`: `sum = 0` vs `sum = 0.0`. In Julia, `0` is an `Int` literal and `0.0` is a `Float64` literal. How big of a difference could this tiny change even make? + +Because Julia is Just-In-Time (JIT) compiled, the first run of a function will take longer than subsequent runs (because the first run includes the time it takes to compile it). When we benchmark functions, we have to be sure to run them once (or precompile them) before timing them. + +~~~jl +julia> unstable() +2525.0 + +julia> stable() +2525.0 + +julia> @time unstable() +elapsed time: 9.517e-6 seconds (3248 bytes allocated) +2525.0 + +julia> @time stable() +elapsed time: 2.285e-6 seconds (64 bytes allocated) +2525.0 +~~~ + +The `@time` macro prints out how long the function took to run and how many bytes were allocated while it was running. The number of bytes allocated increases every time new memory is needed; it does not decrease when the garbage collector vacuums up memory that's no longer being used. This means that the bytes allocated is related to the amount of time we spend allocating memory, but does not imply that we had all of that memory in-use at the same time. + +If we wanted to get solid numbers for `stable` vs `unstable` we would need to make the loop much longer or run the functions many times. However, we can already see that `unstable` seems to be slower. More interestingly, we can see a large gap in the number of bytes allocated; `unstable` is allocated around 3kb of memory, where `stable` is using 64 bytes. + +Since we can see how simple `unstable` is, we might guess that this allocation is happening in the loop. To test this, we can make the loop longer and see if the allocations increase accordingly. Let's make the loop go from 1 to 10000, which is 100 times more iterations; we'll look for the number of bytes allocated to also increase about 100 times, to around 300kb. + +~~~jl +function unstable() + sum = 0 + for i=1:10000 + sum += i/2 + end + return sum +end +~~~ + +Since we redefined the function, we'll need to run it to have it compiled before we measure it. We expect to get a different, larger answer from the new function defintion, since it's summing more numbers now. + +~~~jl +julia> unstable() +2.50025e7 + +julia>@time unstable() +elapsed time: 0.000667613 seconds (320048 bytes allocated) +2.50025e7 +~~~ + +The new `unstable` allocated about 320kb, which is what we would expect if the allocations are happening in the loop. This difference between `unstable` and `stable` is because `unstable`'s `sum` must be boxed while `stable`'s `sum` can be unboxed. Boxed values consist of a type tag and the actual bits that represent the value; unboxed values only have their actual bits. The type tag is small, so that's not why boxing values allocates a lot more memory. The difference comes from what optimizations the compiler can make. When a variable has a concrete, immutable type, the compiler can unbox it inside the function. If that's not the case, then the variable must be allocated on the heap, and participate in the garbage collector. Immutable types are usually types that represent values, rather than collections of values; most numeric types, including `Int` and `Float64` are immutable, while `Array`s and `Dict`s are mutable. Because immutable types cannot be modified, you must make a new copy every time you change one. For example `4 + 6` must make a new `Int` to hold the result. In constract, the members of a mutable type can be updated in-place. + +Because `sum` in `stable` has a concrete type (`Flaot64`), the compiler know that it can store it unboxed locally in the function and mutate it's value; `sum` will not be allocated on the heap and new copies don't have to be made every time we add `i/2`. Because `sum` in `unstable` does not have a concrete type, the compiler allocates it on the heap. Every time we modify sum, we allocated a new value on the heap. All this time spent allocating values on the heap (and retrieving them everytime we want to read the value of `sum`) is expensive. + +Using `0` vs `0.0` is an easy mistake to make, especially when you're new to Julia. Automatically checking that variables used in loops are type-stable helps programmers get more insight into what the types of their variables are, in performance-critical sections of their code. + +The type of `sum` in `unstable` is `Union(Float64,Int64)`. This is a `UnionType`. A variable of type `Union(Float64,Int64)` can hold values of type `Int64` or `Float64`; a value can only have one type. A `UnionType` can have more than two possible values. The specific thing that we're going to look for is `UnionType`d variables inside loops. + +## Implmentation + +In order to find those variables, we'll need to find what variables are used inside of loops and we'll need to find the types of those variables. After we have those results, we'll need to decide how to print them in a human-readable way. + +* How do we find loops in Exprs +* How do we find the types of variables +* How do we print the results + +// Saving for first implementation section ## Introspection in Julia. When you or I introspect, we're thinking about how and why we think and feel. When code introspects, it examines the representation or execution properties of code in the same language (possibly it's own code). When code's introspection extends to modifying the examined code, it's called metaprogramming (programs that write or modify programs). @@ -122,7 +230,7 @@ returns(e::Expr) = filter(x-> typeof(x) == Expr && x.head==:return,body(e)) returns(code_typed(increment,(Int,))[1]) # => 1-element Array{Any,1}: # :(return top(box)(Int64,top(add_int)(x::Int64,1)) ~~~ - +null This `code_typed(increment,(Int,))[1]` stuff is getting rather tedious. Let's write a couple of helper methods so that we can run `code_typed` on a whole function at once. ~~~jl @@ -154,3 +262,8 @@ Once we have a `code_typed` that handles `Method`s, handling whole `Function`s i ~~~ +# Looking for Unused Variables + +# Checking Functions for Type Statbility + +# Tools for Insight into Variable Types From c40fe696ac5ae083da4d9246adb61ba1cb0a2606 Mon Sep 17 00:00:00 2001 From: Leah Hanson Date: Mon, 22 Sep 2014 09:34:47 -0500 Subject: [PATCH 3/6] Updated first section, added outline. --- static-analysis/StaticAnalysisChapter.md | 445 ++++++++++++++++++----- 1 file changed, 356 insertions(+), 89 deletions(-) diff --git a/static-analysis/StaticAnalysisChapter.md b/static-analysis/StaticAnalysisChapter.md index 7036b742e..19dc5b75d 100644 --- a/static-analysis/StaticAnalysisChapter.md +++ b/static-analysis/StaticAnalysisChapter.md @@ -9,44 +9,50 @@ There are three phases to implementing static analysis: This refers to the general problem you'd like to solve, in terms that a user of the programming language would recognize. Examples include: - * Finding misspelled variable names - * Finding race conditions in parallel code - * Finding calls to unimplemented functions + * Finding misspelled variable names + * Finding race conditions in parallel code + * Finding calls to unimplemented functions + 2. Deciding how exactly to check for it While we could ask a friend to do one of the tasks listed above, they aren't specific enough to explain to a computer. To tackle "misspelled variable names", for example, we'd need to decide what misspelled means here. One option would be to claim variable names should be composed of English words from the dictionary; another, more useful, option is to look for variables that are only used once (the one time you mis-typed it). - Now that we know we're looking for variables only used once, we can talk about kinds of variable usages (having their value assigned vs. read) and what code would or would not trigger a warning. + Now that we know we're looking for variables that are only used once, we can talk about kinds of variable usages (having their value assigned vs. read) and what code would or would not trigger a warning. 3. Implementation details - This covers the actual act of writing the code, the time spent reading the documentation for libraries you use, and figuring out how to get at the information you need to write the analysis. This could involve reading in a file of code, parsing it to understand the structure, and then making your specific check on the structure. Parsing code into a representative structure is a complicated business, and gets more complicated as the language grows. - In this chapter, we'll be depending on internal datastructures used by the compiler. This means that we don't have to worry about reading files or parsing them, but it does mean we have to work with data structures that are not in our control and that sometimes feel clumsy or ugly. Besdies all the work we'll save by not having to understand the code by ourselves, working with the same datastructures the compiler uses means that our checks will be based on an accurate assesment of the compilers understanding -- which means they'll be accurate to how the code will actually run. + This covers the actual act of writing the code, the time spent reading the documentation for libraries you use, and figuring out how to get at the information you need to write the analysis. This could involve reading in a file of code, parsing it to understand the structure, and then making your specific check on that structure. + + Parsing code into a representative structure is a complicated business, and gets more complicated as the language grows. In this chapter, we'll be depending on internal data structures used by the compiler. This means that we don't have to worry about reading files or parsing them, but it does mean we have to work with data structures that are not in our control and that sometimes feel clumsy or ugly. + + Besides all the work we'll save by not having to parse the code by ourselves, working with the same data structures that the compiler uses means that our checks will be based on an accurate assessment of the compilers understanding -- which means our check will be accurate to how the code actually runs. We're going to work through these steps for each of the individual checks implemented in this chapter. Step 1 requires enough understanding of the language we're analyzing to empathize with the kinds of problems it's users face. All the code in this chapter is Julia code, written to analyze Julia code. # A Very Brief Introduction to Julia -Julia is a young language aimed at technical computing. It was released, at version 0.1 in the Spring of 2012; as of the summer of 2014, it is reaching version 0.3. Julia is a procedural language; it is not object-oriented and while it has functional features (anonymous functions, higher order functions, immutable data), it does encourage a functional style of coding. The feature that most programmers will find novel in Julia is multiple dispatch, which is also central to the design of most of it's APIs. +Julia is a young language aimed at technical computing. It was released at version 0.1 in the Spring of 2012; as of the summer of 2014, it has reached version 0.3. In general, Julia looks a lot like Python, but with some type annotations and without any object-oriented stuff. The feature that most programmers will find novel in Julia is multiple dispatch, which has a pervasive impact on both API design and on other design choices in the language. Here is a snippet of Julia code: ~~~jl # A comment about increment -function increment(x::Int) +function increment(x::Int64) return x + 1 end increment(5) ~~~ -This code defines a method of the function `increment` that takes one argument, named `x`, of type `Int`. The method returns the value of `x + 1`. Then, this freshly defined method is called with the value `5`; the function call, as you may have guessed, will evaluate to `6`. +This code defines a method of the function `increment` that takes one argument, named `x`, of type `Int64`. The method returns the value of `x + 1`. Then, this freshly defined method is called with the value `5`; the function call, as you may have guessed, will evaluate to `6`. + +`Int64` is a type whose values are signed integers represented in memory by 64 bits; they are the integers that your hardware understands if your computer has a 64-bit processor. Types in Julia define the representation of data in memory, in addition to influencing method dispatch. The name `increment` refers to a generic function, which may have many methods. We have just defined one method of it. Let's define another: ~~~jl # Increment x by y -function increment(x::Int, y::Number) +function increment(x::Int64, y::Number) return x + y end @@ -57,19 +63,19 @@ increment(5,4) # => 9 Now increment has two methods. Julia decides which method to run for a given call based on the number and types of the arguments; this is called dynamic multiple dispatch. * *dynamic* because it's based on the types of the values used at run-time -* *multiple* because it looks at the types and order of all the arguments. Object-oriented languages use single dispatch because they only consider the first argument (In `x.foo(y)`, the first argument is `x`.) -* *dispatch* because this a way of matching function calls to method definitions. +* *multiple* because it looks at the types and order of all the arguments. Object-oriented languages use single dispatch because they only consider the first argument (In `x.foo(y)`, the first argument is `x`.) [This is true for Python and Ruby, but not Java and C++ which can have multiple methods of the same name within a class.] +* *dispatch* because this is a way of matching function calls to method definitions. -We haven't really seen the "multiple" part yet, but if you're curious about Julia you'll have to look that up on your own. We need to move on to our first check. +We haven't really seen the "multiple" part yet, but if you're curious about Julia, you'll have to look that up on your own. We need to move on to our first check. # Checking the Types of Variables in Loops -A feature of Julia that sets it apart from other high-level languages is its speed. As in most programming languages, writing very fast code in Julia involves an understanding of how the computer works and how Julia works. In Julia, an important part of helping the compiler create fast code for you is writing type-stable code. When the compiler can see that a variable in a section of code will always be the same specific type, it can do more optimizations than if it believes (correctly or not) that there are many possible types for that variable. +A feature of Julia that sets it apart from other high-level languages is its speed. As in most programming languages, writing very fast code in Julia involves an understanding of how the computer works and how Julia works. In Julia, an important part of helping the compiler create fast code for you is writing type-stable code. When the compiler can see that a variable in a section of code will always contain the same specific type, the compiler can do more optimizations than if it believes (correctly or not) that there are many possible types for that variable. -For example, let's write a function that takes an `Int` and then increases it by some amount. If the number is small (less than 10), let's increase it by a big number (50), but if it's big, let's only increase it by a little (0.5). +For example, let's write a function that takes an `Int64` and then increases it by some amount. If the number is small (less than 10), let's increase it by a big number (50), but if it's big, let's only increase it by a little (0.5). ~~~jl -function increment(x::Int) +function increment(x::Int64) if x < 10 x = x + 50 else @@ -79,11 +85,15 @@ function increment(x::Int) end ~~~ -This function looks pretty straight-forward, but the type of `x` is unstable. At the end of this function, `return x` might return an `Int` or it might return a `Float64`. This is because of the `else` clause; if you add an Int, like `22`, to `0.5`, which is a `Float64`, then you'll get a `Float64`, like `22.5`. This means that `5` will become `55` (an `Int`), but `22` will become `22.5` (a `Float64`). If there were more involved code after this function, then it would have to handle both types for `x`, since the compiler expects to need to handle both. +This function looks pretty straight-forward, but the type of `x` is unstable. At the end of this function, `return x` might return an `Int64` or it might return a `Float64`. This is because of the `else` clause; if you add an `Int64`, like `22`, to `0.5`, which is a `Float64`, then you'll get a `Float64` (`22.5`). + +`Float64` is a type that represents floating-point values stored in 64 bits; in C, it is called a `double`. This is one of the floating-point types that 64-bit processors understand. -As with most efficiency problems, this issue is more pronounced when it happens during loops. Code inside for-loops and while-loops is run many, many times, so making it fast is more important than speeding up code that is only run a couple of times. Our first check is going to look for variables inside loops that have unstable types. First, let's look at an example of what we want to catch. +In this definition of `increment`, this means that `5` will become `55` (an `Int64`), but `22` will become `22.5` (a `Float64`). If there were more code in or after this function, then it would have to handle both possible types for `x`, since the compiler (correctly) expects to need to handle both. -We'll be looking at two functions. Each of them sums the numbers 1 to 100, but instead of summing the whole numbers, it divides each one by 2 before summing it. Both functions will get the same answer (`2525.0`); both will return the same type (`Float64`). However, the first function, `unstable`, suffers from type-instability, while the second one, `stable`, does not. +As with most efficiency problems, this issue is more pronounced when it happens during loops. Code inside for-loops and while-loops is run many, many times, so making it fast is more important than speeding up code that is only run once or twice. Therefore, our first check is going to look for variables inside loops that have unstable types. + +First, let's look at an example of what we want to catch. We'll be looking at two functions. Each of them sums the numbers 1 to 100, but instead of summing the whole numbers, they divide each one by 2 before summing it. Both functions will get the same answer (`2525.0`); both will return the same type (`Float64`). However, the first function, `unstable`, suffers from type-instability, while the second one, `stable`, does not. ~~~jl function unstable() @@ -105,9 +115,9 @@ function stable() end ~~~ -The only textual difference between the two functions is in the initialization of `sum`: `sum = 0` vs `sum = 0.0`. In Julia, `0` is an `Int` literal and `0.0` is a `Float64` literal. How big of a difference could this tiny change even make? +The only textual difference between the two functions is in the initialization of `sum`: `sum = 0` vs `sum = 0.0`. In Julia, `0` is an `Int64` literal and `0.0` is a `Float64` literal. How big of a difference could this tiny change even make? -Because Julia is Just-In-Time (JIT) compiled, the first run of a function will take longer than subsequent runs (because the first run includes the time it takes to compile it). When we benchmark functions, we have to be sure to run them once (or precompile them) before timing them. +Because Julia is Just-In-Time (JIT) compiled, the first run of a function will take longer than subsequent runs. (The first run includes the time it takes to compile the function for these argument types.) When we benchmark functions, we have to be sure to run them once (or precompile them) before timing them. ~~~jl julia> unstable() @@ -125,9 +135,9 @@ elapsed time: 2.285e-6 seconds (64 bytes allocated) 2525.0 ~~~ -The `@time` macro prints out how long the function took to run and how many bytes were allocated while it was running. The number of bytes allocated increases every time new memory is needed; it does not decrease when the garbage collector vacuums up memory that's no longer being used. This means that the bytes allocated is related to the amount of time we spend allocating memory, but does not imply that we had all of that memory in-use at the same time. +The `@time` macro prints out how long the function took to run and how many bytes were allocated while it was running. The number of bytes allocated increases every time new memory is needed; it does not decrease when the garbage collector vacuums up memory that's no longer being used. This means that the bytes allocated is related to the amount of time we spend allocating and managing memory, but does not imply that we had all of that memory in-use at the same time. -If we wanted to get solid numbers for `stable` vs `unstable` we would need to make the loop much longer or run the functions many times. However, we can already see that `unstable` seems to be slower. More interestingly, we can see a large gap in the number of bytes allocated; `unstable` is allocated around 3kb of memory, where `stable` is using 64 bytes. +If we wanted to get solid numbers for `stable` vs `unstable` we would need to make the loop much longer or run the functions many times. However, it looks like `unstable` is probably slower. More interestingly, we can see a large gap in the number of bytes allocated; `unstable` has allocated around 3kb of memory, where `stable` is using 64 bytes. Since we can see how simple `unstable` is, we might guess that this allocation is happening in the loop. To test this, we can make the loop longer and see if the allocations increase accordingly. Let's make the loop go from 1 to 10000, which is 100 times more iterations; we'll look for the number of bytes allocated to also increase about 100 times, to around 300kb. @@ -141,7 +151,7 @@ function unstable() end ~~~ -Since we redefined the function, we'll need to run it to have it compiled before we measure it. We expect to get a different, larger answer from the new function defintion, since it's summing more numbers now. +Since we redefined the function, we'll need to run it so it gets compiled before we measure it. We expect to get a different, larger answer from the new function definition, since it's summing more numbers now. ~~~jl julia> unstable() @@ -152,118 +162,375 @@ elapsed time: 0.000667613 seconds (320048 bytes allocated) 2.50025e7 ~~~ -The new `unstable` allocated about 320kb, which is what we would expect if the allocations are happening in the loop. This difference between `unstable` and `stable` is because `unstable`'s `sum` must be boxed while `stable`'s `sum` can be unboxed. Boxed values consist of a type tag and the actual bits that represent the value; unboxed values only have their actual bits. The type tag is small, so that's not why boxing values allocates a lot more memory. The difference comes from what optimizations the compiler can make. When a variable has a concrete, immutable type, the compiler can unbox it inside the function. If that's not the case, then the variable must be allocated on the heap, and participate in the garbage collector. Immutable types are usually types that represent values, rather than collections of values; most numeric types, including `Int` and `Float64` are immutable, while `Array`s and `Dict`s are mutable. Because immutable types cannot be modified, you must make a new copy every time you change one. For example `4 + 6` must make a new `Int` to hold the result. In constract, the members of a mutable type can be updated in-place. +The new `unstable` allocated about 320kb, which is what we would expect if the allocations are happening in the loop. To explain what's going on here, we're going to dive into how Julia works under the hood. + +This difference between `unstable` and `stable` is because `unstable`'s `sum` must be boxed while `stable`'s `sum` can be unboxed. Boxed values consist of a type tag and the actual bits that represent the value; unboxed values only have their actual bits. The type tag is small, so that's not why boxing values allocates a lot more memory. + +The difference comes from what optimizations the compiler can make. When a variable has a concrete, immutable type, the compiler can unbox it inside the function. If that's not the case, then the variable must be allocated on the heap, and participate in the garbage collector. Immutable types are usually types that represent values, rather than collections of values; most numeric types, including `Int64` and `Float64`, are immutable. Because immutable types cannot be modified, you must make a new copy every time you change one. For example `4 + 6` must make a new `Int64` to hold the result. In contrast, the members of a mutable type can be updated in-place; this means you don't have to make a copy of the whole thing to make a change. -Because `sum` in `stable` has a concrete type (`Flaot64`), the compiler know that it can store it unboxed locally in the function and mutate it's value; `sum` will not be allocated on the heap and new copies don't have to be made every time we add `i/2`. Because `sum` in `unstable` does not have a concrete type, the compiler allocates it on the heap. Every time we modify sum, we allocated a new value on the heap. All this time spent allocating values on the heap (and retrieving them everytime we want to read the value of `sum`) is expensive. +Because `sum` in `stable` has a concrete type (`Flaot64`), the compiler know that it can store it unboxed locally in the function and mutate it's value; `sum` will not be allocated on the heap and new copies don't have to be made every time we add `i/2`. -Using `0` vs `0.0` is an easy mistake to make, especially when you're new to Julia. Automatically checking that variables used in loops are type-stable helps programmers get more insight into what the types of their variables are, in performance-critical sections of their code. +Because `sum` in `unstable` does not have a concrete type, the compiler allocates it on the heap. Every time we modify sum, we allocated a new value on the heap. All this time spent allocating values on the heap (and retrieving them every time we want to read the value of `sum`) is expensive. -The type of `sum` in `unstable` is `Union(Float64,Int64)`. This is a `UnionType`. A variable of type `Union(Float64,Int64)` can hold values of type `Int64` or `Float64`; a value can only have one type. A `UnionType` can have more than two possible values. The specific thing that we're going to look for is `UnionType`d variables inside loops. +Using `0` vs `0.0` is an easy mistake to make, especially when you're new to Julia. Automatically checking that variables used in loops are type-stable helps programmers get more insight into what the types of their variables are in performance-critical sections of their code. -## Implmentation +The type of `sum` in `unstable` is `Union(Float64,Int64)`. This is a `UnionType`. A variable of type `Union(Float64,Int64)` can hold values of type `Int64` or `Float64`; a value can only have one of those types. A `UnionType` join any number of types (e.g. `UnionType(Float64, Int64, Int32)` joins three types). The specific thing that we're going to look for is `UnionType`d variables inside loops. -In order to find those variables, we'll need to find what variables are used inside of loops and we'll need to find the types of those variables. After we have those results, we'll need to decide how to print them in a human-readable way. +## Implementation -* How do we find loops in Exprs +In order to find those variables, we'll need to find what variables are used inside of loops and we'll need to find the types of those variables. After we have those results, we'll need to decide how to print them in a human-readable format. + +* How do we find loops in `Expr`s * How do we find the types of variables * How do we print the results -// Saving for first implementation section -## Introspection in Julia. +This process of examining Julia code and finding information about, from other Julia code, is called introspection. When you or I introspect, we're thinking about how and why we think and feel. When code introspects, it examines the representation or execution properties of code in the same language (possibly it's own code). When code's introspection extends to modifying the examined code, it's called metaprogramming (programs that write or modify programs). + +Julia makes it easy to introspect. There are four functions built-in to let us see what that compiler is thinking: `code_lowered`, `code_typed`, `code_llvm`, and `code_native`. Those are listed in order of what step in the compilation process their output is from; the left-most one is closest to the code we'd type in and the right-most one is the closest to what the CPU runs. For this chapter, we'll focus on `code_typed`, which gives us the optimized, type-inferred abstract syntax tree (AST). -When you or I introspect, we're thinking about how and why we think and feel. When code introspects, it examines the representation or execution properties of code in the same language (possibly it's own code). When code's introspection extends to modifying the examined code, it's called metaprogramming (programs that write or modify programs). +Anyway, we need to detect those pesky mistyped variable names. To implement this, we'll be using some built-in data structures. There is a function that exposes the type-inferred and optimized AST: `code_typed`. -Julia makes it easy to introspect. There are four functions built-in to let us see what that compiler is thinking: `code_lowered`, `code_typed`, `code_llvm`, and `code_native`. Those are listed in order of what step in the compilation process their output is from; the left-most one is closest to the code we'd type and the right-most one is the closest to what the CPU runs. For this chapter, we'll focus on `code_typed`, which gives us the optimized, type-inferred abstract syntax tree (AST). +`code_typed` takes two arguments: the function of interest, and a tuple of argument types. For example, if we wanted to see the AST for a function `foo` when called with two Int64`s, then we would call `code_typed(foo, (Int64,Int64))`. -Let's look at the output of `code_typed` for that first method of `increment` that we defined. ~~~jl -# code_typed takes the function and argument types of the method you want -code_typed(increment,(Int,)) +function foo(x,y) + z = x + y + return 2 * z +end + +code_typed(foo,(Int64,Int64)) ~~~ -The output looks like this: +This is the structure that code_typed_ would return: ~~~ 1-element Array{Any,1}: - :($(Expr(:lambda, {:x}, {{},{{:x,Int64,0}},{}}, :(begin # REPL, line 12: - return top(box)(Int64,top(add_int)(x::Int64,1))::Int64 + :($(Expr(:lambda, {:x,:y}, {{:z},{{:x,Int64,0},{:y,Int64,0},{:z,Int64,18}},{}}, :(begin # none, line 2: + z = (top(box))(Int64,(top(add_int))(x::Int64,y::Int64))::Int64 # line 3: + return (top(box))(Int64,(top(mul_int))(2,z::Int64))::Int64 end::Int64)))) ~~~ -The output is an `Array` because you can call `code_typed` with an ambiguous tuple of types -- which would result in multiple methods matching and being returned as results. The `Array` contains `Expr`s, which represent expressions. +First, this is an `Array`; this allows `code_typed` to return multiple matching methods. Some combinations of functions and argument types may not completely determine which method should be called. For exmaple, you could pass in an type like `Any`, which is the type at the top of the type hierarchy; all types are subtypes of `Any` (including `Any`). If we included `Any`s in our tuple of argument types, and had multiple potentially matching methods, then the `Array` from `code_typed` would have more than one element in it. -An `Expr` has three fields: +The structure we're interested in is inside the `Array`: it is an `Expr`. Julia uses `Expr`s (short for expression) to represent its AST. (An abstract syntax tree is how the compiler thinks about the meaning of your code; it's kind of like when you had to diagram sentences in grade school.) The `Expr` we get back represents one method. It has some metadata (about the variables that appear in the method) and the expressions that make up the body of the method. -* `head`, which is a `Symbol`. In our example, it is `:lambda`. -* `typ`, which is a `Type`. The outer `Expr`s from `code_typed` always set this to `Any`. -* `args`, which is a structure of nested `Array{Any,1}`. It is made of untyped, nested lists. +First, let's pull our example `Expr` out to make it easier to talk about. +~~~jl +julia> e = code_typed(foo,(Int64,Int64))[1] +:($(Expr(:lambda, {:x,:y}, {{:z},{{:x,Int64,0},{:y,Int64,0},{:z,Int64,18}},{}}, :(begin # none, line 2: + z = (top(box))(Int64,(top(add_int))(x::Int64,y::Int64))::Int64 # line 3: + return (top(box))(Int64,(top(mul_int))(2,z::Int64))::Int64 + end::Int64)))) +~~~ -For `Expr`s from `code_typed`, there are always three elements in `args`. +Now we can ask some questions about `e`: +~~~.jl +julia> names(e) +3-element Array{Symbol,1}: + :head + :args + :typ + +julia> e.head +:lambda + +julia> e.args +3-element Array{Any,1}: + {:x,:y} + {{:z},{{:x,Int64,0},{:y,Int64,0},{:z,Int64,18}},{}} + :(begin # none, line 2: + z = (top(box))(Int64,(top(add_int))(x::Int64,y::Int64))::Int64 # line 3: + return (top(box))(Int64,(top(mul_int))(2,z::Int64))::Int64 + end::Int64) + +julia> e.typ +Any +~~~ -1. A list of argument names -2. A list of metadata about variables used in the method -3. The body of the function, as another `Expr` +We just asked `e` what names it has, and then asked what value each name corresponds to. An `Expr` has three properties: `head`, `typ` and `args`. -The `Expr` of the function stores the list of expressions in the body of the method in its `args` field. The inferred return type of the method is stored in the `typ` field. We can define some helper functions to make these simple operations more readable; both of these functions will expect an `Expr` of the form returned by `code_typed`. +* `head` tells us what kind of expression this is; normally, you'd use ex separate types for this in Julia, but this is a type that models the structure used in the Lisp parser. Anyway, head tells us how the rest of the `Expr` is structured, and what it represents. +* `typ` is the inferred return type of the expression; every expresision in Julia results in some value when evaluated. `typ` is the type of the value that the expression will evaluate to. For nearly all `Expr`s, this value will be `Any`. Only the `body` of type-inferred methods and most expressions inside them will have their `typ`s set to something else. (Because `type` is a keyword, this field can't use that word as its name.) +* `args` is the most complicated part of Expr; its structure varies based on `head`. It's always an `Array{Any}` of `Array{Any}`s . This is means it's an untyped list of lists (very Lisp-y). + +In this case, there will be three elements in `e.args`: ~~~jl -# given an Expr representing a method, return its inferred return type -returntype(e::Expr) = e.args[3].typ # arrays index from 1 +julia> e.args[1] # names of arguments as symbols +2-element Array{Any,1}: + :x + :y + +julia> e.args[2] # three lists of variable metadata (names of locals, (variable name, type, bitflags) tuples, and captured variable names) +3-element Array{Any,1}: + {:z} + {{:x,Int64,0},{:y,Int64,0},{:z,Int64,18}} + {} + +julia> e.args[3] # an Expr containing the body of the method +:(begin # none, line 2: + z = (top(box))(Int64,(top(add_int))(x::Int64,y::Int64))::Int64 # line 3: + return (top(box))(Int64,(top(mul_int))(2,z::Int64))::Int64 + end::Int64) +~~~ -# given an Expr representing a method, return an Array of Exprs representing its body -body(e::Expr) = e.args[3].args +While the metadata is very interesting, it isn't necessary right now. The important part is the body of the method, which is the third argument. This is another `Expr`. + +~~~.jl +julia> body = e.args[3] +:(begin # none, line 2: + z = (top(box))(Int64,(top(add_int))(x::Int64,y::Int64))::Int64 # line 3: + return (top(box))(Int64,(top(mul_int))(2,z::Int64))::Int64 + end::Int64) + +julia> body.head +:body + +julia> body.type +ERROR: type Expr has no field type + +julia> body.typ +Int64 + +julia> body.args +4-element Array{Any,1}: + :( # none, line 2:) + :(z = (top(box))(Int64,(top(add_int))(x::Int64,y::Int64))::Int64) + :( # line 3:) + :(return (top(box))(Int64,(top(mul_int))(2,z::Int64))::Int64) ~~ ~~~ -We can run these on our first `increment` method: +This `Expr` has head `:body` because it's the body of the method. The `typ` is the inferred return type of the method. The `args` holds a list of expressions; the list of expressions in the method definition. + +There are a couple of annotations of line numbers, but most of it is setting the value of `z` (`z = x + y`) and returning `2 * z`. Notice that these operations have been replaced by `Int64`-specific intrinsic functions. The `top(function-name)` indicates an intrinsic function; something that is implemented in Julia's code generation, rather in Julia. + +The metadata gave us the names and types of all variables appearing in this function. Now we need to look at a function body with a loop, in order to see what that looks like. + ~~~jl -returntype(code_typed(increment,(Int,))[1]) # => Int64 -body(code_typed(increment,(Int,))[1]) # => 2-element Array{Any,1}: - # :( # REPL, line 12:) - # :(return top(box)(Int64,top(add_int)(x::Int64,1))::Int64) +julia> function lloop(x) + for x = 1:100 + x *= 2 + end + end +lloop (generic function with 1 method) + +julia> code_typed(lloop, (Int,))[1].args[3] +:(begin # none, line 2: + #s120 = $(Expr(:new, UnitRange{Int64}, 1, :(((top(getfield))(Intrinsics,:select_value))((top(sle_int))(1,100)::Bool,100,(top(box))(Int64,(top(sub_int))(1,1))::Int64)::Int64)))::UnitRange{Int64} + #s119 = (top(getfield))(#s120::UnitRange{Int64},:start)::Int64 unless (top(box))(Bool,(top(not_int))(#s119::Int64 === (top(box))(Int64,(top(add_int))((top(getfield))(#s120::UnitRange{Int64},:stop)::Int64,1))::Int64::Bool))::Bool goto 1 + 2: + _var0 = #s119::Int64 + _var1 = (top(box))(Int64,(top(add_int))(#s119::Int64,1))::Int64 + x = _var0::Int64 + #s119 = _var1::Int64 # line 3: + x = (top(box))(Int64,(top(mul_int))(x::Int64,2))::Int64 + 3: + unless (top(box))(Bool,(top(not_int))((top(box))(Bool,(top(not_int))(#s119::Int64 === (top(box))(Int64,(top(add_int))((top(getfield))(#s120::UnitRange{Int64},:stop)::Int64,1))::Int64::Bool))::Bool))::Bool goto 2 + 1: 0: + return + end::Nothing) ~~~ -The `head` of an `Expr` indicates what type of `Expr` it is. For example, `:=` indicates an assignment, like `x = 5`. If we wanted to find all the places a method might return, we'd look for head values of `:return`. We can use the `body` helper function that we just wrote to write a function that takes an `Expr` from `code_typed` and returns all the return statements in its body. +I skipped straight to the method body here. You'll notice there's no `for` or `while` loop keyword. Instead, the loop has been lowered to `label`s and `goto`s. The `goto` has a number in it; each `label` also has a number. The `goto` jumps to the the `label` with the same number. We're going to find loops by looking for `goto`s that jump backwards. + +First, we'll need to find the labels and gotos, and figure out where which ones match. + +~~~~.jl +# This is a function for trying to detect loops in the body of a Method +# Returns lines that are inside one or more loops +function loopcontents(e::Expr) + b = body(e) + loops = Int[] + nesting = 0 + lines = {} + for i in 1:length(b) + if typeof(b[i]) == LabelNode + l = b[i].label + jumpback = findnext( + x-> (typeof(x) == GotoNode && x.label == l) || (Base.is_expr(x,:gotoifnot) && x.args[end] == l), + b, i) + if jumpback != 0 + push!(loops,jumpback) + nesting += 1 + end + end + if nesting > 0 + push!(lines,(i,b[i])) + end + + if typeof(b[i]) == GotoNode && in(i,loops) + splice!(loops,findfirst(loops,i)) + nesting -= 1 + end + end + lines +end +~~~ -~~~jl -# given an Expr representing a method, return all of the return statement in its body -returns(e::Expr) = filter(x-> typeof(x) == Expr && x.head==:return,body(e)) +Above, we start by getting all the expressions in the body of method, as an `Array`. -returns(code_typed(increment,(Int,))[1]) # => 1-element Array{Any,1}: - # :(return top(box)(Int64,top(add_int)(x::Int64,1)) +~~~.jl +# Return the body of a Method. +# Takes an Expr representing a Method, +# returns Vector{Expr}. +body(e::Expr) = e.args[3].args ~~~ -null -This `code_typed(increment,(Int,))[1]` stuff is getting rather tedious. Let's write a couple of helper methods so that we can run `code_typed` on a whole function at once. -~~~jl -# return the type-inferred AST for one method of a generic function -function Base.code_typed(m::Method) - linfo = m.func.code - (tree,ty) = Base.typeinf(linfo,m.sig,()) - if !isa(tree,Expr) - ccall(:jl_uncompress_ast, Any, (Any,Any), linfo, tree) - else - tree - end +`loops` is an `Array` of label line numbers where `GoTo`s that are loops occur. `nesting` indicates the number of loops we are currently inside. `lines` is an `Array` of (index, `Expr`) tuples. + +We look at each expression in the body of `e`. If it is a lable, we check to see if there is a `goto` that jubmps to this label (and occurs after the current index). If the result of `findnext` is greater than zero, then such a goto node exists, so we'll add that to `loops` (the `Array` of loops we are currently in) and increment our `nesting` level. + +If we're currently inside a loop, we push the current line to our array of lines to return. + +If we're at a GotoNode, then we check to see if it's the end of a loop. If so, we remove the entry from loops and reduce our nesting level. + +~~~.jl +# given `lr`, a Vector of expressions (Expr + literals, etc) +# try to find all occurances of a variables in `lr` +# and determine their types +function loosetypes(lr::Vector) + symbols = SymbolNode[] + for (i,e) in lr + if typeof(e) == Expr + es = copy(e.args) + while !isempty(es) + e1 = pop!(es) + if typeof(e1) == Expr + append!(es,e1.args) + elseif typeof(e1) == SymbolNode + push!(symbols,e1) + end + end + end + end + loose_types = SymbolNode[] + for symnode in symbols + if !isleaftype(symnode.typ) && typeof(symnode.typ) == UnionType + push!(loose_types, symnode) + end + end + return loose_types end +~~~~ + +We'll pass the output of `loopcontents` into `loosetypes`. The goal of this function is to find all the variables and their types in our lines-from-inside-loops input `Vector`. + +In each expression that occurred inside a loop, `loosetypes` searches for occurrences of symbols and their associated types. Variable usages show up as `SymbolNode`s in the AST; `SymbolNode`s hold the name and inferred type of the variable. + +We can't just check each expression that `loopcontents` collected to see if it's a `SymbolNode`. The problem is that each `Expr` may contain one or more `Expr`; each `Expr` may contain one or more `SymbolNode`s. This means we need to pull out any nested `Expr`s, so that we can look in each of them for `SymbolNode`s. + +The while loop goes through the guts of all the `Expr`s, recursively, until it's seen all the `Expr`s (and hopefully all the `SymbolNode`s). Every time the loop finds a `SymbolNode`, it adds it to the vector `symbols`. -# return the type-inferred AST for each method of a generic function -function Base.code_typed(f::Function) - Expr[code_typed(m) for m in f.env] +Now we have a list of variables and their types, so it's easy to check if a type is loose. `loosetypes` does that by looking for a specific kind of non-concrete type, a `UnionType`. We get a lot more "failing" results when we consider all non-concrete types to be "failing". This is because we're evaluating each method with it's annotated argument types -- which are likely to be abstract. + +Now that we can do the check on an expression, we should make it easier to call on a users's code: + +~~~.jl +# for a given Function, run checklooptypes on each Method +function checklooptypes(f::Callable;kwargs...) + lrs = LoopResult[] + for e in code_typed(f) + lr = checklooptypes(e) + if length(lr.lines) > 0 push!(lrs,lr) end + end + LoopResults(f.env.name,lrs) end + +# for an Expr representing a Method, +# check that the type of each variable used in a loop +# has a concrete type +checklooptypes(e::Expr;kwargs...) = LoopResult(MethodSignature(e),loosetypes(loopcontents(e))) ~~~ -Once we have a `code_typed` that handles `Method`s, handling whole `Function`s is just requires an array-comprehension over the methods of the given function. For a given `Function` `f`, we can get the methods using `f.env`. Handling a `Method` has more details to handle; the implementation is modeled closely on the existing built-in implementation. +Now we have two ways to call `checklooptypes`: -`m.func.code` gives us the implementation of the method; `m.sig` gives us the types of it's arguments. Given these, `Base.typeinf` should return the type-inferred AST. However, if it was saved in a compressed state, we'll need to call one of the C functions used to implement parts of Julia, specifically `jl_uncompress_ast`, to get the `Expr` value we want to return. +1. On a whole function; this will check each method of the given function. -~~~jl -[returntype(e) for e in code_typed(increment)] # => 2-element Array{Any,1}: - # Int64 - # Any +2. On a specific expression; this will work if the user extracts the results of `code_typed` themselves. + +We can see both options work about the same for a function with one method: + +~~~.jl +julia> using TypeCheck + +julia> function foo(x::Int) + s = 0 + for i = 1:x + s += i/2 + end + return s + end +foo (generic function with 1 method) + +julia> checklooptypes(foo) +foo(Int64)::Union(Int64,Float64) + s::Union(Int64,Float64) + s::Union(Int64,Float64) + + + +julia> checklooptypes(code_typed(foo,(Int,))[1]) +(Int64)::Union(Int64,Float64) + s::Union(Int64,Float64) + s::Union(Int64,Float64) ~~~ +I've skipped an implementation detail here: how did we get the results to print out to the REPL like that? -# Looking for Unused Variables +The `checklooptypes` function returns a special type, `LoopResults`. This type has a function called `show` defined for it. The REPL calls `display` on values it wants to display; `display` will then call our `show` implementation. -# Checking Functions for Type Statbility +`LoopResults` is the result of checking a whole function; it has the function name and the results for each method. `LoopResult` is the result of checking one method; it has the argument types and the loosely typed variables. + +~~~.jl +type LoopResult + msig::MethodSignature + lines::Vector{SymbolNode} + LoopResult(ms::MethodSignature,ls::Vector{SymbolNode}) = new(ms,unique(ls)) +end + +function Base.show(io::IO, x::LoopResult) + display(x.msig) + for snode in x.lines + println(io,"\t",string(snode.name),"::",string(snode.typ)) + end +end + +type LoopResults + name::Symbol + methods::Vector{LoopResult} +end + +function Base.show(io::IO, x::LoopResults) + for lr in x.methods + print(io,string(x.name)) + display(lr) + end +end +~~~ + +# Looking For Unused Variables + +* Example of the problem we're checking for +* LHS vs RHS variable usages +* Looking for single-use variables +* Checking for `x += 2` single usages + +# Checking Functions for Type Stability + +* What does type stability mean +* Example of failing function +* Checking function argument & return types +* Where this fails # Tools for Insight into Variable Types + +* Implementing `whos` for functions + + What is `whos` (modules) + + Getting the variables in a function + + implementation + +* Implicit interfaces + + Walking the type hierarchy + + Getting the methods implemented for a type + + Implementing the function + From eddbb1eb01491b57076de5fdccd78e9a6310f60a Mon Sep 17 00:00:00 2001 From: Leah Hanson Date: Tue, 28 Oct 2014 08:49:45 -0500 Subject: [PATCH 4/6] Update StaticAnalysisChapter.md Added a lot of new explanation. Made some edits. The formatting on the lists is bad, but this probably isn't the time to fight with it. I'm not sure what to do for the conclusion (i.e. what the interface should look like). --- static-analysis/StaticAnalysisChapter.md | 539 +++++++++++++++++++---- 1 file changed, 447 insertions(+), 92 deletions(-) diff --git a/static-analysis/StaticAnalysisChapter.md b/static-analysis/StaticAnalysisChapter.md index 19dc5b75d..0040c32aa 100644 --- a/static-analysis/StaticAnalysisChapter.md +++ b/static-analysis/StaticAnalysisChapter.md @@ -1,13 +1,13 @@ # Static Analysis by Leah Hanson for *500 Lines or Less* -Static Analysis is a way to check for problems in your code without running it. "Static" means at compile-time, rather than at run-time, and "analysis" because we're analyzing the code. +Static Analysis is a way to check for problems in your code without running it. "Static" means at compile-time, rather than at run-time, and "analysis" because we're analyzing the code. To implement a static analysis check, you need to now what you want to do and how to do it. -There are three phases to implementing static analysis: +We can get more specific about what you need to know by describing the process as having three stages: 1. Deciding what you want to check for - This refers to the general problem you'd like to solve, in terms that a user of the programming language would recognize. Examples include: + You should be able to explain the general problem you'd like to solve, in terms that a user of the programming language would recognize. Examples include: * Finding misspelled variable names * Finding race conditions in parallel code @@ -15,7 +15,7 @@ There are three phases to implementing static analysis: 2. Deciding how exactly to check for it - While we could ask a friend to do one of the tasks listed above, they aren't specific enough to explain to a computer. To tackle "misspelled variable names", for example, we'd need to decide what misspelled means here. One option would be to claim variable names should be composed of English words from the dictionary; another, more useful, option is to look for variables that are only used once (the one time you mis-typed it). + While we could ask a friend to do one of the tasks listed above, they aren't specific enough to explain to a computer. To tackle "misspelled variable names", for example, we'd need to decide what misspelled means here. One option would be to claim variable names should be composed of English words from the dictionary; another option is to look for variables that are only used once (the one time you mis-typed it). Now that we know we're looking for variables that are only used once, we can talk about kinds of variable usages (having their value assigned vs. read) and what code would or would not trigger a warning. @@ -23,15 +23,11 @@ There are three phases to implementing static analysis: This covers the actual act of writing the code, the time spent reading the documentation for libraries you use, and figuring out how to get at the information you need to write the analysis. This could involve reading in a file of code, parsing it to understand the structure, and then making your specific check on that structure. - Parsing code into a representative structure is a complicated business, and gets more complicated as the language grows. In this chapter, we'll be depending on internal data structures used by the compiler. This means that we don't have to worry about reading files or parsing them, but it does mean we have to work with data structures that are not in our control and that sometimes feel clumsy or ugly. - - Besides all the work we'll save by not having to parse the code by ourselves, working with the same data structures that the compiler uses means that our checks will be based on an accurate assessment of the compilers understanding -- which means our check will be accurate to how the code actually runs. - -We're going to work through these steps for each of the individual checks implemented in this chapter. Step 1 requires enough understanding of the language we're analyzing to empathize with the kinds of problems it's users face. All the code in this chapter is Julia code, written to analyze Julia code. +We're going to work through these steps for each of the individual checks implemented in this chapter. Step 1 requires enough understanding of the language we're analyzing to empathize with the problems its users face. All the code in this chapter is Julia code, written to analyze Julia code. # A Very Brief Introduction to Julia -Julia is a young language aimed at technical computing. It was released at version 0.1 in the Spring of 2012; as of the summer of 2014, it has reached version 0.3. In general, Julia looks a lot like Python, but with some type annotations and without any object-oriented stuff. The feature that most programmers will find novel in Julia is multiple dispatch, which has a pervasive impact on both API design and on other design choices in the language. +Julia is a young language aimed at technical computing. It was released at version 0.1 in the Spring of 2012; as of the summer of 2014, it has reached version 0.3. In general, Julia looks a lot like Python, but with some optional type annotations and without any object-oriented stuff. The feature that most programmers will find novel in Julia is multiple dispatch, which has a pervasive impact on both API design and on other design choices in the language. Here is a snippet of Julia code: @@ -60,17 +56,21 @@ increment(5) # => 6 increment(5,4) # => 9 ~~~ -Now increment has two methods. Julia decides which method to run for a given call based on the number and types of the arguments; this is called dynamic multiple dispatch. +Now the function `increment` has two methods. Julia decides which method to run for a given call based on the number and types of the arguments; this is called dynamic multiple dispatch. * *dynamic* because it's based on the types of the values used at run-time -* *multiple* because it looks at the types and order of all the arguments. Object-oriented languages use single dispatch because they only consider the first argument (In `x.foo(y)`, the first argument is `x`.) [This is true for Python and Ruby, but not Java and C++ which can have multiple methods of the same name within a class.] +* *multiple* because it looks at the types and order of all the arguments. * *dispatch* because this is a way of matching function calls to method definitions. +To put this in context with languages you may already know, object-oriented languages use single dispatch because they only consider the first argument (In `x.foo(y)`, the first argument is `x`.) + We haven't really seen the "multiple" part yet, but if you're curious about Julia, you'll have to look that up on your own. We need to move on to our first check. # Checking the Types of Variables in Loops -A feature of Julia that sets it apart from other high-level languages is its speed. As in most programming languages, writing very fast code in Julia involves an understanding of how the computer works and how Julia works. In Julia, an important part of helping the compiler create fast code for you is writing type-stable code. When the compiler can see that a variable in a section of code will always contain the same specific type, the compiler can do more optimizations than if it believes (correctly or not) that there are many possible types for that variable. +A feature of Julia that sets it apart from other high-level languages is its speed. As in most programming languages, writing very fast code in Julia involves an understanding of how the computer works and how Julia works. In Julia, an important part of helping the compiler create fast code for you is writing type-stable code. When the compiler can see that a variable in a section of code will always contain the same specific type, the compiler can do more optimizations than if it believes (correctly or not) that there are multiple possible types for that variable. + +## Why This is Important For example, let's write a function that takes an `Int64` and then increases it by some amount. If the number is small (less than 10), let's increase it by a big number (50), but if it's big, let's only increase it by a little (0.5). @@ -85,13 +85,11 @@ function increment(x::Int64) end ~~~ -This function looks pretty straight-forward, but the type of `x` is unstable. At the end of this function, `return x` might return an `Int64` or it might return a `Float64`. This is because of the `else` clause; if you add an `Int64`, like `22`, to `0.5`, which is a `Float64`, then you'll get a `Float64` (`22.5`). +This function looks pretty straight-forward, but the type of `x` is unstable. I selected two numbers 50, an `Int64`, and 0.5, a `Float64`; depending on the value of `x`, it might be added to either one of them. At the end of this function, `return x` might return an `Int64` or it might return a `Float64`. This is because of the `else` clause; if you add an `Int64`, like `22`, to `0.5`, which is a `Float64`, then you'll get a `Float64` (`22.5`). `Float64` is a type that represents floating-point values stored in 64 bits; in C, it is called a `double`. This is one of the floating-point types that 64-bit processors understand. -In this definition of `increment`, this means that `5` will become `55` (an `Int64`), but `22` will become `22.5` (a `Float64`). If there were more code in or after this function, then it would have to handle both possible types for `x`, since the compiler (correctly) expects to need to handle both. - -As with most efficiency problems, this issue is more pronounced when it happens during loops. Code inside for-loops and while-loops is run many, many times, so making it fast is more important than speeding up code that is only run once or twice. Therefore, our first check is going to look for variables inside loops that have unstable types. +As with most efficiency problems, this issue is more pronounced when it happens during loops. Code inside for-loops and while-loops is run many, many times, so making it fast is more important than speeding up code that is only run once or twice. Therefore, our first check is to look for variables that have unstable types inside loops. First, let's look at an example of what we want to catch. We'll be looking at two functions. Each of them sums the numbers 1 to 100, but instead of summing the whole numbers, they divide each one by 2 before summing it. Both functions will get the same answer (`2525.0`); both will return the same type (`Float64`). However, the first function, `unstable`, suffers from type-instability, while the second one, `stable`, does not. @@ -174,23 +172,28 @@ Because `sum` in `unstable` does not have a concrete type, the compiler allocate Using `0` vs `0.0` is an easy mistake to make, especially when you're new to Julia. Automatically checking that variables used in loops are type-stable helps programmers get more insight into what the types of their variables are in performance-critical sections of their code. +## Implementation Details + The type of `sum` in `unstable` is `Union(Float64,Int64)`. This is a `UnionType`. A variable of type `Union(Float64,Int64)` can hold values of type `Int64` or `Float64`; a value can only have one of those types. A `UnionType` join any number of types (e.g. `UnionType(Float64, Int64, Int32)` joins three types). The specific thing that we're going to look for is `UnionType`d variables inside loops. -## Implementation +We'll need to find what variables are used inside of loops and we'll need to find the types of those variables. After we have those results, we'll need to decide how to print them in a human-readable format. -In order to find those variables, we'll need to find what variables are used inside of loops and we'll need to find the types of those variables. After we have those results, we'll need to decide how to print them in a human-readable format. +* How do we find loops? +* How do we find variables in loops? +* How do we find the types of a variable? +* How do we print the results? -* How do we find loops in `Expr`s -* How do we find the types of variables -* How do we print the results +Parsing code into a representative structure is a complicated business, and gets more complicated as the language grows. In this chapter, we'll be depending on internal data structures used by the compiler. This means that we don't have to worry about reading files or parsing them, but it does mean we have to work with data structures that are not in our control and that sometimes feel clumsy or ugly. -This process of examining Julia code and finding information about, from other Julia code, is called introspection. When you or I introspect, we're thinking about how and why we think and feel. When code introspects, it examines the representation or execution properties of code in the same language (possibly it's own code). When code's introspection extends to modifying the examined code, it's called metaprogramming (programs that write or modify programs). +Besides all the work we'll save by not having to parse the code by ourselves, working with the same data structures that the compiler uses means that our checks will be based on an accurate assessment of the compilers understanding -- which means our check will be accurate to how the code actually runs. -Julia makes it easy to introspect. There are four functions built-in to let us see what that compiler is thinking: `code_lowered`, `code_typed`, `code_llvm`, and `code_native`. Those are listed in order of what step in the compilation process their output is from; the left-most one is closest to the code we'd type in and the right-most one is the closest to what the CPU runs. For this chapter, we'll focus on `code_typed`, which gives us the optimized, type-inferred abstract syntax tree (AST). +This process of examining Julia code from Julia code is called introspection. When you or I introspect, we're thinking about how and why we think and feel. When code introspects, it examines the representation or execution properties of code in the same language (possibly it's own code). When code's introspection extends to modifying the examined code, it's called metaprogramming (programs that write or modify programs). -Anyway, we need to detect those pesky mistyped variable names. To implement this, we'll be using some built-in data structures. There is a function that exposes the type-inferred and optimized AST: `code_typed`. +### Introspection in Julia -`code_typed` takes two arguments: the function of interest, and a tuple of argument types. For example, if we wanted to see the AST for a function `foo` when called with two Int64`s, then we would call `code_typed(foo, (Int64,Int64))`. +Julia makes it easy to introspect. There are four functions built-in to let us see what that compiler is thinking: `code_lowered`, `code_typed`, `code_llvm`, and `code_native`. Those are listed in order of what step in the compilation process their output is from; the left-most one is closest to the code we'd type in and the right-most one is the closest to what the CPU runs. For this chapter, we'll focus on `code_typed`, which gives us the optimized, type-inferred abstract syntax tree (AST). + +`code_typed` takes two arguments: the function of interest, and a tuple of argument types. For example, if we wanted to see the AST for a function `foo` when called with two `Int64`s, then we would call `code_typed(foo, (Int64,Int64))`. ~~~jl function foo(x,y) @@ -210,11 +213,9 @@ This is the structure that code_typed_ would return: end::Int64)))) ~~~ -First, this is an `Array`; this allows `code_typed` to return multiple matching methods. Some combinations of functions and argument types may not completely determine which method should be called. For exmaple, you could pass in an type like `Any`, which is the type at the top of the type hierarchy; all types are subtypes of `Any` (including `Any`). If we included `Any`s in our tuple of argument types, and had multiple potentially matching methods, then the `Array` from `code_typed` would have more than one element in it. - -The structure we're interested in is inside the `Array`: it is an `Expr`. Julia uses `Expr`s (short for expression) to represent its AST. (An abstract syntax tree is how the compiler thinks about the meaning of your code; it's kind of like when you had to diagram sentences in grade school.) The `Expr` we get back represents one method. It has some metadata (about the variables that appear in the method) and the expressions that make up the body of the method. +This is an `Array`; this allows `code_typed` to return multiple matching methods. Some combinations of functions and argument types may not completely determine which method should be called. For example, you could pass in a type like `Any` (instead of `Int64`). `Any` is the type at the top of the type hierarchy; all types are subtypes of `Any` (including `Any`). If we included `Any`s in our tuple of argument types, and had multiple matching methods, then the `Array` from `code_typed` would have more than one element in it; it would have one element per matching method. -First, let's pull our example `Expr` out to make it easier to talk about. +Let's pull our example `Expr` out to make it easier to talk about. ~~~jl julia> e = code_typed(foo,(Int64,Int64))[1] :($(Expr(:lambda, {:x,:y}, {{:z},{{:x,Int64,0},{:y,Int64,0},{:z,Int64,18}},{}}, :(begin # none, line 2: @@ -223,17 +224,29 @@ julia> e = code_typed(foo,(Int64,Int64))[1] end::Int64)))) ~~~ -Now we can ask some questions about `e`: +The structure we're interested in is inside the `Array`: it is an `Expr`. Julia uses `Expr`s (short for expression) to represent its AST. (An abstract syntax tree is how the compiler thinks about the meaning of your code; it's kind of like when you had to diagram sentences in grade school.) The `Expr` we get back represents one method. It has some metadata (about the variables that appear in the method) and the expressions that make up the body of the method. + +Now we can ask some questions about `e`. + +We can ask what properties an `Expr` has by using the `names` function. The `names` function, which works on any Julia value or type, returns an `Array` of names defined by that type (or the type of the value). + ~~~.jl julia> names(e) 3-element Array{Symbol,1}: :head :args :typ +~~~ +We just asked `e` what names it has, and now we can ask what value each name corresponds to. An `Expr` has three properties: `head`, `typ` and `args`. + +~~~.jl julia> e.head :lambda +julia> e.typ +Any + julia> e.args 3-element Array{Any,1}: {:x,:y} @@ -242,39 +255,44 @@ julia> e.args z = (top(box))(Int64,(top(add_int))(x::Int64,y::Int64))::Int64 # line 3: return (top(box))(Int64,(top(mul_int))(2,z::Int64))::Int64 end::Int64) - -julia> e.typ -Any ~~~ -We just asked `e` what names it has, and then asked what value each name corresponds to. An `Expr` has three properties: `head`, `typ` and `args`. +We just saw some values printed out, but that doesn't tell us much about what they mean or how they're used. -* `head` tells us what kind of expression this is; normally, you'd use ex separate types for this in Julia, but this is a type that models the structure used in the Lisp parser. Anyway, head tells us how the rest of the `Expr` is structured, and what it represents. -* `typ` is the inferred return type of the expression; every expresision in Julia results in some value when evaluated. `typ` is the type of the value that the expression will evaluate to. For nearly all `Expr`s, this value will be `Any`. Only the `body` of type-inferred methods and most expressions inside them will have their `typ`s set to something else. (Because `type` is a keyword, this field can't use that word as its name.) -* `args` is the most complicated part of Expr; its structure varies based on `head`. It's always an `Array{Any}` of `Array{Any}`s . This is means it's an untyped list of lists (very Lisp-y). +* `head` tells us what kind of expression this is; normally, you'd use separate types for this in Julia, but `Expr` is a type that models the structure used in the Lisp parser. `head` tells us how the rest of the `Expr` is structured and what it represents. +* `typ` is the inferred return type of the expression; when you evaluate any expression, it results in some value. `typ` is the type of the value that the expression will evaluate to. For nearly all `Expr`s, this value will be `Any` (which is always correct, since every possible type is a subtype of `Any`). Only the `body` of type-inferred methods and most expressions inside them will have their `typ`s set to something more specific. (Because `type` is a keyword, this field can't use that word as its name.) +* `args` is the most complicated part of `Expr`; its structure varies based on the value of `head`. It's always an `Array{Any}` (an untyped array), but beyond that the structure changes. -In this case, there will be three elements in `e.args`: +In an `Expr` representing a method, there will be three elements in `e.args`: ~~~jl julia> e.args[1] # names of arguments as symbols 2-element Array{Any,1}: :x :y +~~~ -julia> e.args[2] # three lists of variable metadata (names of locals, (variable name, type, bitflags) tuples, and captured variable names) +Symbols are a special type for representing the names of variables, constants, functions, and modules. They are a different type from strings because the specifically represent the name of a program construct. + +~~~ +julia> e.args[2] # three lists of variable metadata 3-element Array{Any,1}: {:z} {{:x,Int64,0},{:y,Int64,0},{:z,Int64,18}} {} +~~~ -julia> e.args[3] # an Expr containing the body of the method +The first list above contains the names of all local variables; we only have one (`z`) here. The second list contains a tuple for each variable in and argument to the method; each tuple has the variable name, the variable's inferred type, and a number. The number conveys information about how the variable is used, in a machine (rather than human) friendly way. The last list is of captured variable names; it's empty in this example. + +~~~ +julia> e.args[3] # the body of the method :(begin # none, line 2: z = (top(box))(Int64,(top(add_int))(x::Int64,y::Int64))::Int64 # line 3: return (top(box))(Int64,(top(mul_int))(2,z::Int64))::Int64 end::Int64) ~~~ -While the metadata is very interesting, it isn't necessary right now. The important part is the body of the method, which is the third argument. This is another `Expr`. +The first two `args` elements are metadata about the third. While the metadata is very interesting, it isn't necessary right now. The important part is the body of the method, which is the third element. This is another `Expr`. ~~~.jl julia> body = e.args[3] @@ -285,13 +303,18 @@ julia> body = e.args[3] julia> body.head :body +~~~ -julia> body.type -ERROR: type Expr has no field type +This `Expr` has head `:body` because it's the body of the method. +~~~ julia> body.typ Int64 +~~~ + +The `typ` is the inferred return type of the method. +~~~ julia> body.args 4-element Array{Any,1}: :( # none, line 2:) @@ -300,11 +323,9 @@ julia> body.args :(return (top(box))(Int64,(top(mul_int))(2,z::Int64))::Int64) ~~ ~~~ -This `Expr` has head `:body` because it's the body of the method. The `typ` is the inferred return type of the method. The `args` holds a list of expressions; the list of expressions in the method definition. +The `args` holds a list of expressions; the list of expressions in the method's body. There are a couple of annotations of line numbers (i.e. `:( # line 3:)`), but most of the body is setting the value of `z` (`z = x + y`) and returning `2 * z`. Notice that these operations have been replaced by `Int64`-specific intrinsic functions. The `top(function-name)` indicates an intrinsic function; something that is implemented in Julia's code generation, rather than in Julia. -There are a couple of annotations of line numbers, but most of it is setting the value of `z` (`z = x + y`) and returning `2 * z`. Notice that these operations have been replaced by `Int64`-specific intrinsic functions. The `top(function-name)` indicates an intrinsic function; something that is implemented in Julia's code generation, rather in Julia. - -The metadata gave us the names and types of all variables appearing in this function. Now we need to look at a function body with a loop, in order to see what that looks like. +We haven't seen what a loop looks like yet, so let's try that. ~~~jl julia> function lloop(x) @@ -331,9 +352,13 @@ julia> code_typed(lloop, (Int,))[1].args[3] end::Nothing) ~~~ -I skipped straight to the method body here. You'll notice there's no `for` or `while` loop keyword. Instead, the loop has been lowered to `label`s and `goto`s. The `goto` has a number in it; each `label` also has a number. The `goto` jumps to the the `label` with the same number. We're going to find loops by looking for `goto`s that jump backwards. +You'll notice there's no `for` or `while` loop in the body. Instead, the loop has been lowered to `label`s and `goto`s. The `goto` has a number in it; each `label` also has a number. The `goto` jumps to the the `label` with the same number. + +### Detecting & Extracting Loops -First, we'll need to find the labels and gotos, and figure out where which ones match. +We're going to find loops by looking for `goto`s that jump backwards. + +We'll need to find the labels and gotos, and figure out which ones match. I'm going to give you the full implementation first. After the wall of code, we'll take this apart in smaller pieces. ~~~~.jl # This is a function for trying to detect loops in the body of a Method @@ -367,22 +392,77 @@ function loopcontents(e::Expr) end ~~~ -Above, we start by getting all the expressions in the body of method, as an `Array`. +And now to explain in pieces: -~~~.jl -# Return the body of a Method. -# Takes an Expr representing a Method, -# returns Vector{Expr}. -body(e::Expr) = e.args[3].args +1. ~~~.jl +b = body(e) ~~~ -`loops` is an `Array` of label line numbers where `GoTo`s that are loops occur. `nesting` indicates the number of loops we are currently inside. `lines` is an `Array` of (index, `Expr`) tuples. + We start by getting all the expressions in the body of method, as an `Array`. `body` is a function that I've already implemented: -We look at each expression in the body of `e`. If it is a lable, we check to see if there is a `goto` that jubmps to this label (and occurs after the current index). If the result of `findnext` is greater than zero, then such a goto node exists, so we'll add that to `loops` (the `Array` of loops we are currently in) and increment our `nesting` level. + ~~~.jl + # Return the body of a Method. + # Takes an Expr representing a Method, + # returns Vector{Expr}. + function body(e::Expr) + return e.args[3].args + end + ~~~ + +2. ~~~.jl + loops = Int[] + nesting = 0 + lines = {} +~~~ -If we're currently inside a loop, we push the current line to our array of lines to return. + `loops` is an `Array` of label line numbers where `GoTo`s that are loops occur. `nesting` indicates the number of loops we are currently inside. `lines` is an `Array` of (index, `Expr`) tuples. -If we're at a GotoNode, then we check to see if it's the end of a loop. If so, we remove the entry from loops and reduce our nesting level. + +3. ~~~.jl + for i in 1:length(b) + if typeof(b[i]) == LabelNode + l = b[i].label + jumpback = findnext( + x-> (typeof(x) == GotoNode && x.label == l) || (Base.is_expr(x,:gotoifnot) && x.args[end] == l), + b, i) + if jumpback != 0 + push!(loops,jumpback) + nesting += 1 + end + end +~~~ + + We look at each expression in the body of `e`. If it is a label, we check to see if there is a `goto` that jumps to this label (and occurs after the current index). If the result of `findnext` is greater than zero, then such a goto node exists, so we'll add that to `loops` (the `Array` of loops we are currently in) and increment our `nesting` level. + +4. ~~~.jl + if nesting > 0 + push!(lines,(i,b[i])) + end +~~~ + + If we're currently inside a loop, we push the current line to our array of lines to return. + +5. ~~~.jl + if typeof(b[i]) == GotoNode && in(i,loops) + splice!(loops,findfirst(loops,i)) + nesting -= 1 + end + end + lines +end +~~~ + + If we're at a GotoNode, then we check to see if it's the end of a loop. If so, we remove the entry from loops and reduce our nesting level. + +^TODO: Let's look at what we get from this! + +### Finding and Typing Variables + +We just finished a function `loopcontents`, which returns the `Expr`s that are inside loops. Our next function will be `loosetypes`, which takes a list of `Expr`s and returns a list of variables that are loosely typed. Later, we'll pass the output of `loopcontents` into `loosetypes`. + +In each expression that occurred inside a loop, `loosetypes` searches for occurrences of symbols and their associated types. Variable usages show up as `SymbolNode`s in the AST; `SymbolNode`s hold the name and inferred type of the variable. + +We can't just check each expression that `loopcontents` collected to see if it's a `SymbolNode`. The problem is that each `Expr` may contain one or more `Expr`; each `Expr` may contain one or more `SymbolNode`s. This means we need to pull out any nested `Expr`s, so that we can look in each of them for `SymbolNode`s. ~~~.jl # given `lr`, a Vector of expressions (Expr + literals, etc) @@ -413,18 +493,47 @@ function loosetypes(lr::Vector) end ~~~~ -We'll pass the output of `loopcontents` into `loosetypes`. The goal of this function is to find all the variables and their types in our lines-from-inside-loops input `Vector`. - -In each expression that occurred inside a loop, `loosetypes` searches for occurrences of symbols and their associated types. Variable usages show up as `SymbolNode`s in the AST; `SymbolNode`s hold the name and inferred type of the variable. -We can't just check each expression that `loopcontents` collected to see if it's a `SymbolNode`. The problem is that each `Expr` may contain one or more `Expr`; each `Expr` may contain one or more `SymbolNode`s. This means we need to pull out any nested `Expr`s, so that we can look in each of them for `SymbolNode`s. +1. ~~~.jl + symbols = SymbolNode[] + for (i,e) in lr + if typeof(e) == Expr + es = copy(e.args) + while !isempty(es) + e1 = pop!(es) + if typeof(e1) == Expr + append!(es,e1.args) + elseif typeof(e1) == SymbolNode + push!(symbols,e1) + end + end + end + end +~~~ + The while loop goes through the guts of all the `Expr`s, recursively, until it's seen all the `Expr`s (and hopefully all the `SymbolNode`s). Every time the loop finds a `SymbolNode`, it adds it to the vector `symbols`. -The while loop goes through the guts of all the `Expr`s, recursively, until it's seen all the `Expr`s (and hopefully all the `SymbolNode`s). Every time the loop finds a `SymbolNode`, it adds it to the vector `symbols`. +2. ~~~.jl + loose_types = SymbolNode[] + for symnode in symbols + if !isleaftype(symnode.typ) && typeof(symnode.typ) == UnionType + push!(loose_types, symnode) + end + end + return loose_types +end +~~~ + Now we have a list of variables and their types, so it's easy to check if a type is loose. `loosetypes` does that by looking for a specific kind of non-concrete type, a `UnionType`. We get a lot more "failing" results when we consider all non-concrete types to be "failing". This is because we're evaluating each method with it's annotated argument types -- which are likely to be abstract. -Now we have a list of variables and their types, so it's easy to check if a type is loose. `loosetypes` does that by looking for a specific kind of non-concrete type, a `UnionType`. We get a lot more "failing" results when we consider all non-concrete types to be "failing". This is because we're evaluating each method with it's annotated argument types -- which are likely to be abstract. +### Making This Usable Now that we can do the check on an expression, we should make it easier to call on a users's code: +Now we have two ways to call `checklooptypes`: + +1. On a whole function; this will check each method of the given function. + +2. On a specific expression; this will work if the user extracts the results of `code_typed` themselves. + ~~~.jl # for a given Function, run checklooptypes on each Method function checklooptypes(f::Callable;kwargs...) @@ -442,12 +551,6 @@ end checklooptypes(e::Expr;kwargs...) = LoopResult(MethodSignature(e),loosetypes(loopcontents(e))) ~~~ -Now we have two ways to call `checklooptypes`: - -1. On a whole function; this will check each method of the given function. - -2. On a specific expression; this will work if the user extracts the results of `code_typed` themselves. - We can see both options work about the same for a function with one method: ~~~.jl @@ -475,11 +578,14 @@ julia> checklooptypes(code_typed(foo,(Int,))[1]) s::Union(Int64,Float64) ~~~ +#### Pretty Printing I've skipped an implementation detail here: how did we get the results to print out to the REPL like that? -The `checklooptypes` function returns a special type, `LoopResults`. This type has a function called `show` defined for it. The REPL calls `display` on values it wants to display; `display` will then call our `show` implementation. +First, I made some new types. `LoopResults` is the result of checking a whole function; it has the function name and the results for each method. `LoopResult` is the result of checking one method; it has the argument types and the loosely typed variables. + +The `checklooptypes` function returns a `LoopResults`. This type has a function called `show` defined for it. The REPL calls `display` on values it wants to display; `display` will then call our `show` implementation. -`LoopResults` is the result of checking a whole function; it has the function name and the results for each method. `LoopResult` is the result of checking one method; it has the argument types and the loosely typed variables. +This code is important for making this static analysis usable, but it is not doing static analysis. You should use the preferred method for pretty-printing types/output in your implementation language; this is just how it's done in Julia. ~~~.jl type LoopResult @@ -508,29 +614,278 @@ function Base.show(io::IO, x::LoopResults) end ~~~ + # Looking For Unused Variables -* Example of the problem we're checking for -* LHS vs RHS variable usages -* Looking for single-use variables -* Checking for `x += 2` single usages +Sometimes, as you're typing in your program, you type a variable -- and sometimes, you mistype the name. When you mistype it, the program can't tell that you meant the same variable as the other times. It sees a variable used only one time, where you might see a variable name misspelled. + +We can find misspelled variable names (and other unused variables) by looking for variables that are only used once -- or only used one way. + +Here is an example of a little bit of code with one misspelled name. + +~~~.jl +function foo(variable_name::Int) + sum = 0 + for i=1:variable_name + sum += variable_name + end + variable_nme = sum + return variable_name +end +~~~ + +This kind of mistake can cause problems in your code that are only discovered when it's run. Let's assume you miss-spell each variable name only once. We can separate variable usages into writes and reads. If the misspelling is a write (i.e. `worng = 5`), then no error will be thrown; you'll just be silently putting the value in the wrong variable -- and it could be frustrating to find the bug. If the misspelling is a read (i.e. `right = worng + 2`), then you'll get a run-time error when the code is run; we'd like to have a static warning for this, so that you can find this error sooner, but you will still have to wait until you run the code to see the problem. + +As code becomes longer and more complicated, it becomes harder to spot the mistake -- unless you have the help of static analysis. + +## Left-hand side and Right-hand side + +Another way to talk about "read" and "write" usages is to call them "right-hand side" (RHS) and "left-hand side" (LHS) usages. This refers to where the variable is relative to the `=` sign. + +Here are some usages of `x`: +* Left-hand side: + + `x = 2` + + `x = y + 22` + + `x = x + y + 2` + + `x += 2` (which de-sugars to `x = x + 2`) +* Right-hand side: + + `y = x + 22` + + `x = x + y + 2` + + `x += 2` (which de-sugars to `x = x + 2`) + + `2 * x` + + `x` + +Notice that expressions like `x = x + y + 2` and `x += 2` appear in both sections, since `x` appears on both sides of the `=` sign. + +## Looking for single-use variables + +There are two cases we need to look for: -# Checking Functions for Type Stability +1. Variables used once. +2. Variables used only on the LHS or only on the RHS. -* What does type stability mean -* Example of failing function -* Checking function argument & return types -* Where this fails +We'll look for all variable usages, but we'll look for LHS and RHS usages separately, to cover both cases. -# Tools for Insight into Variable Types +### Finding LHS usages -* Implementing `whos` for functions - + What is `whos` (modules) - + Getting the variables in a function - + implementation +To be on the LHS, a variable needs to have an `=` sign to be to the left of. This means we can look for `=` signs in the AST, and then look to the left of them to find the relevant variable. -* Implicit interfaces - + Walking the type hierarchy - + Getting the methods implemented for a type - + Implementing the function +In the AST, an `=` is an `Expr` with the head `:(=)`. (The parenthesises are there to make it clear that this is the symbol for `=` and not another operator, `:=`.) The first value in `args` will be the variable name on its LHS. Because we're looking at an AST that the compiler has already cleaned up, there will always be just a single symbol to the left of our `=` sign. +//TODO fact check (on `a[5] = 10`) + +Let's see what that means in code: +~~~.jl +julia> :(x = 5) +:(x = 5) + +julia> :(x = 5).head +:(=) + +julia> :(x = 5).args +2-element Array{Any,1}: + :x + 5 + +julia> :(x = 5).args[1] +:x +~~~ + +Below is the full implementation, followed by an explanation. + +~~~.jl +# Return a list of all variables used on the left-hand-side of assignment (=) +# +# Arguments: +# e: an Expr representing a Method, as from code_typed +# +# Returns: +# a Set{Symbol}, where each element appears on the LHS of an assignment in e. +# +function find_lhs_variables(e::Expr) + output = Set{Symbol}() + for ex in body(e) + if Base.is_expr(ex,:(=)) + push!(output,ex.args[1]) + end + end + return output +end +~~~ + +1. ~~~.jl + output = Set{Symbol}() +~~~ + We have a set of Symbols; those are variables names we've found on the LHS. + +2. ~~~.jl + for ex in body(e) + if Base.is_expr(ex,:(=)) + push!(output,ex.args[1]) + end + end +~~~ + We aren't digging deeper into the expressions, because the code_typed AST is pretty flat; loops and ifs have been converted to flat statements with gotos for control flow. There won't be any assignments hiding inside function calls' arguments. +3. ~~~.jl + push!(output,ex.args[1]) +~~~ + When we find a LHS variable usage, we `push!` the variable name into the `Set`. The `Set` will make sure that we only have one copy of each name. + +### Finding RHS usages + +To find all the other variable usages, we also need to look at each `Expr`. This is a bit more involved, because we care about basically all the `Expr`s, not just the `:(=)` ones and because we have to dig into nested `Expr`s (to handle nested function calls). + +Here is the full implementation, with explanation following. +~~~.jl +# Given an Expression, finds variables used in it (on right-hand-side) +# +# Arguments: e: an Expr +# +# Returns: a Set{Symbol}, where each e is used in a rhs expression in e +# +function find_rhs_variables(e::Expr) + output = Set{Symbol}() + + if e.head == :lambda + for ex in body(e) + union!(output,find_rhs_variables(ex)) + end + elseif e.head == :(=) + for ex in e.args[2:end] # skip lhs + union!(output,find_rhs_variables(ex)) + end + elseif e.head == :return + output = find_rhs_variables(e.args[1]) + elseif e.head == :call + start = 2 # skip function name + e.args[1] == TopNode(:box) && (start = 3) # skip type name + for ex in e.args[start:end] + union!(output,find_rhs_variables(ex)) + end + elseif e.head == :if + for ex in e.args # want to check condition, too + union!(output,find_rhs_variables(ex)) + end + elseif e.head == :(::) + output = find_rhs_variables(e.args[1]) + end + + return output +end +~~~ + +The main structure of this function is a large if-else statement, where each case handles a different head-symbol. + +* ~~~.jl + output = Set{Symbol}() +~~~ + + `output` is the set of variable names, which we will return at the end of the function. Since we only care about the fact that each of these variables has be read at least once, using a `Set` frees us from worrying about the uniqueness of each name. + +* ~~~.jl + if e.head == :lambda + for ex in body(e) + union!(output,find_rhs_variables(ex)) + end +~~~ + + This is the first condition in the if-else statement. A `:lambda` represents the body of a function. We recurse on the body of the definition, which should get all the RHS variable usages in the definition. + +* ~~~.jl + elseif e.head == :(=) + for ex in e.args[2:end] # skip lhs + union!(output,find_rhs_variables(ex)) + end +~~~ + + If the head is `:(=)`, then the expression is an assignment. We skip the first element of `args` because that's the variable being assigned to. For each of the remaining expressions, we recursively find the RHS variables and add them to our set. + +* ~~~.jl + elseif e.head == :return + output = find_rhs_variables(e.args[1]) +~~~ + + If this is a return statement, then the first element of `args` is the expression whose value is returned; we'll add any variables in their into our set. + +* ~~~.jl + elseif e.head == :call + # skip function name + for ex in e.args[2:end] + union!(output,find_rhs_variables(ex)) + end +~~~ + + For function calls, we want to get all variables used in all the arguments to the call. We skip the function name, which is the first element of `args`. + +* ~~~.jl + elseif e.head == :if + for ex in e.args # want to check condition, too + union!(output,find_rhs_variables(ex)) + end +~~~ + An `Expr` representing an if-statment has the `head` value `:if`. We want to get variable usages from all the expressions in the body of the if-statement, so we recurse on each element of `args`. + +* ~~~.jl + elseif e.head == :(::) + output = find_rhs_variables(e.args[1]) + end +~~~ + +The `:(::)` operator is used to add type annotations. The first argument is the expression or variable being annotated; we check for variable usages in the annotated expression. + +* ~~~.jl + return output +~~~ + + At the end of the function, we return the set of RHS variable usages. + + +There's a little more code that simplifies the method above. Because the version above only handles `Expr`s, but some of the values that get passed recursively may not be `Expr`s, we need a few more methods to handle the other possible types appropriately. + +~~~.jl +# Recursive Base Cases, to simplify control flow in the Expr version +find_rhs_variables(a) = Set{Symbol}() # unhandled, should be an immediate value, like an Int. +find_rhs_variables(s::Symbol) = Set{Symbol}([s]) +find_rhs_variables(s::SymbolNode) = Set{Symbol}([s.name]) +~~~ + +### Putting It Together + +Now that we have the two functions defined above, we can use them together to find variables that are either only read from or only written to. The function that finds them will be called `unused_locals`. + +~~~.jl +function unused_locals(e::Expr) + lhs = find_lhs_variables(e) + rhs = find_rhs_variables(e) + setdiff(lhs,rhs) +end +~~~ + +`unused_locals` will return a set of variable names. It's easy to write a function that determines whether the output of `unused_locals` counts as a "pass" or not. If the set is empty, the method passes. If all the methods of a function pass, then the function passes. The function `check_locals` below implements this logic. + +~~~.jl +check_locals(f::Callable) = all([check_locals(e) for e in code_typed(f)]) +check_locals(e::Expr) = isempty(unused_locals(e)) +~~~ + +# Making a Analysis Pipeline + +Now that we have multiple checks to run, we'd like to be able to run them on all our code automatically. We want that to happen automatically every time we compile or make a pull request; we also want to be able to request that it happen -- at a press-just-one-button level of convenience. + +We also want to make it easy for other people to use our checks with their tools, regardless of it they use the same tools as us. + +In this section, we're going to write an interface that makes it easy to runs all the available checks over a given piece of code. This will allow other tools to easily interface with our analysis. Each text editor or other display tool will need to write a little glue code to call our interface, but it should be relatively minimal. + +~~~.jl +const checks = [check_locals, check_loops] +function runchecks(e::Expr) + output = Any[] + for c in checks + push!(output, c(e)) + end + return output +end + +* A common interface +* Printing output +* How to add new checks From 70328c4b3a86a2ce8d0fbc1aca64f21e26ef31b5 Mon Sep 17 00:00:00 2001 From: Leah Hanson Date: Sun, 9 Nov 2014 18:49:29 -0600 Subject: [PATCH 5/6] Make more edits. This seems generally finished. I should make another pass specifically to check that the usages of "function" and "method" are correct for the definitions I gave; I can get sloppy about that sometimes. There are probably also more things that I'll need to change after getting feedback. :) --- static-analysis/StaticAnalysisChapter.md | 65 +++++++++++------------- 1 file changed, 30 insertions(+), 35 deletions(-) diff --git a/static-analysis/StaticAnalysisChapter.md b/static-analysis/StaticAnalysisChapter.md index 0040c32aa..1c9f9fd06 100644 --- a/static-analysis/StaticAnalysisChapter.md +++ b/static-analysis/StaticAnalysisChapter.md @@ -1,7 +1,9 @@ # Static Analysis by Leah Hanson for *500 Lines or Less* -Static Analysis is a way to check for problems in your code without running it. "Static" means at compile-time, rather than at run-time, and "analysis" because we're analyzing the code. To implement a static analysis check, you need to now what you want to do and how to do it. +You may be familiar with a fancy IDE that draws red-underlines under parts of your code that don't compile. You may have run a linter on your code to check for formatting or style problems. You might run your compiler in super-picky mode with all the warnings turned on. All of these tools are applications of static analysis. + +Static Analysis is a way to check for problems in your code without running it. "Static" means at compile-time, rather than at run-time, and "analysis" because we're analyzing the code. When you've used the tools I mentioned above, it may have felt like magic. But those tools are just programs -- they are made of source code that was written by a person, a programmer like you. In this chapter, we're going to talk about how to implement a couple static analysis checks. In order to do this, we need to now what we want the check to do and how we want to do it. We can get more specific about what you need to know by describing the process as having three stages: @@ -44,7 +46,9 @@ This code defines a method of the function `increment` that takes one argument, `Int64` is a type whose values are signed integers represented in memory by 64 bits; they are the integers that your hardware understands if your computer has a 64-bit processor. Types in Julia define the representation of data in memory, in addition to influencing method dispatch. -The name `increment` refers to a generic function, which may have many methods. We have just defined one method of it. Let's define another: +The name `increment` refers to a generic function, which may have many methods. We have just defined one method of it. In many languages, the terms "function" and "method" are used interchangeably; in Julia, they have distinct meanings. This chapter will make more sense if you are careful to understand "function" as a named collection of methods, where "method"s are specific implementations for specific type signatures. + +Let's define another method of the `increment` function: ~~~jl # Increment x by y @@ -62,7 +66,9 @@ Now the function `increment` has two methods. Julia decides which method to run * *multiple* because it looks at the types and order of all the arguments. * *dispatch* because this is a way of matching function calls to method definitions. -To put this in context with languages you may already know, object-oriented languages use single dispatch because they only consider the first argument (In `x.foo(y)`, the first argument is `x`.) +To put this in context with languages you may already know, object-oriented languages use single dispatch because they only consider the first argument (In `x.foo(y)`, the first argument is `x`). + +Both single and multiple dispatch are based on the types of the arguments. The `x::Int64` above is a type annotation purely for dispatch. In Julia's dynamic type system, you could assign a value of any type to `x` during the function without an error. We haven't really seen the "multiple" part yet, but if you're curious about Julia, you'll have to look that up on your own. We need to move on to our first check. @@ -85,7 +91,7 @@ function increment(x::Int64) end ~~~ -This function looks pretty straight-forward, but the type of `x` is unstable. I selected two numbers 50, an `Int64`, and 0.5, a `Float64`; depending on the value of `x`, it might be added to either one of them. At the end of this function, `return x` might return an `Int64` or it might return a `Float64`. This is because of the `else` clause; if you add an `Int64`, like `22`, to `0.5`, which is a `Float64`, then you'll get a `Float64` (`22.5`). +This function looks pretty straight-forward, but the type of `x` is unstable. I selected two numbers 50, an `Int64`, and 0.5, a `Float64`; depending on the value of `x`, it might be added to either one of them. If you add an `Int64`, like `22`, to `0.5`, which is a `Float64`, then you'll get a `Float64` (`22.5`). Because the type of a variable in the function (`x`) could change depending on the value of the arguments to the function (`x`), this method of `increment` and specifically the variable `x` are type unstable. `Float64` is a type that represents floating-point values stored in 64 bits; in C, it is called a `double`. This is one of the floating-point types that 64-bit processors understand. @@ -164,9 +170,13 @@ The new `unstable` allocated about 320kb, which is what we would expect if the a This difference between `unstable` and `stable` is because `unstable`'s `sum` must be boxed while `stable`'s `sum` can be unboxed. Boxed values consist of a type tag and the actual bits that represent the value; unboxed values only have their actual bits. The type tag is small, so that's not why boxing values allocates a lot more memory. -The difference comes from what optimizations the compiler can make. When a variable has a concrete, immutable type, the compiler can unbox it inside the function. If that's not the case, then the variable must be allocated on the heap, and participate in the garbage collector. Immutable types are usually types that represent values, rather than collections of values; most numeric types, including `Int64` and `Float64`, are immutable. Because immutable types cannot be modified, you must make a new copy every time you change one. For example `4 + 6` must make a new `Int64` to hold the result. In contrast, the members of a mutable type can be updated in-place; this means you don't have to make a copy of the whole thing to make a change. +The difference comes from what optimizations the compiler can make. When a variable has a concrete, immutable type, the compiler can unbox it inside the function. If that's not the case, then the variable must be allocated on the heap, and participate in the garbage collector. Immutable types are a concept specific to Julia. When you make a value of a type that's immutable, the value can't be changed. + +Immutable types are usually types that represent values, rather than collections of values. For example, most numeric types, including `Int64` and `Float64`, are immutable. (Numeric types in Julia are normal types, not special primitive types; you could define a new `MyInt64` that's the same as the provided one.) Because immutable types cannot be modified, you must make a new copy every time you want change one. For example `4 + 6` must make a new `Int64` to hold the result. In contrast, the members of a mutable type can be updated in-place; this means you don't have to make a copy of the whole thing to make a change. + +The idea of `x = x + 2` allocating memory probably sounds pretty weird; why would you make such a basic operation slow by making `Int64`s immutable? This is where those compiler optimizations come in: using immutable types doesn't (usually) slow this down. If `x` has a stable, concrete type (such as `Int64`), then the compiler is free to allocate `x` on the stack and mutate `x` in place. The problem is only when `x` has an unstable type (so the compiler doesn't know how big or what type it will be); once `x` is boxed and on the heap, the compiler isn't complete sure that some other piece of code isn't using the value, and thus can't edit it. -Because `sum` in `stable` has a concrete type (`Flaot64`), the compiler know that it can store it unboxed locally in the function and mutate it's value; `sum` will not be allocated on the heap and new copies don't have to be made every time we add `i/2`. +Because `sum` in `stable` has a concrete type (`Float64`), the compiler knows that it can store it unboxed locally in the function and mutate its value; `sum` will not be allocated on the heap and new copies don't have to be made every time we add `i/2`. Because `sum` in `unstable` does not have a concrete type, the compiler allocates it on the heap. Every time we modify sum, we allocated a new value on the heap. All this time spent allocating values on the heap (and retrieving them every time we want to read the value of `sum`) is expensive. @@ -174,14 +184,18 @@ Using `0` vs `0.0` is an easy mistake to make, especially when you're new to Jul ## Implementation Details -The type of `sum` in `unstable` is `Union(Float64,Int64)`. This is a `UnionType`. A variable of type `Union(Float64,Int64)` can hold values of type `Int64` or `Float64`; a value can only have one of those types. A `UnionType` join any number of types (e.g. `UnionType(Float64, Int64, Int32)` joins three types). The specific thing that we're going to look for is `UnionType`d variables inside loops. - We'll need to find what variables are used inside of loops and we'll need to find the types of those variables. After we have those results, we'll need to decide how to print them in a human-readable format. * How do we find loops? * How do we find variables in loops? * How do we find the types of a variable? * How do we print the results? +* How do we tell if the type is unstable? + +I'm going to tackle the last question first, since this whole endevour hinges on it. We've looked at an unstable function and seen as programmers how to identify an unstable variable, but we need our program to find them. This sounds like it would require simulating the function to look for variables whose values might change; this sounds like it would take some work. Luckily for us, Julia's type inference already traces through the function's execution to determine the types. + +The type of `sum` in `unstable` is `Union(Float64,Int64)`. This is a `UnionType`, a special type of type that indicates the variable may hold any of a set of types of values. A variable of type `Union(Float64,Int64)` can hold values of type `Int64` or `Float64`; a value can only have one of those types. A `UnionType` join any number of types (e.g. `UnionType(Float64, Int64, Int32)` joins three types). The specific thing that we're going to look for is `UnionType`d variables inside loops. +Parsing code into a representative structure is a complicated business, and gets more complicated as the language grows. In this chapter, we'll be depending on internal data structures used by the compiler. This means that we don't have to worry about reading files or parsing them, but it does mean we have to work with data structures that are not in our control and that sometimes feel clumsy or ugly. Parsing code into a representative structure is a complicated business, and gets more complicated as the language grows. In this chapter, we'll be depending on internal data structures used by the compiler. This means that we don't have to worry about reading files or parsing them, but it does mean we have to work with data structures that are not in our control and that sometimes feel clumsy or ugly. @@ -191,7 +205,7 @@ This process of examining Julia code from Julia code is called introspection. Wh ### Introspection in Julia -Julia makes it easy to introspect. There are four functions built-in to let us see what that compiler is thinking: `code_lowered`, `code_typed`, `code_llvm`, and `code_native`. Those are listed in order of what step in the compilation process their output is from; the left-most one is closest to the code we'd type in and the right-most one is the closest to what the CPU runs. For this chapter, we'll focus on `code_typed`, which gives us the optimized, type-inferred abstract syntax tree (AST). +Julia makes it easy to introspect. There are four functions built-in to let us see what that compiler is thinking: `code_lowered`, `code_typed`, `code_llvm`, and `code_native`. Those are listed in order of what step in the compilation process their output is from; the left-most one is closest to the code we'd type in and the right-most one is the closest to what the CPU runs. For this chapter, we'll focus on `code_typed`, which gives us the optimized, type-inferred abstract syntax tree (AST). [TODO: point to other 500 lines chapters that use ASTs] `code_typed` takes two arguments: the function of interest, and a tuple of argument types. For example, if we wanted to see the AST for a function `foo` when called with two `Int64`s, then we would call `code_typed(foo, (Int64,Int64))`. @@ -394,7 +408,8 @@ end And now to explain in pieces: -1. ~~~.jl +1. +~~~.jl b = body(e) ~~~ @@ -454,11 +469,11 @@ end If we're at a GotoNode, then we check to see if it's the end of a loop. If so, we remove the entry from loops and reduce our nesting level. -^TODO: Let's look at what we get from this! +The result of this function is the `lines` array; that's an array of (index, value) tuples. This means that each value in the array has an index into the method-body-`Expr`'s body and the value at that index. Each element of `lines` is an expression that occurred inside a loop. ### Finding and Typing Variables -We just finished a function `loopcontents`, which returns the `Expr`s that are inside loops. Our next function will be `loosetypes`, which takes a list of `Expr`s and returns a list of variables that are loosely typed. Later, we'll pass the output of `loopcontents` into `loosetypes`. +We just finished the function `loopcontents`, which returns the `Expr`s that are inside loops. Our next function will be `loosetypes`, which takes a list of `Expr`s and returns a list of variables that are loosely typed. Later, we'll pass the output of `loopcontents` into `loosetypes`. In each expression that occurred inside a loop, `loosetypes` searches for occurrences of symbols and their associated types. Variable usages show up as `SymbolNode`s in the AST; `SymbolNode`s hold the name and inferred type of the variable. @@ -526,9 +541,7 @@ end ### Making This Usable -Now that we can do the check on an expression, we should make it easier to call on a users's code: - -Now we have two ways to call `checklooptypes`: +Now that we can do the check on an expression, we should make it easier to call on a users's code. We'll create two ways to call `checklooptypes`: 1. On a whole function; this will check each method of the given function. @@ -868,24 +881,6 @@ check_locals(f::Callable) = all([check_locals(e) for e in code_typed(f)]) check_locals(e::Expr) = isempty(unused_locals(e)) ~~~ -# Making a Analysis Pipeline - -Now that we have multiple checks to run, we'd like to be able to run them on all our code automatically. We want that to happen automatically every time we compile or make a pull request; we also want to be able to request that it happen -- at a press-just-one-button level of convenience. - -We also want to make it easy for other people to use our checks with their tools, regardless of it they use the same tools as us. - -In this section, we're going to write an interface that makes it easy to runs all the available checks over a given piece of code. This will allow other tools to easily interface with our analysis. Each text editor or other display tool will need to write a little glue code to call our interface, but it should be relatively minimal. +# Conclusion -~~~.jl -const checks = [check_locals, check_loops] -function runchecks(e::Expr) - output = Any[] - for c in checks - push!(output, c(e)) - end - return output -end - -* A common interface -* Printing output -* How to add new checks +We've just done two distinct analyses of Julia code by writing Julia code. Hopefully, your new understanding of how static analysis tools are written will help you understand the tools you use on your code, and maybe inspire your to write one of your own. From 8a8a39030b4552822e0f2b19c4a59dcd7ea45af6 Mon Sep 17 00:00:00 2001 From: Leah Hanson Date: Sat, 28 Mar 2015 20:54:07 -0500 Subject: [PATCH 6/6] Update StaticAnalysisChapter.md Revised, mostly spelling mistakes, and added a better conclusion. --- static-analysis/StaticAnalysisChapter.md | 50 ++++++++++++++---------- 1 file changed, 29 insertions(+), 21 deletions(-) diff --git a/static-analysis/StaticAnalysisChapter.md b/static-analysis/StaticAnalysisChapter.md index 1c9f9fd06..18ede4a9a 100644 --- a/static-analysis/StaticAnalysisChapter.md +++ b/static-analysis/StaticAnalysisChapter.md @@ -3,7 +3,7 @@ by Leah Hanson for *500 Lines or Less* You may be familiar with a fancy IDE that draws red-underlines under parts of your code that don't compile. You may have run a linter on your code to check for formatting or style problems. You might run your compiler in super-picky mode with all the warnings turned on. All of these tools are applications of static analysis. -Static Analysis is a way to check for problems in your code without running it. "Static" means at compile-time, rather than at run-time, and "analysis" because we're analyzing the code. When you've used the tools I mentioned above, it may have felt like magic. But those tools are just programs -- they are made of source code that was written by a person, a programmer like you. In this chapter, we're going to talk about how to implement a couple static analysis checks. In order to do this, we need to now what we want the check to do and how we want to do it. +Static Analysis is a way to check for problems in your code without running it. "Static" means at compile-time, rather than at run-time, and "analysis" because we're analyzing the code. When you've used the tools I mentioned above, it may have felt like magic. But those tools are just programs -- they are made of source code that was written by a person, a programmer like you. In this chapter, we're going to talk about how to implement a couple static analysis checks. In order to do this, we need to know what we want the check to do and how we want to do it. We can get more specific about what you need to know by describing the process as having three stages: @@ -17,7 +17,7 @@ We can get more specific about what you need to know by describing the process a 2. Deciding how exactly to check for it - While we could ask a friend to do one of the tasks listed above, they aren't specific enough to explain to a computer. To tackle "misspelled variable names", for example, we'd need to decide what misspelled means here. One option would be to claim variable names should be composed of English words from the dictionary; another option is to look for variables that are only used once (the one time you mis-typed it). + While we could ask a friend to do one of the tasks listed above, they aren't specific enough to explain to a computer. To tackle "misspelled variable names", for example, we'd need to decide what misspelled means here. One option would be to claim variable names should be composed of English words from the dictionary; another option is to look for variables that are only used once (the one time you mistyped it). Now that we know we're looking for variables that are only used once, we can talk about kinds of variable usages (having their value assigned vs. read) and what code would or would not trigger a warning. @@ -29,7 +29,7 @@ We're going to work through these steps for each of the individual checks implem # A Very Brief Introduction to Julia -Julia is a young language aimed at technical computing. It was released at version 0.1 in the Spring of 2012; as of the summer of 2014, it has reached version 0.3. In general, Julia looks a lot like Python, but with some optional type annotations and without any object-oriented stuff. The feature that most programmers will find novel in Julia is multiple dispatch, which has a pervasive impact on both API design and on other design choices in the language. +Julia is a young language aimed at technical computing. It was released at version 0.1 in the Spring of 2012; as of the start of 2015, it has reached version 0.3. In general, Julia looks a lot like Python, but with some optional type annotations and without any object-oriented stuff. The feature that most programmers will find novel in Julia is multiple dispatch, which has a pervasive impact on both API design and on other design choices in the language. Here is a snippet of Julia code: @@ -74,7 +74,7 @@ We haven't really seen the "multiple" part yet, but if you're curious about Juli # Checking the Types of Variables in Loops -A feature of Julia that sets it apart from other high-level languages is its speed. As in most programming languages, writing very fast code in Julia involves an understanding of how the computer works and how Julia works. In Julia, an important part of helping the compiler create fast code for you is writing type-stable code. When the compiler can see that a variable in a section of code will always contain the same specific type, the compiler can do more optimizations than if it believes (correctly or not) that there are multiple possible types for that variable. +As in most programming languages, writing very fast code in Julia involves an understanding of how the computer works and how Julia works. An important part of helping the compiler create fast code for you is writing type-stable code; this is important in Julia and Javascript, and is also helpful in other JIT’d languages. When the compiler can see that a variable in a section of code will always contain the same specific type, the compiler can do more optimizations than if it believes (correctly or not) that there are multiple possible types for that variable. You can read more about why type stability (also called “monomorphism”) is important for Javascript here: http://mrale.ph/blog/2015/01/11/whats-up-with-monomorphism.html . ## Why This is Important @@ -143,7 +143,7 @@ The `@time` macro prints out how long the function took to run and how many byte If we wanted to get solid numbers for `stable` vs `unstable` we would need to make the loop much longer or run the functions many times. However, it looks like `unstable` is probably slower. More interestingly, we can see a large gap in the number of bytes allocated; `unstable` has allocated around 3kb of memory, where `stable` is using 64 bytes. -Since we can see how simple `unstable` is, we might guess that this allocation is happening in the loop. To test this, we can make the loop longer and see if the allocations increase accordingly. Let's make the loop go from 1 to 10000, which is 100 times more iterations; we'll look for the number of bytes allocated to also increase about 100 times, to around 300kb. +Since we can see how simple `unstable` is, we might guess that this allocation is happening in the loop. To test this, we can make the loop longer and see if the allocations increase accordingly. Let's make the loop go from 1 to 10000, which is 100 times more iterations; we'll look for the number of bytes allocated to also increase about 100 times, to around 300 kb. ~~~jl function unstable() @@ -174,7 +174,7 @@ The difference comes from what optimizations the compiler can make. When a varia Immutable types are usually types that represent values, rather than collections of values. For example, most numeric types, including `Int64` and `Float64`, are immutable. (Numeric types in Julia are normal types, not special primitive types; you could define a new `MyInt64` that's the same as the provided one.) Because immutable types cannot be modified, you must make a new copy every time you want change one. For example `4 + 6` must make a new `Int64` to hold the result. In contrast, the members of a mutable type can be updated in-place; this means you don't have to make a copy of the whole thing to make a change. -The idea of `x = x + 2` allocating memory probably sounds pretty weird; why would you make such a basic operation slow by making `Int64`s immutable? This is where those compiler optimizations come in: using immutable types doesn't (usually) slow this down. If `x` has a stable, concrete type (such as `Int64`), then the compiler is free to allocate `x` on the stack and mutate `x` in place. The problem is only when `x` has an unstable type (so the compiler doesn't know how big or what type it will be); once `x` is boxed and on the heap, the compiler isn't complete sure that some other piece of code isn't using the value, and thus can't edit it. +The idea of `x = x + 2` allocating memory probably sounds pretty weird; why would you make such a basic operation slow by making `Int64`s immutable? This is where those compiler optimizations come in: using immutable types doesn't (usually) slow this down. If `x` has a stable, concrete type (such as `Int64`), then the compiler is free to allocate `x` on the stack and mutate `x` in place. The problem is only when `x` has an unstable type (so the compiler doesn't know how big or what type it will be); once `x` is boxed and on the heap, the compiler isn't completely sure that some other piece of code isn't using the value, and thus can't edit it. Because `sum` in `stable` has a concrete type (`Float64`), the compiler knows that it can store it unboxed locally in the function and mutate its value; `sum` will not be allocated on the heap and new copies don't have to be made every time we add `i/2`. @@ -192,7 +192,7 @@ We'll need to find what variables are used inside of loops and we'll need to fin * How do we print the results? * How do we tell if the type is unstable? -I'm going to tackle the last question first, since this whole endevour hinges on it. We've looked at an unstable function and seen as programmers how to identify an unstable variable, but we need our program to find them. This sounds like it would require simulating the function to look for variables whose values might change; this sounds like it would take some work. Luckily for us, Julia's type inference already traces through the function's execution to determine the types. +I'm going to tackle the last question first, since this whole endeavour hinges on it. We've looked at an unstable function and seen as programmers how to identify an unstable variable, but we need our program to find them. This sounds like it would require simulating the function to look for variables whose values might change; this sounds like it would take some work. Luckily for us, Julia's type inference already traces through the function's execution to determine the types. The type of `sum` in `unstable` is `Union(Float64,Int64)`. This is a `UnionType`, a special type of type that indicates the variable may hold any of a set of types of values. A variable of type `Union(Float64,Int64)` can hold values of type `Int64` or `Float64`; a value can only have one of those types. A `UnionType` join any number of types (e.g. `UnionType(Float64, Int64, Int32)` joins three types). The specific thing that we're going to look for is `UnionType`d variables inside loops. Parsing code into a representative structure is a complicated business, and gets more complicated as the language grows. In this chapter, we'll be depending on internal data structures used by the compiler. This means that we don't have to worry about reading files or parsing them, but it does mean we have to work with data structures that are not in our control and that sometimes feel clumsy or ugly. @@ -205,7 +205,7 @@ This process of examining Julia code from Julia code is called introspection. Wh ### Introspection in Julia -Julia makes it easy to introspect. There are four functions built-in to let us see what that compiler is thinking: `code_lowered`, `code_typed`, `code_llvm`, and `code_native`. Those are listed in order of what step in the compilation process their output is from; the left-most one is closest to the code we'd type in and the right-most one is the closest to what the CPU runs. For this chapter, we'll focus on `code_typed`, which gives us the optimized, type-inferred abstract syntax tree (AST). [TODO: point to other 500 lines chapters that use ASTs] +Julia makes it easy to introspect. There are four functions built-in to let us see what that compiler is thinking: `code_lowered`, `code_typed`, `code_llvm`, and `code_native`. Those are listed in order of what step in the compilation process their output is from; the leftmost one is closest to the code we'd type in and the right-most one is the closest to what the CPU runs. For this chapter, we'll focus on `code_typed`, which gives us the optimized, type-inferred abstract syntax tree (AST). [TODO: point to other 500 lines chapters that use ASTs] `code_typed` takes two arguments: the function of interest, and a tuple of argument types. For example, if we wanted to see the AST for a function `foo` when called with two `Int64`s, then we would call `code_typed(foo, (Int64,Int64))`. @@ -273,7 +273,7 @@ julia> e.args We just saw some values printed out, but that doesn't tell us much about what they mean or how they're used. -* `head` tells us what kind of expression this is; normally, you'd use separate types for this in Julia, but `Expr` is a type that models the structure used in the Lisp parser. `head` tells us how the rest of the `Expr` is structured and what it represents. +* `head` tells us what kind of expression this is; normally, you'd use separate types for this in Julia, but `Expr` is a type that models the structure used in the parser. The parser is written in a dialect of Scheme, which structures everything as nested lists. `head` tells us how the rest of the `Expr` is organized and what kind of expression it represents. * `typ` is the inferred return type of the expression; when you evaluate any expression, it results in some value. `typ` is the type of the value that the expression will evaluate to. For nearly all `Expr`s, this value will be `Any` (which is always correct, since every possible type is a subtype of `Any`). Only the `body` of type-inferred methods and most expressions inside them will have their `typ`s set to something more specific. (Because `type` is a keyword, this field can't use that word as its name.) * `args` is the most complicated part of `Expr`; its structure varies based on the value of `head`. It's always an `Array{Any}` (an untyped array), but beyond that the structure changes. @@ -366,9 +366,9 @@ julia> code_typed(lloop, (Int,))[1].args[3] end::Nothing) ~~~ -You'll notice there's no `for` or `while` loop in the body. Instead, the loop has been lowered to `label`s and `goto`s. The `goto` has a number in it; each `label` also has a number. The `goto` jumps to the the `label` with the same number. +You'll notice there's no `for` or `while` loop in the body. As the compiler transforms the code from what we wrote to the binary instructions the CPU understands, features that useful to humans but that are not understood by the CPU (like loops) are removed. The loop has been rewritten as `label`s and `goto`s. The `goto` has a number in it; each `label` also has a number. The `goto` jumps to the the `label` with the same number. -### Detecting & Extracting Loops +### Detecting and Extracting Loops We're going to find loops by looking for `goto`s that jump backwards. @@ -481,7 +481,7 @@ We can't just check each expression that `loopcontents` collected to see if it's ~~~.jl # given `lr`, a Vector of expressions (Expr + literals, etc) -# try to find all occurances of a variables in `lr` +# try to find all occurrences of a variables in `lr` # and determine their types function loosetypes(lr::Vector) symbols = SymbolNode[] @@ -541,7 +541,7 @@ end ### Making This Usable -Now that we can do the check on an expression, we should make it easier to call on a users's code. We'll create two ways to call `checklooptypes`: +Now that we can do the check on an expression, we should make it easier to call on a user's code. We'll create two ways to call `checklooptypes`: 1. On a whole function; this will check each method of the given function. @@ -630,7 +630,7 @@ end # Looking For Unused Variables -Sometimes, as you're typing in your program, you type a variable -- and sometimes, you mistype the name. When you mistype it, the program can't tell that you meant the same variable as the other times. It sees a variable used only one time, where you might see a variable name misspelled. +Sometimes, as you're typing in your program, you type a variable -- and sometimes, you mistype the name. When you mistype it, the program can't tell that you meant the same variable as the other times. It sees a variable used only one time, where you might see a variable name misspelled. Languages that require variable declarations naturally catch these misspellings, but many dynamic languages don’t require declarations and thus need an extra layer of analysis to catch them. We can find misspelled variable names (and other unused variables) by looking for variables that are only used once -- or only used one way. @@ -647,7 +647,7 @@ function foo(variable_name::Int) end ~~~ -This kind of mistake can cause problems in your code that are only discovered when it's run. Let's assume you miss-spell each variable name only once. We can separate variable usages into writes and reads. If the misspelling is a write (i.e. `worng = 5`), then no error will be thrown; you'll just be silently putting the value in the wrong variable -- and it could be frustrating to find the bug. If the misspelling is a read (i.e. `right = worng + 2`), then you'll get a run-time error when the code is run; we'd like to have a static warning for this, so that you can find this error sooner, but you will still have to wait until you run the code to see the problem. +This kind of mistake can cause problems in your code that are only discovered when it's run. Let's assume you miss-spell each variable name only once. We can separate variable usages into writes and reads. If the misspelling is a write (i.e. `worng = 5`), then no error will be thrown; you'll just be silently putting the value in the wrong variable -- and it could be frustrating to find the bug. If the misspelling is a read (i.e. `right = worng + 2`), then you'll get a runtime error when the code is run; we'd like to have a static warning for this, so that you can find this error sooner, but you will still have to wait until you run the code to see the problem. As code becomes longer and more complicated, it becomes harder to spot the mistake -- unless you have the help of static analysis. @@ -683,9 +683,7 @@ We'll look for all variable usages, but we'll look for LHS and RHS usages separa To be on the LHS, a variable needs to have an `=` sign to be to the left of. This means we can look for `=` signs in the AST, and then look to the left of them to find the relevant variable. -In the AST, an `=` is an `Expr` with the head `:(=)`. (The parenthesises are there to make it clear that this is the symbol for `=` and not another operator, `:=`.) The first value in `args` will be the variable name on its LHS. Because we're looking at an AST that the compiler has already cleaned up, there will always be just a single symbol to the left of our `=` sign. - -//TODO fact check (on `a[5] = 10`) +In the AST, an `=` is an `Expr` with the head `:(=)`. (The parentheses are there to make it clear that this is the symbol for `=` and not another operator, `:=`.) The first value in `args` will be the variable name on its LHS. Because we're looking at an AST that the compiler has already cleaned up, there will (nearly) always be just a single symbol to the left of our `=` sign. Let's see what that means in code: ~~~.jl @@ -738,7 +736,8 @@ end end end ~~~ - We aren't digging deeper into the expressions, because the code_typed AST is pretty flat; loops and ifs have been converted to flat statements with gotos for control flow. There won't be any assignments hiding inside function calls' arguments. + We aren't digging deeper into the expressions, because the code_typed AST is pretty flat; loops and ifs have been converted to flat statements with gotos for control flow. There won't be any assignments hiding inside function calls' arguments. This code while fail if anything more than a symbol is on the left of the equal sign; this misses two specific edge cases -- array accesses (like `a[5]`, which will be represented as a `:ref` expression) and properties (like `a.head`, which will be represented as a `:.` expression). These will still always have the relevant symbol as the first value in their `args`, it might just be buried a bit (as in `a.property.name.head.other_property`). This code doesn’t handle those cases, but a couple lines of code inside the `if` statement could fix that. + 3. ~~~.jl push!(output,ex.args[1]) ~~~ @@ -836,7 +835,7 @@ The main structure of this function is a large if-else statement, where each cas union!(output,find_rhs_variables(ex)) end ~~~ - An `Expr` representing an if-statment has the `head` value `:if`. We want to get variable usages from all the expressions in the body of the if-statement, so we recurse on each element of `args`. + An `Expr` representing an if-statement has the `head` value `:if`. We want to get variable usages from all the expressions in the body of the if-statement, so we recurse on each element of `args`. * ~~~.jl elseif e.head == :(::) @@ -882,5 +881,14 @@ check_locals(e::Expr) = isempty(unused_locals(e)) ~~~ # Conclusion +We’ve done two static analyses of Julia code -- one based on types and one based on variable usages. + +Statically-typed languages already do the kind of work our type-based analysis did; additional type-based static analysis is mostly useful in dynamically typed languages. There have been (mostly research) projects to build static type inference systems for languages including Python, Ruby, and Lisp. These systems are usually built around optional type annotations; you can have static types when you want them, and fall back to dynamic typing when you don’t. This is especially helpful for integrating some static typing into existing code bases. + +Non-typed-based checks, like our variable-usage one, are applicable to both dynamically- and statically-typed languages. However, many statically-typed languages, like C++ and Java, require you to declare variables and already give basic warnings like the ones we created. There are still custom checks that can be written. For example, checks that are specific to your project’s style guide or extra safety precautions based on security policies. + +While Julia does have great tools for enabling static analysis, it’s not alone. Lisp, of course, is famous for having the code be a data structure of nested lists, so it tends to be easy to get at the AST. Java also exposes it’s AST, although the AST is much more complicated than Lisp’s. Some languages or language tool-chains are not designed to allow mere users to poke around at internal representations. For open-source tool chains (especially well-commented ones), one option is to add the hooks you want to pull out the AST. + +In cases where that won’t work, the final fall-back is writing a parser yourself; this is to be avoided when possible. It’s a lot of work to cover the full grammar of most programming languages, and you’ll have to update it yourself as new features are added to the language (rather than getting the updates automatically from upstream). Depending on the checks you want to do, you may be able to get away with parsing only some lines or a subset of language features, which would greatly decrease the cost of writing your own parser. -We've just done two distinct analyses of Julia code by writing Julia code. Hopefully, your new understanding of how static analysis tools are written will help you understand the tools you use on your code, and maybe inspire your to write one of your own. +Hopefully, your new understanding of how static analysis tools are written will help you understand the tools you use on your code, and maybe inspire you to write one of your own.