Last updated: 2020-05-19
Scope graphs provide a single framework for performing name resolution, while abstracting away the specific name resolution rules for each of the languages that we want to support. The basic idea is to represent the definitions and references in a program using graph. A binding maps a reference to all of the possible definitions that the reference could refer to. Because we’ve represented definitions and references as a graph, bindings are represented by paths within that graph.
Unfortunately, scope graphs are not incremental, at least as presented in the original papers. As one example, Scopes as Types presents some rewrite rules that let you handle field access in a record or object type — for instance, being able to resolve the foo
part of A.foo
. In Scopes as Types, we’d handle this by performing the path-finding algorithm when constructing the graph, to find out what A
resolves to. Once we find its definition, we then attach the reference to foo
to that result.
This is not incremental because the reference to foo
“lives” over in some unrelated part of the graph. If A
is defined in a different file (or worse, a different package), then we have to update that distant part of the graph with the node representing the reference. We end up cluttering the graph for a class definition with nodes representing all of its field references, which is especially problematic for popular packages with lots of downstream dependencies.
Our key insight is to recognize that when resolving A.foo
, we have to “pause” the resolution of foo
, start a resolution of A
, and once we’ve found the binding for A
, resume the original resolution of foo
. This describes a stack! So instead of having our path-finding algorithm look for exactly one symbol, we can keep track of a stack of things to look for. To correctly resolve foo
, we do still have to eventually make our way over to the part of the graph where A
is defined. But by having a stack of path-finding targets, we can resolve the reference to A.foo
with a single execution of the path-finding algorithm. And most importantly, each “chunk” of the overall graph only depends on “local” information from the original source file. (a.k.a., it’s incremental!)
Because of this focus on a stack of things to look for, we’re calling our modification of scope graphs stack graphs. There’s enough different between the two frameworks (especially in terms of the “gadgets” of graph that you create for a particular chunk of code) that we think it’s worth having distinct names for them.
While searching for a path in an incremental stack graph, we keep track of two stacks: a symbol stack and a scope stack. Broadly speaking, the symbol stack keeps track of what symbols we’re trying to resolve, while the scope stack gives us control over which particular scopes we look for those symbols in.
There are several different kinds of node that can appear in a stack graph; as we search for a path representing a binding, each kind of node has different rules for how it interacts with the two stacks.
This is a singleton node — there is exactly one of them in the entire stack graph. When we start to compute stack graphs incrementally, the root node in each subgraph represents a single root node: this is the node where the subgraphs are linked together into a single large overall graph.
A scope that is needed to represent the name binding structure of the content of the current file, but that we don’t need to refer to by name in other parts of the stack graph.
A scope that can appear on the scope stack. Used to represent, for example, the parameters passed into a function call.
The symbol stack keeps track of all of the things that we are currently looking for. These will typically be symbols that come from the source code: when the user clicks on a reference, we push the reference’s symbol onto the stack so that we can search for the corresponding definition (which will be represented by a node that pops that symbol off the stack, leaving it empty). However, we will sometimes need to search for other things. For instance, we’ll handle field access (like A.foo
) by searching for a fake symbol called .
.
Colored green. Pushes the content of the node onto the symbol stack, indicating a new thing that we have to start searching for. If the node is dark green with a solid border (sometimes called a “reference node”), then this is a “real” reference that the user can click on and that exists in the syntax tree of the source file. If the node is light green with a dashed border (sometimes called an “internal push node”), then we’re using this node to push something onto the stack, but it doesn’t represent a real reference in the source code. (Note that fake symbols like .
will only appear in internal push nodes, since by definition those can’t come from the user’s source code.)
Colored red. Pops the content of the node off of the symbol stack. (If the symbol stack is empty, or if the top of it is something else, then the current potential path is not valid.). If the node is dark red with a solid thick border (sometimes called a “definition node”), then this is a “real” definition that exists in the syntax tree of the source files, and that could be the target of a jump-to-def binding. If the node is light red with a dashed/dotted border (sometimes called an “internal pop node”), then we’re using this node to pop something off of the stack, but it doesn’t represent a real definition in the source code. (Note that fake symbols like .
will only appear in internal pop nodes, since by definition those can’t come from the user’s source code.)
The scope stack gives us the ability to change the part of the stack graph where we’re currently looking for things. This is used for things like function calls, where we need to package up a context (containing the actual parameters) and send it over to another part of the source code. (The definition of function doesn’t know in advance which parameters will be passed in — and can receive different parameters for different calls.)
Pops a scope reference from the top of the scope stack and jumps to it. We resume the path-finding algorithm from the popped scope’s node.
Removes all scope references from the scope stack, ignoring all of them.
We’ve talked about how we can do things with the scopes on the scope stack, but we haven’t talked about to put things onto the scope stack. The nodes in this section interact with both stacks.
Just like the push symbol nodes, but for scoped symbols. A scoped symbol is a symbol with a list of scopes attached to it.
These nodes create a new scoped symbol, and pushes it onto the symbol stack. The scoped symbol’s scope list is the current contents of the scope stack, prepended with the scope mentioned in the push node.
Just like the pop symbol nodes, but for scoped symbols. If the symbol stack is empty, the current potential path is not valid. If the top of the symbol stack is not a scoped symbol, or if the symbol’s content doesn’t match what’s in the node, the current potential path is not valid.
These nodes pop off the top of the symbol stack, and replace the contents of the scope stack with the scoped symbol’s scope list.
This section describes several patterns for constructing the subgraph that corresponds to a particular snippet of code.
# a/b.py defines module "a.b"
...
We have to know the name of the Python module that the file defines. We create nested “pop symbol” nodes for each element of the module name. Only the final element is exported as an actual definition, the others are internal only. (That way the a
of import a.b
doesn’t resolve to the file that defines b
.)
In Python, top-level definitions are only visible later in the file, and within the bodies of any function or method definitions. To handle this, we’ll create a separate scope for each definition, and link them together so that each definition’s scope can see into the scopes of all of the definitions that precede it.
The “T” scope is the head of that list, and represents what’s visible at the “top of file” (i.e., nothing). The “B” scope will be the end of that list, and represents what’s visible at the “bottom of file”. We’ll want that to include all of the top-level definitions (ignoring __all__
clauses for now).
The definition node of the module points at the “bottom” of file scope, since the module should contain all of the definitions defined in the file.
import a.b
The gray “C” scope (given) is the “current position” scope for the previous definition (or the “top of file” scope if this is the first definition).
The black “C” scope is the new “current position” scope that we create for this definition.
After this import statement, a
and a.b
are both available in the file’s scope. The definition nodes ensure that references later in the file to a
or a.b
have the import statement itself as one possible definition. This branch also “re-pushes” the symbols back onto the symbol stack before linking back up with the singleton node that all files share. This allows us to resolve references to a
and a.b
by looking in other files for the actual definitions of those modules. (Note that here we don’t need to know in advance which file — if any! — the module is defined in! If there is one, the path-finding algorithm will find it, and if not, it won’t!) In addition, those “re-push” nodes are proper references that the user can click on, so that the a
and a.b
in the import statement itself also (hopefully) resolve to the modules’ definitions.
from a.b import *
The gray “C” scope (given) is the “current position” scope for the previous definition (or the “top of file” scope if this is the first definition).
The black “C” scope is the new “current position” scope that we create for this definition.
This kind of import doesn’t make a
or a.b
available in the rest of the file, but does make every symbol exported by a.b
available. We handle that by automatically injecting the nodes necessary to transform a lookup of foo
into a lookup of a.b.foo
, and then linking to the singleton root node to hand off the rewritten lookup to any other available files. The module names are encoded as reference nodes, which allows the user to click on the a.b
part of the import statement, and have that resolve to the file that defines the imported module.
from a.b import foo
The gray “C” scope (given) is the “current position” scope for the previous definition (or the “top of file” scope if this is the first definition).
The black “C” scope is the new “current position” scope that we create for this definition.
This is exactly like the “import star” pattern, but we add additional nodes so that this path can only be used when resolving foo
. (We still transform that into a lookup of a.b.foo
before linking up with the singleton root node.) The imported symbol (foo
) is encoded as both a definition node and a reference node. The definition node ensures that references to foo
later in the file resolve to the import statement as well as to the symbol’s real definition over in a.b
. The reference node allows the user to click on the foo
in the import statement and have it resolve to its real definition.
from a.b import foo as bar
The gray “C” scope (given) is the “current position” scope for the previous definition (or the “top of file” scope if this is the first definition).
The black “C” scope is the new “current position” scope that we create for this definition.
This is exactly like the “import symbol” pattern, but with different symbol names in the definition and reference nodes for the imported symbol. The user-visible behavior is the same: a reference to bar
later in the file will resolve both to the import statement and to the real definition (of foo
) over in a.b
.
foo = expr
The gray “C” scope (given) is the “current position” scope for the previous definition (or the “top of file” scope if this is the first definition).
The black “C” scope is the new “current position” scope that we create for this definition.
We create a definition node for the symbol being assigned to. We also link this node to whatever subgraph we create for the expr
that defines the value being assigned. That ensures that if the expression is a function call or field access, that is “passed through” to the newly aliased symbol.
class A:
...
The gray “C” scope (given) is the “current position” scope for the previous definition (or the “top of file” scope if this is the first definition).
The black “C” scope is the new “current position” scope that we create for this definition.
We create an exported “pop symbol” node for the class being defined. Because it’s exported this will show up as a definition for the class name.
We create two scopes that will contain the class members (“CM”) and instance members (“IM”) of the class. The “IM” scope links to the “CM” scope, since class members are always visible anywhere that instance members are, but not vice versa.
You can use field access syntax (represented by the special .
symbol) on the class name to get access to the class members (“CM” scope) of the class.
The class name can be called to invoke the class’s constructor, so we add a path of nodes that starts by popping off a scoped ()
symbol. The result is a new “S” scope (for “self”), which will represent instances of this class. You can use field access syntax to get access to the instance members (“IM” scope) of the instance.
class A:
foo = expr
The gray “CM” scope (given) is the “class members” scope for the containing class.
The gray “C” scope (given) is the “current position” scope for where the class definition appears.
This looks very much like a top-level definition; the main difference is that the definition node is added to the “CM” scope, but the expression’s subgraph is calculated against the “C” scope, since the field definition can see anything defined before the class definition, but not anything else inside the class definition.
class A:
def foo(...):
...
The gray “IM” scope (given) is the “instance members” scope for the containing class.
The gray “B” scope (given) is the “bottom of file” scope for the module.
The gray “S” scope (given) is the “self” scope for the containing class.
We create a definition node for the method’s name, and add it to the class’s “instance members” scope. Add a scoped ()
node to indicate that it can be called, and create a new “return values” scope (“R”) for the result. For every return
statement in the body, create the subgraph for the return value expression and link it to this scope.
The new “P” scope is the “parent” scope that we use to calculate the subgraph for any expression in the function body.
One path leads through the new “lexical context” scope (“L”), which drops any parameters that are passed into the function and looks up the reference outside of the class definition. (We use the “bottom of file” scope because the inside of a method definition can see all of the definitions in the module, even ones that occur later in the file.)
Another path leads through the new “formal parameters” scope (“F”). We create a first entry for the method’s self
parameter, which drops the parameters passed in via the method call and just goes directly to the class’s “self” scope. We’ll add additional entries for each of the method’s formal parameters.
class A:
def ...(..., foo, ...):
...
The gray “F” scope (given) is the method’s “formal parameters” scope.
We first add a definition node for the parameter, which is what the parameter’s name will resolve to inside of the method body. (That means it doesn’t need to know anything about, and can drop, the actual parameter values passed in.)
We also add an internal pop node that lets the symbol “pass through” information from the actual parameter value. The caller will provide actual values as a scope that is pushed onto the scope stack before calling the method. We use a “jump” node to get back to that passed-in scope. The caller can provide a value either by name or by position, so there are two ways to get through to the “jump” node. (For the position node, the 0
should be replaced with the real index of the formal parameter.)
foo
expr.foo
First create whatever subgraph is needed for the expression that produces the object whose fields needed to be accessed (expr
).
Add an exported “push symbol” node for the field name, and an internal .
push node to indicate that we want to extract a member of the object.
foo(...)
First create whatever subgraph is needed for the expression that produces the function is being called (expr
).
Create a new “actual parameters” scope (1). We’ll add entries to this scope for each parameter passed into the function.
Push a scoped internal ()
symbol onto the symbol stack to “call” the function. The attached scope is the “actual parameters” scope that we just created.
...(..., expr, ...)
The gray “A” scope (given) is the “actual parameters” scope for the function call.
First create whatever subgraph is needed for the expression that provides the value of the parameter (expr
).
Add an internal “pop symbol” node with the index of the parameter within the function call. Positional parameters must be looked up by index, since we don’t know what name the function uses for the corresponding formal parameter.
...(..., foo=expr, ...)
The gray “A” scope (given) is the “actual parameters” scope for the function call.
First create whatever subgraph is needed for the expression that provides the value of the parameter (expr
).
Add an internal “pop symbol” node with the name of the parameter. Keyword parameters should be looked up by name, since the name of the actual parameter in the function call must match the name of the formal parameter in the function body.
Using the patterns described above, we can construct the full stack graph for several interesting programs:
# main.py₀
import a₁
import b₂
print(a₃.foo₄)
# a.py₅
foo₆ = 1
# b.py₇
foo₈ = 2
# main.py₀
from a₁ import *
from b₂ import *
print(foo₃)
# a.py₄
foo₅ = 1
# b.py₆
foo₇ = 2
# main.py₀
from a₁ import *
print(foo₂)
# a.py₃
from b₄ import *
# b.py₅
foo₆ = 2
# main.py₀
class A₁:
foo₂ = 1
print(A₃.foo₄)
# main.py₀
bar₁ = 1
class A₂:
foo₃ = bar₄
print(A₅.foo₆)
# main.py₀
class A₁:
bar₂ = 1
def foo₃(x₄):
return A₅
print(foo₆(A₇)₈.bar₉)
# main.py₀
class A₁:
bar₂ = 1
def foo₃(x₄):
return x₅
print(foo₆(A₇)₈.bar₉)
With this set of rules for constructing stack graphs, bindings between references and definitions are represented by paths, where each edge leaves the symbol and scope stacks in a well-defined state, and where both stacks are empty at the beginning and end of the path.
For instance, in the “sequenced import star” example:
# main.py₀
from a₁ import *
print(foo₂)
# a.py₃
from b₄ import *
# b.py₅
foo₆ = 2
We can resolve the foo2
reference in
main.py
to the definition of foo6
in
b.py
because the following well-formed path exists:
Let’s formalize it!
Specification text can be dense, so throughout we’ll use commentary like this to explain what’s going on.
A symbol is a string of characters that represents a portion of source code.
A stack graph is a graph representing the name binding structure of a program. It consists of a list of nodes and a list of edges.
A node is any of the following:
A push symbol node is also called a reference node if it has a location.
A pop symbol node is also called a definition node if it has a location.
In the graphical notation, we make a clear visual distinction between “exported” and “internal” push and pop nodes. The only difference between an exported and internal node is whether it represents a reference or definition that the user can click on. Therefore, in this formalization, we don’t have separate node types for exported vs internal nodes; instead, the location field of a push or pop node is optional. If it’s present, then it’s a real reference or definition that the user can click on; if it’s missing, then it’s not!
An edge consists of a source node, a sink node, and a label.
The source node and sink node of every edge in a stack graph must be one of the nodes in the stack graph.
There is no edge in a stack graph whose source node is a jump to scope node.
A scoped symbol consists of a symbol and an optional nonempty list of scope identifiers.
A path consists of:
The source node of each edge in a path must be the same as the sink node of the previous edge.
If a path’s edges list is empty, then its starting node must be the same as its ending node.
If a path’s edges list is nonempty, then the starting node of the path must be the same as the source node of the first edge in the path, and the ending node of the path must be the same as the sink node of the last edge in the path.
In the formalization, every element of a symbol stack is called a “scoped symbol”, even if there’s no attached scope stack. For “non-scoped” symbols, the scope identifiers is just empty.
A path is complete if all of the following are true:
Its starting node is a reference node.
Its ending node is a definition node.
Its symbol stack and scope stack are empty.
A complete path is what represents a full name binding that resolves a reference to a definition.
This algorithm looks long, but that’s just because there are a bunch of different node types that we have to handle. Our goal is to add a new edge to a path, while ensuring that all of the “rules” are followed. (For instance, you can’t enter a pop node if the top of the symbol stack doesn’t match.)
We can attempt to append an edge to a path as follows:
If the source node of edge is not the same as the ending node of path, then abort.
Let node be the sink node of edge.
If node is a push symbol node, then:
Let scoped symbol be a new scoped symbol whose:
Push scoped symbol onto the beginning of the path’s symbol stack.
If node is a push scoped symbol node, then:
Let scoped symbol be a new scoped symbol whose:
Push scoped symbol onto the beginning of the path’s symbol stack.
If node is a pop symbol node, then:
If path’s symbol stack is empty, then abort.
Let top be the first element of path’s symbol stack.
If top’s symbol is not equal to node’s symbol, then abort.
If top’s scope identifier list is not missing, then abort.
Otherwise, pop top off the beginning of path’s symbol stack.
If node is a pop scoped symbol node, then:
If path’s symbol stack is empty, then abort.
Let top be the first element of path’s symbol stack.
If top’s symbol is not equal to node’s symbol, then abort.
If top’s scope identifier list is missing, then abort.
Otherwise, pop top off the beginning of path’s symbol stack.
Replace path’s scope stack with top’s scope identifier list.
If node is a drop scopes node, then:
Append edge to path’s edges list.
Set path’s ending node to be node.
If a path ends with a jump to scope node, then we have to actually perform the jump! There had better be a scope on the scope stack for us to jump to…
We can attempt to resolve a path as follows:
Let node be the ending node of path.
If node is a jump to scope node:
If path’s scope stack is empty, then the jump to scope node is invalid. Abort.
Pop scope identifier from the beginning of path’s scope stack.
Let resolved node be a new exported scope node whose scope identifier is scope identifier.
Let jump edge be a new edge whose:
jump
Append jump edge to path’s edges list.
Set path’s ending node to be resolved node.
a.k.a. “the path-finding algorithm”
This is a fixed-point algorithm — we start with a set of initial paths, and keep trying to grow each one for as long as we can. Whenever we encounter a complete path during this process, we stash it away as one of our final results.
Given a stack graph, we can find the set of name binding paths for references, which is a list of (some of the) reference nodes in stack graph, as follows:
Let current paths and complete paths be empty sets of paths.
For each reference in references:
Let initial path be a new path whose:
Append initial path to current paths.
While current paths is not empty:
Remove an arbitrary path from current paths.
Let extensions be all of the edges in stack graph whose source node is the same as path’s ending node.
For each edge in extensions:
If edge already appears in path’s edges list, skip it. [TODO: Investigate a better cycle detection heuristic.]
Let new path be a copy of path.
Attempt to append edge to new path. If unsuccessful, skip this edge.
Attempt to resolve new path. If unsuccessful, skip this edge.
If new path is complete, append it to complete paths.
Append new path to current paths.
Return complete paths.
Given the above definitions and algorithms, we can find the name binding path from foo2
to foo6
with the following sequence of paths:
Starting node | Ending node | Symbol stack | Scope stack |
---|---|---|---|
foo2 |
foo2 |
foo |
|
foo2 |
.1 |
. foo |
|
foo2 |
a1 |
a . foo |
|
foo2 |
a3 |
. foo |
|
foo2 |
.3 |
foo |
|
foo2 |
.4 |
. foo |
|
foo2 |
b4 |
b . foo |
|
foo2 |
b5 |
. foo |
|
foo2 |
.5 |
foo |
|
foo2 |
foo6 |
The final path is complete, and is therefore represents a valid name binding in the source program.
The stack graphs presented here are incremental, since we can produce the subgraph for each file without having to look at the contents of any other file in the repo, or of any upstream or downstream dependencies.
For instance, revisiting the “sequenced import star” example again, we can create separate subgraphs for each file:
# main.py₀
from a₁ import *
print(foo₂)
# a.py₃
from b₄ import *
# b.py₅
foo₆ = 2
This is great, because it means that when we receive a new commit for a repository, we only have to examine, and generate new stack subgraphs for, the files that are changed as part of that commit. We need to keep track of which per-file subgraphs belong to each commit, just like git has to do for the raw file content.
Having done that, one possible way to find name binding paths would be to load in all of the subgraphs that belong to the repo’s most recent commit, union them together into the combined graph for the commit, and run the path-finding algorithm described above. However, we think that this will require too much computation at query time.
Instead, we want to precompute parts of the path-finding algorithm, by calculating partial paths for each file. Because stack graphs have limited places where a path can cross from one file into another, we can calculate all of the possible partial paths that reach those “import/export” points. As we will see later, there is an efficient way to query for “compatible” partial paths, and to concatenate them together, producing the original “full” paths that represent our name bindings.
A file is a reference to a source code file.
Every node in a stack graph except for the root node belongs to a file. The root node belongs to every file.
A partial path consists of:
Every scope identifier that appears in a partial path’s symbol stack postcondition or scope stack postcondition must appear somewhere in its symbol stack precondition or scope stack precondition.
Each partial path belongs to a single file. The source node and sink node of all of the edges in the partial path belong to this file.
The main difference between a path and a partial path is that the latter can be used “in the middle of” on existing path. That means we have to keep track of the “precondition” of the partial path — that is, what needs to be true of the current symbol and scope stacks when we “paused” the path-finding algorithm, for this partial path to represent a valid sequence of next steps? As one example, the partial path might start with a pop node. In a regular path, that would be invalid — you can’t start with a pop node, since there’s nothing on the symbol stack to pop off! However, in a partial path, that’s perfectly fine! We just have to make sure that the “incoming” symbol stack starts with the correct symbol, which the partial path will try to immediately pop off.
A precondition symbol consists of:
A postcondition symbol consists of:
An attached scope list consists of:
An attached scope list is empty if its list of scope identifiers is empty and its scope list variable is missing.
A partial path is as complete as possible if all of the following are true:
Its starting node is the root node, an exported scope node, or a reference node:
Its ending node is the root node, a jump to scope node, or a definition node.
If its starting node is a reference node, then its symbol stack precondition is empty and its scope stack precondition is missing.
If its ending node is a definition node, then its symbol stack postcondition is empty and its scope stack postcondition is empty.
In our regular path-finding algorithm, we were able to keep growing each path until we ran out of eligible edges, because the full scope graph for the entire program was available to us. When we try to find patrial paths, that won’t be true — we need to grow the partial path “as much as we can” with the set of nodes and edges available in the current file.
A partial path is productive if any of the following are true:
Its starting node and ending node are different.
Its symbol stack precondition and symbol stack postcondition are different.
Its scope stack precondition and scope stack postcondition are different.
This bit isn’t fully fleshed out yet, to be honest, but it will be used to ensure that we’re not getting into cycles when stitching partial paths together.
This is very much like algorithm 1. The main difference is that we have to build up the preconditions for our partial paths as we’re “growing” them. For instance, when growing a path, trying to enter a pop node with an empty symbol stack is an error. When growing a partial path, that adds a new element to the partial path’s precondition.
We can attempt to append an edge to a partial path as follows:
If the source node of edge is not the same as the ending node of path, then abort.
Let node be the sink node of edge.
If node is a push symbol node, then:
Let postcondition symbol be a new postcondition symbol whose:
Push postcondition symbol onto the beginning of the path’s symbol stack postcondition.
If node is a push scoped symbol node, then:
Let attached scope list be a copy of path’s scope stack postcondition.
Prepend node’s scope identifier to attached scope list’s list of scope identifiers.
Let postcondition symbol be a new postcondition symbol whose:
Push postcondition symbol onto the beginning of the path’s symbol stack postcondition.
If node is a pop symbol node, then:
If path’s symbol stack postcondition is empty, then:
Let precondition symbol be a new precondition symbol whose:
Push precondition symbol onto the end of path’s symbol stack precondition.
Otherwise:
Let top be the first element of path’s symbol stack postcondition.
If top’s symbol is not equal to node’s symbol, then abort.
If top’s attached scope list is not missing, then abort.
Otherwise, pop top off the beginning of path’s symbol stack postcondition.
If node is a pop scoped symbol node, then:
If path’s symbol stack postcondition is empty, then:
Let precondition symbol be a new precondition symbol whose:
Push precondition symbol onto the end of path’s symbol stack precondition.
Replace path’s scope stack postcondition with an attached scope list whose list of scope identifiers is empty, and whose scope list variable is precondition symbol’s scope list variable.
Otherwise:
Let top be the first element of path’s symbol stack postcondition.
If top’s symbol is not equal to node’s symbol, then abort.
If top’s attached scope list is missing, then abort.
Otherwise, pop top off the beginning of path’s symbol stack postcondition.
Replace path’s scope stack postcondition with top’s attached scope list.
If node is a drop scopes node, then:
Append edge to path’s edges list.
Set path’s ending node to be node.
This is very similar to algorithm 2. The main difference is that we might not know in advance which scope we’re going to jump to. (That can happen when the partial path has a scope stack precondition — if the scope stack needs to be set up correctly before entering the partial path, then the scope that we jump to will be different depending on what other paths we stitch this partial path together with.)
We can attempt to resolve a partial path as follows:
Let node be the ending node of path.
If node is a jump to scope node:
Let postcondition be path’s scope stack postcondition.
If postcondition is empty, then the jump to scope node is invalid. Abort.
If postcondition’s list of scope identifiers is empty, then we don’t have enough information to resolve the jump to scope node yet. Return.
Otherwise, pop scope identifier from the beginning of postcondition’s list of scope identifiers.
Let resolved node be a new exported scope node whose scope identifier is scope identifier.
Let jump edge be a new edge whose:
jump
Append jump edge to path’s edges list.
Set path’s ending node to be resolved node.
And this is almost a carbon-copy of algorithm 3; we just use our new partial-path-related definitions of appending and resolving.
We can find the set of partial paths for a stack subgraph as follows:
Let current paths and complete paths be empty lists of partial paths.
For each node in stack subgraph that is the root node, a reference node, or an exported scope node whose scope identifier is the scope identifier of some push scoped symbol node in stack subgraph:
Let initial scope stack be a fresh scope list variable.
Let initial path be a new partial path whose:
If node is a reference node, then:
Let initial symbol be a new postcondition symbol whose:
Append initial symbol to initial path’s symbol stack postcondition.
Clear initial path’s scope stack precondition and scope stack postcondition.
Append initial path to current paths.
While current paths is not empty:
Remove an arbitrary path from current paths.
Let extensions be all of the edges in stack graph whose source node is the same as path’s ending node.
For each edge in extensions:
If edge already appears in path’s edges list, skip it. [TODO: Investigate a better cycle detection heuristic.]
Let new path be a copy of path.
Attempt to append edge to new path. If unsuccessful, skip this edge.
Attempt to resolve new path. If unsuccessful, skip this edge.
If new path is as complete as possible and productive, append it to complete paths.
Append new path to current paths.
Return complete paths.
Having described how to generate partial paths for a stack graph, we can now talk about what we want to do with them. Our core operation is that we can concatenate partial paths together. Our conjecture is that for any complete path in the original stach graph (which included nodes and edges for all of the files that we care about), there is some sequence of partial paths (each of which comes from exactly one file) that we can concatenate together to yield the complete path. Moreover, as we are building up a potential path, we need it to be easy to query our database to find “eligible” partial paths that we can concatenate on next. If all of that is true, then we have an efficient incremental version of our original path-finding algorithm.
A path (left) is compatible with a partial path (right) if all of the following are true:
Either of the following are true:
The ending node of left and the starting node of right are both the root node.
The ending node of left and the starting node of right are both exported scope nodes, whose scope identifiers are the same.
The symbol stack of left satisfies the symbol stack precondition of right.
The scope stack of left satisfies the scope stack precondition of right.
A symbol stack satisfies a symbol stack precondition if all of the following are true:
The length of symbol stack is ≥ the length of symbol stack precondition.
For each precondition symbol in symbol stack precondition, and corresponding scoped symbol in symbol stack, all of the following are true:
The precondition symbol’s symbol and the scoped symbol’s symbol are the same.
The precondition symbol’s scope list variable and the scoped symbol’s list of scope identifiers are either both missing or both present.
A scope stack satisfies a scope stack precondition if the following is true:
If the scope stack precondition’s scope list variable is missing, then the scope stack is empty.
or, equivalently
If the scope stack is not empty, then the scope stack precondition’s scope list variable is not missing.
These three definitions — compatibility, symbol stack satisfaction, and scope stack satisfaction — are probably the most important parts of this specification, because they translate into efficient database queries if you happen to store your pile of partial paths in a relational database!
We can attempt to promote a partial path to a path as follows:
If partial path’s symbol stack precondition is not empty, abort.
If partial path’s scope stack precondition is not missing, abort.
Note: Because of the above two checks, we know that there are no scope identifiers in partial path’s precondition. That means that there cannot be any scope identifiers in its postcondition, either.
Let symbol stack be a new empty list of scoped symbols.
For each postcondition symbol in partial path’s symbol stack postcondition:
Let scoped symbol be a new scoped symbol whose:
If postcondition symbol’s attached scope list is present, set scoped symbol’s list of scope identifiers to be the same as attached scope list’s list of scope identifiers.
Append scoped symbol to symbol stack.
Let path be a new path whose:
Return path.
This is admittedly a bit dense. Conceptually, it’s not that bad — we just want to append the edges from right onto the edges of left, giving us a new path. However, we also have to make sure that the partial path’s precondition is satisfied. And because the precondition can contain scope list variables, we also have to figure out what parts of the incoming symbol and scope stacks those variables “capture”. And then we have to construct the new symbol and scope stacks using those bindings. Which…takes a few steps to plumb everything through.
We can attempt to concatenate a path left and a partial path right as follows:
If left is not compatible with right, abort.
Let bindings be a new empty mapping from scope list variables to lists of scope identifiers.
Let new symbol stack be a copy of left’s symbol stack.
For each precondition symbol in right’s symbol stack precondition, and corresponding scoped symbol in left’s symbol stack:
Remove scoped symbol from new symbol stack.
If precondition symbol’s scope list variable is not missing, then:
Let scope list variable be precondition symbol’s scope list variable.
Let scope identifiers be scoped symbol’s list of scope identifiers.
(We know that this field must be present because we’ve already checked that left and right are compatible.)
Add a mapping from scope list variable to scope identifers to bindings.
(We know that there can’t already be an entry for scope list variable because scope list variables are required to be unique across the entirety of a partial path.)
If right’s scope stack precondition’s scope list variable is not missing, then:
Let scope list variable be right’s scope stack precondition’s scope list variable.
Let scope identifiers be left’s scope stack.
Add a mapping from scope list variable to scope identifers to bindings.
(We know that there can’t already be an entry for scope list variable because scope list variables are required to be unique across the entirety of a partial path.)
For each postcondition symbol in right’s symbol stack postcondition, in reverse order:
Let scoped symbol be a new scoped symbol whose:
If postcondition symbol’s attached scope list is not missing:
Let attached scope list be postcondition symbol’s attached scope list.
Let scope identifiers be a copy of attached scope list’s list of scope identifiers.
If attached scope list’s scope list variable is not missing:
Let bound scope identifiers be the result of looking up attached scope list’s scope list variable in bindings.
We know this entry must be present because in a well-formed partial path, any scope list variable mentioned in the postcondition must also be present in the precondition.
Append bound scope identifiers to the end of scope identifiers.
Set scoped symbol’s list of scope identifiers to be scope identifiers.
Prepend scoped symbol to the beginning of new symbol stack.
With scope stack postcondition as right’s scope stack postcondition, do the following:
Let new scope stack be a copy of scope stack postcondition’s list of scope identifiers.
If scope stack postcondition’s scope list variable is not missing:
Let bound scope identifiers be the result of looking up scope stack postcondition’s scope list variable in bindings.
We know this entry must be present because in a well-formed partial path, any scope list variable mentioned in the postcondition must also be present in the precondition.
Append bound scope identifiers to the end of new scope stack.
Let new path be a new path whose:
Given a set of files and a set of reference nodes, we can find the set of paths that define which definitions each of those references map to:
For the in-repository case, files will be all of the files of the current repo at the current commit. For cross-repository, files will contain that, as well as all of the files in any upstream dependency’s repository (at the most recent commit that satisfies the current repo’s package manifest).
Let database be the set of partial paths that belong to any file in files.
Let current paths and complete paths be new empty sets of paths.
For each partial path in database whose starting node is an element of reference nodes:
Try to promote partial path into an initial path. If unsuccessful, skip this partial path.
Add initial path to current paths.
While current paths is not empty:
Remove an arbitrary path from current paths.
Let extensions be the set of partial paths in database that are compatible with path.
For each partial path in extensions:
[TODO: We need a cycle detection heuristic here too.]
Let new path be the result of attempting to concatenate partial path onto path. If unsuccessful, skip this partial path.
Attempt to resolve new path. If unsuccessful, skip this partial path.
If new path is complete, append it to complete paths.
Append new path to current paths.
Return complete paths.
Do the same thing as above, but you know, backwards. [TODO: Do this for real — we’ll need equivalents for algorithms 8 and 9, I think.]
Given the above, we can find all of the partial paths in our three example files:
# main.py₀
from a₁ import *
print(foo₂)
Symbols
Symbol | Location | Type |
---|---|---|
__main__0 |
null | Def |
a1 |
1,5 - 1,6 | Ref |
foo2 |
2,6 - 2,9 | Ref |
The definition for the file’s module has a null location because it’s the file as a whole that defines the module; the textual name of the module doesn’t actually appear anywhere in the file.
Partial paths
Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|
a1 |
P | root | a |
|||
foo2 |
P×5 | root | a . foo |
|||
__main__ |
σ | root | P | __main__0 |
σ | |
__main__ . |
σ | root | P×8 | root | σ | a . |
Note that when a partial path starts at a reference, or ends at a definition, what we’re storing is a reference to the corresponding symbol entry, not the text of the symbol.
# a.py₃
from b₄ import *
Symbols
Symbol | Location | Type |
---|---|---|
a3 |
null | Def |
b4 |
1,5 - 1,6 | Ref |
Partial paths
Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|
b4 |
P | root | b |
|||
a |
σ | root | P | a3 |
σ | |
a . |
σ | root | P×7 | root | σ | b . |
# b.py₅
foo₆ = 2
Symbols
Symbol | Location | Type |
---|---|---|
b5 |
null | Def |
foo6 |
1,0 - 1,3 | Def |
Partial paths
Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|
b |
σ | root | P | b5 |
σ | |
b . foo |
σ | root | PPPPP | foo6 |
σ |
User clicks on foo2
in main.py
. Load in all of the partial paths that start with that reference, and promote them to paths. The current paths set starts off as:
# | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|
1 | foo2 |
P×5 | root | a . foo |
Path 1 has two compatible partial paths in a.py
:
# | Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|---|
1a | a |
σ | root | P | a3 |
σ | |
1b | a . |
σ | root | P×7 | root | σ | b . |
Concatenating path 1 with partial path 1a results in an invalid path (it ends in a definition but doesn’t have an empty ending symbol stack), so we discard it.
After concatenating path 1 with the partial path 1b, the current paths list at the end of this iteration is:
# | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|
2 | foo2 |
P×12 | root | b . foo |
In the binding set that we create while concatenating, σ is mapped to an empty scope stack.
Path 2 has two compatible partial paths in b.py
:
# | Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|---|
2a | b |
σ | root | P | b5 |
σ | |
2b | b . foo |
σ | root | P×5 | foo6 |
σ |
Concatenating path 2 with partial path 2a results in an invalid path (it ends in a definition but doesn’t have an empty ending symbol stack), so we discard it.
After concatenating path 2 with partial path 2b, the current paths list at the end of this iteration is:
# | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|
3 | foo2 |
P×17 | foo6 |
In the binding set that we create while concatenating, σ is mapped to an empty scope stack.
All of the current paths are valid and complete, and so we have our set of resolved definitions.
We can split the “print the value of a class field through a function parameter” example into separate files:
# main.py₀
from a₁ import *
from b₂ import *
print(foo₃(A₄)₅.bar₆)
Symbols
Symbol | Location | Type |
---|---|---|
__main__0 |
null | Def |
a1 |
1,5 - 1,6 | Ref |
b2 |
2,5 - 2,6 | Ref |
foo3 |
3,6 - 3,9 | Ref |
A4 |
3,10 - 3,11 | Ref |
bar6 |
3,13 - 3,16 | Ref |
Partial paths
Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|
a1 |
P | root | a |
|||
b2 |
P | root | b |
|||
foo3 |
P×5 | root | b . foo |
|||
foo3 |
P×6 | root | a . foo |
|||
A4 |
P×5 | root | b . A |
|||
A4 |
P×6 | root | a . A |
|||
bar6 |
P×8 | root | b . foo () /5 . bar |
|||
bar6 |
P×9 | root | a . foo () /5 . bar |
|||
0 |
σ | (5) | P×7 | root | σ | b . A |
0 |
σ | (5) | P×8 | root | σ | a . A |
__main__ |
σ | root | P | __main__0 |
σ | |
__main__ . |
σ | root | P×8 | root | σ | b . |
__main__ . |
σ | root | P×9 | root | σ | a . |
# a.py₇
def foo₈(x₉):
return x₁₀
Symbols
Symbol | Location | Type |
---|---|---|
a7 |
null | Def |
foo8 |
1,4 - 1,7 | Def |
x9D |
1,8 - 1,9 | Def |
x10 |
2,11 - 2,12 | Ref |
Partial paths
Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|
x10 |
P×4 | x9D |
||||
a |
σ | root | P | a7 |
σ | |
a . foo |
σ | root | P×5 | foo8 |
σ | |
a . foo () /σ1 |
σ | root | P×13 | (•) | σ1 | x |
a . foo () /σ1 |
σ | root | P×13 | (•) | σ1 | 0 |
# b.py₁₁
class A₁₂:
bar₁₃ = 1
Symbols
Symbol | Location | Type |
---|---|---|
b11 |
null | Def |
A12 |
1,6 - 1,7 | Def |
bar13 |
2,4 - 2,7 | Def |
Partial paths
Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|
b |
σ | root | P | b11 |
σ | |
b . A |
σ | root | P×5 | A12 |
σ | |
b . A . bar |
σ | root | P×8 | bar13 |
σ | |
b . A () /σ1 . bar |
σ | root | P×11 | bar13 |
σ1 |
User clicks on bar6
in main.py
. Load in all of the partial paths that start with that reference, and promote them to paths. The current paths set starts off as:
# | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|
1 | bar6 |
P×8 | root | b . foo () /5 . bar |
|
2 | bar6 |
P×9 | root | a . foo () /5 . bar |
Path 1 has one compatible partial path in b.py
:
# | Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|---|
1a | b |
root | P | b11 |
However, concatenating the two together results in an invalid path (since it ends at definition b11
but does not have an empty ending symbol stack). So we discard path 1.
Path 2 has four compatible partial paths in a.py
:
# | Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|---|
2a | a |
σ | root | P | a7 |
σ | |
2b | a . foo |
σ | root | P×5 | foo8 |
σ | |
2c | a . foo () /σ1 |
σ | root | P×13 | (•) | σ1 | x |
2d | a . foo () /σ1 |
σ | root | P×13 | (•) | σ1 | 0 |
Concatenating path 2 with partial paths 2a or 2b results in invalid paths (since they end at definition a7
and foo8
but do not have an empty ending symbol stack or scope stacks). So we discard partial paths 2a and 2b.
After concatenating path 2 with partial paths 2c and 2d respectively, the current paths list at the end of this iteration is:
# | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|
3’ | bar6 |
P×9, jump , P×13 |
(•) | 5 | x . bar |
4’ | bar6 |
P×9, jump , P×13 |
(•) | 5 | 0 . bar |
In the binding set that we create while concatenating, σ is mapped to an empty scope stack, and σ1 is mapped to the scope stack ⟨5⟩.
Each of these paths end in a jump to scope node, which we then resolve:
# | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|
3 | bar6 |
P×9, jump , P×13 |
(5) | x . bar |
|
4 | bar6 |
P×9, jump , P×13 |
(5) | 0 . bar |
Path 3 has no compatible partial paths in any file, so we discard it.
Path 4 has two compatible partial paths in main.py
:
# | Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|---|
4a | 0 |
σ | (5) | P×7 | root | σ | b . A |
4b | 0 |
σ | (5) | P×8 | root | σ | a . A |
After concatenating path 4 with partial paths 4a and 4b respectively, the current paths list at the end of this iteration is:
# | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|
5 | bar6 |
P×9, jump , P×20 |
root | b . A . bar |
|
6 | bar6 |
P×9, jump , P×21 |
root | a . A . bar |
In the binding set that we create while concatenating, σ is mapped to an empty scope stack.
Path 6 has one compatible partial path in a.py
:
# | Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|---|
6a | a |
σ | root | P | a7 |
σ |
However, concatenating the two together results in an invalid path (since it ends at definition a7
but does not have an empty ending symbol stack). So we discard path 6.
Path 5 has three compatible partial paths in b.py
:
# | Symbols | Scopes | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|---|---|
5a | b |
σ | root | P | b11 |
σ | |
5b | b . A |
σ | root | P×5 | A12 |
σ | |
5c | b . A . bar |
σ | root | P×8 | bar13 |
σ |
Concatenating path 5 with partial paths 5a or 5b results in invalid paths, so we discard them.
After concatenating path 5 with partial path 5c, the current paths list at the end of this iteration is:
# | From | Edges | To | Scopes | Symbols |
---|---|---|---|---|---|
7 | bar6 |
P×9, jump , P×28 |
bar13 |
In the binding set that we create while concatenating, σ is mapped to an empty scope stack.
All of the current paths are valid and complete, and so we have our set of resolved definitions.