The main Kite repo (originally kiteco/kiteco
) was intended for private use. It has been lightly adapted for publication here by replacing private information with XXXXXXX
. As a result many components here may not work out of the box.
Q: How did we analyze all the code on Github?
We used a variety of infrastructure, on a mix of cloud platforms depending on what was the most economical, though it was mostly on AWS.
We used mostly map-reduce to manage workflows that needed to be run over large datasets. You can see a list of some of our map-reduce jobs here (local-pipelines
) and here (emr-pipelines
). I believe tasks in local-pipelines are intended to be ran on single instances whereas EMR is AWS's map-reduce infrastructure.
Here are some example tasks, with a particular focus on Python analysis:
- Github crawler
- Package exploration, which imports and introspects Python packages (as you can imagine, this is very tricky code to get right); see the README.md in that directory
- Related: import exploration, which, if I understand correctly, runs package exploration and other logic to e.g. upload results to s3
- As an example of what package exploration produces, I'm attaching (see the
readme_assets
folder) the graph it produced for numpy 1.14.0. For example, if you load the JSON object aso
, theno["shards"]["numpy"]["139784536077376"]
will give you the objectnumpy.polynomial.tests.test_laguerre.TestEvaluation.c1d
(I just picked one node from the graph at random), and you will see that it has members likemin
conjugate
andtofile
, with pointers for each of those to their node from the graph. - (also extracts docstrings)
- Type induction (the logic), which statistically estimates the return types of functions based on attributes accessed on their return values across Github
- I'm attaching (see the
readme_assets
folder)type-induction-numpy.json.zip
which contains the output of type induction for numpy. For example, if this JSON file is loaded aso
, if you look ato[17]
you will see that there is a 51.9% probability thatnumpy.matrix.argmin
returns anumpy.matrix
.
- I'm attaching (see the
- Dynamic analysis, which runs short Python scripts like this one (we have a set of 2-3k of these that cover 50% of all open-source Python libraries when weighed by usages/popularity) and extracts type information at runtime
- I'm attaching (see the
readme_assets
folder)20190320.json.zip
which contains all of the information extracted from this process. As an example, the JSON object on line 34 tells us thatjinja2.environment.Template.render
is a function that returned a__builtin__.unicode
- I'm attaching (see the
- Extraction of return types from docs
- I'm attaching (see the
readme_assets
folder)docs_returntypes.json.zip
which should be pretty self-explanatory. My main comment is that, as in other datasets, the large numbers are pointers to other nodes in the graph.
- I'm attaching (see the
Several return type sources are unified in this command.
A lot of this pipeline seems to be orchestrated through this Makefile. This is broadly documented a bit here.
This pipeline results in a number of files per package::version, with the following elements:
- SymbolGraph (graph of entities)
- ArgSpec (function signatures)
- PopularSignatures (function-call patterns that are popular on Github, e.g. "how do people most commonly call matplotlib.pyplot.plot?")
- SignatureStats
- Documentation
- SymbolCounts
- aside: here is an example SymbolCounts entry from numpy:
{"Symbol":"numpy.distutils.fcompiler","Data":{"Import":901,"Name":780,"Attribute":711,"Expr":1501,"ImportThis":406,"ImportAliases":{"FC":2,"_fcompiler":2,"fcompiler":8}}}
- this means that numpy.distutils.fcompiler is imported 901 times, used in an expression 1501 times, and is imported "as" most commonly as "fcompiler" although sometimes as "FC" or "_fcompiler"
- aside: here is an example SymbolCounts entry from numpy:
- Kwargs
- ReturnTypes
I'm attaching (see the readme_assets
folder) the final resource build for numpy here as "resource-manager-numpy.zip". You can download the 800MB zip file with all the Python open-source packages here.
The bullet list of resources above is from the code here. You can "find references" to see how these files get loaded from disk. In the Kite client the resource manager's main entry point is here. Note this class includes code for dynamically loading and unloading packages' data into memory to conserve end-user memory.
By the way, we are happy to share any of our open-source-derived data. Our Github crawl is about 20 TB, but for the most part the intermediate and final pipeline outputs are pretty reasonably-sized. Although please let me know soon if you want anything because we will likely end up archiving all of this.
To reiterate, we invested a few $million into our Python tech, so you should find it to be pretty robust and high quality, which is why I'm doing some moonlight work to try to give it a shot at not getting lost.
Q: Is this infrastructure incremental?
Generally, no. Fortunately it didn't really need to be. I can't recall how long it took to run the full Python analysis end to end --- it was more than a day but I think less than a week.
Q: How often did you re-run data collection and analysis of GitHub code?
We ran several Github crawls throughout our time. I think there were something like ~4 successive crawls during a ~7 year period. Things do change, but not super frequently. The other Python package exploration is much cheaper to run so we ran it more often.
Q: How do you deploy your ML models?
Here are some highlights:
- Everything is in a repeatable, code-defined pipeline
- Some (most?) resources don't need to change often, so we didn't build everything from scratch on every build
- Especially for ML models, we wanted to do human review of model performance for every model we shipped; we shipped a new version of the client weekly, so all the more reason that we couldn't retrain from scratch for every build
- We used incremental updates, basically binary patches, to reduce the bandwidth consumption of every update
Q: How did you measure the quality of your models?
I'm not sure I can shed much light here, but here's a rough pass:
- Background: We used tensorflow to train models offline, and do online inference on end-users' machines. (I know TabNine used tensorflow to train, and rolled their own inference on client side)
- We used tensorboard to monitor the training process
- I know the team invested in scrapy solutions for managing training workflows, e.g. Slack notifications when builds failed or finished, etc
- I don't know the technical details of how we did cross-validation, metrics for model success, etc. You may be able to find it in the Github repo.
In terms of the infrastructure and code:
- Since my answer to the top question above is mostly focused on our python analysis pipeline, rather than ML pipelines, here is where you can find the code and scripts related to training our "lexical" (GPT-2) models.
- To save on cloud costs we bought our own GPU-equipped machine for training from Lambda Labs. One of our engineers used it as a room heater in his apartment during the COVID lockdown 🙂
Btw we also trained a simple model to mix lexical/GPT-2 and other completions. (short product spec attached as "Product Spec_ Multi-provider Completions.pdf")
(Bonus: I'm attaching (see the readme_assets
folder) our product spec for lexical completions here as "Product Spec_ Lexical Completions.pdf")
Q: Did you implement your own parsers or reuse existing ones?
We implemented our own Python parser in Golang. It is robust to syntax errors, e.g. it can parse partial function calls. It can be found here.
We also did some parser / formatter work with JavaScript, but did not finish it. We ended up using treesitter for some things after it came out.
Q: Could you do code linting and refactorings, given that the data about API usages you collect is never complete?
We did not try to do this very much. We did some experimentation with linting, but to your point having a noisy linter can be worse than no linter at all. I think it's harder to use ML for linting than completions or other use cases for this reason.
Q: Did you try to pivot to other usages of ML code analysis like automatic code reviews, security checks, etc?
Yes we did some experimentation on a number of different ideas in late 2020 / early 2021.
- Synthesizing status summaries: From an ML perspective, the idea is to use Github PR titles to train a model that can generate "PR titles" from code changes, thus enabling us to make it easy for developers to share descriptions of the work they've been doing more easily.
- ML-enhanced code search and navigation (see attached (see the
readme_assets
folder) "Code search product analysis and roadmap.pdf"): one of the key ideas being using ML to annotate a graph of relationships between code entities, so you could e.g. right-click on a string referring to a file and see "See three references to this file". (see the image below.) There was also a playbook for using a presence on developer desktops to get widespread adoption across teams.
- We built some prototype tech for mapping between code, issues, and wiki content. These models performed pretty well.
-
We went on a long product strategy process wherein we spoke with something like ~50 individual developers, eng managers, etc. You can see some of the ideas that made it the furthest in the attached (see the
readme_assets
folder) "Kite - Wave 5 - Product concepts for Engineering.pdf". They included smart error reporting, dynamic logs, log debugging helper, code nav/search (mentioned above), and semantic/keyword code search. -
In case either of these resonate, here are another couple lenses on some subset of the ideas we brainstormed:
Our codebase is primarily located at github.com/kiteco/kiteco (http://github.com/kiteco/kiteco). There are a few auxiliary repositories that host very experimental code, but the goal is to make the “kiteco” repository the point of truth for all of our services.
- Our codebase is primarily Go. (
kite-go
,kite-golib
directories) - Infrastructure uses Terraform (for AWS) provisioning, and Fabric/shell scripts for deployment and management of remote hosts (
devops
directory) - You need VPN credentials to access any of our remote AWS (or Azure) hosts.
- Platform-specific logic & instructions live in subdirectories
osx
,windows
,linux
. You probably don't need these.
We use Git LFS to store our various bindata.go
files. You will need to install the command line tool to get the contents of those files when you pull the repository. Installation instructions are on their website, but for MacOS you can install it by running (from inside the kiteco
repository)
brew update
brew install git-lfs
git lfs install
Then do a git pull
to get the bindata.go files. If they do not download from LFS, try running git lfs pull
(you should only need to do this once - subsequent git pull
s should update the bindata correctly).
git lfs install
installs a smudge filter that automatically downloads and replaces the contents of newly checked out "pointer files" with their content.
By default smudge filters operate on checked out blobs in sequence, so cannot download in batch as would typically happen when running git lfs pull
.
Furthermore, by default, git checkouts will block on downloading the new LFS files which can be annoying.
You might prefer to disable the smudge filter (this can be run even if you've already run the regular git lfs install
):
git lfs install --skip-smudge
git lfs pull
Then, when building after a new checkout, you may see an error of the form "expected package got ident."
This occurs because go
reads some Go files and sees the Git LFS pointers instead of the actual data file.
At this point, you can download the latest files with git lfs pull
and rebuilding should work.
Nothing needs to be done when pushing LFS blobs. That will still happen automatically.
The bulk of our code is currently in Go. This can be found at github.com/kiteco/kiteco/kite-go (http://github.com/kiteco/kiteco/kite-go). To get started working in this part of the codebase, first make sure you have your Go environment setup correctly (i.e Go is installed, $GOPATH is set, etc.).
Locally, however, you will need to install Go 1.15.3. The following steps will get you going.
Set $GOPATH
in your .profile / .bashrc/ .bash_profile / .zshrc, e.g:
export GOROOT=/usr/local/go
export GOPATH=$HOME/go
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
Make sure to create these directories as well:
mkdir $HOME/go
mkdir $HOME/go/src $HOME/go/bin $HOME/go/pkg
If you are on a Mac and set the above in either .bashrc or .zshrc, make sure to load it in either your .profile or .bash_profile. See this for an explanation.
It would be useful to become familiar with how go
code is organized. Check out https://golang.org/doc/code.html for more on this topic.
Navigate to where the kiteco
repo will live in your GOPATH
, and clone the repo.
# Create kiteco directory within GOPATH, and clone the repo there
mkdir -p ~/go/src/github.com/kiteco
cd ~/go/src/github.com/kiteco
git clone git@github.com:kiteco/kiteco
To install the latest version of Go that's compatible with our codebase, run:
cd ~/go/src/github.com/kiteco/kiteco
cd devops/scripts
./install-golang.sh
From here, just run make install-deps
from the root of the kiteco
repo to get basic utilities installed.
# Install dependencies
make install-deps
Use ./scripts/update-golang-version.sh
if you'd like to make Kite require a newer version of Golang.
For development builds (see below), you may need to have Tensorflow installed globally on your system.
make install-libtensorflow
You're now ready to build Kite! First, build the sidebar for your platform
./osx/build_electron.sh force
# ./linux/build_electron.sh force
# ./windows/build_electron.sh force
This process is asynchronous to the Kite daemon build, so you must manually rebuild the sidebar as needed.
Now build and run Kite:
make run-standalone
Note that this is not a full Kite build, but is the recommended approach for development, as it is much faster. Some functionality is disabled in the development build (depending on the platform):
- Kite system tray icon
- Updater service
You should be able to develop, build, and test Kite entirely on your local machine. However, we do have cloud instances & VMs available for running larger jobs and for testing our cloud services
We use the Go Modules system for dependency management.
General tips:
- make sure in
~/go/src/github.com/kiteco/kiteco
and not a symlink - make sure deps are updated to the versions in
go.mod
:go mod download
- Set
$GOPRIVATE
in your .profile / .bashrc/ .bash_profile / .zshrc, e.g:export GOPRIVATE=github.com/kiteco/*
.
To add or update a dependency, all you need to do is go get
it, which
will automatically update the go.mod
and go.sum
files. To remove a dependency,
remove references to it in the code and run go mod tidy
. In general, make sure to
run go mod tidy
to make sure all new dependencies have been added and unused ones
have been removed before committing any dependency changes.
The process for updating a dependency is:
go get -u github.com/foo/bar
- (optional) run any
go
command, such asgo build
,go test
go mod tidy
git add go.mod go.sum
git commit ...
The process for adding a dependency is:
go get github.com/foo/bar
- edit code to import "github.com/foo/bar"
go mod tidy
git add go.mod go.sum
git commit ...
godep
may attempt to clone private repositories via HTTPS, requiring manual authentication.
Instead, you can add the following section to your ~/.gitconfig
in order to force SSH authentication:
[url "git@github.com:"]
insteadOf = https://github.com/
We bundle a lot of pre-computed datasets & machine learning models into the Kite app
through the use of a custom filemap & encoding on top of go-bindata.
The data, located in kite-go/client/datadeps
, is kept in Git-LFS.
All needed data files is first stored on S3. There are pointers at various places in our codebase to S3 URIs. After updating references to these datasets, the datadeps file must be manually rebuilt:
$ ./scripts/build_datadeps.sh
This will bundle all data that is loaded at Kite initialization time. You must ensure the needed data is loaded at initialization, otherwise it will not be included!
Some logs are displayed in Xcode, but most are written to a log file:
tail -F ~/.kite/logs/client.log
Your Go code should pass several quality criteria before being allowed into the master branch. Travis CI (https://travis-ci.org/) acts as the gatekeeper between pull requests and merging. You can test your code before pushing to a pull request to speed up the process by navigating to the kite-go
directory and running make *
commands directly (any of make (fmt|lint|vet|bin-check|build|test)
).
You will need access to our VPN to connect to our backend hosts.
- Get VPN credentials (*.ovpn file) from @tarak (You will need to type in a password IRL - don't IM/chat it)
- Install Tunnelblick for OS X (https://code.google.com/p/tunnelblick/)
- Double click on the “.ovpn” file that contains your credentials.
- Tunnelblick should automatically apply the configuration.. look for the icon on the OS X status bar
- Click on the Tunnelblick icon, select your config, and enter your VPN password. (NOTE: Tunnelblick will complain saying the IP hasn't changed. Check the box to disable the message and continue.)
- Ping 'test-0.kite.com' and make sure it resolves. It's okay if the pings timeout; ICMP is disabled by default on aws instances.
Kite's Dropbox has ssh credentials for all the machines on AWS and Azure under Shared > Engineering > keys > kite-dev.pem and Shared > Engineering > keys > kite-dev-azure. Place both of these in your .ssh directory, i.e. ~/.ssh/kite-dev.pem. As a convenience, you should add the following to your ~/.ssh/config
:
Host *.kite.com
ForwardAgent yes
IdentityFile ~/.ssh/kite-dev.pem
User ubuntu
# Test instances are on Azure
Host test-*.kite.com
User ubuntu
IdentityFile ~/.ssh/kite-dev-azure
Don't forget to set appropriate permissions on the credential files (e.g. 700)