Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Enable reading workspace contents on GitHub.dev #46421

Closed
mjbvz opened this issue Oct 19, 2021 · 10 comments
Closed

[Discussion] Enable reading workspace contents on GitHub.dev #46421

mjbvz opened this issue Oct 19, 2021 · 10 comments
Labels
Domain: TSServer Issues related to the TSServer In Discussion Not yet reached consensus Needs Proposal This issue needs a plan that clarifies the finer details of how it could be implemented. Suggestion An idea for TypeScript

Comments

@mjbvz
Copy link
Contributor

mjbvz commented Oct 19, 2021

Problem

On serverless web, the TypeScript server currently only knows about the files that VS Code opens as editors. This means that features such as cross-file navigation/IntelliSense will not work unless you open editor for each of the files involved

We've faced this problem with two other features for VS Code: search and our any code implementation (which uses tree sitter to provide limited IntelliSense for languages that don't yet have web enabled extensions). For those cases, we now download a copy of the entire repository to the user's browser so that we can index/process all files in the workspace

Since we already have a copy of the workspace contents, we should see if TypeScript is also able to use this data to provided better intellisense on github.dev. This issue lays out a few potential paths for implementing this.

Goals

  • Let TS provide cross-file IntelliSense in the browser
  • Let TS pick up on tsconfig files
  • Avoid keeping around multiple copies of the workspace contents (i.e. TS should not need another copy of the workspace contents)
  • Don't require making the TS Server async :)

Out of scope

  • File watching. For now let's assume that only open files are changed.
  • Reading files out of node_modules. For this discussions, files under node_modules are not considered part of the workspace contents

Idea 1) Initialize TS with the workspace contents

VS Code constructs a data structure representing the workspace contents and transfers this to TypeScript. TypeScript consults this data structure to read any unopened files.

To avoid creating multiple copies of the workspace contents, the data structure could be stored inside a SharedArrayBuffer. However we'd need to figure out how to represent the workspace data structure since TS would fundamentally just be getting a big blob of binary data that it needs to interpret

Advantages

  • After the initial transfer, TS can work synchronously
  • Single transfer. The entire workspace contents would become available all at once.
  • We should be able to share the memory for the workspace contents between multiple different consumers

Idea 2) VS Code as file reader

TypeScript delegates all file operations back through VS Code's TypeScript extension.

The first challenge with this approach that calls from TS back to VS Code would need to be synchronous. This could be implemented using a SharedArrayBuffer to synchronize message passing

Advantages

  • We can construct a nicer api instead of expositing a data structure directly
  • TS can request the files on demand instead of dealing with the entire workspace contents
@mjbvz
Copy link
Contributor Author

mjbvz commented Oct 19, 2021

@JacksonKearl @jrieken On the VS Code side, do you currently have any plans for how to make a readonly view of the workspace contents available to other extensions?

Initializing TS with a single blob seems like it would be the easiest approach for us. If the majority of workspaces only have a few MB or so of code, for the first iteration we could even skip using a SharedArrayBuffer too and instead send typescript a simple JSON tree structure that contains all the files and their contents

@jrieken
Copy link
Member

jrieken commented Oct 19, 2021

/cc @joyceerhl who is owning the remote hub file system implementation.

For anycode this is currently opaque and it simply consumes things via vscode.workspace.fs. That makes it very nice and simple for me but I actually never really know when I am benefitting from blob-caching or causing individual fetch-requests.

For the SharedArrayBuffer approach there is Safari which doesn't support it... A long, long time back we used indexed-db to share data between various workers since all can access data without postMessage-calls. Tho, there is still some deserialization happening.

@JacksonKearl
Copy link

The implementation side of RemoteHub's indexing is just a .tar.gz kept in IndexedDB, reading from IDB can't be sync, but once it's read and decompressed, all the data should be there in that one ArrayBuffer. I don't have much experience with SharedArrayBuffers, but I imagine we could pull the tarball into one of them and let many threads have read-only access to it?

@joyceerhl
Copy link

From talking with @eamodio my understanding is that the tar approach we have is essentially a workaround for the limitations of the GitHub text search APIs and should go away when the GitHub search APIs improve. It's not clear when the search APIs will be improved, but I slightly worry about whether we should really be exposing that and having TypeScript depend on that implementation.

@mjbvz
Copy link
Contributor Author

mjbvz commented Oct 20, 2021

@joyceerhl I agree we should consider if exposing the whole contents of the workspace to extension makes sense. With any code also needing this data though, I think it would be nice to see if we can come up with a generic solution instead of having each extension/language re-implement this functionality

I'll keep talking with the TS team about our requirements for this. For experimentation purposes, it sounds like I may be able to hack something together using the remote hub file system provider today though?

@JacksonKearl and @jrieken Good suggestion to look into IndexedDB. @joyceerhl/ @JacksonKearl Out of curiosity, is the tar stored as a big binary blob in indexeddb, or does Remote Hub generate a more useful data structure to work with it?

@mjbvz
Copy link
Contributor Author

mjbvz commented Oct 21, 2021

One quick note from discussions today: the GitHub.dev case is kind of just a special case of working with a virtual file system.

When working with arbitrary virtual file systems, we might not be able to easily or efficiently construct workspace contents blob that can be shared with TSServer. Idea 2 would likely work better in those cases then

@eamodio
Copy link

eamodio commented Oct 22, 2021

The TARs were originally just to deal with the lacking GitHub search apis, but they also became useful to avoid thrashing the GitHub api (and causing lots of latency) when an extension uses the workspace.fs apis (like anycode). IMO, it would be great to leverage for this scenario as well -- though I'm not sure if a TAR is the "right" way to go or not. As I'm not sure if Azure Repos or other virtual fs' could support it.

@JacksonKearl
Copy link

JacksonKearl commented Oct 22, 2021

@mjbvz the on-indexeddb representation is always a .tar.gz. We deal with it on the RemoteHub side here: https://github.com/microsoft/vscode-remotehub/blob/main/src/search/tar-pit.ts, the package we use for parsing tar's is here: https://github.com/JacksonKearl/TarFileExplorer. We do currently do more post-processing on the remotehub side than is maybe ideal, for instance:

  • the tar entries come from GH with the repo name prefixed onto the path, we remove that on RH side of things)
  • the decompressed file is stored in memory via a sort of debounced WeakRef

It could be helpful for the vscode extension API to provide some sort of:

type Contents =  {path: string, uri, etc. ; contents: UInt8Array}; // maybe a Map instead
requestFullRepositoryContents(reason: string): Promise<Contents | null>; /* requestHighBandwidthMode? */

Where io-heavy operations could be gated on VS Code presenting some sort of UI to prompt for entering "high bandwidth" mode, perhaps with a list of the operations that have pending requests for the full contents and why (sorta like workspace trust). They could then either use workspace.fs (ideal, no code change for them), or read from the returned Contents directly if its a ts-like situation where they need synchronous file reads..

@joyceerhl
Copy link

joyceerhl commented Oct 24, 2021

As we're thinking about bringing Azure Repos support to parity with GitHub Repos support and generalizing the VirtualProvider concept, I prefer idea 2 for the flexibility it offers as well. (FWIW, the AzDO API supports returning blobs as zip files)

In RemoteHub we have an internal FileTree structure which represents the repository contents. (We're currently only using this for GitHub repos, but we should implement it for Azure Repos as well.) We fill out the tree using the tar contents and could expose a method like this on RemoteHub's extension API for the TypeScript extension to access the repo contents:

getFileTree(workspaceUri: vscode.Uri): FileTree | undefined;

@DanielRosenwasser DanielRosenwasser added In Discussion Not yet reached consensus Needs Proposal This issue needs a plan that clarifies the finer details of how it could be implemented. Suggestion An idea for TypeScript Domain: TSServer Issues related to the TSServer labels Nov 4, 2021
@DanielRosenwasser
Copy link
Member

It seems like the current direction to power this up is #47600. I'll close this issue and if we need to in the future, we can come back to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Domain: TSServer Issues related to the TSServer In Discussion Not yet reached consensus Needs Proposal This issue needs a plan that clarifies the finer details of how it could be implemented. Suggestion An idea for TypeScript
Projects
None yet
Development

No branches or pull requests

6 participants