Skip to content

LSIF spec could use some extra clarity around embedded contents #1139

Open
@jasonmalinowski

Description

@jasonmalinowski

The LSIF spec states that the contents of a file included in an LSIF index is encoded base64:

It can be valuable to embed the contents of a document or project file into the dump as well. For example, if the content of the document is a virtual document generated from program meta data. The index format, therefore, supports an optional contents property on the document and project vertex. If used the content needs to be base64 encoded.

Given base64 is a encoding of a binary stream, this implies that there's a text encoding question. So some questions:

  1. Should the binary stream be the raw file on disk, in whatever text encoded form it is? This then means it's the responsibility of any consumer to do encoding sniffing which may come to a different conclusion (and therefore different contents) than the indexer. The alternative is the indexer re-encodes in some preferred/specified text encoding prior to the base64 encoding, although that still creates other fun questions around binary file inputs to compilers.
  2. For files that have no "native" encoding because the indexer generated them directly in memory, which encoding should be chosen as a preferred choice?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions