Skip to content

Protobuf Template Name Compression Explained

mikael-s-persson edited this page Dec 31, 2014 · 1 revision

Templight traces are, by default, produced in a protobuf format (google protocol buffers) that uses a dictionary-based compression scheme that uses the natural structure of template instantiation names to try and minimize the amount of repetition in the base names (types, arguments, traits, etc..).

The compression is designed such that it's very easy to read (decompress), regardless of how sophisticated the compression is (or might become in the future).

Currently, here is how the compression works:

The elements of the proto message definition that relate to this compression are as follows:

message TemplightEntry {
  
  message TemplateName {
    //..
    optional uint32 dict_id = 3;  // index of dictionary-entry for the name
  }
  
  //..
  
  message Begin {
    //..
    required TemplateName name = 2;  // the name field for the trace entry.
    //..
  }
  
  //..
  optional Begin begin = 1;
}

message DictionaryEntry {
  required string marked_name = 1;  // a compressed string ('\0' characters are placeholders / markers)
  repeated uint32 marker_ids = 2;   // dictionary-entry index for each placeholder / marker found in 'marked_name'.
}

message TemplightTrace {
  required TemplightHeader header = 1;
  repeated TemplightEntry entries = 2;  // all templight entries (or trace entries)
  repeated DictionaryEntry names = 3;   // all dictionary entries
}

In the code that writes (or compresses) the names, the following steps are done (but this might change in the future, but not in a way that would change how it's decompressed):

  1. From a given name of a template instantiation: we first look for a dictionary entry that already matches that complete name and if so, the index of that entry is used as the "dict_id" and we go to step 5; and if it doesn't exist already, we start tokenizing it (e.g., by looking for matching "<" and ">", and the top-level commas that separate the individual arguments within that, and so on).
  2. If there is only a single token, then we leave the string intact, record it unmodified as the "marked_name" field in a new dictionary entry that has an empty "marker_ids" repeated-field, and we return the index of that new dictionary entry as the "dict_id" and go to step 5.
  3. Otherwise, each token (e.g., stuff appearing before "<", stuff between each top-level comma, stuff after the ">", and taking care of the "::" that appear too) is replaced by the character 0 (as in '\0') in the string that will become the field "marked_name" of its dictionary entry.
  4. For each token, we repeat steps 1 to 4 to create a new dictionary entry for the token (or find an existing one, which is where the "compression" effect comes in), and the index of that dictionary entry is added to the "marker_ids" repeated-field. At this point, the "marked_name" from step 3 and the "marker_ids" compiled in this step (4) are put together as a new dictionary entry whose index is returned as the "dict_id".
  5. We take the "dict_id" that was obtained (either from step 1, 2 or 4) and record that into the TemplateName name field of the templight entry.

For example, if you have the template name std::basic_string<char, std::char_traits<char>, std::allocator<char>>, you could end up with the following dictionary entries:

names { // id = 0
  marked_name = "std::basic_string"
}
names { // id = 1
  marked_name = "char"
}
names { // id = 2
  marked_name = "std::char_traits"
}
names { // id = 3
  marked_name = "\0<\0>"
  marker_ids[0] = 2
  marker_ids[1] = 1
}
names { // id = 4
  marked_name = "std::allocator"
}
names { // id = 5
  marked_name = "\0<\0>"
  marker_ids[0] = 4
  marker_ids[1] = 1
}
names { // id = 6
  marked_name = "\0<\0, \0, \0>"
  marker_ids[0] = 0
  marker_ids[1] = 1
  marker_ids[2] = 3
  marker_ids[3] = 5
}
entries {
  ...
  name {
    dict_id = 6
  }
}

But as said before, reading / decompressing this format is a lot easier than writing it, all you have to do to re-create the original names is this:

The batch method:

  1. Load all the dictionary entries into an array (e.g., as you would have to if using the google protobuf library to read the file).
  2. For each dictionary entry (sequentially), go through the "marked_name" string and for each '\0' character you find, you take the next "marker_ids" to lookup the dictionary entry to insert in-place of that zero-character.
  3. When you reach the end of the "marked_name" string and replaced all the zero-characters, you are done.

For the sequential version, you can simply do the same as for the batch method, but you do this as you discover each new dictionary entry. This can be done because the way the file is generated, it is always guaranteed that any dictionary entry referred to by the "marker_ids" will have been seen already, and should therefore already exist in the dictionary that is being built.

The most naive way to do it is using google's protobuf library, and do the following (taken from the code I wrote for Templar, with comments "for dummies", which is just a quick and naive implementation that works without actually having to build a dictionary, just using what's loaded from protobuf code):

struct dict_expansion_task {
  DictionaryEntry const * p_entry;
  std::size_t char_id;
  std::size_t mark_id;
};

void readEntry(TemplightEntry_Begin const &begin, TemplightTrace const &trace) {
  //...
  // Create the "expanded" name (decompressed):
  std::string expanded_name;
  // Set up a non-recursive depth-first traversal starting with "dict_id" of the entry:
  std::vector< dict_expansion_task > tasks;
  tasks.push_back( dict_expansion_task{ &trace.names(begin.name().dict_id()), 0, 0 } );
  while( !tasks.empty() ) {
    const std::string& dict_name = tasks.back().p_entry->marked_name();
    // Find the next zero-character:
    std::size_t new_char_id = 
      std::find(dict_name.begin() + tasks.back().char_id, 
                dict_name.end(), '\0') - dict_name.begin();
    // Copy what is between the last and next zero-char to the output:
    expanded_name.append(dict_name.begin() + tasks.back().char_id, 
                         dict_name.begin() + new_char_id);
    // Check if the zero-char is found (or else, we reached the end):
    if( new_char_id < dict_name.size() ) {
      // Create the next task, to expand the current "marker_ids" dictionary-entry:
      dict_expansion_task next_task{&trace.names(tasks.back().p_entry->marker_ids(tasks.back().mark_id)), 0, 0};
      // Move past the zero-char (and its marker_id) that was just handled:
      tasks.back().char_id = new_char_id + 1;
      tasks.back().mark_id += 1;
      // Push on the next task:
      tasks.push_back(next_task);
    } else {
      // Pop the current task (since we're at the end):
      tasks.pop_back();
    }
  }
  // Set the entry name to the expanded / decompressed name:
  entry->context = expanded_name.c_str();
  //...
}

Of course, the more efficient way to do it is as it is done it in the reference implementation (TemplightProtobufReader class) which is to expand each dictionary entry as they are encountered in the input file, which gets rid of this whole recursive decompression scheme (but takes more memory, of course, as the entire decompressed dictionary has to be created in memory).

Clone this wiki locally