-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixed bug with reversing back half of context. Added <<end_id>> subtokens so that separations are clear between subtokens of adjacent tokens.
- Loading branch information
Showing
3 changed files
with
53 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,9 @@ | ||
## General | ||
* Tokens are split into subtokens so that it's agnostic to whether you use `camelCase` or `c_style` naming conventions. It's also invariant to case after splitting into subtokens. This allows you to unify the tokens with either style, but it does lose _some_ information. | ||
* When loading in contexts, we sample by size of the file. This is proportional to the number of contexts in the file, which will oversample cases where there are a high number of contexts. I'm still not sure if this is preferable. | ||
* `<<start_id>>` and `<<end_id>>` are used to guide seq2seq, but we also need `<<end_id>>` to distinguish `ClassName varName` from `classNameVarName`. | ||
|
||
## On Processing | ||
* For the future, make sure lone underscores appear as their own characters. | ||
* Also, the current scheme makes something like `_0` appear to be an underscore followed by an integer literal. | ||
* The current scheme makes something like `_0` appear to be an underscore followed by an integer literal. | ||
* We don't include predicting `<unk>` tokens, since they make up only around 3-4% of the dataset and it would not make the network output any useful information. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters