chengchingwen · chengchingwen · Feb 15, 2023 · Nov 27, 2022 · Nov 27, 2022 · Nov 27, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,179 @@
+# ChangeLogs (from 0.1.x to 0.2.0)
+
+v0.2 is a rewrite of the whole package. Most layers and API in 0.1 is removed or changed. Some of them are replaced
+ with new one. The basic policy is, if a functionality is achievable with a well-maintained package easily, or there
+ isn't much gain by self-hosting/maintaining it, then we remove the functionality from Transformers.jl.
+
+
+Here is list of the changes with brief explanation:
+
+## Transformers.Pretrain
+
+The `Pretrain` module is entirely removed, due to the duplication of functionality v.s. `Transformers.HuggingFace`.
+ We do not host the small list of the origin official released pretrained weights anymore. All use that require a
+ pretrained weight should refer to `HuggingFace` module. This is a table of the old pretrain name and corresponding
+ huggingface model name:
+
+| old pretrain name              | corresponding huggingface model name    |
+|--------------------------------|-----------------------------------------|
+| `cased_L-12_H-768_A-12`        | `bert-base-cased`                       |
+| `uncased_L-12_H-768_A-12`      | `bert-base-uncased`                     |
+| `chinese_L-12_H-768_A-12`      | `bert-base-chinese`                     |
+| `multi_cased_L-12_H-768_A-12`  | `bert-base-multilingual-cased`          |
+| `multilingual_L-12_H-768_A-12` | `bert-base-multilingual-uncased`        |
+| `cased_L-24_H-1024_A-16`       | `bert-large-cased`                      |
+| `uncased_L-24_H-1024_A-16`     | `bert-large-uncased`                    |
+| `wwm_cased_L-24_H-1024_A-16`   | `bert-large-cased-whole-word-masking`   |
+| `wwm_uncased_L-24_H-1024_A-16` | `bert-large-uncased-whole-word-masking` |
+| `scibert_scivocab_cased`       | `allenai/scibert_scivocab_cased`        |
+| `scibert_scivocab_uncased`     | `allenai/scibert_scivocab_uncased`      |
+| `scibert_basevocab_cased`      | N/A                                     |
+| `scibert_basevocab_uncased`    | N/A                                     |
+| `OpenAIftlm`                   | `openai-gpt`                            |
+
+
+## Transformers.Stacks
+
+The `Stacks` module is entirely removed. `Stacks` provide a small DSL for creating nontrivial `Chain` of layers.
+ However, the DSL isn't intuitive enough and it also doesn't seems worth maintaining a DSL. We don't provide
+ direct replacement for this, but for the specific use case of building transformer models, we have a few new
+ constructors/layers in `Transformers.Layers`.
+
+
+## Transformers.Basic
+
+The `Basic` module is now destructed and most of the elements in `Basic` is separated to other module/package.
+
+1. `Transformer` and `TransformerDecoder`: The `Transformer`/`TransformerDecoder` layer is replaced with the new
+    implementation in `Layers` (the `Layers.TransformerBlock`, `Layers.TransformerDecoderBlock`, and friends).
+2. `MultiheadAttention`: The implementation of attention operations are move out to
+    [NeuralAttentionlib](https://github.com/chengchingwen/NeuralAttentionlib.jl). In NeuralAttentionlib, we can use
+    `multihead_qkv_attention` to do the same computation. Since most transformer variant only use a modified version
+    of self or cross attention, we do not provied the `MultiheadAttention` layer type. One should be able to redefine
+    the `MultiheadAttention` layer type with Flux and NeuralAttentionlib easily. For example:
+
+   ```julia
+   using Flux, Functors
+   using NeuralAttentionlib: multihead_qkv_attention, CausalMask
+
+   struct MultiheadAttention{Q,K,V,O}
+       head::Int
+       future::Bool
+       iqproj::Q
+       ikproj::K
+       ivproj::V
+       oproj::O
+   end
+   @functor MultiheadAttention (iqproj, ikproj, ivporj, oproj)
+   MultiheadAttention(head, hidden_size, head_size; future = true) =
+       MultiheadAttention(head, future,
+           Dense(hidden_size, head_size * head),
+           Dense(hidden_size, head_size * head),
+           Dense(hidden_size, head_size * head),
+           Dense(head_size * head, hidden_size),
+       )
+
+   (mha::MultiheadAttention)(q, k, v) = mha.oproj(multihead_qkv_attention(mha.head,
+       mha.iqproj(q), mha.ikproj(k), mha.ivproj(v), mha.future ? nothing : CausalMask()))
+   ```
+
+3. `TransformerModel`: This is just a Flux layer with embedding layer, transformer layer, and classifier layer
+     bundle together. One can define this easily with Flux/Functors API, thus removed.
+4. `Positionwise`, `PwFFN`, and `@toNd`: This was originally designed for applying `Flux.Dense` on 3-dim arrays,
+    but since `Flux.Dense` support multi-dim input now. This isn't needed and thus removed.
+5. `EmbeddingDecoder`: Replaced with `Layers.EmbedDecoder`. Name change and support extra trainable `bias` parameter.
+6. `PositionEmbedding`: This is replace with `Layers.SinCosPositionEmbed` and `Layers.FixedLenPositionEmbed` for
+    the old `trainable` keyword argument setting.
+7. `crossentropy` with masking: We extend `Flux.logitcrossentropy` and `Flux.crossentropy` with 3-args
+    input (the prediction, label, and mask) and 4-args input (`sum` or `mean`, prediciton, label, and mask).
+8. `kldivergence`: In our use case (i.e. training language model), this is equivalent to cross-entropy, thus removed.
+9. `logcrossentropy`/`logkldivergence`: This is a fault design. Originally I would put a `logsoftmax` at the head of
+    the prediction head. However, that is not only unnecessary but also increasing the amount of memory needed.
+    One should use `Flux.logitcrossentropy` without the `logsoftmax` directly.
+10. `Vocabulary`: Replaced with `TextEncodeBase.Vocab`.
+11. `with_firsthead_tail`/`segment_and_concat`/`concat`: These can be implemented with `TextEncodeBase.SequenceTemplate`
+     and friends thus removed.
+12. `getmask`: The attention mask functionality is moved to NeuralAttentionlib. Manually construct attention mask
+     should use constructor in `NeuralAttentionlib.Masks`.
+
+
+## Transformers.Layers (new)
+
+The `Layers` module is a new module introduced in v0.2.0. It provide a set layer types for construct transformer
+ model variants.
+
+
+## Transformers.TextEncoders (new)
+
+The `TextEncoders` module is a new module introduced in v0.2.0. Basically all old functionality about text preprocessing
+ are moved to this module, including `WordPiece`, `Unigram`, `BertTextEncoder`, `GPT2TextEncoder`, etc.
+
+## Transformers.BidirectionalEncoder / Transformers.GenerativePreTrain
+
+These modules are removed since we are switching to the `Transformers.HuggingFace` for the pretrained model. The text
+ encoder are moved to `Transformers.TextEncoders`. Weight loading and conversion functionality are removed. If you
+ need that, use the tools that huggingface transformers python package provided and make sure the model can be loaded
+ with pytorch. Then we can use the weight in pytorch format.
+
+
+## Transformers.HuggingFace
+
+The changes in `Transformers.HuggingFace` are mainly about the configurations and models. The tokenizer/textencoder part
+ are mostly the same, except the process functions.
+
+### Configuration
+
+For the configuration, the loading mechanism is changed. In previous version, each model type need to define a specific
+ `HGF<XXModelType>Config` struct where `XXModelType` is the model type name. The reason for that is, for some reason,
+ huggingface transformers doesn't serialize all the configuration values into the file, but rely on their constructor
+ with pre-defined default values instead. As a result, some model only need the configuration file, while some need the
+ python code for the defaults as well. The hgf config struct was more like a interal data carrier. You usually
+ won't (and actually can't) manipulate the model with it.
+
+
+In v0.2, we tried to make the process for adding model more automatic, and enable the ability to build model with
+ different configurations. The struct for holding the configuration is now changed to a parametric struct depending
+ on a `Symbol` parameter specifying the model type (e.g. `HGFConfig{:bert}`). With this, the specific
+ `HGF<XXModelType>config` can be constructed on the fly. The `HGFConfig` has 2 field, one for storing the read-only
+ deserialized object loaded from the configuration file, and another for the overwritten values. This should turn the
+ config struct into a user level interface.
+
+
+### Model
+
+For the model part, the main change is that we do not make a 1-1 mapping between the python model/layer class and our
+ julia layer struct. When one wants to add a new model type, there are actually 2 things need to be done. One is
+ defining a model forward method that can do the same computation as the python model, and another is defining a
+ mapping between the python model and the julia model (so that the model parameters/weights can be transferred between
+ 2 language). In the previous version, we chose to make a 1-1 mapping between the model, so that the parameters/weights
+ loading process can be fully automatic. However, for some reason, huggingface transformers is not reusing their
+ attention or transformer implementation for each model type. Which means for different model type, even if they are
+ actually doing the same computation (i.e. the computation graph is the same), the model layout can be different
+ (e.g. consider the differences between `Chain(Chain(dense1, dense2), dense3)` and `Chain(dense1, dense2, dense3)`).
+ As a result, these make implementing the model forward method a real pain, and also it's hard to apply optimizations.
+
+
+We noticed that the model forward method is more important and difficult than the model mapping. On the other hand,
+ though manually defining model mapping is tedious, it's less prone to go wrong. So instead of making a 1-1 mapping for
+ fully automatic model loading, we choose to reduce the work needed for forward method. In v0.2, the attention
+ implementation is switched to NeuralAttentionlib's modulated implementation and we build all internal layers with layer
+ from `Transformers.Layers`. As a result, layers like `FakeTH<XXLayer>` or `HGF<XXModelType>Attention/MLP/...` are
+ removed, only the outer-most types remain (e.g. `HGFBertModel`, `HGFGPT2LMHeadModel`...).
+
+
+Since we want to make it possible to finetune a pretrained model on new dataset/task easily, the model loading would
+ be a combination of initialization and parameters/weights loading. In normal Flux workflow, you would build a complete
+ new model and then inplace load the parameter/weight values into the specific layers/arrays in the model. In v0.2, we
+ combine the 2 step into one `load_model` function, which take the model type, configuration, and a state dictionary (
+ the term comes from PyTorch, which is a `OrderedDict` of variable names to weights). `load_model` would either
+ lookup variable from the state dictionary, or initialize with configuration, recursively. As a result,
+ `load_model!` is removed.
+
+
+## Behavior Changes
+
+* All text encoder (including `HuggingFace` one) process function returned `NamedTuple`: Some field name changed,
+   `tok` => `token`, `mask` => `attention_mask`.
+* Most layer/model from Transformers.jl would be taking and returning `NamedTuple`.
+* For `HuggingFace` model: All input is basically `NamedTuple`. The returned `NamedTuple` field name from the forward
+   method is also changed.
diff --git a/Project.toml b/Project.toml
@@ -1,12 +1,9 @@
 name = "Transformers"
 uuid = "21ca0261-441d-5938-ace7-c90938fde4d4"
 authors = ["chengchingwen <adgjl5645@hotmail.com>"]
-version = "0.1.25"
+version = "0.2.0"
 
 [deps]
-AbstractTrees = "1520ce14-60c1-5f80-bbc7-55ef81b5835c"
-Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
-BSON = "fbb218c0-5317-5bc6-957e-2ee96dd4b1f0"
 Base64 = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
 BytePairEncoding = "a4280ba5-8788-555a-8ca8-4a8c3d966a71"
 CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
@@ -17,25 +14,23 @@ Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
 DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
 DoubleArrayTries = "abbaa0e5-f788-499c-92af-c35ff4258c82"
 Fetch = "bb354801-46f6-40b6-9c3d-d42d7a74c775"
+FillArrays = "1a297f60-69ca-5386-bcde-b61e274b549b"
 Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
 FuncPipelines = "9ed96fbb-10b6-44d4-99a6-7e2a3dc8861b"
 Functors = "d9f16b24-f501-4c13-a1f2-28368ffc5196"
 HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
 HuggingFaceApi = "3cc741c3-0c9d-4fbe-84fa-cdec264173de"
-InternedStrings = "7d512f48-7fb1-5a58-b986-67e6dc259f01"
-JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
+JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
 LightXML = "9c8b4983-aa76-5018-a973-4c85ecc9e179"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
-MacroTools = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
-Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
+Mmap = "a63ad114-7e13-5084-954f-fe012c677804"
 NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
 NNlibCUDA = "a00861dc-f156-4864-bf3c-e6376f28a68d"
 NeuralAttentionlib = "12afc1b8-fad6-47e1-9132-84abc478905f"
 Pickle = "fbb45041-c46e-462f-888f-7c521cafbc2c"
 Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
 PrimitiveOneHot = "13d12f88-f12b-451e-9b9f-13b97e01cc85"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
-Requires = "ae029012-a4dd-5104-9daa-d747884805df"
 SHA = "ea8e919c-243c-51af-8825-aaa63cd721ce"
 Static = "aedffcd0-7271-4cad-89d0-dc628f76c6d3"
 Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
@@ -45,45 +40,40 @@ TextEncodeBase = "f92c20c0-9f2a-4705-8116-881385faba05"
 Unicode = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"
 ValSplit = "0625e100-946b-11ec-09cd-6328dd093154"
 WordTokenizers = "796a5d58-b03d-544a-977e-18100b691f6e"
-ZipFile = "a5390f91-8eb1-5f08-bee0-b1d1ffed6cea"
 
 [compat]
-AbstractTrees = "0.3, 0.4.3"
-Adapt = "3.3"
-BSON = "0.3.4"
 BytePairEncoding = "0.3"
 CUDA = "3.10"
 ChainRulesCore = "1.15"
 DataDeps = "0.7"
 DataStructures = "0.18"
 DoubleArrayTries = "0.0.3"
 Fetch = "0.1.3"
+FillArrays = "0.13"
 Flux = "0.13.4"
 FuncPipelines = "0.2.3"
-Functors = "0.2, 0.3"
+Functors = "0.2, 0.3, 0.4"
 HTTP = "0.9, 1"
 HuggingFaceApi = "0.1"
-InternedStrings = "0.7"
-JSON = "0.21"
 LightXML = "0.9"
-MacroTools = "0.5"
 NNlib = "0.8"
 NNlibCUDA = "0.2"
-NeuralAttentionlib = "0.1"
+NeuralAttentionlib = "0.2.4"
 Pickle = "0.3"
 PrimitiveOneHot = "0.1"
-Requires = "1"
-Static = "0.7"
+Static = "0.7, 0.8"
 StringViews = "1"
 StructWalk = "0.2"
-TextEncodeBase = "0.5.11"
+TextEncodeBase = "0.6"
 ValSplit = "0.1"
 WordTokenizers = "0.5.6"
-ZipFile = "0.9"
 julia = "1.6"
 
 [extras]
+ChainRulesTestUtils = "cdddcdb0-9152-4a09-a978-84456f9df70a"
+Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
+ZipFile = "a5390f91-8eb1-5f08-bee0-b1d1ffed6cea"
 
 [targets]
-test = ["Test"]
+test = ["Test", "Logging", "ZipFile", "ChainRulesTestUtils"]
diff --git a/README.md b/README.md
@@ -6,26 +6,13 @@
 
 Julia implementation of [transformer](https://arxiv.org/abs/1706.03762)-based models, with [Flux.jl](https://github.com/FluxML/Flux.jl).
 
+*notice: The current version is almost complete different from the 0.1.x version. If you are using the old version, make sure to update the changes or stick to the old version.*
+
 # Installation
 
 In the Julia REPL:
 
     ]add Transformers
-
-For using GPU, install & build:
-
-    ]add CUDA
-
-    ]build 
-
-    julia> using CUDA
-
-    julia> using Transformers
-
-    #run the model below
-    .
-    .
-    .
 
 
 # Example
@@ -34,75 +21,30 @@ Using pretrained Bert with `Transformers.jl`.
 
 ```julia
 using Transformers
-using Transformers.Basic
-using Transformers.Pretrain
+using Transformers.TextEncoders
+using Transformers.HuggingFace
 
-ENV["DATADEPS_ALWAYS_ACCEPT"] = true
+textencoder, bert_model = hgf"bert-base-cased"
 
-bert_model, wordpiece, tokenizer = pretrain"bert-uncased_L-12_H-768_A-12"
-vocab = Vocabulary(wordpiece)
+text1 = "Peter Piper picked a peck of pickled peppers"
+text2 = "Fuzzy Wuzzy was a bear"
 
-text1 = "Peter Piper picked a peck of pickled peppers" |> tokenizer |> wordpiece
-text2 = "Fuzzy Wuzzy was a bear" |> tokenizer |> wordpiece
+text = [[ text1, text2 ]] # 1 batch of contiguous sentences
+sample = encode(textencoder, text) # tokenize + pre-process (add special tokens + truncate / padding + one-hot encode)
 
-text = ["[CLS]"; text1; "[SEP]"; text2; "[SEP]"]
-@assert text == [
-    "[CLS]", "peter", "piper", "picked", "a", "peck", "of", "pick", "##led", "peppers", "[SEP]", 
+@assert reshape(decode(textencoder, sample.token), :) == [
+    "[CLS]", "peter", "piper", "picked", "a", "peck", "of", "pick", "##led", "peppers", "[SEP]",
     "fuzzy", "wu", "##zzy",  "was", "a", "bear", "[SEP]"
 ]
 
-token_indices = vocab(text)
-segment_indices = [fill(1, length(text1)+2); fill(2, length(text2)+1)]
-
-sample = (tok = token_indices, segment = segment_indices)
-
-bert_embedding = sample |> bert_model.embed
-feature_tensors = bert_embedding |> bert_model.transformers
+bert_features = bert_model(sample).hidden_state
 ```
 
 See `example` folder for the complete example.
 
 
-# Huggingface
-
-We have some support for the models from [`huggingface/transformers`](https://github.com/huggingface/transformers).
-
-```julia
-using Transformers.HuggingFace
-
-# loading a model from huggingface model hub
-julia> model = hgf"bert-base-cased:forquestionanswering";
-┌ Warning: Transformers.HuggingFace.HGFBertForQuestionAnswering doesn't have field cls.
-└ @ Transformers.HuggingFace ~/peter/repo/gsoc2020/src/huggingface/models/models.jl:46
-┌ Warning: Some fields of Transformers.HuggingFace.HGFBertForQuestionAnswering aren't initialized with loaded state: qa_outputs
-└ @ Transformers.HuggingFace ~/peter/repo/gsoc2020/src/huggingface/models/models.jl:52
-
-```
-
-Current we only support a few model and the tokenizer part is not finished yet.
-
-
 # For more information
 
-If you want to know more about this package, see the [document](https://chengchingwen.github.io/Transformers.jl/dev/) 
-and the series of [blog posts](https://nextjournal.com/chengchingwen) I wrote for JSoC and GSoC. You can also 
-tag me (@chengchingwen) on Julia's slack or discourse if you have any questions, or just create a new Issue on GitHub.
-
-
-# Roadmap
-
-## What we have before v0.2
-
--   `Transformer` and `TransformerDecoder` support for both 2d & 3d data.
--   `PositionEmbedding` implementation.
--   `Positionwise` for handling 2d & 3d input.
--   docstring for most of the functions.
--   runable examples (see `example` folder)
--   `Transformers.HuggingFace` for handling pretrains from `huggingface/transformers`
-
-## What we will have in v0.2.0
-
--   Complete tokenizer APIs
--   tutorials
--   benchmarks
--   more examples
+If you want to know more about this package, see the [document](https://chengchingwen.github.io/Transformers.jl/dev/)
+ and read code in the `example` folder. You can also tag me (@chengchingwen) on Julia's slack or discourse if
+ you have any questions, or just create a new Issue on GitHub.