2024.01.02

huangrt01 · Jan 1, 2025 · be7935d · be7935d
1 parent 106dd70
commit be7935d
Show file tree

Hide file tree

Showing 15 changed files with 156 additions and 6 deletions.
diff --git a/Notes/AI-Algorithms.md b/Notes/AI-Algorithms.md
@@ -267,9 +267,7 @@
     * 运行时只跑2个专家网络
     * 相比GPT-3.5更像人脑
 
-## BERT
 
-![image-20241019021744575](./AI-Algorithms/bert.png)
 
 ## GPT-2
 
@@ -3528,8 +3526,7 @@ https://webkul.com/ai-semantic-search-services/
     * 优势：
       * 让LLM提取concept很简单
       * 无需tuning item embs（可以直接用pretrained emb）
-    * 缺点：one limitation is that lists of concepts are often a coarse representation of a conversation and similar to continuous bag-of-words methods [60] are lossy with
-      respect to word order and other nuances of language, which can negatively affect retrieval quality.
+    * 缺点：one limitation is that lists of concepts are often a coarse representation of a conversation and similar to continuous bag-of-words methods [60] are lossy with respect to word order and other nuances of language, which can negatively affect retrieval quality.
       * 思考：按信息价值排序
   * Search API Lookup
     * 优势同concept based search

diff --git a/Notes/Machine-Learning.md b/Notes/Machine-Learning.md
@@ -517,9 +517,76 @@ Training 量化
 * [GELU](https://paperswithcode.com/method/gelu)
   * GELUs are used in [GPT-3](https://paperswithcode.com/method/gpt-3), [BERT](https://paperswithcode.com/method/bert), and most other Transformers.
 
+![image-20241019021744575](./Machine-Learning/bert.png)
+
+#### Paper
+
+* Intro
+  * BERT: Bidirectional Encoder Representations from Transformers.
+  * task类型：sentence-level/paraphrasing/token-level
+  * 方法：feature-based and fine-tuning
+    *  In previous work, both ap-
+      proaches share the same objective function dur-
+      ing pre-training, where they use unidirectional lan-
+      guage models to learn general language represen-
+      tations.
+  * BERT addresses the previously mentioned uni-directional constraints by proposing a new pre-training objective:
+    * the “masked language model" (MLM)
+    * next sentence prediction” task
+
+![image-20250102001058277](./Machine-Learning/image-20250102001058277.png)
+
+![image-20250102001246772](./Machine-Learning/image-20250102001246772.png)
+
+* 超参：
+  * BERTBASE: L=12, H=768, A=12, Total Parameters=110M
+  * BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M
+  * In all cases we set the feed-forward/ﬁlter size to be 4H
+  * mask setting：
+    * mask 15%，只预测masked词
+  * training
+    * We train with batch size of 256 sequences (256
+      sequences * 512 tokens = 128,000 tokens/batch)
+      for 1,000,000 steps, which is approximately 40
+      epochs over the 3.3 billion word corpus.
+    * use Adam with learning rate of 1e-4, β1 = 0.9,
+      β2 = 0.999, L2 weight decay of 0.01，dropout 0.
+  * 微调
+    * Batch size: 16, 32
+    * Learning rate (Adam): 5e-5, 3e-5, 2e-5
+    * Number of epochs: 3, 4
+
+* 模型
+  * Emb初始化：We use WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary. We
+    denote split word pieces with ##
+  * 设计思想：
+    * masked的动机：看到两边，不泄露信息
+  * 问题1:训练和微调不一致
+    * 方案：8:1:1
+    * ![image-20250102001657033](./Machine-Learning/image-20250102001657033.png)
+  * 问题2:每个batch只有15%的token被预测，训练代价大
+    * 效果收益更高
+  * 任务类型2:next sentence预测，一半对一半
+
+* 和GPT对比
+  *  GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only in-
+    troduced at fine-tuning time; BERT learns
+    [SEP], [CLS] and sentence A/B embeddings during pre-training
+  * bert训练语料多、batch size大
+
+
 
 #### model finetune
 
+* paper
+  * squad任务，学一个start和end vector预测start和end位置
+  * CoNLL 2003 Named Entity Recognition (NER) dataset
+  * swag任务，N选一
+    * 学一个V vector
+    * ![image-20250102002146508](./Machine-Learning/image-20250102002146508.png)
+
+![image-20250102001936987](./Machine-Learning/image-20250102001936987.png)
+
 * model finetune是基于BERT预训练模型强大的通用语义能力，使用具体业务场景的训练数据做finetune，从而针对性地修正网络参数，是典型的双阶段方法。（[BERT在美团搜索核心排序的探索和实践](https://zhuanlan.zhihu.com/p/158181085)）
 * 在BERT预训练模型结构相对稳定的情况下，算法工程师做文章的是模型的输入和输出。首先需要了解BERT预训练时输入和输出的特点，BERT的输入是词向量、段向量、位置向量的特征融合（embedding相加或拼接），并且有[CLS]开头符和[SEP]结尾符表示句间关系；输出是各个位置的表示向量。finetune的主要方法有双句分类、单句分类、问答QA、单句标注，区别在于输入是单句/双句；需要监督的输出是 开头符表示向量作为分类信息 或 结合分割符截取部分输出做自然语言预测。
 * 搜索中finetune的应用：model finetune应用于query-doc语义匹配任务，即搜索相关性问题和embedding服务。在召回and粗排之后，需要用BERT精排返回一个相关性分数，这一问题和语句分类任务有相似性。搜索finetune的手法有以下特点：
@@ -587,6 +654,19 @@ Training 量化
   * 前者是说，一篇文档的词频（而不是词序）代表了文档的主题；
   * 后者是说，上下文环境相似的两个词有着相近的语义。
 
+#### 利用 Embedding 的 Feature-based 方法
+
+* 历史方法
+  * non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006)
+  * neural (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014) methods.
+* 多种运用
+  * BERT
+  * ![image-20250102002230130](./Machine-Learning/image-20250102002230130.png)
+* 应用
+  * These approaches have been generalized to
+    coarser granularities, such as sentence embed-
+    dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). [BERT]
+
 #### Word2Vec: Efﬁcient Estimation of Word Representations in
 Vector Space
 

diff --git a/Notes/Machine-Learning/bert.png b/Notes/Machine-Learning/bert.png
diff --git a/Notes/Machine-Learning/image-20250102001058277.png b/Notes/Machine-Learning/image-20250102001058277.png
diff --git a/Notes/Machine-Learning/image-20250102001246772.png b/Notes/Machine-Learning/image-20250102001246772.png
diff --git a/Notes/Machine-Learning/image-20250102001657033.png b/Notes/Machine-Learning/image-20250102001657033.png
diff --git a/Notes/Machine-Learning/image-20250102001936987.png b/Notes/Machine-Learning/image-20250102001936987.png
diff --git a/Notes/Machine-Learning/image-20250102002146508.png b/Notes/Machine-Learning/image-20250102002146508.png
diff --git a/Notes/Machine-Learning/image-20250102002230130.png b/Notes/Machine-Learning/image-20250102002230130.png
diff --git a/Notes/mathematics.md b/Notes/mathematics.md
@@ -525,6 +525,16 @@
 
 https://arxiv.org/pdf/2206.13446
 
+## 线性代数
+
+*  $$E_{S}=E_CE_U$$    -> $$e_s^x={E_U}^Te_c^x$$
+* 最小二乘法：
+  * 考虑线性方程组 $$Ax = b$$，当该方程组无解（即 $$b$$ 不在 $$A$$ 的列空间中）时，我们希望找到一个 $$\hat{x}$$ 使得 $$\|Ax - b\|^2$$ 最小。
+  * 此时，$$\hat{x}=(A^TA)^{-1}A^Tb$$，$$A\hat{x}=A(A^TA)^{-1}A^Tb$$
+  * 即 $$A\hat{x}$$ 是 $$b$$ 在 $$A$$ 的列空间上的投影，$$A\hat{x}$$ 与 $$b$$ 的误差在所有可能的 $$Ax$$ 中是最小的。 
+
+* $$E_U^T E_U=\left(U_1,U_2,U_3, ... U_u\right)\left(\begin{array}{c}{U_1}^T\\{U_2}^T\\{U_3}^T\\...\\{U_u}^T\end{array}\right) = \sum_{i=1}^uU_i{U_i}^T$$
+
 ## SVD、矩阵分解
 
 ### Application

diff --git a/Notes/snippets/pytorch-model.py b/Notes/snippets/pytorch-model.py
@@ -304,3 +304,38 @@ def forward(self, timestamps: List) -> torch.tensor:
         # (bs, seq, time_dim * time_num) -> (bs, seq, user_dim)
         time_emb = self.merge_time(time_emb)
         return time_emb
+
+
+### Attention Pooling
+
+class AdditiveAttention(nn.Module):
+    ''' AttentionPooling used to weighted aggregate news vectors
+    Arg: 
+        d_h: the last dimension of input
+    '''
+    def __init__(self, d_h, hidden_size=200):
+        super(AdditiveAttention, self).__init__()
+        self.att_fc1 = nn.Linear(d_h, hidden_size)
+        self.att_fc2 = nn.Linear(hidden_size, 1)
+
+    def forward(self, x, attn_mask=None):
+        """
+        Args:
+            x: batch_size, candidate_size, candidate_vector_dim
+            attn_mask: batch_size, candidate_size
+        Returns:
+            (shape) batch_size, candidate_vector_dim
+        """
+        bz = x.shape[0]
+        e = self.att_fc1(x)
+        e = nn.Tanh()(e)
+        alpha = self.att_fc2(e)
+
+        alpha = torch.exp(alpha)
+        if attn_mask is not None:
+            alpha = alpha * attn_mask.unsqueeze(2)
+        alpha = alpha / (torch.sum(alpha, dim=1, keepdim=True) + 1e-8)
+
+        x = torch.bmm(x.permute(0, 2, 1), alpha)
+        x = torch.reshape(x, (bz, -1))  # (bz, 400)
+        return x
diff --git a/Notes/snippets/rs-matrix-factorization.py b/Notes/snippets/rs-matrix-factorization.py
@@ -1,5 +1,5 @@
 ### Intro
-Libraries: Surprise, LightFM, and implicit
+# Libraries: Surprise, LightFM, and implicit
 
 
 ### ALS + SGD
@@ -77,7 +77,7 @@ def sgd_als(user_item_matrix, num_factors, learning_rate, regularization, iterat
     num_users, num_items = user_item_matrix.shape
     errors = []  # To store RMSE after each iteration
 
-    # Initialize user and item latent factor matrices with small random values
+    # Initialize user and item latent factor matrices with small random values, 正态分布
     print("init user and item latent factors")
     user_factors = np.random.normal(scale=1./num_factors, size=(num_users, num_factors))
     item_factors = np.random.normal(scale=1./num_factors, size=(num_items, num_factors))

diff --git a/Notes/深度学习推荐系统—王喆.md b/Notes/深度学习推荐系统—王喆.md
@@ -85,6 +85,15 @@
 
 演化关系图  p13
 
+### 多种方法的对比
+
+|                           | 信息             | 数学假设                                                     | 建模方式                                                     | 优点                                                         | 缺点                                                         |
+| ------------------------- | ---------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| UserCF                    | $$E_C$$          | $$d=N, E_U={E_C}^T$$                                         | 用 User Embedding 检索相似 Users，获取 Users 的 Bhv Seq (Behaviour Sequence) |                                                              | 用户多，存不下矩阵不适用于正反馈获取困难、用户历史数据向量稀疏的场景 |
+| ItemCF                    | $$E_C$$          | $$d=u, E_S=E_C$$                                             | 用 User Bhv Seq Embedding 检索相似 Item Embedding            |                                                              | 商品多，存不下矩阵头部效应较明显                             |
+| MF (Matrix Factorization) | $$E_C$$          | $$E_C=E_S{E_U}^T$$                                           | 用 User Embedding 检索相似 Item Embedding                    | 泛化能力强空间复杂度低更好的扩展性和灵活性                   | 和UserCF/ItemCF一样，不方便加入context信息                   |
+| STAR                      | $$E_C$$、$$E_S$$ | $${E_S^u=\frac{1}{n} (\bold{r\lambda^t})_u^T E_S}$$$${E_C^u=\frac{1}{n} (\bold{r\lambda^t})_u^T E_C}$$ | User Bhv Seq -> User Bhv Emb用 User Bhv Emb 做检索，两路 Merge第一路：用$${E_C}^u$$检索相似 Item第二路：用$${E_S}^u$$检索相似 Item注：STAR paper的实现，实际上不是两路做Merge，而是对全量商品做暴力计算STAR paper按User Bhv Seq顺序做了recency decay | 引入了预训练的Item Embedding，增加了语义信息显式引入协同过滤信息 | 计算复杂度较高，在线计算的挑战性大；仅用 User Bhv Seq 做检索，存在信息茧房 |
+
 ### 协同过滤(CF)
 
 [一篇详细的介绍文章](https://zhuanlan.zhihu.com/p/80069337)
@@ -446,6 +455,25 @@ there indeed exists massive noise in original long-term behavior sequences which
 * 意义：变静态为动态，增强模型学习的实时性
   * “重量”与“实时”的折中
 
+
+
+## 新闻推荐
+
+### Empowering News Recommendation with Pre-trained Language Models
+
+* 模型
+
+![image-20250102003946208](./%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E6%8E%A8%E8%8D%90%E7%B3%BB%E7%BB%9F%E2%80%94%E7%8E%8B%E5%96%86/image-20250102003946208.png)
+
+![image-20250102004310634](./%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E6%8E%A8%E8%8D%90%E7%B3%BB%E7%BB%9F%E2%80%94%E7%8E%8B%E5%96%86/image-20250102004310634.png)
+
+* 训练
+  * used the titles of news
+  * finetuned the last two Transformer layers
+* 结论：
+  * attention pooling > avg pooling > CLS token embarrassment
+    * https://github.com/wuch15/PLM4NewsRec/blob/main/model_bert.py
+
 ## 传统搜索
 
 * 历史：https://www.vantagediscovery.com/post/ecommerce-search-transcended-for-the-ai-age

diff --git a/Notes/深度学习推荐系统—王喆/image-20250102003946208.png b/Notes/深度学习推荐系统—王喆/image-20250102003946208.png
diff --git a/Notes/深度学习推荐系统—王喆/image-20250102004310634.png b/Notes/深度学习推荐系统—王喆/image-20250102004310634.png