Skip to content

Commit

Permalink
2024.01.02
Browse files Browse the repository at this point in the history
  • Loading branch information
huangrt01 committed Jan 1, 2025
1 parent 106dd70 commit be7935d
Show file tree
Hide file tree
Showing 15 changed files with 156 additions and 6 deletions.
5 changes: 1 addition & 4 deletions Notes/AI-Algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,9 +267,7 @@
* 运行时只跑2个专家网络
* 相比GPT-3.5更像人脑

## BERT

![image-20241019021744575](./AI-Algorithms/bert.png)

## GPT-2

Expand Down Expand Up @@ -3528,8 +3526,7 @@ https://webkul.com/ai-semantic-search-services/
* 优势:
* 让LLM提取concept很简单
* 无需tuning item embs(可以直接用pretrained emb)
* 缺点:one limitation is that lists of concepts are often a coarse representation of a conversation and similar to continuous bag-of-words methods [60] are lossy with
respect to word order and other nuances of language, which can negatively affect retrieval quality.
* 缺点:one limitation is that lists of concepts are often a coarse representation of a conversation and similar to continuous bag-of-words methods [60] are lossy with respect to word order and other nuances of language, which can negatively affect retrieval quality.
* 思考:按信息价值排序
* Search API Lookup
* 优势同concept based search
Expand Down
80 changes: 80 additions & 0 deletions Notes/Machine-Learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -517,9 +517,76 @@ Training 量化
* [GELU](https://paperswithcode.com/method/gelu)
* GELUs are used in [GPT-3](https://paperswithcode.com/method/gpt-3), [BERT](https://paperswithcode.com/method/bert), and most other Transformers.

![image-20241019021744575](./Machine-Learning/bert.png)

#### Paper

* Intro
* BERT: Bidirectional Encoder Representations from Transformers.
* task类型:sentence-level/paraphrasing/token-level
* 方法:feature-based and fine-tuning
* In previous work, both ap-
proaches share the same objective function dur-
ing pre-training, where they use unidirectional lan-
guage models to learn general language represen-
tations.
* BERT addresses the previously mentioned uni-directional constraints by proposing a new pre-training objective:
* the “masked language model" (MLM)
* next sentence prediction” task

![image-20250102001058277](./Machine-Learning/image-20250102001058277.png)

![image-20250102001246772](./Machine-Learning/image-20250102001246772.png)

* 超参:
* BERTBASE: L=12, H=768, A=12, Total Parameters=110M
* BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M
* In all cases we set the feed-forward/filter size to be 4H
* mask setting:
* mask 15%,只预测masked词
* training
* We train with batch size of 256 sequences (256
sequences * 512 tokens = 128,000 tokens/batch)
for 1,000,000 steps, which is approximately 40
epochs over the 3.3 billion word corpus.
* use Adam with learning rate of 1e-4, β1 = 0.9,
β2 = 0.999, L2 weight decay of 0.01,dropout 0.
* 微调
* Batch size: 16, 32
* Learning rate (Adam): 5e-5, 3e-5, 2e-5
* Number of epochs: 3, 4

* 模型
* Emb初始化:We use WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary. We
denote split word pieces with ##
* 设计思想:
* masked的动机:看到两边,不泄露信息
* 问题1:训练和微调不一致
* 方案:8:1:1
* ![image-20250102001657033](./Machine-Learning/image-20250102001657033.png)
* 问题2:每个batch只有15%的token被预测,训练代价大
* 效果收益更高
* 任务类型2:next sentence预测,一半对一半

* 和GPT对比
* GPT uses a sentence separator ([SEP]) and classifier token ([CLS]) which are only in-
troduced at fine-tuning time; BERT learns
[SEP], [CLS] and sentence A/B embeddings during pre-training
* bert训练语料多、batch size大



#### model finetune

* paper
* squad任务,学一个start和end vector预测start和end位置
* CoNLL 2003 Named Entity Recognition (NER) dataset
* swag任务,N选一
* 学一个V vector
* ![image-20250102002146508](./Machine-Learning/image-20250102002146508.png)

![image-20250102001936987](./Machine-Learning/image-20250102001936987.png)

* model finetune是基于BERT预训练模型强大的通用语义能力,使用具体业务场景的训练数据做finetune,从而针对性地修正网络参数,是典型的双阶段方法。([BERT在美团搜索核心排序的探索和实践](https://zhuanlan.zhihu.com/p/158181085)
* 在BERT预训练模型结构相对稳定的情况下,算法工程师做文章的是模型的输入和输出。首先需要了解BERT预训练时输入和输出的特点,BERT的输入是词向量、段向量、位置向量的特征融合(embedding相加或拼接),并且有[CLS]开头符和[SEP]结尾符表示句间关系;输出是各个位置的表示向量。finetune的主要方法有双句分类、单句分类、问答QA、单句标注,区别在于输入是单句/双句;需要监督的输出是 开头符表示向量作为分类信息 或 结合分割符截取部分输出做自然语言预测。
* 搜索中finetune的应用:model finetune应用于query-doc语义匹配任务,即搜索相关性问题和embedding服务。在召回and粗排之后,需要用BERT精排返回一个相关性分数,这一问题和语句分类任务有相似性。搜索finetune的手法有以下特点:
Expand Down Expand Up @@ -587,6 +654,19 @@ Training 量化
* 前者是说,一篇文档的词频(而不是词序)代表了文档的主题;
* 后者是说,上下文环境相似的两个词有着相近的语义。

#### 利用 Embedding 的 Feature-based 方法

* 历史方法
* non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006)
* neural (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014) methods.
* 多种运用
* BERT
* ![image-20250102002230130](./Machine-Learning/image-20250102002230130.png)
* 应用
* These approaches have been generalized to
coarser granularities, such as sentence embed-
dings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). [BERT]

#### Word2Vec: Efficient Estimation of Word Representations in
Vector Space

Expand Down
Binary file added Notes/Machine-Learning/bert.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions Notes/mathematics.md
Original file line number Diff line number Diff line change
Expand Up @@ -525,6 +525,16 @@

https://arxiv.org/pdf/2206.13446

## 线性代数

* $$E_{S}=E_CE_U$$ -> $$e_s^x={E_U}^Te_c^x$$
* 最小二乘法:
* 考虑线性方程组 $$Ax = b$$,当该方程组无解(即 $$b$$ 不在 $$A$$ 的列空间中)时,我们希望找到一个 $$\hat{x}$$ 使得 $$\|Ax - b\|^2$$ 最小。
* 此时,$$\hat{x}=(A^TA)^{-1}A^Tb$$$$A\hat{x}=A(A^TA)^{-1}A^Tb$$
*$$A\hat{x}$$$$b$$$$A$$ 的列空间上的投影,$$A\hat{x}$$$$b$$ 的误差在所有可能的 $$Ax$$ 中是最小的。

* $$E_U^T E_U=\left(U_1,U_2,U_3, ... U_u\right)\left(\begin{array}{c}{U_1}^T\\{U_2}^T\\{U_3}^T\\...\\{U_u}^T\end{array}\right) = \sum_{i=1}^uU_i{U_i}^T$$

## SVD、矩阵分解

### Application
Expand Down
35 changes: 35 additions & 0 deletions Notes/snippets/pytorch-model.py
Original file line number Diff line number Diff line change
Expand Up @@ -304,3 +304,38 @@ def forward(self, timestamps: List) -> torch.tensor:
# (bs, seq, time_dim * time_num) -> (bs, seq, user_dim)
time_emb = self.merge_time(time_emb)
return time_emb


### Attention Pooling

class AdditiveAttention(nn.Module):
''' AttentionPooling used to weighted aggregate news vectors
Arg:
d_h: the last dimension of input
'''
def __init__(self, d_h, hidden_size=200):
super(AdditiveAttention, self).__init__()
self.att_fc1 = nn.Linear(d_h, hidden_size)
self.att_fc2 = nn.Linear(hidden_size, 1)

def forward(self, x, attn_mask=None):
"""
Args:
x: batch_size, candidate_size, candidate_vector_dim
attn_mask: batch_size, candidate_size
Returns:
(shape) batch_size, candidate_vector_dim
"""
bz = x.shape[0]
e = self.att_fc1(x)
e = nn.Tanh()(e)
alpha = self.att_fc2(e)

alpha = torch.exp(alpha)
if attn_mask is not None:
alpha = alpha * attn_mask.unsqueeze(2)
alpha = alpha / (torch.sum(alpha, dim=1, keepdim=True) + 1e-8)

x = torch.bmm(x.permute(0, 2, 1), alpha)
x = torch.reshape(x, (bz, -1)) # (bz, 400)
return x
4 changes: 2 additions & 2 deletions Notes/snippets/rs-matrix-factorization.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
### Intro
Libraries: Surprise, LightFM, and implicit
# Libraries: Surprise, LightFM, and implicit


### ALS + SGD
Expand Down Expand Up @@ -77,7 +77,7 @@ def sgd_als(user_item_matrix, num_factors, learning_rate, regularization, iterat
num_users, num_items = user_item_matrix.shape
errors = [] # To store RMSE after each iteration

# Initialize user and item latent factor matrices with small random values
# Initialize user and item latent factor matrices with small random values, 正态分布
print("init user and item latent factors")
user_factors = np.random.normal(scale=1./num_factors, size=(num_users, num_factors))
item_factors = np.random.normal(scale=1./num_factors, size=(num_items, num_factors))
Expand Down
28 changes: 28 additions & 0 deletions Notes/深度学习推荐系统—王喆.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,15 @@

演化关系图 p13

### 多种方法的对比

| | 信息 | 数学假设 | 建模方式 | 优点 | 缺点 |
| ------------------------- | ---------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| UserCF | $$E_C$$ | $$d=N, E_U={E_C}^T$$ | 用 User Embedding 检索相似 Users,获取 Users 的 Bhv Seq (Behaviour Sequence) | | 用户多,存不下矩阵不适用于正反馈获取困难、用户历史数据向量稀疏的场景 |
| ItemCF | $$E_C$$ | $$d=u, E_S=E_C$$ | 用 User Bhv Seq Embedding 检索相似 Item Embedding | | 商品多,存不下矩阵头部效应较明显 |
| MF (Matrix Factorization) | $$E_C$$ | $$E_C=E_S{E_U}^T$$ | 用 User Embedding 检索相似 Item Embedding | 泛化能力强空间复杂度低更好的扩展性和灵活性 | 和UserCF/ItemCF一样,不方便加入context信息 |
| STAR | $$E_C$$$$E_S$$ | $${E_S^u=\frac{1}{n} (\bold{r\lambda^t})_u^T E_S}$$$${E_C^u=\frac{1}{n} (\bold{r\lambda^t})_u^T E_C}$$ | User Bhv Seq -> User Bhv Emb用 User Bhv Emb 做检索,两路 Merge第一路:用$${E_C}^u$$检索相似 Item第二路:用$${E_S}^u$$检索相似 Item注:STAR paper的实现,实际上不是两路做Merge,而是对全量商品做暴力计算STAR paper按User Bhv Seq顺序做了recency decay | 引入了预训练的Item Embedding,增加了语义信息显式引入协同过滤信息 | 计算复杂度较高,在线计算的挑战性大;仅用 User Bhv Seq 做检索,存在信息茧房 |

### 协同过滤(CF)

[一篇详细的介绍文章](https://zhuanlan.zhihu.com/p/80069337)
Expand Down Expand Up @@ -446,6 +455,25 @@ there indeed exists massive noise in original long-term behavior sequences which
* 意义:变静态为动态,增强模型学习的实时性
* “重量”与“实时”的折中



## 新闻推荐

### Empowering News Recommendation with Pre-trained Language Models

* 模型

![image-20250102003946208](./%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E6%8E%A8%E8%8D%90%E7%B3%BB%E7%BB%9F%E2%80%94%E7%8E%8B%E5%96%86/image-20250102003946208.png)

![image-20250102004310634](./%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E6%8E%A8%E8%8D%90%E7%B3%BB%E7%BB%9F%E2%80%94%E7%8E%8B%E5%96%86/image-20250102004310634.png)

* 训练
* used the titles of news
* finetuned the last two Transformer layers
* 结论:
* attention pooling > avg pooling > CLS token embarrassment
* https://github.com/wuch15/PLM4NewsRec/blob/main/model_bert.py

## 传统搜索

* 历史:https://www.vantagediscovery.com/post/ecommerce-search-transcended-for-the-ai-age
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit be7935d

Please sign in to comment.