Skip to content

Commit

Permalink
2024.12.10
Browse files Browse the repository at this point in the history
  • Loading branch information
huangrt01 committed Dec 9, 2024
1 parent 0dd4aaa commit 7746c82
Show file tree
Hide file tree
Showing 37 changed files with 867 additions and 693 deletions.
786 changes: 750 additions & 36 deletions Notes/AIGC-Algorithms.md

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/640-20241021002249376
Binary file not shown.
Binary file added Notes/AIGC-Algorithms/GNN.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/TAG.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/graph-retrieval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/lightrag-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/lightrag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/mteb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/pai-rag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/rag-fusion.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/rag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/sbert-rerank.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/table_rag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/AIGC-Algorithms/vectordb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
626 changes: 0 additions & 626 deletions Notes/AIGC.md

Large diffs are not rendered by default.

31 changes: 4 additions & 27 deletions Notes/Editor.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,34 +270,11 @@ Advanced Text Objects - Text objects like searches can also be composed with vim

* 打开终端:ctrl + `

#### 开发必备插件
#### 插件

参考「dotfiles:README」


* 公共:
* Code Spell Checker, GitLens, EditorConfig for VSCode, String Manipulation, Visual Studio IntelliCode
* Code Runner
* Remote - SSH

* C++: [cpplint](https://github.com/cpplint/cpplint), CodeLLDB, Header source switch, Rainbow Brackets, C++ Intellisense
* CMake, CMake Tools
* cmake插件需要cmake3.9.4以上版本的cmake的支持,ubuntu16.04以下的用户可能需要手动安装新版本的cmake
* Clangd
* [VSCode C/C++ 开发环境和调试配置:Clangd+Codelldb](https://zhangjk98.xyz/vscode-c-and-cpp-develop-and-debug-setting/)
* 阅读超大型项目源码的注意事项
* 关闭 editor.formatOnSave, clang-tidy, all-scopes-completion
* codeLLDB 使用(TODO)

* 有clangd之后,不需要 C/C++ (by Microsoft) 了

* Tabnine:AI加持的自动补全,用GPT
* Peacock:不同workspace可以用不同的颜色区分
* 调大C++插件的可用内存:
* ![c++](Editor/vscode-c++.png)
* [如何优雅的用 VScode 编写 C++ 大型项目?](https://www.zhihu.com/question/353722203/answer/2564104885)
* lsp(language service provider): vscode clangd
* 依赖 compile_commands.json
* 自动构建
* 手动生成:`mkdir build && cd build && cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 ..`
* 如果linux的glibc版本较旧,需要给clangd打补丁(patchelf),链接向新版glibc


#### Format
Expand Down
12 changes: 12 additions & 0 deletions Notes/MLSys.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,18 @@ val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
* Int8
* PQ

### 检索加速

* 基于树
* KD Tree
* Annoy: https://github.com/spotify/annoy
* Hash
* Local Sensitive Hashing: https://falconn-lib.org/

* PQ
* https://github.com/facebookresearch/faiss
* Learning to hash

#### Semantic search

* [OpenAI Embedding Model](https://openai.com/blog/new-and-improved-embedding-model/)
Expand Down
49 changes: 45 additions & 4 deletions Notes/Machine-Learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,50 @@ train_data, validation_data, test_data = np.split(model_data.sample(frac=1, rand
* on the diagram of thought https://github.com/diagram-of-thought/diagram-of-thought


### Learning To Rank

#### XGBoost

https://xgboost.readthedocs.io/en/stable/tutorials/model.html

XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from the paper *Greedy Function Approximation: A Gradient Boosting Machine*, by Friedman.

![image-20241210004602072](./Machine-Learning/image-20241210004602072.png)

![image-20241210004925060](./Machine-Learning/image-20241210004925060.png)

![image-20241210004943879](./Machine-Learning/image-20241210004943879.png)

![image-20241210005132323](./Machine-Learning/image-20241210005132323.png)

![illustration of structure score (fitness)](./Machine-Learning/struct_score.png)



#### LTR

https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html

* Intro
* The default objective is `rank:ndcg` based on the `LambdaMART` [[2\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references) algorithm, which in turn is an adaptation of the `LambdaRank` [[3\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references) framework to gradient boosting trees. For a history and a summary of the algorithm, see [[5\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references)
* 《Unbiased LambdaMART: An Unbiased Pairwise Learning-to-Rank Algorithm》
* 调参
* lambdarank_num_pair_per_sample

### Position Bias

* Intro

* Obtaining real relevance degrees for query results is an expensive and strenuous, requiring human labelers to label all results one by one. When such labeling task is infeasible, we might want to train the learning-to-rank model on user click data instead, as it is relatively easy to collect. Another advantage of using click data directly is that it can reflect the most up-to-date user preferences [[1\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references). However, user clicks are often biased, as users tend to choose results that are displayed in higher positions. User clicks are also noisy, where users might accidentally click on irrelevant documents. To ameliorate these issues, XGBoost implements the `Unbiased LambdaMART` [[4\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references) algorithm to debias the position-dependent click data. The feature can be enabled by the `lambdarank_unbiased` parameter; see [Parameters for learning to rank (rank:ndcg, rank:map, rank:pairwise)](https://xgboost.readthedocs.io/en/stable/parameter.html#ltr-param) for related options and [Getting started with learning to rank](https://xgboost.readthedocs.io/en/stable/python/examples/learning_to_rank.html#sphx-glr-python-examples-learning-to-rank-py) for a worked example with simulated user clicks.









### Quantization

#### 模型量化介绍
Expand Down Expand Up @@ -351,7 +395,7 @@ Training 量化
#### 训练 Dense Retriever

* Query2Doc paper
* For training dense retrievers, several factors can influence the final performance, such as hard nega- tive mining (Xiong et al., 2021), intermediate pre- training (Gao and Callan, 2021), and knowledge distillation from a cross-encoder based re-ranker (Qu et al., 2021). In this paper, we investigate two settings to gain a more comprehensive understand- ing of our method. The first setting is training DPR (Karpukhin et al., 2020) models initialized from BERTbase with BM25 hard negatives only
* For training dense retrievers, several factors can influence the final performance, such as hard negative mining (Xiong et al., 2021), intermediate pretraining (Gao and Callan, 2021), and knowledge distillation from a cross-encoder based re-ranker (Qu et al., 2021). In this paper, we investigate two settings to gain a more comprehensive understand- ing of our method. The first setting is training DPR (Karpukhin et al., 2020) models initialized from BERTbase with BM25 hard negatives only
* ![image-20241117211622999](Machine-Learning/image-20241117211622999.png)


Expand Down Expand Up @@ -601,10 +645,7 @@ def find_most_similar(input_word):



#### 以图搜图

* Aliyun
* https://help.aliyun.com/zh/image-search/developer-reference/api-searchbypic?spm=a2c4g.11186623.help-menu-66413.d_4_3_1_3.7538364fjOQka0&scm=20140722.H_202282._.OR_help-V_1



Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Notes/Machine-Learning/struct_score.png
56 changes: 56 additions & 0 deletions Notes/snippets/python-multiprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
- 基础实现https://stackoverflow.com/questions/67363793/correct-way-to-implement-producer-consumer-pattern-in-python-multiprocessing-pool
- 解决logging死锁
- 工业版 https://blog.csdn.net/weixin_68789096/article/details/135546285
- python的进程池是基于fork实现的当我们只使用fork()创建子进程而不是用execve()来替换进程上下文时需要注意一个问题fork()出来的子进程会和父进程共享内存空间除了父进程所拥有的线程父进程中的子线程并没有被fork到子进程中而这正是导致死锁的原因:
- 当父进程中的线程要向队列中写log时它需要获取锁如果恰好在获取锁后进行了fork操作那这个锁也会被带到子进程中同时这个锁的状态是占用中这时候子进程要写日志的话也需要获取锁但是由于锁是占用状态导致永远也无法获取至此死锁产生
- 基础版https://blog.51cto.com/u_16175479/8903194
- 解决consumer退出不要用JoinableQueue用Manager().Queue()
- https://stackoverflow.com/questions/45866698/multiprocessing-processes-wont-join


import queue
import random
from multiprocessing import Process, set_start_method, Manager # JoinableQueue


def consumer(q: Queue):
while True:
try:
res = q.get(block=False)
print(f'Consume {res}')
q.task_done()
except queue.Empty:
pass


def producer(q: Queue, food):
for i in range(2):
res = f'{food} {i}'
print(f'Produce {res}')
q.put(res)
q.join()


if __name__ == "__main__":
set_start_method('spawn')
foods = ['apple', 'banana', 'melon', 'salad']
jobs = 2
q = Manager().Queue(maxsize=1024)

producers = [
Process(target=producer, args=(q, random.choice(foods)))
for _ in range(jobs)
]

# daemon=True is important here
consumers = [
Process(target=consumer, args=(q, ), daemon=True)
for _ in range(jobs * 2)
]

# + order here doesn't matter
for p in consumers + producers:
p.start()

for p in producers:
p.join()
File renamed without changes.

0 comments on commit 7746c82

Please sign in to comment.