2024.12.10

huangrt01 · Dec 9, 2024 · 7746c82 · 7746c82
1 parent 0dd4aaa
commit 7746c82
Show file tree

Hide file tree

Showing 37 changed files with 867 additions and 693 deletions.
diff --git a/Notes/AIGC-Algorithms.md b/Notes/AIGC-Algorithms.md
diff --git a/Notes/AIGC-Algorithms/275f8067-4c5a-42ba-ae58-66b6f7c93067.png b/Notes/AIGC-Algorithms/275f8067-4c5a-42ba-ae58-66b6f7c93067.png
diff --git a/Notes/AIGC-Algorithms/640-20241021002249376 b/Notes/AIGC-Algorithms/640-20241021002249376
diff --git a/Notes/AIGC-Algorithms/GNN.png b/Notes/AIGC-Algorithms/GNN.png
diff --git a/Notes/AIGC-Algorithms/TAG.png b/Notes/AIGC-Algorithms/TAG.png
diff --git a/Notes/AIGC-Algorithms/graph-retrieval.png b/Notes/AIGC-Algorithms/graph-retrieval.png
diff --git a/Notes/AIGC-Algorithms/image-20241020235306018.png b/Notes/AIGC-Algorithms/image-20241020235306018.png
diff --git a/Notes/AIGC-Algorithms/image-20241020235459558.png b/Notes/AIGC-Algorithms/image-20241020235459558.png
diff --git a/Notes/AIGC-Algorithms/image-20241027014446582.png b/Notes/AIGC-Algorithms/image-20241027014446582.png
diff --git a/Notes/AIGC-Algorithms/image-20241027022219991.png b/Notes/AIGC-Algorithms/image-20241027022219991.png
diff --git a/Notes/AIGC-Algorithms/image-20241027023313029.png b/Notes/AIGC-Algorithms/image-20241027023313029.png
diff --git a/Notes/AIGC-Algorithms/image-20241027045058720.png b/Notes/AIGC-Algorithms/image-20241027045058720.png
diff --git a/Notes/AIGC-Algorithms/image-20241210014430460.png b/Notes/AIGC-Algorithms/image-20241210014430460.png
diff --git a/Notes/AIGC-Algorithms/image-20241210015507819.png b/Notes/AIGC-Algorithms/image-20241210015507819.png
diff --git a/Notes/AIGC-Algorithms/image-20241210015627115.png b/Notes/AIGC-Algorithms/image-20241210015627115.png
diff --git a/Notes/AIGC-Algorithms/image-20241210024754923.png b/Notes/AIGC-Algorithms/image-20241210024754923.png
diff --git a/Notes/AIGC-Algorithms/img_v3_02fs_6682e564-a869-4d15-a5c3-8fb11492dbeg.jpg b/Notes/AIGC-Algorithms/img_v3_02fs_6682e564-a869-4d15-a5c3-8fb11492dbeg.jpg
diff --git a/Notes/AIGC-Algorithms/lightrag-example.png b/Notes/AIGC-Algorithms/lightrag-example.png
diff --git a/Notes/AIGC-Algorithms/lightrag.png b/Notes/AIGC-Algorithms/lightrag.png
diff --git a/Notes/AIGC-Algorithms/mteb.png b/Notes/AIGC-Algorithms/mteb.png
diff --git a/Notes/AIGC-Algorithms/pai-rag.png b/Notes/AIGC-Algorithms/pai-rag.png
diff --git a/Notes/AIGC-Algorithms/rag-fusion.jpeg b/Notes/AIGC-Algorithms/rag-fusion.jpeg
diff --git a/Notes/AIGC-Algorithms/rag.png b/Notes/AIGC-Algorithms/rag.png
diff --git a/Notes/AIGC-Algorithms/sbert-rerank.png b/Notes/AIGC-Algorithms/sbert-rerank.png
diff --git a/Notes/AIGC-Algorithms/table_rag.png b/Notes/AIGC-Algorithms/table_rag.png
diff --git a/Notes/AIGC-Algorithms/vectordb.png b/Notes/AIGC-Algorithms/vectordb.png
diff --git a/Notes/AIGC.md b/Notes/AIGC.md
diff --git a/Notes/Editor.md b/Notes/Editor.md
@@ -270,34 +270,11 @@ Advanced Text Objects - Text objects like searches can also be composed with vim
 
 * 打开终端：ctrl + `
 
-#### 开发必备插件
+#### 插件
+
+参考「dotfiles：README」
+
 
-* 公共: 
-  * Code Spell Checker, GitLens, EditorConfig for VSCode, String Manipulation, Visual Studio IntelliCode
-  * Code Runner
-  * Remote - SSH
-
-* C++: [cpplint](https://github.com/cpplint/cpplint), CodeLLDB, Header source switch, Rainbow Brackets, C++ Intellisense
-  * CMake, CMake Tools
-    * cmake插件需要cmake3.9.4以上版本的cmake的支持，ubuntu16.04以下的用户可能需要手动安装新版本的cmake
-  * Clangd
-    * [VSCode C/C++ 开发环境和调试配置：Clangd+Codelldb](https://zhangjk98.xyz/vscode-c-and-cpp-develop-and-debug-setting/)
-      * 阅读超大型项目源码的注意事项
-        * 关闭 editor.formatOnSave, clang-tidy, all-scopes-completion
-        * codeLLDB 使用（TODO）
-
-    * 有clangd之后，不需要 C/C++ (by Microsoft) 了
-
-  * Tabnine：AI加持的自动补全，用GPT
-  * Peacock：不同workspace可以用不同的颜色区分
-  * 调大C++插件的可用内存：
-    * ![c++](Editor/vscode-c++.png)
-  * [如何优雅的用 VScode 编写 C++ 大型项目？](https://www.zhihu.com/question/353722203/answer/2564104885)
-    * lsp(language service provider): vscode clangd
-    * 依赖 compile_commands.json
-      * 自动构建
-      * 手动生成：`mkdir build && cd build && cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 ..`
-    * 如果linux的glibc版本较旧，需要给clangd打补丁（patchelf），链接向新版glibc
 
 
 #### Format

diff --git a/Notes/MLSys.md b/Notes/MLSys.md
@@ -244,6 +244,18 @@ val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
   * Int8 
   * PQ
 
+### 检索加速
+
+* 基于树
+  * KD Tree
+  * Annoy: https://github.com/spotify/annoy
+* Hash
+  * Local Sensitive Hashing: https://falconn-lib.org/
+
+* PQ
+  * https://github.com/facebookresearch/faiss
+* Learning to hash
+
 #### Semantic search
 
 * [OpenAI Embedding Model](https://openai.com/blog/new-and-improved-embedding-model/)

diff --git a/Notes/Machine-Learning.md b/Notes/Machine-Learning.md
@@ -279,6 +279,50 @@ train_data, validation_data, test_data = np.split(model_data.sample(frac=1, rand
   * on the diagram of thought https://github.com/diagram-of-thought/diagram-of-thought
 
 
+### Learning To Rank
+
+#### XGBoost
+
+https://xgboost.readthedocs.io/en/stable/tutorials/model.html
+
+XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from the paper *Greedy Function Approximation: A Gradient Boosting Machine*, by Friedman.
+
+![image-20241210004602072](./Machine-Learning/image-20241210004602072.png)
+
+![image-20241210004925060](./Machine-Learning/image-20241210004925060.png)
+
+![image-20241210004943879](./Machine-Learning/image-20241210004943879.png)
+
+![image-20241210005132323](./Machine-Learning/image-20241210005132323.png)
+
+![illustration of structure score (fitness)](./Machine-Learning/struct_score.png)
+
+
+
+#### LTR
+
+https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html
+
+* Intro
+  * The default objective is `rank:ndcg` based on the `LambdaMART` [[2\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references) algorithm, which in turn is an adaptation of the `LambdaRank` [[3\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references) framework to gradient boosting trees. For a history and a summary of the algorithm, see [[5\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references)
+  * 《Unbiased LambdaMART: An Unbiased Pairwise Learning-to-Rank Algorithm》
+* 调参
+  * lambdarank_num_pair_per_sample
+
+### Position Bias
+
+* Intro
+
+  * Obtaining real relevance degrees for query results is an expensive and strenuous, requiring human labelers to label all results one by one. When such labeling task is infeasible, we might want to train the learning-to-rank model on user click data instead, as it is relatively easy to collect. Another advantage of using click data directly is that it can reflect the most up-to-date user preferences [[1\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references). However, user clicks are often biased, as users tend to choose results that are displayed in higher positions. User clicks are also noisy, where users might accidentally click on irrelevant documents. To ameliorate these issues, XGBoost implements the `Unbiased LambdaMART` [[4\]](https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html#references) algorithm to debias the position-dependent click data. The feature can be enabled by the `lambdarank_unbiased` parameter; see [Parameters for learning to rank (rank:ndcg, rank:map, rank:pairwise)](https://xgboost.readthedocs.io/en/stable/parameter.html#ltr-param) for related options and [Getting started with learning to rank](https://xgboost.readthedocs.io/en/stable/python/examples/learning_to_rank.html#sphx-glr-python-examples-learning-to-rank-py) for a worked example with simulated user clicks.
+
+
+
+
+
+
+
+
+
 ### Quantization
 
 #### 模型量化介绍
@@ -351,7 +395,7 @@ Training 量化
 #### 训练 Dense Retriever
 
 * Query2Doc paper
-  * For training dense retrievers, several factors can influence the final performance, such as hard nega- tive mining (Xiong et al., 2021), intermediate pre- training (Gao and Callan, 2021), and knowledge distillation from a cross-encoder based re-ranker (Qu et al., 2021). In this paper, we investigate two settings to gain a more comprehensive understand- ing of our method. The first setting is training DPR (Karpukhin et al., 2020) models initialized from BERTbase with BM25 hard negatives only
+  * For training dense retrievers, several factors can influence the final performance, such as hard negative mining (Xiong et al., 2021), intermediate pretraining (Gao and Callan, 2021), and knowledge distillation from a cross-encoder based re-ranker (Qu et al., 2021). In this paper, we investigate two settings to gain a more comprehensive understand- ing of our method. The first setting is training DPR (Karpukhin et al., 2020) models initialized from BERTbase with BM25 hard negatives only
   * ![image-20241117211622999](Machine-Learning/image-20241117211622999.png)
 
 
@@ -601,10 +645,7 @@ def find_most_similar(input_word):
 
 
 
-#### 以图搜图
 
-* Aliyun
-  * https://help.aliyun.com/zh/image-search/developer-reference/api-searchbypic?spm=a2c4g.11186623.help-menu-66413.d_4_3_1_3.7538364fjOQka0&scm=20140722.H_202282._.OR_help-V_1
 
 
 

diff --git a/Notes/Machine-Learning/image-20241210004602072.png b/Notes/Machine-Learning/image-20241210004602072.png
diff --git a/Notes/Machine-Learning/image-20241210004925060.png b/Notes/Machine-Learning/image-20241210004925060.png
diff --git a/Notes/Machine-Learning/image-20241210004943879.png b/Notes/Machine-Learning/image-20241210004943879.png
diff --git a/Notes/Machine-Learning/image-20241210005132323.png b/Notes/Machine-Learning/image-20241210005132323.png
diff --git a/Notes/Machine-Learning/struct_score.png b/Notes/Machine-Learning/struct_score.png
diff --git a/Notes/snippets/python-multiprocess.py b/Notes/snippets/python-multiprocess.py
@@ -0,0 +1,56 @@
+- 基础实现：https://stackoverflow.com/questions/67363793/correct-way-to-implement-producer-consumer-pattern-in-python-multiprocessing-pool
+  - 解决logging死锁：
+    - 工业版 https://blog.csdn.net/weixin_68789096/article/details/135546285
+      - python的进程池是基于fork实现的，当我们只使用fork()创建子进程而不是用execve()来替换进程上下文时，需要注意一个问题：fork()出来的子进程会和父进程共享内存空间，除了父进程所拥有的线程。父进程中的子线程并没有被fork到子进程中，而这正是导致死锁的原因:
+      - 当父进程中的线程要向队列中写log时，它需要获取锁。如果恰好在获取锁后进行了fork操作，那这个锁也会被带到子进程中，同时这个锁的状态是占用中。这时候子进程要写日志的话，也需要获取锁，但是由于锁是占用状态，导致永远也无法获取，至此，死锁产生。                       
+    - 基础版：https://blog.51cto.com/u_16175479/8903194
+  - 解决consumer退出：不要用JoinableQueue，用Manager().Queue()
+   - https://stackoverflow.com/questions/45866698/multiprocessing-processes-wont-join
+
+
+import queue
+import random
+from multiprocessing import Process, set_start_method, Manager # JoinableQueue
+
+
+def consumer(q: Queue):
+    while True:
+        try:
+            res = q.get(block=False)
+            print(f'Consume {res}')
+            q.task_done()
+        except queue.Empty:
+            pass
+
+
+def producer(q: Queue, food):
+    for i in range(2):
+        res = f'{food} {i}'
+        print(f'Produce {res}')
+        q.put(res)
+    q.join()
+
+
+if __name__ == "__main__":
+    set_start_method('spawn')
+    foods = ['apple', 'banana', 'melon', 'salad']
+    jobs = 2
+    q = Manager().Queue(maxsize=1024)
+
+    producers = [
+        Process(target=producer, args=(q, random.choice(foods)))
+        for _ in range(jobs)
+    ]
+
+    # daemon=True is important here
+    consumers = [
+        Process(target=consumer, args=(q, ), daemon=True)
+        for _ in range(jobs * 2)
+    ]
+
+    # + order here doesn't matter
+    for p in consumers + producers:
+        p.start()
+
+    for p in producers:
+        p.join()
diff --git a/Notes/snippets/multithread_consumer.py → Notes/snippets/python-multithread.py b/Notes/snippets/multithread_consumer.py → Notes/snippets/python-multithread.py