|
1 | | -# 📈 首席情报官(Wiseflow) |
| 1 | +# WiseFlow |
2 | 2 |
|
3 | | -**首席情报官**(Wiseflow)是一个敏捷的信息挖掘工具,可以从社交平台消息、微信公众号、群聊等各种信息源中提炼简洁的讯息,自动做标签归类并上传数据库。让你轻松应对信息过载,精准掌握你最关心的内容。 |
| 3 | +**[中文](README_CN.md) | [日本語](README_JP.md) | [Français](README_FR.md) | [Deutsch](README_DE.md)** |
4 | 4 |
|
5 | | -## 🌟 功能特色 |
| 5 | +**Wiseflow** is an agile information mining tool that extracts concise messages from various sources such as websites, WeChat official accounts, social platforms, etc. It automatically categorizes and uploads them to the database. |
6 | 6 |
|
7 | | -- 🚀 **原生 LLM 应用** |
8 | | - 我们精心选择了最适合的 7B~9B 开源模型,最大化降低使用成本,且利于数据敏感用户随时切换到本地部署模式。 |
| 7 | +We are not short of information; what we need is to filter out the noise from the vast amount of information so that valuable information stands out! See how Chief Intelligence Officer helps you save time, filter out irrelevant information, and organize key points of interest! |
9 | 8 |
|
10 | | -- 🌱 **轻量化设计** |
11 | | - 没有使用任何向量模型,系统开销很小,无需 GPU,适合任何硬件环境。 |
| 9 | +<img alt="sample.png" src="asset/sample.png" width="1024"/> |
12 | 10 |
|
13 | | -- 🗃️ **智能信息提取和分类** |
14 | | - 从各种信息源中自动提取信息,并根据用户关注点进行标签化和分类管理。 |
| 11 | +## 🔥 Major Update V0.3.0 |
15 | 12 |
|
16 | | -- 🌍 **实时动态知识库** |
17 | | - 能够与现有的 RAG 类项目整合,作为动态知识库提升知识管理效率。 |
| 13 | +- ✅ Completely rewritten general web content parser, using a combination of statistical learning (relying on the open-source project GNE) and LLM, adapted to over 90% of news pages; |
18 | 14 |
|
19 | | -- 📦 **流行的 Pocketbase 数据库** |
20 | | - 数据库和界面使用 Pocketbase,不管是直接用 Web 阅读,还是通过 Go 工具读取,都很方便。 |
21 | 15 |
|
22 | | -## 🔄 对比分析 |
| 16 | +- ✅ Brand new asynchronous task architecture; |
23 | 17 |
|
24 | | -| 特点 | 首席情报官(Wiseflow) | Markdown_crawler | firecrawler | RAG 类项目 | |
25 | | -| -------------- | ----------------------- | ----------------- | ----------- | ---------------- | |
26 | | -| **信息提取** | ✅ 高效 | ❌ 限制于 Markdown | ❌ 仅网页 | ⚠️ 提取后处理 | |
27 | | -| **信息分类** | ✅ 自动 | ❌ 手动 | ❌ 手动 | ⚠️ 依赖外部工具 | |
28 | | -| **模型依赖** | ✅ 7B~9B 开源模型 | ❌ 无模型 | ❌ 无模型 | ✅ 向量模型 | |
29 | | -| **硬件需求** | ✅ 无需 GPU | ✅ 无需 GPU | ✅ 无需 GPU | ⚠️ 视具体实现而定 | |
30 | | -| **可整合性** | ✅ 动态知识库 | ❌ 低 | ❌ 低 | ✅ 高 | |
31 | 18 |
|
32 | | -## 📥 安装与使用 |
| 19 | +- ✅ New information extraction and labeling strategy, more accurate, more refined, and can perform tasks perfectly with only a 9B LLM! |
33 | 20 |
|
34 | | -1. **克隆代码仓库** |
| 21 | +## 🌟 Key Features |
| 22 | + |
| 23 | +- 🚀 **Native LLM Application** |
| 24 | + We carefully selected the most suitable 7B~9B open-source models to minimize usage costs and allow data-sensitive users to switch to local deployment at any time. |
| 25 | + |
| 26 | + |
| 27 | +- 🌱 **Lightweight Design** |
| 28 | + Without using any vector models, the system has minimal overhead and does not require a GPU, making it suitable for any hardware environment. |
| 29 | + |
| 30 | + |
| 31 | +- 🗃️ **Intelligent Information Extraction and Classification** |
| 32 | + Automatically extracts information from various sources and tags and classifies it according to user interests. |
| 33 | + |
| 34 | + 😄 **Wiseflow is particularly good at extracting information from WeChat official account articles**; for this, we have configured a dedicated mp article parser! |
| 35 | + |
| 36 | + |
| 37 | +- 🌍 **Can be Integrated into Any RAG Project** |
| 38 | + Can serve as a dynamic knowledge base for any RAG project, without needing to understand the code of Wiseflow, just operate through database reads! |
| 39 | + |
| 40 | + |
| 41 | +- 📦 **Popular Pocketbase Database** |
| 42 | + The database and interface use PocketBase. Besides the web interface, APIs for Go/Javascript/Python languages are available. |
| 43 | + |
| 44 | + - Go: https://pocketbase.io/docs/go-overview/ |
| 45 | + - Javascript: https://pocketbase.io/docs/js-overview/ |
| 46 | + - Python: https://github.com/vaphes/pocketbase |
| 47 | + |
| 48 | +## 🔄 What are the Differences and Connections between Wiseflow and Common Crawlers, RAG Projects? |
| 49 | + |
| 50 | +| Feature | Wiseflow | Crawler / Scraper | RAG Projects | |
| 51 | +|-----------------|--------------------------------------|------------------------------------------|--------------------------| |
| 52 | +| **Main Problem Solved** | Data processing (filtering, extraction, labeling) | Raw data acquisition | Downstream applications | |
| 53 | +| **Connection** | | Can be integrated into Wiseflow for more powerful raw data acquisition | Can integrate Wiseflow as a dynamic knowledge base | |
| 54 | + |
| 55 | +## 📥 Installation and Usage |
| 56 | + |
| 57 | +WiseFlow has virtually no hardware requirements, with minimal system overhead, and does not need a discrete GPU or CUDA (when using online LLM services). |
| 58 | + |
| 59 | +1. **Clone the Code Repository** |
| 60 | + |
| 61 | + 😄 Liking and forking is a good habit |
35 | 62 |
|
36 | 63 | ```bash |
37 | | - git clone https://github.com/your-username/wiseflow.git |
| 64 | + git clone https://github.com/TeamWiseFlow/wiseflow.git |
38 | 65 | cd wiseflow |
39 | 66 | ``` |
40 | 67 |
|
41 | | -2. **安装依赖** |
42 | 68 |
|
43 | | - ```bash |
44 | | - pip install -r requirements.txt |
45 | | - ``` |
| 69 | +2. **Configuration** |
46 | 70 |
|
47 | | -3. **配置** |
| 71 | + Copy `env_sample` in the directory and rename it to `.env`, then fill in your configuration information (such as LLM service tokens) as follows: |
48 | 72 |
|
49 | | - 在 `config.yaml` 中配置你的信息源和关注点。 |
| 73 | + - LLM_API_KEY # API key for large model inference service (if using OpenAI service, you can omit this by deleting this entry) |
| 74 | + - LLM_API_BASE # Base URL for the OpenAI-compatible model service (omit this if using OpenAI service) |
| 75 | + - WS_LOG="verbose" # Enable debug logging, delete if not needed |
| 76 | + - GET_INFO_MODEL # Model for information extraction and tagging tasks, default is gpt-3.5-turbo |
| 77 | + - REWRITE_MODEL # Model for near-duplicate information merging and rewriting tasks, default is gpt-3.5-turbo |
| 78 | + - HTML_PARSE_MODEL # Web page parsing model (smartly enabled when GNE algorithm performs poorly), default is gpt-3.5-turbo |
| 79 | + - PROJECT_DIR # Location for storing cache and log files, relative to the code repository; default is the code repository itself if not specified |
| 80 | + - PB_API_AUTH='email|password' # Admin email and password for the pb database (use a valid email for the first use, it can be a fictitious one but must be an email) |
| 81 | + - PB_API_BASE # Not required for normal use, only needed if not using the default local PocketBase interface (port 8090) |
50 | 82 |
|
51 | | -4. **启动服务** |
52 | 83 |
|
53 | | - ```bash |
54 | | - python main.py |
55 | | - ``` |
| 84 | +3. **Model Recommendation** |
56 | 85 |
|
57 | | -5. **访问 Web 界面** |
| 86 | + After extensive testing (in both Chinese and English tasks), for comprehensive effect and cost, we recommend the following for **GET_INFO_MODEL**, **REWRITE_MODEL**, and **HTML_PARSE_MODEL**: **"zhipuai/glm4-9B-chat"**, **"alibaba/Qwen2-7B-Instruct"**, **"alibaba/Qwen2-7B-Instruct"**. |
58 | 87 |
|
59 | | - 打开浏览器,访问 `http://localhost:8000`。 |
| 88 | + These models fit the project well, with stable command adherence and excellent generation effects. The related prompts for this project are also optimized for these three models. (**HTML_PARSE_MODEL** can also use **"01-ai/Yi-1.5-9B-Chat"**, which also performs excellently in tests) |
60 | 89 |
|
61 | | -## 📚 文档与支持 |
| 90 | + ⚠️ We strongly recommend using **SiliconFlow**'s online inference service for lower costs, faster speeds, and higher free quotas! ⚠️ |
62 | 91 |
|
63 | | -- [使用文档](docs/usage.md) |
64 | | -- [开发者指南](docs/developer.md) |
65 | | -- [常见问题](docs/faq.md) |
| 92 | + SiliconFlow online inference service is compatible with the OpenAI SDK and provides open-source services for the above three models. Just configure LLM_API_BASE as "https://api.siliconflow.cn/v1" and set up LLM_API_KEY to use it. |
66 | 93 |
|
67 | | -## 🤝 贡献指南 |
68 | 94 |
|
69 | | -欢迎对项目进行贡献!请阅读 [贡献指南](CONTRIBUTING.md) 以了解详细信息。 |
| 95 | +4. **Local Deployment** |
70 | 96 |
|
71 | | -## 🛡️ 许可协议 |
| 97 | + As you can see, this project uses 7B/9B LLMs and does not require any vector models, which means you can fully deploy this project locally with just an RTX 3090 (24GB VRAM). |
72 | 98 |
|
73 | | -本项目基于 [Apach2.0](LICENSE) 开源。 |
| 99 | + Ensure your local LLM service is compatible with the OpenAI SDK, and configure LLM_API_BASE accordingly. |
74 | 100 |
|
75 | | -商用以及定制合作,请联系 Email:35252986@qq.com |
76 | 101 |
|
77 | | -(商用客户请联系我们报备登记,产品承诺永远免费。) |
| 102 | +5. **Run the Program** |
78 | 103 |
|
79 | | -(对于定制客户,我们会针对您的信源和数据情况,提供专有解析器开发、信息提取和分类策略优化、llm模型微调以及私有化部署服务) |
| 104 | + **For regular users, it is strongly recommended to use Docker to run the Chief Intelligence Officer.** |
80 | 105 |
|
81 | | -## 📬 联系方式 |
| 106 | + 📚 For developers, see [/core/README.md](/core/README.md) for more. |
82 | 107 |
|
83 | | -有任何问题或建议,欢迎通过 [issue](https://github.com/your-username/wiseflow/issues) 与我们联系。 |
| 108 | + Access data obtained via PocketBase: |
84 | 109 |
|
| 110 | + - http://127.0.0.1:8090/_/ - Admin dashboard UI |
| 111 | + - http://127.0.0.1:8090/api/ - REST API |
| 112 | + - https://pocketbase.io/docs/ check more |
85 | 113 |
|
86 | | -## change log |
87 | 114 |
|
88 | | -【2024.5.8】增加对openai SDK的支持,现在可以通过调用llms.openai_wrapper使用所有兼容openai SDK的大模型服务,具体见 [client/backend/llms/README.md](client/backend/llms/README.md) |
| 115 | +6. **Adding Scheduled Source Scanning** |
89 | 116 |
|
| 117 | + After starting the program, open the PocketBase Admin dashboard UI (http://127.0.0.1:8090/_/) |
90 | 118 |
|
91 | | -## getting started |
| 119 | + Open the **sites** form. |
92 | 120 |
|
93 | | -首席情报官提供了开箱即用的本地客户端,对于没有二次开发需求的用户可以通过如下简单五个步骤即刻起飞! |
| 121 | + Through this form, you can specify custom sources, and the system will start background tasks to scan, parse, and analyze the sources locally. |
94 | 122 |
|
95 | | -1、克隆代码仓 |
| 123 | + Description of the sites fields: |
96 | 124 |
|
97 | | -```commandline |
98 | | -git clone git@github.com:TeamWiseFlow/wiseflow.git |
99 | | -cd wiseflow/client |
100 | | -``` |
| 125 | + - url: The URL of the source. The source does not need to specify the specific article page, just the article list page. Wiseflow client includes two general page parsers that can effectively acquire and parse over 90% of news-type static web pages. |
| 126 | + - per_hours: Scanning frequency, in hours, integer type (range 1~24; we recommend a scanning frequency of no more than once per day, i.e., set to 24). |
| 127 | + - activated: Whether to activate. If turned off, the source will be ignored; it can be turned on again later. Turning on and off does not require restarting the Docker container and will be updated at the next scheduled task. |
101 | 128 |
|
102 | | -4、参考 /client/env_sample 编辑.env文件; |
| 129 | +## 🛡️ License |
103 | 130 |
|
104 | | -5、运行 `docker compose up -d` 启动(第一次需要build image,时间较长) |
| 131 | +This project is open-source under the [Apache 2.0](LICENSE) license. |
105 | 132 |
|
| 133 | +For commercial use and customization cooperation, please contact **Email: 35252986@qq.com**. |
| 134 | +
|
| 135 | +- Commercial customers, please register with us. The product promises to be free forever. |
| 136 | +- For customized customers, we provide the following services according to your sources and business needs: |
| 137 | + - Custom proprietary parsers |
| 138 | + - Customized information extraction and classification strategies |
| 139 | + - Targeted LLM recommendations or even fine-tuning services |
| 140 | + - Private deployment services |
| 141 | + - UI interface customization |
| 142 | +
|
| 143 | +## 📬 Contact Information |
| 144 | +
|
| 145 | +If you have any questions or suggestions, feel free to contact us through [issue](https://github.com/TeamWiseFlow/wiseflow/issues). |
| 146 | +
|
| 147 | +## 🤝 This Project is Based on the Following Excellent Open-source Projects: |
| 148 | +
|
| 149 | +- GeneralNewsExtractor (General Extractor of News Web Page Body Based on Statistical Learning) https://github.com/GeneralNewsExtractor/GeneralNewsExtractor |
| 150 | +- json_repair (Repair invalid JSON documents) https://github.com/josdejong/jsonrepair/tree/main |
| 151 | +- python-pocketbase (PocketBase client SDK for Python) https://github.com/vaphes/pocketbase |
106 | 152 |
|
107 | 153 | # Citation |
108 | 154 |
|
109 | | -如果您在相关工作中参考或引用了本项目的部分或全部,请注明如下信息: |
| 155 | +If you refer to or cite part or all of this project in related work, please indicate the following information: |
110 | 156 |
|
111 | 157 | ``` |
112 | | -Author:Wiseflow Team |
| 158 | +Author: Wiseflow Team |
113 | 159 | https://openi.pcl.ac.cn/wiseflow/wiseflow |
114 | 160 | https://github.com/TeamWiseFlow/wiseflow |
115 | 161 | Licensed under Apache2.0 |
116 | | -``` |
117 | | -# After many comparisons, we recommend the following model for the two tasks of this project (combining effectiveness, speed, and cost performance). |
118 | | -# At the same time, we recommend the siliconflow platform, which can provide online reasoning services for the following two models at a more favorable price |
119 | | -# The siliconflow platform is compatible with openai sdk, which makes the program simple |
120 | | -# Therefore, unless you have experimented and found that there are better options for your data, it is not recommended to change the following two parameters |
121 | | -# (although you have the right to make any changes at any time). |
| 162 | +``` |
0 commit comments