Skip to content

Commit e8db4fa

Browse files
mulity-language readme
1 parent 06a6ac1 commit e8db4fa

File tree

10 files changed

+832
-231
lines changed

10 files changed

+832
-231
lines changed

README.md

Lines changed: 109 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -1,121 +1,162 @@
1-
# 📈 首席情报官(Wiseflow)
1+
# WiseFlow
22

3-
**首席情报官**(Wiseflow)是一个敏捷的信息挖掘工具,可以从社交平台消息、微信公众号、群聊等各种信息源中提炼简洁的讯息,自动做标签归类并上传数据库。让你轻松应对信息过载,精准掌握你最关心的内容。
3+
**[中文](README_CN.md) | [日本語](README_JP.md) | [Français](README_FR.md) | [Deutsch](README_DE.md)**
44

5-
## 🌟 功能特色
5+
**Wiseflow** is an agile information mining tool that extracts concise messages from various sources such as websites, WeChat official accounts, social platforms, etc. It automatically categorizes and uploads them to the database.
66

7-
- 🚀 **原生 LLM 应用**
8-
我们精心选择了最适合的 7B~9B 开源模型,最大化降低使用成本,且利于数据敏感用户随时切换到本地部署模式。
7+
We are not short of information; what we need is to filter out the noise from the vast amount of information so that valuable information stands out! See how Chief Intelligence Officer helps you save time, filter out irrelevant information, and organize key points of interest!
98

10-
- 🌱 **轻量化设计**
11-
没有使用任何向量模型,系统开销很小,无需 GPU,适合任何硬件环境。
9+
<img alt="sample.png" src="asset/sample.png" width="1024"/>
1210

13-
- 🗃️ **智能信息提取和分类**
14-
从各种信息源中自动提取信息,并根据用户关注点进行标签化和分类管理。
11+
## 🔥 Major Update V0.3.0
1512

16-
- 🌍 **实时动态知识库**
17-
能够与现有的 RAG 类项目整合,作为动态知识库提升知识管理效率。
13+
- ✅ Completely rewritten general web content parser, using a combination of statistical learning (relying on the open-source project GNE) and LLM, adapted to over 90% of news pages;
1814

19-
- 📦 **流行的 Pocketbase 数据库**
20-
数据库和界面使用 Pocketbase,不管是直接用 Web 阅读,还是通过 Go 工具读取,都很方便。
2115

22-
## 🔄 对比分析
16+
- ✅ Brand new asynchronous task architecture;
2317

24-
| 特点 | 首席情报官(Wiseflow) | Markdown_crawler | firecrawler | RAG 类项目 |
25-
| -------------- | ----------------------- | ----------------- | ----------- | ---------------- |
26-
| **信息提取** | ✅ 高效 | ❌ 限制于 Markdown | ❌ 仅网页 | ⚠️ 提取后处理 |
27-
| **信息分类** | ✅ 自动 | ❌ 手动 | ❌ 手动 | ⚠️ 依赖外部工具 |
28-
| **模型依赖** | ✅ 7B~9B 开源模型 | ❌ 无模型 | ❌ 无模型 | ✅ 向量模型 |
29-
| **硬件需求** | ✅ 无需 GPU | ✅ 无需 GPU | ✅ 无需 GPU | ⚠️ 视具体实现而定 |
30-
| **可整合性** | ✅ 动态知识库 | ❌ 低 | ❌ 低 | ✅ 高 |
3118

32-
## 📥 安装与使用
19+
- ✅ New information extraction and labeling strategy, more accurate, more refined, and can perform tasks perfectly with only a 9B LLM!
3320

34-
1. **克隆代码仓库**
21+
## 🌟 Key Features
22+
23+
- 🚀 **Native LLM Application**
24+
We carefully selected the most suitable 7B~9B open-source models to minimize usage costs and allow data-sensitive users to switch to local deployment at any time.
25+
26+
27+
- 🌱 **Lightweight Design**
28+
Without using any vector models, the system has minimal overhead and does not require a GPU, making it suitable for any hardware environment.
29+
30+
31+
- 🗃️ **Intelligent Information Extraction and Classification**
32+
Automatically extracts information from various sources and tags and classifies it according to user interests.
33+
34+
😄 **Wiseflow is particularly good at extracting information from WeChat official account articles**; for this, we have configured a dedicated mp article parser!
35+
36+
37+
- 🌍 **Can be Integrated into Any RAG Project**
38+
Can serve as a dynamic knowledge base for any RAG project, without needing to understand the code of Wiseflow, just operate through database reads!
39+
40+
41+
- 📦 **Popular Pocketbase Database**
42+
The database and interface use PocketBase. Besides the web interface, APIs for Go/Javascript/Python languages are available.
43+
44+
- Go: https://pocketbase.io/docs/go-overview/
45+
- Javascript: https://pocketbase.io/docs/js-overview/
46+
- Python: https://github.com/vaphes/pocketbase
47+
48+
## 🔄 What are the Differences and Connections between Wiseflow and Common Crawlers, RAG Projects?
49+
50+
| Feature | Wiseflow | Crawler / Scraper | RAG Projects |
51+
|-----------------|--------------------------------------|------------------------------------------|--------------------------|
52+
| **Main Problem Solved** | Data processing (filtering, extraction, labeling) | Raw data acquisition | Downstream applications |
53+
| **Connection** | | Can be integrated into Wiseflow for more powerful raw data acquisition | Can integrate Wiseflow as a dynamic knowledge base |
54+
55+
## 📥 Installation and Usage
56+
57+
WiseFlow has virtually no hardware requirements, with minimal system overhead, and does not need a discrete GPU or CUDA (when using online LLM services).
58+
59+
1. **Clone the Code Repository**
60+
61+
😄 Liking and forking is a good habit
3562

3663
```bash
37-
git clone https://github.com/your-username/wiseflow.git
64+
git clone https://github.com/TeamWiseFlow/wiseflow.git
3865
cd wiseflow
3966
```
4067

41-
2. **安装依赖**
4268

43-
```bash
44-
pip install -r requirements.txt
45-
```
69+
2. **Configuration**
4670

47-
3. **配置**
71+
Copy `env_sample` in the directory and rename it to `.env`, then fill in your configuration information (such as LLM service tokens) as follows:
4872

49-
`config.yaml` 中配置你的信息源和关注点。
73+
- LLM_API_KEY # API key for large model inference service (if using OpenAI service, you can omit this by deleting this entry)
74+
- LLM_API_BASE # Base URL for the OpenAI-compatible model service (omit this if using OpenAI service)
75+
- WS_LOG="verbose" # Enable debug logging, delete if not needed
76+
- GET_INFO_MODEL # Model for information extraction and tagging tasks, default is gpt-3.5-turbo
77+
- REWRITE_MODEL # Model for near-duplicate information merging and rewriting tasks, default is gpt-3.5-turbo
78+
- HTML_PARSE_MODEL # Web page parsing model (smartly enabled when GNE algorithm performs poorly), default is gpt-3.5-turbo
79+
- PROJECT_DIR # Location for storing cache and log files, relative to the code repository; default is the code repository itself if not specified
80+
- PB_API_AUTH='email|password' # Admin email and password for the pb database (use a valid email for the first use, it can be a fictitious one but must be an email)
81+
- PB_API_BASE # Not required for normal use, only needed if not using the default local PocketBase interface (port 8090)
5082

51-
4. **启动服务**
5283

53-
```bash
54-
python main.py
55-
```
84+
3. **Model Recommendation**
5685

57-
5. **访问 Web 界面**
86+
After extensive testing (in both Chinese and English tasks), for comprehensive effect and cost, we recommend the following for **GET_INFO_MODEL**, **REWRITE_MODEL**, and **HTML_PARSE_MODEL**: **"zhipuai/glm4-9B-chat"**, **"alibaba/Qwen2-7B-Instruct"**, **"alibaba/Qwen2-7B-Instruct"**.
5887

59-
打开浏览器,访问 `http://localhost:8000`
88+
These models fit the project well, with stable command adherence and excellent generation effects. The related prompts for this project are also optimized for these three models. (**HTML_PARSE_MODEL** can also use **"01-ai/Yi-1.5-9B-Chat"**, which also performs excellently in tests)
6089

61-
## 📚 文档与支持
90+
⚠️ We strongly recommend using **SiliconFlow**'s online inference service for lower costs, faster speeds, and higher free quotas! ⚠️
6291
63-
- [使用文档](docs/usage.md)
64-
- [开发者指南](docs/developer.md)
65-
- [常见问题](docs/faq.md)
92+
SiliconFlow online inference service is compatible with the OpenAI SDK and provides open-source services for the above three models. Just configure LLM_API_BASE as "https://api.siliconflow.cn/v1" and set up LLM_API_KEY to use it.
6693
67-
## 🤝 贡献指南
6894
69-
欢迎对项目进行贡献!请阅读 [贡献指南](CONTRIBUTING.md) 以了解详细信息。
95+
4. **Local Deployment**
7096
71-
## 🛡️ 许可协议
97+
As you can see, this project uses 7B/9B LLMs and does not require any vector models, which means you can fully deploy this project locally with just an RTX 3090 (24GB VRAM).
7298
73-
本项目基于 [Apach2.0](LICENSE) 开源。
99+
Ensure your local LLM service is compatible with the OpenAI SDK, and configure LLM_API_BASE accordingly.
74100
75-
商用以及定制合作,请联系 Email:35252986@qq.com
76101
77-
(商用客户请联系我们报备登记,产品承诺永远免费。)
102+
5. **Run the Program**
78103
79-
(对于定制客户,我们会针对您的信源和数据情况,提供专有解析器开发、信息提取和分类策略优化、llm模型微调以及私有化部署服务)
104+
**For regular users, it is strongly recommended to use Docker to run the Chief Intelligence Officer.**
80105
81-
## 📬 联系方式
106+
📚 For developers, see [/core/README.md](/core/README.md) for more.
82107
83-
有任何问题或建议,欢迎通过 [issue](https://github.com/your-username/wiseflow/issues) 与我们联系。
108+
Access data obtained via PocketBase:
84109
110+
- http://127.0.0.1:8090/_/ - Admin dashboard UI
111+
- http://127.0.0.1:8090/api/ - REST API
112+
- https://pocketbase.io/docs/ check more
85113
86-
## change log
87114
88-
【2024.5.8】增加对openai SDK的支持,现在可以通过调用llms.openai_wrapper使用所有兼容openai SDK的大模型服务,具体见 [client/backend/llms/README.md](client/backend/llms/README.md)
115+
6. **Adding Scheduled Source Scanning**
89116
117+
After starting the program, open the PocketBase Admin dashboard UI (http://127.0.0.1:8090/_/)
90118
91-
## getting started
119+
Open the **sites** form.
92120
93-
首席情报官提供了开箱即用的本地客户端,对于没有二次开发需求的用户可以通过如下简单五个步骤即刻起飞!
121+
Through this form, you can specify custom sources, and the system will start background tasks to scan, parse, and analyze the sources locally.
94122
95-
1、克隆代码仓
123+
Description of the sites fields:
96124
97-
```commandline
98-
git clone git@github.com:TeamWiseFlow/wiseflow.git
99-
cd wiseflow/client
100-
```
125+
- url: The URL of the source. The source does not need to specify the specific article page, just the article list page. Wiseflow client includes two general page parsers that can effectively acquire and parse over 90% of news-type static web pages.
126+
- per_hours: Scanning frequency, in hours, integer type (range 1~24; we recommend a scanning frequency of no more than once per day, i.e., set to 24).
127+
- activated: Whether to activate. If turned off, the source will be ignored; it can be turned on again later. Turning on and off does not require restarting the Docker container and will be updated at the next scheduled task.
101128
102-
4、参考 /client/env_sample 编辑.env文件;
129+
## 🛡️ License
103130
104-
5、运行 `docker compose up -d` 启动(第一次需要build image,时间较长)
131+
This project is open-source under the [Apache 2.0](LICENSE) license.
105132
133+
For commercial use and customization cooperation, please contact **Email: 35252986@qq.com**.
134+
135+
- Commercial customers, please register with us. The product promises to be free forever.
136+
- For customized customers, we provide the following services according to your sources and business needs:
137+
- Custom proprietary parsers
138+
- Customized information extraction and classification strategies
139+
- Targeted LLM recommendations or even fine-tuning services
140+
- Private deployment services
141+
- UI interface customization
142+
143+
## 📬 Contact Information
144+
145+
If you have any questions or suggestions, feel free to contact us through [issue](https://github.com/TeamWiseFlow/wiseflow/issues).
146+
147+
## 🤝 This Project is Based on the Following Excellent Open-source Projects:
148+
149+
- GeneralNewsExtractor (General Extractor of News Web Page Body Based on Statistical Learning) https://github.com/GeneralNewsExtractor/GeneralNewsExtractor
150+
- json_repair (Repair invalid JSON documents) https://github.com/josdejong/jsonrepair/tree/main
151+
- python-pocketbase (PocketBase client SDK for Python) https://github.com/vaphes/pocketbase
106152
107153
# Citation
108154
109-
如果您在相关工作中参考或引用了本项目的部分或全部,请注明如下信息:
155+
If you refer to or cite part or all of this project in related work, please indicate the following information:
110156
111157
```
112-
AuthorWiseflow Team
158+
Author: Wiseflow Team
113159
https://openi.pcl.ac.cn/wiseflow/wiseflow
114160
https://github.com/TeamWiseFlow/wiseflow
115161
Licensed under Apache2.0
116-
```
117-
# After many comparisons, we recommend the following model for the two tasks of this project (combining effectiveness, speed, and cost performance).
118-
# At the same time, we recommend the siliconflow platform, which can provide online reasoning services for the following two models at a more favorable price
119-
# The siliconflow platform is compatible with openai sdk, which makes the program simple
120-
# Therefore, unless you have experimented and found that there are better options for your data, it is not recommended to change the following two parameters
121-
# (although you have the right to make any changes at any time).
162+
```

0 commit comments

Comments
 (0)