-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【New feature】新增黑客松小助手 #37
Conversation
HackathonBot/README.md
Outdated
【序号】:2、3 | ||
``` | ||
|
||
> 其中`【状态】: 报名`表示当前评论是赛题`报名`,`序号`表示报名的赛题序号,多个赛题之间需要用`中文顿号、`分隔。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要考虑对不符合格式规范的回复如何处理?提供两条思路:
- 严格要求按照格式回复,对于不符合格式规范导致的信息录入失败,bot在issue下@报名人提示重新报名
- 扩展对某些常见不规范格式的兼容,如中英文的冒号
:
:
(btw你提供的格式里也没有区分中英文的冒号哦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 这个解决办法好,我重新修改下。
- 中英文冒号是一种case,所以代码中是通过找到第一个数字字符实现的。重要的是题目分隔符用顿号实现,这个中英文区别明显,应该不会写错,如果写错了就回到第1种情况,提示用户重新报名。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ok
- 如果找的是第一个数字字符,理论上什么数字字符都可以咯?是不是会有潜在风险
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个确实会有风险,这里我加一个判断赛题题号的逻辑吧,如果数字大于赛题数,那么就在评论区提示用户
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议我们在实现的啥时候,支持非中文的[]:、,符号,以及对空格的兼容,这样可能就没有找第一个数字字符的问题存在。例如代码中可以通过以下方式实现
s = '【序号】: 2, 3'
s = s.replace(' ').replace(':',':').replace(',',‘、’).replace('[','【').replace(']','】')
这样可以获取到较为纯净的字符,格式可能也较为统一
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
定时任务更新表单和一些不规范指令识别的逻辑肯定是要分开的,定时任务一天跑两次,但是报名信息格式校验应该是实时的,或者至少一小时一次,自动格式化好像也不是不行,可以调研一下技术路径,飞桨项目有一个机器人paddle-bot用于自动回复,可以看看能不能接入
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们最开始的时候是不是考虑过这个机器人,但是可能漏包?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
漏包的问题现在有人在看,可以先假设这个问题修复了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
漏包的话可以用定时兜底
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确认了排期,漏包的问题9.8可以完成修复
HackathonBot/README.md
Outdated
|
||
除`报名状态`外,剩下`四种状态`的变更可以通过监控`PR`的状态来完成, 具体的实现逻辑如下: | ||
|
||
* 获取`paddle`仓库下黑客松`开始之后`标题中包含`Hackathon No.`字样的所有`PR`。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里有两点问题需要考虑:
- 黑客松的任务不是都提给paddle的,也可能是一些其他套件或生态repo,如PaddleOCR、FastDeply,甚至包括OpenVINO,所以简单粗暴的爬去paddle仓库的pr是无法覆盖的
- 每一个代码仓库每天都会产生很多pr,以paddle代码仓库为例(日均20+),如果按照黑客松时长(一般为3个月)来算,爬取数据量是很大的,遗漏数据的可能性就很大
建议是维护一个数据库,每天爬一次全量pr即可,黑客松的pr打标签,比如paddle下我们使用 PaddlePaddle Hackathon 标签来管理黑客松的pr,可以要求其他repo也使用同样的标签。
后续通过查标签的方式(或者建立数据库,每次查库里的pr状态)来更新issue榜单,计算量也能小一些
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 可以爬取paddle这个组织下所有的项目的pr,项目个数在90个左右,拉取的时候可以通过关键字匹配查询
title:*Hackathon*
,调用官方API,这样PR数量会很少,一次分页请求最多可以拉取100个PR,所以请求的数量不多。 - 建立数据库的话实时性不好保证,比如用户关掉了之前的PR,那么之前的记录数据库不能及时更新,主要是处理起来比较麻烦~
- 有必要的话也可以加个数据库,不加的话五分钟内应该可以拉取处理完,可以再讨论加还是不加~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenVINO不是PaddlePaddle组织下的repo,是intel的项目,所以我们是不是考虑设置一个repo范围(比如一个list),并且可以灵活增减
数据库just一个提议,如果调GitHub API足够用就完全ok~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,可以设置一个repo范围,这样就可以动态操作监控的repo
|
||
#### PR格式 | ||
|
||
为了完成状态变更,只需要在`PR`的标题中以`【Hackathon No.xxx】`开头即可,程序会自动提取赛题编号并更新榜单。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要考虑的特殊情况:
- 如果开发者写错了序号(后面又改对了)怎么处理
- 如果开发者没有报名,直接提交rfc了怎么处理
- 如果赛题要求rfc,但是有人跳过rfc环节提交pr了怎么处理
- 如果赛题没有要求rfc,但是有人提交rfc了怎么处理
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 在当前逻辑下,每次更新都是重新读取所有评论,如果开发者写错了序号,然后又改回来,最终显示的是更改后的序号。包括删除报名评论后,下一次更新就不会出现在榜单上了。
- 如果直接提交rfc,榜单会显示提交rfc状态,确实会有人不报名直接提交。
- 这个需要改进一下逻辑,更新表单PR状态时,判断榜单中是否已经有rfc链接。
- 这个我觉得需要从两方面处理,一是可以在榜单加一列
是否需要提交rfc
的额外信息,二是在处理逻辑中针对这种情况在评论中@用户,提示不需要rfc。额外信息这一列也可以用来解决情况3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所以这个定时任务每次跑都是:
- 先刷一遍issue下的回复,新建一个榜单替换现有榜单(这样可能就没有人工的操作空间了,因为下一次还是会刷掉
- 再刷一遍所有repo下的pr list,通过标题匹配更新pr状态到榜单(这里好像没有考虑pr close的情况,以及close一个pr后同一个github id又新提了pr的情况
行倒是也行,就是感觉有点暴力,如果榜单能做到增量更新最好,这样也留有人工修改issue的空间
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前的逻辑确实是这样的,每次都是在重写榜单,如果想要增量更新,可以更新后记录更新时间,下次更新时只改变上一次更新时间之后的回复和PR。
PR close
后新提PR
最终只会显示新提交的那个,因为查询的时候只查open
或者merge
两种状态的PR 。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得不是修改查询时间来解决的,这个时间肯定是需要有overlap的,或者是全量的去查也没有问题,但是需要记录上一次的状态,和这一次做对比,然后拿到diff,更新到表单上
后面那句我没太懂,一个github id只会显示一个pr的意思吗?那如果这个任务就是包含多个pr(且merge 呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据讨论可以先这样:
假设我们每天定时任务更新2次,0-12h更新一次,12-24h更新一次
然后计算一下API的调用频次,@gouzil 表示一个token每天最多调用60次,可能需要估算一下需要多少token
以及记得保存日志和异常情况报警
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是可以考虑一下仓库的Webhooks
或者actions
(这两种其实是可以做到及时更新,不需要定时)。 其次我们在请求的时候可以考虑一次多数据,少请求次数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* **预警功能**:自动提醒每个赛题的状态,针对长时间没有变更状态的赛题进行预警。 | ||
|
||
* **看板功能**:后期可以扩展看板功能,进行数据分析,更好的量化开发者的贡献,类似于 [openGauss 贡献看板](https://datastat.opengauss.org/zh/overview) [一款开源的开源社区贡献看板](https://ost.51cto.com/posts/14589)。 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该需要一个仓库做一下每天,或者每次执行前的备份,防止封号
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以,现在是执行的时候保存在一个txt
文档中备份。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以尝试提交到一个另一个分支上,可能会有不同的人来找备份,这部分最好还是公开一下。
可以参考:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的~
HackathonBot/utils.py
Outdated
import time | ||
import logging | ||
|
||
access_token = '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' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
token
暴露了,记得删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
谢谢提醒~ Github已经扫描到该token,提示失效了
if 'community' in html_url: | ||
update_status = { | ||
'username': username, | ||
'status': '提交RFC' if state == 'open' else '完成设计文档', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果这里是closed
他是不是也会标记完成,例如 PaddlePaddle/community#524
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里在bot.py
中会判断merge_at
属性,所以只会处理已经merge
状态的PR
。
else: | ||
update_status = { | ||
'username': username, | ||
'status': '提交PR' if state == 'open' else '完成任务', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里在bot.py
中会判断merge_at
属性,所以只会处理已经merge
状态的PR
。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不过从你的测试案例来看好像并没有正常解析。https://github.com/Tomoko-hjf/paddleviz/issues/1#issuecomment-1655552510
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前好像还缺了个任务方向的区分,然后就可以加整体进展的看板啦 |
@gouzil 任务方向需要展示在表格列吗,感觉任务方向只会在整体进展看板功能中用到,而发布赛题的时候是直接按照方向发布的,所以每题所属的任务方向是不是可以定义到程序变量中,比如赛题1-10是第一个方向,以此类推。 |
要不用解析目录或者解析注释,这里也需要考虑到某个方向临时加题 下面是一个小例子,当然最后可能还是要问一下 @Ligoml <!--
* 方向1: 1-2
* 方向2: 3-5
--> 目录:
方向1:框架
方向2:社区
统计 (这个图可能用md画个表格会更好) |
@Tomoko-hjf @gouzil 任务方向我建议维护一个dict或者list,因为确实会存在后面加题的情况,我们通常的做法是任务序号按顺序往后排,但是任务介绍放在同一个方向下面,这里是一个黑客松四期的示例 |
另外想了解一下现在的代码逻辑,如果我修改了issue中的md文档,下次自动任务是基于我的修改来加的吗?比如我加了新的任务后有人认领这个任务,下次自动任务启动会发生什么? |
看板感觉可以用 html 渲染然后转图片,简单写了一个效果如下: python 转换代码: import imgkit
options = {'encoding': 'UTF-8'}
imgkit.from_file('a.html','a.jpg', options=options) html 代码: (虽然看起来很长,但其实都是重复块的叠加) <!DOCTYPE html>
<html>
<head>
<style>
table {
width: 100%;
border-collapse: collapse;
}
th, td {
border: 1px solid black;
padding: 8px;
text-align: center;
}
th {
background-color: #f2f2f2;
}
.progress {
height: 20px;
width: 100%;
background-color: #f2f2f2;
border: 1px solid #ccc;
position: relative;
}
.progress-inner {
height: 100%;
width: 0; /* Change this value to reflect the completion percentage */
background-color: #4CAF50;
position: absolute;
text-align: center;
color: white;
}
</style>
</head>
<body>
<table>
<tr>
<th>任务方向</th>
<th>任务数量</th>
<th>任务认领 / 提交作品</th>
<th>提交率</th>
<th>完成</th>
<th>完成率</th>
</tr>
<tr>
<td>API开发</td>
<td>31</td>
<td>31 / 26</td>
<td>84%</td>
<td>24</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 77%;">77%</div>
</div>
</td>
</tr>
<tr>
<td>算子性能优化</td>
<td>10</td>
<td>10 / 7</td>
<td>70%</td>
<td>7</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 70%;">70%</div>
</div>
</td>
</tr>
<tr>
<td>数据类型扩展-float16</td>
<td>10</td>
<td>10 / 10</td>
<td>100%</td>
<td>9</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 90%;">90%</div>
</div>
</td>
</tr>
<tr>
<td>数据类型扩展-单测</td>
<td>10</td>
<td>10 / 10</td>
<td>100%</td>
<td>8</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 80%;">80%</div>
</div>
</td>
</tr>
<tr>
<td>PHI算子库独立编译</td>
<td>4</td>
<td>4 / 4</td>
<td>100%</td>
<td>4</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 100%;">100%</div>
</div>
</td>
</tr>
<tr>
<td>TensorRT开发</td>
<td>8</td>
<td>8 / 8</td>
<td>100%</td>
<td>7</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 88%;">88%</div>
</div>
</td>
</tr>
<tr>
<td>CINN开发</td>
<td>4</td>
<td>4 / 4</td>
<td>100%</td>
<td>4</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 100%;">100%</div>
</div>
</td>
</tr>
<tr>
<td>开源社区洞察</td>
<td>4</td>
<td>4 / 2</td>
<td>50%</td>
<td>2</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 50%;">50%</div>
</div>
</td>
</tr>
<tr>
<td>其他</td>
<td>9</td>
<td>8 / 3</td>
<td>33%</td>
<td>2</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 22%;">22%</div>
</div>
</td>
</tr>
<tr>
<td>总计</td>
<td>90</td>
<td>90 / 74</td>
<td>82%</td>
<td>67</td>
<td>
<div class="progress">
<div class="progress-inner" style="width: 74%;">74%</div>
</div>
</td>
</tr>
</table>
</body>
</html> |
目前的逻辑大致如下: 1、每次更新前拉取issue内容,根据issue内容将每个赛题信息结构化为一个task类(修改了md文档后,自动任务是基于人工修改后的赛题来加的) 2、根据评论更新赛题报名状态(对于新的任务,如果有人认领了,也可以自动正确地更新) 3、根据PR进行剩余四种状态的更新 |
@AndSonder markdown 试了下可以直接解析html,不需要转成图片吧,直接在issue中以表格形式展示是不是就可以。 markdown语法的表格也可以展示看板信息,但是搜了下进度条不太好展示,所以还是像这样用CSS样式来展示吧 |
本周四可以完成一个测试版吗?我给黑客松组委会的同学试一下效果 |
HackathonBot/README.md
Outdated
为了完成状态变更,只需要在`PR`的标题中以`【Hackathon No.xxx】`开头即可,程序会自动提取赛题编号并更新榜单。 | ||
|
||
### 🚀 看板功能 | ||
看板功能是将HTML转为图片存放在`./image`文件夹下,所以issue中的图片链接是需要指向这个图片的(这里存在一个图片文件放在哪里的问题)。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有几种解决方法
- 去解一下
https://github.com/upload/policies/assets
这个api
(他好像不是个开放的api, 得自己去趴一下, 就是点issues
上那个上传图片的按钮) - 使用
<img src="data:image/png;base64
- 指向部署仓库的
log
文件夹, 然后使用https://raw.githubusercontent.com/
上面的地址
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1、直接用GitHub上传图片的按钮功能没找到对应的api,再就是定时任务会上传很多次图片,GitHub对用户上传空间是不是也有限制呢;
2、转成base64,issue的内容会很长,人工看起来比较难受 ;
3、如果把issue中图片链接写为PaddleAutoProject仓库log文件夹下的链接,每次直接更新log文件夹下的图片,然后push一下仓库;或者运行该定时任务服务器的IP也是公开的吧,直接指向服务器上该图片的路径是不是也可以,这样就不需要push到仓库中了
更倾向于第3种解决方案~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emmm, 理论上来说一般是不会给你提供任务服务器的,是否有ip这就取决于你部署在哪里了。不过反正都要上传日志,直接用仓库的raw
链接问题应该也不大
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emmm, 理论上来说一般是不会给你提供任务服务器的,是否有ip这就取决于你部署在哪里了。不过反正都要上传日志,直接用仓库的
raw
链接问题应该也不大
直接用 raw 链接应该会更方便一些
主要是没法加载css,如果是纯html的话直接放上去也行 @Tomoko-hjf |
| -------- | ------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | -------- | ---- | | ||
| 1 | 首次评论区信息抽取 | 1. 构造2个任务列表,每个任务列表包含5-10个任务</br>2. 构造至少2个github账号,按照规定格式报名若干任务,任务均已发布</br>规定格式:【报名】: 2、3,`【报名】:` +任务序号,多个任务之间需要用中文顿号分隔,多个连续任务可以用横线连接`2-5`。</br>3. 首次运行脚本 | 报名信息正确更新到表单中 | | | | ||
| 2 | 二次评论区信息抽取 | 1. 基于case1,在评论区按照规定格式二次构造若干报名信息,任务均已发布</br>2. 在case1的基础上运行脚本 | 新增报名信息正确更新到表单中 | | | | ||
| 3 | 新增/删除/修改任务列表信息 | 1. 在任务列表中随机删除1个任务(md中划掉)</br>2. 在任务列表中随机增加1个任务(新增一行,任务编号不重复)</br>3. 在任务列表中随机修改1个任务的难度和issue描述文案</br>4. 在task_list中补充增加的任务编号</br>5. 任意github账号报名删除任务的编号</br>6. 任意github账号报名新增任务的编号</br>7. 在case2的基础上运行脚本 | 1. 新增/删除/修改任务列表信息保留</br>2. 报名信息正确更新到表单中 | | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case3和case4提醒选手未正确报名or未正确填写pr标题,什么方式还可以讨论,我目前想到的是三种方式:
- bot直接回复选手,在评论区提醒
- 发邮件给选手,在邮箱中提醒
- 发邮件给黑客松组委会,官方做修改和管理
大家看下哪种方式更好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得第一种会比较好,可以加入一些修改建议,第二种的话github应该是会根据用户策略自动发,相对来讲也减少了邮箱的维护成本。第三种组委会每天收到好多邮件太难处理了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bot直接回复选手,在评论区提醒 +1
committer或者官方看到了,也是可以直接修改选手评论和删除bot回复的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,给paddle-bot提了一个接口调用的需求,到时候让paddle-bot发提醒,在issue中直接回复,@Tomoko-hjf 可以先写一下这块的代码
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我来解决 嘿嘿~
80384d7
to
f9dbfcc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
【New feature】新增黑客松小助手
主要功能如下:
issue
回复自动填写报名信息,完成任务认领。PR
状态,自动更新issue
中表单信息,完成状态变更。