Name		Name	Last commit message	Last commit date
parent directory ..
ans_tes		ans_tes
data		data
models		models
usr_data		usr_data
1_get_emb.py		1_get_emb.py
2_get_oag_emb.py		2_get_oag_emb.py
3_get_feats.py		3_get_feats.py
4_nn_train.py		4_nn_train.py
5_nn_infer.py		5_nn_infer.py
6_tree_train.py		6_tree_train.py
7_tree_infer.py		7_tree_infer.py
8_ensemble.py		8_ensemble.py
infer.sh		infer.sh
model.py		model.py
readme.md		readme.md
tmp		tmp
train.sh		train.sh

readme.md

Whoiswho

Team LGB YYDS RANK9

Prerequisites

Windows
python==3.9
pandas==1.4.4
numpy==1.21.6
scikit-learn==1.4.2
gensim==4.1.2
cogdl==0.6
tqdm==4.64.1
lightgbm==4.1.0
xgboost==2.0.2
pytorch==1.13.1+cu117
pyarrow==16.0

Hardware device

CPU AMD 5600X
GPU 3080Ti 12G
RAM 64G

Parameter count

total ~110,500,000

oagbert-v2 ~110M
mlp ~430k
hand-crafted features ~2k

File structure

data [dataset given by organizer]
- train_author.json
- pid_to_info_all.json
- ind_test_author_submit.json
- ind_test_author_filter_public.json]
usr_data [dataset generated by codes]
models [models trained]
ans_test [single model answers and the final answer(named ensemble.json)]
code files

files of usr_data/models/ans_test can be downloaded from

Link: https://pan.baidu.com/s/1iCtXYY-1jIp51lVAAcduMg Password: fq70

Run code

Train+infer: sh train.sh
Only infer: sh infer.sh

Method

extract embedding of article information with w2v and tfidf
extract embedding of article information with OAG Bert
do feature engineering: statistics features and distance features
train mlp model with features to get an oof prediction
train xgboost and lightgbm models with different groups of features and mlp oof prediction to get 4 single model predictions
ensemble, just weighted average of 4 single-model answers