articles_generator is a module including class ArticleGenerator. All work should bdone with that class.
After cloning code from repository need to load gpt-2 models:
pip3 install -r requirements.txt
python3 download_model.py 117M
python3 download_model.py 345M
There are two ways to run script:
- via single method
from articles_generator import ArticleGenerator
default_path = ""
a_gen = ArticleGenerator(default_path=default_path, verbose=1)
a_gen.process_all_steps()
- via sequence steps (useful for colab to see intermediate dataframes)
from articles_generator import ArticleGenerator
default_path = ""
a_gen = ArticleGenerator(default_path=default_path, verbose=1)
a_gen.step_load_data()
a_gen.step_prepare_tf_hub()
a_gen.step_clusterize_questions()
a_gen.step_generate_questions_texts()
a_gen.step_merge_texts_to_questions()
a_gen.step_load_texts()
a_gen.step_extract_sentences()
a_gen.step_find_closest_sentences_to_question()
a_gen.save_articles()
Final processed file store to /articles_by_question.csv
There are several configuration constants in the ArticleGenerator
:
-
ArticleGenerator.MAX_CLUSTER_SIZE
: Maximum unique questions that can be in the cluster while questions clusterziation. Default is250
-
ArticleGenerator.BASE_DBSCAN_EPS
: For questions clusterziation. If less then items in cluster will be closer. It is start value for clusterziation. Default is0.4
-
ArticleGenerator.DBSCAN_EPS_MULT_STEP
: For questions clusterziation. If found cluster size bigger thenMAX_CLUSTER_SIZE
, clusterization will launch next cycle with newepsilon = self.BASE_DBSCAN_EPS/multiplicator
. At startmultiplicator = 1
and with each new cycle it decreased byDBSCAN_EPS_MULT_STEP
. Default is0.3
-
ArticleGenerator.BATCH_SIZE
: Batch size for Embedding module. Default is1000
-
ArticleGenerator.NUM_CLOSEST_SENTENCES
: Number of the closest sentences to question selected from all. Use on STEP 8. Default is20
See DEVELOPERS.md
See CONTRIBUTORS.md
@article{radford2019language,
title={Language Models are Unsupervised Multitask Learners},
author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
year={2019}
}