Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Premaster #61

Merged
merged 3 commits into from
Jun 24, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,220 changes: 1,220 additions & 0 deletions demos/NumTopics_all_datasets.ipynb

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions demos/Stability-Demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2044,9 +2044,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "topicnet",
"display_name": "Python 3",
"language": "python",
"name": "topicnet"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -2058,7 +2058,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
"version": "3.6.7"
}
},
"nbformat": 4,
Expand Down
16 changes: 16 additions & 0 deletions topnum/configs/20NG.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@

name: 20NewsGroups
batches_prefix: 20NG
dataset_path: '/data_mil/datasets/20_News_dataset/ /data/datasets/20_News_dataset/20NG_BOW.csv'

word: "@word"

min_num_topics: 10
max_num_topics: 30
Copy link
Collaborator

@Alvant Alvant May 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Если для каждого датасета делать свой max_num_topics, то кмк надо с большим запасом, например x2 или x3 или больше (то есть для 20 NG это будет 40 или 60). И если range топиков разный для разных датасетов, то на графики может быть сложнее смотреть, если их объединять в один figure в ТеХ-е


num_topics_interval: 3
num_fit_iterations: 40
num_restarts: 6



16 changes: 16 additions & 0 deletions topnum/configs/Brown.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@

name: Brown
batches_prefix: Brown
dataset_path: '/data_mil/datasets/Brown/Brown.csv'

word: "@word"

min_num_topics: 5
max_num_topics: 25

num_topics_interval: 3
num_fit_iterations: 30
num_restarts: 6



16 changes: 16 additions & 0 deletions topnum/configs/PN.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@

name: PostNauka
batches_prefix: PN
dataset_path: '/data_mil/datasets/postnauka/postnauka.csv'

word: "@word"

min_num_topics: 5
max_num_topics: 50

num_topics_interval: 3
num_fit_iterations: 40
num_restarts: 6



16 changes: 16 additions & 0 deletions topnum/configs/Reuters.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@

name: Reuters
batches_prefix: Reuters
dataset_path: '/data_mil/datasets/Reuters/Reuters.csv'

word: "@word"

min_num_topics: 5
max_num_topics: 50

num_topics_interval: 3
num_fit_iterations: 40
num_restarts: 6



24 changes: 24 additions & 0 deletions topnum/configs/SO.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: StackOverflow

dataset_path: '/data_mil/datasets/StackOverflow/SO_vw_bow.txt'
batches_prefix: SO

word: "@lemmatized"

# https://link.springer.com/article/10.1007/s10664-012-9231-y
# Anton Barua, Stephen W. Thomas & Ahmed E. Hassan 2012
# used just 40 topics
#
# Rosen, C., Shihab, E. 2016
# What are mobile developers asking about? A large scale study using stack overflow.
# used 40 topics (but merged them down to 32)

min_num_topics: 5
max_num_topics: 60

num_topics_interval: 5
num_fit_iterations: 40
num_restarts: 6



13 changes: 13 additions & 0 deletions topnum/configs/WikiRef220.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: WikiRef220

dataset_path: '/data_mil/datasets/WikiRef220/wiki_ref220_bow.csv'
batches_prefix: WRef

word: "@lemmatized"

min_num_topics: 2
max_num_topics: 20

num_topics_interval: 1
num_fit_iterations: 40
num_restarts: 6
21 changes: 21 additions & 0 deletions topnum/configs/ruwikigood.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@

name: RuWikiGood
batches_prefix: RWG
dataset_path: '/data_mil/datasets/ruwiki_good/good_ruwiki_vw.txt'

word: "@lemmatized"

min_num_topics: 5

# around 10 main categories
# around 87 `ul b` tags
# around 238 <b> tags in total
# max_num_topics: 300?
max_num_topics: 100

num_topics_interval: 5
num_fit_iterations: 40
num_restarts: 4



8 changes: 8 additions & 0 deletions topnum/model_constructor.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,14 @@ def init_lda(
dataset, modalities_to_use, main_modality, num_topics
)

# TODO: implement this LDA also
# Found in doi.org/10.1007/s10664-015-9379-3
# Rosen, C., Shihab, E. 2016
# What are mobile developers asking about? A large scale study using stack overflow.
#
# "We use the defacto standard heuristics of α=50/K and β=0.01
# (Biggers et al. 2014) for our hyperparameter values"

# what GenSim returns by default (everything is 'symmetric')
# see https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/ldamodel.py#L521
if prior == "symmetric":
Expand Down
7 changes: 5 additions & 2 deletions topnum/search_methods/optimize_scores_method.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import logging
import os
import pandas as pd
from numpy.random import RandomState
import uuid
import warnings

Expand Down Expand Up @@ -87,6 +88,7 @@ def __init__(
self._keys_mean_many.append(key)
self._keys_std_many.append(key)

# TODO: accept either VowpalWabbitTextCollection or Dataset with modalities
def search_for_optimum(
self,
text_collection: VowpalWabbitTextCollection) -> None:
Expand All @@ -95,8 +97,9 @@ def search_for_optimum(

dataset = text_collection._to_dataset()

# seed == None is too similar to seed == 0
seeds = [None] + list(range(1, self._num_restarts))
# TODO: if this sophisticated seeds don't make models different,
# return the simpler seeds (0, 1, 2, ...)
seeds = [None] + [abs(RandomState(i).tomaxint()) for i in range(1, self._num_restarts)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Здесь не хватает коммента, почему такая жесть. Плюс, если это не даёт эффекта, мб это стоит убрать?


nums_topics = list(range(
self._min_num_topics,
Expand Down
Loading