-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Premaster #61
Premaster #61
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
|
||
name: 20NewsGroups | ||
batches_prefix: 20NG | ||
dataset_path: '/data_mil/datasets/20_News_dataset/ /data/datasets/20_News_dataset/20NG_BOW.csv' | ||
|
||
word: "@word" | ||
|
||
min_num_topics: 10 | ||
max_num_topics: 30 | ||
|
||
num_topics_interval: 3 | ||
num_fit_iterations: 40 | ||
num_restarts: 6 | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
|
||
name: Brown | ||
batches_prefix: Brown | ||
dataset_path: '/data_mil/datasets/Brown/Brown.csv' | ||
|
||
word: "@word" | ||
|
||
min_num_topics: 5 | ||
max_num_topics: 25 | ||
|
||
num_topics_interval: 3 | ||
num_fit_iterations: 30 | ||
num_restarts: 6 | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
|
||
name: PostNauka | ||
batches_prefix: PN | ||
dataset_path: '/data_mil/datasets/postnauka/postnauka.csv' | ||
|
||
word: "@word" | ||
|
||
min_num_topics: 5 | ||
max_num_topics: 50 | ||
|
||
num_topics_interval: 3 | ||
num_fit_iterations: 40 | ||
num_restarts: 6 | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
|
||
name: Reuters | ||
batches_prefix: Reuters | ||
dataset_path: '/data_mil/datasets/Reuters/Reuters.csv' | ||
|
||
word: "@word" | ||
|
||
min_num_topics: 5 | ||
max_num_topics: 50 | ||
|
||
num_topics_interval: 3 | ||
num_fit_iterations: 40 | ||
num_restarts: 6 | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
name: StackOverflow | ||
|
||
dataset_path: '/data_mil/datasets/StackOverflow/SO_vw_bow.txt' | ||
batches_prefix: SO | ||
|
||
word: "@lemmatized" | ||
|
||
# https://link.springer.com/article/10.1007/s10664-012-9231-y | ||
# Anton Barua, Stephen W. Thomas & Ahmed E. Hassan 2012 | ||
# used just 40 topics | ||
# | ||
# Rosen, C., Shihab, E. 2016 | ||
# What are mobile developers asking about? A large scale study using stack overflow. | ||
# used 40 topics (but merged them down to 32) | ||
|
||
min_num_topics: 5 | ||
max_num_topics: 60 | ||
|
||
num_topics_interval: 5 | ||
num_fit_iterations: 40 | ||
num_restarts: 6 | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
name: WikiRef220 | ||
|
||
dataset_path: '/data_mil/datasets/WikiRef220/wiki_ref220_bow.csv' | ||
batches_prefix: WRef | ||
|
||
word: "@lemmatized" | ||
|
||
min_num_topics: 2 | ||
max_num_topics: 20 | ||
|
||
num_topics_interval: 1 | ||
num_fit_iterations: 40 | ||
num_restarts: 6 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
|
||
name: RuWikiGood | ||
batches_prefix: RWG | ||
dataset_path: '/data_mil/datasets/ruwiki_good/good_ruwiki_vw.txt' | ||
|
||
word: "@lemmatized" | ||
|
||
min_num_topics: 5 | ||
|
||
# around 10 main categories | ||
# around 87 `ul b` tags | ||
# around 238 <b> tags in total | ||
# max_num_topics: 300? | ||
max_num_topics: 100 | ||
|
||
num_topics_interval: 5 | ||
num_fit_iterations: 40 | ||
num_restarts: 4 | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -154,6 +154,13 @@ def init_lda( | |
dataset, modalities_to_use, main_modality, num_topics | ||
) | ||
|
||
# found in doi.org/10.1007/s10664-015-9379-3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ок, только не понятно, к чему этот коммент) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. А, ок. Ну, стоит тогда хотя бы туду сделать в духе: implement this LDA also. |
||
# Rosen, C., Shihab, E. 2016 | ||
# What are mobile developers asking about? A large scale study using stack overflow. | ||
# | ||
# "We use the defacto standard heuristics of α=50/K and β=0.01 | ||
# (Biggers et al. 2014) for our hyperparameter values" | ||
|
||
# what GenSim returns by default (everything is 'symmetric') | ||
# see https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/ldamodel.py#L521 | ||
if prior == "symmetric": | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
import logging | ||
import os | ||
import pandas as pd | ||
from numpy.random import RandomState | ||
import uuid | ||
import warnings | ||
|
||
|
@@ -84,13 +85,13 @@ def __init__( | |
self._keys_mean_many.append(key) | ||
self._keys_std_many.append(key) | ||
|
||
# TODO: accept either VowpalWabbitTextCollection or Dataset with modalities | ||
def search_for_optimum(self, text_collection: VowpalWabbitTextCollection) -> None: | ||
_logger.info('Starting to search for optimum...') | ||
|
||
dataset = text_collection._to_dataset() | ||
|
||
# seed == None is too similar to seed == 0 | ||
seeds = [None] + list(range(1, self._num_restarts)) | ||
seeds = [None] + [abs(RandomState(i).tomaxint()) for i in range(1, self._num_restarts)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Здесь не хватает коммента, почему такая жесть. Плюс, если это не даёт эффекта, мб это стоит убрать? |
||
|
||
nums_topics = list(range( | ||
self._min_num_topics, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Если для каждого датасета делать свой
max_num_topics
, то кмк надо с большим запасом, например x2 или x3 или больше (то есть для 20 NG это будет 40 или 60). И если range топиков разный для разных датасетов, то на графики может быть сложнее смотреть, если их объединять в один figure в ТеХ-е