-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW]: SIRITVIS: Social Interaction Research Insights Topic Visualisation #6243
Comments
Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks. For a list of things I can do to help you, just type:
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
|
|
Wordcount for |
|
👋 Hi @n3mo @cjbarrie, thank you so much for helping out at JOSS. If you need any pointers, please feel free to look at previous reviews (which can be found by looking at published papers) and the documentation. If you need to comment on the code itself, opening an issue at the repo and then linking to it from here (to help me/others keep track) is the way to go. For comments on the paper, you can also open issues or PRs (say for typos), but those can be directly posted as replies in this issue. Thanks, and feel free to reach out if you need me. |
@editorialbot generate pdf |
1 similar comment
Review checklist for @cjbarrieConflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
|
Hi All, I've been going through this and am bumping up against quite a few errors. I detail these in commented lines below: from SIRITVIS import twitter_streamer, insta_streamer, reddit_streamer, cleaner, topic_model, topic_visualise, topic_mapper
import ssl
import nltk
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
nltk.download('stopwords')
# Run the streaming process to retrieve raw data based on the specified keywords
client_id = "******************"
client_secret = "**********************"
user_agent = "**************"
keywords = ['and'] # just to get random data
save_path = 'data/'
raw_data = reddit_streamer.RedditStreamer(client_id,client_secret,user_agent,save_path,keywords).run()
cleaner_obj = cleaner.Cleaner(data_source='data/',data_source_type='reddit')
cleaned_file = cleaner_obj.saving('data_cleaned/',data_save_name='reddit_cleaned')
cleaned_file.to_csv('corpus.csv', index=False, header=True)
# gives error if you pass cleaned_file to dataset_source
model = topic_model.TopicModeling(num_topics=10, dataset_source='corpus.csv',
learning_rate=0.001, batch_size=32, activation='softplus', num_layers=3, num_neurons=100,
dropout=0.2, num_epochs=100, save_model=False, model_path=None, train_model='NeuralLDA',
evaluation=['topicdiversity','invertedrbo','jaccardsimilarity'])
saved_model = model.run()
# gives nltk install error unless you use SSL routine above
# gives following error: An error occurred: max_df corresponds to < documents than min_df
vis_model = topic_visualise.PyLDAvis(data_source='corpus.csv',num_topics=5,text_column='text')
vis_model.visualize()
vis_model = topic_visualise.Wordcloud(data_source='corpus.csv',text_column='text')
vis_model.visualize() #nothing happens when trying to visualize |
Hello @cjbarrie, Thank you for providing your feedback. I'm pleased to inform you that the issues you mentioned have been addressed in the latest version (1.1.6). To ensure optimal plotting, I suggest utilizing an Integrated Development Environment (IDE) as some plots may encounter difficulties saving as image or HTML files when executing code via command line or terminal. Additionally, for optimal results or visualization appearance, it is recommended to maintain a minimum csv file size of 500KB to 1MB. |
Thank you for this @CodeEagle22. I'll review again with the latest version. It may be worth adding some of these pointers to the README too. |
Hello @cjbarrie, I have updated the README.md file on Git to ensure it reflects the latest information. Please inform me if there are any additional requirements needed to finalize the review process. |
Hi @CodeEagle22, this version now allows the cleaned_file object to be passed to the model specification function. But every time I run this, I get the following error: model = topic_model.TopicModeling(num_topics=10, dataset_source=cleaned_file,
learning_rate=0.001, batch_size=32, activation='softplus', num_layers=3, num_neurons=100,
dropout=0.2, num_epochs=100, save_model=False, model_path=None, train_model='NeuralLDA',
evaluation=['topicdiversity','invertedrbo','jaccardsimilarity'])
I think this is because the actual reddit data I've collected is not sufficiently large. But the raw data I collected is >2GB in size. When I generate the cleaned file with: cleaner_obj = cleaner.Cleaner(data_source='data/',data_source_type='reddit')
cleaned_file = cleaner_obj.saving('data_cleaned/',data_save_name='reddit_cleaned') I get a .csv file with just 15 rows every time. Is there an argument I need to add to make sure that all of the data is combined with this function? It would be helpful to have more documentation so I could answer this myself. |
Hey @cjbarrie, Regarding the issue you're facing with the cleaned_file object in the model specification function, it seems that the error is due to an empty training set. This could be because the cleaned file generated from the raw Reddit data only contains 15 rows, which might not be sufficient for training the model effectively. Multiple filters have been applied during the cleaning process to remove spam posts. While this ensures data quality, it might result in a reduced number of rows in the cleaned file. If you use given sample raw data of reddit you will find it generates thousands of lines in clean file which means in your raw streaming data there might be very less ham posts (15 rows). To address this, I suggest considering the following points:
By addressing these points, you can mitigate the issue of an empty training set and ensure that the model specification function receives a sufficiently large and diverse dataset for effective training. |
Thanks a lot for this, @CodeEagle22. This is helpful. I can use the sample data for tests... One thing: none of this seems to have been documented anywhere. It's hard for us to review when we don't know exactly what a function is doing without documentation (absent going into all of the source code etc.). And this would also make the library a lot more useful for applied users. I strongly recommend that you consider providing documentation for the functions, detailing the different variables and parameters. My apologies if this exists and I just haven't found it |
Hi @CodeEagle22, I've returned to this I'm now using the sample_dataset rather than collecting my own data. When I do this (in VS Code). I get the following error: # gives following error: The visualisation is based on Latent Dirichlet Allocation (LDA) model.
# An error occurred: 'NoneType' object has no attribute 'display_formatter'
vis_model = topic_visualise.PyLDAvis(data_source=cleaned_file,num_topics=5,text_column='text')
vis_model.visualize()
# gives following error: The visualisation is based on Latent Dirichlet Allocation (LDA) model.
# An error occurred: 'NoneType' object has no attribute 'display_formatter'
vis_model = topic_visualise.Wordcloud(data_source=cleaned_file,text_column='text')
vis_model.visualize() I therefore decided to try out a different IDE and used a Google Colab notebook. When I did this, I got an import error related to pyLDAvis: from SIRITVIS import twitter_streamer, insta_streamer, reddit_streamer, cleaner, topic_model, topic_visualise, topic_mapper
import nltk
nltk.download('stopwords')
So I decided to try jupyter lab. When I did this, all of the code ran but there was again no visualization output. I think at this stage what I need is a notebook that I can open and run. All of these errors make me a bit nervous and will also mean it is unlikely that your end user is going to have much luck either. |
Hey @cjbarrie, Sorry for inconvenience. I've updated the issue and provided the latest documentation in pip installing SIRITVIS==1.1.7.1. This should address all the issues and make it more robust for end users. |
Review checklist for @n3moConflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
|
@n3mo what is the status of your review now? 😊 |
Hiya. I just went back to test with the new version of the library that @CodeEagle22 mentioned. I'm still bumping into a lot of errors, unfortunately. As I said above, it would really help if the authors could provide a Colab runthrough of the working library. As written, I just don't have the time to keep testing code that is not functioning as described. This is the fourth or fifth time I've been back to it. I appreciate a lot of work has gone into this software. But realistically, the take-up by any end user will be minimal without further guidance on functionality and without resolving the errors I keep encountering. Here is my code, with annotations of the errors this routine throws: from SIRITVIS import twitter_streamer, insta_streamer, reddit_streamer, cleaner, topic_model, topic_visualise, topic_mapper
import ssl
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
import nltk
nltk.download('stopwords')
# using the Github sample_dataset data
cleaner_obj = cleaner.Cleaner(data_source='data/',data_source_type='reddit')
cleaned_file = cleaner_obj.saving('data_cleaned/',data_save_name='reddit_cleaned')
cleaned_file.to_csv('corpus.csv', index=False, header=True)
# This no longer takes "cleaned_file" object as parameter
model = topic_model.TopicModeling(num_topics=10, dataset_source='data_cleaned/reddit_cleaned_part_1.csv',
learning_rate=0.001, batch_size=32, activation='softplus', num_layers=3, num_neurons=100,
dropout=0.2, num_epochs=100, save_model=False, model_path=None, train_model='NeuralLDA',
evaluation=['topicdiversity','invertedrbo','jaccardsimilarity'])
saved_model = model.run()
# gives following error: The visualisation is based on Latent Dirichlet Allocation (LDA) model.
# An error occurred: 'NoneType' object has no attribute 'display_formatter'
vis_model = topic_visualise.PyLDAvis(data_source=cleaned_file,num_topics=5,text_column='text')
vis_model.visualize()
# gives following error: An error occurred: 'ImageDraw' object has no attribute 'textsize'
vis_model = topic_visualise.Wordcloud(data_source=cleaned_file,text_column='text')
vis_model.visualize() |
I'm running into similar errors as @cjbarrie. The documentation suggests that the package is optimized for python notebooks, but it seems more accurate to say that it currently depends on it. I encountered the same two errors shared by @cjbarrie. The 'display_formatter' error associated with It seems that the PyLDAvis package that does the heavy lifting here offers a Aside from the above, the 'textsize' error associated with |
I can confirm that the TopicMapper visualization also works for me, but also only in a notebook context. There is no error in an interactive python shell context--instead, an IPython.core.display.HTML object is returned. I believe this can be saved to file for display in a browser. But clearly, a notebook context is preferred here as well. |
@CodeEagle22 what is the current situation? |
@editorialbot generate pdf |
Hello @samhforbes, The latest version of the paper is more accurate, as you requested. I urge you if you can generate the PDF again. Thank you! |
Dear @samhforbes, @oliviaguest, @samhforbes, we rebuild the paper from scratch and now it shoud have the expected form. We should be able to move forward now. Sorry for the delay and thank you for your patience! |
@editorialbot generate pdf |
@editorialbot check references |
|
@xillig @CodeEagle22 the PDF looks good thanks. |
Hey @samhforbes, I've updated the LICENSE file in the Git repo to CC-BY with correct all authors. Looking forward to any feedback you might have! |
Dear @samhforbes, @oliviaguest, @samhforbes, we adjusted all the demanded parts - I think we are ready for release! |
Hi both @CodeEagle22 and @xillig I just wanted to double check - but please note in submission requirements that JOSS needs OSI approved licences - see here: https://opensource.org/license |
Hi @samhforbes, the license has been updated to MIT on both GitHub and Figshare. |
@editorialbot recommend-accept |
|
|
👋 @openjournals/sbcs-eics, this paper is ready to be accepted and published. Check final proof 👉📄 Download article If the paper PDF and the deposit XML files look good in openjournals/joss-papers#5739, then you can now move forward with accepting the submission by compiling again with the command |
@editorialbot accept |
|
Ensure proper citation by uploading a plain text CITATION.cff file to the default branch of your repository. If using GitHub, a Cite this repository menu will appear in the About section, containing both APA and BibTeX formats. When exported to Zotero using a browser plugin, Zotero will automatically create an entry using the information contained in the .cff file. You can copy the contents for your CITATION.cff file here: CITATION.cff
If the repository is not hosted on GitHub, a .cff file can still be uploaded to set your preferred citation. Users will be able to manually copy and paste the citation. |
🐘🐘🐘 👉 Toot for this paper 👈 🐘🐘🐘 |
🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨 Here's what you must now do:
Any issues? Notify your editorial technical team... |
Dear @arfon, @n3mo, @oliviaguest, @samhforbes, @cjbarrie, Thank you all for your contribution in making this publishing possible 😄 All the Best |
Many thanks to @arfon, @n3mo, @oliviaguest, @samhforbes, and @cjbarrie for your valuable contributions and support in making this release possible 🎉 |
Congrats @CodeEagle22 @xillig! |
🎉🎉🎉 Congratulations on your paper acceptance! 🎉🎉🎉 If you would like to include a link to your paper from your README use the following code snippets:
This is how it will look in your documentation: We need your help! The Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:
|
Submitting author: @CodeEagle22 (Sagar Narwade)
Repository: https://github.com/CodeEagle22/SIRITVIS/
Branch with paper.md (empty if default branch):
Version: v2.0.0
Editor: @oliviaguest
Reviewers: @n3mo, @cjbarrie
Archive: 10.6084/m9.figshare.26298487
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@n3mo & @cjbarrie, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review.
First of all you need to run this command in a separate comment to create the checklist:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @oliviaguest know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Checklists
📝 Checklist for @cjbarrie
📝 Checklist for @n3mo
The text was updated successfully, but these errors were encountered: