-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring your own LIWC & matplotlib dependency fix #322
base: dev
Are you sure you want to change the base?
Conversation
TODO (1): Please pass in a version of the text that retains key punctuation.Currently, we are passing in
We DO need to remove capitalization and SOME punctuation, but there are some lexicon words (e.g., emojis) that require punctuation to be retained. TODO (2): Please check emoji/word boundary rules and run tests on custom LIWC categoriesEmily's sample testing code: test_byol.ipynb.zip I ran a few simple test cases on the new LIWC, and I found that the emojis (perhaps due to word boundary rules) are not fully working as expected; emojis like More generally, would it be possible to read in the dictionary and generate some test cases? For example, if a category consists of "happy, good, great," let's generate a sentence that samples a combination of words with replacement from the category, and words that are not in the category (e.g., "foo"):
And we can therefore assert the expected value to be the number of category words we sampled: 5 (in this case). It should be relatively easy to do this locally with the dictionary, and confirm that we pass all tests for both the old and new dictionaries. Additionally, let's try variations with punctuation, just to confirm that the new parentheses rules aren't breaking anything. I think it's important to just have this sanity check that everything works as we expect! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make the following changes:
- Add the custom LIWC feature names to
self.chat_features
in the FeatureBuilder; - Update the version of the text passed into the LIWC feature to be preprocessed WITH (some) punctuation;
- Add tests to confirm that all the features work as expected, especially with edge cases (e.g., emojis, other punctuation).
Thanks so much for your amazing progress and persistence on this!!
@@ -47,12 +48,19 @@ def liwc_features(chat_df: pd.DataFrame, message_col) -> pd.DataFrame: | |||
lexicons_dict = pickle.load(lexicons_pickle_file) | |||
|
|||
# Return the lexical features stacked as columns | |||
return pd.concat( | |||
# return pd.concat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a note to double check the commented-out code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is fine; it just comments out the previous code, which directly concats the dataframe, and replaces it with the new code, which calls it a second time if the custom dictionary is present.
lexicon = lexicon.strip() | ||
lexicon = lexicon.replace('(', '') | ||
lexicon = lexicon.replace(')', '') | ||
# get rid of parentheses; comment out to keep the emojis like :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the commented out parentheses
@@ -501,6 +522,7 @@ def featurize(self) -> None: | |||
print("All Done!") | |||
|
|||
# Store column names of what we generated, so that the user can easily access them | |||
# TODO --- this needs to be updated if the user brings their own LIWC, because the custom LIWC features are not in `self.chat_features`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sundy1994 Can you please update the chat_features
property of the FeatureBuilder object whenever the user generates custom LIWC features? Otherwise, those new column names will not show up when the user tries to check the names of the chat features.
Basic Info
What's this pull request about?
Added bring your own LIWC feature #281 and fix matplotlib error during installation #319 .
BYOL
Now users can upload their own LIWC dictionary like this:
If the custom dictionary is provided,
liwc_features()
will callget_liwc_count()
again on it. The new columns will be named as 'xx_lexical_wordcount_custom' to prevent overwrite existing 'xx_lexical_wordcount' cols from the pickle dic.It's worth noting that the liwc_2015 dictionary contains emojis like ':)' and ':('.
re.findall()
will throw a parentheses unbalanced error for them, so I add a backslash before all single parentheses in the dic terms while reading the custom dictionary.matplotlib issue
"matplotlib>=3.0.0" is added to both
requirements.txt
andproject.toml
. Now our package can be installed to an empty environment as before.