Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring your own LIWC & matplotlib dependency fix #322

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from
Open

Conversation

sundy1994
Copy link
Collaborator

@sundy1994 sundy1994 commented Oct 18, 2024

Basic Info

What's this pull request about?
Added bring your own LIWC feature #281 and fix matplotlib error during installation #319 .

BYOL

Now users can upload their own LIWC dictionary like this:

testing_chat = FeatureBuilder(
    input_df = chat_df,
    ...
    custom_liwc_dictionary_path = 'my_path/liwc_2015.dic'
)
testing_chat.featurize()

If the custom dictionary is provided, liwc_features() will call get_liwc_count() again on it. The new columns will be named as 'xx_lexical_wordcount_custom' to prevent overwrite existing 'xx_lexical_wordcount' cols from the pickle dic.

It's worth noting that the liwc_2015 dictionary contains emojis like ':)' and ':('. re.findall() will throw a parentheses unbalanced error for them, so I add a backslash before all single parentheses in the dic terms while reading the custom dictionary.

matplotlib issue

"matplotlib>=3.0.0" is added to bothrequirements.txt and project.toml. Now our package can be installed to an empty environment as before.

@xehu
Copy link
Collaborator

xehu commented Oct 22, 2024

TODO (1): Please pass in a version of the text that retains key punctuation.

Currently, we are passing in self.message_col, but this is after the message column is preprocessed to remove capitalization and punctuation:

 def lexical_features(self) -> None:
        """
        Implement lexical features.

        This driver function calls relevant functions to compute lexical features and appends them to the chat data.

        :return: None
        :rtype: None
        """
        self.chat_data = pd.concat([self.chat_data, liwc_features(self.chat_data, self.message_col, self.custom_liwc_dictionary)], axis = 1)

We DO need to remove capitalization and SOME punctuation, but there are some lexicon words (e.g., emojis) that require punctuation to be retained.

TODO (2): Please check emoji/word boundary rules and run tests on custom LIWC categories

Emily's sample testing code: test_byol.ipynb.zip

I ran a few simple test cases on the new LIWC, and I found that the emojis (perhaps due to word boundary rules) are not fully working as expected; emojis like :), :(, ;), etc., which are listed in the dictionary, do not appear to be showing up in the counts.

More generally, would it be possible to read in the dictionary and generate some test cases? For example, if a category consists of "happy, good, great," let's generate a sentence that samples a combination of words with replacement from the category, and words that are not in the category (e.g., "foo"):

happy happy foo foobar good good great

And we can therefore assert the expected value to be the number of category words we sampled: 5 (in this case).

It should be relatively easy to do this locally with the dictionary, and confirm that we pass all tests for both the old and new dictionaries.

Additionally, let's try variations with punctuation, just to confirm that the new parentheses rules aren't breaking anything. I think it's important to just have this sanity check that everything works as we expect!

Copy link
Collaborator

@xehu xehu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the following changes:

  1. Add the custom LIWC feature names to self.chat_features in the FeatureBuilder;
  2. Update the version of the text passed into the LIWC feature to be preprocessed WITH (some) punctuation;
  3. Add tests to confirm that all the features work as expected, especially with edge cases (e.g., emojis, other punctuation).

Thanks so much for your amazing progress and persistence on this!!

@@ -47,12 +48,19 @@ def liwc_features(chat_df: pd.DataFrame, message_col) -> pd.DataFrame:
lexicons_dict = pickle.load(lexicons_pickle_file)

# Return the lexical features stacked as columns
return pd.concat(
# return pd.concat(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a note to double check the commented-out code here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is fine; it just comments out the previous code, which directly concats the dataframe, and replaces it with the new code, which calls it a second time if the custom dictionary is present.

lexicon = lexicon.strip()
lexicon = lexicon.replace('(', '')
lexicon = lexicon.replace(')', '')
# get rid of parentheses; comment out to keep the emojis like :)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the commented out parentheses

@@ -501,6 +522,7 @@ def featurize(self) -> None:
print("All Done!")

# Store column names of what we generated, so that the user can easily access them
# TODO --- this needs to be updated if the user brings their own LIWC, because the custom LIWC features are not in `self.chat_features`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sundy1994 Can you please update the chat_features property of the FeatureBuilder object whenever the user generates custom LIWC features? Otherwise, those new column names will not show up when the user tries to check the names of the chat features.

@sundy1994 sundy1994 requested a review from xehu October 29, 2024 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants