Skip to content

Added Hindi stopwords to NLTK stopwords corpus #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: gh-pages
Choose a base branch
from

Conversation

mridulchdry17
Copy link

This PR adds a list of common Hindi stopwords to the corpora directory. These stopwords can be useful for preprocessing in NLP tasks involving Hindi text.

@mridulchdry17
Copy link
Author

@tomaarsen @stevenbird
Kindly take a look at this when possible
Thank you!

@ekaf
Copy link
Member

ekaf commented Jun 9, 2025

Thank you for the contribution! To help move this PR forward, could you please provide additional context, such as the source or justification for the Hindi stopwords list and any validation or tests performed? Linking to related issues or feature requests would also be helpful.

@mridulchdry17
Copy link
Author

I generated the initial Hindi stopwords list using ChatGPT, then manually reviewed and cross-checked it against trusted sources like the Indic NLP Library to ensure quality and relevance.

I’m happy to further refine the list or validate it more rigorously based on community feedback

@ekaf
Copy link
Member

ekaf commented Jun 10, 2025

To strengthen the PR, you might consider comparing the proposed list with stopwords identified through methods like TF-IDF or other statistical approaches to ensure its effectiveness and completeness.
For example, Gemini finds that your list only presents approx. one third overlap with a typical Hindi stopwords list.

@ekaf
Copy link
Member

ekaf commented Jun 19, 2025

Hi @mridulchdry17 ,

Gemini reviewed your proposed Hindi stopword list, and has a few questions regarding its scope and content, aiming to ensure it's as comprehensive and accurate as possible for general NLP use.

1. Justification for including certain words as stopwords:
Could you please clarify the rationale for including terms that typically carry significant semantic meaning, are numbers, or appear to be misspellings/non-standard? For example:

  • कहा (said) - A common verb.
  • दो (two) - A cardinal number.
  • हर (every) - A common quantifier.
  • अगेर (appears to be a misspelling of अगर - if).
  • जुका (seems like a non-standard or very uncommon word).

2. Justification for the absence of otherwise common stopwords:
Conversely, some very high-frequency, low-semantic-value words commonly found in Hindi texts seem to be missing from this list. Could you explain their exclusion? For example:

  • मैं (I) - A core first-person pronoun.
  • आज (today) - A common temporal adverb.
  • होगा (will be) - A very frequent auxiliary verb form.
  • किए (did/done) - A common verb form.
  • कहां (where) - A common interrogative/adverb.

Understanding these choices would help align the list with general NLP best practices for stopword removal, which typically focuses on words that are frequent but carry little unique semantic information across diverse contexts.

Thanks!

@stevenbird stevenbird self-assigned this Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants