Added Hindi stopwords to NLTK stopwords corpus #238

mridulchdry17 · 2025-04-21T17:05:54Z

This PR adds a list of common Hindi stopwords to the corpora directory. These stopwords can be useful for preprocessing in NLP tasks involving Hindi text.

mridulchdry17 · 2025-05-19T12:13:12Z

@tomaarsen @stevenbird
Kindly take a look at this when possible
Thank you!

ekaf · 2025-06-09T05:34:29Z

Thank you for the contribution! To help move this PR forward, could you please provide additional context, such as the source or justification for the Hindi stopwords list and any validation or tests performed? Linking to related issues or feature requests would also be helpful.

mridulchdry17 · 2025-06-09T05:46:57Z

I generated the initial Hindi stopwords list using ChatGPT, then manually reviewed and cross-checked it against trusted sources like the Indic NLP Library to ensure quality and relevance.

I’m happy to further refine the list or validate it more rigorously based on community feedback

ekaf · 2025-06-10T09:33:41Z

To strengthen the PR, you might consider comparing the proposed list with stopwords identified through methods like TF-IDF or other statistical approaches to ensure its effectiveness and completeness.
For example, Gemini finds that your list only presents approx. one third overlap with a typical Hindi stopwords list.

ekaf · 2025-06-19T03:31:32Z

Hi @mridulchdry17 ,

Gemini reviewed your proposed Hindi stopword list, and has a few questions regarding its scope and content, aiming to ensure it's as comprehensive and accurate as possible for general NLP use.

1. Justification for including certain words as stopwords:
Could you please clarify the rationale for including terms that typically carry significant semantic meaning, are numbers, or appear to be misspellings/non-standard? For example:

कहा (said) - A common verb.
दो (two) - A cardinal number.
हर (every) - A common quantifier.
अगेर (appears to be a misspelling of अगर - if).
जुका (seems like a non-standard or very uncommon word).

2. Justification for the absence of otherwise common stopwords:
Conversely, some very high-frequency, low-semantic-value words commonly found in Hindi texts seem to be missing from this list. Could you explain their exclusion? For example:

मैं (I) - A core first-person pronoun.
आज (today) - A common temporal adverb.
होगा (will be) - A very frequent auxiliary verb form.
किए (did/done) - A common verb form.
कहां (where) - A common interrogative/adverb.

Understanding these choices would help align the list with general NLP best practices for stopword removal, which typically focuses on words that are frequent but carry little unique semantic information across diverse contexts.

Thanks!

Addded Hindi stopwords to NLTK stopwords corpus

5f9e2cc

mridulchdry17 mentioned this pull request Apr 21, 2025

Hindi Stopwords Missing #223

Open

stevenbird self-assigned this Jun 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added Hindi stopwords to NLTK stopwords corpus #238

Added Hindi stopwords to NLTK stopwords corpus #238

Uh oh!

mridulchdry17 commented Apr 21, 2025

Uh oh!

mridulchdry17 commented May 19, 2025

Uh oh!

ekaf commented Jun 9, 2025

Uh oh!

mridulchdry17 commented Jun 9, 2025

Uh oh!

ekaf commented Jun 10, 2025 •

edited

Loading

Uh oh!

ekaf commented Jun 19, 2025

Uh oh!

Uh oh!

Added Hindi stopwords to NLTK stopwords corpus #238

Are you sure you want to change the base?

Added Hindi stopwords to NLTK stopwords corpus #238

Uh oh!

Conversation

mridulchdry17 commented Apr 21, 2025

Uh oh!

mridulchdry17 commented May 19, 2025

Uh oh!

ekaf commented Jun 9, 2025

Uh oh!

mridulchdry17 commented Jun 9, 2025

Uh oh!

ekaf commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekaf commented Jun 19, 2025

Uh oh!

Uh oh!

ekaf commented Jun 10, 2025 •

edited

Loading