Skip to content

Deficiencies in the Mahābhārata Dataset #6

@VedantMadane

Description

@VedantMadane

You have done a commendable work in curating this repo.

But there is a deficiency in the original sacred-texts.com Sanskrit that has inadvertently crept up here also.
This error is in the last letter of every line or ślokārdha-s, the halanta or ् are not represented properly.
नरॊत्तमम should be नरॊत्तमम्
उदीरयेत should be उदीरयेत्
् is missing everywhere.

We have two ways to remedy this problem:

  1. Look for the last character of each line and if it is अ-कारान्त then replace with हलन्त ् . Exceptions: a-kārānta valid words such as मम, च etc.
  2. Use an alternate data source such as https://bombay.indology.info/mahabharata/text/UD/MBh01.txt and others.

If you could upload the programs used by you for scraping the websites, scanning Sanskrit text along with accent markers and OCR, turning PDFs into JSON, etc. to the repo, I can create a pull request with the additional data sources.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions