The aim of this repository is mainly to extract keywords from medical transcription. The dataset obtained from an open medical transcription dataset.
Remove symbols, stopwords, empty spaces after comma, multiple spaces, etc. Basicly it will keep only the words with a single space separator. The clean dataset here
- Pipeline -- Vectorize the word -- TF-IDF Transformer -- OneVsRestClassifier with SGD Classifier
- Input:
['a string sentences', 'another string sentences']
- Output:
['keywords separated by a single space', 'another extracted keywords']
- Serialized model here here
We used a number of open source projects to work properly:
- Datasets - Where the story begin!
- Sklearn - The most used machine learning Framework
- NLTK - Linguistic libray.
- Pandas, Keras, Numpy, and many others
MIT
Free Software, Hell Yeah!