Refine the text cleaning before embedding the documents in the RAG pipeline

## Overview
Our [current text cleaning method ](https://github.com/vmware/versatile-data-kit/blob/main/examples/embed-ingest-job-example/20_clean_and_embed_json_data.py#L19) converts the text to lower case, removes punctuation, lemmatizes and removes the stop words from the text. As discussed [HERE](https://github.com/vmware/versatile-data-kit/pull/3085#issuecomment-1929483614), the transformer models (in our case SentenceTransformer) doesn't require such extensive preprocessing, it's even suggested to not do it as this way some context might be lost.

**Suggested solution**
Drop the lemmatization and stop words removal from the cleaning.
Double-check if the lower case conversion isn't done by default by [the transformer model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) we are using.
The cleaning step is something you would expect to have in a pipeline, so we need to figure out how to handle it properly.
Decide on what text cleaning logic might be relevant and add it.

## Acceptance criteria
Remove the extensive NLP preprocessing (lemmatization and stop words removal).
Add relevant text cleaning logic.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refine the text cleaning before embedding the documents in the RAG pipeline #3089

Overview

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refine the text cleaning before embedding the documents in the RAG pipeline #3089

Description

Overview

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions