Skip to content

Commit 18caef8

Browse files
Revert "Document sampling and fix local run" (microsoft#670)
Reverts microsoft#538 Temporarily revert merge until prerelease branch is merged into dev microsoft#556
1 parent 6cd00fa commit 18caef8

File tree

12 files changed

+23
-154
lines changed

12 files changed

+23
-154
lines changed

01_index.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
import json
22
import argparse
3-
import sys
43

54
from rag_experiment_accelerator.run.index import run
65
from rag_experiment_accelerator.config.config import Config
@@ -18,12 +17,9 @@
1817
environment = Environment.from_env_or_keyvault()
1918
config = Config(environment, args.config_path, args.data_dir)
2019

21-
# Are we running locally and not in AML? We do not want to run sampling on the distributed compute at this stage
22-
is_local = "01_index.py" in str(sys.argv[0])
23-
2420
file_paths = get_all_file_paths(config.data_dir)
2521
for index_config in config.index_configs():
26-
index_dict = run(environment, config, index_config, file_paths, is_local)
22+
index_dict = run(environment, config, index_config, file_paths)
2723

2824
with open(config.GENERATED_INDEX_NAMES_FILE_PATH, "w") as index_name:
2925
json.dump(index_dict, index_name, indent=4)

README.md

Lines changed: 3 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,6 @@ The custom loader resorts to the simpler 'prebuilt-layout' API model as a fallba
5151

5252
1. **Multi-Lingual**: The tool supports language analyzers for linguistic support on individual languages and specialized (language-agnostic) analyzers for user-defined patterns on search indexes. For more information, see [Types of Analyzers](https://learn.microsoft.com/en-us/azure/search/search-analyzers#types-of-analyzers).
5353

54-
1. **Sampling**: If you have a large dataset and/or want to speed up the experimentation, a sampling process is available to create a small but representative sample of the data for the percentage specified. The data will be clustered by content and a percentage of each cluster will be selected as part of the sample. Results obtained should be roughly indicative of the full dataset within a ~10% margin. Once an approach has been identified, running on the full dataset is recommended for accurate results.
55-
5654
## Products used
5755

5856
- [Azure AI Search Service](https://learn.microsoft.com/en-us/azure/search/search-create-service-portal) (Note: [Semantic Search](https://learn.microsoft.com/en-us/azure/search/search-get-started-semantic?tabs=dotnet) is available in Azure AI Search, at Basic tier or higher.)
@@ -187,21 +185,21 @@ az deployment sub create --location uksouth --template-file infra/main.bicep \
187185

188186
## How to use
189187

190-
To use the **RAG Experiment Accelerator** locally, follow these steps:
188+
To use the **RAG Experiment Accelerator**, follow these steps:
191189

192190
1. Copy the provided `config.sample.json` file to a file named `config.json` and change any hyperparameters to tailor to your experiment.
193191
2. Run `01_index.py` (python 01_index.py) to create Azure AI Search indexes and load data into them.
194192
```bash
195193
python 01_index.py
196194
-d "The directory holding the configuration files and data. Defaults to current working directory"
197-
--data_dir "The directory holding the data. Defaults to data"
195+
-dd "The directory holding the data. Defaults to data"
198196
-cf "JSON config filename. Defaults to config.json"
199197
```
200198
3. Run `02_qa_generation.py` (python 02_qa_generation.py) to generate question-answer pairs using Azure OpenAI.
201199
```bash
202200
python 02_qa_generation.py
203201
-d "The directory holding the configuration files and data. Defaults to current working directory"
204-
--data_dir "The directory holding the data. Defaults to data"
202+
-dd "The directory holding the data. Defaults to data"
205203
-cf "JSON config filename. Defaults to config.json"
206204
```
207205
4. Run `03_querying.py` (python 03_querying.py) to query Azure AI Search to generate context, re-rank items in context, and get response from Azure OpenAI using the new context.
@@ -219,63 +217,6 @@ To use the **RAG Experiment Accelerator** locally, follow these steps:
219217

220218
Alternatively, you can run the above steps (apart from `02_qa_generation.py`) using an Azure ML pipeline. To do so, follow [the guide here](./docs/azureml-pipeline.md).
221219

222-
### Running with sampling
223-
224-
Sampling will be run locally to create a small but representative slice of the data. This helps with rapid experimentation and keeps costs down. Results obtained should be roughly indicative of the full dataset within a ~10% margin. Once an approach has been identified, running on the full dataset is recommended for accurate results.
225-
226-
**Note**: Sampling can only be run locally, at this stage it is not supported on a distributed AML compute cluster. So the process would be to run sampling locally and then use the generated sample dataset to run on AML.
227-
228-
If you have a very large dataset and want to run a similar approach to sample the data, you can use the pyspark in-memory distributed implementation in the [Data Discovery Toolkit](https://github.com/microsoft/Data-Discovery-Toolkit) for [Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/get-started/microsoft-fabric-overview) or [Azure Synapse Analytics](https://learn.microsoft.com/en-gb/azure/synapse-analytics/).
229-
230-
#### Available sampling parameters in the config.json file
231-
232-
```json
233-
"sampling": {
234-
"sample_data": "Set to true to enable sampling",
235-
"only_run_sampling": "If set to true, this will only run the sampling step and will not create an index or any subsequent steps, use this if you want to build a small sampled dataset to run in AML",
236-
"sample_percentage": "Percentage of the document corpus to sample",
237-
"optimum_k": "Set to 'auto' to automatically determine the optimum cluster number or set to a specific value e.g. 15",
238-
"min_cluster": "Used by the automated optimum cluster process, this is the minimum number of clusters e.g. 2",
239-
"max_cluster": "Used by the automated optimum cluster process, this is the maximum number of clusters e.g. 30",
240-
},
241-
```
242-
243-
244-
The sampling process will produce the following artifacts in the sampling directory:
245-
246-
1. A directory named after the config value ```job_name``` containing the subset of files sampled, these can be specified as ```--data_dir``` argument when running the entire process on AML.
247-
2. A 2 dimensional scatter plot of the clustered files (by content) selected as the sampled dataset in the sampling folder.
248-
![images/all_cluster_predictions_cluster_number_5.jpg](images/all_cluster_predictions_cluster_number_5.jpg)
249-
3. A .cvs file of the entire dataset with cluster predictions named "all_cluster_predictions..." and a cvs file with the sampled cluster predictions named "sampled_cluster_predictions...". This can be used for further enriching the dataset, for example, creating a meaningful label per cluster and updates all record. See the [Heuristics classifier in the Data Discovery Toolkit as an example](https://github.com/microsoft/Data-Discovery-Toolkit/blob/main/walkthroughs/heuristics/standalone_text_heuristics.ipynb) or [Pixplotml for image data](https://github.com/microsoft/Data-Discovery-Toolkit?tab=readme-ov-file#using-pixplotml-to-rapidly-visualise-and-label-data-for-training).
250-
4. If the ```"optimum_k": auto``` config value is set to auto, the sampling process will attempt to set the optimum number of clusters automatically. This can be overridden if you know roughly how many broad buckets of content exist in your data. An elbow graph will be generated in the sampling folder.
251-
![Optimum k elbow graph](images/elbow_5.png)
252-
253-
Two options exist for running sampling, namely:
254-
255-
1. Run the entire process locally with sampling, including the index generation and subsequent steps
256-
2. Run only the sampling locally and then use the created sampled dataset to execute on AML
257-
258-
#### Run the entire process locally
259-
260-
Set the following values to run the indexing process locally:
261-
262-
```json
263-
"sampling": {
264-
"sample_data": true,
265-
"only_run_sampling": false,
266-
"sample_percentage": 10,
267-
"optimum_k": auto,
268-
"min_cluster": 2,
269-
"max_cluster": 30
270-
},
271-
```
272-
273-
#### Run only the sampling locally and the subsequent steps on AML
274-
275-
If ```only_run_sampling```config value is set to true, this will only run the sampling step, no index will be created and any other subsequent steps will not executed. Set the ```--data_dir``` argument to directory created by the sampling process which will be:
276-
277-
```artifacts/sampling/config.[job_name]``` and execute the [AML pipeline step.](docs/azureml-pipeline.md)
278-
279220
# Description of configuration elements
280221

281222
```json
@@ -286,7 +227,6 @@ If ```only_run_sampling```config value is set to true, this will only run the sa
286227
"job_description": "You may provide a description for the current job run which describes in words what you are about to experiment with",
287228
"sampling": {
288229
"sample_data": "Set to true to enable sampling",
289-
"only_run_sampling": "If set to true, this will only run the sampling step and will not create an index or any subsequent steps, use this if you want to build a small sampled dataset to run in AML",
290230
"sample_percentage": "Percentage of the document corpus to sample",
291231
"optimum_k": "Set to 'auto' to automatically determine the optimum cluster number or set to a specific value e.g. 15",
292232
"min_cluster": "Used by the automated optimum cluster process, this is the minimum number of clusters e.g. 2",

config.sample.json

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,6 @@
44
"job_name": "",
55
"job_description": "",
66
"preprocess": false,
7-
"sampling": {
8-
"sample_data": true,
9-
"only_run_sampling": true,
10-
"sample_percentage": 5,
11-
"optimum_k": "auto",
12-
"min_cluster": 2,
13-
"max_cluster": 30
14-
},
157
"chunking": {
168
"chunk_size": [1000],
179
"overlap_size": [200],
Binary file not shown.

images/elbow_5.png

-36.4 KB
Binary file not shown.

rag_experiment_accelerator/config/config.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ def __init__(
9191
self.EF_CONSTRUCTIONS = config_json["ef_construction"]
9292
self.EF_SEARCHES = config_json["ef_search"]
9393
self.INDEX_NAME_PREFIX = config_json["index_name_prefix"]
94-
self.EXPERIMENT_NAME = self.INDEX_NAME_PREFIX
94+
self.EXPERIMENT_NAME = config_json["experiment_name"] or self.INDEX_NAME_PREFIX
9595
self.JOB_NAME = config_json["job_name"]
9696
self.JOB_DESCRIPTION = config_json["job_description"]
9797
self.SEARCH_VARIANTS = config_json["search_types"]
@@ -157,7 +157,6 @@ def __init__(
157157
self.SAMPLE_OPTIMUM_K = config_json["sampling"]["optimum_k"]
158158
self.SAMPLE_MIN_CLUSTER = config_json["sampling"]["min_cluster"]
159159
self.SAMPLE_MAX_CLUSTER = config_json["sampling"]["max_cluster"]
160-
self.ONLY_RUN_SAMPLING = config_json["sampling"]["only_run_sampling"]
161160

162161
# log all the configuration settings in debug mode
163162
for key, value in config_json.items():

rag_experiment_accelerator/config/tests/data/config.json

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@
77
"preprocess": false,
88
"sampling": {
99
"sample_data": false,
10-
"only_run_sampling": false,
1110
"sample_percentage": 5,
1211
"optimum_k": "auto",
1312
"min_cluster": 2,

rag_experiment_accelerator/config/tests/test_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ def test_config_init(mock_create_embedding_model):
4848
config.embedding_models = [embedding_model_1, embedding_model_2]
4949

5050
assert config.INDEX_NAME_PREFIX == mock_config_data["index_name_prefix"]
51-
assert config.EXPERIMENT_NAME == mock_config_data["index_name_prefix"]
51+
assert config.EXPERIMENT_NAME == mock_config_data["experiment_name"]
5252
assert config.CHUNK_SIZES == mock_config_data["chunking"]["chunk_size"]
5353
assert config.OVERLAP_SIZES == mock_config_data["chunking"]["overlap_size"]
5454
assert config.CHUNKING_STRATEGY == mock_config_data["chunking_strategy"]

rag_experiment_accelerator/run/index.py

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@ def run(
3030
config: Config,
3131
index_config: IndexConfig,
3232
file_paths: list[str],
33-
is_local: bool = False,
3433
) -> dict[str]:
3534
"""
3635
Runs the main experiment loop, which chunks and uploads data to Azure AI Search indexes based on the configuration specified in the Config class.
@@ -65,14 +64,10 @@ def run(
6564
config.AZURE_DOCUMENT_INTELLIGENCE_MODEL,
6665
)
6766

68-
if is_local and config.SAMPLE_DATA:
67+
if config.SAMPLE_DATA:
6968
parser = load_parser()
7069
docs = cluster(docs, config, parser)
7170

72-
# If run with "ONLY_RUN_SAMPLING" we exit here after creating the sampled dataset for running in AML
73-
if config.ONLY_RUN_SAMPLING:
74-
return index_dict
75-
7671
docs_ready_to_index = convert_docs_to_vector_db_records(docs)
7772
embed_chunks(index_config, pre_process, docs_ready_to_index)
7873

rag_experiment_accelerator/sampling/clustering.py

Lines changed: 11 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
import os
21
import warnings
32
import numpy as np
43
import matplotlib
@@ -11,7 +10,6 @@
1110
from umap import UMAP
1211
from scipy.spatial.distance import cdist
1312
from rag_experiment_accelerator.utils.logging import get_logger
14-
import shutil
1513

1614
matplotlib.use("Agg")
1715
plt.style.use("ggplot")
@@ -47,17 +45,14 @@ def spacy_tokenizer(sentence, parser):
4745
str: The tokenized sentence.
4846
4947
"""
50-
51-
if not isinstance(sentence, str):
52-
sentence = sentence["content"]
53-
54-
tokens = [
48+
mytokens = parser(sentence)
49+
mytokens = [
5550
word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_
56-
for word in parser(sentence)
51+
for word in mytokens
5752
if not word.is_stop and not word.is_punct
5853
]
59-
tokenized_sentence = " ".join([token for token in tokens])
60-
return tokenized_sentence
54+
mytokens = " ".join([i for i in mytokens])
55+
return mytokens
6156

6257

6358
def determine_optimum_k_elbow(embeddings_2d, X, min_cluster, max_cluster, result_dir):
@@ -177,20 +172,18 @@ def chunk_dict_to_dataframe(all_chunks):
177172
all_chunks (list[dict]): A list of dictionaries where each dictionary contains a chunk and its corresponding text.
178173
179174
Returns:
180-
df (pandas.DataFrame): A DataFrame with three columns - 'chunk', 'text' and 'filename, where 'chunk' contains the chunks and 'text' contains the corresponding text and 'filename' the file name.
175+
df (pandas.DataFrame): A DataFrame with two columns - 'chunk' and 'text', where 'chunk' contains the chunks and 'text' contains the corresponding text.
181176
"""
182177

183178
chunks = []
184179
text = []
185-
filename = []
186180

187181
for row in all_chunks:
188182
key, value = list(row.items())[0]
189183
chunks.append(key)
190184
text.append(value)
191-
filename.append(value["metadata"]["source"])
192185

193-
df = pd.DataFrame({"chunk": chunks, "text": text, "filename": filename})
186+
df = pd.DataFrame({"chunk": chunks, "text": text})
194187

195188
return df
196189

@@ -214,7 +207,6 @@ def cluster_kmeans(embeddings_2d, optimum_k, df, result_dir):
214207
- chunk (list): Chunk data from the DataFrame.
215208
- prediction (list): Cluster labels assigned by K-means.
216209
- prediction_values (list): Unique cluster labels.
217-
- filenames (list): File names of the sampled data.
218210
219211
"""
220212
logger.info("Clustering chunks")
@@ -228,19 +220,15 @@ def cluster_kmeans(embeddings_2d, optimum_k, df, result_dir):
228220
)
229221

230222
# Save
231-
filenames = (
232-
x
233-
) = y = text = processed_text = chunk = prediction = prediction_values = []
234223
x = embeddings_2d[:, 0].tolist()
235224
y = embeddings_2d[:, 1].tolist()
236225
text = df["text"].tolist()
237226
processed_text = df["processed_text"].tolist()
238227
chunk = df["chunk"].tolist()
239228
prediction = kmeans.labels_.tolist()
240229
prediction_values = list(set(kmeans.labels_.tolist()))
241-
filenames = list(set(df["filename"].tolist()))
242230

243-
return x, y, text, processed_text, chunk, prediction, prediction_values, filenames
231+
return x, y, text, processed_text, chunk, prediction, prediction_values
244232

245233

246234
def cluster(all_chunks, config, parser):
@@ -286,16 +274,9 @@ def cluster(all_chunks, config, parser):
286274
optimum_k = config.SAMPLE_OPTIMUM_K
287275

288276
# Cluster
289-
(
290-
x,
291-
y,
292-
text,
293-
processed_text,
294-
chunk,
295-
prediction,
296-
prediction_values,
297-
filenames,
298-
) = cluster_kmeans(embeddings_2d, optimum_k, df, config.sampling_output_dir)
277+
x, y, text, processed_text, chunk, prediction, prediction_values = cluster_kmeans(
278+
embeddings_2d, optimum_k, df, config.sampling_output_dir
279+
)
299280

300281
# Capture all predictions
301282
data = {"x": x, "y": y, "text": text, "prediction": prediction, "chunk": chunk}
@@ -333,21 +314,4 @@ def cluster(all_chunks, config, parser):
333314
sampled_chunks = dataframe_to_chunk_dict(df_concat)
334315
logger.info(f"Sampled Document chunk length {len(sampled_chunks)}")
335316

336-
# Preserve the sampled files into directory
337-
for filename in filenames:
338-
try:
339-
fn = os.path.basename(filename)
340-
os.makedirs(
341-
config.sampling_output_dir + "/" + config.JOB_NAME, exist_ok=True
342-
)
343-
shutil.copy2(
344-
filename, config.sampling_output_dir + "/" + config.JOB_NAME + "/" + fn
345-
)
346-
except OSError as e:
347-
logger.info(f"file {filename} could not be copied with metadata {e}")
348-
continue
349-
logger.info(
350-
f"Sampled Documents have been copied to {config.sampling_output_dir + '/' + config.JOB_NAME + '/'}"
351-
)
352-
353317
return sampled_chunks

0 commit comments

Comments
 (0)