You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-63Lines changed: 3 additions & 63 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,8 +51,6 @@ The custom loader resorts to the simpler 'prebuilt-layout' API model as a fallba
51
51
52
52
1.**Multi-Lingual**: The tool supports language analyzers for linguistic support on individual languages and specialized (language-agnostic) analyzers for user-defined patterns on search indexes. For more information, see [Types of Analyzers](https://learn.microsoft.com/en-us/azure/search/search-analyzers#types-of-analyzers).
53
53
54
-
1.**Sampling**: If you have a large dataset and/or want to speed up the experimentation, a sampling process is available to create a small but representative sample of the data for the percentage specified. The data will be clustered by content and a percentage of each cluster will be selected as part of the sample. Results obtained should be roughly indicative of the full dataset within a ~10% margin. Once an approach has been identified, running on the full dataset is recommended for accurate results.
55
-
56
54
## Products used
57
55
58
56
-[Azure AI Search Service](https://learn.microsoft.com/en-us/azure/search/search-create-service-portal) (Note: [Semantic Search](https://learn.microsoft.com/en-us/azure/search/search-get-started-semantic?tabs=dotnet) is available in Azure AI Search, at Basic tier or higher.)
@@ -187,21 +185,21 @@ az deployment sub create --location uksouth --template-file infra/main.bicep \
187
185
188
186
## How to use
189
187
190
-
To use the **RAG Experiment Accelerator** locally, follow these steps:
188
+
To use the **RAG Experiment Accelerator**, follow these steps:
191
189
192
190
1. Copy the provided `config.sample.json` file to a file named `config.json` and change any hyperparameters to tailor to your experiment.
193
191
2. Run `01_index.py` (python 01_index.py) to create Azure AI Search indexes and load data into them.
194
192
```bash
195
193
python 01_index.py
196
194
-d "The directory holding the configuration files and data. Defaults to current working directory"
197
-
--data_dir"The directory holding the data. Defaults to data"
195
+
-dd"The directory holding the data. Defaults to data"
198
196
-cf "JSON config filename. Defaults to config.json"
199
197
```
200
198
3. Run `02_qa_generation.py` (python 02_qa_generation.py) to generate question-answer pairs using Azure OpenAI.
201
199
```bash
202
200
python 02_qa_generation.py
203
201
-d "The directory holding the configuration files and data. Defaults to current working directory"
204
-
--data_dir"The directory holding the data. Defaults to data"
202
+
-dd"The directory holding the data. Defaults to data"
205
203
-cf "JSON config filename. Defaults to config.json"
206
204
```
207
205
4. Run `03_querying.py` (python 03_querying.py) to query Azure AI Search to generate context, re-rank items in context, and get response from Azure OpenAI using the new context.
@@ -219,63 +217,6 @@ To use the **RAG Experiment Accelerator** locally, follow these steps:
219
217
220
218
Alternatively, you can run the above steps (apart from `02_qa_generation.py`) using an Azure ML pipeline. To do so, follow [the guide here](./docs/azureml-pipeline.md).
221
219
222
-
### Running with sampling
223
-
224
-
Sampling will be run locally to create a small but representative slice of the data. This helps with rapid experimentation and keeps costs down. Results obtained should be roughly indicative of the full dataset within a ~10% margin. Once an approach has been identified, running on the full dataset is recommended for accurate results.
225
-
226
-
**Note**: Sampling can only be run locally, at this stage it is not supported on a distributed AML compute cluster. So the process would be to run sampling locally and then use the generated sample dataset to run on AML.
227
-
228
-
If you have a very large dataset and want to run a similar approach to sample the data, you can use the pyspark in-memory distributed implementation in the [Data Discovery Toolkit](https://github.com/microsoft/Data-Discovery-Toolkit) for [Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/get-started/microsoft-fabric-overview) or [Azure Synapse Analytics](https://learn.microsoft.com/en-gb/azure/synapse-analytics/).
229
-
230
-
#### Available sampling parameters in the config.json file
231
-
232
-
```json
233
-
"sampling": {
234
-
"sample_data": "Set to true to enable sampling",
235
-
"only_run_sampling": "If set to true, this will only run the sampling step and will not create an index or any subsequent steps, use this if you want to build a small sampled dataset to run in AML",
236
-
"sample_percentage": "Percentage of the document corpus to sample",
237
-
"optimum_k": "Set to 'auto' to automatically determine the optimum cluster number or set to a specific value e.g. 15",
238
-
"min_cluster": "Used by the automated optimum cluster process, this is the minimum number of clusters e.g. 2",
239
-
"max_cluster": "Used by the automated optimum cluster process, this is the maximum number of clusters e.g. 30",
240
-
},
241
-
```
242
-
243
-
244
-
The sampling process will produce the following artifacts in the sampling directory:
245
-
246
-
1. A directory named after the config value ```job_name``` containing the subset of files sampled, these can be specified as ```--data_dir``` argument when running the entire process on AML.
247
-
2. A 2 dimensional scatter plot of the clustered files (by content) selected as the sampled dataset in the sampling folder.
3. A .cvs file of the entire dataset with cluster predictions named "all_cluster_predictions..." and a cvs file with the sampled cluster predictions named "sampled_cluster_predictions...". This can be used for further enriching the dataset, for example, creating a meaningful label per cluster and updates all record. See the [Heuristics classifier in the Data Discovery Toolkit as an example](https://github.com/microsoft/Data-Discovery-Toolkit/blob/main/walkthroughs/heuristics/standalone_text_heuristics.ipynb) or [Pixplotml for image data](https://github.com/microsoft/Data-Discovery-Toolkit?tab=readme-ov-file#using-pixplotml-to-rapidly-visualise-and-label-data-for-training).
250
-
4. If the ```"optimum_k": auto``` config value is set to auto, the sampling process will attempt to set the optimum number of clusters automatically. This can be overridden if you know roughly how many broad buckets of content exist in your data. An elbow graph will be generated in the sampling folder.
251
-

252
-
253
-
Two options exist for running sampling, namely:
254
-
255
-
1. Run the entire process locally with sampling, including the index generation and subsequent steps
256
-
2. Run only the sampling locally and then use the created sampled dataset to execute on AML
257
-
258
-
#### Run the entire process locally
259
-
260
-
Set the following values to run the indexing process locally:
261
-
262
-
```json
263
-
"sampling": {
264
-
"sample_data": true,
265
-
"only_run_sampling": false,
266
-
"sample_percentage": 10,
267
-
"optimum_k": auto,
268
-
"min_cluster": 2,
269
-
"max_cluster": 30
270
-
},
271
-
```
272
-
273
-
#### Run only the sampling locally and the subsequent steps on AML
274
-
275
-
If ```only_run_sampling```config value is set to true, this will only run the sampling step, no index will be created and any other subsequent steps will not executed. Set the ```--data_dir``` argument to directory created by the sampling process which will be:
276
-
277
-
```artifacts/sampling/config.[job_name]``` and execute the [AML pipeline step.](docs/azureml-pipeline.md)
278
-
279
220
# Description of configuration elements
280
221
281
222
```json
@@ -286,7 +227,6 @@ If ```only_run_sampling```config value is set to true, this will only run the sa
286
227
"job_description": "You may provide a description for the current job run which describes in words what you are about to experiment with",
287
228
"sampling": {
288
229
"sample_data": "Set to true to enable sampling",
289
-
"only_run_sampling": "If set to true, this will only run the sampling step and will not create an index or any subsequent steps, use this if you want to build a small sampled dataset to run in AML",
290
230
"sample_percentage": "Percentage of the document corpus to sample",
291
231
"optimum_k": "Set to 'auto' to automatically determine the optimum cluster number or set to a specific value e.g. 15",
292
232
"min_cluster": "Used by the automated optimum cluster process, this is the minimum number of clusters e.g. 2",
all_chunks (list[dict]): A list of dictionaries where each dictionary contains a chunk and its corresponding text.
178
173
179
174
Returns:
180
-
df (pandas.DataFrame): A DataFrame with three columns - 'chunk', 'text' and 'filename, where 'chunk' contains the chunks and 'text' contains the corresponding text and 'filename' the file name.
175
+
df (pandas.DataFrame): A DataFrame with two columns - 'chunk'and 'text', where 'chunk' contains the chunks and 'text' contains the corresponding text.
0 commit comments