Skip to content

Commit

Permalink
Merge pull request #28 from linkml/added-mmr-to-search
Browse files Browse the repository at this point in the history
Added MMR algorithm to search
  • Loading branch information
cmungall authored Aug 24, 2024
2 parents fa360bc + 1b2a215 commit 8b9cfb8
Show file tree
Hide file tree
Showing 9 changed files with 536 additions and 108 deletions.
249 changes: 179 additions & 70 deletions docs/how-to/Predict-Missing-Data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@
"\n",
"The framework is designed to support different kinds of inference, including rule-based and LLMs. This notebooks shows simple ML-based inference using scikit-learn DecisionTrees.\n",
"\n",
"We will use the Iris dataset:"
"This how-to walks through the basic operations of using the `linkml-store` command line tool to perform training and inference using scikit-learn DecisionTrees. This uses the command line interface, but the same operations can be performed programmatically using the Python API, or via the Web API.\n",
"\n",
"We will use a subset of the classic [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html), converted to jsonl (JSON Lines) format:"
],
"metadata": {
"collapsed": false
Expand All @@ -18,7 +20,18 @@
},
{
"cell_type": "code",
"execution_count": 18,
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl describe"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-23T22:15:36.754913Z",
"start_time": "2024-08-23T22:15:33.366042Z"
}
},
"id": "d2ef6e85292b5a20",
"outputs": [
{
"name": "stdout",
Expand All @@ -33,25 +46,111 @@
]
}
],
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl describe"
],
"execution_count": 2
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## The Infer Command",
"id": "335516b2c129363a"
},
{
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-12T20:08:06.401967Z",
"start_time": "2024-08-12T20:08:03.933123Z"
"end_time": "2024-08-23T22:20:41.635957Z",
"start_time": "2024-08-23T22:20:38.428284Z"
}
},
"id": "d2ef6e85292b5a20"
"cell_type": "code",
"source": [
"%%bash\n",
"linkml-store infer --help"
],
"id": "e38efeb1addfe697",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Usage: linkml-store infer [OPTIONS]\n",
"\n",
" Predict a complete object from a partial object.\n",
"\n",
" Currently two main prediction methods are provided: RAG and sklearn\n",
"\n",
" ## RAG:\n",
"\n",
" The RAG approach will use Retrieval Augmented Generation to inference the\n",
" missing attributes of an object.\n",
"\n",
" Example:\n",
"\n",
" linkml-store -i countries.jsonl inference -t rag -q 'name: Uruguay'\n",
"\n",
" Result:\n",
"\n",
" capital: Montevideo, code: UY, continent: South America, languages:\n",
" [Spanish]\n",
"\n",
" You can pass in configurations as follows:\n",
"\n",
" linkml-store -i countries.jsonl inference -t\n",
" rag:llm_config.model_name=llama-3 -q 'name: Uruguay'\n",
"\n",
" ## SKLearn:\n",
"\n",
" This uses scikit-learn (defaulting to simple decision trees) to do the\n",
" prediction.\n",
"\n",
" linkml-store -i tests/input/iris.csv inference -t sklearn -q\n",
" '{\"sepal_length\": 5.1, \"sepal_width\": 3.5, \"petal_length\": 1.4,\n",
" \"petal_width\": 0.2}'\n",
"\n",
"Options:\n",
" -O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]\n",
" Output format\n",
" -o, --output PATH Output file path\n",
" -T, --target-attribute TEXT Target attributes for inference\n",
" -F, --feature-attributes TEXT Feature attributes for inference (comma\n",
" separated)\n",
" -Y, --inference-config-file PATH\n",
" Path to inference configuration file\n",
" -E, --export-model PATH Export model to file\n",
" -L, --load-model PATH Load model from file\n",
" -M, --model-format [pickle|onnx|pmml|pfa|joblib|png|linkml_expression|rulebased|rag_index]\n",
" Format for model\n",
" -S, --training-test-data-split <FLOAT FLOAT>...\n",
" Training/test data split\n",
" -t, --predictor-type TEXT Type of predictor [default: sklearn]\n",
" -n, --evaluation-count INTEGER Number of examples to evaluate over\n",
" --evaluation-match-function TEXT\n",
" Name of function to use for matching objects\n",
" in eval\n",
" -q, --query TEXT query term\n",
" --help Show this message and exit.\n"
]
}
],
"execution_count": 5
},
{
"cell_type": "markdown",
"source": [
"## Training and Inference\n",
"\n",
"We can perform training and inference in a single step:"
"We can perform training and inference in a single step. \n",
"\n",
"For feature labels, we use:\n",
"\n",
"- `petal_length`\n",
"- `petal_width`\n",
"- `sepal_length`\n",
"- `sepal_width`\n",
"\n",
"These can be explicitly specified using `-F`, but in this case we are specifying a query, so\n",
"the feature labels are inferred from the query.\n",
"\n",
"We specify the target label using `-T`. In this case, we are predicting the `species` of the iris.\n"
],
"metadata": {
"collapsed": false
Expand All @@ -60,7 +159,18 @@
},
{
"cell_type": "code",
"execution_count": 9,
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -q \"{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}\" "
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-23T22:17:38.972690Z",
"start_time": "2024-08-23T22:17:35.558907Z"
}
},
"id": "4984aeb4016df154",
"outputs": [
{
"name": "stderr",
Expand All @@ -76,29 +186,27 @@
"text": [
"predicted_object:\n",
" species: setosa\n",
"confidence: 1.0\n"
"confidence: 1.0\n",
"\n"
]
}
],
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -q \"{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}\" "
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-12T19:35:08.172872Z",
"start_time": "2024-08-12T19:35:05.095856Z"
}
},
"id": "4984aeb4016df154"
"execution_count": 4
},
{
"metadata": {},
"cell_type": "markdown",
"source": "The data model for the output consists of a `predicted_object` slot and a `confidence`. Note that for standard ML operations, the predicted object will typically have one attribute only, but other kinds of inference (OWL reasoning, LLMs) may be able to predict complex objects.",
"id": "dfcbdae846f56ada"
},
{
"cell_type": "markdown",
"source": [
"## Saving the Model\n",
"\n",
"Performing training and inference in a single step is convenient where training is fast, but more typically we'd want to save the model for later use:"
"Performing training and inference in a single step is convenient where training is fast, but more typically we'd want to save the model for later use.\n",
"\n",
"We can do this with the `-E` option:"
],
"metadata": {
"collapsed": false
Expand Down Expand Up @@ -181,48 +289,29 @@
},
{
"cell_type": "code",
"execution_count": 15,
"outputs": [],
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -L \"tmp/iris-model.joblib\" -E \"tmp/iris-model.png\""
"linkml-store --stacktrace -i ../../tests/input/iris.jsonl infer -t sklearn -T species -L tmp/iris-model.joblib -E input/iris-model.png"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-12T19:57:43.145521Z",
"start_time": "2024-08-12T19:57:40.441893Z"
"end_time": "2024-08-23T22:23:18.451362Z",
"start_time": "2024-08-23T22:23:15.571984Z"
}
},
"id": "d7d14edd77e9e1fe"
"id": "d7d14edd77e9e1fe",
"outputs": [],
"execution_count": 9
},
{
"cell_type": "markdown",
"source": [
"![img](tmp/iris-model.png)"
],
"source": "![img](input/iris-model.png)",
"metadata": {
"collapsed": false
},
"id": "cca55edf629f8c26"
},
{
"cell_type": "code",
"execution_count": 29,
"outputs": [],
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -L tmp/iris-model.joblib -E tmp/iris-model.rulebased.yaml"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-12T21:59:26.805316Z",
"start_time": "2024-08-12T21:59:24.343197Z"
}
},
"id": "acb7c57ecb3be9b"
},
{
"cell_type": "markdown",
"source": [
Expand All @@ -244,8 +333,20 @@
"id": "3ef8a6bc39b5e667"
},
{
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-08-23T22:24:16.457340Z",
"start_time": "2024-08-23T22:24:13.977990Z"
}
},
"cell_type": "code",
"execution_count": 30,
"source": [
"%%bash\n",
"linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -L tmp/iris-model.joblib -E tmp/iris-model.rulebased.yaml\n",
"cat tmp/iris-model.rulebased.yaml"
],
"id": "acb7c57ecb3be9b",
"outputs": [
{
"name": "stdout",
Expand All @@ -266,17 +367,13 @@
]
}
],
"source": [
"%%bash\n",
"cat tmp/iris-model.rulebased.yaml"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2024-08-12T21:59:52.936844Z"
}
},
"id": "4fdea226f501455e"
"execution_count": 10
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We can then apply this model to new data:",
"id": "50f9cd9df60b41c9"
},
{
"cell_type": "code",
Expand Down Expand Up @@ -310,14 +407,26 @@
"id": "4df0d87dff96e667"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## More advanced ML models\n",
"\n",
"Currently only Decision Trees are supported. Additionally, most of the underlying functionality of scikit-learn is hidden.\n",
"\n",
"For more advanced ML, you are encouraged to use linkml-store for *data management* and then exporting to standard tabular ot dataframe formats in order to do more advanced ML in Python. linkml-store is *not* intended as an ML platform. Instead a limited set of operations are provided to assist with data exploration and assisting in construction of deterministic rules.\n",
"\n",
"For inference using LLMs and Retrieval Augmented Generation, see the how-to guide on those topics.\n"
],
"id": "d1b583ce2d75c0e0"
},
{
"metadata": {},
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
},
"id": "cef5b6e4ee9cb5f5"
"execution_count": null,
"source": "",
"id": "c8d9e36761d3088d"
}
],
"metadata": {
Expand Down
Binary file added docs/how-to/input/iris-model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion src/linkml_store/api/collection.py
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,7 @@ def search(
where: Optional[Any] = None,
index_name: Optional[str] = None,
limit: Optional[int] = None,
mmr_relevance_factor: Optional[float] = None,
**kwargs,
) -> QueryResult:
"""
Expand Down Expand Up @@ -534,7 +535,7 @@ def search(
index_col = ix.index_field
# TODO: optimize this for large indexes
vector_pairs = [(row, np.array(row[index_col], dtype=float)) for row in qr.rows]
results = ix.search(query, vector_pairs, limit=limit)
results = ix.search(query, vector_pairs, limit=limit, mmr_relevance_factor=mmr_relevance_factor, **kwargs)
for r in results:
del r[1][index_col]
new_qr = QueryResult(num_rows=len(results))
Expand Down
Loading

0 comments on commit 8b9cfb8

Please sign in to comment.