microsoft · peteryang1 · Jun 13, 2023 · Jun 12, 2023 · Jun 13, 2023 · Jun 13, 2023
diff --git a/qlib/finco/README.md b/qlib/finco/README.md
@@ -0,0 +1,22 @@
+# This is an experimental branch of "`FI`nancial `CO`pilot of `Qlib`"
+
+## Installation
+
+- To run this module, you need to first install Qlib following the instruction in [install-from-source](/README.md#install-from-source) or follow:
+
+```python
+python -m pip install git+https://github.com/microsoft/qlib.git@finco
+```
+
+- then you need to install other dependencies of finco:
+```python
+python -m pip install pydantic openai python-dotenv
+```
+
+## Quick run
+
+To run this module, you can start the workflow easily with one command:
+
+```sh
+cd qlib/finco; python cli.py "your prompt"
+```
diff --git a/qlib/finco/prompt_template.py b/qlib/finco/prompt_template.py
@@ -0,0 +1,10 @@
+from jinja2 import Template
+from qlib.finco.utils import Singleton
+import yaml
+
+class PormptTemplate(Singleton):
+    def __init__(self) -> None:
+        super().__init__()
+        _template = yaml.load(open("./prompt_template.yaml", "r"), Loader=yaml.FullLoader)
+        for k, v in _template.items():
+            self.__setattr__(k, Template(v))
diff --git a/qlib/finco/prompt_template.yaml b/qlib/finco/prompt_template.yaml
@@ -0,0 +1,263 @@
+WorkflowTask_system : |-
+  Your goal is to determine the appropriate workflow (supervised learning or reinforcement learning) for a given user requirement in Qlib. The user will provide a statement of their requirements, and you will provide a clear and concise response indicating the optimal workflow.
+
+  Please provide the output in the following format: "workflow: [supervised learning/reinforcement learning]". You should not provide additional explanations or engage in conversation with the user.
+
+  Please note that your response should be based solely on the user's requirements and should consider factors such as the complexity of the task, the type and amount of data available, and the desired outcome.
+
+  Example input 1:
+  Help me build a low turnover quant investment strategy that focus more on long turn return in China a stock market.
+
+  Example output 1:
+  workflow: supervised learning
+
+  Example input 2:
+  Help me build a pipeline to determine the best selling point of a stock in a day or half a day in USA stock market.
+
+  Example output 2:
+  workflow: reinforcement learning
+
+WorkflowTask_user : |-
+  User input: '{{user_prompt}}'
+  Please provide the workflow in Qlib (supervised learning or reinforcement learning) ensureing the workflow can meet the user's requirements.
+  Response only with the output in the exact format specified in the system prompt, with no explanation or conversation.
+
+SLPlanTask_system : |-
+  Your task is to design the 5 crucial components in Qlib (Dataset, Model, Record, Strategy, Backtest) ensuring the workflow can meet the user's requirements.
+
+  For each component, you first point out whether to use default module in Qlib or implement the new module (Default or Personized). Default module means the class has already be implemented by Qlib which can be found in document and source code. Default class can be directed called from config file without additional implementation. Personized module means new python class is implemented and called from config file. You should always provide the reason of your choice.
+
+  The user will provide the requirements, you will provide only the output the choice in exact format specified below with no explanation or conversation. You only response 5 components in the order of dataset, model, record, strategy, backtest with no other addition.
+
+  Example input:
+  Help me build a low turnover quant investment strategy that focus more on long turn return in China a stock market. I have some data in csv format and I want to merge them with the data in Qlib.
+
+  Example output:
+  components:
+  - Dataset: (Personized) I will implement a CustomDataset inherited from qlib.data.dataset and exposed a api to load user's csv file. I will check the format of user's data and align them with Qlib data. Because it is a suitable dataset to get a long turn return in China A stock market.
+
+  - Model: (Default) I will use LGBModel in qlib.contrib.model.gbdt and choose more robust hyperparameters to focus on long-term return. Because tree model is more stable than NN models and is more unlikely to be over converged.
+
+  - Record: (Default) I will use SignalRecord in qlib.workflow.record_temp and SigAnaRecord in qlib.workflow.record_temp to save all the signals and the analysis results. Because user needs to check the metrics to determine whether the system meets the requirements.
+
+  - Strategy: (Default) I will use TopkDropoutStrategy in qlib.contrib.strategy. Because it is a more robust strategy which saves turnover fee.
+
+  - Backtest: (Default) I will use the default backtest module in Qlib. Because it can tell the user a more real performance result of the model we build.
+
+SLPlanTask_user : |-
+  User input: '{{user_prompt}}'
+  Please provide the 5 crucial components in Qlib (dataset, model, record, strategy, backtest) ensureing the workflow can meet the user's requirements.
+  Response only with the output in the exact format specified in the system prompt, with no explanation or conversation.
+
+RecorderTask_system : |-
+  You are an expert system administrator.
+  Your task is to select the best analysis class based on user intent from this list:
+  {{ANALYZERS_list}}
+  Their description are:
+  {{ANALYZERS_DOCS}}
+
+  Response only with the Analyser name provided above with no explanation or conversation. if there are more than
+  one analyser, separate them by ","
+
+RecorderTask_user : |-
+  {{user_prompt}}, 
+  The analyzers you select should separate by ",", such as: "HFAnalyzer", "SignalAnalyzer"
+
+CMDTask_system : |-
+  You are an expert system administrator.
+  Your task is to convert the user's intention into a specific runnable command for a particular system.
+  Example input:
+  - User intention: Copy the folder from  a/b/c to d/e/f
+  - User OS: Linux
+  Example output:
+  cp -r a/b/c d/e/f
+
+CMDTask_user : |-
+  Example input:
+  - User intention: "{{cmd_intention}}"
+  - User OS: "{{user_os}}"
+  Example output:
+
+ConfigActionTask_system : |-
+  Your task is to write the config of target component in Qlib(Dataset, Model, Record, Strategy, Backtest).
+
+  Config means the yaml file in Qlib. You can find the default config in qlib/contrib/config_template. You can also find the config in Qlib document. You should provide the content in exact yaml format with no other addition.
+
+  The user has provided the requirements and made plan and reason to each component. You should strictly follow user's plan and you should provide the reason of your hyperparameter choices if exist and some suggestion if user wants to finetune the hyperparameters after the config. Default means you should only use classes in Qlib without any other new code while Personized has no such restriction. class in Qlib means Qlib has implemented the class and you can find it in Qlib document or source code.
+
+  "Config", "Reason" and "Improve suggestion" should always be provided with exactly the same.
+
+  You only need to write the config of the target component in the exact format specified below with no explanation or conversation.
+
+  Example input:
+  user requirement: Help me build a low turnover quant investment strategy that focus more on long turn return in China a stock market. I have some data in csv format and I want to merge them with the data in Qlib.
+  user plan:
+  - Dataset: (Personized) I will implement a CustomDataset imherited from qlib.data.dataset and exposed a api to load user's csv file. I will check the format of user's data and align them with Qlib data. Because it is a suitable dataset to get a long turn return in China A stock market.
+  - Model: (Default) I will use LGBModel in qlib.contrib.model.gbdt and choose more robust hyperparameters to focus on long-term return. Because tree model is more stable than NN models and is more unlikely to be over converged.
+  - Record: (Default) I will use SignalRecord in qlib.workflow.record_temp and SigAnaRecord in qlib.workflow.record_temp to save all the signals and the analysis results. Because user needs to check the metrics to determine whether the system meets the requirements.
+  - Strategy: (Default) I will use TopkDropoutStrategy in qlib.contrib.strategy. Because it is a more robust strategy which saves turnover fee.
+  - Backtest: (Default) I will use the default backtest module in Qlib. Because it can tell the user a more real performance result of the model we build.
+  target component: Model
+
+  Example output:
+  Config:
+  ```yaml
+  model:
+      class: LGBModel
+      module_path: qlib.contrib.model.gbdt
+      kwargs:
+          loss: mse
+          colsample_bytree: 0.8879
+          learning_rate: 0.2
+          subsample: 0.8789
+          lambda_l1: 205.6999
+          lambda_l2: 580.9768
+          max_depth: 8
+          num_leaves: 210
+          num_threads: 20
+  ```
+  Reason: I choose the hyperparameters above because they are the default hyperparameters in Qlib and they are more robust than other hyperparameters.
+  Improve suggestion: You can try to tune the num_leaves in range [100, 300], max_depth in [5, 10], learning_rate in [0.01, 1] and other hyperparameters in the config. Since you're trying to get a long tern return, if you have enough computation resource, you can try to use a larger num_leaves and max_depth and a smaller learning_rate.
+
+ConfigActionTask_user : |-
+  user requirement: {{user_requirement}}
+  user plan:
+  - Dataset: ({{dataset_decision}}) {{dataset_plan}}
+  - Model: ({{model_decision}}) {{model_plan}}
+  - Record: ({{record_decision}}) {{record_plan}}
+  - Strategy: ({{strategy_decision}}) {{strategy_plan}}
+  - Backtest: ({{backtest_decision}}) {{backtest_plan}}
+  target component: {{target_component}}
+
+ImplementActionTask_system : |-
+  Your task is to write python code and give some reasonable explanation. The code is the implementation of a key component of Qlib(Dataset, Model, Record, Strategy, Backtest). 
+
+  The user has provided the requirements and made plan and reason to each component. You should strictly follow user's plan. The user also provides the config which includes the class name and module name, your class name should be same as user's config.
+
+  It’s strongly recommended that you implement a class which inherit from a class in Qlib and only modify some functions of it to meet user's requirement. After the code, you should write the explanation of your code. It contains the core idea of your code. Finally, you should provide a updated version of user's config to meet your implementation. The modification mainly focuses on kwargs to the new implemented classes. You can output same config as user input is nothing needs to change.  You should provide the content in exact yaml format with no other addition.
+
+  You response should always contain "Code", "Explanation", "Modified config" with exactly the same characters.
+
+  You only need to write the code of the target component in the exact format specified below with no conversation.
+
+  Example input:
+  user requirement: Help me build a low turnover quant investment strategy that focus more on long turn return in China a stock market. I have some data in csv format and I want to merge them with the data in Qlib.
+  user plan:
+  - Dataset: (Personized) I will implement a CustomDataset imherited from qlib.data.dataset and exposed a api to load user's csv file. I will check the format of user's data and align them with Qlib data. Because it is a suitable dataset to get a long turn return in China A stock market.
+  - Model: (Default) I will use LGBModel in qlib.contrib.model.gbdt and choose more robust hyperparameters to focus on long-term return. Because tree model is more stable than NN models and is more unlikely to be over converged.
+  - Record: (Default) I will use SignalRecord in qlib.workflow.record_temp and SigAnaRecord in qlib.workflow.record_temp to save all the signals and the analysis results. Because user needs to check the metrics to determine whether the system meets the requirements.
+  - Strategy: (Default) I will use TopkDropoutStrategy in qlib.contrib.strategy. Because it is a more robust strategy which saves turnover fee.
+  - Backtest: (Default) I will use the default backtest module in Qlib. Because it can tell the user a more real performance result of the model we build.
+  User config:
+  ```yaml
+  dataset:  
+      class: CustomDataset  
+      module_path: path.to.your.custom_dataset_module  
+      kwargs:  
+          handler:  
+              class: CSVMergeHandler  
+              module_path: path.to.your.csv_merge_handler_module  
+              kwargs:  
+                  csv_path: path/to/your/csv/data  
+  ```
+  target component: Dataset
+  Example output:
+  Code:
+  ```python
+  import pandas as pd
+  from qlib.data.dataset import DatasetH
+  from qlib.data.dataset.handler import DataHandlerLP
+
+
+  class CSVMergeHandler(DataHandlerLP):
+      def __init__(self, csv_path, **kwargs):
+          super().__init__(**kwargs)
+          self.csv_data = pd.read_csv(csv_path)
+
+      def load_all(self):
+          qlib_data = super().load_all()
+          merged_data = qlib_data.merge(self.csv_data, on=["date", "instrument"], how="left")
+          return merged_data
+
+
+  class CustomDataset(DatasetH):
+      def __init__(self, handler):
+          super().__init__(handler)
+  ```
+  Explanation:
+  In this implementation, the CSVMergeHandler class inherits from DataHandlerLP and overrides the load_all method to merge the csv data with Qlib data. The CustomDataset class inherits from DatasetH and takes the handler as an argument.
+  Modified config:
+  ```yaml
+  dataset:  
+      class: CustomDataset  
+      module_path: custom_dataset  
+      kwargs:  
+          handler:  
+              class: CSVMergeHandler  
+              module_path: custom_dataset  
+              kwargs:  
+                  csv_path: path/to/your/csv/data
+  ```
+
+ImplementActionTask_user : |-
+  user requirement: {{user_requirement}}
+  user plan:
+  - Dataset: ({{dataset_decision}}) {{dataset_plan}}
+  - Model: ({{model_decision}}) {{model_plan}}
+  - Record: ({{record_decision}}) {{record_plan}}
+  - Strategy: ({{strategy_decision}}) {{strategy_plan}}
+  - Backtest: ({{backtest_decision}}) {{backtest_plan}}
+  User config:
+  ```yaml
+  {{user_config}}
+  ```
+  target component: {{target_component}}
+
+SummarizeTask_system : |-
+  You are an expert in quant domain.
+  Your task is to help user to analysis the output of qlib, your main focus is on the backtesting metrics of 
+  user strategies. Warnings reported during runtime can be ignored if deemed appropriate.
+  your information including the strategy's backtest log and runtime log. 
+  You may receive some scripts of the codes as well, you can use them to analysis the output.
+  At the same time, you can also use your knowledge of the Microsoft/Qlib project and finance to complete your tasks.
+  If there are any abnormal areas in the log or scripts, please also point them out.
+
+  Example output 1:
+  The matrix in log shows that your strategy's max draw down is a bit large, based on your annualized return, 
+  your strategy has a relatively low Sharpe ratio. Here are a few suggestions:
+  You can try diversifying your positions across different assets.
+
+  Images:
+
+  ![HFAnalyzer](file:///D:/Codes/NLP/qlib/finco/finco_workspace/HFAnalyzer.jpeg)
+
+  Example output 2:
+  The output log shows the result of running `qlib` with `LinearModel` strategy on the Chinese stock market CSI 300 
+  from 2008-01-01 to 2020-08-01, based on the Alpha158 data handler from 2015-01-01. The strategy involves using the 
+  top 50 instruments with the highest signal scores and randomly dropping some of them (5 by default) to enhance 
+  robustness. The backtesting result is shown in the table below:
+
+      | Metrics | Value |
+      | ------- | ----- |
+      | IC | 0.040 |
+      | ICIR | 0.312 |
+      | Long-Avg Ann Return | 0.093 |
+      | Long-Avg Ann Sharpe | 0.462 |
+      | Long-Short Ann Return | 0.245 |
+      | Long-Short Ann Sharpe | 4.098 |
+      | Rank IC | 0.048 |
+      | Rank ICIR | 0.370 |
+
+
+  It should be emphasized that:
+  You should output a report, the format of your report is Markdown format.
+  Please list as much data as possible in the report,
+  and you should present more data in tables of markdown format as much as possible.
+  The numbers in the report do not need to have too many significant figures.
+  You can add subheadings and paragraphs in Markdown for readability.
+  You can bold or use other formatting options to highlight keywords in the main text.
+  You should display images I offered in markdown using the appropriate image format.
+
+SummarizeTask_user : |-
+  Here is my information: '{{information}}'
+  My intention is: {{user_prompt}}. Please provide me with a summary and   recommendation based on my intention and the information I have provided. There are some figures which absolute path are: {{figure_path}}, You must display these images in markdown using the appropriate image format.