diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
new file mode 100644
index 0000000000..ae0843cee5
--- /dev/null
+++ b/.github/copilot-instructions.md
@@ -0,0 +1,242 @@
+# GitHub Copilot Instructions for FLAML
+
+## Project Overview
+
+FLAML (Fast Library for Automated Machine Learning & Tuning) is a lightweight Python library for efficient automation of machine learning and AI operations. It automates workflow based on large language models, machine learning models, etc. and optimizes their performance.
+
+**Key Components:**
+
+- `flaml/automl/`: AutoML functionality for classification and regression
+- `flaml/tune/`: Generic hyperparameter tuning
+- `flaml/default/`: Zero-shot AutoML with default configurations
+- `flaml/autogen/`: Legacy autogen code (note: AutoGen has moved to a separate repository)
+- `flaml/fabric/`: Microsoft Fabric integration
+- `test/`: Comprehensive test suite
+
+## Build and Test Commands
+
+### Installation
+
+```bash
+# Basic installation
+pip install -e .
+
+# Install with test dependencies
+pip install -e .[test]
+
+# Install with automl dependencies
+pip install -e .[automl]
+
+# Install with forecast dependencies (Linux only)
+pip install -e .[forecast]
+```
+
+### Running Tests
+
+```bash
+# Run all tests (excluding autogen)
+pytest test/ --ignore=test/autogen --reruns 2 --reruns-delay 10
+
+# Run tests with coverage
+coverage run -a -m pytest test --ignore=test/autogen --reruns 2 --reruns-delay 10
+coverage xml
+
+# Check dependencies
+python test/check_dependency.py
+```
+
+### Linting and Formatting
+
+```bash
+# Run pre-commit hooks
+pre-commit run --all-files
+
+# Format with black (line length: 120)
+black . --line-length 120
+
+# Run ruff for linting and auto-fix
+ruff check . --fix
+```
+
+## Code Style and Formatting
+
+### Python Style
+
+- **Line length:** 120 characters (configured in both Black and Ruff)
+- **Formatter:** Black (v23.3.0+)
+- **Linter:** Ruff with Pyflakes and pycodestyle rules
+- **Import sorting:** Use isort (via Ruff)
+- **Python version:** Supports Python >= 3.10 (full support for 3.10, 3.11, 3.12 and 3.13)
+
+### Code Quality Rules
+
+- Follow Black formatting conventions
+- Keep imports sorted and organized
+- Avoid unused imports (F401) - these are flagged but not auto-fixed
+- Avoid wildcard imports (F403) where possible
+- Complexity: Max McCabe complexity of 10
+- Use type hints where appropriate
+- Write clear docstrings for public APIs
+
+### Pre-commit Hooks
+
+The repository uses pre-commit hooks for:
+
+- Checking for large files, AST syntax, YAML/TOML/JSON validity
+- Detecting merge conflicts and private keys
+- Trailing whitespace and end-of-file fixes
+- pyupgrade for Python 3.8+ syntax
+- Black formatting
+- Markdown formatting (mdformat with GFM and frontmatter support)
+- Ruff linting with auto-fix
+
+## Testing Strategy
+
+### Test Organization
+
+- Tests are in the `test/` directory, organized by module
+- `test/automl/`: AutoML feature tests
+- `test/tune/`: Hyperparameter tuning tests
+- `test/default/`: Zero-shot AutoML tests
+- `test/nlp/`: NLP-related tests
+- `test/spark/`: Spark integration tests
+
+### Test Requirements
+
+- Write tests for new functionality
+- Ensure tests pass on multiple Python versions (3.10, 3.11, 3.12 and 3.13)
+- Tests should work on both Ubuntu and Windows
+- Use pytest markers for platform-specific tests (e.g., `@pytest.mark.spark`)
+- Tests should be idempotent and not depend on external state
+- Use `--reruns 2 --reruns-delay 10` for flaky tests
+
+### Coverage
+
+- Aim for good test coverage on new code
+- Coverage reports are generated for Python 3.11 builds
+- Coverage reports are uploaded to Codecov
+
+## Git Workflow and Best Practices
+
+### Branching
+
+- Main branch: `main`
+- Create feature branches from `main`
+- PR reviews are required before merging
+
+### Commit Messages
+
+- Use clear, descriptive commit messages
+- Reference issue numbers when applicable
+- ALWAYS run `pre-commit run --all-files` before each commit to avoid formatting issues
+
+### Pull Requests
+
+- Ensure all tests pass before requesting review
+- Update documentation if adding new features
+- Follow the PR template in `.github/PULL_REQUEST_TEMPLATE.md`
+
+## Project Structure
+
+```
+flaml/
+├── automl/         # AutoML functionality
+├── tune/           # Hyperparameter tuning
+├── default/        # Zero-shot AutoML
+├── autogen/        # Legacy autogen (deprecated, moved to separate repo)
+├── fabric/         # Microsoft Fabric integration
+├── onlineml/       # Online learning
+└── version.py      # Version information
+
+test/               # Test suite
+├── automl/
+├── tune/
+├── default/
+├── nlp/
+└── spark/
+
+notebook/           # Example notebooks
+website/            # Documentation website
+```
+
+## Dependencies and Package Management
+
+### Core Dependencies
+
+- NumPy >= 1.17
+- Python >= 3.10 (officially supported: 3.10, 3.11, 3.12 and 3.13)
+
+### Optional Dependencies
+
+- `[automl]`: lightgbm, xgboost, scipy, pandas, scikit-learn
+- `[test]`: Full test suite dependencies
+- `[spark]`: PySpark and joblib dependencies
+- `[forecast]`: holidays, prophet, statsmodels, hcrystalball, pytorch-forecasting, pytorch-lightning, tensorboardX
+- `[hf]`: Hugging Face transformers and datasets
+- See `setup.py` for complete list
+
+### Version Constraints
+
+- Be mindful of Python version-specific dependencies (check setup.py)
+- XGBoost versions differ based on Python version
+- NumPy 2.0+ only for Python >= 3.13
+- Some features (like vowpalwabbit) only work with older Python versions
+
+## Boundaries and Restrictions
+
+### Do NOT Modify
+
+- `.git/` directory and Git configuration
+- `LICENSE` file
+- Version information in `flaml/version.py` (unless explicitly updating version)
+- GitHub Actions workflows without careful consideration
+- Existing test files unless fixing bugs or adding coverage
+
+### Be Cautious With
+
+- `setup.py`: Changes to dependencies should be carefully reviewed
+- `pyproject.toml`: Linting and testing configuration
+- `.pre-commit-config.yaml`: Pre-commit hook configuration
+- Backward compatibility: FLAML is a library with external users
+
+### Security Considerations
+
+- Never commit secrets or API keys
+- Be careful with external data sources in tests
+- Validate user inputs in public APIs
+- Follow secure coding practices for ML operations
+
+## Special Notes
+
+### AutoGen Migration
+
+- AutoGen has moved to a separate repository: https://github.com/microsoft/autogen
+- The `flaml/autogen/` directory contains legacy code
+- Tests in `test/autogen/` are ignored in the main test suite
+- Direct users to the new AutoGen repository for AutoGen-related issues
+
+### Platform-Specific Considerations
+
+- Some tests only run on Linux (e.g., forecast tests with prophet)
+- Windows and Ubuntu are the primary supported platforms
+- macOS support exists but requires special libomp setup for lgbm/xgboost
+
+### Performance
+
+- FLAML focuses on efficient automation and tuning
+- Consider computational cost when adding new features
+- Optimize for low resource usage where possible
+
+## Documentation
+
+- Main documentation: https://microsoft.github.io/FLAML/
+- Update documentation when adding new features
+- Provide clear examples in docstrings
+- Add notebook examples for significant new features
+
+## Contributing
+
+- Follow the contributing guide: https://microsoft.github.io/FLAML/docs/Contribute
+- Sign the Microsoft CLA when making your first contribution
+- Be respectful and follow the Microsoft Open Source Code of Conduct
+- Join the Discord community for discussions: https://discord.gg/Cppx2vSPVP
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
index 11258e54a7..0659b4aa85 100644
--- a/.github/workflows/python-package.yml
+++ b/.github/workflows/python-package.yml
@@ -40,7 +40,7 @@ jobs:
       fail-fast: false
       matrix:
         os: [ubuntu-latest, windows-latest]
-        python-version: ["3.10", "3.11", "3.12"]
+        python-version: ["3.10", "3.11", "3.12", "3.13"]
     steps:
       - uses: actions/checkout@v4
       - name: Set up Python ${{ matrix.python-version }}
@@ -74,6 +74,11 @@ jobs:
         run: |
           pip install pyspark==4.0.1
           pip list | grep "pyspark"
+      - name: On Ubuntu python 3.13, install pyspark 4.1.0
+        if: matrix.python-version == '3.13' && matrix.os == 'ubuntu-latest'
+        run: |
+          pip install pyspark==4.1.0
+          pip list | grep "pyspark"
       # # TODO: support ray
       # - name: If linux and python<3.11, install ray 2
       #   if: matrix.os == 'ubuntu-latest' && matrix.python-version < '3.11'
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 161b21ca8f..98f7495c8f 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -36,7 +36,7 @@ repos:
     - id: black
 
   - repo: https://github.com/executablebooks/mdformat
-    rev: 0.7.17
+    rev: 0.7.22
     hooks:
       - id: mdformat
         additional_dependencies:
diff --git a/NOTICE.md b/NOTICE.md
index 839201ff2f..4da9e0e9e7 100644
--- a/NOTICE.md
+++ b/NOTICE.md
@@ -4,8 +4,8 @@ This repository incorporates material as listed below or described in the code.
 
 ## Component. Ray.
 
-Code in tune/\[analysis.py, sample.py, trial.py, result.py\],
-searcher/\[suggestion.py, variant_generator.py\], and scheduler/trial_scheduler.py is adapted from
+Code in tune/[analysis.py, sample.py, trial.py, result.py],
+searcher/[suggestion.py, variant_generator.py], and scheduler/trial_scheduler.py is adapted from
 https://github.com/ray-project/ray/blob/master/python/ray/tune/
 
 ## Open Source License/Copyright Notice.
diff --git a/README.md b/README.md
index 8151fc7cd4..b4313980f1 100644
--- a/README.md
+++ b/README.md
@@ -34,7 +34,7 @@ FLAML has a .NET implementation in [ML.NET](http://dot.net/ml), an open-source,
 
 ## Installation
 
-The latest version of FLAML requires **Python >= 3.10 and \< 3.13**. While other Python versions may work for core components, full model support is not guaranteed. FLAML can be installed via `pip`:
+The latest version of FLAML requires **Python >= 3.10 and < 3.14**. While other Python versions may work for core components, full model support is not guaranteed. FLAML can be installed via `pip`:
 
 ```bash
 pip install flaml
diff --git a/SECURITY.md b/SECURITY.md
index 9657262baf..105419313f 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -12,7 +12,7 @@ If you believe you have found a security vulnerability in any Microsoft-owned re
 
 Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
 
-If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
+If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
 
 You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
 
diff --git a/flaml/automl/automl.py b/flaml/automl/automl.py
index edf6c1412b..cb3fe37857 100644
--- a/flaml/automl/automl.py
+++ b/flaml/automl/automl.py
@@ -118,6 +118,8 @@ def __init__(self, **settings):
                 e.g., 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_weighted',
                 'roc_auc_ovo_weighted', 'roc_auc_ovr_weighted', 'f1', 'micro_f1', 'macro_f1',
                 'log_loss', 'mae', 'mse', 'r2', 'mape'. Default is 'auto'.
+                For a full list of supported built-in metrics, please refer to
+                https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric
                 If passing a customized metric function, the function needs to
                 have the following input arguments:
 
@@ -154,6 +156,10 @@ def custom_metric(
                 "pred_time": pred_time,
             }
         ```
+                **Note:** When passing a custom metric function, pass the function itself
+                (e.g., `metric=custom_metric`), not the result of calling it
+                (e.g., `metric=custom_metric(...)`). FLAML will call your function
+                internally during the training process.
             task: A string of the task type, e.g.,
                 'classification', 'regression', 'ts_forecast', 'rank',
                 'seq-classification', 'seq-regression', 'summarization',
@@ -174,6 +180,11 @@ def custom_metric(
                 and 'final_estimator' to specify the passthrough and
                 final_estimator in the stacker. The dict can also contain
                 'n_jobs' as the key to specify the number of jobs for the stacker.
+                Note: The hyperparameters of a custom 'final_estimator' are NOT
+                automatically tuned. If you provide an estimator instance (e.g.,
+                CatBoostClassifier()), it will use the parameters you specified
+                or their defaults. If 'final_estimator' is not provided, the best
+                model found during the search will be used as the final estimator.
             eval_method: A string of resampling strategy, one of
                 ['auto', 'cv', 'holdout'].
             split_ratio: A float of the valiation data percentage for holdout.
@@ -332,6 +343,12 @@ def custom_metric(
          }
         ```
             skip_transform: boolean, default=False | Whether to pre-process data prior to modeling.
+            allow_label_overlap: boolean, default=True | For classification tasks with holdout evaluation,
+                whether to allow label overlap between train and validation sets. When True (default),
+                uses a fast strategy that adds the first instance of missing labels to the set that is
+                missing them, which may create some overlap. When False, uses a precise but slower
+                strategy that intelligently re-splits instances to avoid overlap when possible.
+                Only affects classification tasks with holdout evaluation method.
             fit_kwargs_by_estimator: dict, default=None | The user specified keywords arguments, grouped by estimator name.
                 e.g.,
 
@@ -362,7 +379,10 @@ def custom_metric(
         settings["split_ratio"] = settings.get("split_ratio", SPLIT_RATIO)
         settings["n_splits"] = settings.get("n_splits", N_SPLITS)
         settings["auto_augment"] = settings.get("auto_augment", True)
+        settings["allow_label_overlap"] = settings.get("allow_label_overlap", True)
         settings["metric"] = settings.get("metric", "auto")
+        # Validate that custom metric is callable if not a string
+        self._validate_metric_parameter(settings["metric"], allow_auto=True)
         settings["estimator_list"] = settings.get("estimator_list", "auto")
         settings["log_file_name"] = settings.get("log_file_name", "")
         settings["max_iter"] = settings.get("max_iter")  # no budget by default
@@ -455,6 +475,28 @@ def __setstate__(self, state):
                 except Exception:
                     mi.mlflow_client = None
 
+    @staticmethod
+    def _validate_metric_parameter(metric, allow_auto=True):
+        """Validate that the metric parameter is either a string or a callable function.
+
+        Args:
+            metric: The metric parameter to validate.
+            allow_auto: Whether to allow "auto" as a valid string value.
+
+        Raises:
+            ValueError: If metric is not a string or callable function.
+        """
+        if allow_auto and metric == "auto":
+            return
+        if not isinstance(metric, str) and not callable(metric):
+            raise ValueError(
+                f"The 'metric' parameter must be either a string or a callable function, "
+                f"but got {type(metric).__name__}. "
+                f"If you defined a custom_metric function, make sure to pass the function itself "
+                f"(e.g., metric=custom_metric) and not the result of calling it "
+                f"(e.g., metric=custom_metric(...))."
+            )
+
     def get_params(self, deep: bool = False) -> dict:
         return self._settings.copy()
 
@@ -503,18 +545,135 @@ def best_iteration(self):
 
     @property
     def best_config(self):
-        """A dictionary of the best configuration."""
+        """A dictionary of the best configuration.
+
+        The returned config dictionary can be used to:
+        1. Pass as `starting_points` to a new AutoML run.
+        2. Initialize the corresponding FLAML estimator directly.
+        3. Initialize the original model (e.g., LightGBM, XGBoost) after converting
+           FLAML-specific parameters.
+
+        Note:
+            The config contains FLAML's search space parameters, which may differ from
+            the original model's parameters. For example, FLAML uses `log_max_bin` for
+            LightGBM instead of `max_bin`. Use the FLAML estimator's `config2params()`
+            method to convert to the original model's parameters.
+
+        Example:
+
+        ```python
+        from flaml import AutoML
+        from flaml.automl.model import LGBMEstimator
+        from lightgbm import LGBMClassifier
+        from sklearn.datasets import load_iris
+
+        X, y = load_iris(return_X_y=True)
+
+        # Train with AutoML
+        automl = AutoML()
+        automl.fit(X, y, task="classification", time_budget=10)
+
+        # Get the best config
+        best_config = automl.best_config
+        print("Best config:", best_config)
+        # Example output: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+        #                  'learning_rate': 0.1, 'log_max_bin': 8, ...}
+
+        # Option 1: Use FLAML estimator directly (handles parameter conversion internally)
+        flaml_estimator = LGBMEstimator(task="classification", **best_config)
+        flaml_estimator.fit(X, y)
+
+        # Option 2: Convert to original model parameters using config2params()
+        # This converts FLAML-specific params (e.g., log_max_bin -> max_bin)
+        original_params = flaml_estimator.params  # or use flaml_estimator.config2params(best_config)
+        print("Original model params:", original_params)
+        # Example output: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+        #                  'learning_rate': 0.1, 'max_bin': 255, ...}  # log_max_bin converted to max_bin
+
+        # Now use with original LightGBM
+        lgbm_model = LGBMClassifier(**original_params)
+        lgbm_model.fit(X, y)
+        ```
+        """
         state = self._search_states.get(self._best_estimator)
         config = state and getattr(state, "best_config", None)
         return config and AutoMLState.sanitize(config)
 
     @property
     def best_config_per_estimator(self):
-        """A dictionary of all estimators' best configuration."""
-        return {
-            e: e_search_state.best_config and AutoMLState.sanitize(e_search_state.best_config)
-            for e, e_search_state in self._search_states.items()
-        }
+        """A dictionary of all estimators' best configuration.
+
+        Returns a dictionary where keys are estimator names (e.g., 'lgbm', 'xgboost')
+        and values are the best hyperparameter configurations found for each estimator.
+        The config may include `FLAML_sample_size` which indicates the sample size used
+        during training.
+
+        This is useful for:
+        1. Passing as `starting_points` to a new AutoML run for warm-starting.
+        2. Comparing the best configurations across different estimators.
+        3. Initializing the original models after converting FLAML-specific parameters.
+
+        Note:
+            The configs contain FLAML's search space parameters, which may differ from
+            the original models' parameters. Use each estimator's `config2params()` method
+            to convert to the original model's parameters.
+
+        Example:
+
+        ```python
+        from flaml import AutoML
+        from flaml.automl.model import LGBMEstimator, XGBoostEstimator
+        from lightgbm import LGBMClassifier
+        from xgboost import XGBClassifier
+        from sklearn.datasets import load_iris
+
+        X, y = load_iris(return_X_y=True)
+
+        # Train with AutoML
+        automl = AutoML()
+        automl.fit(X, y, task="classification", time_budget=30,
+                   estimator_list=['lgbm', 'xgboost'])
+
+        # Get best configs for all estimators
+        configs = automl.best_config_per_estimator
+        print(configs)
+        # Example output: {'lgbm': {'n_estimators': 4, 'num_leaves': 4, 'log_max_bin': 8, ...},
+        #                  'xgboost': {'n_estimators': 4, 'max_leaves': 4, ...}}
+
+        # Use as starting points for a new AutoML run (warm start)
+        new_automl = AutoML()
+        new_automl.fit(X, y, task="classification", time_budget=30,
+                       starting_points=configs)
+
+        # Or convert to original model parameters for direct use
+        if configs.get('lgbm'):
+            lgbm_config = configs['lgbm'].copy()
+            lgbm_config.pop('FLAML_sample_size', None)  # Remove FLAML internal param
+            flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
+            original_lgbm_params = flaml_lgbm.params  # Converted params (log_max_bin -> max_bin), or use flaml_lgbm.config2params(lgbm_config)
+            lgbm_model = LGBMClassifier(**original_lgbm_params)
+            lgbm_model.fit(X, y)
+
+        if configs.get('xgboost'):
+            xgb_config = configs['xgboost'].copy()
+            xgb_config.pop('FLAML_sample_size', None)  # Remove FLAML internal param
+            flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
+            original_xgb_params = flaml_xgb.params  # Converted params
+            xgb_model = XGBClassifier(**original_xgb_params)
+            xgb_model.fit(X, y)
+        ```
+        """
+        result = {}
+        for e, e_search_state in self._search_states.items():
+            if e_search_state.best_config:
+                config = e_search_state.best_config.get("ml", e_search_state.best_config).copy()
+                # Remove internal keys that are not needed for starting_points, but keep FLAML_sample_size
+                config.pop("learner", None)
+                config.pop("_choice_", None)
+                result[e] = config
+            else:
+                result[e] = None
+        return result
 
     @property
     def best_loss_per_estimator(self):
@@ -630,7 +789,7 @@ def score(
 
     def predict(
         self,
-        X: np.array | DataFrame | list[str] | list[list[str]] | psDataFrame,
+        X: np.ndarray | DataFrame | list[str] | list[list[str]] | psDataFrame,
         **pred_kwargs,
     ):
         """Predict label from features.
@@ -696,6 +855,50 @@ def predict_proba(self, X, **pred_kwargs):
         proba = self._trained_estimator.predict_proba(X, **pred_kwargs)
         return proba
 
+    def preprocess(
+        self,
+        X: np.ndarray | DataFrame | list[str] | list[list[str]] | psDataFrame,
+    ):
+        """Preprocess data using task-level preprocessing.
+
+        This method applies task-level preprocessing transformations to the input data,
+        including handling of data types, sparse matrices, and feature transformations
+        that were learned during the fit phase. This should be called before any
+        estimator-level preprocessing.
+
+        Args:
+            X: A numpy array or pandas dataframe or pyspark.pandas dataframe
+                of featurized instances, shape n * m,
+                or for time series forecast tasks:
+                    a pandas dataframe with the first column containing
+                    timestamp values (datetime type) or an integer n for
+                    the predict steps (only valid when the estimator is
+                    arima or sarimax). Other columns in the dataframe
+                    are assumed to be exogenous variables (categorical
+                    or numeric).
+
+        Returns:
+            Preprocessed data in the same format as input (numpy array, DataFrame, etc.).
+
+        Raises:
+            AttributeError: If the model has not been fitted yet.
+
+        Example:
+            ```python
+            automl = AutoML()
+            automl.fit(X_train, y_train, task="classification")
+
+            # Apply task-level preprocessing to new data
+            X_test_preprocessed = automl.preprocess(X_test)
+            ```
+        """
+        if not hasattr(self, "_state") or self._state is None:
+            raise AttributeError("AutoML instance has not been fitted yet. Please call fit() first.")
+        if not hasattr(self, "_transformer"):
+            raise AttributeError("Transformer not initialized. Please call fit() first.")
+
+        return self._state.task.preprocess(X, self._transformer)
+
     def add_learner(self, learner_name, learner_class):
         """Add a customized learner.
 
@@ -854,6 +1057,14 @@ def retrain_from_log(
                 the searched learners, such as sample_weight. Below are a few examples of
                 estimator-specific parameters:
                     period: int | forecast horizon for all time series forecast tasks.
+                        This is the number of time steps ahead to forecast (e.g., period=12 means
+                        forecasting 12 steps into the future). This represents the forecast horizon
+                        used during model training. Note: during prediction, the output length
+                        equals the length of X_test. FLAML automatically handles feature
+                        engineering for you - sklearn-based models (lgbm, rf, xgboost, etc.) will have
+                        lagged features created automatically, while time series native models (prophet,
+                        arima, sarimax) use their built-in forecasting capabilities. You do NOT need
+                        to manually create lagged features of the target variable.
                     gpu_per_trial: float, default = 0 | A float of the number of gpus per trial,
                         only used by TransformersEstimator, XGBoostSklearnEstimator, and
                         TemporalFusionTransformerEstimator.
@@ -961,6 +1172,7 @@ def retrain_from_log(
         eval_method = self._decide_eval_method(eval_method, time_budget)
         self.modelcount = 0
         self._auto_augment = auto_augment
+        self._allow_label_overlap = self._settings.get("allow_label_overlap", True)
         self._prepare_data(eval_method, split_ratio, n_splits)
         self._state.time_budget = -1
         self._state.free_mem_ratio = 0
@@ -1564,6 +1776,7 @@ def _prepare_data(self, eval_method, split_ratio, n_splits):
             n_splits,
             self._df,
             self._sample_weight_full,
+            self._allow_label_overlap,
         )
         self.data_size_full = self._state.data_size_full
 
@@ -1620,6 +1833,7 @@ def fit(
         time_col=None,
         cv_score_agg_func=None,
         skip_transform=None,
+        allow_label_overlap=True,
         mlflow_logging=None,
         fit_kwargs_by_estimator=None,
         mlflow_exp_name=None,
@@ -1648,6 +1862,8 @@ def fit(
                 e.g., 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_weighted',
                 'roc_auc_ovo_weighted', 'roc_auc_ovr_weighted', 'f1', 'micro_f1', 'macro_f1',
                 'log_loss', 'mae', 'mse', 'r2', 'mape'. Default is 'auto'.
+                For a full list of supported built-in metrics, please refer to
+                https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric
                 If passing a customized metric function, the function needs to
                 have the following input arguments:
 
@@ -1684,6 +1900,10 @@ def custom_metric(
                 "pred_time": pred_time,
             }
         ```
+                **Note:** When passing a custom metric function, pass the function itself
+                (e.g., `metric=custom_metric`), not the result of calling it
+                (e.g., `metric=custom_metric(...)`). FLAML will call your function
+                internally during the training process.
             task: A string of the task type, e.g.,
                 'classification', 'regression', 'ts_forecast_regression',
                 'ts_forecast_classification', 'rank', 'seq-classification',
@@ -1706,6 +1926,11 @@ def custom_metric(
                 and 'final_estimator' to specify the passthrough and
                 final_estimator in the stacker. The dict can also contain
                 'n_jobs' as the key to specify the number of jobs for the stacker.
+                Note: The hyperparameters of a custom 'final_estimator' are NOT
+                automatically tuned. If you provide an estimator instance (e.g.,
+                CatBoostClassifier()), it will use the parameters you specified
+                or their defaults. If 'final_estimator' is not provided, the best
+                model found during the search will be used as the final estimator.
             eval_method: A string of resampling strategy, one of
                 ['auto', 'cv', 'holdout'].
             split_ratio: A float of the valiation data percentage for holdout.
@@ -1895,6 +2120,12 @@ def cv_score_agg_func(val_loss_folds, log_metrics_folds):
         ```
 
             skip_transform: boolean, default=False | Whether to pre-process data prior to modeling.
+            allow_label_overlap: boolean, default=True | For classification tasks with holdout evaluation,
+                whether to allow label overlap between train and validation sets. When True (default),
+                uses a fast strategy that adds the first instance of missing labels to the set that is
+                missing them, which may create some overlap. When False, uses a precise but slower
+                strategy that intelligently re-splits instances to avoid overlap when possible.
+                Only affects classification tasks with holdout evaluation method.
             mlflow_logging: boolean, default=None | Whether to log the training results to mlflow.
                 Default value is None, which means the logging decision is made based on
                 AutoML.__init__'s mlflow_logging argument. Not valid if mlflow is not installed.
@@ -1928,6 +2159,14 @@ def cv_score_agg_func(val_loss_folds, log_metrics_folds):
                 the searched learners, such as sample_weight. Below are a few examples of
                 estimator-specific parameters:
                     period: int | forecast horizon for all time series forecast tasks.
+                        This is the number of time steps ahead to forecast (e.g., period=12 means
+                        forecasting 12 steps into the future). This represents the forecast horizon
+                        used during model training. Note: during prediction, the output length
+                        equals the length of X_test. FLAML automatically handles feature
+                        engineering for you - sklearn-based models (lgbm, rf, xgboost, etc.) will have
+                        lagged features created automatically, while time series native models (prophet,
+                        arima, sarimax) use their built-in forecasting capabilities. You do NOT need
+                        to manually create lagged features of the target variable.
                     gpu_per_trial: float, default = 0 | A float of the number of gpus per trial,
                         only used by TransformersEstimator, XGBoostSklearnEstimator, and
                         TemporalFusionTransformerEstimator.
@@ -1964,7 +2203,10 @@ def cv_score_agg_func(val_loss_folds, log_metrics_folds):
         split_ratio = split_ratio or self._settings.get("split_ratio")
         n_splits = n_splits or self._settings.get("n_splits")
         auto_augment = self._settings.get("auto_augment") if auto_augment is None else auto_augment
-        metric = metric or self._settings.get("metric")
+        allow_label_overlap = (
+            self._settings.get("allow_label_overlap") if allow_label_overlap is None else allow_label_overlap
+        )
+        metric = self._settings.get("metric") if metric is None else metric
         estimator_list = estimator_list or self._settings.get("estimator_list")
         log_file_name = self._settings.get("log_file_name") if log_file_name is None else log_file_name
         max_iter = self._settings.get("max_iter") if max_iter is None else max_iter
@@ -2146,6 +2388,7 @@ def cv_score_agg_func(val_loss_folds, log_metrics_folds):
 
         self._retrain_in_budget = retrain_full == "budget" and (eval_method == "holdout" and self._state.X_val is None)
         self._auto_augment = auto_augment
+        self._allow_label_overlap = allow_label_overlap
 
         _sample_size_from_starting_points = {}
         if isinstance(starting_points, dict):
@@ -2203,6 +2446,9 @@ def cv_score_agg_func(val_loss_folds, log_metrics_folds):
                 and (self._min_sample_size * SAMPLE_MULTIPLY_FACTOR < self._state.data_size[0])
             )
 
+        # Validate metric parameter before processing
+        self._validate_metric_parameter(metric, allow_auto=True)
+
         metric = task.default_metric(metric)
         self._state.metric = metric
 
@@ -2849,7 +3095,7 @@ def _search_sequential(self):
                     )
 
                 logger.info(
-                    " at {:.1f}s,\testimator {}'s best error={:.4f},\tbest estimator {}'s best error={:.4f}".format(
+                    " at {:.1f}s,\testimator {}'s best error={:.4e},\tbest estimator {}'s best error={:.4e}".format(
                         self._state.time_from_start,
                         estimator,
                         search_state.best_loss,
@@ -3026,6 +3272,10 @@ def _search(self):
                     # the total degree of parallelization = parallelization degree per estimator * parallelization degree of ensemble
                 )
                 if isinstance(self._ensemble, dict):
+                    # Note: If a custom final_estimator is provided, it is used as-is without
+                    # hyperparameter tuning. The user is responsible for setting appropriate
+                    # parameters or using defaults. If not provided, the best model found
+                    # during the search (self._trained_estimator) is used.
                     final_estimator = self._ensemble.get("final_estimator", self._trained_estimator)
                     passthrough = self._ensemble.get("passthrough", True)
                     ensemble_n_jobs = self._ensemble.get("n_jobs", ensemble_n_jobs)
diff --git a/flaml/automl/ml.py b/flaml/automl/ml.py
index 9ca3ee7284..1bb33b17dc 100644
--- a/flaml/automl/ml.py
+++ b/flaml/automl/ml.py
@@ -311,14 +311,14 @@ def get_y_pred(estimator, X, eval_metric, task: Task):
     else:
         y_pred = estimator.predict(X)
 
-    if isinstance(y_pred, Series) or isinstance(y_pred, DataFrame):
+    if isinstance(y_pred, (Series, DataFrame)):
         y_pred = y_pred.values
 
     return y_pred
 
 
 def to_numpy(x):
-    if isinstance(x, Series or isinstance(x, DataFrame)):
+    if isinstance(x, (Series, DataFrame)):
         x = x.values
     else:
         x = np.ndarray(x)
@@ -586,7 +586,7 @@ def _eval_estimator(
 
         # TODO: why are integer labels being cast to str in the first place?
 
-        if isinstance(val_pred_y, Series) or isinstance(val_pred_y, DataFrame) or isinstance(val_pred_y, np.ndarray):
+        if isinstance(val_pred_y, (Series, DataFrame, np.ndarray)):
             test = val_pred_y if isinstance(val_pred_y, np.ndarray) else val_pred_y.values
             if not np.issubdtype(test.dtype, np.number):
                 # some NLP models return a list
diff --git a/flaml/automl/model.py b/flaml/automl/model.py
index 0c6c47cec8..be99ad8b34 100644
--- a/flaml/automl/model.py
+++ b/flaml/automl/model.py
@@ -295,6 +295,35 @@ def fit(self, X_train, y_train, budget=None, free_mem_ratio=0, **kwargs):
             train_time = self._fit(X_train, y_train, **kwargs)
         return train_time
 
+    def preprocess(self, X):
+        """Preprocess data using estimator-level preprocessing.
+
+        This method applies estimator-specific preprocessing transformations to the input data.
+        This is the second level of preprocessing that should be applied after task-level
+        preprocessing (automl.preprocess()). Different estimator types may apply different
+        preprocessing steps (e.g., sparse matrix conversion, dataframe handling).
+
+        Args:
+            X: A numpy array or a dataframe of featurized instances, shape n*m.
+
+        Returns:
+            Preprocessed data ready for the estimator's predict/fit methods.
+
+        Example:
+            ```python
+            automl = AutoML()
+            automl.fit(X_train, y_train, task="classification")
+
+            # First apply task-level preprocessing
+            X_test_task = automl.preprocess(X_test)
+
+            # Then apply estimator-level preprocessing
+            estimator = automl.model
+            X_test_estimator = estimator.preprocess(X_test_task)
+            ```
+        """
+        return self._preprocess(X)
+
     def predict(self, X, **kwargs):
         """Predict label from features.
 
diff --git a/flaml/automl/nlp/utils.py b/flaml/automl/nlp/utils.py
index 603e23f676..5eb3066309 100644
--- a/flaml/automl/nlp/utils.py
+++ b/flaml/automl/nlp/utils.py
@@ -25,9 +25,7 @@ def load_default_huggingface_metric_for_task(task):
 
 
 def is_a_list_of_str(this_obj):
-    return (isinstance(this_obj, list) or isinstance(this_obj, np.ndarray)) and all(
-        isinstance(x, str) for x in this_obj
-    )
+    return isinstance(this_obj, (list, np.ndarray)) and all(isinstance(x, str) for x in this_obj)
 
 
 def _clean_value(value: Any) -> str:
diff --git a/flaml/automl/state.py b/flaml/automl/state.py
index a5897f7234..e5469e1ebe 100644
--- a/flaml/automl/state.py
+++ b/flaml/automl/state.py
@@ -37,10 +37,9 @@ def valid_starting_point_one_dim(self, value_one_dim, domain_one_dim):
         if isinstance(domain_one_dim, sample.Domain):
             renamed_type = list(inspect.signature(domain_one_dim.is_valid).parameters.values())[0].annotation
             type_match = (
-                renamed_type == Any
+                renamed_type is Any
                 or isinstance(value_one_dim, renamed_type)
-                or isinstance(value_one_dim, int)
-                and renamed_type is float
+                or (renamed_type is float and isinstance(value_one_dim, int))
             )
             if not (type_match and domain_one_dim.is_valid(value_one_dim)):
                 return False
diff --git a/flaml/automl/task/generic_task.py b/flaml/automl/task/generic_task.py
index 5b74a3d755..19f80a45ef 100644
--- a/flaml/automl/task/generic_task.py
+++ b/flaml/automl/task/generic_task.py
@@ -365,6 +365,465 @@ def _train_test_split(state, X, y, first=None, rest=None, split_ratio=0.2, strat
             X_train, X_val, y_train, y_val = GenericTask._split_pyspark(state, X, y, split_ratio, stratify)
         return X_train, X_val, y_train, y_val
 
+    def _handle_missing_labels_fast(
+        self,
+        state,
+        X_train,
+        X_val,
+        y_train,
+        y_val,
+        X_train_all,
+        y_train_all,
+        is_spark_dataframe,
+        data_is_df,
+    ):
+        """Handle missing labels by adding first instance to the set with missing label.
+
+        This is the faster version that may create some overlap but ensures all labels
+        are present in both sets. If a label is missing from train, it adds the first
+        instance to train. If a label is missing from val, it adds the first instance to val.
+        If no labels are missing, no instances are duplicated.
+
+        Args:
+            state: The state object containing fit parameters
+            X_train, X_val: Training and validation features
+            y_train, y_val: Training and validation labels
+            X_train_all, y_train_all: Complete dataset
+            is_spark_dataframe: Whether data is pandas_on_spark
+            data_is_df: Whether data is DataFrame/Series
+
+        Returns:
+            Tuple of (X_train, X_val, y_train, y_val) with missing labels added
+        """
+        # Check which labels are present in train and val sets
+        if is_spark_dataframe:
+            label_set_train, _ = unique_pandas_on_spark(y_train)
+            label_set_val, _ = unique_pandas_on_spark(y_val)
+            label_set_all, first = unique_value_first_index(y_train_all)
+        else:
+            label_set_all, first = unique_value_first_index(y_train_all)
+            label_set_train = np.unique(y_train)
+            label_set_val = np.unique(y_val)
+
+        # Find missing labels
+        missing_in_train = np.setdiff1d(label_set_all, label_set_train)
+        missing_in_val = np.setdiff1d(label_set_all, label_set_val)
+
+        # Add first instance of missing labels to train set
+        if len(missing_in_train) > 0:
+            missing_train_indices = []
+            for label in missing_in_train:
+                label_matches = np.where(label_set_all == label)[0]
+                if len(label_matches) > 0 and label_matches[0] < len(first):
+                    missing_train_indices.append(first[label_matches[0]])
+
+            if len(missing_train_indices) > 0:
+                X_missing_train = (
+                    iloc_pandas_on_spark(X_train_all, missing_train_indices)
+                    if is_spark_dataframe
+                    else X_train_all.iloc[missing_train_indices]
+                    if data_is_df
+                    else X_train_all[missing_train_indices]
+                )
+                y_missing_train = (
+                    iloc_pandas_on_spark(y_train_all, missing_train_indices)
+                    if is_spark_dataframe
+                    else y_train_all.iloc[missing_train_indices]
+                    if isinstance(y_train_all, (pd.Series, psSeries))
+                    else y_train_all[missing_train_indices]
+                )
+                X_train = concat(X_missing_train, X_train)
+                y_train = concat(y_missing_train, y_train) if data_is_df else np.concatenate([y_missing_train, y_train])
+
+                # Handle sample_weight if present
+                if "sample_weight" in state.fit_kwargs:
+                    sample_weight_source = (
+                        state.sample_weight_all
+                        if hasattr(state, "sample_weight_all")
+                        else state.fit_kwargs.get("sample_weight")
+                    )
+                    if sample_weight_source is not None and max(missing_train_indices) < len(sample_weight_source):
+                        missing_weights = (
+                            sample_weight_source[missing_train_indices]
+                            if isinstance(sample_weight_source, np.ndarray)
+                            else sample_weight_source.iloc[missing_train_indices]
+                        )
+                        state.fit_kwargs["sample_weight"] = concat(missing_weights, state.fit_kwargs["sample_weight"])
+
+        # Add first instance of missing labels to val set
+        if len(missing_in_val) > 0:
+            missing_val_indices = []
+            for label in missing_in_val:
+                label_matches = np.where(label_set_all == label)[0]
+                if len(label_matches) > 0 and label_matches[0] < len(first):
+                    missing_val_indices.append(first[label_matches[0]])
+
+            if len(missing_val_indices) > 0:
+                X_missing_val = (
+                    iloc_pandas_on_spark(X_train_all, missing_val_indices)
+                    if is_spark_dataframe
+                    else X_train_all.iloc[missing_val_indices]
+                    if data_is_df
+                    else X_train_all[missing_val_indices]
+                )
+                y_missing_val = (
+                    iloc_pandas_on_spark(y_train_all, missing_val_indices)
+                    if is_spark_dataframe
+                    else y_train_all.iloc[missing_val_indices]
+                    if isinstance(y_train_all, (pd.Series, psSeries))
+                    else y_train_all[missing_val_indices]
+                )
+                X_val = concat(X_missing_val, X_val)
+                y_val = concat(y_missing_val, y_val) if data_is_df else np.concatenate([y_missing_val, y_val])
+
+                # Handle sample_weight if present
+                if (
+                    "sample_weight" in state.fit_kwargs
+                    and hasattr(state, "weight_val")
+                    and state.weight_val is not None
+                ):
+                    sample_weight_source = (
+                        state.sample_weight_all
+                        if hasattr(state, "sample_weight_all")
+                        else state.fit_kwargs.get("sample_weight")
+                    )
+                    if sample_weight_source is not None and max(missing_val_indices) < len(sample_weight_source):
+                        missing_weights = (
+                            sample_weight_source[missing_val_indices]
+                            if isinstance(sample_weight_source, np.ndarray)
+                            else sample_weight_source.iloc[missing_val_indices]
+                        )
+                        state.weight_val = concat(missing_weights, state.weight_val)
+
+        return X_train, X_val, y_train, y_val
+
+    def _handle_missing_labels_no_overlap(
+        self,
+        state,
+        X_train,
+        X_val,
+        y_train,
+        y_val,
+        X_train_all,
+        y_train_all,
+        is_spark_dataframe,
+        data_is_df,
+        split_ratio,
+    ):
+        """Handle missing labels intelligently to avoid overlap when possible.
+
+        This is the slower but more precise version that:
+        - For single-instance classes: Adds to both sets (unavoidable overlap)
+        - For multi-instance classes: Re-splits them properly to avoid overlap
+
+        Args:
+            state: The state object containing fit parameters
+            X_train, X_val: Training and validation features
+            y_train, y_val: Training and validation labels
+            X_train_all, y_train_all: Complete dataset
+            is_spark_dataframe: Whether data is pandas_on_spark
+            data_is_df: Whether data is DataFrame/Series
+            split_ratio: The ratio for splitting
+
+        Returns:
+            Tuple of (X_train, X_val, y_train, y_val) with missing labels handled
+        """
+        # Check which labels are present in train and val sets
+        if is_spark_dataframe:
+            label_set_train, _ = unique_pandas_on_spark(y_train)
+            label_set_val, _ = unique_pandas_on_spark(y_val)
+            label_set_all, first = unique_value_first_index(y_train_all)
+        else:
+            label_set_all, first = unique_value_first_index(y_train_all)
+            label_set_train = np.unique(y_train)
+            label_set_val = np.unique(y_val)
+
+        # Find missing labels
+        missing_in_train = np.setdiff1d(label_set_all, label_set_train)
+        missing_in_val = np.setdiff1d(label_set_all, label_set_val)
+
+        # Handle missing labels intelligently
+        # For classes with only 1 instance: add to both sets (unavoidable overlap)
+        # For classes with multiple instances: move/split them properly to avoid overlap
+
+        if len(missing_in_train) > 0:
+            # Process missing labels in training set
+            for label in missing_in_train:
+                # Find all indices for this label in the original data
+                if is_spark_dataframe:
+                    label_indices = np.where(y_train_all.to_numpy() == label)[0].tolist()
+                else:
+                    label_indices = np.where(np.asarray(y_train_all) == label)[0].tolist()
+
+                num_instances = len(label_indices)
+
+                if num_instances == 1:
+                    # Single instance: must add to both train and val (unavoidable overlap)
+                    X_single = (
+                        iloc_pandas_on_spark(X_train_all, label_indices)
+                        if is_spark_dataframe
+                        else X_train_all.iloc[label_indices]
+                        if data_is_df
+                        else X_train_all[label_indices]
+                    )
+                    y_single = (
+                        iloc_pandas_on_spark(y_train_all, label_indices)
+                        if is_spark_dataframe
+                        else y_train_all.iloc[label_indices]
+                        if isinstance(y_train_all, (pd.Series, psSeries))
+                        else y_train_all[label_indices]
+                    )
+                    X_train = concat(X_single, X_train)
+                    y_train = concat(y_single, y_train) if data_is_df else np.concatenate([y_single, y_train])
+
+                    # Handle sample_weight
+                    if "sample_weight" in state.fit_kwargs:
+                        sample_weight_source = (
+                            state.sample_weight_all
+                            if hasattr(state, "sample_weight_all")
+                            else state.fit_kwargs.get("sample_weight")
+                        )
+                        if sample_weight_source is not None and label_indices[0] < len(sample_weight_source):
+                            single_weight = (
+                                sample_weight_source[label_indices]
+                                if isinstance(sample_weight_source, np.ndarray)
+                                else sample_weight_source.iloc[label_indices]
+                            )
+                            state.fit_kwargs["sample_weight"] = concat(single_weight, state.fit_kwargs["sample_weight"])
+                else:
+                    # Multiple instances: move some from val to train (no overlap needed)
+                    # Calculate how many to move to train (leave at least 1 in val)
+                    num_to_train = max(1, min(num_instances - 1, int(num_instances * (1 - split_ratio))))
+                    indices_to_move = label_indices[:num_to_train]
+
+                    X_to_move = (
+                        iloc_pandas_on_spark(X_train_all, indices_to_move)
+                        if is_spark_dataframe
+                        else X_train_all.iloc[indices_to_move]
+                        if data_is_df
+                        else X_train_all[indices_to_move]
+                    )
+                    y_to_move = (
+                        iloc_pandas_on_spark(y_train_all, indices_to_move)
+                        if is_spark_dataframe
+                        else y_train_all.iloc[indices_to_move]
+                        if isinstance(y_train_all, (pd.Series, psSeries))
+                        else y_train_all[indices_to_move]
+                    )
+
+                    # Add to train
+                    X_train = concat(X_to_move, X_train)
+                    y_train = concat(y_to_move, y_train) if data_is_df else np.concatenate([y_to_move, y_train])
+
+                    # Remove from val (they are currently all in val)
+                    if is_spark_dataframe:
+                        val_mask = ~y_val.isin([label])
+                        X_val = X_val[val_mask]
+                        y_val = y_val[val_mask]
+                    else:
+                        val_mask = np.asarray(y_val) != label
+                        if data_is_df:
+                            X_val = X_val[val_mask]
+                            y_val = y_val[val_mask]
+                        else:
+                            X_val = X_val[val_mask]
+                            y_val = y_val[val_mask]
+
+                    # Add remaining instances back to val
+                    remaining_indices = label_indices[num_to_train:]
+                    if len(remaining_indices) > 0:
+                        X_remaining = (
+                            iloc_pandas_on_spark(X_train_all, remaining_indices)
+                            if is_spark_dataframe
+                            else X_train_all.iloc[remaining_indices]
+                            if data_is_df
+                            else X_train_all[remaining_indices]
+                        )
+                        y_remaining = (
+                            iloc_pandas_on_spark(y_train_all, remaining_indices)
+                            if is_spark_dataframe
+                            else y_train_all.iloc[remaining_indices]
+                            if isinstance(y_train_all, (pd.Series, psSeries))
+                            else y_train_all[remaining_indices]
+                        )
+                        X_val = concat(X_remaining, X_val)
+                        y_val = concat(y_remaining, y_val) if data_is_df else np.concatenate([y_remaining, y_val])
+
+                    # Handle sample_weight
+                    if "sample_weight" in state.fit_kwargs:
+                        sample_weight_source = (
+                            state.sample_weight_all
+                            if hasattr(state, "sample_weight_all")
+                            else state.fit_kwargs.get("sample_weight")
+                        )
+                        if sample_weight_source is not None and max(indices_to_move) < len(sample_weight_source):
+                            weights_to_move = (
+                                sample_weight_source[indices_to_move]
+                                if isinstance(sample_weight_source, np.ndarray)
+                                else sample_weight_source.iloc[indices_to_move]
+                            )
+                            state.fit_kwargs["sample_weight"] = concat(
+                                weights_to_move, state.fit_kwargs["sample_weight"]
+                            )
+
+                            if (
+                                len(remaining_indices) > 0
+                                and hasattr(state, "weight_val")
+                                and state.weight_val is not None
+                            ):
+                                # Remove and re-add weights for val
+                                if isinstance(state.weight_val, np.ndarray):
+                                    state.weight_val = state.weight_val[val_mask]
+                                else:
+                                    state.weight_val = state.weight_val[val_mask]
+
+                                if max(remaining_indices) < len(sample_weight_source):
+                                    remaining_weights = (
+                                        sample_weight_source[remaining_indices]
+                                        if isinstance(sample_weight_source, np.ndarray)
+                                        else sample_weight_source.iloc[remaining_indices]
+                                    )
+                                    state.weight_val = concat(remaining_weights, state.weight_val)
+
+        if len(missing_in_val) > 0:
+            # Process missing labels in validation set
+            for label in missing_in_val:
+                # Find all indices for this label in the original data
+                if is_spark_dataframe:
+                    label_indices = np.where(y_train_all.to_numpy() == label)[0].tolist()
+                else:
+                    label_indices = np.where(np.asarray(y_train_all) == label)[0].tolist()
+
+                num_instances = len(label_indices)
+
+                if num_instances == 1:
+                    # Single instance: must add to both train and val (unavoidable overlap)
+                    X_single = (
+                        iloc_pandas_on_spark(X_train_all, label_indices)
+                        if is_spark_dataframe
+                        else X_train_all.iloc[label_indices]
+                        if data_is_df
+                        else X_train_all[label_indices]
+                    )
+                    y_single = (
+                        iloc_pandas_on_spark(y_train_all, label_indices)
+                        if is_spark_dataframe
+                        else y_train_all.iloc[label_indices]
+                        if isinstance(y_train_all, (pd.Series, psSeries))
+                        else y_train_all[label_indices]
+                    )
+                    X_val = concat(X_single, X_val)
+                    y_val = concat(y_single, y_val) if data_is_df else np.concatenate([y_single, y_val])
+
+                    # Handle sample_weight
+                    if "sample_weight" in state.fit_kwargs and hasattr(state, "weight_val"):
+                        sample_weight_source = (
+                            state.sample_weight_all
+                            if hasattr(state, "sample_weight_all")
+                            else state.fit_kwargs.get("sample_weight")
+                        )
+                        if sample_weight_source is not None and label_indices[0] < len(sample_weight_source):
+                            single_weight = (
+                                sample_weight_source[label_indices]
+                                if isinstance(sample_weight_source, np.ndarray)
+                                else sample_weight_source.iloc[label_indices]
+                            )
+                            if state.weight_val is not None:
+                                state.weight_val = concat(single_weight, state.weight_val)
+                else:
+                    # Multiple instances: move some from train to val (no overlap needed)
+                    # Calculate how many to move to val (leave at least 1 in train)
+                    num_to_val = max(1, min(num_instances - 1, int(num_instances * split_ratio)))
+                    indices_to_move = label_indices[:num_to_val]
+
+                    X_to_move = (
+                        iloc_pandas_on_spark(X_train_all, indices_to_move)
+                        if is_spark_dataframe
+                        else X_train_all.iloc[indices_to_move]
+                        if data_is_df
+                        else X_train_all[indices_to_move]
+                    )
+                    y_to_move = (
+                        iloc_pandas_on_spark(y_train_all, indices_to_move)
+                        if is_spark_dataframe
+                        else y_train_all.iloc[indices_to_move]
+                        if isinstance(y_train_all, (pd.Series, psSeries))
+                        else y_train_all[indices_to_move]
+                    )
+
+                    # Add to val
+                    X_val = concat(X_to_move, X_val)
+                    y_val = concat(y_to_move, y_val) if data_is_df else np.concatenate([y_to_move, y_val])
+
+                    # Remove from train (they are currently all in train)
+                    if is_spark_dataframe:
+                        train_mask = ~y_train.isin([label])
+                        X_train = X_train[train_mask]
+                        y_train = y_train[train_mask]
+                    else:
+                        train_mask = np.asarray(y_train) != label
+                        if data_is_df:
+                            X_train = X_train[train_mask]
+                            y_train = y_train[train_mask]
+                        else:
+                            X_train = X_train[train_mask]
+                            y_train = y_train[train_mask]
+
+                    # Add remaining instances back to train
+                    remaining_indices = label_indices[num_to_val:]
+                    if len(remaining_indices) > 0:
+                        X_remaining = (
+                            iloc_pandas_on_spark(X_train_all, remaining_indices)
+                            if is_spark_dataframe
+                            else X_train_all.iloc[remaining_indices]
+                            if data_is_df
+                            else X_train_all[remaining_indices]
+                        )
+                        y_remaining = (
+                            iloc_pandas_on_spark(y_train_all, remaining_indices)
+                            if is_spark_dataframe
+                            else y_train_all.iloc[remaining_indices]
+                            if isinstance(y_train_all, (pd.Series, psSeries))
+                            else y_train_all[remaining_indices]
+                        )
+                        X_train = concat(X_remaining, X_train)
+                        y_train = concat(y_remaining, y_train) if data_is_df else np.concatenate([y_remaining, y_train])
+
+                    # Handle sample_weight
+                    if "sample_weight" in state.fit_kwargs:
+                        sample_weight_source = (
+                            state.sample_weight_all
+                            if hasattr(state, "sample_weight_all")
+                            else state.fit_kwargs.get("sample_weight")
+                        )
+                        if sample_weight_source is not None and max(indices_to_move) < len(sample_weight_source):
+                            weights_to_move = (
+                                sample_weight_source[indices_to_move]
+                                if isinstance(sample_weight_source, np.ndarray)
+                                else sample_weight_source.iloc[indices_to_move]
+                            )
+                            if hasattr(state, "weight_val") and state.weight_val is not None:
+                                state.weight_val = concat(weights_to_move, state.weight_val)
+
+                            if len(remaining_indices) > 0:
+                                # Remove and re-add weights for train
+                                if isinstance(state.fit_kwargs["sample_weight"], np.ndarray):
+                                    state.fit_kwargs["sample_weight"] = state.fit_kwargs["sample_weight"][train_mask]
+                                else:
+                                    state.fit_kwargs["sample_weight"] = state.fit_kwargs["sample_weight"][train_mask]
+
+                                if max(remaining_indices) < len(sample_weight_source):
+                                    remaining_weights = (
+                                        sample_weight_source[remaining_indices]
+                                        if isinstance(sample_weight_source, np.ndarray)
+                                        else sample_weight_source.iloc[remaining_indices]
+                                    )
+                                    state.fit_kwargs["sample_weight"] = concat(
+                                        remaining_weights, state.fit_kwargs["sample_weight"]
+                                    )
+
+        return X_train, X_val, y_train, y_val
+
     def prepare_data(
         self,
         state,
@@ -377,6 +836,7 @@ def prepare_data(
         n_splits,
         data_is_df,
         sample_weight_full,
+        allow_label_overlap=True,
     ) -> int:
         X_val, y_val = state.X_val, state.y_val
         if issparse(X_val):
@@ -505,59 +965,46 @@ def prepare_data(
             elif self.is_classification():
                 # for classification, make sure the labels are complete in both
                 # training and validation data
-                label_set, first = unique_value_first_index(y_train_all)
-                rest = []
-                last = 0
-                first.sort()
-                for i in range(len(first)):
-                    rest.extend(range(last, first[i]))
-                    last = first[i] + 1
-                rest.extend(range(last, len(y_train_all)))
-                X_first = X_train_all.iloc[first] if data_is_df else X_train_all[first]
-                if len(first) < len(y_train_all) / 2:
-                    # Get X_rest and y_rest with drop, sparse matrix can't apply np.delete
-                    X_rest = (
-                        np.delete(X_train_all, first, axis=0)
-                        if isinstance(X_train_all, np.ndarray)
-                        else X_train_all.drop(first.tolist())
-                        if data_is_df
-                        else X_train_all[rest]
-                    )
-                    y_rest = (
-                        np.delete(y_train_all, first, axis=0)
-                        if isinstance(y_train_all, np.ndarray)
-                        else y_train_all.drop(first.tolist())
-                        if data_is_df
-                        else y_train_all[rest]
+                stratify = y_train_all if split_type == "stratified" else None
+                X_train, X_val, y_train, y_val = self._train_test_split(
+                    state, X_train_all, y_train_all, split_ratio=split_ratio, stratify=stratify
+                )
+
+                # Handle missing labels using the appropriate strategy
+                if allow_label_overlap:
+                    # Fast version: adds first instance to set with missing label (may create overlap)
+                    X_train, X_val, y_train, y_val = self._handle_missing_labels_fast(
+                        state,
+                        X_train,
+                        X_val,
+                        y_train,
+                        y_val,
+                        X_train_all,
+                        y_train_all,
+                        is_spark_dataframe,
+                        data_is_df,
                     )
                 else:
-                    X_rest = (
-                        iloc_pandas_on_spark(X_train_all, rest)
-                        if is_spark_dataframe
-                        else X_train_all.iloc[rest]
-                        if data_is_df
-                        else X_train_all[rest]
-                    )
-                    y_rest = (
-                        iloc_pandas_on_spark(y_train_all, rest)
-                        if is_spark_dataframe
-                        else y_train_all.iloc[rest]
-                        if data_is_df
-                        else y_train_all[rest]
+                    # Precise version: avoids overlap when possible (slower)
+                    X_train, X_val, y_train, y_val = self._handle_missing_labels_no_overlap(
+                        state,
+                        X_train,
+                        X_val,
+                        y_train,
+                        y_val,
+                        X_train_all,
+                        y_train_all,
+                        is_spark_dataframe,
+                        data_is_df,
+                        split_ratio,
                     )
-                stratify = y_rest if split_type == "stratified" else None
-                X_train, X_val, y_train, y_val = self._train_test_split(
-                    state, X_rest, y_rest, first, rest, split_ratio, stratify
-                )
-                X_train = concat(X_first, X_train)
-                y_train = concat(label_set, y_train) if data_is_df else np.concatenate([label_set, y_train])
-                X_val = concat(X_first, X_val)
-                y_val = concat(label_set, y_val) if data_is_df else np.concatenate([label_set, y_val])
 
                 if isinstance(y_train, (psDataFrame, pd.DataFrame)) and y_train.shape[1] == 1:
                     y_train = y_train[y_train.columns[0]]
                     y_val = y_val[y_val.columns[0]]
-                    y_train.name = y_val.name = y_rest.name
+                    # Only set name if y_train_all is a Series (not a DataFrame)
+                    if isinstance(y_train_all, (pd.Series, psSeries)):
+                        y_train.name = y_val.name = y_train_all.name
 
             elif self.is_regression():
                 X_train, X_val, y_train, y_val = self._train_test_split(
diff --git a/flaml/automl/task/time_series_task.py b/flaml/automl/task/time_series_task.py
index cd69577a32..9f16840891 100644
--- a/flaml/automl/task/time_series_task.py
+++ b/flaml/automl/task/time_series_task.py
@@ -386,9 +386,8 @@ def _preprocess(self, X, transformer=None):
         return X
 
     def preprocess(self, X, transformer=None):
-        if isinstance(X, pd.DataFrame) or isinstance(X, np.ndarray) or isinstance(X, pd.Series):
-            X = X.copy()
-            X = normalize_ts_data(X, self.target_names, self.time_col)
+        if isinstance(X, (pd.DataFrame, np.ndarray, pd.Series)):
+            X = normalize_ts_data(X.copy(), self.target_names, self.time_col)
             return self._preprocess(X, transformer)
         elif isinstance(X, int):
             return X
diff --git a/flaml/automl/time_series/sklearn.py b/flaml/automl/time_series/sklearn.py
index ebe18ed743..eb7172b908 100644
--- a/flaml/automl/time_series/sklearn.py
+++ b/flaml/automl/time_series/sklearn.py
@@ -17,24 +17,30 @@ class PD:
 
 
 def make_lag_features(X: pd.DataFrame, y: pd.Series, lags: int):
-    """Transform input data X, y into autoregressive form - shift
-    them appropriately based on horizon and create `lags` columns.
+    """Transform input data X, y into autoregressive form by creating `lags` columns.
+
+    This function is called automatically by FLAML during the training process
+    to convert time series data into a format suitable for sklearn-based regression
+    models (e.g., lgbm, rf, xgboost). Users do NOT need to manually call this function
+    or create lagged features themselves.
 
     Parameters
     ----------
     X : pandas.DataFrame
-        Input features.
+        Input feature DataFrame, which may contain temporal features and/or exogenous variables.
 
     y : array_like, (1d)
-        Target vector.
+        Target vector (time series values to forecast).
 
-    horizon : int
-        length of X for `predict` method
+    lags : int
+        Number of lagged time steps to use as features.
 
     Returns
     -------
     pandas.DataFrame
-        shifted dataframe with `lags` columns
+        Shifted dataframe with `lags` columns for each original feature.
+        The target variable y is also lagged to prevent data leakage
+        (i.e., we use y(t-1), y(t-2), ..., y(t-lags) to predict y(t)).
     """
     lag_features = []
 
@@ -55,6 +61,17 @@ def make_lag_features(X: pd.DataFrame, y: pd.Series, lags: int):
 
 
 class SklearnWrapper:
+    """Wrapper class for using sklearn-based models for time series forecasting.
+
+    This wrapper automatically handles the transformation of time series data into
+    a supervised learning format by creating lagged features. It trains separate
+    models for each step in the forecast horizon.
+
+    Users typically don't interact with this class directly - it's used internally
+    by FLAML when sklearn-based estimators (lgbm, rf, xgboost, etc.) are selected
+    for time series forecasting tasks.
+    """
+
     def __init__(
         self,
         model_class: type,
diff --git a/flaml/automl/time_series/ts_data.py b/flaml/automl/time_series/ts_data.py
index 5e2b603681..625049c601 100644
--- a/flaml/automl/time_series/ts_data.py
+++ b/flaml/automl/time_series/ts_data.py
@@ -546,14 +546,12 @@ def normalize_ts_data(X_train_all, target_names, time_col, y_train_all=None):
 
 
 def validate_data_basic(X_train_all, y_train_all):
-    assert isinstance(X_train_all, np.ndarray) or issparse(X_train_all) or isinstance(X_train_all, pd.DataFrame), (
-        "X_train_all must be a numpy array, a pandas dataframe, " "or Scipy sparse matrix."
-    )
+    assert isinstance(X_train_all, (np.ndarray, DataFrame)) or issparse(
+        X_train_all
+    ), "X_train_all must be a numpy array, a pandas dataframe, or Scipy sparse matrix."
 
-    assert (
-        isinstance(y_train_all, np.ndarray)
-        or isinstance(y_train_all, pd.Series)
-        or isinstance(y_train_all, pd.DataFrame)
+    assert isinstance(
+        y_train_all, (np.ndarray, pd.Series, pd.DataFrame)
     ), "y_train_all must be a numpy array or a pandas series or DataFrame."
 
     assert X_train_all.size != 0 and y_train_all.size != 0, "Input data must not be empty, use None if no data"
diff --git a/flaml/default/estimator.py b/flaml/default/estimator.py
index 5b46150f81..fcb318638e 100644
--- a/flaml/default/estimator.py
+++ b/flaml/default/estimator.py
@@ -95,6 +95,27 @@ def suggest_hyperparams(self, X, y):
         def fit(self, X, y, *args, **params):
             hyperparams, estimator_name, X, y_transformed = self.suggest_hyperparams(X, y)
             self.set_params(**hyperparams)
+
+            # Transform eval_set if present
+            if "eval_set" in params and params["eval_set"] is not None:
+                transformed_eval_set = []
+                for eval_X, eval_y in params["eval_set"]:
+                    # Transform features
+                    eval_X_transformed = self._feature_transformer.transform(eval_X)
+                    # Transform labels if applicable
+                    if self._label_transformer and estimator_name in [
+                        "rf",
+                        "extra_tree",
+                        "xgboost",
+                        "xgb_limitdepth",
+                        "choose_xgb",
+                    ]:
+                        eval_y_transformed = self._label_transformer.transform(eval_y)
+                        transformed_eval_set.append((eval_X_transformed, eval_y_transformed))
+                    else:
+                        transformed_eval_set.append((eval_X_transformed, eval_y))
+                params["eval_set"] = transformed_eval_set
+
             if self._label_transformer and estimator_name in [
                 "rf",
                 "extra_tree",
diff --git a/flaml/onlineml/README.md b/flaml/onlineml/README.md
index 36926ba16a..0aa505e07a 100644
--- a/flaml/onlineml/README.md
+++ b/flaml/onlineml/README.md
@@ -1,6 +1,6 @@
 # ChaCha for Online AutoML
 
-FLAML includes *ChaCha* which is an automatic hyperparameter tuning solution for online machine learning. Online machine learning has the following properties: (1) data comes in sequential order; and (2) the performance of the machine learning model is evaluated online, i.e., at every iteration. *ChaCha* performs online AutoML respecting the aforementioned properties of online learning, and at the same time respecting the following constraints: (1) only a small constant number of 'live' models are allowed to perform online learning at the same time;  and (2) no model persistence or offline training is allowed, which means that once we decide to replace a 'live' model with a new one, the replaced model can no longer be retrieved.
+FLAML includes *ChaCha* which is an automatic hyperparameter tuning solution for online machine learning. Online machine learning has the following properties: (1) data comes in sequential order; and (2) the performance of the machine learning model is evaluated online, i.e., at every iteration. *ChaCha* performs online AutoML respecting the aforementioned properties of online learning, and at the same time respecting the following constraints: (1) only a small constant number of 'live' models are allowed to perform online learning at the same time; and (2) no model persistence or offline training is allowed, which means that once we decide to replace a 'live' model with a new one, the replaced model can no longer be retrieved.
 
 For more technical details about *ChaCha*, please check our paper.
 
diff --git a/flaml/tune/searcher/blendsearch.py b/flaml/tune/searcher/blendsearch.py
index c76a9a162b..6bdecb20dc 100644
--- a/flaml/tune/searcher/blendsearch.py
+++ b/flaml/tune/searcher/blendsearch.py
@@ -217,7 +217,24 @@ def __init__(
         if global_search_alg is not None:
             self._gs = global_search_alg
         elif getattr(self, "__name__", None) != "CFO":
-            if space and self._ls.hierarchical:
+            # Use define-by-run for OptunaSearch when needed:
+            # - Hierarchical/conditional spaces are best supported via define-by-run.
+            # - Ray Tune domain/grid specs can trigger an "unresolved search space" warning
+            #   unless we switch to define-by-run.
+            use_define_by_run = bool(getattr(self._ls, "hierarchical", False))
+            if (not use_define_by_run) and isinstance(space, dict) and space:
+                try:
+                    from .variant_generator import parse_spec_vars
+
+                    _, domain_vars, grid_vars = parse_spec_vars(space)
+                    use_define_by_run = bool(domain_vars or grid_vars)
+                except Exception:
+                    # Be conservative: if we can't determine whether the space is
+                    # unresolved, fall back to the original behavior.
+                    use_define_by_run = False
+
+            self._use_define_by_run = use_define_by_run
+            if use_define_by_run:
                 from functools import partial
 
                 gs_space = partial(define_by_run_func, space=space)
@@ -487,7 +504,7 @@ def on_trial_complete(self, trial_id: str, result: Optional[Dict] = None, error:
                             self._ls_bound_max,
                             self._subspace.get(trial_id, self._ls.space),
                         )
-                    if self._gs is not None and self._experimental and (not self._ls.hierarchical):
+                    if self._gs is not None and self._experimental and (not getattr(self, "_use_define_by_run", False)):
                         self._gs.add_evaluated_point(flatten_dict(config), objective)
                         # TODO: recover when supported
                         # converted = convert_key(config, self._gs.space)
diff --git a/flaml/tune/searcher/flow2.py b/flaml/tune/searcher/flow2.py
index 4764c80d66..7ccf6fa6bb 100644
--- a/flaml/tune/searcher/flow2.py
+++ b/flaml/tune/searcher/flow2.py
@@ -641,8 +641,10 @@ def config_signature(self, config, space: Dict = None) -> tuple:
             else:
                 # key must be in space
                 domain = space[key]
-                if self.hierarchical and not (
-                    domain is None or type(domain) in (str, int, float) or isinstance(domain, sample.Domain)
+                if (
+                    self.hierarchical
+                    and domain is not None
+                    and not isinstance(domain, (str, int, float, sample.Domain))
                 ):
                     # not domain or hashable
                     # get rid of list type for hierarchical search space.
diff --git a/flaml/tune/searcher/online_searcher.py b/flaml/tune/searcher/online_searcher.py
index 2228752449..dfce8a75d7 100644
--- a/flaml/tune/searcher/online_searcher.py
+++ b/flaml/tune/searcher/online_searcher.py
@@ -207,7 +207,7 @@ def _query_config_oracle(
                     hyperparameter_config_groups.append(partial_new_configs)
                     # does not have searcher_trial_ids
                     searcher_trial_ids_groups.append([])
-            elif isinstance(config_domain, Float) or isinstance(config_domain, Categorical):
+            elif isinstance(config_domain, (Float, Categorical)):
                 # otherwise we need to deal with them in group
                 nonpoly_config[k] = v
                 if k not in self._space_of_nonpoly_hp:
diff --git a/flaml/tune/searcher/search_thread.py b/flaml/tune/searcher/search_thread.py
index 5ab7846aa7..6b79abfb50 100644
--- a/flaml/tune/searcher/search_thread.py
+++ b/flaml/tune/searcher/search_thread.py
@@ -25,6 +25,31 @@
 logger = logging.getLogger(__name__)
 
 
+def _recursive_dict_update(target: Dict, source: Dict) -> None:
+    """Recursively update target dictionary with source dictionary.
+
+    Unlike dict.update(), this function merges nested dictionaries instead of
+    replacing them entirely. This is crucial for configurations with nested
+    structures (e.g., XGBoost params).
+
+    Args:
+        target: The dictionary to be updated (modified in place).
+        source: The dictionary containing values to merge into target.
+
+    Example:
+        >>> target = {'params': {'eta': 0.1, 'max_depth': 3}}
+        >>> source = {'params': {'verbosity': 0}}
+        >>> _recursive_dict_update(target, source)
+        >>> target
+        {'params': {'eta': 0.1, 'max_depth': 3, 'verbosity': 0}}
+    """
+    for key, value in source.items():
+        if isinstance(value, dict) and key in target and isinstance(target[key], dict):
+            _recursive_dict_update(target[key], value)
+        else:
+            target[key] = value
+
+
 class SearchThread:
     """Class of global or local search thread."""
 
@@ -65,7 +90,7 @@ def suggest(self, trial_id: str) -> Optional[Dict]:
             try:
                 config = self._search_alg.suggest(trial_id)
                 if isinstance(self._search_alg._space, dict):
-                    config.update(self._const)
+                    _recursive_dict_update(config, self._const)
                 else:
                     # define by run
                     config, self.space = unflatten_hierarchical(config, self._space)
diff --git a/flaml/version.py b/flaml/version.py
index 54499df347..50062f87c0 100644
--- a/flaml/version.py
+++ b/flaml/version.py
@@ -1 +1 @@
-__version__ = "2.4.1"
+__version__ = "2.5.0"
diff --git a/setup.py b/setup.py
index 0906b9fc08..3a0527075b 100644
--- a/setup.py
+++ b/setup.py
@@ -52,8 +52,8 @@
         ],
         "test": [
             "numpy>=1.17,<2.0.0; python_version<'3.13'",
-            "numpy>2.0.0; python_version>='3.13'",
-            "jupyter; python_version<'3.13'",
+            "numpy>=1.17; python_version>='3.13'",
+            "jupyter",
             "lightgbm>=2.3.1",
             "xgboost>=0.90,<2.0.0; python_version<'3.11'",
             "xgboost>=2.0.0; python_version>='3.11'",
@@ -68,10 +68,10 @@
             "pre-commit",
             "torch",
             "torchvision",
-            "catboost>=0.26; python_version<'3.13'",
+            "catboost>=0.26",
             "rgf-python",
             "optuna>=2.8.0,<=3.6.1",
-            "openml; python_version<'3.13'",
+            "openml",
             "statsmodels>=0.12.2",
             "psutil",
             "dataclasses",
@@ -82,7 +82,7 @@
             "rouge_score",
             "hcrystalball",
             "seqeval",
-            "pytorch-forecasting; python_version<'3.13'",
+            "pytorch-forecasting",
             "mlflow-skinny<=2.22.1",  # Refer to https://mvnrepository.com/artifact/org.mlflow/mlflow-spark
             "joblibspark>=0.5.0",
             "joblib<=1.3.2",
@@ -140,7 +140,7 @@
             "prophet>=1.1.5",
             "statsmodels>=0.12.2",
             "hcrystalball>=0.1.10",
-            "pytorch-forecasting>=0.10.4; python_version<'3.13'",
+            "pytorch-forecasting>=0.10.4",
             "pytorch-lightning>=1.9.0",
             "tensorboardX>=2.6",
         ],
diff --git a/test/automl/test_custom_hp.py b/test/automl/test_custom_hp.py
index b06ae9f2c7..fe846071f0 100644
--- a/test/automl/test_custom_hp.py
+++ b/test/automl/test_custom_hp.py
@@ -72,5 +72,39 @@ def test_custom_hp():
     print(automl.best_config_per_estimator)
 
 
+def test_lgbm_objective():
+    """Test that objective parameter can be set via custom_hp for LGBMEstimator"""
+    import numpy as np
+
+    # Create a simple regression dataset
+    np.random.seed(42)
+    X_train = np.random.rand(100, 5)
+    y_train = np.random.rand(100) * 100  # Scale to avoid division issues with MAPE
+
+    automl = AutoML()
+    settings = {
+        "time_budget": 3,
+        "metric": "mape",
+        "task": "regression",
+        "estimator_list": ["lgbm"],
+        "verbose": 0,
+        "custom_hp": {"lgbm": {"objective": {"domain": "mape"}}},  # Fixed value, not tuned
+    }
+
+    automl.fit(X_train, y_train, **settings)
+
+    # Verify that objective was set correctly
+    assert "objective" in automl.best_config, "objective should be in best_config"
+    assert automl.best_config["objective"] == "mape", "objective should be 'mape'"
+
+    # Verify the model has the correct objective
+    if hasattr(automl.model, "estimator") and hasattr(automl.model.estimator, "get_params"):
+        model_params = automl.model.estimator.get_params()
+        assert model_params.get("objective") == "mape", "Model should use 'mape' objective"
+
+    print("Test passed: objective parameter works correctly with LGBMEstimator")
+
+
 if __name__ == "__main__":
     test_custom_hp()
+    test_lgbm_objective()
diff --git a/test/automl/test_extra_models.py b/test/automl/test_extra_models.py
index 651737a410..b860a9abea 100644
--- a/test/automl/test_extra_models.py
+++ b/test/automl/test_extra_models.py
@@ -188,7 +188,11 @@ def _test_sparse_matrix_classification(estimator):
         "n_jobs": 1,
         "model_history": True,
     }
-    X_train = scipy.sparse.random(1554, 21, dtype=int)
+    # NOTE: Avoid `dtype=int` here. On some NumPy/SciPy combinations (notably
+    # Windows + Python 3.13), `scipy.sparse.random(..., dtype=int)` may trigger
+    # integer sampling paths which raise "low is out of bounds for int32".
+    # A float sparse matrix is sufficient to validate sparse-input support.
+    X_train = scipy.sparse.random(1554, 21, dtype=np.float32)
     y_train = np.random.randint(3, size=1554)
     automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
 
diff --git a/test/automl/test_multiclass.py b/test/automl/test_multiclass.py
index 9be63cff60..12f8a8aa35 100644
--- a/test/automl/test_multiclass.py
+++ b/test/automl/test_multiclass.py
@@ -181,6 +181,49 @@ def test_ensemble(self):
         }
         automl.fit(X_train=X_train, y_train=y_train, **settings)
 
+    def test_ensemble_final_estimator_params_not_tuned(self):
+        """Test that final_estimator parameters in ensemble are not automatically tuned.
+
+        This test verifies that when a custom final_estimator is provided with specific
+        parameters, those parameters are used as-is without any hyperparameter tuning.
+        """
+        from sklearn.linear_model import LogisticRegression
+
+        automl = AutoML()
+        X_train, y_train = load_wine(return_X_y=True)
+
+        # Create a LogisticRegression with specific non-default parameters
+        custom_params = {
+            "C": 0.5,  # Non-default value
+            "max_iter": 50,  # Non-default value
+            "random_state": 42,
+        }
+        final_est = LogisticRegression(**custom_params)
+
+        settings = {
+            "time_budget": 5,
+            "estimator_list": ["rf", "lgbm"],
+            "task": "classification",
+            "ensemble": {
+                "final_estimator": final_est,
+                "passthrough": False,
+            },
+            "n_jobs": 1,
+        }
+        automl.fit(X_train=X_train, y_train=y_train, **settings)
+
+        # Verify that the final estimator in the stacker uses the exact parameters we specified
+        if hasattr(automl.model, "final_estimator_"):
+            # The model is a StackingClassifier
+            fitted_final_estimator = automl.model.final_estimator_
+            assert (
+                abs(fitted_final_estimator.C - custom_params["C"]) < 1e-9
+            ), f"Expected C={custom_params['C']}, but got {fitted_final_estimator.C}"
+            assert (
+                fitted_final_estimator.max_iter == custom_params["max_iter"]
+            ), f"Expected max_iter={custom_params['max_iter']}, but got {fitted_final_estimator.max_iter}"
+            print("✓ Final estimator parameters were preserved (not tuned)")
+
     def test_dataframe(self):
         self.test_classification(True)
 
@@ -235,6 +278,34 @@ def test_custom_metric(self):
         except ImportError:
             pass
 
+    def test_invalid_custom_metric(self):
+        """Test that proper error is raised when custom_metric is called instead of passed."""
+        from sklearn.datasets import load_iris
+
+        X_train, y_train = load_iris(return_X_y=True)
+
+        # Test with non-callable metric in __init__
+        with self.assertRaises(ValueError) as context:
+            automl = AutoML(metric=123)  # passing an int instead of function
+        self.assertIn("must be either a string or a callable function", str(context.exception))
+        self.assertIn("but got int", str(context.exception))
+
+        # Test with non-callable metric in fit
+        automl = AutoML()
+        with self.assertRaises(ValueError) as context:
+            automl.fit(X_train=X_train, y_train=y_train, metric=[], task="classification", time_budget=1)
+        self.assertIn("must be either a string or a callable function", str(context.exception))
+        self.assertIn("but got list", str(context.exception))
+
+        # Test with tuple (simulating result of calling a function that returns tuple)
+        with self.assertRaises(ValueError) as context:
+            automl = AutoML()
+            automl.fit(
+                X_train=X_train, y_train=y_train, metric=(0.5, {"loss": 0.5}), task="classification", time_budget=1
+            )
+        self.assertIn("must be either a string or a callable function", str(context.exception))
+        self.assertIn("but got tuple", str(context.exception))
+
     def test_classification(self, as_frame=False):
         automl_experiment = AutoML()
         automl_settings = {
@@ -368,7 +439,11 @@ def test_sparse_matrix_classification(self):
             "n_jobs": 1,
             "model_history": True,
         }
-        X_train = scipy.sparse.random(1554, 21, dtype=int)
+        # NOTE: Avoid `dtype=int` here. On some NumPy/SciPy combinations (notably
+        # Windows + Python 3.13), `scipy.sparse.random(..., dtype=int)` may trigger
+        # integer sampling paths which raise "low is out of bounds for int32".
+        # A float sparse matrix is sufficient to validate sparse-input support.
+        X_train = scipy.sparse.random(1554, 21, dtype=np.float32)
         y_train = np.random.randint(3, size=1554)
         automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
         print(automl_experiment.classes_)
@@ -531,6 +606,32 @@ def test_fit_w_starting_points_list(self, as_frame=True, n_concurrent_trials=1):
         print(f"Best accuracy on validation data: {new_automl_val_accuracy:.4g}")
         # print('Training duration of best run: {0:.4g} s'.format(new_automl_experiment.best_config_train_time))
 
+    def test_starting_points_should_improve_performance(self):
+        N = 10000  # a large N is needed to see the improvement
+        X_train, y_train = load_iris(return_X_y=True)
+        X_train = np.concatenate([X_train + 0.1 * i for i in range(N)], axis=0)
+        y_train = np.concatenate([y_train] * N, axis=0)
+
+        am1 = AutoML()
+        am1.fit(X_train, y_train, estimator_list=["lgbm"], time_budget=3, seed=11)
+
+        am2 = AutoML()
+        am2.fit(
+            X_train,
+            y_train,
+            estimator_list=["lgbm"],
+            time_budget=2,
+            seed=11,
+            starting_points=am1.best_config_per_estimator,
+        )
+
+        print(f"am1.best_loss: {am1.best_loss:.4f}")
+        print(f"am2.best_loss: {am2.best_loss:.4f}")
+
+        assert np.round(am2.best_loss, 4) <= np.round(
+            am1.best_loss, 4
+        ), "Starting points should help improve the performance!"
+
 
 if __name__ == "__main__":
     unittest.main()
diff --git a/test/automl/test_no_overlap.py b/test/automl/test_no_overlap.py
new file mode 100644
index 0000000000..443d8b9980
--- /dev/null
+++ b/test/automl/test_no_overlap.py
@@ -0,0 +1,272 @@
+"""Test to ensure correct label overlap handling for classification tasks"""
+import numpy as np
+import pandas as pd
+from sklearn.datasets import load_iris, make_classification
+
+from flaml import AutoML
+
+
+def test_allow_label_overlap_true():
+    """Test with allow_label_overlap=True (fast mode, default)"""
+    # Load iris dataset
+    dic_data = load_iris(as_frame=True)
+    iris_data = dic_data["frame"]
+
+    # Prepare data
+    x_train = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]].to_numpy()
+    y_train = iris_data["target"]
+
+    # Train with fast mode (default)
+    automl = AutoML()
+    automl_settings = {
+        "max_iter": 5,
+        "metric": "accuracy",
+        "task": "classification",
+        "estimator_list": ["lgbm"],
+        "eval_method": "holdout",
+        "split_type": "stratified",
+        "keep_search_state": True,
+        "retrain_full": False,
+        "auto_augment": False,
+        "verbose": 0,
+        "allow_label_overlap": True,  # Fast mode
+    }
+    automl.fit(x_train, y_train, **automl_settings)
+
+    # Check results
+    input_size = len(x_train)
+    train_size = len(automl._state.X_train)
+    val_size = len(automl._state.X_val)
+
+    # With stratified split on balanced data, fast mode may have no overlap
+    assert (
+        train_size + val_size >= input_size
+    ), f"Inconsistent sizes. Input: {input_size}, Train: {train_size}, Val: {val_size}"
+
+    # Verify all classes are represented in both sets
+    train_labels = set(np.unique(automl._state.y_train))
+    val_labels = set(np.unique(automl._state.y_val))
+    all_labels = set(np.unique(y_train))
+
+    assert train_labels == all_labels, f"Not all labels in train. All: {all_labels}, Train: {train_labels}"
+    assert val_labels == all_labels, f"Not all labels in val. All: {all_labels}, Val: {val_labels}"
+
+    print(
+        f"✓ Test passed (fast mode): Input: {input_size}, Train: {train_size}, Val: {val_size}, "
+        f"Overlap: {train_size + val_size - input_size}"
+    )
+
+
+def test_allow_label_overlap_false():
+    """Test with allow_label_overlap=False (precise mode)"""
+    # Load iris dataset
+    dic_data = load_iris(as_frame=True)
+    iris_data = dic_data["frame"]
+
+    # Prepare data
+    x_train = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]].to_numpy()
+    y_train = iris_data["target"]
+
+    # Train with precise mode
+    automl = AutoML()
+    automl_settings = {
+        "max_iter": 5,
+        "metric": "accuracy",
+        "task": "classification",
+        "estimator_list": ["lgbm"],
+        "eval_method": "holdout",
+        "split_type": "stratified",
+        "keep_search_state": True,
+        "retrain_full": False,
+        "auto_augment": False,
+        "verbose": 0,
+        "allow_label_overlap": False,  # Precise mode
+    }
+    automl.fit(x_train, y_train, **automl_settings)
+
+    # Check that there's no overlap (or minimal overlap for single-instance classes)
+    input_size = len(x_train)
+    train_size = len(automl._state.X_train)
+    val_size = len(automl._state.X_val)
+
+    # Verify all classes are represented
+    all_labels = set(np.unique(y_train))
+
+    # Should have no overlap or minimal overlap
+    overlap = train_size + val_size - input_size
+    assert overlap <= len(all_labels), f"Excessive overlap: {overlap}"
+
+    # Verify all classes are represented
+    train_labels = set(np.unique(automl._state.y_train))
+    val_labels = set(np.unique(automl._state.y_val))
+
+    combined_labels = train_labels.union(val_labels)
+    assert combined_labels == all_labels, f"Not all labels present. All: {all_labels}, Combined: {combined_labels}"
+
+    print(
+        f"✓ Test passed (precise mode): Input: {input_size}, Train: {train_size}, Val: {val_size}, "
+        f"Overlap: {overlap}"
+    )
+
+
+def test_uniform_split_with_overlap_control():
+    """Test with uniform split and both overlap modes"""
+    # Load iris dataset
+    dic_data = load_iris(as_frame=True)
+    iris_data = dic_data["frame"]
+
+    # Prepare data
+    x_train = iris_data[["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]].to_numpy()
+    y_train = iris_data["target"]
+
+    # Test precise mode with uniform split
+    automl = AutoML()
+    automl_settings = {
+        "max_iter": 5,
+        "metric": "accuracy",
+        "task": "classification",
+        "estimator_list": ["lgbm"],
+        "eval_method": "holdout",
+        "split_type": "uniform",
+        "keep_search_state": True,
+        "retrain_full": False,
+        "auto_augment": False,
+        "verbose": 0,
+        "allow_label_overlap": False,  # Precise mode
+    }
+    automl.fit(x_train, y_train, **automl_settings)
+
+    input_size = len(x_train)
+    train_size = len(automl._state.X_train)
+    val_size = len(automl._state.X_val)
+
+    # Verify all classes are represented
+    train_labels = set(np.unique(automl._state.y_train))
+    val_labels = set(np.unique(automl._state.y_val))
+    all_labels = set(np.unique(y_train))
+
+    combined_labels = train_labels.union(val_labels)
+    assert combined_labels == all_labels, "Not all labels present with uniform split"
+
+    print(f"✓ Test passed (uniform split): Input: {input_size}, Train: {train_size}, Val: {val_size}")
+
+
+def test_with_sample_weights():
+    """Test label overlap handling with sample weights"""
+    # Create a simple dataset
+    X, y = make_classification(
+        n_samples=200,
+        n_features=10,
+        n_informative=5,
+        n_redundant=2,
+        n_classes=3,
+        n_clusters_per_class=1,
+        random_state=42,
+    )
+
+    # Create sample weights (giving more weight to some samples)
+    sample_weight = np.random.uniform(0.5, 2.0, size=len(y))
+
+    # Test fast mode with sample weights
+    automl_fast = AutoML()
+    automl_fast.fit(
+        X,
+        y,
+        task="classification",
+        metric="accuracy",
+        estimator_list=["lgbm"],
+        eval_method="holdout",
+        split_type="stratified",
+        max_iter=3,
+        keep_search_state=True,
+        retrain_full=False,
+        auto_augment=False,
+        verbose=0,
+        allow_label_overlap=True,  # Fast mode
+        sample_weight=sample_weight,
+    )
+
+    # Verify all labels present
+    train_labels_fast = set(np.unique(automl_fast._state.y_train))
+    val_labels_fast = set(np.unique(automl_fast._state.y_val))
+    all_labels = set(np.unique(y))
+
+    assert train_labels_fast == all_labels, "Not all labels in train (fast mode with weights)"
+    assert val_labels_fast == all_labels, "Not all labels in val (fast mode with weights)"
+
+    # Test precise mode with sample weights
+    automl_precise = AutoML()
+    automl_precise.fit(
+        X,
+        y,
+        task="classification",
+        metric="accuracy",
+        estimator_list=["lgbm"],
+        eval_method="holdout",
+        split_type="stratified",
+        max_iter=3,
+        keep_search_state=True,
+        retrain_full=False,
+        auto_augment=False,
+        verbose=0,
+        allow_label_overlap=False,  # Precise mode
+        sample_weight=sample_weight,
+    )
+
+    # Verify all labels present
+    train_labels_precise = set(np.unique(automl_precise._state.y_train))
+    val_labels_precise = set(np.unique(automl_precise._state.y_val))
+
+    combined_labels = train_labels_precise.union(val_labels_precise)
+    assert combined_labels == all_labels, "Not all labels present (precise mode with weights)"
+
+    print("✓ Test passed with sample weights (fast and precise modes)")
+
+
+def test_single_instance_class():
+    """Test handling of single-instance classes"""
+    # Create imbalanced dataset where one class has only 1 instance
+    X = np.random.randn(50, 4)
+    y = np.array([0] * 40 + [1] * 9 + [2] * 1)  # Class 2 has only 1 instance
+
+    # Test precise mode - should add single instance to both sets
+    automl = AutoML()
+    automl.fit(
+        X,
+        y,
+        task="classification",
+        metric="accuracy",
+        estimator_list=["lgbm"],
+        eval_method="holdout",
+        split_type="uniform",
+        max_iter=3,
+        keep_search_state=True,
+        retrain_full=False,
+        auto_augment=False,
+        verbose=0,
+        allow_label_overlap=False,  # Precise mode
+    )
+
+    # Verify all labels present
+    train_labels = set(np.unique(automl._state.y_train))
+    val_labels = set(np.unique(automl._state.y_val))
+    all_labels = set(np.unique(y))
+
+    # Single-instance class should be in both sets
+    combined_labels = train_labels.union(val_labels)
+    assert combined_labels == all_labels, "Not all labels present with single-instance class"
+
+    # Check that single-instance class (label 2) is in both sets
+    assert 2 in train_labels, "Single-instance class not in train"
+    assert 2 in val_labels, "Single-instance class not in val"
+
+    print("✓ Test passed with single-instance class")
+
+
+if __name__ == "__main__":
+    test_allow_label_overlap_true()
+    test_allow_label_overlap_false()
+    test_uniform_split_with_overlap_control()
+    test_with_sample_weights()
+    test_single_instance_class()
+    print("\n✓ All tests passed!")
diff --git a/test/automl/test_preprocess_api.py b/test/automl/test_preprocess_api.py
new file mode 100644
index 0000000000..45b9c6143b
--- /dev/null
+++ b/test/automl/test_preprocess_api.py
@@ -0,0 +1,236 @@
+"""Tests for the public preprocessor APIs."""
+import unittest
+
+import numpy as np
+import pandas as pd
+from sklearn.datasets import load_breast_cancer, load_diabetes
+
+from flaml import AutoML
+
+
+class TestPreprocessAPI(unittest.TestCase):
+    """Test cases for the public preprocess() API methods."""
+
+    def test_automl_preprocess_before_fit(self):
+        """Test that calling preprocess before fit raises an error."""
+        automl = AutoML()
+        X_test = np.array([[1, 2, 3], [4, 5, 6]])
+
+        with self.assertRaises(AttributeError) as context:
+            automl.preprocess(X_test)
+        # Check that an error is raised about not being fitted
+        self.assertIn("fit()", str(context.exception))
+
+    def test_automl_preprocess_classification(self):
+        """Test task-level preprocessing for classification."""
+        # Load dataset
+        X, y = load_breast_cancer(return_X_y=True)
+        X_train, y_train = X[:400], y[:400]
+        X_test = X[400:450]
+
+        # Train AutoML
+        automl = AutoML()
+        automl_settings = {
+            "max_iter": 5,
+            "task": "classification",
+            "metric": "accuracy",
+            "estimator_list": ["lgbm"],
+            "verbose": 0,
+        }
+        automl.fit(X_train, y_train, **automl_settings)
+
+        # Test task-level preprocessing
+        X_preprocessed = automl.preprocess(X_test)
+
+        # Verify the output is not None and has the right shape
+        self.assertIsNotNone(X_preprocessed)
+        self.assertEqual(X_preprocessed.shape[0], X_test.shape[0])
+
+    def test_automl_preprocess_regression(self):
+        """Test task-level preprocessing for regression."""
+        # Load dataset
+        X, y = load_diabetes(return_X_y=True)
+        X_train, y_train = X[:300], y[:300]
+        X_test = X[300:350]
+
+        # Train AutoML
+        automl = AutoML()
+        automl_settings = {
+            "max_iter": 5,
+            "task": "regression",
+            "metric": "r2",
+            "estimator_list": ["lgbm"],
+            "verbose": 0,
+        }
+        automl.fit(X_train, y_train, **automl_settings)
+
+        # Test task-level preprocessing
+        X_preprocessed = automl.preprocess(X_test)
+
+        # Verify the output
+        self.assertIsNotNone(X_preprocessed)
+        self.assertEqual(X_preprocessed.shape[0], X_test.shape[0])
+
+    def test_automl_preprocess_with_dataframe(self):
+        """Test task-level preprocessing with pandas DataFrame."""
+        # Create a simple dataset
+        X_train = pd.DataFrame(
+            {
+                "feature1": [1, 2, 3, 4, 5] * 20,
+                "feature2": [5, 4, 3, 2, 1] * 20,
+                "category": ["a", "b", "a", "b", "a"] * 20,
+            }
+        )
+        y_train = pd.Series([0, 1, 0, 1, 0] * 20)
+
+        X_test = pd.DataFrame(
+            {
+                "feature1": [6, 7, 8],
+                "feature2": [1, 2, 3],
+                "category": ["a", "b", "a"],
+            }
+        )
+
+        # Train AutoML
+        automl = AutoML()
+        automl_settings = {
+            "max_iter": 5,
+            "task": "classification",
+            "metric": "accuracy",
+            "estimator_list": ["lgbm"],
+            "verbose": 0,
+        }
+        automl.fit(X_train, y_train, **automl_settings)
+
+        # Test preprocessing
+        X_preprocessed = automl.preprocess(X_test)
+
+        # Verify the output - check the number of rows matches
+        self.assertIsNotNone(X_preprocessed)
+        preprocessed_len = len(X_preprocessed) if hasattr(X_preprocessed, "__len__") else X_preprocessed.shape[0]
+        self.assertEqual(preprocessed_len, len(X_test))
+
+    def test_estimator_preprocess(self):
+        """Test estimator-level preprocessing."""
+        # Load dataset
+        X, y = load_breast_cancer(return_X_y=True)
+        X_train, y_train = X[:400], y[:400]
+        X_test = X[400:450]
+
+        # Train AutoML
+        automl = AutoML()
+        automl_settings = {
+            "max_iter": 5,
+            "task": "classification",
+            "metric": "accuracy",
+            "estimator_list": ["lgbm"],
+            "verbose": 0,
+        }
+        automl.fit(X_train, y_train, **automl_settings)
+
+        # Get the trained estimator
+        estimator = automl.model
+        self.assertIsNotNone(estimator)
+
+        # First apply task-level preprocessing
+        X_task_preprocessed = automl.preprocess(X_test)
+
+        # Then apply estimator-level preprocessing
+        X_estimator_preprocessed = estimator.preprocess(X_task_preprocessed)
+
+        # Verify the output
+        self.assertIsNotNone(X_estimator_preprocessed)
+        self.assertEqual(X_estimator_preprocessed.shape[0], X_test.shape[0])
+
+    def test_preprocess_pipeline(self):
+        """Test the complete preprocessing pipeline (task-level then estimator-level)."""
+        # Load dataset
+        X, y = load_breast_cancer(return_X_y=True)
+        X_train, y_train = X[:400], y[:400]
+        X_test = X[400:450]
+
+        # Train AutoML
+        automl = AutoML()
+        automl_settings = {
+            "max_iter": 5,
+            "task": "classification",
+            "metric": "accuracy",
+            "estimator_list": ["lgbm"],
+            "verbose": 0,
+        }
+        automl.fit(X_train, y_train, **automl_settings)
+
+        # Apply the complete preprocessing pipeline
+        X_task_preprocessed = automl.preprocess(X_test)
+        X_final = automl.model.preprocess(X_task_preprocessed)
+
+        # Verify predictions work with preprocessed data
+        # The internal predict already does this preprocessing,
+        # but we verify our manual preprocessing gives consistent results
+        y_pred_manual = automl.model._model.predict(X_final)
+        y_pred_auto = automl.predict(X_test)
+
+        # Both should give the same predictions
+        np.testing.assert_array_equal(y_pred_manual, y_pred_auto)
+
+    def test_preprocess_with_mixed_types(self):
+        """Test preprocessing with mixed data types."""
+        # Create dataset with mixed types
+        X_train = pd.DataFrame(
+            {
+                "numeric1": np.random.rand(100),
+                "numeric2": np.random.randint(0, 100, 100),
+                "categorical": np.random.choice(["cat", "dog", "bird"], 100),
+                "boolean": np.random.choice([True, False], 100),
+            }
+        )
+        y_train = pd.Series(np.random.randint(0, 2, 100))
+
+        X_test = pd.DataFrame(
+            {
+                "numeric1": np.random.rand(10),
+                "numeric2": np.random.randint(0, 100, 10),
+                "categorical": np.random.choice(["cat", "dog", "bird"], 10),
+                "boolean": np.random.choice([True, False], 10),
+            }
+        )
+
+        # Train AutoML
+        automl = AutoML()
+        automl_settings = {
+            "max_iter": 5,
+            "task": "classification",
+            "metric": "accuracy",
+            "estimator_list": ["lgbm"],
+            "verbose": 0,
+        }
+        automl.fit(X_train, y_train, **automl_settings)
+
+        # Test preprocessing
+        X_preprocessed = automl.preprocess(X_test)
+
+        # Verify the output
+        self.assertIsNotNone(X_preprocessed)
+
+    def test_estimator_preprocess_without_automl(self):
+        """Test that estimator.preprocess() can be used independently."""
+        from flaml.automl.model import LGBMEstimator
+
+        # Create a simple estimator
+        X_train = np.random.rand(100, 5)
+        y_train = np.random.randint(0, 2, 100)
+
+        estimator = LGBMEstimator(task="classification")
+        estimator.fit(X_train, y_train)
+
+        # Test preprocessing
+        X_test = np.random.rand(10, 5)
+        X_preprocessed = estimator.preprocess(X_test)
+
+        # Verify the output
+        self.assertIsNotNone(X_preprocessed)
+        self.assertEqual(X_preprocessed.shape, X_test.shape)
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/test/default/test_defaults.py b/test/default/test_defaults.py
index acf50e4ea9..04f0fb70bb 100644
--- a/test/default/test_defaults.py
+++ b/test/default/test_defaults.py
@@ -183,6 +183,8 @@ def test_lgbm():
 
 
 def test_xgboost():
+    import numpy as np
+
     from flaml.default import XGBClassifier, XGBRegressor
 
     X_train, y_train = load_breast_cancer(return_X_y=True, as_frame=True)
@@ -200,6 +202,65 @@ def test_xgboost():
     regressor.predict(X_train)
     print(regressor)
 
+    # Test eval_set with categorical features (Issue: eval_set not preprocessed)
+    np.random.seed(42)
+    n = 500
+    df = pd.DataFrame(
+        {
+            "num1": np.random.randn(n),
+            "num2": np.random.rand(n) * 10,
+            "cat1": np.random.choice(["A", "B", "C"], size=n),
+            "cat2": np.random.choice(["X", "Y"], size=n),
+            "target": np.random.choice([0, 1], size=n),
+        }
+    )
+
+    X = df.drop(columns="target")
+    y = df["target"]
+
+    X_train_cat, X_valid_cat, y_train_cat, y_valid_cat = train_test_split(X, y, test_size=0.2, random_state=0)
+
+    # Convert categorical columns to pandas 'category' dtype
+    for col in X_train_cat.select_dtypes(include="object").columns:
+        X_train_cat[col] = X_train_cat[col].astype("category")
+        X_valid_cat[col] = X_valid_cat[col].astype("category")
+
+    # Test XGBClassifier with eval_set
+    classifier_eval = XGBClassifier(
+        tree_method="hist",
+        enable_categorical=True,
+        eval_metric="logloss",
+        use_label_encoder=False,
+        early_stopping_rounds=10,
+        random_state=0,
+        n_estimators=10,
+    )
+    classifier_eval.fit(X_train_cat, y_train_cat, eval_set=[(X_valid_cat, y_valid_cat)], verbose=False)
+    y_pred = classifier_eval.predict(X_valid_cat)
+    assert len(y_pred) == len(y_valid_cat)
+
+    # Test XGBRegressor with eval_set
+    y_reg = df["num1"]  # Use num1 as target for regression
+    X_reg = df.drop(columns=["num1", "target"])
+
+    X_train_reg, X_valid_reg, y_train_reg, y_valid_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=0)
+
+    for col in X_train_reg.select_dtypes(include="object").columns:
+        X_train_reg[col] = X_train_reg[col].astype("category")
+        X_valid_reg[col] = X_valid_reg[col].astype("category")
+
+    regressor_eval = XGBRegressor(
+        tree_method="hist",
+        enable_categorical=True,
+        eval_metric="rmse",
+        early_stopping_rounds=10,
+        random_state=0,
+        n_estimators=10,
+    )
+    regressor_eval.fit(X_train_reg, y_train_reg, eval_set=[(X_valid_reg, y_valid_reg)], verbose=False)
+    y_pred = regressor_eval.predict(X_valid_reg)
+    assert len(y_pred) == len(y_valid_reg)
+
 
 def test_nobudget():
     X_train, y_train = load_breast_cancer(return_X_y=True, as_frame=True)
diff --git a/test/spark/test_multiclass.py b/test/spark/test_multiclass.py
index 45f6e5b45a..c9da982449 100644
--- a/test/spark/test_multiclass.py
+++ b/test/spark/test_multiclass.py
@@ -262,7 +262,11 @@ def test_sparse_matrix_classification(self):
             "n_concurrent_trials": 2,
             "use_spark": True,
         }
-        X_train = scipy.sparse.random(1554, 21, dtype=int)
+        # NOTE: Avoid `dtype=int` here. On some NumPy/SciPy combinations (notably
+        # Windows + Python 3.13), `scipy.sparse.random(..., dtype=int)` may trigger
+        # integer sampling paths which raise "low is out of bounds for int32".
+        # A float sparse matrix is sufficient to validate sparse-input support.
+        X_train = scipy.sparse.random(1554, 21, dtype=np.float32)
         y_train = np.random.randint(3, size=1554)
         automl_experiment.fit(X_train=X_train, y_train=y_train, **automl_settings)
         print(automl_experiment.classes_)
diff --git a/test/tune/test_search_thread.py b/test/tune/test_search_thread.py
new file mode 100644
index 0000000000..4ca9f5db76
--- /dev/null
+++ b/test/tune/test_search_thread.py
@@ -0,0 +1,99 @@
+"""Tests for SearchThread nested dictionary update fix."""
+
+import pytest
+
+from flaml.tune.searcher.search_thread import _recursive_dict_update
+
+
+def test_recursive_dict_update_simple():
+    """Test simple non-nested dictionary update."""
+    target = {"a": 1, "b": 2}
+    source = {"c": 3}
+    _recursive_dict_update(target, source)
+    assert target == {"a": 1, "b": 2, "c": 3}
+
+
+def test_recursive_dict_update_override():
+    """Test that source values override target values for non-dict values."""
+    target = {"a": 1, "b": 2}
+    source = {"b": 3}
+    _recursive_dict_update(target, source)
+    assert target == {"a": 1, "b": 3}
+
+
+def test_recursive_dict_update_nested():
+    """Test nested dictionary merge (the main use case for XGBoost params)."""
+    target = {
+        "num_boost_round": 10,
+        "params": {
+            "max_depth": 12,
+            "eta": 0.020168455186106736,
+            "min_child_weight": 1.4504723523894132,
+            "scale_pos_weight": 3.794258636185337,
+            "gamma": 0.4985070123025904,
+        },
+    }
+    source = {
+        "params": {
+            "verbosity": 3,
+            "booster": "gbtree",
+            "eval_metric": "auc",
+            "tree_method": "hist",
+            "objective": "binary:logistic",
+        }
+    }
+    _recursive_dict_update(target, source)
+
+    # Check that sampled params are preserved
+    assert target["params"]["max_depth"] == 12
+    assert target["params"]["eta"] == 0.020168455186106736
+    assert target["params"]["min_child_weight"] == 1.4504723523894132
+    assert target["params"]["scale_pos_weight"] == 3.794258636185337
+    assert target["params"]["gamma"] == 0.4985070123025904
+
+    # Check that const params are added
+    assert target["params"]["verbosity"] == 3
+    assert target["params"]["booster"] == "gbtree"
+    assert target["params"]["eval_metric"] == "auc"
+    assert target["params"]["tree_method"] == "hist"
+    assert target["params"]["objective"] == "binary:logistic"
+
+    # Check top-level param is preserved
+    assert target["num_boost_round"] == 10
+
+
+def test_recursive_dict_update_deeply_nested():
+    """Test deeply nested dictionary merge."""
+    target = {"a": {"b": {"c": 1, "d": 2}}}
+    source = {"a": {"b": {"e": 3}}}
+    _recursive_dict_update(target, source)
+    assert target == {"a": {"b": {"c": 1, "d": 2, "e": 3}}}
+
+
+def test_recursive_dict_update_mixed_types():
+    """Test that non-dict values in source replace dict values in target."""
+    target = {"a": {"b": 1}}
+    source = {"a": 2}
+    _recursive_dict_update(target, source)
+    assert target == {"a": 2}
+
+
+def test_recursive_dict_update_empty_dicts():
+    """Test with empty dictionaries."""
+    target = {}
+    source = {"a": 1}
+    _recursive_dict_update(target, source)
+    assert target == {"a": 1}
+
+    target = {"a": 1}
+    source = {}
+    _recursive_dict_update(target, source)
+    assert target == {"a": 1}
+
+
+def test_recursive_dict_update_none_values():
+    """Test that None values are properly handled."""
+    target = {"a": 1, "b": None}
+    source = {"b": 2, "c": None}
+    _recursive_dict_update(target, source)
+    assert target == {"a": 1, "b": 2, "c": None}
diff --git a/test/tune/test_searcher.py b/test/tune/test_searcher.py
index 4e17594430..9931047e6f 100644
--- a/test/tune/test_searcher.py
+++ b/test/tune/test_searcher.py
@@ -324,3 +324,26 @@ def test_no_optuna():
     import flaml.tune.searcher.suggestion
 
     subprocess.check_call([sys.executable, "-m", "pip", "install", "optuna==2.8.0"])
+
+
+def test_unresolved_search_space(caplog):
+    import logging
+
+    from flaml import tune
+    from flaml.tune.searcher.blendsearch import BlendSearch
+
+    if caplog is not None:
+        caplog.set_level(logging.INFO)
+
+    BlendSearch(metric="loss", mode="min", space={"lr": tune.uniform(0.001, 0.1), "depth": tune.randint(1, 10)})
+    try:
+        text = caplog.text
+    except AttributeError:
+        text = ""
+    assert (
+        "unresolved search space" not in text and text
+    ), "BlendSearch should not produce warning about unresolved search space"
+
+
+if __name__ == "__main__":
+    test_unresolved_search_space(None)
diff --git a/tutorials/flaml-tutorial-automl-24.md b/tutorials/flaml-tutorial-automl-24.md
index c20954b36d..2c1337a146 100644
--- a/tutorials/flaml-tutorial-automl-24.md
+++ b/tutorials/flaml-tutorial-automl-24.md
@@ -4,7 +4,7 @@
 
 **Date and Time**: 09.09.2024, 15:30-17:00
 
-Location:  Sorbonne University, 4 place Jussieu, 75005 Paris
+Location: Sorbonne University, 4 place Jussieu, 75005 Paris
 
 Duration: 1.5 hours
 
diff --git a/tutorials/flaml-tutorial-pydata-23.md b/tutorials/flaml-tutorial-pydata-23.md
index b93147026e..c85a00a0d5 100644
--- a/tutorials/flaml-tutorial-pydata-23.md
+++ b/tutorials/flaml-tutorial-pydata-23.md
@@ -4,7 +4,7 @@
 
 **Date and Time**: 04-26, 09:00–10:30 PT.
 
-Location:  Microsoft Conference Center, Seattle, WA.
+Location: Microsoft Conference Center, Seattle, WA.
 
 Duration: 1.5 hours
 
diff --git a/website/docs/Best-Practices.md b/website/docs/Best-Practices.md
index 78c057fd51..70c8d75610 100644
--- a/website/docs/Best-Practices.md
+++ b/website/docs/Best-Practices.md
@@ -1,4 +1,3 @@
-````markdown
 # Best Practices
 
 This page collects practical guidance for using FLAML effectively across common tasks.
@@ -16,7 +15,10 @@ from flaml.automl.task.factory import task_factory
 
 automl = AutoML()
 print("Built-in sklearn metrics:", sorted(automl.supported_metrics[0]))
-print("classification estimators:", sorted(task_factory("classification").estimators.keys()))
+print(
+    "classification estimators:",
+    sorted(task_factory("classification").estimators.keys()),
+)
 ```
 
 ## Classification
@@ -26,6 +28,35 @@ print("classification estimators:", sorted(task_factory("classification").estima
   - pass `sample_weight` to `AutoML.fit()`;
   - consider setting class weights via `custom_hp` / `fit_kwargs_by_estimator` for specific estimators (see [FAQ](FAQ)).
 - **Probability vs label metrics**: use `roc_auc` / `log_loss` when you care about calibrated probabilities.
+- **Label overlap control** (holdout evaluation only):
+  - By default, FLAML uses a fast strategy (`allow_label_overlap=True`) that ensures all labels are present in both training and validation sets by adding missing labels' first instances to both sets. This is efficient but may create minor overlap.
+  - For strict no-overlap validation, use `allow_label_overlap=False`. This slower but more precise strategy intelligently re-splits multi-instance classes to avoid overlap while maintaining label completeness.
+
+```python
+from flaml import AutoML
+
+# Fast version (default): allows overlap for efficiency
+automl_fast = AutoML()
+automl_fast.fit(
+    X_train,
+    y_train,
+    task="classification",
+    eval_method="holdout",
+    allow_label_overlap=True,
+)  # default
+
+# Precise version: avoids overlap when possible
+automl_precise = AutoML()
+automl_precise.fit(
+    X_train,
+    y_train,
+    task="classification",
+    eval_method="holdout",
+    allow_label_overlap=False,
+)  # slower but more precise
+```
+
+Note: This only affects holdout evaluation. CV and custom validation sets are unaffected.
 
 ## Regression
 
@@ -77,7 +108,6 @@ from sklearn.datasets import load_iris
 from sklearn.model_selection import train_test_split
 from flaml import AutoML
 
-
 X, y = load_iris(return_X_y=True, as_frame=True)
 X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=42
@@ -86,7 +116,7 @@ X_train, X_test, y_train, y_test = train_test_split(
 automl = AutoML()
 mlflow.set_experiment("flaml")
 with mlflow.start_run(run_name="flaml_run") as run:
-    automl.fit(X_train, y_train, task="classification", time_budget=3, retrain_full=False, eval_method="holdout")
+    automl.fit(X_train, y_train, task="classification", time_budget=3)
 
 run_id = run.info.run_id
 
@@ -95,11 +125,11 @@ automl2 = mlflow.sklearn.load_model(f"runs:/{run_id}/model")
 assert np.array_equal(automl2.predict(X_test), automl.predict(X_test))
 ```
 
-### Option 2: Pickle the full `AutoML` instance (convenient / Fabric)
+### Option 2: Pickle the full `AutoML` instance (convenient)
 
 Pickling stores the *entire* `AutoML` instance (not just the best estimator). This is useful when you prefer not to rely on MLflow or when you want to reuse additional attributes of the AutoML object without retraining.
 
-In Microsoft Fabric scenarios, this is particularly important for re-plotting visualization figures without requiring model retraining.
+In Microsoft Fabric scenarios, additional attributes is particularly important for re-plotting visualization figures without requiring model retraining.
 
 ```python
 import mlflow
@@ -108,7 +138,6 @@ from sklearn.datasets import load_iris
 from sklearn.model_selection import train_test_split
 from flaml import AutoML
 
-
 X, y = load_iris(return_X_y=True, as_frame=True)
 X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=42
@@ -117,7 +146,7 @@ X_train, X_test, y_train, y_test = train_test_split(
 automl = AutoML()
 mlflow.set_experiment("flaml")
 with mlflow.start_run(run_name="flaml_run") as run:
-    automl.fit(X_train, y_train, task="classification", time_budget=3, retrain_full=False, eval_method="holdout")
+    automl.fit(X_train, y_train, task="classification", time_budget=3)
 
 automl.pickle("automl.pkl")
 automl2 = AutoML.load_pickle("automl.pkl")
@@ -128,5 +157,3 @@ assert automl.mlflow_integration.infos == automl2.mlflow_integration.infos
 ```
 
 See also: [Task-Oriented AutoML](Use-Cases/Task-Oriented-AutoML) and [FAQ](FAQ).
-
-````
diff --git a/website/docs/Contribute.md b/website/docs/Contribute.md
index fd0ef194c4..0a390b31a9 100644
--- a/website/docs/Contribute.md
+++ b/website/docs/Contribute.md
@@ -49,7 +49,7 @@ print(flaml.__version__)
 ```
 
 - Please ensure all **code snippets and error messages are formatted in
-  appropriate code blocks**.  See [Creating and highlighting code blocks](https://help.github.com/articles/creating-and-highlighting-code-blocks)
+  appropriate code blocks**. See [Creating and highlighting code blocks](https://help.github.com/articles/creating-and-highlighting-code-blocks)
   for more details.
 
 ## Becoming a Reviewer
@@ -88,7 +88,7 @@ Run `pre-commit install` to install pre-commit into your git hooks. Before you c
 
 ### Coverage
 
-Any code you commit should not decrease coverage. To run all unit tests, install the \[test\] option under FLAML/:
+Any code you commit should not decrease coverage. To run all unit tests, install the [test] option under FLAML/:
 
 ```bash
 pip install -e."[test]"
diff --git a/website/docs/Examples/AutoML-Classification.md b/website/docs/Examples/AutoML-Classification.md
index ab0ca32bee..b47acb4884 100644
--- a/website/docs/Examples/AutoML-Classification.md
+++ b/website/docs/Examples/AutoML-Classification.md
@@ -2,7 +2,7 @@
 
 ### Prerequisites
 
-Install the \[automl\] option.
+Install the [automl] option.
 
 ```bash
 pip install "flaml[automl]"
diff --git a/website/docs/Examples/AutoML-NLP.md b/website/docs/Examples/AutoML-NLP.md
index 6532985bb8..4c17a428fa 100644
--- a/website/docs/Examples/AutoML-NLP.md
+++ b/website/docs/Examples/AutoML-NLP.md
@@ -2,7 +2,7 @@
 
 ### Requirements
 
-This example requires GPU. Install the \[automl,hf\] option:
+This example requires GPU. Install the [automl,hf] option:
 
 ```python
 pip install "flaml[automl,hf]"
diff --git a/website/docs/Examples/AutoML-Rank.md b/website/docs/Examples/AutoML-Rank.md
index c3702e004a..868e445e23 100644
--- a/website/docs/Examples/AutoML-Rank.md
+++ b/website/docs/Examples/AutoML-Rank.md
@@ -2,7 +2,7 @@
 
 ### Prerequisites
 
-Install the \[automl\] option.
+Install the [automl] option.
 
 ```bash
 pip install "flaml[automl]"
diff --git a/website/docs/Examples/AutoML-Regression.md b/website/docs/Examples/AutoML-Regression.md
index 84ca18f88c..dd68b4b1f9 100644
--- a/website/docs/Examples/AutoML-Regression.md
+++ b/website/docs/Examples/AutoML-Regression.md
@@ -2,7 +2,7 @@
 
 ### Prerequisites
 
-Install the \[automl\] option.
+Install the [automl] option.
 
 ```bash
 pip install "flaml[automl]"
diff --git a/website/docs/Examples/AutoML-Time series forecast.md b/website/docs/Examples/AutoML-Time series forecast.md
index 61ef7cf602..9e28f1004e 100644
--- a/website/docs/Examples/AutoML-Time series forecast.md	
+++ b/website/docs/Examples/AutoML-Time series forecast.md	
@@ -2,12 +2,31 @@
 
 ### Prerequisites
 
-Install the \[automl,ts_forecast\] option.
+Install the [automl,ts_forecast] option.
 
 ```bash
 pip install "flaml[automl,ts_forecast]"
 ```
 
+### Understanding the `period` Parameter
+
+The `period` parameter (also called **horizon** in the code) specifies the **forecast horizon** - the number of future time steps the model is trained to predict. For example:
+
+- `period=12` means you want to forecast 12 time steps ahead (e.g., 12 months, 12 days)
+- `period=7` means you want to forecast 7 time steps ahead
+
+**Important Note on Prediction**: During the prediction stage, the output length equals the length of `X_test`. This means you can generate predictions for any number of time steps by providing the corresponding timestamps in `X_test`, regardless of the `period` value used during training.
+
+#### Automatic Feature Engineering
+
+**Important**: You do NOT need to manually lag the target variable before training. FLAML handles this automatically:
+
+- **For sklearn-based models** (lgbm, rf, xgboost, extra_tree, catboost): FLAML automatically creates lagged features of both the target variable and any exogenous variables. This transforms the time series forecasting problem into a supervised learning regression problem.
+
+- **For time series native models** (prophet, arima, sarimax, holt-winters): These models have built-in time series forecasting capabilities and handle temporal dependencies natively.
+
+The automatic lagging is implemented internally when you call `automl.fit()` with `task="ts_forecast"` or `task="ts_forecast_classification"`, so you can focus on providing clean input data without worrying about feature engineering.
+
 ### Simple NumPy Example
 
 ```python
diff --git a/website/docs/Examples/AutoML-for-LightGBM.md b/website/docs/Examples/AutoML-for-LightGBM.md
index e26918d328..ce5fb32efe 100644
--- a/website/docs/Examples/AutoML-for-LightGBM.md
+++ b/website/docs/Examples/AutoML-for-LightGBM.md
@@ -2,7 +2,7 @@
 
 ### Prerequisites for this example
 
-Install the \[automl\] option.
+Install the [automl] option.
 
 ```bash
 pip install "flaml[automl] matplotlib openml"
diff --git a/website/docs/Examples/AutoML-for-XGBoost.md b/website/docs/Examples/AutoML-for-XGBoost.md
index 2d264383cb..072dab3289 100644
--- a/website/docs/Examples/AutoML-for-XGBoost.md
+++ b/website/docs/Examples/AutoML-for-XGBoost.md
@@ -2,7 +2,7 @@
 
 ### Prerequisites for this example
 
-Install the \[automl\] option.
+Install the [automl] option.
 
 ```bash
 pip install "flaml[automl] matplotlib openml"
diff --git a/website/docs/Examples/Default-Flamlized.md b/website/docs/Examples/Default-Flamlized.md
index b89ffe7a8b..5bd011262e 100644
--- a/website/docs/Examples/Default-Flamlized.md
+++ b/website/docs/Examples/Default-Flamlized.md
@@ -6,7 +6,7 @@ Flamlized estimators automatically use data-dependent default hyperparameter con
 
 ### Prerequisites
 
-This example requires the \[autozero\] option.
+This example requires the [autozero] option.
 
 ```bash
 pip install flaml[autozero] lightgbm openml
@@ -67,6 +67,82 @@ X_test.shape: (5160, 8), y_test.shape: (5160,)
 
 [Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/zeroshot_lightgbm.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/zeroshot_lightgbm.ipynb)
 
+## Flamlized LGBMClassifier
+
+### Prerequisites
+
+This example requires the [autozero] option.
+
+```bash
+pip install flaml[autozero] lightgbm openml
+```
+
+### Zero-shot AutoML
+
+```python
+from flaml.automl.data import load_openml_dataset
+from flaml.default import LGBMClassifier
+from flaml.automl.ml import sklearn_metric_loss_score
+
+X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir="./")
+lgbm = LGBMClassifier()
+lgbm.fit(X_train, y_train)
+y_pred = lgbm.predict(X_test)
+print(
+    "flamlized lgbm accuracy",
+    "=",
+    1 - sklearn_metric_loss_score("accuracy", y_pred, y_test),
+)
+print(lgbm)
+```
+
+#### Sample output
+
+```
+load dataset from ./openml_ds1169.pkl
+Dataset name: airlines
+X_train.shape: (404537, 7), y_train.shape: (404537,);
+X_test.shape: (134846, 7), y_test.shape: (134846,)
+flamlized lgbm accuracy = 0.6745
+LGBMClassifier(colsample_bytree=0.85, learning_rate=0.05, max_bin=255,
+               min_child_samples=20, n_estimators=500, num_leaves=31,
+               reg_alpha=0.01, reg_lambda=0.1, verbose=-1)
+```
+
+## Flamlized XGBRegressor
+
+### Prerequisites
+
+This example requires xgboost, sklearn, openml==0.10.2.
+
+### Zero-shot AutoML
+
+```python
+from flaml.automl.data import load_openml_dataset
+from flaml.default import XGBRegressor
+from flaml.automl.ml import sklearn_metric_loss_score
+
+X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir="./")
+xgb = XGBRegressor()
+xgb.fit(X_train, y_train)
+y_pred = xgb.predict(X_test)
+print("flamlized xgb r2", "=", 1 - sklearn_metric_loss_score("r2", y_pred, y_test))
+print(xgb)
+```
+
+#### Sample output
+
+```
+load dataset from ./openml_ds537.pkl
+Dataset name: houses
+X_train.shape: (15480, 8), y_train.shape: (15480,);
+X_test.shape: (5160, 8), y_test.shape: (5160,)
+flamlized xgb r2 = 0.8542
+XGBRegressor(colsample_bylevel=1, colsample_bytree=0.85, learning_rate=0.05,
+             max_depth=6, n_estimators=500, reg_alpha=0.01, reg_lambda=1.0,
+             subsample=0.9)
+```
+
 ## Flamlized XGBClassifier
 
 ### Prerequisites
@@ -112,3 +188,159 @@ XGBClassifier(base_score=0.5, booster='gbtree',
               scale_pos_weight=1, subsample=1.0, tree_method='hist',
               use_label_encoder=False, validate_parameters=1, verbosity=0)
 ```
+
+## Flamlized RandomForestRegressor
+
+### Prerequisites
+
+This example requires the [autozero] option.
+
+```bash
+pip install flaml[autozero] scikit-learn openml
+```
+
+### Zero-shot AutoML
+
+```python
+from flaml.automl.data import load_openml_dataset
+from flaml.default import RandomForestRegressor
+from flaml.automl.ml import sklearn_metric_loss_score
+
+X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir="./")
+rf = RandomForestRegressor()
+rf.fit(X_train, y_train)
+y_pred = rf.predict(X_test)
+print("flamlized rf r2", "=", 1 - sklearn_metric_loss_score("r2", y_pred, y_test))
+print(rf)
+```
+
+#### Sample output
+
+```
+load dataset from ./openml_ds537.pkl
+Dataset name: houses
+X_train.shape: (15480, 8), y_train.shape: (15480,);
+X_test.shape: (5160, 8), y_test.shape: (5160,)
+flamlized rf r2 = 0.8521
+RandomForestRegressor(max_features=0.8, min_samples_leaf=2, min_samples_split=5,
+                      n_estimators=500)
+```
+
+## Flamlized RandomForestClassifier
+
+### Prerequisites
+
+This example requires the [autozero] option.
+
+```bash
+pip install flaml[autozero] scikit-learn openml
+```
+
+### Zero-shot AutoML
+
+```python
+from flaml.automl.data import load_openml_dataset
+from flaml.default import RandomForestClassifier
+from flaml.automl.ml import sklearn_metric_loss_score
+
+X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir="./")
+rf = RandomForestClassifier()
+rf.fit(X_train, y_train)
+y_pred = rf.predict(X_test)
+print(
+    "flamlized rf accuracy",
+    "=",
+    1 - sklearn_metric_loss_score("accuracy", y_pred, y_test),
+)
+print(rf)
+```
+
+#### Sample output
+
+```
+load dataset from ./openml_ds1169.pkl
+Dataset name: airlines
+X_train.shape: (404537, 7), y_train.shape: (404537,);
+X_test.shape: (134846, 7), y_test.shape: (134846,)
+flamlized rf accuracy = 0.6701
+RandomForestClassifier(max_features=0.7, min_samples_leaf=3, min_samples_split=5,
+                       n_estimators=500)
+```
+
+## Flamlized ExtraTreesRegressor
+
+### Prerequisites
+
+This example requires the [autozero] option.
+
+```bash
+pip install flaml[autozero] scikit-learn openml
+```
+
+### Zero-shot AutoML
+
+```python
+from flaml.automl.data import load_openml_dataset
+from flaml.default import ExtraTreesRegressor
+from flaml.automl.ml import sklearn_metric_loss_score
+
+X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir="./")
+et = ExtraTreesRegressor()
+et.fit(X_train, y_train)
+y_pred = et.predict(X_test)
+print("flamlized et r2", "=", 1 - sklearn_metric_loss_score("r2", y_pred, y_test))
+print(et)
+```
+
+#### Sample output
+
+```
+load dataset from ./openml_ds537.pkl
+Dataset name: houses
+X_train.shape: (15480, 8), y_train.shape: (15480,);
+X_test.shape: (5160, 8), y_test.shape: (5160,)
+flamlized et r2 = 0.8534
+ExtraTreesRegressor(max_features=0.75, min_samples_leaf=2, min_samples_split=5,
+                    n_estimators=500)
+```
+
+## Flamlized ExtraTreesClassifier
+
+### Prerequisites
+
+This example requires the [autozero] option.
+
+```bash
+pip install flaml[autozero] scikit-learn openml
+```
+
+### Zero-shot AutoML
+
+```python
+from flaml.automl.data import load_openml_dataset
+from flaml.default import ExtraTreesClassifier
+from flaml.automl.ml import sklearn_metric_loss_score
+
+X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir="./")
+et = ExtraTreesClassifier()
+et.fit(X_train, y_train)
+y_pred = et.predict(X_test)
+print(
+    "flamlized et accuracy",
+    "=",
+    1 - sklearn_metric_loss_score("accuracy", y_pred, y_test),
+)
+print(et)
+```
+
+#### Sample output
+
+```
+load dataset from ./openml_ds1169.pkl
+Dataset name: airlines
+X_train.shape: (404537, 7), y_train.shape: (404537,);
+X_test.shape: (134846, 7), y_test.shape: (134846,)
+flamlized et accuracy = 0.6698
+ExtraTreesClassifier(max_features=0.7, min_samples_leaf=3, min_samples_split=5,
+                     n_estimators=500)
+```
diff --git a/website/docs/Examples/Integrate - AzureML.md b/website/docs/Examples/Integrate - AzureML.md
index 1a46ca6242..85f643d5cf 100644
--- a/website/docs/Examples/Integrate - AzureML.md	
+++ b/website/docs/Examples/Integrate - AzureML.md	
@@ -2,7 +2,7 @@ FLAML can be used together with AzureML. On top of that, using mlflow and ray is
 
 ### Prerequisites
 
-Install the \[automl,azureml\] option.
+Install the [automl,azureml] option.
 
 ```bash
 pip install "flaml[automl,azureml]"
diff --git a/website/docs/Examples/Integrate - Scikit-learn Pipeline.md b/website/docs/Examples/Integrate - Scikit-learn Pipeline.md
index 58657c76a2..ec19e4e3c5 100644
--- a/website/docs/Examples/Integrate - Scikit-learn Pipeline.md	
+++ b/website/docs/Examples/Integrate - Scikit-learn Pipeline.md	
@@ -2,7 +2,7 @@ As FLAML's AutoML module can be used a transformer in the Sklearn's pipeline we
 
 ### Prerequisites
 
-Install the \[automl\] option.
+Install the [automl] option.
 
 ```bash
 pip install "flaml[automl] openml"
diff --git a/website/docs/FAQ.md b/website/docs/FAQ.md
index 5f14778694..5f9364c1db 100644
--- a/website/docs/FAQ.md
+++ b/website/docs/FAQ.md
@@ -8,13 +8,114 @@
 
 ### About `low_cost_partial_config` in `tune`.
 
-- Definition and purpose: The `low_cost_partial_config` is a dictionary of subset of the hyperparameter coordinates whose value corresponds to a configuration with known low-cost (i.e., low computation cost for training the corresponding model).  The concept of low/high-cost is meaningful in the case where a subset of the hyperparameters to tune directly affects the computation cost for training the model. For example, `n_estimators` and `max_leaves` are known to affect the training cost of tree-based learners. We call this subset of hyperparameters, *cost-related hyperparameters*. In such scenarios, if you are aware of low-cost configurations for the cost-related hyperparameters, you are recommended to set them as the `low_cost_partial_config`. Using the tree-based method example again, since we know that small `n_estimators` and  `max_leaves` generally correspond to simpler models and thus lower cost, we set `{'n_estimators': 4, 'max_leaves': 4}` as the `low_cost_partial_config` by default (note that `4` is the lower bound of search space for these two hyperparameters), e.g., in [LGBM](https://github.com/microsoft/FLAML/blob/main/flaml/model.py#L215).  Configuring `low_cost_partial_config` helps the search algorithms make more cost-efficient choices.
+- Definition and purpose: The `low_cost_partial_config` is a dictionary of subset of the hyperparameter coordinates whose value corresponds to a configuration with known low-cost (i.e., low computation cost for training the corresponding model). The concept of low/high-cost is meaningful in the case where a subset of the hyperparameters to tune directly affects the computation cost for training the model. For example, `n_estimators` and `max_leaves` are known to affect the training cost of tree-based learners. We call this subset of hyperparameters, *cost-related hyperparameters*. In such scenarios, if you are aware of low-cost configurations for the cost-related hyperparameters, you are recommended to set them as the `low_cost_partial_config`. Using the tree-based method example again, since we know that small `n_estimators` and `max_leaves` generally correspond to simpler models and thus lower cost, we set `{'n_estimators': 4, 'max_leaves': 4}` as the `low_cost_partial_config` by default (note that `4` is the lower bound of search space for these two hyperparameters), e.g., in [LGBM](https://github.com/microsoft/FLAML/blob/main/flaml/model.py#L215). Configuring `low_cost_partial_config` helps the search algorithms make more cost-efficient choices.
   In AutoML, the `low_cost_init_value` in `search_space()` function for each estimator serves the same role.
 
 - Usage in practice: It is recommended to configure it if there are cost-related hyperparameters in your tuning task and you happen to know the low-cost values for them, but it is not required (It is fine to leave it the default value, i.e., `None`).
 
 - How does it work: `low_cost_partial_config` if configured, will be used as an initial point of the search. It also affects the search trajectory. For more details about how does it play a role in the search algorithms, please refer to the papers about the search algorithms used: Section 2 of [Frugal Optimization for Cost-related Hyperparameters (CFO)](https://arxiv.org/pdf/2005.01571.pdf) and Section 3 of [Economical Hyperparameter Optimization with Blended Search Strategy (BlendSearch)](https://openreview.net/pdf?id=VbLH04pRA3).
 
+### How does FLAML handle missing values?
+
+FLAML automatically preprocesses missing values in the input data through its `DataTransformer` class (for classification/regression tasks) and `DataTransformerTS` class (for time series tasks). The preprocessing behavior differs based on the column type:
+
+**Automatic Missing Value Preprocessing:**
+
+FLAML performs the following preprocessing automatically when you call `AutoML.fit()`:
+
+1. **Numerical/Continuous Columns**: Missing values (NaN) in numerical columns are imputed using `sklearn.impute.SimpleImputer` with the **median strategy**. This preprocessing is applied in the `DataTransformer.fit_transform()` method (see `flaml/automl/data.py` lines 357-369 and `flaml/automl/time_series/ts_data.py` lines 429-440).
+
+1. **Categorical Columns**: Missing values in categorical columns (object, category, or string dtypes) are filled with a special placeholder value `"__NAN__"`, which is treated as a distinct category.
+
+**Example of automatic preprocessing:**
+
+```python
+from flaml import AutoML
+import pandas as pd
+import numpy as np
+
+# Data with missing values
+X_train = pd.DataFrame(
+    {
+        "num_feature": [1.0, 2.0, np.nan, 4.0, 5.0],
+        "cat_feature": ["A", "B", None, "A", "B"],
+    }
+)
+y_train = [0, 1, 0, 1, 0]
+
+# FLAML automatically handles missing values
+automl = AutoML()
+automl.fit(X_train, y_train, task="classification", time_budget=60)
+# Numerical NaNs are imputed with median, categorical None becomes "__NAN__"
+```
+
+**Estimator-Specific Native Handling:**
+
+After FLAML's preprocessing, some estimators have additional native missing value handling capabilities:
+
+- **`lgbm`** (LightGBM): After preprocessing, can still handle any remaining NaN values natively by learning optimal split directions.
+- **`xgboost`** (XGBoost): After preprocessing, can handle remaining NaN values by learning the best direction during training.
+- **`xgb_limitdepth`** (XGBoost with depth limit): Same as `xgboost`.
+- **`catboost`** (CatBoost): After preprocessing, has additional sophisticated missing value handling strategies. See [CatBoost documentation](https://catboost.ai/en/docs/concepts/algorithm-missing-values-processing).
+- **`histgb`** (HistGradientBoosting): After preprocessing, can still handle NaN values natively.
+
+**Estimators that rely on preprocessing:**
+
+These estimators rely on FLAML's automatic preprocessing since they cannot handle missing values directly:
+
+- **`rf`** (RandomForest): Requires preprocessing (automatically done by FLAML).
+- **`extra_tree`** (ExtraTrees): Requires preprocessing (automatically done by FLAML).
+- **`lrl1`**, **`lrl2`** (LogisticRegression): Require preprocessing (automatically done by FLAML).
+- **`kneighbor`** (KNeighbors): Requires preprocessing (automatically done by FLAML).
+- **`sgd`** (SGDClassifier/Regressor): Require preprocessing (automatically done by FLAML).
+
+**Advanced: Customizing Missing Value Handling**
+
+In most cases, FLAML's automatic preprocessing (median imputation for numerical, "__NAN__" for categorical) works well. However, if you need custom preprocessing:
+
+1. **Skip automatic preprocessing** using the `skip_transform` parameter:
+
+```python
+from flaml import AutoML
+from sklearn.impute import SimpleImputer
+import numpy as np
+
+# Custom preprocessing with different strategy
+imputer = SimpleImputer(strategy="mean")  # Use mean instead of median
+X_train_preprocessed = imputer.fit_transform(X_train)
+X_test_preprocessed = imputer.transform(X_test)
+
+# Skip FLAML's automatic preprocessing
+automl = AutoML()
+automl.fit(
+    X_train_preprocessed,
+    y_train,
+    task="classification",
+    time_budget=60,
+    skip_transform=True,  # Skip automatic preprocessing
+)
+```
+
+2. **Use sklearn Pipeline** for integrated custom preprocessing:
+
+```python
+from flaml import AutoML
+from sklearn.pipeline import Pipeline
+from sklearn.impute import SimpleImputer, KNNImputer
+
+# Custom pipeline with KNN imputation
+pipeline = Pipeline(
+    [
+        ("imputer", KNNImputer(n_neighbors=5)),  # Custom imputation strategy
+        ("automl", AutoML()),
+    ]
+)
+
+pipeline.fit(X_train, y_train)
+```
+
+**Note on time series forecasting**: For time series tasks (`ts_forecast`, `ts_forecast_panel`), the `DataTransformerTS` class applies the same preprocessing approach (median imputation for numerical columns, "__NAN__" for categorical). Missing values handling in the time dimension may require additional consideration depending on your specific forecasting model.
+
 ### How does FLAML handle imbalanced data (unequal distribution of target classes in classification task)?
 
 Currently FLAML does several things for imbalanced data.
@@ -73,7 +174,9 @@ Optimization history can be checked from the [log](Use-Cases/Task-Oriented-AutoM
 
 ### How to get the best config of an estimator and use it to train the original model outside FLAML?
 
-When you finished training an AutoML estimator, you may want to use it in other code w/o depending on FLAML. You can get the `automl.best_config` and convert it to the parameters of the original model with below code:
+When you finished training an AutoML estimator, you may want to use it in other code w/o depending on FLAML. The `automl.best_config` contains FLAML's search space parameters, which may differ from the original model's parameters (e.g., FLAML uses `log_max_bin` for LightGBM instead of `max_bin`). You need to convert them using the `config2params()` method.
+
+**Method 1: Using the trained model instance**
 
 ```python
 from flaml import AutoML
@@ -86,10 +189,43 @@ automl.fit(X, y)
 
 print(f"{automl.best_estimator=}")
 print(f"{automl.best_config=}")
-print(f"params for best estimator: {automl.model.config2params(automl.best_config)}")
+# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+#           'learning_rate': 0.1, 'log_max_bin': 8, ...}
+
+# Convert to original model parameters
+best_params = automl.model.config2params(automl.best_config)
+print(f"params for best estimator: {best_params}")
+# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+#           'learning_rate': 0.1, 'max_bin': 255, ...}  # log_max_bin -> max_bin
+```
+
+**Method 2: Using FLAML estimator classes directly**
+
+If the automl instance is not accessible and you only have the `best_config`, you can convert it with below code:
+
+```python
+from flaml.automl.model import LGBMEstimator
+
+best_config = {
+    "n_estimators": 4,
+    "num_leaves": 4,
+    "min_child_samples": 20,
+    "learning_rate": 0.1,
+    "log_max_bin": 8,  # FLAML-specific parameter
+    "colsample_bytree": 1.0,
+    "reg_alpha": 0.0009765625,
+    "reg_lambda": 1.0,
+}
+
+# Create FLAML estimator - this automatically converts parameters
+flaml_estimator = LGBMEstimator(task="classification", **best_config)
+best_params = flaml_estimator.params  # Converted params ready for original model
+print(f"Converted params: {best_params}")
+# Example: {'n_estimators': 4, 'num_leaves': 4, 'min_child_samples': 20,
+#           'learning_rate': 0.1, 'max_bin': 255, 'verbose': -1, ...}
 ```
 
-If the automl instance is not accessible and you've the `best_config`. You can also convert it with below code:
+**Method 3: Using task_factory (for any estimator type)**
 
 ```python
 from flaml.automl.task.factory import task_factory
@@ -107,15 +243,51 @@ model_class = task_factory(task).estimator_class_from_str(best_estimator)(task=t
 best_params = model_class.config2params(best_config)
 ```
 
-Then you can use it to train the sklearn estimators directly:
+Then you can use it to train the sklearn/lightgbm/xgboost estimators directly:
 
 ```python
-from sklearn.ensemble import RandomForestClassifier
+from lightgbm import LGBMClassifier
 
-model = RandomForestClassifier(**best_params)
+# Using LightGBM directly with converted parameters
+model = LGBMClassifier(**best_params)
 model.fit(X, y)
 ```
 
+**Using best_config_per_estimator for multiple estimators**
+
+```python
+from flaml import AutoML
+from flaml.automl.model import LGBMEstimator, XGBoostEstimator
+from lightgbm import LGBMClassifier
+from xgboost import XGBClassifier
+
+automl = AutoML()
+automl.fit(
+    X, y, task="classification", time_budget=30, estimator_list=["lgbm", "xgboost"]
+)
+
+# Get configs for all estimators
+configs = automl.best_config_per_estimator
+# Example: {'lgbm': {'n_estimators': 4, 'log_max_bin': 8, ...},
+#           'xgboost': {'n_estimators': 4, 'max_leaves': 4, ...}}
+
+# Convert and use LightGBM config
+if configs.get("lgbm"):
+    lgbm_config = configs["lgbm"].copy()
+    lgbm_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
+    flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
+    lgbm_model = LGBMClassifier(**flaml_lgbm.params)
+    lgbm_model.fit(X, y)
+
+# Convert and use XGBoost config
+if configs.get("xgboost"):
+    xgb_config = configs["xgboost"].copy()
+    xgb_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
+    flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
+    xgb_model = XGBClassifier(**flaml_xgb.params)
+    xgb_model.fit(X, y)
+```
+
 ### How to save and load an AutoML object? (`pickle` / `load_pickle`)
 
 FLAML provides `AutoML.pickle()` / `AutoML.load_pickle()` as a convenient and robust way to persist an AutoML run.
diff --git a/website/docs/Installation.md b/website/docs/Installation.md
index af6a63aefd..b971660e68 100644
--- a/website/docs/Installation.md
+++ b/website/docs/Installation.md
@@ -57,7 +57,7 @@ pip install "flaml[hf]"
 #### Notebook
 
 To run the [notebook examples](https://github.com/microsoft/FLAML/tree/main/notebook),
-install flaml with the \[notebook\] option:
+install flaml with the [notebook] option:
 
 ```bash
 pip install "flaml[notebook]"
diff --git a/website/docs/Use-Cases/Task-Oriented-AutoML.md b/website/docs/Use-Cases/Task-Oriented-AutoML.md
index 6f54146d76..602602c77c 100644
--- a/website/docs/Use-Cases/Task-Oriented-AutoML.md
+++ b/website/docs/Use-Cases/Task-Oriented-AutoML.md
@@ -51,6 +51,7 @@ If users provide the minimal inputs only, `AutoML` uses the default settings for
 The optimization metric is specified via the `metric` argument. It can be either a string which refers to a built-in metric, or a user-defined function.
 
 - Built-in metric.
+
   - 'accuracy': 1 - accuracy as the corresponding metric to minimize.
   - 'log_loss': default metric for multiclass classification.
   - 'r2': 1 - r2_score as the corresponding metric to minimize. Default metric for regression.
@@ -70,6 +71,40 @@ The optimization metric is specified via the `metric` argument. It can be either
   - 'ap': minimize 1 - average_precision_score.
   - 'ndcg': minimize 1 - ndcg_score.
   - 'ndcg@k': minimize 1 - ndcg_score@k. k is an integer.
+  - 'pr_auc': minimize 1 - precision-recall AUC score. (Spark-specific)
+  - 'var': minimize variance. (Spark-specific)
+
+- Built-in HuggingFace metrics (for NLP tasks).
+
+  - 'accuracy': minimize 1 - accuracy.
+  - 'bertscore': minimize 1 - BERTScore.
+  - 'bleu': minimize 1 - BLEU score.
+  - 'bleurt': minimize 1 - BLEURT score.
+  - 'cer': minimize character error rate.
+  - 'chrf': minimize ChrF score.
+  - 'code_eval': minimize 1 - code evaluation score.
+  - 'comet': minimize 1 - COMET score.
+  - 'competition_math': minimize 1 - competition math score.
+  - 'coval': minimize 1 - CoVal score.
+  - 'cuad': minimize 1 - CUAD score.
+  - 'f1': minimize 1 - F1 score.
+  - 'gleu': minimize 1 - GLEU score.
+  - 'google_bleu': minimize 1 - Google BLEU score.
+  - 'matthews_correlation': minimize 1 - Matthews correlation coefficient.
+  - 'meteor': minimize 1 - METEOR score.
+  - 'pearsonr': minimize 1 - Pearson correlation coefficient.
+  - 'precision': minimize 1 - precision.
+  - 'recall': minimize 1 - recall.
+  - 'rouge': minimize 1 - ROUGE score.
+  - 'rouge1': minimize 1 - ROUGE-1 score.
+  - 'rouge2': minimize 1 - ROUGE-2 score.
+  - 'sacrebleu': minimize 1 - SacreBLEU score.
+  - 'sari': minimize 1 - SARI score.
+  - 'seqeval': minimize 1 - SeqEval score.
+  - 'spearmanr': minimize 1 - Spearman correlation coefficient.
+  - 'ter': minimize translation error rate.
+  - 'wer': minimize word error rate.
+
 - User-defined function.
   A customized metric function that requires the following (input) signature, and returns the input config’s value in terms of the metric you want to minimize, and a dictionary of auxiliary information at your choice:
 
@@ -144,7 +179,7 @@ The estimator list can contain one or more estimator names, each corresponding t
 - Built-in estimator.
   - 'lgbm': LGBMEstimator for task "classification", "regression", "rank", "ts_forecast" and "ts_forecast_classification". Hyperparameters: n_estimators, num_leaves, min_child_samples, learning_rate, log_max_bin (logarithm of (max_bin + 1) with base 2), colsample_bytree, reg_alpha, reg_lambda.
   - 'xgboost': XGBoostSkLearnEstimator for task "classification", "regression", "rank", "ts_forecast" and "ts_forecast_classification". Hyperparameters: n_estimators, max_leaves, min_child_weight, learning_rate, subsample, colsample_bylevel, colsample_bytree, reg_alpha, reg_lambda.
-  - 'xgb_limitdepth': XGBoostLimitDepthEstimator for task "classification", "regression", "rank", "ts_forecast" and "ts_forecast_classification". Hyperparameters: n_estimators,  max_depth, min_child_weight, learning_rate, subsample, colsample_bylevel, colsample_bytree, reg_alpha, reg_lambda.
+  - 'xgb_limitdepth': XGBoostLimitDepthEstimator for task "classification", "regression", "rank", "ts_forecast" and "ts_forecast_classification". Hyperparameters: n_estimators, max_depth, min_child_weight, learning_rate, subsample, colsample_bylevel, colsample_bytree, reg_alpha, reg_lambda.
   - 'rf': RandomForestEstimator for task "classification", "regression", "ts_forecast" and "ts_forecast_classification". Hyperparameters: n_estimators, max_features, max_leaves, criterion (for classification only). Starting from v1.1.0,
     it uses a fixed random_state by default.
   - 'extra_tree': ExtraTreesEstimator for task "classification", "regression", "ts_forecast" and "ts_forecast_classification". Hyperparameters: n_estimators, max_features, max_leaves, criterion (for classification only). Starting from v1.1.0,
@@ -207,6 +242,7 @@ To tune a custom estimator that is not built-in, you need to:
 
 ```python
 from flaml.automl.model import SKLearnEstimator
+
 # SKLearnEstimator is derived from BaseEstimator
 import rgf
 
@@ -215,31 +251,44 @@ class MyRegularizedGreedyForest(SKLearnEstimator):
     def __init__(self, task="binary", **config):
         super().__init__(task, **config)
 
-        if task in CLASSIFICATION:
-        from rgf.sklearn import RGFClassifier
+        if isinstance(task, str):
+            from flaml.automl.task.factory import task_factory
+
+            task = task_factory(task)
+
+        if task.is_classification():
+            from rgf.sklearn import RGFClassifier
 
-        self.estimator_class = RGFClassifier
+            self.estimator_class = RGFClassifier
         else:
-        from rgf.sklearn import RGFRegressor
+            from rgf.sklearn import RGFRegressor
 
-        self.estimator_class = RGFRegressor
+            self.estimator_class = RGFRegressor
 
     @classmethod
     def search_space(cls, data_size, task):
         space = {
-        "max_leaf": {
-            "domain": tune.lograndint(lower=4, upper=data_size),
-            "low_cost_init_value": 4,
-        },
-        "n_iter": {
-            "domain": tune.lograndint(lower=1, upper=data_size),
-            "low_cost_init_value": 1,
-        },
-        "learning_rate": {"domain": tune.loguniform(lower=0.01, upper=20.0)},
-        "min_samples_leaf": {
-            "domain": tune.lograndint(lower=1, upper=20),
-            "init_value": 20,
-        },
+            "max_leaf": {
+                "domain": tune.lograndint(lower=4, upper=data_size[0]),
+                "init_value": 4,
+            },
+            "n_iter": {
+                "domain": tune.lograndint(lower=1, upper=data_size[0]),
+                "init_value": 1,
+            },
+            "n_tree_search": {
+                "domain": tune.lograndint(lower=1, upper=32768),
+                "init_value": 1,
+            },
+            "opt_interval": {
+                "domain": tune.lograndint(lower=1, upper=10000),
+                "init_value": 100,
+            },
+            "learning_rate": {"domain": tune.loguniform(lower=0.01, upper=20.0)},
+            "min_samples_leaf": {
+                "domain": tune.lograndint(lower=1, upper=20),
+                "init_value": 20,
+            },
         }
         return space
 ```
@@ -420,18 +469,40 @@ To use stacked ensemble after the model search, set `ensemble=True` or a dict. W
 - "final_estimator": an instance of the final estimator in the stacker.
 - "passthrough": True (default) or False, whether to pass the original features to the stacker.
 
+**Important Note:** The hyperparameters of a custom `final_estimator` are **NOT automatically tuned**. If you provide an estimator instance (e.g., `CatBoostClassifier()`), it will use the parameters you specified or their defaults. To use specific hyperparameters, you must set them when creating the estimator instance. If `final_estimator` is not provided, the best model found during the search will be used as the final estimator (recommended for best performance).
+
 For example,
 
 ```python
 automl.fit(
-    X_train, y_train, task="classification",
-    "ensemble": {
-        "final_estimator": LogisticRegression(),
+    X_train,
+    y_train,
+    task="classification",
+    ensemble={
+        "final_estimator": LogisticRegression(),  # Uses default LogisticRegression parameters
         "passthrough": False,
     },
 )
 ```
 
+Or with custom parameters:
+
+```python
+from catboost import CatBoostClassifier
+
+automl.fit(
+    X_train,
+    y_train,
+    task="classification",
+    ensemble={
+        "final_estimator": CatBoostClassifier(
+            iterations=100, depth=6, learning_rate=0.1
+        ),
+        "passthrough": True,
+    },
+)
+```
+
 ### Resampling strategy
 
 By default, flaml decides the resampling automatically according to the data size and the time budget. If you would like to enforce a certain resampling strategy, you can set `eval_method` to be "holdout" or "cv" for holdout or cross-validation.
@@ -462,7 +533,7 @@ For both classification and regression tasks more advanced split configurations
 
 More in general, `split_type` can also be set as a custom splitter object, when `eval_method="cv"`. It needs to be an instance of a derived class of scikit-learn
 [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)
-and have `split` and `get_n_splits` methods with the same signatures.  To disable shuffling, the splitter instance must contain the attribute `shuffle=False`.
+and have `split` and `get_n_splits` methods with the same signatures. To disable shuffling, the splitter instance must contain the attribute `shuffle=False`.
 
 ### Parallel tuning
 
@@ -552,6 +623,8 @@ automl2.fit(
 
 `starting_points` is a dictionary or a str to specify the starting hyperparameter config. (1) When it is a dictionary, the keys are the estimator names. If you do not need to specify starting points for an estimator, exclude its name from the dictionary. The value for each key can be either a dictionary of a list of dictionaries, corresponding to one hyperparameter configuration, or multiple hyperparameter configurations, respectively. (2) When it is a str: if "data", use data-dependent defaults; if "data:path", use data-dependent defaults which are stored at path; if "static", use data-independent defaults. Please find more details about data-dependent defaults in [zero shot AutoML](Zero-Shot-AutoML#combine-zero-shot-automl-and-hyperparameter-tuning).
 
+**Note on sample size preservation**: When using `best_config_per_estimator` as starting points, the configurations now preserve `FLAML_sample_size` (if subsampling was used during the search). This ensures that the warm-started run continues optimization with the same sample sizes that produced the best results in the previous run, leading to more effective warm-starting.
+
 ### Log the trials
 
 The trials are logged in a file if a `log_file_name` is passed.
@@ -653,6 +726,64 @@ plt.barh(
 
 ![png](images/feature_importance.png)
 
+### Preprocess data
+
+FLAML provides two levels of preprocessing that can be accessed as public APIs:
+
+1. **Task-level preprocessing** (`automl.preprocess()`): This applies transformations that are specific to the task type, such as handling data types, sparse matrices, and feature transformations learned during training.
+
+1. **Estimator-level preprocessing** (`estimator.preprocess()`): This applies transformations specific to the estimator type (e.g., LightGBM, XGBoost).
+
+The task-level preprocessing should be applied before the estimator-level preprocessing.
+
+#### Task-level preprocessing
+
+```python
+from flaml import AutoML
+import numpy as np
+
+# Train the model
+automl = AutoML()
+automl.fit(X_train, y_train, task="classification", time_budget=60)
+
+# Apply task-level preprocessing to new data
+X_test_preprocessed = automl.preprocess(X_test)
+
+# Now you can use this with the estimator
+predictions = automl.model.predict(X_test_preprocessed)
+```
+
+#### Estimator-level preprocessing
+
+```python
+# Get the trained estimator
+estimator = automl.model
+
+# Apply task-level preprocessing first
+X_test_task = automl.preprocess(X_test)
+
+# Then apply estimator-level preprocessing
+X_test_estimator = estimator.preprocess(X_test_task)
+
+# Use the fully preprocessed data with the underlying model
+predictions = estimator._model.predict(X_test_estimator)
+```
+
+#### Complete preprocessing pipeline
+
+For most use cases, the `predict()` method already handles both levels of preprocessing internally. However, if you need to apply preprocessing separately (e.g., for custom inference pipelines or debugging), you can use:
+
+```python
+# Complete preprocessing pipeline
+X_task_preprocessed = automl.preprocess(X_test)
+X_final = automl.model.preprocess(X_task_preprocessed)
+
+# This is equivalent to what happens internally in:
+predictions = automl.predict(X_test)
+```
+
+**Note**: The `preprocess()` methods can only be called after `fit()` has been executed, as they rely on the transformations learned during training.
+
 ### Get best configuration
 
 We can find the best estimator's name and best configuration by:
@@ -664,6 +795,25 @@ print(automl.best_config)
 # {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'log_max_bin': 8, 'colsample_bytree': 0.6649148062238498, 'reg_alpha': 0.0009765625, 'reg_lambda': 0.0067613624509965}
 ```
 
+**Note**: The config contains FLAML's search space parameters, which may differ from the original model's parameters. For example, FLAML uses `log_max_bin` for LightGBM instead of `max_bin`. To convert to the original model's parameters, use the `config2params()` method:
+
+```python
+from flaml.automl.model import LGBMEstimator
+
+# Convert FLAML config to original model parameters
+flaml_estimator = LGBMEstimator(task="classification", **automl.best_config)
+original_params = flaml_estimator.params
+print(original_params)
+# {'n_estimators': 148, 'num_leaves': 18, 'min_child_samples': 3, 'learning_rate': 0.17402065726724145, 'max_bin': 255, ...}
+# Note: 'log_max_bin': 8 is converted to 'max_bin': 255 (2^8 - 1)
+
+# Now you can use original LightGBM directly
+from lightgbm import LGBMClassifier
+
+lgbm_model = LGBMClassifier(**original_params)
+lgbm_model.fit(X_train, y_train)
+```
+
 We can also find the best configuration per estimator.
 
 ```python
@@ -673,6 +823,40 @@ print(automl.best_config_per_estimator)
 
 The `None` value corresponds to the estimators which have not been tried.
 
+**Converting configs for all estimators to original model parameters:**
+
+```python
+from flaml.automl.model import LGBMEstimator, XGBoostEstimator
+from lightgbm import LGBMClassifier
+from xgboost import XGBClassifier
+
+configs = automl.best_config_per_estimator
+
+# Convert and use LightGBM config
+if configs.get("lgbm"):
+    lgbm_config = configs["lgbm"].copy()
+    lgbm_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
+    flaml_lgbm = LGBMEstimator(task="classification", **lgbm_config)
+    lgbm_model = LGBMClassifier(**flaml_lgbm.params)
+    lgbm_model.fit(X_train, y_train)
+
+# Convert and use XGBoost config
+if configs.get("xgboost"):
+    xgb_config = configs["xgboost"].copy()
+    xgb_config.pop("FLAML_sample_size", None)  # Remove FLAML internal param if present
+    flaml_xgb = XGBoostEstimator(task="classification", **xgb_config)
+    xgb_model = XGBClassifier(**flaml_xgb.params)
+    xgb_model.fit(X_train, y_train)
+```
+
+**Note**: When subsampling is used during the search (e.g., with large datasets), the configurations may also include `FLAML_sample_size` to indicate the sample size used. For example:
+
+```python
+# {'lgbm': {'n_estimators': 729, 'num_leaves': 21, ..., 'FLAML_sample_size': 45000}, ...}
+```
+
+This information is preserved in `best_config_per_estimator` and is important for warm-starting subsequent runs with the correct sample sizes.
+
 Other useful information:
 
 ```python
@@ -740,7 +924,7 @@ If you want to get a sense of how much time is needed to find the best model, yo
 
 > INFO - Estimated sufficient time budget=145194s. Estimated necessary time budget=2118s.
 
-> INFO -  at 2.6s,  estimator lgbm's best error=0.4459,     best estimator lgbm's best error=0.4459
+> INFO - at 2.6s, estimator lgbm's best error=0.4459, best estimator lgbm's best error=0.4459
 
 You will see that the time to finish the first and cheapest trial is 2.6 seconds. The estimated necessary time budget is 2118 seconds, and the estimated sufficient time budget is 145194 seconds. Note that this is only an estimated range to help you decide your budget.
 
diff --git a/website/docs/Use-Cases/Tune-User-Defined-Function.md b/website/docs/Use-Cases/Tune-User-Defined-Function.md
index 8b19043aa0..7a1afb364a 100644
--- a/website/docs/Use-Cases/Tune-User-Defined-Function.md
+++ b/website/docs/Use-Cases/Tune-User-Defined-Function.md
@@ -23,13 +23,13 @@ Related arguments:
 
 - `evaluation_function`: A user-defined evaluation function.
 - `metric`: A string of the metric name to optimize for.
-- `mode`:  A string in \['min', 'max'\] to specify the objective as minimization or maximization.
+- `mode`: A string in ['min', 'max'] to specify the objective as minimization or maximization.
 
 The first step is to specify your tuning objective.
 To do it, you should first specify your evaluation procedure (e.g., perform a machine learning model training and validation) with respect to the hyperparameters in a user-defined function `evaluation_function`.
 The function requires a hyperparameter configuration as input, and can simply return a metric value in a scalar or return a dictionary of metric name and metric value pairs.
 
-In the following code, we define an evaluation function with respect to two hyperparameters named `x` and `y` according to $obj := (x-85000)^2 - x/y$. Note that we use this toy example here for more accessible demonstration purposes. In real use cases, the evaluation function usually cannot be written in this closed form, but instead involves a black-box and expensive evaluation procedure.  Please check out [Tune HuggingFace](/docs/Examples/Tune-HuggingFace), [Tune PyTorch](/docs/Examples/Tune-PyTorch) and [Tune LightGBM](/docs/Getting-Started#tune-user-defined-function) for real examples of tuning tasks.
+In the following code, we define an evaluation function with respect to two hyperparameters named `x` and `y` according to $obj := (x-85000)^2 - x/y$. Note that we use this toy example here for more accessible demonstration purposes. In real use cases, the evaluation function usually cannot be written in this closed form, but instead involves a black-box and expensive evaluation procedure. Please check out [Tune HuggingFace](/docs/Examples/Tune-HuggingFace), [Tune PyTorch](/docs/Examples/Tune-PyTorch) and [Tune LightGBM](/docs/Getting-Started#tune-user-defined-function) for real examples of tuning tasks.
 
 ```python
 import time
@@ -72,7 +72,7 @@ Related arguments:
 
 The second step is to specify a search space of the hyperparameters through the argument `config`. In the search space, you need to specify valid values for your hyperparameters and can specify how these values are sampled (e.g., from a uniform distribution or a log-uniform distribution).
 
-In the following code example, we include a search space for the two hyperparameters `x` and `y` as introduced above. The valid values for both are integers in the range of \[1, 100000\]. The values for `x` are sampled uniformly in the specified range (using `tune.randint(lower=1, upper=100000)`), and the values for `y` are sampled uniformly in logarithmic space of the specified range (using `tune.lograndit(lower=1, upper=100000)`).
+In the following code example, we include a search space for the two hyperparameters `x` and `y` as introduced above. The valid values for both are integers in the range of [1, 100000]. The values for `x` are sampled uniformly in the specified range (using `tune.randint(lower=1, upper=100000)`), and the values for `y` are sampled uniformly in logarithmic space of the specified range (using `tune.lograndit(lower=1, upper=100000)`).
 
 ```python
 from flaml import tune
@@ -181,15 +181,171 @@ config = {
 
 <!-- Please refer to [ray.tune](https://docs.ray.io/en/latest/tune/api_docs/search_space.html#overview) for a more comprehensive introduction about possible choices of the domain. -->
 
+#### Hierarchical search space
+
+A hierarchical (or conditional) search space allows you to define hyperparameters that depend on the value of other hyperparameters. This is useful when different choices for a categorical hyperparameter require different sets of hyperparameters.
+
+For example, if you're tuning a machine learning pipeline where different models require different hyperparameters, or when the choice of an optimizer determines which optimizer-specific hyperparameters are relevant.
+
+**Syntax**: To create a hierarchical search space, use `tune.choice()` with a list where some elements are dictionaries containing nested hyperparameter definitions.
+
+**Example 1: Model selection with model-specific hyperparameters**
+
+In this example, we have two model types (linear and tree-based), each with their own specific hyperparameters:
+
+```python
+from flaml import tune
+
+search_space = {
+    "model": tune.choice(
+        [
+            {
+                "model_type": "linear",
+                "learning_rate": tune.loguniform(1e-4, 1e-1),
+                "regularization": tune.uniform(0, 1),
+            },
+            {
+                "model_type": "tree",
+                "n_estimators": tune.randint(10, 100),
+                "max_depth": tune.randint(3, 10),
+            },
+        ]
+    ),
+    # Common hyperparameters for all models
+    "batch_size": tune.choice([32, 64, 128]),
+}
+
+
+def evaluate_config(config):
+    model_config = config["model"]
+    if model_config["model_type"] == "linear":
+        # Use learning_rate and regularization
+        # train_linear_model() is a placeholder for your actual training code
+        score = train_linear_model(
+            lr=model_config["learning_rate"],
+            reg=model_config["regularization"],
+            batch_size=config["batch_size"],
+        )
+    else:  # tree
+        # Use n_estimators and max_depth
+        # train_tree_model() is a placeholder for your actual training code
+        score = train_tree_model(
+            n_est=model_config["n_estimators"],
+            depth=model_config["max_depth"],
+            batch_size=config["batch_size"],
+        )
+    return {"score": score}
+
+
+# Run tuning
+analysis = tune.run(
+    evaluate_config,
+    config=search_space,
+    metric="score",
+    mode="min",
+    num_samples=20,
+)
+```
+
+**Example 2: Mixed choices with constants and nested spaces**
+
+You can also mix constant values with nested hyperparameter spaces in `tune.choice()`:
+
+```python
+search_space = {
+    "optimizer": tune.choice(
+        [
+            "sgd",  # constant value
+            {
+                "optimizer_type": "adam",
+                "beta1": tune.uniform(0.8, 0.99),
+                "beta2": tune.uniform(0.9, 0.999),
+            },
+            {
+                "optimizer_type": "rmsprop",
+                "decay": tune.loguniform(1e-3, 1e-1),
+                "momentum": tune.uniform(0, 0.99),
+            },
+        ]
+    ),
+    "learning_rate": tune.loguniform(1e-5, 1e-1),
+}
+
+
+def evaluate_config(config):
+    optimizer_config = config["optimizer"]
+    if optimizer_config == "sgd":
+        optimizer = create_sgd_optimizer(lr=config["learning_rate"])
+    elif optimizer_config["optimizer_type"] == "adam":
+        optimizer = create_adam_optimizer(
+            lr=config["learning_rate"],
+            beta1=optimizer_config["beta1"],
+            beta2=optimizer_config["beta2"],
+        )
+    else:  # rmsprop
+        optimizer = create_rmsprop_optimizer(
+            lr=config["learning_rate"],
+            decay=optimizer_config["decay"],
+            momentum=optimizer_config["momentum"],
+        )
+    # train_model() is a placeholder for your actual training code
+    return train_model(optimizer)
+```
+
+**Example 3: Nested hierarchical spaces**
+
+You can also nest dictionaries within the search space for organizing related hyperparameters:
+
+```python
+search_space = {
+    "preprocessing": {
+        "normalize": tune.choice([True, False]),
+        "feature_selection": tune.choice(["none", "pca", "lda"]),
+    },
+    "model": tune.choice(
+        [
+            {
+                "type": "neural_net",
+                "layers": tune.randint(1, 5),
+                "units_per_layer": tune.randint(32, 256),
+            },
+            {
+                "type": "ensemble",
+                "n_models": tune.randint(3, 10),
+            },
+        ]
+    ),
+}
+
+
+def evaluate_config(config):
+    # Access nested hyperparameters
+    normalize = config["preprocessing"]["normalize"]
+    feature_selection = config["preprocessing"]["feature_selection"]
+    model_config = config["model"]
+
+    # Use the hyperparameters accordingly
+    # train_with_config() is a placeholder for your actual training code
+    score = train_with_config(normalize, feature_selection, model_config)
+    return {"score": score}
+```
+
+**Notes:**
+
+- When a configuration is sampled, only the selected branch of the hierarchical space will be active.
+- The evaluation function should check which choice was selected and use the corresponding nested hyperparameters.
+- Hierarchical search spaces work with all FLAML search algorithms (CFO, BlendSearch).
+- You can specify `low_cost_partial_config` for hierarchical spaces as well by providing the path to the nested parameters.
+
 #### Cost-related hyperparameters
 
 Cost-related hyperparameters are a subset of the hyperparameters which directly affect the computation cost incurred in the evaluation of any hyperparameter configuration. For example, the number of estimators (`n_estimators`) and the maximum number of leaves (`max_leaves`) are known to affect the training cost of tree-based learners. So they are cost-related hyperparameters for tree-based learners.
 
 When cost-related hyperparameters exist, the evaluation cost in the search space is heterogeneous.
-In this case, designing a search space with proper ranges of the hyperparameter values is highly non-trivial. Classical tuning algorithms such as Bayesian optimization and random search are typically sensitive to such ranges.  It may take them a very high cost to find a good choice if the ranges are too large. And if the ranges are too small, the optimal choice(s) may not be included and thus not possible to be found. With our method, you can use a search space with larger ranges in the case of heterogeneous cost.
+In this case, designing a search space with proper ranges of the hyperparameter values is highly non-trivial. Classical tuning algorithms such as Bayesian optimization and random search are typically sensitive to such ranges. It may take them a very high cost to find a good choice if the ranges are too large. And if the ranges are too small, the optimal choice(s) may not be included and thus not possible to be found. With our method, you can use a search space with larger ranges in the case of heterogeneous cost.
 
 Our search algorithms are designed to finish the tuning process at a low total cost when the evaluation cost in the search space is heterogeneous.
-So in such scenarios, if you are aware of low-cost configurations for the cost-related hyperparameters, you are encouraged to set them as the `low_cost_partial_config`, which is a dictionary of a subset of the hyperparameter coordinates whose value corresponds to a configuration with known low cost.  Using the example of the tree-based methods again, since we know that small `n_estimators` and `max_leaves` generally correspond to simpler models and thus lower cost, we set `{'n_estimators': 4, 'max_leaves': 4}` as the `low_cost_partial_config` by default (note that 4 is the lower bound of search space for these two hyperparameters), e.g., in LGBM. Please find more details on how the algorithm works [here](#cfo-frugal-optimization-for-cost-related-hyperparameters).
+So in such scenarios, if you are aware of low-cost configurations for the cost-related hyperparameters, you are encouraged to set them as the `low_cost_partial_config`, which is a dictionary of a subset of the hyperparameter coordinates whose value corresponds to a configuration with known low cost. Using the example of the tree-based methods again, since we know that small `n_estimators` and `max_leaves` generally correspond to simpler models and thus lower cost, we set `{'n_estimators': 4, 'max_leaves': 4}` as the `low_cost_partial_config` by default (note that 4 is the lower bound of search space for these two hyperparameters), e.g., in LGBM. Please find more details on how the algorithm works [here](#cfo-frugal-optimization-for-cost-related-hyperparameters).
 
 In addition, if you are aware of the cost relationship between different categorical hyperparameter choices, you are encouraged to provide this information through `cat_hp_cost`. It also helps the search algorithm to reduce the total cost.
 
@@ -202,7 +358,7 @@ Related arguments:
 - `config_constraints` (optional): A list of config constraints to be satisfied.
 - `metric_constraints` (optional): A list of metric constraints to be satisfied. e.g., `['precision', '>=', 0.9]`.
 
-The third step is to specify constraints of the tuning task. One notable property of `flaml.tune` is that it is able to finish the tuning process (obtaining good results) within a required resource constraint. A user can either provide the resource constraint in terms of wall-clock time (in seconds) through the argument `time_budget_s`, or in terms of the number of trials through the argument `num_samples`.  The following example shows three use cases:
+The third step is to specify constraints of the tuning task. One notable property of `flaml.tune` is that it is able to finish the tuning process (obtaining good results) within a required resource constraint. A user can either provide the resource constraint in terms of wall-clock time (in seconds) through the argument `time_budget_s`, or in terms of the number of trials through the argument `num_samples`. The following example shows three use cases:
 
 ```python
 # Set a resource constraint of 60 seconds wall-clock time for the tuning.
@@ -295,8 +451,8 @@ Related arguments:
 
 Details about parallel tuning with Spark could be found [here](/docs/Examples/Integrate%20-%20Spark#parallel-spark-jobs).
 
-You can perform parallel tuning by specifying `use_ray=True` (requiring flaml\[ray\] option installed) or `use_spark=True`
-(requiring flaml\[spark\] option installed). You can also limit the amount of resources allocated per trial by specifying `resources_per_trial`,
+You can perform parallel tuning by specifying `use_ray=True` (requiring flaml[ray] option installed) or `use_spark=True`
+(requiring flaml[spark] option installed). You can also limit the amount of resources allocated per trial by specifying `resources_per_trial`,
 e.g., `resources_per_trial={'cpu': 2}` when `use_ray=True`.
 
 ```python
@@ -409,11 +565,11 @@ analysis = tune.run(
 
 You can find more details about this scheduler in [this paper](https://arxiv.org/pdf/1911.04706.pdf).
 
-#### 2. A scheduler of the  [`TrialScheduler`](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-schedulers) class from `ray.tune`.
+#### 2. A scheduler of the [`TrialScheduler`](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-schedulers) class from `ray.tune`.
 
 There is a handful of schedulers of this type implemented in `ray.tune`, for example, [ASHA](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#asha-tune-schedulers-ashascheduler), [HyperBand](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-original-hyperband), [BOHB](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#tune-scheduler-bohb), etc.
 
-To use this type of scheduler you can either (1) set `scheduler='asha'`, which will automatically create an  [ASHAScheduler](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#asha-tune-schedulers-ashascheduler) instance using the provided inputs (`resource_attr`, `min_resource`, `max_resource`, and `reduction_factor`); or (2) create an instance by yourself and provided it via `scheduler`, as shown in the following code example,
+To use this type of scheduler you can either (1) set `scheduler='asha'`, which will automatically create an [ASHAScheduler](https://docs.ray.io/en/latest/tune/api_docs/schedulers.html#asha-tune-schedulers-ashascheduler) instance using the provided inputs (`resource_attr`, `min_resource`, `max_resource`, and `reduction_factor`); or (2) create an instance by yourself and provided it via `scheduler`, as shown in the following code example,
 
 ```python
 #  require: pip install flaml[ray]
@@ -589,7 +745,7 @@ NOTE:
 
 ## Hyperparameter Optimization Algorithm
 
-To tune the hyperparameters toward your objective, you will want to use a hyperparameter optimization algorithm which can help suggest hyperparameters with better performance (regarding your objective). `flaml` offers two HPO methods: CFO and BlendSearch. `flaml.tune` uses BlendSearch by default when the option \[blendsearch\] is installed.
+To tune the hyperparameters toward your objective, you will want to use a hyperparameter optimization algorithm which can help suggest hyperparameters with better performance (regarding your objective). `flaml` offers two HPO methods: CFO and BlendSearch. `flaml.tune` uses BlendSearch by default when the option [blendsearch] is installed.
 
 <!-- ![png](images/CFO.png) | ![png](images/BlendSearch.png)
 :---:|:---: -->
diff --git a/website/sidebars.js b/website/sidebars.js
index 85595ea14b..8dfdab98e8 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -15,6 +15,7 @@
     'Installation',
     {'Use Cases': [{type: 'autogenerated', dirName: 'Use-Cases'}]},
     {'Examples': [{type: 'autogenerated', dirName: 'Examples'}]},
+    'Best-Practices',
     'Contribute',
     'Research',
   ],