This repository contains the dataset for herb classification and recognition, named Herbify. The dataset is derived from two datasets and is manually cleaned, with further improvements and processing. The dataset is designed to facilitate research in the field of herb (plant) classification and recognition using deep learning techniques.
This dataset was curated as part of the research study titled "Herbify: an ensemble deep learning framework integrating convolutional neural networks and vision transformers for precise herb identification" by Farhan Sheth, Ishika Chatter, Manvendra Jasra, Gireesh Kumar, and Richa Sharma, published in Plant Methods (BMC).
- Paper: Herbify: an ensemble deep learning framework integrating convolutional neural networks and vision transformers for precise herb identification
- Code: Development Code (model and recommendation system)
Application produced by this research is available at:
- Herbify: Herbify
- Server Code: Herbify Server
NOTE: If you are using any part of this project; dataset, code, application, then please cite the work as mentioned in the Citation section below.
Herbs have historically been central to medicinal practices, representing one of the earliest forms of therapeutic intervention. While synthetic drugs are often highly effective in treating acute conditions, their use is frequently accompanied by adverse side effects. In addition, the growing dependence on synthetic pharmaceuticals has raised concerns regarding affordability, thereby fostering a renewed interest in herbal medicine as a cost-effective and holistic alternative. In response to this need, the current study introduces a computer vision framework for accurate herb identification. A novel dataset, Herbify, was compiled from two different herb datasets and refined through rigorous cleaning, preprocessing, and quality control procedures. The resulting dataset underwent standardization via the Preprocessing Algorithm for Herb Detection (PAHD), producing a refined dataset of 6104 images, representing 91 distinct herb species, with an average of about 67 images per species. Utilizing transfer learning, the research harnessed pre-trained Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), then integrated these models into an ensemble framework that leverages the unique strengths of each architecture. Experimental results indicate that EfficientNet v2-Large achieved a noteworthy F₁-score of 99.13%, while the ensemble of EfficientNet v2-Large and ViT-Large/16, termed EfficientL-ViTL, attained an even higher F₁-score of 99.56%. Additionally, the research also introduces ‘Herbify’ application, an AI-driven framework designed to identify herbs using the developed model. By directly tackling the principal obstacles in herb identification, the proposed system achieves a highly accurate and operationally viable classification mechanism. The experimental outcomes showcase top-tier performance in herb identification and emphasize the transformative potential of AI-based solutions in supporting botanical applications.
The dataset consists of images of 91 different herbs, with a total of 6,104 images. The structure of the dataset is as follows:
Herbify-Dataset/
Scientific name (Common name)/
image1.jpg
image2.jpg
...
Scientific name (Common name)/
image1.jpg
...
...
- Input Parameters:
- Image: The images of the herbs (JPEG format).
- Label: The labels are in the format 'scientific names (common names)' of the herbs (folder names).
- Output Parameter:
- Classification: The predicted class (herb name) based on the input image.
- Total Images: 6104 images
- Images Per Class: Between 7 to 163 images (Average: 67 images)
- Image Format: JPG/JPEG
- Image Resolution: 103 × 94 pixels to 4236 × 4447 pixels (Average: 1267 × 1135 pixels)
- Source: Collected from two different herb dataset (DIMPSAR and DeepHerb) as well as cleaned and processed for consistency.
The Herb Identification Dataset is structured to facilitate the classification of various herbs based on images. Some examples of herb classes from the 91 categories include: llium cepa, Aloe barbadensis miller, Andrographis paniculata, Annona squamosa, Artocarpus heterophyllus, Azadirachta indica, Bacopa monnieri, Bambusa vulgaris, Basella alba, Brassica oleracea, Calotropis gigantea, ..............., Rosa rubiginosa, Ruta graveolens, Saraca asoca, Saraca asoca, Solanum lycopersicum, Solanum nigrum, Spinacia oleracea, Syzygium cumini, Tagetes, Tamarindus indica, Tecoma stans, Tinospora cordifolia, Wrightia tinctoria, and Zingiber officinale.
The dataset is also available to be downloaded from the following sites:
- Kaggle: Herb (Plant) Classification Dataset
- HuggingFace: Coming soon
To use this dataset for your research or project:
-
Clone the repository:
git clone https://github.com/Phantom-fs/Herbify-Dataset.git
-
Download the dataset:
- Download from Kaggle or HuggingFace (links above), or use the files in this repository.
-
Dataset structure:
- Each folder is named after a herb (scientific name (common name)) and contains images of that herb.
-
Usage example (Python):
import os from PIL import Image dataset_path = 'Herbify-Dataset' for herb in os.listdir(dataset_path): herb_folder = os.path.join(dataset_path, herb) if os.path.isdir(herb_folder): for img_file in os.listdir(herb_folder): img_path = os.path.join(herb_folder, img_file) img = Image.open(img_path) # process image
Check the Code for detailed usage.
If you are using the dataset, please cite using this BibTeX:
@Article{Sheth2025Herbify,
author={Sheth, Farhan and Chatter, Ishika and Jasra, Manvendra and Kumar, Gireesh and Sharma, Richa},
title={Herbify: an ensemble deep learning framework integrating convolutional neural networks and vision transformers for precise herb identification},
journal={Plant Methods},
year={2025},
month={Jul},
day={27},
volume={21},
number={1},
pages={104},
issn={1746-4811},
doi={10.1186/s13007-025-01421-5},
url={https://doi.org/10.1186/s13007-025-01421-5}
}
- DIMPSAR: DIMPSAR Dataset
- DeepHerb: DeepHerb Dataset
This dataset is intended for research purposes only. If you use this dataset in your research, if possible, please also cite the original sources of the datasets used to create it.
This project is licensed under:
Creative Commons Attribution-NonCommercial 4.0 International License.
See the LICENSE file for details.
This study is not to be used for commercial purposes. The dataset is intended for research and educational purposes only. This dataset was sourced from other open-source datasets and is not intended to infringe on any copyrights. If you have any concerns or requests regarding the dataset, please contact the repository owner.